Demystifying Tokenization: Part 2 - Algorithms from BPE to Unigram
{A}Table of Contents
This is Part 2 of a 3-part series on tokenization:
- Part 1: Why Tokenization Matters
- Part 2: Algorithms from BPE to Unigram β You are here
- Part 3: Building & Auditing Tokenizers
In Part 1, we explored why tokenization matters, the fundamental problems it solves, and how to think about it as lossy compression. In this post, we will dive deeper into different algorithms to strengthen our understanding of tokenization.
There are two approaches or categories of tokenization algorithms - Deterministic and Probabilistic.
Deterministic Tokenization Approaches
Deterministic tokenization algorithms produce the same segmentation every time they encounter the same input.
Given a fixed vocabulary and merge rules learned during training, these methods apply a predefined sequence of operations such as splitting by characters, whitespace, or iteratively merging frequent pairs and deterministically transform text into tokens.
There is no randomness, no probability distributions, and no exploration of alternative segmentations. This predictability makes them fast, simple to implement, and easy to debug, which is why they form the foundation of modern tokenization systems. However, their greedy, rule-based nature also means they optimize for local patterns rather than global statistical optimality.
We will start with the simplest approaches and build up to the sophisticated byte-pair encoding methods that power modern LLMs.
Character-level Tokenization
This is the most straightforward approach where we treat each character as a token. For example, the word "cat" will be split into ['c', 'a', 't'] and each character will be mapped to an Integer ID.
The code is pretty simple to implement. We take two dictionaries mappings and reverse_mappings and store mappings of {character -> integer} and {integer -> character} respectively.
1class CharacterTokenization:
2 def __init__(self):
3 # Dictionary to hold character to integer mappings
4 self.mappings: dict[str, int] = {}
5 # Dictionary to hold integer to character mappings
6 self.reverse_mappings: dict[int, str] = {}
7 # Index to keep track for next integer to assign
8 self.index: int = 0
9
10 def fit(self, corpus: list[str]) -> None:
11 # Traverse through each character in the text
12 for text in corpus:
13 for char in text:
14 if char not in self.mappings:
15 self.mappings[char] = self.index
16 self.reverse_mappings[self.index] = char
17 self.index += 11class CharacterTokenization:
2 def __init__(self):
3 # Dictionary to hold character to integer mappings
4 self.mappings: dict[str, int] = {}
5 # Dictionary to hold integer to character mappings
6 self.reverse_mappings: dict[int, str] = {}
7 # Index to keep track for next integer to assign
8 self.index: int = 0
9
10 def fit(self, corpus: list[str]) -> None:
11 # Traverse through each character in the text
12 for text in corpus:
13 for char in text:
14 if char not in self.mappings:
15 self.mappings[char] = self.index
16 self.reverse_mappings[self.index] = char
17 self.index += 1Once the training is complete using fit method, we can now encode and decode any word or phrase.
Let's take an example:
1corpus = ["the cat sat on the mat"]
2tokenizer = CharacterTokenization()
3tokenizer.fit(corpus)
4
5# We will get something like this
6
7# 't' -> 0
8# 'h' -> 1
9# 'e' -> 2
10# ' ' -> 3 (space is a character too!)
11# 'c' -> 4
12# 'a' -> 5
13# 's' -> 6
14# ...
15
16# Encode "cat"
17encoded = tokenizer.encode("cat")
18print(encoded) # [4, 5, 0] (c->4, a->5, t->0)
19
20# Decode back
21decoded = tokenizer.decode(encoded)
22print(decoded) # "cat"1corpus = ["the cat sat on the mat"]
2tokenizer = CharacterTokenization()
3tokenizer.fit(corpus)
4
5# We will get something like this
6
7# 't' -> 0
8# 'h' -> 1
9# 'e' -> 2
10# ' ' -> 3 (space is a character too!)
11# 'c' -> 4
12# 'a' -> 5
13# 's' -> 6
14# ...
15
16# Encode "cat"
17encoded = tokenizer.encode("cat")
18print(encoded) # [4, 5, 0] (c->4, a->5, t->0)
19
20# Decode back
21decoded = tokenizer.decode(encoded)
22print(decoded) # "cat"Since we are processing at the character level, 100% coverage can be done. Every possible string - English, δΈζ, emoji π, typos, even random gibberish can be represented because we are just mapping individual characters.
If someone writes "recieve" (misspelled), it encodes perfectly as ['r', 'e', 'c', 'i', 'e', 'v', 'e']. The model might not know it's a misspelling, but at least it can process it.
But this simplicity comes at a severe computational cost. Consider the sentence "Tokenization is essential". As words, it has 3 tokens but as characters, it has 26 tokens. That's 8x-9x increase in sequence length. For a model processing attention mechanism (which scale as with sequence length), this explosion is devastating. An 8x long sequence means roughly 64x more computation in the attention layers.
Real-world impact:
- Longer sequences = more memory needed.
- More computation = slower training and inference.
- Limited context window gets consumed faster.
- Models struggle to capture long-range dependencies (information diffuses across more steps).
This is why character-level tokenization, despite its elegance, is rarely used for large language models. The computational trade-off is simply too expensive.
Word Level Tokenization
As the name suggests, it treats every word as a token. A dictionary of unique words is built by splitting the text on whitespace. At the same time, an Integer ID is also assigned to each token.
The implementation is straightforward: we maintain frequency counts of each word and assign IDs deterministically (sorted by frequency, then alphabetically). The key innovation here is the <UNK> token: a special reserved token that represents all unknown words encountered during inference.
When add_unk_token=True (the default), we reserve ID for <UNK> before building the vocabulary. This allows the model to handle out-of-vocabulary words gracefully by mapping them all to the same token, though at the cost of losing distinction between different unknown words.
1class WordLevelEncoding:
2 def __init__(self, add_unk_token: bool = True):
3 self.mappings: dict[str, int] = {}
4 self.reverse_mappings: dict[int, str] = {}
5 self.frequencies: Counter[str] = Counter()
6 self.index: int = 0
7
8 # Special handling for unknown words
9 self.add_unk_token = add_unk_token
10 if add_unk_token:
11 self.unk_token = "<UNK>"
12 self.unk_id = self._reserve_token(self.unk_token)
13
14 def _reserve_token(self, token: str) -> int:
15 """Reserve token id if not already present and return its id."""
16 if token not in self.mappings:
17 token_id = self.index
18 self.mappings[token] = token_id
19 self.reverse_mappings[token_id] = token
20 self.index += 1
21 return token_id
22 return self.mappings[token]
23
24 def fit(self, corpus: list[str]) -> None:
25 """Build vocabulary from corpus"""
26 for text in corpus:
27 words = text.split()
28 self.frequencies.update(words)
29
30 # Assign IDs to words (by frequency for determinism)
31 for word in sorted(self.frequencies.keys(),key=lambda w: (-self.frequencies[w], w)):
32 if word not in self.mappings:
33 self.mappings[word] = self.index
34 self.reverse_mappings[self.index] = word
35 self.index += 11class WordLevelEncoding:
2 def __init__(self, add_unk_token: bool = True):
3 self.mappings: dict[str, int] = {}
4 self.reverse_mappings: dict[int, str] = {}
5 self.frequencies: Counter[str] = Counter()
6 self.index: int = 0
7
8 # Special handling for unknown words
9 self.add_unk_token = add_unk_token
10 if add_unk_token:
11 self.unk_token = "<UNK>"
12 self.unk_id = self._reserve_token(self.unk_token)
13
14 def _reserve_token(self, token: str) -> int:
15 """Reserve token id if not already present and return its id."""
16 if token not in self.mappings:
17 token_id = self.index
18 self.mappings[token] = token_id
19 self.reverse_mappings[token_id] = token
20 self.index += 1
21 return token_id
22 return self.mappings[token]
23
24 def fit(self, corpus: list[str]) -> None:
25 """Build vocabulary from corpus"""
26 for text in corpus:
27 words = text.split()
28 self.frequencies.update(words)
29
30 # Assign IDs to words (by frequency for determinism)
31 for word in sorted(self.frequencies.keys(),key=lambda w: (-self.frequencies[w], w)):
32 if word not in self.mappings:
33 self.mappings[word] = self.index
34 self.reverse_mappings[self.index] = word
35 self.index += 1For an example run below:
1corpus = [
2 "the cat sat on the mat",
3 "the dog sat on the rug"
4]
5
6tokenizer = WordLevelEncoding()
7tokenizer.fit(corpus)
8
9# Vocabulary learned:
10# "the" -> 0 (most frequent)
11# "sat" -> 1
12# "on" -> 2
13# "cat" -> 3
14# "mat" -> 4
15# ...1corpus = [
2 "the cat sat on the mat",
3 "the dog sat on the rug"
4]
5
6tokenizer = WordLevelEncoding()
7tokenizer.fit(corpus)
8
9# Vocabulary learned:
10# "the" -> 0 (most frequent)
11# "sat" -> 1
12# "on" -> 2
13# "cat" -> 3
14# "mat" -> 4
15# ...This is efficient as "Tokenization is essential" becomes just three words. But it introduces a critical failure mode
The OOV (Out of Vocabulary) Problem
What happens when we encounter a word during deployment that never appeared in training?
1# Training corpus
2corpus = [
3 "the cat sat on the mat",
4 "the dog sat on the rug"
5]
6tokenizer.fit(corpus)
7
8# Now try encoding a held-out sentence
9test = "the platypus sat under the tree"1# Training corpus
2corpus = [
3 "the cat sat on the mat",
4 "the dog sat on the rug"
5]
6tokenizer.fit(corpus)
7
8# Now try encoding a held-out sentence
9test = "the platypus sat under the tree"The words "platypus", "under", and "tree" are unknown. They're not in our vocabulary. Now what? There are three policies to handle it.
1. Raise an error:
encoded = tokenizer.encode(test, unknown_policy="raise")
# KeyError: Unknown word during encode: 'platypus'encoded = tokenizer.encode(test, unknown_policy="raise")
# KeyError: Unknown word during encode: 'platypus'This makes the problem explicit but breaks the system - we cannot process the input at all.
2. Add unknown words dynamically:
encoded = tokenizer.encode(test, unknown_policy="add")
# Dynamically adds 'platypus', 'under', 'tree' to vocabulary
print("New vocabulary size:", tokenizer.vocabulary_size())
# Vocabulary grew from 8 to 11encoded = tokenizer.encode(test, unknown_policy="add")
# Dynamically adds 'platypus', 'under', 'tree' to vocabulary
print("New vocabulary size:", tokenizer.vocabulary_size())
# Vocabulary grew from 8 to 11This approach works but has serious problems:
- vocabulary grows unbound over time.
- models never had these words during training, so their embeddings are random/untrained.
- different deployments will have different vocabularies (non-deterministic).
3. Map to special <UNK> token:
encoded = tokenizer.encode(test, unknown_policy="unk")
# All unknown words become <UNK>
# "the <UNK> sat <UNK> the <UNK>"encoded = tokenizer.encode(test, unknown_policy="unk")
# All unknown words become <UNK>
# "the <UNK> sat <UNK> the <UNK>"This keeps vocabulary fixed but loses information. The model cannot distinguish between "platypus", "under" and "tree" - they are all the same token. If the unknown word is critical to understanding the sentence, the model is blind.
The Vocabulary Explosion
Even if we could solve the vocabulary problem, we have challenge of exploding vocabulary size if we take into account of morphological variants ("walk", "walking"), proper nouns ("Anthropic", "OpenAI", "NVidia"), technical terms ("Http", "JSON").
On top of it, if we add another languages - each needs its own dictionary. A multilingual model could need millions of tokens.
If each token has 1024 dimensional embedding at 4 bytes per float, that's
- 50k tokens = 200 MB
- 100k tokens = 400 MB
- 1M tokens = 4 GB
Note this is just for the embedding table, before any model parameters. With 100,000 tokens, each appears less frequently in training. Rare words get poor embeddings because the model sees them infrequently. The sparsity makes learning inefficient.
This is why word-level tokenization, while intuitive, doesn't scale to modern LLMs.
We need something that combines the efficiency of word-level with the coverage of character-level.
Subword Tokenization: Byte Pair Encoding (BPE)
This is the breakthrough that makes modern tokenization work. Instead, of choosing characters or words, we let the data tell us which fragments are meaningful by iteratively merging the most frequent pairs of symbols.
Start with a simple corpus: ["low", "lower", "newest", "newest", "newest", "widest"]
Step 0: Initialize with characters
Break each word into characters plus an end-of-word marker </w>:
Note on
</w>: This is a training-time delimiter that helps BPE distinguish word boundaries (e.g., "low" as a complete word vs "low" as prefix in "lower").
In the examples below, we show </w> for clarity during training, but when counting tokens, we typically don't count it as a separate token and it is merged with the preceding character during the merge process.
low β ['l', 'o', 'w', '</w>']
lower β ['l', 'o', 'w', 'e', 'r', '</w>']
newest β ['n', 'e', 'w', 'e', 's', 't', '</w>'] (appears 3 times)
widest β ['w', 'i', 'd', 'e', 's', 't', '</w>']
Step 1: Count pair frequencies
Look at all adjacent symbol pairs across the corpus:
('e', 's'): 3 times (from "newest" Γ 3)
('s', 't'): 3 times (from "newest" Γ 3)
('e', 'w'): 3 times (from "newest" Γ 3)
('l', 'o'): 2 times (from "low" and "lower")
('o', 'w'): 2 times (from "low" and "lower")
...
Step 2: Merge most frequent pair
The pair ('e', 's') appears 3 times. Merge it into a single symbol 'es':
newest β ['n', 'e', 'w', 'es', 't', '</w>']
widest β ['w', 'i', 'd', 'es', 't', '</w>']
Step 3: Repeat
Count pairs again. Now ('es', 't') is frequent. Merge it:
newest β ['n', 'e', 'w', 'est', '</w>']
widest β ['w', 'i', 'd', 'est', '</w>']
Continue this process. Over many iterations, common patterns emerge as tokens:
'est'becomes a token (suffix for superlatives)'low'becomes a token (common word)'er'becomes a token (suffix for comparatives)
After 10 merges, encoding "lowest" might give: ['low', 'est', '</w>'] β just 2 tokens for a 6-character word.
Zoom
Figure: BPE iteratively merges the most frequent pairs, gradually building up a vocabulary of meaningful subword units.
Implementation Details
Complexity: The naive BPE implementation has time complexity of O(num_merges Γ total_symbols) per merge iteration. For each merge, we scan the entire corpus to find and replace pairs. With priority queues or efficient data structures, this can be optimized, but the basic approach is quadratic in vocabulary growth.
The core implementation of the BPE algorithm is below:
1def train(self, corpus, num_merges):
2 """Learn BPE merges from corpus"""
3 for i in range(num_merges):
4 # Count all adjacent pairs
5 pair_frequencies = self.get_pair_frequencies(corpus)
6 if not pair_frequencies:
7 break # No more pairs to merge
8
9 # Find most frequent pair
10 most_frequent_pair = max(pair_frequencies, key=lambda p: (pair_frequencies[p], p))
11
12 # Merge this pair throughout the corpus
13 corpus = self.merge_most_frequent_pair(corpus, most_frequent_pair)
14 # Record the merge for encoding later
15 self.merges.append(most_frequent_pair)
16
17 return corpus
18
19def get_pair_frequencies(self, corpus):
20 """Count adjacent symbol pairs across corpus"""
21 pair_frequencies = Counter()
22 for word_symbols, frequency in corpus.items():
23 if len(word_symbols) < 2:
24 continue
25
26 # Sliding window over symbols
27 for i in range(len(word_symbols) - 1):
28 pair = (word_symbols[i], word_symbols[i + 1])
29 pair_frequencies[pair] += frequency # Weighted by word frequency
30
31 return dict(pair_frequencies)
32
33def merge_most_frequent_pair(self, corpus, target_pair):
34 updated_corpus: dict[tuple[str, ...], int] = {}
35 for word_symbols, count in corpus.items():
36 symbols = list(word_symbols)
37 i = 0
38 while i < len(symbols) - 1:
39 # Join the symbols to form the pair
40 pair = (symbols[i], symbols[i + 1])
41 if pair == target_pair:
42 merged_symbol = "".join(pair)
43 symbols[i:i+2] = [merged_symbol] # Replace two elements with one merged element
44 i += 1
45 continue
46 else:
47 i += 1
48 # Add new tuple to the updated corpus
49 updated_corpus[tuple(symbols)] = count
50
51 return updated_corpus1def train(self, corpus, num_merges):
2 """Learn BPE merges from corpus"""
3 for i in range(num_merges):
4 # Count all adjacent pairs
5 pair_frequencies = self.get_pair_frequencies(corpus)
6 if not pair_frequencies:
7 break # No more pairs to merge
8
9 # Find most frequent pair
10 most_frequent_pair = max(pair_frequencies, key=lambda p: (pair_frequencies[p], p))
11
12 # Merge this pair throughout the corpus
13 corpus = self.merge_most_frequent_pair(corpus, most_frequent_pair)
14 # Record the merge for encoding later
15 self.merges.append(most_frequent_pair)
16
17 return corpus
18
19def get_pair_frequencies(self, corpus):
20 """Count adjacent symbol pairs across corpus"""
21 pair_frequencies = Counter()
22 for word_symbols, frequency in corpus.items():
23 if len(word_symbols) < 2:
24 continue
25
26 # Sliding window over symbols
27 for i in range(len(word_symbols) - 1):
28 pair = (word_symbols[i], word_symbols[i + 1])
29 pair_frequencies[pair] += frequency # Weighted by word frequency
30
31 return dict(pair_frequencies)
32
33def merge_most_frequent_pair(self, corpus, target_pair):
34 updated_corpus: dict[tuple[str, ...], int] = {}
35 for word_symbols, count in corpus.items():
36 symbols = list(word_symbols)
37 i = 0
38 while i < len(symbols) - 1:
39 # Join the symbols to form the pair
40 pair = (symbols[i], symbols[i + 1])
41 if pair == target_pair:
42 merged_symbol = "".join(pair)
43 symbols[i:i+2] = [merged_symbol] # Replace two elements with one merged element
44 i += 1
45 continue
46 else:
47 i += 1
48 # Add new tuple to the updated corpus
49 updated_corpus[tuple(symbols)] = count
50
51 return updated_corpusKey insight: Pair frequencies are weighted by word frequency. If
"newest"appears 1,000 times in the corpus, the pair('e', 's')gets +1,000 to its count each time we see it in"newest". This ensures we merge patterns that are genuinely common in the data, not just patterns that happen to appear in many unique words.
Encoding New Words
Once the training is complete, we have a list of learned merges. To encode a new word, apply these merge in order.
1def encode(self, word: str) -> list[str]:
2 """Encode word using learned merges"""
3 symbols = list(word) + ["</w>"]
4
5 # Apply each learned merge in sequence
6 for pair in self.merges:
7 i = 0
8 while i < len(symbols) - 1:
9 if (symbols[i], symbols[i + 1]) == pair:
10 merged = "".join(pair)
11 symbols[i:i+2] = [merged] # Replace pair with merged symbol
12 # Don't advance i; re-check at same position after merge
13 else:
14 i += 1
15
16 return symbols1def encode(self, word: str) -> list[str]:
2 """Encode word using learned merges"""
3 symbols = list(word) + ["</w>"]
4
5 # Apply each learned merge in sequence
6 for pair in self.merges:
7 i = 0
8 while i < len(symbols) - 1:
9 if (symbols[i], symbols[i + 1]) == pair:
10 merged = "".join(pair)
11 symbols[i:i+2] = [merged] # Replace pair with merged symbol
12 # Don't advance i; re-check at same position after merge
13 else:
14 i += 1
15
16 return symbolsTo take an example:
1bpe = BytePairEncoding()
2# fit_corpus prepares the corpus and returns processed word-frequency pairs
3processed_corpus = bpe.fit_corpus(["low", "lower", "newest", "newest", "newest", "widest"])
4# train learns merges from the processed corpus
5bpe.train(processed_corpus, num_merges=10)
6
7# Encode a word seen during training
8encoded = bpe.encode("lowest")
9print(encoded) # ['low', 'est']
10
11# Encode a completely new word
12encoded = bpe.encode("slowest")
13print(encoded) # ['s', 'low', 'est']
14# Note: 's' stays separate, but 'low' and 'est' are recognized1bpe = BytePairEncoding()
2# fit_corpus prepares the corpus and returns processed word-frequency pairs
3processed_corpus = bpe.fit_corpus(["low", "lower", "newest", "newest", "newest", "widest"])
4# train learns merges from the processed corpus
5bpe.train(processed_corpus, num_merges=10)
6
7# Encode a word seen during training
8encoded = bpe.encode("lowest")
9print(encoded) # ['low', 'est']
10
11# Encode a completely new word
12encoded = bpe.encode("slowest")
13print(encoded) # ['s', 'low', 'est']
14# Note: 's' stays separate, but 'low' and 'est' are recognizedThis is the magic of BPE: new words are decomposed using learned patterns. Even though
"slowest"never appeared in training, BPE recognizes"low"and"est"as meaningful fragments and encodes them as single tokens. Only the's'prefix remains as a character-level token.
Why BPE Works?
BPE is greedy statistical compression. By merging frequent pairs, we are building a vocabulary that efficiently encodes the training distribution.
Common words like "the" quickly become single tokens (after merging 't'+'h', then 'th'+'e'). Rare words like "platypus" get fragmented, but the fragments themselves are learned from common patterns elsewhere in the corpus.
The trade-off in action:
| Word | Frequency | Typical BPE Encoding | Token Count |
|---|---|---|---|
| "the" | Very high | ['the'] | 1 token |
| "playing" | High | ['play', 'ing'] | 2 tokens |
| "platypus" | Rare | ['plat', 'y', 'pus'] | 3 tokens |
| "quokka" | Never seen | ['qu', 'ok', 'ka'] | 3 tokens |
Frequent words compress well. Rare words are longer but still representable. This is optimal from an information theory perspective - we are using shorter codes for common patterns and longer codes for rare patterns, exactly what Shannon's theory prescribes.
Limitations of Character Level BPE
BPE as described above operates on characters. This works great for English and similar languages, but it has a subtle problem: different character encodings.
Consider these strings:
"cafΓ©"(Γ©as a single Unicode character: U+00E9)"cafΓ©"(e+ combining accent: U+0065 + U+0301)
They look identical but have different character sequences! A character-level BPE would treat them differently, learning separate tokens. This inconsistency wastes vocabulary and confuses the model.
Additionally, character-level BPE struggles with non-Latin scripts. Chinese, Arabic, and emoji have tens of thousands of unique characters. Starting with all of them as base tokens means our initial vocabulary is already huge before any merging.
This leads us to the final evolution: byte-level BPE.
Byte Level BPE: Universal Coverage
We know that every character in any language is already represented as a sequence of bytes. Therefore, why not operate on bytes (UTF-8 encoding) instead of operating on characters? Since UTF-8 uses only 256 possible byte values (0-255), so we can start with a base vocabulary of just 256 tokens.
Few example of UTF-8 encoding (common characters use 1 byte, less common ones use 2-4 bytes):
'a' β [97] (1 byte: ASCII)
'Γ©' β [195, 169] (2 bytes: Latin extended)
'δΈ' β [228, 184, 150] (3 bytes: Chinese)
'π' β [240, 159, 140, 141] (4 bytes: emoji)'a' β [97] (1 byte: ASCII)
'Γ©' β [195, 169] (2 bytes: Latin extended)
'δΈ' β [228, 184, 150] (3 bytes: Chinese)
'π' β [240, 159, 140, 141] (4 bytes: emoji)No matter what the input is - English, Hindi, emojis, code, even binary data - it becomes a sequence of bytes in the range 0-255.
Zoom
Figure: Byte-level BPE operates on UTF-8 bytes, providing 100% coverage while learning common byte sequences as tokens.
Implementation Details
The algorithm is nearly identical to character level BPE, but operates on bytes:
1def fit_corpus(self, corpus: list[str]):
2 """Convert words to byte sequences"""
3 word_frequencies = Counter(corpus)
4 processed = {}
5 for word, frequency in word_frequencies.items():
6 # Convert to bytes
7 word_bytes = word.encode("utf-8")
8 symbols = list(word_bytes) + ["</w>"] # Bytes + end marker
9 processed[tuple(symbols)] = frequency
10 return processed1def fit_corpus(self, corpus: list[str]):
2 """Convert words to byte sequences"""
3 word_frequencies = Counter(corpus)
4 processed = {}
5 for word, frequency in word_frequencies.items():
6 # Convert to bytes
7 word_bytes = word.encode("utf-8")
8 symbols = list(word_bytes) + ["</w>"] # Bytes + end marker
9 processed[tuple(symbols)] = frequency
10 return processedLet's take an example understand this in action:
1word = "cafΓ©"
2word_bytes = word.encode("utf-8")
3# [99, 97, 102, 195, 169] (c, a, f, Γ© as 2 bytes)
4
5symbols = list(word_bytes) + ["</w>"]
6# [99, 97, 102, 195, 169, "</w>"]1word = "cafΓ©"
2word_bytes = word.encode("utf-8")
3# [99, 97, 102, 195, 169] (c, a, f, Γ© as 2 bytes)
4
5symbols = list(word_bytes) + ["</w>"]
6# [99, 97, 102, 195, 169, "</w>"]Important nuance about Unicode normalization: UTF-8 encoding does not automatically normalize text. The word "cafΓ©" can be represented in Unicode using two different normalization forms:
- NFC (Canonical Composition):
Γ©as a single codepoint (U+00E9) β UTF-8 bytes:[99, 97, 102, 195, 169] - NFD (Canonical Decomposition):
e+ combining acute accent (U+0065 + U+0301) β UTF-8 bytes:[99, 97, 102, 101, 204, 129]
These produce different byte sequences even though they display identically. Byte-level BPE treats them as different inputs.
This has implications:
Advantage: Byte-level BPE doesn't make normalization assumptions β it handles whatever UTF-8 bytes it receives. If your training data contains both NFC and NFD text, both get represented (though possibly inefficiently).
Disadvantage: If the same logical word appears in multiple normalization forms, the tokenizer may learn separate tokens for each, wasting vocabulary space and creating inconsistency.
In practice: Most modern text processing pipelines normalize to NFC before tokenization to avoid this issue. GPT-2's tokenizer applies normalization as a preprocessing step before byte-level BPE runs. The key insight is that byte-level operation makes tokenization robust to any input, but normalization is still a separate decision in the preprocessing pipeline.
Training and Merging
The merging logic is same as before, but now we merge byte values instead of characters:
1def train(self, corpus, num_merges):
2 """Learn BPE merges on bytes"""
3 for i in range(num_merges):
4 pair_frequencies = self.get_pair_frequencies(corpus)
5 if not pair_frequencies:
6 break
7
8 # Most frequent byte pair
9 most_frequent_pair = max(pair_frequencies, key=pair_frequencies.get)
10 print(f"Merge {i+1}: {most_frequent_pair}")
11
12 corpus = self.merge_most_frequent_pair(corpus, most_frequent_pair)
13 self.merges.append(most_frequent_pair)
14
15 return corpus1def train(self, corpus, num_merges):
2 """Learn BPE merges on bytes"""
3 for i in range(num_merges):
4 pair_frequencies = self.get_pair_frequencies(corpus)
5 if not pair_frequencies:
6 break
7
8 # Most frequent byte pair
9 most_frequent_pair = max(pair_frequencies, key=pair_frequencies.get)
10 print(f"Merge {i+1}: {most_frequent_pair}")
11
12 corpus = self.merge_most_frequent_pair(corpus, most_frequent_pair)
13 self.merges.append(most_frequent_pair)
14
15 return corpusAfter training, common byte sequences become tokens. For English text, we will see merges like:
- (116, 104) β "th" (bytes for 't' and 'h')
- (101, 114) β "er" (bytes for 'e' and 'r')
- (105, 110, 103) β "ing" (bytes for 'i', 'n', 'g' after successive merges)- (116, 104) β "th" (bytes for 't' and 'h')
- (101, 114) β "er" (bytes for 'e' and 'r')
- (105, 110, 103) β "ing" (bytes for 'i', 'n', 'g' after successive merges)This merging operation is true for Chinese, Arabic, and emoji bytes as well.
The algorithm learns whatever is statistically common in the training data, regardless of language.
Decoding Bytes Back to Text
After encoding, we have a sequence of tokens (single, merged). To decode:
1def decode(self, tokens: list) -> str:
2 """Decode tokens back to text"""
3 byte_sequence = []
4 for token in tokens:
5 if token == "</w>":
6 continue
7 byte_sequence.extend(self._flatten(token)) # Flatten nested merges
8
9 return bytes(byte_sequence).decode("utf-8", errors="replace")
10
11def _flatten(self, symbol):
12 """Recursively flatten nested tuples of bytes"""
13 if isinstance(symbol, int):
14 return [symbol] # Base case: single byte
15 elif isinstance(symbol, tuple):
16 result = []
17 for s in symbol:
18 result.extend(self._flatten(s))
19 return result
20 else:
21 return []1def decode(self, tokens: list) -> str:
2 """Decode tokens back to text"""
3 byte_sequence = []
4 for token in tokens:
5 if token == "</w>":
6 continue
7 byte_sequence.extend(self._flatten(token)) # Flatten nested merges
8
9 return bytes(byte_sequence).decode("utf-8", errors="replace")
10
11def _flatten(self, symbol):
12 """Recursively flatten nested tuples of bytes"""
13 if isinstance(symbol, int):
14 return [symbol] # Base case: single byte
15 elif isinstance(symbol, tuple):
16 result = []
17 for s in symbol:
18 result.extend(self._flatten(s))
19 return result
20 else:
21 return []The _flatten() helper recursively unpacks merged tokens back to individual bytes and this is necessary because merged tokens are stored as tuples. After merging (116, 104) β "th", then merging "th" with (101) β "the", we get ((116, 104), 101). Flattening gives us back [116, 104, 101], which we decode as "the".
Why Byte Level BPE is the Gold Standard?
GPT-2 (Radford et al., 2019), GPT-3 (Brown et al., 2020), GPT-4 (OpenAI, 2023), and many modern models use byte-level BPE because it solves every coverage problem:
- 100% coverage: Any UTF-8 text can be encoded (the worst case: falls back to individual bytes)
- Language-agnostic: Single tokenizer handles English, Chinese, Arabic, code, emojiβeverything
- Works on raw UTF-8 bytes: Ensures coverage of any input. Normalization (e.g., NFC) is still a separate, recommended preprocessing step for consistency.
- Compact vocabulary: Starts with 256 base tokens, grows to ~50K through merging
- Robust to typos: Misspelled words decompose into bytes; no special handling needed
The trade-off: byte-level tokenization is slightly less efficient than character-level tokenization for languages with simple scripts (like English). The word "the" might be 1 token with character-level BPE, but require merging 3 bytes [116, 104, 101] with byte-level BPE. In practice, after sufficient merging iterations, this difference is minimalβboth converge to similar vocabulary structures.
What We Have Learned
These are all deterministic approaches and given the same training corpus and merge count, BPE produces the same vocabulary every time. But we can do better. What if instead of greedily merging the most frequent pair, we used probabilistic reasoning to find optimal segmentations?
That's where probabilistic tokenization comes in. We will explore WordPiece and Unigram Language Models nextβalgorithms that optimize for likelihood rather than frequency, and produce better vocabularies for modern LLMs.
Probabilistic Tokenization Approaches
BPE is elegant and simple as it merges the most frequent pairs until we reach our target vocabulary size. But most frequent has a subtle flaw - it optimizes for how often symbols appear together, not for how meaningful those combinations are.
Consider two pairs in a corpus:
('e', 's')appears 1000 times (mostly from common words like"best","rest","test").('q', 'u')appears 100 times ('q'almost never appears without'u'in English).
BPE merges ('e', 's') first because it is more frequent. But linguistically "qu" is a much stronger unit as 'q' and 'u' are statistically dependent, while 'e' and 's' often appear separately.
Mental model: I think of it as Baye's theorem of conditional probability.
This is where probabilistic tokenization comes in. Instead of counting raw frequencies, these methods use likelihood and statistical independence to decide what to merge or keep.
Let's explore two most widely used algorithms in this category - WordPiece and UnigramLM.
WordPiece
This was developed by Google and is used in BERT. It asks a better question than BPE: "Do these symbols carry more information together than apart?"
The intuition is simple. In English, the probability of the letter u next to the letter q is nearly 100%. Thus, the two letters are statistically dependent - seeing q tells us a lot about what comes next. On the contrary, the probability of seeing s next to e is fairly lower. These letters are more independent.
WordPiece's insight: Merge symbols that are dependent, not just frequent. This captures meaningful linguistic units.
The Math: Pointwise Mutual Information (PMI)
Note on PMI as a teaching proxy: We use PMI here as an intuitive explanation. In practice, production WordPiece implementations (as used in BERT) greedily add tokens that maximize the likelihood gain under a simple language model objective. PMI captures the core intuition: merging statistically dependent symbols but the actual algorithm optimizes corpus likelihood. For the full details, see Schuster & Nakajima (2012).
WordPiece uses a score based on PMI that measures how much more likely two symbols appear together versus independently:
- If and are independent: , so .
- If and are dependent: , so .
- The higher the PMI, the stronger the association.
WordPiece's scoring function weighs PMI by frequency:
Where,
- = frequency of the pair in the corpus.
- = probability of seeing the merged token.
- = probabilities of seeing each symbol individually.
Why multiply by frequency? We still want to prioritize common patterns (like BPE), but among patterns with similar frequency, we prefer those with high mutual information.
Example
1Example: "play" + "ing" vs "pl" + "ay"
2
3Let's compare two potential merges in a corpus with these counts:
4
5Pair frequencies:
6('play', 'ing'): 500 times
7('pl', 'ay'): 600 times
8
9Individual token frequencies:
10'play': 800 times (also appears in "play", "playful", "player")
11'ing': 2000 times (appears in "running", "walking", "talking"...)
12'pl': 650 times (mostly from "play" words, but also "plus", "place")
13'ay': 620 times (mostly from "play" words, but also "day", "say", "may")
14
15BPE would choose ('pl', 'ay') because it appears 600 times vs 500 for ('play', 'ing').
16
17WordPiece computes PMI:
18
19For ('play', 'ing'):
20P(play) = 800 / total β 0.016
21P(ing) = 2000 / total β 0.040
22P(playΒ·ing) = 500 / total β 0.010
23
24PMI = log(0.010) - log(0.016) - log(0.040)
25 = log(0.010 / (0.016 Γ 0.040))
26 = log(15.625) β 2.75
27
28score = 500 Γ 2.75 = 1,375
29
30For ('pl', 'ay'):
31P(pl) = 650 / total β 0.013
32P(ay) = 620 / total β 0.012
33P(plΒ·ay) = 600 / total β 0.012
34
35PMI = log(0.012) - log(0.013) - log(0.012)
36 = log(0.012 / (0.013 Γ 0.012))
37 = log(7.69) β 2.04
38
39score = 600 Γ 2.04 = 1,2241Example: "play" + "ing" vs "pl" + "ay"
2
3Let's compare two potential merges in a corpus with these counts:
4
5Pair frequencies:
6('play', 'ing'): 500 times
7('pl', 'ay'): 600 times
8
9Individual token frequencies:
10'play': 800 times (also appears in "play", "playful", "player")
11'ing': 2000 times (appears in "running", "walking", "talking"...)
12'pl': 650 times (mostly from "play" words, but also "plus", "place")
13'ay': 620 times (mostly from "play" words, but also "day", "say", "may")
14
15BPE would choose ('pl', 'ay') because it appears 600 times vs 500 for ('play', 'ing').
16
17WordPiece computes PMI:
18
19For ('play', 'ing'):
20P(play) = 800 / total β 0.016
21P(ing) = 2000 / total β 0.040
22P(playΒ·ing) = 500 / total β 0.010
23
24PMI = log(0.010) - log(0.016) - log(0.040)
25 = log(0.010 / (0.016 Γ 0.040))
26 = log(15.625) β 2.75
27
28score = 500 Γ 2.75 = 1,375
29
30For ('pl', 'ay'):
31P(pl) = 650 / total β 0.013
32P(ay) = 620 / total β 0.012
33P(plΒ·ay) = 600 / total β 0.012
34
35PMI = log(0.012) - log(0.013) - log(0.012)
36 = log(0.012 / (0.013 Γ 0.012))
37 = log(7.69) β 2.04
38
39score = 600 Γ 2.04 = 1,224WordPiece chooses ('play', 'ing') despite lower frequency, because the PMI is higher - "play" and "ing" are more strongly associated than "pl" and "ay".
Implementation Details
Complexity: WordPiece has similar complexity to BPE: for training. The additional PMI computation is per iteration, which is typically much smaller than the corpus size. The main difference from BPE is the scoring function, not the algorithmic complexity.
Here's the core of WordPiece implementation:
1def _get_best_pair(self, token_frequencies: dict[str, int],
2 pair_frequencies: dict[tuple[str, str], int]) -> tuple[str, str]:
3 """Find pair with highest PMI-weighted score"""
4 scores: dict[tuple, float] = {}
5
6 for (x, y), f_xy in pair_frequencies.items():
7 f_x = token_frequencies.get(x, 1)
8 f_y = token_frequencies.get(y, 1)
9
10 # Compute PMI score
11 epsilon = 1e-10 # Avoid log(0)
12 score = f_xy * (
13 math.log(f_xy + epsilon) - math.log(f_x + epsilon) - math.log(f_y + epsilon)
14 )
15 scores[(x, y)] = score
16
17 # Return pair with maximum score
18 best_pair = max(scores, key=scores.get)
19 return best_pair1def _get_best_pair(self, token_frequencies: dict[str, int],
2 pair_frequencies: dict[tuple[str, str], int]) -> tuple[str, str]:
3 """Find pair with highest PMI-weighted score"""
4 scores: dict[tuple, float] = {}
5
6 for (x, y), f_xy in pair_frequencies.items():
7 f_x = token_frequencies.get(x, 1)
8 f_y = token_frequencies.get(y, 1)
9
10 # Compute PMI score
11 epsilon = 1e-10 # Avoid log(0)
12 score = f_xy * (
13 math.log(f_xy + epsilon) - math.log(f_x + epsilon) - math.log(f_y + epsilon)
14 )
15 scores[(x, y)] = score
16
17 # Return pair with maximum score
18 best_pair = max(scores, key=scores.get)
19 return best_pairKey implementation notes:
- Adding prevents errors when frequencies are zero.
- We work in log probabilities for numerical stability.
- Like BPE, we still merge one pair at a time β the difference is which pair we choose.
The rest of the algorithm is identical to BPE:
1def fit(self, corpus: list[str]) -> None:
2 """Train WordPiece tokenizer"""
3 # Start with character-level tokenization
4 tokenized: list[list[str]] = [list(word) + ["</w>"] for word in corpus]
5 self.vocab = set(ch for word in tokenized for ch in word)
6
7 while len(self.vocab) < self.target_vocab_size:
8 # Count token and pair frequencies
9 token_frequencies = self._get_token_frequencies(tokenized)
10 pair_frequencies = self._get_pair_frequencies(tokenized)
11
12 if not pair_frequencies:
13 break
14
15 # Find best pair using PMI (not just frequency!)
16 best_pair = self._get_best_pair(token_frequencies, pair_frequencies)
17 # Merge the best pair
18 tokenized = self._merge_pair(tokenized, best_pair)
19 self.vocab.add("".join(best_pair))1def fit(self, corpus: list[str]) -> None:
2 """Train WordPiece tokenizer"""
3 # Start with character-level tokenization
4 tokenized: list[list[str]] = [list(word) + ["</w>"] for word in corpus]
5 self.vocab = set(ch for word in tokenized for ch in word)
6
7 while len(self.vocab) < self.target_vocab_size:
8 # Count token and pair frequencies
9 token_frequencies = self._get_token_frequencies(tokenized)
10 pair_frequencies = self._get_pair_frequencies(tokenized)
11
12 if not pair_frequencies:
13 break
14
15 # Find best pair using PMI (not just frequency!)
16 best_pair = self._get_best_pair(token_frequencies, pair_frequencies)
17 # Merge the best pair
18 tokenized = self._merge_pair(tokenized, best_pair)
19 self.vocab.add("".join(best_pair))Greedy longest-prefix matching: For each position in the word, try the longest possible substring first. This ensures we use the most "complete" tokens available.
Strengths and Limitations
- Better linguistic units: PMI captures meaningful combinations (morphemes, common phrases).
- Less sensitive to frequency spikes: A rare but highly associated pair can be preferred over a frequent but weakly associated pair.
- Still greedy: Like BPE, WordPiece merges one pair at a time without considering global optimization.
- Deterministic: Given the same corpus, produces the same vocabulary - no exploration of alternative segmentations.
This last limitation is what Unigram Language Models address.
Unigram Language Model
WordPiece improves on BPE by using mutual information, but it is still a bottom-up, greedy algorithm: start with characters, iteratively merge pairs. The Unigram approach (used in Google's SentencePiece and models like T5), flips this on its head.
Here, we start with a large vocabulary (all possible subwords up to some length), then iteratively prune low-probability tokens until we reach the target size.
Core Idea: Multiple Segmentations
Unlike BPE and WordPiece, which produce a single deterministic segment for each word, Unigram considers all possible segmentations and treats tokenization as a probabilistic model.
Given a word like "playing", there are many ways to segment it:
['playing'](if'playing'is in vocabulary)['play', 'ing']['pla', 'ying']['pl', 'ay', 'ing']['p', 'l', 'a', 'y', 'i', 'n', 'g'](character-level fallback)- ...
Each segmentation has a probability based on the product of individual token probabilities:
For example:
The goal is to learn token probabilities that maximize the likelihood of the training corpus under the most probable segmentations.
The EM Algorithm: Expectation-Maximization
This is a chicken and egg problem:
- To find the best segmentation, we need to know token probabilities.
- To learn token probabilities, we need to know which segmentations to count.
The solution is EM algorithm that iterative refines both:
- E-Step (Expectation): Given current token probabilities, compute the expected count of each token across all possible segmentations of all words in the corpus.
- M-Step (Maximization): Given expected count, update token probabilities.
Repeat until convergence, then prune low-probabilities tokens.
E-Step: Forward-Backward Algorithm
Complexity: The forward-backward algorithm for a single word runs in where is the word length.
For the entire corpus with EM iterations, the total complexity is . This is significantly more expensive than BPE/WordPiece, but produces globally optimal probabilities. Memory requirements are also higher due to storing candidate vocabularies.
The E-step uses dynamic programming to efficiently compute expected counts without enumerating all segmentations explicitly.
Forward pass computes the probability of reaching each position in the word:
1def forward_backward_expected_counts(self, matches, logP):
2 L = len(matches)
3
4 # Forward probabilities: Ξ±[i] = sum of probabilities of all paths to position i
5 log_alpha = [-math.inf] * (L + 1)
6 log_alpha[0] = 0.0 # log(1) - we start at position 0 with probability 1
7
8 for i in range(L):
9 if log_alpha[i] == -math.inf:
10 continue
11
12 # Try all tokens starting at position i
13 for tid, l in matches[i]:
14 j = i + l # Token takes us to position j
15 # Update Ξ±[j] by adding this path's probability
16 log_alpha[j] = self._log_sum_exp(log_alpha[j], log_alpha[i] + logP[tid])
17
18 logZ = log_alpha[L] # Total probability of reaching the end
19
20 # Backward pass computes the probability of completing the word from each position:
21 # Backward probabilities: Ξ²[i] = sum of probabilities of all paths from i to end
22 log_beta = [-math.inf] * (L + 1)
23 log_beta[L] = 0.0 # log(1) - we end at position L with probability 1
24
25 for i in range(L - 1, -1, -1):
26 for tid, l in matches[i]:
27 j = i + l
28 # Update Ξ²[i] by adding probability of going i -> j -> end
29 log_beta[i] = self._log_sum_exp(log_beta[i], logP[tid] + log_beta[j])
30
31 # Expected counts combine forward and backward probabilities:
32 # Expected count for each token
33 expected = defaultdict(float)
34 if logZ == -math.inf:
35 return expected, -math.inf # Word cannot be segmented
36
37 for i in range(L):
38 for tid, l in matches[i]:
39 j = i + l
40 # Probability of using this token in a random segmentation:
41 # P(path to i) Γ P(token) Γ P(path from j to end) / P(word)
42 log_contribution = log_alpha[i] + logP[tid] + log_beta[j] - logZ
43 contribution = math.exp(log_contribution)
44
45 if contribution > 0.0:
46 expected[tid] += contribution
47
48 return expected, logZ1def forward_backward_expected_counts(self, matches, logP):
2 L = len(matches)
3
4 # Forward probabilities: Ξ±[i] = sum of probabilities of all paths to position i
5 log_alpha = [-math.inf] * (L + 1)
6 log_alpha[0] = 0.0 # log(1) - we start at position 0 with probability 1
7
8 for i in range(L):
9 if log_alpha[i] == -math.inf:
10 continue
11
12 # Try all tokens starting at position i
13 for tid, l in matches[i]:
14 j = i + l # Token takes us to position j
15 # Update Ξ±[j] by adding this path's probability
16 log_alpha[j] = self._log_sum_exp(log_alpha[j], log_alpha[i] + logP[tid])
17
18 logZ = log_alpha[L] # Total probability of reaching the end
19
20 # Backward pass computes the probability of completing the word from each position:
21 # Backward probabilities: Ξ²[i] = sum of probabilities of all paths from i to end
22 log_beta = [-math.inf] * (L + 1)
23 log_beta[L] = 0.0 # log(1) - we end at position L with probability 1
24
25 for i in range(L - 1, -1, -1):
26 for tid, l in matches[i]:
27 j = i + l
28 # Update Ξ²[i] by adding probability of going i -> j -> end
29 log_beta[i] = self._log_sum_exp(log_beta[i], logP[tid] + log_beta[j])
30
31 # Expected counts combine forward and backward probabilities:
32 # Expected count for each token
33 expected = defaultdict(float)
34 if logZ == -math.inf:
35 return expected, -math.inf # Word cannot be segmented
36
37 for i in range(L):
38 for tid, l in matches[i]:
39 j = i + l
40 # Probability of using this token in a random segmentation:
41 # P(path to i) Γ P(token) Γ P(path from j to end) / P(word)
42 log_contribution = log_alpha[i] + logP[tid] + log_beta[j] - logZ
43 contribution = math.exp(log_contribution)
44
45 if contribution > 0.0:
46 expected[tid] += contribution
47
48 return expected, logZWhat's happening here? For each token at each position, we compute:
- Forward probability: How likely is it to reach this position?
- Token probability: How likely is this token?
- Backward probability: How likely is it to complete the word from here?
Multiplying these together gives the probability of using this token in a random segmentation. Summing across all segmentations gives the expected count.
1Example walkthrough for "play":
2
3Vocabulary: {p, l, a, y, pl, la, ay, play}
4Token probabilities (simplified):
5P(p) = 0.05, P(l) = 0.05, P(a) = 0.05, P(y) = 0.05
6P(pl) = 0.10, P(la) = 0.08, P(ay) = 0.12
7P(play) = 0.50
8
9Possible segmentations:
101. ['play'] β P = 0.50
112. ['pl', 'ay'] β P = 0.10 Γ 0.12 = 0.012
123. ['p', 'l', 'ay'] β P = 0.05 Γ 0.05 Γ 0.12 = 0.0003
134. ['p', 'l', 'a', 'y'] β P = 0.05^4 = 0.00000625
14...
15
16Total probability: Z β 0.50 + 0.012 + 0.0003 + ... β 0.5123
17
18Expected count of 'play':
19(0.50 / 0.5123) β 0.976 (appears in ~98% of sampled segmentations)
20
21Expected count of 'pl':
22(0.012 / 0.5123) β 0.023 (appears in ~2% of sampled segmentations)1Example walkthrough for "play":
2
3Vocabulary: {p, l, a, y, pl, la, ay, play}
4Token probabilities (simplified):
5P(p) = 0.05, P(l) = 0.05, P(a) = 0.05, P(y) = 0.05
6P(pl) = 0.10, P(la) = 0.08, P(ay) = 0.12
7P(play) = 0.50
8
9Possible segmentations:
101. ['play'] β P = 0.50
112. ['pl', 'ay'] β P = 0.10 Γ 0.12 = 0.012
123. ['p', 'l', 'ay'] β P = 0.05 Γ 0.05 Γ 0.12 = 0.0003
134. ['p', 'l', 'a', 'y'] β P = 0.05^4 = 0.00000625
14...
15
16Total probability: Z β 0.50 + 0.012 + 0.0003 + ... β 0.5123
17
18Expected count of 'play':
19(0.50 / 0.5123) β 0.976 (appears in ~98% of sampled segmentations)
20
21Expected count of 'pl':
22(0.012 / 0.5123) β 0.023 (appears in ~2% of sampled segmentations)The forward-backward algorithm computes these efficiently without enumerating all segmentations.
Zoom
Figure: The forward-backward algorithm computes expected token counts by summing probabilities across all possible segmentation paths.
M-Step: Update Probabilities
Given expected counts across the entire corpus, update token probabilities:
1def m_step_update(self, expected_counts, vocab_size):
2 total = sum(expected_counts.values()) + 1e-20 # Avoid division by zero
3
4 # Normalize to probabilities
5 P = [expected_counts.get(i, 0.0) / total for i in range(vocab_size)]
6 logP = [math.log(p + 1e-20) for p in P] # Log-space for numerical stability
7
8 return P, logP1def m_step_update(self, expected_counts, vocab_size):
2 total = sum(expected_counts.values()) + 1e-20 # Avoid division by zero
3
4 # Normalize to probabilities
5 P = [expected_counts.get(i, 0.0) / total for i in range(vocab_size)]
6 logP = [math.log(p + 1e-20) for p in P] # Log-space for numerical stability
7
8 return P, logPThis is standard maximum likelihood estimation: the probability of a token is proportional to how often we expect to see it.
Pruning: Reducing Vocabulary Size
After several EM iterations, we have token probabilities. Now prune the lowest-probability tokens:
1
2def prune_vocabulary(self, expected_counts, id_to_token, target_vocabulary_size):
3 """
4 Keep top-K tokens by expected count, discard the rest
5 """
6 # Sort tokens by expected count (descending)
7 ranked = sorted(expected_counts.items(), key=lambda x: -x[1])
8 top = ranked[:target_vocabulary_size]
9 kept_ids = [tid for tid, _ in top]
10
11 # Rebuild vocabulary with only kept tokens
12 new_vocabulary = [id_to_token[tid] for tid in kept_ids]
13 token_to_id = {t: i for i, t in enumerate(new_vocabulary)}
14
15 return token_to_id, new_vocabulary, kept_ids1
2def prune_vocabulary(self, expected_counts, id_to_token, target_vocabulary_size):
3 """
4 Keep top-K tokens by expected count, discard the rest
5 """
6 # Sort tokens by expected count (descending)
7 ranked = sorted(expected_counts.items(), key=lambda x: -x[1])
8 top = ranked[:target_vocabulary_size]
9 kept_ids = [tid for tid, _ in top]
10
11 # Rebuild vocabulary with only kept tokens
12 new_vocabulary = [id_to_token[tid] for tid in kept_ids]
13 token_to_id = {t: i for i, t in enumerate(new_vocabulary)}
14
15 return token_to_id, new_vocabulary, kept_idsAfter pruning, we run EM again with the smaller vocabulary. Repeat until we reach the target size.
Encoding with Viterbi Algorithm
At inference time, we want the single most probable segmentation (not expected counts over all segmentations). This is found using the Viterbi algorithm - dynamic programming to find the highest-probability path:
1def encode(self, word):
2 L = len(word)
3 matches = self.precompute_matches_for_word(word, self.token_to_id, max_token_length=10)
4 logP_dict = {tid: math.log(p + 1e-10) for tid, p in enumerate(self.probability)}
5
6 # cost[i] = negative log probability of best path to position i
7 cost = [math.inf] * (L + 1)
8 previous = [None] * (L + 1) # Backpointers for reconstruction
9 cost[0] = 0.0
10
11 # Forward pass: find minimum cost (maximum probability) path
12 for i in range(L):
13 if cost[i] == math.inf:
14 continue
15
16 for tid, l in matches[i]:
17 j = i + l
18 new_cost = cost[i] - logP_dict[tid] # Negative log: minimize cost = maximize prob
19 if new_cost < cost[j]:
20 cost[j] = new_cost
21 previous[j] = (i, tid) # Remember best token to reach j
22
23 # Backward pass: reconstruct best segmentation
24 if previous[L] is None:
25 return ["[UNK]"] # No valid segmentation found
26
27 tokens = []
28 position = L
29 while position > 0:
30 i, tid = previous[position]
31 tokens.append(self.vocabulary[tid])
32 position = i
33
34 tokens.reverse()
35 return tokens1def encode(self, word):
2 L = len(word)
3 matches = self.precompute_matches_for_word(word, self.token_to_id, max_token_length=10)
4 logP_dict = {tid: math.log(p + 1e-10) for tid, p in enumerate(self.probability)}
5
6 # cost[i] = negative log probability of best path to position i
7 cost = [math.inf] * (L + 1)
8 previous = [None] * (L + 1) # Backpointers for reconstruction
9 cost[0] = 0.0
10
11 # Forward pass: find minimum cost (maximum probability) path
12 for i in range(L):
13 if cost[i] == math.inf:
14 continue
15
16 for tid, l in matches[i]:
17 j = i + l
18 new_cost = cost[i] - logP_dict[tid] # Negative log: minimize cost = maximize prob
19 if new_cost < cost[j]:
20 cost[j] = new_cost
21 previous[j] = (i, tid) # Remember best token to reach j
22
23 # Backward pass: reconstruct best segmentation
24 if previous[L] is None:
25 return ["[UNK]"] # No valid segmentation found
26
27 tokens = []
28 position = L
29 while position > 0:
30 i, tid = previous[position]
31 tokens.append(self.vocabulary[tid])
32 position = i
33
34 tokens.reverse()
35 return tokensLet's take a concrete example to understand this:
1tokenizer = UnigramLMTokenizer(target_vocab_size=20)
2tokenizer.fit(["playing", "player", "played"], max_token_length=5)
3
4# Final learned vocabulary (sorted by probability):
5# play -> 0.35 (high probability - common root)
6# ing -> 0.15 (common suffix)
7# ed -> 0.12 (common suffix)
8# er -> 0.10 (common suffix)
9# p -> 0.05 (fallback character)
10# l -> 0.05
11# ...
12
13# Encoding uses Viterbi to find most probable segmentation:
14tokenizer.encode("playing")
15# ['play', 'ing'] (highest probability path)
16
17tokenizer.encode("player")
18# ['play', 'er']
19
20tokenizer.encode("played")
21# ['play', 'ed']1tokenizer = UnigramLMTokenizer(target_vocab_size=20)
2tokenizer.fit(["playing", "player", "played"], max_token_length=5)
3
4# Final learned vocabulary (sorted by probability):
5# play -> 0.35 (high probability - common root)
6# ing -> 0.15 (common suffix)
7# ed -> 0.12 (common suffix)
8# er -> 0.10 (common suffix)
9# p -> 0.05 (fallback character)
10# l -> 0.05
11# ...
12
13# Encoding uses Viterbi to find most probable segmentation:
14tokenizer.encode("playing")
15# ['play', 'ing'] (highest probability path)
16
17tokenizer.encode("player")
18# ['play', 'er']
19
20tokenizer.encode("played")
21# ['play', 'ed']Why Unigram LM is Powerful?
1. Global optimization: Considers all possible segmentations, not just greedy choices.
2. Probabilistic: Provides uncertainty estimatesβless confident on rare words.
3. Robust: Automatically handles unknown words by falling back to character-level tokens.
4. Flexible: Can sample different segmentations (useful for data augmentation during training).
Limitations
1. Computationally expensive: EM iterations over all words and all segmentations.
2. Requires large initial vocabulary: Must start with many candidate tokens (memory-intensive).
WordPiece vs Unigram LM
Let's summarize the key differences:
| Aspect | WordPiece | Unigram LM |
|---|---|---|
| Merge style | Bottom-up greedy (like BPE) | Top-down pruning |
| Objective | Local PMI (mutual information) | Global likelihood (EM) |
| Segmentation | Deterministic (longest-prefix matching) | Probabilistic (Viterbi for best, or sampling) |
| Training speed | Fast (greedy merging) | Slower (EM iterations) |
| Vocabulary quality | Good linguistic units | Globally optimal probabilities |
| Implementation | Simpler | More complex (forward-backward, Viterbi) |
| Used in | BERT, RoBERTa, DistilBERT | T5, ALBERT, XLNet (via SentencePiece) |
When to use each:
WordPiece is a good choice when:
- We want better linguistic units than BPE without much added complexity.
- Training speed matters (large corpus, limited compute).
- We are building an English-focused model (where PMI works well).
Unigram LM is preferred when:
- We need the best possible vocabulary quality.
- We building multilingual models (used in T5, mT5).
- We want probabilistic segmentations for data augmentation.
- We have compute budget for EM training.
In practice, both are significant improvements over pure frequency-based BPE. The choice often comes down to implementation convenience (WordPiece is simpler) versus quality (Unigram LM is theoretically more principled).
Zoom
Figure: Comprehensive comparison of BPE, WordPiece, and Unigram LM across key dimensions.
Putting It All Together: Tradeoffs and Design Insights
We've seen four different tokenization strategies: character-level, word-level, BPE, and probabilistic methods (WordPiece and Unigram LM). Each makes different tradeoffs between coverage, efficiency, and linguistic quality.
What's Next?
In Part 3, we'll apply these algorithms to real-world problems:
- Why subwords win: cross-linguistic generalization and frequency-based learning
- Practical metrics: vocabulary size, tokens/sentence, compression ratios
- Lessons from building your own tokenizer
- Measuring fairness across languages
- Production deployment and monitoring
Continue to Part 3: Building & Auditing Tokenizers β
π Subscribe to my newsletter The Main Thread for more deep dives. Namaste!
References
- Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units (BPE). https://arxiv.org/abs/1508.07909
- Schuster, M., & Nakajima, K. (2012). Japanese and Korean Voice Search (WordPiece). https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf
- Kudo, T. (2018). Subword Regularization: Improving Neural Network Translation Models (Unigram LM). https://arxiv.org/abs/1804.10959
- Kudo, T., & Richardson, J. (2018). SentencePiece: A simple and language independent approach to subword tokenization. https://arxiv.org/abs/1808.06226
- Devlin, J., et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https://arxiv.org/abs/1810.04805
- Brown, T., et al. (2020). Language Models are Few-Shot Learners (GPT-3). https://arxiv.org/abs/2005.14165
- OpenAI (2023). GPT-4 Technical Report. https://arxiv.org/abs/2303.08774
- HuggingFace Tokenizers Documentation. https://huggingface.co/docs/tokenizers/
Written by Anirudh Sharma
Published on November 11, 2025