November 11, 202533 min readBy Anirudh Sharma

Demystifying Tokenization: Part 2 - Algorithms from BPE to Unigram

{A}

Table of Contents

This is Part 2 of a 3-part series on tokenization:

Part 1: Why Tokenization Matters
Part 2: Algorithms from BPE to Unigram ← You are here
Part 3: Building & Auditing Tokenizers

In Part 1, we explored why tokenization matters, the fundamental problems it solves, and how to think about it as lossy compression. In this post, we will dive deeper into different algorithms to strengthen our understanding of tokenization.

There are two approaches or categories of tokenization algorithms - Deterministic and Probabilistic.

Deterministic Tokenization Approaches

Deterministic tokenization algorithms produce the same segmentation every time they encounter the same input.

Given a fixed vocabulary and merge rules learned during training, these methods apply a predefined sequence of operations such as splitting by characters, whitespace, or iteratively merging frequent pairs and deterministically transform text into tokens.

There is no randomness, no probability distributions, and no exploration of alternative segmentations. This predictability makes them fast, simple to implement, and easy to debug, which is why they form the foundation of modern tokenization systems. However, their greedy, rule-based nature also means they optimize for local patterns rather than global statistical optimality.

We will start with the simplest approaches and build up to the sophisticated byte-pair encoding methods that power modern LLMs.

Character-level Tokenization

This is the most straightforward approach where we treat each character as a token. For example, the word "cat" will be split into ['c', 'a', 't'] and each character will be mapped to an Integer ID.

The code is pretty simple to implement. We take two dictionaries mappings and reverse_mappings and store mappings of {character -> integer} and {integer -> character} respectively.

python

class CharacterTokenization:
  def __init__(self):
    # Dictionary to hold character to integer mappings
    self.mappings: dict[str, int] = {}
    # Dictionary to hold integer to character mappings
    self.reverse_mappings: dict[int, str] = {}
    # Index to keep track for next integer to assign
    self.index: int = 0

  def fit(self, corpus: list[str]) -> None:
    # Traverse through each character in the text
    for text in corpus:
      for char in text:
        if char not in self.mappings:
          self.mappings[char] = self.index
          self.reverse_mappings[self.index] = char
          self.index += 1

Once the training is complete using fit method, we can now encode and decode any word or phrase.

Let's take an example:

python

corpus = ["the cat sat on the mat"]
tokenizer = CharacterTokenization()
tokenizer.fit(corpus)

# We will get something like this

# 't' -> 0
# 'h' -> 1
# 'e' -> 2
# ' ' -> 3  (space is a character too!)
# 'c' -> 4
# 'a' -> 5
# 's' -> 6
# ...

# Encode "cat"
encoded = tokenizer.encode("cat")
print(encoded)  # [4, 5, 0] (c->4, a->5, t->0)

# Decode back
decoded = tokenizer.decode(encoded)
print(decoded)  # "cat"

(See full implementation)

Since we are processing at the character level, 100% coverage can be done. Every possible string - English, 中文, emoji 🌍, typos, even random gibberish can be represented because we are just mapping individual characters.

If someone writes "recieve" (misspelled), it encodes perfectly as ['r', 'e', 'c', 'i', 'e', 'v', 'e']. The model might not know it's a misspelling, but at least it can process it.

But this simplicity comes at a severe computational cost. Consider the sentence "Tokenization is essential". As words, it has 3 tokens but as characters, it has 26 tokens. That's 8x-9x increase in sequence length. For a model processing attention mechanism (which scale as $O(n^2)$ with sequence length), this explosion is devastating. An 8x long sequence means roughly 64x more computation in the attention layers.

Real-world impact:

Longer sequences = more memory needed.
More computation = slower training and inference.
Limited context window gets consumed faster.
Models struggle to capture long-range dependencies (information diffuses across more steps).

This is why character-level tokenization, despite its elegance, is rarely used for large language models. The computational trade-off is simply too expensive.

Word Level Tokenization

As the name suggests, it treats every word as a token. A dictionary of unique words is built by splitting the text on whitespace. At the same time, an Integer ID is also assigned to each token.

The implementation is straightforward: we maintain frequency counts of each word and assign IDs deterministically (sorted by frequency, then alphabetically). The key innovation here is the <UNK> token: a special reserved token that represents all unknown words encountered during inference.

When add_unk_token=True (the default), we reserve ID $0$ for <UNK> before building the vocabulary. This allows the model to handle out-of-vocabulary words gracefully by mapping them all to the same token, though at the cost of losing distinction between different unknown words.

python

class WordLevelEncoding:
  def __init__(self, add_unk_token: bool = True):
    self.mappings: dict[str, int] = {}
    self.reverse_mappings: dict[int, str] = {}
    self.frequencies: Counter[str] = Counter()
    self.index: int = 0

    # Special handling for unknown words
    self.add_unk_token = add_unk_token
    if add_unk_token:
      self.unk_token = "<UNK>"
      self.unk_id = self._reserve_token(self.unk_token)

  def _reserve_token(self, token: str) -> int:
    """Reserve token id if not already present and return its id."""
    if token not in self.mappings:
      token_id = self.index
      self.mappings[token] = token_id
      self.reverse_mappings[token_id] = token
      self.index += 1
      return token_id
    return self.mappings[token]

  def fit(self, corpus: list[str]) -> None:
    """Build vocabulary from corpus"""
    for text in corpus:
      words = text.split()
      self.frequencies.update(words)

      # Assign IDs to words (by frequency for determinism)
      for word in sorted(self.frequencies.keys(),key=lambda w: (-self.frequencies[w], w)):
        if word not in self.mappings:
          self.mappings[word] = self.index
          self.reverse_mappings[self.index] = word
          self.index += 1

For an example run below:

python

corpus = [
  "the cat sat on the mat",
  "the dog sat on the rug"
]

tokenizer = WordLevelEncoding()
tokenizer.fit(corpus)

# Vocabulary learned:
# "the" -> 0 (most frequent)
# "sat" -> 1
# "on" -> 2
# "cat" -> 3
# "mat" -> 4
# ...

(See full implementation)

This is efficient as "Tokenization is essential" becomes just three words. But it introduces a critical failure mode

The OOV (Out of Vocabulary) Problem

What happens when we encounter a word during deployment that never appeared in training?

python

# Training corpus
corpus = [
  "the cat sat on the mat",
  "the dog sat on the rug"
]
tokenizer.fit(corpus)

# Now try encoding a held-out sentence
test = "the platypus sat under the tree"

The words "platypus", "under", and "tree" are unknown. They're not in our vocabulary. Now what? There are three policies to handle it.

1. Raise an error:

python

encoded = tokenizer.encode(test, unknown_policy="raise")
# KeyError: Unknown word during encode: 'platypus'

This makes the problem explicit but breaks the system - we cannot process the input at all.

2. Add unknown words dynamically:

python

encoded = tokenizer.encode(test, unknown_policy="add")
# Dynamically adds 'platypus', 'under', 'tree' to vocabulary
print("New vocabulary size:", tokenizer.vocabulary_size())
# Vocabulary grew from 8 to 11

This approach works but has serious problems:

vocabulary grows unbound over time.
models never had these words during training, so their embeddings are random/untrained.
different deployments will have different vocabularies (non-deterministic).

3. Map to special <UNK> token:

python

encoded = tokenizer.encode(test, unknown_policy="unk")
# All unknown words become <UNK>
# "the <UNK> sat <UNK> the <UNK>"

This keeps vocabulary fixed but loses information. The model cannot distinguish between "platypus", "under" and "tree" - they are all the same token. If the unknown word is critical to understanding the sentence, the model is blind.

The Vocabulary Explosion

Even if we could solve the vocabulary problem, we have challenge of exploding vocabulary size if we take into account of morphological variants ("walk", "walking"), proper nouns ("Anthropic", "OpenAI", "NVidia"), technical terms ("Http", "JSON").

On top of it, if we add another languages - each needs its own dictionary. A multilingual model could need millions of tokens.

If each token has 1024 dimensional embedding at 4 bytes per float, that's

50k tokens = 200 MB
100k tokens = 400 MB
1M tokens = 4 GB

Note this is just for the embedding table, before any model parameters. With 100,000 tokens, each appears less frequently in training. Rare words get poor embeddings because the model sees them infrequently. The sparsity makes learning inefficient.

This is why word-level tokenization, while intuitive, doesn't scale to modern LLMs.

We need something that combines the efficiency of word-level with the coverage of character-level.

Subword Tokenization: Byte Pair Encoding (BPE)

This is the breakthrough that makes modern tokenization work. Instead, of choosing characters or words, we let the data tell us which fragments are meaningful by iteratively merging the most frequent pairs of symbols.

Start with a simple corpus: ["low", "lower", "newest", "newest", "newest", "widest"]

Step 0: Initialize with characters

Break each word into characters plus an end-of-word marker </w>:

Note on </w>: This is a training-time delimiter that helps BPE distinguish word boundaries (e.g., "low" as a complete word vs "low" as prefix in "lower").

In the examples below, we show </w> for clarity during training, but when counting tokens, we typically don't count it as a separate token and it is merged with the preceding character during the merge process.

low → ['l', 'o', 'w', '</w>']

lower → ['l', 'o', 'w', 'e', 'r', '</w>']

newest → ['n', 'e', 'w', 'e', 's', 't', '</w>'] (appears 3 times)

widest → ['w', 'i', 'd', 'e', 's', 't', '</w>']

Step 1: Count pair frequencies

Look at all adjacent symbol pairs across the corpus:

('e', 's'): 3 times (from "newest" × 3)

('s', 't'): 3 times (from "newest" × 3)

('e', 'w'): 3 times (from "newest" × 3)

('l', 'o'): 2 times (from "low" and "lower")

('o', 'w'): 2 times (from "low" and "lower")

...

Step 2: Merge most frequent pair

The pair ('e', 's') appears 3 times. Merge it into a single symbol 'es':

newest → ['n', 'e', 'w', 'es', 't', '</w>']

widest → ['w', 'i', 'd', 'es', 't', '</w>']

Step 3: Repeat

Count pairs again. Now ('es', 't') is frequent. Merge it:

newest → ['n', 'e', 'w', 'est', '</w>']

widest → ['w', 'i', 'd', 'est', '</w>']

Continue this process. Over many iterations, common patterns emerge as tokens:

'est' becomes a token (suffix for superlatives)
'low' becomes a token (common word)
'er' becomes a token (suffix for comparatives)

After 10 merges, encoding "lowest" might give: ['low', 'est', '</w>'] — just 2 tokens for a 6-character word.

BPE Merge Iterations Zoom Figure: BPE iteratively merges the most frequent pairs, gradually building up a vocabulary of meaningful subword units.

Implementation Details

Complexity: The naive BPE implementation has time complexity of O(num_merges × total_symbols) per merge iteration. For each merge, we scan the entire corpus to find and replace pairs. With priority queues or efficient data structures, this can be optimized, but the basic approach is quadratic in vocabulary growth.

The core implementation of the BPE algorithm is below:

python

def train(self, corpus, num_merges):
  """Learn BPE merges from corpus"""
  for i in range(num_merges):
    # Count all adjacent pairs
    pair_frequencies = self.get_pair_frequencies(corpus)
    if not pair_frequencies:
      break  # No more pairs to merge

    # Find most frequent pair
    most_frequent_pair = max(pair_frequencies, key=lambda p: (pair_frequencies[p], p))

    # Merge this pair throughout the corpus
    corpus = self.merge_most_frequent_pair(corpus, most_frequent_pair)
    # Record the merge for encoding later
    self.merges.append(most_frequent_pair)

  return corpus

def get_pair_frequencies(self, corpus):
  """Count adjacent symbol pairs across corpus"""
  pair_frequencies = Counter()
  for word_symbols, frequency in corpus.items():
    if len(word_symbols) < 2:
      continue

    # Sliding window over symbols
    for i in range(len(word_symbols) - 1):
      pair = (word_symbols[i], word_symbols[i + 1])
      pair_frequencies[pair] += frequency  # Weighted by word frequency

  return dict(pair_frequencies)

def merge_most_frequent_pair(self, corpus, target_pair):
  updated_corpus: dict[tuple[str, ...], int] = {}
  for word_symbols, count in corpus.items():
    symbols = list(word_symbols)
    i = 0
    while i < len(symbols) - 1:
      # Join the symbols to form the pair
      pair = (symbols[i], symbols[i + 1])
      if pair == target_pair:
        merged_symbol = "".join(pair)
        symbols[i:i+2] = [merged_symbol] # Replace two elements with one merged element
        i += 1
        continue
      else:
        i += 1
  # Add new tuple to the updated corpus
  updated_corpus[tuple(symbols)] = count

  return updated_corpus

Key insight: Pair frequencies are weighted by word frequency. If "newest" appears 1,000 times in the corpus, the pair ('e', 's') gets +1,000 to its count each time we see it in "newest". This ensures we merge patterns that are genuinely common in the data, not just patterns that happen to appear in many unique words.

Encoding New Words

Once the training is complete, we have a list of learned merges. To encode a new word, apply these merge in order.

python

def encode(self, word: str) -> list[str]:
  """Encode word using learned merges"""
  symbols = list(word) + ["</w>"]

  # Apply each learned merge in sequence
  for pair in self.merges:
    i = 0
    while i < len(symbols) - 1:
      if (symbols[i], symbols[i + 1]) == pair:
        merged = "".join(pair)
        symbols[i:i+2] = [merged]  # Replace pair with merged symbol
        # Don't advance i; re-check at same position after merge
      else:
        i += 1

  return symbols

To take an example:

python

bpe = BytePairEncoding()
# fit_corpus prepares the corpus and returns processed word-frequency pairs
processed_corpus = bpe.fit_corpus(["low", "lower", "newest", "newest", "newest", "widest"])
# train learns merges from the processed corpus
bpe.train(processed_corpus, num_merges=10)

# Encode a word seen during training
encoded = bpe.encode("lowest")
print(encoded)  # ['low', 'est']

# Encode a completely new word
encoded = bpe.encode("slowest")
print(encoded)  # ['s', 'low', 'est']
# Note: 's' stays separate, but 'low' and 'est' are recognized

(See full implementation)

This is the magic of BPE: new words are decomposed using learned patterns. Even though "slowest" never appeared in training, BPE recognizes "low" and "est" as meaningful fragments and encodes them as single tokens. Only the 's' prefix remains as a character-level token.

Why BPE Works?

BPE is greedy statistical compression. By merging frequent pairs, we are building a vocabulary that efficiently encodes the training distribution.

Common words like "the" quickly become single tokens (after merging 't'+'h', then 'th'+'e'). Rare words like "platypus" get fragmented, but the fragments themselves are learned from common patterns elsewhere in the corpus.

The trade-off in action:

Word	Frequency	Typical BPE Encoding	Token Count
"the"	Very high	['the']	1 token
"playing"	High	['play', 'ing']	2 tokens
"platypus"	Rare	['plat', 'y', 'pus']	3 tokens
"quokka"	Never seen	['qu', 'ok', 'ka']	3 tokens

Frequent words compress well. Rare words are longer but still representable. This is optimal from an information theory perspective - we are using shorter codes for common patterns and longer codes for rare patterns, exactly what Shannon's theory prescribes.

Limitations of Character Level BPE

BPE as described above operates on characters. This works great for English and similar languages, but it has a subtle problem: different character encodings.

Consider these strings:

"café" (é as a single Unicode character: U+00E9)
"café" (e + combining accent: U+0065 + U+0301)

They look identical but have different character sequences! A character-level BPE would treat them differently, learning separate tokens. This inconsistency wastes vocabulary and confuses the model.

Additionally, character-level BPE struggles with non-Latin scripts. Chinese, Arabic, and emoji have tens of thousands of unique characters. Starting with all of them as base tokens means our initial vocabulary is already huge before any merging.

This leads us to the final evolution: byte-level BPE.

Byte Level BPE: Universal Coverage

We know that every character in any language is already represented as a sequence of bytes. Therefore, why not operate on bytes (UTF-8 encoding) instead of operating on characters? Since UTF-8 uses only 256 possible byte values (0-255), so we can start with a base vocabulary of just 256 tokens.

Few example of UTF-8 encoding (common characters use 1 byte, less common ones use 2-4 bytes):

plaintext

'a' → [97]                    (1 byte: ASCII)
'é' → [195, 169]              (2 bytes: Latin extended)
'世' → [228, 184, 150]         (3 bytes: Chinese)
'🌍' → [240, 159, 140, 141]    (4 bytes: emoji)

No matter what the input is - English, Hindi, emojis, code, even binary data - it becomes a sequence of bytes in the range 0-255.

Byte-Level Encoding Zoom Figure: Byte-level BPE operates on UTF-8 bytes, providing 100% coverage while learning common byte sequences as tokens.

Implementation Details

The algorithm is nearly identical to character level BPE, but operates on bytes:

python

def fit_corpus(self, corpus: list[str]):
  """Convert words to byte sequences"""
  word_frequencies = Counter(corpus)
  processed = {}
  for word, frequency in word_frequencies.items():
    # Convert to bytes
    word_bytes = word.encode("utf-8")
    symbols = list(word_bytes) + ["</w>"]  # Bytes + end marker
    processed[tuple(symbols)] = frequency
  return processed

Let's take an example understand this in action:

python

word = "café"
word_bytes = word.encode("utf-8")
# [99, 97, 102, 195, 169]  (c, a, f, é as 2 bytes)

symbols = list(word_bytes) + ["</w>"]
# [99, 97, 102, 195, 169, "</w>"]

(See full implementation)

Important nuance about Unicode normalization: UTF-8 encoding does not automatically normalize text. The word "café" can be represented in Unicode using two different normalization forms:

NFC (Canonical Composition): é as a single codepoint (U+00E9) → UTF-8 bytes: [99, 97, 102, 195, 169]
NFD (Canonical Decomposition): e + combining acute accent (U+0065 + U+0301) → UTF-8 bytes: [99, 97, 102, 101, 204, 129]

These produce different byte sequences even though they display identically. Byte-level BPE treats them as different inputs.

This has implications:

Advantage: Byte-level BPE doesn't make normalization assumptions — it handles whatever UTF-8 bytes it receives. If your training data contains both NFC and NFD text, both get represented (though possibly inefficiently).

Disadvantage: If the same logical word appears in multiple normalization forms, the tokenizer may learn separate tokens for each, wasting vocabulary space and creating inconsistency.

In practice: Most modern text processing pipelines normalize to NFC before tokenization to avoid this issue. GPT-2's tokenizer applies normalization as a preprocessing step before byte-level BPE runs. The key insight is that byte-level operation makes tokenization robust to any input, but normalization is still a separate decision in the preprocessing pipeline.

Training and Merging

The merging logic is same as before, but now we merge byte values instead of characters:

python

def train(self, corpus, num_merges):
  """Learn BPE merges on bytes"""
  for i in range(num_merges):
    pair_frequencies = self.get_pair_frequencies(corpus)
    if not pair_frequencies:
      break

    # Most frequent byte pair
    most_frequent_pair = max(pair_frequencies, key=pair_frequencies.get)
    print(f"Merge {i+1}: {most_frequent_pair}")

    corpus = self.merge_most_frequent_pair(corpus, most_frequent_pair)
    self.merges.append(most_frequent_pair)

  return corpus

After training, common byte sequences become tokens. For English text, we will see merges like:

plaintext

- (116, 104) → "th" (bytes for 't' and 'h')
- (101, 114) → "er" (bytes for 'e' and 'r')
- (105, 110, 103) → "ing" (bytes for 'i', 'n', 'g' after successive merges)

This merging operation is true for Chinese, Arabic, and emoji bytes as well.

The algorithm learns whatever is statistically common in the training data, regardless of language.

Decoding Bytes Back to Text

After encoding, we have a sequence of tokens (single, merged). To decode:

python

def decode(self, tokens: list) -> str:
  """Decode tokens back to text"""
  byte_sequence = []
  for token in tokens:
    if token == "</w>":
      continue
    byte_sequence.extend(self._flatten(token))  # Flatten nested merges

  return bytes(byte_sequence).decode("utf-8", errors="replace")

def _flatten(self, symbol):
  """Recursively flatten nested tuples of bytes"""
  if isinstance(symbol, int):
    return [symbol]  # Base case: single byte
  elif isinstance(symbol, tuple):
    result = []
    for s in symbol:
      result.extend(self._flatten(s))
    return result
  else:
    return []

The _flatten() helper recursively unpacks merged tokens back to individual bytes and this is necessary because merged tokens are stored as tuples. After merging (116, 104) → "th", then merging "th" with (101) → "the", we get ((116, 104), 101). Flattening gives us back [116, 104, 101], which we decode as "the".

Why Byte Level BPE is the Gold Standard?

GPT-2 (Radford et al., 2019), GPT-3 (Brown et al., 2020), GPT-4 (OpenAI, 2023), and many modern models use byte-level BPE because it solves every coverage problem:

100% coverage: Any UTF-8 text can be encoded (the worst case: falls back to individual bytes)
Language-agnostic: Single tokenizer handles English, Chinese, Arabic, code, emoji—everything
Works on raw UTF-8 bytes: Ensures coverage of any input. Normalization (e.g., NFC) is still a separate, recommended preprocessing step for consistency.
Compact vocabulary: Starts with 256 base tokens, grows to ~50K through merging
Robust to typos: Misspelled words decompose into bytes; no special handling needed

The trade-off: byte-level tokenization is slightly less efficient than character-level tokenization for languages with simple scripts (like English). The word "the" might be 1 token with character-level BPE, but require merging 3 bytes [116, 104, 101] with byte-level BPE. In practice, after sufficient merging iterations, this difference is minimal—both converge to similar vocabulary structures.

What We Have Learned

These are all deterministic approaches and given the same training corpus and merge count, BPE produces the same vocabulary every time. But we can do better. What if instead of greedily merging the most frequent pair, we used probabilistic reasoning to find optimal segmentations?

That's where probabilistic tokenization comes in. We will explore WordPiece and Unigram Language Models next—algorithms that optimize for likelihood rather than frequency, and produce better vocabularies for modern LLMs.

Probabilistic Tokenization Approaches

BPE is elegant and simple as it merges the most frequent pairs until we reach our target vocabulary size. But most frequent has a subtle flaw - it optimizes for how often symbols appear together, not for how meaningful those combinations are.

Consider two pairs in a corpus:

('e', 's') appears 1000 times (mostly from common words like "best", "rest", "test").
('q', 'u') appears 100 times ('q' almost never appears without 'u' in English).

BPE merges ('e', 's') first because it is more frequent. But linguistically "qu" is a much stronger unit as 'q' and 'u' are statistically dependent, while 'e' and 's' often appear separately.

Mental model: I think of it as Baye's theorem of conditional probability.

This is where probabilistic tokenization comes in. Instead of counting raw frequencies, these methods use likelihood and statistical independence to decide what to merge or keep.

Let's explore two most widely used algorithms in this category - WordPiece and UnigramLM.

WordPiece

This was developed by Google and is used in BERT. It asks a better question than BPE: "Do these symbols carry more information together than apart?"

The intuition is simple. In English, the probability of the letter u next to the letter q is nearly 100%. Thus, the two letters are statistically dependent - seeing q tells us a lot about what comes next. On the contrary, the probability of seeing s next to e is fairly lower. These letters are more independent.

WordPiece's insight: Merge symbols that are dependent, not just frequent. This captures meaningful linguistic units.

The Math: Pointwise Mutual Information (PMI)

Note on PMI as a teaching proxy: We use PMI here as an intuitive explanation. In practice, production WordPiece implementations (as used in BERT) greedily add tokens that maximize the likelihood gain under a simple language model objective. PMI captures the core intuition: merging statistically dependent symbols but the actual algorithm optimizes corpus likelihood. For the full details, see Schuster & Nakajima (2012).

WordPiece uses a score based on PMI that measures how much more likely two symbols appear together versus independently:

$\text{PMI}(x, y) = \log \frac{P(xy)}{P(x) \cdot P(y)}$

If $x$ and $y$ are independent: $P(xy) = P(x) \cdot P(y)$ , so $\text{PMI} = \log(1) = 0$ .
If $x$ and $y$ are dependent: $P(xy) > P(x) \cdot P(y)$ , so $\text{PMI} > 0$ .
The higher the PMI, the stronger the association.

WordPiece's scoring function weighs PMI by frequency:

$\text{score}(x, y) = f(x, y) \times \left[\log P(xy) - \log P(x) - \log P(y)\right]$

Where,

$f(x, y)$ = frequency of the pair $(x, y)$ in the corpus.
$P(xy)$ = probability of seeing the merged token.
$P(x), P(y)$ = probabilities of seeing each symbol individually.

Why multiply by frequency? We still want to prioritize common patterns (like BPE), but among patterns with similar frequency, we prefer those with high mutual information.

Example

plaintext

Example: "play" + "ing" vs "pl" + "ay"

Let's compare two potential merges in a corpus with these counts:

Pair frequencies:
('play', 'ing'): 500 times
('pl', 'ay'): 600 times

Individual token frequencies:
'play': 800 times (also appears in "play", "playful", "player")
'ing': 2000 times (appears in "running", "walking", "talking"...)
'pl': 650 times (mostly from "play" words, but also "plus", "place")
'ay': 620 times (mostly from "play" words, but also "day", "say", "may")

BPE would choose ('pl', 'ay') because it appears 600 times vs 500 for ('play', 'ing').

WordPiece computes PMI:

For ('play', 'ing'):
P(play) = 800 / total ≈ 0.016
P(ing) = 2000 / total ≈ 0.040
P(play·ing) = 500 / total ≈ 0.010

PMI = log(0.010) - log(0.016) - log(0.040)
    = log(0.010 / (0.016 × 0.040))
    = log(15.625) ≈ 2.75

score = 500 × 2.75 = 1,375

For ('pl', 'ay'):
P(pl) = 650 / total ≈ 0.013
P(ay) = 620 / total ≈ 0.012
P(pl·ay) = 600 / total ≈ 0.012

PMI = log(0.012) - log(0.013) - log(0.012)
    = log(0.012 / (0.013 × 0.012))
    = log(7.69) ≈ 2.04

score = 600 × 2.04 = 1,224

WordPiece chooses ('play', 'ing') despite lower frequency, because the PMI is higher - "play" and "ing" are more strongly associated than "pl" and "ay".

Implementation Details

Complexity: WordPiece has similar complexity to BPE: $O(num\_merges × total\_symbols)$ for training. The additional PMI computation is $O(vocabulary\_size)$ per iteration, which is typically much smaller than the corpus size. The main difference from BPE is the scoring function, not the algorithmic complexity.

Here's the core of WordPiece implementation:

python

def _get_best_pair(self, token_frequencies: dict[str, int],
                     pair_frequencies: dict[tuple[str, str], int]) -> tuple[str, str]:
  """Find pair with highest PMI-weighted score"""
  scores: dict[tuple, float] = {}

  for (x, y), f_xy in pair_frequencies.items():
    f_x = token_frequencies.get(x, 1)
    f_y = token_frequencies.get(y, 1)

    # Compute PMI score
    epsilon = 1e-10  # Avoid log(0)
    score = f_xy * (
      math.log(f_xy + epsilon) - math.log(f_x + epsilon) - math.log(f_y + epsilon)
    )
    scores[(x, y)] = score

    # Return pair with maximum score
  best_pair = max(scores, key=scores.get)
  return best_pair

Key implementation notes:

Adding $1e-10$ prevents $log(0)$ errors when frequencies are zero.
We work in log probabilities for numerical stability.
Like BPE, we still merge one pair at a time — the difference is which pair we choose.

The rest of the algorithm is identical to BPE:

python

def fit(self, corpus: list[str]) -> None:
  """Train WordPiece tokenizer"""
  # Start with character-level tokenization
  tokenized: list[list[str]] = [list(word) + ["</w>"] for word in corpus]
  self.vocab = set(ch for word in tokenized for ch in word)

  while len(self.vocab) < self.target_vocab_size:
    # Count token and pair frequencies
    token_frequencies = self._get_token_frequencies(tokenized)
    pair_frequencies = self._get_pair_frequencies(tokenized)

    if not pair_frequencies:
      break

    # Find best pair using PMI (not just frequency!)
    best_pair = self._get_best_pair(token_frequencies, pair_frequencies)
    # Merge the best pair
    tokenized = self._merge_pair(tokenized, best_pair)
    self.vocab.add("".join(best_pair))

(See full implementation)

Greedy longest-prefix matching: For each position in the word, try the longest possible substring first. This ensures we use the most "complete" tokens available.

Strengths and Limitations

Better linguistic units: PMI captures meaningful combinations (morphemes, common phrases).
Less sensitive to frequency spikes: A rare but highly associated pair can be preferred over a frequent but weakly associated pair.
Still greedy: Like BPE, WordPiece merges one pair at a time without considering global optimization.
Deterministic: Given the same corpus, produces the same vocabulary - no exploration of alternative segmentations.

This last limitation is what Unigram Language Models address.

Unigram Language Model

WordPiece improves on BPE by using mutual information, but it is still a bottom-up, greedy algorithm: start with characters, iteratively merge pairs. The Unigram approach (used in Google's SentencePiece and models like T5), flips this on its head.

Here, we start with a large vocabulary (all possible subwords up to some length), then iteratively prune low-probability tokens until we reach the target size.

Core Idea: Multiple Segmentations

Unlike BPE and WordPiece, which produce a single deterministic segment for each word, Unigram considers all possible segmentations and treats tokenization as a probabilistic model.

Given a word like "playing", there are many ways to segment it:

['playing'] (if 'playing' is in vocabulary)
['play', 'ing']
['pla', 'ying']
['pl', 'ay', 'ing']
['p', 'l', 'a', 'y', 'i', 'n', 'g'] (character-level fallback)
...

Each segmentation has a probability based on the product of individual token probabilities:

$P(\text{segmentation}) = \prod_{t \in \text{segmentation}} P(t)$

For example:

$P(\text{['play', 'ing']}) = P(\text{'play'}) \times P(\text{'ing'})$
$P(\text{['pl', 'ay', 'ing']}) = P(\text{'pl'}) \times P(\text{'ay'}) \times P(\text{'ing'})$

The goal is to learn token probabilities $P(t)$ that maximize the likelihood of the training corpus under the most probable segmentations.

The EM Algorithm: Expectation-Maximization

This is a chicken and egg problem:

To find the best segmentation, we need to know token probabilities.
To learn token probabilities, we need to know which segmentations to count.

The solution is EM algorithm that iterative refines both:

E-Step (Expectation): Given current token probabilities, compute the expected count of each token across all possible segmentations of all words in the corpus.
M-Step (Maximization): Given expected count, update token probabilities.

Repeat until convergence, then prune low-probabilities tokens.

E-Step: Forward-Backward Algorithm

Complexity: The forward-backward algorithm for a single word runs in $O(L × max\_token\_length × vocab\_size)$ where $L$ is the word length.

For the entire corpus with EM iterations, the total complexity is $O(EM\_iterations × corpus\_size × L × max\_token\_length²)$ . This is significantly more expensive than BPE/WordPiece, but produces globally optimal probabilities. Memory requirements are also higher due to storing candidate vocabularies.

The E-step uses dynamic programming to efficiently compute expected counts without enumerating all segmentations explicitly.

Forward pass computes the probability of reaching each position in the word:

python

def forward_backward_expected_counts(self, matches, logP):
  L = len(matches)

  # Forward probabilities: α[i] = sum of probabilities of all paths to position i
  log_alpha = [-math.inf] * (L + 1)
  log_alpha[0] = 0.0  # log(1) - we start at position 0 with probability 1

  for i in range(L):
    if log_alpha[i] == -math.inf:
      continue

    # Try all tokens starting at position i
    for tid, l in matches[i]:
      j = i + l  # Token takes us to position j
      # Update α[j] by adding this path's probability
      log_alpha[j] = self._log_sum_exp(log_alpha[j], log_alpha[i] + logP[tid])

  logZ = log_alpha[L]  # Total probability of reaching the end

  # Backward pass computes the probability of completing the word from each position:
  # Backward probabilities: β[i] = sum of probabilities of all paths from i to end
  log_beta = [-math.inf] * (L + 1)
  log_beta[L] = 0.0  # log(1) - we end at position L with probability 1

  for i in range(L - 1, -1, -1):
    for tid, l in matches[i]:
      j = i + l
      # Update β[i] by adding probability of going i -> j -> end
      log_beta[i] = self._log_sum_exp(log_beta[i], logP[tid] + log_beta[j])

  # Expected counts combine forward and backward probabilities:
  # Expected count for each token
  expected = defaultdict(float)
  if logZ == -math.inf:
    return expected, -math.inf  # Word cannot be segmented

  for i in range(L):
    for tid, l in matches[i]:
      j = i + l
      # Probability of using this token in a random segmentation:
      # P(path to i) × P(token) × P(path from j to end) / P(word)
      log_contribution = log_alpha[i] + logP[tid] + log_beta[j] - logZ
      contribution = math.exp(log_contribution)

      if contribution > 0.0:
        expected[tid] += contribution

  return expected, logZ

What's happening here? For each token at each position, we compute:

Forward probability: How likely is it to reach this position?
Token probability: How likely is this token?
Backward probability: How likely is it to complete the word from here?

Multiplying these together gives the probability of using this token in a random segmentation. Summing across all segmentations gives the expected count.

plaintext

Example walkthrough for "play":

Vocabulary: {p, l, a, y, pl, la, ay, play}
Token probabilities (simplified):
P(p) = 0.05, P(l) = 0.05, P(a) = 0.05, P(y) = 0.05
P(pl) = 0.10, P(la) = 0.08, P(ay) = 0.12
P(play) = 0.50

Possible segmentations:
1. ['play']           → P = 0.50
2. ['pl', 'ay']       → P = 0.10 × 0.12 = 0.012
3. ['p', 'l', 'ay']   → P = 0.05 × 0.05 × 0.12 = 0.0003
4. ['p', 'l', 'a', 'y'] → P = 0.05^4 = 0.00000625
...

Total probability: Z ≈ 0.50 + 0.012 + 0.0003 + ... ≈ 0.5123

Expected count of 'play':
(0.50 / 0.5123) ≈ 0.976 (appears in ~98% of sampled segmentations)

Expected count of 'pl':
(0.012 / 0.5123) ≈ 0.023 (appears in ~2% of sampled segmentations)

The forward-backward algorithm computes these efficiently without enumerating all segmentations.

Unigram Forward-Backward Zoom Figure: The forward-backward algorithm computes expected token counts by summing probabilities across all possible segmentation paths.

M-Step: Update Probabilities

Given expected counts across the entire corpus, update token probabilities:

python

def m_step_update(self, expected_counts, vocab_size):
  total = sum(expected_counts.values()) + 1e-20  # Avoid division by zero

  # Normalize to probabilities
  P = [expected_counts.get(i, 0.0) / total for i in range(vocab_size)]
  logP = [math.log(p + 1e-20) for p in P]  # Log-space for numerical stability

  return P, logP

This is standard maximum likelihood estimation: the probability of a token is proportional to how often we expect to see it.

Pruning: Reducing Vocabulary Size

After several EM iterations, we have token probabilities. Now prune the lowest-probability tokens:

python

def prune_vocabulary(self, expected_counts, id_to_token, target_vocabulary_size):
  """
  Keep top-K tokens by expected count, discard the rest
  """
  # Sort tokens by expected count (descending)
  ranked = sorted(expected_counts.items(), key=lambda x: -x[1])
  top = ranked[:target_vocabulary_size]
  kept_ids = [tid for tid, _ in top]

  # Rebuild vocabulary with only kept tokens
  new_vocabulary = [id_to_token[tid] for tid in kept_ids]
  token_to_id = {t: i for i, t in enumerate(new_vocabulary)}

  return token_to_id, new_vocabulary, kept_ids

After pruning, we run EM again with the smaller vocabulary. Repeat until we reach the target size.

Encoding with Viterbi Algorithm

At inference time, we want the single most probable segmentation (not expected counts over all segmentations). This is found using the Viterbi algorithm - dynamic programming to find the highest-probability path:

python

def encode(self, word):
  L = len(word)
  matches = self.precompute_matches_for_word(word, self.token_to_id, max_token_length=10)
  logP_dict = {tid: math.log(p + 1e-10) for tid, p in enumerate(self.probability)}

  # cost[i] = negative log probability of best path to position i
  cost = [math.inf] * (L + 1)
  previous = [None] * (L + 1)  # Backpointers for reconstruction
  cost[0] = 0.0

  # Forward pass: find minimum cost (maximum probability) path
  for i in range(L):
    if cost[i] == math.inf:
      continue

    for tid, l in matches[i]:
      j = i + l
      new_cost = cost[i] - logP_dict[tid]  # Negative log: minimize cost = maximize prob
      if new_cost < cost[j]:
        cost[j] = new_cost
        previous[j] = (i, tid)  # Remember best token to reach j

  # Backward pass: reconstruct best segmentation
  if previous[L] is None:
    return ["[UNK]"]  # No valid segmentation found

  tokens = []
  position = L
  while position > 0:
    i, tid = previous[position]
    tokens.append(self.vocabulary[tid])
    position = i

  tokens.reverse()
  return tokens

Let's take a concrete example to understand this:

python

tokenizer = UnigramLMTokenizer(target_vocab_size=20)
tokenizer.fit(["playing", "player", "played"], max_token_length=5)

# Final learned vocabulary (sorted by probability):
# play    -> 0.35   (high probability - common root)
# ing     -> 0.15   (common suffix)
# ed      -> 0.12   (common suffix)
# er      -> 0.10   (common suffix)
# p       -> 0.05   (fallback character)
# l       -> 0.05
# ...

# Encoding uses Viterbi to find most probable segmentation:
tokenizer.encode("playing")
# ['play', 'ing']  (highest probability path)

tokenizer.encode("player")
# ['play', 'er']

tokenizer.encode("played")
# ['play', 'ed']

(See full implementation)

Why Unigram LM is Powerful?

1. Global optimization: Considers all possible segmentations, not just greedy choices.

2. Probabilistic: Provides uncertainty estimates—less confident on rare words.

3. Robust: Automatically handles unknown words by falling back to character-level tokens.

4. Flexible: Can sample different segmentations (useful for data augmentation during training).

Limitations

1. Computationally expensive: EM iterations over all words and all segmentations.

2. Requires large initial vocabulary: Must start with many candidate tokens (memory-intensive).

WordPiece vs Unigram LM

Let's summarize the key differences:

Aspect	WordPiece	Unigram LM
Merge style	Bottom-up greedy (like BPE)	Top-down pruning
Objective	Local PMI (mutual information)	Global likelihood (EM)
Segmentation	Deterministic (longest-prefix matching)	Probabilistic (Viterbi for best, or sampling)
Training speed	Fast (greedy merging)	Slower (EM iterations)
Vocabulary quality	Good linguistic units	Globally optimal probabilities
Implementation	Simpler	More complex (forward-backward, Viterbi)
Used in	BERT, RoBERTa, DistilBERT	T5, ALBERT, XLNet (via SentencePiece)

When to use each:

WordPiece is a good choice when:

We want better linguistic units than BPE without much added complexity.
Training speed matters (large corpus, limited compute).
We are building an English-focused model (where PMI works well).

Unigram LM is preferred when:

We need the best possible vocabulary quality.
We building multilingual models (used in T5, mT5).
We want probabilistic segmentations for data augmentation.
We have compute budget for EM training.

In practice, both are significant improvements over pure frequency-based BPE. The choice often comes down to implementation convenience (WordPiece is simpler) versus quality (Unigram LM is theoretically more principled).

Algorithm Comparison Zoom Figure: Comprehensive comparison of BPE, WordPiece, and Unigram LM across key dimensions.

Putting It All Together: Tradeoffs and Design Insights

We've seen four different tokenization strategies: character-level, word-level, BPE, and probabilistic methods (WordPiece and Unigram LM). Each makes different tradeoffs between coverage, efficiency, and linguistic quality.

What's Next?

In Part 3, we'll apply these algorithms to real-world problems:

Why subwords win: cross-linguistic generalization and frequency-based learning
Practical metrics: vocabulary size, tokens/sentence, compression ratios
Lessons from building your own tokenizer
Measuring fairness across languages
Production deployment and monitoring

Continue to Part 3: Building & Auditing Tokenizers →

👉 Subscribe to my newsletter The Main Thread for more deep dives. Namaste!

References

Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units (BPE). https://arxiv.org/abs/1508.07909
Schuster, M., & Nakajima, K. (2012). Japanese and Korean Voice Search (WordPiece). https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf
Kudo, T. (2018). Subword Regularization: Improving Neural Network Translation Models (Unigram LM). https://arxiv.org/abs/1804.10959
Kudo, T., & Richardson, J. (2018). SentencePiece: A simple and language independent approach to subword tokenization. https://arxiv.org/abs/1808.06226
Devlin, J., et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https://arxiv.org/abs/1810.04805
Brown, T., et al. (2020). Language Models are Few-Shot Learners (GPT-3). https://arxiv.org/abs/2005.14165
OpenAI (2023). GPT-4 Technical Report. https://arxiv.org/abs/2303.08774
HuggingFace Tokenizers Documentation. https://huggingface.co/docs/tokenizers/

Written by Anirudh Sharma

Published on November 11, 2025

Demystifying Tokenization: Part 2 - Algorithms from BPE to Unigram

Deterministic Tokenization Approaches

Character-level Tokenization

Word Level Tokenization

The OOV (Out of Vocabulary) Problem

The Vocabulary Explosion

Subword Tokenization: Byte Pair Encoding (BPE)

Implementation Details

Encoding New Words

Why BPE Works?

Limitations of Character Level BPE

Byte Level BPE: Universal Coverage

Implementation Details

Training and Merging

Decoding Bytes Back to Text

Why Byte Level BPE is the Gold Standard?

What We Have Learned

Probabilistic Tokenization Approaches

WordPiece

The Math: Pointwise Mutual Information (PMI)

Example

Implementation Details

Strengths and Limitations

Unigram Language Model

Core Idea: Multiple Segmentations

The EM Algorithm: Expectation-Maximization

E-Step: Forward-Backward Algorithm

M-Step: Update Probabilities

Pruning: Reducing Vocabulary Size

Encoding with Viterbi Algorithm

Why Unigram LM is Powerful?

Limitations

WordPiece vs Unigram LM

Putting It All Together: Tradeoffs and Design Insights

What's Next?

References

Comments