November 12, 202521 min readBy Anirudh Sharma

Demystifying Tokenization: Part 3 - Building & Auditing Tokenizers

{A}

Table of Contents

This is Part 3 of a 3-part series on tokenization:

Part 1: Why Tokenization Matters
Part 2: Algorithms from BPE to Unigram
Part 3: Building & Auditing Tokenizers ← You are here

In Part 1, we understood why tokenization matters and how to think about it from First Principles lens. In Part 2, we implemented popular tokenization algorithms like BPE, WordPiece, and Unigram LM from scratch.

Now we will tackle the practical challenges:

How do we know if our tokenizer is good?
How do we measure fairness across languages?
When should we build a custom tokenizer?
How do we monitor it in production?

Now let's take step back and understand why subword tokenization has become the dominant approach, what metrics matter when evaluating tokenizers, and what deeper lessons emerge from building these systems.

Why Subwords Win?

The triumph of subword tokenization is not accidental - it is the result of solving three seemingly incompatible requirements simultaneously: universality, efficiency, and robustness.

The Impossible Trinity

Think about what we need from a tokenizer:

1. Universality means handling any input without failure. We can't predict what users will type - misspellings, rare proper nouns, code snippets, emoji, text in languages we didn't train on. A production system needs to process all of this gracefully. Character-level tokenization achieves perfect universality by construction — every string is just a sequence of characters. But this comes at a prohibitive computational cost.

2. Efficiency means short sequences. Models have quadratic attention complexity $O(n^2)$ , limited context windows, and per-token API pricing. A tokenizer that represents "machine learning" as 18 character tokens instead of two word tokens makes everything slower and more expensive. Word-level tokenization is maximally efficient for in-vocabulary words where one semantic unit becomes one token. But it fails catastrophically on out-of-vocabulary (OOV) inputs.

3. Robustness means degrading gracefully on challenging inputs. If the tokenizer sees a misspelled word ("receive" misspelled as "recieve"), it shouldn't break or lose all information. If it encounters a language rarely seen during training, it should still produce reasonable tokens rather than treating every word as [UNK]. Both character-level and word-level tokenizers struggle here—characters are robust but inefficient, words are efficient but fragile.

Subword tokenization is the only widely adopted approach known to balance all three requirements in practice.

The Tokenization Impossible Trinity Zoom Visualization of the three incompatible requirements

character-level maximizes universality and robustness but sacrifices efficiency
word-level maximizes efficiency but fails on universality and robustness
subword tokenization finds the practical middle ground.

Cross Linguistic Generalization

Subwords don't just generalize within a language - they generalize across languages, especially those with rich morphology.

Consider German compound words. German productively creates new words by concatenation: "Donaudampfschifffahrtsgesellschaft" (Danube steamship company) is a real word formed by gluing together "Donau" (Danube) + "Dampf" (steam) + "Schiff" (ship) + "Fahrt" (journey) + "Gesellschaft" (company). A word-level tokenizer would never see this exact compound during training as there are infinite possible compounds. Character-level would produce 34 tokens.

BPE with a multilingual corpus learns fragments like "dampf", "schiff", "fahrt", "schaft" as individual tokens because these components appear across many compounds. The full word becomes 6-8 tokens instead of 34, and the model sees familiar building blocks it has learned from other contexts.

The same principle applies to morphologically rich languages like Finnish, Turkish, or Arabic, where a single word can encode what English expresses in a full sentence. Turkish "Çekoslovakyalılaştıramadıklarımızdanmışsınızcasına" (meaning roughly "as if you were one of those we could not make Czechoslovakian") is grammatically valid.

Subword tokenization decomposes it into meaningful morphemes that the model has seen in other contexts.

Why Frequency Based Learning Works?

There is a deeper reason subwords generalize so well: natural language has compressible structure. This is not a superficial observation but a consequence of how human languages evolve.

Languages minimize effort. We reuse common sounds, morphemes, and word parts rather than inventing completely new forms for every concept. The distribution of linguistic units follows Zipf's law (Zipf, 1949): a small number of units (like "ing", "the", "tion") appear extremely frequently, while the long tail of rare units (like "quokka", "xylem") appear rarely. This non-uniform distribution is exactly what compression algorithms exploit.

When BPE merges the most frequent character pairs, it's essentially building a Huffman code for the language - short codes for frequent patterns, longer codes for rare patterns. This is information-theoretically optimal (Shannon, 1948) under the assumption that training distribution matches deployment distribution.

WordPiece and Unigram LM refine this by using mutual information and likelihood instead of raw frequency, but the underlying principle remains: statistical regularity in language enables learned compression, and learned compression enables generalization.

This is why subword tokenizers trained on English generalize reasonably well to English they haven't seen before. The morphological patterns ("pre-", "post-", "-tion", "-ment") and common word fragments ("play", "work", "think") transfer. It's also why they generalize poorly to radically different languages: if our training corpus is 95% English, then learned fragments optimize for English structure, not Arabic or Chinese.

Now that we understand why subword tokenization dominates, let's turn to measurement: how do we know if a tokenizer is actually good?

Practical Metrics: What Actually Matters?

When we build or evaluate a tokenizer, hit rate is just one metric. The real question is: does this tokenizer enable the model to learn efficiently and generalize effectively? Here are the metrics that actually correlate with downstream performance.

Metric	Purpose	Ideal Range	Failure Mode
Vocabulary Size	Balance memory vs sequence length	30K-50K (general) 50K-100K (specialized)	Too small: long sequences Too large: sparse embeddings
Tokens/Sentence	Predict computational cost	15-25 (English) Vary by language	>30: inefficient 3x disparity: unfair
Effective OOV Rate	Measure robustness	70-80% single-token words	<50%: vocab too small or misaligned
Bytes/Token	Measure compression efficiency	4-6 (English) Language-dependent	<3: poor compression 2x disparity: bias

Vocabulary Size: The Goldilocks Problem

Vocabulary size determines everything else: model capacity, memory usage, sequence length, and learning efficiency. Too small and we get long sequences; too large and we get sparse, poorly-learned embeddings.

The theoretical lower bound is 256 tokens (byte-level BPE's base vocabulary). Every possible input can be represented as bytes, so you technically need nothing more. But at 256 tokens, common words decompose into many fragments. "the" becomes 3 tokens [116, 104, 101], and "Tokenization" becomes 13 tokens. Our context window fills up fast, and attention becomes expensive.

The theoretical upper bound is unbounded: we could have a unique token for every possible word. But real-world constraints kick in immediately. Each token needs an embedding (typically 768 to 4096 dimensions). With 1 million tokens at 2048 dimensions and 4 bytes per float, we are using 8 GB just for the embedding table.

The memory cost scales linearly:

$embedding\_memory\_gb = (vocab\_size × embedding\_dim × 4) / (1024)^3$

Example: 50K vocab × 2048 dims × 4 bytes ≈ 400 MB

More importantly, rare tokens appear infrequently during training, so their embeddings never learn meaningful representations.

Empirically, 30,000-50,000 tokens is the sweet spot for most modern LLMs Devlin et al., 2018; Liu et al., 2019; Raffel et al., 2020; Brown et al., 2020):

plaintext

GPT-2:     50,257 tokens
BERT:      30,522 tokens
RoBERTa:   50,265 tokens
T5:        32,000 tokens
GPT-3:     50,257 tokens

Why this range? It balances sequence length efficiency with embedding quality. Common words and morphemes get single-token representations. Rare words decompose into 2-4 fragments. Character-level fallback handles the infinite tail.

How to choose vocabulary size for our domain?

Start with the baseline question: how specialized is our domain? If we are building a general-purpose chatbot, use a standard vocabulary (30K-50K). If we are building a medical coding assistant, we need to handle terms like "cholecystectomy" and "electroencephalography" efficiently - these shouldn't decompose into 8 generic fragments.

Run a simple experiment: encode a sample of our target domain with different vocabulary sizes and measure average tokens per sentence. If our 30K-token vocabulary produces 40 tokens per sentence while a 100K-token vocabulary produces 25 tokens per sentence, the larger vocabulary might be worth the memory cost. But if the difference is only 40 vs 38, the smaller vocabulary is fine.

The second consideration is training data size. Larger vocabularies need more data to learn good embeddings. If we have 1 billion tokens of training data, we can support 100K vocabulary tokens (each appears ~10,000 times on average). If we have 10 million tokens, we should stick to 10K vocabulary (each appears ~1,000 times).

Takeaway: Target 30K-50K tokens for general use. Scale up to 100K only if you have 1B+ training tokens and domain-specific needs justify the memory cost.

Average Tokens Per Sentence: The Efficiency Metric

This is the single best predictor of computational cost. If tokenizer A produces 20 tokens per sentence and tokenizer B produces 40 tokens, tokenizer B will be roughly 4x slower (quadratic attention) and consume 2x more memory.

We should measure this on a held-out sample that represents our deployment distribution. Download 10,000 examples of the kind of text our model will process in production, tokenize them, and compute the mean and standard deviation of sequence lengths.

Average tokens per sentence:
$avg\_tokens\_per\_sentence = total\_tokens / num\_sentences$

What counts as "good"?

For English text with a 50K vocabulary, we can target around 15-25 tokens per sentence. If we are consistently above 30, we should either increase vocabulary size or investigate whether we are fragmenting common domain-specific terms unnecessarily.

For multilingual models, we can expect asymmetry. English, Spanish, and French typically produce shorter sequences (efficient tokenization). Chinese, Japanese, and Korean produce moderate sequences (character-heavy scripts but compressible). Morphologically rich languages like Finnish, Turkish, or Hungarian produce longer sequences even with good subword vocabularies.

If we care about fairness, we must measure this per language. A tokenizer that produces 15 tokens per English sentence but 45 tokens per Swahili sentence is charging Swahili speakers 3x more for API usage and giving them 3x less context window. This isn't hypothetical - researchers have documented exactly this problem in GPT-3's tokenizer (Ahia et al., 2023; Ahuja et al., 2023).

Example: Multilingual Token Count Disparity

Language	Sentence	Tokens	Bytes/Token	Cost Multiplier
English	"Hello, how are you?"	6	3.2	1.0x (baseline)
Swahili	"Hujambo, u hali gani?"	11	2.0	1.8x
Arabic	"مرحبا، كيف حالك؟"	9	2.4	1.5x

Data approximations based on GPT-3 tokenizer behavior (Ahuja et al., 2023).

Multilingual Tokenization Fairness Comparison Zoom Visual comparison of token count disparities: underrepresented languages in training data require significantly more tokens for equivalent semantic content, directly translating to higher API costs and reduced effective context windows.

Takeaway: We should measure on your deployment distribution. and target 15-25 tokens/sentence for English. Also, audit per-language is needed to detect fairness issues (3x disparity signals problems).

Out-of-Vocabulary Rate: The Robustness Check

Technically, subword tokenizers don't have OOV in the traditional sense - every string is representable, even if it decomposes to character-level tokens. But we should still measure effective OOV: the rate at which words decompose into fragments smaller than some threshold (e.g., 2+ tokens for single words).

Take a held-out test set and encode each word individually. Count how many words become a single token versus multiple tokens. A well-designed tokenizer for English should encode 70-80% of words as single tokens, with the remaining 20-30% splitting into 2-3 tokens for morphologically complex or rare words.

If we are seeing 40%+ of common words splitting, our vocabulary is either too small or poorly optimized for our domain. If only 50% of words are single tokens, we need to consider increasing vocabulary size or retraining on domain-specific data.

For multilingual tokenizers, we will compute this metric per language again. If 75% of English words are single tokens but only 30% of Hindi words are, our tokenizer is biased toward English - it allocated vocabulary slots to English fragments and underrepresented other scripts.

Takeaway: Aim for 70-80% of common words as single tokens. If below 50%, increase vocab size or retrain on domain data.

The Hidden Metric: Compression Ratio

Here's a metric that's rarely discussed but deeply informative: bytes per token. Measure the average number of UTF-8 bytes represented by each token.

For English with a well-tuned tokenizer:

"the cat sat on the mat" = 26 bytes, 7 tokens = 3.7 bytes/token
"Tokenization enables compression" = 33 bytes, 4 tokens = 8.3 bytes/token

$bytes\_per\_token = total\_utf8\_bytes / total\_tokens$

A higher bytes-per-token ratio means better compression - we are encoding more information per token, which means shorter sequences and cheaper computation.

Compare this across different domains. If our tokenizer achieves 6 bytes/token on Wikipedia but only 3 bytes/token on medical records, it's fragmenting medical terminology inefficiently. We need to consider training a specialized medical tokenizer.

We can also compare across languages. If English achieves 5 bytes/token but Arabic achieves 2 bytes/token, our vocabulary is underrepresenting Arabic. This directly translates to cost - an Arabic user pays 2.5x more tokens for the same semantic content.

Takeaway: Higher bytes/token = better compression. Track across domains and languages to spot inefficiencies and fairness gaps.

Metrics tell us what to measure, but building tokenizers teaches us what really matters. Here are the non-obvious insights from implementation experience we did in the part 2 of this series.

Lessons From Building Our Own Tokenizer

While implementing our own tokenizer, we have encountered several non-obvious insights. The insights shape how we should think about tokenization going forward.

Tokenization is Representation Design, Not Preprocessing

This is the most important conceptual shift. Most engineers think of tokenization as a boring preprocessing step: "just split the text so the model can process it." But tokenization determines what the model can learn.

Consider the phrase "strawberry has three r's". If "strawberry" tokenizes as a single token, the model never sees the internal letters - it must memorize that this specific token relates to the letter 'r', with no compositional reasoning. If "strawberry" tokenizes as ["straw", "berry"], the model sees fragments that don't contain 'r' in the expected positions. To answer correctly, it must learn a mapping from token boundaries to character positions, which is a harder compositional task.

This is why GPT-4 and Claude struggle with "How many r's in strawberry?" - their tokenizers fragment the word in ways that obscure character-level structure. Character-level models answer these questions trivially because their representation aligns with the task.

The lesson: Tokenization choices create inductive biases. If we want our model to excel at character-level tasks (spelling, anagrams, character counting), we should use character-level or byte-level tokenization.

If we want semantic tasks (translation, summarization, reasoning), we should use subword tokenization that exposes morphological structure. We can't optimize for both simultaneously with a single tokenizer.

Statistical Structure Enables Compression, But Compression Loses Information

BPE and its variants are lossy compression schemes. They preserve high-frequency patterns and discard low-frequency patterns. This is optimal for the average case but suboptimal for the worst case.

When we train a tokenizer on English Wikipedia, we learn excellent representations for words like "artificial", "intelligence", "algorithm" - these appear thousands of times, so they become single tokens or efficient 2-token splits. But domain-specific jargon ("methylprednisolone", "CRISPR-Cas9", "Schrödinger") fragments badly unless it appeared in training.

The compression is adaptive to the training distribution. If our deployment distribution matches our training distribution, compression is nearly lossless - the model sees tokens that align with semantic units. If distributions diverge (e.g., deploying a Wikipedia-trained tokenizer on medical records), compression becomes very lossy as semantic units fragment into meaningless pieces.

The lesson: Tokenizer quality degrades when deployment distribution shifts from training distribution. If we are building a specialized system (legal contracts, code generation, medical diagnosis), we should train a custom tokenizer on domain-specific data.

The performance gains can be substantial: 15-30% reduction in sequence length and corresponding improvements in latency and cost.

Vocabulary Allocation Is a Zero-Sum Game

When we set vocab_size=50000, we are making 50,000 hard choices about what linguistic units deserve atomic representation. Every token allocated to a low-value fragment is a token not allocated to a high-value fragment.

In multilingual tokenizers, this becomes acute. If our training corpus is 60% English, 20% European languages, 10% Chinese, and 10% everything else, BPE will allocate vocabulary proportionally. English morphemes get efficient single-token representations while underrepresented languages fragment excessively.

SentencePiece tries to mitigate this with sampling: oversample rare languages during training so they get fairer vocabulary allocation. But the fundamental tradeoff remains: vocabulary is finite, and every slot allocated to one pattern is unavailable for another.

The lesson: If we care about multilingual fairness or domain coverage, we don't just train on a massive unbalanced corpus. We need to curate our training data deliberately, oversample underrepresented languages or domains.

We need to inspect the learned vocabulary to ensure critical terminology gets atomic tokens. Vocabulary design is active curation, not passive learning.

Special Tokens Are a Necessary Compromise

Real tokenizers need special tokens such as [PAD], [UNK], [CLS], [SEP], <|endoftext|>, etc. These serve structural roles (sentence boundaries, padding for batching, document separators) that aren't expressible in raw text.

But every special token is a vocabulary slot that could have been allocated to a real linguistic unit. BERT uses ~100 special tokens out of its 30,522 vocabulary, which is <0.5% overhead. GPT-2 uses a single <|endoftext|> token. T5 uses ~100 for task-specific markers (<extra_id_0>, <extra_id_1>, etc.).

There's no free lunch here. Special tokens enable architectural features (masked language modeling, multi-task prompting, efficient batching), but they cost vocabulary capacity and complicate the tokenizer logic.

The lesson: Minimize special tokens. Use them only when they enable critical architectural features. Don't allocate 500 special tokens "just in case" - every unused special token is wasted vocabulary.

Pre-tokenization and Normalization Matter More Than We Think

Before BPE or WordPiece runs, text undergoes pre-tokenization (splitting into words) and normalization (case folding, Unicode normalization, whitespace handling). These choices shape what the subword algorithm learns.

If we pre-tokenize by splitting on whitespace, "don't" remains intact and might become a single token. If we split on all punctuation, it becomes "don" + "'" + "t", and BPE must relearn that these fragments combine. Different choices lead to different vocabularies and different model behavior.

Normalization is even subtler. Should we lowercase everything (losing case information but reducing vocabulary size)? Should we normalize Unicode (ensuring "café" and "café" map to the same tokens, but losing distinctions that matter in some languages)? Should we strip accents (simplifying European languages but destroying semantic distinctions in Vietnamese or Arabic)?

The lesson: Pre-tokenization and normalization are first-order design decisions, not implementation details. We must test different strategies on held-out data and measure downstream task performance.

The "right" choice depends on your domain—case might be critical for code (CamelCase variables) but irrelevant for casual chat.

Tokenization Is Where Bias Enters the Model

Because tokenizers are trained on internet text (Wikipedia, Common Crawl, books, etc.), they inherit the biases of that data. English is overrepresented, so English gets better tokens. Formal written language is overrepresented, so slang and dialectal variants fragment inefficiently.

This creates representational harm. A user writing in African American Vernacular English might see their text fragment 2x more than Standardized American English, signaling to them that the system treats their language as "abnormal" or "broken." A user writing in Yoruba or Igbo might see every word fragment into 5+ tokens, making the system prohibitively expensive to use.

These aren't bugs but are direct consequences of optimizing tokenizers on biased corpora. The only solution is intentional data curation and fairness auditing.

The lesson: If we are deploying tokenization systems at scale, we should measure performance across demographic groups. Track sequence length, compression ratio, and effective OOV rate by language, dialect, and formality level.

If we find significant disparities, we should curate training data to oversample underrepresented groups. Tokenization fairness is model fairness.

These lessons aren't just technical — they are about how representation choices shape what AI systems can and cannot do.

Reflections

We've journeyed from raw text to integer sequences — from the messy, infinite space of human language to the discrete, finite space of token IDs. We have seen how different algorithms make different tradeoffs: BPE optimizes for frequency, WordPiece for mutual information, and Unigram LM for global likelihood. But this is only the first transformation in a longer pipeline.

Tokenization solves the representation problem: how do we encode language computably? But it creates a new challenge — tokens are just discrete symbols. The integer [30642] for "Tokenization" has no inherent meaning to a neural network. It's an arbitrary label, not a meaningful quantity.

This is where embeddings enter. An embedding converts each token ID into a continuous vector in high-dimensional space—typically 768 to 4096 dimensions. Instead of the integer 30642, the model sees a vector like [0.21, -0.45, 0.82, ..., 0.13]. These vectors are learned during training such that tokens with similar meanings or distributional patterns end up geometrically close (we will discuss embeddings later in the next blog post in detail).

Tokenization is the interface between human language and machine learning. On one side is the infinite, messy creativity of human communication—slang, code-switching, emoji, morphology. On the other side is the finite, structured world of tensors and gradients. Tokenization is the bridge, and like all interfaces, it reflects design choices and values.

This is systems thinking applied to language. Understanding tokenization means understanding how linguistic structure (morphology, frequency distributions, compositionality) interacts with computational constraints (vocabulary size, sequence length, memory) to shape what AI systems can and cannot do.

So the next time you see a model struggle with a task, ask: is this a modeling failure, or a tokenization failure? Is the model too small, or did the tokenizer fragment critical information into pieces the model can't reassemble?

Tokenization isn't everything. But it's the first step and first steps shape everything that follows.

Conclusion

Series Conclusion

Across this three-part series, we have built tokenization from first principles:

Part 1 showed why tokenization is representation design, not preprocessing — and how the compression lens helps us understand the fundamental tradeoffs.

Part 2 implemented all major algorithms with working code, from deterministic (BPE) to probabilistic (WordPiece, Unigram LM).

Part 3 (this post) applied these to production: measuring quality, ensuring fairness, and building domain-specific tokenizers.

The meta-lesson: Tokenization choices propagate through your entire system. Token boundaries determine what models can learn, how fairly they treat different users, and whether certain tasks are tractable. Treat tokenizer design with the same rigor you give model architecture — it matters just as much.

Theory is only half the story. To truly understand tokenization, implement the algorithms yourself. Test them on different languages, measure compression ratios, observe where they fail.

I have implemented complete BPE, WordPiece, and Unigram LM tokenizers from scratch in my GitHub repository: ai-engineering/tokenization. The code includes extensive documentation, worked examples, and experiments you can run on your own data.

👉 Subscribe to my newsletter The Main Thread to keep learning with me. Namaste!

References

Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units (BPE). https://arxiv.org/abs/1508.07909
Schuster, M., & Nakajima, K. (2012). Japanese and Korean Voice Search (WordPiece). https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf
Kudo, T. (2018). Subword Regularization: Improving Neural Network Translation Models (Unigram LM). https://arxiv.org/abs/1804.10959
Kudo, T., & Richardson, J. (2018). SentencePiece: A simple and language independent approach to subword tokenization. https://arxiv.org/abs/1808.06226
Devlin, J., et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https://arxiv.org/abs/1810.04805
Liu, Y., et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. https://arxiv.org/abs/1907.11692
Raffel, C., et al. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5). https://arxiv.org/abs/1910.10683
Brown, T., et al. (2020). Language Models are Few-Shot Learners (GPT-3). https://arxiv.org/abs/2005.14165
OpenAI (2023). GPT-4 Technical Report. https://arxiv.org/abs/2303.08774
Ahia, O., et al. (2023). Do Localization and Tokenization Improve Multilingual Machine Translation? https://arxiv.org/abs/2305.15024
Ahuja, K., et al. (2023). MEGA: Multilingual Evaluation of Generative AI. https://arxiv.org/abs/2303.12528
HuggingFace Tokenizers Documentation. https://huggingface.co/docs/tokenizers/
OpenAI Tokenizer Tool. https://platform.openai.com/tokenizer

Written by Anirudh Sharma

Published on November 12, 2025