November 10, 202520 min readBy Anirudh Sharma

Demystifying Tokenization: Part 1 - Why Tokenization Matters

Tokenization LLMs Machine Learning

{A}

Table of Contents

This is Part 1 of a 3-part series on tokenization:

Part 1: Why Tokenization Matters ← You are here
Part 2: Algorithms from BPE to Unigram
Part 3: Building & Auditing Tokenizers

Introduction

Tokenization errors cascade into model errors.

When GPT-4 struggles to count the r's in "strawberry".
When a medical AI misinterprets "hypoglycemia" as gibberish.
When a Swahili speaker pays 3x more than an English speaker for the same API call.

These aren't model failures but tokenization failures propagating through the entire system.

Before a language model can predict anything, it must first see the text. Not as words, but as discrete tokens — fragments that determine what the model can learn, how much inference costs, and which languages get fair treatment. This transformation happens in a layer most engineers ignore: tokenization.

Here's what actually happens: when you type "Tokenization is essential" into GPT, the model doesn't see those three words. It sees integers like [30642, 1340, 374, 7718] mapping to subword fragments (see OpenAI Tokenizer). The word "strawberry" becomes ["straw", "berry"] and suddenly counting characters becomes nearly impossible because the model never sees the individual letters as atomic units.

The tokenization pipeline works as: From raw text to model input: Text → UTF-8 Bytes → Subword Tokens → Integer IDs. Zoom

The way we split text into atomic units fundamentally shapes what a model can and cannot learn.

This is not preprocessing. It's representation design - and it creates what we call tokenization debt: long-term constraints on model capabilities, cost structure, and fairness that are painful to change once baked into production systems.

The Core Problem

Neural networks can't compute on raw strings, they need numbers. Thus, we need a mapping:

Text → Tokens → Integer IDs → [Model Processes Numbers]

On the surface, it sounds simple, but natural language is messy because of the following reasons:

Morphology: words like "walk", "walking", "walked" are related but different.
Compound Words: is "New Delhi" one thing or two?
Rare Words: few words like "platypus" might appear once in our training data.
Typos and Variants: "receive" vs "recieve" or "color" vs "colour".
Multiple Languages: language models need to support multiple languages - English, हिंदी, 中文, العربية, emoji (🌍), code def hello():

Creating mappings from text to Integer IDs is challenging because of the following reasons:

We need to handle unknown words. These words were part of our training corpus.
There can be infinite combination of words (infinite vocabulary). Models cannot store infinite embeddings thus the vocabulary has to be finite. It is impractical to store embeddings of every combination.
Frequent patterns should be atomic. Therefore, we need to capture the statistical nature of the training corpus.
The model should work across languages including ones we haven't seen.

Thus, the naive, brute force approach fails immediately. We need to come up something clever.

If we think about this deeply from first principle's approach, then we can arrive at the following solutions:

Character-Level Tokenization

In this approach, we split the text into individual characters. For example, "cat" will become ['c', 'a', 't']. The good thing about this approach is that it works for any language, emoji, typo.

But this approach has flaws: sequences will explode in length - the phrase "Tokenization is essential" has 26 characters but only 3-4 meaningful words. This will cause 5-10x longer sequences → 5-10x more computation.

Models cannot see this much context.

Word Level Tokenization

Another fairly intuitive approach is to split the text on spaces i.e., "Tokenization is essential" will become ["tokenization", "is", "essential"].

On surface, this approach seems efficient as each semantic unit is one token.

But if we think deeper, this approach also has flaws: consider only English language - it has ~170k words and if we include proper nouns, technical terms, typos, etc., there can be infinite combinations of words. It is impractical to have infinite vocabulary.

Also, if we encounter a rare word like "platypus" that we didn't see in the training, models cannot comprehend it. Furthermore, every language will need its own dictionary.

Subword Tokenization

One of the best thing about natural language is that it has compressible structure. It means some patterns repeat, others don't.

For example, we will see the word "the" massive number of times in corpus. Thus, treat the entire word as a single token. On the other hand, we won't see "platypus" too many times, so we can tokenize the word into common subwords like ["plat", "y", "pus"].

This approach hits the sweet spot between number of words in the vocabulary and adaptive way to tokenize words based on their occurrences in the corpus.

Comparing sequence lengths for "Tokenization is essential": Characters (26 tokens) vs Words (3 tokens) vs Subwords (4-5 tokens). Sequence Length Comparison Zoom

This approach solves all of our four major headaches:

1. 100% Coverage: any string can be decomposed to subwords (the worst case: individual characters).

2. Finite Vocabulary: typically 30k-50k tokens.

3. Statistical Structure: frequent patterns get atomic representation.

4. Cross Language: works for any language with enough training data.

The trade-off is that we are compressing language and like any compression, we choose what to preserve (frequent patterns) and what to fragment (rare patterns). This choice shapes what models can learn.

What you will learn in this series?

This series builds tokenization from first principles - showing the mathematical objectives, algorithmic intuitions, and implementation details that papers gloss over.

In Part 1 (this post), we will explore why tokenization matters, the fundamental problems it solves, and how to think about it as lossy compression.

In Part 2, we'll implement the major algorithms: character-level, word-level, BPE, byte-level BPE, WordPiece, and Unigram Language Models—with complete working code.

In Part 3, we'll apply these to real-world problems: measuring fairness across languages, tuning for domain-specific text, and monitoring tokenizer health in production.

If you are dealing with issues like models failing on certain inputs, building domain specific models (medical, legal, code), want to understand why OpenAI/Anthropic charges per token, care about even multilingual fairness in AI systems, or even just curious, then you will definitely enjoy this series.

And if you just want to use transformers.AutoTokenizer and don't care how stuff works under the hood, it is fine and you can just skip this series and get on with your life.

Without further ado, let's jump in.

TL;DR

Tokenization is the bridge between human language and machine computation - converting text into discrete units that models can process. Three core insights:

1. Subwords beat characters and words: Character-level tokenization has 100% coverage but creates sequences 5-10x longer (computationally expensive). Word-level tokenization is efficient but breaks on unknown words. Subword tokenization (BPE, WordPiece, Unigram LM) achieves both coverage and efficiency by treating common words as single tokens and rare words as compositions of frequent fragments.

2. Statistical patterns enable compression: Natural language has compressible structure. Frequent morphemes like "ing", "un-", "tion" appear across thousands of words. Tokenization algorithms learn these patterns from data, creating vocabularies of 30K-50K tokens that efficiently encode most text while gracefully degrading on rare inputs.

3. Probabilistic methods optimize for meaning, not just frequency: BPE greedily merges the most frequent character pairs. WordPiece improves this by merging pairs with high mutual information (statistical dependence). Unigram LM goes further, using EM algorithm to learn globally optimal token probabilities. Modern LLMs use these probabilistic methods because they produce linguistically meaningful units.

The meta-lesson: Tokenization isn't preprocessing - it's representation design. The tokens we choose, determine what patterns our model can learn, how fairly it treats different languages, and whether tasks like "count the r's in strawberry" are tractable or impossible.

Reading Guide for This Series

⚡ Quick learner? Read Part 1 (this post) + Part 2: Subword Tokenization section + Part 3: Why Subwords Win

🔢 Math-oriented? Focus on Part 2: WordPiece (PMI scoring) + Part 2: Unigram LM (EM algorithm) + Part 3: Practical Metrics

🛠️ Practitioner? Jump to Part 2: BPE Implementation + Part 3: Practical Metrics + Part 3: Lessons from Building

🎓 Academic depth? Read all three parts including Part 2: Forward-Backward Algorithm + Part 2: Viterbi Decoding

Real-World Failure Case: When Tokenization Breaks in Production

In this section, we will analyze a hypothetical use case where Tokenizer may break in production.

The Medical AI That Couldn't Read Prescriptions

A healthcare startup deployed a model to extract medication names from doctor's notes. It worked perfectly in testing but failed catastrophically in production. The culprit? Tokenization.

Medical terms like "methylprednisolone" (a common corticosteroid) were fragmented into 8-10 generic subword tokens by their GPT-based tokenizer: "methylprednisolone" → ["m", "ethyl", "pred", "nis", "ol", "one", "</w>"]

Meanwhile, common words like "take" or "daily" were single tokens. The model learned strong associations with complete tokens but struggled to compose meaning from fragmented medical terminology. Accuracy dropped from 94% (on Wikipedia-style text) to 67% (on medical notes).

The cost impact: Each prescription required 3x more tokens than estimated, tripling their API costs. Patients in clinical trials had to wait 2-3x longer for medication verification because the longer sequences slowed inference.

The fix: They trained a custom tokenizer on 50M medical documents, allocating vocabulary specifically to drug names, symptoms, and procedures. Accuracy recovered to 91%, and token count per prescription dropped by 40%.

The lesson: Tokenization isn't universal. A tokenizer optimized for one domain (Wikipedia, web text) can catastrophically underperform in another (medical, legal, code). This is tokenization debt - the hidden cost of representation choices made early in model development.

Tokenization Comparison Zoom Same text, different tokenizers: General-purpose vs domain-specific tokenization for medical terms

Why This Matters for Practitioners Today?

If you're building or deploying LLMs, tokenization directly impacts your bottom line and model behavior:

Inference Cost: Every token costs money. GPT-4 charges $0.03/1K input tokens. If poor tokenization inflates your sequences by 30%, you're paying 30% more for every API call. At scale, this is thousands of dollars per month.

Latency: Attention scales as $O(n^2)$ with sequence length. Doubling token count quadruples attention computation. For real-time applications (chatbots, autocomplete), tokenization efficiency determines whether we meet latency SLAs.

Model Failure Modes: Tasks requiring character-level reasoning (spelling, anagrams, counting letters) fail when tokenization fragments words unpredictably. The model never sees "strawberry" as ['s','t','r','a','w','b','e','r','r','y']—it sees ["straw", "berry"] and must infer character positions through indirect reasoning.

Fine-Tuning Stability: If our fine-tuning data has different tokenization characteristics than pretraining (e.g., more code, more medical terms), the model encounters many previously-rare tokens. Their embeddings are poorly initialized, causing training instability and requiring more epochs to converge.

Multilingual Fairness: English tokenizes efficiently (~15 tokens/sentence). Swahili, Turkish, or Finnish often tokenize at 3-4x that rate. This means non-English users pay more, experience higher latency, and consume context windows faster—a direct fairness and accessibility issue.

Understanding tokenization is not academic - it is the difference between a model that works and one that fails silently, between affordable deployment and runaway costs.

Foundations: What is Tokenization, Really?

Now that we understand why tokenization matters, let's dig deeper into what makes it challenging and what we are actually trying to optimize for.

The Problem Space

At the lowest level, computers see everything as bytes-sequences of $0s$ and $1s$ . When we save a text file containing "hello", the computer stores it as bytes: [104, 101, 108, 108, 111] (UTF-8 encoding). But these bytes don't carry the structure that language models need.

Models don't learn patterns over bytes directly. They learn patterns over symbols - discrete unit that represent linguistic meaning. The question is what these symbols should be?

If we choose symbols that are too fine-grained (like individual bytes or characters), we lose linguistic structure. The model has to learn the 'c', 'a', 't' frequently appear together to mean something, rather than being told upfront that "cat" is a meaningful unit.

If we choose symbols that are too coarse-grained (like entire words), we get an explosion of possible symbols and lose the ability to handle anything we haven't seen before.

Here's a deeper challenge: natural language doesn't come pre-segmented into neat units. Unlike a programming language where syntax is explicit and unambiguous, natural language is fluid. Consider the following cases:

Ambiguous Boundaries

Is "New York" one token or two? What about "NewYork" (no spaces) or "new york" (lowercase)? Each appear in real text and ideally they map to similar representations since they refer to the same place.

Morphological Variations

Words like "play", "playing", "played", "player" and "playful" are all related as they share the same root "play". Should each get its own token? Or should we decompose them into "play" + suffixes? The answer affects how efficiently the model learns patterns.

Rare and Unseen Words

During training, we might see "elephant" 10,000 times but "platypus" only five times. During deployment, someone asks about a "quokka" - a word that never appeared in training. How do we represent it? If we can't, the model is blind to it.

Typographical Noise

Real-world text has misspellings, creative spellings ("u r gr8"), and variants ("organize" vs "organise"). A tokenizer that treats each variant completely wastes vocabulary space and loses the connection between them.

Cross-Linguistic Complexity

Some languages (like Chinese) don't use spaces between words. Some (like German) create long compound word by concatenation. Some like (Finnish or Turkish) have a rich morphology where a single word can encode what English expresses in a whole sentence. A tokenizer needs to handle all of this - ideally with a single algorithm and vocabulary.

The problem space then is this: given raw text as a sequence of bytes, produce a sequence of Integer IDs that proves linguistic structure, generalizes to unseen inputs, and uses a finite vocabulary that a model can actually learn from.

Design Goals of a Tokenizer

When we design a tokenizer, we are balancing multiple competing objectives. No tokenizer is perfect - each makes trade-offs. Understanding these goals helps us reason about why different systems makes different choices.

Coverage

The first requirement is coverage. Every possible input string must map to some sequence of tokens. This sounds trivial, but it is not. If we use word-level tokenization with a fixed dictionary of 50,000 words, what happens when we encounter 50,001st word?

Early neural translation systems used a special [UNK] (unknown) token for out-of-vocabulary (OOV) words. But this loses information as both "platypus" and "quokka" become [UNK], so the model cannot distinguish them. Worse, if the unknown word is critical to understanding the sentence ("The [UNK] attacked the village"), the model is blind to the key information.

Subword tokenization achieves perfect coverage by ensuring that in the worst case, any word can be decomposed into individual characters. If "quokka" wasn't in the training data as a whole word, it becomes something like ["qu", "ok", "ka"] - the model doesn't know it is an animal but at least it can represent and process the word.

Efficiency

Coverage alone isn't enough, we need efficiency too. Why? Two reasons: memory and computation.

Every token in the vocabulary needs an embedding vector - typically 768 to 1536 in the modern models. If our vocabulary has 100k tokens, and each embedding is 1024 dimensions at 4 bytes per float, that 400 MB just for the embedding table. Larger vocabularies mean larger models, which means higher cost for training and deployment.

Models process sequences of tokens. If our tokenizer produces 10 tokens per sentence instead of five, we have doubled the sequence length. Attention mechanisms scale as $O(n^2)$ with sequence length, so longer sequences mean quadratically more computation. For a model serving millions of requests per day, this translates directly to hardware cost and latency.

An efficient tokenizer minimize sequence length without sacrificing coverage. This is why subword tokenization is a sweet spot - common words are single tokens (efficient), rare words are multiple tokens (coverage maintained).

Robustness: Graceful Degradation

A good tokenizer should degrade gracefully when it encounters challenging inputs. If the text has typo ("recieve" instead of "receive"), the tokenizer should not break. It should produce reasonable tokens that allow models to still make sense of the sentence.

Robustness also means handling different domains and languages without catastrophic failure. A tokenizer trained primarily on English Wikipedia should work - albeit less efficiently on English Twitter slang, English legal documents, or even code. It might not be optimal, but it should be usable.

Byte-level tokenization (which we will cover in Part 2) is the ultimate robustness strategy-by operating on UTF-8 bytes directly, it can handle any text in any language, plus emoji, symbols, binary data, or even malformed Unicode. There's nothing that can break it.

Learnability: Statistical Consistency

Finally, a tokenizer should create tokens that have statistically consistent patterns that a model can learn from. If the tokenization makes arbitrary or inconsistent choices, it makes the models' job harder.

Consider two sentences:

"The economy is strong"
"The economy's strength"

Ideally, "strong" and "strength" should tokenize in a way that reveals their relationship. Perhaps "strength" becomes ["strong", "th"] or both share a common root token. If they tokenize as completely unrelated fragments, the model has to learn their connections from scratch using only distributional statistics.

This is where frequency-based tokenization shines. By merging character pairs that appear frequently together, we create tokens that capture real linguistic patterns. The token "ing" appears constantly in English (walking, talking, singing), so it gets its own ID. The sequence "xq" almost never appears, so it doesn't. This aligns vocabulary with the statistical structure of language, making patterns easier for the model to learn.

First Principles View: Tokenization as Compression

Tokenization is lossy compression of text.

The above is a key insight that unifies everything.

Think about it. We start with infinite possible strings (every combination of Unicode characters). We compress this infinite space down to sequences drawn from a finite vocabulary of 30k-50k tokens. Information is necessarily lost in this compression. Specifically, we lose information about rare patterns.

This is exactly analogous to image compression. JPEG compresses images by keeping high-frequency patterns (common details) and discarding low-frequency patterns (rare details). The compression is lossy as we can't perfectly reconstruct the original but it preserves what matters for human perception.

Tokenization does the same thing for language. Common words like "the", "is" are stored as single tokens - no compression loss. Rare words like "quokka" are decomposed into subwords with some information loss (the model doesn't immediately know it's a single word), but the content is still representable.

The compression ratio is determined by vocabulary size:

Small vocabulary (8k tokens): more compression, longer sequences, more information loss on rare patterns.
Large vocabulary (100k tokens): less compression, shorter sequences, but harder for model to learn (sparse embeddings).

The vocabulary size is typically 30k-50k tokens because this balances compression efficiency with learnability. Empirically, this is where we get the best trade-off between sequence length and model performance across diverse tasks.

Trade-off Between Resolution and Efficiency

Just like choosing JPEG quality settings, choosing vocabulary size and tokenization strategy is about deciding what level of detail we need. If we are building a model for medical text with lots of technical terminology, we might want a larger, more specialized vocabulary. If we are building a multilingual model, we want a vocabulary efficiently covers multiple languages rather than over-optimizing for English.

Tokenization isn't Just Preprocessing

The compression scheme we choose determines what patterns are easy vs. hard for the model to learn. If "strawberry" compresses to ["straw", "berry"], the model has to learn to compose those fragments to understand the fruit. If it were a single token, that composition is unnecessary. The tokenizer has made a choice about what the model needs to learn, and that choice has consequences.

Understanding Tokenization Debt

Just as technical debt refers to future costs incurred by choosing quick solutions over better approaches, tokenization debt refers to the long-term constraints created by tokenization choices made early in model development.

Once we have trained a model with a specific tokenizer:

We are locked in. Changing the tokenizer requires retraining the entire model from scratch. The embedding table, which maps token IDs to vectors, is specific to that vocabulary. Add a new token? We need a new embedding. Merge tokens? We need to retrain how those fragments compose. For models costing millions of dollars to train, this isn't a trivial decision.

Costs compound over time. If our tokenizer fragments domain-specific terms inefficiently, every single inference request pays the penalty in more tokens, more compute, more API charges. At scale, a 20% tokenization inefficiency means 20% higher infrastructure costs forever, until we retrain.

Fairness issues persist. If our tokenizer treats English efficiently but fragments other languages, that bias is baked into the model. Non-English users experience worse latency, higher costs, and reduced context window — and this can't be fixed without retraining with a better tokenizer.

Capability gaps emerge. Tasks that require character-level reasoning (counting, spelling, anagrams) will remain hard if our tokenizer never exposes individual characters as atomic units. We can't fix this with better training data or more parameters — the representational capacity simply isn't there.

Migration is painful. Imagine we have deployed a model in production with thousands of users, fine-tuned versions for specific customers, and cached embeddings for fast inference. Changing the tokenizer means:

Invalidating all cached embeddings.
Retraining all fine-tuned models.
Updating client applications that depend on specific token IDs.
Managing backward compatibility during rollout.

This is why tokenization is representation design, not preprocessing. The choices we make at this layer propagate through our entire system and become increasingly expensive to change over time.

The lesson: Invest time upfront to get tokenization right. Measure tokens-per-sentence on your target domain. Test on multiple languages if you plan to be multilingual. Understand whether your application needs character-level reasoning. The hour you spend analyzing tokenization before training can save months of expensive retraining or permanently constrained capabilities.

Now that we understand the problem space, the design goals, the compression lens, and the long-term implications of tokenization choices, we're ready to explore the actual algorithms in Part 2.

What's Next?

In Part 2, we will implement the major tokenization algorithms from scratch:

Character-level and word-level tokenization (with the OOV problem).
Byte Pair Encoding (BPE) and byte-level BPE.
Probabilistic methods: WordPiece (PMI scoring) and Unigram Language Models (EM algorithm).

We will see complete working code, understand the mathematical foundations, and learn when to use each approach.

Continue to Part 2: Algorithms from BPE to Unigram →

👉 Subscribe to my newsletter The Main Thread to get weekly emails for byte-sized content on topics like this. Namaste!

References

Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units (BPE). https://arxiv.org/abs/1508.07909
Devlin, J., et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https://arxiv.org/abs/1810.04805
OpenAI Tokenizer Tool. https://platform.openai.com/tokenizer

Written by Anirudh Sharma

Published on November 10, 2025