Understanding Embeddings: Part 4 - The Limits of Static Embeddings
{A}Table of Contents
This is Part 4 of a 4-part series on Embeddings:
- Part 1: Why Embeddings Exist
- Part 2: Learning Meaning From Context
- Part 3: Global Statistics and GloVe
- Part 4: The Limits of Static Embeddings ← You are here
Hey curious engineers, welcome to the fourth and final part of this series on embeddings.
In Part 3, we explored how GloVe learns embeddings from global co-occurrence statistics.
We saw that Word2Vec and GloVe are two paths to the same destination: both produce embeddings where dot products approximate PMI, capturing semantic relationships through geometric structure.
The results seemed remarkable. Word analogies worked: . Similarity search worked: nearest neighbors to "dog" were {puppy, cat, pet, animal}. Downstream tasks improved: sentiment analysis, named entity recognition, document classification all benefited from pre-trained embeddings.
But beneath this success lurked a fundamental problem: one that wouldn't become fully apparent until researchers tried to push embeddings further.
The problem is deceptively simple: every word gets exactly one vector.
"Dog" gets one embedding. "Cat" gets one embedding. "Bank" gets... one embedding. Always the same embedding, regardless of context.
And this is where things break.
Core Realization: Static embeddings solved the wrong level of the problem. They captured word-level distributional patterns perfectly.
But language operates at the sentence level, where meaning depends on context. One vector per word cannot represent one meaning per usage.
This post explores why static embeddings hit a fundamental ceiling, why this limitation is structural (not accidental), and what this failure teaches us about the nature of meaning itself.
We won't introduce the solution yet (that's transformers, BERT, contextual embeddings). Instead, we will understand the problem deeply enough that the need for a new approach becomes inevitable.
One Word, Many Meanings
Let's start with the problem that breaks static embeddings: polysemy, i.e., one word, multiple meanings.
The "Bank" Problem
Consider the word "bank". In English, it has at least two common meanings:
Meaning 1: Financial institution
- I deposited money at the bank.
- The bank approved my loan application.
- She works for an investment bank.
Meaning 2: River edge
- We sat on the bank of the river.
- The boat ran aground on the muddy bank.
- Wildflowers grew along the bank.
These are completely different concepts. The financial institution has nothing to do with river edges. They share a word form by historical accident (both derive from different roots that converged in English), not because of semantic relatedness.
Question: What embedding does Word2Vec learn for "bank"?
Let's think through the training process. During corpus scanning, Word2Vec encounters both contexts:
Context 1: "deposited money at the bank approved"
Context 2: "sat on the bank of the"
Context 3: "works for an investment bank in"
Context 4: "along the bank grew wildflowers"
...Context 1: "deposited money at the bank approved"
Context 2: "sat on the bank of the"
Context 3: "works for an investment bank in"
Context 4: "along the bank grew wildflowers"
...Skip-Gram tries to learn an embedding that predicts:
- Financial contexts:
{money, deposited, loan, investment, approved} - River contexts:
{river, water, edge, shore, muddy, wildflowers}
The model faces an impossible optimization problem. It must find a single vector that simultaneously:
- Has high dot product with
"money","loan","investment" - Has high dot product with
"river","water","shore"
But "money" and "river" are semantically unrelated. Their embeddings point in different directions. There is no single direction in embedding space that aligns with both.
To overcome this dilemma, the model compromises. It learns an averaged embedding that sits somewhere between the two meanings, not quite aligning with either.
This averaged vector:
- Is moderately similar to
"money"(but less than it should be for the financial sense) - Is moderately similar to
"river"(but less than it should be for the river sense) - Doesn't accurately represent either meaning
Why One Vector Cannot Represent Both
Let's make this concrete with a visualization. Imagine a 2D semantic space with two dimensions:
- Dimension 1 (x-axis): Finance-related (high values = financial concepts)
- Dimension 2 (y-axis): Nature-related (high values = natural/geographic concepts)
Where should different words lie?
"money": high finance, low nature → (0.9, 0.1)
"loan": high finance, low nature → (0.8, 0.2)
"river": low finance, high nature → (0.1, 0.8)
"shore": low finance, high nature → (0.2, 0.7)"money": high finance, low nature → (0.9, 0.1)
"loan": high finance, low nature → (0.8, 0.2)
"river": low finance, high nature → (0.1, 0.8)
"shore": low finance, high nature → (0.2, 0.7)Where should "bank" lie?
- For financial usage: should be near
"money","loan"→ (0.85, 0.15) - For river usage: should be near
"river","shore"→ (0.15, 0.75)
But we only get one vector. The model averages:
This embedding sits in the middle of the space which is equidistant from both clusters. It's equally far from both true meanings. When we compute similarity:
Neither is high enough to capture the actual semantic relationship in context. The embedding is a muddy average that doesn't represent either sense faithfully.
Zoom
Static embeddings collapse multiple word senses into a single averaged vector. "Bank" must simultaneously represent financial and river meanings, resulting in a vector that accurately captures neither.
How Word2Vec and GloVe Collapse Meaning
This isn't specific to "bank". Every polysemous word suffers the same fate:
"Bat":
- Animal (mammal that flies) → contexts:
{cave, wings, nocturnal, vampire} - Sports equipment (used in cricket) → contexts:
{cricket, swing, hit, pull, stroke} - Embedding: averaged, not great for either
"Spring":
- Season (after winter) → contexts:
{flowers, warm, April, bloom} - Coiled metal (bounces) → contexts:
{coil, bounce, mattress, tension} - Water source (natural) → contexts:
{water, fountain, source, fresh} - Embedding: three-way average, terrible for all
"Run":
- To jog (physical motion) → contexts:
{fast, sprint, marathon, exercise} - To operate (machines) → contexts:
{computer, program, execute, process} - To manage (organizations) → contexts:
{company, business, lead, manage} - Embedding: highly ambiguous
The problem scales with frequency. Common words tend to have more meanings (because they get extended to new contexts over time). So the words we encounter most often and care most about disambiguating are precisely the ones that suffer most from averaging.
The Training Process Guarantees This
Let's trace through how GloVe (or Word2Vec) processes "bank":
Step 1: Count co-occurrences
X[bank, money] = 4,523 (from financial contexts)
X[bank, loan] = 3,891
X[bank, river] = 2,134 (from geographic contexts)
X[bank, shore] = 1,567X[bank, money] = 4,523 (from financial contexts)
X[bank, loan] = 3,891
X[bank, river] = 2,134 (from geographic contexts)
X[bank, shore] = 1,567The co-occurrence matrix treats all instances of "bank" as the same word. It doesn't distinguish:
"bank"in"deposited at the bank""bank"in"sat on the bank"
They all increment the same row: .
Step 2: Factorize
GloVe minimizes:
This objective forces to simultaneously satisfy:
But and are orthogonal (unrelated concepts). The optimization finds a compromise vector that partially satisfies both constraints but fully satisfies neither.
Step 3: Result
The learned embedding is a weighted average of the ideal embeddings for each sense, weighted by their corpus frequencies. If financial senses appear 60% of the time and river senses 40%:
The more frequent sense dominates, but both are diluted. The less frequent sense (river) is underrepresented, while the more frequent sense (financial) is contaminated by the rare sense.
This is not a bug. This is not even a failure of optimization. This is the inevitable consequence of assigning one vector per word type, regardless of word usage.
Why This Failure Is Structural, Not Accidental
The "bank" problem might seem like an edge case, a quirk of English where unrelated words happen to share the same spelling. But the failure runs deeper.
Context Is Discarded During Training
Let's revisit how Word2Vec learns embeddings. Recall from Part 2 that Skip-Gram processes sliding windows:
1Sentence: "I deposited money at the bank yesterday"
2Window (center = "bank", size = 2):
3 Context: {money, at, the, yesterday}
4
5Training signal:
6 Maximize P(money | bank)
7 Maximize P(at | bank)
8 Maximize P(the | bank)
9 Maximize P(yesterday | bank)1Sentence: "I deposited money at the bank yesterday"
2Window (center = "bank", size = 2):
3 Context: {money, at, the, yesterday}
4
5Training signal:
6 Maximize P(money | bank)
7 Maximize P(at | bank)
8 Maximize P(the | bank)
9 Maximize P(yesterday | bank)Notice what's not in the training signal: the other context words when predicting each individual word. When the model predicts , it doesn't know that "deposited" also appeared nearby. The context words are used as independent targets, not as disambiguating information.
This is the fundamental information bottleneck: the model sees that "bank" appears near "money" and "river", but it never learns when to use which meaning because context is thrown away before the embedding is retrieved.
Compare to how humans understand "bank":
Sentence A: "I deposited money at the bank."
- We see
"deposited"and"money"→ conclude financial sense
Sentence B: "We sat on the bank of the river."
- We see
"river"and"sat on"→ conclude geographic sense
The difference: Humans use context to disambiguate. We compute meaning conditionally: meaning of "bank" given surrounding words.
Static embeddings compute meaning unconditionally: the embedding for "bank" is fixed, independent of context. There is no mechanism to say "in this sentence, 'bank' means X."
Averaging Destroys Specificity
Even methods that use context during training (like CBOW) suffer the same failure. CBOW averages context word embeddings:
Then predicts the center word from this average. But averaging is lossy:
- If context is
{money, at, the, yesterday}, the average is dominated by function words{at, the}which appear everywhere and contribute little semantic information - The informative word
"money"is diluted by 1/4 - The resulting average loses specificity
Worse, during training on the river sense:
Both averaged context vectors are used to predict the same target: "bank". The model learns:
This "something" is the intersection of the two contexts which is mostly function words like "the", "of", "at". The distinctive semantic content ("money" vs "river") is washed out by averaging.
Global Statistics Erase Local Nuance
GloVe's global co-occurrence matrix makes the problem even clearer. The matrix has one row per word:
This row aggregates counts across all occurrences of "bank" in the corpus:
"bank"in"investment bank"→ increments"bank"in"river bank"→ increments- Both counts go into the same row
The matrix has no way to say: "When 'bank' appears with 'money', it should have a different representation than when it appears with 'river'." All occurrences are pooled.
This pooling is efficient (count once, use globally), but it fundamentally cannot capture polysemy. The co-occurrence counts are sense-agnostic: they don't distinguish which usage contributed to which count.
When GloVe factorizes this matrix, it learns embeddings that reconstruct the aggregate statistics, not the conditional statistics. The embedding for "bank" predicts the average co-occurrence across all senses, not the specific co-occurrence for each sense.
Key Message: Static Embeddings Must Collapse Context
This is not a limitation of Word2Vec or GloVe specifically. It's a limitation of the static embedding paradigm: one vector per word type.
The fundamental assumptions:
- Each word type (vocabulary entry) gets exactly one embedding
- This embedding is fixed (does not change based on context)
- The embedding is learned from aggregate corpus statistics
These assumptions guarantee that polysemous words will be poorly represented. There is no way around it within the static framework.
To fix this, we would need:
- Multiple embeddings per word (one per sense)
- Context-dependent selection (choose embedding based on sentence)
- Dynamic computation (compute representation on-the-fly from context)
But then we are no longer doing static embedding but something fundamentally different. We will come back to this.
Static embeddings are structurally incapable of handling polysemy.
Subword Embeddings: A Partial Fix
Before declaring static embeddings a complete failure, let's examine an improvement that emerged around 2017: subword embeddings, popularized by Facebook's FastText.
The idea: instead of learning embeddings for whole words, learn embeddings for character n-grams, then compose them to represent words.
Why Subwords Help with Morphology
Consider the word "playing". In traditional Word2Vec:
"playing"gets one embedding:"play"gets a separate embedding:"played"gets yet another:
If "playing" rarely appears in the corpus, its embedding is poorly trained. The model doesn't know that "playing" and "play" are related.
FastText's approach: Represent "playing" as the sum of character n-gram embeddings:
"playing" = <pl + pla + lay + ayi + yin + ing + ng> + <playing>
Where:
<pl,pla, etc. are 3-grams (substring of length 3)<playing>is the whole word (optional, for frequent words)- Each n-gram has a learned embedding
The final embedding is:
Why this helps:
1. Morphological sharing: The n-gram "ing" appears in "playing", "running", "jumping". All these words share the embedding for "ing", which captures the progressive aspect.
2. Rare word generalization: If "jogger" is rare but "jog" and "-er" are common, FastText composes embeddings from known parts: . The model can make a reasonable guess even for unseen words.
3. Out-of-vocabulary robustness: For a completely new word like "COVID-19", traditional embeddings fail (word not in vocabulary). FastText composes from character n-grams: , giving a non-random embedding.
This is genuinely useful for morphologically rich languages (German, Turkish, Finnish) where words have many inflected forms. It is also helpful for domain-specific jargon and typos.
Zoom
FastText represents words as sums of character n-gram embeddings. "playing" = ⟨pl⟩ + ⟨pla⟩ + ⟨lay⟩ + ... + ⟨ing⟩. Morphological patterns like "-ing" are shared across words, enabling generalization.
What Subwords Still Can't Fix
Even subword embeddings do not solve polysemy. Let's see why.
Example: "bank"
FastText represents "bank" as:
This is still one fixed vector. The character n-grams {ba, ban, ank, nk} are the same whether "bank" appears in:
"deposited at the bank"(financial)"sat on the bank"(river)
The n-grams have no semantic content related to finance vs. geography. They capture orthographic patterns (spelling), not meaning.
Result: The FastText embedding for "bank" is still an averaged representation that collapses both senses, just like Word2Vec. It's just that now the averaging is happening at the n-gram level.
Polysemy Remains Unsolved
Subword embeddings help with:
- Morphological variation (
"play"→"playing"→"played") - Rare word composition (
"unhappiness"="un-"+"happy"+"-ness") - Typo robustness (
"happyness"≈"happiness"due to shared n-grams)
Subword embeddings do not help with:
- Semantic ambiguity (
"bank"in finance vs. river contexts) - Homonyms (
"bat"as animal vs. sports equipment) - Context-dependent meaning (
"run"as jog vs. operate vs. manage)
These are fundamentally different problems.
Why Composition Doesn't Help?
One might think: "Can't we compose embeddings differently based on context? Use "bank" + "money" for financial sense, "bank" + "river" for geographic sense?"
This is a step in the right direction (and foreshadows contextual embeddings), but it doesn't work within the FastText framework for the following reasons:
1. Fixed composition: FastText uses simple summation. There's no mechanism to weight n-grams differently based on context. The embedding is that is independent of surrounding words.
2. No context in the model: During training, FastText still uses Skip-Gram or CBOW objectives. It predicts context from words or vice versa, but the word embedding doesn't change based on the context being predicted. The composition happens at retrieval time, after context is discarded.
To truly fix polysemy, we would need context-aware composition that computes the embedding for "bank" as a function of the surrounding words {deposited, money, at, the} or {sat, on, river}.
But that's a fundamentally different architecture, it is no longer static embeddings. It's what we will call contextual embeddings (though we are not introducing them yet).
Where Subwords Shine
Despite not fixing polysemy, subword embeddings are valuable for practical applications:
1. Machine translation: Morphologically rich languages benefit from shared subword structure. Translating "unglaublich" (German: unbelievable) is easier if the model knows "un-" (negation), "glaub" (believe), "-lich" (adjective suffix).
2. Named entity recognition: New entity names ("SpaceX", "ChatGPT") appear constantly. Subword embeddings provide reasonable representations for unseen names based on character patterns.
3. Low-resource languages: When training data is scarce, subword sharing helps generalize across related words.
4. Social media text: Typos, slang, creative spellings ("coooool", "happppy") are handled gracefully because character n-grams overlap with standard spellings.
Key point: Subword embeddings are an orthogonal improvement to static embeddings. They help with morphology and rare words but they don't address the fundamental limitation: one vector per word usage.
Embeddings Are Inputs, Not Understanding
Before we discuss what comes next, it is crucial to reset expectations about what embeddings actually do and what they don't.
Embeddings Don't Reason
Embeddings encode distributional similarity, i.e., which words appear in similar contexts. This captures useful semantic information (synonyms cluster together, analogies work), but it's not reasoning.
Example: Causal inference
Consider these statements:
It rained, so the ground is wet.The ground is wet, so it rained.
A reasoning system should know:
- Statement 1 is valid causal inference (rain causes wetness)
- Statement 2 is invalid (wet ground doesn't cause rain; could be from sprinklers, snow, etc.)
What do embeddings tell us?
"rain"and"wet"have high cosine similarity (they co-occur frequently)- The direction of causation is invisible
Embeddings cannot distinguish correlation from causation. They encode co-occurrence, not causality.
Example: Logical consistency
Consider:
All birds can fly.Penguins are birds.Penguins can fly.(false conclusion)
Embeddings might show:
"bird"and"fly"are similar (high dot product)"penguin"and"bird"are similar
But embeddings have no mechanism for logical deduction. They can't derive statement 3 from statements 1 and 2, nor can they recognize that statement 3 is false (penguins don't fly).
The geometry of embeddings represents association, not logical entailment.
They Don't Model Syntax or Compositional Semantics
Embeddings for individual words don't tell us how to combine them into sentences.
Example: Word order
Consider:
The dog chased the cat.The cat chased the dog.
If we represent each sentence as the average of word embeddings:
These are identical (averaging is commutative: order doesn't matter). But the sentences have opposite meanings!
Embeddings alone cannot capture syntax. They don't know:
- Subject vs. object
- Active vs. passive voice
- Temporal order (before vs. after)
To model syntax, we need additional structure: parse trees, dependency graphs, or sequence models (RNNs, transformers) that preserve word order.
Example: Negation
I like dogs.(positive sentiment)I don't like dogs.(negative sentiment)
If we average word embeddings, the difference is:
The embedding shifts the sentence vector, but there's no inherent "flipping" operation. The model needs to learn that "don't" inverts sentiment, but embeddings provide no privileged mechanism for this.
Negation requires compositional semantics: understanding how logical operators combine with predicates. Embeddings are just vectors; they don't have compositional structure.
They Encode Correlation, Not Meaning
Embeddings learn that "king" and "queen" are related, but they don't know:
- Why they are related (both are monarchs)
- What a monarch is (a hereditary ruler)
- How monarchies work (systems of governance)
They know the pattern (these words co-occur with similar words), but not the grounding (what these words refer to in the world).
Recall from Part 1: We discussed the Symbol Grounding Problem: how symbols (words) connect to their referents (objects, concepts, actions in the world). Embeddings don't solve this. They remain ungrounded symbols, pointing to other symbols in a web of correlations.
An embedding model trained purely on text has never seen a dog, touched ice, or heard music. It knows:
"dog"appears with"bark","pet","tail""ice"appears with"cold","frozen","water""music"appears with"sound","melody","rhythm"
But it doesn't know:
- The sensory experience of a dog barking (auditory)
- The tactile sensation of ice (cold, slippery)
- The emotional response to music (joy, sadness)
This is why modern multimodal models (CLIP, DALL-E, Gemini) train on text and images to ground linguistic representations in perceptual experience.
Embeddings from text alone remain symbolic, not experiential.
Preventing Future Hype Confusion
It is tempting to anthropomorphize embeddings: "The model knows that 'king' and 'queen' are related!" But precision matters.
What embeddings actually do:
- Compress co-occurrence statistics into dense vectors
- Preserve distributional similarity geometrically
- Enable efficient similarity search and clustering
- Provide useful features for downstream tasks
What embeddings don't do:
- Understand causality or logical reasoning
- Model syntax or compositional meaning
- Ground language in perceptual or physical reality
- Perform inference beyond pattern matching
Embeddings are incredibly useful statistical artifacts, and they revolutionized NLP, but they are not understanding in any deep sense.
The key insight: Embeddings don't provide the reasoning themselves. They are just inputs to models that perform reasoning.
They provide a starting representation, which downstream models (classifiers, sequence models, transformers) then process.
The success of embeddings taught us that good representations matter. But representation is not comprehension.
The next challenge was building models that could use representations to perform actual language understanding tasks: reading comprehension, translation, question answering.
This sets up the need for architectures beyond static embeddings, the need for models that can combine representations dynamically, attend to relevant context, and reason over sequences. But we are not there yet.
The Inevitable Next Question
We have established that static embeddings have a fundamental limitation: one vector per word, regardless of context. This means polysemy cannot be handled, syntax is invisible, and compositional meaning is lost.
But we have also shown that embeddings capture distributional similarity, semantic relatedness, efficient dense representations.
The natural question now is: can we keep what works (vector representations, geometric similarity) while fixing what's broken (context-independence, polysemy)?
If Context Matters, How Do We Keep It?
Let's think through what a better system might look like.
Observation 1
When humans read I deposited money at the bank, we don't retrieve a fixed meaning for "bank". We compute its meaning in context.
We see "deposited" and "money" so we activate our financial sense and suppress our river sense because it is not relevant here.
Observation 2
Different context words have different importance. In I deposited money at the bank yesterday:
"deposited"and"money"are highly informative (disambiguate"bank")"at"and"the"are uninformative (appear everywhere)"yesterday"is moderately informative (temporal, but not sense-disambiguating)
Current approaches (CBOW, FastText) average all context words equally:
This dilutes signal (informative words) with noise (function words).
What if we could weight them?
Essentially, we give high weight to informative words ("deposited", "money") and low weight to function words ("at", "the"). This weighted average should preserve signal.
But how do we determine the weights? We need a mechanism that:
- Looks at all context words
- Decides which are relevant for the current word
- Weights them accordingly
This sounds like... attention. But we are not introducing attention yet.
We are just recognizing that learned, context-dependent weighting would be valuable.
What If Embeddings Depended on Surrounding Words?
Here's a more radical idea: instead of retrieving a fixed embedding for "bank", we compute the embedding dynamically based on context.
Static approach (current):
Sentence: "I deposited money at the bank yesterday"
Embedding for "bank": v_bank = [0.23, -0.41, 0.08, ...] (fixed)Sentence: "I deposited money at the bank yesterday"
Embedding for "bank": v_bank = [0.23, -0.41, 0.08, ...] (fixed)Dynamic approach (hypothetical):
1Sentence: "I deposited money at the bank yesterday"
2Embedding for "bank":
3 f(bank, [deposited, money, at, the, yesterday]) = [0.89, 0.12, -0.34, ...]
4
5Sentence: "We sat on the bank of the river"
6Embedding for "bank":
7 f(bank, [sat, on, the, of, river]) = [0.15, 0.76, 0.21, ...]1Sentence: "I deposited money at the bank yesterday"
2Embedding for "bank":
3 f(bank, [deposited, money, at, the, yesterday]) = [0.89, 0.12, -0.34, ...]
4
5Sentence: "We sat on the bank of the river"
6Embedding for "bank":
7 f(bank, [sat, on, the, of, river]) = [0.15, 0.76, 0.21, ...]The function computes the embedding conditioned on context. The same word "bank" gets different embeddings in different sentences.
How might work?
One simple idea: start with a base embedding , then adjust it based on context:
Where is a learned function that shifts the base embedding in a context-specific direction.
Example:
1Base embedding: v_bank = [0.5, 0.45] (average of both senses)
2
3Financial context: Δ = [+0.4, -0.3] (shift toward finance)
4Contextual embedding: [0.9, 0.15] (now close to "money")
5
6River context: Δ = [-0.35, +0.3] (shift toward geography)
7Contextual embedding: [0.15, 0.75] (now close to "river")1Base embedding: v_bank = [0.5, 0.45] (average of both senses)
2
3Financial context: Δ = [+0.4, -0.3] (shift toward finance)
4Contextual embedding: [0.9, 0.15] (now close to "money")
5
6River context: Δ = [-0.35, +0.3] (shift toward geography)
7Contextual embedding: [0.15, 0.75] (now close to "river")This is no longer a lookup table.
What If Importance Was Learned, Not Averaged?
Going further down this road, we ask what if the model learned which words to attend to, rather than treating all context words equally or manually designing attention weights?
The current averaging approach works like this:
The proposed (learned weighting) approach might work something like this:
Where are learned weights (not fixed, not uniform).
How to compute ?
The weights should depend on:
- The word being represented (
"bank") - Each context word (
"deposited","money", etc.) - Their relationship (is
"deposited"relevant for disambiguating"bank"?)
This suggests a similarity or compatibility score between the target word and each context word, normalized to sum to 1:
(This is softmax over compatibility scores.)
What is the score function?
It could be dot product:
Or a learned function:
This mechanism:
- Automatically discovers which context words are informative
- Weights them accordingly
- Computes context-sensitive representations
Now, we are describing attention. But notice that we arrived here by asking what properties a better embedding system should have, not by imposing a pre-designed architecture.
No Transformers Yet — Just Curiosity
Let's understand the requirements for a better system. We will introduce the solution later (in future blog posts).
- Context-dependent representations: Same word, different embeddings in different sentences
- Selective attention: Weight informative context words higher than uninformative ones
- Learned weighting: The model discovers which words matter, rather than using fixed rules
- Dynamic computation: Representations are computed at inference time, not retrieved from a static table
These requirements point toward a fundamentally different architecture than static embeddings. We need:
- A model that processes sequences (to preserve word order and context)
- A mechanism for selective attention (to weight context words)
- Bidirectional context (to use both left and right context)
- Multiple layers (to build increasingly abstract representations)
This is what would become transformers and contextual embeddings (ELMo, BERT, GPT). The goal here is to make the need for these architectures feel inevitable, not to explain how they work.
The limitations of static embeddings should create intellectual tension: "There must be a better way."
What Embeddings Prepared Us For
Despite their limitations, static embeddings were far from a failure. They were a necessary step: a proof of concept that shaped everything that came after.
Why Embeddings Were Required?
Before Word2Vec (2013), NLP systems used:
- One-hot encodings: They were sparse, high-dimensional and didn't have semantic structure
- Hand-crafted features: WordNet synonyms, part-of-speech tags, dependency parse features—labor-intensive, brittle, language-specific
- Count-based methods: Methods like TF-IDF, PMI matrices were interpretable but crude
These methods worked (to some extent), but they didn't scale. Each new task required new feature engineering. Each new language required new linguistic resources. Models were shallow (logistic regression, SVMs).
Word2Vec changed the game by proving:
- Unsupervised pre-training works: We can learn useful representations from raw text, without labels
- Dense representations outperform sparse ones: 300 dimensions beat 50,000 dimensions for most tasks
- Transfer learning is possible: Embeddings trained on one corpus (Wikipedia) improve performance on unrelated tasks (sentiment analysis, NER)
- Semantic structure can be learned: Synonyms, analogies, relationships emerge automatically from distributional patterns
This was a paradigm shift. It showed that neural networks could learn linguistic structure from data instead of just memorize labels.
What Problem They Solved Perfectly?
Embeddings solved the representation problem which means how to convert discrete symbols (words) into continuous vectors that machines can process.
Before embeddings:
- Words are arbitrary symbols:
"dog"= token ID 4521 - No notion of similarity:
"dog"and"puppy"are as different as"dog"and"asteroid" - Models can't generalize: never seen
"puppy"→ fail
After embeddings:
- Words are points in semantic space:
"dog"= [0.23, -0.41, 0.08, ...] - Similarity is measurable:
- Models generalize:
"puppy"is close to"dog"→ similar behavior
This solved the input representation problem for neural networks. Now we could feed sentences into deep models: embed each word, then process the sequence of embeddings with RNNs, CNNs, or transformers.
Embeddings became the input layer for virtually all NLP models. Even modern transformers (BERT, GPT) start with embeddings, they just compute better embeddings by making them contextual.
Why the Next Leap Was Unavoidable?
Once embeddings showed that unsupervised learning of representations was possible, the next questions became obvious:
Q1: Can we go beyond static embeddings to context-dependent ones?
A1: Yes → ELMo (2018), contextual embeddings from LSTMs
Q2: Can we use attention instead of recurrence?
A2: Yes → Transformers (2017), self-attention mechanisms
Q3: Can we scale this to massive corpora and model sizes?
A3: Yes → BERT (2018), GPT series (2018-2023)
Q4: Can we make models truly generalist, not task-specific?
A4: Yes → Large Language Models (GPT-3, GPT-4, Claude, Gemini)
Each leap built on embeddings. Without proving that:
- Dense vectors work better than sparse ones
- Unsupervised pre-training transfers to supervised tasks
- Semantic structure is learnable from co-occurrence
...we wouldn't have known where to go next.
Embeddings were the foundation. Transformers are the building. But you can't build without a foundation.
The Lesson Embeddings Taught
The most important lesson from embeddings is philosophical, not technical.
Before embeddings: Meaning is symbolic, discrete, hand-coded (WordNet, ontologies, rules)
After embeddings: Meaning is geometric, continuous, learned (vectors, distances, gradients)
This shift in perspective enabled deep learning for NLP. It showed that geometry is a valid representation of semantics: dot products approximate meaning, distances reflect similarity, directions encode relationships.
This geometric view unlocked:
- Differentiable models: Gradients flow through continuous vectors
- End-to-end learning: No manual feature engineering
- Compositionality: Combine vectors to represent phrases, sentences, documents
Modern LLMs inherit this geometric foundation. When GPT-4 generates text, it's navigating a high-dimensional semantic space, computing probability distributions over vector representations. The geometry comes from embeddings.
The arc of progress:
- Symbolic NLP (pre-2013): Words are discrete, meaning is rule-based
- Distributional embeddings (2013-2017): Words are vectors, meaning is geometric
- Contextual embeddings (2018-2020): Vectors depend on context, meaning is dynamic
- Large language models (2020-present): Scale + context + attention → emergent capabilities
Embeddings were step 2. Without step 2, we never reach step 4.
Final Line
Embeddings taught machines what words look like, their distributional signatures, their statistical fingerprints, their geometric structure.
The next step was teaching them what words mean in context, how meaning shifts with usage, how syntax and semantics interact, how understanding requires attending to the right information at the right time.
Static embeddings couldn't do this. They were frozen representations, fixed vectors, context-blind.
But their failure was productive. It showed us exactly what was missing: context-awareness, selective attention, dynamic computation.
And that pointed the way forward.
Key Takeaways
After covering so much ground in this post, it's time to summarize the key takeaways.
The Polysemy Problem:
- Static embeddings assign one vector per word type, regardless of usage
- Polysemous words (
"bank","bat","spring") get averaged embeddings that don't represent any sense well - It is a structural limitation of the static embedding paradigm
Why Averaging Is Inevitable:
- Word2Vec and GloVe process all occurrences of a word together (same row in co-occurrence matrix)
- Context is discarded during training (only used to generate targets, not to condition embeddings)
- The learned embedding must satisfy constraints from all senses simultaneously → compromise → averaging
Subword Embeddings (Partial Fix):
- FastText represents words as sums of character n-gram embeddings
- Helps with morphology (
"playing"→"play"+"-ing"), rare words, typos - Does not help with polysemy (n-grams have no semantic content)
Embeddings Are Inputs, Not Understanding:
- Encode correlation (co-occurrence), not causation
- No syntax, no compositionality, no logical reasoning
- Ungrounded symbols (no connection to perceptual reality)
- Useful statistical artifacts, not cognitive models
The Inevitable Next Questions:
- Can embeddings be context-dependent (different vectors in different sentences)?
- Can we weight context words by importance (learned, not averaged)?
- Can we preserve word order and syntactic structure?
- Can we compute representations dynamically?
What Embeddings Prepared Us For:
- Proved unsupervised representation learning works
- Showed dense vectors outperform sparse features
- Enabled transfer learning across tasks and domains
- Established geometric view of semantics (vectors, distances, angles)
- Became the input layer for all modern NLP models
Conclusion
If you have come this far, thank you for reading all the way.
We have completed the journey through static embeddings, from their inception to their limitations.
Part 1: Why embeddings exist—the need to convert discrete symbols into continuous geometric representations
Part 2: How to learn embeddings from local context windows (Word2Vec, Skip-Gram, CBOW, negative sampling)
Part 3: How to learn embeddings from global statistics (GloVe, co-occurrence matrices, matrix factorization)
Part 4: Why static embeddings hit a ceiling—polysemy, context-blindness, lack of compositionality
The arc of this series:
- Problem: Words are symbols; machines need numbers
- Solution: Embeddings (distributional vectors)
- Success: Semantic structure emerges (analogies, similarity, transfer learning)
- Limitation: Context-independence breaks for polysemy and compositional meaning
Static embeddings were transformative. They enabled:
- Modern NLP (pre-training, transfer learning)
- Neural architectures (differentiable representations)
- Semantic search (efficient similarity via dot products)
But they couldn't be the final answer because language is not context-free. Words don't have fixed meanings; they have meanings in context.
Recognizing this limitation pointed toward the next generation of models: ElMo, Transformers, BERT, GPTs.
But all of these build on the geometric foundation that embeddings established. Vectors, dot products, cosine similarity, gradient descent on distributional objectives—these concepts originated with Word2Vec and GloVe.
Embeddings didn't fail. They solved the problem they were designed for: learning distributional representations from raw text.
They just revealed a deeper problem: context matters.
And that revelation launched the next decade of NLP research.
Namaste!
References
Polysemy and Word Sense Disambiguation
Polysemy in NLP - Navigli, R. (2009). Word Sense Disambiguation: A Survey. ACM Computing Surveys.
Limitations of Static Embeddings - Camacho-Collados, J., & Pilehvar, M. T. (2018). From Word to Sense Embeddings: A Survey on Vector Representations of Meaning. JAIR.
Subword Embeddings
FastText - Bojanowski, P., et al. (2017). Enriching Word Vectors with Subword Information. TACL.
Character-level Models - Kim, Y., et al. (2016). Character-Aware Neural Language Models. AAAI 2016.
Contextual Embeddings (Next Generation)
ELMo - Peters, M. E., et al. (2018). Deep Contextualized Word Representations. NAACL 2018.
BERT - Devlin, J., et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint.
GPT - Radford, A., et al. (2018). Improving Language Understanding by Generative Pre-Training. OpenAI.
Attention Is All You Need - Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS 2017.
Multimodal Grounding
CLIP - Radford, A., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML 2021.
Symbol Grounding Problem - Harnad, S. (1990). The Symbol Grounding Problem. Physica D.
Evaluation and Analysis
Embedding Evaluation - Schnabel, T., et al. (2015). Evaluation Methods for Unsupervised Word Embeddings. EMNLP 2015.
Probing Word Embeddings - Conneau, A., et al. (2018). What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties. ACL 2018.
Written by Anirudh Sharma
Published on January 18, 2026