Understanding Embeddings: Part 3 - Global Statistics and GloVe
{A}Table of Contents
This is Part 3 of a 4-part series on Embeddings:
- Part 1: Why Embeddings Exist
- Part 2: Learning Meaning From Context
- Part 3: Global Statistics and GloVe ← You are here
- Part 4: The Limits of Static Embeddings
Hey curious engineers, welcome to the third part of this series on understanding embeddings.
In Part 2, we explored how Word2Vec learns embeddings from local context windows.
The core insight we have internalized so far is that words appearing in similar contexts must have similar embeddings because they solve similar prediction tasks.
After training on billions of examples, the geometry of embedding space reflects distributional patterns:
- is high
- is low.
In case you noticed, Word2Vec has a subtle inefficiency: it rediscovers the same information repeatedly.
Let's understand this with an example: consider the word pair ("king", "queen"). In a large corpus, these words might co-occur hundreds of times across different sentences:
- Sentence 1: "The king and queen attended the ceremony."
- Sentence 2: "Both the queen and king gave speeches."
- Sentence 3: "The royal couple, king and queen, waved..."
- ...
- Sentence 347: "The king consulted with the queen on matters..."
Each time Skip-Gram encounters this pair, it performs a gradient update pulling "king" and "queen" closer together.
The issue is, it processes each occurrence independently, as if seeing the relationship for the first time.
The 347th occurrence provides roughly the same signal as the 1st, yet the model treats it as a fresh training example, computing gradients, updating parameters, spending computation.
This raises a fundamental question: What if we aggregated all co-occurrence evidence first, then trained once on the global statistics?
This is precisely what GloVe (Global Vectors for Word Representation) does. Introduced by Pennington, Socher, and Manning at Stanford in 2014, GloVe takes a fundamentally different approach: instead of learning from individual context windows, it learns from a global co-occurrence matrix that counts how many times every word pair appears together across the entire corpus.
Core Insight: Word2Vec and GloVe are not competing ideas. Rather, they are two ways of extracting the same latent structure.
Word2Vec discovers global patterns implicitly through many local updates while GloVe discovers them explicitly by counting first, then factorizing.
This post explores why global statistics matter, how co-occurrence matrices capture meaning, why raw counts need transformation, and how GloVe's objective unifies prediction and matrix factorization into a single elegant framework.
Why Local Context Is Not the Whole Story
Word2Vec's local context windows are powerful, but they capture patterns, not structure. Let's understand why this matters.
Local Windows Capture Patterns, Not Structure
When Skip-Gram processes this sentence: The quick brown fox jumps over the lazy dog.
With a window size of 2, it generates training pairs:
(fox, quick)
(fox, brown)
(fox, jumps)
(fox, over)(fox, quick)
(fox, brown)
(fox, jumps)
(fox, over)These pairs capture local distributional patterns: "fox" appears near motion verbs ("jumps"), color adjectives ("brown"), and speed descriptors ("quick").
Over millions of sentences, the model learns that "fox" belongs to a cluster of animal nouns that appear with similar modifiers.
But notice what's missing here: global structure. Each window is a tiny snapshot which is a 5-word slice of a sentence.
It doesn't know:
- How many times total
"fox"and"jumps"appear together across the entire corpus - Whether this co-occurrence is more frequent than chance would predict
- How
"fox"compares to"wolf"in terms of overall distributional similarity
The model builds up this understanding gradually, through millions of independent gradient updates.
It does learn structure, but it does so inefficiently.
Context Windows Are Small and Noisy
A typical Word2Vec window size is 5 (five words on each side of center). This captures immediate syntactic and semantic context, but misses longer-range dependencies.
Consider this sentence: The fox, a clever creature, quickly jumped over the fence.
With window size 5 around "fox", the context would be: {The, a, clever, creature, quickly}
Notice that the word "jumped" is too far away in the sentence. It is outside the window, so Word2Vec never creates the pair (fox, jumped) from this sentence. The model loses evidence about the fox-action relationship.
Why not just increase window size?
Because larger windows introduce noise. If we used window size 10, we would capture"jumped", but also other irrelevant words.
In a complex sentence with multiple clauses, a large window might include words from a completely different semantic context.
This creates a fundamental trade-off in Word2Vec: small windows miss long-range patterns; large windows add noise. There is no perfect setting.
Meaning Emerges Across Many Windows
The distributional hypothesis tells us that meaning comes from aggregate patterns, not individual co-occurrences. A single sentence tells us little; a million sentences reveal structure.
Let's see this concretely. Suppose we want to learn that "ice" and "snow" are semantically related. We might encounter:
- Sentence 1:
The ice melted in the sun. - Sentence 2:
Snow covered the mountains. - Sentence 3:
Children played in the snow and ice. - Sentence 4:
The cold ice and snow lasted all winter. - ...
- Sentence 1000:
Ice and snow are both forms of frozen water.
"Ice" and "snow" rarely appear together directly (except in coordinated phrases like "ice and snow"). But they appear with similar context words:
- Shared contexts:
{cold, frozen, winter, melt, water} "Ice"contexts:{cube, skating, hockey}"Snow"contexts:{flake, ski, shovel}
Word2Vec learns their similarity through second-order co-occurrence (discussed in Part 2) where words that share context words become similar, even without direct co-occurrence. But this requires seeing these patterns repeatedly across many sentences.
But this is inefficient. The fact that both "ice" and "snow" appear with "cold" is discovered through many independent training examples. Sentence 1 updates embeddings for (ice, cold). Sentence 4 updates embeddings for (ice, cold) again, then (snow, cold). Each update is local and the global pattern emerges slowly.
The Same Information Appears Repeatedly
Let's make the redundancy concrete. Suppose in a 1 billion word corpus:
("king", "queen")co-occur 347 times("king", "royal")co-occur 892 times("king", "throne")co-occur 423 times
Word2Vec processes each occurrence as a separate training example:
1# Pseudocode for Skip-Gram
2for sentence in corpus:
3 for position in sentence:
4 if center_word == "king" and "queen" in context:
5 # Update embeddings to increase dot product
6 gradient_step(v_king, v_queen) # Occurrence 1
7
8 if center_word == "king" and "queen" in context:
9 # Later in corpus...
10 gradient_step(v_king, v_queen) # Occurrence 2
11
12 # ... 345 more times ...1# Pseudocode for Skip-Gram
2for sentence in corpus:
3 for position in sentence:
4 if center_word == "king" and "queen" in context:
5 # Update embeddings to increase dot product
6 gradient_step(v_king, v_queen) # Occurrence 1
7
8 if center_word == "king" and "queen" in context:
9 # Later in corpus...
10 gradient_step(v_king, v_queen) # Occurrence 2
11
12 # ... 345 more times ...By occurrence 100, the model has already learned that "king" and "queen" are strongly associated. Occurrences 101-347 provide diminishing returns because the relationship is already well-established. Yet Word2Vec continues processing them, performing gradient computations, updating parameters.
This is computationally wasteful. More importantly, it treats common patterns differently than rare patterns based purely on repetition, not informativeness.
Local Models Rediscover Global Patterns Slowly
Consider learning the relationship "Paris → France" (capital-of). In a large corpus, we might see:
Paris, the capital of France...(appears 1,234 times)Paris is located in France...(appears 876 times)France's capital city, Paris...(appears 654 times)
These are the same semantic relationship expressed with different syntax.
A human reading the first instance would generalize: "Ah, Paris is the capital of France." Word2Vec must see this pattern thousands of times before the geometric relationship stabilizes.
Why? Because each occurrence provides a small gradient update pushing embeddings in the right direction. The relationship emerges through accumulation of evidence, not through direct recognition of the pattern.
This slow discovery is a fundamental property of stochastic gradient descent on streaming data — this is simply how neural networks learn from examples.
But one might ask, wouldn't it be better if we count all evidence first, then learn from the aggregate statistics? That's where GloVe comes into the picture.
The Question GloVe Asks
Here's the central insight that motivated GloVe:
What if we aggregate all co-occurrence evidence first?
Instead of processing ("king", "queen") 347 separate times, what if we:
- Count all co-occurrences in one pass through the corpus
- Store the count:
- Train on the aggregate statistics, not individual occurrences
This changes the learning paradigm:
- Word2Vec: Local prediction is repeated billions of times and global patterns emerge implicitly
- GloVe: Global counting is done once and global patterns used explicitly
Mathematically, both methods converge to similar embeddings (we'll see why later). But GloVe's approach has several advantages:
- Efficiency: Count once, train on aggregates (no redundant processing)
- Transparency: The co-occurrence matrix is interpretable (we can inspect it, visualize it, analyze it)
- Weighting: We can weight frequent and rare pairs differently (more control than sampling)
Let's visualize this shift:
Zoom
Word2Vec: many sliding windows processed independently, rediscovering patterns repeatedly. GloVe: windows collapsed into a single global co-occurrence matrix, each cell storing aggregate evidence.
The next section formalizes this: what exactly is a co-occurrence matrix, and how does it capture meaning?
The Co-occurrence Matrix (The True Data Source)
The co-occurrence matrix is not a learned object but a direct measurement of corpus statistics.
Before we train any model, before we learn any embeddings, we can compute this matrix by simply counting word co-occurrences.
Let's build intuition for what it represents.
What Is a Co-occurrence Count?
At its core, a co-occurrence count answers a simple question:
How often does word B appear near word A?
"Near" is defined by a context window, just like Word2Vec. With window size , we count how many times word appears within positions of word (on either side). So you see, no learning here, just pure statistics.
A Concrete Example
Consider a tiny corpus with just three sentences and window size 2 (two words on each side).
the cat sat on the matthe dog sat on the logthe cat and the dog played
For the word "cat":
- Sentence 1:
the cat sat on the mat
Context window: [the, ___] cat [sat, on]
Co-occurrences: cat-the, cat-sat, cat-onContext window: [the, ___] cat [sat, on]
Co-occurrences: cat-the, cat-sat, cat-on- Sentence 3:
the cat and the dog played
Context window: [the, ___] cat [and, the]
Co-occurrences: cat-the (×2), cat-andContext window: [the, ___] cat [and, the]
Co-occurrences: cat-the (×2), cat-andTotal counts for "cat":
("cat", "the"): 3 times("cat", "sat"): 1 time("cat", "on"): 1 time("cat", "and"): 1 time
For the word "dog":
- Sentence 2:
the dog sat on the log
Context window: [the, ___] dog [sat, on]
Co-occurrences: dog-the, dog-sat, dog-onContext window: [the, ___] dog [sat, on]
Co-occurrences: dog-the, dog-sat, dog-on- Sentence 3:
the cat and the dog played
Context window: [and, the] dog [played, ___]
Co-occurrences: dog-and, dog-the, dog-playedContext window: [and, the] dog [played, ___]
Co-occurrences: dog-and, dog-the, dog-playedTotal counts for "dog":
("dog", "the"): 2 times("dog", "sat"): 1 time("dog", "on"): 1 time("dog", "and"): 1 time("dog", "played"): 1 time
Notice that "cat" and "dog" have very similar co-occurrence patterns!
Both appear with {the, sat, on, and}. This similarity in counts reflects their semantic similarity which suggests they are both domestic animals that perform similar actions.
Window Size Still Matters: But Now Globally
Just like Word2Vec, the choice of window size in GloVe affects which co-occurrences we capture.
But there is a crucial difference. In Word2Vec, window size affects which training pairs we generate—each window creates local training examples. In GloVe, by contrast, window size affects which cells of the co-occurrence matrix get incremented. We count once, globally, across the entire corpus.
A common extension in GloVe is distance-weighted counting which means words closer to the center contribute more than words farther away.
If word is at distance from word , we increment the count by instead of 1:
1Example: "the quick brown fox jumps"
2For center word "fox" with window size 2:
3
4Distance 1: brown (weight = 1/1 = 1.0), jumps (weight = 1.0)
5Distance 2: quick (weight = 1/2 = 0.5)
6
7Co-occurrence counts:
8X[fox, brown] += 1.0
9X[fox, jumps] += 1.0
10X[fox, quick] += 0.51Example: "the quick brown fox jumps"
2For center word "fox" with window size 2:
3
4Distance 1: brown (weight = 1/1 = 1.0), jumps (weight = 1.0)
5Distance 2: quick (weight = 1/2 = 0.5)
6
7Co-occurrence counts:
8X[fox, brown] += 1.0
9X[fox, jumps] += 1.0
10X[fox, quick] += 0.5This decay captures the intuition that immediate neighbors are more relevant than distant words within the window.
From Sentences to a Matrix
After processing the entire corpus, we have a vocabulary-by-vocabulary matrix where:
- Rows: Target words (the center words)
- Columns: Context words (words appearing in the window)
- Cells: Counts = number of times word appears in the context of word
For our tiny corpus with vocabulary , the co-occurrence matrix (window size 2, symmetric) looks like:
| the | cat | dog | sat | on | mat | log | and | played | |
|---|---|---|---|---|---|---|---|---|---|
| the | 0 | 3 | 2 | 2 | 2 | 1 | 1 | 2 | 1 |
| cat | 3 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 |
| dog | 2 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 |
| sat | 2 | 1 | 1 | 0 | 2 | 1 | 1 | 0 | 0 |
| on | 2 | 1 | 1 | 2 | 0 | 1 | 1 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Reading the matrix:
- Row for
"cat": , , - Row for
"dog": , ,
Observation: The rows for "cat" and "dog" are similar (high values in the same columns). This row similarity reflects semantic similarity!
Why This Matrix Already Contains Meaning
This is the crucial insight: similar rows imply similar usage, which implies similar meaning.
Let's formalize this. If word has row vector , this vector represents the distributional signature of word : how it co-occurs with every word in the vocabulary.
Two words with similar distributional signatures must have similar meanings (distributional hypothesis from Part 1).
Example: "ice" vs "snow"
Suppose we computed co-occurrence counts for a real corpus and extracted these rows:
| cold | frozen | water | winter | hot | desert | tropical | |
|---|---|---|---|---|---|---|---|
| ice | 1247 | 892 | 2341 | 654 | 23 | 12 | 8 |
| snow | 1456 | 734 | 1987 | 891 | 18 | 15 | 11 |
| sand | 45 | 3 | 234 | 67 | 234 | 876 | 123 |
Observations:
"ice"and"snow"have high counts for{cold, frozen, water, winter}and low counts for{hot, desert, tropical}"sand"has the opposite pattern: high for{hot, desert}, low for{cold, frozen}- The rows for
"ice"and"snow"are similar; this means they are semantically related - The row for
"sand"is different because it's semantically distinct
This similarity can be measured with cosine similarity (from Part 1):
If we computed this for our real matrix, we'd find (high), (low).
The matrix already encodes semantic relationships—before any neural network, before any training. This is pure statistics, pure counting.
So why do we need GloVe at all? Why not just use the co-occurrence matrix directly for similarity computations?
Two problems:
- Dimensionality: For vocabulary size , each word is represented by a 50,000-dimensional vector. This is sparse, high-dimensional, and computationally expensive.
- Raw counts are problematic: Frequent words like
"the"dominate the counts. We need transformation (logarithms, normalization, weighting) to extract meaningful signal.
GloVe addresses both issues: it learns low-dimensional dense embeddings that approximate the transformed co-occurrence statistics. The embeddings are compressed, efficient, and normalized but they preserve the distributional relationships from the matrix.
Zoom
Co-occurrence matrix for a tiny corpus. Rows represent words' distributional signatures. Similar rows (e.g., "cat" and "dog") indicate similar meanings. The matrix is sparse and high-dimensional; GloVe compresses it into dense, low-dimensional embeddings.
Pseudocode for Building the Matrix
Here's how we construct in practice:
1# Initialize co-occurrence matrix (sparse, since most pairs never co-occur)
2X = sparse_matrix(vocab_size, vocab_size)
3
4for sentence in corpus:
5 for center_position in range(len(sentence)):
6 center_word = sentence[center_position]
7
8 # Look at context window
9 for offset in range(-window_size, window_size + 1):
10 if offset == 0:
11 continue # Skip the center word itself
12
13 context_position = center_position + offset
14 if context_position < 0 or context_position >= len(sentence):
15 continue # Out of bounds
16
17 context_word = sentence[context_position]
18
19 # Optionally weight by distance
20 distance = abs(offset)
21 weight = 1.0 / distance # Closer words get higher weight
22
23 # Increment co-occurrence count
24 X[center_word][context_word] += weight
25
26# Result: X[i][j] = weighted count of how often word j appears near word i1# Initialize co-occurrence matrix (sparse, since most pairs never co-occur)
2X = sparse_matrix(vocab_size, vocab_size)
3
4for sentence in corpus:
5 for center_position in range(len(sentence)):
6 center_word = sentence[center_position]
7
8 # Look at context window
9 for offset in range(-window_size, window_size + 1):
10 if offset == 0:
11 continue # Skip the center word itself
12
13 context_position = center_position + offset
14 if context_position < 0 or context_position >= len(sentence):
15 continue # Out of bounds
16
17 context_word = sentence[context_position]
18
19 # Optionally weight by distance
20 distance = abs(offset)
21 weight = 1.0 / distance # Closer words get higher weight
22
23 # Increment co-occurrence count
24 X[center_word][context_word] += weight
25
26# Result: X[i][j] = weighted count of how often word j appears near word iKey point: This is a single pass through the corpus. We count once, then use for training. Compare to Word2Vec, which makes multiple passes (epochs) through the corpus, reprocessing the same co-occurrences repeatedly.
Why Raw Counts Don't Work (and Logarithms Do)
We have established that the co-occurrence matrix encodes semantic relationships.
But if we use raw counts directly, we run into serious problems. Let's understand why, and how logarithms fix this.
The Frequency Problem
Not all co-occurrences are equally meaningful. Consider these counts from a real corpus:
You see, the word "the" appears everywhere. It is the most frequent word in English, co-occurring with nearly every noun.
The count is high not because "the" is semantically related to "dog", but because "the" is frequent.
Contrast with "barked": is lower in absolute terms, but far more informative. "Barked" is selective, that is, it only appears with words that can bark (dogs, wolves, seals, but not cats, asteroids, or ideas). The co-occurrence is meaningful.
If we used raw counts to measure similarity, common words would dominate:
The dot product would be inflated by shared high-frequency words, swamping the signal from rare but meaningful words.
In Part 1, we introduced Pointwise Mutual Information (PMI) to address this:
PMI measures how much more likely are these words to co-occur than if they were independent?
- High PMI: words co-occur more than chance giving us meaningful association
- Low or negative PMI: words co-occur less than chance, that is, unrelated or anti-correlated
Raw counts don't account for base frequencies and .
Logarithms, combined with proper normalization, approximate PMI which is what we actually want.
Human Perception is Logarithmic
There is a deeper reason why logarithms are the right transformation: human perception of quantity is logarithmic, not linear.
Consider these questions:
Question 1: Which difference is more significant?
- Difference A: Seeing a word 1 time vs. 10 times (change of +9)
- Difference B: Seeing a word 10,000 times vs. 10,009 times (change of +9)
Intuitively, Difference A is far more significant. Going from 1 occurrence to 10 is a 10× increase, we now have substantial evidence the word exists.
Going from 10,000 to 10,009 is a 0.09% increase which is statistically insignificant.
Question 2: Which is more surprising?
- Seeing a co-occurrence increase from 2 to 20 (10× increase)
- Seeing a co-occurrence increase from 1,000 to 1,010 (1% increase)
Again, the first is more surprising. Multiplicative changes matter more than absolute changes.
This is formalized in Weber-Fechner Law from psychophysics: the perceived intensity of a stimulus is proportional to the logarithm of the physical intensity.
Brightness, loudness, and perceived quantity all follow logarithmic scaling in human perception.
For language, this means:
- The difference between co-occurring 1 time and 10 times is as perceptually significant as the difference between 10 times and 100 times, or between 100 and 1,000
- Each 10× increase feels like "one step" in perceptual space
Logarithms capture this:
Equal multiplicative changes become equal additive changes in log space. This aligns with how we actually interpret co-occurrence frequencies.
Log Counts Compress Reality
Let's see the compression effect concretely. Here are raw counts and their logarithms (base 10 for interpretability):
| Co-occurrence | Raw count | |
|---|---|---|
| Rare pair | 1 | 0.00 |
| Uncommon pair | 10 | 1.00 |
| Moderate pair | 100 | 2.00 |
| Common pair | 1,000 | 3.00 |
| Very common pair | 10,000 | 4.00 |
| Extremely common | 100,000 | 5.00 |
Below are few observations from the above table.
Raw counts span 5 orders of magnitude (1 to 100,000): The extremely common pair is 100,000× larger than the rare pair. If we used these directly in training, the gradient updates would be dominated by the frequent pairs and rare pairs would contribute negligibly.
Log counts compress to a range of 5 (0 to 5): The extremely common pair is only 5 "steps" away from the rare pair in log space. This makes training numerically stable: all co-occurrences contribute meaningfully to the loss, not just the most frequent ones.
Why This Makes Learning Numerically Stable?
In gradient descent, the magnitude of gradients is proportional to the target values.
If we are trying to predict raw count , and our model predicts 50,000, the error is which is a huge gradient that could cause unstable updates (exploding gradients).
If we instead predict , and our model predicts 4.7, the error is , a much smaller, more stable gradient.
This is why GloVe's objective (which we will see next) minimizes squared error on log counts, not raw counts:
The term ensures that:
- Frequent and rare pairs contribute comparably to the loss
- Gradients remain bounded and stable
- The objective aligns with human perception of co-occurrence significance
Aligns Better With Semantic Differences
Finally, logarithms align with how semantic relationships scale. Consider the semantic "distance" between these words:
("dog", "wolf"): Both canines, closely related("dog", "cat"): Both pets, moderately related("dog", "asteroid"): Completely unrelated
If raw co-occurrence counts are:
The ratio suggests "dog" is 10× more related to "wolf" than to "cat". But semantically, both are animals; therefore, the difference is less dramatic than 10×.
In log space:
The difference is a single "step," reflecting that both are animal-related. The gap to "asteroid" (, assuming we add a small constant to avoid ) is much larger.
This logarithmic spacing better reflects semantic structure than raw counts.
Zoom
Raw co-occurrence counts span many orders of magnitude, dominated by frequent pairs. Log transformation compresses the range, making frequent and rare pairs contribute comparably. The curve shows saturation: doubling a huge count barely changes the log value.
The GloVe Objective (Explained, Not Derived)
Finally, with all the theory above, now is the time to learn the objective function that transforms global co-occurrence statistics into dense word embeddings.
We won't derive this from first principles (the original paper does that). Instead, we will build intuition for what each component does and why the equation makes sense.
The Core Equation
GloVe minimizes this weighted least-squares objective:
Damn, this looks intimidating, so let's unpack it piece by piece.
Breaking It Down
The sum runs over all word pairs in the vocabulary. For a 50,000-word vocabulary, that's whopping 2.5 billion pairs (though most are zero and can be skipped).
The inner term is a squared error:
This measures how well does the dot product of embeddings match the log co-occurrence count?
Let's break down what's being compared:
Left side (model prediction):
- : Dot product of word embeddings (our learned vectors)
- : Bias term for word
- : Bias term for word
Right side (target):
- The log of the co-occurrence count (transformed statistic from the matrix)
The objective is to make these two as close as possible.
The weighting function controls how much each pair contributes to the total loss. We will discuss this in detail later, but intuitively:
- Very rare pairs ( close to 0): low weight (noisy, unreliable)
- Very frequent pairs ( very large): capped weight (saturated, less informative)
- Mid-frequency pairs: highest weight (sweet spot of informativeness)
What the Model is Really Doing
At a high level, GloVe is performing matrix factorization. Let's understand how.
The co-occurrence matrix is (vocabulary-by-vocabulary). Taking its logarithm element-wise gives (with some handling for zero entries).
GloVe learns two embedding matrices:
- : Word embeddings ( dimensions, typically 100-300)
- : Context embeddings
The objective tries to make:
This is low-rank matrix factorization where we approximate a large matrix with the product of two smaller matrices.
Why is this useful?
1. Dimensionality reduction: Instead of storing co-occurrence counts, we store embedding values. For and : billion vs. million (83× compression).
2. Generalization: The low-rank constraint forces the model to find shared structure. Instead of memorizing every co-occurrence, it learns latent dimensions that explain many co-occurrences simultaneously.
3. Noise reduction: Rare co-occurrences might be noisy (statistical flukes). Factorization smooths over noise by finding the best low-dimensional approximation.
This is similar to Singular Value Decomposition (SVD) on (as used in Latent Semantic Analysis, mentioned in Part 1).
The key differences are:
- SVD: Unweighted, solves for exact factorization with closed-form solution
- GloVe: Weighted (via ), optimized via gradient descent, includes bias terms
The bias terms and are crucial, let's understand them next.
Bias Terms Absorb Word Popularity
Why do we need and ? Why not just use the dot product ?
They are needed to account for word frequency independently of word meaning.
What?
Consider the word "the". It is extremely frequent, appearing in nearly every sentence. Its co-occurrence counts with all words are inflated simply because it appears everywhere:
If we tried to model these with just , the embedding would need large magnitude to produce large dot products with everything. This pollutes the semantic space.
The bias term absorbs this frequency:
Now can be large (because "the" is frequent), while remains normalized. The embedding captures the semantic properties of "the" (a determiner, appears before nouns), not its raw frequency.
Similarly, absorbs the context word's frequency. This is symmetric: both and contribute their own frequency offsets.
Mathematical intuition:
Recall that PMI is:
The co-occurrence count is proportional to . The word frequencies are proportional to and .
So we can write:
The bias terms and learn to approximate and , leaving to approximate the PMI (the frequency-independent association).
This design ensures that dot products reflect semantic similarity, not word frequency.
Why Squared Error Makes Sense?
GloVe uses mean squared error (MSE) as the loss function:
Why does GloVe use squared error, rather than cross-entropy (used in Word2Vec)?
GloVe is reconstruction, not prediction.
Word2Vec (Skip-Gram/CBOW): Predicts one word from another. This is a classification problem: given context, which of 50,000 words is most likely? Cross-entropy is natural for classification.
GloVe: Reconstructs log co-occurrence counts. This is a regression problem: given word pair , predict the continuous value . MSE is natural for regression.
Squared error has a nice geometric interpretation: it measures Euclidean distance between prediction and target.
Minimizing squared error is equivalent to maximum likelihood estimation under Gaussian noise assumptions (the target is the true value plus Gaussian noise, and we want to estimate the mean).
Additionally, squared error is convex in the parameters (for fixed weighting), making optimization well-behaved.
The Full Objective
Putting it all together:
What we're minimizing: A weighted sum of squared errors between dot products (plus biases) and log counts.
Over what? We are optimizing the embedding matrices and (word and context embeddings) plus bias vectors and .
Why this works: By minimizing this objective, we force the geometric relationship to reflect the statistical relationship . Words that co-occur frequently will have high dot products (small angle); words that don't co-occur will have low dot products (large angle).
The embeddings emerge as a compressed, low-dimensional representation of co-occurrence statistics, exactly like Word2Vec, but learned from aggregated global data instead of local predictions.
Zoom
GloVe factorizes the log co-occurrence matrix into two low-rank embedding matrices and . The reconstruction arrow shows that , compressing a huge sparse matrix into dense embeddings.
The Weighting Function: Why Some Pairs Matter More
Not all word pairs are equally informative. GloVe's weighting function ensures that the model focuses on the most useful co-occurrences while down-weighting noise and saturation.
Let's understand why this is necessary and how is designed.
Rare Pairs Are Noisy
Consider a word pair that co-occurs only once in a 1-billion-word corpus:
This co-occurrence might be a statistical fluke: In a bizarre news story, a quokka was named after the newly discovered asteroid...
This is not a meaningful semantic relationship, it's just a one-off coincidence in a quirky sentence. If we treat this pair with high importance, we will learn spurious associations.
The weighting function should assign low weight to very rare pairs (), effectively telling the model: "Don't trust this signal; it's too noisy."
Very Frequent Pairs Saturate
On the other end, consider extremely frequent pairs:
These words co-occur hundreds of thousands of times. While this confirms they are strongly associated (both are function words), the 847,293rd occurrence doesn't add much new information beyond what the first 1,000 occurrences already told us.
Treating all 847,293 occurrences with equal weight would mean this pair dominates the loss function. The model would spend most of its optimization capacity fitting the most frequent pairs, potentially underfitting less common but semantically rich pairs.
The weighting function should cap the influence of very frequent pairs, preventing saturation from overwhelming the training signal.
The Sweet Spot: Mid-Frequency Pairs
The most informative co-occurrences are in the middle:
- Not too rare: Reliable signal, not statistical noise
- Not too frequent: Each occurrence still provides new information
For example:
This is frequent enough to be trustworthy (not a fluke), but not so common that additional occurrences are redundant. Pairs like this should receive high weight.
Explaining Intuitively
GloVe uses this weighting function:
Where (the co-occurrence count), is a threshold (typically 100), and is an exponent (typically 0.75).
Let's break this down:
For Rare Pairs
For, ,
Example with and :
- : (very low weight)
- :
- :
- : (maximum weight)
Behavior:
- As co-occurrence count increases from 0 to , the weight grows smoothly from 0 to 1
- The exponent makes the growth sublinear: doubling doesn't double ; it increases it by
- This means rare pairs get very low weight (near 0), while mid-frequency pairs quickly ramp up
Why sublinear? If we used (linear), rare pairs would still contribute significantly, and the function would weight proportionally to raw counts, thus, reintroducing the frequency problem.
The exponent balances giving rare pairs some weight (acknowledging they exist) while heavily down-weighting the noisiest ones.
For Frequent Pair
For ,
Once exceeds the threshold , the weight is capped at 1. This prevents very frequent pairs from dominating.
Example:
- :
- : (capped)
- : (capped)
All pairs with contribute equally to the loss, regardless of how much larger than 100 they are.
Why cap at ? This is an empirical choice from the original GloVe paper. The authors found that co-occurrences beyond 100 provide diminishing returns, i.e., the first 100 occurrences capture the relationship, and additional occurrences mostly add redundancy.
Capping prevents the loss from being dominated by the most common words.
The Exponent : Where Did It Come From?
The choice of echoes the smoothed unigram distribution used in Word2Vec's negative sampling (from Part 2). There, we sampled negative words with probability , which balanced frequent and rare words.
Here, the same exponent serves a similar purpose: it smooths the contribution of rare vs. frequent pairs. If , the weight grows linearly with count (too sensitive to frequency). If , the weight grows too slowly (under-weighting mid-frequency pairs). is a sweet spot found empirically.
Remember, this is not a theoretically derived value but a hyperparameter tuned on downstream tasks (word similarity benchmarks). But it works robustly across different corpora.
Putting It Together
The weighted loss for a single pair is:
- If (very rare): , so this pair contributes almost nothing to the total loss (ignored as noise)
- If (mid-frequency): , so this pair contributes substantially
- If (frequent): , maximum contribution
- If (very frequent): , same as (capped to prevent dominance)
This ensures that the model learns primarily from reliable, informative co-occurrences while tolerating rare noise and preventing frequent pairs from overwhelming the signal.
Zoom
GloVe's weighting function . For rare co-occurrences (), weight grows sublinearly with count (exponent 0.75). For frequent co-occurrences (), weight caps at 1. Mid-frequency pairs in the "sweet spot" receive highest effective influence.
GloVe vs Word2Vec
We have now seen both Word2Vec (Part 2) and GloVe in detail. On the surface, they seem quite different, yet empirically, both produce similar embeddings.
The famous word analogies work with both:
Word similarity rankings are nearly identical. Cosine similarities match closely.
Why? Because they are extracting the same underlying structure from different angles.
Local Prediction vs Global Reconstruction
The key philosophical difference between the two are:
Word2Vec (local prediction):
- Processes sentences sequentially, one window at a time
- Learns to predict: or
- Global patterns emerge implicitly through many local updates
- Never explicitly constructs the co-occurrence matrix
GloVe (global reconstruction):
- Processes the entire corpus once to build co-occurrence matrix
- Learns to reconstruct:
- Global patterns are explicit from the start (stored in )
- Training optimizes a single global objective
It is like learning geography:
- Word2Vec approach: Walk around a city, street by street, remembering which neighborhoods connect. After many walks, you build a mental map.
- GloVe approach: Get the full city map first, then learn a compressed representation (like memorizing major landmarks and highways instead of every street).
Both give us navigational knowledge, but through different learning processes.
Why Embeddings Look Similar in Practice
When researchers train Word2Vec and GloVe on the same corpus with similar hyperparameters, the learned embeddings are highly correlated.
For example, if we compute:
We typically find:
The similarity scores are close. The nearest neighbors for "dog" in both embedding spaces are usually {puppy, cat, dogs, pet, animal} (perhaps in slightly different order).
Why? Because both methods:
- Use the same data source (co-occurrence patterns in text)
- Optimize for similar objectives (dot product reflects co-occurrence statistics)
- Produce similar geometry (PMI approximation)
The minor differences come from:
- Sampling noise in Word2Vec (which pairs get selected during training depends on random window sampling)
- Weighting differences (GloVe's vs Word2Vec's negative sampling distribution)
- Optimization dynamics (SGD on streaming data vs batch optimization on a fixed matrix)
In practice, these differences are often negligible for downstream tasks. Both produce high-quality embeddings.
Comparison Table
| Aspect | Word2Vec (Skip-Gram) | GloVe |
|---|---|---|
| Training data | Sliding context windows | Pre-computed co-occurrence matrix |
| Objective | Predict context from word (or vice versa) | Reconstruct log co-occurrence counts |
| Loss function | Negative log-likelihood (cross-entropy) | Weighted mean squared error |
| Matrix factorization | Implicit (PMI matrix) | Explicit (log co-occurrence matrix) |
| Computational cost | Proportional to corpus size × epochs | Matrix construction + optimization on matrix |
| Memory usage | Low (only stores embeddings during training) | High (stores full co-occurrence matrix) |
| Handling rare words | Better (direct gradient updates) | Worse (down-weighted by ) |
| Handling frequent words | Good (negative sampling balances) | Good (capped weighting prevents dominance) |
| Interpretability | Opaque (what does a gradient update "mean"?) | Transparent (can inspect ) |
| Parallelization | Difficult (sequential sentence processing) | Easy (matrix operations parallelize well) |
| Typical performance | Slightly better on rare words | Slightly better on frequent words |
| Training time (small corpus) | Faster (no matrix construction) | Slower (matrix overhead) |
| Training time (large corpus) | Slower (many epochs over corpus) | Faster (optimize on matrix, not raw text) |
When to use Word2Vec:
- Small to medium corpora (< 1 billion words)
- Rare words are critical (medical, legal, scientific domains)
- Limited memory (can't store large co-occurrence matrix)
- Streaming data (corpus grows over time; can update incrementally)
When to use GloVe:
- Large corpora (> 1 billion words)
- Need interpretability (want to inspect co-occurrence statistics)
- Have sufficient memory for the matrix
- Parallelization is important (can distribute matrix operations)
- Want deterministic results (Word2Vec has random sampling; GloVe is deterministic given )
Modern practice: Both are now largely superseded by contextual embeddings (BERT, GPT) for most tasks. But for applications requiring static embeddings (efficient similarity search, interpretability, low-resource settings), GloVe is often preferred due to its speed and transparency.
What GloVe Teaches Us About Embeddings
Beyond the specific algorithm, GloVe reveals deeper lessons about what embeddings are and how they work. Let's extract the key insights.
1. Embeddings Are Compressed Statistics
This is the most fundamental takeaway: embeddings are compressed statistical summaries of co-occurrence patterns.
The co-occurrence matrix is a complete record of which words appear together in the corpus. It's huge (), sparse (most pairs never co-occur), and high-dimensional. But it contains all the distributional information we need.
GloVe compresses this into dense vectors (-dimensional, ) by finding a low-rank factorization:
The embeddings are a compressed representation of . We have gone from numbers to numbers, losing some information but preserving the most important structure.
This is analogous to image compression (JPEG) or audio compression (MP3):
- Raw image: Millions of pixel values
- JPEG: Compressed representation using frequency components (DCT coefficients)
- Result: 10× smaller file, perceptually similar image
Similarly:
- Raw co-occurrence matrix: Billions of count values
- GloVe embeddings: Compressed representation using latent dimensions
- Result: 100× smaller representation, semantically similar relationships
The compression is lossy (we can't perfectly reconstruct from ), but the loss is acceptable because:
- Rare co-occurrences are noisy anyway (we want to smooth over them)
- The low-rank structure captures the systematic patterns (semantic relationships)
- Downstream tasks (similarity, analogy) don't need perfect reconstruction—they need semantic structure
Key insight: When you see an embedding like , you are looking at a compressed summary of how "dog" co-occurs with all other words in the vocabulary. Each dimension captures a latent pattern of co-occurrence.
2. Geometry Reflects Usage Patterns
The geometric relationships in embedding space such as angles, distances, directions are not arbitrary. They directly reflect statistical patterns in language use.
High dot product ↔ frequent co-occurrence:
If is large, then is large (from GloVe's objective), meaning (co-occurrence count) is large. Words with high dot product co-occur frequently.
Parallel directions ↔ similar distributional profiles:
If and point in similar directions (high cosine similarity), it's because they have similar rows in : both co-occur with .
Offsets capture relations:
The famous analogy works because:
- is a direction in embedding space
- This direction represents "remove male-associated co-occurrences, keep royal-associated co-occurrences"
- Adding adds back female-associated co-occurrences
- The result points toward
"queen", which has royal + female co-occurrence pattern
This arithmetic works because co-occurrence patterns are compositional.
The pattern vs. shows up in the data (different words used with kings vs. queens), and GloVe's factorization preserves this structure in the geometry.
Every geometric property has a statistical interpretation:
- Cosine similarity → row similarity in
- Dot product → PMI approximation
- Vector arithmetic → compositional co-occurrence patterns
- Clustering → shared contexts (second-order co-occurrence)
This is why embeddings "just work" for downstream tasks: the geometry encodes real linguistic patterns, not random structure.
3. Neural Networks Are Not Required for Meaning
This is perhaps the most liberating insight: we don't need a neural network to learn semantic embeddings.
Word2Vec uses a shallow neural network (one hidden layer), but GloVe doesn't. It's just weighted matrix factorization: a classical linear algebra technique, optimized with gradient descent.
The "secret sauce" is the distributional hypothesis applied to large-scale data. Meaning comes from co-occurrence patterns; any method that compresses those patterns into dense vectors will capture semantics.
Why did Word2Vec seem revolutionary in 2013?
Not because of the neural network (neural language models existed since Bengio et al. 2003), but because:
- Scale: Trained on billions of words (previous work used millions)
- Efficiency: Negative sampling made training feasible at scale
- Empirical success: Achieved state-of-the-art results on word similarity and analogy benchmarks
- Accessibility: Released as open-source code, easy to use (gensim, word2vec toolkit)
GloVe showed that the same results could be achieved with matrix factorization, reinforcing that the core insight is global co-occurrence statistics, not any particular model architecture.
Implication for modern LLMs:
Even today's massive transformer models (BERT, GPT-4) fundamentally rely on the distributional hypothesis. They learn more complex patterns (contextual usage, syntactic structure, long-range dependencies), but the foundation is still statistical co-occurrence in text data.
Neural networks provide flexibility (nonlinear transformations, attention mechanisms), but they are not the source of meaning. Meaning comes from data. The architecture is just a tool for extracting and representing that meaning efficiently.
Key Takeaways
We are reaching towards the end of this post. Now is a good time to look at the key takeaways.
Why Global Statistics Matter:
- Word2Vec rediscovers the same co-occurrence patterns repeatedly through millions of local updates
- GloVe aggregates all evidence first, then trains once on global statistics
- Both approaches converge to similar embeddings because they extract the same latent structure (PMI approximation)
The Co-occurrence Matrix:
- Built by counting word co-occurrences across the entire corpus
- Rows represent distributional signatures: how each word co-occurs with all others
- Similar rows → similar meanings (distributional hypothesis in action)
- The matrix is interpretable, sparse, and high-dimensional
Why Raw Counts Fail:
- Frequent words (like
"the") dominate raw counts, overwhelming meaningful signal - Human perception is logarithmic: 1→10 feels like 10→100 (multiplicative scaling)
- compresses counts, making frequent and rare pairs contribute comparably
- Aligns with PMI (frequency-independent association measure)
GloVe's Objective:
- Minimizes weighted squared error between dot products and log counts
- Performs low-rank matrix factorization:
- Bias terms absorb word frequency, leaving dot products to capture semantic association
- Squared error is natural for regression (reconstruction task, not classification)
The Weighting Function:
- gives low weight to rare pairs (noisy), caps weight for frequent pairs (saturated)
- Mid-frequency pairs get highest effective weight (sweet spot of informativeness)
- Prevents very common words from dominating loss, prevents rare flukes from adding noise
What GloVe Teaches Us:
- Embeddings are compressed statistics: Dense vectors summarize sparse co-occurrence patterns
- Geometry reflects usage: Dot products ≈ co-occurrence frequency, angles ≈ distributional similarity
- Neural networks not required: Matrix factorization suffices; the key is distributional hypothesis + scale
- Meaning comes from data, not architecture: Statistical patterns in text, not model complexity, drive semantic learning
Conclusion
We have completed the journey from local context windows to global statistical structures.
Part 2 (Word2Vec): Embeddings emerge from predicting contexts. Sliding windows generate millions of training pairs; neural networks learn representations where similar words solve similar prediction tasks. The geometry (dot products ≈ PMI) emerges implicitly through gradient descent on local objectives.
Part 3 (GloVe): The same geometry can be learned explicitly from global co-occurrence statistics. Count all word pairs once, store in matrix , factorize into low-dimensional embeddings. The result: compressed statistical summaries that preserve semantic relationships.
The unification: Word2Vec and GloVe are not competing methods. Rather, they are two paths to the same destination. Both extract latent structure from distributional patterns; one discovers it through many local updates, the other computes it from global aggregates. Both converge to embeddings where dot products approximate PMI, the core statistical signal of semantic association.
This unification reveals a profound truth: embeddings are inevitable.
Given the distributional hypothesis (meaning comes from context) and large-scale data (millions of sentences), any reasonable learning algorithm will discover roughly the same geometric structure.
The architecture (neural network, matrix factorization) is secondary; the primary driver is the statistical regularity in how humans use language.
But all the methods we have seen so far, Word2Vec, GloVe, LSA, share fundamental limitations:
- Static embeddings: Each word gets a single vector, conflating all senses (
"bank"as institution vs. river edge) - Bag-of-words context: Word order is ignored, losing syntactic structure
- Fixed window size: Long-range dependencies beyond 5-10 words are invisible
- No compositionality: Sentence meaning is just averaged word vectors (loses
"dog chased cat"vs"cat chased dog")
These limitations motivated the next revolution in NLP: contextual embeddings.
In Part 4 (The Limits of Static Embeddings), we will explore:
- Why polysemy breaks static embeddings (one vector per word can't capture multiple meanings)
- How ELMo, BERT, and GPT create dynamic representations that change based on sentence context
- Why attention mechanisms outperform fixed windows for capturing dependencies
- How transformers preserve word order through positional encodings
- The trade-offs: contextual embeddings are powerful but computationally expensive; static embeddings are limited but efficient
The story doesn't end with Word2Vec and GloVe but it begins there.
These methods proved that meaning has geometry, and that geometry can be learned from data.
Everything since has been refinement, scaling, and contextualization of that core insight.
Namaste!
References
Foundational Papers
GloVe: Global Vectors for Word Representation - Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global Vectors for Word Representation. EMNLP 2014.
Word2Vec (for comparison) - Mikolov, T., et al. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv preprint.
Theoretical Analysis
Neural Word Embedding as Implicit Matrix Factorization - Levy, O., & Goldberg, Y. (2014). Neural Word Embedding as Implicit Matrix Factorization. NeurIPS 2014.
Improving Distributional Similarity - Levy, O., Goldberg, Y., & Dagan, I. (2015). Improving Distributional Similarity with Lessons Learned from Word Embeddings. Transactions of the ACL.
Don't count, predict! - Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. ACL 2014.
Classical Methods (for context)
Latent Semantic Analysis - Deerwester, S., et al. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science.
Pointwise Mutual Information - Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics.
Practical Resources
GloVe Official Implementation - Stanford NLP Group. GloVe: Global Vectors for Word Representation.
GloVe Explained - GloVe: Global Vectors for Word Representation Explained. Toward Data Science tutorial.
Evaluation Benchmarks
Word Similarity - SimLex-999: Human-annotated word similarity scores for evaluation.
Written by Anirudh Sharma
Published on January 17, 2026