← Back to blog
43 min readBy Anirudh Sharma

Understanding Embeddings: Part 3 - Global Statistics and GloVe

{A}
Table of Contents

This is Part 3 of a 4-part series on Embeddings:



Hey curious engineers, welcome to the third part of this series on understanding embeddings.

In Part 2, we explored how Word2Vec learns embeddings from local context windows.

The core insight we have internalized so far is that words appearing in similar contexts must have similar embeddings because they solve similar prediction tasks.

After training on billions of examples, the geometry of embedding space reflects distributional patterns:

  • vdogvcat\mathbf{v}_{\text{dog}} \cdot \mathbf{v}_{\text{cat}} is high
  • vdogvasteroid\mathbf{v}_{\text{dog}} \cdot \mathbf{v}_{\text{asteroid}} is low.

In case you noticed, Word2Vec has a subtle inefficiency: it rediscovers the same information repeatedly.

Let's understand this with an example: consider the word pair ("king", "queen"). In a large corpus, these words might co-occur hundreds of times across different sentences:

  • Sentence 1: "The king and queen attended the ceremony."
  • Sentence 2: "Both the queen and king gave speeches."
  • Sentence 3: "The royal couple, king and queen, waved..."
  • ...
  • Sentence 347: "The king consulted with the queen on matters..."

Each time Skip-Gram encounters this pair, it performs a gradient update pulling "king" and "queen" closer together.

The issue is, it processes each occurrence independently, as if seeing the relationship for the first time.

The 347th occurrence provides roughly the same signal as the 1st, yet the model treats it as a fresh training example, computing gradients, updating parameters, spending computation.

This raises a fundamental question: What if we aggregated all co-occurrence evidence first, then trained once on the global statistics?

This is precisely what GloVe (Global Vectors for Word Representation) does. Introduced by Pennington, Socher, and Manning at Stanford in 2014, GloVe takes a fundamentally different approach: instead of learning from individual context windows, it learns from a global co-occurrence matrix that counts how many times every word pair appears together across the entire corpus.

Core Insight: Word2Vec and GloVe are not competing ideas. Rather, they are two ways of extracting the same latent structure.

Word2Vec discovers global patterns implicitly through many local updates while GloVe discovers them explicitly by counting first, then factorizing.

This post explores why global statistics matter, how co-occurrence matrices capture meaning, why raw counts need transformation, and how GloVe's objective unifies prediction and matrix factorization into a single elegant framework.

Why Local Context Is Not the Whole Story

Word2Vec's local context windows are powerful, but they capture patterns, not structure. Let's understand why this matters.

Local Windows Capture Patterns, Not Structure

When Skip-Gram processes this sentence: The quick brown fox jumps over the lazy dog.

With a window size of 2, it generates training pairs:

plaintext
(fox, quick) (fox, brown) (fox, jumps) (fox, over)

These pairs capture local distributional patterns: "fox" appears near motion verbs ("jumps"), color adjectives ("brown"), and speed descriptors ("quick").

Over millions of sentences, the model learns that "fox" belongs to a cluster of animal nouns that appear with similar modifiers.

But notice what's missing here: global structure. Each window is a tiny snapshot which is a 5-word slice of a sentence.

It doesn't know:

  • How many times total "fox" and "jumps" appear together across the entire corpus
  • Whether this co-occurrence is more frequent than chance would predict
  • How "fox" compares to "wolf" in terms of overall distributional similarity

The model builds up this understanding gradually, through millions of independent gradient updates.

It does learn structure, but it does so inefficiently.

Context Windows Are Small and Noisy

A typical Word2Vec window size is 5 (five words on each side of center). This captures immediate syntactic and semantic context, but misses longer-range dependencies.

Consider this sentence: The fox, a clever creature, quickly jumped over the fence.

With window size 5 around "fox", the context would be: {The, a, clever, creature, quickly}

Notice that the word "jumped" is too far away in the sentence. It is outside the window, so Word2Vec never creates the pair (fox, jumped) from this sentence. The model loses evidence about the fox-action relationship.

Why not just increase window size?
Because larger windows introduce noise. If we used window size 10, we would capture "jumped", but also other irrelevant words.
In a complex sentence with multiple clauses, a large window might include words from a completely different semantic context.

This creates a fundamental trade-off in Word2Vec: small windows miss long-range patterns; large windows add noise. There is no perfect setting.

Meaning Emerges Across Many Windows

The distributional hypothesis tells us that meaning comes from aggregate patterns, not individual co-occurrences. A single sentence tells us little; a million sentences reveal structure.

Let's see this concretely. Suppose we want to learn that "ice" and "snow" are semantically related. We might encounter:

  • Sentence 1: The ice melted in the sun.
  • Sentence 2: Snow covered the mountains.
  • Sentence 3: Children played in the snow and ice.
  • Sentence 4: The cold ice and snow lasted all winter.
  • ...
  • Sentence 1000: Ice and snow are both forms of frozen water.

"Ice" and "snow" rarely appear together directly (except in coordinated phrases like "ice and snow"). But they appear with similar context words:

  • Shared contexts: {cold, frozen, winter, melt, water}
  • "Ice" contexts: {cube, skating, hockey}
  • "Snow" contexts: {flake, ski, shovel}

Word2Vec learns their similarity through second-order co-occurrence (discussed in Part 2) where words that share context words become similar, even without direct co-occurrence. But this requires seeing these patterns repeatedly across many sentences.

But this is inefficient. The fact that both "ice" and "snow" appear with "cold" is discovered through many independent training examples. Sentence 1 updates embeddings for (ice, cold). Sentence 4 updates embeddings for (ice, cold) again, then (snow, cold). Each update is local and the global pattern emerges slowly.

The Same Information Appears Repeatedly

Let's make the redundancy concrete. Suppose in a 1 billion word corpus:

  • ("king", "queen") co-occur 347 times
  • ("king", "royal") co-occur 892 times
  • ("king", "throne") co-occur 423 times

Word2Vec processes each occurrence as a separate training example:

python
1# Pseudocode for Skip-Gram 2for sentence in corpus: 3 for position in sentence: 4 if center_word == "king" and "queen" in context: 5 # Update embeddings to increase dot product 6 gradient_step(v_king, v_queen) # Occurrence 1 7 8 if center_word == "king" and "queen" in context: 9 # Later in corpus... 10 gradient_step(v_king, v_queen) # Occurrence 2 11 12 # ... 345 more times ...

By occurrence 100, the model has already learned that "king" and "queen" are strongly associated. Occurrences 101-347 provide diminishing returns because the relationship is already well-established. Yet Word2Vec continues processing them, performing gradient computations, updating parameters.

This is computationally wasteful. More importantly, it treats common patterns differently than rare patterns based purely on repetition, not informativeness.

Local Models Rediscover Global Patterns Slowly

Consider learning the relationship "Paris → France" (capital-of). In a large corpus, we might see:

  • Paris, the capital of France... (appears 1,234 times)
  • Paris is located in France... (appears 876 times)
  • France's capital city, Paris... (appears 654 times)

These are the same semantic relationship expressed with different syntax.

A human reading the first instance would generalize: "Ah, Paris is the capital of France." Word2Vec must see this pattern thousands of times before the geometric relationship stabilizes.

Why? Because each occurrence provides a small gradient update pushing embeddings in the right direction. The relationship emerges through accumulation of evidence, not through direct recognition of the pattern.

This slow discovery is a fundamental property of stochastic gradient descent on streaming data — this is simply how neural networks learn from examples.

But one might ask, wouldn't it be better if we count all evidence first, then learn from the aggregate statistics? That's where GloVe comes into the picture.

The Question GloVe Asks

Here's the central insight that motivated GloVe:

What if we aggregate all co-occurrence evidence first?

Instead of processing ("king", "queen") 347 separate times, what if we:

  1. Count all co-occurrences in one pass through the corpus
  2. Store the count: Xking,queen=347X_{\text{king}, \text{queen}} = 347
  3. Train on the aggregate statistics, not individual occurrences

This changes the learning paradigm:

  • Word2Vec: Local prediction is repeated billions of times and global patterns emerge implicitly
  • GloVe: Global counting is done once and global patterns used explicitly

Mathematically, both methods converge to similar embeddings (we'll see why later). But GloVe's approach has several advantages:

  1. Efficiency: Count once, train on aggregates (no redundant processing)
  2. Transparency: The co-occurrence matrix is interpretable (we can inspect it, visualize it, analyze it)
  3. Weighting: We can weight frequent and rare pairs differently (more control than sampling)

Let's visualize this shift:

Multiple Windows to MatrixZoom Word2Vec: many sliding windows processed independently, rediscovering patterns repeatedly. GloVe: windows collapsed into a single global co-occurrence matrix, each cell storing aggregate evidence.

The next section formalizes this: what exactly is a co-occurrence matrix, and how does it capture meaning?

The Co-occurrence Matrix (The True Data Source)

The co-occurrence matrix is not a learned object but a direct measurement of corpus statistics.

Before we train any model, before we learn any embeddings, we can compute this matrix by simply counting word co-occurrences.

Let's build intuition for what it represents.

What Is a Co-occurrence Count?

At its core, a co-occurrence count answers a simple question:

How often does word B appear near word A?

"Near" is defined by a context window, just like Word2Vec. With window size kk, we count how many times word BB appears within kk positions of word AA (on either side). So you see, no learning here, just pure statistics.

A Concrete Example

Consider a tiny corpus with just three sentences and window size 2 (two words on each side).

  • the cat sat on the mat
  • the dog sat on the log
  • the cat and the dog played

For the word "cat":

  • Sentence 1: the cat sat on the mat
plaintext
Context window: [the, ___] cat [sat, on] Co-occurrences: cat-the, cat-sat, cat-on
  • Sentence 3: the cat and the dog played
plaintext
Context window: [the, ___] cat [and, the] Co-occurrences: cat-the (×2), cat-and

Total counts for "cat":

  • ("cat", "the"): 3 times
  • ("cat", "sat"): 1 time
  • ("cat", "on"): 1 time
  • ("cat", "and"): 1 time

For the word "dog":

  • Sentence 2: the dog sat on the log
plaintext
Context window: [the, ___] dog [sat, on] Co-occurrences: dog-the, dog-sat, dog-on
  • Sentence 3: the cat and the dog played
plaintext
Context window: [and, the] dog [played, ___] Co-occurrences: dog-and, dog-the, dog-played

Total counts for "dog":

  • ("dog", "the"): 2 times
  • ("dog", "sat"): 1 time
  • ("dog", "on"): 1 time
  • ("dog", "and"): 1 time
  • ("dog", "played"): 1 time

Notice that "cat" and "dog" have very similar co-occurrence patterns!

Both appear with {the, sat, on, and}. This similarity in counts reflects their semantic similarity which suggests they are both domestic animals that perform similar actions.

Window Size Still Matters: But Now Globally

Just like Word2Vec, the choice of window size in GloVe affects which co-occurrences we capture.

But there is a crucial difference. In Word2Vec, window size affects which training pairs we generate—each window creates local training examples. In GloVe, by contrast, window size affects which cells of the co-occurrence matrix get incremented. We count once, globally, across the entire corpus.

A common extension in GloVe is distance-weighted counting which means words closer to the center contribute more than words farther away.

If word BB is at distance dd from word AA, we increment the count by 1d\frac{1}{d} instead of 1:

plaintext
1Example: "the quick brown fox jumps" 2For center word "fox" with window size 2: 3 4Distance 1: brown (weight = 1/1 = 1.0), jumps (weight = 1.0) 5Distance 2: quick (weight = 1/2 = 0.5) 6 7Co-occurrence counts: 8X[fox, brown] += 1.0 9X[fox, jumps] += 1.0 10X[fox, quick] += 0.5

This decay captures the intuition that immediate neighbors are more relevant than distant words within the window.

From Sentences to a Matrix

After processing the entire corpus, we have a vocabulary-by-vocabulary matrix X\mathbf{X} where:

  • Rows: Target words (the center words)
  • Columns: Context words (words appearing in the window)
  • Cells: Counts XijX_{ij} = number of times word jj appears in the context of word ii

For our tiny corpus with vocabulary {the, cat, dog, sat, on, mat, log, and, played}\{\text{the, cat, dog, sat, on, mat, log, and, played}\}, the co-occurrence matrix (window size 2, symmetric) looks like:

thecatdogsatonmatlogandplayed
the032221121
cat300110010
dog200110011
sat211021100
on211201100
..............................

Reading the matrix:

  • Row for "cat": Xcat,the=3X_{\text{cat}, \text{the}} = 3, Xcat,sat=1X_{\text{cat}, \text{sat}} = 1, Xcat,on=1X_{\text{cat}, \text{on}} = 1
  • Row for "dog": Xdog,the=2X_{\text{dog}, \text{the}} = 2, Xdog,sat=1X_{\text{dog}, \text{sat}} = 1, Xdog,on=1X_{\text{dog}, \text{on}} = 1

Observation: The rows for "cat" and "dog" are similar (high values in the same columns). This row similarity reflects semantic similarity!

Why This Matrix Already Contains Meaning

This is the crucial insight: similar rows imply similar usage, which implies similar meaning.

Let's formalize this. If word ii has row vector Xi=[Xi1,Xi2,,XiV]\mathbf{X}_i = [X_{i1}, X_{i2}, \ldots, X_{iV}], this vector represents the distributional signature of word ii: how it co-occurs with every word in the vocabulary.

Two words with similar distributional signatures must have similar meanings (distributional hypothesis from Part 1).

Example: "ice" vs "snow"

Suppose we computed co-occurrence counts for a real corpus and extracted these rows:

coldfrozenwaterwinterhotdeserttropical
ice1247892234165423128
snow14567341987891181511
sand45323467234876123

Observations:

  • "ice" and "snow" have high counts for {cold, frozen, water, winter} and low counts for {hot, desert, tropical}
  • "sand" has the opposite pattern: high for {hot, desert}, low for {cold, frozen}
  • The rows for "ice" and "snow" are similar; this means they are semantically related
  • The row for "sand" is different because it's semantically distinct

This similarity can be measured with cosine similarity (from Part 1):

similarity(ice,snow)=XiceXsnowXiceXsnow\text{similarity}(\text{ice}, \text{snow}) = \frac{\mathbf{X}_{\text{ice}} \cdot \mathbf{X}_{\text{snow}}}{|\mathbf{X}_{\text{ice}}| |\mathbf{X}_{\text{snow}}|}

If we computed this for our real matrix, we'd find similarity(ice,snow)0.85\text{similarity}(\text{ice}, \text{snow}) \approx 0.85 (high), similarity(ice,sand)0.23\text{similarity}(\text{ice}, \text{sand}) \approx 0.23 (low).

The matrix already encodes semantic relationships—before any neural network, before any training. This is pure statistics, pure counting.

So why do we need GloVe at all? Why not just use the co-occurrence matrix directly for similarity computations?

Two problems:

  • Dimensionality: For vocabulary size V=50,000V = 50,000, each word is represented by a 50,000-dimensional vector. This is sparse, high-dimensional, and computationally expensive.
  • Raw counts are problematic: Frequent words like "the" dominate the counts. We need transformation (logarithms, normalization, weighting) to extract meaningful signal.

GloVe addresses both issues: it learns low-dimensional dense embeddings that approximate the transformed co-occurrence statistics. The embeddings are compressed, efficient, and normalized but they preserve the distributional relationships from the matrix.

Co-occurrence MatrixZoom Co-occurrence matrix for a tiny corpus. Rows represent words' distributional signatures. Similar rows (e.g., "cat" and "dog") indicate similar meanings. The matrix is sparse and high-dimensional; GloVe compresses it into dense, low-dimensional embeddings.

Pseudocode for Building the Matrix

Here's how we construct X\mathbf{X} in practice:

python
1# Initialize co-occurrence matrix (sparse, since most pairs never co-occur) 2X = sparse_matrix(vocab_size, vocab_size) 3 4for sentence in corpus: 5 for center_position in range(len(sentence)): 6 center_word = sentence[center_position] 7 8 # Look at context window 9 for offset in range(-window_size, window_size + 1): 10 if offset == 0: 11 continue # Skip the center word itself 12 13 context_position = center_position + offset 14 if context_position < 0 or context_position >= len(sentence): 15 continue # Out of bounds 16 17 context_word = sentence[context_position] 18 19 # Optionally weight by distance 20 distance = abs(offset) 21 weight = 1.0 / distance # Closer words get higher weight 22 23 # Increment co-occurrence count 24 X[center_word][context_word] += weight 25 26# Result: X[i][j] = weighted count of how often word j appears near word i

Key point: This is a single pass through the corpus. We count once, then use X\mathbf{X} for training. Compare to Word2Vec, which makes multiple passes (epochs) through the corpus, reprocessing the same co-occurrences repeatedly.

Why Raw Counts Don't Work (and Logarithms Do)

We have established that the co-occurrence matrix X\mathbf{X} encodes semantic relationships.

But if we use raw counts directly, we run into serious problems. Let's understand why, and how logarithms fix this.

The Frequency Problem

Not all co-occurrences are equally meaningful. Consider these counts from a real corpus:

Xdog,the=12, ⁣456(very high)X_{\text{dog}, \text{the}} = 12,\!456 \quad \text{(very high)} Xdog,barked=892(moderate)X_{\text{dog}, \text{barked}} = 892 \quad \text{(moderate)} Xdog,asteroid=0(zero)X_{\text{dog}, \text{asteroid}} = 0 \quad \text{(zero)}

You see, the word "the" appears everywhere. It is the most frequent word in English, co-occurring with nearly every noun.

The count Xdog,the=12, ⁣456X_{\text{dog}, \text{the}} = 12,\!456 is high not because "the" is semantically related to "dog", but because "the" is frequent.

Contrast with "barked": Xdog,barked=892X_{\text{dog}, \text{barked}} = 892 is lower in absolute terms, but far more informative. "Barked" is selective, that is, it only appears with words that can bark (dogs, wolves, seals, but not cats, asteroids, or ideas). The co-occurrence is meaningful.

If we used raw counts to measure similarity, common words would dominate:

vdogvcatXdog,the+Xcat,the+\mathbf{v}_{\text{dog}} \cdot \mathbf{v}_{\text{cat}} \approx X_{\text{dog}, \text{the}} + X_{\text{cat}, \text{the}} + \ldots

The dot product would be inflated by shared high-frequency words, swamping the signal from rare but meaningful words.

In Part 1, we introduced Pointwise Mutual Information (PMI) to address this:

PMI(w1,w2)=logP(w1,w2)P(w1)P(w2)\text{PMI}(w_1, w_2) = \log \frac{P(w_1, w_2)}{P(w_1) \cdot P(w_2)}

PMI measures how much more likely are these words to co-occur than if they were independent?

  • High PMI: words co-occur more than chance giving us meaningful association
  • Low or negative PMI: words co-occur less than chance, that is, unrelated or anti-correlated

Raw counts XijX_{ij} don't account for base frequencies P(wi)P(w_i) and P(wj)P(w_j).

Logarithms, combined with proper normalization, approximate PMI which is what we actually want.

Human Perception is Logarithmic

There is a deeper reason why logarithms are the right transformation: human perception of quantity is logarithmic, not linear.

Consider these questions:

Question 1: Which difference is more significant?

  • Difference A: Seeing a word 1 time vs. 10 times (change of +9)
  • Difference B: Seeing a word 10,000 times vs. 10,009 times (change of +9)

Intuitively, Difference A is far more significant. Going from 1 occurrence to 10 is a 10× increase, we now have substantial evidence the word exists.

Going from 10,000 to 10,009 is a 0.09% increase which is statistically insignificant.

Question 2: Which is more surprising?

  • Seeing a co-occurrence increase from 2 to 20 (10× increase)
  • Seeing a co-occurrence increase from 1,000 to 1,010 (1% increase)

Again, the first is more surprising. Multiplicative changes matter more than absolute changes.

This is formalized in Weber-Fechner Law from psychophysics: the perceived intensity of a stimulus is proportional to the logarithm of the physical intensity.

Brightness, loudness, and perceived quantity all follow logarithmic scaling in human perception.

For language, this means:

  • The difference between co-occurring 1 time and 10 times is as perceptually significant as the difference between 10 times and 100 times, or between 100 and 1,000
  • Each 10× increase feels like "one step" in perceptual space

Logarithms capture this:

log(1)=0,log(10)=1,log(100)=2,log(1000)=3\log(1) = 0, \quad \log(10) = 1, \quad \log(100) = 2, \quad \log(1000) = 3

Equal multiplicative changes become equal additive changes in log space. This aligns with how we actually interpret co-occurrence frequencies.

Log Counts Compress Reality

Let's see the compression effect concretely. Here are raw counts and their logarithms (base 10 for interpretability):

Co-occurrenceRaw count XijX_{ij}log10(Xij)\log_{10}(X_{ij})
Rare pair10.00
Uncommon pair101.00
Moderate pair1002.00
Common pair1,0003.00
Very common pair10,0004.00
Extremely common100,0005.00

Below are few observations from the above table.

Raw counts span 5 orders of magnitude (1 to 100,000): The extremely common pair is 100,000× larger than the rare pair. If we used these directly in training, the gradient updates would be dominated by the frequent pairs and rare pairs would contribute negligibly.

Log counts compress to a range of 5 (0 to 5): The extremely common pair is only 5 "steps" away from the rare pair in log space. This makes training numerically stable: all co-occurrences contribute meaningfully to the loss, not just the most frequent ones.

Why This Makes Learning Numerically Stable?

In gradient descent, the magnitude of gradients is proportional to the target values.

If we are trying to predict raw count Xij=100, ⁣000X_{ij} = 100,\!000, and our model predicts 50,000, the error is 100, ⁣00050, ⁣000=50, ⁣000100,\!000 - 50,\!000 = 50,\!000 which is a huge gradient that could cause unstable updates (exploding gradients).

If we instead predict log(Xij)=5\log(X_{ij}) = 5, and our model predicts 4.7, the error is 54.7=0.35 - 4.7 = 0.3, a much smaller, more stable gradient.

This is why GloVe's objective (which we will see next) minimizes squared error on log counts, not raw counts:

Loss(wiTwjlogXij)2\text{Loss} \propto \left( \mathbf{w}_i^T \mathbf{w}_j - \log X_{ij} \right)^2

The logXij\log X_{ij} term ensures that:

  1. Frequent and rare pairs contribute comparably to the loss
  2. Gradients remain bounded and stable
  3. The objective aligns with human perception of co-occurrence significance

Aligns Better With Semantic Differences

Finally, logarithms align with how semantic relationships scale. Consider the semantic "distance" between these words:

  • ("dog", "wolf"): Both canines, closely related
  • ("dog", "cat"): Both pets, moderately related
  • ("dog", "asteroid"): Completely unrelated

If raw co-occurrence counts are:

Xdog,wolf=500,Xdog,cat=50,Xdog,asteroid=0X_{\text{dog}, \text{wolf}} = 500, \quad X_{\text{dog}, \text{cat}} = 50, \quad X_{\text{dog}, \text{asteroid}} = 0

The ratio 500:50:0500:50:0 suggests "dog" is 10× more related to "wolf" than to "cat". But semantically, both are animals; therefore, the difference is less dramatic than 10×.

In log space:

log(500)2.7,log(50)1.7,log(0.1)1\log(500) \approx 2.7, \quad \log(50) \approx 1.7, \quad \log(0.1) \approx -1

The difference 2.71.7=1.02.7 - 1.7 = 1.0 is a single "step," reflecting that both are animal-related. The gap to "asteroid" (1-1, assuming we add a small constant to avoid log(0)\log(0)) is much larger.

This logarithmic spacing better reflects semantic structure than raw counts.

Raw vs Log FrequencyZoom Raw co-occurrence counts span many orders of magnitude, dominated by frequent pairs. Log transformation compresses the range, making frequent and rare pairs contribute comparably. The curve shows saturation: doubling a huge count barely changes the log value.

The GloVe Objective (Explained, Not Derived)

Finally, with all the theory above, now is the time to learn the objective function that transforms global co-occurrence statistics into dense word embeddings.

We won't derive this from first principles (the original paper does that). Instead, we will build intuition for what each component does and why the equation makes sense.

The Core Equation

GloVe minimizes this weighted least-squares objective:

J=i,j=1Vf(Xij)(wiTwj+bi+bjlogXij)2J = \sum_{i,j=1}^{V} f(X_{ij}) \left( \mathbf{w}_i^T \mathbf{w}_j + b_i + b_j - \log X_{ij} \right)^2

Damn, this looks intimidating, so let's unpack it piece by piece.

Breaking It Down

The sum runs over all word pairs i,ji, j in the vocabulary. For a 50,000-word vocabulary, that's whopping 2.5 billion pairs (though most are zero and can be skipped).

The inner term is a squared error:

(wiTwj+bi+bjlogXij)2\left( \mathbf{w}_i^T \mathbf{w}_j + b_i + b_j - \log X_{ij} \right)^2

This measures how well does the dot product of embeddings match the log co-occurrence count?

Let's break down what's being compared:

Left side (model prediction): wiTwj+bi+bj\mathbf{w}_i^T \mathbf{w}_j + b_i + b_j

  • wiTwj\mathbf{w}_i^T \mathbf{w}_j: Dot product of word embeddings (our learned vectors)
  • bib_i: Bias term for word ii
  • bjb_j: Bias term for word jj

Right side (target): logXij\log X_{ij}

  • The log of the co-occurrence count (transformed statistic from the matrix)

The objective is to make these two as close as possible.

The weighting function f(Xij)f(X_{ij}) controls how much each pair contributes to the total loss. We will discuss this in detail later, but intuitively:

  • Very rare pairs (XijX_{ij} close to 0): low weight (noisy, unreliable)
  • Very frequent pairs (XijX_{ij} very large): capped weight (saturated, less informative)
  • Mid-frequency pairs: highest weight (sweet spot of informativeness)

What the Model is Really Doing

At a high level, GloVe is performing matrix factorization. Let's understand how.

The co-occurrence matrix X\mathbf{X} is V×VV \times V (vocabulary-by-vocabulary). Taking its logarithm element-wise gives logX\log \mathbf{X} (with some handling for zero entries).

GloVe learns two embedding matrices:

  • WRV×d\mathbf{W} \in \mathbb{R}^{V \times d}: Word embeddings (dd dimensions, typically 100-300)
  • W~RV×d\mathbf{\tilde{W}} \in \mathbb{R}^{V \times d}: Context embeddings

The objective tries to make:

WW~TlogX\mathbf{W} \mathbf{\tilde{W}}^T \approx \log \mathbf{X}

This is low-rank matrix factorization where we approximate a large V×VV \times V matrix with the product of two smaller V×dV \times d matrices.

Why is this useful?

1. Dimensionality reduction: Instead of storing V2V^2 co-occurrence counts, we store 2Vd2Vd embedding values. For V=50, ⁣000V = 50,\!000 and d=300d = 300: 50, ⁣0002=2.550,\!000^2 = 2.5 billion vs. 2×50, ⁣000×300=302 \times 50,\!000 \times 300 = 30 million (83× compression).

2. Generalization: The low-rank constraint forces the model to find shared structure. Instead of memorizing every co-occurrence, it learns latent dimensions that explain many co-occurrences simultaneously.

3. Noise reduction: Rare co-occurrences might be noisy (statistical flukes). Factorization smooths over noise by finding the best low-dimensional approximation.

This is similar to Singular Value Decomposition (SVD) on logX\log \mathbf{X} (as used in Latent Semantic Analysis, mentioned in Part 1).

The key differences are:

  • SVD: Unweighted, solves for exact factorization with closed-form solution
  • GloVe: Weighted (via f(Xij)f(X_{ij})), optimized via gradient descent, includes bias terms

The bias terms bib_i and bjb_j are crucial, let's understand them next.

Bias Terms Absorb Word Popularity

Why do we need bib_i and bjb_j? Why not just use the dot product wiTwj\mathbf{w}_i^T \mathbf{w}_j?

They are needed to account for word frequency independently of word meaning.

What?

Consider the word "the". It is extremely frequent, appearing in nearly every sentence. Its co-occurrence counts with all words are inflated simply because it appears everywhere:

Xthe,dog=45, ⁣000,Xthe,cat=38, ⁣000,Xthe,asteroid=12, ⁣000X_{\text{the}, \text{dog}} = 45,\!000, \quad X_{\text{the}, \text{cat}} = 38,\!000, \quad X_{\text{the}, \text{asteroid}} = 12,\!000

If we tried to model these with just wtheTwj\mathbf{w}_{\text{the}}^T \mathbf{w}_j, the embedding wthe\mathbf{w}_{\text{the}} would need large magnitude to produce large dot products with everything. This pollutes the semantic space.

The bias term btheb_{\text{the}} absorbs this frequency:

wtheTwdog+bthecaptures frequency of "the"+bdoglogXthe,dog\mathbf{w}_{\text{the}}^T \mathbf{w}_{\text{dog}} + \underbrace{b_{\text{the}}}_{\text{captures frequency of "the"}} + b_{\text{dog}} \approx \log X_{\text{the}, \text{dog}}

Now btheb_{\text{the}} can be large (because "the" is frequent), while wthe\mathbf{w}_{\text{the}} remains normalized. The embedding captures the semantic properties of "the" (a determiner, appears before nouns), not its raw frequency.

Similarly, bjb_j absorbs the context word's frequency. This is symmetric: both ii and jj contribute their own frequency offsets.

Mathematical intuition:

Recall that PMI is:

PMI(i,j)=logP(i,j)P(i)P(j)=logP(i,j)logP(i)logP(j)\text{PMI}(i, j) = \log \frac{P(i,j)}{P(i) P(j)} = \log P(i,j) - \log P(i) - \log P(j)

The co-occurrence count XijX_{ij} is proportional to P(i,j)P(i,j). The word frequencies are proportional to P(i)P(i) and P(j)P(j).

So we can write:

logXijPMI(i,j)+logP(i)+logP(j)+constant\log X_{ij} \approx \text{PMI}(i,j) + \log P(i) + \log P(j) + \text{constant}

The bias terms bib_i and bjb_j learn to approximate logP(i)\log P(i) and logP(j)\log P(j), leaving wiTwj\mathbf{w}_i^T \mathbf{w}_j to approximate the PMI (the frequency-independent association).

This design ensures that dot products reflect semantic similarity, not word frequency.

Why Squared Error Makes Sense?

GloVe uses mean squared error (MSE) as the loss function:

Loss=(predictiontarget)2\text{Loss} = \left( \text{prediction} - \text{target} \right)^2

Why does GloVe use squared error, rather than cross-entropy (used in Word2Vec)?

GloVe is reconstruction, not prediction.

Word2Vec (Skip-Gram/CBOW): Predicts one word from another. This is a classification problem: given context, which of 50,000 words is most likely? Cross-entropy is natural for classification.

GloVe: Reconstructs log co-occurrence counts. This is a regression problem: given word pair (i,j)(i, j), predict the continuous value logXij\log X_{ij}. MSE is natural for regression.

Squared error has a nice geometric interpretation: it measures Euclidean distance between prediction and target.

Minimizing squared error is equivalent to maximum likelihood estimation under Gaussian noise assumptions (the target logXij\log X_{ij} is the true value plus Gaussian noise, and we want to estimate the mean).

Additionally, squared error is convex in the parameters (for fixed weighting), making optimization well-behaved.

The Full Objective

Putting it all together:

J=i,j=1Vf(Xij)(wiTwj+bi+bjlogXij)2J = \sum_{i,j=1}^{V} f(X_{ij}) \left( \mathbf{w}_i^T \mathbf{w}_j + b_i + b_j - \log X_{ij} \right)^2

What we're minimizing: A weighted sum of squared errors between dot products (plus biases) and log counts.

Over what? We are optimizing the embedding matrices W\mathbf{W} and W~\mathbf{\tilde{W}} (word and context embeddings) plus bias vectors b\mathbf{b} and b~\mathbf{\tilde{b}}.

Why this works: By minimizing this objective, we force the geometric relationship wiTwj\mathbf{w}_i^T \mathbf{w}_j to reflect the statistical relationship logXij\log X_{ij}. Words that co-occur frequently will have high dot products (small angle); words that don't co-occur will have low dot products (large angle).

The embeddings emerge as a compressed, low-dimensional representation of co-occurrence statistics, exactly like Word2Vec, but learned from aggregated global data instead of local predictions.

Matrix FactorizationZoom GloVe factorizes the log co-occurrence matrix logX\log \mathbf{X} into two low-rank embedding matrices W\mathbf{W} and W~\mathbf{\tilde{W}}. The reconstruction arrow shows that WW~TlogX\mathbf{W} \mathbf{\tilde{W}}^T \approx \log \mathbf{X}, compressing a huge sparse matrix into dense embeddings.

The Weighting Function: Why Some Pairs Matter More

Not all word pairs are equally informative. GloVe's weighting function f(Xij)f(X_{ij}) ensures that the model focuses on the most useful co-occurrences while down-weighting noise and saturation.

Let's understand why this is necessary and how ff is designed.

Rare Pairs Are Noisy

Consider a word pair that co-occurs only once in a 1-billion-word corpus:

Xquokka,asteroid=1X_{\text{quokka}, \text{asteroid}} = 1

This co-occurrence might be a statistical fluke: In a bizarre news story, a quokka was named after the newly discovered asteroid...

This is not a meaningful semantic relationship, it's just a one-off coincidence in a quirky sentence. If we treat this pair with high importance, we will learn spurious associations.

The weighting function f(Xij)f(X_{ij}) should assign low weight to very rare pairs (Xij0X_{ij} \approx 0), effectively telling the model: "Don't trust this signal; it's too noisy."

Very Frequent Pairs Saturate

On the other end, consider extremely frequent pairs:

Xthe,of=847, ⁣293X_{\text{the}, \text{of}} = 847,\!293

These words co-occur hundreds of thousands of times. While this confirms they are strongly associated (both are function words), the 847,293rd occurrence doesn't add much new information beyond what the first 1,000 occurrences already told us.

Treating all 847,293 occurrences with equal weight would mean this pair dominates the loss function. The model would spend most of its optimization capacity fitting the most frequent pairs, potentially underfitting less common but semantically rich pairs.

The weighting function should cap the influence of very frequent pairs, preventing saturation from overwhelming the training signal.

The Sweet Spot: Mid-Frequency Pairs

The most informative co-occurrences are in the middle:

  • Not too rare: Reliable signal, not statistical noise
  • Not too frequent: Each occurrence still provides new information

For example:

Xdog,barked=1, ⁣234X_{\text{dog}, \text{barked}} = 1,\!234

This is frequent enough to be trustworthy (not a fluke), but not so common that additional occurrences are redundant. Pairs like this should receive high weight.

Explaining f(x)f(x) Intuitively

GloVe uses this weighting function:

f(x)={(xxmax)αif x<xmax1if xxmaxf(x) = \begin{cases} \left( \frac{x}{x_{\max}} \right)^{\alpha} & \text{if } x < x_{\max} \\ 1 & \text{if } x \geq x_{\max} \end{cases}

Where x=Xijx = X_{ij} (the co-occurrence count), xmaxx_{\max} is a threshold (typically 100), and α\alpha is an exponent (typically 0.75).

Let's break this down:

For Rare Pairs

For, x<xmaxx < x_{\max}, f(x)=(xxmax)αf(x) = \left( \frac{x}{x_{\max}} \right)^{\alpha}

Example with xmax=100x_{\max} = 100 and α=0.75\alpha = 0.75:

  • x=1x = 1: f(1)=(1100)0.75=0.0178f(1) = \left(\frac{1}{100}\right)^{0.75} = 0.0178 (very low weight)
  • x=10x = 10: f(10)=(10100)0.75=0.178f(10) = \left(\frac{10}{100}\right)^{0.75} = 0.178
  • x=50x = 50: f(50)=(50100)0.75=0.595f(50) = \left(\frac{50}{100}\right)^{0.75} = 0.595
  • x=100x = 100: f(100)=1.0f(100) = 1.0 (maximum weight)

Behavior:

  • As co-occurrence count increases from 0 to xmaxx_{\max}, the weight grows smoothly from 0 to 1
  • The exponent α=0.75\alpha = 0.75 makes the growth sublinear: doubling xx doesn't double f(x)f(x); it increases it by 20.751.682^{0.75} \approx 1.68
  • This means rare pairs get very low weight (near 0), while mid-frequency pairs quickly ramp up

Why sublinear? If we used α=1\alpha = 1 (linear), rare pairs would still contribute significantly, and the function would weight proportionally to raw counts, thus, reintroducing the frequency problem.

The exponent 0.750.75 balances giving rare pairs some weight (acknowledging they exist) while heavily down-weighting the noisiest ones.

For Frequent Pair

For xxmaxx \geq x_{\max}, f(x)=1f(x) = 1

Once xx exceeds the threshold xmaxx_{\max}, the weight is capped at 1. This prevents very frequent pairs from dominating.

Example:

  • x=100x = 100: f(100)=1.0f(100) = 1.0
  • x=1, ⁣000x = 1,\!000: f(1, ⁣000)=1.0f(1,\!000) = 1.0 (capped)
  • x=100, ⁣000x = 100,\!000: f(100, ⁣000)=1.0f(100,\!000) = 1.0 (capped)

All pairs with Xij100X_{ij} \geq 100 contribute equally to the loss, regardless of how much larger than 100 they are.

Why cap at xmax=100x_{\max} = 100? This is an empirical choice from the original GloVe paper. The authors found that co-occurrences beyond 100 provide diminishing returns, i.e., the first 100 occurrences capture the relationship, and additional occurrences mostly add redundancy.

Capping prevents the loss from being dominated by the most common words.

The Exponent α=0.75\alpha = 0.75: Where Did It Come From?

The choice of α=0.75\alpha = 0.75 echoes the smoothed unigram distribution used in Word2Vec's negative sampling (from Part 2). There, we sampled negative words with probability f(w)0.75\propto f(w)^{0.75}, which balanced frequent and rare words.

Here, the same exponent serves a similar purpose: it smooths the contribution of rare vs. frequent pairs. If α=1\alpha = 1, the weight grows linearly with count (too sensitive to frequency). If α=0.5\alpha = 0.5, the weight grows too slowly (under-weighting mid-frequency pairs). α=0.75\alpha = 0.75 is a sweet spot found empirically.

Remember, this is not a theoretically derived value but a hyperparameter tuned on downstream tasks (word similarity benchmarks). But it works robustly across different corpora.

Putting It Together

The weighted loss for a single pair (i,j)(i, j) is:

f(Xij)×(wiTwj+bi+bjlogXij)2f(X_{ij}) \times \left( \mathbf{w}_i^T \mathbf{w}_j + b_i + b_j - \log X_{ij} \right)^2

  • If Xij=1X_{ij} = 1 (very rare): f(1)0.018f(1) \approx 0.018, so this pair contributes almost nothing to the total loss (ignored as noise)
  • If Xij=50X_{ij} = 50 (mid-frequency): f(50)0.595f(50) \approx 0.595, so this pair contributes substantially
  • If Xij=100X_{ij} = 100 (frequent): f(100)=1.0f(100) = 1.0, maximum contribution
  • If Xij=10, ⁣000X_{ij} = 10,\!000 (very frequent): f(10, ⁣000)=1.0f(10,\!000) = 1.0, same as Xij=100X_{ij} = 100 (capped to prevent dominance)

This ensures that the model learns primarily from reliable, informative co-occurrences while tolerating rare noise and preventing frequent pairs from overwhelming the signal.

Weight vs FrequencyZoom GloVe's weighting function f(x)f(x). For rare co-occurrences (x<xmaxx < x_{\max}), weight grows sublinearly with count (exponent 0.75). For frequent co-occurrences (xxmaxx \geq x_{\max}), weight caps at 1. Mid-frequency pairs in the "sweet spot" receive highest effective influence.

GloVe vs Word2Vec

We have now seen both Word2Vec (Part 2) and GloVe in detail. On the surface, they seem quite different, yet empirically, both produce similar embeddings.

The famous word analogies work with both:

vkingvman+vwomanvqueen\mathbf{v}_{\text{king}} - \mathbf{v}_{\text{man}} + \mathbf{v}_{\text{woman}} \approx \mathbf{v}_{\text{queen}}

Word similarity rankings are nearly identical. Cosine similarities match closely.

Why? Because they are extracting the same underlying structure from different angles.

Local Prediction vs Global Reconstruction

The key philosophical difference between the two are:

Word2Vec (local prediction):

  • Processes sentences sequentially, one window at a time
  • Learns to predict: P(contextword)P(\text{context} | \text{word}) or P(wordcontext)P(\text{word} | \text{context})
  • Global patterns emerge implicitly through many local updates
  • Never explicitly constructs the co-occurrence matrix

GloVe (global reconstruction):

  • Processes the entire corpus once to build co-occurrence matrix X\mathbf{X}
  • Learns to reconstruct: wiTwjlogXij\mathbf{w}_i^T \mathbf{w}_j \approx \log X_{ij}
  • Global patterns are explicit from the start (stored in X\mathbf{X})
  • Training optimizes a single global objective

It is like learning geography:

  • Word2Vec approach: Walk around a city, street by street, remembering which neighborhoods connect. After many walks, you build a mental map.
  • GloVe approach: Get the full city map first, then learn a compressed representation (like memorizing major landmarks and highways instead of every street).

Both give us navigational knowledge, but through different learning processes.

Why Embeddings Look Similar in Practice

When researchers train Word2Vec and GloVe on the same corpus with similar hyperparameters, the learned embeddings are highly correlated.

For example, if we compute:

cosine(vdogWord2Vec,vcatWord2Vec)=0.76\text{cosine}(\mathbf{v}_{\text{dog}}^{\text{Word2Vec}}, \mathbf{v}_{\text{cat}}^{\text{Word2Vec}}) = 0.76

We typically find:

cosine(vdogGloVe,vcatGloVe)0.74\text{cosine}(\mathbf{v}_{\text{dog}}^{\text{GloVe}}, \mathbf{v}_{\text{cat}}^{\text{GloVe}}) \approx 0.74

The similarity scores are close. The nearest neighbors for "dog" in both embedding spaces are usually {puppy, cat, dogs, pet, animal} (perhaps in slightly different order).

Why? Because both methods:

  1. Use the same data source (co-occurrence patterns in text)
  2. Optimize for similar objectives (dot product reflects co-occurrence statistics)
  3. Produce similar geometry (PMI approximation)

The minor differences come from:

  • Sampling noise in Word2Vec (which pairs get selected during training depends on random window sampling)
  • Weighting differences (GloVe's f(Xij)f(X_{ij}) vs Word2Vec's negative sampling distribution)
  • Optimization dynamics (SGD on streaming data vs batch optimization on a fixed matrix)

In practice, these differences are often negligible for downstream tasks. Both produce high-quality embeddings.

Comparison Table

AspectWord2Vec (Skip-Gram)GloVe
Training dataSliding context windowsPre-computed co-occurrence matrix
ObjectivePredict context from word (or vice versa)Reconstruct log co-occurrence counts
Loss functionNegative log-likelihood (cross-entropy)Weighted mean squared error
Matrix factorizationImplicit (PMI matrix)Explicit (log co-occurrence matrix)
Computational costProportional to corpus size × epochsMatrix construction + optimization on matrix
Memory usageLow (only stores embeddings during training)High (stores full co-occurrence matrix)
Handling rare wordsBetter (direct gradient updates)Worse (down-weighted by f(Xij)f(X_{ij}))
Handling frequent wordsGood (negative sampling balances)Good (capped weighting prevents dominance)
InterpretabilityOpaque (what does a gradient update "mean"?)Transparent (can inspect X\mathbf{X})
ParallelizationDifficult (sequential sentence processing)Easy (matrix operations parallelize well)
Typical performanceSlightly better on rare wordsSlightly better on frequent words
Training time (small corpus)Faster (no matrix construction)Slower (matrix overhead)
Training time (large corpus)Slower (many epochs over corpus)Faster (optimize on matrix, not raw text)

When to use Word2Vec:

  • Small to medium corpora (< 1 billion words)
  • Rare words are critical (medical, legal, scientific domains)
  • Limited memory (can't store large co-occurrence matrix)
  • Streaming data (corpus grows over time; can update incrementally)

When to use GloVe:

  • Large corpora (> 1 billion words)
  • Need interpretability (want to inspect co-occurrence statistics)
  • Have sufficient memory for the matrix
  • Parallelization is important (can distribute matrix operations)
  • Want deterministic results (Word2Vec has random sampling; GloVe is deterministic given X\mathbf{X})

Modern practice: Both are now largely superseded by contextual embeddings (BERT, GPT) for most tasks. But for applications requiring static embeddings (efficient similarity search, interpretability, low-resource settings), GloVe is often preferred due to its speed and transparency.

What GloVe Teaches Us About Embeddings

Beyond the specific algorithm, GloVe reveals deeper lessons about what embeddings are and how they work. Let's extract the key insights.

1. Embeddings Are Compressed Statistics

This is the most fundamental takeaway: embeddings are compressed statistical summaries of co-occurrence patterns.

The co-occurrence matrix X\mathbf{X} is a complete record of which words appear together in the corpus. It's huge (V×VV \times V), sparse (most pairs never co-occur), and high-dimensional. But it contains all the distributional information we need.

GloVe compresses this into dense vectors (dd-dimensional, dVd \ll V) by finding a low-rank factorization:

WW~TlogX\mathbf{W} \mathbf{\tilde{W}}^T \approx \log \mathbf{X}

The embeddings W\mathbf{W} are a compressed representation of logX\log \mathbf{X}. We have gone from V2V^2 numbers to VdVd numbers, losing some information but preserving the most important structure.

This is analogous to image compression (JPEG) or audio compression (MP3):

  • Raw image: Millions of pixel values
  • JPEG: Compressed representation using frequency components (DCT coefficients)
  • Result: 10× smaller file, perceptually similar image

Similarly:

  • Raw co-occurrence matrix: Billions of count values
  • GloVe embeddings: Compressed representation using latent dimensions
  • Result: 100× smaller representation, semantically similar relationships

The compression is lossy (we can't perfectly reconstruct X\mathbf{X} from W\mathbf{W}), but the loss is acceptable because:

  1. Rare co-occurrences are noisy anyway (we want to smooth over them)
  2. The low-rank structure captures the systematic patterns (semantic relationships)
  3. Downstream tasks (similarity, analogy) don't need perfect reconstruction—they need semantic structure

Key insight: When you see an embedding like vdog=[0.23,0.41,0.08,]\mathbf{v}_{\text{dog}} = [0.23, -0.41, 0.08, \ldots], you are looking at a compressed summary of how "dog" co-occurs with all other words in the vocabulary. Each dimension captures a latent pattern of co-occurrence.

2. Geometry Reflects Usage Patterns

The geometric relationships in embedding space such as angles, distances, directions are not arbitrary. They directly reflect statistical patterns in language use.

High dot product ↔ frequent co-occurrence:

If wiTwj\mathbf{w}_i^T \mathbf{w}_j is large, then logXij\log X_{ij} is large (from GloVe's objective), meaning XijX_{ij} (co-occurrence count) is large. Words with high dot product co-occur frequently.

Parallel directions ↔ similar distributional profiles:

If vking\mathbf{v}_{\text{king}} and vqueen\mathbf{v}_{\text{queen}} point in similar directions (high cosine similarity), it's because they have similar rows in X\mathbf{X}: both co-occur with {royal, throne, crown, reign}\{\text{royal, throne, crown, reign}\}.

Offsets capture relations:

The famous analogy vkingvman+vwomanvqueen\mathbf{v}_{\text{king}} - \mathbf{v}_{\text{man}} + \mathbf{v}_{\text{woman}} \approx \mathbf{v}_{\text{queen}} works because:

  • vkingvman\mathbf{v}_{\text{king}} - \mathbf{v}_{\text{man}} is a direction in embedding space
  • This direction represents "remove male-associated co-occurrences, keep royal-associated co-occurrences"
  • Adding vwoman\mathbf{v}_{\text{woman}} adds back female-associated co-occurrences
  • The result points toward "queen", which has royal + female co-occurrence pattern

This arithmetic works because co-occurrence patterns are compositional.

The pattern royal+maleroyal + male vs. royal+femaleroyal + female shows up in the data (different words used with kings vs. queens), and GloVe's factorization preserves this structure in the geometry.

Every geometric property has a statistical interpretation:

  • Cosine similarity → row similarity in X\mathbf{X}
  • Dot product → PMI approximation
  • Vector arithmetic → compositional co-occurrence patterns
  • Clustering → shared contexts (second-order co-occurrence)

This is why embeddings "just work" for downstream tasks: the geometry encodes real linguistic patterns, not random structure.

3. Neural Networks Are Not Required for Meaning

This is perhaps the most liberating insight: we don't need a neural network to learn semantic embeddings.

Word2Vec uses a shallow neural network (one hidden layer), but GloVe doesn't. It's just weighted matrix factorization: a classical linear algebra technique, optimized with gradient descent.

The "secret sauce" is the distributional hypothesis applied to large-scale data. Meaning comes from co-occurrence patterns; any method that compresses those patterns into dense vectors will capture semantics.

Why did Word2Vec seem revolutionary in 2013?

Not because of the neural network (neural language models existed since Bengio et al. 2003), but because:

  1. Scale: Trained on billions of words (previous work used millions)
  2. Efficiency: Negative sampling made training feasible at scale
  3. Empirical success: Achieved state-of-the-art results on word similarity and analogy benchmarks
  4. Accessibility: Released as open-source code, easy to use (gensim, word2vec toolkit)

GloVe showed that the same results could be achieved with matrix factorization, reinforcing that the core insight is global co-occurrence statistics, not any particular model architecture.

Implication for modern LLMs:

Even today's massive transformer models (BERT, GPT-4) fundamentally rely on the distributional hypothesis. They learn more complex patterns (contextual usage, syntactic structure, long-range dependencies), but the foundation is still statistical co-occurrence in text data.

Neural networks provide flexibility (nonlinear transformations, attention mechanisms), but they are not the source of meaning. Meaning comes from data. The architecture is just a tool for extracting and representing that meaning efficiently.

Key Takeaways

We are reaching towards the end of this post. Now is a good time to look at the key takeaways.

Why Global Statistics Matter:

  • Word2Vec rediscovers the same co-occurrence patterns repeatedly through millions of local updates
  • GloVe aggregates all evidence first, then trains once on global statistics
  • Both approaches converge to similar embeddings because they extract the same latent structure (PMI approximation)

The Co-occurrence Matrix:

  • Built by counting word co-occurrences across the entire corpus
  • Rows represent distributional signatures: how each word co-occurs with all others
  • Similar rows → similar meanings (distributional hypothesis in action)
  • The matrix is interpretable, sparse, and high-dimensional

Why Raw Counts Fail:

  • Frequent words (like "the") dominate raw counts, overwhelming meaningful signal
  • Human perception is logarithmic: 1→10 feels like 10→100 (multiplicative scaling)
  • logXij\log X_{ij} compresses counts, making frequent and rare pairs contribute comparably
  • Aligns with PMI (frequency-independent association measure)

GloVe's Objective:

  • J=i,jf(Xij)(wiTwj+bi+bjlogXij)2J = \sum_{i,j} f(X_{ij}) \left( \mathbf{w}_i^T \mathbf{w}_j + b_i + b_j - \log X_{ij} \right)^2
  • Minimizes weighted squared error between dot products and log counts
  • Performs low-rank matrix factorization: WW~TlogX\mathbf{W} \mathbf{\tilde{W}}^T \approx \log \mathbf{X}
  • Bias terms absorb word frequency, leaving dot products to capture semantic association
  • Squared error is natural for regression (reconstruction task, not classification)

The Weighting Function:

  • f(Xij)f(X_{ij}) gives low weight to rare pairs (noisy), caps weight for frequent pairs (saturated)
  • Mid-frequency pairs get highest effective weight (sweet spot of informativeness)
  • Prevents very common words from dominating loss, prevents rare flukes from adding noise

What GloVe Teaches Us:

  • Embeddings are compressed statistics: Dense vectors summarize sparse co-occurrence patterns
  • Geometry reflects usage: Dot products ≈ co-occurrence frequency, angles ≈ distributional similarity
  • Neural networks not required: Matrix factorization suffices; the key is distributional hypothesis + scale
  • Meaning comes from data, not architecture: Statistical patterns in text, not model complexity, drive semantic learning

Conclusion

We have completed the journey from local context windows to global statistical structures.

Part 2 (Word2Vec): Embeddings emerge from predicting contexts. Sliding windows generate millions of training pairs; neural networks learn representations where similar words solve similar prediction tasks. The geometry (dot products ≈ PMI) emerges implicitly through gradient descent on local objectives.

Part 3 (GloVe): The same geometry can be learned explicitly from global co-occurrence statistics. Count all word pairs once, store in matrix X\mathbf{X}, factorize logX\log \mathbf{X} into low-dimensional embeddings. The result: compressed statistical summaries that preserve semantic relationships.

The unification: Word2Vec and GloVe are not competing methods. Rather, they are two paths to the same destination. Both extract latent structure from distributional patterns; one discovers it through many local updates, the other computes it from global aggregates. Both converge to embeddings where dot products approximate PMI, the core statistical signal of semantic association.

This unification reveals a profound truth: embeddings are inevitable.

Given the distributional hypothesis (meaning comes from context) and large-scale data (millions of sentences), any reasonable learning algorithm will discover roughly the same geometric structure.

The architecture (neural network, matrix factorization) is secondary; the primary driver is the statistical regularity in how humans use language.

But all the methods we have seen so far, Word2Vec, GloVe, LSA, share fundamental limitations:

  1. Static embeddings: Each word gets a single vector, conflating all senses ("bank" as institution vs. river edge)
  2. Bag-of-words context: Word order is ignored, losing syntactic structure
  3. Fixed window size: Long-range dependencies beyond 5-10 words are invisible
  4. No compositionality: Sentence meaning is just averaged word vectors (loses "dog chased cat" vs "cat chased dog")

These limitations motivated the next revolution in NLP: contextual embeddings.

In Part 4 (The Limits of Static Embeddings), we will explore:

  • Why polysemy breaks static embeddings (one vector per word can't capture multiple meanings)
  • How ELMo, BERT, and GPT create dynamic representations that change based on sentence context
  • Why attention mechanisms outperform fixed windows for capturing dependencies
  • How transformers preserve word order through positional encodings
  • The trade-offs: contextual embeddings are powerful but computationally expensive; static embeddings are limited but efficient

The story doesn't end with Word2Vec and GloVe but it begins there.

These methods proved that meaning has geometry, and that geometry can be learned from data.

Everything since has been refinement, scaling, and contextualization of that core insight.

Namaste!


References

Foundational Papers

GloVe: Global Vectors for Word Representation - Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global Vectors for Word Representation. EMNLP 2014.

Word2Vec (for comparison) - Mikolov, T., et al. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv preprint.

Theoretical Analysis

Neural Word Embedding as Implicit Matrix Factorization - Levy, O., & Goldberg, Y. (2014). Neural Word Embedding as Implicit Matrix Factorization. NeurIPS 2014.

Improving Distributional Similarity - Levy, O., Goldberg, Y., & Dagan, I. (2015). Improving Distributional Similarity with Lessons Learned from Word Embeddings. Transactions of the ACL.

Don't count, predict! - Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. ACL 2014.

Classical Methods (for context)

Latent Semantic Analysis - Deerwester, S., et al. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science.

Pointwise Mutual Information - Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics.

Practical Resources

GloVe Official Implementation - Stanford NLP Group. GloVe: Global Vectors for Word Representation.

GloVe Explained - GloVe: Global Vectors for Word Representation Explained. Toward Data Science tutorial.

Evaluation Benchmarks

Word Similarity - SimLex-999: Human-annotated word similarity scores for evaluation.

Written by Anirudh Sharma

Published on January 17, 2026

Read more posts

Comments