← Back to blog
40 min readBy Anirudh Sharma

Understanding Embeddings: Part 2 - Learning Meaning from Context

{A}
Table of Contents

This is Part 2 of a 4-part series on Embeddings:



In the previous part of this four-part blog series, we explored the reasons why embeddings have to exist.

To recap: words need numerical representations that capture semantic similarity, and vectors provide the natural mathematical structure for this — hence, embeddings.

We also learned that meaning emerges from distributional patterns: "You shall know a word by the company it keeps" and that geometric relationships (cosine similarity, vector arithmetic) mirror semantic relationships.

But we left a critical question unanswered: Where do embeddings actually come from?

We know what properties embeddings should have (dense, low-dimensional, semantically structured). We also know that similar words should have similar vectors. But how do we learn these vectors from raw text? How does a neural network discover that "dog" and "puppy" should be close together, while "dog" and "asteroid" should be far apart?

This is where Word2Vec enters the picture. It was introduced by Tomas Mikolov and colleagues at Google in 2013.

Word2Vec is a family of related techniques for learning word embeddings from unlabeled text. It revolutionized NLP by showing that simple, shallow neural networks trained on a straightforward prediction task could capture rich semantic relationships.

We don't tell the model that "dog" and "cat" are similar. Instead, we force it to predict context words, and it discovers similarity on its own because similar words appear in similar contexts.

Core Insight: Word2Vec doesn't encode semantics directly. Instead, it creates a prediction task (predicting context from words, or vice versa) whose solution requires learning semantic representations.

The embeddings emerge as a byproduct of solving this task well.

Similar words must have similar embeddings because they appear in similar contexts and solve similar prediction problems.

This post explores the learning dynamics of Word2Vec: how local context windows create training signal, how Skip-Gram and CBOW formulations differ, why negative sampling is necessary, and what geometric pressures emerge during training.

Context Windows as Learning Signal

The distributional hypothesis tells us that meaning comes from context. But to train a model, we need to operationalize "context" into concrete training examples. This is where the context window comes in.

Local Context as Supervision

Consider the sentence: The quick brown fox jumps over the lazy dog.

If we want to learn an embedding for the word "fox", we need training data that captures its typical contexts. The simplest approach is to slide a fixed-size window across the sentence and extract (center word, context words) pairs.

Defining Center and Context

At any position in a sentence, we define:

Center word: The target word we are currently focusing on (the word at the current position in our sliding window).

Context words: The words surrounding the center word within a fixed distance (the window size).

Window size: The maximum distance on each side of the center word. A window size of kk means we look at kk words to the left and kk words to the right, giving us up to 2k2k total context words.

With a window size of 2 (two words on each side), when "fox" is the center word:

plaintext
Context window for "fox": [quick, brown] fox [jumps, over] <----- 2 -----> <---- 2 ----> (left context) (right context)

Here:

  • Center word: "fox"
  • Context words: {quick, brown, jumps, over} (4 words total: 2 on left + 2 on right)
  • Window size: 2 (the radius, not the total count)

Two Prediction Formulations

This (center, context) pair becomes our training signal, but we can formulate the prediction task in two ways:

1. Predict context from center (Skip-Gram): Given the center word, predict the surrounding context words.

P(context wordscenter word)P(\text{context words} | \text{center word})

P(quick, brown, jumps, overfox)P(\text{quick, brown, jumps, over} | \text{fox})

This asks: "If I see the word "fox", what other words are likely to appear nearby?"

NOTE: P(XY)P(X | Y) is the conditional probability, meaning P(x)P(x) given event YY has occurred

2. Predict center from context (CBOW): Given the context words, predict the center word.

P(center wordcontext words)P(\text{center word} | \text{context words})

P(foxquick, brown, jumps, over)P(\text{fox} | \text{quick, brown, jumps, over})

This asks: "If I see the words {quick, brown, jumps, over} nearby, what word is likely in the middle?"

Both formulations capture the same distributional intuition: words that appear in similar contexts should have similar meanings. The difference is which direction we predict, and this creates different learning dynamics (which we will explore in detail later).

This approach creates semantic learning because words that appear in similar contexts will get similar representations.

Second-Order Co-occurrence

This is crucial: words don't need to appear together directly to be recognized as similar. Consider "fox" and "wolf":

  • Sentence 1: "The quick brown fox jumps through the forest"
  • Sentence 2: "The gray timber wolf runs through the forest"

"Fox" and "wolf" never co-occur directly, yet they should have similar embeddings. Why? Because they share second-order co-occurrence as they appear with similar context words:

  • Both appear with: {the, through, forest} (shared contexts)
  • "Fox" appears with: {quick, brown, jumps}
  • "Wolf" appears with: {gray, timber, runs}

Even the non-shared contexts are semantically similar:

"quick""runs",

"brown""gray" (color adjectives),

"jumps""runs" (motion verbs).

Over millions of sentences, the model learns that both animals appear in similar distributional environments (subjects of action verbs, modified by adjectives, followed by location phrases).

This shared pattern forces their embeddings to be similar, even without direct co-occurrence.

Sliding Window Intuition

The window slides across the entire corpus, generating millions of training examples. At each position, the word at that position becomes the center word, and the surrounding words within the window become the context words.

plaintext
1Sentence: "The quick brown fox jumps over the lazy dog" 2 3Window size = 2: 4 5Position 1: [∅, ∅] The [quick, brown] center: "The" 6Position 2: [∅, The] quick [brown, fox] center: "quick" 7Position 3: [The, quick] brown [fox, jumps] center: "brown" 8Position 4: [quick, brown] fox [jumps, over] center: "fox" ← our example 9Position 5: [brown, fox] jumps [over, the] center: "jumps" 10...and so on

Each position generates a training example with its (center, context) pair. For Word2Vec, we have two key choices for the prediction task:

1. Skip-Gram: Given the center word, predict each context word independently.

For position 4 (center = "fox"):

P(quickfox),P(brownfox),P(jumpsfox),P(overfox)P(\text{quick} | \text{fox}), P(\text{brown} | \text{fox}), P(\text{jumps} | \text{fox}), P(\text{over} | \text{fox})

2. CBOW (Continuous Bag-of-Words): Given the context words, predict the center word.

For position 4 (center = "fox"):

P(foxquick, brown, jumps, over)P(\text{fox} | \text{quick, brown, jumps, over})

The key difference is that the Skip-Gram makes multiple predictions per window (one for each context word), while CBOW makes a single prediction (the center word from all context words). This asymmetry creates different learning dynamics.

We will explore both, but first, let's understand why order initially doesn't matter.

Why Order Initially Doesn't Matter

Notice that in the context window {quick, brown, jumps, over}, we treat all four words symmetrically. We don't distinguish between "quick" (two positions left) versus "over" (two positions right). The context is a bag of words which is an unordered set.

This might seem like a limitation. After all, word order carries meaning:

  • "The dog chased the cat""The cat chased the dog"

But for learning basic semantic embeddings, positional information is second-order. The primary signal is co-occurrence itself: which words tend to appear near each other. Whether "quick" is left or right of "fox" matters less than the fact that they co-occur at all.

This bag-of-words assumption simplifies the learning problem enormously. Instead of modeling:

P(word at position icenter word)P(\text{word at position } i | \text{center word}),

we model the simpler:

P(word in contextcenter word)P(\text{word in context} | \text{center word}).

The first requires learning positional encodings; the second requires only learning which words tend to appear together.

Later architectures (ELMo, BERT, GPT) add positional encodings to capture word order. But Word2Vec deliberately ignores order to focus on the core distributional signal. This design choice enables training on massive corpora with minimal computational cost.

Context WindowZoom Sliding a context window (size = 2) across a sentence. Each position generates a training example: center word paired with surrounding context words. Order within the context is ignored.

Skip-Gram: Predicting Context from a Word

Skip-Gram is the most influential Word2Vec variant. Before diving into its mathematics, let's build intuition for what it's actually trying to accomplish.

The Big Picture: A Prediction Game

Imagine you are learning a new language by reading books. You notice that certain words appear near each other frequently. When you see the word "fox", nearby words are often "quick", "brown", "forest", "jumps". When you see "wolf", nearby words are "gray", "pack", "howls", "forest".

Skip-Gram turns this observation into a learning task: If I show you a word, can you guess which words appear nearby?

This might seem backwards: why predict context from words instead of learning meanings directly? The reason is simple: words that appear in similar contexts must have similar meanings.

By forcing the model to predict contexts, we automatically cluster semantically related words.

Walking Through a Concrete Example

Let's use our familiar sentence: The quick brown fox jumps over the lazy dog

When we encounter "fox" as the center word with window size 2: Context: [quick, brown] fox [jumps, over]

Skip-Gram's task: Given only the word "fox", predict each of the four context words.

QuestionModel Answer
What is the probability I will find "quick" nearby?High probability (foxes are often described as quick)
What's the probability I will find "asteroid" nearby?Very low probability (foxes and asteroids rarely appear together)
What's the probability I will find "brown" nearby?High probability (foxes are often brown)
What's the probability I will find "calculus" nearby?Very low probability (unrelated concepts)

The model makes four independent predictions, one for each context word. It doesn't try to predict all four simultaneously. Rather, it treats each prediction separately.

Why This Creates Semantic Learning

Here's the magic: Consider what happens when the model also sees "wolf": The gray timber wolf runs through the forest with context: [gray, timber] wolf [runs, through]

The model must now predict {gray, timber, runs, through} given "wolf".

Notice: Both "fox" and "wolf" appear with similar types of words:

  • Color adjectives: "brown" vs. "gray"
  • Motion verbs: "jumps" vs. "runs"
  • Natural settings: Both might appear with "forest"

To make good predictions for both words, the model learns similar representations for "fox" and "wolf".

If it learns that "fox" should predict animal-related contexts, that same knowledge helps predict contexts for "wolf".

This is semantic pressure in action: Words with similar meanings must have similar embeddings because they solve similar prediction tasks.

The Mathematical Formulation

Now that we understand the intuition, let's formalize it mathematically.

For a center word at position cc and a window size of kk, the context words will reside in the window [ck,c+k][c - k, c + k]

If the word at position cc is represented as wcw_c, Skip-Gram tries to maximize the probability of observing the actual context words {wck,,wc1,wc+1,,wc+k}\{w_{c-k}, \ldots, w_{c-1}, w_{c+1}, \ldots, w_{c+k}\}.

The objective is to maximize:

P(contextwc)=kjk,j!=0P(wc+jwc)P(\text{context} | w_c) = \prod_{-k \leq j \leq k, j != 0} P(w_{c+j} | w_c)

In our "fox" example with window size 2:

P(contextfox)=P(quickfox)×P(brownfox)×P(jumpsfox)×P(overfox)P(\text{context} | \text{fox}) = P(\text{quick} | \text{fox}) \times P(\text{brown} | \text{fox}) \times P(\text{jumps} | \text{fox}) \times P(\text{over} | \text{fox})

The \prod (product) symbol means we multiply all the individual probabilities together. Each context word is predicted independently; this is the bag-of-words assumption we discussed earlier.

How the Model Works: The Neural Architecture

Now let's see how the model actually computes these probabilities. We will use a tiny example with just 5 words to make the mathematics concrete.

A Tiny Example: 5-Word Vocabulary

Imagine we have a vocabulary of only 5 words:

plaintext
0: "the" 1: "fox" 2: "quick" 3: "brown" 4: "jumps"

And we want to use 3-dimensional embeddings (in practice, we'd use 100-300 dimensions, but 3 makes the math visible).

Step 1: Represent the Center Word (One-Hot Encoding)

Suppose our center word is "fox" (index 1). We represent it as a one-hot vector i.e., all zeros except a 1 at position 1:

x=[0,1,0,0,0]\mathbf{x} = [0, 1, 0, 0, 0]

This is just a way to say "I'm talking about word #1" in a format the neural network can process.

Step 2: Look Up the Embedding

We have an embedding matrix Win\mathbf{W}_{\text{in}} that stores one embedding vector for each word. With 5 words and 3-dimensional embeddings, it's a 5×35 \times 3 matrix:

Win=[0.10.20.30.50.80.20.30.10.90.40.70.10.20.60.5]\mathbf{W}_{\text{in}} = \begin{bmatrix} 0.1 & 0.2 & 0.3 \\ 0.5 & 0.8 & 0.2 \\ 0.3 & 0.1 & 0.9 \\ 0.4 & 0.7 & 0.1 \\ 0.2 & 0.6 & 0.5 \end{bmatrix}

Where each row corresponds to: row 0 = "the", row 1 = "fox", row 2 = "quick", row 3 = "brown", row 4 = "jumps"

When we multiply by the one-hot vector:

h=Winx=[0.10.20.30.50.80.20.30.10.90.40.70.10.20.60.5][01000]=[0.50.80.2]\mathbf{h} = \mathbf{W}_{\text{in}} \cdot \mathbf{x} = \begin{bmatrix} 0.1 & 0.2 & 0.3 \\ 0.5 & 0.8 & 0.2 \\ 0.3 & 0.1 & 0.9 \\ 0.4 & 0.7 & 0.1 \\ 0.2 & 0.6 & 0.5 \end{bmatrix} \cdot \begin{bmatrix} 0 \\ 1 \\ 0 \\ 0 \\ 0 \end{bmatrix} = \begin{bmatrix} 0.5 \\ 0.8 \\ 0.2 \end{bmatrix}

What just happened? The multiplication by the one-hot vector [0,1,0,0,0][0, 1, 0, 0, 0] effectively picks out row 1 (the second row, since we count from 0). So h=[0.5,0.8,0.2]\mathbf{h} = [0.5, 0.8, 0.2] is simply the embedding for "fox".

Key insight: One-hot multiplication is just a fancy way of doing a table lookup! We're retrieving the embedding row for "fox".

Step 3: Compute Probabilities for Context Words

Now we need to predict which words appear in the context. We have another matrix Wout\mathbf{W}_{\text{out}} (also 5×35 \times 3) for output:

Note: If you are wondering where these values come from, read the section after this.

Wout=[0.20.50.10.40.30.70.60.80.30.50.70.40.30.40.6]\mathbf{W}_{\text{out}} = \begin{bmatrix} 0.2 & 0.5 & 0.1 \\ 0.4 & 0.3 & 0.7 \\ 0.6 & 0.8 & 0.3 \\ 0.5 & 0.7 & 0.4 \\ 0.3 & 0.4 & 0.6 \end{bmatrix}

Where each row corresponds to: row 0 = "the", row 1 = "fox", row 2 = "quick", row 3 = "brown", row 4 = "jumps"

For each word, we compute its score by taking the dot product with h\mathbf{h}:

Score for "the": [0.2,0.5,0.1][0.5,0.8,0.2]=0.2×0.5+0.5×0.8+0.1×0.2=0.1+0.4+0.02=0.52[0.2, 0.5, 0.1] \cdot [0.5, 0.8, 0.2] = 0.2 \times 0.5 + 0.5 \times 0.8 + 0.1 \times 0.2 = 0.1 + 0.4 + 0.02 = 0.52

Score for "fox": [0.4,0.3,0.7][0.5,0.8,0.2]=0.4×0.5+0.3×0.8+0.7×0.2=0.2+0.24+0.14=0.58[0.4, 0.3, 0.7] \cdot [0.5, 0.8, 0.2] = 0.4 \times 0.5 + 0.3 \times 0.8 + 0.7 \times 0.2 = 0.2 + 0.24 + 0.14 = 0.58

Score for "quick": [0.6,0.8,0.3][0.5,0.8,0.2]=0.6×0.5+0.8×0.8+0.3×0.2=0.3+0.64+0.06=1.0[0.6, 0.8, 0.3] \cdot [0.5, 0.8, 0.2] = 0.6 \times 0.5 + 0.8 \times 0.8 + 0.3 \times 0.2 = 0.3 + 0.64 + 0.06 = 1.0

Score for "brown": [0.5,0.7,0.4][0.5,0.8,0.2]=0.5×0.5+0.7×0.8+0.4×0.2=0.25+0.56+0.08=0.89[0.5, 0.7, 0.4] \cdot [0.5, 0.8, 0.2] = 0.5 \times 0.5 + 0.7 \times 0.8 + 0.4 \times 0.2 = 0.25 + 0.56 + 0.08 = 0.89

Score for "jumps": [0.3,0.4,0.6][0.5,0.8,0.2]=0.3×0.5+0.4×0.8+0.6×0.2=0.15+0.32+0.12=0.59[0.3, 0.4, 0.6] \cdot [0.5, 0.8, 0.2] = 0.3 \times 0.5 + 0.4 \times 0.8 + 0.6 \times 0.2 = 0.15 + 0.32 + 0.12 = 0.59

These scores tell us how compatible each word is with "fox". Higher scores mean the word is more likely to appear in fox's context.

Step 4: Convert Scores to Probabilities (Softmax)

We apply the softmax function to convert raw scores into probabilities that sum to 1:

P(wifox)=exp(scorei)j=04exp(scorej)P(w_i | \text{fox}) = \frac{\exp(\text{score}_i)}{\sum_{j=0}^{4} \exp(\text{score}_j)}

First, compute exponentials:

  • exp(0.52)1.68\exp(0.52) \approx 1.68
  • exp(0.58)1.79\exp(0.58) \approx 1.79
  • exp(1.0)2.72\exp(1.0) \approx 2.72
  • exp(0.89)2.44\exp(0.89) \approx 2.44
  • exp(0.59)1.80\exp(0.59) \approx 1.80

Sum: 1.68+1.79+2.72+2.44+1.80=10.431.68 + 1.79 + 2.72 + 2.44 + 1.80 = 10.43

Probabilities:

  • P("the""fox")=1.68/10.43=0.16P(\text{"the"} | \text{"fox"}) = 1.68 / 10.43 = 0.16 (16%)
  • P("fox""fox")=1.79/10.43=0.17P(\text{"fox"} | \text{"fox"}) = 1.79 / 10.43 = 0.17 (17%)
  • P("quick""fox")=2.72/10.43=0.26P(\text{"quick"} | \text{"fox"}) = 2.72 / 10.43 = 0.26 (26%) ← highest!
  • P("brown""fox")=2.44/10.43=0.23P(\text{"brown"} | \text{"fox"}) = 2.44 / 10.43 = 0.23 (23%)
  • P("jumps""fox")=1.80/10.43=0.17P(\text{"jumps"} | \text{"fox"}) = 1.80 / 10.43 = 0.17 (17%)

Interpretation: Given the word "fox", the model predicts:

  • 26% chance of seeing "quick" nearby (highest probability)
  • 23% chance of seeing "brown"
  • Lower chances for other words

Where Do These Embedding Values Come From?

You might be wondering Where did we get the numbers in Win\mathbf{W}_{\text{in}} (0.1, 0.2, 0.3, etc.)? Are they arbitrary?

The answer has three parts:

1. Initial Values: Random Initialization

When training starts, we don't know what good embeddings look like yet. So we initialize Win\mathbf{W}_{\text{in}} and Wout\mathbf{W}_{\text{out}} with small random numbers (typically drawn from a uniform or normal distribution, like values between -0.5 and +0.5).

At this point, the embeddings are essentially meaningless — "fox" might have embedding [0.23,0.41,0.08][0.23, -0.41, 0.08] and "asteroid" might have [0.19,0.38,0.12][0.19, -0.38, 0.12], making them appear similar despite being completely unrelated!

2. Learning Through Training: Gradient Descent

As we process millions of text examples, the model makes predictions (like we just calculated) and compares them to the actual context words in the text:

  • If "fox" actually appears with "quick" in the training data, but the model only predicted 26% probability, the loss function says: "You were too uncertain! Increase this probability."
  • The model then adjusts the embeddings in Win\mathbf{W}_{\text{in}} and Wout\mathbf{W}_{\text{out}} using backpropagation to make better predictions next time.

This adjustment happens through gradient descent: we compute how much each number in the matrices contributed to the error, then nudge those numbers in the direction that reduces the error.

3. Emergence of Semantic Meaning

After training on billions of word co-occurrences:

  • Words that appear in similar contexts (like "fox" and "wolf") naturally get pushed toward similar embedding vectors because they need to predict similar context words
  • Words that appear in different contexts (like "fox" and "asteroid") get pushed apart

The semantic structure emerges as a side effect of optimizing the prediction task! We never told the model that "fox" and "wolf" are similar animals — it discovered this by noticing they appear with similar words like {the, through, forest}.

The numbers in our tiny example above are made up for illustration. In real training, you'd start with random values and let gradient descent find the optimal embeddings through millions of updates.

How Training Works

If the actual context word is "quick", the model gets rewarded (low loss). If the actual context word is "the", the model gets penalized (high loss because it only predicted 16% probability).

Through many training examples, the model adjusts the numbers in Win\mathbf{W}_{\text{in}} and Wout\mathbf{W}_{\text{out}} to make correct predictions more likely.

The General Formula

For real Skip-Gram with vocabulary size VV and embedding dimension dd:

Step 1 - Look up the embedding:

h=Winx\mathbf{h} = \mathbf{W}_{\text{in}} \cdot \mathbf{x}

Where:

  • x\mathbf{x} is the one-hot input vector for our center word wcw_c
  • Win\mathbf{W}_{\text{in}} is a V×dV \times d weight matrix (the embedding table)
  • h\mathbf{h} is the dd-dimensional embedding we just looked up

Step 2 - Compute probability for each context word wow_o:

P(wowc)=exp(vwoh)w=1Vexp(vwh)P(w_o | w_c) = \frac{\exp(\mathbf{v}_{w_o} \cdot \mathbf{h})}{\sum_{w=1}^{V} \exp(\mathbf{v}_w \cdot \mathbf{h})}

Where:

  • vwo\mathbf{v}_{w_o} is row wow_o from Wout\mathbf{W}_{\text{out}} (the output embedding)
  • The numerator is the score for word wow_o appearing in context
  • The denominator sums scores across all vocabulary words to normalize

The magic: After training on millions of examples, Win\mathbf{W}_{\text{in}} contains meaningful embeddings where similar words (like "fox" and "wolf") have similar vectors because they predict similar contexts!

Common Hyperparameters:

  • Window size: 5-10 (original papers), 2-5 (modern practice for efficiency)
  • Embedding dimensions: 300 (standard), 100-768 (task-dependent; smaller for speed, larger for complex semantics)
  • Negative samples: 5-20 (depends on corpus size; larger corpora use more negatives)
  • Learning rate: 0.025 (with linear decay)

Why Prediction Creates Semantic Embeddings

Here's the crucial question: Why does training a model to predict contexts force it to learn meaningful embeddings?

Let's think through what happens during training:

Day 1 of Training (Random Embeddings):

The model starts with random embeddings. When it sees "fox" in different sentences:

  • Sentence 1: "The quick brown fox jumps..." → context: {quick, brown, jumps}
  • Sentence 2: "The sly fox hunted..." → context: {sly, hunted}
  • Sentence 3: "An arctic fox lives..." → context: {arctic, lives}

With random embeddings, the model can't predict these contexts consistently. It makes wild guesses. High error.

What the model learns:

To reduce error, the model adjusts the embedding for "fox" to better predict ALL the contexts where "fox" appears. It learns an embedding that:

  • Has high probability for animal-related words (quick, sly, hunted)
  • Has high probability for motion verbs (jumps, lives, runs)
  • Has low probability for unrelated words (calculus, asteroid, philosophy)

The Key Realization:

When the model also sees "wolf":

  • Sentence: "The gray wolf runs through the forest"

It faces the same optimization problem! "Wolf" also appears with animal adjectives, motion verbs, and nature settings.

The Efficient Solution:

Instead of learning completely different embeddings for "fox" and "wolf", the model discovers it can reuse patterns. If it learns that dimension 1 represents "animal-ness" and dimension 2 represents "motion capability", both "fox" and "wolf" can have high values in these dimensions.

This sharing of learned features across similar words is what creates semantic clustering. Words that solve similar prediction tasks must have similar embeddings, not because we told the model they are related, but because it is the most efficient way to minimize prediction error across the entire corpus.

Push-Pull Geometry

During training, each (center, context) pair creates geometric forces:

  • Pull: If "fox" appears with "quick", gradient descent increases their dot product vfoxvquick\mathbf{v}_{\text{fox}} \cdot \mathbf{v}_{\text{quick}}, pulling their embeddings closer (smaller angle).
  • Push: Simultaneously, the softmax denominator pushes down probabilities for all other words. Words that never co-occur with "fox" (like "asteroid", "calculus") have their dot products decreased, pushing them away.

Over millions of training examples, these push-pull forces converge to a stable geometry:

  • Positive pressure: Words that frequently co-occur are pulled together
  • Negative pressure: Words that never co-occur are pushed apart
  • Balance: The embedding space finds an equilibrium where dot products reflect co-occurrence statistics

This is why the final geometry encodes semantic relationships. The dot product vivj\mathbf{v}_i \cdot \mathbf{v}_j approximates the Pointwise Mutual Information (PMI) between words ii and jj:

vivjPMI(i,j)=logP(i,j)P(i)P(j)\mathbf{v}_i \cdot \mathbf{v}_j \approx \text{PMI}(i, j) = \log \frac{P(i, j)}{P(i) \cdot P(j)}

As we discussed in Part 1, PMI measures how much more likely two words co-occur than chance. Skip-Gram implicitly factorizes this statistical relationship into vector geometry.

Skip-Gram GeometryZoom Skip-Gram creates push-pull forces: center word "fox" pulled toward context words (quick, brown, jumps) while pushed away from non-context words (asteroid, calculus). After millions of updates, semantically related words cluster together.

Pseudocode

Here's the core Skip-Gram training loop (simplified, without negative sampling):

python
1# Initialize embeddings randomly 2W_in = random_matrix(vocab_size, embedding_dim) # Input embeddings 3W_out = random_matrix(vocab_size, embedding_dim) # Output embeddings 4 5for sentence in corpus: 6 for center_position in sentence: 7 center_word = sentence[center_position] 8 context_words = get_context(sentence, center_position, window_size) 9 10 # Get center word embedding 11 h = W_in[center_word] # Shape: (embedding_dim,) 12 13 for context_word in context_words: 14 # Compute softmax probability 15 scores = dot(W_out, h) # Shape: (vocab_size,) 16 probs = softmax(scores) 17 18 # Loss: negative log-likelihood 19 loss = -log(probs[context_word]) 20 21 # Backprop: update W_in and W_out to increase P(context_word | center_word) 22 gradients = compute_gradients(loss, W_in, W_out) 23 W_in -= learning_rate * gradients.W_in 24 W_out -= learning_rate * gradients.W_out 25 26# Final embeddings: W_in (or average of W_in and W_out)

The critical issue with Skip-Gram approach is that computing the softmax denominator w=1Vexp(vwh)\sum_{w=1}^{V} \exp(\mathbf{v}_w \cdot \mathbf{h}) requires iterating over the entire vocabulary (50,000+ words) for every training example.

With billions of training pairs, this is computationally prohibitive. The solution to this problem brings us to the concept of negative sampling.

Negative Sampling

The softmax bottleneck is a fundamental problem in large-vocabulary language models. For every training example, we must:

  1. Compute scores for all VV vocabulary words
  2. Exponentiate all scores
  3. Sum them (partition function)
  4. Divide to normalize

For V=50,000V = 50,000 and embedding dimension d=300d = 300, each softmax requires 15M15M multiplications. With billions of training examples, this is infeasible.

Why Softmax is Expensive

The computational cost comes from the normalization term:

P(wowc)=exp(vwoh)w=1Vexp(vwh)P(w_o | w_c) = \frac{\exp(\mathbf{v}_{w_o} \cdot \mathbf{h})}{\sum_{w=1}^{V} \exp(\mathbf{v}_w \cdot \mathbf{h})}

The numerator exp(vwoh)\exp(\mathbf{v}_{w_o} \cdot \mathbf{h}) is cheap: one dot product, one exponentiation. The denominator requires computing vwh\mathbf{v}_w \cdot \mathbf{h} for every word ww in the vocabulary.

Why is the denominator necessary? Because it ensures the probabilities sum to 1: w=1VP(wwc)=1\sum_{w=1}^{V} P(w | w_c) = 1. Without normalization, we would have arbitrary scores, not probabilities.

But here's the key insight: we don't actually need proper probabilities for learning embeddings. We only need a signal that says: "This (center, context) pair should have higher score than random pairs."

Contrastive Learning Intuition

Instead of computing probabilities over the entire vocabulary, we can reframe the problem as binary classification: distinguish true (center, context) pairs from fake pairs.

For each true pair (wc,wo)(w_c, w_o) from the corpus (e.g., "fox" and "quick"), we sample kk negative examples which are random words that don't appear in the context (e.g., "asteroid", "calculus", "piano").

The model learns to:

  • Assign high probability to the true pair: P(truefox,quick)P(\text{true} | \text{fox}, \text{quick}) should be high
  • Assign low probability to fake pairs: P(truefox,asteroid)P(\text{true} | \text{fox}, \text{asteroid}) should be low

This is contrastive learning: we learn by contrasting positive examples (real context) against negative examples (random noise).

The Objective Function: Breaking Down the Math

The negative sampling objective is:

logσ(vwoh)+i=1kEwiPn(w)[logσ(vwih)]\log \sigma(\mathbf{v}_{w_o} \cdot \mathbf{h}) + \sum_{i=1}^{k} \mathbb{E}_{w_i \sim P_n(w)} \left[ \log \sigma(-\mathbf{v}_{w_i} \cdot \mathbf{h}) \right]

This looks complex, but let's build intuition piece by piece.

Step 1: Binary Classification with Sigmoid

First, recall we are doing binary classification which means that for each word pair, we ask "Did these two words actually appear together in the text?"

Instead of softmax (which computes probabilities over all VV words), we use the sigmoid function for each pair independently:

σ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}}

The sigmoid converts a score xx (the dot product) into a probability between 0 and 1:

  • If x=5x = 5 (large positive), σ(5)0.993\sigma(5) \approx 0.993 ≈ "very confident YES"
  • If x=0x = 0 (neutral), σ(0)=0.5\sigma(0) = 0.5 ≈ "uncertain"
  • If x=5x = -5 (large negative), σ(5)0.007\sigma(-5) \approx 0.007 ≈ "very confident NO"

Key difference from softmax: Sigmoid doesn't need to sum over all vocabulary words. Each pair gets its own independent probability.

Step 2: The Positive Term (True Context)

For a true pair like ("fox", "quick") that appeared in the corpus:

logσ(vquickhfox)\log \sigma(\mathbf{v}_{\text{quick}} \cdot \mathbf{h}_{\text{fox}})

What does this do?

  • vquickhfox\mathbf{v}_{\text{quick}} \cdot \mathbf{h}_{\text{fox}} is the dot product (higher = more similar)
  • σ()\sigma(\ldots) converts it to probability: "How likely is it these words appeared together?"
  • log()\log(\ldots) converts to log-probability for numerical stability

Goal: Maximize this term → increase the dot product → pull "fox" and "quick" embeddings closer together

Step 3: The Negative Term (Random Words)

For a negative sample like ("fox", "asteroid") that did NOT appear together:

logσ(vasteroidhfox)\log \sigma(-\mathbf{v}_{\text{asteroid}} \cdot \mathbf{h}_{\text{fox}})

Notice the negative sign before the dot product! Why?

  • vasteroidhfox\mathbf{v}_{\text{asteroid}} \cdot \mathbf{h}_{\text{fox}} might be positive (embeddings accidentally similar)
  • The negative sign flips it: vasteroidhfox-\mathbf{v}_{\text{asteroid}} \cdot \mathbf{h}_{\text{fox}}
  • σ(dot product)\sigma(-\text{dot product}) = probability that these words did NOT appear together

Goal: Maximize logσ(dot product)\log \sigma(-\text{dot product}) → make the dot product more negative → push "fox" and "asteroid" apart

Step 4: Putting It Together

The full objective for one training example with kk negative samples:

logσ(vwoh)reward if positive pair has high dot product+i=1klogσ(vwih)reward if negative pairs have low dot product\underbrace{\log \sigma(\mathbf{v}_{w_o} \cdot \mathbf{h})}_{\text{reward if positive pair has high dot product}} + \underbrace{\sum_{i=1}^{k} \log \sigma(-\mathbf{v}_{w_i} \cdot \mathbf{h})}_{\text{reward if negative pairs have low dot product}}

Where:

  • wow_o is the true context word (1 positive example)
  • wiw_i are kk randomly sampled negative words
  • Pn(w)P_n(w) is the sampling distribution (smoothed unigram, explained below)

A Concrete Example

Suppose we are training on the pair ("fox", "quick") with k=2k=2 negative samples: "asteroid" and "piano".

Current dot products (before training)

  • vquickhfox=1.5\mathbf{v}_{\text{quick}} \cdot \mathbf{h}_{\text{fox}} = 1.5σ(1.5)=0.82\sigma(1.5) = 0.82log(0.82)=0.20\log(0.82) = -0.20
  • vasteroidhfox=0.5\mathbf{v}_{\text{asteroid}} \cdot \mathbf{h}_{\text{fox}} = 0.5σ(0.5)=0.38\sigma(-0.5) = 0.38log(0.38)=0.97\log(0.38) = -0.97
  • vpianohfox=0.3\mathbf{v}_{\text{piano}} \cdot \mathbf{h}_{\text{fox}} = 0.3σ(0.3)=0.43\sigma(-0.3) = 0.43log(0.43)=0.84\log(0.43) = -0.84

Objective value: 0.20+(0.97)+(0.84)=2.01-0.20 + (-0.97) + (-0.84) = -2.01

What gradient descent does:

  1. Increase vquickhfox\mathbf{v}_{\text{quick}} \cdot \mathbf{h}_{\text{fox}} (currently 1.5) → maybe adjust to 1.8
  2. Decrease vasteroidhfox\mathbf{v}_{\text{asteroid}} \cdot \mathbf{h}_{\text{fox}} (currently 0.5) → maybe adjust to 0.2
  3. Decrease vpianohfox\mathbf{v}_{\text{piano}} \cdot \mathbf{h}_{\text{fox}} (currently 0.3) → maybe adjust to 0.0

After one gradient update:

  • vquickhfox=1.8\mathbf{v}_{\text{quick}} \cdot \mathbf{h}_{\text{fox}} = 1.8σ(1.8)=0.86\sigma(1.8) = 0.86log(0.86)=0.15\log(0.86) = -0.15
  • vasteroidhfox=0.2\mathbf{v}_{\text{asteroid}} \cdot \mathbf{h}_{\text{fox}} = 0.2σ(0.2)=0.45\sigma(-0.2) = 0.45log(0.45)=0.80\log(0.45) = -0.80
  • vpianohfox=0.0\mathbf{v}_{\text{piano}} \cdot \mathbf{h}_{\text{fox}} = 0.0σ(0.0)=0.50\sigma(0.0) = 0.50log(0.50)=0.69\log(0.50) = -0.69

New objective: 0.15+(0.80)+(0.69)=1.64-0.15 + (-0.80) + (-0.69) = -1.64 (improved from -2.01!)

The model successfully:

  • Pulled "fox" and "quick" closer (dot product increased)
  • Pushed "fox" away from "asteroid" and "piano" (dot products decreased)

Why Logarithms?

Two reasons:

  1. Numerical stability: Probabilities can be very small (e.g., 0.0001). Logs convert them to manageable numbers.
  2. Gradient behavior: Logarithm amplifies gradients when predictions are wrong (high learning signal) and reduces them when predictions are correct (low learning signal). This makes training more efficient.

Why This Works Better Than Softmax

Softmax for vocabulary size V=50,000V = 50,000:

  • Compute 50,00050,000 dot products
  • Exponentiate and sum all of them
  • Total: ~50,00050,000 operations per training example

Negative sampling with k=5k = 5:

  • Compute 1+5=61 + 5 = 6 dot products (1 positive + 5 negatives)
  • No expensive sum over vocabulary
  • Total: ~66 operations per training example

Speedup: 50,00068,300×\frac{50,000}{6} \approx 8,300 \times faster!

Insight: We don't need to know the exact probability of every word in the vocabulary. We just need to know that the true context word scores higher than random words. This relative ranking is enough to learn good embeddings.

True vs Fake Pairs in Embedding Space

Geometrically, negative sampling creates clear separation:

Positive pairs (true context):

  • (wc,wo)(w_c, w_o) appears in corpus → increase vwcvwo\mathbf{v}_{w_c} \cdot \mathbf{v}_{w_o} → pull vectors together → small angle

Negative pairs (random words):

  • (wc,wneg)(w_c, w_{\text{neg}}) doesn't appear in corpus → decrease vwcvneg\mathbf{v}_{w_c} \cdot \mathbf{v}_{\text{neg}} → push vectors apart → large angle

Over many iterations, embeddings converge to a state where:

  • Related words cluster together (high dot product)
  • Unrelated words are far apart (low or negative dot product)
  • The geometry reflects corpus statistics, not random initialization

Critically, we only compute k+1k + 1 dot products per training example (1 positive + kk negatives), instead of VV dot products for full softmax. With k=5k = 5 and V=50,000V = 50,000, this is a 10,000×10,000× speedup.

Choosing Negative Samples

Now, the question arises: how do we sample negative words?

The naive approach is to use uniform random sampling from vocabulary. But this is suboptimal because of the following reasons:

  1. Rare words are over-sampled (every word has equal probability)
  2. Common words like "the", "and" are under-sampled
  3. The model wastes time learning to push away extremely rare words that never co-occur anyway

Mikolov's original Word2Vec uses smoothed unigram distribution:

Pn(w)=f(w)0.75wf(w)0.75P_n(w) = \frac{f(w)^{0.75}}{\sum_{w'} f(w')^{0.75}}

Where f(w)f(w) is the frequency of word ww in the corpus. The 0.750.75 exponent smooths the distribution: it reduces the probability of very common words and increases the probability of rare words, balancing the negative sampling.

Why 0.750.75 specifically? This was found empirically to work better than the extremes:

  • Exponent=1.0\text{Exponent} = 1.0 (unsmoothed): Over-samples common words like "the", "and"—wastes computation on obvious negatives
  • Exponent=0.5\text{Exponent} = 0.5: Over-samples rare words; creates too-easy negatives that don't provide useful signal
  • Exponent=0.75\text{Exponent} = 0.75: Sweet spot that balances between frequent and rare words, creating informative negative examples

This ensures the model learns meaningful contrasts: distinguishing true context from plausible-but-wrong alternatives, rather than from nonsense words.

Negative SamplingZoom Positive pair (fox, quick) pulled together; negative pairs (fox, asteroid), (fox, calculus) pushed apart. After training, embedding space separates semantically related words from unrelated words.

CBOW: Predicting the Word from Context

Continuous Bag-of-Words (CBOW) inverts the Skip-Gram formulation. Here, instead of predicting context from a word, we predict the word from context.

Inverting the Prediction Task

Given context words {wck,,wc1,wc+1,,wc+k}\{w_{c-k}, \ldots, w_{c-1}, w_{c+1}, \ldots, w_{c+k}\}, CBOW predicts the center word wcw_c.

For our "fox" example:

plaintext
Context: {quick, brown, jumps, over} Target: fox

The model must learn: "Given that the surrounding words are {quick, brown, jumps, over}, what word is in the center?"

This is a classification problem: predict 1-of-VV words given the context.

Averaging Context Embeddings

The key design choice in CBOW: how do we combine multiple context words into a single representation? The simplest approach is to average their embeddings.

h=12kkjk,j!=0vwc+j\mathbf{h} = \frac{1}{2k} \sum_{-k \leq j \leq k, j != 0} \mathbf{v}_{w_{c+j}}

Where:

  • vwc+j\mathbf{v}_{w_{c+j}} is the embedding of context word at position c+jc+j
  • 2k2k is the total number of context words (kk on each side)
  • h\mathbf{h} is the averaged context representation

For "fox":

h=14(vquick+vbrown+vjumps+vover)\mathbf{h} = \frac{1}{4} \left( \mathbf{v}_{\text{quick}} + \mathbf{v}_{\text{brown}} + \mathbf{v}_{\text{jumps}} + \mathbf{v}_{\text{over}} \right)

This averaged vector h\mathbf{h} becomes the input to the output layer, which predicts the center word:

P(wccontext)=exp(uwch)w=1Vexp(uwh)P(w_c | \text{context}) = \frac{\exp(\mathbf{u}_{w_c} \cdot \mathbf{h})}{\sum_{w=1}^{V} \exp(\mathbf{u}_w \cdot \mathbf{h})}

Where uwc\mathbf{u}_{w_c} is the output embedding for center word wcw_c.

Like Skip-Gram, CBOW uses negative sampling to avoid computing the full softmax.

Why CBOW Converges Faster

CBOW typically trains faster than Skip-Gram for several reasons:

1. Fewer Updates Per Training Example

For a window size kk, Skip-Gram generates 2k2k training examples (one for each context word):

plaintext
Skip-Gram from "fox" with context {quick, brown, jumps, over}: (fox → quick) (fox → brown) (fox → jumps) (fox → over)

CBOW generates a single training example:

plaintext
CBOW: ({quick, brown, jumps, over} → fox)

This means Skip-Gram performs 2k2k gradient updates per window position, while CBOW performs just one. For large corpora, this 4-8× difference significantly accelerates training.

2. Smoothing Effect of Averaging

Averaging context embeddings creates a smoothing effect. Individual words might be noisy (ambiguous, rare, or polysemous), but their average is more stable.

Consider predicting the center word from context {the, ___, barked, loudly}. The word "the" is extremely common and appears in countless contexts, providing minimal information. But combined with "barked" and "loudly", the average embedding strongly suggests an animal noun. CBOW automatically learns to weight informative context words higher (through the learned embeddings) while diluting noisy words in the average.

This smoothing stabilizes gradients, allowing larger learning rates and faster convergence.

3. Better Performance on Frequent Words

CBOW learns better representations for frequent words because it sees them as targets many times. A common word like "dog" appears in millions of different contexts. Each time, CBOW updates its embedding to be predictable from diverse contexts, creating a robust, generalized representation.

Skip-Gram, in contrast, learns better representations for rare words. When a rare word appears, Skip-Gram treats it as the center and predicts its context, getting strong gradients that update the rare word's embedding. CBOW averages rare words into the context, diluting their signal.

This leads to a key trade-off, which we will explore in the next section.

CBOW AveragingZoom CBOW averages context word embeddings (quick, brown, jumps, over) into a single vector h, then predicts the center word "fox". The averaging smooths noise and stabilizes training.

Pseudocode

python
1# Initialize embeddings randomly 2W_in = random_matrix(vocab_size, embedding_dim) # Context word embeddings 3W_out = random_matrix(vocab_size, embedding_dim) # Center word embeddings 4 5for sentence in corpus: 6 for center_position in sentence: 7 center_word = sentence[center_position] 8 context_words = get_context(sentence, center_position, window_size) 9 10 # Average context word embeddings 11 h = mean([W_in[w] for w in context_words]) # Shape: (embedding_dim,) 12 13 # Predict center word (with negative sampling) 14 positive_score = dot(W_out[center_word], h) 15 positive_loss = -log(sigmoid(positive_score)) 16 17 # Sample k negative words 18 negative_words = sample_negatives(k, noise_distribution) 19 negative_loss = 0 20 for neg_word in negative_words: 21 neg_score = dot(W_out[neg_word], h) 22 negative_loss += -log(sigmoid(-neg_score)) 23 24 total_loss = positive_loss + negative_loss 25 26 # Backprop: update W_in (context embeddings) and W_out (center embeddings) 27 gradients = compute_gradients(total_loss, W_in, W_out) 28 W_in -= learning_rate * gradients.W_in 29 W_out -= learning_rate * gradients.W_out 30 31# Final embeddings: W_in (or average of W_in and W_out)

Skip-Gram vs CBOW

Both Skip-Gram and CBOW learn embeddings from local context windows, but their formulations create different inductive biases. Understanding when to use each requires examining their trade-offs.

When Each Works Better

The choice depends on our corpus characteristics and downstream task:

CriterionSkip-GramCBOW
Training speedSlower (2k2k updates per position)Faster (1 update per position)
Rare word qualityBetter (rare words as center get strong gradients)Worse (rare words in context are averaged out)
Frequent word qualityWorse (common words diluted across many contexts)Better (frequent targets updated many times)
Small corpusPreferred (better data efficiency)Acceptable
Large corpusSlower but higher qualityMuch faster, still good quality
Syntactic tasksModerate (focuses on individual word contexts)Better (averaging captures syntactic patterns)
Semantic tasksBetter (sharp distinctions between rare words)Moderate (smoothed representations)
Example: Medical NER✓ Better (rare disease names need sharp embeddings)Acceptable (may conflate similar rare terms)
Example: Sentiment AnalysisAcceptable (common sentiment words well-represented)✓ Better (frequent words like "good", "bad" dominate)
Example: Search/Retrieval✓ Better (precise matching of rare entity names)Good (robust for common query terms)

Sharp vs Smooth Embeddings

The averaging in CBOW creates smooth embeddings: representations that blend multiple contextual signals. This is beneficial for frequent words, which appear in diverse contexts and benefit from aggregation.

Skip-Gram creates sharp embeddings: representations that preserve fine-grained distinctions. This is beneficial for rare words, which appear in limited contexts and need strong signal from each occurrence.

Empirically, Skip-Gram embeddings tend to have higher variance: large differences between similar-but-not-identical words. CBOW embeddings have lower variance: similar words are tightly clustered.

For tasks requiring fine-grained semantic distinctions (e.g., identifying subtle differences between synonyms), Skip-Gram often performs better.

For tasks requiring robust generalizations (e.g., part-of-speech tagging, where all verbs should cluster), CBOW often performs better.

Rare vs Frequent Words

This is the most important trade-off. Let's make it concrete with examples.

Skip-Gram Favors Rare Words

Consider the rare word "quokka" (a small marsupial). It might appear only 100 times in a large corpus, always in contexts like:

  • "The quokka is a small marsupial native to Australia."
  • "Quokkas are related to kangaroos and wallabies."

In Skip-Gram, when "quokka" is the center word, we predict {small, marsupial, Australia, kangaroos}. The gradient flows directly into the "quokka" embedding, giving it a strong update even from limited data. After 100 occurrences, "quokka" has a well-trained embedding that clusters near {kangaroo, wallaby, marsupial}.

In CBOW, "quokka" might appear in the context, but it is averaged with other words:

  • Context: {the, is, a, small, quokka, native, to, Australia}
  • Average: (the + is + a + small + quokka + native + to + Australia) / 8

The "quokka" signal is diluted by common words like "the", "is", "a". The gradient to "quokka" is weaker. After 100 occurrences, its embedding is less distinctive.

CBOW Favors Frequent Words

Consider the common word "dog", appearing 1 million times in diverse contexts:

  • "The dog barked loudly."
  • "She adopted a friendly dog."
  • "Dogs are loyal companions."

In CBOW, every time "dog" is the center word, its embedding gets updated to be predictable from that specific context. After 1 million updates from diverse contexts, "dog" has a robust, generalized embedding that captures all its typical usages.

In Skip-Gram, "dog" appears as the center word, predicting context words like {barked, adopted, loyal, companions}. But these contexts are diverse. The embedding must simultaneously predict "barked" (action), "adopted" (acquisition), "loyal" (attribute). The gradients pull in many directions, potentially creating a less focused representation.

For very frequent words, CBOW's smoothing stabilizes the learning, while Skip-Gram's sharp updates can create noisy embeddings.

Practical Recommendations

Based on these trade-offs, here are practical guidelines:

Use Skip-Gram when:

  • You have a small corpus (< 100 million words)
  • Rare words are important for your task (e.g., medical NER extracting rare disease names, legal document retrieval matching obscure case citations, scientific literature search with specialized terminology)
  • You need fine-grained semantic distinctions
  • You can afford longer training time
  • Your downstream task benefits from sharp, distinctive embeddings

Use CBOW when:

  • You have a large corpus (> 100 million words)
  • Training speed is critical
  • Frequent words are more important than rare words (e.g., sentiment analysis where common words like "excellent", "terrible" carry most signal, spam classification detecting common spam patterns, POS tagging where syntactic categories matter more than rare vocabulary)
  • You need robust, stable embeddings for syntactic tasks
  • Your downstream task benefits from smooth, generalized embeddings

Modern practice: Most production systems use Skip-Gram for quality, even though CBOW is faster. The quality gain for rare words outweighs the speed cost. However, for truly massive corpora (billions of words), CBOW's speed advantage becomes significant.

Let's visualize how loss decreases during training for both models (conceptual, not real data):

plaintext
1Training Loss over Time 2 3Loss 4 | 5 | CBOW ----___ 6 | ----___ 7 | ----___ 8 | Skip-Gram ----___ 9 | -----______ ----___ 10 | ------______ ----___ 11 | ------_______----___ 12 |__________________________________________________ Iterations 13 0 1M 14 15CBOW: Faster initial convergence due to fewer updates per example 16 Plateaus earlier at slightly higher loss (smoothing effect) 17 18Skip-Gram: Slower initial convergence (more updates per example) 19 Continues improving longer (sharper gradients for rare words) 20 Lower final loss (better fit to data)

CBOW reaches "good enough" embeddings quickly. Skip-Gram reaches "better" embeddings slowly. The choice depends on your resource constraints and quality requirements.

Key Takeaways

After learning two approaches from Word2Vec, it's now time to look at takeaways:

Context Windows Create Training Signal

  • Sliding windows over text extract (center, context) pairs
  • Local co-occurrence becomes supervision for embedding learning
  • Bag-of-words assumption: order within context is initially ignored
  • Millions of training pairs provide rich distributional signal

Skip-Gram (Center → Context)

  • Predict each context word independently given center word
  • Creates 2k2k training examples per window position
  • Better for rare words (direct gradients to rare center words)
  • Sharper embeddings with fine-grained distinctions
  • Slower training but higher quality

Negative Sampling

  • Replaces expensive softmax with contrastive learning
  • Binary classification: true pairs vs. random pairs
  • kk negative samples per positive example (typically k=520k = 5-20)
  • 10,000× speedup for large vocabularies
  • Smoothed unigram distribution for sampling negatives

CBOW (Context → Center)

  • Predict center word from averaged context embeddings
  • Single training example per window position (4-8× faster)
  • Better for frequent words (robust from many updates)
  • Smoother embeddings with stable generalizations
  • Faster training, slightly lower quality for rare words

The Geometry of Learning

  • Push-pull forces: co-occurring words pulled together, non-co-occurring pushed apart
  • Dot product approximates PMI after convergence
  • Semantic similarity emerges from distributional similarity
  • No explicit supervision for semantics—learned as side effect of prediction

Conclusion

We have answered the question: Where do embeddings come from?

They emerge from self-supervised learning on local context windows. By training a shallow neural network to predict context from words (Skip-Gram) or words from context (CBOW), we force the model to discover distributional patterns. The learned embeddings encode co-occurrence statistics in geometric form: similar words cluster together because they solve similar prediction tasks.

Negative sampling makes this computationally tractable, replacing expensive softmax normalization with contrastive learning against random samples.

But Word2Vec has fundamental limitations:

  1. Local context only: Only words within a small window (typically 5-10) contribute to learning. Long-range dependencies and document-level semantics are ignored.
  2. Bag-of-words: Word order is discarded, losing syntactic structure (can't distinguish "dog chased cat" from "cat chased dog").
  3. Static embeddings: Each word gets a single vector, conflating all senses ("bank" as institution and "bank" as river edge get the same representation).

These limitations motivated the next generation of embedding methods:

GloVe (Part 3): Instead of local windows, use global co-occurrence statistics across the entire corpus, combining the benefits of matrix factorization (LSA) with predictive models (Word2Vec).

Contextual Embeddings (Part 4): Instead of static vectors, compute dynamic representations based on sentence context:

  • ELMo (2018): Bidirectional LSTMs generate context-dependent embeddings
  • BERT (2018): Transformer encoders with masked language modeling; attention mechanisms capture long-range dependencies
  • GPT (2018+): Transformer decoders with autoregressive prediction; positional encodings preserve word order

These modern architectures address Word2Vec's limitations (dynamic representations for polysemy, attention for long-range dependencies, positional encodings for word order) while building on its core insight: prediction tasks drive semantic learning.

Word2Vec was a breakthrough, proving that simple prediction tasks could learn rich semantic structure. But it's just the beginning of the embedding story.

See you in the next post. Namaste!


References

Foundational Papers

Word2Vec (Skip-Gram and CBOW) - Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv preprint.

Distributed Representations of Words and Phrases - Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. NeurIPS 2013.

Theoretical Analysis

Word2Vec Explained - Goldberg, Y., & Levy, O. (2014). word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method. arXiv preprint.

Neural Word Embedding as Implicit Matrix Factorization - Levy, O., & Goldberg, Y. (2014). Neural Word Embedding as Implicit Matrix Factorization. NeurIPS 2014.

Improving Distributional Similarity - Levy, O., Goldberg, Y., & Dagan, I. (2015). Improving Distributional Similarity with Lessons Learned from Word Embeddings. Transactions of the ACL.

Practical Guides

Word2Vec Tutorial - McCormick, C. (2016). Word2Vec Tutorial - The Skip-Gram Model.

Gensim Word2Vec Implementation - Řehůřek, R. Gensim: Word2Vec. Production-quality Python implementation.

Negative Sampling Intuition - Understanding Negative Sampling in Word2Vec. Detailed walkthrough with examples.

Contrastive Learning

Noise Contrastive Estimation - Gutmann, M., & Hyvärinen, A. (2010). Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. AISTATS 2010.

  • Theoretical foundation for negative sampling

Contrastive Learning Survey - Le-Khac, P. H., Healy, G., & Smeaton, A. F. (2020). Contrastive Representation Learning: A Framework and Review. IEEE Access.

Embeddings Evolution

ELMo (Contextual Embeddings) - Peters, M. E., et al. (2018). Deep contextualized word representations. NAACL 2018.

BERT - Devlin, J., et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint.

Historical Context

Neural Language Models - Bengio, Y., et al. (2003). A Neural Probabilistic Language Model. JMLR 2003.

  • Early neural language model that learned word embeddings as a byproduct

Collobert & Weston (2008) - A Unified Architecture for Natural Language Processing. ICML 2008.

Written by Anirudh Sharma

Published on December 26, 2025

Read more posts

Comments