Understanding Embeddings: Part 2 - Learning Meaning from Context
{A}Table of Contents
This is Part 2 of a 4-part series on Embeddings:
- Part 1: Why Embeddings Exist
- [Part 2: Learning Meaning From Context] ← You are here
- Part 3: Global Statistics and GloVe
- Part 4: The Limits of Static Embeddings
In the previous part of this four-part blog series, we explored the reasons why embeddings have to exist.
To recap: words need numerical representations that capture semantic similarity, and vectors provide the natural mathematical structure for this — hence, embeddings.
We also learned that meaning emerges from distributional patterns: "You shall know a word by the company it keeps" and that geometric relationships (cosine similarity, vector arithmetic) mirror semantic relationships.
But we left a critical question unanswered: Where do embeddings actually come from?
We know what properties embeddings should have (dense, low-dimensional, semantically structured). We also know that similar words should have similar vectors. But how do we learn these vectors from raw text? How does a neural network discover that "dog" and "puppy" should be close together, while "dog" and "asteroid" should be far apart?
This is where Word2Vec enters the picture. It was introduced by Tomas Mikolov and colleagues at Google in 2013.
Word2Vec is a family of related techniques for learning word embeddings from unlabeled text. It revolutionized NLP by showing that simple, shallow neural networks trained on a straightforward prediction task could capture rich semantic relationships.
We don't tell the model that "dog" and "cat" are similar. Instead, we force it to predict context words, and it discovers similarity on its own because similar words appear in similar contexts.
Core Insight: Word2Vec doesn't encode semantics directly. Instead, it creates a prediction task (predicting context from words, or vice versa) whose solution requires learning semantic representations.
The embeddings emerge as a byproduct of solving this task well.
Similar words must have similar embeddings because they appear in similar contexts and solve similar prediction problems.
This post explores the learning dynamics of Word2Vec: how local context windows create training signal, how Skip-Gram and CBOW formulations differ, why negative sampling is necessary, and what geometric pressures emerge during training.
Context Windows as Learning Signal
The distributional hypothesis tells us that meaning comes from context. But to train a model, we need to operationalize "context" into concrete training examples. This is where the context window comes in.
Local Context as Supervision
Consider the sentence: The quick brown fox jumps over the lazy dog.
If we want to learn an embedding for the word "fox", we need training data that captures its typical contexts. The simplest approach is to slide a fixed-size window across the sentence and extract (center word, context words) pairs.
Defining Center and Context
At any position in a sentence, we define:
Center word: The target word we are currently focusing on (the word at the current position in our sliding window).
Context words: The words surrounding the center word within a fixed distance (the window size).
Window size: The maximum distance on each side of the center word. A window size of means we look at words to the left and words to the right, giving us up to total context words.
With a window size of 2 (two words on each side), when "fox" is the center word:
Context window for "fox":
[quick, brown] fox [jumps, over]
<----- 2 -----> <---- 2 ---->
(left context) (right context)Context window for "fox":
[quick, brown] fox [jumps, over]
<----- 2 -----> <---- 2 ---->
(left context) (right context)Here:
- Center word:
"fox" - Context words:
{quick, brown, jumps, over}(4 words total: 2 on left + 2 on right) - Window size: 2 (the radius, not the total count)
Two Prediction Formulations
This (center, context) pair becomes our training signal, but we can formulate the prediction task in two ways:
1. Predict context from center (Skip-Gram): Given the center word, predict the surrounding context words.
This asks: "If I see the word "fox", what other words are likely to appear nearby?"
NOTE: is the conditional probability, meaning given event has occurred
2. Predict center from context (CBOW): Given the context words, predict the center word.
This asks: "If I see the words {quick, brown, jumps, over} nearby, what word is likely in the middle?"
Both formulations capture the same distributional intuition: words that appear in similar contexts should have similar meanings. The difference is which direction we predict, and this creates different learning dynamics (which we will explore in detail later).
This approach creates semantic learning because words that appear in similar contexts will get similar representations.
Second-Order Co-occurrence
This is crucial: words don't need to appear together directly to be recognized as similar. Consider "fox" and "wolf":
- Sentence 1: "The quick brown fox jumps through the forest"
- Sentence 2: "The gray timber wolf runs through the forest"
"Fox" and "wolf" never co-occur directly, yet they should have similar embeddings. Why? Because they share second-order co-occurrence as they appear with similar context words:
- Both appear with:
{the, through, forest}(shared contexts) "Fox"appears with:{quick, brown, jumps}"Wolf"appears with:{gray, timber, runs}
Even the non-shared contexts are semantically similar:
"quick" ≈ "runs",
"brown" ≈ "gray" (color adjectives),
"jumps" ≈ "runs" (motion verbs).
Over millions of sentences, the model learns that both animals appear in similar distributional environments (subjects of action verbs, modified by adjectives, followed by location phrases).
This shared pattern forces their embeddings to be similar, even without direct co-occurrence.
Sliding Window Intuition
The window slides across the entire corpus, generating millions of training examples. At each position, the word at that position becomes the center word, and the surrounding words within the window become the context words.
1Sentence: "The quick brown fox jumps over the lazy dog"
2
3Window size = 2:
4
5Position 1: [∅, ∅] The [quick, brown] center: "The"
6Position 2: [∅, The] quick [brown, fox] center: "quick"
7Position 3: [The, quick] brown [fox, jumps] center: "brown"
8Position 4: [quick, brown] fox [jumps, over] center: "fox" ← our example
9Position 5: [brown, fox] jumps [over, the] center: "jumps"
10...and so on1Sentence: "The quick brown fox jumps over the lazy dog"
2
3Window size = 2:
4
5Position 1: [∅, ∅] The [quick, brown] center: "The"
6Position 2: [∅, The] quick [brown, fox] center: "quick"
7Position 3: [The, quick] brown [fox, jumps] center: "brown"
8Position 4: [quick, brown] fox [jumps, over] center: "fox" ← our example
9Position 5: [brown, fox] jumps [over, the] center: "jumps"
10...and so onEach position generates a training example with its (center, context) pair. For Word2Vec, we have two key choices for the prediction task:
1. Skip-Gram: Given the center word, predict each context word independently.
For position 4 (center = "fox"):
2. CBOW (Continuous Bag-of-Words): Given the context words, predict the center word.
For position 4 (center = "fox"):
The key difference is that the Skip-Gram makes multiple predictions per window (one for each context word), while CBOW makes a single prediction (the center word from all context words). This asymmetry creates different learning dynamics.
We will explore both, but first, let's understand why order initially doesn't matter.
Why Order Initially Doesn't Matter
Notice that in the context window {quick, brown, jumps, over}, we treat all four words symmetrically. We don't distinguish between "quick" (two positions left) versus "over" (two positions right). The context is a bag of words which is an unordered set.
This might seem like a limitation. After all, word order carries meaning:
"The dog chased the cat"≠"The cat chased the dog"
But for learning basic semantic embeddings, positional information is second-order. The primary signal is co-occurrence itself: which words tend to appear near each other. Whether "quick" is left or right of "fox" matters less than the fact that they co-occur at all.
This bag-of-words assumption simplifies the learning problem enormously. Instead of modeling:
,
we model the simpler:
.
The first requires learning positional encodings; the second requires only learning which words tend to appear together.
Later architectures (ELMo, BERT, GPT) add positional encodings to capture word order. But Word2Vec deliberately ignores order to focus on the core distributional signal. This design choice enables training on massive corpora with minimal computational cost.
Zoom
Sliding a context window (size = 2) across a sentence. Each position generates a training example: center word paired with surrounding context words. Order within the context is ignored.
Skip-Gram: Predicting Context from a Word
Skip-Gram is the most influential Word2Vec variant. Before diving into its mathematics, let's build intuition for what it's actually trying to accomplish.
The Big Picture: A Prediction Game
Imagine you are learning a new language by reading books. You notice that certain words appear near each other frequently. When you see the word "fox", nearby words are often "quick", "brown", "forest", "jumps". When you see "wolf", nearby words are "gray", "pack", "howls", "forest".
Skip-Gram turns this observation into a learning task: If I show you a word, can you guess which words appear nearby?
This might seem backwards: why predict context from words instead of learning meanings directly? The reason is simple: words that appear in similar contexts must have similar meanings.
By forcing the model to predict contexts, we automatically cluster semantically related words.
Walking Through a Concrete Example
Let's use our familiar sentence: The quick brown fox jumps over the lazy dog
When we encounter "fox" as the center word with window size 2: Context: [quick, brown] fox [jumps, over]
Skip-Gram's task: Given only the word "fox", predict each of the four context words.
| Question | Model Answer |
|---|---|
What is the probability I will find "quick" nearby? | High probability (foxes are often described as quick) |
What's the probability I will find "asteroid" nearby? | Very low probability (foxes and asteroids rarely appear together) |
What's the probability I will find "brown" nearby? | High probability (foxes are often brown) |
What's the probability I will find "calculus" nearby? | Very low probability (unrelated concepts) |
The model makes four independent predictions, one for each context word. It doesn't try to predict all four simultaneously. Rather, it treats each prediction separately.
Why This Creates Semantic Learning
Here's the magic: Consider what happens when the model also sees "wolf": The gray timber wolf runs through the forest with context: [gray, timber] wolf [runs, through]
The model must now predict {gray, timber, runs, through} given "wolf".
Notice: Both "fox" and "wolf" appear with similar types of words:
- Color adjectives:
"brown"vs."gray" - Motion verbs:
"jumps"vs."runs" - Natural settings: Both might appear with
"forest"
To make good predictions for both words, the model learns similar representations for "fox" and "wolf".
If it learns that "fox" should predict animal-related contexts, that same knowledge helps predict contexts for "wolf".
This is semantic pressure in action: Words with similar meanings must have similar embeddings because they solve similar prediction tasks.
The Mathematical Formulation
Now that we understand the intuition, let's formalize it mathematically.
For a center word at position and a window size of , the context words will reside in the window
If the word at position is represented as , Skip-Gram tries to maximize the probability of observing the actual context words .
The objective is to maximize:
In our "fox" example with window size 2:
The (product) symbol means we multiply all the individual probabilities together. Each context word is predicted independently; this is the bag-of-words assumption we discussed earlier.
How the Model Works: The Neural Architecture
Now let's see how the model actually computes these probabilities. We will use a tiny example with just 5 words to make the mathematics concrete.
A Tiny Example: 5-Word Vocabulary
Imagine we have a vocabulary of only 5 words:
0: "the"
1: "fox"
2: "quick"
3: "brown"
4: "jumps"0: "the"
1: "fox"
2: "quick"
3: "brown"
4: "jumps"And we want to use 3-dimensional embeddings (in practice, we'd use 100-300 dimensions, but 3 makes the math visible).
Step 1: Represent the Center Word (One-Hot Encoding)
Suppose our center word is "fox" (index 1). We represent it as a one-hot vector i.e., all zeros except a 1 at position 1:
This is just a way to say "I'm talking about word #1" in a format the neural network can process.
Step 2: Look Up the Embedding
We have an embedding matrix that stores one embedding vector for each word. With 5 words and 3-dimensional embeddings, it's a matrix:
Where each row corresponds to: row 0 = "the", row 1 = "fox", row 2 = "quick", row 3 = "brown", row 4 = "jumps"
When we multiply by the one-hot vector:
What just happened? The multiplication by the one-hot vector effectively picks out row 1 (the second row, since we count from 0). So is simply the embedding for "fox".
Key insight: One-hot multiplication is just a fancy way of doing a table lookup! We're retrieving the embedding row for "fox".
Step 3: Compute Probabilities for Context Words
Now we need to predict which words appear in the context. We have another matrix (also ) for output:
Note: If you are wondering where these values come from, read the section after this.
Where each row corresponds to: row 0 = "the", row 1 = "fox", row 2 = "quick", row 3 = "brown", row 4 = "jumps"
For each word, we compute its score by taking the dot product with :
Score for "the":
Score for "fox":
Score for "quick":
Score for "brown":
Score for "jumps":
These scores tell us how compatible each word is with "fox". Higher scores mean the word is more likely to appear in fox's context.
Step 4: Convert Scores to Probabilities (Softmax)
We apply the softmax function to convert raw scores into probabilities that sum to 1:
First, compute exponentials:
Sum:
Probabilities:
- (16%)
- (17%)
- (26%) ← highest!
- (23%)
- (17%)
Interpretation: Given the word "fox", the model predicts:
- 26% chance of seeing
"quick"nearby (highest probability) - 23% chance of seeing
"brown" - Lower chances for other words
Where Do These Embedding Values Come From?
You might be wondering Where did we get the numbers in (0.1, 0.2, 0.3, etc.)? Are they arbitrary?
The answer has three parts:
1. Initial Values: Random Initialization
When training starts, we don't know what good embeddings look like yet. So we initialize and with small random numbers (typically drawn from a uniform or normal distribution, like values between -0.5 and +0.5).
At this point, the embeddings are essentially meaningless — "fox" might have embedding and "asteroid" might have , making them appear similar despite being completely unrelated!
2. Learning Through Training: Gradient Descent
As we process millions of text examples, the model makes predictions (like we just calculated) and compares them to the actual context words in the text:
- If
"fox"actually appears with"quick"in the training data, but the model only predicted 26% probability, the loss function says: "You were too uncertain! Increase this probability." - The model then adjusts the embeddings in and using backpropagation to make better predictions next time.
This adjustment happens through gradient descent: we compute how much each number in the matrices contributed to the error, then nudge those numbers in the direction that reduces the error.
3. Emergence of Semantic Meaning
After training on billions of word co-occurrences:
- Words that appear in similar contexts (like
"fox"and"wolf") naturally get pushed toward similar embedding vectors because they need to predict similar context words - Words that appear in different contexts (like
"fox"and"asteroid") get pushed apart
The semantic structure emerges as a side effect of optimizing the prediction task! We never told the model that "fox" and "wolf" are similar animals — it discovered this by noticing they appear with similar words like {the, through, forest}.
The numbers in our tiny example above are made up for illustration. In real training, you'd start with random values and let gradient descent find the optimal embeddings through millions of updates.
How Training Works
If the actual context word is "quick", the model gets rewarded (low loss). If the actual context word is "the", the model gets penalized (high loss because it only predicted 16% probability).
Through many training examples, the model adjusts the numbers in and to make correct predictions more likely.
The General Formula
For real Skip-Gram with vocabulary size and embedding dimension :
Step 1 - Look up the embedding:
Where:
- is the one-hot input vector for our center word
- is a weight matrix (the embedding table)
- is the -dimensional embedding we just looked up
Step 2 - Compute probability for each context word :
Where:
- is row from (the output embedding)
- The numerator is the score for word appearing in context
- The denominator sums scores across all vocabulary words to normalize
The magic: After training on millions of examples, contains meaningful embeddings where similar words (like "fox" and "wolf") have similar vectors because they predict similar contexts!
Common Hyperparameters:
- Window size: 5-10 (original papers), 2-5 (modern practice for efficiency)
- Embedding dimensions: 300 (standard), 100-768 (task-dependent; smaller for speed, larger for complex semantics)
- Negative samples: 5-20 (depends on corpus size; larger corpora use more negatives)
- Learning rate: 0.025 (with linear decay)
Why Prediction Creates Semantic Embeddings
Here's the crucial question: Why does training a model to predict contexts force it to learn meaningful embeddings?
Let's think through what happens during training:
Day 1 of Training (Random Embeddings):
The model starts with random embeddings. When it sees "fox" in different sentences:
- Sentence 1: "The quick brown fox jumps..." → context: {quick, brown, jumps}
- Sentence 2: "The sly fox hunted..." → context: {sly, hunted}
- Sentence 3: "An arctic fox lives..." → context: {arctic, lives}
With random embeddings, the model can't predict these contexts consistently. It makes wild guesses. High error.
What the model learns:
To reduce error, the model adjusts the embedding for "fox" to better predict ALL the contexts where "fox" appears. It learns an embedding that:
- Has high probability for animal-related words (quick, sly, hunted)
- Has high probability for motion verbs (jumps, lives, runs)
- Has low probability for unrelated words (calculus, asteroid, philosophy)
The Key Realization:
When the model also sees "wolf":
- Sentence: "The gray wolf runs through the forest"
It faces the same optimization problem! "Wolf" also appears with animal adjectives, motion verbs, and nature settings.
The Efficient Solution:
Instead of learning completely different embeddings for "fox" and "wolf", the model discovers it can reuse patterns. If it learns that dimension 1 represents "animal-ness" and dimension 2 represents "motion capability", both "fox" and "wolf" can have high values in these dimensions.
This sharing of learned features across similar words is what creates semantic clustering. Words that solve similar prediction tasks must have similar embeddings, not because we told the model they are related, but because it is the most efficient way to minimize prediction error across the entire corpus.
Push-Pull Geometry
During training, each (center, context) pair creates geometric forces:
- Pull: If
"fox"appears with"quick", gradient descent increases their dot product , pulling their embeddings closer (smaller angle). - Push: Simultaneously, the softmax denominator pushes down probabilities for all other words. Words that never co-occur with
"fox"(like"asteroid","calculus") have their dot products decreased, pushing them away.
Over millions of training examples, these push-pull forces converge to a stable geometry:
- Positive pressure: Words that frequently co-occur are pulled together
- Negative pressure: Words that never co-occur are pushed apart
- Balance: The embedding space finds an equilibrium where dot products reflect co-occurrence statistics
This is why the final geometry encodes semantic relationships. The dot product approximates the Pointwise Mutual Information (PMI) between words and :
As we discussed in Part 1, PMI measures how much more likely two words co-occur than chance. Skip-Gram implicitly factorizes this statistical relationship into vector geometry.
Zoom
Skip-Gram creates push-pull forces: center word "fox" pulled toward context words (quick, brown, jumps) while pushed away from non-context words (asteroid, calculus). After millions of updates, semantically related words cluster together.
Pseudocode
Here's the core Skip-Gram training loop (simplified, without negative sampling):
1# Initialize embeddings randomly
2W_in = random_matrix(vocab_size, embedding_dim) # Input embeddings
3W_out = random_matrix(vocab_size, embedding_dim) # Output embeddings
4
5for sentence in corpus:
6 for center_position in sentence:
7 center_word = sentence[center_position]
8 context_words = get_context(sentence, center_position, window_size)
9
10 # Get center word embedding
11 h = W_in[center_word] # Shape: (embedding_dim,)
12
13 for context_word in context_words:
14 # Compute softmax probability
15 scores = dot(W_out, h) # Shape: (vocab_size,)
16 probs = softmax(scores)
17
18 # Loss: negative log-likelihood
19 loss = -log(probs[context_word])
20
21 # Backprop: update W_in and W_out to increase P(context_word | center_word)
22 gradients = compute_gradients(loss, W_in, W_out)
23 W_in -= learning_rate * gradients.W_in
24 W_out -= learning_rate * gradients.W_out
25
26# Final embeddings: W_in (or average of W_in and W_out)1# Initialize embeddings randomly
2W_in = random_matrix(vocab_size, embedding_dim) # Input embeddings
3W_out = random_matrix(vocab_size, embedding_dim) # Output embeddings
4
5for sentence in corpus:
6 for center_position in sentence:
7 center_word = sentence[center_position]
8 context_words = get_context(sentence, center_position, window_size)
9
10 # Get center word embedding
11 h = W_in[center_word] # Shape: (embedding_dim,)
12
13 for context_word in context_words:
14 # Compute softmax probability
15 scores = dot(W_out, h) # Shape: (vocab_size,)
16 probs = softmax(scores)
17
18 # Loss: negative log-likelihood
19 loss = -log(probs[context_word])
20
21 # Backprop: update W_in and W_out to increase P(context_word | center_word)
22 gradients = compute_gradients(loss, W_in, W_out)
23 W_in -= learning_rate * gradients.W_in
24 W_out -= learning_rate * gradients.W_out
25
26# Final embeddings: W_in (or average of W_in and W_out)The critical issue with Skip-Gram approach is that computing the softmax denominator requires iterating over the entire vocabulary (50,000+ words) for every training example.
With billions of training pairs, this is computationally prohibitive. The solution to this problem brings us to the concept of negative sampling.
Negative Sampling
The softmax bottleneck is a fundamental problem in large-vocabulary language models. For every training example, we must:
- Compute scores for all vocabulary words
- Exponentiate all scores
- Sum them (partition function)
- Divide to normalize
For and embedding dimension , each softmax requires multiplications. With billions of training examples, this is infeasible.
Why Softmax is Expensive
The computational cost comes from the normalization term:
The numerator is cheap: one dot product, one exponentiation. The denominator requires computing for every word in the vocabulary.
Why is the denominator necessary? Because it ensures the probabilities sum to 1: . Without normalization, we would have arbitrary scores, not probabilities.
But here's the key insight: we don't actually need proper probabilities for learning embeddings. We only need a signal that says: "This (center, context) pair should have higher score than random pairs."
Contrastive Learning Intuition
Instead of computing probabilities over the entire vocabulary, we can reframe the problem as binary classification: distinguish true (center, context) pairs from fake pairs.
For each true pair from the corpus (e.g., "fox" and "quick"), we sample negative examples which are random words that don't appear in the context (e.g., "asteroid", "calculus", "piano").
The model learns to:
- Assign high probability to the true pair: should be high
- Assign low probability to fake pairs: should be low
This is contrastive learning: we learn by contrasting positive examples (real context) against negative examples (random noise).
The Objective Function: Breaking Down the Math
The negative sampling objective is:
This looks complex, but let's build intuition piece by piece.
Step 1: Binary Classification with Sigmoid
First, recall we are doing binary classification which means that for each word pair, we ask "Did these two words actually appear together in the text?"
Instead of softmax (which computes probabilities over all words), we use the sigmoid function for each pair independently:
The sigmoid converts a score (the dot product) into a probability between 0 and 1:
- If (large positive), ≈ "very confident YES"
- If (neutral), ≈ "uncertain"
- If (large negative), ≈ "very confident NO"
Key difference from softmax: Sigmoid doesn't need to sum over all vocabulary words. Each pair gets its own independent probability.
Step 2: The Positive Term (True Context)
For a true pair like ("fox", "quick") that appeared in the corpus:
What does this do?
- is the dot product (higher = more similar)
- converts it to probability: "How likely is it these words appeared together?"
- converts to log-probability for numerical stability
Goal: Maximize this term → increase the dot product → pull "fox" and "quick" embeddings closer together
Step 3: The Negative Term (Random Words)
For a negative sample like ("fox", "asteroid") that did NOT appear together:
Notice the negative sign before the dot product! Why?
- might be positive (embeddings accidentally similar)
- The negative sign flips it:
- = probability that these words did NOT appear together
Goal: Maximize → make the dot product more negative → push "fox" and "asteroid" apart
Step 4: Putting It Together
The full objective for one training example with negative samples:
Where:
- is the true context word (1 positive example)
- are randomly sampled negative words
- is the sampling distribution (smoothed unigram, explained below)
A Concrete Example
Suppose we are training on the pair ("fox", "quick") with negative samples: "asteroid" and "piano".
Current dot products (before training)
- → →
- → →
- → →
Objective value:
What gradient descent does:
- Increase (currently 1.5) → maybe adjust to 1.8
- Decrease (currently 0.5) → maybe adjust to 0.2
- Decrease (currently 0.3) → maybe adjust to 0.0
After one gradient update:
- → →
- → →
- → →
New objective: (improved from -2.01!)
The model successfully:
- Pulled
"fox"and"quick"closer (dot product increased) - Pushed
"fox"away from"asteroid"and"piano"(dot products decreased)
Why Logarithms?
Two reasons:
- Numerical stability: Probabilities can be very small (e.g., 0.0001). Logs convert them to manageable numbers.
- Gradient behavior: Logarithm amplifies gradients when predictions are wrong (high learning signal) and reduces them when predictions are correct (low learning signal). This makes training more efficient.
Why This Works Better Than Softmax
Softmax for vocabulary size :
- Compute dot products
- Exponentiate and sum all of them
- Total: ~ operations per training example
Negative sampling with :
- Compute dot products (1 positive + 5 negatives)
- No expensive sum over vocabulary
- Total: ~ operations per training example
Speedup: faster!
Insight: We don't need to know the exact probability of every word in the vocabulary. We just need to know that the true context word scores higher than random words. This relative ranking is enough to learn good embeddings.
True vs Fake Pairs in Embedding Space
Geometrically, negative sampling creates clear separation:
Positive pairs (true context):
- appears in corpus → increase → pull vectors together → small angle
Negative pairs (random words):
- doesn't appear in corpus → decrease → push vectors apart → large angle
Over many iterations, embeddings converge to a state where:
- Related words cluster together (high dot product)
- Unrelated words are far apart (low or negative dot product)
- The geometry reflects corpus statistics, not random initialization
Critically, we only compute dot products per training example (1 positive + negatives), instead of dot products for full softmax. With and , this is a speedup.
Choosing Negative Samples
Now, the question arises: how do we sample negative words?
The naive approach is to use uniform random sampling from vocabulary. But this is suboptimal because of the following reasons:
- Rare words are over-sampled (every word has equal probability)
- Common words like
"the","and"are under-sampled - The model wastes time learning to push away extremely rare words that never co-occur anyway
Mikolov's original Word2Vec uses smoothed unigram distribution:
Where is the frequency of word in the corpus. The exponent smooths the distribution: it reduces the probability of very common words and increases the probability of rare words, balancing the negative sampling.
Why specifically? This was found empirically to work better than the extremes:
- (unsmoothed): Over-samples common words like
"the","and"—wastes computation on obvious negatives - : Over-samples rare words; creates too-easy negatives that don't provide useful signal
- : Sweet spot that balances between frequent and rare words, creating informative negative examples
This ensures the model learns meaningful contrasts: distinguishing true context from plausible-but-wrong alternatives, rather than from nonsense words.
Zoom
Positive pair (fox, quick) pulled together; negative pairs (fox, asteroid), (fox, calculus) pushed apart. After training, embedding space separates semantically related words from unrelated words.
CBOW: Predicting the Word from Context
Continuous Bag-of-Words (CBOW) inverts the Skip-Gram formulation. Here, instead of predicting context from a word, we predict the word from context.
Inverting the Prediction Task
Given context words , CBOW predicts the center word .
For our "fox" example:
Context: {quick, brown, jumps, over}
Target: foxContext: {quick, brown, jumps, over}
Target: foxThe model must learn: "Given that the surrounding words are {quick, brown, jumps, over}, what word is in the center?"
This is a classification problem: predict 1-of- words given the context.
Averaging Context Embeddings
The key design choice in CBOW: how do we combine multiple context words into a single representation? The simplest approach is to average their embeddings.
Where:
- is the embedding of context word at position
- is the total number of context words ( on each side)
- is the averaged context representation
For "fox":
This averaged vector becomes the input to the output layer, which predicts the center word:
Where is the output embedding for center word .
Like Skip-Gram, CBOW uses negative sampling to avoid computing the full softmax.
Why CBOW Converges Faster
CBOW typically trains faster than Skip-Gram for several reasons:
1. Fewer Updates Per Training Example
For a window size , Skip-Gram generates training examples (one for each context word):
Skip-Gram from "fox" with context {quick, brown, jumps, over}:
(fox → quick)
(fox → brown)
(fox → jumps)
(fox → over)Skip-Gram from "fox" with context {quick, brown, jumps, over}:
(fox → quick)
(fox → brown)
(fox → jumps)
(fox → over)CBOW generates a single training example:
CBOW:
({quick, brown, jumps, over} → fox)CBOW:
({quick, brown, jumps, over} → fox)This means Skip-Gram performs gradient updates per window position, while CBOW performs just one. For large corpora, this 4-8× difference significantly accelerates training.
2. Smoothing Effect of Averaging
Averaging context embeddings creates a smoothing effect. Individual words might be noisy (ambiguous, rare, or polysemous), but their average is more stable.
Consider predicting the center word from context {the, ___, barked, loudly}. The word "the" is extremely common and appears in countless contexts, providing minimal information. But combined with "barked" and "loudly", the average embedding strongly suggests an animal noun. CBOW automatically learns to weight informative context words higher (through the learned embeddings) while diluting noisy words in the average.
This smoothing stabilizes gradients, allowing larger learning rates and faster convergence.
3. Better Performance on Frequent Words
CBOW learns better representations for frequent words because it sees them as targets many times. A common word like "dog" appears in millions of different contexts. Each time, CBOW updates its embedding to be predictable from diverse contexts, creating a robust, generalized representation.
Skip-Gram, in contrast, learns better representations for rare words. When a rare word appears, Skip-Gram treats it as the center and predicts its context, getting strong gradients that update the rare word's embedding. CBOW averages rare words into the context, diluting their signal.
This leads to a key trade-off, which we will explore in the next section.
Zoom
CBOW averages context word embeddings
(quick, brown, jumps, over) into a single vector h, then predicts the center word "fox". The averaging smooths noise and stabilizes training.
Pseudocode
1# Initialize embeddings randomly
2W_in = random_matrix(vocab_size, embedding_dim) # Context word embeddings
3W_out = random_matrix(vocab_size, embedding_dim) # Center word embeddings
4
5for sentence in corpus:
6 for center_position in sentence:
7 center_word = sentence[center_position]
8 context_words = get_context(sentence, center_position, window_size)
9
10 # Average context word embeddings
11 h = mean([W_in[w] for w in context_words]) # Shape: (embedding_dim,)
12
13 # Predict center word (with negative sampling)
14 positive_score = dot(W_out[center_word], h)
15 positive_loss = -log(sigmoid(positive_score))
16
17 # Sample k negative words
18 negative_words = sample_negatives(k, noise_distribution)
19 negative_loss = 0
20 for neg_word in negative_words:
21 neg_score = dot(W_out[neg_word], h)
22 negative_loss += -log(sigmoid(-neg_score))
23
24 total_loss = positive_loss + negative_loss
25
26 # Backprop: update W_in (context embeddings) and W_out (center embeddings)
27 gradients = compute_gradients(total_loss, W_in, W_out)
28 W_in -= learning_rate * gradients.W_in
29 W_out -= learning_rate * gradients.W_out
30
31# Final embeddings: W_in (or average of W_in and W_out)1# Initialize embeddings randomly
2W_in = random_matrix(vocab_size, embedding_dim) # Context word embeddings
3W_out = random_matrix(vocab_size, embedding_dim) # Center word embeddings
4
5for sentence in corpus:
6 for center_position in sentence:
7 center_word = sentence[center_position]
8 context_words = get_context(sentence, center_position, window_size)
9
10 # Average context word embeddings
11 h = mean([W_in[w] for w in context_words]) # Shape: (embedding_dim,)
12
13 # Predict center word (with negative sampling)
14 positive_score = dot(W_out[center_word], h)
15 positive_loss = -log(sigmoid(positive_score))
16
17 # Sample k negative words
18 negative_words = sample_negatives(k, noise_distribution)
19 negative_loss = 0
20 for neg_word in negative_words:
21 neg_score = dot(W_out[neg_word], h)
22 negative_loss += -log(sigmoid(-neg_score))
23
24 total_loss = positive_loss + negative_loss
25
26 # Backprop: update W_in (context embeddings) and W_out (center embeddings)
27 gradients = compute_gradients(total_loss, W_in, W_out)
28 W_in -= learning_rate * gradients.W_in
29 W_out -= learning_rate * gradients.W_out
30
31# Final embeddings: W_in (or average of W_in and W_out)Skip-Gram vs CBOW
Both Skip-Gram and CBOW learn embeddings from local context windows, but their formulations create different inductive biases. Understanding when to use each requires examining their trade-offs.
When Each Works Better
The choice depends on our corpus characteristics and downstream task:
| Criterion | Skip-Gram | CBOW |
|---|---|---|
| Training speed | Slower ( updates per position) | Faster (1 update per position) |
| Rare word quality | Better (rare words as center get strong gradients) | Worse (rare words in context are averaged out) |
| Frequent word quality | Worse (common words diluted across many contexts) | Better (frequent targets updated many times) |
| Small corpus | Preferred (better data efficiency) | Acceptable |
| Large corpus | Slower but higher quality | Much faster, still good quality |
| Syntactic tasks | Moderate (focuses on individual word contexts) | Better (averaging captures syntactic patterns) |
| Semantic tasks | Better (sharp distinctions between rare words) | Moderate (smoothed representations) |
| Example: Medical NER | ✓ Better (rare disease names need sharp embeddings) | Acceptable (may conflate similar rare terms) |
| Example: Sentiment Analysis | Acceptable (common sentiment words well-represented) | ✓ Better (frequent words like "good", "bad" dominate) |
| Example: Search/Retrieval | ✓ Better (precise matching of rare entity names) | Good (robust for common query terms) |
Sharp vs Smooth Embeddings
The averaging in CBOW creates smooth embeddings: representations that blend multiple contextual signals. This is beneficial for frequent words, which appear in diverse contexts and benefit from aggregation.
Skip-Gram creates sharp embeddings: representations that preserve fine-grained distinctions. This is beneficial for rare words, which appear in limited contexts and need strong signal from each occurrence.
Empirically, Skip-Gram embeddings tend to have higher variance: large differences between similar-but-not-identical words. CBOW embeddings have lower variance: similar words are tightly clustered.
For tasks requiring fine-grained semantic distinctions (e.g., identifying subtle differences between synonyms), Skip-Gram often performs better.
For tasks requiring robust generalizations (e.g., part-of-speech tagging, where all verbs should cluster), CBOW often performs better.
Rare vs Frequent Words
This is the most important trade-off. Let's make it concrete with examples.
Skip-Gram Favors Rare Words
Consider the rare word "quokka" (a small marsupial). It might appear only 100 times in a large corpus, always in contexts like:
- "The quokka is a small marsupial native to Australia."
- "Quokkas are related to kangaroos and wallabies."
In Skip-Gram, when "quokka" is the center word, we predict {small, marsupial, Australia, kangaroos}. The gradient flows directly into the "quokka" embedding, giving it a strong update even from limited data. After 100 occurrences, "quokka" has a well-trained embedding that clusters near {kangaroo, wallaby, marsupial}.
In CBOW, "quokka" might appear in the context, but it is averaged with other words:
- Context: {the, is, a, small, quokka, native, to, Australia}
- Average: (the + is + a + small + quokka + native + to + Australia) / 8
The "quokka" signal is diluted by common words like "the", "is", "a". The gradient to "quokka" is weaker. After 100 occurrences, its embedding is less distinctive.
CBOW Favors Frequent Words
Consider the common word "dog", appearing 1 million times in diverse contexts:
- "The dog barked loudly."
- "She adopted a friendly dog."
- "Dogs are loyal companions."
In CBOW, every time "dog" is the center word, its embedding gets updated to be predictable from that specific context. After 1 million updates from diverse contexts, "dog" has a robust, generalized embedding that captures all its typical usages.
In Skip-Gram, "dog" appears as the center word, predicting context words like {barked, adopted, loyal, companions}. But these contexts are diverse. The embedding must simultaneously predict "barked" (action), "adopted" (acquisition), "loyal" (attribute). The gradients pull in many directions, potentially creating a less focused representation.
For very frequent words, CBOW's smoothing stabilizes the learning, while Skip-Gram's sharp updates can create noisy embeddings.
Practical Recommendations
Based on these trade-offs, here are practical guidelines:
Use Skip-Gram when:
- You have a small corpus (< 100 million words)
- Rare words are important for your task (e.g., medical NER extracting rare disease names, legal document retrieval matching obscure case citations, scientific literature search with specialized terminology)
- You need fine-grained semantic distinctions
- You can afford longer training time
- Your downstream task benefits from sharp, distinctive embeddings
Use CBOW when:
- You have a large corpus (> 100 million words)
- Training speed is critical
- Frequent words are more important than rare words (e.g., sentiment analysis where common words like "excellent", "terrible" carry most signal, spam classification detecting common spam patterns, POS tagging where syntactic categories matter more than rare vocabulary)
- You need robust, stable embeddings for syntactic tasks
- Your downstream task benefits from smooth, generalized embeddings
Modern practice: Most production systems use Skip-Gram for quality, even though CBOW is faster. The quality gain for rare words outweighs the speed cost. However, for truly massive corpora (billions of words), CBOW's speed advantage becomes significant.
Visualization: Loss Trends
Let's visualize how loss decreases during training for both models (conceptual, not real data):
1Training Loss over Time
2
3Loss
4 |
5 | CBOW ----___
6 | ----___
7 | ----___
8 | Skip-Gram ----___
9 | -----______ ----___
10 | ------______ ----___
11 | ------_______----___
12 |__________________________________________________ Iterations
13 0 1M
14
15CBOW: Faster initial convergence due to fewer updates per example
16 Plateaus earlier at slightly higher loss (smoothing effect)
17
18Skip-Gram: Slower initial convergence (more updates per example)
19 Continues improving longer (sharper gradients for rare words)
20 Lower final loss (better fit to data)1Training Loss over Time
2
3Loss
4 |
5 | CBOW ----___
6 | ----___
7 | ----___
8 | Skip-Gram ----___
9 | -----______ ----___
10 | ------______ ----___
11 | ------_______----___
12 |__________________________________________________ Iterations
13 0 1M
14
15CBOW: Faster initial convergence due to fewer updates per example
16 Plateaus earlier at slightly higher loss (smoothing effect)
17
18Skip-Gram: Slower initial convergence (more updates per example)
19 Continues improving longer (sharper gradients for rare words)
20 Lower final loss (better fit to data)CBOW reaches "good enough" embeddings quickly. Skip-Gram reaches "better" embeddings slowly. The choice depends on your resource constraints and quality requirements.
Key Takeaways
After learning two approaches from Word2Vec, it's now time to look at takeaways:
Context Windows Create Training Signal
- Sliding windows over text extract
(center, context)pairs - Local co-occurrence becomes supervision for embedding learning
- Bag-of-words assumption: order within context is initially ignored
- Millions of training pairs provide rich distributional signal
Skip-Gram (Center → Context)
- Predict each context word independently given center word
- Creates training examples per window position
- Better for rare words (direct gradients to rare center words)
- Sharper embeddings with fine-grained distinctions
- Slower training but higher quality
Negative Sampling
- Replaces expensive softmax with contrastive learning
- Binary classification: true pairs vs. random pairs
- negative samples per positive example (typically )
- 10,000× speedup for large vocabularies
- Smoothed unigram distribution for sampling negatives
CBOW (Context → Center)
- Predict center word from averaged context embeddings
- Single training example per window position (4-8× faster)
- Better for frequent words (robust from many updates)
- Smoother embeddings with stable generalizations
- Faster training, slightly lower quality for rare words
The Geometry of Learning
- Push-pull forces: co-occurring words pulled together, non-co-occurring pushed apart
- Dot product approximates PMI after convergence
- Semantic similarity emerges from distributional similarity
- No explicit supervision for semantics—learned as side effect of prediction
Conclusion
We have answered the question: Where do embeddings come from?
They emerge from self-supervised learning on local context windows. By training a shallow neural network to predict context from words (Skip-Gram) or words from context (CBOW), we force the model to discover distributional patterns. The learned embeddings encode co-occurrence statistics in geometric form: similar words cluster together because they solve similar prediction tasks.
Negative sampling makes this computationally tractable, replacing expensive softmax normalization with contrastive learning against random samples.
But Word2Vec has fundamental limitations:
- Local context only: Only words within a small window (typically 5-10) contribute to learning. Long-range dependencies and document-level semantics are ignored.
- Bag-of-words: Word order is discarded, losing syntactic structure (can't distinguish
"dog chased cat"from"cat chased dog"). - Static embeddings: Each word gets a single vector, conflating all senses (
"bank"as institution and"bank"as river edge get the same representation).
These limitations motivated the next generation of embedding methods:
GloVe (Part 3): Instead of local windows, use global co-occurrence statistics across the entire corpus, combining the benefits of matrix factorization (LSA) with predictive models (Word2Vec).
Contextual Embeddings (Part 4): Instead of static vectors, compute dynamic representations based on sentence context:
- ELMo (2018): Bidirectional LSTMs generate context-dependent embeddings
- BERT (2018): Transformer encoders with masked language modeling; attention mechanisms capture long-range dependencies
- GPT (2018+): Transformer decoders with autoregressive prediction; positional encodings preserve word order
These modern architectures address Word2Vec's limitations (dynamic representations for polysemy, attention for long-range dependencies, positional encodings for word order) while building on its core insight: prediction tasks drive semantic learning.
Word2Vec was a breakthrough, proving that simple prediction tasks could learn rich semantic structure. But it's just the beginning of the embedding story.
See you in the next post. Namaste!
References
Foundational Papers
Word2Vec (Skip-Gram and CBOW) - Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv preprint.
Distributed Representations of Words and Phrases - Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. NeurIPS 2013.
Theoretical Analysis
Word2Vec Explained - Goldberg, Y., & Levy, O. (2014). word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method. arXiv preprint.
Neural Word Embedding as Implicit Matrix Factorization - Levy, O., & Goldberg, Y. (2014). Neural Word Embedding as Implicit Matrix Factorization. NeurIPS 2014.
Improving Distributional Similarity - Levy, O., Goldberg, Y., & Dagan, I. (2015). Improving Distributional Similarity with Lessons Learned from Word Embeddings. Transactions of the ACL.
Practical Guides
Word2Vec Tutorial - McCormick, C. (2016). Word2Vec Tutorial - The Skip-Gram Model.
Gensim Word2Vec Implementation - Řehůřek, R. Gensim: Word2Vec. Production-quality Python implementation.
Negative Sampling Intuition - Understanding Negative Sampling in Word2Vec. Detailed walkthrough with examples.
Contrastive Learning
Noise Contrastive Estimation - Gutmann, M., & Hyvärinen, A. (2010). Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. AISTATS 2010.
- Theoretical foundation for negative sampling
Contrastive Learning Survey - Le-Khac, P. H., Healy, G., & Smeaton, A. F. (2020). Contrastive Representation Learning: A Framework and Review. IEEE Access.
Embeddings Evolution
ELMo (Contextual Embeddings) - Peters, M. E., et al. (2018). Deep contextualized word representations. NAACL 2018.
BERT - Devlin, J., et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint.
Historical Context
Neural Language Models - Bengio, Y., et al. (2003). A Neural Probabilistic Language Model. JMLR 2003.
- Early neural language model that learned word embeddings as a byproduct
Collobert & Weston (2008) - A Unified Architecture for Natural Language Processing. ICML 2008.
Written by Anirudh Sharma
Published on December 26, 2025