December 26, 202540 min readBy Anirudh Sharma

Understanding Embeddings: Part 2 - Learning Meaning from Context

{A}

Table of Contents

This is Part 2 of a 4-part series on Embeddings:

Part 1: Why Embeddings Exist
[Part 2: Learning Meaning From Context] ← You are here
Part 3: Global Statistics and GloVe
Part 4: The Limits of Static Embeddings

In the previous part of this four-part blog series, we explored the reasons why embeddings have to exist.

To recap: words need numerical representations that capture semantic similarity, and vectors provide the natural mathematical structure for this — hence, embeddings.

We also learned that meaning emerges from distributional patterns: "You shall know a word by the company it keeps" and that geometric relationships (cosine similarity, vector arithmetic) mirror semantic relationships.

But we left a critical question unanswered: Where do embeddings actually come from?

We know what properties embeddings should have (dense, low-dimensional, semantically structured). We also know that similar words should have similar vectors. But how do we learn these vectors from raw text? How does a neural network discover that "dog" and "puppy" should be close together, while "dog" and "asteroid" should be far apart?

This is where Word2Vec enters the picture. It was introduced by Tomas Mikolov and colleagues at Google in 2013.

Word2Vec is a family of related techniques for learning word embeddings from unlabeled text. It revolutionized NLP by showing that simple, shallow neural networks trained on a straightforward prediction task could capture rich semantic relationships.

We don't tell the model that "dog" and "cat" are similar. Instead, we force it to predict context words, and it discovers similarity on its own because similar words appear in similar contexts.

Core Insight: Word2Vec doesn't encode semantics directly. Instead, it creates a prediction task (predicting context from words, or vice versa) whose solution requires learning semantic representations.

The embeddings emerge as a byproduct of solving this task well.

Similar words must have similar embeddings because they appear in similar contexts and solve similar prediction problems.

This post explores the learning dynamics of Word2Vec: how local context windows create training signal, how Skip-Gram and CBOW formulations differ, why negative sampling is necessary, and what geometric pressures emerge during training.

Context Windows as Learning Signal

The distributional hypothesis tells us that meaning comes from context. But to train a model, we need to operationalize "context" into concrete training examples. This is where the context window comes in.

Local Context as Supervision

Consider the sentence: The quick brown fox jumps over the lazy dog.

If we want to learn an embedding for the word "fox", we need training data that captures its typical contexts. The simplest approach is to slide a fixed-size window across the sentence and extract (center word, context words) pairs.

Defining Center and Context

At any position in a sentence, we define:

Center word: The target word we are currently focusing on (the word at the current position in our sliding window).

Context words: The words surrounding the center word within a fixed distance (the window size).

Window size: The maximum distance on each side of the center word. A window size of $k$ means we look at $k$ words to the left and $k$ words to the right, giving us up to $2k$ total context words.

With a window size of 2 (two words on each side), when "fox" is the center word:

plaintext

Context window for "fox":
[quick, brown]  fox [jumps, over]
<----- 2 ----->     <---- 2 ---->
(left context)      (right context)

Here:

Center word: "fox"
Context words: {quick, brown, jumps, over} (4 words total: 2 on left + 2 on right)
Window size: 2 (the radius, not the total count)

Two Prediction Formulations

This (center, context) pair becomes our training signal, but we can formulate the prediction task in two ways:

1. Predict context from center (Skip-Gram): Given the center word, predict the surrounding context words.

$P(\text{context words} | \text{center word})$

$P(\text{quick, brown, jumps, over} | \text{fox})$

This asks: "If I see the word "fox", what other words are likely to appear nearby?"

NOTE: $P(X | Y)$ is the conditional probability, meaning $P(x)$ given event $Y$ has occurred

2. Predict center from context (CBOW): Given the context words, predict the center word.

$P(\text{center word} | \text{context words})$

$P(\text{fox} | \text{quick, brown, jumps, over})$

This asks: "If I see the words {quick, brown, jumps, over} nearby, what word is likely in the middle?"

Both formulations capture the same distributional intuition: words that appear in similar contexts should have similar meanings. The difference is which direction we predict, and this creates different learning dynamics (which we will explore in detail later).

This approach creates semantic learning because words that appear in similar contexts will get similar representations.

Second-Order Co-occurrence

This is crucial: words don't need to appear together directly to be recognized as similar. Consider "fox" and "wolf":

Sentence 1: "The quick brown fox jumps through the forest"
Sentence 2: "The gray timber wolf runs through the forest"

"Fox" and "wolf" never co-occur directly, yet they should have similar embeddings. Why? Because they share second-order co-occurrence as they appear with similar context words:

Both appear with: {the, through, forest} (shared contexts)
"Fox" appears with: {quick, brown, jumps}
"Wolf" appears with: {gray, timber, runs}

Even the non-shared contexts are semantically similar:

"quick" ≈ "runs",

"brown" ≈ "gray" (color adjectives),

"jumps" ≈ "runs" (motion verbs).

Over millions of sentences, the model learns that both animals appear in similar distributional environments (subjects of action verbs, modified by adjectives, followed by location phrases).

This shared pattern forces their embeddings to be similar, even without direct co-occurrence.

Sliding Window Intuition

The window slides across the entire corpus, generating millions of training examples. At each position, the word at that position becomes the center word, and the surrounding words within the window become the context words.

plaintext

Sentence: "The quick brown fox jumps over the lazy dog"

Window size = 2:

Position 1: [∅, ∅] The [quick, brown]           center: "The"
Position 2: [∅, The] quick [brown, fox]         center: "quick"
Position 3: [The, quick] brown [fox, jumps]     center: "brown"
Position 4: [quick, brown] fox [jumps, over]    center: "fox" ← our example
Position 5: [brown, fox] jumps [over, the]      center: "jumps"
...and so on

Each position generates a training example with its (center, context) pair. For Word2Vec, we have two key choices for the prediction task:

1. Skip-Gram: Given the center word, predict each context word independently.

For position 4 (center = "fox"):

$P(\text{quick} | \text{fox}), P(\text{brown} | \text{fox}), P(\text{jumps} | \text{fox}), P(\text{over} | \text{fox})$

2. CBOW (Continuous Bag-of-Words): Given the context words, predict the center word.

For position 4 (center = "fox"):

$P(\text{fox} | \text{quick, brown, jumps, over})$

The key difference is that the Skip-Gram makes multiple predictions per window (one for each context word), while CBOW makes a single prediction (the center word from all context words). This asymmetry creates different learning dynamics.

We will explore both, but first, let's understand why order initially doesn't matter.

Why Order Initially Doesn't Matter

Notice that in the context window {quick, brown, jumps, over}, we treat all four words symmetrically. We don't distinguish between "quick" (two positions left) versus "over" (two positions right). The context is a bag of words which is an unordered set.

This might seem like a limitation. After all, word order carries meaning:

"The dog chased the cat" ≠ "The cat chased the dog"

But for learning basic semantic embeddings, positional information is second-order. The primary signal is co-occurrence itself: which words tend to appear near each other. Whether "quick" is left or right of "fox" matters less than the fact that they co-occur at all.

This bag-of-words assumption simplifies the learning problem enormously. Instead of modeling:

$P(\text{word at position } i | \text{center word})$ ,

we model the simpler:

$P(\text{word in context} | \text{center word})$ .

The first requires learning positional encodings; the second requires only learning which words tend to appear together.

Later architectures (ELMo, BERT, GPT) add positional encodings to capture word order. But Word2Vec deliberately ignores order to focus on the core distributional signal. This design choice enables training on massive corpora with minimal computational cost.

Context Window Zoom Sliding a context window (size = 2) across a sentence. Each position generates a training example: center word paired with surrounding context words. Order within the context is ignored.

Skip-Gram: Predicting Context from a Word

Skip-Gram is the most influential Word2Vec variant. Before diving into its mathematics, let's build intuition for what it's actually trying to accomplish.

The Big Picture: A Prediction Game

Imagine you are learning a new language by reading books. You notice that certain words appear near each other frequently. When you see the word "fox", nearby words are often "quick", "brown", "forest", "jumps". When you see "wolf", nearby words are "gray", "pack", "howls", "forest".

Skip-Gram turns this observation into a learning task: If I show you a word, can you guess which words appear nearby?

This might seem backwards: why predict context from words instead of learning meanings directly? The reason is simple: words that appear in similar contexts must have similar meanings.

By forcing the model to predict contexts, we automatically cluster semantically related words.

Walking Through a Concrete Example

Let's use our familiar sentence: The quick brown fox jumps over the lazy dog

When we encounter "fox" as the center word with window size 2: Context: [quick, brown] fox [jumps, over]

Skip-Gram's task: Given only the word "fox", predict each of the four context words.

Question	Model Answer
What is the probability I will find `"quick"` nearby?	High probability (foxes are often described as quick)
What's the probability I will find `"asteroid"` nearby?	Very low probability (foxes and asteroids rarely appear together)
What's the probability I will find `"brown"` nearby?	High probability (foxes are often brown)
What's the probability I will find `"calculus"` nearby?	Very low probability (unrelated concepts)

The model makes four independent predictions, one for each context word. It doesn't try to predict all four simultaneously. Rather, it treats each prediction separately.

Why This Creates Semantic Learning

Here's the magic: Consider what happens when the model also sees "wolf": The gray timber wolf runs through the forest with context: [gray, timber] wolf [runs, through]

The model must now predict {gray, timber, runs, through} given "wolf".

Notice: Both "fox" and "wolf" appear with similar types of words:

Color adjectives: "brown" vs. "gray"
Motion verbs: "jumps" vs. "runs"
Natural settings: Both might appear with "forest"

To make good predictions for both words, the model learns similar representations for "fox" and "wolf".

If it learns that "fox" should predict animal-related contexts, that same knowledge helps predict contexts for "wolf".

This is semantic pressure in action: Words with similar meanings must have similar embeddings because they solve similar prediction tasks.

The Mathematical Formulation

Now that we understand the intuition, let's formalize it mathematically.

For a center word at position $c$ and a window size of $k$ , the context words will reside in the window $[c - k, c + k]$

If the word at position $c$ is represented as $w_c$ , Skip-Gram tries to maximize the probability of observing the actual context words $\{w_{c-k}, \ldots, w_{c-1}, w_{c+1}, \ldots, w_{c+k}\}$ .

The objective is to maximize:

$P(\text{context} | w_c) = \prod_{-k \leq j \leq k, j != 0} P(w_{c+j} | w_c)$

In our "fox" example with window size 2:

$P(\text{context} | \text{fox}) = P(\text{quick} | \text{fox}) \times P(\text{brown} | \text{fox}) \times P(\text{jumps} | \text{fox}) \times P(\text{over} | \text{fox})$

The $\prod$ (product) symbol means we multiply all the individual probabilities together. Each context word is predicted independently; this is the bag-of-words assumption we discussed earlier.

How the Model Works: The Neural Architecture

Now let's see how the model actually computes these probabilities. We will use a tiny example with just 5 words to make the mathematics concrete.

A Tiny Example: 5-Word Vocabulary

Imagine we have a vocabulary of only 5 words:

plaintext

"the"
"fox"
"quick"
"brown"
"jumps"

And we want to use 3-dimensional embeddings (in practice, we'd use 100-300 dimensions, but 3 makes the math visible).

Step 1: Represent the Center Word (One-Hot Encoding)

Suppose our center word is "fox" (index 1). We represent it as a one-hot vector i.e., all zeros except a 1 at position 1:

$\mathbf{x} = [0, 1, 0, 0, 0]$

This is just a way to say "I'm talking about word #1" in a format the neural network can process.

Step 2: Look Up the Embedding

We have an embedding matrix $\mathbf{W}_{\text{in}}$ that stores one embedding vector for each word. With 5 words and 3-dimensional embeddings, it's a $5 \times 3$ matrix:

\mathbf{W}_{\text{in}} = \begin{bmatrix} 0.1 & 0.2 & 0.3 \\ 0.5 & 0.8 & 0.2 \\ 0.3 & 0.1 & 0.9 \\ 0.4 & 0.7 & 0.1 \\ 0.2 & 0.6 & 0.5 \end{bmatrix}

Where each row corresponds to: row 0 = "the", row 1 = "fox", row 2 = "quick", row 3 = "brown", row 4 = "jumps"

When we multiply by the one-hot vector:

\mathbf{h} = \mathbf{W}_{\text{in}} \cdot \mathbf{x} = \begin{bmatrix} 0.1 & 0.2 & 0.3 \\ 0.5 & 0.8 & 0.2 \\ 0.3 & 0.1 & 0.9 \\ 0.4 & 0.7 & 0.1 \\ 0.2 & 0.6 & 0.5 \end{bmatrix} \cdot \begin{bmatrix} 0 \\ 1 \\ 0 \\ 0 \\ 0 \end{bmatrix} = \begin{bmatrix} 0.5 \\ 0.8 \\ 0.2 \end{bmatrix}

What just happened? The multiplication by the one-hot vector $[0, 1, 0, 0, 0]$ effectively picks out row 1 (the second row, since we count from 0). So $\mathbf{h} = [0.5, 0.8, 0.2]$ is simply the embedding for "fox".

Key insight: One-hot multiplication is just a fancy way of doing a table lookup! We're retrieving the embedding row for "fox".

Step 3: Compute Probabilities for Context Words

Now we need to predict which words appear in the context. We have another matrix $\mathbf{W}_{\text{out}}$ (also $5 \times 3$ ) for output:

Note: If you are wondering where these values come from, read the section after this.

\mathbf{W}_{\text{out}} = \begin{bmatrix} 0.2 & 0.5 & 0.1 \\ 0.4 & 0.3 & 0.7 \\ 0.6 & 0.8 & 0.3 \\ 0.5 & 0.7 & 0.4 \\ 0.3 & 0.4 & 0.6 \end{bmatrix}

Where each row corresponds to: row 0 = "the", row 1 = "fox", row 2 = "quick", row 3 = "brown", row 4 = "jumps"

For each word, we compute its score by taking the dot product with $\mathbf{h}$ :

Score for "the": $[0.2, 0.5, 0.1] \cdot [0.5, 0.8, 0.2] = 0.2 \times 0.5 + 0.5 \times 0.8 + 0.1 \times 0.2 = 0.1 + 0.4 + 0.02 = 0.52$

Score for "fox": $[0.4, 0.3, 0.7] \cdot [0.5, 0.8, 0.2] = 0.4 \times 0.5 + 0.3 \times 0.8 + 0.7 \times 0.2 = 0.2 + 0.24 + 0.14 = 0.58$

Score for "quick": $[0.6, 0.8, 0.3] \cdot [0.5, 0.8, 0.2] = 0.6 \times 0.5 + 0.8 \times 0.8 + 0.3 \times 0.2 = 0.3 + 0.64 + 0.06 = 1.0$

Score for "brown": $[0.5, 0.7, 0.4] \cdot [0.5, 0.8, 0.2] = 0.5 \times 0.5 + 0.7 \times 0.8 + 0.4 \times 0.2 = 0.25 + 0.56 + 0.08 = 0.89$

Score for "jumps": $[0.3, 0.4, 0.6] \cdot [0.5, 0.8, 0.2] = 0.3 \times 0.5 + 0.4 \times 0.8 + 0.6 \times 0.2 = 0.15 + 0.32 + 0.12 = 0.59$

These scores tell us how compatible each word is with "fox". Higher scores mean the word is more likely to appear in fox's context.

Step 4: Convert Scores to Probabilities (Softmax)

We apply the softmax function to convert raw scores into probabilities that sum to 1:

$P(w_i | \text{fox}) = \frac{\exp(\text{score}_i)}{\sum_{j=0}^{4} \exp(\text{score}_j)}$

First, compute exponentials:

$\exp(0.52) \approx 1.68$
$\exp(0.58) \approx 1.79$
$\exp(1.0) \approx 2.72$
$\exp(0.89) \approx 2.44$
$\exp(0.59) \approx 1.80$

Sum: $1.68 + 1.79 + 2.72 + 2.44 + 1.80 = 10.43$

Probabilities:

$P(\text{"the"} | \text{"fox"}) = 1.68 / 10.43 = 0.16$ (16%)
$P(\text{"fox"} | \text{"fox"}) = 1.79 / 10.43 = 0.17$ (17%)
$P(\text{"quick"} | \text{"fox"}) = 2.72 / 10.43 = 0.26$ (26%) ← highest!
$P(\text{"brown"} | \text{"fox"}) = 2.44 / 10.43 = 0.23$ (23%)
$P(\text{"jumps"} | \text{"fox"}) = 1.80 / 10.43 = 0.17$ (17%)

Interpretation: Given the word "fox", the model predicts:

26% chance of seeing "quick" nearby (highest probability)
23% chance of seeing "brown"
Lower chances for other words

Where Do These Embedding Values Come From?

You might be wondering Where did we get the numbers in $\mathbf{W}_{\text{in}}$ (0.1, 0.2, 0.3, etc.)? Are they arbitrary?

The answer has three parts:

1. Initial Values: Random Initialization

When training starts, we don't know what good embeddings look like yet. So we initialize $\mathbf{W}_{\text{in}}$ and $\mathbf{W}_{\text{out}}$ with small random numbers (typically drawn from a uniform or normal distribution, like values between -0.5 and +0.5).

At this point, the embeddings are essentially meaningless — "fox" might have embedding $[0.23, -0.41, 0.08]$ and "asteroid" might have $[0.19, -0.38, 0.12]$ , making them appear similar despite being completely unrelated!

2. Learning Through Training: Gradient Descent

As we process millions of text examples, the model makes predictions (like we just calculated) and compares them to the actual context words in the text:

If "fox" actually appears with "quick" in the training data, but the model only predicted 26% probability, the loss function says: "You were too uncertain! Increase this probability."
The model then adjusts the embeddings in $\mathbf{W}_{\text{in}}$ and $\mathbf{W}_{\text{out}}$ using backpropagation to make better predictions next time.

This adjustment happens through gradient descent: we compute how much each number in the matrices contributed to the error, then nudge those numbers in the direction that reduces the error.

3. Emergence of Semantic Meaning

After training on billions of word co-occurrences:

Words that appear in similar contexts (like "fox" and "wolf") naturally get pushed toward similar embedding vectors because they need to predict similar context words
Words that appear in different contexts (like "fox" and "asteroid") get pushed apart

The semantic structure emerges as a side effect of optimizing the prediction task! We never told the model that "fox" and "wolf" are similar animals — it discovered this by noticing they appear with similar words like {the, through, forest}.

The numbers in our tiny example above are made up for illustration. In real training, you'd start with random values and let gradient descent find the optimal embeddings through millions of updates.

How Training Works

If the actual context word is "quick", the model gets rewarded (low loss). If the actual context word is "the", the model gets penalized (high loss because it only predicted 16% probability).

Through many training examples, the model adjusts the numbers in $\mathbf{W}_{\text{in}}$ and $\mathbf{W}_{\text{out}}$ to make correct predictions more likely.

The General Formula

For real Skip-Gram with vocabulary size $V$ and embedding dimension $d$ :

Step 1 - Look up the embedding:

$\mathbf{h} = \mathbf{W}_{\text{in}} \cdot \mathbf{x}$

Where:

$\mathbf{x}$ is the one-hot input vector for our center word $w_c$
$\mathbf{W}_{\text{in}}$ is a $V \times d$ weight matrix (the embedding table)
$\mathbf{h}$ is the $d$ -dimensional embedding we just looked up

Step 2 - Compute probability for each context word $w_o$ :

$P(w_o | w_c) = \frac{\exp(\mathbf{v}_{w_o} \cdot \mathbf{h})}{\sum_{w=1}^{V} \exp(\mathbf{v}_w \cdot \mathbf{h})}$

Where:

$\mathbf{v}_{w_o}$ is row $w_o$ from $\mathbf{W}_{\text{out}}$ (the output embedding)
The numerator is the score for word $w_o$ appearing in context
The denominator sums scores across all vocabulary words to normalize

The magic: After training on millions of examples, $\mathbf{W}_{\text{in}}$ contains meaningful embeddings where similar words (like "fox" and "wolf") have similar vectors because they predict similar contexts!

Common Hyperparameters:

Window size: 5-10 (original papers), 2-5 (modern practice for efficiency)

Embedding dimensions: 300 (standard), 100-768 (task-dependent; smaller for speed, larger for complex semantics)

Negative samples: 5-20 (depends on corpus size; larger corpora use more negatives)

Learning rate: 0.025 (with linear decay)

Why Prediction Creates Semantic Embeddings

Here's the crucial question: Why does training a model to predict contexts force it to learn meaningful embeddings?

Let's think through what happens during training:

Day 1 of Training (Random Embeddings):

The model starts with random embeddings. When it sees "fox" in different sentences:

Sentence 1: "The quick brown fox jumps..." → context: {quick, brown, jumps}
Sentence 2: "The sly fox hunted..." → context: {sly, hunted}
Sentence 3: "An arctic fox lives..." → context: {arctic, lives}

With random embeddings, the model can't predict these contexts consistently. It makes wild guesses. High error.

What the model learns:

To reduce error, the model adjusts the embedding for "fox" to better predict ALL the contexts where "fox" appears. It learns an embedding that:

Has high probability for animal-related words (quick, sly, hunted)
Has high probability for motion verbs (jumps, lives, runs)
Has low probability for unrelated words (calculus, asteroid, philosophy)

The Key Realization:

When the model also sees "wolf":

Sentence: "The gray wolf runs through the forest"

It faces the same optimization problem! "Wolf" also appears with animal adjectives, motion verbs, and nature settings.

The Efficient Solution:

Instead of learning completely different embeddings for "fox" and "wolf", the model discovers it can reuse patterns. If it learns that dimension 1 represents "animal-ness" and dimension 2 represents "motion capability", both "fox" and "wolf" can have high values in these dimensions.

This sharing of learned features across similar words is what creates semantic clustering. Words that solve similar prediction tasks must have similar embeddings, not because we told the model they are related, but because it is the most efficient way to minimize prediction error across the entire corpus.

Push-Pull Geometry

During training, each (center, context) pair creates geometric forces:

Pull: If "fox" appears with "quick", gradient descent increases their dot product $\mathbf{v}_{\text{fox}} \cdot \mathbf{v}_{\text{quick}}$ , pulling their embeddings closer (smaller angle).
Push: Simultaneously, the softmax denominator pushes down probabilities for all other words. Words that never co-occur with "fox" (like "asteroid", "calculus") have their dot products decreased, pushing them away.

Over millions of training examples, these push-pull forces converge to a stable geometry:

Positive pressure: Words that frequently co-occur are pulled together
Negative pressure: Words that never co-occur are pushed apart
Balance: The embedding space finds an equilibrium where dot products reflect co-occurrence statistics

This is why the final geometry encodes semantic relationships. The dot product $\mathbf{v}_i \cdot \mathbf{v}_j$ approximates the Pointwise Mutual Information (PMI) between words $i$ and $j$ :

$\mathbf{v}_i \cdot \mathbf{v}_j \approx \text{PMI}(i, j) = \log \frac{P(i, j)}{P(i) \cdot P(j)}$

As we discussed in Part 1, PMI measures how much more likely two words co-occur than chance. Skip-Gram implicitly factorizes this statistical relationship into vector geometry.

Skip-Gram Geometry Zoom Skip-Gram creates push-pull forces: center word "fox" pulled toward context words (quick, brown, jumps) while pushed away from non-context words (asteroid, calculus). After millions of updates, semantically related words cluster together.

Pseudocode

Here's the core Skip-Gram training loop (simplified, without negative sampling):

python

# Initialize embeddings randomly
W_in = random_matrix(vocab_size, embedding_dim)   # Input embeddings
W_out = random_matrix(vocab_size, embedding_dim)  # Output embeddings

for sentence in corpus:
    for center_position in sentence:
        center_word = sentence[center_position]
        context_words = get_context(sentence, center_position, window_size)

        # Get center word embedding
        h = W_in[center_word]  # Shape: (embedding_dim,)

        for context_word in context_words:
            # Compute softmax probability
            scores = dot(W_out, h)  # Shape: (vocab_size,)
            probs = softmax(scores)

            # Loss: negative log-likelihood
            loss = -log(probs[context_word])

            # Backprop: update W_in and W_out to increase P(context_word | center_word)
            gradients = compute_gradients(loss, W_in, W_out)
            W_in -= learning_rate * gradients.W_in
            W_out -= learning_rate * gradients.W_out

# Final embeddings: W_in (or average of W_in and W_out)

The critical issue with Skip-Gram approach is that computing the softmax denominator $\sum_{w=1}^{V} \exp(\mathbf{v}_w \cdot \mathbf{h})$ requires iterating over the entire vocabulary (50,000+ words) for every training example.

With billions of training pairs, this is computationally prohibitive. The solution to this problem brings us to the concept of negative sampling.

Negative Sampling

The softmax bottleneck is a fundamental problem in large-vocabulary language models. For every training example, we must:

Compute scores for all $V$ vocabulary words
Exponentiate all scores
Sum them (partition function)
Divide to normalize

For $V = 50,000$ and embedding dimension $d = 300$ , each softmax requires $15M$ multiplications. With billions of training examples, this is infeasible.

Why Softmax is Expensive

The computational cost comes from the normalization term:

$P(w_o | w_c) = \frac{\exp(\mathbf{v}_{w_o} \cdot \mathbf{h})}{\sum_{w=1}^{V} \exp(\mathbf{v}_w \cdot \mathbf{h})}$

The numerator $\exp(\mathbf{v}_{w_o} \cdot \mathbf{h})$ is cheap: one dot product, one exponentiation. The denominator requires computing $\mathbf{v}_w \cdot \mathbf{h}$ for every word $w$ in the vocabulary.

Why is the denominator necessary? Because it ensures the probabilities sum to 1: $\sum_{w=1}^{V} P(w | w_c) = 1$ . Without normalization, we would have arbitrary scores, not probabilities.

But here's the key insight: we don't actually need proper probabilities for learning embeddings. We only need a signal that says: "This (center, context) pair should have higher score than random pairs."

Contrastive Learning Intuition

Instead of computing probabilities over the entire vocabulary, we can reframe the problem as binary classification: distinguish true (center, context) pairs from fake pairs.

For each true pair $(w_c, w_o)$ from the corpus (e.g., "fox" and "quick"), we sample $k$ negative examples which are random words that don't appear in the context (e.g., "asteroid", "calculus", "piano").

The model learns to:

Assign high probability to the true pair: $P(\text{true} | \text{fox}, \text{quick})$ should be high
Assign low probability to fake pairs: $P(\text{true} | \text{fox}, \text{asteroid})$ should be low

This is contrastive learning: we learn by contrasting positive examples (real context) against negative examples (random noise).

The Objective Function: Breaking Down the Math

The negative sampling objective is:

$\log \sigma(\mathbf{v}_{w_o} \cdot \mathbf{h}) + \sum_{i=1}^{k} \mathbb{E}_{w_i \sim P_n(w)} \left[ \log \sigma(-\mathbf{v}_{w_i} \cdot \mathbf{h}) \right]$

This looks complex, but let's build intuition piece by piece.

Step 1: Binary Classification with Sigmoid

First, recall we are doing binary classification which means that for each word pair, we ask "Did these two words actually appear together in the text?"

Instead of softmax (which computes probabilities over all $V$ words), we use the sigmoid function for each pair independently:

$\sigma(x) = \frac{1}{1 + e^{-x}}$

The sigmoid converts a score $x$ (the dot product) into a probability between 0 and 1:

If $x = 5$ (large positive), $\sigma(5) \approx 0.993$ ≈ "very confident YES"
If $x = 0$ (neutral), $\sigma(0) = 0.5$ ≈ "uncertain"
If $x = -5$ (large negative), $\sigma(-5) \approx 0.007$ ≈ "very confident NO"

Key difference from softmax: Sigmoid doesn't need to sum over all vocabulary words. Each pair gets its own independent probability.

Step 2: The Positive Term (True Context)

For a true pair like ("fox", "quick") that appeared in the corpus:

$\log \sigma(\mathbf{v}_{\text{quick}} \cdot \mathbf{h}_{\text{fox}})$

What does this do?

$\mathbf{v}_{\text{quick}} \cdot \mathbf{h}_{\text{fox}}$ is the dot product (higher = more similar)
$\sigma(\ldots)$ converts it to probability: "How likely is it these words appeared together?"
$\log(\ldots)$ converts to log-probability for numerical stability

Goal: Maximize this term → increase the dot product → pull "fox" and "quick" embeddings closer together

Step 3: The Negative Term (Random Words)

For a negative sample like ("fox", "asteroid") that did NOT appear together:

$\log \sigma(-\mathbf{v}_{\text{asteroid}} \cdot \mathbf{h}_{\text{fox}})$

Notice the negative sign before the dot product! Why?

$\mathbf{v}_{\text{asteroid}} \cdot \mathbf{h}_{\text{fox}}$ might be positive (embeddings accidentally similar)
The negative sign flips it: $-\mathbf{v}_{\text{asteroid}} \cdot \mathbf{h}_{\text{fox}}$
$\sigma(-\text{dot product})$ = probability that these words did NOT appear together

Goal: Maximize $\log \sigma(-\text{dot product})$ → make the dot product more negative → push "fox" and "asteroid" apart

Step 4: Putting It Together

The full objective for one training example with $k$ negative samples:

$\underbrace{\log \sigma(\mathbf{v}_{w_o} \cdot \mathbf{h})}_{\text{reward if positive pair has high dot product}} + \underbrace{\sum_{i=1}^{k} \log \sigma(-\mathbf{v}_{w_i} \cdot \mathbf{h})}_{\text{reward if negative pairs have low dot product}}$

Where:

$w_o$ is the true context word (1 positive example)
$w_i$ are $k$ randomly sampled negative words
$P_n(w)$ is the sampling distribution (smoothed unigram, explained below)

A Concrete Example

Suppose we are training on the pair ("fox", "quick") with $k=2$ negative samples: "asteroid" and "piano".

Current dot products (before training)

$\mathbf{v}_{\text{quick}} \cdot \mathbf{h}_{\text{fox}} = 1.5$ → $\sigma(1.5) = 0.82$ → $\log(0.82) = -0.20$
$\mathbf{v}_{\text{asteroid}} \cdot \mathbf{h}_{\text{fox}} = 0.5$ → $\sigma(-0.5) = 0.38$ → $\log(0.38) = -0.97$
$\mathbf{v}_{\text{piano}} \cdot \mathbf{h}_{\text{fox}} = 0.3$ → $\sigma(-0.3) = 0.43$ → $\log(0.43) = -0.84$

Objective value: $-0.20 + (-0.97) + (-0.84) = -2.01$

What gradient descent does:

Increase $\mathbf{v}_{\text{quick}} \cdot \mathbf{h}_{\text{fox}}$ (currently 1.5) → maybe adjust to 1.8
Decrease $\mathbf{v}_{\text{asteroid}} \cdot \mathbf{h}_{\text{fox}}$ (currently 0.5) → maybe adjust to 0.2
Decrease $\mathbf{v}_{\text{piano}} \cdot \mathbf{h}_{\text{fox}}$ (currently 0.3) → maybe adjust to 0.0

After one gradient update:

$\mathbf{v}_{\text{quick}} \cdot \mathbf{h}_{\text{fox}} = 1.8$ → $\sigma(1.8) = 0.86$ → $\log(0.86) = -0.15$
$\mathbf{v}_{\text{asteroid}} \cdot \mathbf{h}_{\text{fox}} = 0.2$ → $\sigma(-0.2) = 0.45$ → $\log(0.45) = -0.80$
$\mathbf{v}_{\text{piano}} \cdot \mathbf{h}_{\text{fox}} = 0.0$ → $\sigma(0.0) = 0.50$ → $\log(0.50) = -0.69$

New objective: $-0.15 + (-0.80) + (-0.69) = -1.64$ (improved from -2.01!)

The model successfully:

Pulled "fox" and "quick" closer (dot product increased)
Pushed "fox" away from "asteroid" and "piano" (dot products decreased)

Why Logarithms?

Two reasons:

Numerical stability: Probabilities can be very small (e.g., 0.0001). Logs convert them to manageable numbers.
Gradient behavior: Logarithm amplifies gradients when predictions are wrong (high learning signal) and reduces them when predictions are correct (low learning signal). This makes training more efficient.

Why This Works Better Than Softmax

Softmax for vocabulary size $V = 50,000$ :

Compute $50,000$ dot products
Exponentiate and sum all of them
Total: ~ $50,000$ operations per training example

Negative sampling with $k = 5$ :

Compute $1 + 5 = 6$ dot products (1 positive + 5 negatives)
No expensive sum over vocabulary
Total: ~ $6$ operations per training example

Speedup: $\frac{50,000}{6} \approx 8,300 \times$ faster!

Insight: We don't need to know the exact probability of every word in the vocabulary. We just need to know that the true context word scores higher than random words. This relative ranking is enough to learn good embeddings.

True vs Fake Pairs in Embedding Space

Geometrically, negative sampling creates clear separation:

Positive pairs (true context):

$(w_c, w_o)$ appears in corpus → increase $\mathbf{v}_{w_c} \cdot \mathbf{v}_{w_o}$ → pull vectors together → small angle

Negative pairs (random words):

$(w_c, w_{\text{neg}})$ doesn't appear in corpus → decrease $\mathbf{v}_{w_c} \cdot \mathbf{v}_{\text{neg}}$ → push vectors apart → large angle

Over many iterations, embeddings converge to a state where:

Related words cluster together (high dot product)
Unrelated words are far apart (low or negative dot product)
The geometry reflects corpus statistics, not random initialization

Critically, we only compute $k + 1$ dot products per training example (1 positive + $k$ negatives), instead of $V$ dot products for full softmax. With $k = 5$ and $V = 50,000$ , this is a $10,000×$ speedup.

Choosing Negative Samples

Now, the question arises: how do we sample negative words?

The naive approach is to use uniform random sampling from vocabulary. But this is suboptimal because of the following reasons:

Rare words are over-sampled (every word has equal probability)
Common words like "the", "and" are under-sampled
The model wastes time learning to push away extremely rare words that never co-occur anyway

Mikolov's original Word2Vec uses smoothed unigram distribution:

$P_n(w) = \frac{f(w)^{0.75}}{\sum_{w'} f(w')^{0.75}}$

Where $f(w)$ is the frequency of word $w$ in the corpus. The $0.75$ exponent smooths the distribution: it reduces the probability of very common words and increases the probability of rare words, balancing the negative sampling.

Why $0.75$ specifically? This was found empirically to work better than the extremes:

$\text{Exponent} = 1.0$ (unsmoothed): Over-samples common words like "the", "and"—wastes computation on obvious negatives
$\text{Exponent} = 0.5$ : Over-samples rare words; creates too-easy negatives that don't provide useful signal
$\text{Exponent} = 0.75$ : Sweet spot that balances between frequent and rare words, creating informative negative examples

This ensures the model learns meaningful contrasts: distinguishing true context from plausible-but-wrong alternatives, rather than from nonsense words.

Negative Sampling Zoom Positive pair (fox, quick) pulled together; negative pairs (fox, asteroid), (fox, calculus) pushed apart. After training, embedding space separates semantically related words from unrelated words.

CBOW: Predicting the Word from Context

Continuous Bag-of-Words (CBOW) inverts the Skip-Gram formulation. Here, instead of predicting context from a word, we predict the word from context.

Inverting the Prediction Task

Given context words $\{w_{c-k}, \ldots, w_{c-1}, w_{c+1}, \ldots, w_{c+k}\}$ , CBOW predicts the center word $w_c$ .

For our "fox" example:

plaintext

Context: {quick, brown, jumps, over}
Target: fox

The model must learn: "Given that the surrounding words are {quick, brown, jumps, over}, what word is in the center?"

This is a classification problem: predict 1-of- $V$ words given the context.

Averaging Context Embeddings

The key design choice in CBOW: how do we combine multiple context words into a single representation? The simplest approach is to average their embeddings.

$\mathbf{h} = \frac{1}{2k} \sum_{-k \leq j \leq k, j != 0} \mathbf{v}_{w_{c+j}}$

Where:

$\mathbf{v}_{w_{c+j}}$ is the embedding of context word at position $c+j$
$2k$ is the total number of context words ( $k$ on each side)
$\mathbf{h}$ is the averaged context representation

For "fox":

$\mathbf{h} = \frac{1}{4} \left( \mathbf{v}_{\text{quick}} + \mathbf{v}_{\text{brown}} + \mathbf{v}_{\text{jumps}} + \mathbf{v}_{\text{over}} \right)$

This averaged vector $\mathbf{h}$ becomes the input to the output layer, which predicts the center word:

$P(w_c | \text{context}) = \frac{\exp(\mathbf{u}_{w_c} \cdot \mathbf{h})}{\sum_{w=1}^{V} \exp(\mathbf{u}_w \cdot \mathbf{h})}$

Where $\mathbf{u}_{w_c}$ is the output embedding for center word $w_c$ .

Like Skip-Gram, CBOW uses negative sampling to avoid computing the full softmax.

Why CBOW Converges Faster

CBOW typically trains faster than Skip-Gram for several reasons:

1. Fewer Updates Per Training Example

For a window size $k$ , Skip-Gram generates $2k$ training examples (one for each context word):

plaintext

Skip-Gram from "fox" with context {quick, brown, jumps, over}:
(fox → quick)
(fox → brown)
(fox → jumps)
(fox → over)

CBOW generates a single training example:

plaintext

CBOW:
({quick, brown, jumps, over} → fox)

This means Skip-Gram performs $2k$ gradient updates per window position, while CBOW performs just one. For large corpora, this 4-8× difference significantly accelerates training.

2. Smoothing Effect of Averaging

Averaging context embeddings creates a smoothing effect. Individual words might be noisy (ambiguous, rare, or polysemous), but their average is more stable.

Consider predicting the center word from context {the, ___, barked, loudly}. The word "the" is extremely common and appears in countless contexts, providing minimal information. But combined with "barked" and "loudly", the average embedding strongly suggests an animal noun. CBOW automatically learns to weight informative context words higher (through the learned embeddings) while diluting noisy words in the average.

This smoothing stabilizes gradients, allowing larger learning rates and faster convergence.

3. Better Performance on Frequent Words

CBOW learns better representations for frequent words because it sees them as targets many times. A common word like "dog" appears in millions of different contexts. Each time, CBOW updates its embedding to be predictable from diverse contexts, creating a robust, generalized representation.

Skip-Gram, in contrast, learns better representations for rare words. When a rare word appears, Skip-Gram treats it as the center and predicts its context, getting strong gradients that update the rare word's embedding. CBOW averages rare words into the context, diluting their signal.

This leads to a key trade-off, which we will explore in the next section.

CBOW Averaging Zoom CBOW averages context word embeddings (quick, brown, jumps, over) into a single vector h, then predicts the center word "fox". The averaging smooths noise and stabilizes training.

Pseudocode

python

# Initialize embeddings randomly
W_in = random_matrix(vocab_size, embedding_dim)   # Context word embeddings
W_out = random_matrix(vocab_size, embedding_dim)  # Center word embeddings

for sentence in corpus:
    for center_position in sentence:
        center_word = sentence[center_position]
        context_words = get_context(sentence, center_position, window_size)

        # Average context word embeddings
        h = mean([W_in[w] for w in context_words])  # Shape: (embedding_dim,)

        # Predict center word (with negative sampling)
        positive_score = dot(W_out[center_word], h)
        positive_loss = -log(sigmoid(positive_score))

        # Sample k negative words
        negative_words = sample_negatives(k, noise_distribution)
        negative_loss = 0
        for neg_word in negative_words:
            neg_score = dot(W_out[neg_word], h)
            negative_loss += -log(sigmoid(-neg_score))

        total_loss = positive_loss + negative_loss

        # Backprop: update W_in (context embeddings) and W_out (center embeddings)
        gradients = compute_gradients(total_loss, W_in, W_out)
        W_in -= learning_rate * gradients.W_in
        W_out -= learning_rate * gradients.W_out

# Final embeddings: W_in (or average of W_in and W_out)

Skip-Gram vs CBOW

Both Skip-Gram and CBOW learn embeddings from local context windows, but their formulations create different inductive biases. Understanding when to use each requires examining their trade-offs.

When Each Works Better

The choice depends on our corpus characteristics and downstream task:

Criterion	Skip-Gram	CBOW
Training speed	Slower ( $2k$ updates per position)	Faster (1 update per position)
Rare word quality	Better (rare words as center get strong gradients)	Worse (rare words in context are averaged out)
Frequent word quality	Worse (common words diluted across many contexts)	Better (frequent targets updated many times)
Small corpus	Preferred (better data efficiency)	Acceptable
Large corpus	Slower but higher quality	Much faster, still good quality
Syntactic tasks	Moderate (focuses on individual word contexts)	Better (averaging captures syntactic patterns)
Semantic tasks	Better (sharp distinctions between rare words)	Moderate (smoothed representations)
Example: Medical NER	✓ Better (rare disease names need sharp embeddings)	Acceptable (may conflate similar rare terms)
Example: Sentiment Analysis	Acceptable (common sentiment words well-represented)	✓ Better (frequent words like "good", "bad" dominate)
Example: Search/Retrieval	✓ Better (precise matching of rare entity names)	Good (robust for common query terms)

Sharp vs Smooth Embeddings

The averaging in CBOW creates smooth embeddings: representations that blend multiple contextual signals. This is beneficial for frequent words, which appear in diverse contexts and benefit from aggregation.

Skip-Gram creates sharp embeddings: representations that preserve fine-grained distinctions. This is beneficial for rare words, which appear in limited contexts and need strong signal from each occurrence.

Empirically, Skip-Gram embeddings tend to have higher variance: large differences between similar-but-not-identical words. CBOW embeddings have lower variance: similar words are tightly clustered.

For tasks requiring fine-grained semantic distinctions (e.g., identifying subtle differences between synonyms), Skip-Gram often performs better.

For tasks requiring robust generalizations (e.g., part-of-speech tagging, where all verbs should cluster), CBOW often performs better.

Rare vs Frequent Words

This is the most important trade-off. Let's make it concrete with examples.

Skip-Gram Favors Rare Words

Consider the rare word "quokka" (a small marsupial). It might appear only 100 times in a large corpus, always in contexts like:

"The quokka is a small marsupial native to Australia."
"Quokkas are related to kangaroos and wallabies."

In Skip-Gram, when "quokka" is the center word, we predict {small, marsupial, Australia, kangaroos}. The gradient flows directly into the "quokka" embedding, giving it a strong update even from limited data. After 100 occurrences, "quokka" has a well-trained embedding that clusters near {kangaroo, wallaby, marsupial}.

In CBOW, "quokka" might appear in the context, but it is averaged with other words:

Context: {the, is, a, small, quokka, native, to, Australia}
Average: (the + is + a + small + quokka + native + to + Australia) / 8

The "quokka" signal is diluted by common words like "the", "is", "a". The gradient to "quokka" is weaker. After 100 occurrences, its embedding is less distinctive.

CBOW Favors Frequent Words

Consider the common word "dog", appearing 1 million times in diverse contexts:

"The dog barked loudly."
"She adopted a friendly dog."
"Dogs are loyal companions."

In CBOW, every time "dog" is the center word, its embedding gets updated to be predictable from that specific context. After 1 million updates from diverse contexts, "dog" has a robust, generalized embedding that captures all its typical usages.

In Skip-Gram, "dog" appears as the center word, predicting context words like {barked, adopted, loyal, companions}. But these contexts are diverse. The embedding must simultaneously predict "barked" (action), "adopted" (acquisition), "loyal" (attribute). The gradients pull in many directions, potentially creating a less focused representation.

For very frequent words, CBOW's smoothing stabilizes the learning, while Skip-Gram's sharp updates can create noisy embeddings.

Practical Recommendations

Based on these trade-offs, here are practical guidelines:

Use Skip-Gram when:

You have a small corpus (< 100 million words)
Rare words are important for your task (e.g., medical NER extracting rare disease names, legal document retrieval matching obscure case citations, scientific literature search with specialized terminology)
You need fine-grained semantic distinctions
You can afford longer training time
Your downstream task benefits from sharp, distinctive embeddings

Use CBOW when:

You have a large corpus (> 100 million words)
Training speed is critical
Frequent words are more important than rare words (e.g., sentiment analysis where common words like "excellent", "terrible" carry most signal, spam classification detecting common spam patterns, POS tagging where syntactic categories matter more than rare vocabulary)
You need robust, stable embeddings for syntactic tasks
Your downstream task benefits from smooth, generalized embeddings

Modern practice: Most production systems use Skip-Gram for quality, even though CBOW is faster. The quality gain for rare words outweighs the speed cost. However, for truly massive corpora (billions of words), CBOW's speed advantage becomes significant.

Visualization: Loss Trends

Let's visualize how loss decreases during training for both models (conceptual, not real data):

plaintext

Training Loss over Time

Loss
 |
 |  CBOW ----___
 |              ----___
 |                     ----___
 |  Skip-Gram              ----___
 |         -----______            ----___
 |                   ------______       ----___
 |                              ------_______----___
 |__________________________________________________ Iterations
 0                                                  1M

CBOW: Faster initial convergence due to fewer updates per example
      Plateaus earlier at slightly higher loss (smoothing effect)

Skip-Gram: Slower initial convergence (more updates per example)
           Continues improving longer (sharper gradients for rare words)
           Lower final loss (better fit to data)

CBOW reaches "good enough" embeddings quickly. Skip-Gram reaches "better" embeddings slowly. The choice depends on your resource constraints and quality requirements.

Key Takeaways

After learning two approaches from Word2Vec, it's now time to look at takeaways:

Context Windows Create Training Signal

Sliding windows over text extract (center, context) pairs
Local co-occurrence becomes supervision for embedding learning
Bag-of-words assumption: order within context is initially ignored
Millions of training pairs provide rich distributional signal

Skip-Gram (Center → Context)

Predict each context word independently given center word
Creates $2k$ training examples per window position
Better for rare words (direct gradients to rare center words)
Sharper embeddings with fine-grained distinctions
Slower training but higher quality

Negative Sampling

Replaces expensive softmax with contrastive learning
Binary classification: true pairs vs. random pairs
$k$ negative samples per positive example (typically $k = 5-20$ )
10,000× speedup for large vocabularies
Smoothed unigram distribution for sampling negatives

CBOW (Context → Center)

Predict center word from averaged context embeddings
Single training example per window position (4-8× faster)
Better for frequent words (robust from many updates)
Smoother embeddings with stable generalizations
Faster training, slightly lower quality for rare words

The Geometry of Learning

Push-pull forces: co-occurring words pulled together, non-co-occurring pushed apart
Dot product approximates PMI after convergence
Semantic similarity emerges from distributional similarity
No explicit supervision for semantics—learned as side effect of prediction

Conclusion

We have answered the question: Where do embeddings come from?

They emerge from self-supervised learning on local context windows. By training a shallow neural network to predict context from words (Skip-Gram) or words from context (CBOW), we force the model to discover distributional patterns. The learned embeddings encode co-occurrence statistics in geometric form: similar words cluster together because they solve similar prediction tasks.

Negative sampling makes this computationally tractable, replacing expensive softmax normalization with contrastive learning against random samples.

But Word2Vec has fundamental limitations:

Local context only: Only words within a small window (typically 5-10) contribute to learning. Long-range dependencies and document-level semantics are ignored.
Bag-of-words: Word order is discarded, losing syntactic structure (can't distinguish "dog chased cat" from "cat chased dog").
Static embeddings: Each word gets a single vector, conflating all senses ("bank" as institution and "bank" as river edge get the same representation).

These limitations motivated the next generation of embedding methods:

GloVe (Part 3): Instead of local windows, use global co-occurrence statistics across the entire corpus, combining the benefits of matrix factorization (LSA) with predictive models (Word2Vec).

Contextual Embeddings (Part 4): Instead of static vectors, compute dynamic representations based on sentence context:

ELMo (2018): Bidirectional LSTMs generate context-dependent embeddings
BERT (2018): Transformer encoders with masked language modeling; attention mechanisms capture long-range dependencies
GPT (2018+): Transformer decoders with autoregressive prediction; positional encodings preserve word order

These modern architectures address Word2Vec's limitations (dynamic representations for polysemy, attention for long-range dependencies, positional encodings for word order) while building on its core insight: prediction tasks drive semantic learning.

Word2Vec was a breakthrough, proving that simple prediction tasks could learn rich semantic structure. But it's just the beginning of the embedding story.

See you in the next post. Namaste!

References

Foundational Papers

Word2Vec (Skip-Gram and CBOW) - Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv preprint.

Distributed Representations of Words and Phrases - Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. NeurIPS 2013.

Theoretical Analysis

Word2Vec Explained - Goldberg, Y., & Levy, O. (2014). word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method. arXiv preprint.

Neural Word Embedding as Implicit Matrix Factorization - Levy, O., & Goldberg, Y. (2014). Neural Word Embedding as Implicit Matrix Factorization. NeurIPS 2014.

Improving Distributional Similarity - Levy, O., Goldberg, Y., & Dagan, I. (2015). Improving Distributional Similarity with Lessons Learned from Word Embeddings. Transactions of the ACL.

Practical Guides

Word2Vec Tutorial - McCormick, C. (2016). Word2Vec Tutorial - The Skip-Gram Model.

Gensim Word2Vec Implementation - Řehůřek, R. Gensim: Word2Vec. Production-quality Python implementation.

Negative Sampling Intuition - Understanding Negative Sampling in Word2Vec. Detailed walkthrough with examples.

Contrastive Learning

Noise Contrastive Estimation - Gutmann, M., & Hyvärinen, A. (2010). Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. AISTATS 2010.

Theoretical foundation for negative sampling

Contrastive Learning Survey - Le-Khac, P. H., Healy, G., & Smeaton, A. F. (2020). Contrastive Representation Learning: A Framework and Review. IEEE Access.

Embeddings Evolution

ELMo (Contextual Embeddings) - Peters, M. E., et al. (2018). Deep contextualized word representations. NAACL 2018.

BERT - Devlin, J., et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint.

Historical Context

Neural Language Models - Bengio, Y., et al. (2003). A Neural Probabilistic Language Model. JMLR 2003.

Early neural language model that learned word embeddings as a byproduct

Collobert & Weston (2008) - A Unified Architecture for Natural Language Processing. ICML 2008.

Written by Anirudh Sharma

Published on December 26, 2025

Understanding Embeddings: Part 2 - Learning Meaning from Context

Context Windows as Learning Signal

Local Context as Supervision

Defining Center and Context

Two Prediction Formulations

Second-Order Co-occurrence

Sliding Window Intuition

Why Order Initially Doesn't Matter

Skip-Gram: Predicting Context from a Word

The Big Picture: A Prediction Game

Walking Through a Concrete Example

Why This Creates Semantic Learning

The Mathematical Formulation

How the Model Works: The Neural Architecture

A Tiny Example: 5-Word Vocabulary

Step 1: Represent the Center Word (One-Hot Encoding)

Step 2: Look Up the Embedding

Step 3: Compute Probabilities for Context Words

Step 4: Convert Scores to Probabilities (Softmax)

Where Do These Embedding Values Come From?

How Training Works

The General Formula

Why Prediction Creates Semantic Embeddings

Push-Pull Geometry

Pseudocode

Negative Sampling

Why Softmax is Expensive

Contrastive Learning Intuition

The Objective Function: Breaking Down the Math

Step 1: Binary Classification with Sigmoid

Step 2: The Positive Term (True Context)

Step 3: The Negative Term (Random Words)

Step 4: Putting It Together

A Concrete Example

Current dot products (before training)

Why Logarithms?

Why This Works Better Than Softmax

True vs Fake Pairs in Embedding Space

Choosing Negative Samples

CBOW: Predicting the Word from Context

Inverting the Prediction Task

Averaging Context Embeddings

Why CBOW Converges Faster

1. Fewer Updates Per Training Example

2. Smoothing Effect of Averaging

3. Better Performance on Frequent Words

Pseudocode

Skip-Gram vs CBOW

When Each Works Better

Sharp vs Smooth Embeddings

Rare vs Frequent Words

Skip-Gram Favors Rare Words

CBOW Favors Frequent Words

Practical Recommendations

Visualization: Loss Trends

Key Takeaways

Context Windows Create Training Signal

Skip-Gram (Center → Context)

Negative Sampling

CBOW (Context → Center)

The Geometry of Learning

Conclusion

References

Foundational Papers

Theoretical Analysis

Practical Guides

Contrastive Learning

Embeddings Evolution

Historical Context

Comments