December 19, 202534 min readBy Anirudh Sharma

Understanding Embeddings: Part 1 - Why Embeddings Exist?

{A}

Table of Contents

This is Part 1 of a 4-part series on Embeddings:

[Part 1: Why Embeddings Exist] ← You are here
Part 2: Learning Meaning From Context
Part 3: Global Statistics and GloVe
Part 4: The Limits of Static Embeddings

After demystifying Tokenization in the previous three posts 1, 2, 3. Now is the time to take the next logical step: learning Embeddings from first principles.

To understand the relationships and meanings of complex data such as text, images, sounds, etc., we require numerical representations known as embeddings.

These representations allow machine learning models to process and understand the complex data by placing similar items closer together in a mathematical space.

This definition may seem abstract (and it is), but as we go deep down into this rabbit hole, its meaning and necessity will become clear.

The Fundamental Problem

We all know that computers don't understand language; they operate on numbers - binary representations of electrical states.

When we type the word "dog" in a text file, the computer doesn't understand mammal, pet, or loyalty. It stores the UTF-8 byte sequence [100, 111, 103] representing the characters 'd', 'o', 'g'. These bytes are just indices into a lookup table. They are arbitrary symbols with no intrinsic connection to the concept they represent. In the above example, just by looking at the numbers [100, 111, 103], we cannot understand the concept of "dog".

This is the Symbol Grounding Problem, formalized by cognitive scientist Steven Harnad. It says How can the semantic interpretation of a formal symbol system be made intrinsic to the system, rather than just parasitic on the meanings in our minds?

Harnad illustrated this with a powerful thought experiment. The experiment was trying to learn Chinese from a Chinese-Chinese dictionary alone. We go from one meaningless symbol to another in an endless "symbolic merry-go-round" where nothing ever grounds to actual meaning.

In the Tokenization series, we saw how text becomes integers: "dog" → [4521]. This solves the computational representation problem because now we have a discrete unit a computer can process.

However, the token [4521] is just as semantically empty as the bytes [100, 111, 103]. The integer $4521$ doesn't tell us that dogs bark, have four legs, or are related to wolves. It is just an arbitrary index into a vocabulary table.

Natural language models need more than symbolic lookup. They need representations that capture relationships. We need to know that a "dog" is semantically closer to a "cat" than to an asteroid, that "king" is related to "queen" in the same way "man" relates to "woman". Pure symbols can't express all of these nuances.

We need a mathematical structure where similarity is measurable and where semantic relationships have geometric interpretations.

The Failure of Integers and One-Hot Vectors

The naive solution is integer encoding where we assign each word a unique integer for example:

plaintext

"dog" = 4521
"cat" = 2387
"asteroid" = 9876

This is exactly what tokenization gives us. But these integers impose false mathematical relationships. In standard arithmetic, $4521 < 9876$ , suggesting "dog" is somehow "less than" "asteroid". Or if we compute mean of "dog" and "cat" $(4521 + 2387) / 2 = 3454$ , we'd get... what? Some word that's "between" dog and cat? This is nonsense. The ordering is arbitrary - there's no natural sequence where "dog" comes before "cat" See: Ordinal and One-Hot Encodings for Categorical Data.

Linear models trained on integer-encoded features assume this ordering is meaningful. If we use integers as inputs to a neural network, the model learns that multiplying "dog" by a weight treats $4521$ as magnitude. But word identity isn't a quantity; it's a category. The arbitrary assignment of "dog" = 4521 versus "dog" = 1000 would produce completely different learned weights, even though the semantic content is identical.

The standard solution in ML is One-Hot Encoding: represent each word as a vector of zeroes with a single 1 at the word's index position. For a vocabulary of size $V$ , this gives:

plaintext

"dog"      = [0, 0, 0, ..., 1, ..., 0]  (1 at position 4521)
"cat"      = [0, 0, ..., 1, ..., 0, 0]  (1 at position 2387)
"asteroid" = [0, 0, ..., 0, 0, ..., 1]  (1 at position 9876)

This solves the false ordering problem; there's no magnitude relationship between these vectors. But it creates three devastating new problems:

Problem 1: Orthogonality means zero similarity everywhere

One-hot vectors are mathematically orthogonal which means the dot product between any two distinct vectors is always zero because they share no active dimension. The dot product $(v\_dog, v\_cat) = 0$ is identical to $(v\_dog, v\_asteroid) = 0$ . Geometrically, all words are equidistant, pointing in completely different directions in V-dimensional space like axes in coordinate system.

This means there is no measurable similarity between any two words. Words "dog" and "cat" are as semantically distant as "dog" and "asteroid". The representation contains zero information about semantic relationships. Every word is an island.

One-Hot Vectors Orthogonal Zoom One-hot vectors are orthogonal: all words are equidistant with zero dot product. No semantic similarity can be measured.

Problem 2: The curse of dimensionality

For a vocabulary of 50,000 words (typical for modern LLMs), each one-hot vector has 50,000 dimensions (a single one and 49,999 zeroes). This is extremely sparse. At 4 bytes per float, storing one vector requires 200 KB. A single sentence with 20 words requires 4 MB just for one-hot representations, before any computation begins.

Training neural networks on such high-dimensional, sparse data is computationally prohibitive. The first layer of the network has to learn a $50,000 * hidden\_dim$ weight matrix. For a 512-dimensional hidden layer, that's a whopping 25.6 million parameters just to transform the input.

On top of it, this layer learns almost nothing useful because the input vectors contain no semantic structure to exploit.

Problem 3: No generalization

When the model sees "puppy" (a word that might not be in the training vocabulary), there's no way to infer it's related to "dog". Each word is a completely isolated dimension. Rare words appear infrequently during training, so their corresponding weights barely update. The model can't transfer knowledge from "dog" to "puppy" because their one-hot vectors share zero dimensions.

Recall from our Tokenization series: subword tokenization (BPE, WordPiece) solves the out-of-vocabulary problem by decomposing rare words into familiar fragments (e.g., "playing" → ["play", "ing"]). One-hot encoding has no such mechanism. It's all or nothing: either the word is in vocabulary or it's completely unseen.

Similarity Is Not Equality

The fundamental issue is that semantic similarity is not a binary property. In formal logic, two symbols are either identical (TRUE) or different (FALSE). But in natural language, similarity is continuous and graded:

"dog" and "puppy" are very similar (both canines, related by age)
"dog" and "cat" are moderately similar (both pets, both mammals)
"dog" and "wolf" are somewhat similar (same family, different domestication)
"dog" and "asteroid" are completely dissimilar (one is animate, one is celestial)

One-hot vectors can only express equality $(v\_dog \cdot v\_dog) = 1$ versus inequality $(v\_dog \cdot v\_puppy) = 0$ . There is absolutely no way to encode that "dog" is more similar to "puppy" than to "asteroid".

Thus, to overcome this problem, we need is a representation where:

1. Similarity is measurable (a continuous value, not binary)

2. Related words have high similarity (numerically quantifiable)

3. Unrelated words have low similarity

4. The representation is dense (low-dimensional, not 50,000-sparse dimensions)

5. Rare words can generalize from related common words

This is where distributed representations come in: vectors where every dimension contributes to the meaning, and similar words have similar patterns across dimensions.

But before we can build such representations, we need a theory of where meaning comes from.

Recall from Tokenization: Part 1: tokenization is a lossy compression where common patterns are preserved and rare patterns are fragmented. Embeddings face the same trade-off: we compress an infinite space of possible meanings into a finite vector space (typically 768-1536 dimensions).

The compression is semantic where instead of preserving raw frequencies, we preserve relationships among the words.

Meaning Comes From Usage

If symbols don't have intrinsic meaning, where does the meaning come from?

The answer emerged from mid-20th century linguistics: meaning is usage. This is the distributional hypothesis, independently formulated by two linguists in the 1950s: Zelling Harris and J. R. Firth.

Harris observed that words that occur in the same contexts tend to have similar meanings. If we see the sentence "The ___ chased the ball", we know that the blank could be "dog", "cat" or "puppy" but probably not "asteroid" or "philosophy". Clearly, words that fit the same linguistic environments share semantic properties.

Harris observed that distribution reflected deeper semantic relationships - that the difference of meaning correlates with the difference of distribution.

Firth approached this from a different angle. He emphasized that context includes not just neighboring words, but broader situational and cultural factors.

His famous formulation was You shall know a word by the company it keeps.

If "bank" appears near "river", "water" and "erosion", it means one thing, but if it appears near "loan", "interest" and "deposit", it means something else.

The surrounding linguistic environment disambiguates and defines meaning.

While both linguists are cited in every NLP textbook, their theories diverge in important ways. Harris focused on internal linguistic forms: the structural distribution of words within texts. Firth emphasized external situational context: how words relate to cultural knowledge and real-world situations. Harris's approach is syntagmatic (what words appear next to each other). Firth's approach is paradigmatic (what words can substitute for each other in similar situations).

Modern NLP inherits both perspectives. Word embeddings (Word2Vec, GloVe) are fundamentally Harrisan; they learn from word co-occurrence statistics within a text window. But they also capture Firthinian properties: words that substitute for each other in diverse contexts (like "good" and "great") end up nearby in vector space even if they rarely directly co-occur.

"You Shall Know a Word by the Company It Keeps"

Let's make this concrete. Consider how we would explain the word "quokka" to someone who's never heard of it. We might say: A quokka is a small marsupial native to Australia, about the size of a cat, with a friendly appearance. It's related to kangaroos and wallabies.

Notice what we have done: we have identified related words to "quokka" - "marsupial", "cat", "Australia", "kangaroo". This is the distributional hypothesis in action. Meaning emerges from a web of associations, not from intrinsic definitions.

Now imagine we encounter "quokka" in many sentences:

The quokka hopped across the grass.
Quokkas are native to Australia.
This small marsupial is related to kangaroos.
Quokkas have a friendly, smiling appearance.

Even if we didn't know the definition, we could infer:

It's an animal (appears with "hopped", "marsupial")
It's from Australia (stated explicitly)
It's small (contrasted with larger animals)
It's related to kangaroos (shares contexts with "wallaby", "marsupial")

This is semantic learning from distribution. The word's meaning is inferred from the statistical pattern of contexts where it appears. This is how distributional models like Word2Vec learn. The only difference is that they process millions of such sentences, extracting patterns that would take humans years to notice manually.

Frequency Alone Is Insufficient

The crucial point here is that raw frequency doesn't mean equal meaningfulness.

Consider the following word pairs:

("the", "cat") — "the" is the most frequent word in English, appearing before many nouns
("small", "cat") — "small" is less frequent than "the" but more informative about "cat"

This is where Pointwise Mutual Information (PMI) comes in:

$\text{PMI}(w_1, w_2) = \log \frac{P(w_1, w_2)}{P(w_1) \cdot P(w_2)}$

PMI measures how much more likely are these words to appear together than if they were independent?

If $P(w1, w2) = P(w1) × P(w2)$ , they are independent → $PMI = 0$
If $P(w1, w2) > P(w1) × P(w2)$ , they are associated → $PMI > 0$
If $P(w1, w2) < P(w1) × P(w2)$ , they are anti-correlated → $PMI < 0$

For ("the", "cat"): "the" appears everywhere, so P("the") is huge. Even though they co-occur often, their PMI is low because their joint probability is close to what we would expect by chance.

For ("small", "cat"): "small" is selective — it appears with size-relevant nouns. Their co-occurrence is higher than chance, so PMI is positive.

This is the reason why modern embeddings use PMI-based objectives instead of just using raw counts. They look for the answer to the question: What words provide surprising information about each other?

Co-occurrence Matrix Zoom Co-occurrence matrix for "the dog chased the cat" with window size 2. Notice how "dog" and "cat" never co-occur directly but share identical contexts—second-order co-occurrence reveals semantic similarity.

Why Geometry is the Right Abstraction

Once we have established that meaning comes from distributional patterns, the next question is: How do we represent these patterns mathematically? The answer may seem non-obvious at first, but it is geometry - specifically, the vector space.

The reason for choosing vectors is because distributional patterns are inherently multi-dimensional. A word is characterized by many properties, not just a single one. Few such properties are below:

Syntactic role (noun, verb, adjective)
Semantic domain (animal, artifact, abstract concept)
Connotation (positive, negative, neutral)
Register (formal, informal, technical)
Frequency (common, rare)

Each of these is a dimension along which words vary. Therefore, a vector is a natural mathematical object for representing multi-dimensional quantities.

In physics, a force has both magnitude (strength) and direction (orientation). Similarly, a word has magnitude (frequency in corpus) and direction (semantic properties).

But the key point to note is that for semantic similarity, direction matters more than magnitude.

Consider two documents:

Document A: "Dog. Dog. Dog." (3 words, all the same)
Document B: "The dog barked loudly." (4 words, semantically related)

If we represent these as term-frequency vectors:

A = [dog: 3, barked: 0, loudly: 0, the: 0] B = [dog: 1, barked: 1, loudly: 1, the: 1]

The magnitude of A ( $√(3²) = 3$ ) is different from B ( $√(1²+1²+1²+1²) = 2$ ). But their semantic content is closely related because both are about dogs.

If we normalize these vectors to unit length (divide by magnitude), they point in similar directions in the semantic space. The direction encodes what the text is about; the magnitude encodes how much text there is.

This is why most embedding systems normalize vectors before counting similarity. We want to know Are these two words conceptually aligned? not Does one appear more frequently?

Similarity as Alignment, Not Distance

In high school geometry, we learn that similarity means same shape, different size. In Euclidean geometry, we measure distance: two points are close if the straight line joining them is short. But in high-dimensional semantic spaces, distance is a bad metric.

This is the curse of dimensionality. As dimensions increase, distances concentrate: most pairs of points become nearly equidistant.

In 2D, some points are close, some are far. In 1000D, almost all points are roughly the same distance apart (with small variance). This is because distance grows as $√(sum\_of\_squared\_differences\_across\_dimensions)$ , and in high dimensions, this sum is dominated by random noise.

For text data, the problem is worse because vectors are sparse. A document about dogs has non-zero values in only four dimensions like {dog, barked, tail, pet} and zeroes everywhere else. A document about cats has non-zero values in only four dimensions like {cat, meowed, whiskers, pet} and zeroes everywhere else.

For a vector space with $50,000$ dimensions, the Euclidean distance between them is dominated by the $49,996$ dimensions where one or both are zero, yet semantically, they're closely related (both about pets).

Instead of measuring distance, we must measure angle. Two vectors are similar if they point in the same direction, regardless of their lengths. This is Cosine Similarity:

$\text{cosine}(\mathbf{v}_1, \mathbf{v}_2) = \frac{\mathbf{v}_1 \cdot \mathbf{v}_2}{|\mathbf{v}_1| |\mathbf{v}_2|} = \cos(\theta)$

Where $θ$ is the angle between the vectors. This ranges from -1 (opposite directions) to +1 (same direction), with 0 meaning orthogonal (unrelated).

Crucially, angle is invariant to magnitude. If we scale a vector by 2 (doubling all its components), the direction still remains the same. This aligns with our intuition: a document that mentions "dog" 10 times versus 5 times is still about dogs. The frequency may differ but the semantic direction remains unchanged.

Why Language Naturally Maps to Vector Space

There is a deeper reason why vectors work for languages: linearity of semantic relationships. Many linguistic relationships are compositional, which means we can combine meaning vectors using vector arithmetic.

The famous example is:

$\vec{king} - \vec{man} + \vec{woman} \approx \vec{queen}$

This suggests that the gender offset (man → woman) is a direction in vector space, applying it to "king" moves us toward "queen".

Similarly:

$\vec{Paris} - \vec{France} + \vec{Italy} \approx \vec{Rome}$

The capital-of is also a learnable direction.

This works because the distributional hypothesis creates parallel structures. "King" appears in contexts like "The king ruled the kingdom" while "queen" appears in "The queen ruled the kingdom". The words {ruled, kingdom, throne} create a shared context, while {he, his, male} vs. {she, her, female} create the gender offset. These patterns are linear in distributional space - we can subtract one and add the other.

Geometric structure mirrors linguistic structure

Syntax (word order, grammar) imposes relational patterns. Semantics (meaning) create clusters of related concepts. Vector spaces naturally represent both: syntax as directional relationships, semantics as clustering.

Vector Alignment Zoom 2D semantic space showing how similar words (dog, cat) have small angles (high cosine similarity) while unrelated words (dog, rock) have large angles (low cosine similarity).

Dot Product and Cosine Similarity

The dot product is one of the most fundamental operations in linear algebra, yet its geometric interpretation is often glossed over in undergraduate education. At the computational level, the dot product of two vectors is simple: multiply corresponding components and sum:

$\mathbf{a} \cdot \mathbf{b} = a_1b_1 + a_2b_2 + ... + a_nb_n = \sum_{i=1}^{n} a_i b_i$

But this arithmetic definition obscures the profound geometric meaning: the dot product measures how much two vectors agree.

Think about what happens when we compute $a_i × b_i$ for each dimension.

If both components are positive or both are negative, the product is positive - they agree in that dimension.
If one is positive and other is negative, the product is negative - they disagree.
If either is zero, the product is zero, that dimension contributes nothing.

The sum aggregates these dimension-wise agreements and disagreements into a single scalar.

This gives three cases:

1. Maximum Agreement

When vectors point in the same direction, all components have the same sign. Every term $a_i × b_i$ is positive, and the dot product is maximized. For unit vectors (length 1), this maximum is exactly 1.

2. No Agreement

When vectors are perpendicular (orthogonal), the positive agreements exactly cancel the negative disagreements. The dot product is zero. This is the state of one-hot vectors we discussed earlier: they share no active dimensions, so their dot product is always zero.

3. Disagreement

When vectors point in opposite directions, most terms $a_i × b_i$ are negative. The dot product is negative, reaching a minimum of -1 for unit vectors pointing in exactly opposite directions.

This interpretation reveals why the dot product is perfect for measuring semantic alignment. When two word vectors have a high dot product, it means their distributional patterns agree across many dimensions. They appear in similar contexts, modify similar words, and participate in similar syntactic structures. Conversely, a dot product near 0 means the words occupy independent semantic subspaces; knowing one tells you nothing about the other.

Physics Analogy

The deepest intuition for the dot product comes from physics. Let's take the classical example of Work

In this example, we are pushing a box across the floor. Following cases emerge:

When we push a box across the floor, we are applying a force $F$ . If we push horizontally with 10 Newtons and the box moves 3 meters, we have done $W = F \cdot d = 10N \cdot 3m = 30 \space Joules$ of work.

Imagine pushing at $45°$ above horizontal. Not all our force contributes to horizontal motion as some of it is "wasted" pushing the box into the floor.

Only the horizontal component of our force does the useful work. If our total force is $10N$ at $45°$ , the horizontal component is $10 × cos(45°) ≈ 7.07N$ . The work done is:

$W = \mathbf{F} \cdot \mathbf{d} = |\mathbf{F}| |\mathbf{d}| \cos(\theta)$

This is the geometric definition of the dot product. It captures the idea that only the aligned portion of the force contributes to the motion in the direction of displacement. The cosine term $cos(\theta)$ acts as a "percentage of alignment".

$θ = 0° (parallel): cos(0°) = 1$ → 100% of force contributes
$θ = 45°: cos(45°) ≈ 0.707$ → 70.7% of force contributes
$θ = 90° (perpendicular): cos(90°) = 0$ → 0% of force contributes
$θ = 180° (opposite): cos(180°) = -1$ → force opposes motion

This applies directly to word embeddings.

When we compute $embedding("dog") \cdot embedding("cat")$ , we are asking: "How much of 'dog's' semantic direction aligns with 'cat's' semantic direction?"

Both words have many semantic properties (animacy, size, domestication, etc.). The dot product aggregates across all properties, measuring total alignment.

Projection: The Shadow Interpretation

Imagine shining a light perpendicular to vector $b$ . Vector $a$ casts a shadow onto $b$ . The length of this shadow is the projection of $a$ onto $b$ :

$\text{proj}_\mathbf{b}(\mathbf{a}) = \frac{\mathbf{a} \cdot \mathbf{b}}{|\mathbf{b}|}$

If $b$ is a unit vector (length 1), this simplifies to just $a \cdot b$ : the dot product is the projection length.

This interpretation is powerful because it tells us: the dot product measures how much of vector $a$ lies in the direction of vector $b$ . If the shadow is long, the vectors are aligned. If the shadow is short (or zero), they are nearly perpendicular.

In NLP, this has practical applications. Imagine we have learned a gender direction in our embedding space: a vector pointing from average masculine words toward average feminine words. We can project any word onto this direction to measure its gender loading.

Words like "king," "father," "brother" project in one direction and words like "queen," "mother," "sister" project in the opposite direction. Words like "chair," "table," "algorithm" project orthogonal (gender-neutral).

Why Cosine Similarity Dominates NLP

Raw dot product has a problem: it is not normalized. The dot product of two vectors depends on both their angle and magnitude. If we double the length of vector $a$ , the dot product doubles even though the semantic content hasn't changed.

This is problematic for word embeddings. Common words like "the" or "and" appear millions of times in training corpora, potentially leading to vectors with large magnitude. Rare words like "quokka" appear infrequently, leading to smaller magnitude vectors.

If we used raw dot product for similarity, common words would dominate simply because their vectors are longer not because they're semantically more similar.

Cosine similarity solves this by normalizing away the magnitude:

$\text{cosine}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{|\mathbf{a}| |\mathbf{b}|} = \cos(\theta)$

Dividing by the product of magnitudes converts the dot product into a pure angle measure. Now similarity depends only on direction.

This ranges from -1 to +1:

+1: Vectors point in exactly the same direction ( $θ = 0°$ ) → maximum similarity
0: Vectors are perpendicular ( $θ = 90°$ ) → orthogonal, unrelated
-1: Vectors point in opposite directions ( $θ = 180°$ ) → antonyms or negation

Notice that cosine similarity is symmetric: $cos(θ)$ for angle $θ$ is the same as $cos(-θ)$ . This matches our intuition: the similarity between "dog" and "cat" should equal the similarity between "cat" and "dog."

Cosine Similarity Zoom Cosine similarity measures the angle θ between vectors. Values range from +1 (identical direction) to 0 (perpendicular) to -1 (opposite direction).

Why Cosine Works Better Than Euclidean Distance in High Dimensions

In principle, we could measure similarity using Euclidean distance instead of cosine similarity. Euclidean distance between two vectors is:

$d(\mathbf{a}, \mathbf{b}) = \sqrt{\sum_{i=1}^{n} (a_i - b_i)^2}$

Small distance means similar; large distance means dissimilar. This works fine in low dimensions (2D, 3D). But in the high-dimensional spaces used for word embeddings (300-1536 dimensions), Euclidean distance breaks down.

The problem is distance concentration: as dimensionality increases, the distance between most pairs of points converges to nearly the same value.

Imagine we have 1000-dimensional vectors with random components. The average distance between any two random vectors approaches a constant (proportional to $√1000$ ), with very small variance. This means nearest neighbor search becomes meaningless as every point is roughly equidistant from every other point.

Even worse for text: embeddings are sparse at the semantic level. A document about dogs has high values in dimensions related to animals, pets, mammals, and zeroes elsewhere. A document about coding has high values in dimensions related to programming, computers, and zeroes elsewhere. The Euclidean distance is dominated by the thousands of dimensions where both are zero; which tells us nothing about semantic similarity.

On the other hand, cosine similarity is immune to this problem because it only cares about direction, not distance. Two sparse vectors can be extremely similar (high cosine) even if they're far apart in Euclidean space (because they have small magnitudes).

This aligns with our semantic intuition: a short paragraph about dogs and a long essay about dogs should be semantically similar, even though the essay has much higher word counts (larger magnitude).

Additionally, for text represented as word counts or TF-IDF vectors, the magnitude often reflects document length rather than semantic content. A Grokipedia article has high word counts simply because it's long, not because it's "more about" the topic.

Cosine similarity removes this confound by normalizing, focusing purely on the distribution of semantic content.

Mathematical Connection: Normalized Vectors

There's an elegant mathematical insight here. If we normalize all vectors to unit length (divide by their magnitude), then Euclidean distance and cosine similarity become monotonically related.

For unit vectors $u$ and $v$ :

$d(\mathbf{u}, \mathbf{v})^2 = |\mathbf{u} - \mathbf{v}|^2 = (\mathbf{u} - \mathbf{v}) \cdot (\mathbf{u} - \mathbf{v})$

Expanding:

$= \mathbf{u} \cdot \mathbf{u} - 2\mathbf{u} \cdot \mathbf{v} + \mathbf{v} \cdot \mathbf{v} = 1 - 2\mathbf{u} \cdot \mathbf{v} + 1 = 2(1 - \mathbf{u} \cdot \mathbf{v})$

Since for unit vectors, $u · v = cos(θ)$ :

$d(\mathbf{u}, \mathbf{v})^2 = 2(1 - \cos\theta)$

This means: smaller Euclidean distance ⟺ larger cosine similarity.

They provide identical rankings of similarity. But cosine similarity is computationally cheaper for normalized vectors: it's just the dot product, while Euclidean distance requires computing squared differences and a square root.

This is why modern embedding systems (Sentence-BERT, CLIP, etc.) typically normalize embeddings to unit length during training. Then at inference time, computing similarity is a single dot product operation which is extremely fast for retrieval over millions of vectors.

What Embeddings Are (And Are Not)

Word embeddings are learned continuous representations that compress distributional statistics into dense, low-dimensional vectors.

Let's be precise about what information they actually capture.

What Embeddings Encode

1. Distributional Co-occurrence Patterns

At their core, embeddings learn from the distributional hypothesis: words appearing in similar contexts get similar vectors. Word2Vec's skip-gram model explicitly optimizes this: given a word, predict its context. GloVe explicitly factorizes a co-occurrence matrix. Both result in vectors where dot product approximates the PMI.

Mathematically, Word2Vec's skip-gram objective can be shown to implicitly approximate:

$\mathbf{w}_i \cdot \mathbf{c}_j \approx \text{PMI}(w_i, c_j) = \log \frac{P(w_i, c_j)}{P(w_i)P(c_j)}$

Where $w_i$ is a word vector and $c_j$ is a context vector.

This means the geometry of embedding space reflects statistical co-occurrence. High dot product means words frequently co-occur more than chance would predict.

Don't worry about these models yet. We will explore them deeply in part 2 and part 3 of this series.

2. Syntactic and Semantic Regularities

Embeddings capture compositional relationships through vector arithmetic:

Semantic: $king - man + woman ≈ queen$ (gender offset)
Syntactic: $walking - walk ≈ swimming - swim$ (verb conjugation)
Geographic: $Paris - France + Italy ≈ Rome$ (capital-of relationship)

This works because the distributional contexts create parallel structures.

But there is a critical caveat, these algebraic relationships are approximate and noisy, not mathematical laws.

The equation $king - man + woman$ doesn't give exactly "queen", it gives a vector near "queen" in the embedding space.

When we find the nearest neighbor to this result vector, "queen" is often in the top results, but not always.

The geometry is fuzzy, not precise.

3. Semantic Similarity and Relatedness

Words with high cosine similarity tend to be semantically related, but the nature of that relationship varies:

Synonyms: "happy" ≈ "joyful" ≈ "cheerful"
Hypernyms/Hyponyms: "animal" ≈ "dog" ≈ "mammal" (hierarchical)
Functional similarity: "car" ≈ "vehicle" ≈ "transportation"
Topical association: "hospital" ≈ "doctor" ≈ "patient" ≈ "medicine"

Embeddings don't distinguish between these types of similarity; they just capture distributional relatedness.

"Coffee" and "cup" are nearby in embedding space not because they are synonyms, but because they frequently co-occur ("a cup of coffee").

This is both a strength (captures real-world associations) and a limitation (conflates different semantic relations).

4. Compressed, Low-Dimensional Approximations

Unlike one-hot vectors, embeddings are dense and low-dimensional. This compression is lossy; we cannot perfectly reconstruct the full co-occurrence matrix from embeddings, but it's semantically informed compression.

The dimensionality reduction is similar to what SVD does in Latent Semantic Analysis(LSA). LSA, introduced in 1988 by Deerwester et al., applied Singular Value Decomposition (SVD) to term-document matrices, reducing dimensionality from tens of thousands to hundreds while preserving semantic structure.

Modern neural embeddings do the same thing more flexibly, learning non-linear compressions through backpropagation rather than linear projections through SVD.

Insight: Most semantic distinctions don't require vocabulary-size dimensions. The difference between "dog" and "cat" can be captured in a few dimensions (domestication, size, temperament), not 50,000 independent axes.

Embeddings learn which dimensions matter for semantic tasks and discard irrelevant variance.

What Embeddings Do Not Encode

Now is the time to understand the critical part: what embeddings fail to capture. This is where the limitations of purely distributional models become apparent.

1. Grounded, Embodied Meaning (The Symbol Grounding Problem)

Embeddings learn from text statistics which means patterns of word co-occurrence in written language.

But text is itself a symbolic system. When we read "The dog chased the ball", we understand this because we have sensorimotor grounding:

we have seen dogs
we have chased things
we understand motion and causality from physical experience.

In the case of an embedding model, it sees only "dog" frequently co-occurs with "chased", "ball", "barked", "pet".

It learns that these symbols are statistically related. But it has no connection to the physical referents, no concept of a four-legged mammal, no understanding of motion through space, no experience of a ball's roundness or a dog's bark.

This is Harnad's Symbol Grounding Problem (discussed above) in modern form. Harnad described it as trying to learn Chinese from a Chinese-Chinese dictionary, we go from one meaningless symbol to another in an endless "merry-go-round."

Modern language models face the same issue, they are caught in a "merry-go-round" of vectors. Their embeddings point to other embeddings, which point to other embeddings, but nothing ever grounds to the physical world, perceptual experience, or goal-oriented action.

A human can look at a calendar and identify "summer" without reading the word. A human can see a photo and recognize "dog" without labels.

Embeddings trained purely on text have no such grounding, they cannot connect "summer" to warmth, sunshine, or a specific temporal period, except insofar as these concepts are mentioned in text.

2. Causal Reasoning and World Knowledge

Embeddings capture correlations, not causation. They know that "rain" and "wet" co-occur frequently, but they don't know that rain causes wetness. They can't answer: "If it didn't rain, would the ground still be wet from the rain?" This requires causal reasoning, understanding that removing the cause removes the effect.

Similarly, embeddings lack structured world knowledge:

They don't know that all dogs are mammals, but not all mammals are dogs (taxonomic hierarchy)
They don't know that $1 + 1 = 2$ (mathematical facts)
They don't know that Paris is in France in 2025, but wasn't part of France in 200 BCE (temporal, historical knowledge)

These facts might be implicitly encoded if they appear frequently in training text ("Paris, the capital of France..."), but they are not stored as verifiable, logical facts.

The model can't deduce new facts through logical inference. It only has distributional statistics.

3. Compositionality Beyond Bag-Of-Words

Word embeddings represent individual words. To represent a sentence, early methods used simple aggregation: average all word vectors. But this is a bag-of-words approach that loses word order and syntactic structure:

"The dog chased the cat" → average(dog, chased, cat)
"The cat chased the dog" → average(dog, chased, cat)

These give identical representations, even though the meaning is different (who chased whom?). Word embeddings can't capture this without additional machinery (recurrent networks, attention mechanisms, transformer architectures).

Even modern contextual embeddings (BERT, GPT) struggle with true compositionality which is the systematic way meanings combine. They learn contextualized representations (the embedding for "bank" changes based on surrounding words), but they don't have explicit compositional semantics like formal semantic theories (λ-calculus, Montague grammar).

4. Negation, Quantification, and Logical Structure

Distributional semantics struggles with negation. Consider:

"I like dogs" (positive sentiment toward dogs)
"I don't like dogs" (negative sentiment toward dogs)

The word "don't" flips the polarity, but its embedding doesn't inherently encode "negation operator." In a bag-of-words model, both sentences have similar embeddings, they both mention "I," "like," and "dogs," differing only by the presence of "don't." We need syntactic structure and compositional operators to correctly interpret negation.

Similarly, quantification is hard:

"All dogs are mammals" (universal quantifier)
"Some dogs are brown" (existential quantifier)

Embeddings don't naturally represent the difference between $∀$ (all) and $∃$ (some). These require logical forms, not just distributional statistics.

5. Multiple Senses and Polysemy

Traditional word embeddings (Word2Vec, GloVe) assign one vector per word, conflating all senses:

"bank" (financial institution) and "bank" (river edge) get the same vector
"bat" (animal) and "bat" (sports equipment) get the same vector

The vector averages across contexts where each sense appears. This is problematic for disambiguation tasks. Contextual embeddings (ELMo, BERT) address this by computing different embeddings for the same word in different sentences, but even they don't explicitly separate senses—they just modulate the representation based on context.

6. Rare Words and Long-Tail Distributions

Embeddings are learned from corpus statistics.

Frequent words like "the," "and," "is" appear millions of times, so their embeddings are well-trained.

Rare words like "quokka," "axolotl," or domain-specific jargon appear infrequently, leading to poorly estimated embeddings.

For words that appear fewer than ~10-100 times, the embedding is essentially random noise as the model hasn't seen enough contexts to learn meaningful patterns.

This is the long tail problem. Natural language follows Zipf's Law: a small number of words are extremely common, and a vast number of words are extremely rare.

Embeddings work well for the head of the distribution but poorly for the tail.

Embeddings ≠ Knowledge ≠ Reasoning

Embeddings are learned statistical patterns.

They capture which words co-occur, which contexts are similar, which relationships are distributionally parallel.

Knowledge is structured, verifiable information: "Water boils at 100°C," "Paris is the capital of France," "All mammals are warm-blooded." Knowledge supports logical inference: if A implies B, and B implies C, then A implies C.

Reasoning is the process of deriving new conclusions from existing knowledge through deduction, induction, or abduction.

Reasoning requires compositionality (combining meanings), negation (flipping truth values), quantification (all, some, none), and causality (X causes Y).

Embeddings provide a foundation: a continuous semantic space where similar meanings cluster. But they are not sufficient for true understanding, knowledge representation, or logical reasoning. They are a starting point, not an ending point.

Modern LLMs (GPT-4, Claude) go far beyond static embeddings by using contextual representations, attention mechanisms, and massive scale. But even they inherit the fundamental limitation as they are trained on text statistics, not grounded experience.

They don't "know" the world; they know the textual patterns that humans have written about the world.

Key Takeaways

As we are nearing to conclude this post, let's look at a few takeaways that have explored.

Why Embeddings Had to Exist:

Computers need numbers, but integers and one-hot vectors cannot measure semantic similarity
Meaning emerges from usage patterns (distributional hypothesis: "You shall know a word by the company it keeps")
Multi-dimensional distributional patterns require vectors as the natural mathematical representation
High-dimensional similarity is angular (cosine), not distance-based (Euclidean)
Geometric structure enables semantic algebra: king - man + woman ≈ queen

What Embeddings Encode:

Distributional co-occurrence patterns (dot product ≈ PMI)
Syntactic and semantic regularities via vector arithmetic
Semantic similarity and relatedness (though they conflate different types)
Compressed, dense approximations of sparse co-occurrence statistics

Critical Limitations:

Embeddings Have	Embeddings Lack
Statistical correlation patterns	Grounded meaning (symbol grounding problem)
Distributional similarity	Causal reasoning (correlation ≠ causation)
Vector arithmetic	Compositional semantics (bag-of-words loses structure)
Continuous representations	Logical operators (negation, quantification)

The Bottom Line: Embeddings are correlation engines that provide a geometric foundation for semantic similarity which is powerful for measuring relatedness but not sufficient for reasoning or true understanding.

Conclusion

If you are here, thank you for sticking to the post until the very end.

In this post, we covered a lot of ground understanding embeddings from the first principles, what works, what doesn't why, why have we chosen dot product over euclidean distance and many more.

We started with understanding why embeddings have to exist:

Words are arbitrary symbols which means integers and one-hot vectors can't measure similarity
Meaning comes from usage (distributional hypothesis), thus, statistical patterns in context
Patterns are multi-dimensional and vectors are their natural representation
Similarity in high dimensions is angular, therefore, we choose cosine similarity

Then, we looked at limits of embeddings:

They have no grounding to physical world (symbol grounding problem)
They don't understand causal reasoning (correlation ≠ causation)
No compositional semantics (bag-of-words loses structure)
No logical operators (negation, quantification)

Now, that we have fundamentals out of the way, in the next post we will go down the rabbit hole even deeper and will understand different algorithms for embeddings, what they are good at and what are their limitations.

Namaste!

References

Foundational Theory

Symbol Grounding Problem - Harnad, S. (1990). The symbol grounding problem. Physica D: Nonlinear Phenomena.

Distributional Hypothesis - Harris, Z. S. (1954). Distributional Structure. WORD, 10(2-3), 146-162.

Distributional Semantics - Firth, J. R. (1957). You shall know a word by the company it keeps. Studies in Linguistic Analysis.

Encoding and Representation

One-Hot Encoding Limitations - Ordinal and One-Hot Encodings for Categorical Data.

Curse of Dimensionality - An Intuitive Exploration of the Curse of Dimensionality.

Statistical Measures

Pointwise Mutual Information (PMI) - Co-occurrence Matrices and PMI-SVD Embeddings.

Cosine Similarity - Cosine Similarity.

Euclidean vs Cosine Distance - Cha, S. H. (2007). Comprehensive Survey on Distance/Similarity Measures.

Geometric Interpretations

Dot Product in Physics - Understanding the Dot Product.

Vector Space Model - Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing.

Dimensionality Reduction

Latent Semantic Analysis (LSA) - Deerwester, S. et al. (1990). Indexing by Latent Semantic Analysis.

Singular Value Decomposition (SVD) - Reducing dimensionality of text documents using latent semantic analysis

Word Embeddings Evolution

Word2Vec - Mikolov, T. et al. (2013). Efficient Estimation of Word Representations in Vector Space.

GloVe - Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global Vectors for Word Representation. EMNLP

Word Embedding Limitations - On the Limitations of Word Embeddings. PMC.

Language Model Theory

Zipf's Law - Distribution of word frequencies in natural language corpora

Symbols and Grounding in LLMs - Symbols and grounding in large language models. Royal Society Publishing.

Written by Anirudh Sharma

Published on December 19, 2025