Understanding Embeddings: Part 4 - The Limits of Static Embeddings
Why one vector per word cannot capture meaning—and what this limitation reveals about the nature of language understanding.
Technical deep dives on distributed systems, databases, performance, and infrastructure. 9 articles and counting.
Why one vector per word cannot capture meaning—and what this limitation reveals about the nature of language understanding.
From local context windows to global co-occurrence matrices—understanding how GloVe unifies prediction and counting methods.
Where do embeddings actually come from? Understanding Word2Vec, Skip-Gram, CBOW, and the learning dynamics that create semantic geometry.
Deep dive into embeddings from first principles
Practical guide to building, tuning, and auditing tokenizers: metrics, fairness, production deployment, and real-world lessons
Complete implementations of tokenization algorithms: character-level, word-level, BPE, byte-level BPE, WordPiece, and Unigram Language Models
Understanding tokenization from first principles: the problem space, design goals, and why it's representation design, not just preprocessing
Why caching is not just "faster reads"
The biggest bottleneck in databases are not CPU or memory - it's Disk I/O.
Technical deep dives on systems, databases, and infrastructure. No spam, unsubscribe anytime.