Feb 05, 2026
So here’s the thing about Bag-of-Words and TF-IDF, they’re solid starting points. They work, they’re fast, and they’ve powered enough production systems that we all have scars from tuning their parameters. But there’s this fundamental problem that bugs you after a while - they have zero semantic awareness.
When you represent words as one-hot vectors, king, queen, and banana are equally different from each other. Your model can technically count them, but it doesn’t know that king and queen are kindred spirits while banana is just… hanging out alone.
This is where word embeddings come in, and honestly, this is where NLP stopped being just counting things and started being something closer to understanding.
Word embeddings represent each word as a dense vector of learned numbers. You don’t hand-craft these, the network learns them by watching patterns in how words appear around each other. Words that show up in similar contexts gradually learn similar vectors. It’s like the model is picking up on linguistic vibes without you explicitly telling it what those vibes are.
In this tutorial, we’re building this from scratch. No frameworks, no magical libraries doing the heavy lifting behind the scenes. We’ll implement the two classic architectures that started this whole embedding revolution:
Let’s say you’re looking at: the king loves the queen
A TF-IDF model will happily count that king appears once and queen appears once. But ask it whether those two words are similar, and it’ll shrug. In the math, they’re just different columns in a huge sparse matrix.
Embeddings fix this by giving each word its own learned vector:
\[\mathbf{v}_w \in \mathbb{R}^d\]Instead of living in a 50,000-dimensional one-hot space (where one dimension is 1 and the rest are 0), a word now has a compact representation in, say, 300 dimensions. And here’s the clever part, those 300 dimensions actually encode meaning.
In a well-trained embedding space, something magical happens:
king - man + woman ≈ queen (the classic example)This shift from manually counting things to learning what words mean from context is why embeddings were such a inflection point in NLP. It’s the difference between having a spell-checker and having something that actually understands language.
Word2Vec’s foundation is surprisingly simple. It’s based on something linguists figured out a while ago:
You can understand what a word means by looking at what words show up around it.
This is the distributional hypothesis, and it’s kind of obvious once you think about it. If you see:
the king rules the kingdomthe queen rules the kingdomthe king and queen are royaltyYou notice that king and queen tend to appear near the same kinds of words: rules, kingdom, royalty. Your brain picks up on this pattern immediately and concludes that these words are probably related. They hang out with similar crowds.
Word2Vec exploits this idea by turning it into a game the network can play:
The beautiful part? You never explicitly tell the model what words mean. You just keep feeding it context clues, and the meaning figures itself out. It’s like learning a language by osmosis.
Before embeddings, we had one-hot encoding. It looks deceptively clean:
Vocabulary: ["king", "queen", "man", "woman", "cat", "dog"]
Representations:
king -> [1, 0, 0, 0, 0, 0]queen -> [0, 1, 0, 0, 0, 0]cat -> [0, 0, 0, 0, 1, 0]Great for indexing. Terrible for understanding. Every word is equally distant from every other word in this space. According to the math, king is just as different from queen as king is from cat. There’s no structure, no meaning baked in.
Word2Vec flips this entirely. Instead of one-hot vectors, we learn an embedding matrix:
\[\mathbf{W} \in \mathbb{R}^{V \times d}\]Where $V$ is your vocabulary size and $d$ is your embedding dimension (usually something like 100-300). Each row of this matrix is the learned vector for one word.
The core challenge becomes:
How do we train this matrix so words that are semantically similar actually get similar vectors?
That’s the problem CBOW and Skip-Gram solve, just coming at it from slightly different angles.
CBOW (Continuous Bag of Words) is the simpler of the two approaches. It’s basically a fill-in-the-blank game.
Take: the king loves the queen
If we treat loves as the target word and use a window of 2 words on each side:
the, king, the, queenlovesCBOW learns:
\[(\text{the}, \text{king}, \text{the}, \text{queen}) \rightarrow \text{loves}\]The network’s job is simple: given these context words, predict what should go in the middle.
CBOW takes the embeddings of all the context words, averages them together (or sometimes sums), and uses that combined representation to predict the center word. It’s elegant and efficient - you’re getting multiple training signals with each prediction.
CBOW is genuinely fast. Because it combines context words upfront, frequent words get stable representations quickly. You’re effectively learning from multiple context examples in a single forward pass.
The tradeoff? CBOW smooths out fine details. When you average context, you’re sort of losing information about word order and specific relationships. the king loves and loves the king produce roughly the same averaged context, which isn’t always what you want.
Skip-Gram inverts the prediction direction. Instead of “given context, find the center word,” it asks “given this word, what should appear around it?”
Same sentence: the king loves the queen
With loves as the center and window size 2, Skip-Gram creates four separate training examples:
loves -> the (left context)loves -> king (left context)loves -> the (right context)loves -> queen (right context)So for every word that appears in your text, you get multiple mini-prediction tasks scattered around it.
This generates way more training examples per sentence. If CBOW gives you one “guess the middle word” task per window, Skip-Gram gives you multiple “guess the neighbor” tasks from the same position.
Why does this matter? Rare words. If a rare word only appears a few times, CBOW might not see enough examples to learn a stable representation. But Skip-Gram turns each appearance into 2-4 prediction tasks, so even infrequent words get decent training signals.
The downside is obvious: more training examples means more computation. Skip-Gram is slower to train than CBOW. It’s a classic engineering tradeoff — more data accuracy versus faster training.
Despite their different architectures, CBOW and Skip-Gram follow basically the same workflow. Here’s what actually happens:
Both models use two weight matrices:
\[\mathbf{W}_1 \in \mathbb{R}^{V \times d}, \quad \mathbf{W}_2 \in \mathbb{R}^{d \times V}\]Think of it like this:
When training is done, you throw away $\mathbf{W}_2$ and keep $\mathbf{W}_1$. Those learned word vectors are your final product — they’re dense, semantic, and ready to feed into downstream tasks.
Word2Vec is honestly just a softmax classifier wrapped around context windows. Let me show you what’s actually happening.
You have context words at positions around $t$:
\[w_{t-m}, \dots, w_{t-1}, w_{t+1}, \dots, w_{t+m}\]CBOW averages their embeddings:
\[\mathbf{h} = \frac{1}{2m} \sum_{-m \leq j \leq m,\; j \neq 0} \mathbf{v}_{w_{t+j}}\]Then uses that average as the input to a softmax prediction layer:
\[P(w_t \mid \text{context}) = \text{softmax}(\mathbf{W}_2^\top \mathbf{h})\]During training, you compute the loss (cross-entropy) and backprop to make the true center word have high probability.
Skip-Gram starts with the center word embedding and predicts each context position independently:
\[P(w_{t+j} \mid w_t) = \text{softmax}(\mathbf{W}_2^\top \mathbf{v}_{w_t})\]You do this for every $j$ in your window. The loss is the sum of log-probabilities for all real context words:
\[\sum_{-m \leq j \leq m,\; j \neq 0} \log P(w_{t+j} \mid w_t)\]They’re cousins, not clones. Here’s the honest breakdown:
CBOW averages context, so it trains faster and converges with fewer examples. That efficiency buys you something real in a production setting.
Skip-Gram gives each context position its own prediction, so even infrequent words see multiple training signals.
CBOW: “Given the, king, the, queen, what word goes here?”
Skip-Gram: “Given loves, what word appears here? And here? And here?”
It’s a tiny change in perspective that cascades into different behavior. Same algorithm, different data flow.
bank in “river bank” vs “savings bank” gets the same vector. Later methods (ELMo, BERT) fixed this.These aren’t bugs in Word2Vec. They’re exactly the problems that motivated the next generation of methods. Contextual embeddings, subword tokenization, and pre-training emerged because people kept bumping into these walls.
Word2Vec’s genius is its simplicity. One core insight:
Words that hang out together in text learn similar vectors.
That’s it. No hand-crafted features, no linguistic rules, no expensive annotations. Just context clues and gradient descent.
CBOW and Skip-Gram are two flavors of the same idea:
This is where the story gets interesting. After implementing Word2Vec, you realize:
Those are the natural progressions, and they’re all built on understanding Word2Vec first. You’re now positioned to understand why later methods were invented and what problems they solve.