Feb 03, 2026
Continuing in our series in NLP, let’s dive into text representation. Imagine you are a world-class chef standing in a massive, chaotic pantry filled with thousands of ingredients from across the globe. Some ingredients, like salt and water, are in almost every dish you make. Others, like rare saffron or white truffles, appear only in your most exquisite masterpieces. If a food critic asked you to describe a dish without using its name, how would you do it? You might list the ingredients and their quantities: “This dish has 200g of pasta, 50g of parmesan, and a pinch of black pepper.”
To the critic, this “list of ingredients” is a fingerprint of the dish. It doesn’t tell them the order in which you added the ingredients or the technique you used, but it gives them a very good idea of what the dish is.
This is exactly how machines “read” text. A computer cannot understand the poetic beauty of a Shakespearean sonnet or the urgency of a breaking news headline in their raw, string-based form. To a machine, a sentence is just a sequence of characters. To perform tasks like sentiment analysis, document classification, or building a search engine, we must transform these “dishes” of text into a numerical “list of ingredients.” This process is known as Text Vectorization or Text Representation.
In this article, we will explore the foundational techniques of text representation: the Bag-of-Words (BoW) model and its more sophisticated cousin, TF-IDF (Term Frequency-Inverse Document Frequency). We will delve into the mathematical elegance of these methods, understand their limitations, and see how they form the backbone of many modern NLP systems.
Computers are essentially glorified calculators. They excel at performing billions of mathematical operations per second, but they have no innate concept of “meaning.” A string like "Machine Learning" is just a sequence of Unicode values (M=77, a=97, c=99...).
To bridge the gap between human language and machine logic, we need a Mapping Function ($f$) that transforms a piece of text ($T$) into a vector ($\mathbf{v}$): \(\mathbf{v} = f(T)\) where $\mathbf{v} \in \mathbb{R}^n$, and $n$ is the dimensionality of our representation.
A “good” vector representation should ideally satisfy three properties:
The Bag-of-Words model is the simplest form of text representation. It treats a document as an unordered collection (a “bag”) of words, disregarding grammar, word order, and sentence structure. It only cares about presence and frequency.
Suppose we have a small collection of documents (a Corpus):
We collect all unique words across the entire corpus (after basic preprocessing like lowercasing):
Vocabulary = {"the", "cat", "sat", "on", "mat", "dog", "log", "and", "are", "friends"}
Size of Vocabulary ($V$) = 10.
Each document is now represented by a vector of length $V$, where each index corresponds to a word in the vocabulary.
| Document | the | cat | sat | on | mat | dog | log | and | are | friends |
|---|---|---|---|---|---|---|---|---|---|---|
| Doc 1 | 2 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
| Doc 2 | 2 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 |
| Doc 3 | 2 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 |
Vector for Doc 1: $[2, 1, 1, 1, 1, 0, 0, 0, 0, 0]$
In a real-world corpus (e.g., all Wikipedia articles), the vocabulary might consist of 100,000+ words. However, a single paragraph might only contain 50 unique words. This means the vector for that paragraph will have 50 non-zero values and 99,950 zeros. This is called Sparsity. Sparse vectors are computationally expensive to store and process unless optimized (using specialized sparse matrix formats like CSR).
Notice in our example that “the” appears in every document with a high frequency. In English, words like “the”, “is”, “at”, “which”, and “on” are extremely common but carry very little semantic value. In a Bag-of-Words model, these Stopwords often dominate the vectors, making it harder for the model to focus on the truly descriptive words (like “cat” or “log”).
Standard practice is to remove these words during preprocessing, but as we’ll see, TF-IDF provides a more elegant mathematical solution.
The biggest weakness of BoW is the total loss of word order. The sentences “The dog bit the man” and “The man bit the dog” result in identical BoW vectors, even though their meanings are drastically different (and one is much more news-worthy!).
N-grams attempt to fix this by considering sequences of $n$ adjacent tokens.
If we use both Unigrams and Bigrams, our vocabulary size explodes. For a vocabulary of size $V$, the number of possible bigrams is $V^2$. While most of these won’t appear in the data, the effective vocabulary still grows significantly.
| Document | the | cat | the cat | cat sat | sat on | … |
|---|---|---|---|---|---|---|
| Doc 1 | 2 | 1 | 1 | 1 | 1 | … |
Advantage: Captures phrases like “Machine Learning”, “New York”, or “Not happy” (negation). Disadvantage: Drastically increases dimensionality and sparsity.
If BoW is an “ingredient list,” TF-IDF (Term Frequency-Inverse Document Frequency) is a “weighted importance list.” It acknowledges that not all words are created equal.
The core intuition is: A word is important if it appears frequently in a specific document, but it is less important if it appears frequently across many documents in the entire corpus.
TF measures how frequently a term occurs in a document. There are several ways to calculate this:
IDF measures how “rare” or “informative” a term is across the whole corpus. If a word appears in every single document (like “the”), its IDF should be very low (close to 0).
The standard formula for IDF is: \(IDF(t, D) = \log\left(\frac{N}{\lvert \{d \in D \mid t \in d\} \rvert}\right)\)
Where:
Why the Logarithm? The log function ensures that the IDF doesn’t explode for very rare words and stays within a manageable range. If $N=1,000,000$ and a word appears in only 1 document, the raw ratio is $1,000,000$. The $\log_{10}$ of that is $6$. This makes the weights much more stable for machine learning models.
The TF-IDF score for a term $t$ in document $d$ is: \(w_{t,d} = TF(t,d) \times IDF(t)\)
Suppose we have a corpus of 1,000 documents. We are looking at Document A, which has 100 words. The word “Algorithm” appears 5 times in Document A. The word “Algorithm” appears in 10 documents across the entire corpus.
Calculate TF: \(TF(\text{"Algorithm"}, \text{Doc A}) = \frac{5}{100} = 0.05\)
Calculate IDF: \(IDF(\text{"Algorithm"}) = \log\left(\frac{1000}{10}\right) = \log(100) = 2\)
Calculate TF-IDF: \(w = 0.05 \times 2 = 0.10\)
Now compare this to the word “The”, which appears 10 times in Document A but appears in all 1,000 documents.
Result: TF-IDF successfully filtered out the “noise” word (“the”) and highlighted the “informative” word (“algorithm”), even though “the” appeared more frequently in the document!
We will use scikit-learn, the industry standard for classical ML, to implement these vectors.
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import nltk
from nltk.corpus import stopwords
# Download stopwords if you haven't
# nltk.download('stopwords')
stop_words = list(stopwords.words('english'))
corpus = [
"The cat sat on the mat.",
"The dog sat on the log.",
"The cat and the dog are friends."
]
# 1. Initialize CountVectorizer
# We can include n-grams here using ngram_range=(1, 2)
vectorizer = CountVectorizer(stop_words='english')
# 2. Fit and Transform
X_bow = vectorizer.fit_transform(corpus)
# 3. View the Result
df_bow = pd.DataFrame(X_bow.toarray(), columns=vectorizer.get_feature_names_out())
print("Bag-of-Words Representation:")
print(df_bow)
# 1. Initialize TfidfVectorizer
# norm='l2' ensures that the vectors have a length of 1 (useful for Cosine Similarity)
tfidf_vectorizer = TfidfVectorizer(stop_words='english', norm='l2')
# 2. Fit and Transform
X_tfidf = tfidf_vectorizer.fit_transform(corpus)
# 3. View the Result
df_tfidf = pd.DataFrame(X_tfidf.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
print("\nTF-IDF Representation:")
print(df_tfidf.round(4))
In scikit-learn, you can use sublinear_tf=True which applies the $1 + \log(TF)$ scaling we discussed earlier.
tfidf_vectorizer_log = TfidfVectorizer(sublinear_tf=True, stop_words='english')
| Feature | Bag-of-Words (BoW) | TF-IDF |
|---|---|---|
| Logic | Simple counting of occurrences. | Statistical weighting of importance. |
| Weighting | All words are treated equally. | Frequent words in corpus are penalized. |
| Noise Sensitivity | High (Stopwords dominate). | Low (Automatically suppresses noise). |
| Interpretability | High (Numbers = Counts). | Medium (Numbers = Relative importance). |
| Best For | Very small datasets, simple tasks. | Most classical NLP tasks, Search engines. |
BoW is surprisingly effective for spam detection. If a message contains “win”, “prize”, and “cash” multiple times, a simple Naive Bayes classifier trained on BoW vectors can achieve 98%+ accuracy.
Before modern neural search, almost every search engine used a variant of TF-IDF called BM25. When you type a query, the engine calculates the TF-IDF vectors for your query and all documents, then ranks them using Cosine Similarity.
In algorithms like Latent Dirichlet Allocation (LDA), the input is typically a BoW or TF-IDF matrix. The algorithm then clusters documents based on the distribution of words they contain.
Text representation is the bridge between the fluid, ambiguous world of human language and the rigid, mathematical world of machines. While we have moved towards more advanced techniques like Word Embeddings (Word2Vec, GloVe) and Contextual Embeddings (BERT, GPT), the principles of Bag-of-Words and TF-IDF remain foundational.
They teach us that frequency matters, that rarity implies information, and that even a simple “ingredient list” can capture a surprising amount of meaning.
Further Reading: