Dec 12, 2025
Continuing in our journey through the fascinating world of machine learning, today we’re going to explore one of the simplest yet powerful algorithms: K-Nearest Neighbors (KNN). Imagine you’re new to a city and looking for a good restaurant. Instead of analyzing complex culinary theories, you simply ask: “Where do people similar to me like to eat?” You look at your neighbors, people with similar tastes and choose based on their preferences. This intuitive approach is exactly how the K-Nearest Neighbors (KNN) algorithm works: it makes predictions by finding the most similar examples in your training data.
K-Nearest Neighbors (KNN) is a non-parametric, instance-based learning algorithm used for both classification and regression tasks. Unlike algorithms that learn a mathematical function during training (like linear regression or neural networks), KNN is a “lazy learner”, means it doesn’t really “learn” in the traditional sense. Instead, it simply stores all the training data and makes predictions by finding similar examples when needed.
The “K” in KNN refers to the number of nearest neighbors to consider when making a prediction. For example, if K=5, the algorithm looks at the 5 most similar training examples to make its prediction.
The fundamental assumption of KNN is simple yet powerful: similar things exist in close proximity. In other words, data points that are close to each other in feature space are likely to belong to the same class or have similar values. Let’s consider a simple example of classifying fruits as apples or oranges based on weight and color.
Feature Space:
Color (Redness 0-10)
↑
10 | 🍎 🍎
9 | 🍎 🍎 🍎
8 | 🍎 🍎
7 |
6 |
5 | ❓
4 |
3 | 🍊 🍊
2 | 🍊 🍊 🍊
1 | 🍊 🍊
0 |____________→ Weight (grams)
0 100 200
If we have a new fruit (❓) at position (180g, 5), KNN would:
KNN is called a lazy learner because:
This contrasts with eager learners (like linear regression or decision trees) that spend time during training to build a model, then make fast predictions.
Now let’s dive into how KNN works for classification and regression. Let’s start with classification first.
For a classification problem with K neighbors:
Training Phase:
Prediction Phase (for a new point $x_{new}$):
Given a training dataset $\mathcal{D} = {(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), …, (x^{(n)}, y^{(n)})}$ where:
For a new point $x_{new}$, the prediction is:
\[\hat{y} = \text{mode}(\{y^{(i_1)}, y^{(i_2)}, ..., y^{(i_K)}\})\]Where $i_1, i_2, …, i_K$ are the indices of the K nearest neighbors, determined by:
\[\text{distance}(x_{new}, x^{(i_j)}) \leq \text{distance}(x_{new}, x^{(i_k)}) \quad \forall j \in \{1,...,K\}, k \notin \{i_1,...,i_K\}\]Let’s classify a new data point using K=3:
Training Data (2D: [Age, Income], Label):
New Point: [32, 55k] → ?
Step 1: Calculate distances (using Euclidean distance, normalized):
Step 2: Sort by distance:
Step 3: Majority vote (K=3):
Prediction: “No Purchase”
KNN can also predict continuous values. Instead of voting, we average the target values of K nearest neighbors.
Prediction Phase (for a new point $x_{new}$):
For regression, the prediction is:
\[\hat{y} = \frac{1}{K} \sum_{i \in \mathcal{N}_K(x_{new})} y^{(i)}\]Where $\mathcal{N}K(x{new})$ denotes the set of K nearest neighbors to $x_{new}$.
Training Data (House features → Price):
New House: [1700 sqft, 3 beds] → ?
Using K=3 and Euclidean distance, suppose the 3 nearest neighbors are:
Prediction (simple average):
\[\hat{y} = \frac{320 + 300 + 380}{3} = \frac{1000}{3} = 333.33k\]The choice of distance metric is crucial in KNN. Different metrics can lead to very different results.
The most common metric, measuring “straight-line” distance:
\[d_{euclidean}(x, y) = \sqrt{\sum_{i=1}^{d} (x_i - y_i)^2} = \|x - y\|_2\]Properties:
Example: Distance between [3, 4] and [0, 0]: \(d = \sqrt{(3-0)^2 + (4-0)^2} = \sqrt{9 + 16} = 5\)
Also called “taxicab” or “city block” distance, measures the sum of absolute differences:
\[d_{manhattan}(x, y) = \sum_{i=1}^{d} |x_i - y_i| = \|x - y\|_1\]Properties:
Example: Distance between [3, 4] and [0, 0]: \(d = |3-0| + |4-0| = 3 + 4 = 7\)
A generalization of both Euclidean and Manhattan distances:
\[d_{minkowski}(x, y) = \left(\sum_{i=1}^{d} |x_i - y_i|^p\right)^{1/p}\]Where:
The maximum absolute difference along any dimension:
\[d_{chebyshev}(x, y) = \max_{i=1}^{d} |x_i - y_i| = \|x - y\|_\infty\]Example: Distance between [3, 4] and [0, 0]: \(d = \max(|3-0|, |4-0|) = \max(3, 4) = 4\)
Measures the angle between vectors, useful for text and high-dimensional data:
\[d_{cosine}(x, y) = 1 - \frac{x \cdot y}{\|x\| \|y\|} = 1 - \frac{\sum_{i=1}^{d} x_i y_i}{\sqrt{\sum_{i=1}^{d} x_i^2} \sqrt{\sum_{i=1}^{d} y_i^2}}\]Properties:
Used for categorical features, counts the number of positions at which symbols differ:
\[d_{hamming}(x, y) = \sum_{i=1}^{d} \mathbb{1}_{x_i \neq y_i}\]Where $\mathbb{1}$ is the indicator function.
Example: Distance between “KAROLIN” and “KATHRIN”:
Consider points A = [0, 0], B = [3, 4]:
| Metric | Distance | Interpretation |
|---|---|---|
| Euclidean | 5.00 | Straight-line distance |
| Manhattan | 7.00 | Grid-based path |
| Minkowski (p=3) | 4.50 | Between Euclidean and Manhattan |
| Chebyshev | 4.00 | Maximum coordinate difference |
Now, let’s dive into the mathematics of distances, it’s important to understand the mathematics of distances so that you can choose the right distance metric for your problem.
In KNN, the entire algorithm depends on measuring similarity. The distance function defines what “similar” means in your feature space. Let’s explore the mathematical properties that make a good distance metric.
A function $d(x, y)$ is a metric if it satisfies these properties:
Consider two features:
Without scaling, income dominates the distance calculation!
Example:
Euclidean distance (unscaled):
Income completely dominates! Age difference means nothing.
1. Min-Max Normalization (scales to [0, 1]):
\[x_{scaled} = \frac{x - x_{min}}{x_{max} - x_{min}}\]2. Standardization (Z-score normalization):
\[x_{scaled} = \frac{x - \mu}{\sigma}\]Where $\mu$ is mean and $\sigma$ is standard deviation.
3. Robust Scaling (uses median and IQR):
\[x_{scaled} = \frac{x - \text{median}(x)}{\text{IQR}(x)}\]As dimensionality increases, distance metrics behave differently. All points tend to become equidistant (curse of dimensionality).
Theorem: In high dimensions, the ratio of distances to nearest and farthest neighbors approaches 1:
\[\lim_{d \to \infty} \frac{d_{max} - d_{min}}{d_{min}} \to 0\]This means KNN becomes less effective as dimensionality increases without more data!
The choice of K dramatically affects model performance. It’s a critical hyperparameter that balances bias and variance.
Small K (e.g., K=1):
Large K (e.g., K=100):
Error
↑
| Training Error
| ╱
| ╱
| ╱____________ Validation Error
| ╲
| ╲_____
| ╲
| ╲
+─────────────────────────→ K
1 100
← Complex Simple →
Optimal K: Where validation error is minimized.
Algorithm:
for K in [1, 3, 5, 7, 9, 11, 13, 15]:
cv_scores = []
for fold in cross_validation_folds:
train_data, val_data = split(fold)
model = KNN(K=K)
model.fit(train_data)
score = model.evaluate(val_data)
cv_scores.append(score)
mean_score = average(cv_scores)
print(f"K={K}: Mean CV Score = {mean_score}")
For classification, the probability of misclassification with K neighbors:
\[P(\text{error}) = 1 - P(y = c_{true})\]Where $c_{true}$ is the true class. As K increases:
In standard KNN, all K neighbors get equal votes regardless of distance. But intuitively, closer neighbors should have more influence than distant ones.
Instead of equal voting, weight each neighbor by the inverse of its distance:
\[w_i = \frac{1}{d(x_{new}, x^{(i)})}\]For Classification (weighted voting):
\[\hat{y} = \arg\max_c \sum_{i \in \mathcal{N}_K(x_{new})} w_i \cdot \mathbb{1}_{y^{(i)} = c}\]For Regression (weighted average):
\[\hat{y} = \frac{\sum_{i \in \mathcal{N}_K(x_{new})} w_i \cdot y^{(i)}}{\sum_{i \in \mathcal{N}_K(x_{new})} w_i}\]K=3 neighbors for regression:
Simple average: $(10 + 15 + 20) / 3 = 15$
Weighted average: \(\hat{y} = \frac{0.5 \times 10 + 0.2 \times 15 + 0.1 \times 20}{0.5 + 0.2 + 0.1} = \frac{5 + 3 + 2}{0.8} = \frac{10}{0.8} = 12.5\)
The weighted prediction is closer to the nearest neighbor!
A more sophisticated approach uses Gaussian (RBF) weighting:
\[w_i = \exp\left(-\frac{d(x_{new}, x^{(i)})^2}{2\sigma^2}\right)\]Where $\sigma$ controls the decay rate. This ensures:
The curse of dimensionality refers to various phenomena that arise when analyzing data in high-dimensional spaces. For KNN, it means that as the number of features grows, the algorithm becomes less effective.
In high dimensions:
Consider a unit hypercube in d dimensions $[0, 1]^d$. The volume of a hypersphere of radius r inscribed in this cube is:
\[V(r, d) = \frac{\pi^{d/2}}{\Gamma(d/2 + 1)} r^d\]As $d$ increases:
Example: To capture 10% of the data in each dimension:
This means in high dimensions, you need exponentially more data to maintain the same density!
In high dimensions, distances between points become similar:
\[\frac{d_{max} - d_{min}}{d_{min}} \xrightarrow{d \to \infty} 0\]Example: Generate 1000 random points in d dimensions, compute pairwise distances:
| Dimensions | Mean Distance | Std. Deviation | Coefficient of Variation |
|---|---|---|---|
| 2 | 0.52 | 0.17 | 0.33 |
| 10 | 1.47 | 0.20 | 0.14 |
| 100 | 4.69 | 0.24 | 0.05 |
| 1000 | 14.87 | 0.38 | 0.03 |
As dimensions increase, relative variance in distances decreases!
1. Dimensionality Reduction:
2. Feature Engineering:
3. Distance Metrics:
4. Increase Sample Size:
Let’s implement KNN from scratch in Python to understand every detail.
import numpy as np
from collections import Counter
class KNearestNeighbors:
"""
K-Nearest Neighbors implementation for classification and regression.
Parameters:
-----------
k : int
Number of neighbors to consider
metric : str
Distance metric ('euclidean', 'manhattan', 'minkowski', 'cosine')
p : int
Parameter for Minkowski distance (2 = Euclidean, 1 = Manhattan)
weighted : bool
Whether to use distance-weighted voting
task : str
'classification' or 'regression'
"""
def __init__(self, k=5, metric='euclidean', p=2, weighted=False,
task='classification'):
self.k = k
self.metric = metric
self.p = p
self.weighted = weighted
self.task = task
self.X_train = None
self.y_train = None
def fit(self, X, y):
"""
Store training data (lazy learning - no actual training).
Parameters:
-----------
X : numpy.ndarray of shape (n_samples, n_features)
Training feature vectors
y : numpy.ndarray of shape (n_samples,)
Target values
"""
self.X_train = np.array(X)
self.y_train = np.array(y)
return self
def predict(self, X):
"""
Predict labels for test data.
Parameters:
-----------
X : numpy.ndarray of shape (n_samples, n_features)
Test feature vectors
Returns:
--------
predictions : numpy.ndarray of shape (n_samples,)
Predicted labels
"""
X = np.array(X)
predictions = [self._predict_single(x) for x in X]
return np.array(predictions)
def _predict_single(self, x):
"""Predict label for a single data point."""
# Calculate distances to all training points
distances = self._calculate_distances(x)
# Get indices of k nearest neighbors
k_indices = np.argsort(distances)[:self.k]
# Get labels of k nearest neighbors
k_nearest_labels = self.y_train[k_indices]
if self.weighted:
# Distance-weighted prediction
k_nearest_distances = distances[k_indices]
# Avoid division by zero
weights = 1 / (k_nearest_distances + 1e-10)
if self.task == 'classification':
# Weighted voting
return self._weighted_vote(k_nearest_labels, weights)
else:
# Weighted average
return np.sum(weights * k_nearest_labels) / np.sum(weights)
else:
# Simple prediction (no weighting)
if self.task == 'classification':
# Majority vote
return Counter(k_nearest_labels).most_common(1)[0][0]
else:
# Simple average
return np.mean(k_nearest_labels)
def _calculate_distances(self, x):
"""Calculate distances from x to all training points."""
if self.metric == 'euclidean':
return np.sqrt(np.sum((self.X_train - x) ** 2, axis=1))
elif self.metric == 'manhattan':
return np.sum(np.abs(self.X_train - x), axis=1)
elif self.metric == 'minkowski':
return np.power(np.sum(np.abs(self.X_train - x) ** self.p,
axis=1), 1/self.p)
elif self.metric == 'cosine':
# Cosine distance = 1 - cosine similarity
dot_products = np.dot(self.X_train, x)
x_norm = np.linalg.norm(x)
train_norms = np.linalg.norm(self.X_train, axis=1)
cosine_similarities = dot_products / (train_norms * x_norm + 1e-10)
return 1 - cosine_similarities
else:
raise ValueError(f"Unknown metric: {self.metric}")
def _weighted_vote(self, labels, weights):
"""Perform weighted voting for classification."""
unique_labels = np.unique(labels)
weighted_votes = {}
for label in unique_labels:
# Sum weights for this label
mask = (labels == label)
weighted_votes[label] = np.sum(weights[mask])
# Return label with highest weighted vote
return max(weighted_votes, key=weighted_votes.get)
def score(self, X, y):
"""Calculate accuracy (classification) or R² (regression)."""
predictions = self.predict(X)
if self.task == 'classification':
# Accuracy
return np.mean(predictions == y)
else:
# R² score
ss_res = np.sum((y - predictions) ** 2)
ss_tot = np.sum((y - np.mean(y)) ** 2)
return 1 - (ss_res / ss_tot)
# Example Usage: Classification
if __name__ == "__main__":
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Generate synthetic classification data
X, y = make_classification(n_samples=200, n_features=4, n_informative=3,
n_redundant=1, n_classes=3, random_state=42)
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Standardize features (critical for KNN!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train and evaluate KNN
knn = KNearestNeighbors(k=5, metric='euclidean', weighted=True,
task='classification')
knn.fit(X_train_scaled, y_train)
accuracy = knn.score(X_test_scaled, y_test)
print(f"KNN Accuracy: {accuracy:.4f}")
# Try different K values
print("\nAccuracy vs K:")
for k in [1, 3, 5, 7, 9, 11, 15, 21]:
knn = KNearestNeighbors(k=k, task='classification')
knn.fit(X_train_scaled, y_train)
acc = knn.score(X_test_scaled, y_test)
print(f"K={k:2d}: {acc:.4f}")
Training: $O(1)$ - Just stores data
Prediction (for one sample):
For m test samples: $O(m \times n \times d)$
This is expensive for large datasets!
$O(n \times d)$ - Must store all training data
Build a binary tree that partitions space for faster neighbor search:
Similar to KD-trees but uses hyperspheres instead of hyperplanes:
Approximate nearest neighbor search:
Libraries like FAISS (Facebook AI Similarity Search) or Annoy (Spotify):
| Aspect | KNN | Decision Tree | SVM | Neural Network |
|---|---|---|---|---|
| Training Speed | Instant | Fast | Medium | Slow |
| Prediction Speed | Slow | Fast | Medium | Fast |
| Memory | High | Low | Medium | Low |
| Interpretability | High | High | Medium | Low |
| Non-linear | Yes | Yes | Yes (kernel) | Yes |
| Curse of Dimensionality | High | Medium | Low | Low |
Now that we have a solid understanding of KNN and it’s fundamentals, we can apply it to real-world problems. In the next post, we will explore K-Means Clustering, which is a popular unsupervised learning algorithm used for clustering.
Reference Papers: