series

Machine Learning Series

by Mayank Sharma

K-Nearest Neighbors: Instance-Based Learning

Dec 12, 2025

Continuing in our journey through the fascinating world of machine learning, today we’re going to explore one of the simplest yet powerful algorithms: K-Nearest Neighbors (KNN). Imagine you’re new to a city and looking for a good restaurant. Instead of analyzing complex culinary theories, you simply ask: “Where do people similar to me like to eat?” You look at your neighbors, people with similar tastes and choose based on their preferences. This intuitive approach is exactly how the K-Nearest Neighbors (KNN) algorithm works: it makes predictions by finding the most similar examples in your training data.

Table of Contents

  1. Introduction: Learning from Your Neighbors
  2. The Intuition Behind KNN
  3. KNN for Classification
  4. KNN for Regression
  5. Distance Metrics: Measuring Similarity
  6. The Mathematics of Distance
  7. Choosing the Optimal K
  8. Weighted KNN: Not All Neighbors Are Equal
  9. The Curse of Dimensionality
  10. Implementation from Scratch
  11. Computational Considerations
  12. Advantages and Limitations
  13. Conclusion

Introduction: Learning from Your Neighbors

What is K-Nearest Neighbors?

K-Nearest Neighbors (KNN) is a non-parametric, instance-based learning algorithm used for both classification and regression tasks. Unlike algorithms that learn a mathematical function during training (like linear regression or neural networks), KNN is a “lazy learner”, means it doesn’t really “learn” in the traditional sense. Instead, it simply stores all the training data and makes predictions by finding similar examples when needed.

The “K” in KNN refers to the number of nearest neighbors to consider when making a prediction. For example, if K=5, the algorithm looks at the 5 most similar training examples to make its prediction.

The Intuition Behind KNN

The Neighborhood Principle

The fundamental assumption of KNN is simple yet powerful: similar things exist in close proximity. In other words, data points that are close to each other in feature space are likely to belong to the same class or have similar values. Let’s consider a simple example of classifying fruits as apples or oranges based on weight and color.

Feature Space:
              Color (Redness 0-10)
                    
                10  |     🍎 🍎
                 9  |   🍎 🍎 🍎
                 8  |   🍎 🍎
                 7  |
                 6  |
                 5  |     
                 4  |
                 3  |   🍊 🍊
                 2  | 🍊 🍊 🍊
                 1  |   🍊 🍊
                 0  |____________ Weight (grams)
                    0  100  200

If we have a new fruit (❓) at position (180g, 5), KNN would:

  1. Measure its distance to all training examples
  2. Find the K nearest neighbors
  3. For classification: Use majority vote
  4. For regression: Use average (or weighted average)

Why “Lazy Learning”?

KNN is called a lazy learner because:

This contrasts with eager learners (like linear regression or decision trees) that spend time during training to build a model, then make fast predictions.

Now let’s dive into how KNN works for classification and regression. Let’s start with classification first.

KNN for Classification

The Algorithm

For a classification problem with K neighbors:

Training Phase:

  1. Store all training data points $(x^{(i)}, y^{(i)})$

Prediction Phase (for a new point $x_{new}$):

  1. Calculate distance from $x_{new}$ to all training points
  2. Sort distances in ascending order
  3. Select the K nearest neighbors
  4. Count the class labels among these K neighbors
  5. Assign the most common class (majority vote)

Mathematical Formulation

Given a training dataset $\mathcal{D} = {(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), …, (x^{(n)}, y^{(n)})}$ where:

For a new point $x_{new}$, the prediction is:

\[\hat{y} = \text{mode}(\{y^{(i_1)}, y^{(i_2)}, ..., y^{(i_K)}\})\]

Where $i_1, i_2, …, i_K$ are the indices of the K nearest neighbors, determined by:

\[\text{distance}(x_{new}, x^{(i_j)}) \leq \text{distance}(x_{new}, x^{(i_k)}) \quad \forall j \in \{1,...,K\}, k \notin \{i_1,...,i_K\}\]

Let’s Look at an Example

Let’s classify a new data point using K=3:

Training Data (2D: [Age, Income], Label):

New Point: [32, 55k] → ?

Step 1: Calculate distances (using Euclidean distance, normalized):

Step 2: Sort by distance:

  1. B: 5.39 → “No Purchase”
  2. C: 5.83 → “Purchase”
  3. F: 10.77 → “No Purchase”

Step 3: Majority vote (K=3):

Prediction: “No Purchase”

KNN for Regression

The Algorithm

KNN can also predict continuous values. Instead of voting, we average the target values of K nearest neighbors.

Prediction Phase (for a new point $x_{new}$):

  1. Find K nearest neighbors
  2. Calculate the average (or weighted average) of their target values

Mathematical Formulation

For regression, the prediction is:

\[\hat{y} = \frac{1}{K} \sum_{i \in \mathcal{N}_K(x_{new})} y^{(i)}\]

Where $\mathcal{N}K(x{new})$ denotes the set of K nearest neighbors to $x_{new}$.

Concrete Example

Training Data (House features → Price):

New House: [1700 sqft, 3 beds] → ?

Using K=3 and Euclidean distance, suppose the 3 nearest neighbors are:

Prediction (simple average):

\[\hat{y} = \frac{320 + 300 + 380}{3} = \frac{1000}{3} = 333.33k\]

Distance Metrics: Measuring Similarity

The choice of distance metric is crucial in KNN. Different metrics can lead to very different results.

Let’s Look at Some Common Distance Metrics

1. Euclidean Distance (L2 Norm)

The most common metric, measuring “straight-line” distance:

\[d_{euclidean}(x, y) = \sqrt{\sum_{i=1}^{d} (x_i - y_i)^2} = \|x - y\|_2\]

Properties:

Example: Distance between [3, 4] and [0, 0]: \(d = \sqrt{(3-0)^2 + (4-0)^2} = \sqrt{9 + 16} = 5\)

2. Manhattan Distance (L1 Norm)

Also called “taxicab” or “city block” distance, measures the sum of absolute differences:

\[d_{manhattan}(x, y) = \sum_{i=1}^{d} |x_i - y_i| = \|x - y\|_1\]

Properties:

Example: Distance between [3, 4] and [0, 0]: \(d = |3-0| + |4-0| = 3 + 4 = 7\)

3. Minkowski Distance (Generalized)

A generalization of both Euclidean and Manhattan distances:

\[d_{minkowski}(x, y) = \left(\sum_{i=1}^{d} |x_i - y_i|^p\right)^{1/p}\]

Where:

4. Chebyshev Distance (L∞ Norm)

The maximum absolute difference along any dimension:

\[d_{chebyshev}(x, y) = \max_{i=1}^{d} |x_i - y_i| = \|x - y\|_\infty\]

Example: Distance between [3, 4] and [0, 0]: \(d = \max(|3-0|, |4-0|) = \max(3, 4) = 4\)

5. Cosine Distance

Measures the angle between vectors, useful for text and high-dimensional data:

\[d_{cosine}(x, y) = 1 - \frac{x \cdot y}{\|x\| \|y\|} = 1 - \frac{\sum_{i=1}^{d} x_i y_i}{\sqrt{\sum_{i=1}^{d} x_i^2} \sqrt{\sum_{i=1}^{d} y_i^2}}\]

Properties:

6. Hamming Distance

Used for categorical features, counts the number of positions at which symbols differ:

\[d_{hamming}(x, y) = \sum_{i=1}^{d} \mathbb{1}_{x_i \neq y_i}\]

Where $\mathbb{1}$ is the indicator function.

Example: Distance between “KAROLIN” and “KATHRIN”:

Comparing Distance Metrics

Consider points A = [0, 0], B = [3, 4]:

Metric Distance Interpretation
Euclidean 5.00 Straight-line distance
Manhattan 7.00 Grid-based path
Minkowski (p=3) 4.50 Between Euclidean and Manhattan
Chebyshev 4.00 Maximum coordinate difference

When to Use Which Metric?

Now, let’s dive into the mathematics of distances, it’s important to understand the mathematics of distances so that you can choose the right distance metric for your problem.

The Mathematics of Distance

Why Distance Matters

In KNN, the entire algorithm depends on measuring similarity. The distance function defines what “similar” means in your feature space. Let’s explore the mathematical properties that make a good distance metric.

Properties of a Valid Distance Metric

A function $d(x, y)$ is a metric if it satisfies these properties:

  1. Non-negativity: $d(x, y) \geq 0$ for all $x, y$
  2. Identity: $d(x, y) = 0$ if and only if $x = y$
  3. Symmetry: $d(x, y) = d(y, x)$
  4. Triangle Inequality: $d(x, z) \leq d(x, y) + d(y, z)$

Feature Scaling: Critical for Distance-Based Methods

Consider two features:

Without scaling, income dominates the distance calculation!

Example:

Euclidean distance (unscaled):

Income completely dominates! Age difference means nothing.

Common Scaling Methods

1. Min-Max Normalization (scales to [0, 1]):

\[x_{scaled} = \frac{x - x_{min}}{x_{max} - x_{min}}\]

2. Standardization (Z-score normalization):

\[x_{scaled} = \frac{x - \mu}{\sigma}\]

Where $\mu$ is mean and $\sigma$ is standard deviation.

3. Robust Scaling (uses median and IQR):

\[x_{scaled} = \frac{x - \text{median}(x)}{\text{IQR}(x)}\]

Distance in High Dimensions

As dimensionality increases, distance metrics behave differently. All points tend to become equidistant (curse of dimensionality).

Theorem: In high dimensions, the ratio of distances to nearest and farthest neighbors approaches 1:

\[\lim_{d \to \infty} \frac{d_{max} - d_{min}}{d_{min}} \to 0\]

This means KNN becomes less effective as dimensionality increases without more data!

Choosing the Optimal K

The choice of K dramatically affects model performance. It’s a critical hyperparameter that balances bias and variance.

The Bias-Variance Tradeoff

Small K (e.g., K=1):

Large K (e.g., K=100):

The Sweet Spot

Error
  
  |     Training Error
  |    
  |   
  |  ____________ Validation Error
  |              
  |               _____
  |                     
  |                      
  +─────────────────────────→ K
  1                       100
     Complex        Simple 

Optimal K: Where validation error is minimized.

Guidelines for Choosing K

  1. Try odd numbers for binary classification (breaks ties)
  2. Start with $K = \sqrt{n}$ where n is the number of training samples
  3. Use cross-validation to find optimal K
  4. Consider class imbalance: Larger K can dilute minority class
  5. Domain knowledge: Sometimes K has interpretable meaning

Cross-Validation for K Selection

Algorithm:

for K in [1, 3, 5, 7, 9, 11, 13, 15]:
    cv_scores = []
    for fold in cross_validation_folds:
        train_data, val_data = split(fold)
        model = KNN(K=K)
        model.fit(train_data)
        score = model.evaluate(val_data)
        cv_scores.append(score)
    mean_score = average(cv_scores)
    print(f"K={K}: Mean CV Score = {mean_score}")

Mathematical Analysis of K

For classification, the probability of misclassification with K neighbors:

\[P(\text{error}) = 1 - P(y = c_{true})\]

Where $c_{true}$ is the true class. As K increases:

Weighted KNN: Not All Neighbors Are Equal

The Problem with Simple KNN

In standard KNN, all K neighbors get equal votes regardless of distance. But intuitively, closer neighbors should have more influence than distant ones.

Distance-Weighted KNN

Instead of equal voting, weight each neighbor by the inverse of its distance:

\[w_i = \frac{1}{d(x_{new}, x^{(i)})}\]

For Classification (weighted voting):

\[\hat{y} = \arg\max_c \sum_{i \in \mathcal{N}_K(x_{new})} w_i \cdot \mathbb{1}_{y^{(i)} = c}\]

For Regression (weighted average):

\[\hat{y} = \frac{\sum_{i \in \mathcal{N}_K(x_{new})} w_i \cdot y^{(i)}}{\sum_{i \in \mathcal{N}_K(x_{new})} w_i}\]

Let’s Understand with an Example

K=3 neighbors for regression:

Simple average: $(10 + 15 + 20) / 3 = 15$

Weighted average: \(\hat{y} = \frac{0.5 \times 10 + 0.2 \times 15 + 0.1 \times 20}{0.5 + 0.2 + 0.1} = \frac{5 + 3 + 2}{0.8} = \frac{10}{0.8} = 12.5\)

The weighted prediction is closer to the nearest neighbor!

Gaussian Weighting

A more sophisticated approach uses Gaussian (RBF) weighting:

\[w_i = \exp\left(-\frac{d(x_{new}, x^{(i)})^2}{2\sigma^2}\right)\]

Where $\sigma$ controls the decay rate. This ensures:

The Curse of Dimensionality

What Is It?

The curse of dimensionality refers to various phenomena that arise when analyzing data in high-dimensional spaces. For KNN, it means that as the number of features grows, the algorithm becomes less effective.

Why It Happens

In high dimensions:

  1. Data becomes sparse: The available data is spread thinly across the feature space
  2. Distances become uniform: All points are approximately equidistant
  3. Nearest neighbors aren’t really “near”: Even the closest points are far apart
  4. Computational cost explodes: Distance calculations become expensive

Mathematical Insight

Consider a unit hypercube in d dimensions $[0, 1]^d$. The volume of a hypersphere of radius r inscribed in this cube is:

\[V(r, d) = \frac{\pi^{d/2}}{\Gamma(d/2 + 1)} r^d\]

As $d$ increases:

Example: To capture 10% of the data in each dimension:

This means in high dimensions, you need exponentially more data to maintain the same density!

Distance Concentration

In high dimensions, distances between points become similar:

\[\frac{d_{max} - d_{min}}{d_{min}} \xrightarrow{d \to \infty} 0\]

Example: Generate 1000 random points in d dimensions, compute pairwise distances:

Dimensions Mean Distance Std. Deviation Coefficient of Variation
2 0.52 0.17 0.33
10 1.47 0.20 0.14
100 4.69 0.24 0.05
1000 14.87 0.38 0.03

As dimensions increase, relative variance in distances decreases!

Mitigating the Curse

1. Dimensionality Reduction:

2. Feature Engineering:

3. Distance Metrics:

4. Increase Sample Size:

Implementation from Scratch

Let’s implement KNN from scratch in Python to understand every detail.

Complete KNN Implementation

import numpy as np
from collections import Counter

class KNearestNeighbors:
    """
    K-Nearest Neighbors implementation for classification and regression.

    Parameters:
    -----------
    k : int
        Number of neighbors to consider
    metric : str
        Distance metric ('euclidean', 'manhattan', 'minkowski', 'cosine')
    p : int
        Parameter for Minkowski distance (2 = Euclidean, 1 = Manhattan)
    weighted : bool
        Whether to use distance-weighted voting
    task : str
        'classification' or 'regression'
    """

    def __init__(self, k=5, metric='euclidean', p=2, weighted=False,
                 task='classification'):
        self.k = k
        self.metric = metric
        self.p = p
        self.weighted = weighted
        self.task = task
        self.X_train = None
        self.y_train = None

    def fit(self, X, y):
        """
        Store training data (lazy learning - no actual training).

        Parameters:
        -----------
        X : numpy.ndarray of shape (n_samples, n_features)
            Training feature vectors
        y : numpy.ndarray of shape (n_samples,)
            Target values
        """
        self.X_train = np.array(X)
        self.y_train = np.array(y)
        return self

    def predict(self, X):
        """
        Predict labels for test data.

        Parameters:
        -----------
        X : numpy.ndarray of shape (n_samples, n_features)
            Test feature vectors

        Returns:
        --------
        predictions : numpy.ndarray of shape (n_samples,)
            Predicted labels
        """
        X = np.array(X)
        predictions = [self._predict_single(x) for x in X]
        return np.array(predictions)

    def _predict_single(self, x):
        """Predict label for a single data point."""
        # Calculate distances to all training points
        distances = self._calculate_distances(x)

        # Get indices of k nearest neighbors
        k_indices = np.argsort(distances)[:self.k]

        # Get labels of k nearest neighbors
        k_nearest_labels = self.y_train[k_indices]

        if self.weighted:
            # Distance-weighted prediction
            k_nearest_distances = distances[k_indices]
            # Avoid division by zero
            weights = 1 / (k_nearest_distances + 1e-10)

            if self.task == 'classification':
                # Weighted voting
                return self._weighted_vote(k_nearest_labels, weights)
            else:
                # Weighted average
                return np.sum(weights * k_nearest_labels) / np.sum(weights)
        else:
            # Simple prediction (no weighting)
            if self.task == 'classification':
                # Majority vote
                return Counter(k_nearest_labels).most_common(1)[0][0]
            else:
                # Simple average
                return np.mean(k_nearest_labels)

    def _calculate_distances(self, x):
        """Calculate distances from x to all training points."""
        if self.metric == 'euclidean':
            return np.sqrt(np.sum((self.X_train - x) ** 2, axis=1))

        elif self.metric == 'manhattan':
            return np.sum(np.abs(self.X_train - x), axis=1)

        elif self.metric == 'minkowski':
            return np.power(np.sum(np.abs(self.X_train - x) ** self.p,
                                   axis=1), 1/self.p)

        elif self.metric == 'cosine':
            # Cosine distance = 1 - cosine similarity
            dot_products = np.dot(self.X_train, x)
            x_norm = np.linalg.norm(x)
            train_norms = np.linalg.norm(self.X_train, axis=1)
            cosine_similarities = dot_products / (train_norms * x_norm + 1e-10)
            return 1 - cosine_similarities

        else:
            raise ValueError(f"Unknown metric: {self.metric}")

    def _weighted_vote(self, labels, weights):
        """Perform weighted voting for classification."""
        unique_labels = np.unique(labels)
        weighted_votes = {}

        for label in unique_labels:
            # Sum weights for this label
            mask = (labels == label)
            weighted_votes[label] = np.sum(weights[mask])

        # Return label with highest weighted vote
        return max(weighted_votes, key=weighted_votes.get)

    def score(self, X, y):
        """Calculate accuracy (classification) or R² (regression)."""
        predictions = self.predict(X)

        if self.task == 'classification':
            # Accuracy
            return np.mean(predictions == y)
        else:
            # R² score
            ss_res = np.sum((y - predictions) ** 2)
            ss_tot = np.sum((y - np.mean(y)) ** 2)
            return 1 - (ss_res / ss_tot)


# Example Usage: Classification
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import StandardScaler

    # Generate synthetic classification data
    X, y = make_classification(n_samples=200, n_features=4, n_informative=3,
                               n_redundant=1, n_classes=3, random_state=42)

    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42
    )

    # Standardize features (critical for KNN!)
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # Train and evaluate KNN
    knn = KNearestNeighbors(k=5, metric='euclidean', weighted=True,
                            task='classification')
    knn.fit(X_train_scaled, y_train)

    accuracy = knn.score(X_test_scaled, y_test)
    print(f"KNN Accuracy: {accuracy:.4f}")

    # Try different K values
    print("\nAccuracy vs K:")
    for k in [1, 3, 5, 7, 9, 11, 15, 21]:
        knn = KNearestNeighbors(k=k, task='classification')
        knn.fit(X_train_scaled, y_train)
        acc = knn.score(X_test_scaled, y_test)
        print(f"K={k:2d}: {acc:.4f}")

Computational Considerations

Time Complexity

Training: $O(1)$ - Just stores data

Prediction (for one sample):

For m test samples: $O(m \times n \times d)$

This is expensive for large datasets!

Space Complexity

$O(n \times d)$ - Must store all training data

Optimization Techniques

1. KD-Trees

Build a binary tree that partitions space for faster neighbor search:

2. Ball Trees

Similar to KD-trees but uses hyperspheres instead of hyperplanes:

3. Locality-Sensitive Hashing (LSH)

Approximate nearest neighbor search:

4. Approximate Nearest Neighbors

Libraries like FAISS (Facebook AI Similarity Search) or Annoy (Spotify):

When KNN is Computationally Feasible

Advantages and Limitations

Advantages

  1. Simple and Intuitive: Easy to understand and explain
  2. No Training: Instant “training” (just store data)
  3. Non-Parametric: Makes no assumptions about data distribution
  4. Naturally Multi-Class: Handles multiple classes without modification
  5. Versatile: Works for classification and regression
  6. Interpretable: Can inspect which neighbors influenced prediction
  7. Adapts to New Data: Simply add new samples to training set

Limitations

  1. Computationally Expensive: Slow predictions, especially with large datasets
  2. Memory Intensive: Must store all training data
  3. Curse of Dimensionality: Performance degrades in high dimensions
  4. Sensitive to Scaling: Features must be normalized
  5. Sensitive to Irrelevant Features: All features affect distance equally
  6. Imbalanced Data: Majority class dominates with large K
  7. No Model: Can’t inspect learned patterns like decision trees
  8. Outliers: Sensitive to noisy data and outliers

Comparison with Other Algorithms

Aspect KNN Decision Tree SVM Neural Network
Training Speed Instant Fast Medium Slow
Prediction Speed Slow Fast Medium Fast
Memory High Low Medium Low
Interpretability High High Medium Low
Non-linear Yes Yes Yes (kernel) Yes
Curse of Dimensionality High Medium Low Low

Conclusion

Now that we have a solid understanding of KNN and it’s fundamentals, we can apply it to real-world problems. In the next post, we will explore K-Means Clustering, which is a popular unsupervised learning algorithm used for clustering.

Reference Papers: