K-Nearest Neighbors: Instance-Based Learning

Dec 12, 2025

Continuing in our journey through the fascinating world of machine learning, today we’re going to explore one of the simplest yet powerful algorithms: K-Nearest Neighbors (KNN). Imagine you’re new to a city and looking for a good restaurant. Instead of analyzing complex culinary theories, you simply ask: “Where do people similar to me like to eat?” You look at your neighbors, people with similar tastes and choose based on their preferences. This intuitive approach is exactly how the K-Nearest Neighbors (KNN) algorithm works: it makes predictions by finding the most similar examples in your training data.

Introduction: Learning from Your Neighbors
The Intuition Behind KNN
KNN for Classification
KNN for Regression
Distance Metrics: Measuring Similarity
The Mathematics of Distance
Choosing the Optimal K
Weighted KNN: Not All Neighbors Are Equal
The Curse of Dimensionality
Implementation from Scratch
Computational Considerations
Advantages and Limitations
Conclusion

Introduction: Learning from Your Neighbors

What is K-Nearest Neighbors?

K-Nearest Neighbors (KNN) is a non-parametric, instance-based learning algorithm used for both classification and regression tasks. Unlike algorithms that learn a mathematical function during training (like linear regression or neural networks), KNN is a “lazy learner”, means it doesn’t really “learn” in the traditional sense. Instead, it simply stores all the training data and makes predictions by finding similar examples when needed.

The “K” in KNN refers to the number of nearest neighbors to consider when making a prediction. For example, if K=5, the algorithm looks at the 5 most similar training examples to make its prediction.

The Intuition Behind KNN

The Neighborhood Principle

The fundamental assumption of KNN is simple yet powerful: similar things exist in close proximity. In other words, data points that are close to each other in feature space are likely to belong to the same class or have similar values. Let’s consider a simple example of classifying fruits as apples or oranges based on weight and color.

Feature Space:
              Color (Redness 0-10)
                    ↑
                10  |     🍎 🍎
                 9  |   🍎 🍎 🍎
                 8  |   🍎 🍎
                 7  |
                 6  |
                 5  |     ❓
                 4  |
                 3  |   🍊 🍊
                 2  | 🍊 🍊 🍊
                 1  |   🍊 🍊
                 0  |____________→ Weight (grams)
                    0  100  200

If we have a new fruit (❓) at position (180g, 5), KNN would:

Measure its distance to all training examples
Find the K nearest neighbors
For classification: Use majority vote
For regression: Use average (or weighted average)

Why “Lazy Learning”?

KNN is called a lazy learner because:

No explicit training: It doesn’t build a model during training
All computation at prediction time: The real work happens when you make predictions
Instance-based: It memorizes training examples rather than learning patterns

This contrasts with eager learners (like linear regression or decision trees) that spend time during training to build a model, then make fast predictions.

Now let’s dive into how KNN works for classification and regression. Let’s start with classification first.

KNN for Classification

The Algorithm

For a classification problem with K neighbors:

Training Phase:

Store all training data points $(x^{(i)}, y^{(i)})$

Prediction Phase (for a new point $x_{new}$):

Calculate distance from $x_{new}$ to all training points
Sort distances in ascending order
Select the K nearest neighbors
Count the class labels among these K neighbors
Assign the most common class (majority vote)

Mathematical Formulation

Given a training dataset $\mathcal{D} = {(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), …, (x^{(n)}, y^{(n)})}$ where:

$x^{(i)} \in \mathbb{R}^d$ is a d-dimensional feature vector
$y^{(i)} \in {1, 2, …, C}$ is the class label

For a new point $x_{new}$, the prediction is:

\[\hat{y} = \text{mode}(\{y^{(i_1)}, y^{(i_2)}, ..., y^{(i_K)}\})\]

Where $i_1, i_2, …, i_K$ are the indices of the K nearest neighbors, determined by:

\[\text{distance}(x_{new}, x^{(i_j)}) \leq \text{distance}(x_{new}, x^{(i_k)}) \quad \forall j \in \{1,...,K\}, k \notin \{i_1,...,i_K\}\]

Let’s Look at an Example

Let’s classify a new data point using K=3:

Training Data (2D: [Age, Income], Label):

A: [25, 40k] → “No Purchase”
B: [30, 50k] → “No Purchase”
C: [35, 60k] → “Purchase”
D: [40, 70k] → “Purchase”
E: [45, 80k] → “Purchase”
F: [28, 45k] → “No Purchase”

New Point: [32, 55k] → ?

Step 1: Calculate distances (using Euclidean distance, normalized):

Distance to A: $\sqrt{(32-25)^2 + (55-40)^2} = \sqrt{49 + 225} = 16.55$
Distance to B: $\sqrt{(32-30)^2 + (55-50)^2} = \sqrt{4 + 25} = 5.39$
Distance to C: $\sqrt{(32-35)^2 + (55-60)^2} = \sqrt{9 + 25} = 5.83$
Distance to D: $\sqrt{(32-40)^2 + (55-70)^2} = \sqrt{64 + 225} = 17.00$
Distance to E: $\sqrt{(32-45)^2 + (55-80)^2} = \sqrt{169 + 625} = 28.18$
Distance to F: $\sqrt{(32-28)^2 + (55-45)^2} = \sqrt{16 + 100} = 10.77$

Step 2: Sort by distance:

B: 5.39 → “No Purchase”
C: 5.83 → “Purchase”
F: 10.77 → “No Purchase”

Step 3: Majority vote (K=3):

“No Purchase”: 2 votes
“Purchase”: 1 vote

Prediction: “No Purchase”

KNN for Regression

The Algorithm

KNN can also predict continuous values. Instead of voting, we average the target values of K nearest neighbors.

Prediction Phase (for a new point $x_{new}$):

Find K nearest neighbors
Calculate the average (or weighted average) of their target values

Mathematical Formulation

For regression, the prediction is:

\[\hat{y} = \frac{1}{K} \sum_{i \in \mathcal{N}_K(x_{new})} y^{(i)}\]

Where $\mathcal{N}K(x{new})$ denotes the set of K nearest neighbors to $x_{new}$.

Concrete Example

Training Data (House features → Price):

House A: [1500 sqft, 3 beds] → $300k
House B: [1600 sqft, 3 beds] → $320k
House C: [1800 sqft, 4 beds] → $380k
House D: [2000 sqft, 4 beds] → $420k
House E: [1200 sqft, 2 beds] → $250k

New House: [1700 sqft, 3 beds] → ?

Using K=3 and Euclidean distance, suppose the 3 nearest neighbors are:

House B: $320k (distance: 100)
House A: $300k (distance: 200)
House C: $380k (distance: 100)

Prediction (simple average):

\[\hat{y} = \frac{320 + 300 + 380}{3} = \frac{1000}{3} = 333.33k\]

Distance Metrics: Measuring Similarity

The choice of distance metric is crucial in KNN. Different metrics can lead to very different results.

Let’s Look at Some Common Distance Metrics

1. Euclidean Distance (L2 Norm)

The most common metric, measuring “straight-line” distance:

\[d_{euclidean}(x, y) = \sqrt{\sum_{i=1}^{d} (x_i - y_i)^2} = \|x - y\|_2\]

Properties:

Sensitive to scale: Features with larger ranges dominate
Measures shortest path
Works well when features have similar importance

Example: Distance between [3, 4] and [0, 0]: $d = \sqrt{(3-0)^2 + (4-0)^2} = \sqrt{9 + 16} = 5$

2. Manhattan Distance (L1 Norm)

Also called “taxicab” or “city block” distance, measures the sum of absolute differences:

\[d_{manhattan}(x, y) = \sum_{i=1}^{d} |x_i - y_i| = \|x - y\|_1\]

Properties:

Less sensitive to outliers than Euclidean
Useful when features represent different units
Appropriate for grid-like spaces

Example: Distance between [3, 4] and [0, 0]: $d = |3-0| + |4-0| = 3 + 4 = 7$

3. Minkowski Distance (Generalized)

A generalization of both Euclidean and Manhattan distances:

\[d_{minkowski}(x, y) = \left(\sum_{i=1}^{d} |x_i - y_i|^p\right)^{1/p}\]

Where:

$p = 1$: Manhattan distance
$p = 2$: Euclidean distance
$p = \infty$: Chebyshev distance (maximum absolute difference)

4. Chebyshev Distance (L∞ Norm)

The maximum absolute difference along any dimension:

\[d_{chebyshev}(x, y) = \max_{i=1}^{d} |x_i - y_i| = \|x - y\|_\infty\]

Example: Distance between [3, 4] and [0, 0]: $d = \max(|3-0|, |4-0|) = \max(3, 4) = 4$

5. Cosine Distance

Measures the angle between vectors, useful for text and high-dimensional data:

\[d_{cosine}(x, y) = 1 - \frac{x \cdot y}{\|x\| \|y\|} = 1 - \frac{\sum_{i=1}^{d} x_i y_i}{\sqrt{\sum_{i=1}^{d} x_i^2} \sqrt{\sum_{i=1}^{d} y_i^2}}\]

Properties:

Invariant to magnitude (only direction matters)
Range: [0, 2] (0 = identical direction, 2 = opposite)
Excellent for text classification and document similarity

6. Hamming Distance

Used for categorical features, counts the number of positions at which symbols differ:

\[d_{hamming}(x, y) = \sum_{i=1}^{d} \mathbb{1}_{x_i \neq y_i}\]

Where $\mathbb{1}$ is the indicator function.

Example: Distance between “KAROLIN” and “KATHRIN”:

Positions that differ: 3 (positions 2, 4, 5)
Hamming distance: 3

Comparing Distance Metrics

Consider points A = [0, 0], B = [3, 4]:

Metric	Distance	Interpretation
Euclidean	5.00	Straight-line distance
Manhattan	7.00	Grid-based path
Minkowski (p=3)	4.50	Between Euclidean and Manhattan
Chebyshev	4.00	Maximum coordinate difference

When to Use Which Metric?

Euclidean: Default choice, good for continuous features with similar scales
Manhattan: Features with different units, less sensitive to outliers
Cosine: Text data, document similarity, high-dimensional sparse data
Hamming: Categorical or binary features
Chebyshev: When you care about worst-case difference in any dimension

Now, let’s dive into the mathematics of distances, it’s important to understand the mathematics of distances so that you can choose the right distance metric for your problem.

The Mathematics of Distance

Why Distance Matters

In KNN, the entire algorithm depends on measuring similarity. The distance function defines what “similar” means in your feature space. Let’s explore the mathematical properties that make a good distance metric.

Properties of a Valid Distance Metric

A function $d(x, y)$ is a metric if it satisfies these properties:

Non-negativity: $d(x, y) \geq 0$ for all $x, y$
Identity: $d(x, y) = 0$ if and only if $x = y$
Symmetry: $d(x, y) = d(y, x)$
Triangle Inequality: $d(x, z) \leq d(x, y) + d(y, z)$

Feature Scaling: Critical for Distance-Based Methods

Consider two features:

Feature 1 (Age): Range [20, 80], scale: 60 units
Feature 2 (Income): Range [20k, 200k], scale: 180,000 units

Without scaling, income dominates the distance calculation!

Example:

Person A: [25 years, $30k]
Person B: [30 years, $35k]
Person C: [27 years, $100k]

Euclidean distance (unscaled):

$d(A, B) = \sqrt{(30-25)^2 + (35000-30000)^2} = \sqrt{25 + 25000000} = 5000$
$d(A, C) = \sqrt{(27-25)^2 + (100000-30000)^2} = \sqrt{4 + 4900000000} = 70000$

Income completely dominates! Age difference means nothing.

Common Scaling Methods

1. Min-Max Normalization (scales to [0, 1]):

\[x_{scaled} = \frac{x - x_{min}}{x_{max} - x_{min}}\]

2. Standardization (Z-score normalization):

\[x_{scaled} = \frac{x - \mu}{\sigma}\]

Where $\mu$ is mean and $\sigma$ is standard deviation.

3. Robust Scaling (uses median and IQR):

\[x_{scaled} = \frac{x - \text{median}(x)}{\text{IQR}(x)}\]

Distance in High Dimensions

As dimensionality increases, distance metrics behave differently. All points tend to become equidistant (curse of dimensionality).

Theorem: In high dimensions, the ratio of distances to nearest and farthest neighbors approaches 1:

\[\lim_{d \to \infty} \frac{d_{max} - d_{min}}{d_{min}} \to 0\]

This means KNN becomes less effective as dimensionality increases without more data!

Choosing the Optimal K

The choice of K dramatically affects model performance. It’s a critical hyperparameter that balances bias and variance.

The Bias-Variance Tradeoff

Small K (e.g., K=1):

Low Bias: Model is very flexible, captures intricate patterns
High Variance: Sensitive to noise, decision boundary is jagged
Risk: Overfitting

Large K (e.g., K=100):

High Bias: Model is too smooth, misses patterns
Low Variance: Stable predictions, smooth decision boundary
Risk: Underfitting

The Sweet Spot

Error
  ↑
  |     Training Error
  |    ╱
  |   ╱
  |  ╱____________ Validation Error
  |              ╲
  |               ╲_____
  |                     ╲
  |                      ╲
  +─────────────────────────→ K
  1                       100
    ← Complex        Simple →

Optimal K: Where validation error is minimized.

Guidelines for Choosing K

Try odd numbers for binary classification (breaks ties)
Start with $K = \sqrt{n}$ where n is the number of training samples
Use cross-validation to find optimal K
Consider class imbalance: Larger K can dilute minority class
Domain knowledge: Sometimes K has interpretable meaning

Cross-Validation for K Selection

Algorithm:

for K in [1, 3, 5, 7, 9, 11, 13, 15]:
    cv_scores = []
    for fold in cross_validation_folds:
        train_data, val_data = split(fold)
        model = KNN(K=K)
        model.fit(train_data)
        score = model.evaluate(val_data)
        cv_scores.append(score)
    mean_score = average(cv_scores)
    print(f"K={K}: Mean CV Score = {mean_score}")

Mathematical Analysis of K

For classification, the probability of misclassification with K neighbors:

\[P(\text{error}) = 1 - P(y = c_{true})\]

Where $c_{true}$ is the true class. As K increases:

K → 1: $P(\text{error}) \approx$ Bayes error rate (theoretical minimum)
K → n: $P(\text{error}) \to$ prior probability of most common class

Weighted KNN: Not All Neighbors Are Equal

The Problem with Simple KNN

In standard KNN, all K neighbors get equal votes regardless of distance. But intuitively, closer neighbors should have more influence than distant ones.

Distance-Weighted KNN

Instead of equal voting, weight each neighbor by the inverse of its distance:

\[w_i = \frac{1}{d(x_{new}, x^{(i)})}\]

For Classification (weighted voting):

\[\hat{y} = \arg\max_c \sum_{i \in \mathcal{N}_K(x_{new})} w_i \cdot \mathbb{1}_{y^{(i)} = c}\]

For Regression (weighted average):

\[\hat{y} = \frac{\sum_{i \in \mathcal{N}_K(x_{new})} w_i \cdot y^{(i)}}{\sum_{i \in \mathcal{N}_K(x_{new})} w_i}\]

Let’s Understand with an Example

K=3 neighbors for regression:

Neighbor 1: $y = 10$, distance = 2 → weight = 1/2 = 0.5
Neighbor 2: $y = 15$, distance = 5 → weight = 1/5 = 0.2
Neighbor 3: $y = 20$, distance = 10 → weight = 1/10 = 0.1

Simple average: $(10 + 15 + 20) / 3 = 15$

Weighted average: $\hat{y} = \frac{0.5 \times 10 + 0.2 \times 15 + 0.1 \times 20}{0.5 + 0.2 + 0.1} = \frac{5 + 3 + 2}{0.8} = \frac{10}{0.8} = 12.5$

The weighted prediction is closer to the nearest neighbor!

Gaussian Weighting

A more sophisticated approach uses Gaussian (RBF) weighting:

\[w_i = \exp\left(-\frac{d(x_{new}, x^{(i)})^2}{2\sigma^2}\right)\]

Where $\sigma$ controls the decay rate. This ensures:

Very close neighbors: $w \approx 1$
Distant neighbors: $w \approx 0$
Smooth transition between weights

The Curse of Dimensionality

What Is It?

The curse of dimensionality refers to various phenomena that arise when analyzing data in high-dimensional spaces. For KNN, it means that as the number of features grows, the algorithm becomes less effective.

Why It Happens

In high dimensions:

Data becomes sparse: The available data is spread thinly across the feature space
Distances become uniform: All points are approximately equidistant
Nearest neighbors aren’t really “near”: Even the closest points are far apart
Computational cost explodes: Distance calculations become expensive

Mathematical Insight

Consider a unit hypercube in d dimensions $[0, 1]^d$. The volume of a hypersphere of radius r inscribed in this cube is:

\[V(r, d) = \frac{\pi^{d/2}}{\Gamma(d/2 + 1)} r^d\]

As $d$ increases:

Most volume concentrates near the corners
A sphere with radius 0.5 captures decreasing proportion of the cube’s volume

Example: To capture 10% of the data in each dimension:

1D: Need 10% of range
2D: Need $\sqrt{0.1} = 31.6$% of range per dimension
10D: Need $0.1^{1/10} = 79.4$% of range per dimension

This means in high dimensions, you need exponentially more data to maintain the same density!

Distance Concentration

In high dimensions, distances between points become similar:

\[\frac{d_{max} - d_{min}}{d_{min}} \xrightarrow{d \to \infty} 0\]

Example: Generate 1000 random points in d dimensions, compute pairwise distances:

Dimensions	Mean Distance	Std. Deviation	Coefficient of Variation
2	0.52	0.17	0.33
10	1.47	0.20	0.14
100	4.69	0.24	0.05
1000	14.87	0.38	0.03

As dimensions increase, relative variance in distances decreases!

Mitigating the Curse

1. Dimensionality Reduction:

Principal Component Analysis (PCA)
t-SNE, UMAP for visualization
Feature selection

2. Feature Engineering:

Create informative features
Remove irrelevant features
Use domain knowledge

3. Distance Metrics:

Use metrics appropriate for high dimensions (e.g., cosine similarity)
Consider learned distance metrics

4. Increase Sample Size:

More data helps, but requires exponential growth
Rule of thumb: Need at least $5^d$ samples for d dimensions

Implementation from Scratch

Let’s implement KNN from scratch in Python to understand every detail.

Complete KNN Implementation

import numpy as np
from collections import Counter

class KNearestNeighbors:
    """
    K-Nearest Neighbors implementation for classification and regression.

    Parameters:
    -----------
    k : int
        Number of neighbors to consider
    metric : str
        Distance metric ('euclidean', 'manhattan', 'minkowski', 'cosine')
    p : int
        Parameter for Minkowski distance (2 = Euclidean, 1 = Manhattan)
    weighted : bool
        Whether to use distance-weighted voting
    task : str
        'classification' or 'regression'
    """

    def __init__(self, k=5, metric='euclidean', p=2, weighted=False,
                 task='classification'):
        self.k = k
        self.metric = metric
        self.p = p
        self.weighted = weighted
        self.task = task
        self.X_train = None
        self.y_train = None

    def fit(self, X, y):
        """
        Store training data (lazy learning - no actual training).

        Parameters:
        -----------
        X : numpy.ndarray of shape (n_samples, n_features)
            Training feature vectors
        y : numpy.ndarray of shape (n_samples,)
            Target values
        """
        self.X_train = np.array(X)
        self.y_train = np.array(y)
        return self

    def predict(self, X):
        """
        Predict labels for test data.

        Parameters:
        -----------
        X : numpy.ndarray of shape (n_samples, n_features)
            Test feature vectors

        Returns:
        --------
        predictions : numpy.ndarray of shape (n_samples,)
            Predicted labels
        """
        X = np.array(X)
        predictions = [self._predict_single(x) for x in X]
        return np.array(predictions)

    def _predict_single(self, x):
        """Predict label for a single data point."""
        # Calculate distances to all training points
        distances = self._calculate_distances(x)

        # Get indices of k nearest neighbors
        k_indices = np.argsort(distances)[:self.k]

        # Get labels of k nearest neighbors
        k_nearest_labels = self.y_train[k_indices]

        if self.weighted:
            # Distance-weighted prediction
            k_nearest_distances = distances[k_indices]
            # Avoid division by zero
            weights = 1 / (k_nearest_distances + 1e-10)

            if self.task == 'classification':
                # Weighted voting
                return self._weighted_vote(k_nearest_labels, weights)
            else:
                # Weighted average
                return np.sum(weights * k_nearest_labels) / np.sum(weights)
        else:
            # Simple prediction (no weighting)
            if self.task == 'classification':
                # Majority vote
                return Counter(k_nearest_labels).most_common(1)[0][0]
            else:
                # Simple average
                return np.mean(k_nearest_labels)

    def _calculate_distances(self, x):
        """Calculate distances from x to all training points."""
        if self.metric == 'euclidean':
            return np.sqrt(np.sum((self.X_train - x) ** 2, axis=1))

        elif self.metric == 'manhattan':
            return np.sum(np.abs(self.X_train - x), axis=1)

        elif self.metric == 'minkowski':
            return np.power(np.sum(np.abs(self.X_train - x) ** self.p,
                                   axis=1), 1/self.p)

        elif self.metric == 'cosine':
            # Cosine distance = 1 - cosine similarity
            dot_products = np.dot(self.X_train, x)
            x_norm = np.linalg.norm(x)
            train_norms = np.linalg.norm(self.X_train, axis=1)
            cosine_similarities = dot_products / (train_norms * x_norm + 1e-10)
            return 1 - cosine_similarities

        else:
            raise ValueError(f"Unknown metric: {self.metric}")

    def _weighted_vote(self, labels, weights):
        """Perform weighted voting for classification."""
        unique_labels = np.unique(labels)
        weighted_votes = {}

        for label in unique_labels:
            # Sum weights for this label
            mask = (labels == label)
            weighted_votes[label] = np.sum(weights[mask])

        # Return label with highest weighted vote
        return max(weighted_votes, key=weighted_votes.get)

    def score(self, X, y):
        """Calculate accuracy (classification) or R² (regression)."""
        predictions = self.predict(X)

        if self.task == 'classification':
            # Accuracy
            return np.mean(predictions == y)
        else:
            # R² score
            ss_res = np.sum((y - predictions) ** 2)
            ss_tot = np.sum((y - np.mean(y)) ** 2)
            return 1 - (ss_res / ss_tot)


# Example Usage: Classification
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import StandardScaler

    # Generate synthetic classification data
    X, y = make_classification(n_samples=200, n_features=4, n_informative=3,
                               n_redundant=1, n_classes=3, random_state=42)

    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42
    )

    # Standardize features (critical for KNN!)
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # Train and evaluate KNN
    knn = KNearestNeighbors(k=5, metric='euclidean', weighted=True,
                            task='classification')
    knn.fit(X_train_scaled, y_train)

    accuracy = knn.score(X_test_scaled, y_test)
    print(f"KNN Accuracy: {accuracy:.4f}")

    # Try different K values
    print("\nAccuracy vs K:")
    for k in [1, 3, 5, 7, 9, 11, 15, 21]:
        knn = KNearestNeighbors(k=k, task='classification')
        knn.fit(X_train_scaled, y_train)
        acc = knn.score(X_test_scaled, y_test)
        print(f"K={k:2d}: {acc:.4f}")

Computational Considerations

Time Complexity

Training: $O(1)$ - Just stores data

Prediction (for one sample):

Distance calculation: $O(n \times d)$ where n = training samples, d = dimensions
Sorting: $O(n \log n)$
Total: $O(n \times d + n \log n)$

For m test samples: $O(m \times n \times d)$

This is expensive for large datasets!

Space Complexity

$O(n \times d)$ - Must store all training data

Optimization Techniques

1. KD-Trees

Build a binary tree that partitions space for faster neighbor search:

Construction: $O(n \log n)$
Query: $O(\log n)$ average case (but degrades in high dimensions)

2. Ball Trees

Similar to KD-trees but uses hyperspheres instead of hyperplanes:

More robust in higher dimensions than KD-trees
Query: $O(\log n)$ expected

3. Locality-Sensitive Hashing (LSH)

Approximate nearest neighbor search:

Trade accuracy for speed
Query: $O(1)$ to $O(\log n)$ depending on parameters

4. Approximate Nearest Neighbors

Libraries like FAISS (Facebook AI Similarity Search) or Annoy (Spotify):

Sacrifice small amount of accuracy for huge speedups
Essential for production systems with millions of points

When KNN is Computationally Feasible

Small to medium datasets: n < 10,000
Low dimensions: d < 20
Real-time not required: Batch predictions acceptable
Approximations acceptable: Use ANN libraries

Advantages and Limitations

Advantages

Simple and Intuitive: Easy to understand and explain
No Training: Instant “training” (just store data)
Non-Parametric: Makes no assumptions about data distribution
Naturally Multi-Class: Handles multiple classes without modification
Versatile: Works for classification and regression
Interpretable: Can inspect which neighbors influenced prediction
Adapts to New Data: Simply add new samples to training set

Limitations

Computationally Expensive: Slow predictions, especially with large datasets
Memory Intensive: Must store all training data
Curse of Dimensionality: Performance degrades in high dimensions
Sensitive to Scaling: Features must be normalized
Sensitive to Irrelevant Features: All features affect distance equally
Imbalanced Data: Majority class dominates with large K
No Model: Can’t inspect learned patterns like decision trees
Outliers: Sensitive to noisy data and outliers

Comparison with Other Algorithms

Aspect	KNN	Decision Tree	SVM	Neural Network
Training Speed	Instant	Fast	Medium	Slow
Prediction Speed	Slow	Fast	Medium	Fast
Memory	High	Low	Medium	Low
Interpretability	High	High	Medium	Low
Non-linear	Yes	Yes	Yes (kernel)	Yes
Curse of Dimensionality	High	Medium	Low	Low

Conclusion

Now that we have a solid understanding of KNN and it’s fundamentals, we can apply it to real-world problems. In the next post, we will explore K-Means Clustering, which is a popular unsupervised learning algorithm used for clustering.

Reference Papers:

Cover, T. & Hart, P. (1967). “Nearest Neighbor Pattern Classification”

Machine Learning Series

K-Nearest Neighbors: Instance-Based Learning

Table of Contents

Introduction: Learning from Your Neighbors

What is K-Nearest Neighbors?

The Intuition Behind KNN

The Neighborhood Principle

Why “Lazy Learning”?

KNN for Classification

The Algorithm

Mathematical Formulation

Let’s Look at an Example

KNN for Regression

The Algorithm

Mathematical Formulation

Concrete Example

Distance Metrics: Measuring Similarity

Let’s Look at Some Common Distance Metrics

1. Euclidean Distance (L2 Norm)

2. Manhattan Distance (L1 Norm)

3. Minkowski Distance (Generalized)

4. Chebyshev Distance (L∞ Norm)

5. Cosine Distance

6. Hamming Distance

Comparing Distance Metrics

When to Use Which Metric?

The Mathematics of Distance

Why Distance Matters

Properties of a Valid Distance Metric

Feature Scaling: Critical for Distance-Based Methods

Common Scaling Methods

Distance in High Dimensions

Choosing the Optimal K

The Bias-Variance Tradeoff

The Sweet Spot

Guidelines for Choosing K

Cross-Validation for K Selection

Mathematical Analysis of K

Weighted KNN: Not All Neighbors Are Equal

The Problem with Simple KNN

Distance-Weighted KNN

Let’s Understand with an Example

Gaussian Weighting

The Curse of Dimensionality

What Is It?

Why It Happens

Mathematical Insight

Distance Concentration

Mitigating the Curse

Implementation from Scratch

Complete KNN Implementation

Computational Considerations

Time Complexity

Space Complexity

Optimization Techniques

1. KD-Trees

2. Ball Trees

3. Locality-Sensitive Hashing (LSH)

4. Approximate Nearest Neighbors

When KNN is Computationally Feasible

Advantages and Limitations

Advantages

Limitations

Comparison with Other Algorithms

Conclusion