series

Machine Learning Series

by Mayank Sharma

K-Means Clustering: Unsupervised Learning Fundamentals

Dec 13, 2025

Continuing in our journey of Machine Learning, let’s now venture into the realm of unsupervised learning. Today we will be learning about K-Means clustering algorithm. Imagine walking into a large library where books are scattered everywhere with no organization. Your task is to group similar books together, novels with novels, textbooks with textbooks, cookbooks with cookbooks, without any labels telling you which book belongs where. You’d naturally look at features like size, cover design, and content to create these groups. This is exactly what K-Means clustering does: it discovers natural groupings in data without any prior labels.

Table of Contents

  1. Introduction: From Supervised to Unsupervised Learning
  2. What is Clustering?
  3. The K-Means Algorithm: Intuition
  4. Mathematical Formulation
  5. The K-Means Algorithm: Step by Step
  6. Initialization Strategies
  7. Choosing the Optimal K
  8. Convergence and Computational Complexity
  9. Implementation from Scratch
  10. Evaluating Clustering Quality
  11. Advantages and Limitations
  12. Variants and Extensions
  13. Conclusion

Introduction: From Supervised to Unsupervised Learning

The Paradigm Shift

In our previous tutorial on K-Nearest Neighbors (KNN), we worked with supervised learning where we had labeled data and predicted labels for new examples. Now we enter the realm of unsupervised learning, where we have data but no labels. Our goal shifts from prediction to discovery: finding hidden patterns and structure in the data.

What is Unsupervised Learning?

Unsupervised learning involves finding patterns in data without explicit labels or targets. The algorithm must discover structure on its own. Some of the common tasks where we can use unsupervised learning include:

Why Unsupervised Learning Matters

In the real world, labeled data is expensive:

Unsupervised learning allows us to:

Real-World Applications

K-Means clustering powers countless applications:

What is Clustering?

Definition

Clustering is the task of dividing a dataset into groups (clusters) such that:

Formally, given a dataset $\mathcal{D} = {x^{(1)}, x^{(2)}, …, x^{(n)}}$ where $x^{(i)} \in \mathbb{R}^d$, clustering aims to partition the data into $K$ disjoint clusters:

\[\mathcal{C} = \{C_1, C_2, ..., C_K\}\]

where:

Types of Clustering

1. Partitional Clustering

Divides data into non-overlapping clusters. Each point belongs to exactly one cluster.

2. Hierarchical Clustering

Creates a tree-like structure (dendrogram) of nested clusters.

3. Density-Based Clustering

Defines clusters as dense regions separated by sparse regions.

4. Model-Based Clustering

Assumes data is generated from a mixture of probability distributions.

The K-Means Algorithm: Intuition

The Core Idea

K-Means is beautifully simple: it finds K cluster centers (called centroids) and assigns each data point to its nearest centroid. The algorithm iteratively:

  1. Assigns each point to the closest centroid
  2. Updates centroids to be the mean of assigned points
  3. Repeats until convergence

You can think of it like organizing a dinner party where K = 3 (three tables). Initially, you randomly place three chairs as “table centers.” Then:

Visual Intuition

Consider points scattered in 2D space:

Initial State (random centroids):

            
             
          
           

After Assignment:

    1  *      2
 1     1  2     2
   1  1     *
    *     2  2

After Update (move centroids):

    1  *      2
 1     1  2     2
   1  *     2
    1     *  2

Final Clusters (converged):

    A  A      B
 A     A  B     B
   A  A     B
    A     B  B

The asterisks (*) are centroids, numbers/letters show cluster assignments.

Why “K-Means”?

The name comes from:

Mathematical Formulation

Objective Function

K-Means aims to minimize the within-cluster sum of squares (WCSS), also called inertia:

\[J = \sum_{k=1}^{K} \sum_{x \in C_k} \|x - \mu_k\|^2\]

Where:

Intuition: We want points to be as close as possible to their cluster centroids.

Expanded Form

Let’s denote:

The objective function can be rewritten as:

\[J = \sum_{i=1}^{n} \sum_{k=1}^{K} r_{ik} \|x^{(i)} - \mu_k\|^2\]

With the constraint: $\sum_{k=1}^{K} r_{ik} = 1$ for all $i$ (each point belongs to exactly one cluster).

Centroid Calculation

For a given cluster assignment, the optimal centroid is the mean of all points in that cluster:

\[\mu_k = \frac{1}{|C_k|} \sum_{x \in C_k} x\]

Or equivalently:

\[\mu_k = \frac{\sum_{i=1}^{n} r_{ik} x^{(i)}}{\sum_{i=1}^{n} r_{ik}}\]

Proof: To minimize $J$ with respect to $\mu_k$, take the derivative and set to zero:

\[\frac{\partial J}{\partial \mu_k} = \frac{\partial}{\partial \mu_k} \sum_{i=1}^{n} r_{ik} \|x^{(i)} - \mu_k\|^2 = 0\] \[\sum_{i=1}^{n} r_{ik} \cdot 2(x^{(i)} - \mu_k) \cdot (-1) = 0\] \[\sum_{i=1}^{n} r_{ik} x^{(i)} = \mu_k \sum_{i=1}^{n} r_{ik}\] \[\mu_k = \frac{\sum_{i=1}^{n} r_{ik} x^{(i)}}{\sum_{i=1}^{n} r_{ik}}\]

This is exactly the mean (average) of all points assigned to cluster $k$.

Assignment Rule

For a given set of centroids, the optimal assignment is:

\[r_{ik} = \begin{cases} 1 & \text{if } k = \arg\min_j \|x^{(i)} - \mu_j\|^2 \\ 0 & \text{otherwise} \end{cases}\]

Each point is assigned to its nearest centroid.

The Optimization Problem

K-Means is solving:

\[\min_{C_1, ..., C_K} \sum_{k=1}^{K} \sum_{x \in C_k} \|x - \mu_k\|^2\]

This is a non-convex optimization problem (has local minima), which is why initialization matters!

The K-Means Algorithm: Step by Step

Algorithm Pseudocode

Input: Dataset D = {x(1), x(2), ..., x(n)}, number of clusters K
Output: Cluster assignments and centroids

1. Initialize K centroids μ, μ, ..., μ_K (randomly or using K-Means++)

2. Repeat until convergence:

   a. Assignment Step:
      For each data point x(i):
          Assign x(i) to the nearest centroid:
          c(i) = argmin_k ||x(i) - μ_k||²

   b. Update Step:
      For each cluster k:
          Update centroid to mean of assigned points:
          μ_k = (1/|C_k|) Σ(xC_k) x

3. Return final cluster assignments and centroids

Convergence Criteria

The algorithm stops when one of these conditions is met:

  1. Centroids don’t change: $|\mu_k^{new} - \mu_k^{old}| < \epsilon$ for all $k$
  2. Assignments don’t change: No points switch clusters
  3. Maximum iterations reached: Prevent infinite loops
  4. Objective function stabilizes: Change in $J$ is below threshold

Concrete Example

Let’s cluster 6 points in 2D with K=2:

Data:

Iteration 0 (Initialization):

Iteration 1 - Assignment:

Calculate distances:

Iteration 1 - Update:

\[\mu_1 = \frac{[1,1] + [1.5,2] + [2,1]}{3} = [1.5, 1.33]\] \[\mu_2 = \frac{[8,8] + [8.5,8] + [9,9]}{3} = [8.5, 8.33]\]

Iteration 2: Repeat assignment and update…

The algorithm continues until centroids stabilize!

Initialization Strategies

Initialization is crucial because K-Means can get stuck in local minima. Different starting points lead to different final clusters.

1. Random Initialization

Method: Randomly select K data points as initial centroids.

Advantages:

Disadvantages:

Implementation:

# Randomly select K points
indices = np.random.choice(n, K, replace=False)
centroids = X[indices]

2. K-Means++ Initialization

Key Idea: Choose initial centroids that are far apart from each other.

Algorithm:

  1. Choose first centroid uniformly at random from data points
  2. For each remaining centroid:
    • Calculate distance $D(x)$ from each point $x$ to nearest existing centroid
    • Choose next centroid with probability proportional to $D(x)^2$
  3. Repeat until K centroids are chosen

Mathematical Formulation:

For choosing the $(j+1)$-th centroid:

\[P(x^{(i)}) = \frac{D(x^{(i)})^2}{\sum_{i=1}^{n} D(x^{(i)})^2}\]

Where $D(x^{(i)}) = \min_{k=1,…,j} |x^{(i)} - \mu_k|$

Advantages:

Disadvantages:

Why It Works:

K-Means++ ensures centroids start far apart, which helps:

3. Multiple Random Runs

Method: Run K-Means multiple times with different random initializations and keep the best result.

Selection Criterion: Choose the clustering with lowest inertia (objective function value).

Trade-off: Increases computational cost but improves quality.

Comparing Initializations

For the same dataset:

Initialization Avg Inertia Avg Iterations Consistency
Random 450.2 12.3 Low
K-Means++ 398.7 8.1 High
Random (10 runs) 401.3 9.5 Medium

Choosing the Optimal K

One of the biggest challenges with K-Means: How do we choose K?

Unlike supervised learning (where we have validation accuracy), there’s no single “correct” number of clusters. We use heuristics and domain knowledge.

1. The Elbow Method

Idea: Plot inertia (WCSS) vs. K and look for an “elbow” where the rate of decrease sharply changes.

Mathematical Intuition:

Procedure:

  1. Run K-Means for K = 1, 2, 3, …, K_max
  2. Calculate inertia for each K
  3. Plot inertia vs. K
  4. Choose K at the “elbow” (point of maximum curvature)

Example:

Inertia
   
5000|
4000|  
3000|    
2000|      ___
1000|          ______
   0|________________________ K
    1  2  3  4  5  6  7  8

Elbow at K=4

Limitation: The elbow is often not clearly defined!

2. Silhouette Score

Idea: Measure how similar each point is to its own cluster compared to other clusters.

For a single point $x^{(i)}$ in cluster $C_k$:

\[a(i) = \frac{1}{|C_k| - 1} \sum_{j \in C_k, j \neq i} d(i, j)\]

Average distance to points in own cluster (lower is better).

\[b(i) = \min_{k' \neq k} \frac{1}{|C_{k'}|} \sum_{j \in C_{k'}} d(i, j)\]

Average distance to points in nearest other cluster (higher is better).

Silhouette coefficient for point $i$:

\[s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}\]

Properties:

Average silhouette score for the entire clustering:

\[\bar{s} = \frac{1}{n} \sum_{i=1}^{n} s(i)\]

Usage: Choose K that maximizes average silhouette score.

3. Gap Statistic

Idea: Compare inertia of your clustering to that of random data.

\[\text{Gap}(K) = \mathbb{E}[\log(W_K^*)] - \log(W_K)\]

Where:

Choose K where the gap is largest (your clustering is much better than random).

4. Domain Knowledge

Often the best approach: use domain expertise!

Examples:

Practical Recommendations

  1. Start with elbow method for quick insight
  2. Validate with silhouette score for confirmation
  3. Try multiple K values and visualize results
  4. Use domain knowledge to guide final choice
  5. Consider interpretability: Fewer clusters are easier to understand and act upon

Convergence and Computational Complexity

Convergence Properties

Theorem: K-Means algorithm is guaranteed to converge.

Proof Sketch:

  1. Each assignment step decreases (or maintains) the objective function $J$
  2. Each update step decreases (or maintains) $J$
  3. There are finitely many possible assignments
  4. $J$ is bounded below by 0
  5. Therefore, the algorithm must converge in finite steps

However: K-Means converges to a local minimum, not necessarily the global minimum!

Number of Iterations

Typical: 10-20 iterations for convergence

Factors affecting convergence speed:

Time Complexity

Per iteration:

Total: $O(I \cdot n \cdot K \cdot d)$

Where:

Space Complexity: $O(n \cdot d + K \cdot d)$

Scalability Considerations

K-Means is relatively efficient, but challenges arise with:

  1. Large n (millions of points): Use mini-batch K-Means
  2. Large d (high dimensions): Apply dimensionality reduction first
  3. Large K (many clusters): Consider hierarchical approaches

Implementation from Scratch

Let’s implement K-Means from scratch to understand every detail.

import numpy as np
from collections import defaultdict

class KMeans:
    """
    K-Means clustering implementation from scratch.

    Parameters:
    -----------
    n_clusters : int, default=3
        The number of clusters to form
    max_iters : int, default=300
        Maximum number of iterations
    init : str, default='kmeans++'
        Initialization method ('random' or 'kmeans++')
    n_init : int, default=10
        Number of times to run algorithm with different initializations
    tol : float, default=1e-4
        Convergence tolerance
    random_state : int, default=None
        Random seed for reproducibility
    """

    def __init__(self, n_clusters=3, max_iters=300, init='kmeans++',
                 n_init=10, tol=1e-4, random_state=None):
        self.n_clusters = n_clusters
        self.max_iters = max_iters
        self.init = init
        self.n_init = n_init
        self.tol = tol
        self.random_state = random_state

        self.centroids = None
        self.labels_ = None
        self.inertia_ = None
        self.n_iter_ = None

    def fit(self, X):
        """
        Compute K-Means clustering.

        Parameters:
        -----------
        X : numpy.ndarray of shape (n_samples, n_features)
            Training data
        """
        X = np.array(X)

        if self.random_state is not None:
            np.random.seed(self.random_state)

        best_inertia = np.inf
        best_centroids = None
        best_labels = None
        best_iters = 0

        # Run multiple times with different initializations
        for _ in range(self.n_init):
            centroids, labels, inertia, n_iters = self._kmeans_single(X)

            if inertia < best_inertia:
                best_inertia = inertia
                best_centroids = centroids
                best_labels = labels
                best_iters = n_iters

        self.centroids = best_centroids
        self.labels_ = best_labels
        self.inertia_ = best_inertia
        self.n_iter_ = best_iters

        return self

    def _kmeans_single(self, X):
        """Run K-Means algorithm once."""
        # Initialize centroids
        if self.init == 'kmeans++':
            centroids = self._kmeans_plus_plus(X)
        else:
            centroids = self._random_init(X)

        # Iterate until convergence
        for iteration in range(self.max_iters):
            # Assignment step
            labels = self._assign_clusters(X, centroids)

            # Update step
            new_centroids = self._update_centroids(X, labels)

            # Check convergence
            centroid_shift = np.linalg.norm(new_centroids - centroids)
            if centroid_shift < self.tol:
                centroids = new_centroids
                break

            centroids = new_centroids

        # Calculate final inertia
        inertia = self._calculate_inertia(X, centroids, labels)

        return centroids, labels, inertia, iteration + 1

    def _random_init(self, X):
        """Random initialization: choose K random points as centroids."""
        n_samples = X.shape[0]
        indices = np.random.choice(n_samples, self.n_clusters, replace=False)
        return X[indices].copy()

    def _kmeans_plus_plus(self, X):
        """K-Means++ initialization."""
        n_samples, n_features = X.shape
        centroids = np.empty((self.n_clusters, n_features))

        # Choose first centroid randomly
        centroids[0] = X[np.random.randint(n_samples)]

        # Choose remaining centroids
        for k in range(1, self.n_clusters):
            # Calculate distances to nearest existing centroid
            distances = np.min([np.linalg.norm(X - c, axis=1)
                               for c in centroids[:k]], axis=0)

            # Square distances for probability calculation
            probabilities = distances ** 2
            probabilities /= probabilities.sum()

            # Choose next centroid with probability proportional to distance²
            next_centroid_idx = np.random.choice(n_samples, p=probabilities)
            centroids[k] = X[next_centroid_idx]

        return centroids

    def _assign_clusters(self, X, centroids):
        """Assign each point to nearest centroid."""
        distances = np.array([np.linalg.norm(X - c, axis=1)
                             for c in centroids])
        return np.argmin(distances, axis=0)

    def _update_centroids(self, X, labels):
        """Update centroids to mean of assigned points."""
        centroids = np.array([X[labels == k].mean(axis=0)
                             for k in range(self.n_clusters)])
        return centroids

    def _calculate_inertia(self, X, centroids, labels):
        """Calculate within-cluster sum of squares."""
        inertia = 0
        for k in range(self.n_clusters):
            cluster_points = X[labels == k]
            if len(cluster_points) > 0:
                inertia += np.sum((cluster_points - centroids[k]) ** 2)
        return inertia

    def predict(self, X):
        """
        Predict cluster labels for new data.

        Parameters:
        -----------
        X : numpy.ndarray of shape (n_samples, n_features)
            New data to predict

        Returns:
        --------
        labels : numpy.ndarray of shape (n_samples,)
            Cluster labels
        """
        X = np.array(X)
        return self._assign_clusters(X, self.centroids)

    def fit_predict(self, X):
        """Fit and return cluster labels."""
        self.fit(X)
        return self.labels_


# Example usage
if __name__ == "__main__":
    from sklearn.datasets import make_blobs

    # Generate synthetic data
    X, y_true = make_blobs(n_samples=300, centers=4, n_features=2,
                          cluster_std=0.60, random_state=42)

    # Fit K-Means
    kmeans = KMeans(n_clusters=4, init='kmeans++', random_state=42)
    kmeans.fit(X)

    print(f"Converged in {kmeans.n_iter_} iterations")
    print(f"Final inertia: {kmeans.inertia_:.2f}")
    print(f"Cluster sizes: {np.bincount(kmeans.labels_)}")

Evaluating Clustering Quality

Internal Evaluation (No Ground Truth)

When we don’t have true labels, we use internal metrics:

1. Inertia (Within-Cluster Sum of Squares)

\[\text{Inertia} = \sum_{k=1}^{K} \sum_{x \in C_k} \|x - \mu_k\|^2\]

2. Silhouette Score

Already discussed above. Range: [-1, 1], higher is better.

3. Davies-Bouldin Index

\[DB = \frac{1}{K} \sum_{k=1}^{K} \max_{k' \neq k} \frac{\sigma_k + \sigma_{k'}}{d(c_k, c_{k'})}\]

Where:

Lower is better (indicates better separation).

4. Calinski-Harabasz Index

\[CH = \frac{SS_B / (K-1)}{SS_W / (n-K)}\]

Where:

Higher is better (more separation, less variance within clusters).

External Evaluation (With Ground Truth)

When we have true labels (for validation):

1. Adjusted Rand Index (ARI)

Measures similarity between two clusterings, adjusted for chance.

2. Normalized Mutual Information (NMI)

Measures mutual information between predicted and true labels.

Advantages and Limitations

Advantages

  1. Simple and Intuitive: Easy to understand and implement
  2. Efficient: Linear in number of data points $O(n)$
  3. Scalable: Works well with large datasets
  4. Guaranteed Convergence: Always converges to a local minimum
  5. Versatile: Works for many types of data
  6. Well-Established: Extensive research and implementations available

Limitations

  1. Must Choose K: Number of clusters must be specified beforehand
  2. Local Minima: Sensitive to initialization (use K-Means++)
  3. Assumes Spherical Clusters: Works best with convex, isotropic clusters
  4. Equal Size Bias: Tends to create equal-sized clusters
  5. Outlier Sensitive: Outliers can distort centroids
  6. Fixed Cluster Shapes: Cannot handle non-spherical clusters well
  7. Distance Metric: Only works with Euclidean distance (standard implementation)

When K-Means Fails

Problem cases:

  1. Non-convex clusters (e.g., moon shapes): Use DBSCAN or spectral clustering
  2. Varying densities: Use DBSCAN
  3. Varying sizes: Use Gaussian Mixture Models
  4. Nested clusters: Use hierarchical clustering

Comparison with Other Algorithms

Aspect K-Means Hierarchical DBSCAN GMM
Choose K Yes No No Yes
Cluster Shape Spherical Any Any Elliptical
Scalability Excellent Poor Good Medium
Outliers Sensitive Sensitive Robust Medium
Deterministic No Yes Yes No

Variants and Extensions

1. Mini-Batch K-Means

Problem: Standard K-Means slow for massive datasets

Solution: Use random subsets (mini-batches) of data in each iteration

Advantages:

Trade-off: Slightly worse clustering quality

2. K-Medoids (PAM)

Problem: K-Means sensitive to outliers (mean affected by extreme values)

Solution: Use actual data points (medoids) as cluster centers instead of means

Advantages:

Disadvantages:

3. Fuzzy C-Means

Problem: Hard assignment—each point belongs to exactly one cluster

Solution: Soft (fuzzy) assignment—each point has membership degree to each cluster

Membership function:

\[u_{ik} = \frac{1}{\sum_{j=1}^{K} \left(\frac{d(x_i, c_k)}{d(x_i, c_j)}\right)^{2/(m-1)}}\]

Where $m > 1$ is the fuzziness parameter.

4. K-Means with Different Distance Metrics

Standard K-Means uses Euclidean distance, but variants exist:

5. Hierarchical K-Means

Combine K-Means with hierarchical clustering:

  1. Run K-Means to get initial clusters
  2. Recursively apply K-Means within each cluster
  3. Build tree structure

Useful for: Multi-level clustering, topic modeling

Conclusion

Now that we have learned about K-Means clustering and how it works, we now have a good understanding of how to use it to solve various real-world problems.

Reference Papers: