Dec 13, 2025
Continuing in our journey of Machine Learning, let’s now venture into the realm of unsupervised learning. Today we will be learning about K-Means clustering algorithm. Imagine walking into a large library where books are scattered everywhere with no organization. Your task is to group similar books together, novels with novels, textbooks with textbooks, cookbooks with cookbooks, without any labels telling you which book belongs where. You’d naturally look at features like size, cover design, and content to create these groups. This is exactly what K-Means clustering does: it discovers natural groupings in data without any prior labels.
In our previous tutorial on K-Nearest Neighbors (KNN), we worked with supervised learning where we had labeled data and predicted labels for new examples. Now we enter the realm of unsupervised learning, where we have data but no labels. Our goal shifts from prediction to discovery: finding hidden patterns and structure in the data.
Unsupervised learning involves finding patterns in data without explicit labels or targets. The algorithm must discover structure on its own. Some of the common tasks where we can use unsupervised learning include:
In the real world, labeled data is expensive:
Unsupervised learning allows us to:
K-Means clustering powers countless applications:
Clustering is the task of dividing a dataset into groups (clusters) such that:
Formally, given a dataset $\mathcal{D} = {x^{(1)}, x^{(2)}, …, x^{(n)}}$ where $x^{(i)} \in \mathbb{R}^d$, clustering aims to partition the data into $K$ disjoint clusters:
\[\mathcal{C} = \{C_1, C_2, ..., C_K\}\]where:
Divides data into non-overlapping clusters. Each point belongs to exactly one cluster.
Creates a tree-like structure (dendrogram) of nested clusters.
Defines clusters as dense regions separated by sparse regions.
Assumes data is generated from a mixture of probability distributions.
K-Means is beautifully simple: it finds K cluster centers (called centroids) and assigns each data point to its nearest centroid. The algorithm iteratively:
You can think of it like organizing a dinner party where K = 3 (three tables). Initially, you randomly place three chairs as “table centers.” Then:
Consider points scattered in 2D space:
Initial State (random centroids):
○ ● ○
○ ○ ○ ●
○ ○ ●
● ○ ○
After Assignment:
1 * 2
1 1 2 2
1 1 *
* 2 2
After Update (move centroids):
1 * 2
1 1 2 2
1 * 2
1 * 2
Final Clusters (converged):
A A B
A A B B
A A B
A B B
The asterisks (*) are centroids, numbers/letters show cluster assignments.
The name comes from:
K-Means aims to minimize the within-cluster sum of squares (WCSS), also called inertia:
\[J = \sum_{k=1}^{K} \sum_{x \in C_k} \|x - \mu_k\|^2\]Where:
Intuition: We want points to be as close as possible to their cluster centroids.
Let’s denote:
The objective function can be rewritten as:
\[J = \sum_{i=1}^{n} \sum_{k=1}^{K} r_{ik} \|x^{(i)} - \mu_k\|^2\]With the constraint: $\sum_{k=1}^{K} r_{ik} = 1$ for all $i$ (each point belongs to exactly one cluster).
For a given cluster assignment, the optimal centroid is the mean of all points in that cluster:
\[\mu_k = \frac{1}{|C_k|} \sum_{x \in C_k} x\]Or equivalently:
\[\mu_k = \frac{\sum_{i=1}^{n} r_{ik} x^{(i)}}{\sum_{i=1}^{n} r_{ik}}\]Proof: To minimize $J$ with respect to $\mu_k$, take the derivative and set to zero:
\[\frac{\partial J}{\partial \mu_k} = \frac{\partial}{\partial \mu_k} \sum_{i=1}^{n} r_{ik} \|x^{(i)} - \mu_k\|^2 = 0\] \[\sum_{i=1}^{n} r_{ik} \cdot 2(x^{(i)} - \mu_k) \cdot (-1) = 0\] \[\sum_{i=1}^{n} r_{ik} x^{(i)} = \mu_k \sum_{i=1}^{n} r_{ik}\] \[\mu_k = \frac{\sum_{i=1}^{n} r_{ik} x^{(i)}}{\sum_{i=1}^{n} r_{ik}}\]This is exactly the mean (average) of all points assigned to cluster $k$.
For a given set of centroids, the optimal assignment is:
\[r_{ik} = \begin{cases} 1 & \text{if } k = \arg\min_j \|x^{(i)} - \mu_j\|^2 \\ 0 & \text{otherwise} \end{cases}\]Each point is assigned to its nearest centroid.
K-Means is solving:
\[\min_{C_1, ..., C_K} \sum_{k=1}^{K} \sum_{x \in C_k} \|x - \mu_k\|^2\]This is a non-convex optimization problem (has local minima), which is why initialization matters!
Input: Dataset D = {x(1), x(2), ..., x(n)}, number of clusters K
Output: Cluster assignments and centroids
1. Initialize K centroids μ₁, μ₂, ..., μ_K (randomly or using K-Means++)
2. Repeat until convergence:
a. Assignment Step:
For each data point x(i):
Assign x(i) to the nearest centroid:
c(i) = argmin_k ||x(i) - μ_k||²
b. Update Step:
For each cluster k:
Update centroid to mean of assigned points:
μ_k = (1/|C_k|) Σ(x∈C_k) x
3. Return final cluster assignments and centroids
The algorithm stops when one of these conditions is met:
Let’s cluster 6 points in 2D with K=2:
Data:
Iteration 0 (Initialization):
Iteration 1 - Assignment:
Calculate distances:
Iteration 1 - Update:
\[\mu_1 = \frac{[1,1] + [1.5,2] + [2,1]}{3} = [1.5, 1.33]\] \[\mu_2 = \frac{[8,8] + [8.5,8] + [9,9]}{3} = [8.5, 8.33]\]Iteration 2: Repeat assignment and update…
The algorithm continues until centroids stabilize!
Initialization is crucial because K-Means can get stuck in local minima. Different starting points lead to different final clusters.
Method: Randomly select K data points as initial centroids.
Advantages:
Disadvantages:
Implementation:
# Randomly select K points
indices = np.random.choice(n, K, replace=False)
centroids = X[indices]
Key Idea: Choose initial centroids that are far apart from each other.
Algorithm:
Mathematical Formulation:
For choosing the $(j+1)$-th centroid:
\[P(x^{(i)}) = \frac{D(x^{(i)})^2}{\sum_{i=1}^{n} D(x^{(i)})^2}\]Where $D(x^{(i)}) = \min_{k=1,…,j} |x^{(i)} - \mu_k|$
Advantages:
Disadvantages:
Why It Works:
K-Means++ ensures centroids start far apart, which helps:
Method: Run K-Means multiple times with different random initializations and keep the best result.
Selection Criterion: Choose the clustering with lowest inertia (objective function value).
Trade-off: Increases computational cost but improves quality.
For the same dataset:
| Initialization | Avg Inertia | Avg Iterations | Consistency |
|---|---|---|---|
| Random | 450.2 | 12.3 | Low |
| K-Means++ | 398.7 | 8.1 | High |
| Random (10 runs) | 401.3 | 9.5 | Medium |
One of the biggest challenges with K-Means: How do we choose K?
Unlike supervised learning (where we have validation accuracy), there’s no single “correct” number of clusters. We use heuristics and domain knowledge.
Idea: Plot inertia (WCSS) vs. K and look for an “elbow” where the rate of decrease sharply changes.
Mathematical Intuition:
Procedure:
Example:
Inertia
↑
5000|●
4000| ●
3000| ●
2000| ●___
1000| ●___●___●
0|________________________→ K
1 2 3 4 5 6 7 8
Elbow at K=4
Limitation: The elbow is often not clearly defined!
Idea: Measure how similar each point is to its own cluster compared to other clusters.
For a single point $x^{(i)}$ in cluster $C_k$:
\[a(i) = \frac{1}{|C_k| - 1} \sum_{j \in C_k, j \neq i} d(i, j)\]Average distance to points in own cluster (lower is better).
\[b(i) = \min_{k' \neq k} \frac{1}{|C_{k'}|} \sum_{j \in C_{k'}} d(i, j)\]Average distance to points in nearest other cluster (higher is better).
Silhouette coefficient for point $i$:
\[s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}\]Properties:
Average silhouette score for the entire clustering:
\[\bar{s} = \frac{1}{n} \sum_{i=1}^{n} s(i)\]Usage: Choose K that maximizes average silhouette score.
Idea: Compare inertia of your clustering to that of random data.
\[\text{Gap}(K) = \mathbb{E}[\log(W_K^*)] - \log(W_K)\]Where:
Choose K where the gap is largest (your clustering is much better than random).
Often the best approach: use domain expertise!
Examples:
Theorem: K-Means algorithm is guaranteed to converge.
Proof Sketch:
However: K-Means converges to a local minimum, not necessarily the global minimum!
Typical: 10-20 iterations for convergence
Factors affecting convergence speed:
Per iteration:
Total: $O(I \cdot n \cdot K \cdot d)$
Where:
Space Complexity: $O(n \cdot d + K \cdot d)$
K-Means is relatively efficient, but challenges arise with:
Let’s implement K-Means from scratch to understand every detail.
import numpy as np
from collections import defaultdict
class KMeans:
"""
K-Means clustering implementation from scratch.
Parameters:
-----------
n_clusters : int, default=3
The number of clusters to form
max_iters : int, default=300
Maximum number of iterations
init : str, default='kmeans++'
Initialization method ('random' or 'kmeans++')
n_init : int, default=10
Number of times to run algorithm with different initializations
tol : float, default=1e-4
Convergence tolerance
random_state : int, default=None
Random seed for reproducibility
"""
def __init__(self, n_clusters=3, max_iters=300, init='kmeans++',
n_init=10, tol=1e-4, random_state=None):
self.n_clusters = n_clusters
self.max_iters = max_iters
self.init = init
self.n_init = n_init
self.tol = tol
self.random_state = random_state
self.centroids = None
self.labels_ = None
self.inertia_ = None
self.n_iter_ = None
def fit(self, X):
"""
Compute K-Means clustering.
Parameters:
-----------
X : numpy.ndarray of shape (n_samples, n_features)
Training data
"""
X = np.array(X)
if self.random_state is not None:
np.random.seed(self.random_state)
best_inertia = np.inf
best_centroids = None
best_labels = None
best_iters = 0
# Run multiple times with different initializations
for _ in range(self.n_init):
centroids, labels, inertia, n_iters = self._kmeans_single(X)
if inertia < best_inertia:
best_inertia = inertia
best_centroids = centroids
best_labels = labels
best_iters = n_iters
self.centroids = best_centroids
self.labels_ = best_labels
self.inertia_ = best_inertia
self.n_iter_ = best_iters
return self
def _kmeans_single(self, X):
"""Run K-Means algorithm once."""
# Initialize centroids
if self.init == 'kmeans++':
centroids = self._kmeans_plus_plus(X)
else:
centroids = self._random_init(X)
# Iterate until convergence
for iteration in range(self.max_iters):
# Assignment step
labels = self._assign_clusters(X, centroids)
# Update step
new_centroids = self._update_centroids(X, labels)
# Check convergence
centroid_shift = np.linalg.norm(new_centroids - centroids)
if centroid_shift < self.tol:
centroids = new_centroids
break
centroids = new_centroids
# Calculate final inertia
inertia = self._calculate_inertia(X, centroids, labels)
return centroids, labels, inertia, iteration + 1
def _random_init(self, X):
"""Random initialization: choose K random points as centroids."""
n_samples = X.shape[0]
indices = np.random.choice(n_samples, self.n_clusters, replace=False)
return X[indices].copy()
def _kmeans_plus_plus(self, X):
"""K-Means++ initialization."""
n_samples, n_features = X.shape
centroids = np.empty((self.n_clusters, n_features))
# Choose first centroid randomly
centroids[0] = X[np.random.randint(n_samples)]
# Choose remaining centroids
for k in range(1, self.n_clusters):
# Calculate distances to nearest existing centroid
distances = np.min([np.linalg.norm(X - c, axis=1)
for c in centroids[:k]], axis=0)
# Square distances for probability calculation
probabilities = distances ** 2
probabilities /= probabilities.sum()
# Choose next centroid with probability proportional to distance²
next_centroid_idx = np.random.choice(n_samples, p=probabilities)
centroids[k] = X[next_centroid_idx]
return centroids
def _assign_clusters(self, X, centroids):
"""Assign each point to nearest centroid."""
distances = np.array([np.linalg.norm(X - c, axis=1)
for c in centroids])
return np.argmin(distances, axis=0)
def _update_centroids(self, X, labels):
"""Update centroids to mean of assigned points."""
centroids = np.array([X[labels == k].mean(axis=0)
for k in range(self.n_clusters)])
return centroids
def _calculate_inertia(self, X, centroids, labels):
"""Calculate within-cluster sum of squares."""
inertia = 0
for k in range(self.n_clusters):
cluster_points = X[labels == k]
if len(cluster_points) > 0:
inertia += np.sum((cluster_points - centroids[k]) ** 2)
return inertia
def predict(self, X):
"""
Predict cluster labels for new data.
Parameters:
-----------
X : numpy.ndarray of shape (n_samples, n_features)
New data to predict
Returns:
--------
labels : numpy.ndarray of shape (n_samples,)
Cluster labels
"""
X = np.array(X)
return self._assign_clusters(X, self.centroids)
def fit_predict(self, X):
"""Fit and return cluster labels."""
self.fit(X)
return self.labels_
# Example usage
if __name__ == "__main__":
from sklearn.datasets import make_blobs
# Generate synthetic data
X, y_true = make_blobs(n_samples=300, centers=4, n_features=2,
cluster_std=0.60, random_state=42)
# Fit K-Means
kmeans = KMeans(n_clusters=4, init='kmeans++', random_state=42)
kmeans.fit(X)
print(f"Converged in {kmeans.n_iter_} iterations")
print(f"Final inertia: {kmeans.inertia_:.2f}")
print(f"Cluster sizes: {np.bincount(kmeans.labels_)}")
When we don’t have true labels, we use internal metrics:
Already discussed above. Range: [-1, 1], higher is better.
Where:
Lower is better (indicates better separation).
Where:
Higher is better (more separation, less variance within clusters).
When we have true labels (for validation):
Measures similarity between two clusterings, adjusted for chance.
Measures mutual information between predicted and true labels.
Problem cases:
| Aspect | K-Means | Hierarchical | DBSCAN | GMM |
|---|---|---|---|---|
| Choose K | Yes | No | No | Yes |
| Cluster Shape | Spherical | Any | Any | Elliptical |
| Scalability | Excellent | Poor | Good | Medium |
| Outliers | Sensitive | Sensitive | Robust | Medium |
| Deterministic | No | Yes | Yes | No |
Problem: Standard K-Means slow for massive datasets
Solution: Use random subsets (mini-batches) of data in each iteration
Advantages:
Trade-off: Slightly worse clustering quality
Problem: K-Means sensitive to outliers (mean affected by extreme values)
Solution: Use actual data points (medoids) as cluster centers instead of means
Advantages:
Disadvantages:
Problem: Hard assignment—each point belongs to exactly one cluster
Solution: Soft (fuzzy) assignment—each point has membership degree to each cluster
Membership function:
\[u_{ik} = \frac{1}{\sum_{j=1}^{K} \left(\frac{d(x_i, c_k)}{d(x_i, c_j)}\right)^{2/(m-1)}}\]Where $m > 1$ is the fuzziness parameter.
Standard K-Means uses Euclidean distance, but variants exist:
Combine K-Means with hierarchical clustering:
Useful for: Multi-level clustering, topic modeling
Now that we have learned about K-Means clustering and how it works, we now have a good understanding of how to use it to solve various real-world problems.
Reference Papers: