K-Means Clustering: Unsupervised Learning Fundamentals

Dec 13, 2025

Continuing in our journey of Machine Learning, let’s now venture into the realm of unsupervised learning. Today we will be learning about K-Means clustering algorithm. Imagine walking into a large library where books are scattered everywhere with no organization. Your task is to group similar books together, novels with novels, textbooks with textbooks, cookbooks with cookbooks, without any labels telling you which book belongs where. You’d naturally look at features like size, cover design, and content to create these groups. This is exactly what K-Means clustering does: it discovers natural groupings in data without any prior labels.

Introduction: From Supervised to Unsupervised Learning
What is Clustering?
The K-Means Algorithm: Intuition
Mathematical Formulation
The K-Means Algorithm: Step by Step
Initialization Strategies
Choosing the Optimal K
Convergence and Computational Complexity
Implementation from Scratch
Evaluating Clustering Quality
Advantages and Limitations
Variants and Extensions
Conclusion

Introduction: From Supervised to Unsupervised Learning

The Paradigm Shift

In our previous tutorial on K-Nearest Neighbors (KNN), we worked with supervised learning where we had labeled data and predicted labels for new examples. Now we enter the realm of unsupervised learning, where we have data but no labels. Our goal shifts from prediction to discovery: finding hidden patterns and structure in the data.

What is Unsupervised Learning?

Unsupervised learning involves finding patterns in data without explicit labels or targets. The algorithm must discover structure on its own. Some of the common tasks where we can use unsupervised learning include:

Clustering: Grouping similar data points together
Dimensionality Reduction: Finding lower-dimensional representations (e.g., PCA)
Anomaly Detection: Identifying unusual patterns
Association Rules: Discovering relationships between variables

Why Unsupervised Learning Matters

In the real world, labeled data is expensive:

Medical diagnosis: Requires expert doctors to label each case
Customer segmentation: No pre-defined customer categories exist
Document organization: Manually categorizing millions of documents is infeasible

Unsupervised learning allows us to:

Discover unknown patterns
Explore data structure
Reduce labeling costs
Preprocess for supervised learning

Real-World Applications

K-Means clustering powers countless applications:

Customer Segmentation: Grouping customers by purchasing behavior for targeted marketing
Image Compression: Reducing colors in images by clustering similar colors
Document Clustering: Organizing news articles, research papers, or emails by topic
Anomaly Detection: Identifying unusual patterns in network traffic or transactions
Biology: Grouping genes with similar expression patterns
Recommendation Systems: Grouping users or items to improve recommendations

What is Clustering?

Definition

Clustering is the task of dividing a dataset into groups (clusters) such that:

Points in the same cluster are similar to each other
Points in different clusters are dissimilar to each other

Formally, given a dataset $\mathcal{D} = {x^{(1)}, x^{(2)}, …, x^{(n)}}$ where $x^{(i)} \in \mathbb{R}^d$, clustering aims to partition the data into $K$ disjoint clusters:

\[\mathcal{C} = \{C_1, C_2, ..., C_K\}\]

where:

$C_i \cap C_j = \emptyset$ for $i \neq j$ (clusters don’t overlap)
$\bigcup_{i=1}^{K} C_i = \mathcal{D}$ (all points belong to some cluster)

Types of Clustering

1. Partitional Clustering

Divides data into non-overlapping clusters. Each point belongs to exactly one cluster.

K-Means (this tutorial)
K-Medoids

2. Hierarchical Clustering

Creates a tree-like structure (dendrogram) of nested clusters.

Agglomerative (bottom-up)
Divisive (top-down)

3. Density-Based Clustering

Defines clusters as dense regions separated by sparse regions.

DBSCAN
OPTICS

4. Model-Based Clustering

Assumes data is generated from a mixture of probability distributions.

Gaussian Mixture Models (GMM)
Expectation-Maximization (EM)

The K-Means Algorithm: Intuition

The Core Idea

K-Means is beautifully simple: it finds K cluster centers (called centroids) and assigns each data point to its nearest centroid. The algorithm iteratively:

Assigns each point to the closest centroid
Updates centroids to be the mean of assigned points
Repeats until convergence

You can think of it like organizing a dinner party where K = 3 (three tables). Initially, you randomly place three chairs as “table centers.” Then:

Assignment: Each guest sits at the nearest table
Update: Move each table to the center of its guests
Repeat until tables stop moving

Visual Intuition

Consider points scattered in 2D space:

Initial State (random centroids):

    ○  ●      ○
 ○     ○  ○     ●
   ○  ○     ●
    ●     ○  ○

After Assignment:

    1  *      2
 1     1  2     2
   1  1     *
    *     2  2

After Update (move centroids):

    1  *      2
 1     1  2     2
   1  *     2
    1     *  2

Final Clusters (converged):

    A  A      B
 A     A  B     B
   A  A     B
    A     B  B

The asterisks (*) are centroids, numbers/letters show cluster assignments.

Why “K-Means”?

The name comes from:

K: Number of clusters (specified by user)
Means: Centroids are computed as the mean (average) of points in each cluster

Mathematical Formulation

Objective Function

K-Means aims to minimize the within-cluster sum of squares (WCSS), also called inertia:

\[J = \sum_{k=1}^{K} \sum_{x \in C_k} \|x - \mu_k\|^2\]

Where:

$K$ is the number of clusters
$C_k$ is the set of points in cluster $k$
$\mu_k$ is the centroid of cluster $k$
$|x - \mu_k|^2$ is the squared Euclidean distance

Intuition: We want points to be as close as possible to their cluster centroids.

Expanded Form

Let’s denote:

$r_{ik} = 1$ if point $i$ belongs to cluster $k$, else $r_{ik} = 0$

The objective function can be rewritten as:

\[J = \sum_{i=1}^{n} \sum_{k=1}^{K} r_{ik} \|x^{(i)} - \mu_k\|^2\]

With the constraint: $\sum_{k=1}^{K} r_{ik} = 1$ for all $i$ (each point belongs to exactly one cluster).

Centroid Calculation

For a given cluster assignment, the optimal centroid is the mean of all points in that cluster:

\[\mu_k = \frac{1}{|C_k|} \sum_{x \in C_k} x\]

Or equivalently:

\[\mu_k = \frac{\sum_{i=1}^{n} r_{ik} x^{(i)}}{\sum_{i=1}^{n} r_{ik}}\]

Proof: To minimize $J$ with respect to $\mu_k$, take the derivative and set to zero:

\[\frac{\partial J}{\partial \mu_k} = \frac{\partial}{\partial \mu_k} \sum_{i=1}^{n} r_{ik} \|x^{(i)} - \mu_k\|^2 = 0\] \[\sum_{i=1}^{n} r_{ik} \cdot 2(x^{(i)} - \mu_k) \cdot (-1) = 0\] \[\sum_{i=1}^{n} r_{ik} x^{(i)} = \mu_k \sum_{i=1}^{n} r_{ik}\] \[\mu_k = \frac{\sum_{i=1}^{n} r_{ik} x^{(i)}}{\sum_{i=1}^{n} r_{ik}}\]

This is exactly the mean (average) of all points assigned to cluster $k$.

Assignment Rule

For a given set of centroids, the optimal assignment is:

\[r_{ik} = \begin{cases} 1 & \text{if } k = \arg\min_j \|x^{(i)} - \mu_j\|^2 \\ 0 & \text{otherwise} \end{cases}\]

Each point is assigned to its nearest centroid.

The Optimization Problem

K-Means is solving:

\[\min_{C_1, ..., C_K} \sum_{k=1}^{K} \sum_{x \in C_k} \|x - \mu_k\|^2\]

This is a non-convex optimization problem (has local minima), which is why initialization matters!

The K-Means Algorithm: Step by Step

Algorithm Pseudocode

Input: Dataset D = {x(1), x(2), ..., x(n)}, number of clusters K
Output: Cluster assignments and centroids

1. Initialize K centroids μ₁, μ₂, ..., μ_K (randomly or using K-Means++)

2. Repeat until convergence:

   a. Assignment Step:
      For each data point x(i):
          Assign x(i) to the nearest centroid:
          c(i) = argmin_k ||x(i) - μ_k||²

   b. Update Step:
      For each cluster k:
          Update centroid to mean of assigned points:
          μ_k = (1/|C_k|) Σ(x∈C_k) x

3. Return final cluster assignments and centroids

Convergence Criteria

The algorithm stops when one of these conditions is met:

Centroids don’t change: $|\mu_k^{new} - \mu_k^{old}| < \epsilon$ for all $k$
Assignments don’t change: No points switch clusters
Maximum iterations reached: Prevent infinite loops
Objective function stabilizes: Change in $J$ is below threshold

Concrete Example

Let’s cluster 6 points in 2D with K=2:

Data:

$x^{(1)} = [1, 1]$, $x^{(2)} = [1.5, 2]$, $x^{(3)} = [2, 1]$
$x^{(4)} = [8, 8]$, $x^{(5)} = [8.5, 8]$, $x^{(6)} = [9, 9]$

Iteration 0 (Initialization):

$\mu_1 = [1, 1]$ (randomly choose first point)
$\mu_2 = [9, 9]$ (randomly choose last point)

Iteration 1 - Assignment:

Calculate distances:

$x^{(1)}$: $d(\mu_1) = 0$, $d(\mu_2) = \sqrt{(9-1)^2 + (9-1)^2} = 11.31$ → Cluster 1
$x^{(2)}$: $d(\mu_1) = 1.12$, $d(\mu_2) = 9.70$ → Cluster 1
$x^{(3)}$: $d(\mu_1) = 1.0$, $d(\mu_2) = 11.31$ → Cluster 1
$x^{(4)}$: $d(\mu_1) = 9.90$, $d(\mu_2) = 1.41$ → Cluster 2
$x^{(5)}$: $d(\mu_1) = 10.61$, $d(\mu_2) = 1.12$ → Cluster 2
$x^{(6)}$: $d(\mu_1) = 11.31$, $d(\mu_2) = 0$ → Cluster 2

Iteration 1 - Update:

\[\mu_1 = \frac{[1,1] + [1.5,2] + [2,1]}{3} = [1.5, 1.33]\] \[\mu_2 = \frac{[8,8] + [8.5,8] + [9,9]}{3} = [8.5, 8.33]\]

Iteration 2: Repeat assignment and update…

The algorithm continues until centroids stabilize!

Initialization Strategies

Initialization is crucial because K-Means can get stuck in local minima. Different starting points lead to different final clusters.

1. Random Initialization

Method: Randomly select K data points as initial centroids.

Advantages:

Simple and fast
Works well for well-separated clusters

Disadvantages:

Can lead to poor local minima
Results vary between runs
Sensitive to outliers

Implementation:

# Randomly select K points
indices = np.random.choice(n, K, replace=False)
centroids = X[indices]

2. K-Means++ Initialization

Key Idea: Choose initial centroids that are far apart from each other.

Algorithm:

Choose first centroid uniformly at random from data points
For each remaining centroid:
- Calculate distance $D(x)$ from each point $x$ to nearest existing centroid
- Choose next centroid with probability proportional to $D(x)^2$
Repeat until K centroids are chosen

Mathematical Formulation:

For choosing the $(j+1)$-th centroid:

\[P(x^{(i)}) = \frac{D(x^{(i)})^2}{\sum_{i=1}^{n} D(x^{(i)})^2}\]

Where $D(x^{(i)}) = \min_{k=1,…,j} |x^{(i)} - \mu_k|$

Advantages:

Provably better than random initialization
Spreads centroids across data
Often faster convergence
More consistent results

Disadvantages:

Slightly more complex
Still can find local minima (but better ones)

Why It Works:

K-Means++ ensures centroids start far apart, which helps:

Avoid placing multiple centroids in one dense region
Better capture the global structure
Reduce iterations needed for convergence

3. Multiple Random Runs

Method: Run K-Means multiple times with different random initializations and keep the best result.

Selection Criterion: Choose the clustering with lowest inertia (objective function value).

Trade-off: Increases computational cost but improves quality.

Comparing Initializations

For the same dataset:

Initialization	Avg Inertia	Avg Iterations	Consistency
Random	450.2	12.3	Low
K-Means++	398.7	8.1	High
Random (10 runs)	401.3	9.5	Medium

Choosing the Optimal K

One of the biggest challenges with K-Means: How do we choose K?

Unlike supervised learning (where we have validation accuracy), there’s no single “correct” number of clusters. We use heuristics and domain knowledge.

1. The Elbow Method

Idea: Plot inertia (WCSS) vs. K and look for an “elbow” where the rate of decrease sharply changes.

Mathematical Intuition:

As K increases, inertia always decreases (more clusters = points closer to centroids)
At K = n (every point is its own cluster), inertia = 0
We want the K where adding more clusters doesn’t help much

Procedure:

Run K-Means for K = 1, 2, 3, …, K_max
Calculate inertia for each K
Plot inertia vs. K
Choose K at the “elbow” (point of maximum curvature)

Example:

Inertia
   ↑
●
 ●
   ●
     ●___
         ●___●___●
________________________→ K
2  3  4  5  6  7  8

Elbow at K=4

Limitation: The elbow is often not clearly defined!

2. Silhouette Score

Idea: Measure how similar each point is to its own cluster compared to other clusters.

For a single point $x^{(i)}$ in cluster $C_k$:

\[a(i) = \frac{1}{|C_k| - 1} \sum_{j \in C_k, j \neq i} d(i, j)\]

Average distance to points in own cluster (lower is better).

\[b(i) = \min_{k' \neq k} \frac{1}{|C_{k'}|} \sum_{j \in C_{k'}} d(i, j)\]

Average distance to points in nearest other cluster (higher is better).

Silhouette coefficient for point $i$:

\[s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}\]

Properties:

Range: $[-1, 1]$
$s(i) \approx 1$: Point is well-matched to its cluster
$s(i) \approx 0$: Point is on the border between clusters
$s(i) \approx -1$: Point might belong to wrong cluster

Average silhouette score for the entire clustering:

\[\bar{s} = \frac{1}{n} \sum_{i=1}^{n} s(i)\]

Usage: Choose K that maximizes average silhouette score.

3. Gap Statistic

Idea: Compare inertia of your clustering to that of random data.

\[\text{Gap}(K) = \mathbb{E}[\log(W_K^*)] - \log(W_K)\]

Where:

$W_K$ is the inertia for your data
$W_K^*$ is the inertia for uniformly distributed random data

Choose K where the gap is largest (your clustering is much better than random).

4. Domain Knowledge

Often the best approach: use domain expertise!

Examples:

Customer segmentation: Business might want exactly 5 segments for marketing campaigns
Image compression: Choose K based on desired file size
Biology: Number of cell types might be known

Practical Recommendations

Start with elbow method for quick insight
Validate with silhouette score for confirmation
Try multiple K values and visualize results
Use domain knowledge to guide final choice
Consider interpretability: Fewer clusters are easier to understand and act upon

Convergence and Computational Complexity

Convergence Properties

Theorem: K-Means algorithm is guaranteed to converge.

Proof Sketch:

Each assignment step decreases (or maintains) the objective function $J$
Each update step decreases (or maintains) $J$
There are finitely many possible assignments
$J$ is bounded below by 0
Therefore, the algorithm must converge in finite steps

However: K-Means converges to a local minimum, not necessarily the global minimum!

Number of Iterations

Typical: 10-20 iterations for convergence

Factors affecting convergence speed:

Quality of initialization (K-Means++ converges faster)
Dataset characteristics (well-separated clusters converge faster)
Number of clusters K
Data dimensionality

Time Complexity

Per iteration:

Assignment step: $O(n \cdot K \cdot d)$
- For each of $n$ points, calculate distance to $K$ centroids in $d$ dimensions
Update step: $O(n \cdot d)$
- For each of $n$ points, add to centroid sum

Total: $O(I \cdot n \cdot K \cdot d)$

Where:

$I$ = number of iterations
$n$ = number of data points
$K$ = number of clusters
$d$ = dimensionality

Space Complexity: $O(n \cdot d + K \cdot d)$

Store data: $O(n \cdot d)$
Store centroids: $O(K \cdot d)$

Scalability Considerations

K-Means is relatively efficient, but challenges arise with:

Large n (millions of points): Use mini-batch K-Means
Large d (high dimensions): Apply dimensionality reduction first
Large K (many clusters): Consider hierarchical approaches

Implementation from Scratch

Let’s implement K-Means from scratch to understand every detail.

import numpy as np
from collections import defaultdict

class KMeans:
    """
    K-Means clustering implementation from scratch.

    Parameters:
    -----------
    n_clusters : int, default=3
        The number of clusters to form
    max_iters : int, default=300
        Maximum number of iterations
    init : str, default='kmeans++'
        Initialization method ('random' or 'kmeans++')
    n_init : int, default=10
        Number of times to run algorithm with different initializations
    tol : float, default=1e-4
        Convergence tolerance
    random_state : int, default=None
        Random seed for reproducibility
    """

    def __init__(self, n_clusters=3, max_iters=300, init='kmeans++',
                 n_init=10, tol=1e-4, random_state=None):
        self.n_clusters = n_clusters
        self.max_iters = max_iters
        self.init = init
        self.n_init = n_init
        self.tol = tol
        self.random_state = random_state

        self.centroids = None
        self.labels_ = None
        self.inertia_ = None
        self.n_iter_ = None

    def fit(self, X):
        """
        Compute K-Means clustering.

        Parameters:
        -----------
        X : numpy.ndarray of shape (n_samples, n_features)
            Training data
        """
        X = np.array(X)

        if self.random_state is not None:
            np.random.seed(self.random_state)

        best_inertia = np.inf
        best_centroids = None
        best_labels = None
        best_iters = 0

        # Run multiple times with different initializations
        for _ in range(self.n_init):
            centroids, labels, inertia, n_iters = self._kmeans_single(X)

            if inertia < best_inertia:
                best_inertia = inertia
                best_centroids = centroids
                best_labels = labels
                best_iters = n_iters

        self.centroids = best_centroids
        self.labels_ = best_labels
        self.inertia_ = best_inertia
        self.n_iter_ = best_iters

        return self

    def _kmeans_single(self, X):
        """Run K-Means algorithm once."""
        # Initialize centroids
        if self.init == 'kmeans++':
            centroids = self._kmeans_plus_plus(X)
        else:
            centroids = self._random_init(X)

        # Iterate until convergence
        for iteration in range(self.max_iters):
            # Assignment step
            labels = self._assign_clusters(X, centroids)

            # Update step
            new_centroids = self._update_centroids(X, labels)

            # Check convergence
            centroid_shift = np.linalg.norm(new_centroids - centroids)
            if centroid_shift < self.tol:
                centroids = new_centroids
                break

            centroids = new_centroids

        # Calculate final inertia
        inertia = self._calculate_inertia(X, centroids, labels)

        return centroids, labels, inertia, iteration + 1

    def _random_init(self, X):
        """Random initialization: choose K random points as centroids."""
        n_samples = X.shape[0]
        indices = np.random.choice(n_samples, self.n_clusters, replace=False)
        return X[indices].copy()

    def _kmeans_plus_plus(self, X):
        """K-Means++ initialization."""
        n_samples, n_features = X.shape
        centroids = np.empty((self.n_clusters, n_features))

        # Choose first centroid randomly
        centroids[0] = X[np.random.randint(n_samples)]

        # Choose remaining centroids
        for k in range(1, self.n_clusters):
            # Calculate distances to nearest existing centroid
            distances = np.min([np.linalg.norm(X - c, axis=1)
                               for c in centroids[:k]], axis=0)

            # Square distances for probability calculation
            probabilities = distances ** 2
            probabilities /= probabilities.sum()

            # Choose next centroid with probability proportional to distance²
            next_centroid_idx = np.random.choice(n_samples, p=probabilities)
            centroids[k] = X[next_centroid_idx]

        return centroids

    def _assign_clusters(self, X, centroids):
        """Assign each point to nearest centroid."""
        distances = np.array([np.linalg.norm(X - c, axis=1)
                             for c in centroids])
        return np.argmin(distances, axis=0)

    def _update_centroids(self, X, labels):
        """Update centroids to mean of assigned points."""
        centroids = np.array([X[labels == k].mean(axis=0)
                             for k in range(self.n_clusters)])
        return centroids

    def _calculate_inertia(self, X, centroids, labels):
        """Calculate within-cluster sum of squares."""
        inertia = 0
        for k in range(self.n_clusters):
            cluster_points = X[labels == k]
            if len(cluster_points) > 0:
                inertia += np.sum((cluster_points - centroids[k]) ** 2)
        return inertia

    def predict(self, X):
        """
        Predict cluster labels for new data.

        Parameters:
        -----------
        X : numpy.ndarray of shape (n_samples, n_features)
            New data to predict

        Returns:
        --------
        labels : numpy.ndarray of shape (n_samples,)
            Cluster labels
        """
        X = np.array(X)
        return self._assign_clusters(X, self.centroids)

    def fit_predict(self, X):
        """Fit and return cluster labels."""
        self.fit(X)
        return self.labels_


# Example usage
if __name__ == "__main__":
    from sklearn.datasets import make_blobs

    # Generate synthetic data
    X, y_true = make_blobs(n_samples=300, centers=4, n_features=2,
                          cluster_std=0.60, random_state=42)

    # Fit K-Means
    kmeans = KMeans(n_clusters=4, init='kmeans++', random_state=42)
    kmeans.fit(X)

    print(f"Converged in {kmeans.n_iter_} iterations")
    print(f"Final inertia: {kmeans.inertia_:.2f}")
    print(f"Cluster sizes: {np.bincount(kmeans.labels_)}")

Evaluating Clustering Quality

Internal Evaluation (No Ground Truth)

When we don’t have true labels, we use internal metrics:

1. Inertia (Within-Cluster Sum of Squares)

\[\text{Inertia} = \sum_{k=1}^{K} \sum_{x \in C_k} \|x - \mu_k\|^2\]

Lower is better
Measures compactness of clusters
Problem: Always decreases with more clusters

2. Silhouette Score

Already discussed above. Range: [-1, 1], higher is better.

3. Davies-Bouldin Index

\[DB = \frac{1}{K} \sum_{k=1}^{K} \max_{k' \neq k} \frac{\sigma_k + \sigma_{k'}}{d(c_k, c_{k'})}\]

Where:

$\sigma_k$ is average distance of points in cluster $k$ to centroid
$d(c_k, c_{k’})$ is distance between centroids

Lower is better (indicates better separation).

4. Calinski-Harabasz Index

\[CH = \frac{SS_B / (K-1)}{SS_W / (n-K)}\]

Where:

$SS_B$ is between-cluster variance
$SS_W$ is within-cluster variance

Higher is better (more separation, less variance within clusters).

External Evaluation (With Ground Truth)

When we have true labels (for validation):

1. Adjusted Rand Index (ARI)

Measures similarity between two clusterings, adjusted for chance.

Range: [-1, 1]
1: Perfect match
0: Random labeling
Negative: Worse than random

2. Normalized Mutual Information (NMI)

Measures mutual information between predicted and true labels.

Range: [0, 1]
1: Perfect match
0: Independent

Advantages and Limitations

Advantages

Simple and Intuitive: Easy to understand and implement
Efficient: Linear in number of data points $O(n)$
Scalable: Works well with large datasets
Guaranteed Convergence: Always converges to a local minimum
Versatile: Works for many types of data
Well-Established: Extensive research and implementations available

Limitations

Must Choose K: Number of clusters must be specified beforehand
Local Minima: Sensitive to initialization (use K-Means++)
Assumes Spherical Clusters: Works best with convex, isotropic clusters
Equal Size Bias: Tends to create equal-sized clusters
Outlier Sensitive: Outliers can distort centroids
Fixed Cluster Shapes: Cannot handle non-spherical clusters well
Distance Metric: Only works with Euclidean distance (standard implementation)

When K-Means Fails

Problem cases:

Non-convex clusters (e.g., moon shapes): Use DBSCAN or spectral clustering
Varying densities: Use DBSCAN
Varying sizes: Use Gaussian Mixture Models
Nested clusters: Use hierarchical clustering

Comparison with Other Algorithms

Aspect	K-Means	Hierarchical	DBSCAN	GMM
Choose K	Yes	No	No	Yes
Cluster Shape	Spherical	Any	Any	Elliptical
Scalability	Excellent	Poor	Good	Medium
Outliers	Sensitive	Sensitive	Robust	Medium
Deterministic	No	Yes	Yes	No

Variants and Extensions

1. Mini-Batch K-Means

Problem: Standard K-Means slow for massive datasets

Solution: Use random subsets (mini-batches) of data in each iteration

Advantages:

Much faster (3-10x speedup)
Lower memory usage
Suitable for online learning

Trade-off: Slightly worse clustering quality

2. K-Medoids (PAM)

Problem: K-Means sensitive to outliers (mean affected by extreme values)

Solution: Use actual data points (medoids) as cluster centers instead of means

Advantages:

More robust to outliers
Works with any distance metric
Centroids are interpretable (actual data points)

Disadvantages:

More computationally expensive: $O(K(n-K)^2)$

3. Fuzzy C-Means

Problem: Hard assignment—each point belongs to exactly one cluster

Solution: Soft (fuzzy) assignment—each point has membership degree to each cluster

Membership function:

\[u_{ik} = \frac{1}{\sum_{j=1}^{K} \left(\frac{d(x_i, c_k)}{d(x_i, c_j)}\right)^{2/(m-1)}}\]

Where $m > 1$ is the fuzziness parameter.

4. K-Means with Different Distance Metrics

Standard K-Means uses Euclidean distance, but variants exist:

K-Means with Manhattan distance: More robust to outliers
K-Means with cosine distance: Good for text/document clustering
Spherical K-Means: Normalizes points to unit sphere

5. Hierarchical K-Means

Combine K-Means with hierarchical clustering:

Run K-Means to get initial clusters
Recursively apply K-Means within each cluster
Build tree structure

Useful for: Multi-level clustering, topic modeling

Conclusion

Now that we have learned about K-Means clustering and how it works, we now have a good understanding of how to use it to solve various real-world problems.

Reference Papers:

Machine Learning Series

K-Means Clustering: Unsupervised Learning Fundamentals

Table of Contents

Introduction: From Supervised to Unsupervised Learning

The Paradigm Shift

What is Unsupervised Learning?

Why Unsupervised Learning Matters

Real-World Applications

What is Clustering?

Definition

Types of Clustering

1. Partitional Clustering

2. Hierarchical Clustering

3. Density-Based Clustering

4. Model-Based Clustering

The K-Means Algorithm: Intuition

The Core Idea

Visual Intuition

Why “K-Means”?

Mathematical Formulation

Objective Function

Expanded Form

Centroid Calculation

Assignment Rule

The Optimization Problem

The K-Means Algorithm: Step by Step

Algorithm Pseudocode

Convergence Criteria

Concrete Example

Initialization Strategies

1. Random Initialization

2. K-Means++ Initialization

3. Multiple Random Runs

Comparing Initializations

Choosing the Optimal K

1. The Elbow Method

2. Silhouette Score

3. Gap Statistic

4. Domain Knowledge

Practical Recommendations

Convergence and Computational Complexity

Convergence Properties

Number of Iterations

Time Complexity

Scalability Considerations

Implementation from Scratch

Evaluating Clustering Quality

Internal Evaluation (No Ground Truth)

1. Inertia (Within-Cluster Sum of Squares)

2. Silhouette Score

3. Davies-Bouldin Index

4. Calinski-Harabasz Index

External Evaluation (With Ground Truth)

1. Adjusted Rand Index (ARI)

2. Normalized Mutual Information (NMI)

Advantages and Limitations

Advantages

Limitations

When K-Means Fails

Comparison with Other Algorithms

Variants and Extensions

1. Mini-Batch K-Means

2. K-Medoids (PAM)

3. Fuzzy C-Means

4. K-Means with Different Distance Metrics

5. Hierarchical K-Means

Conclusion