series

Machine Learning Series

by Mayank Sharma

Support Vector Machines: Maximizing the Margin

Dec 10, 2025

Continuing in our series on machine learning, today we delve into the fascinating world of Support Vector Machines (SVMs). Imagine you’re a city planner tasked with building a border wall between two countries. You want to place it as far as possible from both sides to create a neutral buffer zone, maximizing the distance to the nearest towns on either side. This is precisely what Support Vector Machines (SVMs) do, they find the decision boundary that’s as far as possible from the closest data points of each class. This “maximum margin” principle makes SVMs one of the most elegant and powerful classification algorithms in machine learning.

Table of Contents

  1. Introduction: The Geometry of Classification
  2. Linear Separability and Hyperplanes
  3. Maximum Margin Classifier
  4. Support Vectors: The Critical Points
  5. Hard Margin vs Soft Margin SVM
  6. The Optimization Problem
  7. Lagrange Multipliers and the Dual Problem
  8. The Kernel Trick
  9. Popular Kernel Functions
  10. Non-linear Classification with Kernels
  11. Multi-class SVM
  12. Support Vector Regression (SVR)
  13. Conclusion

Introduction: The Geometry of Classification

The Quest for the Best Boundary

Consider a binary classification problem: separating apples from oranges based on weight and color. Many different lines (or hyperplanes in higher dimensions) could separate these classes. But which one is best?

Naive approach: Any line that separates the classes works.

SVM approach: Find the line with the maximum margin - the largest possible distance to the nearest points from both classes.

Why maximum margin?

What Makes SVMs Special?

Support Vector Machines excel because they:

  1. Maximize the margin → Better generalization
  2. Use kernel trick → Handle non-linear boundaries without explicit feature engineering
  3. Have theoretical guarantees → Grounded in statistical learning theory
  4. Work in high dimensions → Effective even when features » samples
  5. Are memory efficient → Only store support vectors (not all training data)
  6. Provide unique solution → Convex optimization (no local minima)

Linear Separability and Hyperplanes

What is a Hyperplane?

In $d$-dimensional space, a hyperplane is a $(d-1)$ dimensional subspace.

Examples:

Mathematical Definition:

A hyperplane in $\mathbb{R}^d$ is defined by:

\[\mathbf{w}^T \mathbf{x} + b = 0\]

Where:

2D Example:

\[w_1 x_1 + w_2 x_2 + b = 0\]

This is a line with slope $w_1/w_2$ and intercept $b/w_2$.

Decision Function

For classification, we use the signed distance from a point to the hyperplane:

\[f(\mathbf{x}) = \mathbf{w}^T \mathbf{x} + b\]

Decision rule:

Linear Separability

A dataset is linearly separable if there exists a hyperplane that perfectly separates the two classes.

Example (Linearly Separable):

Class +1: (2,3), (3,3), (4,4)
Class -1: (1,1), (2,1), (1,2)

Can find: wx + wx + b = 0 that separates them

Example (Not Linearly Separable):

XOR problem:
Class +1: (0,0), (1,1)
Class -1: (0,1), (1,0)

No line can separate these!

For non-linearly separable data, we’ll use the kernel trick later.

Maximum Margin Classifier

The Margin

The margin is the perpendicular distance from the decision boundary to the nearest data point.

Formal definition:

For a hyperplane $\mathbf{w}^T \mathbf{x} + b = 0$, the distance from a point $\mathbf{x}_i$ to the hyperplane is:

\[\text{distance} = \frac{|f(\mathbf{x}_i)|}{\lVert\mathbf{w}\rVert} = \frac{|\mathbf{w}^T \mathbf{x}_i + b|}{\lVert\mathbf{w}\rVert}\]

Where $\lVert\mathbf{w}\rVert = \sqrt{w_1^2 + w_2^2 + \cdots + w_d^2}$ is the Euclidean norm.

For correctly classified points with labels $y_i \in {-1, +1}$:

\[\text{margin} = \frac{y_i(\mathbf{w}^T \mathbf{x}_i + b)}{\lVert\mathbf{w}\rVert}\]

The $y_i$ ensures the margin is always positive for correct classifications.

Maximum Margin Hyperplane

The maximum margin hyperplane is the one that maximizes the smallest margin over all training points.

Geometric Margin:

\[\gamma = \min_{i=1,\ldots,n} \frac{y_i(\mathbf{w}^T \mathbf{x}_i + b)}{\lVert\mathbf{w}\rVert}\]

Goal: Find $\mathbf{w}$ and $b$ that maximize $\gamma$.

Canonical Representation

We can scale $\mathbf{w}$ and $b$ arbitrarily without changing the hyperplane. To make the optimization unique, we use the canonical form:

\[y_i(\mathbf{w}^T \mathbf{x}_i + b) \geq 1 \quad \forall i\]

With equality for the support vectors (closest points):

\[y_i(\mathbf{w}^T \mathbf{x}_i + b) = 1\]

In this representation, the margin is:

\[\gamma = \frac{1}{\lVert\mathbf{w}\rVert}\]

Width of margin: $\frac{2}{\lVert\mathbf{w}\rVert}$ (distance between the two parallel hyperplanes $\mathbf{w}^T\mathbf{x} + b = \pm 1$)

Visualization in 2D

                    |
   Class -1         |         Class +1
                    |
                   |              
                   |           
              [SV] | [SV]  
               ____H+________________ Margin = 2/||w||
                    |
                   |           
                   |              
                    |

Support Vectors: The Critical Points

What are Support Vectors?

Support vectors are the data points that lie exactly on the margin boundaries.

Mathematically, support vectors satisfy:

\[y_i(\mathbf{w}^T \mathbf{x}_i + b) = 1\]

Key properties to remember:

  1. Only support vectors matter: Removing non-support vectors doesn’t change the solution
  2. Typically few: Often only 5-10% of training data
  3. Define the decision boundary: The hyperplane depends only on support vectors
  4. Memory efficient: Only need to store support vectors for prediction

Why “Support” Vectors?

They “support” or define the maximum margin hyperplane like pillars supporting a roof.

Analogy: Imagine pushing two groups of points apart with parallel walls. The support vectors are the points that touch the walls and resist further separation.

Mathematical Insight

In the dual formulation (discussed later), the decision function becomes:

\[f(\mathbf{x}) = \sum_{i \in SV} \alpha_i y_i \mathbf{x}_i^T \mathbf{x} + b\]

Where:

This shows that only support vectors contribute to predictions!

Hard Margin vs Soft Margin SVM

Hard Margin SVM

Assumption: Data is perfectly linearly separable.

Constraints: All points must be on the correct side with margin $\geq 1$:

\[y_i(\mathbf{w}^T \mathbf{x}_i + b) \geq 1 \quad \forall i\]

Problem: Very restrictive!

Soft Margin SVM

Idea: Allow some points to violate the margin or even be misclassified.

Introduce slack variables $\xi_i \geq 0$ for each training point:

\[y_i(\mathbf{w}^T \mathbf{x}_i + b) \geq 1 - \xi_i\]

Interpretation:

Penalty: Pay cost $C \sum_i \xi_i$ for violations.

The C Parameter

Regularization parameter $C$ controls the trade-off:

Extreme cases:

Typical values: $C \in [0.01, 100]$ (often found via cross-validation)

The Optimization Problem

Primal Formulation (Soft Margin)

Objective: Maximize margin while minimizing violations.

\[\min_{\mathbf{w}, b, \xi} \frac{1}{2}||\mathbf{w}||^2 + C \sum_{i=1}^{n} \xi_i\]

Subject to:

\(y_i(\mathbf{w}^T \mathbf{x}_i + b) \geq 1 - \xi_i \quad \forall i\) \(\xi_i \geq 0 \quad \forall i\)

Breakdown:

  1. $\frac{1}{2}\lVert\mathbf{w}\rVert^2$: Maximize margin (minimize $\lVert\mathbf{w}\rVert$)
    • Factor $\frac{1}{2}$ for mathematical convenience
  2. $C \sum_i \xi_i$: Penalty for margin violations
    • Larger $C$ → heavier penalty
  3. Constraints: Ensure correct classification up to slack

Why $\lVert\mathbf{w}\rVert^2$ instead of $\lVert\mathbf{w}\rVert$?

Quadratic Programming

This is a convex quadratic program with linear constraints:

Standard form:

\[\min_{\mathbf{x}} \frac{1}{2}\mathbf{x}^T Q \mathbf{x} + \mathbf{c}^T \mathbf{x}\]

Subject to: $A\mathbf{x} \leq \mathbf{b}$

Properties:

Lagrange Multipliers and the Dual Problem

Why the Dual Formulation?

Primal problem: Optimize over $\mathbf{w}, b, \xi$ (dimension = $d + 1 + n$)

Dual problem: Optimize over $\alpha$ (dimension = $n$)

Advantages of dual:

  1. Kernel trick becomes possible (only need inner products $\mathbf{x}_i^T \mathbf{x}_j$)
  2. Fewer variables when $n < d$ (common in high-dimensional problems)
  3. Explicit identification of support vectors ($\alpha_i > 0$)

Lagrangian Formulation

Introduce Lagrange multipliers $\alpha_i \geq 0$ for each constraint:

\[L(\mathbf{w}, b, \xi, \alpha, \beta) = \frac{1}{2}||\mathbf{w}||^2 + C\sum_i \xi_i - \sum_i \alpha_i[y_i(\mathbf{w}^T\mathbf{x}_i + b) - 1 + \xi_i] - \sum_i \beta_i \xi_i\]

Where $\alpha_i, \beta_i \geq 0$ are Lagrange multipliers.

KKT Conditions

At the optimum, the Karush-Kuhn-Tucker (KKT) conditions must hold:

Stationarity:

\[\frac{\partial L}{\partial \mathbf{w}} = 0 \implies \mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i\] \[\frac{\partial L}{\partial b} = 0 \implies \sum_i \alpha_i y_i = 0\] \[\frac{\partial L}{\partial \xi_i} = 0 \implies C - \alpha_i - \beta_i = 0\]

Complementary slackness:

\[\alpha_i[y_i(\mathbf{w}^T\mathbf{x}_i + b) - 1 + \xi_i] = 0\]

Feasibility: All constraints satisfied

Dual Formulation

After substituting the stationarity conditions, we get the dual problem:

\[\max_{\alpha} \sum_{i=1}^{n} \alpha_i - \frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{n} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T \mathbf{x}_j\]

Subject to:

\(0 \leq \alpha_i \leq C \quad \forall i\) \(\sum_{i=1}^{n} \alpha_i y_i = 0\)

Key observation: Only depends on inner products $\mathbf{x}_i^T \mathbf{x}_j$!

This is where the kernel trick comes in.

Recovering the Solution

Once we solve for $\alpha^*$:

Weight vector: \(\mathbf{w}^* = \sum_{i=1}^{n} \alpha_i^* y_i \mathbf{x}_i\)

Bias (using any support vector with $0 < \alpha_i < C$): \(b^* = y_i - \mathbf{w}^{*T} \mathbf{x}_i\)

Decision function: \(f(\mathbf{x}) = \sum_{i=1}^{n} \alpha_i^* y_i \mathbf{x}_i^T \mathbf{x} + b^*\)

The Kernel Trick

Motivation: Non-linear Data

Many real-world datasets are not linearly separable in their original space.

Example: XOR Problem

Original space (2D):
  (+)   (-)
  (-)   (+)

Not linearly separable!

Idea: Map to higher-dimensional space where data becomes linearly separable.

Feature Mapping

Feature map $\phi: \mathbb{R}^d \to \mathbb{R}^D$ transforms features:

\[\mathbf{x} \mapsto \phi(\mathbf{x})\]

Example ($d=2$ to $D=5$):

\[\phi(x_1, x_2) = (x_1, x_2, x_1^2, x_2^2, x_1 x_2)\]

In the transformed space, we solve:

\[f(\mathbf{x}) = \mathbf{w}^T \phi(\mathbf{x}) + b\]

Problem: If $D$ is very large (or infinite!), computing $\phi(\mathbf{x})$ explicitly is expensive or impossible.

The Kernel Trick Solution

Key Insight: We never need $\phi(\mathbf{x})$ explicitly! Only need inner products $\phi(\mathbf{x}_i)^T \phi(\mathbf{x}_j)$.

Kernel function:

\[K(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i)^T \phi(\mathbf{x}_j)\]

We can compute $K(\mathbf{x}_i, \mathbf{x}_j)$ directly without computing $\phi$!

Dual formulation with kernel:

\[\max_{\alpha} \sum_{i=1}^{n} \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j K(\mathbf{x}_i, \mathbf{x}_j)\]

Decision function with kernel:

\[f(\mathbf{x}) = \sum_{i=1}^{n} \alpha_i y_i K(\mathbf{x}_i, \mathbf{x}) + b\]

Magic: Work in infinite-dimensional space with finite computation!

Example: Polynomial Kernel

Kernel: $K(\mathbf{x}, \mathbf{z}) = (\mathbf{x}^T \mathbf{z} + c)^2$

For $\mathbf{x} = (x_1, x_2)$ and $\mathbf{z} = (z_1, z_2)$:

\[K(\mathbf{x}, \mathbf{z}) = (x_1 z_1 + x_2 z_2 + c)^2\]

Expanding:

\[= x_1^2 z_1^2 + x_2^2 z_2^2 + 2x_1x_2z_1z_2 + 2cx_1z_1 + 2cx_2z_2 + c^2\]

This corresponds to feature map:

\[\phi(\mathbf{x}) = (x_1^2, x_2^2, \sqrt{2}x_1x_2, \sqrt{2c}x_1, \sqrt{2c}x_2, c)\]

Computing $\phi$ explicitly: 6 operations Computing $K$ directly: 3 operations (much faster!)

1. Linear Kernel

\[K(\mathbf{x}, \mathbf{z}) = \mathbf{x}^T \mathbf{z}\]

Use when:

Equivalent to: No kernel (standard linear SVM)

2. Polynomial Kernel

\[K(\mathbf{x}, \mathbf{z}) = (\mathbf{x}^T \mathbf{z} + c)^d\]

Parameters:

Use when:

Example ($d=2$): Captures pairwise feature interactions

3. Radial Basis Function (RBF) / Gaussian Kernel

\[K(\mathbf{x}, \mathbf{z}) = \exp\left(-\gamma ||\mathbf{x} - \mathbf{z}||^2\right)\]

Parameter: $\gamma > 0$ (often $\gamma = \frac{1}{2\sigma^2}$)

Properties:

Use when:

Most popular kernel in practice!

4. Sigmoid Kernel

\[K(\mathbf{x}, \mathbf{z}) = \tanh(\alpha \mathbf{x}^T \mathbf{z} + c)\]

Parameters:

Use when:

Note: Not always positive semi-definite (may not be valid kernel)

Choosing a Kernel

Decision flow:

  1. Start with RBF: Usually best for small-medium datasets
  2. Try linear if:
    • High dimensions (features » samples)
    • Text data
    • RBF is too slow
  3. Try polynomial if:
    • Know polynomial relationship exists
    • RBF overfits but linear underfits
  4. Custom kernel if domain knowledge suggests specific similarity measure

Non-linear Classification with Kernels

XOR Problem Solved

Original space (2D): Not linearly separable

X = [[0,0], [0,1], [1,0], [1,1]]
y = [  -1,    +1,    +1,    -1 ]

With RBF kernel: Becomes separable!

The kernel implicitly maps to a high-dimensional space where a hyperplane can separate the classes.

Concentric Circles Example

Problem: Inner circle = class +1, outer ring = class -1

Linear kernel: Cannot separate (no straight line works)

RBF kernel: Easily creates circular decision boundary

Mathematical insight:

RBF kernel measures similarity based on Euclidean distance:

This allows creating complex, localized decision regions.

Decision Boundary Visualization

For 2D data with RBF kernel:

\[f(x_1, x_2) = \sum_{i \in SV} \alpha_i y_i \exp(-\gamma ||(x_1, x_2) - (x_{i1}, x_{i2})||^2) + b\]

Decision boundary: Contour where $f(x_1, x_2) = 0$

Can be highly non-linear:

Multi-class SVM

The Problem

SVM is inherently a binary classifier (+1 vs -1).

Challenge: Extend to $K > 2$ classes.

Strategy 1: One-vs-Rest (OvR)

Approach: Train $K$ binary classifiers.

For class $k$:

Prediction: Choose class with highest decision function value

\[\hat{y} = \arg\max_{k} f_k(\mathbf{x})\]

Advantages:

Disadvantages:

Strategy 2: One-vs-One (OvO)

Approach: Train $\binom{K}{2} = \frac{K(K-1)}{2}$ binary classifiers.

For each pair $(i, j)$:

Prediction: Majority voting

Each classifier votes for one class, choose class with most votes.

Advantages:

Disadvantages:

Strategy 3: Crammer-Singer (Direct Multi-class)

Approach: Solve single optimization problem for all classes simultaneously.

\[\min_{\mathbf{w}_1, \ldots, \mathbf{w}_K} \frac{1}{2}\sum_{k=1}^{K}||\mathbf{w}_k||^2 + C\sum_{i=1}^{n}\xi_i\]

Subject to: $\mathbf{w}_{y_i}^T \mathbf{x}_i - \mathbf{w}_k^T \mathbf{x}_i \geq 1 - \xi_i$ for all $k \neq y_i$

Advantages:

Disadvantages:

sklearn default: One-vs-One for SVC

Support Vector Regression (SVR)

The Regression Problem

Goal: Predict continuous values $y \in \mathbb{R}$.

SVM approach: Instead of maximizing margin between classes, create a tube around predictions.

$\epsilon$-insensitive Loss

Key idea: Don’t penalize predictions within $\epsilon$ of true value.

Loss function:

\[L_\epsilon(y, f(\mathbf{x})) = \begin{cases} 0 & \text{if } |y - f(\mathbf{x})| \leq \epsilon \\ |y - f(\mathbf{x})| - \epsilon & \text{otherwise} \end{cases}\]

Visualization:

        |
  y + ε |_________ No penalty zone (tube)
        |
    y   |* * * * * * Prediction
        |
  y - ε |_________ No penalty zone (tube)
        |

Points inside the tube contribute nothing to the loss!

SVR Optimization Problem

\[\min_{\mathbf{w}, b, \xi, \xi^*} \frac{1}{2}||\mathbf{w}||^2 + C\sum_{i=1}^{n}(\xi_i + \xi_i^*)\]

Subject to:

\(y_i - (\mathbf{w}^T\mathbf{x}_i + b) \leq \epsilon + \xi_i\) \((\mathbf{w}^T\mathbf{x}_i + b) - y_i \leq \epsilon + \xi_i^*\) \(\xi_i, \xi_i^* \geq 0\)

Slack variables:

Parameters

$\epsilon$: Width of insensitive tube

$C$: Penalty for violations (same as classification)

Kernel: Can use any kernel (RBF, polynomial, etc.)

Prediction

\[f(\mathbf{x}) = \sum_{i \in SV} (\alpha_i - \alpha_i^*) K(\mathbf{x}_i, \mathbf{x}) + b\]

Where $\alpha_i, \alpha_i^*$ are dual variables.

Conclusion

Support Vector Machines represent a beautiful marriage of:

While deep learning has taken over image and NLP tasks, SVMs remain a powerful tool for specific scenarios where their strengths shine:

Master SVMs, and you have a sophisticated tool that excels in many practical scenarios where simpler models fail and complex models are overkill.

Further Reading

Classic Papers: