Dec 10, 2025
Continuing in our series on machine learning, today we delve into the fascinating world of Support Vector Machines (SVMs). Imagine you’re a city planner tasked with building a border wall between two countries. You want to place it as far as possible from both sides to create a neutral buffer zone, maximizing the distance to the nearest towns on either side. This is precisely what Support Vector Machines (SVMs) do, they find the decision boundary that’s as far as possible from the closest data points of each class. This “maximum margin” principle makes SVMs one of the most elegant and powerful classification algorithms in machine learning.
Consider a binary classification problem: separating apples from oranges based on weight and color. Many different lines (or hyperplanes in higher dimensions) could separate these classes. But which one is best?
Naive approach: Any line that separates the classes works.
SVM approach: Find the line with the maximum margin - the largest possible distance to the nearest points from both classes.
Why maximum margin?
Support Vector Machines excel because they:
In $d$-dimensional space, a hyperplane is a $(d-1)$ dimensional subspace.
Examples:
Mathematical Definition:
A hyperplane in $\mathbb{R}^d$ is defined by:
\[\mathbf{w}^T \mathbf{x} + b = 0\]Where:
2D Example:
\[w_1 x_1 + w_2 x_2 + b = 0\]This is a line with slope $w_1/w_2$ and intercept $b/w_2$.
For classification, we use the signed distance from a point to the hyperplane:
\[f(\mathbf{x}) = \mathbf{w}^T \mathbf{x} + b\]Decision rule:
A dataset is linearly separable if there exists a hyperplane that perfectly separates the two classes.
Example (Linearly Separable):
Class +1: (2,3), (3,3), (4,4)
Class -1: (1,1), (2,1), (1,2)
Can find: w₁x₁ + w₂x₂ + b = 0 that separates them
Example (Not Linearly Separable):
XOR problem:
Class +1: (0,0), (1,1)
Class -1: (0,1), (1,0)
No line can separate these!
For non-linearly separable data, we’ll use the kernel trick later.
The margin is the perpendicular distance from the decision boundary to the nearest data point.
Formal definition:
For a hyperplane $\mathbf{w}^T \mathbf{x} + b = 0$, the distance from a point $\mathbf{x}_i$ to the hyperplane is:
\[\text{distance} = \frac{|f(\mathbf{x}_i)|}{\lVert\mathbf{w}\rVert} = \frac{|\mathbf{w}^T \mathbf{x}_i + b|}{\lVert\mathbf{w}\rVert}\]Where $\lVert\mathbf{w}\rVert = \sqrt{w_1^2 + w_2^2 + \cdots + w_d^2}$ is the Euclidean norm.
For correctly classified points with labels $y_i \in {-1, +1}$:
\[\text{margin} = \frac{y_i(\mathbf{w}^T \mathbf{x}_i + b)}{\lVert\mathbf{w}\rVert}\]The $y_i$ ensures the margin is always positive for correct classifications.
The maximum margin hyperplane is the one that maximizes the smallest margin over all training points.
Geometric Margin:
\[\gamma = \min_{i=1,\ldots,n} \frac{y_i(\mathbf{w}^T \mathbf{x}_i + b)}{\lVert\mathbf{w}\rVert}\]Goal: Find $\mathbf{w}$ and $b$ that maximize $\gamma$.
We can scale $\mathbf{w}$ and $b$ arbitrarily without changing the hyperplane. To make the optimization unique, we use the canonical form:
\[y_i(\mathbf{w}^T \mathbf{x}_i + b) \geq 1 \quad \forall i\]With equality for the support vectors (closest points):
\[y_i(\mathbf{w}^T \mathbf{x}_i + b) = 1\]In this representation, the margin is:
\[\gamma = \frac{1}{\lVert\mathbf{w}\rVert}\]Width of margin: $\frac{2}{\lVert\mathbf{w}\rVert}$ (distance between the two parallel hyperplanes $\mathbf{w}^T\mathbf{x} + b = \pm 1$)
|
Class -1 | Class +1
|
• | ◦
• | ◦
• [SV] | [SV] ◦
•____H+____◦____________ Margin = 2/||w||
|
• | ◦
• | ◦
|
Support vectors are the data points that lie exactly on the margin boundaries.
Mathematically, support vectors satisfy:
\[y_i(\mathbf{w}^T \mathbf{x}_i + b) = 1\]Key properties to remember:
They “support” or define the maximum margin hyperplane like pillars supporting a roof.
Analogy: Imagine pushing two groups of points apart with parallel walls. The support vectors are the points that touch the walls and resist further separation.
In the dual formulation (discussed later), the decision function becomes:
\[f(\mathbf{x}) = \sum_{i \in SV} \alpha_i y_i \mathbf{x}_i^T \mathbf{x} + b\]Where:
This shows that only support vectors contribute to predictions!
Assumption: Data is perfectly linearly separable.
Constraints: All points must be on the correct side with margin $\geq 1$:
\[y_i(\mathbf{w}^T \mathbf{x}_i + b) \geq 1 \quad \forall i\]Problem: Very restrictive!
Idea: Allow some points to violate the margin or even be misclassified.
Introduce slack variables $\xi_i \geq 0$ for each training point:
\[y_i(\mathbf{w}^T \mathbf{x}_i + b) \geq 1 - \xi_i\]Interpretation:
Penalty: Pay cost $C \sum_i \xi_i$ for violations.
Regularization parameter $C$ controls the trade-off:
Extreme cases:
Typical values: $C \in [0.01, 100]$ (often found via cross-validation)
Objective: Maximize margin while minimizing violations.
\[\min_{\mathbf{w}, b, \xi} \frac{1}{2}||\mathbf{w}||^2 + C \sum_{i=1}^{n} \xi_i\]Subject to:
\(y_i(\mathbf{w}^T \mathbf{x}_i + b) \geq 1 - \xi_i \quad \forall i\) \(\xi_i \geq 0 \quad \forall i\)
Breakdown:
Why $\lVert\mathbf{w}\rVert^2$ instead of $\lVert\mathbf{w}\rVert$?
This is a convex quadratic program with linear constraints:
Standard form:
\[\min_{\mathbf{x}} \frac{1}{2}\mathbf{x}^T Q \mathbf{x} + \mathbf{c}^T \mathbf{x}\]Subject to: $A\mathbf{x} \leq \mathbf{b}$
Properties:
Primal problem: Optimize over $\mathbf{w}, b, \xi$ (dimension = $d + 1 + n$)
Dual problem: Optimize over $\alpha$ (dimension = $n$)
Advantages of dual:
Introduce Lagrange multipliers $\alpha_i \geq 0$ for each constraint:
\[L(\mathbf{w}, b, \xi, \alpha, \beta) = \frac{1}{2}||\mathbf{w}||^2 + C\sum_i \xi_i - \sum_i \alpha_i[y_i(\mathbf{w}^T\mathbf{x}_i + b) - 1 + \xi_i] - \sum_i \beta_i \xi_i\]Where $\alpha_i, \beta_i \geq 0$ are Lagrange multipliers.
At the optimum, the Karush-Kuhn-Tucker (KKT) conditions must hold:
Stationarity:
\[\frac{\partial L}{\partial \mathbf{w}} = 0 \implies \mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i\] \[\frac{\partial L}{\partial b} = 0 \implies \sum_i \alpha_i y_i = 0\] \[\frac{\partial L}{\partial \xi_i} = 0 \implies C - \alpha_i - \beta_i = 0\]Complementary slackness:
\[\alpha_i[y_i(\mathbf{w}^T\mathbf{x}_i + b) - 1 + \xi_i] = 0\]Feasibility: All constraints satisfied
After substituting the stationarity conditions, we get the dual problem:
\[\max_{\alpha} \sum_{i=1}^{n} \alpha_i - \frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{n} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T \mathbf{x}_j\]Subject to:
\(0 \leq \alpha_i \leq C \quad \forall i\) \(\sum_{i=1}^{n} \alpha_i y_i = 0\)
Key observation: Only depends on inner products $\mathbf{x}_i^T \mathbf{x}_j$!
This is where the kernel trick comes in.
Once we solve for $\alpha^*$:
Weight vector: \(\mathbf{w}^* = \sum_{i=1}^{n} \alpha_i^* y_i \mathbf{x}_i\)
Bias (using any support vector with $0 < \alpha_i < C$): \(b^* = y_i - \mathbf{w}^{*T} \mathbf{x}_i\)
Decision function: \(f(\mathbf{x}) = \sum_{i=1}^{n} \alpha_i^* y_i \mathbf{x}_i^T \mathbf{x} + b^*\)
Many real-world datasets are not linearly separable in their original space.
Example: XOR Problem
Original space (2D):
(+) (-)
(-) (+)
Not linearly separable!
Idea: Map to higher-dimensional space where data becomes linearly separable.
Feature map $\phi: \mathbb{R}^d \to \mathbb{R}^D$ transforms features:
\[\mathbf{x} \mapsto \phi(\mathbf{x})\]Example ($d=2$ to $D=5$):
\[\phi(x_1, x_2) = (x_1, x_2, x_1^2, x_2^2, x_1 x_2)\]In the transformed space, we solve:
\[f(\mathbf{x}) = \mathbf{w}^T \phi(\mathbf{x}) + b\]Problem: If $D$ is very large (or infinite!), computing $\phi(\mathbf{x})$ explicitly is expensive or impossible.
Key Insight: We never need $\phi(\mathbf{x})$ explicitly! Only need inner products $\phi(\mathbf{x}_i)^T \phi(\mathbf{x}_j)$.
Kernel function:
\[K(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i)^T \phi(\mathbf{x}_j)\]We can compute $K(\mathbf{x}_i, \mathbf{x}_j)$ directly without computing $\phi$!
Dual formulation with kernel:
\[\max_{\alpha} \sum_{i=1}^{n} \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j K(\mathbf{x}_i, \mathbf{x}_j)\]Decision function with kernel:
\[f(\mathbf{x}) = \sum_{i=1}^{n} \alpha_i y_i K(\mathbf{x}_i, \mathbf{x}) + b\]Magic: Work in infinite-dimensional space with finite computation!
Kernel: $K(\mathbf{x}, \mathbf{z}) = (\mathbf{x}^T \mathbf{z} + c)^2$
For $\mathbf{x} = (x_1, x_2)$ and $\mathbf{z} = (z_1, z_2)$:
\[K(\mathbf{x}, \mathbf{z}) = (x_1 z_1 + x_2 z_2 + c)^2\]Expanding:
\[= x_1^2 z_1^2 + x_2^2 z_2^2 + 2x_1x_2z_1z_2 + 2cx_1z_1 + 2cx_2z_2 + c^2\]This corresponds to feature map:
\[\phi(\mathbf{x}) = (x_1^2, x_2^2, \sqrt{2}x_1x_2, \sqrt{2c}x_1, \sqrt{2c}x_2, c)\]Computing $\phi$ explicitly: 6 operations Computing $K$ directly: 3 operations (much faster!)
Use when:
Equivalent to: No kernel (standard linear SVM)
Parameters:
Use when:
Example ($d=2$): Captures pairwise feature interactions
Parameter: $\gamma > 0$ (often $\gamma = \frac{1}{2\sigma^2}$)
Properties:
Use when:
Most popular kernel in practice!
Parameters:
Use when:
Note: Not always positive semi-definite (may not be valid kernel)
Decision flow:
Original space (2D): Not linearly separable
X = [[0,0], [0,1], [1,0], [1,1]]
y = [ -1, +1, +1, -1 ]
With RBF kernel: Becomes separable!
The kernel implicitly maps to a high-dimensional space where a hyperplane can separate the classes.
Problem: Inner circle = class +1, outer ring = class -1
Linear kernel: Cannot separate (no straight line works)
RBF kernel: Easily creates circular decision boundary
Mathematical insight:
RBF kernel measures similarity based on Euclidean distance:
This allows creating complex, localized decision regions.
For 2D data with RBF kernel:
\[f(x_1, x_2) = \sum_{i \in SV} \alpha_i y_i \exp(-\gamma ||(x_1, x_2) - (x_{i1}, x_{i2})||^2) + b\]Decision boundary: Contour where $f(x_1, x_2) = 0$
Can be highly non-linear:
SVM is inherently a binary classifier (+1 vs -1).
Challenge: Extend to $K > 2$ classes.
Approach: Train $K$ binary classifiers.
For class $k$:
Prediction: Choose class with highest decision function value
\[\hat{y} = \arg\max_{k} f_k(\mathbf{x})\]Advantages:
Disadvantages:
Approach: Train $\binom{K}{2} = \frac{K(K-1)}{2}$ binary classifiers.
For each pair $(i, j)$:
Prediction: Majority voting
Each classifier votes for one class, choose class with most votes.
Advantages:
Disadvantages:
Approach: Solve single optimization problem for all classes simultaneously.
\[\min_{\mathbf{w}_1, \ldots, \mathbf{w}_K} \frac{1}{2}\sum_{k=1}^{K}||\mathbf{w}_k||^2 + C\sum_{i=1}^{n}\xi_i\]Subject to: $\mathbf{w}_{y_i}^T \mathbf{x}_i - \mathbf{w}_k^T \mathbf{x}_i \geq 1 - \xi_i$ for all $k \neq y_i$
Advantages:
Disadvantages:
sklearn default: One-vs-One for SVC
Goal: Predict continuous values $y \in \mathbb{R}$.
SVM approach: Instead of maximizing margin between classes, create a tube around predictions.
Key idea: Don’t penalize predictions within $\epsilon$ of true value.
Loss function:
\[L_\epsilon(y, f(\mathbf{x})) = \begin{cases} 0 & \text{if } |y - f(\mathbf{x})| \leq \epsilon \\ |y - f(\mathbf{x})| - \epsilon & \text{otherwise} \end{cases}\]Visualization:
|
y + ε |_________ No penalty zone (tube)
|
y |* * * * * * Prediction
|
y - ε |_________ No penalty zone (tube)
|
Points inside the tube contribute nothing to the loss!
Subject to:
\(y_i - (\mathbf{w}^T\mathbf{x}_i + b) \leq \epsilon + \xi_i\) \((\mathbf{w}^T\mathbf{x}_i + b) - y_i \leq \epsilon + \xi_i^*\) \(\xi_i, \xi_i^* \geq 0\)
Slack variables:
$\epsilon$: Width of insensitive tube
$C$: Penalty for violations (same as classification)
Kernel: Can use any kernel (RBF, polynomial, etc.)
Where $\alpha_i, \alpha_i^*$ are dual variables.
Support Vector Machines represent a beautiful marriage of:
While deep learning has taken over image and NLP tasks, SVMs remain a powerful tool for specific scenarios where their strengths shine:
Master SVMs, and you have a sophisticated tool that excels in many practical scenarios where simpler models fail and complex models are overkill.
Classic Papers: