Support Vector Machines: Maximizing the Margin

Dec 10, 2025

Continuing in our series on machine learning, today we delve into the fascinating world of Support Vector Machines (SVMs). Imagine you’re a city planner tasked with building a border wall between two countries. You want to place it as far as possible from both sides to create a neutral buffer zone, maximizing the distance to the nearest towns on either side. This is precisely what Support Vector Machines (SVMs) do, they find the decision boundary that’s as far as possible from the closest data points of each class. This “maximum margin” principle makes SVMs one of the most elegant and powerful classification algorithms in machine learning.

Introduction: The Geometry of Classification
Linear Separability and Hyperplanes
Maximum Margin Classifier
Support Vectors: The Critical Points
Hard Margin vs Soft Margin SVM
The Optimization Problem
Lagrange Multipliers and the Dual Problem
The Kernel Trick
Popular Kernel Functions
Non-linear Classification with Kernels
Multi-class SVM
Support Vector Regression (SVR)
Conclusion

Introduction: The Geometry of Classification

The Quest for the Best Boundary

Consider a binary classification problem: separating apples from oranges based on weight and color. Many different lines (or hyperplanes in higher dimensions) could separate these classes. But which one is best?

Naive approach: Any line that separates the classes works.

SVM approach: Find the line with the maximum margin - the largest possible distance to the nearest points from both classes.

Why maximum margin?

Better generalization: Farther from training data → less overfitting
Robustness: Small perturbations in data less likely to cause misclassification
Unique solution: Maximum margin is well-defined (unlike “any separating line”)

What Makes SVMs Special?

Support Vector Machines excel because they:

Maximize the margin → Better generalization
Use kernel trick → Handle non-linear boundaries without explicit feature engineering
Have theoretical guarantees → Grounded in statistical learning theory
Work in high dimensions → Effective even when features » samples
Are memory efficient → Only store support vectors (not all training data)
Provide unique solution → Convex optimization (no local minima)

Linear Separability and Hyperplanes

What is a Hyperplane?

In $d$-dimensional space, a hyperplane is a $(d-1)$ dimensional subspace.

Examples:

1D space: A point (0D)
2D space: A line (1D)
3D space: A plane (2D)
$d$D space: A hyperplane ($(d-1)$D)

Mathematical Definition:

A hyperplane in $\mathbb{R}^d$ is defined by:

\[\mathbf{w}^T \mathbf{x} + b = 0\]

Where:

$\mathbf{w} \in \mathbb{R}^d$ is the normal vector (perpendicular to hyperplane)
$b \in \mathbb{R}$ is the bias term (offset from origin)
$\mathbf{x} \in \mathbb{R}^d$ is any point on the hyperplane

2D Example:

\[w_1 x_1 + w_2 x_2 + b = 0\]

This is a line with slope $w_1/w_2$ and intercept $b/w_2$.

Decision Function

For classification, we use the signed distance from a point to the hyperplane:

\[f(\mathbf{x}) = \mathbf{w}^T \mathbf{x} + b\]

Decision rule:

If $f(\mathbf{x}) > 0$: Predict class +1
If $f(\mathbf{x}) < 0$: Predict class -1
If $f(\mathbf{x}) = 0$: On the decision boundary

Linear Separability

A dataset is linearly separable if there exists a hyperplane that perfectly separates the two classes.

Example (Linearly Separable):

Class +1: (2,3), (3,3), (4,4)
Class -1: (1,1), (2,1), (1,2)

Can find: w₁x₁ + w₂x₂ + b = 0 that separates them

Example (Not Linearly Separable):

XOR problem:
Class +1: (0,0), (1,1)
Class -1: (0,1), (1,0)

No line can separate these!

For non-linearly separable data, we’ll use the kernel trick later.

Maximum Margin Classifier

The Margin

The margin is the perpendicular distance from the decision boundary to the nearest data point.

Formal definition:

For a hyperplane $\mathbf{w}^T \mathbf{x} + b = 0$, the distance from a point $\mathbf{x}_i$ to the hyperplane is:

\[\text{distance} = \frac{|f(\mathbf{x}_i)|}{\lVert\mathbf{w}\rVert} = \frac{|\mathbf{w}^T \mathbf{x}_i + b|}{\lVert\mathbf{w}\rVert}\]

Where $\lVert\mathbf{w}\rVert = \sqrt{w_1^2 + w_2^2 + \cdots + w_d^2}$ is the Euclidean norm.

For correctly classified points with labels $y_i \in {-1, +1}$:

\[\text{margin} = \frac{y_i(\mathbf{w}^T \mathbf{x}_i + b)}{\lVert\mathbf{w}\rVert}\]

The $y_i$ ensures the margin is always positive for correct classifications.

Maximum Margin Hyperplane

The maximum margin hyperplane is the one that maximizes the smallest margin over all training points.

Geometric Margin:

\[\gamma = \min_{i=1,\ldots,n} \frac{y_i(\mathbf{w}^T \mathbf{x}_i + b)}{\lVert\mathbf{w}\rVert}\]

Goal: Find $\mathbf{w}$ and $b$ that maximize $\gamma$.

Canonical Representation

We can scale $\mathbf{w}$ and $b$ arbitrarily without changing the hyperplane. To make the optimization unique, we use the canonical form:

\[y_i(\mathbf{w}^T \mathbf{x}_i + b) \geq 1 \quad \forall i\]

With equality for the support vectors (closest points):

\[y_i(\mathbf{w}^T \mathbf{x}_i + b) = 1\]

In this representation, the margin is:

\[\gamma = \frac{1}{\lVert\mathbf{w}\rVert}\]

Width of margin: $\frac{2}{\lVert\mathbf{w}\rVert}$ (distance between the two parallel hyperplanes $\mathbf{w}^T\mathbf{x} + b = \pm 1$)

Visualization in 2D

                    |
   Class -1         |         Class +1
                    |
      •             |              ◦
         •          |           ◦
            •  [SV] | [SV]  ◦
               •____H+____◦____________ Margin = 2/||w||
                    |
         •          |           ◦
      •             |              ◦
                    |

H: Decision boundary ($\mathbf{w}^T\mathbf{x} + b = 0$)
[SV]: Support vectors (on the margin boundaries)
Margin = $\frac{2}{\lvert\mathbf{w}\rVert}$

Support Vectors: The Critical Points

What are Support Vectors?

Support vectors are the data points that lie exactly on the margin boundaries.

Mathematically, support vectors satisfy:

\[y_i(\mathbf{w}^T \mathbf{x}_i + b) = 1\]

Key properties to remember:

Only support vectors matter: Removing non-support vectors doesn’t change the solution
Typically few: Often only 5-10% of training data
Define the decision boundary: The hyperplane depends only on support vectors
Memory efficient: Only need to store support vectors for prediction

Why “Support” Vectors?

They “support” or define the maximum margin hyperplane like pillars supporting a roof.

Analogy: Imagine pushing two groups of points apart with parallel walls. The support vectors are the points that touch the walls and resist further separation.

Mathematical Insight

In the dual formulation (discussed later), the decision function becomes:

\[f(\mathbf{x}) = \sum_{i \in SV} \alpha_i y_i \mathbf{x}_i^T \mathbf{x} + b\]

Where:

Sum is only over support vectors
$\alpha_i > 0$ for support vectors, $\alpha_i = 0$ for all others

This shows that only support vectors contribute to predictions!

Hard Margin vs Soft Margin SVM

Hard Margin SVM

Assumption: Data is perfectly linearly separable.

Constraints: All points must be on the correct side with margin $\geq 1$:

\[y_i(\mathbf{w}^T \mathbf{x}_i + b) \geq 1 \quad \forall i\]

Problem: Very restrictive!

Requires perfect linear separability
Outliers can make the problem infeasible
Very sensitive to noise

Soft Margin SVM

Idea: Allow some points to violate the margin or even be misclassified.

Introduce slack variables $\xi_i \geq 0$ for each training point:

\[y_i(\mathbf{w}^T \mathbf{x}_i + b) \geq 1 - \xi_i\]

Interpretation:

$\xi_i = 0$: Point is on or outside the margin (correctly classified)
$0 < \xi_i < 1$: Point is inside the margin but correctly classified
$\xi_i \geq 1$: Point is misclassified

Penalty: Pay cost $C \sum_i \xi_i$ for violations.

The C Parameter

Regularization parameter $C$ controls the trade-off:

$\text{Small C}$: Allow more violations → Wider margin, more generalization, underfitting risk
$\text{Large C}$: Penalize violations heavily → Narrower margin, less violations, overfitting risk

Extreme cases:

$C \to \infty$: Hard margin SVM (if data is separable)
$C \to 0$: All points can violate margin

Typical values: $C \in [0.01, 100]$ (often found via cross-validation)

The Optimization Problem

Primal Formulation (Soft Margin)

Objective: Maximize margin while minimizing violations.

\[\min_{\mathbf{w}, b, \xi} \frac{1}{2}||\mathbf{w}||^2 + C \sum_{i=1}^{n} \xi_i\]

Subject to:

$y_i(\mathbf{w}^T \mathbf{x}_i + b) \geq 1 - \xi_i \quad \forall i$ $\xi_i \geq 0 \quad \forall i$

Breakdown:

$\frac{1}{2}\lVert\mathbf{w}\rVert^2$: Maximize margin (minimize $\lVert\mathbf{w}\rVert$)
- Factor $\frac{1}{2}$ for mathematical convenience
$C \sum_i \xi_i$: Penalty for margin violations
- Larger $C$ → heavier penalty
Constraints: Ensure correct classification up to slack

Why $\lVert\mathbf{w}\rVert^2$ instead of $\lVert\mathbf{w}\rVert$?

$\lVert\mathbf{w}\rVert^2$ is differentiable everywhere
Makes optimization easier (quadratic programming)
Doesn’t change the solution (monotonic transformation)

Quadratic Programming

This is a convex quadratic program with linear constraints:

Standard form:

\[\min_{\mathbf{x}} \frac{1}{2}\mathbf{x}^T Q \mathbf{x} + \mathbf{c}^T \mathbf{x}\]

Subject to: $A\mathbf{x} \leq \mathbf{b}$

Properties:

Convex → Unique global minimum
Well-studied → Efficient solvers exist
Polynomial time complexity

Lagrange Multipliers and the Dual Problem

Why the Dual Formulation?

Primal problem: Optimize over $\mathbf{w}, b, \xi$ (dimension = $d + 1 + n$)

Dual problem: Optimize over $\alpha$ (dimension = $n$)

Advantages of dual:

Kernel trick becomes possible (only need inner products $\mathbf{x}_i^T \mathbf{x}_j$)
Fewer variables when $n < d$ (common in high-dimensional problems)
Explicit identification of support vectors ($\alpha_i > 0$)

Lagrangian Formulation

Introduce Lagrange multipliers $\alpha_i \geq 0$ for each constraint:

\[L(\mathbf{w}, b, \xi, \alpha, \beta) = \frac{1}{2}||\mathbf{w}||^2 + C\sum_i \xi_i - \sum_i \alpha_i[y_i(\mathbf{w}^T\mathbf{x}_i + b) - 1 + \xi_i] - \sum_i \beta_i \xi_i\]

Where $\alpha_i, \beta_i \geq 0$ are Lagrange multipliers.

KKT Conditions

At the optimum, the Karush-Kuhn-Tucker (KKT) conditions must hold:

Stationarity:

\[\frac{\partial L}{\partial \mathbf{w}} = 0 \implies \mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i\] \[\frac{\partial L}{\partial b} = 0 \implies \sum_i \alpha_i y_i = 0\] \[\frac{\partial L}{\partial \xi_i} = 0 \implies C - \alpha_i - \beta_i = 0\]

Complementary slackness:

\[\alpha_i[y_i(\mathbf{w}^T\mathbf{x}_i + b) - 1 + \xi_i] = 0\]

Feasibility: All constraints satisfied

Dual Formulation

After substituting the stationarity conditions, we get the dual problem:

\[\max_{\alpha} \sum_{i=1}^{n} \alpha_i - \frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{n} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T \mathbf{x}_j\]

Subject to:

$0 \leq \alpha_i \leq C \quad \forall i$ $\sum_{i=1}^{n} \alpha_i y_i = 0$

Key observation: Only depends on inner products $\mathbf{x}_i^T \mathbf{x}_j$!

This is where the kernel trick comes in.

Recovering the Solution

Once we solve for $\alpha^*$:

Weight vector: $\mathbf{w}^* = \sum_{i=1}^{n} \alpha_i^* y_i \mathbf{x}_i$

Bias (using any support vector with $0 < \alpha_i < C$): $b^* = y_i - \mathbf{w}^{*T} \mathbf{x}_i$

Decision function: $f(\mathbf{x}) = \sum_{i=1}^{n} \alpha_i^* y_i \mathbf{x}_i^T \mathbf{x} + b^*$

The Kernel Trick

Motivation: Non-linear Data

Many real-world datasets are not linearly separable in their original space.

Example: XOR Problem

Original space (2D):
  (+)   (-)
  (-)   (+)

Not linearly separable!

Idea: Map to higher-dimensional space where data becomes linearly separable.

Feature Mapping

Feature map $\phi: \mathbb{R}^d \to \mathbb{R}^D$ transforms features:

\[\mathbf{x} \mapsto \phi(\mathbf{x})\]

Example ($d=2$ to $D=5$):

\[\phi(x_1, x_2) = (x_1, x_2, x_1^2, x_2^2, x_1 x_2)\]

In the transformed space, we solve:

\[f(\mathbf{x}) = \mathbf{w}^T \phi(\mathbf{x}) + b\]

Problem: If $D$ is very large (or infinite!), computing $\phi(\mathbf{x})$ explicitly is expensive or impossible.

The Kernel Trick Solution

Key Insight: We never need $\phi(\mathbf{x})$ explicitly! Only need inner products $\phi(\mathbf{x}_i)^T \phi(\mathbf{x}_j)$.

Kernel function:

\[K(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i)^T \phi(\mathbf{x}_j)\]

We can compute $K(\mathbf{x}_i, \mathbf{x}_j)$ directly without computing $\phi$!

Dual formulation with kernel:

\[\max_{\alpha} \sum_{i=1}^{n} \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j K(\mathbf{x}_i, \mathbf{x}_j)\]

Decision function with kernel:

\[f(\mathbf{x}) = \sum_{i=1}^{n} \alpha_i y_i K(\mathbf{x}_i, \mathbf{x}) + b\]

Magic: Work in infinite-dimensional space with finite computation!

Example: Polynomial Kernel

Kernel: $K(\mathbf{x}, \mathbf{z}) = (\mathbf{x}^T \mathbf{z} + c)^2$

For $\mathbf{x} = (x_1, x_2)$ and $\mathbf{z} = (z_1, z_2)$:

\[K(\mathbf{x}, \mathbf{z}) = (x_1 z_1 + x_2 z_2 + c)^2\]

Expanding:

\[= x_1^2 z_1^2 + x_2^2 z_2^2 + 2x_1x_2z_1z_2 + 2cx_1z_1 + 2cx_2z_2 + c^2\]

This corresponds to feature map:

\[\phi(\mathbf{x}) = (x_1^2, x_2^2, \sqrt{2}x_1x_2, \sqrt{2c}x_1, \sqrt{2c}x_2, c)\]

Computing $\phi$ explicitly: 6 operations Computing $K$ directly: 3 operations (much faster!)

Popular Kernel Functions

1. Linear Kernel

\[K(\mathbf{x}, \mathbf{z}) = \mathbf{x}^T \mathbf{z}\]

Use when:

Data is linearly separable
Very high dimensions (text classification)
Want interpretable model

Equivalent to: No kernel (standard linear SVM)

2. Polynomial Kernel

\[K(\mathbf{x}, \mathbf{z}) = (\mathbf{x}^T \mathbf{z} + c)^d\]

Parameters:

$d$: Degree (usually 2-5)
$c$: Constant term (often 0 or 1)

Use when:

Interactions between features matter
Degree of non-linearity is known
Want smooth decision boundaries

Example ($d=2$): Captures pairwise feature interactions

3. Radial Basis Function (RBF) / Gaussian Kernel

\[K(\mathbf{x}, \mathbf{z}) = \exp\left(-\gamma ||\mathbf{x} - \mathbf{z}||^2\right)\]

Parameter: $\gamma > 0$ (often $\gamma = \frac{1}{2\sigma^2}$)

Properties:

$\gamma$ small: Wide kernel, smooth boundaries, underfitting risk
$\gamma$ large: Narrow kernel, complex boundaries, overfitting risk
Maps to infinite-dimensional space!
Universal approximator

Use when:

No prior knowledge about data structure
Default choice for many problems
Small to medium datasets

Most popular kernel in practice!

4. Sigmoid Kernel

\[K(\mathbf{x}, \mathbf{z}) = \tanh(\alpha \mathbf{x}^T \mathbf{z} + c)\]

Parameters:

$\alpha > 0$: Slope
$c$: Intercept

Use when:

Want neural network-like behavior
S-shaped decision boundaries needed

Note: Not always positive semi-definite (may not be valid kernel)

Choosing a Kernel

Decision flow:

Start with RBF: Usually best for small-medium datasets
Try linear if:
- High dimensions (features » samples)
- Text data
- RBF is too slow
Try polynomial if:
- Know polynomial relationship exists
- RBF overfits but linear underfits
Custom kernel if domain knowledge suggests specific similarity measure

Non-linear Classification with Kernels

XOR Problem Solved

Original space (2D): Not linearly separable

X = [[0,0], [0,1], [1,0], [1,1]]
y = [  -1,    +1,    +1,    -1 ]

With RBF kernel: Becomes separable!

The kernel implicitly maps to a high-dimensional space where a hyperplane can separate the classes.

Concentric Circles Example

Problem: Inner circle = class +1, outer ring = class -1

Linear kernel: Cannot separate (no straight line works)

RBF kernel: Easily creates circular decision boundary

Mathematical insight:

RBF kernel measures similarity based on Euclidean distance:

Points close together → High kernel value
Points far apart → Low kernel value

This allows creating complex, localized decision regions.

Decision Boundary Visualization

For 2D data with RBF kernel:

\[f(x_1, x_2) = \sum_{i \in SV} \alpha_i y_i \exp(-\gamma ||(x_1, x_2) - (x_{i1}, x_{i2})||^2) + b\]

Decision boundary: Contour where $f(x_1, x_2) = 0$

Can be highly non-linear:

Circles
Ellipses
Complex curved shapes
Even disconnected regions!

Multi-class SVM

The Problem

SVM is inherently a binary classifier (+1 vs -1).

Challenge: Extend to $K > 2$ classes.

Strategy 1: One-vs-Rest (OvR)

Approach: Train $K$ binary classifiers.

For class $k$:

Positive examples: Class $k$
Negative examples: All other classes

Prediction: Choose class with highest decision function value

\[\hat{y} = \arg\max_{k} f_k(\mathbf{x})\]

Advantages:

Simple to implement
Fast (only $K$ classifiers)

Disadvantages:

Imbalanced classes (1 vs many)
May produce ambiguous regions

Strategy 2: One-vs-One (OvO)

Approach: Train $\binom{K}{2} = \frac{K(K-1)}{2}$ binary classifiers.

For each pair $(i, j)$:

Train classifier on classes $i$ and $j$ only

Prediction: Majority voting

Each classifier votes for one class, choose class with most votes.

Advantages:

Balanced classes (always 1 vs 1)
Often more accurate

Disadvantages:

More classifiers ($\frac{K(K-1)}{2}$)
Slower training and prediction

Strategy 3: Crammer-Singer (Direct Multi-class)

Approach: Solve single optimization problem for all classes simultaneously.

\[\min_{\mathbf{w}_1, \ldots, \mathbf{w}_K} \frac{1}{2}\sum_{k=1}^{K}||\mathbf{w}_k||^2 + C\sum_{i=1}^{n}\xi_i\]

Subject to: $\mathbf{w}_{y_i}^T \mathbf{x}_i - \mathbf{w}_k^T \mathbf{x}_i \geq 1 - \xi_i$ for all $k \neq y_i$

Advantages:

Theoretically elegant
Single optimization

Disadvantages:

Computationally expensive
Not widely used in practice

sklearn default: One-vs-One for SVC

Support Vector Regression (SVR)

The Regression Problem

Goal: Predict continuous values $y \in \mathbb{R}$.

SVM approach: Instead of maximizing margin between classes, create a tube around predictions.

$\epsilon$-insensitive Loss

Key idea: Don’t penalize predictions within $\epsilon$ of true value.

Loss function:

\[L_\epsilon(y, f(\mathbf{x})) = \begin{cases} 0 & \text{if } |y - f(\mathbf{x})| \leq \epsilon \\ |y - f(\mathbf{x})| - \epsilon & \text{otherwise} \end{cases}\]

Visualization:

        |
  y + ε |_________ No penalty zone (tube)
        |
    y   |* * * * * * Prediction
        |
  y - ε |_________ No penalty zone (tube)
        |

Points inside the tube contribute nothing to the loss!

SVR Optimization Problem

\[\min_{\mathbf{w}, b, \xi, \xi^*} \frac{1}{2}||\mathbf{w}||^2 + C\sum_{i=1}^{n}(\xi_i + \xi_i^*)\]

Subject to:

$y_i - (\mathbf{w}^T\mathbf{x}_i + b) \leq \epsilon + \xi_i$ $(\mathbf{w}^T\mathbf{x}_i + b) - y_i \leq \epsilon + \xi_i^*$ $\xi_i, \xi_i^* \geq 0$

Slack variables:

$\xi_i$: Violation above the tube
$\xi_i^*$: Violation below the tube

Parameters

$\epsilon$: Width of insensitive tube

Small $\epsilon$: Tight fit, more support vectors
Large $\epsilon$: Loose fit, fewer support vectors, smoother

$C$: Penalty for violations (same as classification)

Kernel: Can use any kernel (RBF, polynomial, etc.)

Prediction

\[f(\mathbf{x}) = \sum_{i \in SV} (\alpha_i - \alpha_i^*) K(\mathbf{x}_i, \mathbf{x}) + b\]

Where $\alpha_i, \alpha_i^*$ are dual variables.

Conclusion

Support Vector Machines represent a beautiful marriage of:

Geometric intuition (maximum margin)
Mathematical elegance (convex optimization)
Computational cleverness (kernel trick)

While deep learning has taken over image and NLP tasks, SVMs remain a powerful tool for specific scenarios where their strengths shine:

Small to medium tabular data
High-dimensional problems
When you need theoretical guarantees
Situations where interpretability matters more than a few percent accuracy

Master SVMs, and you have a sophisticated tool that excels in many practical scenarios where simpler models fail and complex models are overkill.

Machine Learning Series

Support Vector Machines: Maximizing the Margin

Table of Contents

Introduction: The Geometry of Classification

The Quest for the Best Boundary

What Makes SVMs Special?

Linear Separability and Hyperplanes

What is a Hyperplane?

Decision Function

Linear Separability

Maximum Margin Classifier

The Margin

Maximum Margin Hyperplane

Canonical Representation

Visualization in 2D

Support Vectors: The Critical Points

What are Support Vectors?

Why “Support” Vectors?

Mathematical Insight

Hard Margin vs Soft Margin SVM

Hard Margin SVM

Soft Margin SVM

The C Parameter

The Optimization Problem

Primal Formulation (Soft Margin)

Quadratic Programming

Lagrange Multipliers and the Dual Problem

Why the Dual Formulation?

Lagrangian Formulation

KKT Conditions

Dual Formulation

Recovering the Solution

The Kernel Trick

Motivation: Non-linear Data

Feature Mapping

The Kernel Trick Solution

Example: Polynomial Kernel

Popular Kernel Functions

1. Linear Kernel

2. Polynomial Kernel

3. Radial Basis Function (RBF) / Gaussian Kernel

4. Sigmoid Kernel

Choosing a Kernel

Non-linear Classification with Kernels

XOR Problem Solved

Concentric Circles Example

Decision Boundary Visualization

Multi-class SVM

The Problem

Strategy 1: One-vs-Rest (OvR)

Strategy 2: One-vs-One (OvO)

Strategy 3: Crammer-Singer (Direct Multi-class)

Support Vector Regression (SVR)

The Regression Problem

$\epsilon$-insensitive Loss

SVR Optimization Problem

Parameters

Prediction

Conclusion

Further Reading