Jan 19, 2026
Continuing in our Deep Learning Series, we now turn our attention to the most critical component of neural network training: optimizers. Imagine you’re blindfolded on a mountain range, trying to find the lowest valley. You can’t see the landscape, but you can feel the slope beneath your feet. Each step you take is guided by the gradient, the direction of steepest descent. This is exactly how neural networks learn: optimizers are the algorithms that decide how to take each step based on the gradient information available.
Training a neural network is fundamentally an optimization problem. We have a loss function $L(\theta)$ that measures how wrong our model is, and we want to find parameters $\theta^*$ that minimize this loss:
\[\theta^* = \arg\min_{\theta} L(\theta)\]The optimizer is the algorithm that finds this minimum. It’s like a navigation system for the loss landscape, basically a high-dimensional surface where every point represents a different set of model parameters.
Consider the loss landscape of a neural network. For a simple network with just two parameters, we could visualize this as a 3D surface, you know like a mountain range where elevation represents loss. Real neural networks have millions of parameters, creating a surface in millions of dimensions that we cannot visualize but must navigate.
This landscape has several challenging features:
Good optimizers must handle all these challenges efficiently. Now, let’s explore each major optimizer and understand how it works.
Gradient descent is the simplest optimization algorithm. The key insight is that the gradient $\nabla L(\theta)$ points in the direction of steepest ascent. To minimize the loss, we move in the opposite direction:
\[\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)\]Where:
Let’s say we’re training a simple linear regression $y = wx + b$ with MSE loss:
\[L(w, b) = \frac{1}{n} \sum_{i=1}^{n} (y_i - (wx_i + b))^2\]The gradients are:
\[\frac{\partial L}{\partial w} = -\frac{2}{n} \sum_{i=1}^{n} x_i(y_i - (wx_i + b))\] \[\frac{\partial L}{\partial b} = -\frac{2}{n} \sum_{i=1}^{n} (y_i - (wx_i + b))\]Starting from random values, we update:
\(w_{t+1} = w_t - \eta \frac{\partial L}{\partial w}\) \(b_{t+1} = b_t - \eta \frac{\partial L}{\partial b}\)
The learning rate $\eta$ is critical: if it’s Too large, the algorithm overshoots, oscillates, or diverges and if it’s Too small, training is extremely slow. So, what we need is Just right, somethign that ensures smooth convergence (but what’s “just right”?)
This is called the "learning rate scheduling problem", and it’s one of the most challenging aspects of deep learning. Modern deep learning and modern optimizers address it in sophisticated ways.
Unlike Vanilla Gradient Descent, which uses the entire dataset to compute gradients, SGD uses a single randomly chosen sample (or mini-batch):
\[\theta_{t+1} = \theta_t - \eta \nabla L_i(\theta_t)\]Where $L_i$ is the loss for sample (or mini-batch) $i$.
Important thing here is, to note that the expected value of the stochastic gradient equals the true gradient:
\[\mathbb{E}[\nabla L_i(\theta)] = \nabla L(\theta)\]This means SGD will converge to the same solution (on average), but each step is much faster to compute.
The stochastic nature of SGD has unexpected benefits:
In practice, we use mini-batches (typically 32-256 samples):
\[\theta_{t+1} = \theta_t - \eta \frac{1}{|B|} \sum_{i \in B} \nabla L_i(\theta_t)\]This balances three important factors:
import numpy as np
class SGD:
"""Vanilla Stochastic Gradient Descent optimizer."""
def __init__(self, learning_rate=0.01):
self.learning_rate = learning_rate
def update(self, params, grads):
"""
Update parameters using gradients.
Args:
params: Dictionary of parameters {name: value}
grads: Dictionary of gradients {name: gradient}
Returns:
Updated parameters
"""
for name in params:
params[name] -= self.learning_rate * grads[name]
return params
# Example usage
params = {'W': np.random.randn(10, 5), 'b': np.zeros(5)}
grads = {'W': np.random.randn(10, 5) * 0.1, 'b': np.random.randn(5) * 0.1}
optimizer = SGD(learning_rate=0.01)
params = optimizer.update(params, grads)
SGD often tend to oscillate wildly in ravines, narrow valleys where the gradient is steep in one direction but shallow in another. The algorithm zig-zags, making slow progress toward the minimum. Imagine a ball rolling down a hill. It doesn’t just follow the gradient, it builds up momentum. If it’s been moving in one direction consistently, it should continue in that direction even if the current gradient is small.
Momentum introduces a “velocity” term that accumulates past gradients:
\(v_t = \beta v_{t-1} + \nabla L(\theta_t)\) \(\theta_{t+1} = \theta_t - \eta v_t\)
Where:
The velocity is an exponentially weighted moving average of gradients. Expanding:
\[v_t = \nabla L(\theta_t) + \beta \nabla L(\theta_{t-1}) + \beta^2 \nabla L(\theta_{t-2}) + ...\]Recent gradients have more influence, but past gradients still contribute. This smooths out oscillations.
The momentum term has three key effects:
class MomentumSGD:
"""SGD with Momentum."""
def __init__(self, learning_rate=0.01, momentum=0.9):
self.learning_rate = learning_rate
self.momentum = momentum
self.velocity = {}
def update(self, params, grads):
"""Update parameters using momentum."""
for name in params:
if name not in self.velocity:
self.velocity[name] = np.zeros_like(params[name])
# Update velocity
self.velocity[name] = (self.momentum * self.velocity[name] +
grads[name])
# Update parameters
params[name] -= self.learning_rate * self.velocity[name]
return params
So, the standard momentum can overshoot the minimum because it keeps moving even when it should slow down. It’s like a ball that builds up too much speed and rolls past the valley bottom.
Nesterov realized we could do better. Instead of computing the gradient at the current position, compute it at the “lookahead” position—where momentum would take you:
\(v_t = \beta v_{t-1} + \nabla L(\theta_t - \eta \beta v_{t-1})\) \(\theta_{t+1} = \theta_t - \eta v_t\)
Nesterov momentum says that “Before computing the gradient, peek ahead to see where momentum is taking me. If I’m about to overshoot, I’ll know and can correct.”
This provides a form of “anticipatory” update that leads to better convergence.
| Aspect | Standard Momentum | Nesterov Momentum |
|---|---|---|
| Gradient computed at | Current position | Lookahead position |
| Correction | After overshooting | Before overshooting |
| Convergence | Good | Better (theoretically optimal for convex) |
class NesterovMomentum:
"""Nesterov Accelerated Gradient optimizer."""
def __init__(self, learning_rate=0.01, momentum=0.9):
self.learning_rate = learning_rate
self.momentum = momentum
self.velocity = {}
def update(self, params, grads):
"""
Update using Nesterov momentum.
Note: In practice, we use a reformulation that doesn't
require computing gradients at the lookahead position.
"""
for name in params:
if name not in self.velocity:
self.velocity[name] = np.zeros_like(params[name])
v_prev = self.velocity[name].copy()
# Update velocity
self.velocity[name] = (self.momentum * self.velocity[name] -
self.learning_rate * grads[name])
# Nesterov update: use velocity correction
params[name] += (-self.momentum * v_prev +
(1 + self.momentum) * self.velocity[name])
return params
AdaGrad adapts the learning rate for each parameter based on the historical gradient information. The key idea is that different parameters may need different learning rates:
AdaGrad adapts the learning rate for each parameter based on its history.
AdaGrad accumulates squared gradients:
\[G_t = G_{t-1} + (\nabla L(\theta_t))^2\]And divides the learning rate by the square root:
\[\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} \nabla L(\theta_t)\]Where:
This means:
This is especially useful for sparse features (e.g., NLP with rare words).
The accumulated gradient $G_t$ only grows, never shrinks. Eventually, the effective learning rate becomes so small that learning stops entirely. This is called “premature learning rate decay.”
class AdaGrad:
"""AdaGrad optimizer with per-parameter learning rates."""
def __init__(self, learning_rate=0.01, epsilon=1e-8):
self.learning_rate = learning_rate
self.epsilon = epsilon
self.accumulated_grads = {}
def update(self, params, grads):
"""Update parameters with adaptive learning rates."""
for name in params:
if name not in self.accumulated_grads:
self.accumulated_grads[name] = np.zeros_like(params[name])
# Accumulate squared gradients
self.accumulated_grads[name] += grads[name] ** 2
# Adaptive update
params[name] -= (self.learning_rate * grads[name] /
(np.sqrt(self.accumulated_grads[name]) +
self.epsilon))
return params
RMSprop (proposed by Geoff Hinton in a Coursera lecture) uses an exponentially decaying average instead of accumulating all squared gradients:
\[E[g^2]_t = \gamma E[g^2]_{t-1} + (1 - \gamma) (\nabla L(\theta_t))^2\] \[\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} \nabla L(\theta_t)\]Where $\gamma$ is the decay rate (typically 0.9).
class RMSprop:
"""RMSprop optimizer with exponentially decaying gradient average."""
def __init__(self, learning_rate=0.001, decay=0.9, epsilon=1e-8):
self.learning_rate = learning_rate
self.decay = decay
self.epsilon = epsilon
self.moving_avg_sq = {}
def update(self, params, grads):
"""Update parameters using RMSprop."""
for name in params:
if name not in self.moving_avg_sq:
self.moving_avg_sq[name] = np.zeros_like(params[name])
# Update moving average of squared gradients
self.moving_avg_sq[name] = (self.decay * self.moving_avg_sq[name] +
(1 - self.decay) * grads[name] ** 2)
# Update parameters
params[name] -= (self.learning_rate * grads[name] /
(np.sqrt(self.moving_avg_sq[name]) + self.epsilon))
return params
Adam (Adaptive Moment Estimation) was introduced by Diederik Kingma and Jimmy Ba in 2014 and combines the advantages of momentum and RMSprop. It maintains two moving averages:
Plus bias correction to account for initialization at zero.
First moment (momentum-like): \(m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla L(\theta_t)\)
Second moment (RMSprop-like): \(v_t = \beta_2 v_{t-1} + (1 - \beta_2) (\nabla L(\theta_t))^2\)
Bias correction: \(\hat{m}_t = \frac{m_t}{1 - \beta_1^t}\) \(\hat{v}_t = \frac{v_t}{1 - \beta_2^t}\)
Update: \(\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t\)
At $t=0$, $m_0 = v_0 = 0$. In early steps, the moving averages are biased toward zero. Bias correction compensates:
The original Adam paper recommends:
These work well for most problems without tuning.
class Adam:
"""Adam optimizer combining momentum and adaptive learning rates."""
def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999,
epsilon=1e-8):
self.learning_rate = learning_rate
self.beta1 = beta1
self.beta2 = beta2
self.epsilon = epsilon
self.m = {} # First moment
self.v = {} # Second moment
self.t = 0 # Time step
def update(self, params, grads):
"""Update parameters using Adam."""
self.t += 1
for name in params:
if name not in self.m:
self.m[name] = np.zeros_like(params[name])
self.v[name] = np.zeros_like(params[name])
# Update first moment (momentum)
self.m[name] = (self.beta1 * self.m[name] +
(1 - self.beta1) * grads[name])
# Update second moment (RMSprop-like)
self.v[name] = (self.beta2 * self.v[name] +
(1 - self.beta2) * grads[name] ** 2)
# Bias correction
m_hat = self.m[name] / (1 - self.beta1 ** self.t)
v_hat = self.v[name] / (1 - self.beta2 ** self.t)
# Update parameters
params[name] -= (self.learning_rate * m_hat /
(np.sqrt(v_hat) + self.epsilon))
return params
Despite it’s popularity, Adam has known issues:
AdamW is a variant of Adam that fixes the weight decay issue. The key insight is that Adam’s weight decay implementation is mathematically incorrect.
In standard SGD, L2 regularization and weight decay are equivalent:
L2 Regularization: Add $\frac{\lambda}{2}||\theta||^2$ to loss \(\nabla L_{reg} = \nabla L + \lambda \theta\)
Weight Decay: Shrink weights after each update \(\theta_{t+1} = \theta_t - \eta \nabla L - \eta \lambda \theta_t\)
In Adam, these are NOT equivalent because the gradient is divided by $\sqrt{v_t}$:
L2 in Adam: The regularization gradient is also scaled by $\frac{1}{\sqrt{v_t}}$, which is wrong. We want consistent regularization regardless of the gradient history.
AdamW applies weight decay directly to the weights, not through the gradient:
\[\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t - \eta \lambda \theta_t\]class AdamW:
"""AdamW optimizer with decoupled weight decay."""
def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999,
epsilon=1e-8, weight_decay=0.01):
self.learning_rate = learning_rate
self.beta1 = beta1
self.beta2 = beta2
self.epsilon = epsilon
self.weight_decay = weight_decay
self.m = {}
self.v = {}
self.t = 0
def update(self, params, grads):
"""Update parameters using AdamW."""
self.t += 1
for name in params:
if name not in self.m:
self.m[name] = np.zeros_like(params[name])
self.v[name] = np.zeros_like(params[name])
# Update moments
self.m[name] = (self.beta1 * self.m[name] +
(1 - self.beta1) * grads[name])
self.v[name] = (self.beta2 * self.v[name] +
(1 - self.beta2) * grads[name] ** 2)
# Bias correction
m_hat = self.m[name] / (1 - self.beta1 ** self.t)
v_hat = self.v[name] / (1 - self.beta2 ** self.t)
# Adam update
params[name] -= (self.learning_rate * m_hat /
(np.sqrt(v_hat) + self.epsilon))
# Decoupled weight decay
params[name] -= self.learning_rate * self.weight_decay * params[name]
return params
AdamW is now the default optimizer for many state-of-the-art models:
The “W” in AdamW stands for “weight decay”, a small change with big implications.
As we train larger models, we want to use larger batches to speed up training (more parallelism). But large batches often hurt generalization, the model converges to sharper, less generalizable minima.
LARS scales the learning rate differently for each layer based on the ratio of weight norm to gradient norm:
\[\lambda_l = \frac{||\theta_l||}{||\nabla L(\theta_l)||}\] \[\theta_l^{t+1} = \theta_l^t - \eta \lambda_l \frac{\nabla L(\theta_l)}{||\nabla L(\theta_l)||}\]This prevents layers with small gradients from getting left behind.
LAMB combines LARS with Adam, enabling large batch training with adaptive learning rates:
\[r_t = \frac{m_t}{\sqrt{v_t} + \epsilon}\] \[\theta_{t+1} = \theta_t - \eta \frac{||\theta_t||}{||r_t + \lambda \theta_t||} (r_t + \lambda \theta_t)\]LAMB was used to train BERT in 76 minutes (vs 3+ days with standard Adam).
These are some of the most popular optimizers used in modern deep learning, out of scope for this series. However, they are worth mentioning and if you are curious then you can explore further on your own:
A fixed learning rate has limitations:
Learning rate scheduling adjusts $\eta$ during training.
Step Decay: \(\eta_t = \eta_0 \times \gamma^{\lfloor t / s \rfloor}\)
Drop the learning rate by $\gamma$ every $s$ epochs.
Exponential Decay: \(\eta_t = \eta_0 \times e^{-kt}\)
Continuous exponential decrease.
Cosine Annealing: \(\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})(1 + \cos(\frac{t\pi}{T}))\)
Smooth cosine decay from $\eta_{max}$ to $\eta_{min}$.
Warmup + Decay: Start with a low learning rate, increase linearly for warmup steps, then decay:
\[\eta_t = \begin{cases} \eta_{max} \times \frac{t}{T_{warmup}} & t < T_{warmup} \\ \text{decay schedule} & t \geq T_{warmup} \end{cases}\]class LearningRateScheduler:
"""Learning rate schedulers for training."""
@staticmethod
def step_decay(initial_lr, epoch, drop_rate=0.5, epochs_drop=10):
"""Step decay: drop lr by drop_rate every epochs_drop epochs."""
return initial_lr * (drop_rate ** (epoch // epochs_drop))
@staticmethod
def exponential_decay(initial_lr, epoch, decay_rate=0.95):
"""Exponential decay."""
return initial_lr * (decay_rate ** epoch)
@staticmethod
def cosine_annealing(initial_lr, epoch, total_epochs, min_lr=0):
"""Cosine annealing from initial_lr to min_lr."""
import math
return min_lr + 0.5 * (initial_lr - min_lr) * (
1 + math.cos(math.pi * epoch / total_epochs)
)
@staticmethod
def warmup_cosine(initial_lr, epoch, total_epochs,
warmup_epochs=5, min_lr=0):
"""Linear warmup followed by cosine decay."""
import math
if epoch < warmup_epochs:
return initial_lr * epoch / warmup_epochs
else:
progress = (epoch - warmup_epochs) / (total_epochs - warmup_epochs)
return min_lr + 0.5 * (initial_lr - min_lr) * (
1 + math.cos(math.pi * progress)
)
The one cycle policy (by Leslie Smith) uses:
This often achieves faster convergence and better results.
What type of problem?
├── Computer Vision (CNNs)
│ ├── Training from scratch → SGD + Momentum + LR Schedule
│ ├── Fine-tuning → AdamW with low LR
│ └── Large batch → LARS or LAMB
│
├── NLP / Transformers
│ ├── Training from scratch → AdamW + Warmup + Cosine Decay
│ ├── Fine-tuning → AdamW, LR = 1e-5 to 5e-5
│ └── Large models (GPT-3 scale) → AdamW with careful tuning
│
├── Reinforcement Learning
│ ├── Policy gradients → Adam
│ └── Value functions → RMSprop or Adam
│
└── General / Unsure
├── Start with AdamW (safest default)
├── If poor generalization → Try SGD + Momentum
└── If sparse features → AdaGrad or Adam
| Optimizer | Learning Rate | Momentum | Adaptive LR | Weight Decay | Best For |
|---|---|---|---|---|---|
| SGD | 0.01-0.1 | No | No | L2 | Simple problems |
| SGD+Momentum | 0.01-0.1 | Yes | No | L2 | CNNs, CV |
| Nesterov | 0.01-0.1 | Yes (better) | No | L2 | Convex problems |
| AdaGrad | 0.01 | No | Yes | L2 | Sparse features |
| RMSprop | 0.001 | No | Yes | L2 | RNNs, RL |
| Adam | 0.001 | Yes | Yes | L2 (issues) | General default |
| AdamW | 0.001 | Yes | Yes | Decoupled | Transformers, BERT |
| LAMB | 0.001 | Yes | Yes | Decoupled | Large batch training |
We’ve traveled from simple gradient descent to sophisticated adaptive methods. Each optimizer represents a different strategy for navigating the loss landscape:
The best optimizer is the one that works for your specific problem. Experiment, monitor, and adapt.
Classic Papers:
Online Resources: