series

Deep Learning Series

by Mayank Sharma

Understanding Optimizers: How Neural Networks Actually Learn

Jan 19, 2026

Continuing in our Deep Learning Series, we now turn our attention to the most critical component of neural network training: optimizers. Imagine you’re blindfolded on a mountain range, trying to find the lowest valley. You can’t see the landscape, but you can feel the slope beneath your feet. Each step you take is guided by the gradient, the direction of steepest descent. This is exactly how neural networks learn: optimizers are the algorithms that decide how to take each step based on the gradient information available.

Table of Contents

  1. Introduction: The Art of Finding Minima
  2. Gradient Descent: The Foundation
  3. Stochastic Gradient Descent (SGD)
  4. Momentum: Learning from the Past
  5. Nesterov Accelerated Gradient
  6. AdaGrad: Adaptive Learning Rates
  7. RMSprop: Solving AdaGrad’s Problems
  8. Adam: The Best of Both Worlds
  9. AdamW: Decoupled Weight Decay
  10. Advanced Optimizers: LAMB, LARS, and Beyond
  11. Learning Rate Scheduling
  12. Choosing The Right Optimizer
  13. Conclusion

Introduction: The Art of Finding Minima

Why Optimizers Matter

Training a neural network is fundamentally an optimization problem. We have a loss function $L(\theta)$ that measures how wrong our model is, and we want to find parameters $\theta^*$ that minimize this loss:

\[\theta^* = \arg\min_{\theta} L(\theta)\]

The optimizer is the algorithm that finds this minimum. It’s like a navigation system for the loss landscape, basically a high-dimensional surface where every point represents a different set of model parameters.

The Loss Landscape

Consider the loss landscape of a neural network. For a simple network with just two parameters, we could visualize this as a 3D surface, you know like a mountain range where elevation represents loss. Real neural networks have millions of parameters, creating a surface in millions of dimensions that we cannot visualize but must navigate.

This landscape has several challenging features:

  1. Local Minima: Valleys that aren’t the lowest point
  2. Saddle Points: Flat regions that are neither maxima nor minima
  3. Plateaus: Extended flat regions where gradients vanish
  4. Ravines: Narrow valleys with steep walls

Good optimizers must handle all these challenges efficiently. Now, let’s explore each major optimizer and understand how it works.

Gradient Descent: The Foundation

The Core Idea

Gradient descent is the simplest optimization algorithm. The key insight is that the gradient $\nabla L(\theta)$ points in the direction of steepest ascent. To minimize the loss, we move in the opposite direction:

\[\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)\]

Where:

A Concrete Example With Vanilla Gradient Descent

Let’s say we’re training a simple linear regression $y = wx + b$ with MSE loss:

\[L(w, b) = \frac{1}{n} \sum_{i=1}^{n} (y_i - (wx_i + b))^2\]

The gradients are:

\[\frac{\partial L}{\partial w} = -\frac{2}{n} \sum_{i=1}^{n} x_i(y_i - (wx_i + b))\] \[\frac{\partial L}{\partial b} = -\frac{2}{n} \sum_{i=1}^{n} (y_i - (wx_i + b))\]

Starting from random values, we update:

\(w_{t+1} = w_t - \eta \frac{\partial L}{\partial w}\) \(b_{t+1} = b_t - \eta \frac{\partial L}{\partial b}\)

Now Comes The Learning Rate Dilemma

The learning rate $\eta$ is critical: if it’s Too large, the algorithm overshoots, oscillates, or diverges and if it’s Too small, training is extremely slow. So, what we need is Just right, somethign that ensures smooth convergence (but what’s “just right”?)

This is called the "learning rate scheduling problem", and it’s one of the most challenging aspects of deep learning. Modern deep learning and modern optimizers address it in sophisticated ways.

Limitations of Vanilla Gradient Descent

  1. Requires full dataset: Must compute gradients over all training examples
  2. Same learning rate for all parameters: Some parameters may need faster or slower updates
  3. Gets stuck in local minima: No mechanism to escape poor solutions
  4. Oscillates in ravines: Zig-zags instead of going directly to the minimum

Stochastic Gradient Descent (SGD)

Unlike Vanilla Gradient Descent, which uses the entire dataset to compute gradients, SGD uses a single randomly chosen sample (or mini-batch):

\[\theta_{t+1} = \theta_t - \eta \nabla L_i(\theta_t)\]

Where $L_i$ is the loss for sample (or mini-batch) $i$.

Important thing here is, to note that the expected value of the stochastic gradient equals the true gradient:

\[\mathbb{E}[\nabla L_i(\theta)] = \nabla L(\theta)\]

This means SGD will converge to the same solution (on average), but each step is much faster to compute.

The Benefits of Noise

The stochastic nature of SGD has unexpected benefits:

  1. Escapes local minima: Random fluctuations can push the optimizer out of poor solutions
  2. Better generalization: The noise acts as implicit regularization
  3. Faster iterations: Each step is cheap to compute

Mini-Batch SGD

In practice, we use mini-batches (typically 32-256 samples):

\[\theta_{t+1} = \theta_t - \eta \frac{1}{|B|} \sum_{i \in B} \nabla L_i(\theta_t)\]

This balances three important factors:

Implementation

import numpy as np

class SGD:
    """Vanilla Stochastic Gradient Descent optimizer."""

    def __init__(self, learning_rate=0.01):
        self.learning_rate = learning_rate

    def update(self, params, grads):
        """
        Update parameters using gradients.

        Args:
            params: Dictionary of parameters {name: value}
            grads: Dictionary of gradients {name: gradient}

        Returns:
            Updated parameters
        """
        for name in params:
            params[name] -= self.learning_rate * grads[name]
        return params

# Example usage
params = {'W': np.random.randn(10, 5), 'b': np.zeros(5)}
grads = {'W': np.random.randn(10, 5) * 0.1, 'b': np.random.randn(5) * 0.1}

optimizer = SGD(learning_rate=0.01)
params = optimizer.update(params, grads)

Momentum: Learning from the Past

The Problem with SGD

SGD often tend to oscillate wildly in ravines, narrow valleys where the gradient is steep in one direction but shallow in another. The algorithm zig-zags, making slow progress toward the minimum. Imagine a ball rolling down a hill. It doesn’t just follow the gradient, it builds up momentum. If it’s been moving in one direction consistently, it should continue in that direction even if the current gradient is small.

Mathematical Formulation

Momentum introduces a “velocity” term that accumulates past gradients:

\(v_t = \beta v_{t-1} + \nabla L(\theta_t)\) \(\theta_{t+1} = \theta_t - \eta v_t\)

Where:

Exponential Moving Average

The velocity is an exponentially weighted moving average of gradients. Expanding:

\[v_t = \nabla L(\theta_t) + \beta \nabla L(\theta_{t-1}) + \beta^2 \nabla L(\theta_{t-2}) + ...\]

Recent gradients have more influence, but past gradients still contribute. This smooths out oscillations.

Why It Works

The momentum term has three key effects:

Implementation

class MomentumSGD:
    """SGD with Momentum."""

    def __init__(self, learning_rate=0.01, momentum=0.9):
        self.learning_rate = learning_rate
        self.momentum = momentum
        self.velocity = {}

    def update(self, params, grads):
        """Update parameters using momentum."""
        for name in params:
            if name not in self.velocity:
                self.velocity[name] = np.zeros_like(params[name])

            # Update velocity
            self.velocity[name] = (self.momentum * self.velocity[name] +
                                   grads[name])

            # Update parameters
            params[name] -= self.learning_rate * self.velocity[name]

        return params

Choosing the Momentum Coefficient

Nesterov Accelerated Gradient

The Problem with Standard Momentum

So, the standard momentum can overshoot the minimum because it keeps moving even when it should slow down. It’s like a ball that builds up too much speed and rolls past the valley bottom.

Nesterov’s Insight

Nesterov realized we could do better. Instead of computing the gradient at the current position, compute it at the “lookahead” position—where momentum would take you:

\(v_t = \beta v_{t-1} + \nabla L(\theta_t - \eta \beta v_{t-1})\) \(\theta_{t+1} = \theta_t - \eta v_t\)

Nesterov momentum says that “Before computing the gradient, peek ahead to see where momentum is taking me. If I’m about to overshoot, I’ll know and can correct.”

This provides a form of “anticipatory” update that leads to better convergence.

Comparison

Aspect Standard Momentum Nesterov Momentum
Gradient computed at Current position Lookahead position
Correction After overshooting Before overshooting
Convergence Good Better (theoretically optimal for convex)

Implementation

class NesterovMomentum:
    """Nesterov Accelerated Gradient optimizer."""

    def __init__(self, learning_rate=0.01, momentum=0.9):
        self.learning_rate = learning_rate
        self.momentum = momentum
        self.velocity = {}

    def update(self, params, grads):
        """
        Update using Nesterov momentum.

        Note: In practice, we use a reformulation that doesn't
        require computing gradients at the lookahead position.
        """
        for name in params:
            if name not in self.velocity:
                self.velocity[name] = np.zeros_like(params[name])

            v_prev = self.velocity[name].copy()

            # Update velocity
            self.velocity[name] = (self.momentum * self.velocity[name] -
                                   self.learning_rate * grads[name])

            # Nesterov update: use velocity correction
            params[name] += (-self.momentum * v_prev +
                            (1 + self.momentum) * self.velocity[name])

        return params

AdaGrad: Adaptive Learning Rates

AdaGrad adapts the learning rate for each parameter based on the historical gradient information. The key idea is that different parameters may need different learning rates:

AdaGrad adapts the learning rate for each parameter based on its history.

Mathematical Formulation

AdaGrad accumulates squared gradients:

\[G_t = G_{t-1} + (\nabla L(\theta_t))^2\]

And divides the learning rate by the square root:

\[\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} \nabla L(\theta_t)\]

Where:

This means:

This is especially useful for sparse features (e.g., NLP with rare words).

The Problem

The accumulated gradient $G_t$ only grows, never shrinks. Eventually, the effective learning rate becomes so small that learning stops entirely. This is called “premature learning rate decay.”

Implementation

class AdaGrad:
    """AdaGrad optimizer with per-parameter learning rates."""

    def __init__(self, learning_rate=0.01, epsilon=1e-8):
        self.learning_rate = learning_rate
        self.epsilon = epsilon
        self.accumulated_grads = {}

    def update(self, params, grads):
        """Update parameters with adaptive learning rates."""
        for name in params:
            if name not in self.accumulated_grads:
                self.accumulated_grads[name] = np.zeros_like(params[name])

            # Accumulate squared gradients
            self.accumulated_grads[name] += grads[name] ** 2

            # Adaptive update
            params[name] -= (self.learning_rate * grads[name] /
                            (np.sqrt(self.accumulated_grads[name]) +
                             self.epsilon))

        return params

RMSprop: Solving AdaGrad’s Problems

RMSprop (proposed by Geoff Hinton in a Coursera lecture) uses an exponentially decaying average instead of accumulating all squared gradients:

\[E[g^2]_t = \gamma E[g^2]_{t-1} + (1 - \gamma) (\nabla L(\theta_t))^2\] \[\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} \nabla L(\theta_t)\]

Where $\gamma$ is the decay rate (typically 0.9).

Why It Works

Implementation

class RMSprop:
    """RMSprop optimizer with exponentially decaying gradient average."""

    def __init__(self, learning_rate=0.001, decay=0.9, epsilon=1e-8):
        self.learning_rate = learning_rate
        self.decay = decay
        self.epsilon = epsilon
        self.moving_avg_sq = {}

    def update(self, params, grads):
        """Update parameters using RMSprop."""
        for name in params:
            if name not in self.moving_avg_sq:
                self.moving_avg_sq[name] = np.zeros_like(params[name])

            # Update moving average of squared gradients
            self.moving_avg_sq[name] = (self.decay * self.moving_avg_sq[name] +
                                        (1 - self.decay) * grads[name] ** 2)

            # Update parameters
            params[name] -= (self.learning_rate * grads[name] /
                            (np.sqrt(self.moving_avg_sq[name]) + self.epsilon))

        return params

Adam: The Best of Both Worlds

Adam (Adaptive Moment Estimation) was introduced by Diederik Kingma and Jimmy Ba in 2014 and combines the advantages of momentum and RMSprop. It maintains two moving averages:

  1. Momentum: Exponential moving average of gradients (first moment)
  2. RMSprop: Exponential moving average of squared gradients (second moment)

Plus bias correction to account for initialization at zero.

Mathematical Formulation

First moment (momentum-like): \(m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla L(\theta_t)\)

Second moment (RMSprop-like): \(v_t = \beta_2 v_{t-1} + (1 - \beta_2) (\nabla L(\theta_t))^2\)

Bias correction: \(\hat{m}_t = \frac{m_t}{1 - \beta_1^t}\) \(\hat{v}_t = \frac{v_t}{1 - \beta_2^t}\)

Update: \(\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t\)

Understanding Bias Correction

At $t=0$, $m_0 = v_0 = 0$. In early steps, the moving averages are biased toward zero. Bias correction compensates:

Default Hyperparameters

The original Adam paper recommends:

These work well for most problems without tuning.

Implementation

class Adam:
    """Adam optimizer combining momentum and adaptive learning rates."""

    def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999,
                 epsilon=1e-8):
        self.learning_rate = learning_rate
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.m = {}  # First moment
        self.v = {}  # Second moment
        self.t = 0   # Time step

    def update(self, params, grads):
        """Update parameters using Adam."""
        self.t += 1

        for name in params:
            if name not in self.m:
                self.m[name] = np.zeros_like(params[name])
                self.v[name] = np.zeros_like(params[name])

            # Update first moment (momentum)
            self.m[name] = (self.beta1 * self.m[name] +
                           (1 - self.beta1) * grads[name])

            # Update second moment (RMSprop-like)
            self.v[name] = (self.beta2 * self.v[name] +
                           (1 - self.beta2) * grads[name] ** 2)

            # Bias correction
            m_hat = self.m[name] / (1 - self.beta1 ** self.t)
            v_hat = self.v[name] / (1 - self.beta2 ** self.t)

            # Update parameters
            params[name] -= (self.learning_rate * m_hat /
                            (np.sqrt(v_hat) + self.epsilon))

        return params

When Adam Struggles

Despite it’s popularity, Adam has known issues:

  1. Poor generalization: Sometimes converges to sharper minima than SGD with momentum
  2. Non-convergence: In some pathological cases, doesn’t converge (though rare in practice)
  3. Weight decay interaction: L2 regularization doesn’t work as expected

AdamW: Decoupled Weight Decay

AdamW is a variant of Adam that fixes the weight decay issue. The key insight is that Adam’s weight decay implementation is mathematically incorrect.

The Problem with L2 Regularization in Adam

In standard SGD, L2 regularization and weight decay are equivalent:

L2 Regularization: Add $\frac{\lambda}{2}||\theta||^2$ to loss \(\nabla L_{reg} = \nabla L + \lambda \theta\)

Weight Decay: Shrink weights after each update \(\theta_{t+1} = \theta_t - \eta \nabla L - \eta \lambda \theta_t\)

In Adam, these are NOT equivalent because the gradient is divided by $\sqrt{v_t}$:

L2 in Adam: The regularization gradient is also scaled by $\frac{1}{\sqrt{v_t}}$, which is wrong. We want consistent regularization regardless of the gradient history.

The Fix: Decoupled Weight Decay

AdamW applies weight decay directly to the weights, not through the gradient:

\[\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t - \eta \lambda \theta_t\]

Implementation

class AdamW:
    """AdamW optimizer with decoupled weight decay."""

    def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999,
                 epsilon=1e-8, weight_decay=0.01):
        self.learning_rate = learning_rate
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.weight_decay = weight_decay
        self.m = {}
        self.v = {}
        self.t = 0

    def update(self, params, grads):
        """Update parameters using AdamW."""
        self.t += 1

        for name in params:
            if name not in self.m:
                self.m[name] = np.zeros_like(params[name])
                self.v[name] = np.zeros_like(params[name])

            # Update moments
            self.m[name] = (self.beta1 * self.m[name] +
                           (1 - self.beta1) * grads[name])
            self.v[name] = (self.beta2 * self.v[name] +
                           (1 - self.beta2) * grads[name] ** 2)

            # Bias correction
            m_hat = self.m[name] / (1 - self.beta1 ** self.t)
            v_hat = self.v[name] / (1 - self.beta2 ** self.t)

            # Adam update
            params[name] -= (self.learning_rate * m_hat /
                            (np.sqrt(v_hat) + self.epsilon))

            # Decoupled weight decay
            params[name] -= self.learning_rate * self.weight_decay * params[name]

        return params

Why AdamW Matters

AdamW is now the default optimizer for many state-of-the-art models:

The “W” in AdamW stands for “weight decay”, a small change with big implications.

Advanced Optimizers: LAMB, LARS, and Beyond

The Large Batch Challenge

As we train larger models, we want to use larger batches to speed up training (more parallelism). But large batches often hurt generalization, the model converges to sharper, less generalizable minima.

LARS (Layer-wise Adaptive Rate Scaling)

LARS scales the learning rate differently for each layer based on the ratio of weight norm to gradient norm:

\[\lambda_l = \frac{||\theta_l||}{||\nabla L(\theta_l)||}\] \[\theta_l^{t+1} = \theta_l^t - \eta \lambda_l \frac{\nabla L(\theta_l)}{||\nabla L(\theta_l)||}\]

This prevents layers with small gradients from getting left behind.

LAMB (Layer-wise Adaptive Moments for Batch training)

LAMB combines LARS with Adam, enabling large batch training with adaptive learning rates:

\[r_t = \frac{m_t}{\sqrt{v_t} + \epsilon}\] \[\theta_{t+1} = \theta_t - \eta \frac{||\theta_t||}{||r_t + \lambda \theta_t||} (r_t + \lambda \theta_t)\]

LAMB was used to train BERT in 76 minutes (vs 3+ days with standard Adam).

Other Modern Optimizers

These are some of the most popular optimizers used in modern deep learning, out of scope for this series. However, they are worth mentioning and if you are curious then you can explore further on your own:

Learning Rate Scheduling

Why Schedule?

A fixed learning rate has limitations:

Learning rate scheduling adjusts $\eta$ during training.

Common Schedules

Step Decay: \(\eta_t = \eta_0 \times \gamma^{\lfloor t / s \rfloor}\)

Drop the learning rate by $\gamma$ every $s$ epochs.

Exponential Decay: \(\eta_t = \eta_0 \times e^{-kt}\)

Continuous exponential decrease.

Cosine Annealing: \(\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})(1 + \cos(\frac{t\pi}{T}))\)

Smooth cosine decay from $\eta_{max}$ to $\eta_{min}$.

Warmup + Decay: Start with a low learning rate, increase linearly for warmup steps, then decay:

\[\eta_t = \begin{cases} \eta_{max} \times \frac{t}{T_{warmup}} & t < T_{warmup} \\ \text{decay schedule} & t \geq T_{warmup} \end{cases}\]

Implementation

class LearningRateScheduler:
    """Learning rate schedulers for training."""

    @staticmethod
    def step_decay(initial_lr, epoch, drop_rate=0.5, epochs_drop=10):
        """Step decay: drop lr by drop_rate every epochs_drop epochs."""
        return initial_lr * (drop_rate ** (epoch // epochs_drop))

    @staticmethod
    def exponential_decay(initial_lr, epoch, decay_rate=0.95):
        """Exponential decay."""
        return initial_lr * (decay_rate ** epoch)

    @staticmethod
    def cosine_annealing(initial_lr, epoch, total_epochs, min_lr=0):
        """Cosine annealing from initial_lr to min_lr."""
        import math
        return min_lr + 0.5 * (initial_lr - min_lr) * (
            1 + math.cos(math.pi * epoch / total_epochs)
        )

    @staticmethod
    def warmup_cosine(initial_lr, epoch, total_epochs,
                      warmup_epochs=5, min_lr=0):
        """Linear warmup followed by cosine decay."""
        import math
        if epoch < warmup_epochs:
            return initial_lr * epoch / warmup_epochs
        else:
            progress = (epoch - warmup_epochs) / (total_epochs - warmup_epochs)
            return min_lr + 0.5 * (initial_lr - min_lr) * (
                1 + math.cos(math.pi * progress)
            )

One Cycle Policy

The one cycle policy (by Leslie Smith) uses:

  1. Warmup from low LR to high LR
  2. Annealing from high LR to very low LR
  3. Often combined with momentum cycling (high momentum when LR is low)

This often achieves faster convergence and better results.

Choosing The Right Optimizer

Decision Framework

What type of problem?
├── Computer Vision (CNNs)
   ├── Training from scratch  SGD + Momentum + LR Schedule
   ├── Fine-tuning  AdamW with low LR
   └── Large batch  LARS or LAMB

├── NLP / Transformers
   ├── Training from scratch  AdamW + Warmup + Cosine Decay
   ├── Fine-tuning  AdamW, LR = 1e-5 to 5e-5
   └── Large models (GPT-3 scale)  AdamW with careful tuning

├── Reinforcement Learning
   ├── Policy gradients  Adam
   └── Value functions  RMSprop or Adam

└── General / Unsure
    ├── Start with AdamW (safest default)
    ├── If poor generalization  Try SGD + Momentum
    └── If sparse features  AdaGrad or Adam

Optimizer Comparison Table

Optimizer Learning Rate Momentum Adaptive LR Weight Decay Best For
SGD 0.01-0.1 No No L2 Simple problems
SGD+Momentum 0.01-0.1 Yes No L2 CNNs, CV
Nesterov 0.01-0.1 Yes (better) No L2 Convex problems
AdaGrad 0.01 No Yes L2 Sparse features
RMSprop 0.001 No Yes L2 RNNs, RL
Adam 0.001 Yes Yes L2 (issues) General default
AdamW 0.001 Yes Yes Decoupled Transformers, BERT
LAMB 0.001 Yes Yes Decoupled Large batch training

Conclusion

We’ve traveled from simple gradient descent to sophisticated adaptive methods. Each optimizer represents a different strategy for navigating the loss landscape:

The best optimizer is the one that works for your specific problem. Experiment, monitor, and adapt.

Further Reading

Classic Papers:

Online Resources: