series

Deep Learning Series

by Mayank Sharma

Batch and Layer Normalization: Stabilizing Deep Neural Network Training

Jan 08, 2026

Introduction: The Challenge of Training Deep Networks

Imagine you’re conducting an orchestra where each musician plays at wildly different volumes, some whisper softly while others blast at maximum intensity. The conductor struggles to balance the ensemble, constantly adjusting their attention between the quietest flute and the loudest timpani. This is analogous to training deep neural networks without normalization: as data flows through layers, its statistical properties change unpredictably, making optimization difficult and unstable.

Normalization techniques solve this problem by standardizing the inputs to each layer, ensuring consistent statistical properties throughout the network. This seemingly simple idea has revolutionized deep learning, enabling the training of much deeper networks, faster convergence, and better generalization.

In 2015, Sergey Ioffe and Christian Szegedy introduced Batch Normalization, which became one of the most impactful innovations in modern deep learning. Later, Layer Normalization and other variants emerged to address specific limitations of batch normalization, particularly in recurrent networks and scenarios with small batch sizes.

Why Normalization Matters

Training deep neural networks presents several fundamental challenges:

  1. Internal Covariate Shift: As parameters update during training, the distribution of inputs to each layer changes, forcing subsequent layers to continuously adapt to these shifting distributions.

  2. Vanishing/Exploding Gradients: In deep networks, gradients can become extremely small or large as they propagate backward through many layers, making training unstable.

  3. Sensitivity to Initialization: Poor weight initialization can significantly slow down training or prevent convergence entirely.

  4. Slow Convergence: Without normalization, networks often require careful tuning of learning rates and take much longer to train.

Normalization techniques address these challenges by maintaining stable activation distributions throughout training, leading to:

The Internal Covariate Shift Problem

To understand why normalization is crucial, we must first understand the internal covariate shift problem.

What is Internal Covariate Shift?

Internal covariate shift refers to the change in the distribution of network activations due to parameter updates during training. Consider a simple feedforward network:

Input  Layer 1  Activation  Layer 2  Activation  Output

When we update the parameters of Layer 1 during backpropagation, the distribution of outputs from Layer 1 changes. This means Layer 2 must continuously adapt to a “moving target”, the input distribution keeps changing even though the learning task remains the same.

A Concrete Example

Suppose Layer 2 has learned to expect inputs with mean 0 and standard deviation 1. After one training iteration:

This continuous readjustment across all layers slows down training and can lead to instability.

Mathematical Formulation

For a layer computing $y = f(Wx + b)$ where $f$ is an activation function:

\[\mu_y = \mathbb{E}[y], \quad \sigma_y^2 = \text{Var}[y]\]

During training, as $W$ and $b$ update, $\mu_y$ and $\sigma_y^2$ change, causing the covariate shift problem. This shift compounds across layers, becoming more severe in deeper networks.

Batch Normalization: Theory and Mathematics

Batch Normalization (BN) addresses internal covariate shift by normalizing layer inputs to have zero mean and unit variance, computed across the mini-batch.

Core Idea

For a mini-batch of activations $\mathcal{B} = {x_1, x_2, \ldots, x_m}$:

  1. Compute batch statistics: \(\mu_\mathcal{B} = \frac{1}{m} \sum_{i=1}^{m} x_i\) \(\sigma_\mathcal{B}^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_\mathcal{B})^2\)

  2. Normalize: \(\hat{x}_i = \frac{x_i - \mu_\mathcal{B}}{\sqrt{\sigma_\mathcal{B}^2 + \epsilon}}\) where $\epsilon$ (typically $10^{-5}$) prevents division by zero.

  3. Scale and shift (learnable parameters): \(y_i = \gamma \hat{x}_i + \beta\)

The parameters $\gamma$ (scale) and $\beta$ (shift) are learned during training, allowing the network to undo the normalization if needed. This is crucial because simply forcing all activations to have zero mean and unit variance might limit the network’s representational power.

Why Scale and Shift Parameters?

The learned parameters $\gamma$ and $\beta$ give the network flexibility. In the extreme case where: \(\gamma = \sqrt{\sigma_\mathcal{B}^2 + \epsilon}, \quad \beta = \mu_\mathcal{B}\)

The normalization is completely undone: $y_i = x_i$. This allows the network to learn the optimal amount of normalization for each layer.

Batch Normalization Algorithm

Forward Pass (Training):

Input: Mini-batch B = {x, x, ..., xₘ}
Parameters: γ (scale), β (shift)
Hyperparameter: ε (small constant)

1. Compute batch mean:
   μ_B = (1/m) Σᵢ xᵢ

2. Compute batch variance:
   σ²_B = (1/m) Σᵢ (xᵢ - μ_B)²

3. Normalize:
   x̂ = (xᵢ - μ_B) / (σ²_B + ε)

4. Scale and shift:
   yᵢ = γ x̂ + β

5. Update running statistics (for inference):
   μ_running = momentum × μ_running + (1 - momentum) × μ_B
   σ²_running = momentum × σ²_running + (1 - momentum) × σ²_B

Forward Pass (Inference):

Use pre-computed running statistics instead of batch statistics:
x̂ = (x - μ_running) / (σ²_running + ε)
y = γ x̂ + β

Backpropagation Through Batch Normalization

Computing gradients for batch normalization requires careful application of the chain rule. Given the loss gradient $\frac{\partial \mathcal{L}}{\partial y_i}$:

1. Gradient w.r.t. scale and shift parameters: \(\frac{\partial \mathcal{L}}{\partial \gamma} = \sum_{i=1}^{m} \frac{\partial \mathcal{L}}{\partial y_i} \cdot \hat{x}_i\) \(\frac{\partial \mathcal{L}}{\partial \beta} = \sum_{i=1}^{m} \frac{\partial \mathcal{L}}{\partial y_i}\)

2. Gradient w.r.t. normalized input: \(\frac{\partial \mathcal{L}}{\partial \hat{x}_i} = \frac{\partial \mathcal{L}}{\partial y_i} \cdot \gamma\)

3. Gradient w.r.t. variance: \(\frac{\partial \mathcal{L}}{\partial \sigma_\mathcal{B}^2} = \sum_{i=1}^{m} \frac{\partial \mathcal{L}}{\partial \hat{x}_i} \cdot (x_i - \mu_\mathcal{B}) \cdot \frac{-1}{2} (\sigma_\mathcal{B}^2 + \epsilon)^{-3/2}\)

4. Gradient w.r.t. mean: \(\frac{\partial \mathcal{L}}{\partial \mu_\mathcal{B}} = \sum_{i=1}^{m} \frac{\partial \mathcal{L}}{\partial \hat{x}_i} \cdot \frac{-1}{\sqrt{\sigma_\mathcal{B}^2 + \epsilon}} + \frac{\partial \mathcal{L}}{\partial \sigma_\mathcal{B}^2} \cdot \frac{-2}{m} \sum_{i=1}^{m} (x_i - \mu_\mathcal{B})\)

5. Finally, gradient w.r.t. input: \(\frac{\partial \mathcal{L}}{\partial x_i} = \frac{\partial \mathcal{L}}{\partial \hat{x}_i} \cdot \frac{1}{\sqrt{\sigma_\mathcal{B}^2 + \epsilon}} + \frac{\partial \mathcal{L}}{\partial \sigma_\mathcal{B}^2} \cdot \frac{2(x_i - \mu_\mathcal{B})}{m} + \frac{\partial \mathcal{L}}{\partial \mu_\mathcal{B}} \cdot \frac{1}{m}\)

These gradients might look intimidating, but modern deep learning frameworks compute them automatically using automatic differentiation.

Where to Place Batch Normalization?

The original paper suggested placing BN before the activation function:

Linear layer: z = Wx + b
Batch Norm: z_norm = BN(z)
Activation: a = f(z_norm)

However, empirical studies have shown that placing BN after the activation can also work well:

Linear layer: z = Wx + b
Activation: a = f(z)
Batch Norm: a_norm = BN(a)

The choice often depends on the specific architecture and task. Modern implementations typically place BN before the activation, allowing the activation function to operate on normalized inputs.

Batch Normalization in Convolutional Networks

For convolutional layers, batch normalization normalizes across both the batch dimension and spatial dimensions (height and width), but separately for each feature map (channel).

Given a 4D tensor of shape (N, C, H, W) where:

BN computes statistics over dimensions (N, H, W) for each channel independently. This means we learn C pairs of $(\gamma, \beta)$ parameters.

Example: For a feature map with shape (32, 64, 28, 28):

BN computes mean and variance over 32 × 28 × 28 = 25,088 values for each of the 64 channels, resulting in 64 mean values and 64 variance values.

Layer Normalization: An Alternative Approach

While batch normalization works excellently for feedforward and convolutional networks with large batch sizes, it has limitations:

  1. Batch size dependency: Performance degrades with small batch sizes (common in distributed training or resource-constrained settings)
  2. Recurrent networks: RNNs have different sequence lengths and temporal dynamics that make batch statistics problematic
  3. Training-inference discrepancy: Different behavior during training (uses batch statistics) and inference (uses running statistics)

Layer Normalization (LN), introduced by Ba, Kiros, and Hinton in 2016, addresses these issues by normalizing across the feature dimension instead of the batch dimension.

Core Difference from Batch Normalization

Batch Normalization: Normalizes across the batch for each feature \(\mu_j = \frac{1}{m} \sum_{i=1}^{m} x_{ij}, \quad \text{(across batch for feature } j \text{)}\)

Layer Normalization: Normalizes across all features for each example \(\mu_i = \frac{1}{d} \sum_{j=1}^{d} x_{ij}, \quad \text{(across features for example } i \text{)}\)

where $d$ is the number of features (layer width).

Layer Normalization Algorithm

Forward Pass:

Input: Single example x = [x, x, ..., xₐ]
Parameters: γ (scale), β (shift)
Hyperparameter: ε (small constant)

1. Compute mean across features:
   μ = (1/d) Σⱼ xⱼ

2. Compute variance across features:
   σ² = (1/d) Σⱼ (xⱼ - μ)²

3. Normalize:
   x̂ = (xⱼ - μ) / (σ² + ε)

4. Scale and shift:
   yⱼ = γ x̂ + β

Key advantage: The same computation applies during both training and inference since statistics are computed per-example, not per-batch.

Mathematical Formulation

For an input vector $\mathbf{x} \in \mathbb{R}^d$:

\(\mu = \frac{1}{d} \sum_{i=1}^{d} x_i\) \(\sigma^2 = \frac{1}{d} \sum_{i=1}^{d} (x_i - \mu)^2\) \(\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}\) \(y_i = \gamma \hat{x}_i + \beta\)

where $\gamma$ and $\beta$ are learned parameters of the same dimensionality as $\mathbf{x}$.

Layer Normalization in Transformers

Layer normalization became the standard choice for Transformer architectures (the foundation of modern language models like GPT and BERT). In Transformers:

# Self-attention sub-layer
z = LayerNorm(x + SelfAttention(x))

# Feed-forward sub-layer
output = LayerNorm(z + FeedForward(z))

The “Add & Norm” pattern (residual connection + layer normalization) is crucial for training very deep Transformers.

Comparing Normalization Techniques

Batch Normalization vs. Layer Normalization

Aspect Batch Normalization Layer Normalization
Normalization Axis Across batch (same feature) Across features (same example)
Batch Size Sensitivity High (fails with small batches) None (batch-independent)
Training vs. Inference Different (uses running stats in inference) Same (per-example statistics)
Best For CNNs, large batch sizes RNNs, Transformers, small batches
Learnable Parameters 2 per feature 2 per feature
Computational Cost Low Low

When to Use Each

Use Batch Normalization when:

Use Layer Normalization when:

Other Normalization Variants

Several other normalization techniques have been proposed:

  1. Instance Normalization: Normalizes each channel in each example independently (popular in style transfer) \(\text{Normalize across spatial dimensions for each (example, channel) pair}\)

  2. Group Normalization: Divides channels into groups and normalizes within each group (between Layer Norm and Instance Norm) \(\text{Normalize across channels within groups for each example}\)

  3. Weight Normalization: Normalizes weight matrices instead of activations \(\mathbf{w} = \frac{g}{\|\mathbf{v}\|} \mathbf{v}\)

  4. Spectral Normalization: Constrains the spectral norm (largest singular value) of weight matrices

Implementation Details and Best Practices

Training vs. Inference Mode

Batch Normalization requires careful handling of training vs. inference:

# Training mode
model.train()
# Uses batch statistics
# Updates running statistics

# Inference mode
model.eval()
# Uses running statistics computed during training
# No updates to running statistics

Layer Normalization uses the same computation in both modes, simplifying deployment.

Momentum for Running Statistics

Batch normalization maintains exponential moving averages of mean and variance:

running_mean = momentum × running_mean + (1 - momentum) × batch_mean
running_var = momentum × running_var + (1 - momentum) × batch_var

Typical momentum values: 0.9 or 0.99

Initialization

Normalization parameters are typically initialized as:

This makes normalization act as the identity function initially, allowing the network to learn gradually.

Placement in the Network

For feedforward layers:

Linear  BatchNorm  ReLU  Dropout

For convolutional layers:

Conv2D  BatchNorm2D  ReLU  MaxPool

For Transformer layers:

x = x + Attention(LayerNorm(x))
x = x + FeedForward(LayerNorm(x))

Note: Recent research has explored “Pre-LN” (layer norm before sub-layers) vs. “Post-LN” (layer norm after sub-layers) in Transformers.

Hyperparameter: Epsilon (ε)

The small constant $\epsilon$ prevents division by zero: \(\hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}\)

Typical values: $10^{-5}$ to $10^{-8}$

Too large: Affects normalization quality Too small: Numerical instability

Computational Complexity

Both Batch Norm and Layer Norm have O(d) complexity where $d$ is the feature dimension, making them very efficient compared to the layer computations themselves (O(d²) for fully connected layers).

Practical Applications and Impact

Training Deep Residual Networks

Batch normalization was crucial for training the original ResNet (Residual Networks) architectures with 50, 101, or even 152 layers. Without BN, such deep networks were nearly impossible to train.

Transformer Models

Layer normalization enabled the explosive growth of Transformer-based models:

These models would be nearly impossible to train without layer normalization.

Higher Learning Rates

Normalization allows using 10-100× higher learning rates, dramatically reducing training time:

Regularization Effect

Both BN and LN act as implicit regularizers:

Advantages and Limitations

Advantages of Batch Normalization

Limitations of Batch Normalization

Advantages of Layer Normalization

Limitations of Layer Normalization

Conclusion

Continuing in our Deep Learning Series, we now focus on one of the fundamental and most impactful techniques of deep learning, Batch and Layer Normalization. Batch and layer normalization represent fundamental breakthroughs in deep learning, transforming how we train neural networks. These techniques address critical challenges in deep architecture design and training dynamics. As deep learning continues to evolve, normalization remains a critical component of modern architectures. Understanding these techniques deeply from mathematical foundations to practical implementations is essential for anyone building state-of-the-art models.

The journey from simple feedforward networks to today’s massive Transformer models was only possible because of innovations like batch and layer normalization. As we push toward even larger and more capable models, these fundamental techniques continue to play a vital role in making the impossible possible.


Further Reading and Resources

Seminal Papers

  1. Ioffe, S., & Szegedy, C. (2015). “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”
  2. Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). “Layer Normalization”
  3. Wu, Y., & He, K. (2018). “Group Normalization”