series

Deep Learning Series

by Mayank Sharma

Weight Initialization: The Critical First Step in Neural Network Training

Jan 15, 2026

Introduction: The Starting Point Matters

Continuing in our Deep Learning Series, we will now focus on Weight initialization. So, Imagine you’re dropped into an unfamiliar mountain range with the goal of finding the lowest valley. Where you start dramatically affects your journey, you might begin near a good path, and you’ll descend smoothly to your destination; or you might start on a steep cliff or a flat plateau, and you might tumble uncontrollably or wander aimlessly forever.

This is precisely the situation neural networks face. Before any training begins, we must assign initial values to the millions (or billions) of weights in the network. These initial values, the weight initialization determine the starting point in the optimization landscape. A good initialization leads to smooth, stable training; whereas a poor one can cause the network to fail catastrophically before learning anything at all.

For decades, weight initialization was treated as a minor detail. Researchers would initialize weights with small random numbers and hope for the best. But as networks grew deeper, this casual approach led to mysterious training failures; the gradients would either explode to infinity or vanish to zero, making learning impossible. The breakthrough came when researchers realized that the statistics of activations and gradients must be carefully controlled across layers. This insight led to principled initialization schemes like Xavier/Glorot initialization for sigmoid/tanh networks, and He/Kaiming initialization for ReLU networks that transformed deep learning from an unreliable art into a more predictable science.

Why Initialization Matters

Consider a network with 100 layers. During forward propagation, the input signal passes through all 100 layers. During backpropagation, gradients flow backward through the same 100 layers. If each layer slightly amplifies or attenuates the signal, these effects compound exponentially:

Proper initialization ensures that signals neither explode nor vanish, maintaining healthy magnitudes throughout the network.

The Vanishing and Exploding Gradient Problem

Before diving into initialization techniques, we must understand the problems they solve. We need to first explore the dynamics of forward and backward propagation in neural networks.

Forward Propagation Dynamics

Let’s consider a simple feedforward network without biases:

\[\mathbf{h}^{[l]} = f(\mathbf{W}^{[l]} \mathbf{h}^{[l-1]})\]

where $f$ is the activation function and $\mathbf{W}^{[l]}$ is the weight matrix for layer $l$.

The variance of activations at layer $l$ depends on:

  1. The variance of the previous layer’s activations
  2. The variance of the weights
  3. The number of input connections (fan-in)
  4. The activation function

Now, for a linear network (no activation) with $n$ inputs per neuron:

\[\text{Var}(h^{[l]}) = n \cdot \text{Var}(W) \cdot \text{Var}(h^{[l-1]})\]

If $n \cdot \text{Var}(W) > 1$, activations grow exponentially with depth. If $n \cdot \text{Var}(W) < 1$, activations shrink exponentially with depth.

Backward Propagation Dynamics

Whereas during backpropagation, gradients flow in reverse:

\[\frac{\partial \mathcal{L}}{\partial \mathbf{h}^{[l-1]}} = (\mathbf{W}^{[l]})^T \frac{\partial \mathcal{L}}{\partial \mathbf{h}^{[l]}} \odot f'(\mathbf{z}^{[l-1]})\]

The gradient magnitude depends on:

  1. The magnitude of weights (transpose)
  2. The derivative of the activation function
  3. The gradient from the next layer

Similar exponential effects occur, if gradients grow or shrink at each layer, they explode or vanish over many layers.

A Concrete Example

Let’s trace through a 5-layer network with poor initialization:

Too Large Initialization ($W \sim \mathcal{N}(0, 1)$, 100 neurons per layer):

Layer Expected Activation Variance
1 $100 \times 1 = 100$
2 $100 \times 100 = 10,000$
3 $100 \times 10,000 = 1,000,000$
4 $100 \times 1,000,000 = 10^8$
5 $100 \times 10^8 = 10^{10}$

Activations explode! By layer 5, values are astronomically large.

Too Small Initialization ($W \sim \mathcal{N}(0, 0.001)$):

Layer Expected Activation Variance
1 $100 \times 0.001 = 0.1$
2 $100 \times 0.001 \times 0.1 = 0.01$
3 $100 \times 0.001 \times 0.01 = 0.001$
4 $100 \times 0.001 \times 0.001 = 0.0001$
5 $100 \times 0.001 \times 0.0001 = 0.00001$

Activations vanish! Information is lost.

Why Zero Initialization Fails

An obvious first thought might be: “Why not initialize all weights to zero?” This seems safe, no explosions or vanishing. However, zero initialization causes a fundamental problem: symmetry.

The Symmetry Problem

If all weights are initialized to zero (or any identical value), then:

  1. All neurons in a layer receive the same input
  2. All neurons compute the same output
  3. All neurons receive the same gradient during backpropagation
  4. All weights update identically

The neurons remain identical forever, they’re symmetric. The network effectively has only one neuron per layer, regardless of how many we specified.

Mathematical Proof

Consider a layer with weights $\mathbf{W}$ initialized to zeros. For input $\mathbf{x}$:

\[\mathbf{z} = \mathbf{W}\mathbf{x} = \mathbf{0}\]

All pre-activations are zero. After activation:

\[\mathbf{h} = f(\mathbf{0})\]

All activations are identical (whatever $f(0)$ equals).

During backpropagation, the gradient with respect to each weight:

\[\frac{\partial \mathcal{L}}{\partial W_{ij}} = \frac{\partial \mathcal{L}}{\partial z_i} \cdot x_j\]

Since all $\frac{\partial \mathcal{L}}{\partial z_i}$ are identical (neurons are symmetric), and $x_j$ is the same input, all weight gradients in a column are identical. Updates preserve symmetry.

Breaking Symmetry

To break symmetry, we must initialize weights randomly. Different random values mean different neurons, different gradients, and different learning trajectories. The network can then learn diverse features. It is important to remember that Random initialization isn’t just about setting a good starting point, it’s essential for allowing the network to learn different features.

Random Initialization

The simplest approach beyond zero initialization is random initialization from a uniform or Gaussian distribution.

Uniform Random Initialization

\[W_{ij} \sim \text{Uniform}(-a, a)\]

The variance of a uniform distribution on $(-a, a)$ is:

\[\text{Var}(W) = \frac{a^2}{3}\]

Gaussian Random Initialization

\[W_{ij} \sim \mathcal{N}(0, \sigma^2)\]

The variance is simply $\sigma^2$.

The Problem: What Values to Use?

Random initialization breaks symmetry, but we still face the vanishing/exploding gradient problem. The key question is: what variance should we use?

Too large → exploding activations/gradients Too small → vanishing activations/gradients

We need to choose the variance carefully based on the network architecture. This is where principled initialization schemes come in.

Xavier/Glorot Initialization

In 2010, Xavier Glorot and Yoshua Bengio published “Understanding the difficulty of training deep feedforward neural networks,” which introduced what we now call Xavier initialization (or Glorot initialization).

The Key Insight

Glorot and Bengio reasoned that to maintain stable signal flow, we need:

  1. Forward pass: Activation variance should stay constant across layers
  2. Backward pass: Gradient variance should stay constant across layers

Derivation for Linear Networks

Consider a layer with $n_{in}$ inputs and $n_{out}$ outputs. For the forward pass, assuming inputs have unit variance and weights have variance $\text{Var}(W)$:

\[\text{Var}(z) = n_{in} \cdot \text{Var}(W) \cdot \text{Var}(x)\]

To maintain $\text{Var}(z) = \text{Var}(x)$, we need:

\[\text{Var}(W) = \frac{1}{n_{in}}\]

For the backward pass, similar analysis shows we need:

\[\text{Var}(W) = \frac{1}{n_{out}}\]

The Xavier Compromise

Since we can’t satisfy both conditions simultaneously (unless $n_{in} = n_{out}$), Xavier initialization uses the harmonic mean:

\[\text{Var}(W) = \frac{2}{n_{in} + n_{out}}\]

Xavier Initialization Formulas

Gaussian (Normal) Distribution:

\[W \sim \mathcal{N}\left(0, \frac{2}{n_{in} + n_{out}}\right)\]

Uniform Distribution:

For variance $\frac{2}{n_{in} + n_{out}}$, the uniform bounds are:

\[W \sim \text{Uniform}\left(-\sqrt{\frac{6}{n_{in} + n_{out}}}, \sqrt{\frac{6}{n_{in} + n_{out}}}\right)\]

(Since $\text{Var}(\text{Uniform}(-a, a)) = \frac{a^2}{3}$, we solve for $a$.)

When to Use Xavier Initialization

Xavier initialization was derived assuming:

For tanh, the derivative near zero is approximately 1, so the linear analysis holds reasonably well. For sigmoid, the derivative is at most 0.25, which Xavier doesn’t fully account for, but it still works much better than naive random initialization.

Important Note: Do NOT use Xavier initialization with ReLU, why?, because the asymmetry of ReLU violates Xavier’s assumptions.

He/Kaiming Initialization

When ReLU (Rectified Linear Unit) became the dominant activation function, researchers noticed that Xavier initialization didn’t work well. In 2015, Kaiming He and colleagues published “Delving Deep into Rectifiers,” introducing He initialization (also called Kaiming initialization).

Why Xavier Fails for ReLU

ReLU zeros out all negative inputs:

\[\text{ReLU}(x) = \max(0, x)\]

This means roughly half of the activations become zero (assuming symmetric input distribution). The variance after ReLU is approximately halved:

\[\text{Var}(\text{ReLU}(z)) \approx \frac{1}{2} \text{Var}(z)\]

With Xavier initialization, this variance reduction at each layer causes activations to vanish in deep networks.

Let’s have a look at He Initialization Derivation

He et al. derived that to maintain variance through ReLU layers:

\[\text{Var}(W) = \frac{2}{n_{in}}\]

The factor of 2 compensates for ReLU zeroing out half the values.

He Initialization Formulas

Gaussian (Normal) Distribution:

\[W \sim \mathcal{N}\left(0, \frac{2}{n_{in}}\right)\]

Standard deviation: $\sigma = \sqrt{\frac{2}{n_{in}}}$

Uniform Distribution:

\[W \sim \text{Uniform}\left(-\sqrt{\frac{6}{n_{in}}}, \sqrt{\frac{6}{n_{in}}}\right)\]

Fan-In vs. Fan-Out Modes

He initialization can use either:

Fan-in mode is standard and recommended for most cases.

Variants for Different Activations

He initialization can be adapted for other activations by adjusting the gain factor:

Activation Gain Variance Formula
ReLU $\sqrt{2}$ $\frac{2}{n_{in}}$
Leaky ReLU (slope $a$) $\sqrt{\frac{2}{1+a^2}}$ $\frac{2}{(1+a^2) \cdot n_{in}}$
SELU $\frac{3}{4}$ Custom formula
Linear/Tanh/Sigmoid 1 $\frac{1}{n_{in}}$ (Xavier-like)

Orthogonal Initialization

Orthogonal initialization takes a different approach: instead of controlling variance statistically, it initializes weights as orthogonal matrices.

What is an Orthogonal Matrix?

A matrix $\mathbf{W}$ is orthogonal if:

\[\mathbf{W}^T \mathbf{W} = \mathbf{W} \mathbf{W}^T = \mathbf{I}\]

Orthogonal matrices preserve vector norms:

\[\|\mathbf{W}\mathbf{x}\| = \|\mathbf{x}\|\]

Why Orthogonal Initialization Works

For linear networks, orthogonal weights perfectly preserve signal magnitude through layers. Even with nonlinear activations, orthogonal initialization provides excellent starting points because:

  1. No amplification or attenuation: Signals maintain their magnitude
  2. Gradient preservation: Gradients flow without explosion or vanishing
  3. Maximum information flow: Different dimensions remain distinguishable

Simple Python Implementation

For a weight matrix of shape $(n_{out}, n_{in})$:

  1. Generate a random matrix $\mathbf{A} \sim \mathcal{N}(0, 1)$
  2. Compute QR decomposition: $\mathbf{A} = \mathbf{Q}\mathbf{R}$
  3. Use $\mathbf{Q}$ (or its appropriate submatrix) as the weight matrix
  4. Optionally scale by a gain factor
def orthogonal_init(shape, gain=1.0):
    """Generate orthogonally initialized weight matrix."""
    rows, cols = shape
    # Generate random matrix
    a = np.random.randn(rows, cols)
    # QR decomposition
    q, r = np.linalg.qr(a)
    # Handle non-square matrices
    if rows < cols:
        q = q.T
    # Make Q uniform (handle sign ambiguity)
    d = np.diag(r)
    ph = np.sign(d)
    q *= ph
    # Apply gain
    return gain * q[:rows, :cols]

When to Use Orthogonal Initialization

Orthogonal initialization is particularly effective for:

LeCun Initialization

Before Xavier and He, Yann LeCun proposed an initialization scheme in his 1998 work “Efficient BackProp”:

\[\text{Var}(W) = \frac{1}{n_{in}}\]

This is essentially Xavier initialization using only the fan-in, equivalent to He initialization with a gain of 1 instead of $\sqrt{2}$.

LeCun initialization is appropriate for:

LSUV: Layer-Sequential Unit-Variance

LSUV (Layer-Sequential Unit-Variance) is a data driven initialization approach introduced by Mishkin and Matas in 2015.

The Idea

Instead of deriving variance formulas mathematically, LSUV empirically ensures unit variance at each layer:

  1. Initialize weights with orthogonal initialization
  2. Pass a mini-batch through the network
  3. For each layer, measure the actual output variance
  4. Scale weights to achieve unit variance
  5. Repeat until all layers have approximately unit variance

Algorithm

Input: Network with L layers, mini-batch of data X

1. For each layer l = 1 to L:
   a. Initialize W[l] with orthogonal initialization
   b. While |Var(output[l]) - 1| > tolerance:
      - Forward pass X through layers 1 to l
      - Compute v = Var(output[l])
      - Scale: W[l] = W[l] / sqrt(v)
   c. Output[l] now has unit variance

Output: Initialized network

Advantages of LSUV

Disadvantages

Initialization for Specific Architectures

Convolutional Neural Networks (CNNs)

For convolutional layers, fan-in is computed as:

\[n_{\text{in}} = k_h \times k_w \times c_{\text{in}}\]

Where $k_h$ = kernel height, $k_w$ = kernel width, and $c_{\text{in}}$ = input channels.

For a 3×3 convolution with 64 input channels:

\[n_{in} = 3 \times 3 \times 64 = 576\]

He initialization variance: $\frac{2}{576} \approx 0.00347$

Recurrent Neural Networks (RNNs)

RNNs have two weight matrices:

The recurrent weight matrix is applied many times (once per time step). Non-orthogonal initialization causes exponential explosion/vanishing over sequence length.

Transformers

Transformers typically use:

The scaling for deep Transformers prevents gradient accumulation from residual connections.

Residual Networks (ResNets)

For residual connections $\mathbf{y} = \mathbf{x} + f(\mathbf{x})$, special considerations apply:

This technique, called zero initialization of residual branches, significantly improves training stability for very deep ResNets.

Practical Implementation

NumPy Implementation

import numpy as np

def xavier_init(shape, gain=1.0):
    """Xavier/Glorot initialization."""
    fan_in, fan_out = shape[0], shape[1]
    std = gain * np.sqrt(2.0 / (fan_in + fan_out))
    return np.random.randn(*shape) * std

def he_init(shape, gain=np.sqrt(2)):
    """He/Kaiming initialization for ReLU."""
    fan_in = shape[0]
    std = gain / np.sqrt(fan_in)
    return np.random.randn(*shape) * std

def lecun_init(shape):
    """LeCun initialization."""
    fan_in = shape[0]
    std = 1.0 / np.sqrt(fan_in)
    return np.random.randn(*shape) * std

def orthogonal_init(shape, gain=1.0):
    """Orthogonal initialization."""
    rows, cols = shape
    if rows < cols:
        flat_shape = (cols, rows)
    else:
        flat_shape = shape

    a = np.random.randn(*flat_shape)
    q, r = np.linalg.qr(a)
    d = np.diag(r)
    ph = np.sign(d)
    q *= ph

    if rows < cols:
        q = q.T

    return gain * q[:rows, :cols]

PyTorch Implementation

import torch
import torch.nn as nn

# Xavier initialization
nn.init.xavier_uniform_(layer.weight)
nn.init.xavier_normal_(layer.weight)

# He/Kaiming initialization
nn.init.kaiming_uniform_(layer.weight, mode='fan_in', nonlinearity='relu')
nn.init.kaiming_normal_(layer.weight, mode='fan_in', nonlinearity='relu')

# Orthogonal initialization
nn.init.orthogonal_(layer.weight, gain=1.0)

# Zero initialization (for biases)
nn.init.zeros_(layer.bias)

# Custom initialization for a network
def init_weights(module):
    if isinstance(module, nn.Linear):
        nn.init.kaiming_normal_(module.weight, nonlinearity='relu')
        if module.bias is not None:
            nn.init.zeros_(module.bias)
    elif isinstance(module, nn.Conv2d):
        nn.init.kaiming_normal_(module.weight, mode='fan_out', nonlinearity='relu')

model.apply(init_weights)

Initialization and Normalization

Modern networks often use batch normalization or layer normalization. How does this affect initialization?

The Good News is normalization layers standardize activations, reducing sensitivity to initialization. A network with normalization after every layer is much more forgiving of initialization choices. However, initialization still matters because:

  1. First layer: Before the first normalization, initialization directly affects activations
  2. Gradient flow: Poor initialization can still cause gradient issues
  3. Convergence speed: Good initialization speeds up training even with normalization
  4. Residual connections: Normalization doesn’t fully address residual paths

It is always a good practice, that, even with normalization, use appropriate initialization (He for ReLU, Xavier for tanh/sigmoid). It costs nothing and can only help.

Common Initialization Mistakes To Avoid

Mistake 1: Using Zero Initialization

Never initialize all weights to zero or any constant. Symmetry breaking requires randomness.

Mistake 2: Wrong Initialization for Activation

So it’s very important to match initialization to activation function.

Mistake 3: Forgetting Biases

Biases should typically be initialized to zero. Non-zero bias initialization can cause issues:

Mistake 4: Ignoring Architecture

Different architectures need different approaches:

Mistake 5: Not Considering Data Scale

If input data isn’t normalized, even perfect weight initialization may fail. Always normalize inputs.

Comparison of Initialization Methods

Method Formula Best For Key Property
Xavier (Normal) $\mathcal{N}(0, \frac{2}{n_{in}+n_{out}})$ Tanh, Sigmoid Balances forward/backward
Xavier (Uniform) $U(-\sqrt{\frac{6}{n_{in}+n_{out}}}, \sqrt{\frac{6}{n_{in}+n_{out}}})$ Tanh, Sigmoid Same as above
He (Normal) $\mathcal{N}(0, \frac{2}{n_{in}})$ ReLU, Leaky ReLU Compensates for ReLU
He (Uniform) $U(-\sqrt{\frac{6}{n_{in}}}, \sqrt{\frac{6}{n_{in}}})$ ReLU, Leaky ReLU Same as above
LeCun $\mathcal{N}(0, \frac{1}{n_{in}})$ SELU, Linear Self-normalizing
Orthogonal QR decomposition RNNs, Very deep Preserves norms
LSUV Data-driven scaling Any Empirically robust

Advantages and Limitations

Advantages of Proper Initialization

  1. Enables deep networks: Without good initialization, very deep networks cannot train
  2. Faster convergence: Good starting point means fewer iterations needed
  3. Better final performance: Avoids bad local minima from poor starts
  4. Stable training: Prevents gradient explosion/vanishing
  5. Reproducibility: Principled approach yields consistent results

Limitations

  1. Not a silver bullet: Bad architectures won’t be saved by good initialization
  2. Doesn’t replace normalization: Both are complementary
  3. Architecture-dependent: No single method works universally
  4. Data-agnostic (except LSUV): May not account for input distribution
  5. One-time effect: Only matters at the start of training

Conclusion

Weight initialization is the critical first step that determines whether a neural network will train successfully. What seems like a minor detail “the initial values of weights”, has profound implications for the entire training process. The journey from naive random initialization to principled methods like Xavier and He represents a fundamental advance in our understanding of neural networks. These techniques transformed deep learning from unreliable experimentation into a more predictable engineering discipline.

As you build neural networks, remember the starting point matters. Take the time to initialize properly, and your networks will thank you with stable, efficient training.


Further Reading and Resources

Seminal Papers

  1. Glorot, X., & Bengio, Y. (2010). “Understanding the difficulty of training deep feedforward neural networks”

  2. He, K., Zhang, X., Ren, S., & Sun, J. (2015). “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”

  3. Saxe, A. M., McClelland, J. L., & Ganguli, S. (2013). “Exact solutions to the nonlinear dynamics of learning in deep linear neural networks”

  4. Mishkin, D., & Matas, J. (2015). “All you need is a good init”

Now that you understand initialization, experiment with different methods on your networks. Visualize activation and gradient statistics to see how initialization affects training dynamics. The best way to internalize these concepts is through hands-on practice.