Weight Initialization: The Critical First Step in Neural Network Training

Jan 15, 2026

Introduction: The Starting Point Matters

Continuing in our Deep Learning Series, we will now focus on Weight initialization. So, Imagine you’re dropped into an unfamiliar mountain range with the goal of finding the lowest valley. Where you start dramatically affects your journey, you might begin near a good path, and you’ll descend smoothly to your destination; or you might start on a steep cliff or a flat plateau, and you might tumble uncontrollably or wander aimlessly forever.

This is precisely the situation neural networks face. Before any training begins, we must assign initial values to the millions (or billions) of weights in the network. These initial values, the weight initialization determine the starting point in the optimization landscape. A good initialization leads to smooth, stable training; whereas a poor one can cause the network to fail catastrophically before learning anything at all.

For decades, weight initialization was treated as a minor detail. Researchers would initialize weights with small random numbers and hope for the best. But as networks grew deeper, this casual approach led to mysterious training failures; the gradients would either explode to infinity or vanish to zero, making learning impossible. The breakthrough came when researchers realized that the statistics of activations and gradients must be carefully controlled across layers. This insight led to principled initialization schemes like Xavier/Glorot initialization for sigmoid/tanh networks, and He/Kaiming initialization for ReLU networks that transformed deep learning from an unreliable art into a more predictable science.

Why Initialization Matters

Consider a network with 100 layers. During forward propagation, the input signal passes through all 100 layers. During backpropagation, gradients flow backward through the same 100 layers. If each layer slightly amplifies or attenuates the signal, these effects compound exponentially:

Amplification by 1.1× per layer: $1.1^{100} \approx 13,781$ (explosion!)
Attenuation by 0.9× per layer: $0.9^{100} \approx 0.00003$ (vanishing!)

Proper initialization ensures that signals neither explode nor vanish, maintaining healthy magnitudes throughout the network.

The Vanishing and Exploding Gradient Problem

Before diving into initialization techniques, we must understand the problems they solve. We need to first explore the dynamics of forward and backward propagation in neural networks.

Forward Propagation Dynamics

Let’s consider a simple feedforward network without biases:

\[\mathbf{h}^{[l]} = f(\mathbf{W}^{[l]} \mathbf{h}^{[l-1]})\]

where $f$ is the activation function and $\mathbf{W}^{[l]}$ is the weight matrix for layer $l$.

The variance of activations at layer $l$ depends on:

The variance of the previous layer’s activations
The variance of the weights
The number of input connections (fan-in)
The activation function

Now, for a linear network (no activation) with $n$ inputs per neuron:

\[\text{Var}(h^{[l]}) = n \cdot \text{Var}(W) \cdot \text{Var}(h^{[l-1]})\]

If $n \cdot \text{Var}(W) > 1$, activations grow exponentially with depth. If $n \cdot \text{Var}(W) < 1$, activations shrink exponentially with depth.

Backward Propagation Dynamics

Whereas during backpropagation, gradients flow in reverse:

\[\frac{\partial \mathcal{L}}{\partial \mathbf{h}^{[l-1]}} = (\mathbf{W}^{[l]})^T \frac{\partial \mathcal{L}}{\partial \mathbf{h}^{[l]}} \odot f'(\mathbf{z}^{[l-1]})\]

The gradient magnitude depends on:

The magnitude of weights (transpose)
The derivative of the activation function
The gradient from the next layer

Similar exponential effects occur, if gradients grow or shrink at each layer, they explode or vanish over many layers.

A Concrete Example

Let’s trace through a 5-layer network with poor initialization:

Too Large Initialization ($W \sim \mathcal{N}(0, 1)$, 100 neurons per layer):

Layer	Expected Activation Variance
1	$100 \times 1 = 100$
2	$100 \times 100 = 10,000$
3	$100 \times 10,000 = 1,000,000$
4	$100 \times 1,000,000 = 10^8$
5	$100 \times 10^8 = 10^{10}$

Activations explode! By layer 5, values are astronomically large.

Too Small Initialization ($W \sim \mathcal{N}(0, 0.001)$):

Layer	Expected Activation Variance
1	$100 \times 0.001 = 0.1$
2	$100 \times 0.001 \times 0.1 = 0.01$
3	$100 \times 0.001 \times 0.01 = 0.001$
4	$100 \times 0.001 \times 0.001 = 0.0001$
5	$100 \times 0.001 \times 0.0001 = 0.00001$

Activations vanish! Information is lost.

Why Zero Initialization Fails

An obvious first thought might be: “Why not initialize all weights to zero?” This seems safe, no explosions or vanishing. However, zero initialization causes a fundamental problem: symmetry.

The Symmetry Problem

If all weights are initialized to zero (or any identical value), then:

All neurons in a layer receive the same input
All neurons compute the same output
All neurons receive the same gradient during backpropagation
All weights update identically

The neurons remain identical forever, they’re symmetric. The network effectively has only one neuron per layer, regardless of how many we specified.

Mathematical Proof

Consider a layer with weights $\mathbf{W}$ initialized to zeros. For input $\mathbf{x}$:

\[\mathbf{z} = \mathbf{W}\mathbf{x} = \mathbf{0}\]

All pre-activations are zero. After activation:

\[\mathbf{h} = f(\mathbf{0})\]

All activations are identical (whatever $f(0)$ equals).

During backpropagation, the gradient with respect to each weight:

\[\frac{\partial \mathcal{L}}{\partial W_{ij}} = \frac{\partial \mathcal{L}}{\partial z_i} \cdot x_j\]

Since all $\frac{\partial \mathcal{L}}{\partial z_i}$ are identical (neurons are symmetric), and $x_j$ is the same input, all weight gradients in a column are identical. Updates preserve symmetry.

Breaking Symmetry

To break symmetry, we must initialize weights randomly. Different random values mean different neurons, different gradients, and different learning trajectories. The network can then learn diverse features. It is important to remember that Random initialization isn’t just about setting a good starting point, it’s essential for allowing the network to learn different features.

Random Initialization

The simplest approach beyond zero initialization is random initialization from a uniform or Gaussian distribution.

Uniform Random Initialization

\[W_{ij} \sim \text{Uniform}(-a, a)\]

The variance of a uniform distribution on $(-a, a)$ is:

\[\text{Var}(W) = \frac{a^2}{3}\]

Gaussian Random Initialization

\[W_{ij} \sim \mathcal{N}(0, \sigma^2)\]

The variance is simply $\sigma^2$.

The Problem: What Values to Use?

Random initialization breaks symmetry, but we still face the vanishing/exploding gradient problem. The key question is: what variance should we use?

Too large → exploding activations/gradients Too small → vanishing activations/gradients

We need to choose the variance carefully based on the network architecture. This is where principled initialization schemes come in.

Xavier/Glorot Initialization

In 2010, Xavier Glorot and Yoshua Bengio published “Understanding the difficulty of training deep feedforward neural networks,” which introduced what we now call Xavier initialization (or Glorot initialization).

The Key Insight

Glorot and Bengio reasoned that to maintain stable signal flow, we need:

Forward pass: Activation variance should stay constant across layers
Backward pass: Gradient variance should stay constant across layers

Derivation for Linear Networks

Consider a layer with $n_{in}$ inputs and $n_{out}$ outputs. For the forward pass, assuming inputs have unit variance and weights have variance $\text{Var}(W)$:

\[\text{Var}(z) = n_{in} \cdot \text{Var}(W) \cdot \text{Var}(x)\]

To maintain $\text{Var}(z) = \text{Var}(x)$, we need:

\[\text{Var}(W) = \frac{1}{n_{in}}\]

For the backward pass, similar analysis shows we need:

\[\text{Var}(W) = \frac{1}{n_{out}}\]

The Xavier Compromise

Since we can’t satisfy both conditions simultaneously (unless $n_{in} = n_{out}$), Xavier initialization uses the harmonic mean:

\[\text{Var}(W) = \frac{2}{n_{in} + n_{out}}\]

Xavier Initialization Formulas

Gaussian (Normal) Distribution:

\[W \sim \mathcal{N}\left(0, \frac{2}{n_{in} + n_{out}}\right)\]

Uniform Distribution:

For variance $\frac{2}{n_{in} + n_{out}}$, the uniform bounds are:

\[W \sim \text{Uniform}\left(-\sqrt{\frac{6}{n_{in} + n_{out}}}, \sqrt{\frac{6}{n_{in} + n_{out}}}\right)\]

(Since $\text{Var}(\text{Uniform}(-a, a)) = \frac{a^2}{3}$, we solve for $a$.)

When to Use Xavier Initialization

Xavier initialization was derived assuming:

Linear activations or activations symmetric around zero
Specifically designed for tanh and sigmoid activations

For tanh, the derivative near zero is approximately 1, so the linear analysis holds reasonably well. For sigmoid, the derivative is at most 0.25, which Xavier doesn’t fully account for, but it still works much better than naive random initialization.

Important Note: Do NOT use Xavier initialization with ReLU, why?, because the asymmetry of ReLU violates Xavier’s assumptions.

He/Kaiming Initialization

When ReLU (Rectified Linear Unit) became the dominant activation function, researchers noticed that Xavier initialization didn’t work well. In 2015, Kaiming He and colleagues published “Delving Deep into Rectifiers,” introducing He initialization (also called Kaiming initialization).

Why Xavier Fails for ReLU

ReLU zeros out all negative inputs:

\[\text{ReLU}(x) = \max(0, x)\]

This means roughly half of the activations become zero (assuming symmetric input distribution). The variance after ReLU is approximately halved:

\[\text{Var}(\text{ReLU}(z)) \approx \frac{1}{2} \text{Var}(z)\]

With Xavier initialization, this variance reduction at each layer causes activations to vanish in deep networks.

Let’s have a look at He Initialization Derivation

He et al. derived that to maintain variance through ReLU layers:

\[\text{Var}(W) = \frac{2}{n_{in}}\]

The factor of 2 compensates for ReLU zeroing out half the values.

He Initialization Formulas

Gaussian (Normal) Distribution:

\[W \sim \mathcal{N}\left(0, \frac{2}{n_{in}}\right)\]

Standard deviation: $\sigma = \sqrt{\frac{2}{n_{in}}}$

Uniform Distribution:

\[W \sim \text{Uniform}\left(-\sqrt{\frac{6}{n_{in}}}, \sqrt{\frac{6}{n_{in}}}\right)\]

Fan-In vs. Fan-Out Modes

He initialization can use either:

Fan-in mode (default): $\text{Var}(W) = \frac{2}{n_{in}}$ — preserves variance in forward pass
Fan-out mode: $\text{Var}(W) = \frac{2}{n_{out}}$ — preserves variance in backward pass

Fan-in mode is standard and recommended for most cases.

Variants for Different Activations

He initialization can be adapted for other activations by adjusting the gain factor:

Activation	Gain	Variance Formula
ReLU	$\sqrt{2}$	$\frac{2}{n_{in}}$
Leaky ReLU (slope $a$)	$\sqrt{\frac{2}{1+a^2}}$	$\frac{2}{(1+a^2) \cdot n_{in}}$
SELU	$\frac{3}{4}$	Custom formula
Linear/Tanh/Sigmoid	1	$\frac{1}{n_{in}}$ (Xavier-like)

Orthogonal Initialization

Orthogonal initialization takes a different approach: instead of controlling variance statistically, it initializes weights as orthogonal matrices.

What is an Orthogonal Matrix?

A matrix $\mathbf{W}$ is orthogonal if:

\[\mathbf{W}^T \mathbf{W} = \mathbf{W} \mathbf{W}^T = \mathbf{I}\]

Orthogonal matrices preserve vector norms:

\[\|\mathbf{W}\mathbf{x}\| = \|\mathbf{x}\|\]

Why Orthogonal Initialization Works

For linear networks, orthogonal weights perfectly preserve signal magnitude through layers. Even with nonlinear activations, orthogonal initialization provides excellent starting points because:

No amplification or attenuation: Signals maintain their magnitude
Gradient preservation: Gradients flow without explosion or vanishing
Maximum information flow: Different dimensions remain distinguishable

Simple Python Implementation

For a weight matrix of shape $(n_{out}, n_{in})$:

Generate a random matrix $\mathbf{A} \sim \mathcal{N}(0, 1)$
Compute QR decomposition: $\mathbf{A} = \mathbf{Q}\mathbf{R}$
Use $\mathbf{Q}$ (or its appropriate submatrix) as the weight matrix
Optionally scale by a gain factor

def orthogonal_init(shape, gain=1.0):
    """Generate orthogonally initialized weight matrix."""
    rows, cols = shape
    # Generate random matrix
    a = np.random.randn(rows, cols)
    # QR decomposition
    q, r = np.linalg.qr(a)
    # Handle non-square matrices
    if rows < cols:
        q = q.T
    # Make Q uniform (handle sign ambiguity)
    d = np.diag(r)
    ph = np.sign(d)
    q *= ph
    # Apply gain
    return gain * q[:rows, :cols]

When to Use Orthogonal Initialization

Orthogonal initialization is particularly effective for:

Recurrent Neural Networks (RNNs): Prevents gradient explosion/vanishing over time steps
Very deep networks: Maintains signal flow through many layers
Residual connections: Works well with skip connections

LeCun Initialization

Before Xavier and He, Yann LeCun proposed an initialization scheme in his 1998 work “Efficient BackProp”:

\[\text{Var}(W) = \frac{1}{n_{in}}\]

This is essentially Xavier initialization using only the fan-in, equivalent to He initialization with a gain of 1 instead of $\sqrt{2}$.

LeCun initialization is appropriate for:

SELU activation: Self-normalizing networks specifically require this
Linear layers: Where no ReLU compensation is needed

LSUV: Layer-Sequential Unit-Variance

LSUV (Layer-Sequential Unit-Variance) is a data driven initialization approach introduced by Mishkin and Matas in 2015.

The Idea

Instead of deriving variance formulas mathematically, LSUV empirically ensures unit variance at each layer:

Initialize weights with orthogonal initialization
Pass a mini-batch through the network
For each layer, measure the actual output variance
Scale weights to achieve unit variance
Repeat until all layers have approximately unit variance

Algorithm

Input: Network with L layers, mini-batch of data X

1. For each layer l = 1 to L:
   a. Initialize W[l] with orthogonal initialization
   b. While |Var(output[l]) - 1| > tolerance:
      - Forward pass X through layers 1 to l
      - Compute v = Var(output[l])
      - Scale: W[l] = W[l] / sqrt(v)
   c. Output[l] now has unit variance

Output: Initialized network

Advantages of LSUV

Activation-agnostic: Works with any activation function
Architecture-agnostic: Handles complex architectures automatically
Data-aware: Considers actual data distribution
Empirically robust: Often outperforms analytical methods

Disadvantages

Requires data: Need a representative mini-batch
Computational cost: Multiple forward passes during initialization
Not differentiable: Can’t be part of the computation graph

Initialization for Specific Architectures

Convolutional Neural Networks (CNNs)

For convolutional layers, fan-in is computed as:

\[n_{\text{in}} = k_h \times k_w \times c_{\text{in}}\]

Where $k_h$ = kernel height, $k_w$ = kernel width, and $c_{\text{in}}$ = input channels.

For a 3×3 convolution with 64 input channels:

\[n_{in} = 3 \times 3 \times 64 = 576\]

He initialization variance: $\frac{2}{576} \approx 0.00347$

Recurrent Neural Networks (RNNs)

RNNs have two weight matrices:

Input-to-hidden ($\mathbf{W}_{xh}$): Initialize with He or Xavier
Hidden-to-hidden ($\mathbf{W}_{hh}$): Orthogonal initialization is critical

The recurrent weight matrix is applied many times (once per time step). Non-orthogonal initialization causes exponential explosion/vanishing over sequence length.

Transformers

Transformers typically use:

Xavier initialization for most weights (attention, feedforward)
Scaled initialization for output projections: $\frac{1}{\sqrt{2 \cdot n_{layers}}}$

The scaling for deep Transformers prevents gradient accumulation from residual connections.

Residual Networks (ResNets)

For residual connections $\mathbf{y} = \mathbf{x} + f(\mathbf{x})$, special considerations apply:

Initialize the last layer of each residual block to zero (or very small values)
This makes the network behave like a shallower network initially
Gradual “growth” of effective depth during training

This technique, called zero initialization of residual branches, significantly improves training stability for very deep ResNets.

Practical Implementation

NumPy Implementation

import numpy as np

def xavier_init(shape, gain=1.0):
    """Xavier/Glorot initialization."""
    fan_in, fan_out = shape[0], shape[1]
    std = gain * np.sqrt(2.0 / (fan_in + fan_out))
    return np.random.randn(*shape) * std

def he_init(shape, gain=np.sqrt(2)):
    """He/Kaiming initialization for ReLU."""
    fan_in = shape[0]
    std = gain / np.sqrt(fan_in)
    return np.random.randn(*shape) * std

def lecun_init(shape):
    """LeCun initialization."""
    fan_in = shape[0]
    std = 1.0 / np.sqrt(fan_in)
    return np.random.randn(*shape) * std

def orthogonal_init(shape, gain=1.0):
    """Orthogonal initialization."""
    rows, cols = shape
    if rows < cols:
        flat_shape = (cols, rows)
    else:
        flat_shape = shape

    a = np.random.randn(*flat_shape)
    q, r = np.linalg.qr(a)
    d = np.diag(r)
    ph = np.sign(d)
    q *= ph

    if rows < cols:
        q = q.T

    return gain * q[:rows, :cols]

PyTorch Implementation

import torch
import torch.nn as nn

# Xavier initialization
nn.init.xavier_uniform_(layer.weight)
nn.init.xavier_normal_(layer.weight)

# He/Kaiming initialization
nn.init.kaiming_uniform_(layer.weight, mode='fan_in', nonlinearity='relu')
nn.init.kaiming_normal_(layer.weight, mode='fan_in', nonlinearity='relu')

# Orthogonal initialization
nn.init.orthogonal_(layer.weight, gain=1.0)

# Zero initialization (for biases)
nn.init.zeros_(layer.bias)

# Custom initialization for a network
def init_weights(module):
    if isinstance(module, nn.Linear):
        nn.init.kaiming_normal_(module.weight, nonlinearity='relu')
        if module.bias is not None:
            nn.init.zeros_(module.bias)
    elif isinstance(module, nn.Conv2d):
        nn.init.kaiming_normal_(module.weight, mode='fan_out', nonlinearity='relu')

model.apply(init_weights)

Initialization and Normalization

Modern networks often use batch normalization or layer normalization. How does this affect initialization?

The Good News is normalization layers standardize activations, reducing sensitivity to initialization. A network with normalization after every layer is much more forgiving of initialization choices. However, initialization still matters because:

First layer: Before the first normalization, initialization directly affects activations
Gradient flow: Poor initialization can still cause gradient issues
Convergence speed: Good initialization speeds up training even with normalization
Residual connections: Normalization doesn’t fully address residual paths

It is always a good practice, that, even with normalization, use appropriate initialization (He for ReLU, Xavier for tanh/sigmoid). It costs nothing and can only help.

Common Initialization Mistakes To Avoid

Mistake 1: Using Zero Initialization

Never initialize all weights to zero or any constant. Symmetry breaking requires randomness.

Mistake 2: Wrong Initialization for Activation

Using Xavier with ReLU → activations shrink
Using He with sigmoid/tanh → activations may saturate

So it’s very important to match initialization to activation function.

Mistake 3: Forgetting Biases

Biases should typically be initialized to zero. Non-zero bias initialization can cause issues:

For ReLU: Positive bias ensures neurons are active initially (sometimes useful)
For sigmoid/tanh: Zero bias centers activations

Mistake 4: Ignoring Architecture

Different architectures need different approaches:

RNNs need orthogonal hidden-to-hidden weights
Transformers need scaled initialization for depth
ResNets may benefit from zero-initialized residual branches

Mistake 5: Not Considering Data Scale

If input data isn’t normalized, even perfect weight initialization may fail. Always normalize inputs.

Comparison of Initialization Methods

Method	Formula	Best For	Key Property
Xavier (Normal)	$\mathcal{N}(0, \frac{2}{n_{in}+n_{out}})$	Tanh, Sigmoid	Balances forward/backward
Xavier (Uniform)	$U(-\sqrt{\frac{6}{n_{in}+n_{out}}}, \sqrt{\frac{6}{n_{in}+n_{out}}})$	Tanh, Sigmoid	Same as above
He (Normal)	$\mathcal{N}(0, \frac{2}{n_{in}})$	ReLU, Leaky ReLU	Compensates for ReLU
He (Uniform)	$U(-\sqrt{\frac{6}{n_{in}}}, \sqrt{\frac{6}{n_{in}}})$	ReLU, Leaky ReLU	Same as above
LeCun	$\mathcal{N}(0, \frac{1}{n_{in}})$	SELU, Linear	Self-normalizing
Orthogonal	QR decomposition	RNNs, Very deep	Preserves norms
LSUV	Data-driven scaling	Any	Empirically robust

Advantages and Limitations

Advantages of Proper Initialization

Enables deep networks: Without good initialization, very deep networks cannot train
Faster convergence: Good starting point means fewer iterations needed
Better final performance: Avoids bad local minima from poor starts
Stable training: Prevents gradient explosion/vanishing
Reproducibility: Principled approach yields consistent results

Limitations

Not a silver bullet: Bad architectures won’t be saved by good initialization
Doesn’t replace normalization: Both are complementary
Architecture-dependent: No single method works universally
Data-agnostic (except LSUV): May not account for input distribution
One-time effect: Only matters at the start of training

Conclusion

Weight initialization is the critical first step that determines whether a neural network will train successfully. What seems like a minor detail “the initial values of weights”, has profound implications for the entire training process. The journey from naive random initialization to principled methods like Xavier and He represents a fundamental advance in our understanding of neural networks. These techniques transformed deep learning from unreliable experimentation into a more predictable engineering discipline.

As you build neural networks, remember the starting point matters. Take the time to initialize properly, and your networks will thank you with stable, efficient training.

Deep Learning Series