Jan 15, 2026
Continuing in our Deep Learning Series, we will now focus on Weight initialization. So, Imagine you’re dropped into an unfamiliar mountain range with the goal of finding the lowest valley. Where you start dramatically affects your journey, you might begin near a good path, and you’ll descend smoothly to your destination; or you might start on a steep cliff or a flat plateau, and you might tumble uncontrollably or wander aimlessly forever.
This is precisely the situation neural networks face. Before any training begins, we must assign initial values to the millions (or billions) of weights in the network. These initial values, the weight initialization determine the starting point in the optimization landscape. A good initialization leads to smooth, stable training; whereas a poor one can cause the network to fail catastrophically before learning anything at all.
For decades, weight initialization was treated as a minor detail. Researchers would initialize weights with small random numbers and hope for the best. But as networks grew deeper, this casual approach led to mysterious training failures; the gradients would either explode to infinity or vanish to zero, making learning impossible. The breakthrough came when researchers realized that the statistics of activations and gradients must be carefully controlled across layers. This insight led to principled initialization schemes like Xavier/Glorot initialization for sigmoid/tanh networks, and He/Kaiming initialization for ReLU networks that transformed deep learning from an unreliable art into a more predictable science.
Consider a network with 100 layers. During forward propagation, the input signal passes through all 100 layers. During backpropagation, gradients flow backward through the same 100 layers. If each layer slightly amplifies or attenuates the signal, these effects compound exponentially:
Proper initialization ensures that signals neither explode nor vanish, maintaining healthy magnitudes throughout the network.
Before diving into initialization techniques, we must understand the problems they solve. We need to first explore the dynamics of forward and backward propagation in neural networks.
Let’s consider a simple feedforward network without biases:
\[\mathbf{h}^{[l]} = f(\mathbf{W}^{[l]} \mathbf{h}^{[l-1]})\]where $f$ is the activation function and $\mathbf{W}^{[l]}$ is the weight matrix for layer $l$.
The variance of activations at layer $l$ depends on:
Now, for a linear network (no activation) with $n$ inputs per neuron:
\[\text{Var}(h^{[l]}) = n \cdot \text{Var}(W) \cdot \text{Var}(h^{[l-1]})\]If $n \cdot \text{Var}(W) > 1$, activations grow exponentially with depth. If $n \cdot \text{Var}(W) < 1$, activations shrink exponentially with depth.
Whereas during backpropagation, gradients flow in reverse:
\[\frac{\partial \mathcal{L}}{\partial \mathbf{h}^{[l-1]}} = (\mathbf{W}^{[l]})^T \frac{\partial \mathcal{L}}{\partial \mathbf{h}^{[l]}} \odot f'(\mathbf{z}^{[l-1]})\]The gradient magnitude depends on:
Similar exponential effects occur, if gradients grow or shrink at each layer, they explode or vanish over many layers.
Let’s trace through a 5-layer network with poor initialization:
Too Large Initialization ($W \sim \mathcal{N}(0, 1)$, 100 neurons per layer):
| Layer | Expected Activation Variance |
|---|---|
| 1 | $100 \times 1 = 100$ |
| 2 | $100 \times 100 = 10,000$ |
| 3 | $100 \times 10,000 = 1,000,000$ |
| 4 | $100 \times 1,000,000 = 10^8$ |
| 5 | $100 \times 10^8 = 10^{10}$ |
Activations explode! By layer 5, values are astronomically large.
Too Small Initialization ($W \sim \mathcal{N}(0, 0.001)$):
| Layer | Expected Activation Variance |
|---|---|
| 1 | $100 \times 0.001 = 0.1$ |
| 2 | $100 \times 0.001 \times 0.1 = 0.01$ |
| 3 | $100 \times 0.001 \times 0.01 = 0.001$ |
| 4 | $100 \times 0.001 \times 0.001 = 0.0001$ |
| 5 | $100 \times 0.001 \times 0.0001 = 0.00001$ |
Activations vanish! Information is lost.
An obvious first thought might be: “Why not initialize all weights to zero?” This seems safe, no explosions or vanishing. However, zero initialization causes a fundamental problem: symmetry.
If all weights are initialized to zero (or any identical value), then:
The neurons remain identical forever, they’re symmetric. The network effectively has only one neuron per layer, regardless of how many we specified.
Consider a layer with weights $\mathbf{W}$ initialized to zeros. For input $\mathbf{x}$:
\[\mathbf{z} = \mathbf{W}\mathbf{x} = \mathbf{0}\]All pre-activations are zero. After activation:
\[\mathbf{h} = f(\mathbf{0})\]All activations are identical (whatever $f(0)$ equals).
During backpropagation, the gradient with respect to each weight:
\[\frac{\partial \mathcal{L}}{\partial W_{ij}} = \frac{\partial \mathcal{L}}{\partial z_i} \cdot x_j\]Since all $\frac{\partial \mathcal{L}}{\partial z_i}$ are identical (neurons are symmetric), and $x_j$ is the same input, all weight gradients in a column are identical. Updates preserve symmetry.
To break symmetry, we must initialize weights randomly. Different random values mean different neurons, different gradients, and different learning trajectories. The network can then learn diverse features. It is important to remember that Random initialization isn’t just about setting a good starting point, it’s essential for allowing the network to learn different features.
The simplest approach beyond zero initialization is random initialization from a uniform or Gaussian distribution.
The variance of a uniform distribution on $(-a, a)$ is:
\[\text{Var}(W) = \frac{a^2}{3}\]The variance is simply $\sigma^2$.
Random initialization breaks symmetry, but we still face the vanishing/exploding gradient problem. The key question is: what variance should we use?
Too large → exploding activations/gradients Too small → vanishing activations/gradients
We need to choose the variance carefully based on the network architecture. This is where principled initialization schemes come in.
In 2010, Xavier Glorot and Yoshua Bengio published “Understanding the difficulty of training deep feedforward neural networks,” which introduced what we now call Xavier initialization (or Glorot initialization).
Glorot and Bengio reasoned that to maintain stable signal flow, we need:
Consider a layer with $n_{in}$ inputs and $n_{out}$ outputs. For the forward pass, assuming inputs have unit variance and weights have variance $\text{Var}(W)$:
\[\text{Var}(z) = n_{in} \cdot \text{Var}(W) \cdot \text{Var}(x)\]To maintain $\text{Var}(z) = \text{Var}(x)$, we need:
\[\text{Var}(W) = \frac{1}{n_{in}}\]For the backward pass, similar analysis shows we need:
\[\text{Var}(W) = \frac{1}{n_{out}}\]Since we can’t satisfy both conditions simultaneously (unless $n_{in} = n_{out}$), Xavier initialization uses the harmonic mean:
\[\text{Var}(W) = \frac{2}{n_{in} + n_{out}}\]Gaussian (Normal) Distribution:
\[W \sim \mathcal{N}\left(0, \frac{2}{n_{in} + n_{out}}\right)\]Uniform Distribution:
For variance $\frac{2}{n_{in} + n_{out}}$, the uniform bounds are:
\[W \sim \text{Uniform}\left(-\sqrt{\frac{6}{n_{in} + n_{out}}}, \sqrt{\frac{6}{n_{in} + n_{out}}}\right)\](Since $\text{Var}(\text{Uniform}(-a, a)) = \frac{a^2}{3}$, we solve for $a$.)
Xavier initialization was derived assuming:
For tanh, the derivative near zero is approximately 1, so the linear analysis holds reasonably well. For sigmoid, the derivative is at most 0.25, which Xavier doesn’t fully account for, but it still works much better than naive random initialization.
Important Note: Do NOT use Xavier initialization with ReLU, why?, because the asymmetry of ReLU violates Xavier’s assumptions.
When ReLU (Rectified Linear Unit) became the dominant activation function, researchers noticed that Xavier initialization didn’t work well. In 2015, Kaiming He and colleagues published “Delving Deep into Rectifiers,” introducing He initialization (also called Kaiming initialization).
ReLU zeros out all negative inputs:
\[\text{ReLU}(x) = \max(0, x)\]This means roughly half of the activations become zero (assuming symmetric input distribution). The variance after ReLU is approximately halved:
\[\text{Var}(\text{ReLU}(z)) \approx \frac{1}{2} \text{Var}(z)\]With Xavier initialization, this variance reduction at each layer causes activations to vanish in deep networks.
He et al. derived that to maintain variance through ReLU layers:
\[\text{Var}(W) = \frac{2}{n_{in}}\]The factor of 2 compensates for ReLU zeroing out half the values.
Gaussian (Normal) Distribution:
\[W \sim \mathcal{N}\left(0, \frac{2}{n_{in}}\right)\]Standard deviation: $\sigma = \sqrt{\frac{2}{n_{in}}}$
Uniform Distribution:
\[W \sim \text{Uniform}\left(-\sqrt{\frac{6}{n_{in}}}, \sqrt{\frac{6}{n_{in}}}\right)\]He initialization can use either:
Fan-in mode is standard and recommended for most cases.
He initialization can be adapted for other activations by adjusting the gain factor:
| Activation | Gain | Variance Formula |
|---|---|---|
| ReLU | $\sqrt{2}$ | $\frac{2}{n_{in}}$ |
| Leaky ReLU (slope $a$) | $\sqrt{\frac{2}{1+a^2}}$ | $\frac{2}{(1+a^2) \cdot n_{in}}$ |
| SELU | $\frac{3}{4}$ | Custom formula |
| Linear/Tanh/Sigmoid | 1 | $\frac{1}{n_{in}}$ (Xavier-like) |
Orthogonal initialization takes a different approach: instead of controlling variance statistically, it initializes weights as orthogonal matrices.
A matrix $\mathbf{W}$ is orthogonal if:
\[\mathbf{W}^T \mathbf{W} = \mathbf{W} \mathbf{W}^T = \mathbf{I}\]Orthogonal matrices preserve vector norms:
\[\|\mathbf{W}\mathbf{x}\| = \|\mathbf{x}\|\]For linear networks, orthogonal weights perfectly preserve signal magnitude through layers. Even with nonlinear activations, orthogonal initialization provides excellent starting points because:
For a weight matrix of shape $(n_{out}, n_{in})$:
def orthogonal_init(shape, gain=1.0):
"""Generate orthogonally initialized weight matrix."""
rows, cols = shape
# Generate random matrix
a = np.random.randn(rows, cols)
# QR decomposition
q, r = np.linalg.qr(a)
# Handle non-square matrices
if rows < cols:
q = q.T
# Make Q uniform (handle sign ambiguity)
d = np.diag(r)
ph = np.sign(d)
q *= ph
# Apply gain
return gain * q[:rows, :cols]
Orthogonal initialization is particularly effective for:
Before Xavier and He, Yann LeCun proposed an initialization scheme in his 1998 work “Efficient BackProp”:
\[\text{Var}(W) = \frac{1}{n_{in}}\]This is essentially Xavier initialization using only the fan-in, equivalent to He initialization with a gain of 1 instead of $\sqrt{2}$.
LeCun initialization is appropriate for:
LSUV (Layer-Sequential Unit-Variance) is a data driven initialization approach introduced by Mishkin and Matas in 2015.
Instead of deriving variance formulas mathematically, LSUV empirically ensures unit variance at each layer:
Input: Network with L layers, mini-batch of data X
1. For each layer l = 1 to L:
a. Initialize W[l] with orthogonal initialization
b. While |Var(output[l]) - 1| > tolerance:
- Forward pass X through layers 1 to l
- Compute v = Var(output[l])
- Scale: W[l] = W[l] / sqrt(v)
c. Output[l] now has unit variance
Output: Initialized network
For convolutional layers, fan-in is computed as:
\[n_{\text{in}} = k_h \times k_w \times c_{\text{in}}\]Where $k_h$ = kernel height, $k_w$ = kernel width, and $c_{\text{in}}$ = input channels.
For a 3×3 convolution with 64 input channels:
\[n_{in} = 3 \times 3 \times 64 = 576\]He initialization variance: $\frac{2}{576} \approx 0.00347$
RNNs have two weight matrices:
The recurrent weight matrix is applied many times (once per time step). Non-orthogonal initialization causes exponential explosion/vanishing over sequence length.
Transformers typically use:
The scaling for deep Transformers prevents gradient accumulation from residual connections.
For residual connections $\mathbf{y} = \mathbf{x} + f(\mathbf{x})$, special considerations apply:
This technique, called zero initialization of residual branches, significantly improves training stability for very deep ResNets.
import numpy as np
def xavier_init(shape, gain=1.0):
"""Xavier/Glorot initialization."""
fan_in, fan_out = shape[0], shape[1]
std = gain * np.sqrt(2.0 / (fan_in + fan_out))
return np.random.randn(*shape) * std
def he_init(shape, gain=np.sqrt(2)):
"""He/Kaiming initialization for ReLU."""
fan_in = shape[0]
std = gain / np.sqrt(fan_in)
return np.random.randn(*shape) * std
def lecun_init(shape):
"""LeCun initialization."""
fan_in = shape[0]
std = 1.0 / np.sqrt(fan_in)
return np.random.randn(*shape) * std
def orthogonal_init(shape, gain=1.0):
"""Orthogonal initialization."""
rows, cols = shape
if rows < cols:
flat_shape = (cols, rows)
else:
flat_shape = shape
a = np.random.randn(*flat_shape)
q, r = np.linalg.qr(a)
d = np.diag(r)
ph = np.sign(d)
q *= ph
if rows < cols:
q = q.T
return gain * q[:rows, :cols]
import torch
import torch.nn as nn
# Xavier initialization
nn.init.xavier_uniform_(layer.weight)
nn.init.xavier_normal_(layer.weight)
# He/Kaiming initialization
nn.init.kaiming_uniform_(layer.weight, mode='fan_in', nonlinearity='relu')
nn.init.kaiming_normal_(layer.weight, mode='fan_in', nonlinearity='relu')
# Orthogonal initialization
nn.init.orthogonal_(layer.weight, gain=1.0)
# Zero initialization (for biases)
nn.init.zeros_(layer.bias)
# Custom initialization for a network
def init_weights(module):
if isinstance(module, nn.Linear):
nn.init.kaiming_normal_(module.weight, nonlinearity='relu')
if module.bias is not None:
nn.init.zeros_(module.bias)
elif isinstance(module, nn.Conv2d):
nn.init.kaiming_normal_(module.weight, mode='fan_out', nonlinearity='relu')
model.apply(init_weights)
Modern networks often use batch normalization or layer normalization. How does this affect initialization?
The Good News is normalization layers standardize activations, reducing sensitivity to initialization. A network with normalization after every layer is much more forgiving of initialization choices. However, initialization still matters because:
It is always a good practice, that, even with normalization, use appropriate initialization (He for ReLU, Xavier for tanh/sigmoid). It costs nothing and can only help.
Never initialize all weights to zero or any constant. Symmetry breaking requires randomness.
So it’s very important to match initialization to activation function.
Biases should typically be initialized to zero. Non-zero bias initialization can cause issues:
Different architectures need different approaches:
If input data isn’t normalized, even perfect weight initialization may fail. Always normalize inputs.
| Method | Formula | Best For | Key Property |
|---|---|---|---|
| Xavier (Normal) | $\mathcal{N}(0, \frac{2}{n_{in}+n_{out}})$ | Tanh, Sigmoid | Balances forward/backward |
| Xavier (Uniform) | $U(-\sqrt{\frac{6}{n_{in}+n_{out}}}, \sqrt{\frac{6}{n_{in}+n_{out}}})$ | Tanh, Sigmoid | Same as above |
| He (Normal) | $\mathcal{N}(0, \frac{2}{n_{in}})$ | ReLU, Leaky ReLU | Compensates for ReLU |
| He (Uniform) | $U(-\sqrt{\frac{6}{n_{in}}}, \sqrt{\frac{6}{n_{in}}})$ | ReLU, Leaky ReLU | Same as above |
| LeCun | $\mathcal{N}(0, \frac{1}{n_{in}})$ | SELU, Linear | Self-normalizing |
| Orthogonal | QR decomposition | RNNs, Very deep | Preserves norms |
| LSUV | Data-driven scaling | Any | Empirically robust |
Weight initialization is the critical first step that determines whether a neural network will train successfully. What seems like a minor detail “the initial values of weights”, has profound implications for the entire training process. The journey from naive random initialization to principled methods like Xavier and He represents a fundamental advance in our understanding of neural networks. These techniques transformed deep learning from unreliable experimentation into a more predictable engineering discipline.
As you build neural networks, remember the starting point matters. Take the time to initialize properly, and your networks will thank you with stable, efficient training.
Now that you understand initialization, experiment with different methods on your networks. Visualize activation and gradient statistics to see how initialization affects training dynamics. The best way to internalize these concepts is through hands-on practice.