Dec 30, 2025
Continuing in our Deep Learning Series, we now turn our attention to Recurrent Neural Networks (RNNs). So, imagine you’re watching a movie. To understand what’s happening in the current scene, you don’t just look at that single frame in isolation, you remember what happened in the previous scenes. You know who the characters are, what their relationships are, and what conflicts are unfolding. This ability to maintain context over time is fundamental to understanding sequences, and it’s exactly what Recurrent Neural Networks bring to artificial intelligence.
Much of the data we encounter in real life is sequential, and it has an inherent order where the position matters:
Traditional machine learning models treat each input independently, making them poorly suited for sequential data. If you wanted to predict the next word in a sentence using a standard neural network, it would only see the current word without any memory of what came before. That’s like trying to understand a conversation by only hearing every fifth word!
Sequential data has three key characteristics:
Here come our hero - RNNs, they were specifically designed to handle these challenges.
Let’s consider a traditional feedforward neural network:
Input Layer → Hidden Layer(s) → Output Layer
This architecture has a fundamental limitation: it requires fixed-size inputs. But how do you handle:
You could try padding all inputs to the same maximum length, but this is wasteful and doesn’t truly capture temporal relationships.
Even if you solve the input size issue, feedforward networks have no memory. Each prediction is made independently:
Input 1 → Network → Output 1
Input 2 → Network → Output 2
Input 3 → Network → Output 3
The network treating Input 2 has no knowledge that Input 1 came before it. This is catastrophic for sequential tasks.
Imagine analyzing the sentiment of this sentence:
“The movie started great, but the ending was terrible.”
A bag-of-words model (treating words independently) might see:
But a human understands the sequence: The sentence structure (“but”) indicates the negative sentiment about the ending is the final judgment. A sequential model should understand this temporal structure.
The breakthrough idea of RNNs is beautifully simple: feed the network’s output from the previous time step back into the network as input at the current time step. This creates a loop that allows information to persist:
┌──────────────┐
│ │
↓ │
Input → RNN Cell → Output
↓
Hidden State
At each time step $t$, the RNN receives two inputs:
This hidden state acts as the network’s memory, carrying information from all previous time steps.
While RNNs are defined by their recurrent structure, it’s helpful to “unfold” them through time to see how they process sequences:
x₁ → [RNN] → h₁ → [RNN] → h₂ → [RNN] → h₃
↓ ↓ ↓
y₁ y₂ y₃
At each time step:
It’s important to note that, the same weights are shared across all time steps. The network isn’t learning different operations for each position in the sequence; it’s learning a single operation that it applies repeatedly.
The core RNN computation at each time step is surprisingly simple:
Hidden State Update: \(h_t = \tanh(W_{hh} \cdot h_{t-1} + W_{xh} \cdot x_t + b_h)\)
Output (if needed): \(y_t = W_{hy} \cdot h_t + b_y\)
Let’s break this down:
The key insight: The hidden state $h_t$ depends on both the current input $x_t$ AND the previous hidden state $h_{t-1}$, which itself depended on $h_{t-2}$, and so on. This creates a chain of dependencies that allows the network to maintain memory.
Let’s work through a tiny example to make this concrete. Suppose we have:
Weights (randomly initialized):
W_xh = [[0.5, -0.3],
[0.2, 0.4],
[-0.1, 0.6]] # Shape: (3, 2)
W_hh = [[0.3, -0.2, 0.1],
[0.4, 0.5, -0.3],
[-0.2, 0.1, 0.4]] # Shape: (3, 3)
b_h = [0.1, -0.1, 0.2] # Shape: (3,)
Input Sequence:
x_1 = [1.0, 0.5]
x_2 = [0.8, 1.2]
x_3 = [0.3, 0.9]
Time Step 1:
Initial hidden state: $h_0 = [0, 0, 0]$ (all zeros)
# Input contribution
W_xh @ x_1 = [0.5*1.0 + (-0.3)*0.5,
0.2*1.0 + 0.4*0.5,
-0.1*1.0 + 0.6*0.5]
= [0.35, 0.40, 0.20]
# Recurrent contribution (h_0 is zero, so this is zero)
W_hh @ h_0 = [0, 0, 0]
# Combine with bias
h_1_pre = [0.35+0+0.1, 0.40+0+(-0.1), 0.20+0+0.2]
= [0.45, 0.30, 0.40]
# Apply tanh activation
h_1 = tanh([0.45, 0.30, 0.40])
≈ [0.42, 0.29, 0.38]
Time Step 2:
Now $h_1 = [0.42, 0.29, 0.38]$
# Input contribution
W_xh @ x_2 = [0.5*0.8 + (-0.3)*1.2,
0.2*0.8 + 0.4*1.2,
-0.1*0.8 + 0.6*1.2]
= [0.04, 0.64, 0.64]
# Recurrent contribution
W_hh @ h_1 = [0.3*0.42 + (-0.2)*0.29 + 0.1*0.38,
0.4*0.42 + 0.5*0.29 + (-0.3)*0.38,
-0.2*0.42 + 0.1*0.29 + 0.4*0.38]
≈ [0.096, 0.199, 0.081]
# Combine
h_2_pre = [0.04+0.096+0.1, 0.64+0.199+(-0.1), 0.64+0.081+0.2]
≈ [0.236, 0.739, 0.921]
# Apply tanh
h_2 = tanh([0.236, 0.739, 0.921])
≈ [0.23, 0.63, 0.73]
Notice how $h_2$ depends on both $x_2$ (current input) and $h_1$ (which contains information from $x_1$). This is how RNNs maintain memory!
The fundamental building block is the RNN cell:
h_{t-1} (previous hidden state)
│
↓
┌──────┐
x_t → │ tanh │ → h_t (current hidden state)
└──────┘
Inside the cell:
Just like feedforward networks, RNNs can be stacked to create deeper architectures:
Layer 2: h₂¹ → h₂² → h₂³
↑ ↑ ↑
Layer 1: h₁¹ → h₁² → h₁³
↑ ↑ ↑
Input: x₁ x₂ x₃
Each layer processes the outputs of the previous layer as its input sequence. Deeper RNNs can learn more complex patterns and hierarchical representations.
Standard RNNs process sequences in one direction (left-to-right, or past-to-future). But sometimes, the future context is also important!
Example: “The animal didn’t cross the street because it was too _____”
To fill in the blank, we need to know what comes after: “tired” (the animal) vs. “wide” (the street).
Bidirectional RNNs solve this by processing the sequence in both directions:
Forward: h₁⁺ → h₂⁺ → h₃⁺
↑ ↑ ↑
Input: x₁ x₂ x₃
↓ ↓ ↓
Backward: h₁⁻ ← h₂⁻ ← h₃⁻
At each time step, we concatenate the forward and backward hidden states:
\[h_t = [h_t^+; h_t^-]\]This gives the network access to both past and future context.
Training RNNs is more complex than training feedforward networks because of the temporal dependencies. We need to compute gradients that flow backward through time, accounting for how earlier time steps affect later ones.
BPTT is the algorithm for training RNNs. The key insight is to unfold the RNN through time and treat it as a very deep feedforward network, then apply standard backpropagation.
Forward Pass:
t=1: h₁ = tanh(W_hh·h₀ + W_xh·x₁ + b)
t=2: h₂ = tanh(W_hh·h₁ + W_xh·x₂ + b)
t=3: h₃ = tanh(W_hh·h₂ + W_xh·x₃ + b)
...
Loss = L(y_true, y_pred)
Backward Pass:
Starting from the final time step, we compute gradients:
\[\frac{\partial L}{\partial h_t} = \frac{\partial L}{\partial y_t} \frac{\partial y_t}{\partial h_t} + \frac{\partial L}{\partial h_{t+1}} \frac{\partial h_{t+1}}{\partial h_t}\]This recursive formula shows that the gradient at time $t$ depends on:
The gradient flows backward through time, accumulating contributions from all future time steps.
Critically, the same weights $W_{hh}$ and $W_{xh}$ are used at every time step, so their gradients accumulate across all time steps:
\[\frac{\partial L}{\partial W_{hh}} = \sum_{t=1}^{T} \frac{\partial L_t}{\partial W_{hh}}\]This weight sharing is both a blessing (fewer parameters) and a curse (gradients can vanish or explode).
For very long sequences, computing gradients through the entire sequence is computationally expensive and memory-intensive. Truncated BPTT is a practical solution:
Instead of backpropagating through the entire sequence, we:
This approximation makes training tractable while still allowing the hidden state to carry information across the entire sequence.
The vanishing gradient problem is the Achilles’ heel of basic RNNs. To understand it, let’s examine how gradients flow backward through time.
When we compute $\frac{\partial L}{\partial h_1}$ (the gradient at time step 1) for a sequence of length $T$, we need to apply the chain rule through all intermediate time steps:
\[\frac{\partial L}{\partial h_1} = \frac{\partial L}{\partial h_T} \prod_{t=2}^{T} \frac{\partial h_t}{\partial h_{t-1}}\]Each term $\frac{\partial h_t}{\partial h_{t-1}}$ involves the recurrent weight matrix $W_{hh}$ and the derivative of the activation function:
\[\frac{\partial h_t}{\partial h_{t-1}} = W_{hh}^T \cdot \text{diag}(\tanh'(h_t))\]The problem arises from repeated multiplication:
Activation Function Saturation: The derivative of tanh is maximum 1 (at 0) and approaches 0 for large positive/negative values. Repeatedly multiplying by values ≤ 1 causes exponential decay.
Weight Matrix Norm: If the largest eigenvalue of $W_{hh}$ is less than 1, repeated multiplication by $W_{hh}$ will cause the gradient to shrink exponentially.
Numerical Example:
Suppose each $\frac{\partial h_t}{\partial h_{t-1}} \approx 0.5$ (a typical value when tanh is saturated).
After 10 time steps: \(\text{gradient} \propto 0.5^{10} = 0.00098\)
After 20 time steps: \(\text{gradient} \propto 0.5^{20} = 0.00000095\)
The gradient has effectively vanished! The network can’t learn long-term dependencies because the learning signal from distant time steps is too weak.
Conversely, if each $\frac{\partial h_t}{\partial h_{t-1}} > 1$, gradients can explode:
\(\text{gradient} \propto 2^{10} = 1024\) \(\text{gradient} \propto 2^{20} = 1,048,576\)
This causes numerical instability and training divergence.
For Exploding Gradients:
if gradient.norm() > threshold:
gradient = gradient * (threshold / gradient.norm())
For Vanishing Gradients:
The vanishing gradient problem means basic RNNs struggle with long-term dependencies. They can remember information from 5-10 time steps back, but not 50 or 100 steps back.
Example Tasks Where This Fails:
❌ “I grew up in France… [50 words later]… I speak fluent _____”
❌ Long-form text generation
❌ Long-term stock price prediction
This limitation motivated the development of LSTM and GRU architectures, which we cover in subsequent tutorials.
RNNs can be configured in different ways depending on the task. Let’s explore the main architectures:
Input → Network → Output
This is just a standard feedforward network. No temporal dependencies.
Example: Image classification (single input, single output)
Input → [RNN] → [RNN] → [RNN] → [RNN]
↓ ↓ ↓ ↓
y₁ y₂ y₃ y₄
A single input generates a sequence of outputs.
Examples:
x₁ → [RNN] → [RNN] → [RNN] → [RNN] → Output
↓ ↓ ↓ ↓
h₁ h₂ h₃ h₄
A sequence of inputs produces a single output (usually at the last time step).
Examples:
x₁ → [RNN] → x₂ → [RNN] → x₃ → [RNN]
↓ ↓ ↓
y₁ y₂ y₃
Sequence input to sequence output, where outputs are synchronized with inputs.
Examples:
Encoder: x₁ → [RNN] → x₂ → [RNN] → x₃ → [RNN] → context
↓
Decoder: [RNN] → [RNN] → [RNN]
↓ ↓ ↓
y₁ y₂ y₃
Input and output sequences can have different lengths. The encoder processes the input into a fixed-size context vector, which the decoder uses to generate output.
Examples:
Let’s implement a basic RNN cell in Python using only NumPy. This will solidify our understanding of the mechanics.
import numpy as np
class SimpleRNNCell:
"""
A single RNN cell implementing:
h_t = tanh(W_hh @ h_{t-1} + W_xh @ x_t + b_h)
"""
def __init__(self, input_size: int, hidden_size: int, seed: int = 42):
"""
Initialize RNN cell with random weights.
Args:
input_size: Dimension of input vectors
hidden_size: Dimension of hidden state
seed: Random seed for reproducibility
"""
np.random.seed(seed)
self.input_size = input_size
self.hidden_size = hidden_size
# Xavier/Glorot initialization for better gradient flow
# W_xh: transforms input to hidden space
self.W_xh = np.random.randn(input_size, hidden_size) * np.sqrt(2.0 / (input_size + hidden_size))
# W_hh: transforms previous hidden state to current hidden state
self.W_hh = np.random.randn(hidden_size, hidden_size) * np.sqrt(2.0 / (hidden_size + hidden_size))
# Bias term
self.b_h = np.zeros((1, hidden_size))
# Cache for backward pass
self.cache = {}
def forward(self, x_t: np.ndarray, h_prev: np.ndarray) -> np.ndarray:
"""
Forward pass for a single time step.
Args:
x_t: Input at time t, shape (batch_size, input_size)
h_prev: Previous hidden state, shape (batch_size, hidden_size)
Returns:
h_t: Current hidden state, shape (batch_size, hidden_size)
"""
# Linear combination
h_linear = h_prev @ self.W_hh + x_t @ self.W_xh + self.b_h
# Non-linear activation
h_t = np.tanh(h_linear)
# Cache for backward pass
self.cache['x_t'] = x_t
self.cache['h_prev'] = h_prev
self.cache['h_linear'] = h_linear
self.cache['h_t'] = h_t
return h_t
def backward(self, dh_t: np.ndarray) -> tuple:
"""
Backward pass for a single time step.
Args:
dh_t: Gradient of loss w.r.t. h_t, shape (batch_size, hidden_size)
Returns:
dh_prev: Gradient w.r.t. previous hidden state
dW_xh: Gradient w.r.t. input weights
dW_hh: Gradient w.r.t. recurrent weights
db_h: Gradient w.r.t. bias
"""
# Retrieve cached values
x_t = self.cache['x_t']
h_prev = self.cache['h_prev']
h_linear = self.cache['h_linear']
batch_size = x_t.shape[0]
# Gradient through tanh activation
# d(tanh(x))/dx = 1 - tanh²(x)
dtanh = 1 - np.tanh(h_linear) ** 2
dh_linear = dh_t * dtanh
# Gradients w.r.t. weights and bias
dW_xh = x_t.T @ dh_linear
dW_hh = h_prev.T @ dh_linear
db_h = np.sum(dh_linear, axis=0, keepdims=True)
# Gradient w.r.t. previous hidden state (for backprop through time)
dh_prev = dh_linear @ self.W_hh.T
return dh_prev, dW_xh, dW_hh, db_h
class SimpleRNN:
"""
Complete RNN for sequence processing.
"""
def __init__(self, input_size: int, hidden_size: int, output_size: int, seed: int = 42):
"""
Initialize RNN.
Args:
input_size: Dimension of input features
hidden_size: Dimension of hidden state
output_size: Dimension of output
seed: Random seed
"""
self.input_size = input_size
self.hidden_size = hidden_size
self.output_size = output_size
# RNN cell
self.cell = SimpleRNNCell(input_size, hidden_size, seed)
# Output projection layer
np.random.seed(seed)
self.W_hy = np.random.randn(hidden_size, output_size) * np.sqrt(2.0 / (hidden_size + output_size))
self.b_y = np.zeros((1, output_size))
# For BPTT
self.hidden_states = []
def forward(self, X: np.ndarray, h_0: np.ndarray = None) -> tuple:
"""
Forward pass through entire sequence.
Args:
X: Input sequence, shape (batch_size, seq_length, input_size)
h_0: Initial hidden state, shape (batch_size, hidden_size).
If None, initialized to zeros.
Returns:
outputs: Output sequence, shape (batch_size, seq_length, output_size)
h_t: Final hidden state, shape (batch_size, hidden_size)
"""
batch_size, seq_length, _ = X.shape
# Initialize hidden state
if h_0 is None:
h_t = np.zeros((batch_size, self.hidden_size))
else:
h_t = h_0
# Store all hidden states for BPTT
self.hidden_states = [h_t]
outputs = []
# Process sequence
for t in range(seq_length):
x_t = X[:, t, :] # Input at time t
# RNN cell forward
h_t = self.cell.forward(x_t, h_t)
self.hidden_states.append(h_t)
# Compute output
y_t = h_t @ self.W_hy + self.b_y
outputs.append(y_t)
# Stack outputs
outputs = np.stack(outputs, axis=1) # (batch_size, seq_length, output_size)
return outputs, h_t
def compute_loss(self, predictions: np.ndarray, targets: np.ndarray) -> float:
"""
Compute mean squared error loss.
Args:
predictions: Model predictions, shape (batch_size, seq_length, output_size)
targets: Ground truth, shape (batch_size, seq_length, output_size)
Returns:
loss: Scalar loss value
"""
return np.mean((predictions - targets) ** 2)
# Example usage
def example_rnn_usage():
"""
Demonstrate RNN on a simple sequence prediction task.
"""
# Hyperparameters
batch_size = 2
seq_length = 5
input_size = 3
hidden_size = 4
output_size = 2
# Create synthetic data
X = np.random.randn(batch_size, seq_length, input_size)
Y_true = np.random.randn(batch_size, seq_length, output_size)
# Initialize RNN
rnn = SimpleRNN(input_size, hidden_size, output_size)
# Forward pass
Y_pred, final_hidden = rnn.forward(X)
# Compute loss
loss = rnn.compute_loss(Y_pred, Y_true)
print(f"Input shape: {X.shape}")
print(f"Output shape: {Y_pred.shape}")
print(f"Final hidden state shape: {final_hidden.shape}")
print(f"Loss: {loss:.4f}")
return rnn, X, Y_pred, loss
# Example: Character-level language model
def character_level_example():
"""
Simple character-level language model using RNN.
"""
# Vocabulary
text = "hello world"
chars = sorted(list(set(text)))
vocab_size = len(chars)
# Create mappings
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for i, ch in enumerate(chars)}
# Convert text to indices
data = [char_to_idx[ch] for ch in text]
print("Character-level RNN Example")
print(f"Text: '{text}'")
print(f"Vocabulary: {chars}")
print(f"Vocab size: {vocab_size}")
print(f"Encoded: {data}")
# Create training pairs: predict next character
seq_length = 3
X_data = []
Y_data = []
for i in range(len(data) - seq_length):
# One-hot encode input sequence
x_seq = np.zeros((seq_length, vocab_size))
for t, idx in enumerate(data[i:i+seq_length]):
x_seq[t, idx] = 1
# One-hot encode target (next character)
y_seq = np.zeros((seq_length, vocab_size))
for t, idx in enumerate(data[i+1:i+seq_length+1]):
y_seq[t, idx] = 1
X_data.append(x_seq)
Y_data.append(y_seq)
X_data = np.array(X_data) # Shape: (num_samples, seq_length, vocab_size)
Y_data = np.array(Y_data)
print(f"\nTraining data shape: {X_data.shape}")
print(f"Target data shape: {Y_data.shape}")
# Initialize RNN
hidden_size = 10
rnn = SimpleRNN(vocab_size, hidden_size, vocab_size)
# Forward pass
predictions, _ = rnn.forward(X_data)
loss = rnn.compute_loss(predictions, Y_data)
print(f"\nInitial loss: {loss:.4f}")
if __name__ == "__main__":
print("=" * 60)
print("Simple RNN Implementation from Scratch")
print("=" * 60)
print()
# Run basic example
example_rnn_usage()
print()
# Run character-level example
character_level_example()
Weight Initialization: We use Xavier/Glorot initialization to help prevent vanishing/exploding gradients at the start of training.
Tanh Activation: The derivative is $1 - \tanh^2(x)$, which we compute during backpropagation.
Caching: We store intermediate values during the forward pass to use in the backward pass.
Backpropagation Through Time: The backward method computes gradients for a single time step. To train the full sequence, we’d need to call this recursively for all time steps.
Gradient Accumulation: Gradients for $W_{xh}$ and $W_{hh}$ accumulate across all time steps because these weights are shared.
Sequential Processing: RNNs are specifically designed for sequential data, unlike feedforward networks.
Variable Length Input: Can handle sequences of any length without architectural changes.
Parameter Sharing: The same weights are used at every time step, reducing the total number of parameters.
Memory: The hidden state acts as memory, allowing the network to maintain context.
Theoretical Power: RNNs are Turing-complete, meaning they can theoretically compute any computable function given enough time and resources.
Versatility: Can be configured for many different sequence tasks (one-to-many, many-to-one, many-to-many).
Vanishing/Exploding Gradients: Basic RNNs struggle to learn long-term dependencies due to gradient issues during backpropagation through time.
Sequential Processing: Unlike CNNs or Transformers, RNNs must process sequences step-by-step, which is slower and harder to parallelize.
Limited Memory: In practice, vanilla RNNs only remember information from ~5-10 steps back effectively.
Training Difficulty: BPTT is computationally expensive for long sequences, and the model is sensitive to hyperparameters.
Slow Inference: Generating long sequences requires many sequential forward passes, making real-time generation challenging.
Difficulty with Long-Range Dependencies: Even with gradient clipping and careful initialization, capturing dependencies spanning hundreds of time steps is difficult.
Recurrent Neural Networks represent a fundamental breakthrough in handling sequential data. By introducing recurrence and feeding outputs back as inputs, RNNs create a form of memory that allows them to process sequences of any length while maintaining context. The basic RNN architecture we implemented here forms the foundation for more advanced architectures like LSTMs and GRUs, which address many of the limitations of vanilla RNNs, which we will explore in future posts.
For hands-on practice, check out the companion notebooks - Understanding RNN Architecture From Scratch