series

Deep Learning Series

by Mayank Sharma

Understanding Recurrent Neural Networks: A Complete Guide from Theory to Practice

Dec 30, 2025

Continuing in our Deep Learning Series, we now turn our attention to Recurrent Neural Networks (RNNs). So, imagine you’re watching a movie. To understand what’s happening in the current scene, you don’t just look at that single frame in isolation, you remember what happened in the previous scenes. You know who the characters are, what their relationships are, and what conflicts are unfolding. This ability to maintain context over time is fundamental to understanding sequences, and it’s exactly what Recurrent Neural Networks bring to artificial intelligence.

Table of Contents

  1. Introduction: Why Sequential Data Matters
  2. The Limitation of Traditional Neural Networks
  3. Enter Recurrent Neural Networks
  4. The Architecture of RNNs
  5. How RNNs Learn: Backpropagation Through Time
  6. The Vanishing Gradient Problem
  7. Types of RNN Architectures
  8. Building an RNN from Scratch
  9. Advantages and Limitations
  10. Conclusion
  11. Jupyter Notebook

Introduction: Why Sequential Data Matters

The World is Sequential

Much of the data we encounter in real life is sequential, and it has an inherent order where the position matters:

Traditional machine learning models treat each input independently, making them poorly suited for sequential data. If you wanted to predict the next word in a sentence using a standard neural network, it would only see the current word without any memory of what came before. That’s like trying to understand a conversation by only hearing every fifth word!

What Makes Sequential Data Special?

Sequential data has three key characteristics:

  1. Temporal Dependencies: Earlier elements influence later ones
  2. Variable Length: Sequences can be different lengths (a sentence can be 5 or 50 words)
  3. Context Matters: The meaning of an element depends on its position in the sequence

Here come our hero - RNNs, they were specifically designed to handle these challenges.

The Limitation of Traditional Neural Networks

The Fixed Input Problem

Let’s consider a traditional feedforward neural network:

Input Layer  Hidden Layer(s)  Output Layer

This architecture has a fundamental limitation: it requires fixed-size inputs. But how do you handle:

You could try padding all inputs to the same maximum length, but this is wasteful and doesn’t truly capture temporal relationships.

The Memory Problem

Even if you solve the input size issue, feedforward networks have no memory. Each prediction is made independently:

Input 1  Network  Output 1
Input 2  Network  Output 2
Input 3  Network  Output 3

The network treating Input 2 has no knowledge that Input 1 came before it. This is catastrophic for sequential tasks.

A Concrete Example: Sentiment Analysis

Imagine analyzing the sentiment of this sentence:

“The movie started great, but the ending was terrible.”

A bag-of-words model (treating words independently) might see:

But a human understands the sequence: The sentence structure (“but”) indicates the negative sentiment about the ending is the final judgment. A sequential model should understand this temporal structure.

Enter Recurrent Neural Networks

The Core Innovation: Recurrence

The breakthrough idea of RNNs is beautifully simple: feed the network’s output from the previous time step back into the network as input at the current time step. This creates a loop that allows information to persist:

        ┌──────────────┐
                      
                      
Input  RNN Cell  Output
        
    Hidden State

At each time step $t$, the RNN receives two inputs:

  1. The current input $x_t$
  2. The hidden state from the previous time step $h_{t-1}$

This hidden state acts as the network’s memory, carrying information from all previous time steps.

Unfolding Through Time

While RNNs are defined by their recurrent structure, it’s helpful to “unfold” them through time to see how they process sequences:

x  [RNN]  h  [RNN]  h  [RNN]  h
                                  
      y             y             y

At each time step:

It’s important to note that, the same weights are shared across all time steps. The network isn’t learning different operations for each position in the sequence; it’s learning a single operation that it applies repeatedly.

The Mathematics of RNNs

The core RNN computation at each time step is surprisingly simple:

Hidden State Update: \(h_t = \tanh(W_{hh} \cdot h_{t-1} + W_{xh} \cdot x_t + b_h)\)

Output (if needed): \(y_t = W_{hy} \cdot h_t + b_y\)

Let’s break this down:

The key insight: The hidden state $h_t$ depends on both the current input $x_t$ AND the previous hidden state $h_{t-1}$, which itself depended on $h_{t-2}$, and so on. This creates a chain of dependencies that allows the network to maintain memory.

A Concrete Numerical Example

Let’s work through a tiny example to make this concrete. Suppose we have:

Weights (randomly initialized):

W_xh = [[0.5, -0.3],
        [0.2,  0.4],
        [-0.1, 0.6]]  # Shape: (3, 2)

W_hh = [[0.3, -0.2, 0.1],
        [0.4,  0.5, -0.3],
        [-0.2, 0.1, 0.4]]  # Shape: (3, 3)

b_h = [0.1, -0.1, 0.2]  # Shape: (3,)

Input Sequence:

x_1 = [1.0, 0.5]
x_2 = [0.8, 1.2]
x_3 = [0.3, 0.9]

Time Step 1:

Initial hidden state: $h_0 = [0, 0, 0]$ (all zeros)

# Input contribution
W_xh @ x_1 = [0.5*1.0 + (-0.3)*0.5,
              0.2*1.0 + 0.4*0.5,
              -0.1*1.0 + 0.6*0.5]
           = [0.35, 0.40, 0.20]

# Recurrent contribution (h_0 is zero, so this is zero)
W_hh @ h_0 = [0, 0, 0]

# Combine with bias
h_1_pre = [0.35+0+0.1, 0.40+0+(-0.1), 0.20+0+0.2]
        = [0.45, 0.30, 0.40]

# Apply tanh activation
h_1 = tanh([0.45, 0.30, 0.40])
     [0.42, 0.29, 0.38]

Time Step 2:

Now $h_1 = [0.42, 0.29, 0.38]$

# Input contribution
W_xh @ x_2 = [0.5*0.8 + (-0.3)*1.2,
              0.2*0.8 + 0.4*1.2,
              -0.1*0.8 + 0.6*1.2]
           = [0.04, 0.64, 0.64]

# Recurrent contribution
W_hh @ h_1 = [0.3*0.42 + (-0.2)*0.29 + 0.1*0.38,
              0.4*0.42 + 0.5*0.29 + (-0.3)*0.38,
              -0.2*0.42 + 0.1*0.29 + 0.4*0.38]
            [0.096, 0.199, 0.081]

# Combine
h_2_pre = [0.04+0.096+0.1, 0.64+0.199+(-0.1), 0.64+0.081+0.2]
         [0.236, 0.739, 0.921]

# Apply tanh
h_2 = tanh([0.236, 0.739, 0.921])
     [0.23, 0.63, 0.73]

Notice how $h_2$ depends on both $x_2$ (current input) and $h_1$ (which contains information from $x_1$). This is how RNNs maintain memory!

The Architecture of RNNs

Basic RNN Cell

The fundamental building block is the RNN cell:

        h_{t-1} (previous hidden state)
           
           
        ┌──────┐
 x_t    tanh   h_t (current hidden state)
        └──────┘

Inside the cell:

  1. Concatenate or combine $x_t$ and $h_{t-1}$
  2. Apply linear transformation (weight matrices)
  3. Apply non-linear activation (typically tanh or ReLU)
  4. Output new hidden state $h_t$

Stacking RNN Layers

Just like feedforward networks, RNNs can be stacked to create deeper architectures:

Layer 2:  h₂¹  h₂²  h₂³
                    
Layer 1:  h₁¹  h₁²  h₁³
                    
Input:    x    x    x

Each layer processes the outputs of the previous layer as its input sequence. Deeper RNNs can learn more complex patterns and hierarchical representations.

Bidirectional RNNs

Standard RNNs process sequences in one direction (left-to-right, or past-to-future). But sometimes, the future context is also important!

Example: “The animal didn’t cross the street because it was too _____”

To fill in the blank, we need to know what comes after: “tired” (the animal) vs. “wide” (the street).

Bidirectional RNNs solve this by processing the sequence in both directions:

Forward:  h₁⁺  h₂⁺  h₃⁺
                    
Input:    x    x    x
                    
Backward: h₁⁻  h₂⁻  h₃⁻

At each time step, we concatenate the forward and backward hidden states:

\[h_t = [h_t^+; h_t^-]\]

This gives the network access to both past and future context.

How RNNs Learn: Backpropagation Through Time

The Challenge of Training RNNs

Training RNNs is more complex than training feedforward networks because of the temporal dependencies. We need to compute gradients that flow backward through time, accounting for how earlier time steps affect later ones.

Backpropagation Through Time (BPTT)

BPTT is the algorithm for training RNNs. The key insight is to unfold the RNN through time and treat it as a very deep feedforward network, then apply standard backpropagation.

Forward Pass:

t=1: h = tanh(W_hh·h + W_xh·x + b)
t=2: h = tanh(W_hh·h + W_xh·x + b)
t=3: h = tanh(W_hh·h + W_xh·x + b)
...
Loss = L(y_true, y_pred)

Backward Pass:

Starting from the final time step, we compute gradients:

\[\frac{\partial L}{\partial h_t} = \frac{\partial L}{\partial y_t} \frac{\partial y_t}{\partial h_t} + \frac{\partial L}{\partial h_{t+1}} \frac{\partial h_{t+1}}{\partial h_t}\]

This recursive formula shows that the gradient at time $t$ depends on:

  1. The loss at time $t$ (if there’s an output)
  2. The gradient from the next time step $t+1$

The gradient flows backward through time, accumulating contributions from all future time steps.

Weight Updates

Critically, the same weights $W_{hh}$ and $W_{xh}$ are used at every time step, so their gradients accumulate across all time steps:

\[\frac{\partial L}{\partial W_{hh}} = \sum_{t=1}^{T} \frac{\partial L_t}{\partial W_{hh}}\]

This weight sharing is both a blessing (fewer parameters) and a curse (gradients can vanish or explode).

Truncated BPTT

For very long sequences, computing gradients through the entire sequence is computationally expensive and memory-intensive. Truncated BPTT is a practical solution:

Instead of backpropagating through the entire sequence, we:

  1. Process the sequence in chunks (e.g., 50 time steps)
  2. Backpropagate only through the chunk
  3. Pass the hidden state forward to the next chunk (but don’t backpropagate through it)

This approximation makes training tractable while still allowing the hidden state to carry information across the entire sequence.

The Vanishing Gradient Problem

The Core Issue

The vanishing gradient problem is the Achilles’ heel of basic RNNs. To understand it, let’s examine how gradients flow backward through time.

When we compute $\frac{\partial L}{\partial h_1}$ (the gradient at time step 1) for a sequence of length $T$, we need to apply the chain rule through all intermediate time steps:

\[\frac{\partial L}{\partial h_1} = \frac{\partial L}{\partial h_T} \prod_{t=2}^{T} \frac{\partial h_t}{\partial h_{t-1}}\]

Each term $\frac{\partial h_t}{\partial h_{t-1}}$ involves the recurrent weight matrix $W_{hh}$ and the derivative of the activation function:

\[\frac{\partial h_t}{\partial h_{t-1}} = W_{hh}^T \cdot \text{diag}(\tanh'(h_t))\]

Why Gradients Vanish

The problem arises from repeated multiplication:

  1. Activation Function Saturation: The derivative of tanh is maximum 1 (at 0) and approaches 0 for large positive/negative values. Repeatedly multiplying by values ≤ 1 causes exponential decay.

  2. Weight Matrix Norm: If the largest eigenvalue of $W_{hh}$ is less than 1, repeated multiplication by $W_{hh}$ will cause the gradient to shrink exponentially.

Numerical Example:

Suppose each $\frac{\partial h_t}{\partial h_{t-1}} \approx 0.5$ (a typical value when tanh is saturated).

After 10 time steps: \(\text{gradient} \propto 0.5^{10} = 0.00098\)

After 20 time steps: \(\text{gradient} \propto 0.5^{20} = 0.00000095\)

The gradient has effectively vanished! The network can’t learn long-term dependencies because the learning signal from distant time steps is too weak.

The Exploding Gradient Problem

Conversely, if each $\frac{\partial h_t}{\partial h_{t-1}} > 1$, gradients can explode:

\(\text{gradient} \propto 2^{10} = 1024\) \(\text{gradient} \propto 2^{20} = 1,048,576\)

This causes numerical instability and training divergence.

Solutions

For Exploding Gradients:

For Vanishing Gradients:

Why This Matters

The vanishing gradient problem means basic RNNs struggle with long-term dependencies. They can remember information from 5-10 time steps back, but not 50 or 100 steps back.

Example Tasks Where This Fails:

❌ “I grew up in France… [50 words later]… I speak fluent _____”

❌ Long-form text generation

❌ Long-term stock price prediction

This limitation motivated the development of LSTM and GRU architectures, which we cover in subsequent tutorials.

Types of RNN Architectures

RNNs can be configured in different ways depending on the task. Let’s explore the main architectures:

1. One-to-One (Not Really an RNN)

Input  Network  Output

This is just a standard feedforward network. No temporal dependencies.

Example: Image classification (single input, single output)

2. One-to-Many

Input  [RNN]  [RNN]  [RNN]  [RNN]
                              
         y      y      y      y

A single input generates a sequence of outputs.

Examples:

3. Many-to-One

x  [RNN]  [RNN]  [RNN]  [RNN]  Output
                          
     h      h      h      h

A sequence of inputs produces a single output (usually at the last time step).

Examples:

4. Many-to-Many (Synced)

x  [RNN]  x  [RNN]  x  [RNN]
                              
      y           y           y

Sequence input to sequence output, where outputs are synchronized with inputs.

Examples:

5. Many-to-Many (Encoder-Decoder)

Encoder:  x  [RNN]  x  [RNN]  x  [RNN]  context
                                                      
Decoder:                               [RNN]  [RNN]  [RNN]
                                                       
                                         y      y      y

Input and output sequences can have different lengths. The encoder processes the input into a fixed-size context vector, which the decoder uses to generate output.

Examples:

Building an RNN from Scratch

Let’s implement a basic RNN cell in Python using only NumPy. This will solidify our understanding of the mechanics.

Basic RNN Cell Implementation

import numpy as np

class SimpleRNNCell:
    """
    A single RNN cell implementing:
    h_t = tanh(W_hh @ h_{t-1} + W_xh @ x_t + b_h)
    """

    def __init__(self, input_size: int, hidden_size: int, seed: int = 42):
        """
        Initialize RNN cell with random weights.

        Args:
            input_size: Dimension of input vectors
            hidden_size: Dimension of hidden state
            seed: Random seed for reproducibility
        """
        np.random.seed(seed)

        self.input_size = input_size
        self.hidden_size = hidden_size

        # Xavier/Glorot initialization for better gradient flow
        # W_xh: transforms input to hidden space
        self.W_xh = np.random.randn(input_size, hidden_size) * np.sqrt(2.0 / (input_size + hidden_size))

        # W_hh: transforms previous hidden state to current hidden state
        self.W_hh = np.random.randn(hidden_size, hidden_size) * np.sqrt(2.0 / (hidden_size + hidden_size))

        # Bias term
        self.b_h = np.zeros((1, hidden_size))

        # Cache for backward pass
        self.cache = {}

    def forward(self, x_t: np.ndarray, h_prev: np.ndarray) -> np.ndarray:
        """
        Forward pass for a single time step.

        Args:
            x_t: Input at time t, shape (batch_size, input_size)
            h_prev: Previous hidden state, shape (batch_size, hidden_size)

        Returns:
            h_t: Current hidden state, shape (batch_size, hidden_size)
        """
        # Linear combination
        h_linear = h_prev @ self.W_hh + x_t @ self.W_xh + self.b_h

        # Non-linear activation
        h_t = np.tanh(h_linear)

        # Cache for backward pass
        self.cache['x_t'] = x_t
        self.cache['h_prev'] = h_prev
        self.cache['h_linear'] = h_linear
        self.cache['h_t'] = h_t

        return h_t

    def backward(self, dh_t: np.ndarray) -> tuple:
        """
        Backward pass for a single time step.

        Args:
            dh_t: Gradient of loss w.r.t. h_t, shape (batch_size, hidden_size)

        Returns:
            dh_prev: Gradient w.r.t. previous hidden state
            dW_xh: Gradient w.r.t. input weights
            dW_hh: Gradient w.r.t. recurrent weights
            db_h: Gradient w.r.t. bias
        """
        # Retrieve cached values
        x_t = self.cache['x_t']
        h_prev = self.cache['h_prev']
        h_linear = self.cache['h_linear']

        batch_size = x_t.shape[0]

        # Gradient through tanh activation
        # d(tanh(x))/dx = 1 - tanh²(x)
        dtanh = 1 - np.tanh(h_linear) ** 2
        dh_linear = dh_t * dtanh

        # Gradients w.r.t. weights and bias
        dW_xh = x_t.T @ dh_linear
        dW_hh = h_prev.T @ dh_linear
        db_h = np.sum(dh_linear, axis=0, keepdims=True)

        # Gradient w.r.t. previous hidden state (for backprop through time)
        dh_prev = dh_linear @ self.W_hh.T

        return dh_prev, dW_xh, dW_hh, db_h


class SimpleRNN:
    """
    Complete RNN for sequence processing.
    """

    def __init__(self, input_size: int, hidden_size: int, output_size: int, seed: int = 42):
        """
        Initialize RNN.

        Args:
            input_size: Dimension of input features
            hidden_size: Dimension of hidden state
            output_size: Dimension of output
            seed: Random seed
        """
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size

        # RNN cell
        self.cell = SimpleRNNCell(input_size, hidden_size, seed)

        # Output projection layer
        np.random.seed(seed)
        self.W_hy = np.random.randn(hidden_size, output_size) * np.sqrt(2.0 / (hidden_size + output_size))
        self.b_y = np.zeros((1, output_size))

        # For BPTT
        self.hidden_states = []

    def forward(self, X: np.ndarray, h_0: np.ndarray = None) -> tuple:
        """
        Forward pass through entire sequence.

        Args:
            X: Input sequence, shape (batch_size, seq_length, input_size)
            h_0: Initial hidden state, shape (batch_size, hidden_size).
                 If None, initialized to zeros.

        Returns:
            outputs: Output sequence, shape (batch_size, seq_length, output_size)
            h_t: Final hidden state, shape (batch_size, hidden_size)
        """
        batch_size, seq_length, _ = X.shape

        # Initialize hidden state
        if h_0 is None:
            h_t = np.zeros((batch_size, self.hidden_size))
        else:
            h_t = h_0

        # Store all hidden states for BPTT
        self.hidden_states = [h_t]
        outputs = []

        # Process sequence
        for t in range(seq_length):
            x_t = X[:, t, :]  # Input at time t

            # RNN cell forward
            h_t = self.cell.forward(x_t, h_t)
            self.hidden_states.append(h_t)

            # Compute output
            y_t = h_t @ self.W_hy + self.b_y
            outputs.append(y_t)

        # Stack outputs
        outputs = np.stack(outputs, axis=1)  # (batch_size, seq_length, output_size)

        return outputs, h_t

    def compute_loss(self, predictions: np.ndarray, targets: np.ndarray) -> float:
        """
        Compute mean squared error loss.

        Args:
            predictions: Model predictions, shape (batch_size, seq_length, output_size)
            targets: Ground truth, shape (batch_size, seq_length, output_size)

        Returns:
            loss: Scalar loss value
        """
        return np.mean((predictions - targets) ** 2)


# Example usage
def example_rnn_usage():
    """
    Demonstrate RNN on a simple sequence prediction task.
    """
    # Hyperparameters
    batch_size = 2
    seq_length = 5
    input_size = 3
    hidden_size = 4
    output_size = 2

    # Create synthetic data
    X = np.random.randn(batch_size, seq_length, input_size)
    Y_true = np.random.randn(batch_size, seq_length, output_size)

    # Initialize RNN
    rnn = SimpleRNN(input_size, hidden_size, output_size)

    # Forward pass
    Y_pred, final_hidden = rnn.forward(X)

    # Compute loss
    loss = rnn.compute_loss(Y_pred, Y_true)

    print(f"Input shape: {X.shape}")
    print(f"Output shape: {Y_pred.shape}")
    print(f"Final hidden state shape: {final_hidden.shape}")
    print(f"Loss: {loss:.4f}")

    return rnn, X, Y_pred, loss


# Example: Character-level language model
def character_level_example():
    """
    Simple character-level language model using RNN.
    """
    # Vocabulary
    text = "hello world"
    chars = sorted(list(set(text)))
    vocab_size = len(chars)

    # Create mappings
    char_to_idx = {ch: i for i, ch in enumerate(chars)}
    idx_to_char = {i: ch for i, ch in enumerate(chars)}

    # Convert text to indices
    data = [char_to_idx[ch] for ch in text]

    print("Character-level RNN Example")
    print(f"Text: '{text}'")
    print(f"Vocabulary: {chars}")
    print(f"Vocab size: {vocab_size}")
    print(f"Encoded: {data}")

    # Create training pairs: predict next character
    seq_length = 3
    X_data = []
    Y_data = []

    for i in range(len(data) - seq_length):
        # One-hot encode input sequence
        x_seq = np.zeros((seq_length, vocab_size))
        for t, idx in enumerate(data[i:i+seq_length]):
            x_seq[t, idx] = 1

        # One-hot encode target (next character)
        y_seq = np.zeros((seq_length, vocab_size))
        for t, idx in enumerate(data[i+1:i+seq_length+1]):
            y_seq[t, idx] = 1

        X_data.append(x_seq)
        Y_data.append(y_seq)

    X_data = np.array(X_data)  # Shape: (num_samples, seq_length, vocab_size)
    Y_data = np.array(Y_data)

    print(f"\nTraining data shape: {X_data.shape}")
    print(f"Target data shape: {Y_data.shape}")

    # Initialize RNN
    hidden_size = 10
    rnn = SimpleRNN(vocab_size, hidden_size, vocab_size)

    # Forward pass
    predictions, _ = rnn.forward(X_data)
    loss = rnn.compute_loss(predictions, Y_data)

    print(f"\nInitial loss: {loss:.4f}")


if __name__ == "__main__":
    print("=" * 60)
    print("Simple RNN Implementation from Scratch")
    print("=" * 60)
    print()

    # Run basic example
    example_rnn_usage()
    print()

    # Run character-level example
    character_level_example()

Key Implementation Details

  1. Weight Initialization: We use Xavier/Glorot initialization to help prevent vanishing/exploding gradients at the start of training.

  2. Tanh Activation: The derivative is $1 - \tanh^2(x)$, which we compute during backpropagation.

  3. Caching: We store intermediate values during the forward pass to use in the backward pass.

  4. Backpropagation Through Time: The backward method computes gradients for a single time step. To train the full sequence, we’d need to call this recursively for all time steps.

  5. Gradient Accumulation: Gradients for $W_{xh}$ and $W_{hh}$ accumulate across all time steps because these weights are shared.

Advantages and Limitations

Advantages

  1. Sequential Processing: RNNs are specifically designed for sequential data, unlike feedforward networks.

  2. Variable Length Input: Can handle sequences of any length without architectural changes.

  3. Parameter Sharing: The same weights are used at every time step, reducing the total number of parameters.

  4. Memory: The hidden state acts as memory, allowing the network to maintain context.

  5. Theoretical Power: RNNs are Turing-complete, meaning they can theoretically compute any computable function given enough time and resources.

  6. Versatility: Can be configured for many different sequence tasks (one-to-many, many-to-one, many-to-many).

Limitations

  1. Vanishing/Exploding Gradients: Basic RNNs struggle to learn long-term dependencies due to gradient issues during backpropagation through time.

  2. Sequential Processing: Unlike CNNs or Transformers, RNNs must process sequences step-by-step, which is slower and harder to parallelize.

  3. Limited Memory: In practice, vanilla RNNs only remember information from ~5-10 steps back effectively.

  4. Training Difficulty: BPTT is computationally expensive for long sequences, and the model is sensitive to hyperparameters.

  5. Slow Inference: Generating long sequences requires many sequential forward passes, making real-time generation challenging.

  6. Difficulty with Long-Range Dependencies: Even with gradient clipping and careful initialization, capturing dependencies spanning hundreds of time steps is difficult.

Conclusion

Recurrent Neural Networks represent a fundamental breakthrough in handling sequential data. By introducing recurrence and feeding outputs back as inputs, RNNs create a form of memory that allows them to process sequences of any length while maintaining context. The basic RNN architecture we implemented here forms the foundation for more advanced architectures like LSTMs and GRUs, which address many of the limitations of vanilla RNNs, which we will explore in future posts.

Jupyter Notebook

For hands-on practice, check out the companion notebooks - Understanding RNN Architecture From Scratch