series

Deep Learning Series

by Mayank Sharma

Understanding Recurrent Neural Networks: A Complete Guide from Theory to Practice

Jan 17, 2026

Continuing in our Deep Learning Series, we now turn our attention to Recurrent Neural Networks (RNNs). So, imagine you’re watching a movie. To understand what’s happening in the current scene, you don’t just look at that single frame in isolation, you remember what happened in the previous scenes. You know who the characters are, what their relationships are, and what conflicts are unfolding. This ability to maintain context over time is fundamental to understanding sequences, and it’s exactly what Recurrent Neural Networks bring to artificial intelligence.

Table of Contents

  1. Introduction: Why Sequential Data Matters
  2. The Limitation of Traditional Neural Networks
  3. Enter Recurrent Neural Networks
  4. The Architecture of RNNs
  5. How RNNs Learn: Backpropagation Through Time
  6. The Vanishing Gradient Problem
  7. Building an RNN from Scratch
  8. Advantages and Limitations
  9. Conclusion
  10. Jupyter Notebook

Introduction: Why Sequential Data Matters

The World is Sequential

Much of the data we encounter in real life is sequential, and it has an inherent order where the position matters, let’s look at some examples to understand this:

So, the traditional machine learning models treat each input independently, making them poorly suited for sequential data. If you wanted to predict the next word in a sentence using a standard neural network, it would only see the current word without any memory of what came before, this is like trying to understand a conversation by only hearing every fifth word.

What Makes Sequential Data Special?

To understand why we need special tools for sequential data, let’s look at what makes it different from regular data. Sequential data has three key characteristics:

  1. Temporal Dependencies: Earlier elements influence later ones
  2. Variable Length: Sequences can be different lengths (a sentence can be 5 or 50 words)
  3. Context Matters: The meaning of an element depends on its position in the sequence

Here come our hero - RNNs, they were specifically designed to handle these challenges.

The Limitation of Traditional Neural Networks

The Fixed Input Problem

Let’s consider a traditional feedforward neural network:

Input Layer  Hidden Layer(s)  Output Layer

This architecture has a fundamental limitation: it requires fixed-size inputs. For sequential data, how do you handle:

You could try padding all inputs to the same maximum length, but this is wasteful and doesn’t truly capture temporal relationships.

The Memory Problem

Even if you solve the input size issue, feedforward networks have no memory, each prediction is made independently:

Input 1  Network  Output 1
Input 2  Network  Output 2
Input 3  Network  Output 3

The network treating Input 2 has no knowledge that Input 1 came before it. This is catastrophic for sequential tasks. Let’s look at an example to understand this.

Imagine analyzing the sentiment of this sentence:

“The movie started great, but the ending was terrible.”

A bag-of-words model (treating words independently) might see:

But a human understands the sequence. The sentence structure (“but”) indicates the negative sentiment about the ending is the final judgment. A sequential model should understand this temporal structure.

Enter Recurrent Neural Networks

The breakthrough idea of RNNs is beautifully simple: feed the network’s output from the previous time step back into the network as input at the current time step. This creates a loop that allows information to persist:

        ┌──────────────┐
                      
                      
Input  RNN Cell  Output
        
    Hidden State

At each time step $t$, the RNN receives two inputs:

This hidden state acts as the network’s memory, carrying information from all previous time steps.

Unfolding the Recurrent Structure

While RNNs are defined by their recurrent structure, it’s helpful to “unfold” them through time to see how they process sequences:

x  [RNN]  h  [RNN]  h  [RNN]  h
                                  
      y             y             y

At each time step:

It’s important to note that, the same weights are shared across all time steps. The network isn’t learning different operations for each position in the sequence; it’s learning a single operation that it applies repeatedly.

The Mathematics of RNNs

Mathematically, the core RNN computation at each time step is surprisingly simple:

Hidden State Update: \(h_t = \tanh(W_{hh} \cdot h_{t-1} + W_{xh} \cdot x_t + b_h)\)

Output (if needed): \(y_t = W_{hy} \cdot h_t + b_y\)

Let’s break this down to understand each component:

The important thing to note is that the hidden state $h_t$ depends on both the current input $x_t$ AND the previous hidden state $h_{t-1}$, which itself depended on $h_{t-2}$, and so on. This creates a chain of dependencies that allows the network to maintain memory. Let’s look at a concrete example to make this clearer.

Suppose we have a simple RNN with:

Weights (randomly initialized):

W_xh = [[0.5, -0.3],
        [0.2,  0.4],
        [-0.1, 0.6]]  # Shape: (3, 2)

W_hh = [[0.3, -0.2, 0.1],
        [0.4,  0.5, -0.3],
        [-0.2, 0.1, 0.4]]  # Shape: (3, 3)

b_h = [0.1, -0.1, 0.2]  # Shape: (3,)

Input Sequence:

x_1 = [1.0, 0.5]
x_2 = [0.8, 1.2]
x_3 = [0.3, 0.9]

Time Step 1:

Initial hidden state: $h_0 = [0, 0, 0]$ (all zeros)

# Input contribution
W_xh @ x_1 = [0.5*1.0 + (-0.3)*0.5,
              0.2*1.0 + 0.4*0.5,
              -0.1*1.0 + 0.6*0.5]
           = [0.35, 0.40, 0.20]

# Recurrent contribution (h_0 is zero, so this is zero)
W_hh @ h_0 = [0, 0, 0]

# Combine with bias
h_1_pre = [0.35+0+0.1, 0.40+0+(-0.1), 0.20+0+0.2]
        = [0.45, 0.30, 0.40]

# Apply tanh activation
h_1 = tanh([0.45, 0.30, 0.40])
     [0.42, 0.29, 0.38]

Time Step 2:

Now $h_1 = [0.42, 0.29, 0.38]$

# Input contribution
W_xh @ x_2 = [0.5*0.8 + (-0.3)*1.2,
              0.2*0.8 + 0.4*1.2,
              -0.1*0.8 + 0.6*1.2]
           = [0.04, 0.64, 0.64]

# Recurrent contribution
W_hh @ h_1 = [0.3*0.42 + (-0.2)*0.29 + 0.1*0.38,
              0.4*0.42 + 0.5*0.29 + (-0.3)*0.38,
              -0.2*0.42 + 0.1*0.29 + 0.4*0.38]
            [0.096, 0.199, 0.081]

# Combine
h_2_pre = [0.04+0.096+0.1, 0.64+0.199+(-0.1), 0.64+0.081+0.2]
         [0.236, 0.739, 0.921]

# Apply tanh
h_2 = tanh([0.236, 0.739, 0.921])
     [0.23, 0.63, 0.73]

Notice how $h_2$ depends on both $x_2$ (current input) and $h_1$ (which contains information from $x_1$). This is how RNNs maintain memory!

The Architecture of RNNs

Basic RNN Cell

Let’s look at the architecture of a simple RNN. The fundamental building block is the RNN cell:

        h_{t-1} (previous hidden state)
           
           
        ┌──────┐
 x_t    tanh   h_t (current hidden state)
        └──────┘

Inside the cell:

  1. Concatenate or combine $x_t$ and $h_{t-1}$
  2. Apply linear transformation (weight matrices)
  3. Apply non-linear activation (typically tanh or ReLU)
  4. Output new hidden state $h_t$

Stacking RNN Layers

Just like feedforward networks, RNNs can be stacked to create deeper architectures:

Layer 2:  h₂¹  h₂²  h₂³
                    
Layer 1:  h₁¹  h₁²  h₁³
                    
Input:    x    x    x

Each layer processes the outputs of the previous layer as its input sequence. Deeper RNNs can learn more complex patterns and hierarchical representations.

Bidirectional RNNs

Standard RNNs process sequences in one direction (left-to-right, or past-to-future). But sometimes, the future context is also important, so this is where bidirectional RNNs come in. Let’s take an example:

Example: “The animal didn’t cross the street because it was too _____”

To fill in the blank, we need to know what comes after: “tired” (the animal) vs. “wide” (the street).

Bidirectional RNNs solve this by processing the sequence in both directions:

Forward:  h₁⁺  h₂⁺  h₃⁺
                    
Input:    x    x    x
                    
Backward: h₁⁻  h₂⁻  h₃⁻

At each time step, we concatenate the forward and backward hidden states:

\[h_t = [h_t^+; h_t^-]\]

This gives the network access to both past and future context.

How RNNs Learn: Backpropagation Through Time

So training RNNs is more complex than training feedforward networks because of the temporal dependencies. We need to compute gradients that flow backward through time, accounting for how earlier time steps affect later ones. This is where Backpropagation Through Time (BPTT) comes in. BPTT is the algorithm for training RNNs. The important thing is to unfold the RNN through time and treat it as a very deep feedforward network, then apply standard backpropagation.

Forward Pass:

t=1: h = tanh(W_hh·h + W_xh·x + b)
t=2: h = tanh(W_hh·h + W_xh·x + b)
t=3: h = tanh(W_hh·h + W_xh·x + b)
...
Loss = L(y_true, y_pred)

Backward Pass:

Starting from the final time step, we compute gradients:

\[\frac{\partial L}{\partial h_t} = \frac{\partial L}{\partial y_t} \frac{\partial y_t}{\partial h_t} + \frac{\partial L}{\partial h_{t+1}} \frac{\partial h_{t+1}}{\partial h_t}\]

This recursive formula shows that the gradient at time $t$ depends on:

The gradient flows backward through time, accumulating contributions from all future time steps.

Weight Update Strategy

Critically, the same weights $W_{hh}$ and $W_{xh}$ are used at every time step, so their gradients accumulate across all time steps:

\[\frac{\partial L}{\partial W_{hh}} = \sum_{t=1}^{T} \frac{\partial L_t}{\partial W_{hh}}\]

This weight sharing is both a blessing (fewer parameters) and a curse (gradients can vanish or explode). This is where truncated BPTT comes in.

Truncated BPTT

For very long sequences, computing gradients through the entire sequence is computationally expensive and memory-intensive. Truncated BPTT is a practical solution. So, instead of backpropagating through the entire sequence, we:

This approximation makes training tractable while still allowing the hidden state to carry information across the entire sequence.

The Vanishing Gradient Problem

The vanishing gradient problem is the Achilles’ heel of basic RNNs. To understand it, let’s examine how gradients flow backward through time. When we compute $\frac{\partial L}{\partial h_1}$ (the gradient at time step 1) for a sequence of length $T$, we need to apply the chain rule through all intermediate time steps:

\[\frac{\partial L}{\partial h_1} = \frac{\partial L}{\partial h_T} \prod_{t=2}^{T} \frac{\partial h_t}{\partial h_{t-1}}\]

Each term $\frac{\partial h_t}{\partial h_{t-1}}$ involves the recurrent weight matrix $W_{hh}$ and the derivative of the activation function:

\[\frac{\partial h_t}{\partial h_{t-1}} = W_{hh}^T \cdot \text{diag}(\tanh'(h_t))\]

Why Gradients Vanish

Vanishing gradients occur because the product of many small numbers becomes negligible due to repeated multiplication. There are two main causes:

Suppose each $\frac{\partial h_t}{\partial h_{t-1}} \approx 0.5$ (a typical value when tanh is saturated).

After 10 time steps: \(\text{gradient} \propto 0.5^{10} = 0.00098\)

After 20 time steps: \(\text{gradient} \propto 0.5^{20} = 0.00000095\)

The gradient has effectively vanished! The network can’t learn long-term dependencies because the learning signal from distant time steps is too weak.

The Exploding Gradient Problem

The Exploding Gradient Problem is the opposite of the vanishing gradient problem. It occurs when gradients grow exponentially during backpropagation through time. Suppose, if each $\frac{\partial h_t}{\partial h_{t-1}} > 1$, gradients can explode:

\(\text{gradient} \propto 2^{10} = 1024\) \(\text{gradient} \propto 2^{20} = 1,048,576\)

This causes numerical instability and training divergence.

Practical Solutions

Let’s look at some practical solutions to these problems:

For Exploding Gradients:

For Vanishing Gradients:

The vanishing gradient problem means basic RNNs struggle with long-term dependencies. They can remember information from 5-10 time steps back, but not 50 or 100 steps back. Let’s see some concrete examples where this limitation becomes problematic:

❌ “I grew up in France… [50 words later]… I speak fluent _____”

❌ Long-form text generation

❌ Long-term stock price prediction

This limitation motivated the development of LSTM and GRU architectures, which we cover in subsequent tutorials.

Building an RNN from Scratch

Let’s implement a basic RNN cell in Python using only NumPy. This will solidify our understanding of the mechanics.

Basic RNN Cell Implementation

import numpy as np

class SimpleRNNCell:
    """
    A single RNN cell implementing:
    h_t = tanh(W_hh @ h_{t-1} + W_xh @ x_t + b_h)
    """

    def __init__(self, input_size: int, hidden_size: int, seed: int = 42):
        """
        Initialize RNN cell with random weights.

        Args:
            input_size: Dimension of input vectors
            hidden_size: Dimension of hidden state
            seed: Random seed for reproducibility
        """
        np.random.seed(seed)

        self.input_size = input_size
        self.hidden_size = hidden_size

        # Xavier/Glorot initialization for better gradient flow
        # W_xh: transforms input to hidden space
        self.W_xh = np.random.randn(input_size, hidden_size) * np.sqrt(2.0 / (input_size + hidden_size))

        # W_hh: transforms previous hidden state to current hidden state
        self.W_hh = np.random.randn(hidden_size, hidden_size) * np.sqrt(2.0 / (hidden_size + hidden_size))

        # Bias term
        self.b_h = np.zeros((1, hidden_size))

        # Cache for backward pass
        self.cache = {}

    def forward(self, x_t: np.ndarray, h_prev: np.ndarray) -> np.ndarray:
        """
        Forward pass for a single time step.

        Args:
            x_t: Input at time t, shape (batch_size, input_size)
            h_prev: Previous hidden state, shape (batch_size, hidden_size)

        Returns:
            h_t: Current hidden state, shape (batch_size, hidden_size)
        """
        # Linear combination
        h_linear = h_prev @ self.W_hh + x_t @ self.W_xh + self.b_h

        # Non-linear activation
        h_t = np.tanh(h_linear)

        # Cache for backward pass
        self.cache['x_t'] = x_t
        self.cache['h_prev'] = h_prev
        self.cache['h_linear'] = h_linear
        self.cache['h_t'] = h_t

        return h_t

    def backward(self, dh_t: np.ndarray) -> tuple:
        """
        Backward pass for a single time step.

        Args:
            dh_t: Gradient of loss w.r.t. h_t, shape (batch_size, hidden_size)

        Returns:
            dh_prev: Gradient w.r.t. previous hidden state
            dW_xh: Gradient w.r.t. input weights
            dW_hh: Gradient w.r.t. recurrent weights
            db_h: Gradient w.r.t. bias
        """
        # Retrieve cached values
        x_t = self.cache['x_t']
        h_prev = self.cache['h_prev']
        h_linear = self.cache['h_linear']

        batch_size = x_t.shape[0]

        # Gradient through tanh activation
        # d(tanh(x))/dx = 1 - tanh²(x)
        dtanh = 1 - np.tanh(h_linear) ** 2
        dh_linear = dh_t * dtanh

        # Gradients w.r.t. weights and bias
        dW_xh = x_t.T @ dh_linear
        dW_hh = h_prev.T @ dh_linear
        db_h = np.sum(dh_linear, axis=0, keepdims=True)

        # Gradient w.r.t. previous hidden state (for backprop through time)
        dh_prev = dh_linear @ self.W_hh.T

        return dh_prev, dW_xh, dW_hh, db_h


class SimpleRNN:
    """
    Complete RNN for sequence processing.
    """

    def __init__(self, input_size: int, hidden_size: int, output_size: int, seed: int = 42):
        """
        Initialize RNN.

        Args:
            input_size: Dimension of input features
            hidden_size: Dimension of hidden state
            output_size: Dimension of output
            seed: Random seed
        """
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size

        # RNN cell
        self.cell = SimpleRNNCell(input_size, hidden_size, seed)

        # Output projection layer
        np.random.seed(seed)
        self.W_hy = np.random.randn(hidden_size, output_size) * np.sqrt(2.0 / (hidden_size + output_size))
        self.b_y = np.zeros((1, output_size))

        # For BPTT
        self.hidden_states = []

    def forward(self, X: np.ndarray, h_0: np.ndarray = None) -> tuple:
        """
        Forward pass through entire sequence.

        Args:
            X: Input sequence, shape (batch_size, seq_length, input_size)
            h_0: Initial hidden state, shape (batch_size, hidden_size).
                 If None, initialized to zeros.

        Returns:
            outputs: Output sequence, shape (batch_size, seq_length, output_size)
            h_t: Final hidden state, shape (batch_size, hidden_size)
        """
        batch_size, seq_length, _ = X.shape

        # Initialize hidden state
        if h_0 is None:
            h_t = np.zeros((batch_size, self.hidden_size))
        else:
            h_t = h_0

        # Store all hidden states for BPTT
        self.hidden_states = [h_t]
        outputs = []

        # Process sequence
        for t in range(seq_length):
            x_t = X[:, t, :]  # Input at time t

            # RNN cell forward
            h_t = self.cell.forward(x_t, h_t)
            self.hidden_states.append(h_t)

            # Compute output
            y_t = h_t @ self.W_hy + self.b_y
            outputs.append(y_t)

        # Stack outputs
        outputs = np.stack(outputs, axis=1)  # (batch_size, seq_length, output_size)

        return outputs, h_t

    def compute_loss(self, predictions: np.ndarray, targets: np.ndarray) -> float:
        """
        Compute mean squared error loss.

        Args:
            predictions: Model predictions, shape (batch_size, seq_length, output_size)
            targets: Ground truth, shape (batch_size, seq_length, output_size)

        Returns:
            loss: Scalar loss value
        """
        return np.mean((predictions - targets) ** 2)


# Example usage
def example_rnn_usage():
    """
    Demonstrate RNN on a simple sequence prediction task.
    """
    # Hyperparameters
    batch_size = 2
    seq_length = 5
    input_size = 3
    hidden_size = 4
    output_size = 2

    # Create synthetic data
    X = np.random.randn(batch_size, seq_length, input_size)
    Y_true = np.random.randn(batch_size, seq_length, output_size)

    # Initialize RNN
    rnn = SimpleRNN(input_size, hidden_size, output_size)

    # Forward pass
    Y_pred, final_hidden = rnn.forward(X)

    # Compute loss
    loss = rnn.compute_loss(Y_pred, Y_true)

    print(f"Input shape: {X.shape}")
    print(f"Output shape: {Y_pred.shape}")
    print(f"Final hidden state shape: {final_hidden.shape}")
    print(f"Loss: {loss:.4f}")

    return rnn, X, Y_pred, loss


# Example: Character-level language model
def character_level_example():
    """
    Simple character-level language model using RNN.
    """
    # Vocabulary
    text = "hello world"
    chars = sorted(list(set(text)))
    vocab_size = len(chars)

    # Create mappings
    char_to_idx = {ch: i for i, ch in enumerate(chars)}
    idx_to_char = {i: ch for i, ch in enumerate(chars)}

    # Convert text to indices
    data = [char_to_idx[ch] for ch in text]

    print("Character-level RNN Example")
    print(f"Text: '{text}'")
    print(f"Vocabulary: {chars}")
    print(f"Vocab size: {vocab_size}")
    print(f"Encoded: {data}")

    # Create training pairs: predict next character
    seq_length = 3
    X_data = []
    Y_data = []

    for i in range(len(data) - seq_length):
        # One-hot encode input sequence
        x_seq = np.zeros((seq_length, vocab_size))
        for t, idx in enumerate(data[i:i+seq_length]):
            x_seq[t, idx] = 1

        # One-hot encode target (next character)
        y_seq = np.zeros((seq_length, vocab_size))
        for t, idx in enumerate(data[i+1:i+seq_length+1]):
            y_seq[t, idx] = 1

        X_data.append(x_seq)
        Y_data.append(y_seq)

    X_data = np.array(X_data)  # Shape: (num_samples, seq_length, vocab_size)
    Y_data = np.array(Y_data)

    print(f"\nTraining data shape: {X_data.shape}")
    print(f"Target data shape: {Y_data.shape}")

    # Initialize RNN
    hidden_size = 10
    rnn = SimpleRNN(vocab_size, hidden_size, vocab_size)

    # Forward pass
    predictions, _ = rnn.forward(X_data)
    loss = rnn.compute_loss(predictions, Y_data)

    print(f"\nInitial loss: {loss:.4f}")


if __name__ == "__main__":
    print("=" * 60)
    print("Simple RNN Implementation from Scratch")
    print("=" * 60)
    print()

    # Run basic example
    example_rnn_usage()
    print()

    # Run character-level example
    character_level_example()

Advantages and Limitations

Advantages

  1. Sequential Processing: RNNs are specifically designed for sequential data, unlike feedforward networks.

  2. Variable Length Input: Can handle sequences of any length without architectural changes.

  3. Parameter Sharing: The same weights are used at every time step, reducing the total number of parameters.

  4. Memory: The hidden state acts as memory, allowing the network to maintain context.

  5. Theoretical Power: RNNs are Turing-complete, meaning they can theoretically compute any computable function given enough time and resources.

  6. Versatility: Can be configured for many different sequence tasks (one-to-many, many-to-one, many-to-many).

Limitations

  1. Vanishing/Exploding Gradients: Basic RNNs struggle to learn long-term dependencies due to gradient issues during backpropagation through time.

  2. Sequential Processing: Unlike CNNs or Transformers, RNNs must process sequences step-by-step, which is slower and harder to parallelize.

  3. Limited Memory: In practice, vanilla RNNs only remember information from ~5-10 steps back effectively.

  4. Training Difficulty: BPTT is computationally expensive for long sequences, and the model is sensitive to hyperparameters.

  5. Slow Inference: Generating long sequences requires many sequential forward passes, making real-time generation challenging.

  6. Difficulty with Long-Range Dependencies: Even with gradient clipping and careful initialization, capturing dependencies spanning hundreds of time steps is difficult.

Conclusion

Recurrent Neural Networks represent a fundamental breakthrough in handling sequential data. By introducing recurrence and feeding outputs back as inputs, RNNs create a form of memory that allows them to process sequences of any length while maintaining context. The basic RNN architecture we implemented here forms the foundation for more advanced architectures like LSTMs and GRUs, which address many of the limitations of vanilla RNNs, which we will explore in future posts.

Jupyter Notebook

For hands-on practice, check out the companion notebooks - Understanding RNN Architecture From Scratch