Jan 28, 2026
Continuing in our journey through sequence models in this Deep Learning Series, this will be our last post in the series. Here we will explore Bidirectional Recurrent Neural Networks (BiRNNs). So, imagine you’re reading a mystery novel and trying to understand each sentence as you go. Normally, you read from left to right, building understanding based on what came before. But what if you could also peek ahead to see what comes next? You’d have a much richer understanding of each word and phrase because you’d know both the past context and future context. This is exactly what Bidirectional Recurrent Neural Networks (BiRNNs) do, they process sequences in both directions to capture complete contextual information.
Traditional RNNs, LSTMs, and GRUs process sequences in one direction, typically from beginning to end (left to right). While this works well for many tasks, it has a fundamental limitation: at any given time step, the model only knows about the past, not the future.
Consider these examples where future context is crucial:
Sentence Completion:
Sentiment Analysis:
Named Entity Recognition:
Bidirectional RNNs solve this by processing the sequence twice:
At each time step, the model combines information from both directions, giving it access to complete contextual information.
To better understand bidirectional processing, let’s use an analogy of two readers working together. Imagine you have two friends helping you understand a complex document:
When they discuss each sentence, they combine:
Together, they have a complete picture that neither could achieve alone.
Let’s visualize how bidirectional processing works:
Input Sequence: [x₁] [x₂] [x₃] [x₄] [x₅]
↓ ↓ ↓ ↓ ↓
Forward RNN: h₁→ h₂ → h₃ → h₄ → h₅ →
Backward RNN: ← h₁ ← h₂ ← h₃ ← h₄ ← h₅
↓ ↓ ↓ ↓ ↓
Combined: [h₁f,h₁b] [h₂f,h₂b] [h₃f,h₃b] [h₄f,h₄b] [h₅f,h₅b]
The forward and backward hidden states are typically combined using:
Mathematically, bidirectional RNNs combine two separate RNNs running in opposite directions. To understand intutively, let’s break down the mathematics:
The forward RNN processes the sequence normally:
\[h_t^f = \tanh(W_{hh}^f h_{t-1}^f + W_{xh}^f x_t + b_h^f)\]Where:
The backward RNN processes the sequence in reverse:
\[h_t^b = \tanh(W_{hh}^b h_{t+1}^b + W_{xh}^b x_t + b_h^b)\]Where:
It’s important to note that the backward RNN uses $h_{t+1}^b$ (future state) instead of $h_{t-1}^b$ (past state).
The most common approach is to concatenate the forward and backward hidden states at each time step, creating a combined hidden state:
\[h_t = [h_t^f; h_t^b]\]This creates a hidden state of size $2 \times \text{hidden_size}$, containing information from both directions.
For tasks like classification or sequence labeling:
\[y_t = \text{softmax}(W_y h_t + b_y)\]Where $W_y$ maps the combined hidden state to the output space.
It’s important to note that the backward RNN uses $h_{t+1}^b$ (future state) instead of $h_{t-1}^b$ (past state)
Below are some important properties of bidirectional RNNs that make them powerful and useful:
Let’s take a practical example of sentiment analysis where future context significantly impacts meaning.
We’ll create a model that can handle sentences where sentiment isn’t clear without complete context:
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader, TensorDataset
from sklearn.metrics import accuracy_score, classification_report
import seaborn as sns
# Set random seeds
torch.manual_seed(42)
np.random.seed(42)
class BidirectionalRNN(nn.Module):
"""
Bidirectional RNN for sentiment analysis.
This model processes sequences in both directions to capture
complete contextual information for better sentiment understanding.
"""
def __init__(self, vocab_size, embedding_dim=100, hidden_size=64,
num_layers=1, num_classes=3, rnn_type='LSTM', dropout=0.3):
super(BidirectionalRNN, self).__init__()
self.hidden_size = hidden_size
self.num_layers = num_layers
self.rnn_type = rnn_type
# Embedding layer
self.embedding = nn.Embedding(vocab_size, embedding_dim)
# Bidirectional RNN layer
if rnn_type == 'LSTM':
self.rnn = nn.LSTM(
embedding_dim,
hidden_size,
num_layers,
batch_first=True,
bidirectional=True, # This is the key!
dropout=dropout if num_layers > 1 else 0
)
elif rnn_type == 'GRU':
self.rnn = nn.GRU(
embedding_dim,
hidden_size,
num_layers,
batch_first=True,
bidirectional=True, # This is the key!
dropout=dropout if num_layers > 1 else 0
)
else:
self.rnn = nn.RNN(
embedding_dim,
hidden_size,
num_layers,
batch_first=True,
bidirectional=True, # This is the key!
dropout=dropout if num_layers > 1 else 0
)
# Note: bidirectional=True doubles the hidden size
self.dropout = nn.Dropout(dropout)
# Classification layer
# Hidden size is doubled due to bidirectional processing
self.classifier = nn.Sequential(
nn.Linear(hidden_size * 2, hidden_size),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(hidden_size, num_classes)
)
def forward(self, x, lengths=None):
"""
Forward pass with bidirectional processing.
Args:
x: Input sequences (batch_size, seq_length)
lengths: Actual lengths of sequences (for packing)
Returns:
logits: Classification logits (batch_size, num_classes)
"""
batch_size = x.size(0)
# Embedding
embedded = self.embedding(x) # (batch_size, seq_length, embedding_dim)
# Pack sequences if lengths provided (for variable-length sequences)
if lengths is not None:
embedded = nn.utils.rnn.pack_padded_sequence(
embedded, lengths, batch_first=True, enforce_sorted=False
)
# Bidirectional RNN
if self.rnn_type == 'LSTM':
# Initialize hidden and cell states for both directions
h0 = torch.zeros(self.num_layers * 2, batch_size, self.hidden_size) # *2 for bidirectional
c0 = torch.zeros(self.num_layers * 2, batch_size, self.hidden_size) # *2 for bidirectional
rnn_out, (hidden, cell) = self.rnn(embedded, (h0, c0))
else:
# Initialize hidden state for both directions
h0 = torch.zeros(self.num_layers * 2, batch_size, self.hidden_size) # *2 for bidirectional
rnn_out, hidden = self.rnn(embedded, h0)
# Unpack if we packed
if lengths is not None:
rnn_out, _ = nn.utils.rnn.pad_packed_sequence(rnn_out, batch_first=True)
# Use the last output from both directions
# rnn_out shape: (batch_size, seq_length, hidden_size * 2)
# For sentiment analysis, we can use different strategies:
# 1. Last hidden state
# 2. Mean pooling over all time steps
# 3. Max pooling over all time steps
# 4. Attention mechanism
# Strategy 1: Use last hidden state (most common)
if self.rnn_type == 'LSTM':
# hidden shape: (num_layers * 2, batch_size, hidden_size)
# We want the last layer's hidden states from both directions
forward_hidden = hidden[0][-2] # Last layer, forward direction
backward_hidden = hidden[0][-1] # Last layer, backward direction
combined_hidden = torch.cat([forward_hidden, backward_hidden], dim=1)
else:
# For RNN and GRU
forward_hidden = hidden[-2] # Forward direction
backward_hidden = hidden[-1] # Backward direction
combined_hidden = torch.cat([forward_hidden, backward_hidden], dim=1)
# Apply dropout
combined_hidden = self.dropout(combined_hidden)
# Classification
logits = self.classifier(combined_hidden)
return logits
# Create synthetic sentiment data with context-dependent examples
class SentimentDataGenerator:
"""
Generate synthetic sentiment data where bidirectional context is crucial.
"""
def __init__(self):
# Simple vocabulary
self.vocab = {
'<PAD>': 0, '<UNK>': 1, 'the': 2, 'movie': 3, 'was': 4, 'is': 5,
'not': 6, 'really': 7, 'very': 8, 'quite': 9, 'absolutely': 10,
'terrible': 11, 'bad': 12, 'awful': 13, 'horrible': 14,
'amazing': 15, 'great': 16, 'excellent': 17, 'fantastic': 18,
'good': 19, 'okay': 20, 'fine': 21, 'decent': 22, 'average': 23,
'but': 24, 'however': 25, 'although': 26, 'despite': 27, 'until': 28,
'it': 29, 'that': 30, 'this': 31, 'i': 32, 'thought': 33, 'think': 34,
'would': 35, 'could': 36, 'should': 37, 'be': 38, 'been': 39,
'ending': 40, 'beginning': 41, 'middle': 42, 'plot': 43, 'acting': 44,
'ruined': 45, 'saved': 46, 'made': 47, 'everything': 48, 'nothing': 49
}
self.word_to_idx = self.vocab
self.idx_to_word = {idx: word for word, idx in self.vocab.items()}
self.vocab_size = len(self.vocab)
# Sentiment labels: 0=negative, 1=neutral, 2=positive
self.sentiment_labels = ['negative', 'neutral', 'positive']
def create_context_dependent_examples(self):
"""
Create examples where bidirectional context is crucial for correct classification.
"""
examples = [
# Positive examples that start negatively
(['the', 'movie', 'was', 'not', 'terrible'], 2), # positive
(['i', 'thought', 'it', 'would', 'be', 'awful', 'but', 'it', 'was', 'amazing'], 2),
(['not', 'bad', 'at', 'all'], 2),
(['the', 'acting', 'was', 'not', 'horrible'], 2),
# Negative examples that start positively
(['great', 'movie', 'until', 'the', 'ending', 'ruined', 'everything'], 0),
(['i', 'thought', 'it', 'was', 'excellent', 'but', 'it', 'was', 'terrible'], 0),
(['amazing', 'beginning', 'but', 'awful', 'ending'], 0),
# Neutral examples with mixed sentiment
(['the', 'movie', 'was', 'okay'], 1),
(['average', 'acting', 'decent', 'plot'], 1),
(['not', 'great', 'not', 'terrible'], 1),
# Clear positive examples
(['absolutely', 'amazing', 'movie'], 2),
(['fantastic', 'acting', 'excellent', 'plot'], 2),
(['really', 'great', 'film'], 2),
# Clear negative examples
(['terrible', 'movie', 'awful', 'acting'], 0),
(['really', 'bad', 'horrible', 'plot'], 0),
(['absolutely', 'terrible'], 0),
]
# Convert words to indices
processed_examples = []
for words, label in examples:
indices = [self.word_to_idx.get(word, self.word_to_idx['<UNK>']) for word in words]
processed_examples.append((indices, label))
return processed_examples
def pad_sequences(self, sequences, max_length=None):
"""Pad sequences to the same length."""
if max_length is None:
max_length = max(len(seq) for seq in sequences)
padded = []
lengths = []
for seq in sequences:
length = len(seq)
lengths.append(length)
if length < max_length:
# Pad with <PAD> token
padded_seq = seq + [self.word_to_idx['<PAD>']] * (max_length - length)
else:
padded_seq = seq[:max_length]
padded.append(padded_seq)
return padded, lengths
def demonstrate_bidirectional_vs_unidirectional():
"""
Demonstrate the difference between bidirectional and unidirectional processing.
"""
print("🔍 BIDIRECTIONAL vs UNIDIRECTIONAL COMPARISON")
print("=" * 60)
# Create data generator
data_gen = SentimentDataGenerator()
examples = data_gen.create_context_dependent_examples()
# Separate sequences and labels
sequences = [ex[0] for ex in examples]
labels = [ex[1] for ex in examples]
# Pad sequences
padded_sequences, lengths = data_gen.pad_sequences(sequences)
# Convert to tensors
X = torch.LongTensor(padded_sequences)
y = torch.LongTensor(labels)
lengths_tensor = torch.LongTensor(lengths)
print(f"Dataset created:")
print(f" Sequences: {len(X)}")
print(f" Vocabulary size: {data_gen.vocab_size}")
print(f" Classes: {len(data_gen.sentiment_labels)}")
# Create models
bidirectional_model = BidirectionalRNN(
vocab_size=data_gen.vocab_size,
embedding_dim=50,
hidden_size=32,
num_classes=3,
rnn_type='LSTM'
)
unidirectional_model = BidirectionalRNN(
vocab_size=data_gen.vocab_size,
embedding_dim=50,
hidden_size=32,
num_classes=3,
rnn_type='LSTM'
)
# Manually set bidirectional to False for comparison
unidirectional_model.rnn.bidirectional = False
# Compare model complexities
bi_params = sum(p.numel() for p in bidirectional_model.parameters())
uni_params = sum(p.numel() for p in unidirectional_model.parameters())
print(f"\nModel Comparison:")
print(f" Bidirectional parameters: {bi_params:,}")
print(f" Unidirectional parameters: {uni_params:,}")
print(f" Parameter increase: {((bi_params - uni_params) / uni_params * 100):.1f}%")
# Quick training function
def quick_train(model, model_name, epochs=50):
print(f"\n🏃♂️ Training {model_name}...")
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
model.train()
for epoch in range(epochs):
optimizer.zero_grad()
if 'Bidirectional' in model_name:
logits = model(X, lengths_tensor)
else:
logits = model(X)
loss = criterion(logits, y)
loss.backward()
optimizer.step()
if (epoch + 1) % 10 == 0:
print(f" Epoch {epoch+1}: Loss = {loss.item():.4f}")
# Test accuracy
model.eval()
with torch.no_grad():
if 'Bidirectional' in model_name:
test_logits = model(X, lengths_tensor)
else:
test_logits = model(X)
predictions = torch.argmax(test_logits, dim=1)
accuracy = (predictions == y).float().mean().item()
return accuracy
# Train both models
bi_accuracy = quick_train(bidirectional_model, "Bidirectional LSTM")
uni_accuracy = quick_train(unidirectional_model, "Unidirectional LSTM")
print(f"\nRESULTS:")
print(f" Bidirectional accuracy: {bi_accuracy:.3f}")
print(f" Unidirectional accuracy: {uni_accuracy:.3f}")
print(f" Improvement: {((bi_accuracy - uni_accuracy) / uni_accuracy * 100):.1f}%")
# Analyze specific examples
print(f"\nDETAILED ANALYSIS:")
print("-" * 40)
bidirectional_model.eval()
unidirectional_model.eval()
with torch.no_grad():
bi_logits = bidirectional_model(X, lengths_tensor)
uni_logits = unidirectional_model(X)
bi_preds = torch.argmax(bi_logits, dim=1)
uni_preds = torch.argmax(uni_logits, dim=1)
# Show examples where bidirectional helps
for i, (seq_indices, true_label) in enumerate(examples):
words = [data_gen.idx_to_word[idx] for idx in seq_indices]
sentence = ' '.join(words)
bi_pred = bi_preds[i].item()
uni_pred = uni_preds[i].item()
true_sentiment = data_gen.sentiment_labels[true_label]
bi_sentiment = data_gen.sentiment_labels[bi_pred]
uni_sentiment = data_gen.sentiment_labels[uni_pred]
print(f"\nExample {i+1}: \"{sentence}\"")
print(f" True: {true_sentiment}")
print(f" Bidirectional: {bi_sentiment} {'correct' if bi_pred == true_label else 'wrong'}")
print(f" Unidirectional: {uni_sentiment} {'correct' if uni_pred == true_label else 'wrong'}")
if bi_pred == true_label and uni_pred != true_label:
print(f"Bidirectional model benefits from future context!")
# Advanced Bidirectional RNN with Attention
class AttentionalBidirectionalRNN(nn.Module):
"""
Bidirectional RNN with attention mechanism for better sequence understanding.
"""
def __init__(self, vocab_size, embedding_dim=100, hidden_size=64,
num_classes=3, dropout=0.3):
super(AttentionalBidirectionalRNN, self).__init__()
self.hidden_size = hidden_size
# Embedding
self.embedding = nn.Embedding(vocab_size, embedding_dim)
# Bidirectional LSTM
self.lstm = nn.LSTM(
embedding_dim,
hidden_size,
batch_first=True,
bidirectional=True,
dropout=dropout
)
# Attention mechanism
self.attention = nn.Linear(hidden_size * 2, 1)
# Classification
self.classifier = nn.Sequential(
nn.Dropout(dropout),
nn.Linear(hidden_size * 2, hidden_size),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(hidden_size, num_classes)
)
def forward(self, x, lengths=None):
# Embedding
embedded = self.embedding(x)
# Bidirectional LSTM
lstm_out, _ = self.lstm(embedded)
# lstm_out shape: (batch_size, seq_length, hidden_size * 2)
# Attention mechanism
attention_weights = torch.softmax(self.attention(lstm_out), dim=1)
# attention_weights shape: (batch_size, seq_length, 1)
# Weighted sum of all time steps
attended_output = torch.sum(attention_weights * lstm_out, dim=1)
# attended_output shape: (batch_size, hidden_size * 2)
# Classification
logits = self.classifier(attended_output)
return logits, attention_weights
def visualize_attention(model, data_gen, example_idx=0):
"""
Visualize attention weights for a specific example.
"""
examples = data_gen.create_context_dependent_examples()
sequences = [ex[0] for ex in examples]
labels = [ex[1] for ex in examples]
padded_sequences, lengths = data_gen.pad_sequences(sequences)
X = torch.LongTensor(padded_sequences)
model.eval()
with torch.no_grad():
logits, attention_weights = model(X)
# Get the example
seq_indices = sequences[example_idx]
words = [data_gen.idx_to_word[idx] for idx in seq_indices]
attention = attention_weights[example_idx, :len(words), 0].numpy()
# Plot attention
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
bars = plt.bar(range(len(words)), attention, alpha=0.7)
plt.xlabel('Words')
plt.ylabel('Attention Weight')
plt.title('Attention Weights Visualization')
plt.xticks(range(len(words)), words, rotation=45)
# Color bars by attention strength
max_attention = max(attention)
for i, (bar, att) in enumerate(zip(bars, attention)):
intensity = att / max_attention
bar.set_color(plt.cm.Reds(intensity))
plt.subplot(1, 2, 2)
# Show sentence with attention
sentence = ' '.join(words)
predicted_class = torch.argmax(logits[example_idx]).item()
true_class = labels[example_idx]
plt.text(0.1, 0.7, f"Sentence: {sentence}", fontsize=12, wrap=True)
plt.text(0.1, 0.5, f"True sentiment: {data_gen.sentiment_labels[true_class]}", fontsize=12)
plt.text(0.1, 0.3, f"Predicted: {data_gen.sentiment_labels[predicted_class]}", fontsize=12)
plt.text(0.1, 0.1, f"Most attended word: '{words[np.argmax(attention)]}'", fontsize=12,
fontweight='bold', color='red')
plt.xlim(0, 1)
plt.ylim(0, 1)
plt.axis('off')
plt.title('Prediction Results')
plt.tight_layout()
plt.show()
# Main demonstration
if __name__ == "__main__":
print("BIDIRECTIONAL RNN TUTORIAL")
print("=" * 40)
# 1. Basic demonstration
demonstrate_bidirectional_vs_unidirectional()
# 2. Advanced model with attention
print(f"\n\nADVANCED: Bidirectional RNN with Attention")
print("=" * 50)
data_gen = SentimentDataGenerator()
examples = data_gen.create_context_dependent_examples()
sequences = [ex[0] for ex in examples]
labels = [ex[1] for ex in examples]
padded_sequences, lengths = data_gen.pad_sequences(sequences)
X = torch.LongTensor(padded_sequences)
y = torch.LongTensor(labels)
# Create attention model
attention_model = AttentionalBidirectionalRNN(
vocab_size=data_gen.vocab_size,
embedding_dim=50,
hidden_size=32,
num_classes=3
)
print(f"Attention model parameters: {sum(p.numel() for p in attention_model.parameters()):,}")
# Train attention model
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(attention_model.parameters(), lr=0.01)
attention_model.train()
for epoch in range(30):
optimizer.zero_grad()
logits, _ = attention_model(X)
loss = criterion(logits, y)
loss.backward()
optimizer.step()
if (epoch + 1) % 10 == 0:
print(f"Attention model epoch {epoch+1}: Loss = {loss.item():.4f}")
# Test accuracy
attention_model.eval()
with torch.no_grad():
test_logits, _ = attention_model(X)
predictions = torch.argmax(test_logits, dim=1)
accuracy = (predictions == y).float().mean().item()
print(f"Attention model accuracy: {accuracy:.3f}")
# Visualize attention for a context-dependent example
print(f"\nVisualizing attention for context-dependent example...")
visualize_attention(attention_model, data_gen, example_idx=1) # Example with "but" contrast
print(f"\nBidirectional RNN tutorial completed!")
print(f"Key insights:")
print(f" • Bidirectional processing captures complete context")
print(f" • Especially important for tasks where future context matters")
print(f" • ~2x parameters but often significantly better performance")
print(f" • Attention mechanisms can further improve interpretability")
Let’s look at specific applications where bidirectional processing provides significant advantages:
Sentiment Analysis
Named Entity Recognition (NER)
Part-of-Speech Tagging
Phoneme Classification
Word Recognition
Anomaly Detection
Financial Prediction
Protein Structure Prediction
DNA Sequence Analysis
Let’s systematically examine the trade-offs of bidirectional RNNs:
Complete Context Awareness: Access to both past and future information at every time step, providing richer representations.
Improved Performance: Significant accuracy improvements (10-30%) on tasks where context is crucial, especially in NLP and sequence labeling.
Better Feature Learning: Can learn more sophisticated patterns that depend on bidirectional context.
Disambiguation Power: Excellent at resolving ambiguities that require complete sentence or sequence context.
Versatile Architecture: Can be combined with different RNN types (LSTM, GRU) and additional mechanisms (attention).
Strong Empirical Results: Consistently outperforms unidirectional models on benchmark tasks like sentiment analysis and NER.
Computational Cost: Approximately twice the computational requirements due to processing in both directions.
Memory Requirements: Roughly double the memory usage compared to unidirectional models.
Training Time: Significantly longer training times, especially for long sequences.
No Real-time Processing: Cannot be used for online/streaming applications where future context is unavailable.
Increased Complexity: More parameters to tune and potential for overfitting on smaller datasets.
Diminishing Returns: Benefits may be limited for tasks where local context is sufficient.
In our deep learning journey through sequence modeling, we have covered a comprehensive toolkit that now includes:
Bidirectional RNNs taught us that context is king in sequence understanding. This insight paved the way for attention mechanisms and transformers, which take the concept of “looking at the whole sequence” to its logical conclusion.
The journey from simple RNNs to bidirectional architectures shows how incremental improvements in deep learning often come from better ways of incorporating information, whether it’s through better memory (LSTM), efficiency (GRU), or context (bidirectional processing).
As we move forward, we’ll see how these concepts continue to evolve in architectures like transformers, which take the idea of complete context awareness to new heights.
For hands-on practice, check out the companion notebooks - Understanding Bidirectional RNNs with PyTorch