Dec 26, 2025
Continuing in our Deep Learning Series, we’ll delve into the fascinating world of Convolutional Neural Networks (CNNs) in this article. So, imagine you’re trying to identify a friend in a crowded photograph. Your brain doesn’t analyze every single pixel independently. Instead, it looks for distinctive patterns, perhaps the shape of their face, the color of their hair, or the style of their clothing. You naturally recognize these features regardless of where your friend stands in the photo or whether they’re slightly turned to the side. This hierarchical, translation-invariant way of processing visual information is exactly what Convolutional Neural Networks (CNNs) were designed to replicate.
Let’s us consider a simple task of recognizing a handwritten digit. To a human, distinguishing a 3 from an 8 seems trivial. But to a computer, an image is just a grid of numbers representing pixel intensities. A 28×28 grayscale image of a digit contains 784 numbers, and a 224×224 color image contains 150,528 numbers!
The challenges multiply when we consider:
Variation in Appearance: The same object can look vastly different depending on:
Spatial Structure: Pixels aren’t independent, nearby pixels are highly correlated. A wheel makes sense in the context of a car, and eyes make sense in the context of a face.
Convolutional Neural Networks (CNNs), inspired by the visual cortex of animals, were specifically designed to address these challenges. The key insight, dating back to Hubel and Wiesel’s Nobel Prize winning research in 1959, was that neurons in the visual cortex have local receptive fields, they respond to stimuli in restricted regions of the visual field. Later, Yann LeCun’s LeNet-5 (1998), designed for handwritten digit recognition, demonstrated that this biological inspiration could be translated into a powerful machine learning architecture. Today, CNNs power everything from facial recognition in your phone to autonomous vehicles and medical image analysis.
Let’s see what happens if we try to use a traditional fully connected neural network for image classification:
Problem 1: Parameter Explosion
For a tiny 28×28 grayscale image with 100 hidden neurons:
And if we look for a more realistic 224×224 RGB image:
This is very unsustainable and leads to:
Problem 2: Loss of Spatial Structure
When we flatten an image into a vector, we destroy the 2D spatial structure:
# Original image: 28x28 with spatial relationships
image = [
[0, 0, 5, 8, 5, 0, 0], # Row 1: Top of digit
[0, 3, 9, 9, 9, 3, 0], # Row 2: Upper part
[5, 9, 5, 0, 5, 9, 5], # Row 3: Middle
...
]
# After flattening: 784-length vector, no spatial info
flattened = [0, 0, 5, 8, 5, 0, 0, 0, 3, 9, 9, 9, 3, 0, 5, 9, 5, ...]
Now, the pixels that were neighbors are treated no differently than pixels from opposite corners. The network must relearn spatial relationships from scratch!
Problem 3: No Translation Invariance
A fully connected network learns that “edge at position (10, 15)” is important, but it is incapable of recognizing that the same edge at position (11, 16) represents the same feature. The network must learn the same pattern separately for every possible position.
CNNs solve all three problems elegantly:
Convolution is a mathematical operation that combines two functions to produce a third function. In the context of CNNs, we convolve an image with a small filter (also called a kernel) to detect specific features.
Imagine you have a magnifying glass (the filter) and you’re systematically scanning across a page (the image), looking for a specific pattern (like vertical edges). So how will you find it?
That’s how the convolution operation works!
For a 2D image $I$ and a filter $K$, the convolution operation is:
\[(I * K)(i, j) = \sum_m \sum_n I(i+m, j+n) \cdot K(m, n)\]Where:
Let’s detect vertical edges in a simple image using a 3×3 filter.
Input Image (7×7):
10 10 10 0 0 0 0
10 10 10 0 0 0 0
10 10 10 0 0 0 0
10 10 10 0 0 0 0
10 10 10 0 0 0 0
10 10 10 0 0 0 0
10 10 10 0 0 0 0
This image has a clear vertical edge in the middle (transition from 10 to 0).
Vertical Edge Detection Filter (3×3):
1 0 -1
1 0 -1
1 0 -1
How Convolution Works:
Step 1: Place filter at top-left (position 0,0):
Filter over image:
10 10 10
10 10 10
10 10 10
Computation:
(10×1) + (10×0) + (10×-1) +
(10×1) + (10×0) + (10×-1) +
(10×1) + (10×0) + (10×-1) = 0
Step 2: Slide one position right (position 0,1):
Filter over image:
10 10 0
10 10 0
10 10 0
Computation:
(10×1) + (10×0) + (0×-1) +
(10×1) + (10×0) + (0×-1) +
(10×1) + (10×0) + (0×-1) = 30
The high value (30) indicates we found an edge! We repeat this across the entire image.
Output Feature Map (5×5, due to valid convolution):
0 30 30 0 0
0 30 30 0 0
0 30 30 0 0
0 30 30 0 0
0 30 30 0 0
The bright values highlight where the vertical edge is located!
Different filters detect different features:
Horizontal Edge Detection:
1 1 1
0 0 0
-1 -1 -1
Diagonal Edge Detection:
2 1 0
1 0 -1
0 -1 -2
Blur (Average):
1/9 1/9 1/9
1/9 1/9 1/9
1/9 1/9 1/9
Sharpen:
0 -1 0
-1 5 -1
0 -1 0
The magic of CNNs: We don’t hand-design these filters, the network learns them during training!
Padding: Adding zeros around the image border
Why padding?
Without padding: information at corners used lessWith padding: all positions treated more equallyStride: How many pixels to move the filter
Output size formula: \(\text{output_size} = \left\lfloor \frac{\text{input_size} + 2 \times \text{padding} - \text{filter_size}}{\text{stride}} \right\rfloor + 1\)
Example:
Purpose: Extract spatial features from input
How it works:
Parameters:
Example:
Input: 32 × 32 × 3 (RGB image)
Filter: 64 filters of size 3×3×3
Output: 32 × 32 × 64 (64 feature maps)
Parameters: (3×3×3 + 1 bias) × 64 = 1,792
What happens:
Purpose: Introduce non-linearity
Formula: $\text{ReLU}(x) = \max(0, x)$
Why it’s crucial:
Applied element-wise:
Before ReLU: [-2.5, 0.3, -0.1, 4.2]
After ReLU: [ 0.0, 0.3, 0.0, 4.2] # Negative values zeroed out
Purpose:
Max Pooling (most common):
Example with 2×2 window, stride 2:
Input (4×4): Output (2×2):
[1 3 2 4] [3 4]
[5 6 7 8] → [9 11]
[9 2 1 3]
[4 5 11 2]
Each 2×2 region is replaced by its maximum value.
Why max pooling works:
Average Pooling: Takes the average instead of maximum. Less common but used in some architectures (e.g., global average pooling).
We use Batch Normalization to Stabilize and accelerate training. During training, the distribution of layer inputs changes as previous layers’ weights update. This internal covariate shift slows training. For each mini-batch, we normalize the activations:
Then apply learnable scale ($\gamma$) and shift ($\beta$):
\[y = \gamma \hat{x} + \beta\]Benefits:
You can read about Batch Normalization in detail here
We use Dropout to Prevent overfitting. So, during the training we randomly “drop” (set to zero) a fraction of neurons. Dropout rate typically 0.2-0.5 (20%-50% of neurons dropped).Why it works is simple, let’s taken an analogy like studying for an exam by randomly covering different parts of your notes each time you learn more robust understanding rather than memorizing specific patterns.
Benefits:
You can read about Dropout in detail here
Purpose: In our final layer, we make final predictions based on extracted features. This is typically at the end of the network. It takes flattened feature maps from convolutional layers and combines these features to make class predictions. In the Final layer we have one neuron per class.
Example:
Input: 7 × 7 × 512 = 25,088 features (flattened)
Hidden: 4096 neurons (fully connected)
Output: 1000 neurons (one per class, e.g., ImageNet)
Parameters: 25,088 × 4096 ≈ 100 million weights!
This is why modern architectures use Global Average Pooling to reduce parameters.
CNNs automatically learn a hierarchy of increasingly complex features:
Layer 1 (Early Layers): Simple features
Layer 2-3 (Middle Layers): Complex features
Layer 4-5 (Deep Layers): High-level features
Example: Face Recognition
Input Image (face photo)
↓
Layer 1: Detects edges and gradients
[Recognizes: "vertical lines", "curves", "color transitions"]
↓
Layer 2: Combines edges into simple shapes
[Recognizes: "circular shapes", "parallel lines"]
↓
Layer 3: Recognizes face parts
[Recognizes: "eyes", "nose", "mouth shapes"]
↓
Layer 4: Recognizes spatial arrangements
[Recognizes: "two eyes above a nose above a mouth"]
↓
Output: "This is a face!"
Intutively, each layer “sees” a larger region of the original image:
Layer 1 (3×3 filter): 9 pixels
Layer 2 (3×3 filter): 15 pixels (after layer 1)
Layer 3 (3×3 filter): 21 pixels (after layers 1-2)
...with pooling: Grows even faster
Deep layers: Entire image
This allows deep layers to integrate information from the full image context.
Now, let’s look at a complete CNN architecture, It’s a very common structure for image classification tasks.
Input Image (224×224×3)
↓
[CONV → ReLU → CONV → ReLU → POOL] × N (Feature extraction)
↓
[CONV → ReLU → CONV → ReLU → POOL] × M (Higher-level features)
↓
[CONV → ReLU → CONV → ReLU → POOL] × K (Abstract features)
↓
Flatten
↓
[FC → ReLU → Dropout] × L (Classification)
↓
FC (Output layer)
↓
Softmax
↓
Class Probabilities
Input: 28×28×1 (grayscale)
Conv1: 32 filters, 3×3, padding=1, stride=1
Output: 28×28×32
ReLU activation
MaxPool1: 2×2, stride=2
Output: 14×14×32
Conv2: 64 filters, 3×3, padding=1, stride=1
Output: 14×14×64
ReLU activation
MaxPool2: 2×2, stride=2
Output: 7×7×64
Flatten: 7×7×64 = 3,136 features
FC1: 128 neurons
ReLU activation
Dropout (0.5)
FC2 (Output): 10 neurons (one per digit)
Softmax activation
Total Parameters:
Conv1: (3×3×1 + 1) × 32 = 320
Conv2: (3×3×32 + 1) × 64 = 18,496
FC1: (3,136 + 1) × 128 = 401,536
FC2: (128 + 1) × 10 = 1,290
Total: 421,642 parameters
This architecture balances depth and width to effectively learn hierarchical features from MNIST digits.
Let us review the forward pass of a convolutional layer below.
# Convolution
Z = conv2d(Input, Filter) + bias
# Activation
A = ReLU(Z)
# Pooling
P = max_pool(A)
How do we compute gradients through the convolution operation?
Answer: The chain rule, but applied to shared weights!
For a standard layer: $Z = W \cdot X + b$,
the gradient is:
\[\frac{\partial L}{\partial W} = X^T \cdot \frac{\partial L}{\partial Z}\]For convolution, it’s similar but we must account for:
So, Gradient for filter = convolution of input with gradient from next layer
Mathematically: \(\frac{\partial L}{\partial K} = I * \frac{\partial L}{\partial O}\)
Where:
Gradient flowing backward (to previous layer): \(\frac{\partial L}{\partial I} = \frac{\partial L}{\partial O} * K_{\text{flipped}}\)
The filter is flipped (rotated 180°) for the backward pass.
Max Pooling Gradient:
In forward pass we record which position had the maximum:
Input (4×4): Max Pool Output (2×2): Max Positions:
[1 3 2 4] [3 4] [(0,1) (0,3)]
[5 6 7 8] → [9 11] [(2,0) (3,2)]
[9 2 1 3]
[4 5 11 2]
So, in the backward pass, the gradient flows only to the position that had the maximum and other positions get zero gradient. This intutivly makes sense as well, since in the pooling layer only the maximum value “mattered” in the forward pass, so only it gets gradient in the backward pass.
Gradient from next layer: Gradient to previous layer:
[2 5] [0 2 0 5]
[3 7] → [0 0 0 0]
[3 0 0 0]
[0 0 7 0]
After computing all gradients:
\(K^{new} = K^{old} - \alpha \frac{\partial L}{\partial K}\) \(b^{new} = b^{old} - \alpha \frac{\partial L}{\partial b}\)
Where $\alpha$ is the learning rate.
Let’s implement a complete CNN for MNIST digit classification using only NumPy to truly understand every operation.
import numpy as np
import matplotlib.pyplot as plt
from typing import Tuple, List
import time
np.random.seed(42)
class Conv2D:
"""
2D Convolutional Layer
Applies learnable filters to extract spatial features from input.
"""
def __init__(
self,
in_channels: int,
out_channels: int,
kernel_size: int = 3,
stride: int = 1,
padding: int = 0
):
"""
Initialize convolutional layer.
Args:
in_channels: Number of input channels
out_channels: Number of output channels (filters)
kernel_size: Size of the convolutional kernel (assumed square)
stride: Stride of convolution
padding: Amount of zero padding
"""
self.in_channels = in_channels
self.out_channels = out_channels
self.kernel_size = kernel_size
self.stride = stride
self.padding = padding
# He initialization for ReLU
self.weights = np.random.randn(
out_channels, in_channels, kernel_size, kernel_size
) * np.sqrt(2.0 / (in_channels * kernel_size * kernel_size))
self.bias = np.zeros((out_channels, 1))
# Cache for backpropagation
self.cache = {}
def forward(self, input: np.ndarray) -> np.ndarray:
"""
Forward pass: perform convolution.
Args:
input: Input tensor of shape (batch, in_channels, height, width)
Returns:
output: Convolved output of shape (batch, out_channels, out_h, out_w)
"""
batch_size, _, h, w = input.shape
# Add padding if specified
if self.padding > 0:
input = np.pad(
input,
((0, 0), (0, 0), (self.padding, self.padding), (self.padding, self.padding)),
mode='constant'
)
# Calculate output dimensions
out_h = (h + 2 * self.padding - self.kernel_size) // self.stride + 1
out_w = (w + 2 * self.padding - self.kernel_size) // self.stride + 1
# Initialize output
output = np.zeros((batch_size, self.out_channels, out_h, out_w))
# Perform convolution
for b in range(batch_size):
for f in range(self.out_channels):
for i in range(out_h):
for j in range(out_w):
# Extract region
h_start = i * self.stride
h_end = h_start + self.kernel_size
w_start = j * self.stride
w_end = w_start + self.kernel_size
region = input[b, :, h_start:h_end, w_start:w_end]
# Convolution: element-wise multiply and sum
output[b, f, i, j] = np.sum(region * self.weights[f]) + self.bias[f]
# Cache for backward pass
self.cache['input'] = input
return output
def get_num_parameters(self) -> int:
"""Return total number of parameters."""
return self.weights.size + self.bias.size
class MaxPool2D:
"""
2D Max Pooling Layer
Downsamples input by taking maximum value in each window.
"""
def __init__(self, pool_size: int = 2, stride: int = 2):
"""
Initialize max pooling layer.
Args:
pool_size: Size of pooling window
stride: Stride of pooling operation
"""
self.pool_size = pool_size
self.stride = stride
self.cache = {}
def forward(self, input: np.ndarray) -> np.ndarray:
"""
Forward pass: perform max pooling.
Args:
input: Input tensor of shape (batch, channels, height, width)
Returns:
output: Pooled output
"""
batch_size, channels, h, w = input.shape
# Calculate output dimensions
out_h = (h - self.pool_size) // self.stride + 1
out_w = (w - self.pool_size) // self.stride + 1
# Initialize output
output = np.zeros((batch_size, channels, out_h, out_w))
# Perform max pooling
for b in range(batch_size):
for c in range(channels):
for i in range(out_h):
for j in range(out_w):
# Extract region
h_start = i * self.stride
h_end = h_start + self.pool_size
w_start = j * self.stride
w_end = w_start + self.pool_size
region = input[b, c, h_start:h_end, w_start:w_end]
# Take maximum
output[b, c, i, j] = np.max(region)
# Cache for backward pass
self.cache['input'] = input
return output
class ReLU:
"""ReLU activation function."""
def __init__(self):
self.cache = {}
def forward(self, input: np.ndarray) -> np.ndarray:
"""Apply ReLU: max(0, x)"""
self.cache['input'] = input
return np.maximum(0, input)
class Flatten:
"""Flatten multi-dimensional input to 2D (batch_size, features)."""
def __init__(self):
self.cache = {}
def forward(self, input: np.ndarray) -> np.ndarray:
"""Flatten all dimensions except batch."""
batch_size = input.shape[0]
self.cache['input_shape'] = input.shape
return input.reshape(batch_size, -1)
class Linear:
"""Fully connected linear layer."""
def __init__(self, in_features: int, out_features: int):
"""
Initialize linear layer.
Args:
in_features: Number of input features
out_features: Number of output features
"""
self.in_features = in_features
self.out_features = out_features
# He initialization
self.weights = np.random.randn(in_features, out_features) * np.sqrt(2.0 / in_features)
self.bias = np.zeros((1, out_features))
self.cache = {}
def forward(self, input: np.ndarray) -> np.ndarray:
"""Forward pass: linear transformation."""
self.cache['input'] = input
return input @ self.weights + self.bias
def get_num_parameters(self) -> int:
"""Return total number of parameters."""
return self.weights.size + self.bias.size
class SimpleCNN:
"""
Simple CNN for MNIST digit classification.
Architecture:
- Conv2D (1 -> 32 channels, 3x3)
- ReLU
- MaxPool (2x2)
- Conv2D (32 -> 64 channels, 3x3)
- ReLU
- MaxPool (2x2)
- Flatten
- Linear (64*7*7 -> 128)
- ReLU
- Linear (128 -> 10)
"""
def __init__(self):
"""Initialize CNN architecture."""
# Feature extraction layers
self.conv1 = Conv2D(in_channels=1, out_channels=32, kernel_size=3, padding=1)
self.relu1 = ReLU()
self.pool1 = MaxPool2D(pool_size=2, stride=2)
self.conv2 = Conv2D(in_channels=32, out_channels=64, kernel_size=3, padding=1)
self.relu2 = ReLU()
self.pool2 = MaxPool2D(pool_size=2, stride=2)
# Classification layers
self.flatten = Flatten()
self.fc1 = Linear(in_features=64*7*7, out_features=128)
self.relu3 = ReLU()
self.fc2 = Linear(in_features=128, out_features=10)
# Store layers for easy iteration
self.layers = [
self.conv1, self.relu1, self.pool1,
self.conv2, self.relu2, self.pool2,
self.flatten,
self.fc1, self.relu3,
self.fc2
]
def forward(self, x: np.ndarray) -> np.ndarray:
"""
Forward pass through the network.
Args:
x: Input images of shape (batch, 1, 28, 28)
Returns:
output: Class logits of shape (batch, 10)
"""
for layer in self.layers:
x = layer.forward(x)
return x
def predict(self, x: np.ndarray) -> np.ndarray:
"""
Make predictions.
Args:
x: Input images
Returns:
predictions: Predicted class indices
"""
logits = self.forward(x)
return np.argmax(logits, axis=1)
def get_num_parameters(self) -> int:
"""Calculate total number of trainable parameters."""
total = 0
for layer in [self.conv1, self.conv2, self.fc1, self.fc2]:
total += layer.get_num_parameters()
return total
def softmax(x: np.ndarray) -> np.ndarray:
"""Numerically stable softmax."""
exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
return exp_x / np.sum(exp_x, axis=1, keepdims=True)
def cross_entropy_loss(predictions: np.ndarray, targets: np.ndarray) -> float:
"""
Compute cross-entropy loss.
Args:
predictions: Model logits of shape (batch, classes)
targets: True labels of shape (batch,)
Returns:
loss: Average cross-entropy loss
"""
batch_size = predictions.shape[0]
probs = softmax(predictions)
# Select probability of correct class
correct_probs = probs[np.arange(batch_size), targets]
# Compute loss
loss = -np.mean(np.log(correct_probs + 1e-8))
return loss
def compute_accuracy(predictions: np.ndarray, targets: np.ndarray) -> float:
"""Compute classification accuracy."""
pred_classes = np.argmax(predictions, axis=1)
return np.mean(pred_classes == targets)
# Demonstration (would need actual training loop with backprop for full implementation)
def demonstrate_cnn():
"""
Demonstrate CNN forward pass.
Note: This is a simplified demonstration. A complete implementation
would include:
- Backward pass for all layers
- Optimizer (SGD, Adam, etc.)
- Training loop
- Data loading
"""
print("=" * 70)
print("CONVOLUTIONAL NEURAL NETWORK DEMONSTRATION")
print("=" * 70)
# Create model
model = SimpleCNN()
print(f"\nModel Architecture:")
print(f" Conv1: 1 → 32 channels, 3×3 kernel")
print(f" ReLU + MaxPool (2×2)")
print(f" Conv2: 32 → 64 channels, 3×3 kernel")
print(f" ReLU + MaxPool (2×2)")
print(f" Flatten")
print(f" FC1: 3136 → 128")
print(f" ReLU")
print(f" FC2: 128 → 10")
print(f"\nTotal Parameters: {model.get_num_parameters():,}")
# Create dummy input (batch of 4 MNIST images)
batch_size = 4
dummy_input = np.random.randn(batch_size, 1, 28, 28)
dummy_labels = np.array([0, 1, 2, 3])
print(f"\nInput Shape: {dummy_input.shape}")
# Forward pass
print("\nPerforming forward pass...")
start_time = time.time()
output = model.forward(dummy_input)
forward_time = time.time() - start_time
print(f"Output Shape: {output.shape}")
print(f"Forward pass time: {forward_time:.4f} seconds")
# Compute loss and accuracy
loss = cross_entropy_loss(output, dummy_labels)
accuracy = compute_accuracy(output, dummy_labels)
print(f"\nLoss: {loss:.4f}")
print(f"Accuracy: {accuracy:.4f}")
# Show predictions
predictions = model.predict(dummy_input)
print(f"\nPredictions: {predictions}")
print(f"True Labels: {dummy_labels}")
print("\n" + "=" * 70)
print("Note: This is an untrained network with random weights.")
print("In practice, you would train this network on MNIST data using")
print("backpropagation and gradient descent for 10-20 epochs to achieve")
print("98-99% accuracy.")
print("=" * 70)
if __name__ == "__main__":
demonstrate_cnn()
You now have a deep understanding of Convolutional Neural Networks, a powerful tool for computer vision tasks like image classification, object detection, and segmentation.
For hands-on practice, check out the companion notebooks - Understanding Convolutional Neural Networks