Understanding Convolutional Neural Networks: From Pixels to Patterns

Dec 26, 2025

Continuing in our Deep Learning Series, we’ll delve into the fascinating world of Convolutional Neural Networks (CNNs) in this article. So, imagine you’re trying to identify a friend in a crowded photograph. Your brain doesn’t analyze every single pixel independently. Instead, it looks for distinctive patterns, perhaps the shape of their face, the color of their hair, or the style of their clothing. You naturally recognize these features regardless of where your friend stands in the photo or whether they’re slightly turned to the side. This hierarchical, translation-invariant way of processing visual information is exactly what Convolutional Neural Networks (CNNs) were designed to replicate.

Introduction: The Vision Problem
What is The Fundamental Problem with Fully Connected Networks
Convolution: The Core Operation
Let’s look deeper into the building blocks of CNNs
How CNNs Learn Visual Hierarchies
Simple CNN Classification Architecture
Let’s Understand Backpropagation Through Convolutions
Final Step: Building a Complete CNN from Scratch
Conclusion
Jupyter Notebook

Introduction: The Vision Problem

Why Computer Vision is Hard

Let’s us consider a simple task of recognizing a handwritten digit. To a human, distinguishing a 3 from an 8 seems trivial. But to a computer, an image is just a grid of numbers representing pixel intensities. A 28×28 grayscale image of a digit contains 784 numbers, and a 224×224 color image contains 150,528 numbers!

The challenges multiply when we consider:

Variation in Appearance: The same object can look vastly different depending on:

Viewpoint: A car looks different from the front, side, or top
Illumination: Lighting changes how colors and shadows appear
Scale: Objects can be near or far from the camera
Deformation: Objects can bend, stretch, or change shape
Occlusion: Objects can be partially hidden
Background clutter: Objects appear in complex scenes

Spatial Structure: Pixels aren’t independent, nearby pixels are highly correlated. A wheel makes sense in the context of a car, and eyes make sense in the context of a face.

The Birth of CNNs

Convolutional Neural Networks (CNNs), inspired by the visual cortex of animals, were specifically designed to address these challenges. The key insight, dating back to Hubel and Wiesel’s Nobel Prize winning research in 1959, was that neurons in the visual cortex have local receptive fields, they respond to stimuli in restricted regions of the visual field. Later, Yann LeCun’s LeNet-5 (1998), designed for handwritten digit recognition, demonstrated that this biological inspiration could be translated into a powerful machine learning architecture. Today, CNNs power everything from facial recognition in your phone to autonomous vehicles and medical image analysis.

What is The Fundamental Problem with Fully Connected Networks

Why Regular Neural Networks Fail for Images

Let’s see what happens if we try to use a traditional fully connected neural network for image classification:

Problem 1: Parameter Explosion

For a tiny 28×28 grayscale image with 100 hidden neurons:

Input size: 28 × 28 = 784 pixels
Parameters in first layer: 784 × 100 = 78,400 weights
Plus 100 biases = 78,500 parameters for just the first layer!

And if we look for a more realistic 224×224 RGB image:

Input size: 224 × 224 × 3 = 150,528 pixels
Parameters in first layer: 150,528 × 100 = 15,052,800 weights!

This is very unsustainable and leads to:

Enormous computational requirements
Massive memory consumption
Severe overfitting (more parameters than training samples)
Impossibly slow training

Problem 2: Loss of Spatial Structure

When we flatten an image into a vector, we destroy the 2D spatial structure:

# Original image: 28x28 with spatial relationships
image = [
    [0, 0, 5, 8, 5, 0, 0],  # Row 1: Top of digit
    [0, 3, 9, 9, 9, 3, 0],  # Row 2: Upper part
    [5, 9, 5, 0, 5, 9, 5],  # Row 3: Middle
    ...
]

# After flattening: 784-length vector, no spatial info
flattened = [0, 0, 5, 8, 5, 0, 0, 0, 3, 9, 9, 9, 3, 0, 5, 9, 5, ...]

Now, the pixels that were neighbors are treated no differently than pixels from opposite corners. The network must relearn spatial relationships from scratch!

Problem 3: No Translation Invariance

A fully connected network learns that “edge at position (10, 15)” is important, but it is incapable of recognizing that the same edge at position (11, 16) represents the same feature. The network must learn the same pattern separately for every possible position.

Here enters out hero: Convolutional Neural Networks (CNNs)

CNNs solve all three problems elegantly:

Parameter Sharing: Use the same small filter across the entire image
Local Connectivity: Each neuron only looks at a small region (like biological neurons)
Translation Invariance: Features are detected regardless of position

Convolution: The Core Operation

What is Convolution?

Convolution is a mathematical operation that combines two functions to produce a third function. In the context of CNNs, we convolve an image with a small filter (also called a kernel) to detect specific features.

Let’s look at the Intuition: Sliding Windows

Imagine you have a magnifying glass (the filter) and you’re systematically scanning across a page (the image), looking for a specific pattern (like vertical edges). So how will you find it?

You Place the magnifying glass at the top-left corner
Look at what’s underneath and compute a score for “how much does this match my pattern?”
Slide the magnifying glass one pixel to the right
Repeat across the entire image
The collection of all these scores forms a new image (feature map)

That’s how the convolution operation works!

Mathematical Definition

For a 2D image $I$ and a filter $K$, the convolution operation is:

\[(I * K)(i, j) = \sum_m \sum_n I(i+m, j+n) \cdot K(m, n)\]

Where:

$(i, j)$ is the position in the output
$(m, n)$ indexes over the filter
We multiply corresponding values and sum them up

Concrete Example: Edge Detection

Let’s detect vertical edges in a simple image using a 3×3 filter.

Input Image (7×7):

10  10  0   0   0   0
10  10  0   0   0   0
10  10  0   0   0   0
10  10  0   0   0   0
10  10  0   0   0   0
10  10  0   0   0   0
10  10  0   0   0   0

This image has a clear vertical edge in the middle (transition from 10 to 0).

Vertical Edge Detection Filter (3×3):

 0  -1
 0  -1
 0  -1

How Convolution Works:

Step 1: Place filter at top-left (position 0,0):

Filter over image:
10  10  10
10  10  10
10  10  10

Computation:
(10×1) + (10×0) + (10×-1) +
(10×1) + (10×0) + (10×-1) +
(10×1) + (10×0) + (10×-1) = 0

Step 2: Slide one position right (position 0,1):

Filter over image:
10  10   0
10  10   0
10  10   0

Computation:
(10×1) + (10×0) + (0×-1) +
(10×1) + (10×0) + (0×-1) +
(10×1) + (10×0) + (0×-1) = 30

The high value (30) indicates we found an edge! We repeat this across the entire image.

Output Feature Map (5×5, due to valid convolution):

 30  30  0   0
 30  30  0   0
 30  30  0   0
 30  30  0   0
 30  30  0   0

The bright values highlight where the vertical edge is located!

Different Types of Filters

Different filters detect different features:

Horizontal Edge Detection:

 1   1   1
 0   0   0
-1  -1  -1

Diagonal Edge Detection:

 1   0
 0  -1
-1  -2

Blur (Average):

1/9  1/9  1/9
1/9  1/9  1/9
1/9  1/9  1/9

Sharpen:

 0  -1   0
-1   5  -1
 0  -1   0

The magic of CNNs: We don’t hand-design these filters, the network learns them during training!

Padding and Stride

Padding: Adding zeros around the image border

Why padding?

Valid convolution (no padding): Output shrinks with each layer
- Input: 7×7, Filter: 3×3 → Output: 5×5
Same convolution (with padding): Output size = Input size
- Input: 7×7 + padding of 1 → 9×9, Filter: 3×3 → Output: 7×7
Without padding: information at corners used less
With padding: all positions treated more equally

Stride: How many pixels to move the filter

Stride = 1: Move one pixel at a time (detailed scanning)
Stride = 2: Move two pixels at a time (faster, smaller output)

Output size formula: $\text{output_size} = \left\lfloor \frac{\text{input_size} + 2 \times \text{padding} - \text{filter_size}}{\text{stride}} \right\rfloor + 1$

Example:

Input: 32×32
Filter: 5×5
Padding: 2
Stride: 1
Output: $\lfloor (32 + 2×2 - 5) / 1 \rfloor + 1 = 32$

Lets look deeper into the building blocks of CNNs

1. Convolutional Layer

Purpose: Extract spatial features from input

How it works:

Apply multiple filters to the input
Each filter produces one feature map
Stack feature maps to create output volume

Parameters:

Number of filters (e.g., 32, 64, 128)
Filter size (typically 3×3 or 5×5)
Stride (usually 1 or 2)
Padding (often ‘same’ to preserve spatial dimensions)

Example:

Input:  32 × 32 × 3  (RGB image)
Filter: 64 filters of size 3×3×3
Output: 32 × 32 × 64 (64 feature maps)

Parameters: (3×3×3 + 1 bias) × 64 = 1,792

What happens:

First convolutional layer learns simple features: edges, colors, gradients
Each of the 64 filters learns to detect a different low-level pattern
Output is 64 feature maps, each highlighting where its pattern was found

2. Activation Function (ReLU)

Purpose: Introduce non-linearity

Formula: $\text{ReLU}(x) = \max(0, x)$

Why it’s crucial:

Without activation functions, stacking layers just creates a complex linear transformation
ReLU allows the network to learn complex, non-linear patterns
It’s computationally efficient and helps avoid vanishing gradients

Applied element-wise:

Before ReLU: [-2.5, 0.3, -0.1, 4.2]
After ReLU:  [ 0.0, 0.3,  0.0, 4.2]  # Negative values zeroed out

3. Pooling Layer

Purpose:

Reduce spatial dimensions (downsampling)
Increase receptive field
Add translation invariance
Reduce computation and parameters

Max Pooling (most common):

Example with 2×2 window, stride 2:

Input (4×4):              Output (2×2):
[1  3  2  4]              [3  4]
[5  6  7  8]     →        [9  11]
[9  2  1  3]
[4  5  11 2]

Each 2×2 region is replaced by its maximum value.

Why max pooling works:

Captures the strongest activation in each region
“Did this feature appear in this region?” (not “where exactly?”)
Makes the network more robust to small translations
Reduces spatial dimensions → fewer parameters in subsequent layers

Average Pooling: Takes the average instead of maximum. Less common but used in some architectures (e.g., global average pooling).

4. Batch Normalization

We use Batch Normalization to Stabilize and accelerate training. During training, the distribution of layer inputs changes as previous layers’ weights update. This internal covariate shift slows training. For each mini-batch, we normalize the activations:

\[\hat{x} = \frac{x - \mu_{\text{batch}}}{\sqrt{\sigma_{\text{batch}}^2 + \epsilon}}\]

Then apply learnable scale ($\gamma$) and shift ($\beta$):

\[y = \gamma \hat{x} + \beta\]

Benefits:

Stabilizes training by normalizing layer inputs
Allows higher learning rates
Reduces sensitivity to initialization
Acts as regularization
Often reduces need for dropout

You can read about Batch Normalization in detail here

5. Dropout

We use Dropout to Prevent overfitting. So, during the training we randomly “drop” (set to zero) a fraction of neurons. Dropout rate typically 0.2-0.5 (20%-50% of neurons dropped).Why it works is simple, let’s taken an analogy like studying for an exam by randomly covering different parts of your notes each time you learn more robust understanding rather than memorizing specific patterns.

Benefits:

Prevents co-adaptation of neurons
Forces the network to learn robust features
Ensemble effect: Training many sub-networks simultaneously

You can read about Dropout in detail here

6. Fully Connected Layer

Purpose: In our final layer, we make final predictions based on extracted features. This is typically at the end of the network. It takes flattened feature maps from convolutional layers and combines these features to make class predictions. In the Final layer we have one neuron per class.

Example:

Input:  7 × 7 × 512 = 25,088 features (flattened)
Hidden: 4096 neurons (fully connected)
Output: 1000 neurons (one per class, e.g., ImageNet)

Parameters: 25,088 × 4096 ≈ 100 million weights!

This is why modern architectures use Global Average Pooling to reduce parameters.

How CNNs Learn Visual Hierarchies

The Hierarchical Feature Learning

CNNs automatically learn a hierarchy of increasingly complex features:

Layer 1 (Early Layers): Simple features

Edges (horizontal, vertical, diagonal)
Colors and gradients
Simple textures
Blobs and corners

Layer 2-3 (Middle Layers): Complex features

Combinations of edges forming shapes
Simple object parts (wheels, eyes, doors)
Textures and patterns
Color combinations

Layer 4-5 (Deep Layers): High-level features

Entire objects or object parts
Faces, bodies, specific animals
Scene components
Abstract concepts

Example: Face Recognition

Input Image (face photo)
        ↓
Layer 1: Detects edges and gradients
        [Recognizes: "vertical lines", "curves", "color transitions"]
        ↓
Layer 2: Combines edges into simple shapes
        [Recognizes: "circular shapes", "parallel lines"]
        ↓
Layer 3: Recognizes face parts
        [Recognizes: "eyes", "nose", "mouth shapes"]
        ↓
Layer 4: Recognizes spatial arrangements
        [Recognizes: "two eyes above a nose above a mouth"]
        ↓
Output: "This is a face!"

Receptive Field Growth

Intutively, each layer “sees” a larger region of the original image:

Layer 1 (3×3 filter):    9 pixels
Layer 2 (3×3 filter):   15 pixels (after layer 1)
Layer 3 (3×3 filter):   21 pixels (after layers 1-2)
...with pooling:        Grows even faster
Deep layers:           Entire image

This allows deep layers to integrate information from the full image context.

Simple CNN Classification Architecture

Now, let’s look at a complete CNN architecture, It’s a very common structure for image classification tasks.

Input Image (224×224×3)
        ↓
[CONV → ReLU → CONV → ReLU → POOL] × N  (Feature extraction)
        ↓
[CONV → ReLU → CONV → ReLU → POOL] × M  (Higher-level features)
        ↓
[CONV → ReLU → CONV → ReLU → POOL] × K  (Abstract features)
        ↓
Flatten
        ↓
[FC → ReLU → Dropout] × L                (Classification)
        ↓
FC (Output layer)
        ↓
Softmax
        ↓
Class Probabilities

Example: Small CNN for MNIST

Input: 28×28×1 (grayscale)

Conv1: 32 filters, 3×3, padding=1, stride=1
    Output: 28×28×32
    ReLU activation

MaxPool1: 2×2, stride=2
    Output: 14×14×32

Conv2: 64 filters, 3×3, padding=1, stride=1
    Output: 14×14×64
    ReLU activation

MaxPool2: 2×2, stride=2
    Output: 7×7×64

Flatten: 7×7×64 = 3,136 features

FC1: 128 neurons
    ReLU activation
    Dropout (0.5)

FC2 (Output): 10 neurons (one per digit)
    Softmax activation

Total Parameters:
    Conv1: (3×3×1 + 1) × 32 = 320
    Conv2: (3×3×32 + 1) × 64 = 18,496
    FC1: (3,136 + 1) × 128 = 401,536
    FC2: (128 + 1) × 10 = 1,290
    Total: 421,642 parameters

This architecture balances depth and width to effectively learn hierarchical features from MNIST digits.

Let’s Understand Backpropagation Through Convolutions

Let us review the forward pass of a convolutional layer below.

# Convolution
Z = conv2d(Input, Filter) + bias

# Activation
A = ReLU(Z)

# Pooling
P = max_pool(A)

Backward Pass: Computing Gradients

How do we compute gradients through the convolution operation?

Answer: The chain rule, but applied to shared weights!

Gradient for Convolution

For a standard layer: $Z = W \cdot X + b$,

the gradient is:

\[\frac{\partial L}{\partial W} = X^T \cdot \frac{\partial L}{\partial Z}\]

For convolution, it’s similar but we must account for:

Weight sharing: Same filter used at multiple positions
Local connectivity: Each output depends on a local input region

So, Gradient for filter = convolution of input with gradient from next layer

Mathematically: $\frac{\partial L}{\partial K} = I * \frac{\partial L}{\partial O}$

Where:

$K$ is the filter
$I$ is the input
$O$ is the output
$*$ represents convolution

Gradient flowing backward (to previous layer): $\frac{\partial L}{\partial I} = \frac{\partial L}{\partial O} * K_{\text{flipped}}$

The filter is flipped (rotated 180°) for the backward pass.

Gradient for Pooling

Max Pooling Gradient:

In forward pass we record which position had the maximum:

Input (4×4):          Max Pool Output (2×2):    Max Positions:
[1  3  2  4]          [3  4]                    [(0,1) (0,3)]
[5  6  7  8]    →     [9  11]                   [(2,0) (3,2)]
[9  2  1  3]
[4  5  11 2]

So, in the backward pass, the gradient flows only to the position that had the maximum and other positions get zero gradient. This intutivly makes sense as well, since in the pooling layer only the maximum value “mattered” in the forward pass, so only it gets gradient in the backward pass.

Gradient from next layer:    Gradient to previous layer:
[2  5]                        [0  2  0  5]
[3  7]                  →     [0  0  0  0]
                              [3  0  0  0]
                              [0  0  7  0]

Parameter Updates

After computing all gradients:

$K^{new} = K^{old} - \alpha \frac{\partial L}{\partial K}$ $b^{new} = b^{old} - \alpha \frac{\partial L}{\partial b}$

Where $\alpha$ is the learning rate.

Final Step: Building a Complete CNN from Scratch

Let’s implement a complete CNN for MNIST digit classification using only NumPy to truly understand every operation.

import numpy as np
import matplotlib.pyplot as plt
from typing import Tuple, List
import time

np.random.seed(42)


class Conv2D:
    """
    2D Convolutional Layer

    Applies learnable filters to extract spatial features from input.
    """

    def __init__(
        self,
        in_channels: int,
        out_channels: int,
        kernel_size: int = 3,
        stride: int = 1,
        padding: int = 0
    ):
        """
        Initialize convolutional layer.

        Args:
            in_channels: Number of input channels
            out_channels: Number of output channels (filters)
            kernel_size: Size of the convolutional kernel (assumed square)
            stride: Stride of convolution
            padding: Amount of zero padding
        """
        self.in_channels = in_channels
        self.out_channels = out_channels
        self.kernel_size = kernel_size
        self.stride = stride
        self.padding = padding

        # He initialization for ReLU
        self.weights = np.random.randn(
            out_channels, in_channels, kernel_size, kernel_size
        ) * np.sqrt(2.0 / (in_channels * kernel_size * kernel_size))

        self.bias = np.zeros((out_channels, 1))

        # Cache for backpropagation
        self.cache = {}

    def forward(self, input: np.ndarray) -> np.ndarray:
        """
        Forward pass: perform convolution.

        Args:
            input: Input tensor of shape (batch, in_channels, height, width)

        Returns:
            output: Convolved output of shape (batch, out_channels, out_h, out_w)
        """
        batch_size, _, h, w = input.shape

        # Add padding if specified
        if self.padding > 0:
            input = np.pad(
                input,
                ((0, 0), (0, 0), (self.padding, self.padding), (self.padding, self.padding)),
                mode='constant'
            )

        # Calculate output dimensions
        out_h = (h + 2 * self.padding - self.kernel_size) // self.stride + 1
        out_w = (w + 2 * self.padding - self.kernel_size) // self.stride + 1

        # Initialize output
        output = np.zeros((batch_size, self.out_channels, out_h, out_w))

        # Perform convolution
        for b in range(batch_size):
            for f in range(self.out_channels):
                for i in range(out_h):
                    for j in range(out_w):
                        # Extract region
                        h_start = i * self.stride
                        h_end = h_start + self.kernel_size
                        w_start = j * self.stride
                        w_end = w_start + self.kernel_size

                        region = input[b, :, h_start:h_end, w_start:w_end]

                        # Convolution: element-wise multiply and sum
                        output[b, f, i, j] = np.sum(region * self.weights[f]) + self.bias[f]

        # Cache for backward pass
        self.cache['input'] = input

        return output

    def get_num_parameters(self) -> int:
        """Return total number of parameters."""
        return self.weights.size + self.bias.size


class MaxPool2D:
    """
    2D Max Pooling Layer

    Downsamples input by taking maximum value in each window.
    """

    def __init__(self, pool_size: int = 2, stride: int = 2):
        """
        Initialize max pooling layer.

        Args:
            pool_size: Size of pooling window
            stride: Stride of pooling operation
        """
        self.pool_size = pool_size
        self.stride = stride
        self.cache = {}

    def forward(self, input: np.ndarray) -> np.ndarray:
        """
        Forward pass: perform max pooling.

        Args:
            input: Input tensor of shape (batch, channels, height, width)

        Returns:
            output: Pooled output
        """
        batch_size, channels, h, w = input.shape

        # Calculate output dimensions
        out_h = (h - self.pool_size) // self.stride + 1
        out_w = (w - self.pool_size) // self.stride + 1

        # Initialize output
        output = np.zeros((batch_size, channels, out_h, out_w))

        # Perform max pooling
        for b in range(batch_size):
            for c in range(channels):
                for i in range(out_h):
                    for j in range(out_w):
                        # Extract region
                        h_start = i * self.stride
                        h_end = h_start + self.pool_size
                        w_start = j * self.stride
                        w_end = w_start + self.pool_size

                        region = input[b, c, h_start:h_end, w_start:w_end]

                        # Take maximum
                        output[b, c, i, j] = np.max(region)

        # Cache for backward pass
        self.cache['input'] = input

        return output


class ReLU:
    """ReLU activation function."""

    def __init__(self):
        self.cache = {}

    def forward(self, input: np.ndarray) -> np.ndarray:
        """Apply ReLU: max(0, x)"""
        self.cache['input'] = input
        return np.maximum(0, input)


class Flatten:
    """Flatten multi-dimensional input to 2D (batch_size, features)."""

    def __init__(self):
        self.cache = {}

    def forward(self, input: np.ndarray) -> np.ndarray:
        """Flatten all dimensions except batch."""
        batch_size = input.shape[0]
        self.cache['input_shape'] = input.shape
        return input.reshape(batch_size, -1)


class Linear:
    """Fully connected linear layer."""

    def __init__(self, in_features: int, out_features: int):
        """
        Initialize linear layer.

        Args:
            in_features: Number of input features
            out_features: Number of output features
        """
        self.in_features = in_features
        self.out_features = out_features

        # He initialization
        self.weights = np.random.randn(in_features, out_features) * np.sqrt(2.0 / in_features)
        self.bias = np.zeros((1, out_features))

        self.cache = {}

    def forward(self, input: np.ndarray) -> np.ndarray:
        """Forward pass: linear transformation."""
        self.cache['input'] = input
        return input @ self.weights + self.bias

    def get_num_parameters(self) -> int:
        """Return total number of parameters."""
        return self.weights.size + self.bias.size


class SimpleCNN:
    """
    Simple CNN for MNIST digit classification.

    Architecture:
    - Conv2D (1 -> 32 channels, 3x3)
    - ReLU
    - MaxPool (2x2)
    - Conv2D (32 -> 64 channels, 3x3)
    - ReLU
    - MaxPool (2x2)
    - Flatten
    - Linear (64*7*7 -> 128)
    - ReLU
    - Linear (128 -> 10)
    """

    def __init__(self):
        """Initialize CNN architecture."""

        # Feature extraction layers
        self.conv1 = Conv2D(in_channels=1, out_channels=32, kernel_size=3, padding=1)
        self.relu1 = ReLU()
        self.pool1 = MaxPool2D(pool_size=2, stride=2)

        self.conv2 = Conv2D(in_channels=32, out_channels=64, kernel_size=3, padding=1)
        self.relu2 = ReLU()
        self.pool2 = MaxPool2D(pool_size=2, stride=2)

        # Classification layers
        self.flatten = Flatten()
        self.fc1 = Linear(in_features=64*7*7, out_features=128)
        self.relu3 = ReLU()
        self.fc2 = Linear(in_features=128, out_features=10)

        # Store layers for easy iteration
        self.layers = [
            self.conv1, self.relu1, self.pool1,
            self.conv2, self.relu2, self.pool2,
            self.flatten,
            self.fc1, self.relu3,
            self.fc2
        ]

    def forward(self, x: np.ndarray) -> np.ndarray:
        """
        Forward pass through the network.

        Args:
            x: Input images of shape (batch, 1, 28, 28)

        Returns:
            output: Class logits of shape (batch, 10)
        """
        for layer in self.layers:
            x = layer.forward(x)
        return x

    def predict(self, x: np.ndarray) -> np.ndarray:
        """
        Make predictions.

        Args:
            x: Input images

        Returns:
            predictions: Predicted class indices
        """
        logits = self.forward(x)
        return np.argmax(logits, axis=1)

    def get_num_parameters(self) -> int:
        """Calculate total number of trainable parameters."""
        total = 0
        for layer in [self.conv1, self.conv2, self.fc1, self.fc2]:
            total += layer.get_num_parameters()
        return total


def softmax(x: np.ndarray) -> np.ndarray:
    """Numerically stable softmax."""
    exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
    return exp_x / np.sum(exp_x, axis=1, keepdims=True)


def cross_entropy_loss(predictions: np.ndarray, targets: np.ndarray) -> float:
    """
    Compute cross-entropy loss.

    Args:
        predictions: Model logits of shape (batch, classes)
        targets: True labels of shape (batch,)

    Returns:
        loss: Average cross-entropy loss
    """
    batch_size = predictions.shape[0]
    probs = softmax(predictions)

    # Select probability of correct class
    correct_probs = probs[np.arange(batch_size), targets]

    # Compute loss
    loss = -np.mean(np.log(correct_probs + 1e-8))

    return loss


def compute_accuracy(predictions: np.ndarray, targets: np.ndarray) -> float:
    """Compute classification accuracy."""
    pred_classes = np.argmax(predictions, axis=1)
    return np.mean(pred_classes == targets)


# Demonstration (would need actual training loop with backprop for full implementation)
def demonstrate_cnn():
    """
    Demonstrate CNN forward pass.

    Note: This is a simplified demonstration. A complete implementation
    would include:
    - Backward pass for all layers
    - Optimizer (SGD, Adam, etc.)
    - Training loop
    - Data loading
    """

    print("=" * 70)
    print("CONVOLUTIONAL NEURAL NETWORK DEMONSTRATION")
    print("=" * 70)

    # Create model
    model = SimpleCNN()

    print(f"\nModel Architecture:")
    print(f"  Conv1: 1 → 32 channels, 3×3 kernel")
    print(f"  ReLU + MaxPool (2×2)")
    print(f"  Conv2: 32 → 64 channels, 3×3 kernel")
    print(f"  ReLU + MaxPool (2×2)")
    print(f"  Flatten")
    print(f"  FC1: 3136 → 128")
    print(f"  ReLU")
    print(f"  FC2: 128 → 10")

    print(f"\nTotal Parameters: {model.get_num_parameters():,}")

    # Create dummy input (batch of 4 MNIST images)
    batch_size = 4
    dummy_input = np.random.randn(batch_size, 1, 28, 28)
    dummy_labels = np.array([0, 1, 2, 3])

    print(f"\nInput Shape: {dummy_input.shape}")

    # Forward pass
    print("\nPerforming forward pass...")
    start_time = time.time()
    output = model.forward(dummy_input)
    forward_time = time.time() - start_time

    print(f"Output Shape: {output.shape}")
    print(f"Forward pass time: {forward_time:.4f} seconds")

    # Compute loss and accuracy
    loss = cross_entropy_loss(output, dummy_labels)
    accuracy = compute_accuracy(output, dummy_labels)

    print(f"\nLoss: {loss:.4f}")
    print(f"Accuracy: {accuracy:.4f}")

    # Show predictions
    predictions = model.predict(dummy_input)
    print(f"\nPredictions: {predictions}")
    print(f"True Labels: {dummy_labels}")

    print("\n" + "=" * 70)
    print("Note: This is an untrained network with random weights.")
    print("In practice, you would train this network on MNIST data using")
    print("backpropagation and gradient descent for 10-20 epochs to achieve")
    print("98-99% accuracy.")
    print("=" * 70)


if __name__ == "__main__":
    demonstrate_cnn()

Conclusion

You now have a deep understanding of Convolutional Neural Networks, a powerful tool for computer vision tasks like image classification, object detection, and segmentation.

Jupyter Notebook

For hands-on practice, check out the companion notebooks - Understanding Convolutional Neural Networks

Deep Learning Series