series

Deep Learning Series

by Mayank Sharma

Understanding Convolutional Neural Networks: From Pixels to Patterns

Dec 26, 2025

Continuing in our Deep Learning Series, we’ll delve into the fascinating world of Convolutional Neural Networks (CNNs) in this article. So, imagine you’re trying to identify a friend in a crowded photograph. Your brain doesn’t analyze every single pixel independently. Instead, it looks for distinctive patterns, perhaps the shape of their face, the color of their hair, or the style of their clothing. You naturally recognize these features regardless of where your friend stands in the photo or whether they’re slightly turned to the side. This hierarchical, translation-invariant way of processing visual information is exactly what Convolutional Neural Networks (CNNs) were designed to replicate.

Table of Contents

  1. Introduction: The Vision Problem
  2. What is The Fundamental Problem with Fully Connected Networks
  3. Convolution: The Core Operation
  4. Let’s look deeper into the building blocks of CNNs
  5. How CNNs Learn Visual Hierarchies
  6. Simple CNN Classification Architecture
  7. Let’s Understand Backpropagation Through Convolutions
  8. Final Step: Building a Complete CNN from Scratch
  9. Conclusion
  10. Jupyter Notebook

Introduction: The Vision Problem

Why Computer Vision is Hard

Let’s us consider a simple task of recognizing a handwritten digit. To a human, distinguishing a 3 from an 8 seems trivial. But to a computer, an image is just a grid of numbers representing pixel intensities. A 28×28 grayscale image of a digit contains 784 numbers, and a 224×224 color image contains 150,528 numbers!

The challenges multiply when we consider:

Variation in Appearance: The same object can look vastly different depending on:

Spatial Structure: Pixels aren’t independent, nearby pixels are highly correlated. A wheel makes sense in the context of a car, and eyes make sense in the context of a face.

The Birth of CNNs

Convolutional Neural Networks (CNNs), inspired by the visual cortex of animals, were specifically designed to address these challenges. The key insight, dating back to Hubel and Wiesel’s Nobel Prize winning research in 1959, was that neurons in the visual cortex have local receptive fields, they respond to stimuli in restricted regions of the visual field. Later, Yann LeCun’s LeNet-5 (1998), designed for handwritten digit recognition, demonstrated that this biological inspiration could be translated into a powerful machine learning architecture. Today, CNNs power everything from facial recognition in your phone to autonomous vehicles and medical image analysis.

What is The Fundamental Problem with Fully Connected Networks

Why Regular Neural Networks Fail for Images

Let’s see what happens if we try to use a traditional fully connected neural network for image classification:

Problem 1: Parameter Explosion

For a tiny 28×28 grayscale image with 100 hidden neurons:

And if we look for a more realistic 224×224 RGB image:

This is very unsustainable and leads to:

Problem 2: Loss of Spatial Structure

When we flatten an image into a vector, we destroy the 2D spatial structure:

# Original image: 28x28 with spatial relationships
image = [
    [0, 0, 5, 8, 5, 0, 0],  # Row 1: Top of digit
    [0, 3, 9, 9, 9, 3, 0],  # Row 2: Upper part
    [5, 9, 5, 0, 5, 9, 5],  # Row 3: Middle
    ...
]

# After flattening: 784-length vector, no spatial info
flattened = [0, 0, 5, 8, 5, 0, 0, 0, 3, 9, 9, 9, 3, 0, 5, 9, 5, ...]

Now, the pixels that were neighbors are treated no differently than pixels from opposite corners. The network must relearn spatial relationships from scratch!

Problem 3: No Translation Invariance

A fully connected network learns that “edge at position (10, 15)” is important, but it is incapable of recognizing that the same edge at position (11, 16) represents the same feature. The network must learn the same pattern separately for every possible position.

Here enters out hero: Convolutional Neural Networks (CNNs)

CNNs solve all three problems elegantly:

  1. Parameter Sharing: Use the same small filter across the entire image
  2. Local Connectivity: Each neuron only looks at a small region (like biological neurons)
  3. Translation Invariance: Features are detected regardless of position

Convolution: The Core Operation

What is Convolution?

Convolution is a mathematical operation that combines two functions to produce a third function. In the context of CNNs, we convolve an image with a small filter (also called a kernel) to detect specific features.

Let’s look at the Intuition: Sliding Windows

Imagine you have a magnifying glass (the filter) and you’re systematically scanning across a page (the image), looking for a specific pattern (like vertical edges). So how will you find it?

  1. You Place the magnifying glass at the top-left corner
  2. Look at what’s underneath and compute a score for “how much does this match my pattern?”
  3. Slide the magnifying glass one pixel to the right
  4. Repeat across the entire image
  5. The collection of all these scores forms a new image (feature map)

That’s how the convolution operation works!

Mathematical Definition

For a 2D image $I$ and a filter $K$, the convolution operation is:

\[(I * K)(i, j) = \sum_m \sum_n I(i+m, j+n) \cdot K(m, n)\]

Where:

Concrete Example: Edge Detection

Let’s detect vertical edges in a simple image using a 3×3 filter.

Input Image (7×7):

10  10  10  0   0   0   0
10  10  10  0   0   0   0
10  10  10  0   0   0   0
10  10  10  0   0   0   0
10  10  10  0   0   0   0
10  10  10  0   0   0   0
10  10  10  0   0   0   0

This image has a clear vertical edge in the middle (transition from 10 to 0).

Vertical Edge Detection Filter (3×3):

1   0  -1
1   0  -1
1   0  -1

How Convolution Works:

Step 1: Place filter at top-left (position 0,0):

Filter over image:
10  10  10
10  10  10
10  10  10

Computation:
(10×1) + (10×0) + (10×-1) +
(10×1) + (10×0) + (10×-1) +
(10×1) + (10×0) + (10×-1) = 0

Step 2: Slide one position right (position 0,1):

Filter over image:
10  10   0
10  10   0
10  10   0

Computation:
(10×1) + (10×0) + (0×-1) +
(10×1) + (10×0) + (0×-1) +
(10×1) + (10×0) + (0×-1) = 30

The high value (30) indicates we found an edge! We repeat this across the entire image.

Output Feature Map (5×5, due to valid convolution):

0   30  30  0   0
0   30  30  0   0
0   30  30  0   0
0   30  30  0   0
0   30  30  0   0

The bright values highlight where the vertical edge is located!

Different Types of Filters

Different filters detect different features:

Horizontal Edge Detection:

 1   1   1
 0   0   0
-1  -1  -1

Diagonal Edge Detection:

 2   1   0
 1   0  -1
 0  -1  -2

Blur (Average):

1/9  1/9  1/9
1/9  1/9  1/9
1/9  1/9  1/9

Sharpen:

 0  -1   0
-1   5  -1
 0  -1   0

The magic of CNNs: We don’t hand-design these filters, the network learns them during training!

Padding and Stride

Padding: Adding zeros around the image border

Why padding?

Stride: How many pixels to move the filter

Output size formula: \(\text{output_size} = \left\lfloor \frac{\text{input_size} + 2 \times \text{padding} - \text{filter_size}}{\text{stride}} \right\rfloor + 1\)

Example:

Lets look deeper into the building blocks of CNNs

1. Convolutional Layer

Purpose: Extract spatial features from input

How it works:

Parameters:

Example:

Input:  32 × 32 × 3  (RGB image)
Filter: 64 filters of size 3×3×3
Output: 32 × 32 × 64 (64 feature maps)

Parameters: (3×3×3 + 1 bias) × 64 = 1,792

What happens:

2. Activation Function (ReLU)

Purpose: Introduce non-linearity

Formula: $\text{ReLU}(x) = \max(0, x)$

Why it’s crucial:

Applied element-wise:

Before ReLU: [-2.5, 0.3, -0.1, 4.2]
After ReLU:  [ 0.0, 0.3,  0.0, 4.2]  # Negative values zeroed out

3. Pooling Layer

Purpose:

Max Pooling (most common):

Example with 2×2 window, stride 2:

Input (4×4):              Output (2×2):
[1  3  2  4]              [3  4]
[5  6  7  8]             [9  11]
[9  2  1  3]
[4  5  11 2]

Each 2×2 region is replaced by its maximum value.

Why max pooling works:

Average Pooling: Takes the average instead of maximum. Less common but used in some architectures (e.g., global average pooling).

4. Batch Normalization

We use Batch Normalization to Stabilize and accelerate training. During training, the distribution of layer inputs changes as previous layers’ weights update. This internal covariate shift slows training. For each mini-batch, we normalize the activations:

\[\hat{x} = \frac{x - \mu_{\text{batch}}}{\sqrt{\sigma_{\text{batch}}^2 + \epsilon}}\]

Then apply learnable scale ($\gamma$) and shift ($\beta$):

\[y = \gamma \hat{x} + \beta\]

Benefits:

You can read about Batch Normalization in detail here

5. Dropout

We use Dropout to Prevent overfitting. So, during the training we randomly “drop” (set to zero) a fraction of neurons. Dropout rate typically 0.2-0.5 (20%-50% of neurons dropped).Why it works is simple, let’s taken an analogy like studying for an exam by randomly covering different parts of your notes each time you learn more robust understanding rather than memorizing specific patterns.

Benefits:

You can read about Dropout in detail here

6. Fully Connected Layer

Purpose: In our final layer, we make final predictions based on extracted features. This is typically at the end of the network. It takes flattened feature maps from convolutional layers and combines these features to make class predictions. In the Final layer we have one neuron per class.

Example:

Input:  7 × 7 × 512 = 25,088 features (flattened)
Hidden: 4096 neurons (fully connected)
Output: 1000 neurons (one per class, e.g., ImageNet)

Parameters: 25,088 × 4096  100 million weights!

This is why modern architectures use Global Average Pooling to reduce parameters.

How CNNs Learn Visual Hierarchies

The Hierarchical Feature Learning

CNNs automatically learn a hierarchy of increasingly complex features:

Layer 1 (Early Layers): Simple features

Layer 2-3 (Middle Layers): Complex features

Layer 4-5 (Deep Layers): High-level features

Example: Face Recognition

Input Image (face photo)
        
Layer 1: Detects edges and gradients
        [Recognizes: "vertical lines", "curves", "color transitions"]
        
Layer 2: Combines edges into simple shapes
        [Recognizes: "circular shapes", "parallel lines"]
        
Layer 3: Recognizes face parts
        [Recognizes: "eyes", "nose", "mouth shapes"]
        
Layer 4: Recognizes spatial arrangements
        [Recognizes: "two eyes above a nose above a mouth"]
        
Output: "This is a face!"

Receptive Field Growth

Intutively, each layer “sees” a larger region of the original image:

Layer 1 (3×3 filter):    9 pixels
Layer 2 (3×3 filter):   15 pixels (after layer 1)
Layer 3 (3×3 filter):   21 pixels (after layers 1-2)
...with pooling:        Grows even faster
Deep layers:           Entire image

This allows deep layers to integrate information from the full image context.

Simple CNN Classification Architecture

Now, let’s look at a complete CNN architecture, It’s a very common structure for image classification tasks.

Input Image (224×224×3)
        
[CONV  ReLU  CONV  ReLU  POOL] × N  (Feature extraction)
        
[CONV  ReLU  CONV  ReLU  POOL] × M  (Higher-level features)
        
[CONV  ReLU  CONV  ReLU  POOL] × K  (Abstract features)
        
Flatten
        
[FC  ReLU  Dropout] × L                (Classification)
        
FC (Output layer)
        
Softmax
        
Class Probabilities

Example: Small CNN for MNIST

Input: 28×28×1 (grayscale)

Conv1: 32 filters, 3×3, padding=1, stride=1
    Output: 28×28×32
    ReLU activation

MaxPool1: 2×2, stride=2
    Output: 14×14×32

Conv2: 64 filters, 3×3, padding=1, stride=1
    Output: 14×14×64
    ReLU activation

MaxPool2: 2×2, stride=2
    Output: 7×7×64

Flatten: 7×7×64 = 3,136 features

FC1: 128 neurons
    ReLU activation
    Dropout (0.5)

FC2 (Output): 10 neurons (one per digit)
    Softmax activation

Total Parameters:
    Conv1: (3×3×1 + 1) × 32 = 320
    Conv2: (3×3×32 + 1) × 64 = 18,496
    FC1: (3,136 + 1) × 128 = 401,536
    FC2: (128 + 1) × 10 = 1,290
    Total: 421,642 parameters

This architecture balances depth and width to effectively learn hierarchical features from MNIST digits.

Let’s Understand Backpropagation Through Convolutions

Let us review the forward pass of a convolutional layer below.

# Convolution
Z = conv2d(Input, Filter) + bias

# Activation
A = ReLU(Z)

# Pooling
P = max_pool(A)

Backward Pass: Computing Gradients

How do we compute gradients through the convolution operation?

Answer: The chain rule, but applied to shared weights!

Gradient for Convolution

For a standard layer: $Z = W \cdot X + b$,

the gradient is:

\[\frac{\partial L}{\partial W} = X^T \cdot \frac{\partial L}{\partial Z}\]

For convolution, it’s similar but we must account for:

  1. Weight sharing: Same filter used at multiple positions
  2. Local connectivity: Each output depends on a local input region

So, Gradient for filter = convolution of input with gradient from next layer

Mathematically: \(\frac{\partial L}{\partial K} = I * \frac{\partial L}{\partial O}\)

Where:

Gradient flowing backward (to previous layer): \(\frac{\partial L}{\partial I} = \frac{\partial L}{\partial O} * K_{\text{flipped}}\)

The filter is flipped (rotated 180°) for the backward pass.

Gradient for Pooling

Max Pooling Gradient:

In forward pass we record which position had the maximum:

Input (4×4):          Max Pool Output (2×2):    Max Positions:
[1  3  2  4]          [3  4]                    [(0,1) (0,3)]
[5  6  7  8]         [9  11]                   [(2,0) (3,2)]
[9  2  1  3]
[4  5  11 2]

So, in the backward pass, the gradient flows only to the position that had the maximum and other positions get zero gradient. This intutivly makes sense as well, since in the pooling layer only the maximum value “mattered” in the forward pass, so only it gets gradient in the backward pass.

Gradient from next layer:    Gradient to previous layer:
[2  5]                        [0  2  0  5]
[3  7]                       [0  0  0  0]
                              [3  0  0  0]
                              [0  0  7  0]

Parameter Updates

After computing all gradients:

\(K^{new} = K^{old} - \alpha \frac{\partial L}{\partial K}\) \(b^{new} = b^{old} - \alpha \frac{\partial L}{\partial b}\)

Where $\alpha$ is the learning rate.

Final Step: Building a Complete CNN from Scratch

Let’s implement a complete CNN for MNIST digit classification using only NumPy to truly understand every operation.

import numpy as np
import matplotlib.pyplot as plt
from typing import Tuple, List
import time

np.random.seed(42)


class Conv2D:
    """
    2D Convolutional Layer

    Applies learnable filters to extract spatial features from input.
    """

    def __init__(
        self,
        in_channels: int,
        out_channels: int,
        kernel_size: int = 3,
        stride: int = 1,
        padding: int = 0
    ):
        """
        Initialize convolutional layer.

        Args:
            in_channels: Number of input channels
            out_channels: Number of output channels (filters)
            kernel_size: Size of the convolutional kernel (assumed square)
            stride: Stride of convolution
            padding: Amount of zero padding
        """
        self.in_channels = in_channels
        self.out_channels = out_channels
        self.kernel_size = kernel_size
        self.stride = stride
        self.padding = padding

        # He initialization for ReLU
        self.weights = np.random.randn(
            out_channels, in_channels, kernel_size, kernel_size
        ) * np.sqrt(2.0 / (in_channels * kernel_size * kernel_size))

        self.bias = np.zeros((out_channels, 1))

        # Cache for backpropagation
        self.cache = {}

    def forward(self, input: np.ndarray) -> np.ndarray:
        """
        Forward pass: perform convolution.

        Args:
            input: Input tensor of shape (batch, in_channels, height, width)

        Returns:
            output: Convolved output of shape (batch, out_channels, out_h, out_w)
        """
        batch_size, _, h, w = input.shape

        # Add padding if specified
        if self.padding > 0:
            input = np.pad(
                input,
                ((0, 0), (0, 0), (self.padding, self.padding), (self.padding, self.padding)),
                mode='constant'
            )

        # Calculate output dimensions
        out_h = (h + 2 * self.padding - self.kernel_size) // self.stride + 1
        out_w = (w + 2 * self.padding - self.kernel_size) // self.stride + 1

        # Initialize output
        output = np.zeros((batch_size, self.out_channels, out_h, out_w))

        # Perform convolution
        for b in range(batch_size):
            for f in range(self.out_channels):
                for i in range(out_h):
                    for j in range(out_w):
                        # Extract region
                        h_start = i * self.stride
                        h_end = h_start + self.kernel_size
                        w_start = j * self.stride
                        w_end = w_start + self.kernel_size

                        region = input[b, :, h_start:h_end, w_start:w_end]

                        # Convolution: element-wise multiply and sum
                        output[b, f, i, j] = np.sum(region * self.weights[f]) + self.bias[f]

        # Cache for backward pass
        self.cache['input'] = input

        return output

    def get_num_parameters(self) -> int:
        """Return total number of parameters."""
        return self.weights.size + self.bias.size


class MaxPool2D:
    """
    2D Max Pooling Layer

    Downsamples input by taking maximum value in each window.
    """

    def __init__(self, pool_size: int = 2, stride: int = 2):
        """
        Initialize max pooling layer.

        Args:
            pool_size: Size of pooling window
            stride: Stride of pooling operation
        """
        self.pool_size = pool_size
        self.stride = stride
        self.cache = {}

    def forward(self, input: np.ndarray) -> np.ndarray:
        """
        Forward pass: perform max pooling.

        Args:
            input: Input tensor of shape (batch, channels, height, width)

        Returns:
            output: Pooled output
        """
        batch_size, channels, h, w = input.shape

        # Calculate output dimensions
        out_h = (h - self.pool_size) // self.stride + 1
        out_w = (w - self.pool_size) // self.stride + 1

        # Initialize output
        output = np.zeros((batch_size, channels, out_h, out_w))

        # Perform max pooling
        for b in range(batch_size):
            for c in range(channels):
                for i in range(out_h):
                    for j in range(out_w):
                        # Extract region
                        h_start = i * self.stride
                        h_end = h_start + self.pool_size
                        w_start = j * self.stride
                        w_end = w_start + self.pool_size

                        region = input[b, c, h_start:h_end, w_start:w_end]

                        # Take maximum
                        output[b, c, i, j] = np.max(region)

        # Cache for backward pass
        self.cache['input'] = input

        return output


class ReLU:
    """ReLU activation function."""

    def __init__(self):
        self.cache = {}

    def forward(self, input: np.ndarray) -> np.ndarray:
        """Apply ReLU: max(0, x)"""
        self.cache['input'] = input
        return np.maximum(0, input)


class Flatten:
    """Flatten multi-dimensional input to 2D (batch_size, features)."""

    def __init__(self):
        self.cache = {}

    def forward(self, input: np.ndarray) -> np.ndarray:
        """Flatten all dimensions except batch."""
        batch_size = input.shape[0]
        self.cache['input_shape'] = input.shape
        return input.reshape(batch_size, -1)


class Linear:
    """Fully connected linear layer."""

    def __init__(self, in_features: int, out_features: int):
        """
        Initialize linear layer.

        Args:
            in_features: Number of input features
            out_features: Number of output features
        """
        self.in_features = in_features
        self.out_features = out_features

        # He initialization
        self.weights = np.random.randn(in_features, out_features) * np.sqrt(2.0 / in_features)
        self.bias = np.zeros((1, out_features))

        self.cache = {}

    def forward(self, input: np.ndarray) -> np.ndarray:
        """Forward pass: linear transformation."""
        self.cache['input'] = input
        return input @ self.weights + self.bias

    def get_num_parameters(self) -> int:
        """Return total number of parameters."""
        return self.weights.size + self.bias.size


class SimpleCNN:
    """
    Simple CNN for MNIST digit classification.

    Architecture:
    - Conv2D (1 -> 32 channels, 3x3)
    - ReLU
    - MaxPool (2x2)
    - Conv2D (32 -> 64 channels, 3x3)
    - ReLU
    - MaxPool (2x2)
    - Flatten
    - Linear (64*7*7 -> 128)
    - ReLU
    - Linear (128 -> 10)
    """

    def __init__(self):
        """Initialize CNN architecture."""

        # Feature extraction layers
        self.conv1 = Conv2D(in_channels=1, out_channels=32, kernel_size=3, padding=1)
        self.relu1 = ReLU()
        self.pool1 = MaxPool2D(pool_size=2, stride=2)

        self.conv2 = Conv2D(in_channels=32, out_channels=64, kernel_size=3, padding=1)
        self.relu2 = ReLU()
        self.pool2 = MaxPool2D(pool_size=2, stride=2)

        # Classification layers
        self.flatten = Flatten()
        self.fc1 = Linear(in_features=64*7*7, out_features=128)
        self.relu3 = ReLU()
        self.fc2 = Linear(in_features=128, out_features=10)

        # Store layers for easy iteration
        self.layers = [
            self.conv1, self.relu1, self.pool1,
            self.conv2, self.relu2, self.pool2,
            self.flatten,
            self.fc1, self.relu3,
            self.fc2
        ]

    def forward(self, x: np.ndarray) -> np.ndarray:
        """
        Forward pass through the network.

        Args:
            x: Input images of shape (batch, 1, 28, 28)

        Returns:
            output: Class logits of shape (batch, 10)
        """
        for layer in self.layers:
            x = layer.forward(x)
        return x

    def predict(self, x: np.ndarray) -> np.ndarray:
        """
        Make predictions.

        Args:
            x: Input images

        Returns:
            predictions: Predicted class indices
        """
        logits = self.forward(x)
        return np.argmax(logits, axis=1)

    def get_num_parameters(self) -> int:
        """Calculate total number of trainable parameters."""
        total = 0
        for layer in [self.conv1, self.conv2, self.fc1, self.fc2]:
            total += layer.get_num_parameters()
        return total


def softmax(x: np.ndarray) -> np.ndarray:
    """Numerically stable softmax."""
    exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
    return exp_x / np.sum(exp_x, axis=1, keepdims=True)


def cross_entropy_loss(predictions: np.ndarray, targets: np.ndarray) -> float:
    """
    Compute cross-entropy loss.

    Args:
        predictions: Model logits of shape (batch, classes)
        targets: True labels of shape (batch,)

    Returns:
        loss: Average cross-entropy loss
    """
    batch_size = predictions.shape[0]
    probs = softmax(predictions)

    # Select probability of correct class
    correct_probs = probs[np.arange(batch_size), targets]

    # Compute loss
    loss = -np.mean(np.log(correct_probs + 1e-8))

    return loss


def compute_accuracy(predictions: np.ndarray, targets: np.ndarray) -> float:
    """Compute classification accuracy."""
    pred_classes = np.argmax(predictions, axis=1)
    return np.mean(pred_classes == targets)


# Demonstration (would need actual training loop with backprop for full implementation)
def demonstrate_cnn():
    """
    Demonstrate CNN forward pass.

    Note: This is a simplified demonstration. A complete implementation
    would include:
    - Backward pass for all layers
    - Optimizer (SGD, Adam, etc.)
    - Training loop
    - Data loading
    """

    print("=" * 70)
    print("CONVOLUTIONAL NEURAL NETWORK DEMONSTRATION")
    print("=" * 70)

    # Create model
    model = SimpleCNN()

    print(f"\nModel Architecture:")
    print(f"  Conv1: 1 → 32 channels, 3×3 kernel")
    print(f"  ReLU + MaxPool (2×2)")
    print(f"  Conv2: 32 → 64 channels, 3×3 kernel")
    print(f"  ReLU + MaxPool (2×2)")
    print(f"  Flatten")
    print(f"  FC1: 3136 → 128")
    print(f"  ReLU")
    print(f"  FC2: 128 → 10")

    print(f"\nTotal Parameters: {model.get_num_parameters():,}")

    # Create dummy input (batch of 4 MNIST images)
    batch_size = 4
    dummy_input = np.random.randn(batch_size, 1, 28, 28)
    dummy_labels = np.array([0, 1, 2, 3])

    print(f"\nInput Shape: {dummy_input.shape}")

    # Forward pass
    print("\nPerforming forward pass...")
    start_time = time.time()
    output = model.forward(dummy_input)
    forward_time = time.time() - start_time

    print(f"Output Shape: {output.shape}")
    print(f"Forward pass time: {forward_time:.4f} seconds")

    # Compute loss and accuracy
    loss = cross_entropy_loss(output, dummy_labels)
    accuracy = compute_accuracy(output, dummy_labels)

    print(f"\nLoss: {loss:.4f}")
    print(f"Accuracy: {accuracy:.4f}")

    # Show predictions
    predictions = model.predict(dummy_input)
    print(f"\nPredictions: {predictions}")
    print(f"True Labels: {dummy_labels}")

    print("\n" + "=" * 70)
    print("Note: This is an untrained network with random weights.")
    print("In practice, you would train this network on MNIST data using")
    print("backpropagation and gradient descent for 10-20 epochs to achieve")
    print("98-99% accuracy.")
    print("=" * 70)


if __name__ == "__main__":
    demonstrate_cnn()

Conclusion

You now have a deep understanding of Convolutional Neural Networks, a powerful tool for computer vision tasks like image classification, object detection, and segmentation.

Jupyter Notebook

For hands-on practice, check out the companion notebooks - Understanding Convolutional Neural Networks