Dec 15, 2025
If you ever decided to work with modern deep learning, especially NLP, vision, or generative AI, you will almost certainly encounter PyTorch. Over the past few years, PyTorch has become the default framework for AI research and is rapidly becoming just as important in production systems. But this dominance didn’t happen by accident.
PyTorch succeeds because it feels natural to humans.
You write normal Python code.
You run it line by line.
And what you write is exactly what the model executes.
That design choice shapes everything we’ll learn in this article.
PyTorch uses dynamic computation graphs, often described as define-by-run. This means the computation graph is built as your code runs, not ahead of time.
Why does this matter?
print(), pdb, stack traces all work normallyThis is why PyTorch feels more like programming and less like configuration.
PyTorch is no longer “just for research”:
torch.compile for major performance gainsWith that context, let’s start from the absolute foundation.
At its core, a tensor is a multi-dimensional array. If you’ve used NumPy arrays, you already understand the idea. Think of tensors as containers for numbers, arranged in different dimensions:
Tensor Hierarchy:
5.0 → Single number.[1, 2, 3] → 1D array.[[1, 2], [3, 4]] → 2D array.[[[...]]] → 3D+ arrays.In deep learning, tensors represents everything:
If you understand tensors, you understand PyTorch.
PyTorch provides multiple ways to create tensors:
import torch
import numpy as np
# From Python lists
tensor_from_list = torch.tensor([[1, 2], [3, 4]])
print(tensor_from_list)
# tensor([[1, 2],
# [3, 4]])
# From NumPy arrays (shares memory by default)
numpy_array = np.array([[1.0, 2.0], [3.0, 4.0]])
tensor_from_numpy = torch.from_numpy(numpy_array)
print(tensor_from_numpy)
# tensor([[1., 2.],
# [3., 4.]], dtype=torch.float64)
# Common initialization patterns
zeros = torch.zeros(3, 4) # 3x4 matrix of zeros
ones = torch.ones(2, 3) # 2x3 matrix of ones
random_uniform = torch.rand(2, 2) # Uniform [0, 1)
random_normal = torch.randn(3, 3) # Normal N(0, 1)
# Create tensor with same shape as another
x = torch.tensor([1, 2, 3])
y = torch.zeros_like(x)
print(y) # tensor([0, 0, 0])
# Linearly spaced values
linear = torch.linspace(0, 10, steps=5)
print(linear) # tensor([ 0.0, 2.5, 5.0, 7.5, 10.0])
Every tensor carries metadata that controls how it behaves.
x = torch.randn(3, 4)
# Shape: dimensions of the tensor
print(x.shape) # torch.Size([3, 4])
print(x.size()) # Equivalent to .shape
# Data type: precision and numeric format
print(x.dtype) # torch.float32 (default)
# Device: where tensor lives (CPU or GPU)
print(x.device) # cpu
# Gradient tracking: whether autograd records operations
print(x.requires_grad) # False (default)
# Memory layout
print(x.is_contiguous()) # True (C-contiguous memory)
Common data types:
# Integer types
int_tensor = torch.tensor([1, 2], dtype=torch.int32) # 32-bit integer
long_tensor = torch.tensor([1, 2], dtype=torch.int64) # 64-bit integer
# Floating-point types
float_tensor = torch.tensor([1.0, 2.0], dtype=torch.float32) # Single precision
double_tensor = torch.tensor([1.0, 2.0], dtype=torch.float64) # Double precision
# Boolean and complex types
bool_tensor = torch.tensor([True, False], dtype=torch.bool)
complex_tensor = torch.tensor([1+2j, 3+4j], dtype=torch.complex64)
Common data types
float32 → default for deep learningfloat64 → higher precision, slowerint64 → indices, token IDsbool → masksUnderstanding dtype and device early will save you countless bugs later.
a = torch.tensor([1.0, 2.0, 3.0])
b = torch.tensor([4.0, 5.0, 6.0])
# Basic arithmetic (element-wise)
print(a + b) # tensor([5., 7., 9.])
print(a - b) # tensor([-3., -3., -3.])
print(a * b) # tensor([4., 10., 18.])
print(a / b) # tensor([0.25, 0.40, 0.50])
print(a ** 2) # tensor([1., 4., 9.])
# In-place operations (append underscore)
a.add_(1) # a becomes [2., 3., 4.]
print(a)
# Mathematical functions
print(torch.sqrt(a)) # Element-wise square root
print(torch.exp(a)) # Element-wise exponential
print(torch.log(a)) # Element-wise natural log
# Matrix multiplication: THE most important operation
A = torch.randn(3, 4) # 3 rows, 4 columns
B = torch.randn(4, 2) # 4 rows, 2 columns
# Three equivalent ways to multiply matrices
C1 = torch.matmul(A, B) # Explicit function
C2 = A @ B # Python 3.5+ operator (RECOMMENDED)
C3 = A.mm(B) # Method version
print(C1.shape) # torch.Size([3, 2])
# Batch matrix multiplication (crucial for neural networks)
batch_A = torch.randn(10, 3, 4) # 10 matrices of shape 3×4
batch_B = torch.randn(10, 4, 2) # 10 matrices of shape 4×2
batch_C = torch.bmm(batch_A, batch_B) # 10 matrices of shape 3×2
print(batch_C.shape) # torch.Size([10, 3, 2])
# Dot product (1D tensors)
v1 = torch.tensor([1.0, 2.0, 3.0])
v2 = torch.tensor([4.0, 5.0, 6.0])
dot_product = torch.dot(v1, v2) # 1*4 + 2*5 + 3*6 = 32.0
print(dot_product)
# Transpose operations
matrix = torch.randn(3, 5)
print(matrix.T.shape) # torch.Size([5, 3])
print(matrix.transpose(0, 1).shape) # Same as .T
# Advanced: Einstein summation (powerful but complex)
# Matrix multiplication using einsum
result = torch.einsum('ij,jk->ik', A, B)
Broadcasting allows operations on tensors of different shapes:
# Broadcasting rules:
# 1. Dimensions are compared from right to left
# 2. Dimensions must be equal, one of them is 1, or one doesn't exist
a = torch.randn(3, 4)
b = torch.randn(4) # Can broadcast to (3, 4)
result = a + b # b is broadcast to each row of a
print(result.shape) # torch.Size([3, 4])
# Common patterns
matrix = torch.randn(5, 3)
column_vector = torch.randn(5, 1)
row_vector = torch.randn(1, 3)
print((matrix + column_vector).shape) # (5, 3) - add to each column
print((matrix + row_vector).shape) # (5, 3) - add to each row
# WARNING: Broadcasting can hide bugs!
x = torch.randn(3, 4)
y = torch.randn(3, 1) # Intended: (3, 4)?
z = x + y # Silently broadcasts instead of erroring
x = torch.randn(2, 3, 4)
# View: returns a new tensor sharing the same data
y = x.view(6, 4) # Reshape to 6×4 (must preserve total elements)
print(y.shape) # torch.Size([6, 4])
# Reshape: like view but copies data if necessary
z = x.reshape(2, 12) # More flexible than view
print(z.shape) # torch.Size([2, 12])
# Squeeze and unsqueeze: remove/add dimensions of size 1
a = torch.randn(1, 3, 1, 4)
b = a.squeeze() # Remove all dimensions of size 1
print(b.shape) # torch.Size([3, 4])
c = torch.randn(3, 4)
d = c.unsqueeze(0) # Add dimension at position 0
print(d.shape) # torch.Size([1, 3, 4])
# Advanced indexing
tensor = torch.arange(12).reshape(3, 4)
print(tensor)
# tensor([[ 0, 1, 2, 3],
# [ 4, 5, 6, 7],
# [ 8, 9, 10, 11]])
print(tensor[0, :]) # First row: tensor([0, 1, 2, 3])
print(tensor[:, 1]) # Second column: tensor([1, 5, 9])
print(tensor[0:2, 2:4]) # Submatrix slice
# Boolean indexing
mask = tensor > 5
print(tensor[mask]) # tensor([ 6, 7, 8, 9, 10, 11])
# Fancy indexing
indices = torch.tensor([0, 2])
print(tensor[indices]) # Rows 0 and 2
GPU acceleration is PyTorch’s killer feature for deep learning. Here we will look at how to use GPUs in PyTorch.
import torch
# Check if CUDA is available
if torch.cuda.is_available():
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"CUDA Version: {torch.version.cuda}")
print(f"Number of GPUs: {torch.cuda.device_count()}")
else:
print("No GPU available, using CPU")
# Check if MPS (Apple Silicon GPU) is available
if torch.backends.mps.is_available():
print("Apple Silicon GPU available")
# Create tensor on CPU (default)
cpu_tensor = torch.randn(3, 4)
print(cpu_tensor.device) # cpu
# Move to GPU
if torch.cuda.is_available():
gpu_tensor = cpu_tensor.to('cuda')
# Alternative: gpu_tensor = cpu_tensor.cuda()
print(gpu_tensor.device) # cuda:0
# Move back to CPU
back_to_cpu = gpu_tensor.to('cpu')
# Alternative: back_to_cpu = gpu_tensor.cpu()
print(back_to_cpu.device) # cpu
# For Apple Silicon
if torch.backends.mps.is_available():
mps_tensor = cpu_tensor.to('mps')
print(mps_tensor.device) # mps:0
# Set device dynamically
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
# Create tensors directly on the target device
x = torch.randn(1000, 1000, device=device)
# Move existing tensors
y = torch.randn(1000, 1000)
y = y.to(device)
# Operations must be on the same device
z = x @ y # Works: both on same device
# This would ERROR:
# a = torch.randn(10, device='cpu')
# b = torch.randn(10, device='cuda')
# c = a + b # RuntimeError: Expected all tensors on the same device
import time
# CPU computation
cpu_x = torch.randn(5000, 5000)
cpu_y = torch.randn(5000, 5000)
start = time.time()
cpu_z = cpu_x @ cpu_y
print(f"CPU time: {time.time() - start:.4f} seconds")
# GPU computation (with proper synchronization)
if torch.cuda.is_available():
gpu_x = cpu_x.to('cuda')
gpu_y = cpu_y.to('cuda')
# Warm-up run (GPU kernel compilation)
_ = gpu_x @ gpu_y
torch.cuda.synchronize() # Wait for GPU to finish
start = time.time()
gpu_z = gpu_x @ gpu_y
torch.cuda.synchronize()
print(f"GPU time: {time.time() - start:.4f} seconds")
# Typical speedup: 10-100x for large matrices
Key takeaways:
torch.cuda.is_available() before using GPUPyTorch’s automatic differentiation engine, autograd, is what makes training neural networks practical. Instead of manually deriving and coding gradients, autograd computes them automatically.
When you perform operations on tensors with requires_grad=True, PyTorch builds a directed acyclic graph (DAG) tracking the computation:
import torch
x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)
# Forward pass: build computational graph
z = x * y + y ** 2
print(z) # tensor(15., grad_fn=<AddBackward0>)
# The graph looks like:
# x (2.0) y (3.0)
# \\ / \\
# \\ / \\
# \\ / \\
# mul pow
# \\ /
# \\ /
# \\ /
# add
# |
# z (15.0)
So what is the grad_fn? It’s the function that created the tensor. For example, z has grad_fn=<AddBackward0> because it was created by an addition operation. This is how PyTorch knows how to compute gradients during backpropagation. Each tensor resulting from an operation stores its grad_fn, a reference to the function that created it. This enables backpropagation.
.backward()The magic happens when we call .backward():
x = torch.tensor(2.0, requires_grad=True)
y = x ** 2 + 3 * x + 1
# Compute gradients
y.backward()
# Access gradient: dy/dx = 2x + 3
print(x.grad) # tensor(7.) = 2(2) + 3
How it works:
y.backward(), PyTorch traverses the graph in reverse (topological order).grad attribute of leaf tensorsa = torch.tensor(3.0, requires_grad=True)
b = torch.tensor(4.0, requires_grad=True)
# Function: f(a, b) = a^2 * b + b^3
f = a ** 2 * b + b ** 3
f.backward()
# Gradients:
# df/da = 2a * b = 2(3)(4) = 24
# df/db = a^2 + 3b^2 = 9 + 3(16) = 57
print(a.grad) # tensor(24.)
print(b.grad) # tensor(57.)
For scalar outputs, .backward() is straightforward. But, for vector outputs, you must specify the gradient:
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2
# y is a vector, so backward needs a gradient vector
# This represents the Jacobian-vector product
gradient = torch.tensor([1.0, 1.0, 1.0])
y.backward(gradient)
print(x.grad) # tensor([2., 4., 6.]) = 2x evaluated at each element
CRITICAL: It’s important to note that Gradients accumulate by default. You must zero them between iterations.
x = torch.tensor(2.0, requires_grad=True)
# First computation
y1 = x ** 2
y1.backward()
print(x.grad) # tensor(4.)
# Second computation WITHOUT zeroing
y2 = x ** 3
y2.backward()
print(x.grad) # tensor(16.) = 4 + 12 (ACCUMULATED!)
# Proper way: zero gradients
x.grad.zero_()
y3 = x ** 3
y3.backward()
print(x.grad) # tensor(12.) (correct)
When gradient accumulation is useful:
Sometimes you want to use a tensor’s value without tracking gradients:
x = torch.tensor(2.0, requires_grad=True)
y = x ** 2
# Detach y from computational graph
y_detached = y.detach()
print(y_detached.requires_grad) # False
# No gradients will flow through y_detached
z = y_detached * 3
z.backward() # ERROR: z is not part of graph
Use cases:
torch.no_grad()For inference or evaluation, you must disable gradient tracking entirely:
x = torch.tensor([1.0, 2.0], requires_grad=True)
# Normal operation (gradients tracked)
y = x ** 2
# Disable gradient tracking
with torch.no_grad():
y_no_grad = x ** 2
print(y_no_grad.requires_grad) # False
# Also useful for updating parameters without tracking
with torch.no_grad():
x -= 0.1 * x.grad
Benefits:
Now that we have a solid understanding of PyTorch’s basic core concepts. Let’s implement linear regression using only tensors and autograd, without using torch.nn yet.
We want to learn a linear function: y = w * x + b
import torch
import matplotlib.pyplot as plt
# Generate synthetic data: y = 3x + 2 + noise
torch.manual_seed(42)
X = torch.randn(100, 1)
y_true = 3 * X + 2 + 0.5 * torch.randn(100, 1)
# Initialize parameters (random initialization)
w = torch.randn(1, requires_grad=True)
b = torch.zeros(1, requires_grad=True)
print(f"Initial w: {w.item():.4f}, b: {b.item():.4f}")
learning_rate = 0.01
num_epochs = 100
losses = []
for epoch in range(num_epochs):
# Forward pass: compute predictions
y_pred = w * X + b
# Compute loss (Mean Squared Error)
loss = ((y_pred - y_true) ** 2).mean()
losses.append(loss.item())
# Backward pass: compute gradients
loss.backward()
# Update parameters (gradient descent)
with torch.no_grad():
w -= learning_rate * w.grad
b -= learning_rate * b.grad
# CRITICAL: Zero gradients for next iteration
w.grad.zero_()
b.grad.zero_()
if (epoch + 1) % 10 == 0:
print(f"Epoch {epoch+1}/{num_epochs}, Loss: {loss.item():.4f}")
print(f"\\nFinal parameters: w={w.item():.4f}, b={b.item():.4f}")
print(f"True parameters: w=3.0000, b=2.0000")
# Plot loss curve
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(losses)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training Loss')
plt.grid(True)
# Plot predictions vs. true values
plt.subplot(1, 2, 2)
with torch.no_grad():
y_final = w * X + b
plt.scatter(X.numpy(), y_true.numpy(), alpha=0.5, label='True data')
plt.scatter(X.numpy(), y_final.numpy(), alpha=0.5, label='Predictions')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Regression Results')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.savefig('linear_regression_results.png')
print("Visualization saved to 'linear_regression_results.png'")
loss.backward().This pattern forms the foundation of ALL deep learning training in PyTorch.
By implementing linear regression from scratch, you’ve learned:
.backward() propagates gradients; always zero them between iterations.For hands-on practice, check out this companion notebook - Part1: PyTorch Foundation