Jan 03, 2026
Continuing in our Deep Learning Series, we now turn our attention to Loss Functions. Loss functions are the compass that guides our neural networks toward better performance. So, imagine you’re teaching a child to play darts. After each throw, you need to give feedback, but how do you measure how far off they were? You could simply count hits and misses, but that doesn’t tell the whole story. A dart that barely misses the bullseye is very different from one that hits the wall. You need a scoring system that captures the nuance of their performance and guides them toward improvement. This is exactly what loss functions do for neural networks, they measure how far predictions are from the truth and guide the learning process.
Every machine learning model follows the same fundamental process:
The loss function is the critical component in step 2, it quantifies “how wrong” the model is. Without a good loss function, the model has no way to improve, no matter how sophisticated its architecture.
Think of a loss function as a GPS system. Just as GPS tells you not only that you’re off course but also by how much and in which direction, a loss function tells the model not just that it’s wrong, but provides a gradient a direction for improvement.
An effective loss function should:
Let’s understand how loss functions fit into the training process:
1. Forward Pass:
Input → Model → Prediction
2. Loss Calculation:
Loss = LossFunction(Prediction, True Value)
3. Backward Pass:
∂Loss/∂Weights → Gradients
4. Parameter Update:
Weights = Weights - LearningRate × Gradients
The loss function appears simple, just one number, but it encodes the entire objective of learning.
The loss value itself is just a number. What makes it powerful is its gradient:
\[\frac{\partial L}{\partial w}\]This tells us:
Imagine predicting house prices:
Different loss functions treat these errors differently:
Mean Squared Error (MSE):
Notice how MSE punishes the larger error 25x more (even though it’s only 5x larger), creating strong pressure to fix big mistakes.
Mean Absolute Error (MAE):
MAE punishes proportionally, so 5x the error means 5x the penalty. This makes it more robust to outliers.
Regression is about predicting continuous values, like house prices, temperatures, distances. Let’s explore the main regression losses.
The Intuition
MSE is like using a quadratic penalty function. Small errors get small penalties, but large errors get punished disproportionately hard.
Mathematical Definition
\[L_{\text{MSE}} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\]Where:
The Gradient
\[\frac{\partial L_{\text{MSE}}}{\partial \hat{y}_i} = \frac{2}{n}(\hat{y}_i - y_i)\]Notice the gradient is proportional to the error, solarger errors create larger gradients.
Practical Example
import numpy as np
def mse_loss(y_true, y_pred):
"""Compute Mean Squared Error."""
return np.mean((y_pred - y_true) ** 2)
def mse_gradient(y_true, y_pred):
"""Compute MSE gradient."""
n = len(y_true)
return (2 / n) * (y_pred - y_true)
# Example
y_true = np.array([2.5, 1.0, 3.2, 0.5, 4.1])
y_pred = np.array([2.3, 1.2, 3.0, 0.8, 3.9])
loss = mse_loss(y_true, y_pred)
print(f"MSE Loss: {loss:.4f}") # MSE Loss: 0.0380
The Intuition
MAE treats all errors equally, like measuring distances in a city where you can only move along streets (Manhattan distance).
Mathematical Definition
\[L_{\text{MAE}} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|\]The Gradient
\[\frac{\partial L_{\text{MAE}}}{\partial \hat{y}_i} = \frac{1}{n} \text{sign}(\hat{y}_i - y_i)\]Notice the gradient is constant (±1/n), it doesn’t grow with the error size. This makes MAE more robust to outliers.
MAE vs MSE: A Visual Comparison
Imagine predicting temperatures:
| Loss | Prediction 1 | Prediction 2 | Ratio |
|---|---|---|---|
| MSE | 4 | 100 | 25x |
| MAE | 2 | 10 | 5x |
MSE says “fix the big error ASAP!” while MAE says “both errors matter proportionally.”
The Intuition
Huber loss is like a diplomatic compromise between MSE and MAE. For small errors, it acts like MSE (quadratic). For large errors, it acts like MAE (linear). It’s the Swiss Army knife of regression losses.
Mathematical Definition
\[L_{\delta}(y, \hat{y}) = \begin{cases} \frac{1}{2}(y - \hat{y})^2 & \text{if } |y - \hat{y}| \leq \delta \\ \delta |y - \hat{y}| - \frac{1}{2}\delta^2 & \text{otherwise} \end{cases}\]Where $\delta$ is a threshold parameter that controls the transition point.
The Gradient
\[\frac{\partial L_{\delta}}{\partial \hat{y}} = \begin{cases} (\hat{y} - y) & \text{if } |y - \hat{y}| \leq \delta \\ \delta \cdot \text{sign}(\hat{y} - y) & \text{otherwise} \end{cases}\]Choosing Delta
The $\delta$ parameter is critical:
Practical Example
def huber_loss(y_true, y_pred, delta=1.0):
"""Compute Huber Loss."""
error = y_pred - y_true
abs_error = np.abs(error)
# Quadratic for small errors, linear for large
quadratic = 0.5 * error ** 2
linear = delta * abs_error - 0.5 * delta ** 2
return np.mean(np.where(abs_error <= delta, quadratic, linear))
# Compare with outliers
y_true = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
y_pred = np.array([1.1, 2.2, 2.9, 4.1, 15.0]) # Last one is outlier
print(f"MSE: {mse_loss(y_true, y_pred):.4f}") # MSE: 20.0140
print(f"MAE: {mae_loss(y_true, y_pred):.4f}") # MAE: 2.0600
print(f"Huber: {huber_loss(y_true, y_pred):.4f}") # Huber: 2.5600
# Huber is between MSE and MAE, closer to MAE due to outlier
Binary classification is about making yes/no decisions: Is this email spam? Will this customer churn? Is this tumor malignant? Let’s explore the losses designed for this task.
The Intuition
Binary Cross-Entropy comes from information theory. It measures the “surprise” of seeing the true label given your predicted probability. If you predict 99% probability and it happens, low surprise (low loss). If you predict 1% and it happens, high surprise (high loss).
Mathematical Definition
\[L_{\text{BCE}} = -\frac{1}{n} \sum_{i=1}^{n} [y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)]\]Where:
Breaking Down the Formula
Let’s understand each part:
The Gradient
\[\frac{\partial L_{\text{BCE}}}{\partial \hat{y}_i} = -\frac{1}{n}\left(\frac{y_i}{\hat{y}_i} - \frac{1 - y_i}{1 - \hat{y}_i}\right)\]When combined with sigmoid activation, this simplifies beautifully to:
\[\frac{\partial L}{\partial z_i} = \frac{1}{n}(\sigma(z_i) - y_i)\]Where $z_i$ is the logit (pre-activation value). This elegant simplification is why BCE and sigmoid are so commonly paired.
Numerical Stability Concern
Computing $\log(0)$ is undefined. Always clip predictions:
def binary_cross_entropy(y_true, y_pred, epsilon=1e-7):
"""Numerically stable BCE."""
# Clip predictions to avoid log(0)
y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
loss = -(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
return np.mean(loss)
The Intuition
Instead of having the network output probabilities (via sigmoid) and then computing BCE, we combine both operations. This is more numerically stable and efficient.
Why It’s Better
Computing sigmoid + BCE separately can cause numerical issues:
# Problematic approach
logits = model(x) # Raw outputs
probs = sigmoid(logits) # Can overflow/underflow
loss = bce(probs, y_true) # Can have log(0)
BCE with logits avoids this by using the log-sum-exp trick:
Mathematical Definition
\[L_{\text{BCE-Logits}}(z, y) = \max(z, 0) - z \cdot y + \log(1 + e^{-|z|})\]This formulation is numerically stable for any value of $z$ (the logit).
The Gradient
It’s remarkably simple:
\[\frac{\partial L}{\partial z} = \sigma(z) - y\]Practical Example
def bce_with_logits(logits, y_true):
"""Numerically stable BCE with logits."""
# Using log-sum-exp trick
max_val = np.maximum(logits, 0)
loss = max_val - logits * y_true + np.log(1 + np.exp(-np.abs(logits)))
return np.mean(loss)
# Example
logits = np.array([2.3, -1.5, 0.8, -0.3, 1.7])
y_true = np.array([1, 0, 1, 0, 1])
print(f"BCE with Logits: {bce_with_logits(logits, y_true):.4f}")
# This is stable even for extreme logit values
The Problem
Imagine training a cancer detection model where 99% of samples are healthy. A naive model could achieve 99% accuracy by always predicting “healthy”, which is useless in practice!
Standard BCE treats all examples equally. In imbalanced datasets, the majority class dominates training, and the model never learns to detect the rare class.
The Solution
Focal Loss, introduced in the RetinaNet paper (2017), it down-weights easy examples and focuses on hard ones.
Mathematical Definition
\[L_{\text{Focal}} = -\alpha_t (1 - p_t)^\gamma \log(p_t)\]Where:
Understanding the Components
Practical Example
def focal_loss(y_true, y_pred, alpha=0.25, gamma=2.0, epsilon=1e-7):
"""Compute Focal Loss."""
y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
# Probability of correct class
p_t = np.where(y_true == 1, y_pred, 1 - y_pred)
# Focal weight
focal_weight = alpha * (1 - p_t) ** gamma
# Cross-entropy
ce = -np.log(p_t)
return np.mean(focal_weight * ce)
# Imbalanced dataset example
y_true = np.array([0, 0, 0, 0, 0, 0, 0, 0, 1, 1]) # 80% class 0
y_pred = np.array([0.1, 0.2, 0.15, 0.1, 0.05, 0.12, 0.08, 0.11, 0.85, 0.9])
print(f"BCE: {binary_cross_entropy(y_true, y_pred):.4f}")
print(f"Focal: {focal_loss(y_true, y_pred):.4f}")
# Focal loss is lower because easy examples are down-weighted
When you have more than two classes, you’re classifying images into 1000 categories, predicting which of 50 customers will buy a product, you’ll need multi-class losses.
The Intuition
CCE is the generalization of binary cross-entropy to multiple classes. Instead of predicting one probability, you predict a probability distribution over all classes.
Mathematical Definition
\[L_{\text{CCE}} = -\frac{1}{n} \sum_{i=1}^{n} \sum_{c=1}^{C} y_{i,c} \log(\hat{y}_{i,c})\]Where:
Understanding with an Example
Imagine classifying animals into 3 classes: {cat, dog, bird}
True label: cat → One-hot: $[1, 0, 0]$
Prediction 1 (good): $[0.8, 0.15, 0.05]$ \(L = -(1 \times \log(0.8) + 0 \times \log(0.15) + 0 \times \log(0.05)) = 0.223\)
Prediction 2 (bad): $[0.2, 0.5, 0.3]$ \(L = -(1 \times \log(0.2) + 0 + 0) = 1.609\)
Notice how only the probability of the correct class matters!
The Gradient
\[\frac{\partial L_{\text{CCE}}}{\partial \hat{y}_{i,c}} = -\frac{y_{i,c}}{\hat{y}_{i,c}}\]When combined with softmax activation:
\[\frac{\partial L}{\partial z_{i,c}} = \hat{y}_{i,c} - y_{i,c}\]This beautiful simplification (same as BCE+sigmoid) is why softmax and CCE are paired.
The Intuition
Exactly the same as CCE, but accepts integer class labels instead of one-hot vectors. This saves memory and computation.
Mathematical Definition
\[L_{\text{Sparse-CCE}} = -\frac{1}{n} \sum_{i=1}^{n} \log(\hat{y}_{i, y_i})\]Where $y_i$ is the integer class label (e.g., 0, 1, 2 instead of [1,0,0], [0,1,0], [0,0,1]).
Memory Comparison
For 1 million samples with 1000 classes:
That’s 1000× memory savings!
Practical Example
def sparse_categorical_crossentropy(y_true, y_pred, epsilon=1e-7):
"""
Args:
y_true: Integer class labels, shape (n_samples,)
y_pred: Predicted probabilities, shape (n_samples, n_classes)
"""
n = y_pred.shape[0]
y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
# Extract probability of correct class for each sample
correct_probs = y_pred[np.arange(n), y_true.astype(int)]
return -np.mean(np.log(correct_probs))
# Example: 3-class classification
y_true = np.array([0, 2, 1, 0, 2]) # Integer labels
y_pred = np.array([
[0.7, 0.2, 0.1], # Predicts class 0 (correct)
[0.1, 0.2, 0.7], # Predicts class 2 (correct)
[0.2, 0.6, 0.2], # Predicts class 1 (correct)
[0.8, 0.1, 0.1], # Predicts class 0 (correct)
[0.3, 0.3, 0.4], # Predicts class 2 (correct)
])
loss = sparse_categorical_crossentropy(y_true, y_pred)
print(f"Sparse CCE: {loss:.4f}") # Low loss, all predictions are good
Beyond standard regression and classification, specialized tasks require specialized losses.
The Intuition
Hinge loss comes from Support Vector Machines (SVM). Instead of just getting the answer right, it wants the model to be confidently right to have a “margin” of safety.
Mathematical Definition
\[L_{\text{Hinge}} = \frac{1}{n} \sum_{i=1}^{n} \max(0, 1 - y_i \cdot \hat{y}_i)\]Where:
Understanding the Margin
Example
True label: +1
The Intuition
Kullback-Leibler (KL) Divergence measures how different two probability distributions are. It’s important to note that KL divergence is not symmetric: $\text{KL}(P||Q) \neq \text{KL}(Q||P)$, which makes it useful for specific applications.
Mathematical Definition
\[D_{\text{KL}}(P||Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)}\]Where:
Interpretation
KL divergence measures the “extra bits” needed to encode samples from $P$ using a code optimized for $Q$. It’s always non-negative and equals zero only when $P = Q$.
Practical Example
def kl_divergence(p_true, q_pred, epsilon=1e-7):
"""
Compute KL(P||Q).
Args:
p_true: True distribution P
q_pred: Predicted distribution Q
"""
p_true = np.clip(p_true, epsilon, 1 - epsilon)
q_pred = np.clip(q_pred, epsilon, 1 - epsilon)
return np.sum(p_true * np.log(p_true / q_pred))
# Example: Matching distributions
p_true = np.array([0.5, 0.3, 0.2])
q_pred1 = np.array([0.5, 0.3, 0.2]) # Perfect match
q_pred2 = np.array([0.33, 0.33, 0.34]) # Uniform-ish
print(f"KL(P||Q1): {kl_divergence(p_true, q_pred1):.4f}") # ~0 (perfect)
print(f"KL(P||Q2): {kl_divergence(p_true, q_pred2):.4f}") # >0 (different)
The Intuition
Dice loss is based on the Dice coefficient (also called F1 score), which measures overlap between two sets. It’s particularly popular in image segmentation where we care about pixel-level accuracy.
Mathematical Definition
\[L_{\text{Dice}} = 1 - \frac{2|X \cap Y| + \epsilon}{|X| + |Y| + \epsilon}\]Where:
Expanding for Continuous Predictions
\[L_{\text{Dice}} = 1 - \frac{2 \sum_i y_i \hat{y}_i + \epsilon}{\sum_i y_i + \sum_i \hat{y}_i + \epsilon}\]Why It Works for Imbalance
In segmentation, background often dominates (95% background, 5% object). Dice loss focuses on the overlap, not the total accuracy, making it robust to class imbalance.
Practical Example
def dice_loss(y_true, y_pred, smooth=1.0):
"""Compute Dice Loss."""
intersection = np.sum(y_true * y_pred)
union = np.sum(y_true) + np.sum(y_pred)
dice_coefficient = (2.0 * intersection + smooth) / (union + smooth)
return 1 - dice_coefficient
# Example: Binary segmentation
y_true = np.array([[0, 0, 1, 1],
[0, 1, 1, 0]])
y_pred_good = np.array([[0.1, 0.1, 0.9, 0.8],
[0.2, 0.8, 0.9, 0.1]])
y_pred_bad = np.array([[0.9, 0.8, 0.1, 0.2],
[0.9, 0.1, 0.2, 0.8]])
print(f"Good prediction Dice: {dice_loss(y_true, y_pred_good):.4f}")
print(f"Bad prediction Dice: {dice_loss(y_true, y_pred_bad):.4f}")
Metric learning is about learning embeddings where similar items are close and dissimilar items are far apart. These losses are crucial for face recognition, recommendation systems, and similarity search.
The Intuition
Contrastive loss trains on pairs of samples. For similar pairs (same person’s face), pull them together. For dissimilar pairs (different people), push them apart.
Mathematical Definition
\[L_{\text{Contrastive}} = \frac{1}{2}[y \cdot D^2 + (1-y) \cdot \max(0, m - D)^2]\]Where:
Understanding the Two Terms
Why the Margin?
The margin $m$ prevents the model from collapsing all dissimilar pairs to zero distance. It ensures a minimum separation between dissimilar items.
Practical Example
def contrastive_loss(embedding1, embedding2, label, margin=1.0):
"""
Compute Contrastive Loss.
Args:
embedding1, embedding2: Embedding vectors
label: 1 if similar, 0 if dissimilar
margin: Minimum distance for dissimilar pairs
"""
# Euclidean distance
distance = np.sqrt(np.sum((embedding1 - embedding2) ** 2))
# Similar: pull together, Dissimilar: push apart
if label == 1:
loss = 0.5 * distance ** 2
else:
loss = 0.5 * max(0, margin - distance) ** 2
return loss
# Example
emb1 = np.array([1.0, 2.0, 3.0])
emb2_similar = np.array([1.1, 2.1, 2.9]) # Close
emb2_dissimilar = np.array([5.0, 6.0, 7.0]) # Far
print(f"Similar pair loss: {contrastive_loss(emb1, emb2_similar, label=1):.4f}")
print(f"Dissimilar pair loss: {contrastive_loss(emb1, emb2_dissimilar, label=0):.4f}")
The Intuition
Triplet loss goes beyond pairs. It uses triplets: (anchor, positive, negative). The anchor should be closer to the positive than to the negative by at least a margin.
Mathematical Definition
\[L_{\text{Triplet}} = \max(0, D(a, p) - D(a, n) + m)\]Where:
Relative vs Absolute
Unlike contrastive loss which cares about absolute distances, triplet loss cares about relative distances:
“Make the anchor closer to the positive than to the negative by at least margin $m$.”
Triplet Mining
The hardest part of triplet loss is choosing good triplets:
Practical Example
def triplet_loss(anchor, positive, negative, margin=1.0):
"""
Compute Triplet Loss.
Args:
anchor, positive, negative: Embedding vectors
margin: Minimum separation margin
"""
pos_distance = np.sum((anchor - positive) ** 2)
neg_distance = np.sum((anchor - negative) ** 2)
loss = max(0, pos_distance - neg_distance + margin)
return loss
# Example: Face recognition
anchor = np.array([1.0, 2.0, 3.0])
positive = np.array([1.1, 2.1, 2.9]) # Same person
negative = np.array([5.0, 6.0, 7.0]) # Different person
loss = triplet_loss(anchor, positive, negative, margin=1.0)
print(f"Triplet loss: {loss:.4f}")
# If positive is too far or negative too close, loss > 0
# If anchor-positive distance + margin < anchor-negative distance, loss = 0
Selecting the appropriate loss function is crucial for model performance. Here’s a comprehensive decision guide.
Task Type?
├── Regression
│ ├── Clean data, Gaussian errors → MSE
│ ├── Outliers present → MAE or Huber Loss
│ └── Mixed (some outliers) → Huber Loss
│
├── Binary Classification
│ ├── Balanced classes → BCE with Logits
│ ├── Imbalanced classes → Focal Loss
│ └── Need margin → Hinge Loss
│
├── Multi-Class Classification
│ ├── Balanced classes → Sparse CCE
│ ├── Imbalanced classes → Weighted CCE or Focal Loss
│ └── Many classes → Sparse CCE (memory efficient)
│
├── Segmentation
│ ├── Balanced pixels → BCE
│ ├── Imbalanced pixels → Dice Loss or Focal Loss
│ └── Multiple objects → Dice Loss + BCE combination
│
└── Similarity Learning
├── Pairs available → Contrastive Loss
├── Triplets available → Triplet Loss
└── Distribution matching → KL Divergence
| Task | Loss Function | When to Use | Avoid When |
|---|---|---|---|
| Regression | MSE | Standard regression, Gaussian errors | Outliers present |
| MAE | Outliers, robust fitting | Need to penalize large errors | |
| Huber | Mixed (some outliers) | Purely clean data | |
| Binary Classification | BCE with Logits | Standard binary classification | Imbalanced data |
| Focal Loss | Severe class imbalance | Balanced data | |
| Hinge Loss | Margin-based learning, SVM | Need probabilities | |
| Multi-Class | Sparse CCE | Standard multi-class | Binary (use BCE) |
| Weighted CCE | Known class weights | Unknown imbalance | |
| Segmentation | Dice Loss | Pixel imbalance | Standard classification |
| Focal Loss | Severe imbalance | Balanced classes | |
| Metric Learning | Contrastive | Pair-based similarity | Have triplet info |
| Triplet | Face recognition, ranking | Small datasets | |
| Distribution | KL Divergence | VAE, distillation | Distance metric needed |
Let’s implement a complete training pipeline showcasing different loss functions.
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
# 1. Define a simple model
class SimpleModel(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim, task='regression'):
super(SimpleModel, self).__init__()
self.task = task
self.layers = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, output_dim)
)
# Add sigmoid for binary classification
if task == 'binary':
self.output_activation = nn.Sigmoid()
# Add softmax for multi-class (but we'll use logits for loss)
elif task == 'multiclass':
self.output_activation = nn.Softmax(dim=1)
else:
self.output_activation = nn.Identity()
def forward(self, x):
logits = self.layers(x)
if self.task in ['binary', 'multiclass']:
return logits # Return logits for BCE/CCE with logits
return logits
# 2. Training function with different losses
def train_model(model, dataloader, loss_fn, optimizer, epochs=10):
"""Generic training loop."""
model.train()
for epoch in range(epochs):
total_loss = 0.0
for batch_x, batch_y in dataloader:
# Forward pass
predictions = model(batch_x)
loss = loss_fn(predictions, batch_y)
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
avg_loss = total_loss / len(dataloader)
if (epoch + 1) % 2 == 0:
print(f"Epoch {epoch+1}/{epochs}, Loss: {avg_loss:.4f}")
# 3. Example: Binary Classification with BCE
print("Example 1: Binary Classification with BCE")
print("-" * 60)
# Generate synthetic data
torch.manual_seed(42)
X_binary = torch.randn(1000, 10)
y_binary = (X_binary.sum(dim=1) > 0).float()
dataset = TensorDataset(X_binary, y_binary)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
# Model and training
model_binary = SimpleModel(input_dim=10, hidden_dim=20, output_dim=1, task='binary')
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model_binary.parameters(), lr=0.001)
train_model(model_binary, dataloader, criterion, optimizer, epochs=10)
# 4. Example: Multi-Class with Sparse CCE
print("\nExample 2: Multi-Class Classification")
print("-" * 60)
X_multi = torch.randn(1000, 10)
y_multi = (X_multi.sum(dim=1) > 0).long() + (X_multi.mean(dim=1) > 0).long()
y_multi = y_multi.clamp(0, 2) # 3 classes: 0, 1, 2
dataset_multi = TensorDataset(X_multi, y_multi)
dataloader_multi = DataLoader(dataset_multi, batch_size=32, shuffle=True)
model_multi = SimpleModel(input_dim=10, hidden_dim=20, output_dim=3, task='multiclass')
criterion_multi = nn.CrossEntropyLoss()
optimizer_multi = optim.Adam(model_multi.parameters(), lr=0.001)
train_model(model_multi, dataloader_multi, criterion_multi, optimizer_multi, epochs=10)
# 5. Example: Regression with MSE vs Huber
print("\nExample 3: Regression with Huber Loss")
print("-" * 60)
X_reg = torch.randn(1000, 10)
y_reg = X_reg.mean(dim=1, keepdim=True) + torch.randn(1000, 1) * 0.5
# Add outliers
outlier_indices = torch.randperm(1000)[:50]
y_reg[outlier_indices] += torch.randn(50, 1) * 5
dataset_reg = TensorDataset(X_reg, y_reg)
dataloader_reg = DataLoader(dataset_reg, batch_size=32, shuffle=True)
model_reg = SimpleModel(input_dim=10, hidden_dim=20, output_dim=1, task='regression')
criterion_huber = nn.SmoothL1Loss() # Huber loss
optimizer_reg = optim.Adam(model_reg.parameters(), lr=0.001)
train_model(model_reg, dataloader_reg, criterion_huber, optimizer_reg, epochs=10)
Sometimes you need a custom loss. Here’s a template:
class CustomLoss(nn.Module):
def __init__(self, hyperparameter=1.0):
super(CustomLoss, self).__init__()
self.hyperparameter = hyperparameter
def forward(self, predictions, targets):
"""
Args:
predictions: Model outputs, shape (batch_size, ...)
targets: Ground truth, shape (batch_size, ...)
Returns:
Scalar loss value
"""
# Your custom loss computation
loss = torch.mean((predictions - targets) ** 2) # Example: MSE
# Can add regularization, weighting, etc.
loss = loss * self.hyperparameter
return loss
# Usage
custom_loss = CustomLoss(hyperparameter=2.0)
loss_value = custom_loss(predictions, targets)
Loss functions are more than mathematical formulas—they encode what we want our models to learn. Choosing and understanding them is as important as designing the neural network architecture itself. Master loss functions, and you master the language of machine learning optimization. Every model, every task, every breakthrough started with someone asking: “What should I optimize for?” Now you have the knowledge to answer that question.
Happy learning, and may your losses always converge!