Dec 14, 2025
Continuing in our series on machine learning, today we explore different types of model evaluation metrics. Imagine you built a cancer screening model that achieves 99% accuracy. Sounds impressive right? But until you discover the dataset has 99% healthy patients and 1% with cancer, and your model just predicts “healthy” for everyone. It is perfectly accurate and completely useless. This is the central lesson of evaluation metrics: a model is only as good as the metric used to measure it. Choosing the right metric is not just a technical afterthought, it is a fundamental design decision that determines whether your model actually solves the real-world problem.
When you train a model, you minimise a loss function. When you evaluate it, you compute a metric. These are distinct quantities:
The disconnect between loss and metric is where most real-world ML failures originate. Optimising cross-entropy does not directly optimise recall. A model with the lowest validation loss may not have the highest AUC.
Consider a fraud detection system: 99.9% of transactions are legitimate; 0.1% are fraudulent.
The naïve model is worse than useless for it catches zero fraud, and it yet dominates on accuracy. This motivates metrics that account for class imbalance and the different costs of different errors.
Every binary classifier makes two types of mistakes:
| Error Type | Also Called | Consequence |
|---|---|---|
| Predicting positive when actually negative | False Positive (Type I) | Wasted resources, false alarms |
| Predicting negative when actually positive | False Negative (Type II) | Missed threats, missed cases |
The relative cost of these errors depends entirely on the application:
For a binary classifier with classes Positive (P) and Negative (N), the confusion matrix tabulates the four possible prediction outcomes:
\[\begin{array}{c|cc} & \text{Predicted Positive} & \text{Predicted Negative} \\ \hline \text{Actual Positive} & \text{TP} & \text{FN} \\ \text{Actual Negative} & \text{FP} & \text{TN} \end{array}\]Total predictions: $N = \text{TP} + \text{TN} + \text{FP} + \text{FN}$
You test a spam filter on 1,000 emails:
| Predicted Spam | Predicted Legitimate | |
|---|---|---|
| Actual Spam | 150 (TP) | 50 (FN) |
| Actual Legitimate | 30 (FP) | 770 (TN) |
From these four numbers, every classification metric derives.
The confusion matrix is the complete summary of binary classifier behaviour. Every scalar metric (accuracy, precision, recall, F1, MCC) is a function of TP, TN, FP, FN. Understanding the matrix lets you:
The fraction of all predictions that are correct:
\[\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}} = \frac{\text{TP} + \text{TN}}{N}\]Spam example: $(150 + 770) / 1000 = 0.92$
Failure mode: Misleading when classes are imbalanced. A model predicting all-negative on a 95/5 split gets 95% accuracy while being useless.
Use when: Classes are roughly balanced and both error types have similar cost.
Of all instances the model predicted as positive, what fraction are actually positive?
\[\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}\]Spam example: $150 / (150 + 30) = 0.833$
Intuition: If I flag an email as spam, how likely is it actually spam? A precision of 83.3% means 1 in 6 flagged emails is legitimate spam. This is potentially an issue for important messages.
High precision requires: Few false positives. Achieved by being conservative i.e., only predict positive when very confident.
Of all instances that are actually positive, what fraction did the model correctly identify?
\[\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}\]Spam example: $150 / (150 + 50) = 0.75$
Intuition: Of all the spam emails, what fraction did we catch? A recall of 75% means 1 in 4 spam emails slips through.
High recall requires: Few false negatives. Achieved by being aggressive i.e., predict positive whenever there’s any chance.
Of all actual negatives, what fraction did the model correctly identify?
\[\text{Specificity} = \frac{\text{TN}}{\text{TN} + \text{FP}}\]Spam example: $770 / (770 + 30) = 0.9625$
Intuition: Of all legitimate emails, what fraction did we correctly leave in the inbox? The complement $1 - \text{Specificity} = \text{FPR}$ is the False Positive Rate, one axis of the ROC curve.
Spam example: $30 / (30 + 770) = 0.0375$
Spam example: $50 / (50 + 150) = 0.25$
A balanced metric that works well even for very imbalanced classes:
\[\text{MCC} = \frac{\text{TP} \cdot \text{TN} - \text{FP} \cdot \text{FN}}{\sqrt{(\text{TP}+\text{FP})(\text{TP}+\text{FN})(\text{TN}+\text{FP})(\text{TN}+\text{FN})}}\]Range: $[-1, +1]$
Spam example: \(\text{MCC} = \frac{150 \cdot 770 - 30 \cdot 50}{\sqrt{180 \cdot 200 \cdot 800 \cdot 820}} = \frac{115500 - 1500}{\sqrt{23616000000}} \approx \frac{114000}{153,676} \approx 0.742\)
MCC is often the single most informative metric for binary classification, especially with imbalanced data.
Most classifiers output a probability or score, not a hard class label. The decision threshold $\tau$ converts this:
\[\hat{y} = \begin{cases} 1 & \text{if } P(\text{positive} \mid \mathbf{x}) \geq \tau \\ 0 & \text{otherwise} \end{cases}\]The default is $\tau = 0.5$, but this is rarely optimal.
The trade-off is unavoidable: you cannot simultaneously maximise both without a better model.
Given the cost of false positives $C_{FP}$ and false negatives $C_{FN}$, the optimal threshold satisfies:
\[\tau^* = \frac{C_{FP}}{C_{FP} + C_{FN}}\]Example: In fraud detection, missing a $10,000 fraud ($C_{FN} = 10{,}000$) is far worse than triggering a review incorrectly ($C_{FP} = 50$): \(\tau^* = \frac{50}{50 + 10000} \approx 0.005\)
Set the threshold very low ($\tau \approx 0$) to flag almost everything as fraudulent to minimise missed frauds.
The F1-score is the harmonic mean of precision and recall:
\[F_1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2\,\text{TP}}{2\,\text{TP} + \text{FP} + \text{FN}}\]Why harmonic mean, not arithmetic mean?
The harmonic mean punishes extreme imbalances between precision and recall. A model with Precision = 1.0 and Recall = 0.0 is useless, yet:
Spam example: $F_1 = 2 \cdot 0.833 \cdot 0.75 / (0.833 + 0.75) = 0.789$
Use when: You want a single number that balances precision and recall and classes are imbalanced.
When one error type is more costly than the other, use the generalised $F_\beta$-score:
\[F_\beta = \frac{(1 + \beta^2) \cdot \text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}}\]F2-score ($\beta = 2$, recall twice as important):
\[F_2 = \frac{5 \cdot \text{Precision} \cdot \text{Recall}}{4 \cdot \text{Precision} + \text{Recall}}\]F0.5-score ($\beta = 0.5$, precision twice as important):
\[F_{0.5} = \frac{1.25 \cdot \text{Precision} \cdot \text{Recall}}{0.25 \cdot \text{Precision} + \text{Recall}}\]The $F_\beta$ score is defined as the weighted harmonic mean of precision and recall:
\[\frac{1}{F_\beta} = \frac{1}{1+\beta^2} \cdot \frac{1}{\text{Recall}} + \frac{\beta^2}{1+\beta^2} \cdot \frac{1}{\text{Precision}}\]The weight $\beta^2$ on the recall term means we weight recall $\beta$ times as much as precision. Solving for $F_\beta$ gives the formula above.
The ROC curve plots True Positive Rate (Recall) against False Positive Rate as the decision threshold varies from 1 to 0:
\[\text{TPR}(\tau) = \frac{\text{TP}(\tau)}{\text{TP}(\tau) + \text{FN}(\tau)}, \qquad \text{FPR}(\tau) = \frac{\text{FP}(\tau)}{\text{FP}(\tau) + \text{TN}(\tau)}\]Tracing the curve: As $\tau$ decreases from 1 to 0, both TPR and FPR increase, tracing a curve from $(0, 0)$ to $(1, 1)$.
| Point | $\tau$ | Meaning |
|---|---|---|
| $(0, 0)$ | $\tau = 1$ | Never predict positive: TPR = 0, FPR = 0 |
| $(1, 1)$ | $\tau = 0$ | Always predict positive: TPR = 1, FPR = 1 |
| $(0, 1)$ | Optimal | Perfect classifier: TPR = 1, FPR = 0 |
The diagonal line $\text{TPR} = \text{FPR}$ represents a random classifier.
The AUC summarises the ROC curve as a single number:
\[\text{AUC} = \int_0^1 \text{TPR}(\text{FPR}) \, d(\text{FPR})\]Range: $[0, 1]$
Probabilistic interpretation: AUC equals the probability that the model assigns a higher score to a randomly chosen positive instance than to a randomly chosen negative instance:
\[\text{AUC} = P(\text{score}(\text{positive}) > \text{score}(\text{negative}))\]This makes AUC a threshold-free metric — it evaluates the quality of the ranking, not the quality of any particular threshold.
Given sorted thresholds with corresponding $(FPR_i, TPR_i)$ pairs:
\[\text{AUC} \approx \sum_{i=1}^{m} (FPR_i - FPR_{i-1}) \cdot \frac{TPR_i + TPR_{i-1}}{2}\]This is the trapezoidal approximation of the integral.
For highly imbalanced datasets (e.g., fraud at 0.1%), the ROC curve can be misleading because FPR involves TN, which is enormous and makes FPR appear small even when many false positives exist.
The Precision-Recall curve plots Precision vs. Recall across thresholds, and is more informative when:
The Average Precision is the area under the PR curve, approximated as:
\[\text{AP} = \sum_{k=1}^{n} (R_k - R_{k-1}) \cdot P_k\]Where $P_k$ and $R_k$ are precision and recall at the $k$-th threshold.
Range: $[0, 1]$; higher is better. A random classifier’s baseline is the class prevalence $\pi = \text{TP} / (\text{TP} + \text{FN})$.
The PR curve is typically jagged. The interpolated precision at recall level $r$ takes the maximum precision achievable at recall $\geq r$:
\[P_{interp}(r) = \max_{\tilde{r} \geq r} P(\tilde{r})\]This is used in object detection evaluation (mAP).
When there are $K > 2$ classes, metrics extend through averaging strategies.
For each class $k$ treated as the “positive” class (all others are “negative”):
\[\text{Precision}_k = \frac{\text{TP}_k}{\text{TP}_k + \text{FP}_k}, \qquad \text{Recall}_k = \frac{\text{TP}_k}{\text{TP}_k + \text{FN}_k}, \qquad F_{1,k} = \frac{2 \cdot \text{Precision}_k \cdot \text{Recall}_k}{\text{Precision}_k + \text{Recall}_k}\]Compute metric per class, then take the unweighted mean:
\[\text{Precision}_{\text{macro}} = \frac{1}{K} \sum_{k=1}^{K} \text{Precision}_k\]Properties: Treats all classes equally regardless of support (number of instances). Good when all classes are equally important. Penalises poor performance on rare classes.
Pool all TP, FP, FN across classes, then compute metric:
\[\text{Precision}_{\text{micro}} = \frac{\sum_{k=1}^{K} \text{TP}_k}{\sum_{k=1}^{K} (\text{TP}_k + \text{FP}_k)}\]Properties: Gives equal weight to each instance, not each class. Dominated by the most frequent classes. For multi-class, micro-precision equals micro-recall equals accuracy (when each instance is assigned exactly one class).
Compute metric per class, then take the support-weighted mean:
\[\text{Precision}_{\text{weighted}} = \frac{\sum_{k=1}^{K} n_k \cdot \text{Precision}_k}{\sum_{k=1}^{K} n_k}\]Where $n_k$ is the number of actual instances of class $k$.
Properties: Accounts for class imbalance while still computing per-class metrics. Best for imbalanced multi-class problems.
| Strategy | Use When |
|---|---|
| Macro | All classes equally important; want to penalise poor performance on rare classes |
| Micro | Individual instances equally important; dominated by frequent classes |
| Weighted | Imbalanced classes; want aggregate that reflects class distribution |
| Per-class | Need to understand performance on a specific class |
Generalises naturally: an $n \times n$ matrix where row $i$ is the true class and column $j$ is the predicted class. Diagonal elements are correct predictions; off-diagonal shows what the model confuses.
When the target is continuous, we need metrics that measure the magnitude of errors.
Properties:
Example: Predicting house prices in $1,000s. MAE = 15 means predictions are off by $15,000 on average.
Properties:
Mathematical relationship: $\text{MSE} = \text{Bias}^2 + \text{Variance}$ when decomposed.
Properties:
Properties:
Where:
Interpretation: Fraction of the target’s variance explained by the model.
Range: $(-\infty, 1]$
Warning: $R^2$ always increases when you add more features, even if they’re noise. Use Adjusted R² to penalise unnecessary complexity:
\[R^2_{\text{adj}} = 1 - \frac{(1 - R^2)(n - 1)}{n - p - 1}\]Where $p$ is the number of predictors and $n$ is sample size.
A compromise between MAE and MSE that is quadratic for small errors and linear for large ones:
\[L_\delta(y, \hat{y}) = \begin{cases} \frac{1}{2}(y - \hat{y})^2 & \text{if } |y - \hat{y}| \leq \delta \\ \delta \cdot \left(|y - \hat{y}| - \frac{\delta}{2}\right) & \text{otherwise} \end{cases}\]Properties: Differentiable everywhere, robust to outliers, controlled by $\delta$ (the transition point).
| Metric | Units | Outlier Sensitivity | Interpretability | Use When |
|---|---|---|---|---|
| MAE | Same as $y$ | Low | High | Outliers present; interpretability needed |
| MSE | Squared | High | Low | Differentiable training loss; outliers matter |
| RMSE | Same as $y$ | High | Medium | Standard reporting; same scale as MAE |
| MAPE | % | Medium | Very High | Comparing across scales; no zeros in $y$ |
| R² | Unitless | High | Very High | Explaining variance; comparing models |
A single train/test split can give misleading estimates. Cross-validation uses the data more efficiently to get a reliable performance estimate with uncertainty.
Divide the data into $k$ equal folds. For each fold $i$:
Final estimate: mean and standard deviation across $k$ scores.
\[\text{CV score} = \frac{1}{k} \sum_{i=1}^{k} \text{metric}_i, \qquad \text{std} = \sqrt{\frac{1}{k}\sum_{i=1}^{k}(\text{metric}_i - \text{CV score})^2}\]Common choices: $k = 5$ or $k = 10$. Larger $k$:
When classes are imbalanced, standard k-fold may create folds with no minority class instances. Stratified k-fold ensures each fold has approximately the same class proportions as the full dataset.
Essential for: Imbalanced classification. Gives more reliable estimates than standard k-fold.
Special case of k-fold where $k = n$: each fold has one sample as validation.
\[\text{LOO-CV} = \frac{1}{n} \sum_{i=1}^{n} \text{metric}(y_i, \hat{y}_{-i})\]Where $\hat{y}_{-i}$ is the prediction for sample $i$ when trained on all other samples.
Properties:
For time-series data, future data must never be used to predict the past. Standard k-fold would leak future information.
Walk-forward validation uses an expanding window:
This strictly respects temporal ordering.
Run k-fold $m$ times with different random splits, giving $k \times m$ estimates. Reduces variance of the CV estimate at the cost of $m \times$ computation. Common: $5 \times 2$-fold or $10 \times 3$-fold.
When tuning hyperparameters and evaluating simultaneously, use nested CV:
Prevents the optimistic bias from selecting the best hyperparameter on the test fold.
Start: What type of problem?
├── Classification
│ ├── Binary
│ │ ├── Balanced classes?
│ │ │ ├── Yes → Accuracy or F1
│ │ │ └── No → F1, MCC, or AUC-PR
│ │ ├── FP cost >> FN cost? → High Precision, use F-beta (β < 1)
│ │ ├── FN cost >> FP cost? → High Recall, use F-beta (β > 1)
│ │ └── Need threshold-free eval? → ROC-AUC
│ └── Multi-class
│ ├── All classes equally important? → Macro F1
│ ├── Instance-level importance? → Micro F1 (= Accuracy)
│ └── Imbalanced? → Weighted F1 or per-class metrics
└── Regression
├── Outliers present? → MAE or Huber
├── Outliers important? → MSE or RMSE
├── Need scale-invariant? → MAPE or RMSLE
└── Need explained variance? → R²
| Domain | Preferred Metrics | Reason |
|---|---|---|
| Medical diagnosis | Recall, F2-score | Missing a disease (FN) is worse than over-diagnosing (FP) |
| Spam filtering | Precision, F0.5-score | Misclassifying legitimate email (FP) is worse than missing spam (FN) |
| Fraud detection | Recall, AUC-PR | Rare positives; missing fraud is costly |
| Search ranking | NDCG, MAP | Ranking quality matters; not binary |
| Autonomous driving | Recall, Specificity | Both FP and FN are dangerous |
| House price prediction | RMSE, R² | Symmetric errors; interpretability |
| Demand forecasting | MAPE | Need scale-independent comparison across products |
import numpy as np
def confusion_matrix_binary(y_true, y_pred):
"""Compute TP, TN, FP, FN from binary arrays."""
y_true = np.array(y_true)
y_pred = np.array(y_pred)
TP = np.sum((y_pred == 1) & (y_true == 1))
TN = np.sum((y_pred == 0) & (y_true == 0))
FP = np.sum((y_pred == 1) & (y_true == 0))
FN = np.sum((y_pred == 0) & (y_true == 1))
return TP, TN, FP, FN
def accuracy(y_true, y_pred):
TP, TN, FP, FN = confusion_matrix_binary(y_true, y_pred)
return (TP + TN) / (TP + TN + FP + FN)
def precision(y_true, y_pred):
TP, _, FP, _ = confusion_matrix_binary(y_true, y_pred)
return TP / (TP + FP) if (TP + FP) > 0 else 0.0
def recall(y_true, y_pred):
TP, _, _, FN = confusion_matrix_binary(y_true, y_pred)
return TP / (TP + FN) if (TP + FN) > 0 else 0.0
def f_beta(y_true, y_pred, beta=1.0):
"""
F-beta score: harmonic mean of precision and recall,
with recall weighted beta times as much as precision.
"""
p = precision(y_true, y_pred)
r = recall(y_true, y_pred)
b2 = beta ** 2
denom = b2 * p + r
return (1 + b2) * p * r / denom if denom > 0 else 0.0
def mcc(y_true, y_pred):
"""Matthews Correlation Coefficient."""
TP, TN, FP, FN = confusion_matrix_binary(y_true, y_pred)
denom = np.sqrt((TP+FP)*(TP+FN)*(TN+FP)*(TN+FN))
return (TP*TN - FP*FN) / denom if denom > 0 else 0.0
def roc_curve_scratch(y_true, y_scores):
"""
Compute ROC curve points by sweeping the decision threshold.
Returns (fpr, tpr, thresholds) sorted ascending by FPR.
"""
y_true = np.array(y_true)
y_scores = np.array(y_scores)
thresholds = np.sort(np.unique(y_scores))[::-1] # descending
P = y_true.sum()
N = len(y_true) - P
fpr_list, tpr_list = [0.0], [0.0]
for tau in thresholds:
y_pred = (y_scores >= tau).astype(int)
TP, TN, FP, FN = confusion_matrix_binary(y_true, y_pred)
fpr_list.append(FP / N if N > 0 else 0.0)
tpr_list.append(TP / P if P > 0 else 0.0)
fpr_list.append(1.0)
tpr_list.append(1.0)
return np.array(fpr_list), np.array(tpr_list), thresholds
def auc_trapezoid(fpr, tpr):
"""Area under curve using the trapezoid rule."""
return float(np.trapz(tpr, fpr))
# ── Regression metrics ─────────────────────────────────────────────────────────
def mae(y_true, y_pred):
return np.mean(np.abs(np.array(y_true) - np.array(y_pred)))
def mse(y_true, y_pred):
return np.mean((np.array(y_true) - np.array(y_pred)) ** 2)
def rmse(y_true, y_pred):
return np.sqrt(mse(y_true, y_pred))
def mape(y_true, y_pred):
y_true = np.array(y_true, dtype=float)
y_pred = np.array(y_pred, dtype=float)
mask = y_true != 0
return np.mean(np.abs((y_true[mask] - y_pred[mask]) / y_true[mask])) * 100
def r_squared(y_true, y_pred):
y_true = np.array(y_true, dtype=float)
y_pred = np.array(y_pred, dtype=float)
ss_res = np.sum((y_true - y_pred) ** 2)
ss_tot = np.sum((y_true - np.mean(y_true)) ** 2)
return 1 - ss_res / ss_tot if ss_tot != 0 else 0.0
def adjusted_r_squared(y_true, y_pred, n_features):
n = len(y_true)
r2 = r_squared(y_true, y_pred)
return 1 - (1 - r2) * (n - 1) / (n - n_features - 1)
# ── k-Fold Cross-Validation ────────────────────────────────────────────────────
def kfold_cv(model, X, y, k=5, metric_fn=None, random_state=42):
"""
k-fold cross-validation returning per-fold scores.
Parameters
----------
model : object with .fit(X, y) and .predict(X)
X : array (n_samples, n_features)
y : array (n_samples,)
k : int, number of folds
metric_fn : callable(y_true, y_pred) → float; defaults to accuracy
"""
if metric_fn is None:
metric_fn = accuracy
np.random.seed(random_state)
X, y = np.array(X), np.array(y)
n = len(y)
indices = np.random.permutation(n)
fold_size = n // k
scores = []
for i in range(k):
val_idx = indices[i * fold_size: (i + 1) * fold_size]
train_idx = np.concatenate([indices[:i * fold_size],
indices[(i + 1) * fold_size:]])
model.fit(X[train_idx], y[train_idx])
y_pred = model.predict(X[val_idx])
scores.append(metric_fn(y[val_idx], y_pred))
return np.array(scores)
# Example usage
if __name__ == "__main__":
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0]
y_scores = [0.9, 0.1, 0.8, 0.4, 0.2, 0.7, 0.6, 0.3, 0.85, 0.15]
print("Classification Metrics:")
print(f" Accuracy : {accuracy(y_true, y_pred):.4f}")
print(f" Precision : {precision(y_true, y_pred):.4f}")
print(f" Recall : {recall(y_true, y_pred):.4f}")
print(f" F1-Score : {f_beta(y_true, y_pred, beta=1.0):.4f}")
print(f" F2-Score : {f_beta(y_true, y_pred, beta=2.0):.4f}")
print(f" MCC : {mcc(y_true, y_pred):.4f}")
fpr, tpr, _ = roc_curve_scratch(y_true, y_scores)
print(f" AUC-ROC : {auc_trapezoid(fpr, tpr):.4f}")
y_reg_true = [3.0, -0.5, 2.0, 7.0]
y_reg_pred = [2.5, 0.0, 2.0, 8.0]
print("\nRegression Metrics:")
print(f" MAE : {mae(y_reg_true, y_reg_pred):.4f}")
print(f" MSE : {mse(y_reg_true, y_reg_pred):.4f}")
print(f" RMSE : {rmse(y_reg_true, y_reg_pred):.4f}")
print(f" R² : {r_squared(y_reg_true, y_reg_pred):.4f}")
| Metric | Range | Imbalance-robust | Threshold-free | Interpretable |
|---|---|---|---|---|
| Accuracy | [0,1] | ❌ | ❌ | ✅ |
| Precision | [0,1] | Partly | ❌ | ✅ |
| Recall | [0,1] | Partly | ❌ | ✅ |
| F1 | [0,1] | ✅ | ❌ | ✅ |
| F-beta | [0,1] | ✅ | ❌ | ✅ |
| MCC | [-1,1] | ✅ | ❌ | ✅ |
| ROC-AUC | [0,1] | Partly | ✅ | Medium |
| AUC-PR | [0,1] | ✅ | ✅ | Medium |
In summary, classification metrics are a valuable tool for model evaluation, with a range of advantages and limitations. So, choosing the right metric depends on the problem, model, and evaluation goals.
Here are some best practices to follow when evaluating classification models: