Dec 07, 2025
Continuing our journey through machine learning, today we explore Random Forests, the ensemble method that combines the wisdom of multiple decision trees. Imagine you’re trying to decide which movie to watch. You could ask one friend for their opinion, but they might have peculiar taste. Instead, you poll ten friends, and the majority votes. Even if some friends have quirky preferences, the group consensus is usually more reliable than any single opinion. This is the power of ensemble learning combining multiple “weak” predictors to create a strong one. Random Forests apply this wisdom to decision trees, creating one of machine learning’s most powerful and widely-used algorithms.
In the previous tutorial, we learned that decision trees are:
But they have a critical weakness:
Random Forests solve this by training many trees and combining their predictions. The result is a model that:
Random Forests excel because they:
Decision trees are highly sensitive to small changes in training data. Consider training a decision tree on slightly different data samples:
The trees look completely different despite coming from the same distribution! This is high variance.
Mathematical Definition:
In statistical learning theory, For a learning algorithm $f$, variance measures how much predictions change with different training sets:
\[\text{Variance}[\hat{f}(x)] = \mathbb{E}_{\mathcal{D}}[(\hat{f}(x) - \mathbb{E}[\hat{f}(x)])^2]\]Where:
Decision trees have high variance because their greedy, hierarchical structure amplifies small data changes.
Decision trees are low bias but high variance models. Total prediction error decomposes into:
\[\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}\]Single Decision Trees:
Random Forests:
Ensemble methods combine multiple models to improve performance. The key idea is that a group of weak learners can become a strong learner through aggregation. So, if you have $N$ independent predictors, each with error rate $\epsilon$, averaging them reduces error.
For Regression:
Suppose we have $B$ independent models $\hat{f}_1(x), \hat{f}_2(x), \ldots, \hat{f}_B(x)$, each with variance $\sigma^2$.
The average prediction: \(\hat{f}_{\text{avg}}(x) = \frac{1}{B} \sum_{b=1}^{B} \hat{f}_b(x)\)
Has variance: \(\text{Var}[\hat{f}_{\text{avg}}(x)] = \frac{\sigma^2}{B}\)
Result: Variance decreases by factor of $B$!
With Correlation:
In reality, models are correlated (they see similar patterns). With correlation $\rho$:
\[\text{Var}[\hat{f}_{\text{avg}}(x)] = \rho \sigma^2 + \frac{1-\rho}{B} \sigma^2\]As $B \to \infty$: \(\text{Var}[\hat{f}_{\text{avg}}(x)] \to \rho \sigma^2\)
So, it’s important to remember that the variance of the ensemble is bounded by the correlation between the models.
1. Bagging (Bootstrap Aggregating):
2. Boosting:
3. Stacking:
Random Forests use Bagging plus random feature selection.
Bootstrap Sampling: Sample $n$ observations with replacement from dataset of size $n$.
Example:
Original data: [1, 2, 3, 4, 5]
Bootstrap sample 1: [1, 1, 3, 5, 5] Bootstrap sample 2: [2, 2, 3, 4, 5] Bootstrap sample 3: [1, 2, 2, 4, 5]
If we have 1000 observations, we can expect:
Properties:
Mathematical: Probability an observation is NOT selected in one draw: $(1 - \frac{1}{n})$
After $n$ draws with replacement: \(P(\text{not selected}) = \left(1 - \frac{1}{n}\right)^n \approx e^{-1} \approx 0.368\)
So about 63.2% of data appears in each bootstrap sample.
function Bagging(data, B):
models = []
for b = 1 to B:
# 1. Create bootstrap sample
sample_b = bootstrap_sample(data)
# 2. Train model on this sample
model_b = train_model(sample_b)
# 3. Store model
models.append(model_b)
return models
function Predict(x, models):
# For regression: average predictions
predictions = [model.predict(x) for model in models]
return mean(predictions)
# For classification: majority vote
votes = [model.predict(x) for model in models]
return most_common(votes)
Intuition: Different bootstrap samples → different trees → averaging smooths out individual quirks.
Mathematical Proof (simplified):
For $B$ independent models with variance $\sigma^2$:
\[\text{Var}[\text{avg}] = \frac{\sigma^2}{B}\]With $B = 100$ trees: variance is 1% of single tree variance!
Even with correlation $\rho = 0.5$:
\[\text{Var}[\text{avg}] = 0.5\sigma^2 + \frac{0.5}{100}\sigma^2 = 0.505\sigma^2\]Still 50% variance reduction!
The problem with bagging decision trees is that individual trees are still correlated because bootstrap samples are still quite similar. If one feature is very strong (e.g., “income” for loan prediction), all trees will split on it first. This creates correlation between trees, limiting variance reduction.
How to fix this? Randomly select features at each split!. We modify the tree-building process: At each split, randomly sample $m$ features from the total $p$ features.
Typical values:
Example:
Total features: 16 Random Forest samples: $\sqrt{16} = 4$ features per split
Tree 1, Root split: Consider features [2, 7, 10, 15] Tree 1, Next split: Consider features [1, 3, 8, 12] Tree 2, Root split: Consider features [4, 6, 9, 11] …
when we decide to randomly select features at each split, we are ensuring that each tree is built on a different subset of features. Therefore, trees become more diverse.
Mathematical Insight:
If we look mathematically, correlation between trees decreases as $m$ decreases:
\[m \downarrow \implies \rho \downarrow \implies \text{Var}[\text{ensemble}] \downarrow\]But there’s a tradeoff:
\[m \downarrow \text{ too much} \implies \text{individual tree quality} \downarrow\]The optimal $m$ balances tree quality and diversity.
function RandomForest(X, y, B, m):
"""
X: feature matrix (n × p)
y: target vector (n)
B: number of trees
m: number of features to sample per split
"""
forest = []
for b = 1 to B:
# 1. Bootstrap sample
X_b, y_b = bootstrap_sample(X, y)
# 2. Build tree with random feature selection
tree_b = build_tree(X_b, y_b, m)
# 3. Add to forest
forest.append(tree_b)
return forest
function build_tree(X, y, m):
"""Build single decision tree with random features."""
if stopping_criterion_met:
return leaf_node(y)
# Randomly select m features
features = random_sample(p, m)
# Find best split among these m features
best_feature, best_threshold = find_best_split(X[:, features], y)
# Split data
X_left, y_left, X_right, y_right = split(X, y, best_feature, best_threshold)
# Recursively build subtrees
left_tree = build_tree(X_left, y_left, m)
right_tree = build_tree(X_right, y_right, m)
return decision_node(best_feature, best_threshold, left_tree, right_tree)
function predict_forest(x, forest):
"""Make prediction with random forest."""
# Classification: majority vote
votes = [tree.predict(x) for tree in forest]
return most_common(votes)
# Regression: average
predictions = [tree.predict(x) for tree in forest]
return mean(predictions)
Classification:
Soft Voting Example:
Tree 1: P(Class A) = 0.8, P(Class B) = 0.2 Tree 2: P(Class A) = 0.6, P(Class B) = 0.4 Tree 3: P(Class A) = 0.7, P(Class B) = 0.3
Average: P(Class A) = 0.7, P(Class B) = 0.3 → Predict Class A
Regression:
Mean Example:
Tree 1: 145,000 Tree 2: 152,000 Tree 3: 148,000
Average: (145,000 + 152,000 + 148,000) / 3 = 148,333
OOB data is a powerful feature of random forests that allows us to estimate the model’s performance without using a separate validation set. Remember each bootstrap sample uses ~63% of data. The remaining ~37% are Out-of-Bag (OOB) samples.
For each observation $i$:
function calculate_oob_error(X, y, forest):
n = len(y)
oob_predictions = [None] * n
for i in range(n):
# Find trees where observation i was OOB
oob_trees = [tree for tree in forest if i not in tree.training_indices]
if len(oob_trees) > 0:
# Predict using only OOB trees
votes = [tree.predict(X[i]) for tree in oob_trees]
oob_predictions[i] = most_common(votes)
# Calculate error on OOB predictions
oob_error = error_rate(y, oob_predictions)
return oob_error
It’s important to note that each observation is predicted by trees it wasn’t trained on. Therefore, OOB error is an unbiased estimate of test error!
Advantages:
It’s Remarkable to see how OOB error closely approximates leave-one-out cross-validation error.
OOB Score = 1 - OOB Error Rate
Typically:
Feature importance is a powerful tool for understanding which features drive predictions and Random Forests provide interpretability through feature importance, answering:
Idea: Measure total impurity reduction from splits using each feature.
Algorithm:
For each feature $j$:
\[\text{Importance}(j) = \sum_{t \in \text{Trees}} \sum_{s \in \text{Splits using } j} \Delta \text{Impurity}(s)\]Where: \(\Delta \text{Impurity}(s) = p_{\text{left}} \cdot \text{Gini}_{\text{left}} + p_{\text{right}} \cdot \text{Gini}_{\text{right}} - \text{Gini}_{\text{parent}}\)
Normalize: \(\text{Importance}_{\text{normalized}}(j) = \frac{\text{Importance}(j)}{\sum_{k=1}^{p} \text{Importance}(k)}\)
Advantages:
Disadvantages:
Idea: Measure error increase when feature values are randomly permuted.
Algorithm:
function permutation_importance(model, X_test, y_test, feature_j):
# 1. Get baseline error
baseline_error = error(model.predict(X_test), y_test)
# 2. Permute feature j
X_permuted = X_test.copy()
X_permuted[:, j] = shuffle(X_permuted[:, j])
# 3. Calculate error with permuted feature
permuted_error = error(model.predict(X_permuted), y_test)
# 4. Importance = increase in error
importance_j = permuted_error - baseline_error
return importance_j
Advantages:
Disadvantages:
Suppose we’re predicting house prices with features:
Interpretation:
Strategy:
Benefits:
Classification:
Regression:
For regression, at each node:
\[\text{MSE}(S) = \frac{1}{|S|} \sum_{i \in S} (y_i - \bar{y})^2\]Where $\bar{y}$ is the mean of target values in node $S$.
MSE Reduction (like information gain):
\[\text{MSE Reduction} = \text{MSE}(S) - \left(\frac{|S_L|}{|S|}\text{MSE}(S_L) + \frac{|S_R|}{|S|}\text{MSE}(S_R)\right)\]Predicting house prices:
Tree 1 predicts: $245,000 Tree 2 predicts: $238,000 Tree 3 predicts: $251,000 … Tree 100 predicts: $247,000
Final prediction: Average = $245,500
Variance of predictions: $\sigma = $8,000$ (measures uncertainty)
Prediction intervals: $245,500 ± 2 \times 8,000 = [$229,500, $261,500]$
import numpy as np
from collections import Counter
class SimpleRandomForest:
"""Simplified Random Forest implementation."""
def __init__(self, n_trees=10, max_depth=5, min_samples_split=2,
max_features='sqrt'):
self.n_trees = n_trees
self.max_depth = max_depth
self.min_samples_split = min_samples_split
self.max_features = max_features
self.trees = []
def _bootstrap_sample(self, X, y):
"""Create bootstrap sample."""
n_samples = X.shape[0]
indices = np.random.choice(n_samples, n_samples, replace=True)
return X[indices], y[indices]
def _calculate_max_features(self, n_features):
"""Calculate number of features to sample."""
if self.max_features == 'sqrt':
return int(np.sqrt(n_features))
elif self.max_features == 'log2':
return int(np.log2(n_features))
elif isinstance(self.max_features, int):
return self.max_features
else:
return n_features
def fit(self, X, y):
"""Train random forest."""
self.trees = []
for _ in range(self.n_trees):
# Create bootstrap sample
X_sample, y_sample = self._bootstrap_sample(X, y)
# Build tree
tree = DecisionTree(
max_depth=self.max_depth,
min_samples_split=self.min_samples_split,
max_features=self._calculate_max_features(X.shape[1])
)
tree.fit(X_sample, y_sample)
self.trees.append(tree)
return self
def predict(self, X):
"""Predict using majority vote."""
# Get predictions from all trees
tree_predictions = np.array([tree.predict(X) for tree in self.trees])
# Majority vote for each sample
predictions = []
for i in range(X.shape[0]):
votes = tree_predictions[:, i]
majority_vote = Counter(votes).most_common(1)[0][0]
predictions.append(majority_vote)
return np.array(predictions)
def predict_proba(self, X):
"""Predict class probabilities."""
# Get predictions from all trees
tree_predictions = np.array([tree.predict(X) for tree in self.trees])
# Calculate probabilities
n_samples = X.shape[0]
n_classes = len(np.unique(tree_predictions))
probabilities = np.zeros((n_samples, n_classes))
for i in range(n_samples):
votes = tree_predictions[:, i]
vote_counts = Counter(votes)
for class_label, count in vote_counts.items():
probabilities[i, class_label] = count / self.n_trees
return probabilities
class DecisionTree:
"""Simple decision tree with random feature selection."""
class Node:
def __init__(self, feature_idx=None, threshold=None,
left=None, right=None, value=None):
self.feature_idx = feature_idx
self.threshold = threshold
self.left = left
self.right = right
self.value = value
def __init__(self, max_depth=5, min_samples_split=2, max_features=None):
self.max_depth = max_depth
self.min_samples_split = min_samples_split
self.max_features = max_features
self.root = None
def _gini(self, y):
"""Calculate Gini impurity."""
m = len(y)
if m == 0:
return 0
counts = np.bincount(y)
probabilities = counts / m
return 1 - np.sum(probabilities ** 2)
def _best_split(self, X, y):
"""Find best split with random feature sampling."""
m, n = X.shape
if m <= 1:
return None, None
# Sample random features
if self.max_features and self.max_features < n:
feature_indices = np.random.choice(n, self.max_features, replace=False)
else:
feature_indices = np.arange(n)
best_gain = 0
best_feature = None
best_threshold = None
parent_gini = self._gini(y)
# Try each sampled feature
for feature_idx in feature_indices:
thresholds = np.unique(X[:, feature_idx])
for threshold in thresholds:
left_mask = X[:, feature_idx] <= threshold
right_mask = ~left_mask
if np.sum(left_mask) == 0 or np.sum(right_mask) == 0:
continue
# Calculate Gini gain
n_left = np.sum(left_mask)
n_right = np.sum(right_mask)
gini_left = self._gini(y[left_mask])
gini_right = self._gini(y[right_mask])
weighted_gini = (n_left/m) * gini_left + (n_right/m) * gini_right
gain = parent_gini - weighted_gini
if gain > best_gain:
best_gain = gain
best_feature = feature_idx
best_threshold = threshold
return best_feature, best_threshold
def _build_tree(self, X, y, depth=0):
"""Recursively build decision tree."""
n_samples, n_features = X.shape
n_classes = len(np.unique(y))
# Stopping criteria
if (depth >= self.max_depth or
n_classes == 1 or
n_samples < self.min_samples_split):
leaf_value = np.bincount(y).argmax()
return self.Node(value=leaf_value)
# Find best split
feature_idx, threshold = self._best_split(X, y)
if feature_idx is None:
leaf_value = np.bincount(y).argmax()
return self.Node(value=leaf_value)
# Split data
left_mask = X[:, feature_idx] <= threshold
right_mask = ~left_mask
# Build subtrees
left_child = self._build_tree(X[left_mask], y[left_mask], depth + 1)
right_child = self._build_tree(X[right_mask], y[right_mask], depth + 1)
return self.Node(
feature_idx=feature_idx,
threshold=threshold,
left=left_child,
right=right_child
)
def fit(self, X, y):
"""Build decision tree."""
self.root = self._build_tree(X, y)
return self
def _traverse_tree(self, x, node):
"""Traverse tree for prediction."""
if node.value is not None:
return node.value
if x[node.feature_idx] <= node.threshold:
return self._traverse_tree(x, node.left)
else:
return self._traverse_tree(x, node.right)
def predict(self, X):
"""Predict class labels."""
return np.array([self._traverse_tree(x, self.root) for x in X])
# Generate sample data
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
X, y = make_classification(n_samples=500, n_features=10,
n_informative=7, n_redundant=3,
random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Train Random Forest
rf = SimpleRandomForest(n_trees=50, max_depth=10, max_features='sqrt')
rf.fit(X_train, y_train)
# Predictions
y_pred = rf.predict(X_test)
accuracy = np.mean(y_pred == y_test)
print(f"Accuracy: {accuracy:.2%}")
1. Excellent Performance
2. Versatility
3. Feature Insights
4. Robustness
5. Parallelization
6. No Assumptions
1. Interpretability
2. Prediction Speed
3. Memory Usage
4. Extrapolation
5. Bias Toward Categorical Features
6. Not Optimal for All Tasks
| Aspect | Single Tree | Random Forest |
|---|---|---|
| Variance | High | Low (averaging) |
| Bias | Low | Low |
| Interpretability | High (visualize) | Low (ensemble) |
| Accuracy | Moderate | High |
| Training Time | Fast | Slower |
| Prediction Time | Very Fast | Moderate |
| Overfitting | Prone | Resistant |
| Aspect | Random Forest | Gradient Boosting |
|---|---|---|
| Training | Parallel | Sequential |
| Speed | Faster training | Slower training |
| Accuracy | Very good | Often better |
| Overfitting | Less prone | More prone |
| Tuning | Less sensitive | More sensitive |
| Interpretability | Moderate | Moderate |
| Aspect | Random Forest | Linear Models |
|---|---|---|
| Linear Relationships | Overkill | Optimal |
| Non-linear Relationships | Excellent | Poor |
| High-Dimensional Sparse | Moderate | Excellent |
| Interpretability | Feature importance | Coefficients |
| Preprocessing | Minimal | Scaling required |
Random Forests represent a crucial step in the evolution of machine learning algorithms - moving from single trees to ensembles that combine the strengths of multiple models. Their ability to handle complex, non-linear relationships while maintaining robustness to noise makes them one of the most widely used algorithms in practice.
Random Forests remain a go-to algorithm because they:
Master Random Forests, and you have a powerful tool that will serve you well across countless machine learning problems.
Original Papers: