series

Machine Learning Series

by Mayank Sharma

Random Forests: Ensemble Learning with Decision Trees

Dec 07, 2025

Continuing our journey through machine learning, today we explore Random Forests, the ensemble method that combines the wisdom of multiple decision trees. Imagine you’re trying to decide which movie to watch. You could ask one friend for their opinion, but they might have peculiar taste. Instead, you poll ten friends, and the majority votes. Even if some friends have quirky preferences, the group consensus is usually more reliable than any single opinion. This is the power of ensemble learning combining multiple “weak” predictors to create a strong one. Random Forests apply this wisdom to decision trees, creating one of machine learning’s most powerful and widely-used algorithms.

Table of Contents

  1. Introduction: From Single Trees to Forests
  2. The Problem with Single Decision Trees
  3. Ensemble Learning Fundamentals
  4. Bootstrap Aggregating (Bagging)
  5. Random Forests: Adding Random Feature Selection
  6. The Random Forest Algorithm
  7. Out-of-Bag (OOB) Error Estimation
  8. Feature Importance in Random Forests
  9. Random Forests for Regression
  10. Implementation from Scratch
  11. Advantages and Limitations
  12. Comparison with Other Methods
  13. Conclusion

Introduction: From Single Trees to Forests

The Journey from Trees to Forests

In the previous tutorial, we learned that decision trees are:

But they have a critical weakness:

Random Forests solve this by training many trees and combining their predictions. The result is a model that:

What Makes Random Forests Special?

Random Forests excel because they:

  1. Reduce overfitting through averaging many overfit trees
  2. Handle high-dimensional data naturally
  3. Provide feature importance rankings
  4. Require minimal preprocessing (no scaling, handles missing values)
  5. Are robust to outliers and noise
  6. Parallelize easily (train trees independently)
  7. Work well “out of the box” with default parameters

The Problem with Single Decision Trees

High Variance Demonstration

Decision trees are highly sensitive to small changes in training data. Consider training a decision tree on slightly different data samples:

The trees look completely different despite coming from the same distribution! This is high variance.

Mathematical Definition:

In statistical learning theory, For a learning algorithm $f$, variance measures how much predictions change with different training sets:

\[\text{Variance}[\hat{f}(x)] = \mathbb{E}_{\mathcal{D}}[(\hat{f}(x) - \mathbb{E}[\hat{f}(x)])^2]\]

Where:

Decision trees have high variance because their greedy, hierarchical structure amplifies small data changes.

The Bias-Variance Tradeoff

Decision trees are low bias but high variance models. Total prediction error decomposes into:

\[\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}\]

Single Decision Trees:

Random Forests:

Ensemble Learning Fundamentals

Ensemble methods combine multiple models to improve performance. The key idea is that a group of weak learners can become a strong learner through aggregation. So, if you have $N$ independent predictors, each with error rate $\epsilon$, averaging them reduces error.

For Regression:

Suppose we have $B$ independent models $\hat{f}_1(x), \hat{f}_2(x), \ldots, \hat{f}_B(x)$, each with variance $\sigma^2$.

The average prediction: \(\hat{f}_{\text{avg}}(x) = \frac{1}{B} \sum_{b=1}^{B} \hat{f}_b(x)\)

Has variance: \(\text{Var}[\hat{f}_{\text{avg}}(x)] = \frac{\sigma^2}{B}\)

Result: Variance decreases by factor of $B$!

With Correlation:

In reality, models are correlated (they see similar patterns). With correlation $\rho$:

\[\text{Var}[\hat{f}_{\text{avg}}(x)] = \rho \sigma^2 + \frac{1-\rho}{B} \sigma^2\]

As $B \to \infty$: \(\text{Var}[\hat{f}_{\text{avg}}(x)] \to \rho \sigma^2\)

So, it’s important to remember that the variance of the ensemble is bounded by the correlation between the models.

Types of Ensemble Methods

1. Bagging (Bootstrap Aggregating):

2. Boosting:

3. Stacking:

Random Forests use Bagging plus random feature selection.

Bootstrap Aggregating (Bagging)

What is Bootstrapping?

Bootstrap Sampling: Sample $n$ observations with replacement from dataset of size $n$.

Example:

Original data: [1, 2, 3, 4, 5]

Bootstrap sample 1: [1, 1, 3, 5, 5] Bootstrap sample 2: [2, 2, 3, 4, 5] Bootstrap sample 3: [1, 2, 2, 4, 5]

If we have 1000 observations, we can expect:

Properties:

Mathematical: Probability an observation is NOT selected in one draw: $(1 - \frac{1}{n})$

After $n$ draws with replacement: \(P(\text{not selected}) = \left(1 - \frac{1}{n}\right)^n \approx e^{-1} \approx 0.368\)

So about 63.2% of data appears in each bootstrap sample.

The Bagging Algorithm

function Bagging(data, B):
    models = []

    for b = 1 to B:
        # 1. Create bootstrap sample
        sample_b = bootstrap_sample(data)

        # 2. Train model on this sample
        model_b = train_model(sample_b)

        # 3. Store model
        models.append(model_b)

    return models

function Predict(x, models):
    # For regression: average predictions
    predictions = [model.predict(x) for model in models]
    return mean(predictions)

    # For classification: majority vote
    votes = [model.predict(x) for model in models]
    return most_common(votes)

Why Bagging Reduces Variance

Intuition: Different bootstrap samples → different trees → averaging smooths out individual quirks.

Mathematical Proof (simplified):

For $B$ independent models with variance $\sigma^2$:

\[\text{Var}[\text{avg}] = \frac{\sigma^2}{B}\]

With $B = 100$ trees: variance is 1% of single tree variance!

Even with correlation $\rho = 0.5$:

\[\text{Var}[\text{avg}] = 0.5\sigma^2 + \frac{0.5}{100}\sigma^2 = 0.505\sigma^2\]

Still 50% variance reduction!

Random Forests: Adding Random Feature Selection

The Limitation of Pure Bagging

The problem with bagging decision trees is that individual trees are still correlated because bootstrap samples are still quite similar. If one feature is very strong (e.g., “income” for loan prediction), all trees will split on it first. This creates correlation between trees, limiting variance reduction.

Random Feature Selection

How to fix this? Randomly select features at each split!. We modify the tree-building process: At each split, randomly sample $m$ features from the total $p$ features.

Typical values:

Example:

Total features: 16 Random Forest samples: $\sqrt{16} = 4$ features per split

Tree 1, Root split: Consider features [2, 7, 10, 15] Tree 1, Next split: Consider features [1, 3, 8, 12] Tree 2, Root split: Consider features [4, 6, 9, 11] …

Why This Works

when we decide to randomly select features at each split, we are ensuring that each tree is built on a different subset of features. Therefore, trees become more diverse.

Mathematical Insight:

If we look mathematically, correlation between trees decreases as $m$ decreases:

\[m \downarrow \implies \rho \downarrow \implies \text{Var}[\text{ensemble}] \downarrow\]

But there’s a tradeoff:

\[m \downarrow \text{ too much} \implies \text{individual tree quality} \downarrow\]

The optimal $m$ balances tree quality and diversity.

The Random Forest Algorithm

Complete Algorithm

function RandomForest(X, y, B, m):
    """
    X: feature matrix (n × p)
    y: target vector (n)
    B: number of trees
    m: number of features to sample per split
    """
    forest = []

    for b = 1 to B:
        # 1. Bootstrap sample
        X_b, y_b = bootstrap_sample(X, y)

        # 2. Build tree with random feature selection
        tree_b = build_tree(X_b, y_b, m)

        # 3. Add to forest
        forest.append(tree_b)

    return forest

function build_tree(X, y, m):
    """Build single decision tree with random features."""
    if stopping_criterion_met:
        return leaf_node(y)

    # Randomly select m features
    features = random_sample(p, m)

    # Find best split among these m features
    best_feature, best_threshold = find_best_split(X[:, features], y)

    # Split data
    X_left, y_left, X_right, y_right = split(X, y, best_feature, best_threshold)

    # Recursively build subtrees
    left_tree = build_tree(X_left, y_left, m)
    right_tree = build_tree(X_right, y_right, m)

    return decision_node(best_feature, best_threshold, left_tree, right_tree)

function predict_forest(x, forest):
    """Make prediction with random forest."""
    # Classification: majority vote
    votes = [tree.predict(x) for tree in forest]
    return most_common(votes)

    # Regression: average
    predictions = [tree.predict(x) for tree in forest]
    return mean(predictions)

Classification vs Regression

Classification:

Soft Voting Example:

Tree 1: P(Class A) = 0.8, P(Class B) = 0.2 Tree 2: P(Class A) = 0.6, P(Class B) = 0.4 Tree 3: P(Class A) = 0.7, P(Class B) = 0.3

Average: P(Class A) = 0.7, P(Class B) = 0.3 → Predict Class A

Regression:

Mean Example:

Tree 1: 145,000 Tree 2: 152,000 Tree 3: 148,000

Average: (145,000 + 152,000 + 148,000) / 3 = 148,333

Out-of-Bag (OOB) Error Estimation

What is OOB Data?

OOB data is a powerful feature of random forests that allows us to estimate the model’s performance without using a separate validation set. Remember each bootstrap sample uses ~63% of data. The remaining ~37% are Out-of-Bag (OOB) samples.

For each observation $i$:

OOB Error Calculation

function calculate_oob_error(X, y, forest):
    n = len(y)
    oob_predictions = [None] * n

    for i in range(n):
        # Find trees where observation i was OOB
        oob_trees = [tree for tree in forest if i not in tree.training_indices]

        if len(oob_trees) > 0:
            # Predict using only OOB trees
            votes = [tree.predict(X[i]) for tree in oob_trees]
            oob_predictions[i] = most_common(votes)

    # Calculate error on OOB predictions
    oob_error = error_rate(y, oob_predictions)
    return oob_error

OOB as Cross-Validation

It’s important to note that each observation is predicted by trees it wasn’t trained on. Therefore, OOB error is an unbiased estimate of test error!

Advantages:

It’s Remarkable to see how OOB error closely approximates leave-one-out cross-validation error.

OOB Score

OOB Score = 1 - OOB Error Rate

Typically:

Feature Importance in Random Forests

Feature importance is a powerful tool for understanding which features drive predictions and Random Forests provide interpretability through feature importance, answering:

Method 1: Mean Decrease in Impurity (Gini Importance)

Idea: Measure total impurity reduction from splits using each feature.

Algorithm:

For each feature $j$:

\[\text{Importance}(j) = \sum_{t \in \text{Trees}} \sum_{s \in \text{Splits using } j} \Delta \text{Impurity}(s)\]

Where: \(\Delta \text{Impurity}(s) = p_{\text{left}} \cdot \text{Gini}_{\text{left}} + p_{\text{right}} \cdot \text{Gini}_{\text{right}} - \text{Gini}_{\text{parent}}\)

Normalize: \(\text{Importance}_{\text{normalized}}(j) = \frac{\text{Importance}(j)}{\sum_{k=1}^{p} \text{Importance}(k)}\)

Advantages:

Disadvantages:

Method 2: Permutation Importance

Idea: Measure error increase when feature values are randomly permuted.

Algorithm:

function permutation_importance(model, X_test, y_test, feature_j):
    # 1. Get baseline error
    baseline_error = error(model.predict(X_test), y_test)

    # 2. Permute feature j
    X_permuted = X_test.copy()
    X_permuted[:, j] = shuffle(X_permuted[:, j])

    # 3. Calculate error with permuted feature
    permuted_error = error(model.predict(X_permuted), y_test)

    # 4. Importance = increase in error
    importance_j = permuted_error - baseline_error

    return importance_j

Advantages:

Disadvantages:

Practical Example

Suppose we’re predicting house prices with features:

Interpretation:

Feature Selection with Random Forests

Strategy:

  1. Train Random Forest on all features
  2. Compute feature importances
  3. Remove features with importance < threshold (e.g., < 0.01)
  4. Retrain on selected features
  5. Compare performance

Benefits:

Random Forests for Regression

Algorithm Differences

Classification:

Regression:

MSE Splitting Criterion

For regression, at each node:

\[\text{MSE}(S) = \frac{1}{|S|} \sum_{i \in S} (y_i - \bar{y})^2\]

Where $\bar{y}$ is the mean of target values in node $S$.

MSE Reduction (like information gain):

\[\text{MSE Reduction} = \text{MSE}(S) - \left(\frac{|S_L|}{|S|}\text{MSE}(S_L) + \frac{|S_R|}{|S|}\text{MSE}(S_R)\right)\]

Regression Example

Predicting house prices:

Tree 1 predicts: $245,000 Tree 2 predicts: $238,000 Tree 3 predicts: $251,000 … Tree 100 predicts: $247,000

Final prediction: Average = $245,500

Variance of predictions: $\sigma = $8,000$ (measures uncertainty)

Prediction intervals: $245,500 ± 2 \times 8,000 = [$229,500, $261,500]$

Advantages for Regression

Implementation from Scratch

Simplified Random Forest Classifier

import numpy as np
from collections import Counter

class SimpleRandomForest:
    """Simplified Random Forest implementation."""

    def __init__(self, n_trees=10, max_depth=5, min_samples_split=2,
                 max_features='sqrt'):
        self.n_trees = n_trees
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.max_features = max_features
        self.trees = []

    def _bootstrap_sample(self, X, y):
        """Create bootstrap sample."""
        n_samples = X.shape[0]
        indices = np.random.choice(n_samples, n_samples, replace=True)
        return X[indices], y[indices]

    def _calculate_max_features(self, n_features):
        """Calculate number of features to sample."""
        if self.max_features == 'sqrt':
            return int(np.sqrt(n_features))
        elif self.max_features == 'log2':
            return int(np.log2(n_features))
        elif isinstance(self.max_features, int):
            return self.max_features
        else:
            return n_features

    def fit(self, X, y):
        """Train random forest."""
        self.trees = []

        for _ in range(self.n_trees):
            # Create bootstrap sample
            X_sample, y_sample = self._bootstrap_sample(X, y)

            # Build tree
            tree = DecisionTree(
                max_depth=self.max_depth,
                min_samples_split=self.min_samples_split,
                max_features=self._calculate_max_features(X.shape[1])
            )
            tree.fit(X_sample, y_sample)

            self.trees.append(tree)

        return self

    def predict(self, X):
        """Predict using majority vote."""
        # Get predictions from all trees
        tree_predictions = np.array([tree.predict(X) for tree in self.trees])

        # Majority vote for each sample
        predictions = []
        for i in range(X.shape[0]):
            votes = tree_predictions[:, i]
            majority_vote = Counter(votes).most_common(1)[0][0]
            predictions.append(majority_vote)

        return np.array(predictions)

    def predict_proba(self, X):
        """Predict class probabilities."""
        # Get predictions from all trees
        tree_predictions = np.array([tree.predict(X) for tree in self.trees])

        # Calculate probabilities
        n_samples = X.shape[0]
        n_classes = len(np.unique(tree_predictions))
        probabilities = np.zeros((n_samples, n_classes))

        for i in range(n_samples):
            votes = tree_predictions[:, i]
            vote_counts = Counter(votes)
            for class_label, count in vote_counts.items():
                probabilities[i, class_label] = count / self.n_trees

        return probabilities


class DecisionTree:
    """Simple decision tree with random feature selection."""

    class Node:
        def __init__(self, feature_idx=None, threshold=None,
                     left=None, right=None, value=None):
            self.feature_idx = feature_idx
            self.threshold = threshold
            self.left = left
            self.right = right
            self.value = value

    def __init__(self, max_depth=5, min_samples_split=2, max_features=None):
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.max_features = max_features
        self.root = None

    def _gini(self, y):
        """Calculate Gini impurity."""
        m = len(y)
        if m == 0:
            return 0
        counts = np.bincount(y)
        probabilities = counts / m
        return 1 - np.sum(probabilities ** 2)

    def _best_split(self, X, y):
        """Find best split with random feature sampling."""
        m, n = X.shape

        if m <= 1:
            return None, None

        # Sample random features
        if self.max_features and self.max_features < n:
            feature_indices = np.random.choice(n, self.max_features, replace=False)
        else:
            feature_indices = np.arange(n)

        best_gain = 0
        best_feature = None
        best_threshold = None
        parent_gini = self._gini(y)

        # Try each sampled feature
        for feature_idx in feature_indices:
            thresholds = np.unique(X[:, feature_idx])

            for threshold in thresholds:
                left_mask = X[:, feature_idx] <= threshold
                right_mask = ~left_mask

                if np.sum(left_mask) == 0 or np.sum(right_mask) == 0:
                    continue

                # Calculate Gini gain
                n_left = np.sum(left_mask)
                n_right = np.sum(right_mask)
                gini_left = self._gini(y[left_mask])
                gini_right = self._gini(y[right_mask])

                weighted_gini = (n_left/m) * gini_left + (n_right/m) * gini_right
                gain = parent_gini - weighted_gini

                if gain > best_gain:
                    best_gain = gain
                    best_feature = feature_idx
                    best_threshold = threshold

        return best_feature, best_threshold

    def _build_tree(self, X, y, depth=0):
        """Recursively build decision tree."""
        n_samples, n_features = X.shape
        n_classes = len(np.unique(y))

        # Stopping criteria
        if (depth >= self.max_depth or
            n_classes == 1 or
            n_samples < self.min_samples_split):
            leaf_value = np.bincount(y).argmax()
            return self.Node(value=leaf_value)

        # Find best split
        feature_idx, threshold = self._best_split(X, y)

        if feature_idx is None:
            leaf_value = np.bincount(y).argmax()
            return self.Node(value=leaf_value)

        # Split data
        left_mask = X[:, feature_idx] <= threshold
        right_mask = ~left_mask

        # Build subtrees
        left_child = self._build_tree(X[left_mask], y[left_mask], depth + 1)
        right_child = self._build_tree(X[right_mask], y[right_mask], depth + 1)

        return self.Node(
            feature_idx=feature_idx,
            threshold=threshold,
            left=left_child,
            right=right_child
        )

    def fit(self, X, y):
        """Build decision tree."""
        self.root = self._build_tree(X, y)
        return self

    def _traverse_tree(self, x, node):
        """Traverse tree for prediction."""
        if node.value is not None:
            return node.value

        if x[node.feature_idx] <= node.threshold:
            return self._traverse_tree(x, node.left)
        else:
            return self._traverse_tree(x, node.right)

    def predict(self, X):
        """Predict class labels."""
        return np.array([self._traverse_tree(x, self.root) for x in X])

Usage Example

# Generate sample data
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=500, n_features=10,
                           n_informative=7, n_redundant=3,
                           random_state=42)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train Random Forest
rf = SimpleRandomForest(n_trees=50, max_depth=10, max_features='sqrt')
rf.fit(X_train, y_train)

# Predictions
y_pred = rf.predict(X_test)
accuracy = np.mean(y_pred == y_test)
print(f"Accuracy: {accuracy:.2%}")

Advantages and Limitations

Advantages

1. Excellent Performance

2. Versatility

3. Feature Insights

4. Robustness

5. Parallelization

6. No Assumptions

Limitations

1. Interpretability

2. Prediction Speed

3. Memory Usage

4. Extrapolation

5. Bias Toward Categorical Features

6. Not Optimal for All Tasks

Comparison with Other Methods

Random Forests vs Single Decision Tree

Aspect Single Tree Random Forest
Variance High Low (averaging)
Bias Low Low
Interpretability High (visualize) Low (ensemble)
Accuracy Moderate High
Training Time Fast Slower
Prediction Time Very Fast Moderate
Overfitting Prone Resistant

Random Forests vs Gradient Boosting

Aspect Random Forest Gradient Boosting
Training Parallel Sequential
Speed Faster training Slower training
Accuracy Very good Often better
Overfitting Less prone More prone
Tuning Less sensitive More sensitive
Interpretability Moderate Moderate

Random Forests vs Linear Models

Aspect Random Forest Linear Models
Linear Relationships Overkill Optimal
Non-linear Relationships Excellent Poor
High-Dimensional Sparse Moderate Excellent
Interpretability Feature importance Coefficients
Preprocessing Minimal Scaling required

Conclusion

Random Forests represent a crucial step in the evolution of machine learning algorithms - moving from single trees to ensembles that combine the strengths of multiple models. Their ability to handle complex, non-linear relationships while maintaining robustness to noise makes them one of the most widely used algorithms in practice.

Random Forests remain a go-to algorithm because they:

Master Random Forests, and you have a powerful tool that will serve you well across countless machine learning problems.

References

Original Papers: