Random Forests: Ensemble Learning with Decision Trees

Dec 07, 2025

Continuing our journey through machine learning, today we explore Random Forests, the ensemble method that combines the wisdom of multiple decision trees. Imagine you’re trying to decide which movie to watch. You could ask one friend for their opinion, but they might have peculiar taste. Instead, you poll ten friends, and the majority votes. Even if some friends have quirky preferences, the group consensus is usually more reliable than any single opinion. This is the power of ensemble learning combining multiple “weak” predictors to create a strong one. Random Forests apply this wisdom to decision trees, creating one of machine learning’s most powerful and widely-used algorithms.

Introduction: From Single Trees to Forests
The Problem with Single Decision Trees
Ensemble Learning Fundamentals
Bootstrap Aggregating (Bagging)
Random Forests: Adding Random Feature Selection
The Random Forest Algorithm
Out-of-Bag (OOB) Error Estimation
Feature Importance in Random Forests
Random Forests for Regression
Implementation from Scratch
Advantages and Limitations
Comparison with Other Methods
Conclusion

Introduction: From Single Trees to Forests

The Journey from Trees to Forests

In the previous tutorial, we learned that decision trees are:

Highly interpretable
Fast to train and predict
Able to capture non-linear relationships

But they have a critical weakness:

High variance: Small changes in training data can produce very completely different trees

Random Forests solve this by training many trees and combining their predictions. The result is a model that:

Maintains the strengths of decision trees
Dramatically reduces variance through averaging
Achieves state-of-the-art performance on many tasks

What Makes Random Forests Special?

Random Forests excel because they:

Reduce overfitting through averaging many overfit trees
Handle high-dimensional data naturally
Provide feature importance rankings
Require minimal preprocessing (no scaling, handles missing values)
Are robust to outliers and noise
Parallelize easily (train trees independently)
Work well “out of the box” with default parameters

The Problem with Single Decision Trees

High Variance Demonstration

Decision trees are highly sensitive to small changes in training data. Consider training a decision tree on slightly different data samples:

Dataset 1: 100 points, 60% accuracy on test set, tree depth = 8
Dataset 2: Same distribution, different 100 points, 58% accuracy, depth = 12
Dataset 3: Another sample, 62% accuracy, depth = 6

The trees look completely different despite coming from the same distribution! This is high variance.

Mathematical Definition:

In statistical learning theory, For a learning algorithm $f$, variance measures how much predictions change with different training sets:

\[\text{Variance}[\hat{f}(x)] = \mathbb{E}_{\mathcal{D}}[(\hat{f}(x) - \mathbb{E}[\hat{f}(x)])^2]\]

Where:

$\hat{f}(x)$ is the prediction for input $x$
$\mathcal{D}$ represents different training datasets
$\mathbb{E}[\cdot]$ is the expectation (average) over datasets

Decision trees have high variance because their greedy, hierarchical structure amplifies small data changes.

The Bias-Variance Tradeoff

Decision trees are low bias but high variance models. Total prediction error decomposes into:

\[\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}\]

Bias: Error from wrong assumptions (underfitting)
Variance: Error from sensitivity to training data (overfitting)
Irreducible Error: Noise in the data

Single Decision Trees:

Low bias (can fit complex patterns)
High variance (too sensitive to data)

Random Forests:

Low bias (maintained from trees)
Low variance (averaging reduces it!)

Ensemble Learning Fundamentals

Ensemble methods combine multiple models to improve performance. The key idea is that a group of weak learners can become a strong learner through aggregation. So, if you have $N$ independent predictors, each with error rate $\epsilon$, averaging them reduces error.

For Regression:

Suppose we have $B$ independent models $\hat{f}_1(x), \hat{f}_2(x), \ldots, \hat{f}_B(x)$, each with variance $\sigma^2$.

The average prediction: $\hat{f}_{\text{avg}}(x) = \frac{1}{B} \sum_{b=1}^{B} \hat{f}_b(x)$

Has variance: $\text{Var}[\hat{f}_{\text{avg}}(x)] = \frac{\sigma^2}{B}$

Result: Variance decreases by factor of $B$!

With Correlation:

In reality, models are correlated (they see similar patterns). With correlation $\rho$:

\[\text{Var}[\hat{f}_{\text{avg}}(x)] = \rho \sigma^2 + \frac{1-\rho}{B} \sigma^2\]

As $B \to \infty$: $\text{Var}[\hat{f}_{\text{avg}}(x)] \to \rho \sigma^2$

So, it’s important to remember that the variance of the ensemble is bounded by the correlation between the models.

More trees reduce variance (up to a limit)
Decorrelating trees (low $\rho$) is crucial!

Types of Ensemble Methods

1. Bagging (Bootstrap Aggregating):

Train models on different data subsets (bootstrap samples)
Average predictions (regression) or vote (classification)
Examples: Random Forests, Bagged Trees

2. Boosting:

Train models sequentially, each correcting previous errors
Weight training examples by difficulty
Examples: AdaBoost, Gradient Boosting, XGBoost

3. Stacking:

Train diverse base models
Train a meta-model to combine them
Learn optimal combination weights

Random Forests use Bagging plus random feature selection.

Bootstrap Aggregating (Bagging)

What is Bootstrapping?

Bootstrap Sampling: Sample $n$ observations with replacement from dataset of size $n$.

Example:

Original data: [1, 2, 3, 4, 5]

Bootstrap sample 1: [1, 1, 3, 5, 5] Bootstrap sample 2: [2, 2, 3, 4, 5] Bootstrap sample 3: [1, 2, 2, 4, 5]

If we have 1000 observations, we can expect:

Properties:

Each sample has same size as original
Some observations appear multiple times
Some observations don’t appear (about 37%)

Mathematical: Probability an observation is NOT selected in one draw: $(1 - \frac{1}{n})$

After $n$ draws with replacement: $P(\text{not selected}) = \left(1 - \frac{1}{n}\right)^n \approx e^{-1} \approx 0.368$

So about 63.2% of data appears in each bootstrap sample.

The Bagging Algorithm

function Bagging(data, B):
    models = []

    for b = 1 to B:
        # 1. Create bootstrap sample
        sample_b = bootstrap_sample(data)

        # 2. Train model on this sample
        model_b = train_model(sample_b)

        # 3. Store model
        models.append(model_b)

    return models

function Predict(x, models):
    # For regression: average predictions
    predictions = [model.predict(x) for model in models]
    return mean(predictions)

    # For classification: majority vote
    votes = [model.predict(x) for model in models]
    return most_common(votes)

Why Bagging Reduces Variance

Intuition: Different bootstrap samples → different trees → averaging smooths out individual quirks.

Mathematical Proof (simplified):

For $B$ independent models with variance $\sigma^2$:

\[\text{Var}[\text{avg}] = \frac{\sigma^2}{B}\]

With $B = 100$ trees: variance is 1% of single tree variance!

Even with correlation $\rho = 0.5$:

\[\text{Var}[\text{avg}] = 0.5\sigma^2 + \frac{0.5}{100}\sigma^2 = 0.505\sigma^2\]

Still 50% variance reduction!

Random Forests: Adding Random Feature Selection

The Limitation of Pure Bagging

The problem with bagging decision trees is that individual trees are still correlated because bootstrap samples are still quite similar. If one feature is very strong (e.g., “income” for loan prediction), all trees will split on it first. This creates correlation between trees, limiting variance reduction.

Random Feature Selection

How to fix this? Randomly select features at each split!. We modify the tree-building process: At each split, randomly sample $m$ features from the total $p$ features.

Typical values:

Classification: $m = \sqrt{p}$
Regression: $m = \frac{p}{3}$

Example:

Total features: 16 Random Forest samples: $\sqrt{16} = 4$ features per split

Tree 1, Root split: Consider features [2, 7, 10, 15] Tree 1, Next split: Consider features [1, 3, 8, 12] Tree 2, Root split: Consider features [4, 6, 9, 11] …

Why This Works

when we decide to randomly select features at each split, we are ensuring that each tree is built on a different subset of features. Therefore, trees become more diverse.

Some trees focus on certain feature combinations
Others explore different aspects of data
Final ensemble captures various patterns

Mathematical Insight:

If we look mathematically, correlation between trees decreases as $m$ decreases:

\[m \downarrow \implies \rho \downarrow \implies \text{Var}[\text{ensemble}] \downarrow\]

But there’s a tradeoff:

\[m \downarrow \text{ too much} \implies \text{individual tree quality} \downarrow\]

The optimal $m$ balances tree quality and diversity.

The Random Forest Algorithm

Complete Algorithm

function RandomForest(X, y, B, m):
    """
    X: feature matrix (n × p)
    y: target vector (n)
    B: number of trees
    m: number of features to sample per split
    """
    forest = []

    for b = 1 to B:
        # 1. Bootstrap sample
        X_b, y_b = bootstrap_sample(X, y)

        # 2. Build tree with random feature selection
        tree_b = build_tree(X_b, y_b, m)

        # 3. Add to forest
        forest.append(tree_b)

    return forest

function build_tree(X, y, m):
    """Build single decision tree with random features."""
    if stopping_criterion_met:
        return leaf_node(y)

    # Randomly select m features
    features = random_sample(p, m)

    # Find best split among these m features
    best_feature, best_threshold = find_best_split(X[:, features], y)

    # Split data
    X_left, y_left, X_right, y_right = split(X, y, best_feature, best_threshold)

    # Recursively build subtrees
    left_tree = build_tree(X_left, y_left, m)
    right_tree = build_tree(X_right, y_right, m)

    return decision_node(best_feature, best_threshold, left_tree, right_tree)

function predict_forest(x, forest):
    """Make prediction with random forest."""
    # Classification: majority vote
    votes = [tree.predict(x) for tree in forest]
    return most_common(votes)

    # Regression: average
    predictions = [tree.predict(x) for tree in forest]
    return mean(predictions)

Classification vs Regression

Classification:

Each tree votes for a class
Final prediction: majority vote or soft voting (average probabilities)

Soft Voting Example:

Tree 1: P(Class A) = 0.8, P(Class B) = 0.2 Tree 2: P(Class A) = 0.6, P(Class B) = 0.4 Tree 3: P(Class A) = 0.7, P(Class B) = 0.3

Average: P(Class A) = 0.7, P(Class B) = 0.3 → Predict Class A

Regression:

Each tree predicts a continuous value
Final prediction: mean (or median for robustness)

Mean Example:

Tree 1: 145,000 Tree 2: 152,000 Tree 3: 148,000

Average: (145,000 + 152,000 + 148,000) / 3 = 148,333

Out-of-Bag (OOB) Error Estimation

What is OOB Data?

OOB data is a powerful feature of random forests that allows us to estimate the model’s performance without using a separate validation set. Remember each bootstrap sample uses ~63% of data. The remaining ~37% are Out-of-Bag (OOB) samples.

For each observation $i$:

Some trees didn’t use it in training (it was OOB for those trees)
Use those trees to predict $i$
Compare prediction to true value

OOB Error Calculation

function calculate_oob_error(X, y, forest):
    n = len(y)
    oob_predictions = [None] * n

    for i in range(n):
        # Find trees where observation i was OOB
        oob_trees = [tree for tree in forest if i not in tree.training_indices]

        if len(oob_trees) > 0:
            # Predict using only OOB trees
            votes = [tree.predict(X[i]) for tree in oob_trees]
            oob_predictions[i] = most_common(votes)

    # Calculate error on OOB predictions
    oob_error = error_rate(y, oob_predictions)
    return oob_error

OOB as Cross-Validation

It’s important to note that each observation is predicted by trees it wasn’t trained on. Therefore, OOB error is an unbiased estimate of test error!

Advantages:

No need for separate validation set
Free cross-validation (no extra computation)
Can use all data for training
Useful for model selection

It’s Remarkable to see how OOB error closely approximates leave-one-out cross-validation error.

OOB Score

OOB Score = 1 - OOB Error Rate

Typically:

OOB Score ≈ 80-95% indicates good fit
OOB Score < 70% suggests underfitting or poor features
Large gap between training and OOB suggests overfitting (though less common in RF)

Feature Importance in Random Forests

Feature importance is a powerful tool for understanding which features drive predictions and Random Forests provide interpretability through feature importance, answering:

Which features are most predictive?
Can we remove irrelevant features?
What drives the predictions?

Method 1: Mean Decrease in Impurity (Gini Importance)

Idea: Measure total impurity reduction from splits using each feature.

Algorithm:

For each feature $j$:

\[\text{Importance}(j) = \sum_{t \in \text{Trees}} \sum_{s \in \text{Splits using } j} \Delta \text{Impurity}(s)\]

Where: $\Delta \text{Impurity}(s) = p_{\text{left}} \cdot \text{Gini}_{\text{left}} + p_{\text{right}} \cdot \text{Gini}_{\text{right}} - \text{Gini}_{\text{parent}}$

Normalize: $\text{Importance}_{\text{normalized}}(j) = \frac{\text{Importance}(j)}{\sum_{k=1}^{p} \text{Importance}(k)}$

Advantages:

Fast to compute (no retraining needed)
Built-in to sklearn

Disadvantages:

Biased toward high-cardinality features
Can be misleading with correlated features

Method 2: Permutation Importance

Idea: Measure error increase when feature values are randomly permuted.

Algorithm:

function permutation_importance(model, X_test, y_test, feature_j):
    # 1. Get baseline error
    baseline_error = error(model.predict(X_test), y_test)

    # 2. Permute feature j
    X_permuted = X_test.copy()
    X_permuted[:, j] = shuffle(X_permuted[:, j])

    # 3. Calculate error with permuted feature
    permuted_error = error(model.predict(X_permuted), y_test)

    # 4. Importance = increase in error
    importance_j = permuted_error - baseline_error

    return importance_j

Advantages:

Unbiased
Works with any model
Captures feature interactions

Disadvantages:

Slower (requires multiple predictions)
Requires test/validation set

Practical Example

Suppose we’re predicting house prices with features:

Size: 45% importance
Location: 30% importance
Age: 15% importance
Color: 5% importance
Owner’s name: 5% importance

Interpretation:

Size and Location are key drivers
Age has moderate impact
Color and Owner’s name are nearly irrelevant → Can remove them

Feature Selection with Random Forests

Strategy:

Train Random Forest on all features
Compute feature importances
Remove features with importance < threshold (e.g., < 0.01)
Retrain on selected features
Compare performance

Benefits:

Reduce overfitting
Faster training and prediction
Simpler, more interpretable model

Random Forests for Regression

Algorithm Differences

Classification:

Leaf prediction: Majority class
Ensemble: Majority vote
Splitting criterion: Gini or Entropy

Regression:

Leaf prediction: Mean of target values
Ensemble: Average of predictions
Splitting criterion: MSE (Mean Squared Error)

MSE Splitting Criterion

For regression, at each node:

\[\text{MSE}(S) = \frac{1}{|S|} \sum_{i \in S} (y_i - \bar{y})^2\]

Where $\bar{y}$ is the mean of target values in node $S$.

MSE Reduction (like information gain):

\[\text{MSE Reduction} = \text{MSE}(S) - \left(\frac{|S_L|}{|S|}\text{MSE}(S_L) + \frac{|S_R|}{|S|}\text{MSE}(S_R)\right)\]

Regression Example

Predicting house prices:

Tree 1 predicts: $245,000 Tree 2 predicts: $238,000 Tree 3 predicts: $251,000 … Tree 100 predicts: $247,000

Final prediction: Average = $245,500

Variance of predictions: $\sigma = $8,000$ (measures uncertainty)

Prediction intervals: $245,500 ± 2 \times 8,000 = [$229,500, $261,500]$

Advantages for Regression

Captures non-linear relationships naturally
Robust to outliers (individual trees may overfit them, but ensemble averages out)
Provides uncertainty estimates via variance of tree predictions
Handles interaction effects automatically

Implementation from Scratch

Simplified Random Forest Classifier

import numpy as np
from collections import Counter

class SimpleRandomForest:
    """Simplified Random Forest implementation."""

    def __init__(self, n_trees=10, max_depth=5, min_samples_split=2,
                 max_features='sqrt'):
        self.n_trees = n_trees
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.max_features = max_features
        self.trees = []

    def _bootstrap_sample(self, X, y):
        """Create bootstrap sample."""
        n_samples = X.shape[0]
        indices = np.random.choice(n_samples, n_samples, replace=True)
        return X[indices], y[indices]

    def _calculate_max_features(self, n_features):
        """Calculate number of features to sample."""
        if self.max_features == 'sqrt':
            return int(np.sqrt(n_features))
        elif self.max_features == 'log2':
            return int(np.log2(n_features))
        elif isinstance(self.max_features, int):
            return self.max_features
        else:
            return n_features

    def fit(self, X, y):
        """Train random forest."""
        self.trees = []

        for _ in range(self.n_trees):
            # Create bootstrap sample
            X_sample, y_sample = self._bootstrap_sample(X, y)

            # Build tree
            tree = DecisionTree(
                max_depth=self.max_depth,
                min_samples_split=self.min_samples_split,
                max_features=self._calculate_max_features(X.shape[1])
            )
            tree.fit(X_sample, y_sample)

            self.trees.append(tree)

        return self

    def predict(self, X):
        """Predict using majority vote."""
        # Get predictions from all trees
        tree_predictions = np.array([tree.predict(X) for tree in self.trees])

        # Majority vote for each sample
        predictions = []
        for i in range(X.shape[0]):
            votes = tree_predictions[:, i]
            majority_vote = Counter(votes).most_common(1)[0][0]
            predictions.append(majority_vote)

        return np.array(predictions)

    def predict_proba(self, X):
        """Predict class probabilities."""
        # Get predictions from all trees
        tree_predictions = np.array([tree.predict(X) for tree in self.trees])

        # Calculate probabilities
        n_samples = X.shape[0]
        n_classes = len(np.unique(tree_predictions))
        probabilities = np.zeros((n_samples, n_classes))

        for i in range(n_samples):
            votes = tree_predictions[:, i]
            vote_counts = Counter(votes)
            for class_label, count in vote_counts.items():
                probabilities[i, class_label] = count / self.n_trees

        return probabilities


class DecisionTree:
    """Simple decision tree with random feature selection."""

    class Node:
        def __init__(self, feature_idx=None, threshold=None,
                     left=None, right=None, value=None):
            self.feature_idx = feature_idx
            self.threshold = threshold
            self.left = left
            self.right = right
            self.value = value

    def __init__(self, max_depth=5, min_samples_split=2, max_features=None):
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.max_features = max_features
        self.root = None

    def _gini(self, y):
        """Calculate Gini impurity."""
        m = len(y)
        if m == 0:
            return 0
        counts = np.bincount(y)
        probabilities = counts / m
        return 1 - np.sum(probabilities ** 2)

    def _best_split(self, X, y):
        """Find best split with random feature sampling."""
        m, n = X.shape

        if m <= 1:
            return None, None

        # Sample random features
        if self.max_features and self.max_features < n:
            feature_indices = np.random.choice(n, self.max_features, replace=False)
        else:
            feature_indices = np.arange(n)

        best_gain = 0
        best_feature = None
        best_threshold = None
        parent_gini = self._gini(y)

        # Try each sampled feature
        for feature_idx in feature_indices:
            thresholds = np.unique(X[:, feature_idx])

            for threshold in thresholds:
                left_mask = X[:, feature_idx] <= threshold
                right_mask = ~left_mask

                if np.sum(left_mask) == 0 or np.sum(right_mask) == 0:
                    continue

                # Calculate Gini gain
                n_left = np.sum(left_mask)
                n_right = np.sum(right_mask)
                gini_left = self._gini(y[left_mask])
                gini_right = self._gini(y[right_mask])

                weighted_gini = (n_left/m) * gini_left + (n_right/m) * gini_right
                gain = parent_gini - weighted_gini

                if gain > best_gain:
                    best_gain = gain
                    best_feature = feature_idx
                    best_threshold = threshold

        return best_feature, best_threshold

    def _build_tree(self, X, y, depth=0):
        """Recursively build decision tree."""
        n_samples, n_features = X.shape
        n_classes = len(np.unique(y))

        # Stopping criteria
        if (depth >= self.max_depth or
            n_classes == 1 or
            n_samples < self.min_samples_split):
            leaf_value = np.bincount(y).argmax()
            return self.Node(value=leaf_value)

        # Find best split
        feature_idx, threshold = self._best_split(X, y)

        if feature_idx is None:
            leaf_value = np.bincount(y).argmax()
            return self.Node(value=leaf_value)

        # Split data
        left_mask = X[:, feature_idx] <= threshold
        right_mask = ~left_mask

        # Build subtrees
        left_child = self._build_tree(X[left_mask], y[left_mask], depth + 1)
        right_child = self._build_tree(X[right_mask], y[right_mask], depth + 1)

        return self.Node(
            feature_idx=feature_idx,
            threshold=threshold,
            left=left_child,
            right=right_child
        )

    def fit(self, X, y):
        """Build decision tree."""
        self.root = self._build_tree(X, y)
        return self

    def _traverse_tree(self, x, node):
        """Traverse tree for prediction."""
        if node.value is not None:
            return node.value

        if x[node.feature_idx] <= node.threshold:
            return self._traverse_tree(x, node.left)
        else:
            return self._traverse_tree(x, node.right)

    def predict(self, X):
        """Predict class labels."""
        return np.array([self._traverse_tree(x, self.root) for x in X])

Usage Example

# Generate sample data
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=500, n_features=10,
                           n_informative=7, n_redundant=3,
                           random_state=42)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train Random Forest
rf = SimpleRandomForest(n_trees=50, max_depth=10, max_features='sqrt')
rf.fit(X_train, y_train)

# Predictions
y_pred = rf.predict(X_test)
accuracy = np.mean(y_pred == y_test)
print(f"Accuracy: {accuracy:.2%}")

Advantages and Limitations

Advantages

1. Excellent Performance

Achieves state-of-the-art results on many tasks
Often competitive with deep learning on tabular data
Reduces overfitting compared to single trees

2. Versatility

Works for classification and regression
Handles numerical and categorical features
Robust to outliers and noise
Minimal data preprocessing required

3. Feature Insights

Provides feature importance rankings
Helps identify irrelevant features
Useful for exploratory data analysis

4. Robustness

Works well with default parameters
Less prone to overfitting than single trees
OOB error provides built-in validation

5. Parallelization

Trees can be trained independently
Easy to parallelize across CPU cores
Scalable to large datasets

6. No Assumptions

Non-parametric (no distribution assumptions)
Captures non-linear relationships
Handles feature interactions automatically

Limitations

1. Interpretability

Ensemble of 100+ trees is a “black box”
Can’t easily visualize decision logic
Feature importance helps but not as clear as single tree

2. Prediction Speed

Slower than single tree (must query all trees)
May not meet real-time requirements for large forests
Trade-off: accuracy vs speed

3. Memory Usage

Must store many trees
Can be large for deep trees and many features
Not ideal for memory-constrained environments

4. Extrapolation

Can’t predict beyond training data range
Regression predictions limited to min/max of training targets
Example: Trained on prices $100k-$500k, can’t predict $600k

5. Bias Toward Categorical Features

High-cardinality features get inflated importance
Need to be careful with feature importance interpretation

6. Not Optimal for All Tasks

Linear relationships: simpler models (linear/logistic regression) better
Text/sparse data: linear models or neural networks better
Time series: specialized methods (ARIMA, LSTM) often better

Comparison with Other Methods

Random Forests vs Single Decision Tree

Aspect	Single Tree	Random Forest
Variance	High	Low (averaging)
Bias	Low	Low
Interpretability	High (visualize)	Low (ensemble)
Accuracy	Moderate	High
Training Time	Fast	Slower
Prediction Time	Very Fast	Moderate
Overfitting	Prone	Resistant

Random Forests vs Gradient Boosting

Aspect	Random Forest	Gradient Boosting
Training	Parallel	Sequential
Speed	Faster training	Slower training
Accuracy	Very good	Often better
Overfitting	Less prone	More prone
Tuning	Less sensitive	More sensitive
Interpretability	Moderate	Moderate

Random Forests vs Linear Models

Aspect	Random Forest	Linear Models
Linear Relationships	Overkill	Optimal
Non-linear Relationships	Excellent	Poor
High-Dimensional Sparse	Moderate	Excellent
Interpretability	Feature importance	Coefficients
Preprocessing	Minimal	Scaling required

Conclusion

Random Forests represent a crucial step in the evolution of machine learning algorithms - moving from single trees to ensembles that combine the strengths of multiple models. Their ability to handle complex, non-linear relationships while maintaining robustness to noise makes them one of the most widely used algorithms in practice.

Random Forests remain a go-to algorithm because they:

Work well out-of-the-box
Provide feature insights
Balance accuracy and simplicity
Serve as excellent baselines

Master Random Forests, and you have a powerful tool that will serve you well across countless machine learning problems.

References

Original Papers:

Machine Learning Series

Random Forests: Ensemble Learning with Decision Trees

Table of Contents

Introduction: From Single Trees to Forests

The Journey from Trees to Forests

What Makes Random Forests Special?

The Problem with Single Decision Trees

High Variance Demonstration

The Bias-Variance Tradeoff

Ensemble Learning Fundamentals

Types of Ensemble Methods

Bootstrap Aggregating (Bagging)

What is Bootstrapping?

The Bagging Algorithm

Why Bagging Reduces Variance

Random Forests: Adding Random Feature Selection

The Limitation of Pure Bagging

Random Feature Selection

Why This Works

The Random Forest Algorithm

Complete Algorithm

Classification vs Regression

Out-of-Bag (OOB) Error Estimation

What is OOB Data?

OOB Error Calculation

OOB as Cross-Validation

OOB Score

Feature Importance in Random Forests

Method 1: Mean Decrease in Impurity (Gini Importance)

Method 2: Permutation Importance

Practical Example

Feature Selection with Random Forests

Random Forests for Regression

Algorithm Differences

MSE Splitting Criterion

Regression Example

Advantages for Regression

Implementation from Scratch

Simplified Random Forest Classifier

Usage Example

Advantages and Limitations

Advantages

Limitations

Comparison with Other Methods

Random Forests vs Single Decision Tree

Random Forests vs Gradient Boosting

Random Forests vs Linear Models

Conclusion

References