Nov 26, 2025
Today we embark on a new series on Machine Learning. Without wasting any time, let’s dive right in. Imagine you’re teaching a child to recognize different types of fruits. You show them many examples, “This is an apple, it’s round and red” and over time, the child learns to identify apples even in pictures they’ve never seen before. This is exactly how machine learning works, computers learn patterns from examples rather than following explicit programming instructions.
Machine Learning (ML) is a field of artificial intelligence that enables computers to learn and improve from experience without being explicitly programmed. Instead of writing specific rules like “if temperature > 30°C, then hot,” ML algorithms discover patterns in data and make decisions based on those patterns.
Traditional Programming:
Rules + Data → Output
Example: You write code that says “if credit score > 700 and income > $50,000, approve loan”
Machine Learning:
Data + Output → Rules (Model)
Example: You provide examples of approved/rejected loans, and the algorithm learns the patterns that determine approval
Machine learning has transformed technology in ways that would be impossible with traditional programming:
The key advantage? ML systems improve automatically as they see more data, adapting to new patterns without human intervention.
Machine learning algorithms fall into three main categories based on how they learn.
Definition: Learning from labeled examples where the correct answer is provided.
Analogy: Like studying with a teacher who provides correct answers to practice problems.
How it works:
Common Tasks:
Classification: Predicting categories
Regression: Predicting continuous values
Popular Algorithms:
Example:
# Training data: houses with features and prices
X_train = [[1500, 3, 2], # [sqft, bedrooms, bathrooms]
[2000, 4, 3],
[1200, 2, 1]]
y_train = [300000, 450000, 250000] # prices
# The algorithm learns: price ≈ f(sqft, bedrooms, bathrooms)
# New prediction: What's the price of a 1800 sqft, 3 bed, 2 bath house?
Definition: Learning from data without labels, or finding hidden patterns or structure.
Analogy: Like organizing books by genre when no one told you what the genres are, you notice similarities and group them yourself.
How it works:
Common Tasks:
Clustering: Grouping similar items
Dimensionality Reduction: Simplifying data while preserving information
Popular Algorithms:
Example:
# Customer data: spending patterns
customers = [[100, 20, 5], # [grocery, electronics, clothing] spending
[95, 25, 10],
[30, 200, 150],
[25, 180, 200]]
# K-Means discovers 2 groups:
# Group 1: [customers 1,2] - grocery shoppers
# Group 2: [customers 3,4] - electronics/clothing shoppers
Definition: Learning through trial and error by receiving rewards or penalties.
Analogy: Like training a dog, give treats for good behavior, corrections for bad behavior.
How it works:
Common Applications:
Popular Algorithms:
Example:
Robot learning to walk:
- Action: Move leg forward → Falls → Penalty (-10)
- Action: Balance + small step → Stays up → Reward (+5)
- Action: Series of balanced steps → Walks 10m → Reward (+100)
Over time, learns walking strategy that maximizes rewards.
Semi-Supervised: Combines small labeled dataset with large unlabeled dataset
Self-Supervised: Creates labels from the data itself
Every ML project follows a systematic process. Understanding this workflow is crucial for success.
Key Questions:
Example:
Data Collection:
Data Cleaning:
Exploratory Data Analysis (EDA):
import pandas as pd
import matplotlib.pyplot as plt
# Load data
data = pd.read_csv('customers.csv')
# Understand structure
print(data.info()) # Data types, missing values
print(data.describe()) # Statistical summary
# Visualize distributions
data['age'].hist(bins=30)
plt.show()
# Check correlations
correlation_matrix = data.corr()
Feature Selection: Choose relevant variables
Feature Creation: Build new meaningful features
# Example: Creating features for house price prediction
data['price_per_sqft'] = data['price'] / data['sqft']
data['age'] = 2026 - data['year_built']
data['is_luxury'] = (data['price'] > 1000000).astype(int)
Feature Transformation:
Divide data into training, validation, and test sets:
from sklearn.model_selection import train_test_split
# 70% train, 15% validation, 15% test
X_train, X_temp, y_train, y_temp = train_test_split(
X, y, test_size=0.3, random_state=42
)
X_val, X_test, y_val, y_test = train_test_split(
X_temp, y_temp, test_size=0.5, random_state=42
)
Select Algorithm based on:
Train Model:
from sklearn.ensemble import RandomForestClassifier
# Initialize model
model = RandomForestClassifier(n_estimators=100, random_state=42)
# Train on training data
model.fit(X_train, y_train)
Evaluate Performance:
from sklearn.metrics import accuracy_score, classification_report
# Predictions on validation set
y_pred = model.predict(X_val)
# Evaluate
accuracy = accuracy_score(y_val, y_pred)
print(f"Validation Accuracy: {accuracy:.2%}")
print(classification_report(y_val, y_pred))
Hyperparameter Tuning:
from sklearn.model_selection import GridSearchCV
# Define parameter grid
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [5, 10, 20, None],
'min_samples_split': [2, 5, 10]
}
# Grid search with cross-validation
grid_search = GridSearchCV(
RandomForestClassifier(),
param_grid,
cv=5,
scoring='accuracy'
)
grid_search.fit(X_train, y_train)
# Best model
best_model = grid_search.best_estimator_
Test Set Evaluation (only once!):
# Final evaluation on held-out test set
test_accuracy = best_model.score(X_test, y_test)
print(f"Test Accuracy: {test_accuracy:.2%}")
Deployment: Integrate into production system
import joblib; joblib.dump(model, 'model.pkl')The Golden Rule: Never test on data you trained on!
If you evaluate a model on the same data it learned from, you’ll get misleadingly optimistic results. It’s like giving students the exact same questions on the exam that they practiced with, so high scores don’t mean they truly understand.
Training Set (60-80% of data)
Validation Set (10-20% of data)
Test Set (10-20% of data)
Random Split (most common):
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2, # 20% for testing
random_state=42, # Reproducibility
stratify=y # Preserve class distribution
)
Stratified Split (for imbalanced data): Ensures each split has same proportion of each class
# If 70% class A, 30% class B in full data
# Stratified split maintains 70%-30% in train and test
Time-Based Split (for temporal data):
# Don't randomize time series!
# Train on past, test on future
train_data = data[data['date'] < '2025-01-01']
test_data = data[data['date'] >= '2025-01-01']
A single train-test split can be “unlucky”; what if your test set happens to be particularly easy or hard? Cross-validation solves this by testing on multiple different splits.
Process:
Visualization:
Fold 1: [Test][Train][Train][Train][Train]
Fold 2: [Train][Test][Train][Train][Train]
Fold 3: [Train][Train][Test][Train][Train]
Fold 4: [Train][Train][Train][Test][Train]
Fold 5: [Train][Train][Train][Train][Test]
Average performance across all 5 folds
Implementation:
from sklearn.model_selection import cross_val_score
# 5-fold cross-validation
scores = cross_val_score(
model,
X_train,
y_train,
cv=5, # Number of folds
scoring='accuracy'
)
print(f"CV Scores: {scores}")
print(f"Mean: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")
Output:
CV Scores: [0.85, 0.87, 0.84, 0.88, 0.86]
Mean: 0.860 (+/- 0.028)
Maintains class distribution in each fold, which is crucial for imbalanced datasets:
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X_train, y_train, cv=skf)
Extreme case where K = number of samples:
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
scores = cross_val_score(model, X, y, cv=loo)
Use CV for:
Don’t use CV for:
One of the most fundamental concepts in machine learning.
Bias: Error from wrong assumptions in the learning algorithm
Variance: Error from sensitivity to small fluctuations in training data
Analogy: Throwing darts at a target
Low Bias, Low Variance: High Bias, Low Variance:
Accurate & Precise Precise but Inaccurate
🎯 ·
· · · · · ·
· ·
(all hits consistently
(all hits near bullseye) off to the side)
Low Bias, High Variance: High Bias, High Variance:
Accurate but Imprecise Neither Accurate nor Precise
· · · ·
🎯 ·
· · · ·
· ·
(scattered around bullseye) (scattered, off-target)
Total Error = Bias² + Variance + Irreducible Error
Where:
The Tradeoff Curve:
Error
↑
| Variance
| ╱
| ╱
| ╱
| ╱_____ Total Error
| ╱
|____╱______
| / ╲_____ Bias²
| / ╲
| / ╲___
|/__________________╲______→ Model Complexity
Simple Complex
Sweet Spot: Minimum total error balances bias and variance
Reduce Bias (if underfitting):
Reduce Variance (if overfitting):
Consider fitting a polynomial to data:
Linear Model (degree 1): $y = w_0 + w_1x$
Quadratic Model (degree 2): $y = w_0 + w_1x + w_2x^2$
High-Degree Polynomial (degree 15): $y = w_0 + w_1x + … + w_{15}x^{15}$
Definition: Model is too simple to capture data patterns
Signs:
Example:
# Predicting house prices with only one feature
# Actual relationship: price depends on size, location, age, etc.
# Underfitting model: price = w * size + b
# This ignores important factors like location!
Solutions:
Definition: Model learns training data too well, including noise
Signs:
Example:
# Training accuracy: 99%
# Test accuracy: 65%
# → Model memorized training data instead of learning general patterns
Visual Example:
Underfitting: Just Right: Overfitting:
· · ·
· · · · · ·
· ────── · ─── · ─╮
· · · ──· · │
· · · · · ╰─
╮
(straight line (smooth curve (wiggly curve
doesn't fit) captures trend) fits every point)
Solutions:
Goal: Find model that is “just right”
Strategy: Monitor both training and validation performance
# During training, track both metrics
for epoch in range(100):
train_loss = train_model(X_train, y_train)
val_loss = evaluate_model(X_val, y_val)
# If val_loss starts increasing while train_loss decreases
# → Overfitting! Stop training.
Accuracy: Percentage of correct predictions
accuracy = (correct_predictions / total_predictions)
# Example: 85/100 = 0.85 or 85% accuracy
When accuracy is misleading: Imbalanced datasets
Dataset: 95% non-fraud, 5% fraud
Model: Always predict "non-fraud"
Accuracy: 95% (looks great!)
But: Misses ALL fraud cases (actually terrible!)
Confusion Matrix: Detailed breakdown of predictions
Predicted
Neg Pos
Actual Neg [ TN | FP ]
Pos [ FN | TP ]
TN: True Negatives (correctly predicted negative)
TP: True Positives (correctly predicted positive)
FN: False Negatives (missed positives - Type II error)
FP: False Positives (false alarms - Type I error)
Precision: Of predicted positives, how many were correct? \(\text{Precision} = \frac{TP}{TP + FP}\)
Use when: False positives are costly
Recall (Sensitivity): Of actual positives, how many did we find? \(\text{Recall} = \frac{TP}{TP + FN}\)
Use when: False negatives are costly
F1-Score: Harmonic mean of precision and recall \(F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}\)
Use when: Need balance between precision and recall
Example:
from sklearn.metrics import classification_report, confusion_matrix
y_pred = model.predict(X_test)
# Confusion matrix
print(confusion_matrix(y_test, y_pred))
# [[45 5] TN=45, FP=5
# [ 3 47]] FN=3, TP=47
# Detailed metrics
print(classification_report(y_test, y_pred))
# precision recall f1-score
# Class 0 0.94 0.90 0.92
# Class 1 0.90 0.94 0.92
# accuracy 0.92
Mean Absolute Error (MAE): Average absolute difference \(MAE = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|\)
Mean Squared Error (MSE): Average squared difference \(MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2\)
Root Mean Squared Error (RMSE): Square root of MSE \(RMSE = \sqrt{MSE}\)
R² Score (Coefficient of Determination): Proportion of variance explained \(R^2 = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}\)
Example:
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print(f"MAE: ${mae:,.0f}")
print(f"RMSE: ${rmse:,.0f}")
print(f"R²: {r2:.3f}")
# Complete ML workflow in ~20 lines
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
# 1. Load data
iris = load_iris()
X, y = iris.data, iris.target
# 2. Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# 3. Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# 4. Evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2%}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
# Data manipulation
import numpy as np # Numerical operations
import pandas as pd # Data structures and analysis
# Machine learning
from sklearn import * # Scikit-learn: ML algorithms
import xgboost as xgb # Gradient boosting
import lightgbm as lgb # Fast gradient boosting
# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Model deployment
import joblib # Save/load models
import pickle # Serialization
You now understand the fundamental concepts that underpin all of machine learning:
The ML Workflow: Systematic process from problem definition to deployment
Data Splitting: Train/validation/test sets prevent overfitting
Cross-Validation: Robust performance estimation through multiple splits
Bias-Variance Tradeoff: Balance between simplicity and complexity
Overfitting vs Underfitting: Finding the “just right” model complexity
Now that you have this foundation, you’re ready to dive deeper into specific algorithms and techniques.
Books:
Online Courses:
Practice Platforms:
Remember, machine learning is a journey, not a destination. Every practitioner, from beginners to experts, continuously learns and improves. The field evolves rapidly, but the fundamentals you’ve learned here remain constant.
Start simple. Stay curious. Keep building.