series

Machine Learning Series

by Mayank Sharma

Introduction to Machine Learning: Understanding the Fundamentals

Nov 26, 2025

Today we embark on a new series on Machine Learning. Without wasting any time, let’s dive right in. Imagine you’re teaching a child to recognize different types of fruits. You show them many examples, “This is an apple, it’s round and red” and over time, the child learns to identify apples even in pictures they’ve never seen before. This is exactly how machine learning works, computers learn patterns from examples rather than following explicit programming instructions.

Table of Contents

  1. What is Machine Learning?
  2. Types of Machine Learning
  3. The Machine Learning Workflow
  4. Data Splitting: Train, Validation, and Test Sets
  5. Cross-Validation: Robust Model Evaluation
  6. The Bias-Variance Tradeoff
  7. Overfitting and Underfitting
  8. Model Evaluation Fundamentals
  9. Getting Started with Your First ML Project
  10. Conclusion and Resources

What is Machine Learning?

The Core Idea

Machine Learning (ML) is a field of artificial intelligence that enables computers to learn and improve from experience without being explicitly programmed. Instead of writing specific rules like “if temperature > 30°C, then hot,” ML algorithms discover patterns in data and make decisions based on those patterns.

Traditional Programming vs. Machine Learning

Traditional Programming:

Rules + Data  Output

Example: You write code that says “if credit score > 700 and income > $50,000, approve loan”

Machine Learning:

Data + Output  Rules (Model)

Example: You provide examples of approved/rejected loans, and the algorithm learns the patterns that determine approval

Why Machine Learning Matters

Machine learning has transformed technology in ways that would be impossible with traditional programming:

The key advantage? ML systems improve automatically as they see more data, adapting to new patterns without human intervention.

Types of Machine Learning

Machine learning algorithms fall into three main categories based on how they learn.

1. Supervised Learning

Definition: Learning from labeled examples where the correct answer is provided.

Analogy: Like studying with a teacher who provides correct answers to practice problems.

How it works:

Common Tasks:

Classification: Predicting categories

Regression: Predicting continuous values

Popular Algorithms:

Example:

# Training data: houses with features and prices
X_train = [[1500, 3, 2],  # [sqft, bedrooms, bathrooms]
           [2000, 4, 3],
           [1200, 2, 1]]
y_train = [300000, 450000, 250000]  # prices

# The algorithm learns: price ≈ f(sqft, bedrooms, bathrooms)
# New prediction: What's the price of a 1800 sqft, 3 bed, 2 bath house?

2. Unsupervised Learning

Definition: Learning from data without labels, or finding hidden patterns or structure.

Analogy: Like organizing books by genre when no one told you what the genres are, you notice similarities and group them yourself.

How it works:

Common Tasks:

Clustering: Grouping similar items

Dimensionality Reduction: Simplifying data while preserving information

Popular Algorithms:

Example:

# Customer data: spending patterns
customers = [[100, 20, 5],   # [grocery, electronics, clothing] spending
             [95, 25, 10],
             [30, 200, 150],
             [25, 180, 200]]

# K-Means discovers 2 groups:
# Group 1: [customers 1,2] - grocery shoppers
# Group 2: [customers 3,4] - electronics/clothing shoppers

3. Reinforcement Learning

Definition: Learning through trial and error by receiving rewards or penalties.

Analogy: Like training a dog, give treats for good behavior, corrections for bad behavior.

How it works:

Common Applications:

Popular Algorithms:

Example:

Robot learning to walk:
- Action: Move leg forward  Falls  Penalty (-10)
- Action: Balance + small step  Stays up  Reward (+5)
- Action: Series of balanced steps  Walks 10m  Reward (+100)

Over time, learns walking strategy that maximizes rewards.

Semi-Supervised and Self-Supervised Learning

Semi-Supervised: Combines small labeled dataset with large unlabeled dataset

Self-Supervised: Creates labels from the data itself

The Machine Learning Workflow

Every ML project follows a systematic process. Understanding this workflow is crucial for success.

Step 1: Define the Problem

Key Questions:

Example:

Step 2: Collect and Prepare Data

Data Collection:

Data Cleaning:

Exploratory Data Analysis (EDA):

import pandas as pd
import matplotlib.pyplot as plt

# Load data
data = pd.read_csv('customers.csv')

# Understand structure
print(data.info())        # Data types, missing values
print(data.describe())    # Statistical summary

# Visualize distributions
data['age'].hist(bins=30)
plt.show()

# Check correlations
correlation_matrix = data.corr()

Step 3: Feature Engineering

Feature Selection: Choose relevant variables

Feature Creation: Build new meaningful features

# Example: Creating features for house price prediction
data['price_per_sqft'] = data['price'] / data['sqft']
data['age'] = 2026 - data['year_built']
data['is_luxury'] = (data['price'] > 1000000).astype(int)

Feature Transformation:

Step 4: Split Data

Divide data into training, validation, and test sets:

from sklearn.model_selection import train_test_split

# 70% train, 15% validation, 15% test
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.3, random_state=42
)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42
)

Step 5: Choose and Train Model

Select Algorithm based on:

Train Model:

from sklearn.ensemble import RandomForestClassifier

# Initialize model
model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train on training data
model.fit(X_train, y_train)

Step 6: Evaluate and Tune

Evaluate Performance:

from sklearn.metrics import accuracy_score, classification_report

# Predictions on validation set
y_pred = model.predict(X_val)

# Evaluate
accuracy = accuracy_score(y_val, y_pred)
print(f"Validation Accuracy: {accuracy:.2%}")
print(classification_report(y_val, y_pred))

Hyperparameter Tuning:

from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 20, None],
    'min_samples_split': [2, 5, 10]
}

# Grid search with cross-validation
grid_search = GridSearchCV(
    RandomForestClassifier(),
    param_grid,
    cv=5,
    scoring='accuracy'
)
grid_search.fit(X_train, y_train)

# Best model
best_model = grid_search.best_estimator_

Step 7: Final Evaluation and Deployment

Test Set Evaluation (only once!):

# Final evaluation on held-out test set
test_accuracy = best_model.score(X_test, y_test)
print(f"Test Accuracy: {test_accuracy:.2%}")

Deployment: Integrate into production system

Data Splitting: Train, Validation, and Test Sets

Why Split Data?

The Golden Rule: Never test on data you trained on!

If you evaluate a model on the same data it learned from, you’ll get misleadingly optimistic results. It’s like giving students the exact same questions on the exam that they practiced with, so high scores don’t mean they truly understand.

The Three Sets

Training Set (60-80% of data)

Validation Set (10-20% of data)

Test Set (10-20% of data)

Split Strategies

Random Split (most common):

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,      # 20% for testing
    random_state=42,    # Reproducibility
    stratify=y          # Preserve class distribution
)

Stratified Split (for imbalanced data): Ensures each split has same proportion of each class

# If 70% class A, 30% class B in full data
# Stratified split maintains 70%-30% in train and test

Time-Based Split (for temporal data):

# Don't randomize time series!
# Train on past, test on future
train_data = data[data['date'] < '2025-01-01']
test_data = data[data['date'] >= '2025-01-01']

Cross-Validation: Robust Model Evaluation

The Problem with Single Split

A single train-test split can be “unlucky”; what if your test set happens to be particularly easy or hard? Cross-validation solves this by testing on multiple different splits.

K-Fold Cross-Validation

Process:

  1. Split data into K equal parts (folds)
  2. Train K times, each time using different fold as validation
  3. Average results across all folds

Visualization:

Fold 1: [Test][Train][Train][Train][Train]
Fold 2: [Train][Test][Train][Train][Train]
Fold 3: [Train][Train][Test][Train][Train]
Fold 4: [Train][Train][Train][Test][Train]
Fold 5: [Train][Train][Train][Train][Test]

Average performance across all 5 folds

Implementation:

from sklearn.model_selection import cross_val_score

# 5-fold cross-validation
scores = cross_val_score(
    model,
    X_train,
    y_train,
    cv=5,              # Number of folds
    scoring='accuracy'
)

print(f"CV Scores: {scores}")
print(f"Mean: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")

Output:

CV Scores: [0.85, 0.87, 0.84, 0.88, 0.86]
Mean: 0.860 (+/- 0.028)

Stratified K-Fold

Maintains class distribution in each fold, which is crucial for imbalanced datasets:

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X_train, y_train, cv=skf)

Leave-One-Out Cross-Validation (LOOCV)

Extreme case where K = number of samples:

from sklearn.model_selection import LeaveOneOut

loo = LeaveOneOut()
scores = cross_val_score(model, X, y, cv=loo)

When to Use Cross-Validation

Use CV for:

Don’t use CV for:

The Bias-Variance Tradeoff

One of the most fundamental concepts in machine learning.

Understanding Bias and Variance

Bias: Error from wrong assumptions in the learning algorithm

Variance: Error from sensitivity to small fluctuations in training data

Analogy: Throwing darts at a target

Low Bias, Low Variance:     High Bias, Low Variance:
    Accurate & Precise           Precise but Inaccurate
         🎯                              ·
        · · ·                          · · ·
         ·                              ·
                                    (all hits consistently
(all hits near bullseye)            off to the side)

Low Bias, High Variance:    High Bias, High Variance:
   Accurate but Imprecise        Neither Accurate nor Precise
      ·     ·                          ·   ·
        🎯                               ·
    ·       ·                        ·       ·
      ·                                  ·
(scattered around bullseye)        (scattered, off-target)

The Tradeoff

Total Error = Bias² + Variance + Irreducible Error

Where:

The Tradeoff Curve:

Error
 
 |    Variance
 |         
 |        
 |       
 |      _____ Total Error
 |     
 |__________
 |   /        _____ Bias²
 |  /              
 | /                ___
 |/________________________ Model Complexity
 Simple                 Complex

Sweet Spot: Minimum total error balances bias and variance

Managing the Tradeoff

Reduce Bias (if underfitting):

Reduce Variance (if overfitting):

Mathematical Example

Consider fitting a polynomial to data:

Linear Model (degree 1): $y = w_0 + w_1x$

Quadratic Model (degree 2): $y = w_0 + w_1x + w_2x^2$

High-Degree Polynomial (degree 15): $y = w_0 + w_1x + … + w_{15}x^{15}$

Overfitting and Underfitting

Underfitting (High Bias)

Definition: Model is too simple to capture data patterns

Signs:

Example:

# Predicting house prices with only one feature
# Actual relationship: price depends on size, location, age, etc.
# Underfitting model: price = w * size + b

# This ignores important factors like location!

Solutions:

Overfitting (High Variance)

Definition: Model learns training data too well, including noise

Signs:

Example:

# Training accuracy: 99%
# Test accuracy: 65%
# → Model memorized training data instead of learning general patterns

Visual Example:

Underfitting:          Just Right:         Overfitting:
    ·                     ·                    ·
   ·  ·                  ·  ·                 ·  ·
  ·    ──────           ·    ───            ·    ─╮
 ·              ·      ·         ──·       ·       
·                ·    ·             ·     ·        ╰─
                                                    
(straight line     (smooth curve        (wiggly curve
doesn't fit)       captures trend)      fits every point)

Solutions:

The Goldilocks Principle

Goal: Find model that is “just right”

Strategy: Monitor both training and validation performance

# During training, track both metrics
for epoch in range(100):
    train_loss = train_model(X_train, y_train)
    val_loss = evaluate_model(X_val, y_val)

    # If val_loss starts increasing while train_loss decreases
    # → Overfitting! Stop training.

Model Evaluation Fundamentals

Evaluation Metrics for Classification

Accuracy: Percentage of correct predictions

accuracy = (correct_predictions / total_predictions)

# Example: 85/100 = 0.85 or 85% accuracy

When accuracy is misleading: Imbalanced datasets

Dataset: 95% non-fraud, 5% fraud
Model: Always predict "non-fraud"
Accuracy: 95% (looks great!)
But: Misses ALL fraud cases (actually terrible!)

Confusion Matrix: Detailed breakdown of predictions

                Predicted
                 Neg   Pos
Actual  Neg    [ TN  | FP ]
        Pos    [ FN  | TP ]

TN: True Negatives (correctly predicted negative)
TP: True Positives (correctly predicted positive)
FN: False Negatives (missed positives - Type II error)
FP: False Positives (false alarms - Type I error)

Precision: Of predicted positives, how many were correct? \(\text{Precision} = \frac{TP}{TP + FP}\)

Use when: False positives are costly

Recall (Sensitivity): Of actual positives, how many did we find? \(\text{Recall} = \frac{TP}{TP + FN}\)

Use when: False negatives are costly

F1-Score: Harmonic mean of precision and recall \(F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}\)

Use when: Need balance between precision and recall

Example:

from sklearn.metrics import classification_report, confusion_matrix

y_pred = model.predict(X_test)

# Confusion matrix
print(confusion_matrix(y_test, y_pred))
# [[45  5]   TN=45, FP=5
#  [ 3 47]]  FN=3,  TP=47

# Detailed metrics
print(classification_report(y_test, y_pred))
#               precision  recall  f1-score
# Class 0         0.94      0.90     0.92
# Class 1         0.90      0.94     0.92
# accuracy                           0.92

Evaluation Metrics for Regression

Mean Absolute Error (MAE): Average absolute difference \(MAE = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|\)

Mean Squared Error (MSE): Average squared difference \(MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2\)

Root Mean Squared Error (RMSE): Square root of MSE \(RMSE = \sqrt{MSE}\)

R² Score (Coefficient of Determination): Proportion of variance explained \(R^2 = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}\)

Example:

from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

y_pred = model.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"MAE: ${mae:,.0f}")
print(f"RMSE: ${rmse:,.0f}")
print(f"R²: {r2:.3f}")

Getting Started with Your First ML Project

Beginner-Friendly Project Ideas

  1. Iris Flower Classification (Classic starter)
    • Dataset: 150 samples, 4 features
    • Task: Classify into 3 species
    • Algorithms to try: Logistic Regression, Decision Trees, KNN
  2. House Price Prediction
    • Dataset: Boston Housing or Kaggle datasets
    • Task: Predict price from features
    • Algorithms: Linear Regression, Random Forest
  3. Titanic Survival Prediction
    • Dataset: Passenger data from Titanic
    • Task: Predict survival (yes/no)
    • Practice feature engineering and handling missing data

Minimal Working Example

# Complete ML workflow in ~20 lines
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# 1. Load data
iris = load_iris()
X, y = iris.data, iris.target

# 2. Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# 4. Evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.2%}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

Essential Libraries to Learn

# Data manipulation
import numpy as np           # Numerical operations
import pandas as pd          # Data structures and analysis

# Machine learning
from sklearn import *        # Scikit-learn: ML algorithms
import xgboost as xgb       # Gradient boosting
import lightgbm as lgb      # Fast gradient boosting

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Model deployment
import joblib               # Save/load models
import pickle              # Serialization

Best Practices Checklist

Common Beginner Mistakes to Avoid

Conclusion and Resources

You now understand the fundamental concepts that underpin all of machine learning:

  1. Machine Learning Types:
    • Supervised: Learn from labeled examples
    • Unsupervised: Find patterns without labels
    • Reinforcement: Learn through trial and error
  2. The ML Workflow: Systematic process from problem definition to deployment

  3. Data Splitting: Train/validation/test sets prevent overfitting

  4. Cross-Validation: Robust performance estimation through multiple splits

  5. Bias-Variance Tradeoff: Balance between simplicity and complexity

  6. Overfitting vs Underfitting: Finding the “just right” model complexity

  7. Evaluation Metrics: Measuring model performance appropriately

Now that you have this foundation, you’re ready to dive deeper into specific algorithms and techniques.

Resources for Deeper Learning

Books:

Online Courses:

Practice Platforms:

Remember, machine learning is a journey, not a destination. Every practitioner, from beginners to experts, continuously learns and improves. The field evolves rapidly, but the fundamentals you’ve learned here remain constant.

Start simple. Stay curious. Keep building.