Machine Learning

Machine Learning: A Gentle Introduction to Classical AI

Edited By Siddhartha Reddy Jonnalagadda, PhD

Written By Hundreds of Parents

Welcome to the next chapter of your journey into the world of AI. In our last book, you learned about the language of uncertainty—the principles of probability and distributions that form the foundation of intelligent systems.

This book is a bridge from theory to practice. You’ll learn that a machine learning model isn’t a mysterious black box; it’s a tool that uses probabilistic reasoning to find patterns in data and make predictions. We’ll introduce the most important libraries for a beginner’s AI toolkit: Pandas for data handling and Scikit-learn for machine learning algorithms.

Chapter 1: The AI Toolkit 🧰

Before you can build, you need the right tools. This chapter will get you set up with everything you need. We’ll use a Jupyter Notebook or Google Colab, a browser-based environment that lets you write and run Python code interactively, making it perfect for experimenting with data.

1.1 Preparing Your Workspace

We’ll start by ensuring you have all the necessary libraries installed. Think of this as unpacking a toolbox of specialized tools.

# Installing key libraries

!pip install numpy pandas scikit-learn matplotlib

This single command will set up your environment to handle numerical data, data manipulation, machine learning, and data visualization.

1.2 A Tour of Pandas

Data in the real world is messy. Pandas is a powerful library that helps you organize, clean, and analyze data efficiently. You’ll learn about the DataFrame, a central object in Pandas that’s like a spreadsheet or a table you can manipulate in Python. Each row represents a data point, and each column is a feature. You’ll get hands-on with basic operations like loading a dataset, inspecting its contents, and handling missing values.

import pandas as pd

# Load a sample dataset

df = pd.read_csv('sample_data.csv')

# View the first 5 rows

print(df.head())

# Get a summary of the data

print(df.info())

By the end of this section, you’ll be comfortable with preparing your data for a machine learning model.

1.3 Introduction to Scikit-learn

Scikit-learn is a high-level library that provides a consistent interface for hundreds of machine learning algorithms. We’ll introduce its core principles: Estimators (the "model" or "learner" object), the fit() method (for training), and the predict() method (for making predictions).

The Estimator: Think of an Estimator as a blank blueprint for a model, like a LinearRegression() object.
The fit() method: This is the learning step. You feed the estimator your data and the correct answers, and it "fits" the blueprint to the data. It’s like a chef learning to cook a new dish by following a recipe.
The predict() method: Once the model is trained, you can give it new, unseen data, and it will use what it learned to make a prediction. It’s like the chef now cooking the dish on their own.

This section will give you a mental model for how all Scikit-learn algorithms work, preparing you for the hands-on projects in the next chapters.

Chapter 2: Supervised Learning: Predicting with Labeled Data 🤖

This chapter introduces supervised learning, the most common type of machine learning. Imagine you’re learning to identify different types of fruit. Your parent gives you a pile of apples, bananas, and oranges, and for each one, they tell you its name. After seeing enough examples, you can identify a new fruit you’ve never seen before. Supervised learning works in the same way; the model learns to predict an output based on a set of input data that has already been labeled with the correct answers.

The core idea of supervised learning is to build a model that can map an input to an output. A collection of data points, or examples, forms a dataset. A single data point consists of an input, which is often a set of features like the color or weight of a fruit, and its corresponding output, which is the correct label. In mathematical terms, a single data point is often represented as $(x, y)$, where x is the input and y is the output. The entire dataset is a collection of these pairs, like ${(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), …, (x^{(m)}, y^{(m)})}$, where m is the total number of examples.

Our goal is to build a "classifier" that learns how to predict the correct output y from a new input x that it hasn’t seen before. The type of prediction a model makes determines what kind of model it is.

Type of Prediction

Outcome

Examples

Regression

Continuous

Linear Regression, predicting a number

Classification

Discrete/Class

Logistic Regression, predicting a category

2.1 The Model’s Blueprint: h and θ

In machine learning, the model we choose is called the hypothesis, and it is often noted as $h_{\theta}$. The θ (theta) represents the parameters or settings that the model learns during training. Think of these parameters as the model’s internal adjustments, like the dials on a radio that you turn to tune into the right station. For a given input x, the model’s prediction is the output of this hypothesis, $h_{\theta}(x)$.

The h is the blueprint, and θ is what makes the blueprint unique to the data it has learned from. For example, a linear regression model’s blueprint is a straight line, but the specific values of θ determine the slope and y-intercept of that line, making it a unique line that fits the data.

2.2 How a Model Learns: Loss and Cost Functions

How does a model know if it’s right or wrong? It uses a loss function. A loss function takes the model’s prediction and the real value and tells you how different they are. If a model predicts a house price of $300,000 for a house that actually costs $310,000, the loss function calculates the "error" or "mistake" for that single prediction.

The cost function ($J$) is a way to measure the overall performance of the model across the entire dataset. It’s the sum of the loss for every single data point. The goal of training a model is to find the parameters (θ) that minimize this cost function. A low cost means the model’s predictions are, on average, very close to the actual values.

For simple models, common loss functions include:

Least squared error: For a regression problem, this calculates the square of the difference between the predicted value and the real value. The square is important because it treats positive and negative errors the same and penalizes larger errors more heavily.
Logistic loss: Used in classification, this loss function is designed to penalize confident wrong predictions heavily.
Hinge loss: Used by Support Vector Machines, it penalizes predictions that are on the wrong side of a decision boundary.
Cross-entropy: A common loss function for neural networks, it measures the difference between two probability distributions.

2.3 Project 1: Predicting House Prices with Linear Regression

We’ll start with a classic example: predicting house prices. This is a regression problem because the output is a continuous number. Our goal is to train a model that can predict a house’s price based on its size.

The Idea

Imagine a graph with house size on the x-axis and price on the y-axis. As a house’s size increases, its price generally increases too. The data points on this graph would roughly form a diagonal line. Linear Regression finds the best-fit line through the data points to predict the price of a new house size. The model assumes a linear relationship between the input and output.

The Learning Process: Gradient Descent

While some models have a simple, direct formula to find the perfect parameters, a more flexible and common way to find the best parameters is with Gradient Descent. Think of this as a hiker trying to get to the bottom of a valley. The valley’s shape represents the cost function, and the bottom of the valley is where the cost is at its minimum.

Step 1: Start Anywhere: The process begins by picking a random starting point in the valley, which corresponds to some random initial parameters for our model.
Step 2: Look at the Slope: At the current location, the hiker looks at the steepness of the ground (the gradient). The gradient tells us the direction of the steepest ascent.
Step 3: Take a Step Down: The hiker then takes a small step in the opposite direction of the gradient—the steepest descent—to move closer to the bottom of the valley.
The Learning Rate ($\alpha$): This is how big of a step the hiker takes. A step that is too big can cause the hiker to overshoot the bottom of the valley, while a step that is too small will make the process very slow.

For linear regression, this iterative process of taking small steps to minimize the cost is also known as the LMS (Least Mean Squares) algorithm or the Widrow-Hoff learning rule.

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error

# Split data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the model

model = LinearRegression()

model.fit(X_train, y_train)

# Make a prediction

predictions = model.predict(X_test)

print(f"Mean Squared Error: {mean_squared_error(y_test, predictions)}")

2.4 Project 2: Classifying Spam with Logistic Regression

Next, you’ll tackle a classification problem: predicting a category. We’ll build a simple spam filter.

The Idea

We’ll use Logistic Regression to make a "yes" or "no" decision (Is this email spam or not?). The model calculates the probability of an email being spam and classifies it as such if the probability crosses a certain threshold (e.g., 0.5).

The Key Tool: The Sigmoid Function

The Sigmoid function, also known as the logistic function, is what makes this all possible. It takes any number and "squishes" it into a value between 0 and 1, which we can then interpret as a probability. A large positive number becomes a value close to 1, and a large negative number becomes a value close to 0. This is perfect for our classification task.

The Math Analogy

Unlike linear regression, logistic regression doesn’t have a simple, direct formula to find the best parameters. It relies on iterative methods like Gradient Descent, just like we discussed for linear regression. The goal is to find the parameters that minimize the cost function, which in this case is the Cross-Entropy loss. This loss function is particularly good for classification because it heavily penalizes the model for being confidently wrong (e.g., predicting an email is 99% spam when it’s not).

Chapter 3: Generative vs. Discriminative Models 🔍

Now that you’ve built a few models, it’s time to understand the two main philosophical approaches to supervised learning: Generative and Discriminative models. The key difference lies in what they learn about the data.

Type of Model

Goal

What’s Learned

Example

Discriminative

Directly estimate the probability of y given x

The decision boundary

Regressions, SVMs

Generative

Estimate first how the data is generated to then deduce probability of y given x

Probability distributions of the data

GDA, Naive Bayes

Think of it this way: a Discriminative model is like a security guard. Their job is to stand at the border and learn to tell the difference between people who are allowed in and people who are not. They don’t need to know everything about every person—just the key features that help them make a decision. They learn the boundary. A Generative model is like a master artist. They learn to draw a picture of a person who is allowed in, and a picture of a person who is not. By understanding the characteristics of each group, they can tell which group a new person belongs to. They learn the full distribution of the data.

This is a subtle but crucial distinction.

3.1 Gaussian Discriminant Analysis (GDA)

GDA is a generative model that makes a few key assumptions. It assumes that the data for each class is drawn from a Gaussian (Normal) distribution. To make a prediction, it calculates which class has the higher probability for a given input.

The model learns two main things for each class: the average value (mean) and the spread (covariance) of the data points. For example, in a problem to distinguish between men and women based on their height and weight, GDA would learn the mean height and weight for men and the mean height and weight for women, as well as the spread of that data. When a new person comes along, the model sees if their height and weight are a better fit for the "men’s" distribution or the "women’s" distribution, and then makes a prediction.

This approach is powerful because it builds a full model of the data’s probability. It can also be used to generate new data that looks like the original.

3.2 Naive Bayes

Naive Bayes is another powerful generative model, famous for its use in spam filtering. The "naive" part comes from a simplifying assumption: it assumes that all the features (e.g., words in an email) are independent of each other. While this assumption is often false in reality (the word "poker" is not independent of the word "online"), the model works surprisingly well and is incredibly fast.

Here’s how it works for a spam filter: the model learns the probability of each word appearing in a spam email versus a non-spam email. When a new email arrives, it goes through each word and multiplies the probabilities together to get a total probability of the email being spam. Even with the "naive" independence assumption, the model is very effective at classifying text.

The Naive Bayes algorithm first estimates the probability of each class (e.g., P(spam)) and the probability of each feature given the class (e.g., P("Viagra" | spam)). It then uses Bayes’ theorem to calculate the probability of the class given the features (e.g., P(spam | "Viagra")), which is what we ultimately want.

Chapter 4: Advanced Concepts and the Bigger Picture 🧠

Now that you have a solid foundation, let’s look at some of the core ideas that tie everything together and introduce more advanced models. You’ll see how many of the models we’ve already discussed are part of a larger, unified framework.

4.1 Generalized Linear Models (GLMs)

Many of the models you’ve already seen, like Linear Regression and Logistic Regression, are actually special cases of a broader framework called Generalized Linear Models (GLMs). This framework provides a unified way to understand and build a wide range of models. It’s built on three key assumptions that allow a variety of models to be constructed on a common foundation.

The core idea of a GLM is to model an output variable y as a function of the input features x. The model relies on three assumptions:

The outcome variable y comes from a distribution in the Exponential Family. This family includes many common distributions like the Gaussian (for Linear Regression) and Bernoulli (for Logistic Regression) distributions. This assumption provides a consistent mathematical form for a wide range of problems.
The model’s prediction is the expected value of the outcome variable.
The relationship between the input features and the predicted output is a linear combination of the input features. This is known as the link function.

This framework elegantly explains why Linear and Logistic Regression, despite their different uses, are both special cases of a single, overarching model type.

4.2 Support Vector Machines (SVMs)

Support Vector Machines (SVMs) are a type of discriminative model that’s highly effective for both classification and regression. Unlike models that focus on finding the average of the data, SVMs focus on the edges.

The Idea: Imagine you have two groups of data points, and you want to draw a line to separate them. There might be many possible lines, but the goal of an SVM is to find the one that has the largest possible margin. The margin is the distance from the line to the nearest data point of either class. This "optimal margin classifier" is defined by a vector w and a bias b.
The Hinge Loss: To help the SVM find this optimal line, it uses a specific loss function called the Hinge Loss. This function penalizes points that are on the wrong side of the margin or too close to the decision boundary.
The Kernel Trick: What if the data isn’t easily separated by a straight line? SVMs can handle this with a clever shortcut called the Kernel Trick. This trick allows the model to find a decision boundary in a higher-dimensional space without ever having to explicitly do the complex calculations. It uses a kernel function, like the Gaussian kernel, to calculate the "similarity" between data points. This allows the model to learn complex, non-linear boundaries.

4.3 Tree-based and Ensemble Methods

These methods are incredibly powerful and form the backbone of many winning solutions in machine learning competitions. They’re often called "tree-based" because they’re built on the idea of a Decision Tree, a simple model that makes a series of yes/no decisions.

Decision Trees (CART): A decision tree is a model that can be represented as a flowchart. It asks a series of simple questions about the data and follows a path to a final prediction. For example, a tree to decide what to wear might first ask "Is it raining?". If yes, it follows one path; if no, it follows another. This makes them very interpretable.
Random Forest: This is an ensemble method that combines many individual decision trees. It works by building a "forest" of trees, each trained on a random subset of the data and a random subset of the features. The final prediction is a consensus from all the trees, making the model much more robust and accurate. It’s a classic example of how combining "weak learners" can create a powerful "strong learner."
Boosting: Another popular ensemble method that builds a strong learner by combining several "weak" learners in a sequential way. The idea is to train a weak model, then train a second model to correct the first one’s mistakes, then train a third to correct the second one’s mistakes, and so on. This process, known as Adaptive Boosting or Gradient Boosting, results in a highly accurate model.

4.4 Other Non-Parametric Approaches

Some models don’t make assumptions about the underlying data distribution. This makes them very flexible but can also make them more complex.

k-Nearest Neighbors (k-NN): The k-nearest neighbors algorithm, or k-NN, is a simple and intuitive approach. For a new data point, it looks at the k closest data points in the training set. The new data point’s class is determined by a majority vote of its k neighbors. This is a non-parametric approach because it doesn’t learn a fixed set of parameters; it simply memorizes the training data and uses it for new predictions.

Chapter 5: Model Evaluation and the Learning Process 📈

Building a model is just the first step; knowing if it’s good is crucial. A model that looks great on the data it was trained on but fails completely on new, unseen data is useless. This chapter will teach you how to evaluate your models, understand their shortcomings, and make them better.

5.1 Bias and Variance

Every model has a tradeoff between bias and variance. Understanding this balance is fundamental to building a reliable model.

Bias: When a model is too simple and can’t capture the underlying patterns in the data. Think of a simple model trying to fit a curvy line with a straight one. It consistently misses the mark because its assumptions are too rigid. This leads to underfitting, where the model performs poorly on both the training data and new data.
Variance: When a model is too complex and learns the noise and random fluctuations in the training data too well. It’s like a person who memorizes a book word-for-word but doesn’t understand the concepts. They’ll ace a test on the exact questions from the book but fail miserably on any new questions. This leads to overfitting, where the model performs great on the training data but fails miserably on new data.

The goal is to find the sweet spot, the right balance between bias and variance. A model with low bias and low variance is what we aim for.

5.2 Evaluation Metrics

You’ll need a set of tools to measure your model’s performance. The right tool depends on the type of problem you’re solving.

Regression Metrics

For regression problems, where the model predicts a continuous number, we often use error metrics that calculate how far off the predictions are from the actual values.

Mean Squared Error (MSE): This metric calculates the average of the squared differences between the predicted and actual values. It’s a popular choice because it heavily penalizes large errors.
Root Mean Squared Error (RMSE): This is simply the square root of the MSE. It’s often preferred because the result is in the same units as the original data, making it easier to interpret.
R-squared ($R^2$): This metric provides a measure of how well the model’s predictions replicate the actual outcomes. An $R^2$ of 1 means the model perfectly captures all the variance in the data, while an $R^2$ of 0 means the model performs no better than a simple average.

Classification Metrics

For classification problems, where the model predicts a category, we use metrics that evaluate its ability to correctly sort data points into classes.

The Confusion Matrix: This is a table that gives you a complete picture of your model’s performance on a set of data. It shows you exactly where your model is making mistakes. For a two-class problem (e.g., spam vs. not spam), it has four key components:
- True Positives (TP): The model correctly predicted a positive outcome.
- True Negatives (TN): The model correctly predicted a negative outcome.
- False Positives (FP): The model incorrectly predicted a positive outcome (Type I error).
- False Negatives (FN): The model incorrectly predicted a negative outcome (Type II error).
Accuracy: This is the most common metric. It’s the ratio of correct predictions to the total number of predictions. While easy to understand, it can be misleading on imbalanced datasets (e.g., a dataset with 99% non-spam emails).
Precision: Of all the positive predictions your model made, what percentage were actually correct?
Recall (Sensitivity): Of all the actual positive cases, what percentage did your model correctly identify?
F1 Score: This is a single metric that balances both Precision and Recall, making it useful for evaluating models on imbalanced datasets.
ROC Curve and AUC: The Receiver Operating Characteristic (ROC) curve plots the True Positive Rate against the False Positive Rate. The Area Under the Curve (AUC) is a single number that summarizes the ROC curve; a higher AUC indicates a better model.

5.3 Model Selection and Cross-Validation

Choosing the right model for your problem is a critical step. To avoid relying on luck, we use a structured process for evaluating models.

Training, Validation, and Testing

To get a reliable estimate of your model’s performance on new data, you must split your dataset into three parts:

Training Set: The largest portion of the data (e.g., 80%) used to train the model.
Validation Set: A smaller portion (e.g., 10%) used to fine-tune the model’s settings.
Testing Set: A final, unseen portion (e.g., 10%) used only once at the very end to get a final, unbiased measure of performance.

Cross-Validation

Cross-validation is a more robust way to evaluate a model’s performance without relying on a single validation set.

The Idea: The most common type is k-fold cross-validation. The training data is split into k equal-sized "folds" (e.g., 5 or 10). The model is trained k times. In each round, one fold is used for validation, and the remaining k-1 folds are used for training. The final performance is the average of the results from all k rounds.
Why It Works: This method ensures that every data point is used for both training and validation, leading to a more reliable and stable estimate of the model’s true performance.

5.4 Regularization and the Bias-Variance Tradeoff

Regularization is a technique used to combat overfitting. It works by adding a penalty to the cost function that discourages the model from becoming too complex. Think of it as a set of rules that prevent a student from just memorizing the test questions.

Lasso (L1 Regularization): This technique can actually shrink some of the model’s coefficients to zero, effectively performing variable selection by eliminating irrelevant features.
Ridge (L2 Regularization): This technique makes the coefficients smaller without forcing them to zero. It’s good for making a model more stable and less sensitive to small changes in the data.
Elastic Net: This method combines both Lasso and Ridge regularization, offering a flexible balance between the two.

Conclusion: Your Next Steps 🚀

This final chapter will summarize the key projects and concepts you’ve mastered. It will provide a roadmap for your future learning, suggesting pathways into more advanced topics like deep learning (using libraries like TensorFlow or PyTorch), natural language processing, or computer vision. The goal is to encourage you to continue building and exploring, using the solid foundation you’ve now established.

Page updated

Google Sites

Report abuse