Programming: How to Create a Machine Learning Model from Scratch in Python
Note: This educational material is comprehensive and detailed. It’s meant for absolute beginners—people who may have never written a line of code before or who have never heard of Python or machine learning. We will take it step by step, explaining each concept and line of code, and by the end, you will not only understand how to build a simple machine learning model, but you’ll also know how to create a graphical interface (a small window with buttons and input boxes) to interact with your model.
What will you learn?
What machine learning (ML) is and why it’s useful.
How to set up Python and a code editor.
How to load and understand a dataset.
How to prepare data for modeling.
How to build, train, and evaluate a simple ML model (Linear Regression) from scratch.
How to save and load the trained model so you don’t have to rebuild it every time.
How to create a Graphical User Interface (GUI) with Python’s built-in
tkinter
library, so users can easily interact with your model by entering a house size and seeing the predicted price.
Who is this for?
Complete beginners with no coding experience.
People who are curious about machine learning and want to understand the process, not just copy and paste code.
Anyone who wants a gentle, step-by-step guide to building a small end-to-end ML project.
Introduction to Machine Learning
What is Machine Learning (ML)?
Machine Learning is a field of artificial intelligence that teaches computers to make predictions or decisions based on data, without being explicitly programmed for every possible scenario.
Think of it this way:
If you show a child many pictures of cats and dogs, eventually the child learns to identify a cat or a dog. You didn’t give the child a long list of rules; instead, the child learned from examples. In machine learning, we give the computer a lot of examples (a dataset) and let it figure out patterns.
Real-World Examples of ML:
Spam Filters in Email: Your email service uses ML to guess which incoming messages are spam.
Movie Recommendations: Services like Netflix analyze your viewing habits and recommend shows or movies you might like.
Voice Assistants (Siri, Alexa): They improve their understanding of your speech over time by learning patterns in your voice commands.
Price Prediction: Predicting the price of a product or a house based on historical data.
Our Project’s Scenario:
We will build a model that predicts the price of a house given its size (in square feet). The model will learn from past data of houses: their sizes and their selling prices. Then, when we give it a new size, it will predict a selling price based on what it learned.
Why Learn to Build a Model?
Demystify AI: Understanding how a model makes predictions helps you know when to trust or question those predictions.
Improve Problem-Solving: You learn a systematic approach to tackle data problems.
Career Boost: Even basic ML knowledge can stand out in many fields—marketing, finance, operations.
Empowerment: You can build tools for personal projects, research, or business.
You do not need to become a math expert or a full-time programmer. This guide focuses on concepts and practical steps, and we’ll show you how to rely on user-friendly Python libraries and even ask AI tools (like ChatGPT) for help when you get stuck.
Overview of What You’ll Achieve
Data Preparation: You’ll start with a CSV file containing house sizes and prices. You’ll learn how to load that file into Python, examine it, and ensure it’s ready for ML.
Building a Model: You’ll use a simple ML algorithm called Linear Regression to learn a relationship between house size and price.
Training and Evaluation: You’ll train the model on historical data and see how well it predicts prices for new houses it hasn’t seen.
Saving the Model: Instead of rebuilding every time, you’ll save the trained model to a file.
Creating a GUI: You’ll use Python’s
tkinter
library to create a small window where users can type a house size and press a button to see the predicted price.Optional Next Steps: Learn where to go from here—add more features, try different models, or explore other ML tools.
By the end of this journey, you’ll have a simple program that anyone can run: enter a house size, click a button, and see the predicted price.
Prerequisites and Tools
You need:
A computer (Windows, macOS, or Linux).
Python installed on your machine.
A code editor (we recommend Visual Studio Code, commonly called VS Code).
Installing Python
Go to https://www.python.org/downloads/ and download the latest Python 3.x version.
Run the installer.
Important: On Windows, check the box “Add Python to PATH” during installation.
Once installed, open a terminal (Command Prompt on Windows, Terminal on macOS/Linux) and type:
python --version
You should see something like Python 3.11.x
.
Installing Visual Studio Code (VS Code)
Go to https://code.visualstudio.com and download the installer for your OS.
Install VS Code.
Open VS Code.
Click the Extensions icon (left sidebar) or press
Ctrl+Shift+X
(Windows/Linux) orCmd+Shift+X
(macOS).In the search box, type “Python” and install the official Python extension by Microsoft.
Installing Python Libraries
We will use libraries to make our life easier:
pandas: For reading and handling data.
scikit-learn: For machine learning algorithms.
matplotlib: For plotting charts.
joblib: For saving and loading our trained model.
tkinter: Built into Python, no extra install needed for basic usage.
In your terminal, type:
pip install pandas scikit-learn matplotlib joblib
If you face issues, you can ask an AI tool:
“How do I fix a pip install error on Windows?”
After successful installation, we’re ready to start.
Step 1: Loading and Inspecting the Data
Before we start building a machine learning model, we need data. Machine learning relies on learning patterns from historical examples, so having data in a suitable format is our foundation.
What is Data?
Data, in the simplest sense, is information. For our project, our data consists of a set of houses, each with certain characteristics (in this case, just one characteristic: the house’s size in square feet) and their associated selling price. This kind of information is often stored in files. One common format is called a CSV file, which stands for “Comma-Separated Values.” A CSV file is like a simple spreadsheet: each line represents a row (an individual example), and the columns are separated by commas.
For example, a CSV file named house_prices.csv
could look like this:
Size,Price
1000,250000
1200,270000
1500,300000
1800,330000
2000,360000
Here’s what each part means:
The first line
Size,Price
indicates that we have two columns: “Size” and “Price.”Each subsequent line represents a specific house. For instance,
1000,250000
means this particular house is 1000 square feet and sold for $250,000.
This CSV file is our dataset. We will use it to teach the computer how to predict the price of a house if we know its size.
Why Are We Creating a Separate Python File?
When working on a programming project, we usually write instructions for the computer in code files. A file in this context is just a document that contains code. By creating separate files for different steps, we keep our work organized. Each file will focus on a particular task, making it easier to understand and maintain our project as it grows. For example, in this step, we’re focusing on loading and inspecting data only. Later, we’ll have another file or add code that handles training the model. By splitting tasks this way, if something goes wrong in a future step, we know exactly which file to check without getting lost in a single massive code file.
We might name our first file 01_load_data.py
. The name 01_load_data.py
is chosen so that it’s clear this file corresponds to the first step of the process—loading and inspecting the data. The .py
extension tells us it’s a Python file. You can think of a Python file as a set of instructions that Python can execute. Make sure that house_prices.csv
is in the same folder (directory) as your Python file, model_building.py
, so that the file can be found easily when you run the code.
What is Python Going to Do With This File?
When you “run” a Python file, you are asking Python to follow the instructions written in that file from top to bottom. These instructions can include:
Importing libraries (explained below)
Reading data from a file
Performing calculations
Printing output (displaying information on your screen)
By organizing the instructions into a file like 01_load_data.py
, you can re-run those instructions anytime without retyping them. This is a more systematic way than typing commands one by one interactively.
Understanding Libraries and Imports
Python has a core set of features built-in, but for specialized tasks, we use libraries (also known as packages or modules). A library is a collection of code that someone else wrote to solve common problems and provide useful features. Instead of reinventing the wheel, we can import these libraries to do work for us quickly.
We will use a library called pandas. Pandas is extremely popular for working with data because it provides powerful and intuitive tools for reading, organizing, and analyzing datasets. It’s like having a friend who is very good at handling spreadsheets right inside your code.
What does “import pandas as pd” mean?
import pandas: This tells Python we want to use the pandas library.
as pd: This is a common shorthand. Instead of writing “pandas” every time we refer to something from this library, we can just write “pd.” It makes the code shorter and easier to read.
So import pandas as pd
means: “Bring in the pandas library and let me refer to it as pd
in my code.”
Reading a CSV File
Once we have pandas available as pd
, we can use its functions to read data. One of these functions is called read_csv
. A function in programming is like a predefined action. You give it some input, it does some work, and then it gives you an output. Think of it as a machine: you feed in something, and it returns something else.
pd.read_csv("house_prices.csv")
is a function call. It tells pandas:
“Please read the file named house_prices.csv
and give me the data in a form I can work with.”
What do we get back when we call pd.read_csv(...)
?
We get a DataFrame. A DataFrame is a special data structure in pandas. You can imagine a DataFrame as a spreadsheet or a table of data in memory, with rows and columns that you can manipulate directly with code. It’s very convenient because it allows us to filter, calculate statistics, find missing values, and do much more, all by using straightforward commands.
We will store this DataFrame in a variable. A variable is a name that holds a value. It’s like a labeled box in your memory where you keep some data. For example, we can write:
data = pd.read_csv("house_prices.csv")
Here:
data
is the variable’s name.=
means we are assigning something todata
.pd.read_csv("house_prices.csv")
is the function call that returns a DataFrame of our dataset.
The variable data
now holds a pandas DataFrame, a versatile structure that lets us easily perform a variety of data analysis operations. As we proceed, you’ll see how this makes tasks that would be complex in raw Python much simpler.”
Inspecting the Data with head()
Now that we have the data loaded, we want to get a quick look at it. Instead of printing the entire dataset (which might be huge in real-world scenarios), we can just look at the first few rows. Pandas provides a function called head()
that returns the first 5 rows of the DataFrame by default. This gives us a small preview, just enough to confirm that the file was read correctly and see what the columns and values look like.
When we write:
print(data.head())
We are doing three things:
Calling the
head()
function ondata
.Getting the first 5 rows of the DataFrame.
Using
print(...)
to display it on the screen.
What is print()
?print()
is a built-in Python function that writes the given message or data to the console (the screen where we see text output). It’s our way of asking, “Hey Python, show me what you have in data.head()
!”
So print(data.head())
means:
“Get the first five rows of data
and show them to me.”
This step is crucial because it gives us visual feedback that we successfully loaded the dataset. We should see something like:
Size Price
0 1000 250000
1 1200 270000
2 1500 300000
3 1800 330000
4 2000 360000
This confirms that data
has the right columns (“Size” and “Price”), and we can see a sample of the data’s content.
Checking for Missing Values
Real-world data often has missing information. For example, maybe one house’s price wasn’t recorded. Missing values can cause problems later when we train our model, so we want to know if we have any such issues.
Pandas provides a convenient way to check for missing values. We can write:
print(data.isnull().sum())
What does data.isnull()
do?
isnull()
is another function that checks each cell in the DataFrame and returns True if that cell is empty (null) and False otherwise.
What does .sum()
do?
When applied to a DataFrame of True/False values,
.sum()
treats True as 1 and False as 0 and adds them up. The result is the number of missing values in each column.
If we get all zeros, that means no missing values. If we see a nonzero number, say 2 in the “Price” column, that means 2 of the houses have no recorded price.
By printing this information, we are staying informed about the quality of our data.
Summarizing the Data with describe()
Another helpful tool is data.describe()
. This function gives us basic statistics about the numeric columns in our DataFrame, such as the count (how many rows), the mean (average), the minimum, the maximum, and other helpful summaries. This helps us understand the range and distribution of our data.
print(data.describe())
This will print something like:
Size Price
count 5.00000 5.000000
mean 1500.00000 302000.000000
std 387.36483 43852.188754
min 1000.00000 250000.000000
25% 1200.00000 270000.000000
50% 1500.00000 300000.000000
75% 1800.00000 330000.000000
max 2000.00000 360000.000000
From this:
count: There are 5 houses total.
mean (average): The average size is 1500 sq ft and the average price is $302,000.
min and max: The smallest house is 1000 sq ft and the largest is 2000 sq ft. The cheapest is $250,000 and the most expensive is $360,000.
This quick overview helps us get a feel for the dataset’s scale and variety before moving on.
Putting It All Together in 01_load_data.py
Now, let’s write the full code for step 1 in a file named 01_load_data.py
. Before writing it down, remember what we’ve discussed:
We’re creating a separate file because it’s good practice to keep each step of the project organized.
By running this file, we’ll have Python load and inspect our data.
We’re introducing concepts like imports, functions, variables, and printing output.
Full code with comments:
# Step 1: Loading and Inspecting the Data
# 1. Import the pandas library as pd so we can use its functions.
import pandas as pd
# 2. Use pandas to read the CSV file into a DataFrame named 'data'.
data = pd.read_csv("house_prices.csv")
# 3. Print the first few rows to ensure the data loaded correctly.
print("First 5 rows of the dataset:")
print(data.head())
# 4. Check for missing values. This helps us know if we need to clean the data.
print("\nChecking for missing values in each column:")
print(data.isnull().sum())
# 5. Show basic statistics to understand the distribution and range of the data.
print("\nStatistical summary of the data:")
print(data.describe())
How to Run the Code
If you’ve never run a Python file before, here’s what it means and why we do it:
Running a Python file means asking Python to execute all the instructions in that file. Think of the file as a recipe and running it as following that recipe step-by-step.
After running the code, we’ll see output (text) printed on the screen—this gives us confidence that our data is loaded and we understand its basic properties.
How to actually run it in practice (Skip down if using VS Code):
Open your terminal or command prompt.
Navigate to the folder where
01_load_data.py
andhouse_prices.csv
are located. (If you saved them in the same folder, just go there.)Type:
python 01_load_data.py
4. Press Enter.
How to run it with VS Code:
Click on “Run” then “Run Without Debugging” or Press Crtl + F5.
You should see the output we discussed.
Why are we doing this? Because we want to confirm that our data is read correctly and looks as expected. If something is wrong at this stage—like the file not found or data looks suspicious—this is where we catch it before moving forward. It’s always a good idea to verify your data before building your model.
Recap of What We Learned in Step 1
We learned what a CSV file is and how it stores data in a structured, tabular form.
We understood why we create separate Python files for different steps: organization and clarity.
We introduced the concept of importing a library (pandas) to leverage existing functions and code.
We learned that functions like
pd.read_csv()
help us load data, anddata.head()
,data.isnull()
, anddata.describe()
help us inspect it.We discussed what printing means and why we print results to see what the code is doing.
By the end of this step, we have verified that our dataset is loaded into a pandas DataFrame, we have checked for missing values, and we have a basic understanding of its contents.
This sets a strong foundation for the next steps, where we will prepare the data further, introduce the concepts of training and testing sets, and ultimately build and evaluate a machine learning model.
Code Update for Step 1:
Create a file named model_building.py
(if you haven’t created it yet) and add the following code to it:
# model_building.py after Step 1
import pandas as pd
# Load the dataset
data = pd.read_csv("house_prices.csv")
print("First 5 rows of the dataset:")
print(data.head())
print("\nChecking for missing values:")
print(data.isnull().sum())
print("\nStatistical summary of the data:")
print(data.describe())
Where to Add It:
Since this is our first code addition, just place it in a new file model_building.py
. This will be our base code going forward.
Step 2: Understanding the Data
In Step 1, we loaded our dataset from a CSV file and inspected it briefly. We confirmed that it contains two columns—Size
and Price
—and verified that there are no missing values. We also saw some basic statistics, such as the average house size and the minimum and maximum prices.
Why is this step necessary?
It’s one thing to load data and peek at it; it’s another to truly understand what it represents, how it will be used in a machine learning model, and why we organized it in this particular way. Understanding the data helps us:
Clarify what we want to predict (the target).
Recognize what information we are using to make that prediction (the features).
Understand the type of machine learning problem we’re dealing with.
Anticipate potential challenges, like whether the relationship between the feature(s) and target might be simple or complex.
Set the stage for how we will train and evaluate our model in upcoming steps.
Let’s break this down piece by piece.
Features and Targets: Defining Our Inputs and Outputs
In machine learning, we typically separate our data into two main categories: features and targets.
Features (Inputs): These are the pieces of information that we provide to the model so it can make a prediction. You can think of features as the “clues” we give the computer. In our dataset, the main feature is
Size
—the size of the house in square feet.Target (Output): This is what we want the model to predict. In our case, the target is
Price
—the selling price of the house. Once the model is trained, when we give it a new size (a feature), it should return a predicted price (the target).
Why define these terms now?
Because understanding the difference between what you’re giving to the model (features) and what you want it to output (target) is fundamental. This distinction will guide how we split our data, train the model, and evaluate it. Without a clear understanding of your features and target, you can’t properly frame the machine learning problem.
The Nature of Our Problem: A Regression Task
There are different kinds of machine learning problems. The two most common categories in supervised learning are:
Regression: Predicting a numeric value. Examples include predicting house prices, predicting someone’s height based on their age, or forecasting next month’s sales.
Classification: Predicting a category or class. Examples include deciding if an email is spam or not, or determining if an image contains a cat or a dog.
Since our target—Price
—is a continuous number (like $250,000 or $310,000), we are dealing with a regression problem. The model’s job is to produce a numerical output given the numerical input of house size.
Why identify the problem type now?
Because the type of problem influences which algorithms and evaluation metrics we will use. For regression problems, certain algorithms (like linear regression) and metrics (like mean squared error or RMSE) are more suitable. By knowing we’re tackling regression, we narrow down our future choices and expectations.
Distribution and Range of Our Data
From Step 1, we saw basic statistics (like mean, min, and max) using data.describe()
. Let’s reflect on that information conceptually:
Mean Size (~1500 sq ft): This suggests our houses are mostly around medium-sized homes.
Mean Price (~$302,000): An average price gives us a sense of scale—these are not million-dollar mansions, nor are they super cheap tiny homes.
Min and Max Values: Sizes ranged from 1000 to 2000 sq ft, and prices from $250,000 to $360,000. This range tells us how varied our examples are. If the range is very narrow, predictions might be easier. If it were extremely wide, the model might have a harder time capturing the relationship.
It’s important to understand these distributions because they help us guess how “predictable” the price might be. If all houses are similarly sized and priced, the model might not have a lot of complexity to learn. If there were extreme outliers (like a 10,000 sq ft mansion priced at $2,000,000 in a dataset of mostly 1,000-2,000 sq ft homes), the model might struggle or require special handling.
Why Understanding the Data Matters Before Modeling
If we skipped directly to model building without understanding our data, we’d be working blindly. By taking the time to understand our data, we can anticipate some questions and strategies:
Is the Relationship Likely to be Linear?
We’re considering linear regression in upcoming steps. Linear regression fits a straight line to the data. If the relationship betweenSize
andPrice
is roughly linear (i.e., price tends to increase by a certain amount per additional square foot), this approach can work well. By understanding the data—seeing that as size increases, price generally increases—we can be more confident in starting with a linear model.Is There Only One Feature?
Right now, we only have one feature: house size. It’s a very simple dataset. In the real world, predicting house prices might require more features: number of bedrooms, location, age of the house, quality of construction, etc. Understanding that we have a very simplified scenario helps us set our expectations. Our predictions won’t be as accurate as a real-world model because we’re using less information. But that’s fine for a learning example.Do We Have Enough Data Points?
Our example dataset is tiny—just a handful of rows. In real machine learning scenarios, you might have thousands or millions of examples. Fewer data points mean the model might not generalize well. Understanding that we have very limited data means we should be cautious about drawing strong conclusions from our model. This also helps us realize that this project is more about learning the process rather than achieving high accuracy.Is There Any Data Cleaning Required?
If we had missing values, duplicates, or strange outliers, we would need to fix them before training. Understanding our data means checking if everything looks reasonable. Since our tiny dataset is clean (no missing values, no weird outliers), we can proceed confidently.
Relating Understanding to Future Steps
In the upcoming steps, we will:
Split the Data into Training and Testing Sets: Knowing our target and features will help us split the data so that some examples are used to train the model and others are used to test it. Understanding what the data represents ensures we keep the right columns for features and the right column for the target.
Choose and Train a Model: Our understanding that this is a regression problem will guide us to pick a suitable algorithm (like linear regression) and an appropriate metric (like RMSE).
Evaluate and Improve the Model: If we understand the range and scale of our data, we can better interpret whether an error of $20,000 in price predictions is large or small. (Is $20,000 a big deal compared to a $300,000 house? It might be, but at least we have context.)
Create a GUI for Predictions: Eventually, we’ll let users input a house size and get a predicted price. Understanding what
Size
andPrice
mean ensures that the final application makes sense to the end-user. If the user enters a size completely outside our known range (e.g., 10 sq ft or 100,000 sq ft), we’ll understand that our model might not produce reasonable predictions, because we have no data close to that range.
By taking a moment to truly understand what our data is and what it represents, we lay a conceptual foundation that will make the next steps—splitting, training, evaluating, and deploying the model—much more meaningful. We won’t just be following instructions blindly; we’ll know why we’re doing what we’re doing.
Summary of What We Learned in Step 2
We clarified the concepts of features (inputs) and target (output).
We identified that this is a regression problem, suitable for predicting continuous values.
We considered how the distribution and range of the data affect the complexity of the prediction task.
We connected our understanding of the data to the choice of model (linear regression) and evaluation metrics.
We prepared ourselves for future steps, such as splitting the data and training a model, with a solid conceptual understanding.
Now that we have a firm grasp on what our data represents and what we aim to do with it, we’re ready to proceed to the next step. In the next steps, we will start making the jump from understanding our data to preparing it for machine learning, ensuring that we set the stage for a successful training process.
Code Update for Step 2:
No code changes are required. The model_building.py
file remains:
# model_building.py after Step 1
import pandas as pd
# Load the dataset
data = pd.read_csv("house_prices.csv")
print("First 5 rows of the dataset:")
print(data.head())
print("\nChecking for missing values:")
print(data.isnull().sum())
print("\nStatistical summary of the data:")
print(data.describe())
Where to Add It:
No additions needed. Your file remains the same.
Step 3: Defining Our Machine Learning Goal
In the previous steps, we accomplished the following:
Step 1: Loaded our data and learned how to inspect it. We verified the data structure, ensured it was readable in Python, and understood what our CSV file looked like.
Step 2: Developed a conceptual understanding of the data. We identified what our features (Size) and target (Price) are and recognized that we are dealing with a regression problem.
At this stage, we need to explicitly define our machine learning goal. This goal acts like a compass, guiding every decision we make moving forward—from how we prepare the data and choose our algorithm to how we measure success.
What Does “Goal” Mean in Machine Learning?
A machine learning goal is a clear statement of what we want our model to achieve. It answers questions like:
What are we trying to predict?
What input information are we using to make that prediction?
How will we know if the model is doing well?
By stating the goal in plain language, we ensure that we, as developers and learners, know what the finished product should look like and what success means in measurable terms.
Our Specific Goal in This Project
We have one main feature (house size) and one target (house price). Our learning setup is:
Given: A house size (in square feet).
Predict: The selling price of that house.
In other words, our machine learning model should take a numeric input (Size) and produce a numeric output (predicted Price). We want the model to learn a relationship such that, when we provide it with a new size that it has never seen before, it can give a reasonable estimate of how much a house with that size should cost, based on the patterns discovered in the historical data.
Why this goal?
If we can achieve this goal, we have a practical use case: if a real estate agent or a homeowner wants a quick estimate of a property’s value based on its size, our model can provide a ballpark figure. Although our dataset is very simple and may not lead to extremely accurate predictions, it’s a perfect demonstration project for learning the machine learning process.
Key Components of Our Goal
Type of Prediction:
We are predicting a continuous number (price). This reconfirms that our problem is regression-oriented. Defining the goal helps us choose an algorithm and evaluation metrics suitable for regression tasks.Simplicity of the Relationship:
We have just one input feature (Size). We aim to find a function (a mathematical relationship) that mapsSize
toPrice
. The simplest model we will start with is linear regression, which tries to fit a straight line to the data.Generalization to New Cases:
Our model will learn from past examples (houses we already know the size and price of). Our goal includes the model’s ability to generalize. That is, even though the model learned from certain houses, we want it to predict the price of a new house that wasn’t in the training data. The better it generalizes, the more useful it is.Balancing Accuracy and Complexity:
Sometimes, we can have very complex models, but that’s not always necessary or even beneficial. With our simple dataset, our goal is to implement a basic model that provides reasonable predictions. We do not need extreme accuracy for this educational exercise. The main goal is understanding the process and establishing a baseline. Later, if we want to improve accuracy, we can consider adding more features or trying different methods.
How Will We Measure Success?
Defining a goal also implies defining metrics for success. A metric is a way of numerically measuring how good our predictions are.
For regression tasks (like ours), a common choice is the Root Mean Squared Error (RMSE) or the Mean Absolute Error (MAE). These metrics give us a sense of how far off our predictions are from the actual prices.
If our goal is “predict the house price as closely as possible,” then a smaller RMSE indicates that our model is doing better. For example, if the RMSE is $20,000, that means on average, the predicted price is off by about $20,000 from the real price. If we can reduce that to $10,000 by using a better model or more data, we know we have improved.
By defining the metric before we start training the model, we give ourselves a clear way to know if we’re achieving our goal. Without a defined metric, we might say “the model seems good” or “I think it’s accurate,” but that would be guesswork. With RMSE, we have a concrete number to improve upon.
Why State the Goal Now?
You might wonder: why do we state the goal explicitly at this stage, rather than when we first loaded the data? There are a few reasons:
Context Building:
In Step 1, we focused on getting our data into Python and seeing what it looked like. In Step 2, we deepened our understanding of the data itself. Only now do we have enough context to confidently specify what we want from the model. We know what we have and what we’re dealing with, so we can set a realistic goal.Guiding Future Decisions:
In the next steps, we will split our data into training and testing sets, choose a modeling technique, and eventually evaluate the results. Having a defined goal keeps these steps aligned. For instance, we know we’ll pick a regression algorithm (like linear regression) and a regression metric (like RMSE) to match our goal of predicting prices.Setting Expectations:
By defining the goal now, we manage our expectations. We know that the model can’t read minds—it only looks atSize
. If the price of a house depends on many factors (location, condition, etc.) that we haven’t provided, we shouldn’t expect perfect predictions. By acknowledging this up front, we won’t be disappointed later.
Connecting the Goal to Real-World Use
While our project is a simplified example, defining the goal mirrors what happens in real data science projects. Before building a model that, say, predicts stock prices or recommends products to customers, a data scientist must clearly state the objective. Without a clearly defined goal, you can easily waste time building models that don’t solve the real problem.
In our house price scenario, our goal is very clear:
Input: House size.
Output: Predicted selling price.
Success Criterion: Achieve a relatively low RMSE or a reasonably close prediction on unseen houses of similar sizes.
With this in mind, every line of code we write from now on will be in service of this goal: from preparing the data to training the model, saving it, and finally building a GUI that allows a user to enter a size and get a predicted price.
Summary of Step 3
We defined our machine learning goal in concrete terms.
We identified that we want to predict house prices from house size, which is a regression goal.
We recognized the importance of having a clear metric (like RMSE) to measure success.
We established that this goal guides our future steps, from choosing the model type to evaluating performance.
We managed expectations by understanding the limitations of predicting price solely from size.
Now that we have a well-defined goal, we can move forward with confidence. Next, we will focus on preparing our data for modeling—this often involves splitting the data into training and testing sets so that we can fairly evaluate our model’s performance. This preparation step will bring us one step closer to building our actual predictive model.
Code Update for Step 3:
No code changes. The file model_building.py
remains exactly the same as after Step 2.
# model_building.py after Step 1
import pandas as pd
# Load the dataset
data = pd.read_csv("house_prices.csv")
print("First 5 rows of the dataset:")
print(data.head())
print("\nChecking for missing values:")
print(data.isnull().sum())
print("\nStatistical summary of the data:")
print(data.describe())
Where to Add It:
No additions needed again. We will add more code in the next step.
Step 4: Splitting the Data into Training and Testing Sets
Up to this point, we have:
Loaded and inspected our dataset (Step 1).
Understood our data, defining features and targets (Step 2).
Clarified our machine learning goal, deciding we want to predict house prices from house size (Step 3).
Now, before we build and train our model, we need to think about how we will evaluate whether our model is any good. If we simply train the model on all our data and then test it on the exact same data, we might end up fooling ourselves into believing the model is better than it really is. Why? Because the model has already “seen” all that data—it could just be memorizing it rather than learning genuine patterns.
To avoid this pitfall, we use a common practice in machine learning: splitting our data into two sets:
Training Set: The portion of data the model will learn from.
Testing Set: A separate portion of data that the model will never see during training and will only be used for evaluation after the model has been trained.
Let’s unpack why this is important and how we do it.
Why Do We Split the Data?
Imagine this scenario: You’re studying for a math test. If you memorize the answers to the practice problems and then your teacher gives you the exact same problems on the test, you’ll easily get a perfect score. But what if you get new, slightly different problems on the actual exam? If you only memorized previous answers, you might struggle with new questions. To truly measure how well you learned the underlying concepts, you need to be tested on problems you haven’t seen before.
Machine learning is similar. The “exam” here is how well the model predicts house prices for new, unseen houses. If we test the model on data it’s already seen, we can’t reliably judge how well it will do on new data in the future. By splitting the data beforehand, we create a fair test: the model trains on the training set, and then we measure its performance on the testing set, which it has never encountered. The result gives us a better understanding of how the model will perform in the real world.
Key Concept:
The training set is for learning the relationship (e.g., how size relates to price).
The testing set is for evaluating whether that learned relationship generalizes to unseen examples.
Choosing the Size of the Split
There is no hard rule for exactly how to split your data, but common practice is to use about 70-80% of your data for training and 20-30% for testing. Since our example dataset is very small, even a simple 80/20 split will suffice.
Why these ratios?
Training the model requires enough data to detect patterns.
Testing the model requires some data that’s kept aside to ensure reliability and fairness.
The exact ratio can vary depending on how much data you have. With more data, you can afford a larger training set while still having plenty of test data.
For our example, we’ll choose an 80/20 split: 80% training, 20% testing.
How Do We Perform the Split in Code?
Python’s scikit-learn library provides a convenient function called train_test_split
that does this for us. Since we introduced the concept of libraries and functions in previous steps, let’s recap briefly:
Library (scikit-learn): A collection of machine learning tools and functions we can use rather than writing everything from scratch.
Function (
train_test_split
): A piece of code that we call with certain inputs (our data and our desired test size) and that returns the training and testing subsets.
Let’s say we have our entire dataset in a variable called data
. We know data
is a DataFrame with columns like Size
and Price
. To use machine learning tools effectively, we typically separate our features and target:
Features (X): In our case, that’s just
Size
.Target (y): That’s
Price
.
We can write:
from sklearn.model_selection import train_test_split
X = data[["Size"]] # This means we take only the 'Size' column as our features.
y = data["Price"] # This means we take the 'Price' column as our target.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
What happens here?
We call
train_test_split(X, y, test_size=0.2, random_state=42)
.X
is our features (house sizes).y
is our target (house prices).test_size=0.2
means that 20% of the data should go into the test set. The remaining 80% will be training data.random_state=42
is a way to make our results reproducible. It sets a “seed” for the random number generator, so every time we run this code withrandom_state=42
, we get the same split. This is helpful for consistency during learning and demonstration.
The function returns four outputs:
X_train
: The portion ofX
used for training.X_test
: The portion ofX
used for testing.y_train
: The portion ofy
corresponding toX_train
.y_test
: The portion ofy
corresponding toX_test
.
By having these four variables, we maintain the alignment between features and target for both training and testing sets. That means each row in X_train
matches with a row in y_train
, and similarly for X_test
and y_test
.
Conceptualizing the Split
If your data originally had, say, 100 rows (100 houses), and you use test_size=0.2
, then approximately 80 houses end up in X_train
and y_train
(these are the training examples), and about 20 houses end up in X_test
and y_test
(these are the testing examples).
The model will “look at” (train on) those 80 training examples. It will never see the 20 testing examples during the training phase. After training, we’ll use X_test
to predict prices and compare those predictions to the actual y_test
values. This comparison tells us how well the model learned the general relationship, not just memorized the training data.
How Does This Align with Our Goal?
Recall our machine learning goal from Step 3: we want to predict house prices from house sizes and measure how close these predictions are to the real prices. By splitting the data, we set ourselves up for a fair assessment:
After training the model on the training set, we’ll ask: “Does the model accurately predict house prices for houses it hasn’t seen?”
The test set answers this question. If the model does well on
X_test
andy_test
, we can be more confident that it learned a true pattern rather than memorizing specific examples.
When to Perform the Split
It’s best practice to split the data before we do any modeling or advanced analysis. Why? Because we want our test set to remain completely untouched and unbiased. If we use the entire dataset to decide on certain parameters or features, we might unknowingly influence the model in a way that makes the test set less representative of true “new” data.
By splitting now—before we train the model—we ensure an unbiased evaluation later. This is why we are doing this step at this point in the process:
We have the data loaded (Step 1).
We understand it conceptually (Step 2).
We know our goal (Step 3).
Now, we prepare for modeling by splitting (Step 4).
This logical sequence ensures each step builds upon the previous one without jumping ahead prematurely.
Summary of What We Learned in Step 4
We discovered why we need separate training and testing sets: to prevent overestimating the model’s performance and to ensure fair evaluation.
We learned how to use
train_test_split
from scikit-learn to split our data intoX_train
,X_test
,y_train
, andy_test
.We connected this data splitting practice to our overall goal of building a reliable model that generalizes well to new data.
We established that performing the split now sets the stage for honest and effective training and testing in upcoming steps.
With our data split into training and testing sets, we’re now ready to move forward. The next logical step is to choose a machine learning algorithm—linear regression in our case—and train it using X_train
and y_train
. After training, we’ll use X_test
and y_test
to measure how well our model meets our goal of predicting house prices accurately.
Code Update for Step 4:
We will add code to separate our features and target, and then split into training and testing sets. Add the following code after the data inspection code in model_building.py
.
# model_building.py after Step 4
import pandas as pd
from sklearn.model_selection import train_test_split
# Load the dataset
data = pd.read_csv("house_prices.csv")
# Inspect the data (from previous steps)
print("First 5 rows of the dataset:")
print(data.head())
print("\nChecking for missing values:")
print(data.isnull().sum())
print("\nStatistical summary of the data:")
print(data.describe())
# Define features (X) and target (y)
X = data[["Size"]] # X is a DataFrame with one column: Size
y = data["Price"] # y is a Series with house prices
# Split the data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("\nTraining and testing data shapes:")
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)
Where to Add It:
After the data inspection code that we wrote in Step 1, add the code for defining X, y and the train/test split. So now model_building.py
shows both the initial data loading and inspection (from Step 1) and the splitting (from Step 4).
Step 5: Introducing Linear Regression
Up to now, we have:
Loaded and inspected our dataset (Step 1).
Understood what our data represents and recognized our problem as a regression task (Step 2).
Defined our machine learning goal (Step 3): to predict house prices from their size.
Split our data into training and testing sets (Step 4), setting the stage for fair evaluation.
Now it’s time to discuss the specific machine learning algorithm we’ll use first: Linear Regression.
What is an Algorithm in Machine Learning?
Before talking about linear regression specifically, let’s clarify what we mean by a “machine learning algorithm.”
In everyday language, an algorithm is just a set of instructions or a procedure for solving a problem step-by-step. In machine learning, algorithms are methods that allow a computer to learn patterns from data. Different algorithms have different strategies for finding these patterns. You can think of an algorithm as the “approach” or “recipe” the computer follows to learn from the examples you provide.
When we say we’re going to “use linear regression,” we mean we’re going to apply the linear regression algorithm to our training data. This algorithm will try to figure out how house size relates to house price by examining the examples we give it.
What is Linear Regression?
Linear Regression is one of the simplest and most popular algorithms for regression tasks. It tries to model the relationship between input variables (features) and an output variable (target) by fitting a straight line (also known as a linear function) to the data.
In our case:
Input (Feature): Size of the house.
Output (Target): Price of the house.
Linear regression will try to find a straight line that best represents how price changes as size changes. Mathematically, a line in two dimensions can be represented as:
Price = (Slope × Size) + Intercept
Slope: This tells us how much the price is expected to change when the size increases by one unit (for example, one square foot).
Intercept: This is the price the model predicts for a house of size zero (not realistic physically, but mathematically necessary). It sets the baseline level of the line.
The challenge in linear regression is to find the best Slope and Intercept that make the line fit our data as closely as possible.
How Does Linear Regression Find the Best Line?
It’s not just guessing. Linear regression uses a method known as least squares to find the best fit. Here’s the general idea:
Start with a guess: Initially, the algorithm might start with a random line through the data.
Measure the error: For each house in the training set, the model uses the current line to predict a price and then compares that predicted price to the house’s actual selling price. The difference between the predicted price and the actual price is called an error or residual.
Combine the errors into a single measure: Instead of looking at each error individually, linear regression squares these errors (to avoid negative values and emphasize larger errors more) and then sums them all up. This sum of squared errors is a way of measuring how poorly the line fits the data.
Adjust the line to reduce the error: The algorithm then tries changing the Slope and Intercept slightly to see if it can make the total error smaller. It repeats this adjustment process many times, gradually moving towards a line that produces the smallest possible sum of squared errors.
By the end of this process, linear regression finds values for Slope and Intercept that minimize the total squared error. This gives us a line that best fits the overall pattern of the data (according to the least squares criterion).
Why Linear Regression for Our Problem?
Simplicity: We have only one feature (Size). A simple straight line might be a reasonable first guess for relating Size to Price. If a bigger house generally costs more, and smaller houses cost less, a line might capture this trend well enough as a starting point.
Interpretability: Linear regression is easy to understand and explain. Once we have our Slope and Intercept, we can say something like, “For every additional square foot, the price increases by approximately X dollars.” This kind of explanation is helpful when you’re learning.
A Good Starting Point: Linear regression is often the first algorithm taught in machine learning because it is conceptually straightforward. Even if it doesn’t produce perfect predictions, it helps us understand the process of training a model, evaluating it, and possibly improving it later.
What Will Using Linear Regression Look Like in Code?
In practical terms, we don’t have to code the mathematical details of linear regression ourselves. The scikit-learn library provides a LinearRegression
class that implements this algorithm. Here is conceptually what we will do in code (not the final code yet, just an outline):
Import the model class:
from sklearn.linear_model import LinearRegression
This line means we’re bringing in the LinearRegression
model from scikit-learn’s linear_model
module so we can use it in our code.
Create an instance of the model:
model = LinearRegression()
Here, we create a “model” variable that represents an untrained linear regression model. It’s like a blank slate waiting to learn.
Train (fit) the model:
model.fit(X_train, y_train)
This is where the magic happens. We call the .fit()
method on the model
and give it our training data (X_train
and y_train
). The model will go through the process of finding the best Slope and Intercept by minimizing the error, as described above.
Use the trained model to make predictions:
y_pred = model.predict(X_test)
Once the model is trained, we can use .predict()
to get predictions for new inputs. For example, we’ll provide X_test
(house sizes the model has never seen before) and get predicted prices in y_pred
.
How Does This Tie Into Our Overall Goal?
Recall our goal: We want to accurately predict house prices from house size and measure success, for example, using something like RMSE. By choosing linear regression, we are taking a structured, well-known approach to solve a regression problem.
If linear regression performs well, great! If it doesn’t, we can try improving the model, adding more features, or using more advanced algorithms. But starting with something simple and interpretable is a great way to learn the process and understand how machine learning algorithms work in general.
What’s Next?
Now that we know what linear regression is and why we’re using it, the next steps will involve actually training this model with our training data, evaluating how well it does on our testing data, and calculating metrics like RMSE to see if it meets our goal.
After that, we’ll move on to saving the trained model, and eventually, we’ll build a graphical user interface (GUI) that allows anyone to input a house size and get a predicted price, all powered by the linear regression model we’ll have trained.
Summary of Step 5
We introduced the concept of a machine learning algorithm and what linear regression is.
We learned how linear regression tries to fit a straight line through our data by minimizing the sum of squared errors.
We understood why linear regression is a good starting point for our simple house price prediction problem.
We got a preview of how we’ll implement linear regression in code using scikit-learn’s
LinearRegression
class.
With this conceptual understanding in place, we’re ready to proceed to the practical side: actually training the linear regression model on our training data and evaluating its performance on the test data.
Code Update for Step 5:
We will now import the LinearRegression
class from sklearn.linear_model
. Although we will train the model in the next step, let’s just set up the code structure so that our model_building.py
is prepared.
We don’t have to train yet, just import and create the model instance to show progress. Add the following code at the bottom of model_building.py
:
# model_building.py after Step 5
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression # Added this line in Step 5
# Load the dataset
data = pd.read_csv("house_prices.csv")
# Inspect the data (from previous steps)
print("First 5 rows of the dataset:")
print(data.head())
print("\nChecking for missing values:")
print(data.isnull().sum())
print("\nStatistical summary of the data:")
print(data.describe())
# Define features (X) and target (y)
X = data[["Size"]]
y = data["Price"]
# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("\nTraining and testing data shapes:")
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)
# Initialize the Linear Regression model
model = LinearRegression()
print("\nLinearRegression model created. Not trained yet.")
Where to Add It:
Add the import for LinearRegression
near the top with the other imports. Then, after splitting the data, create an instance of LinearRegression()
and print a message. We are not fitting the model yet; we’re just preparing the code structure. Training will come in a future step.
Step 6: Training the Linear Regression Model
In the previous steps, we set the stage for this crucial moment. Let’s briefly recall what we’ve done so far:
Step 1: Loaded and inspected the data, ensuring we know what our dataset looks like.
Step 2: Understood the data conceptually, identifying features and target.
Step 3: Defined our machine learning goal (predict house prices from house size).
Step 4: Split our data into training and testing sets to ensure fair evaluation later.
Step 5: Introduced linear regression as our chosen algorithm for making predictions.
Now, in Step 6, we will actually train the model. Training is the process by which the algorithm “learns” the relationship between the input feature (Size
) and the target (Price
).
What Does “Training” Mean in Practice?
Training is when we feed the model a set of examples (the training data) and let it figure out the best parameters (in linear regression, these parameters are the Slope and Intercept of the line) that minimize the prediction errors on those examples. Remember:
We have
X_train
(the house sizes for the training set).We have
y_train
(the corresponding house prices for those training houses).
The model will use these pairs (Size, Price)
to adjust its internal parameters so that the predicted price is as close as possible to the actual price for the houses in the training set.
Conceptual Recap of Linear Regression:
The model starts without knowledge of the correct Slope and Intercept.
It tries different values internally and computes how far off its predictions are.
Through a mathematical process (often involving something called “least squares”), it finds the combination of Slope and Intercept that yields the lowest overall error.
After training, the model will have a final Slope and Intercept that define the best-fit line:
Predicted Price = (Slope * Size) + Intercept
Using .fit()
to Train the Model
We’re using scikit-learn’s LinearRegression
class. The LinearRegression
object, once created, has a method called .fit()
that performs the training. The .fit()
method expects two arguments:
X_train: The training features (in our case, sizes of houses we kept for training).
y_train: The corresponding training targets (the known prices for those training houses).
By calling:
model.fit(X_train, y_train)
we are asking the model to learn from the training data. Behind the scenes, the model goes through the process we described: it tries to minimize the error between predicted and actual prices. This process happens very quickly, especially for simple datasets like ours.
Interpreting the Results
Once the training is complete, we can inspect the learned parameters:
model.intercept_
: The model’s predicted price ifSize
was 0. While a zero-sized house doesn’t make real-world sense, mathematically it’s just the line’s intercept with the vertical axis. It sets a baseline.model.coef_
: This is a list (or array) of coefficients for each feature. Since we have only one feature (Size
),model.coef_
will have just one value—the Slope. This Slope tells us how much the predicted Price changes for each additional square foot of Size.
For example, if model.coef_[0]
(the first and only coefficient) is 100, that suggests that for each increase of 1 sq ft in size, the predicted price goes up by $100, all else being equal.
Once trained, we can print out these parameters and see if they make intuitive sense. Does a bigger house leading to a higher price sound reasonable? Yes, and the numbers the model gives will reflect the specific relationship it found in the training data.
Running the Code and Viewing the Output
When you run the training code, you’ll see output similar to:
Model trained successfully!
Intercept (the base price when Size=0): 150000
Slope (price increase per square foot): 100
This would mean the model’s prediction formula is:
Predicted Price = 100 * Size + 150000
If you plug in a size of 1600 sq ft:
Predicted Price = 100 * 1600 + 150000 = 100 * 1600 + 150000 = 100 * 1600 is 160000, plus 150000 = 310000
So, the model would predict $310,000 for a 1600 sq ft house. The exact numbers will depend on your dataset.
Note: Just because the model gives these numbers doesn’t mean it’s perfectly accurate. We’ll still need to test it on the X_test
and y_test
data we set aside to see how well it generalizes to unseen data. That comes later.
Code Update for Step 6
We’ve previously been building a single file called model_building.py
to keep track of all changes as we go. After Step 5, model_building.py
looked like this:
# model_building.py after Step 5
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# Load the dataset
data = pd.read_csv("house_prices.csv")
# Inspect the data
print("First 5 rows of the dataset:")
print(data.head())
print("\nChecking for missing values:")
print(data.isnull().sum())
print("\nStatistical summary of the data:")
print(data.describe())
# Define features (X) and target (y)
X = data[["Size"]]
y = data["Price"]
# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("\nTraining and testing data shapes:")
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)
# Initialize the Linear Regression model
model = LinearRegression()
print("\nLinearRegression model created. Not trained yet.")
What to Add for Step 6:
Now, we will add the training code (model.fit(X_train, y_train)
) and print the resulting intercept and slope. Place this new code right after we initialize the model:
# model_building.py after Step 6
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# Load the dataset
data = pd.read_csv("house_prices.csv")
# Inspect the data
print("First 5 rows of the dataset:")
print(data.head())
print("\nChecking for missing values:")
print(data.isnull().sum())
print("\nStatistical summary of the data:")
print(data.describe())
# Define features (X) and target (y)
X = data[["Size"]]
y = data["Price"]
# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("\nTraining and testing data shapes:")
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)
# Initialize the Linear Regression model
model = LinearRegression()
print("\nTraining the LinearRegression model...")
model.fit(X_train, y_train) # This is where the model learns the relationship
print("Model trained successfully!")
# Print out the learned parameters
print("Intercept (the base price when Size=0):", model.intercept_)
print("Slope (price increase per square foot):", model.coef_[0])
Where to Add It:
Keep everything from previous steps as is.
Insert the training line (
model.fit(X_train, y_train)
) and the print statements formodel.intercept_
andmodel.coef_
right after themodel = LinearRegression()
line and after you print “Training the LinearRegression model...”.
Run the Code
You should now see the output confirming data inspection steps, the shapes of your training and testing sets, and finally, confirmation that the model has trained and the learned parameters.
This step confirms that your model is no longer just a concept—it’s now a trained model that has “seen” your training data and derived a relationship between house size and price!
Recap of Step 6
We learned that training involves the model finding the best-fit line that relates Size to Price.
We used
model.fit(X_train, y_train)
to train the linear regression model.We printed the Intercept and Slope to understand the model’s learned parameters.
We integrated this code into our
model_building.py
file, building upon all previous steps.
With the model now trained, the next step will be to see how well it predicts prices for unseen data (X_test
and y_test
). This evaluation will help us measure the model’s accuracy and see if we’ve achieved our machine learning goal.
Step 7: Evaluating the Model
Now that our model is trained (from Step 6), we need to find out how well it performs on data it hasn’t seen before. The goal of machine learning isn’t just to memorize the training data; it’s to generalize from that data to make good predictions on new, unseen examples.
Remember, we split our data into training and testing sets in Step 4. We trained the model only on the training set. This means we have a “fresh” set of examples (the test set) that the model never saw during training. Testing the model’s predictions on this separate set gives us a realistic idea of how well it might perform on real-world data.
Why Evaluate the Model?
Assess Generalization:
Just because the model fits the training data well doesn’t mean it will predict well for new data. The test set evaluation tells us if the model learned true underlying patterns or just memorized the training examples.Compare Models:
If we try different models or different techniques later, we need a metric to compare them. A consistent evaluation method allows us to see which approach is better.Identify Next Steps:
If the model’s performance is poor, we might need to add more features, gather more data, or try more complex algorithms. If performance is good, we can have more confidence in our results.
Introducing RMSE (Root Mean Squared Error)
For regression tasks (like predicting house prices), we need a measure of how close the predictions are to the actual values. One common metric is the Root Mean Squared Error (RMSE). It works as follows:
Calculate Errors for Each Example:
For each house in the test set, the model predicts a price. The difference between the predicted price and the actual price is called the error or residual.Square the Errors:
By squaring the errors, we ensure that large errors get more emphasis (since squaring a big number makes it even bigger) and we get rid of negative signs.Take the Mean (Average) of the Squared Errors:
Adding all the squared errors and dividing by the number of examples gives us the mean squared error (MSE).Take the Square Root:
The square root of the MSE is the RMSE. This brings the metric back to the same units as our target variable (dollars in this case), making it easier to interpret.For example:
IfRMSE = 20,000
, it means on average our predictions are off by about $20,000.
Why RMSE?
RMSE gives a sense of the magnitude of the errors in the same units as the original predictions. A lower RMSE means the model’s predictions are closer to the actual values on average.
Calculating RMSE in Code
Scikit-learn provides a function called mean_squared_error
to calculate MSE. We will then take the square root using NumPy’s np.sqrt()
function to get RMSE.
Steps to Evaluate:
Use
model.predict(X_test)
:
We callmodel.predict(X_test)
to generate predictions for the houses in the test set. Remember, the model has never seen these examples during training. This gives usy_pred
, an array of predicted prices.Calculate MSE using
mean_squared_error(y_test, y_pred)
:
This function compares the actual prices (y_test
) with the predicted prices (y_pred
) and returns the average of the squared errors.Take the square root of MSE to get RMSE:
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
Print the RMSE:
This lets us know, on average, how far off we are in our predictions. If the number seems large compared to typical house prices, we might want to improve the model.
Connecting to Previous Steps
In Step 4, we created
X_test
andy_test
. Now we’re using them for evaluation.In Step 6, we trained the model. Now we see if that training paid off in terms of good performance.
This evaluation step is crucial because it reveals whether our model is just good at memorizing the training data or truly good at predicting unseen examples.
Code Update for Step 7
Previously, after Step 6, our model_building.py
looked like this:
# model_building.py after Step 6
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# Load the dataset
data = pd.read_csv("house_prices.csv")
# Inspect the data
print("First 5 rows of the dataset:")
print(data.head())
print("\nChecking for missing values:")
print(data.isnull().sum())
print("\nStatistical summary of the data:")
print(data.describe())
# Define features (X) and target (y)
X = data[["Size"]]
y = data["Price"]
# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("\nTraining and testing data shapes:")
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)
# Initialize the Linear Regression model
model = LinearRegression()
print("\nTraining the LinearRegression model...")
model.fit(X_train, y_train) # Train the model
print("Model trained successfully!")
# Print out the learned parameters
print("Intercept (the base price when Size=0):", model.intercept_)
print("Slope (price increase per square foot):", model.coef_[0])
What to Add for Step 7:
We will import the necessary functions (mean_squared_error
from sklearn.metrics
and np
from numpy
), then use model.predict(X_test)
to get predictions, calculate the RMSE, and print it out. Add this code after printing the model’s learned parameters:
# model_building.py after Step 7
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error # Added in Step 7
import numpy as np # Added in Step 7
# Load the dataset
data = pd.read_csv("house_prices.csv")
# Inspect the data
print("First 5 rows of the dataset:")
print(data.head())
print("\nChecking for missing values:")
print(data.isnull().sum())
print("\nStatistical summary of the data:")
print(data.describe())
# Define features (X) and target (y)
X = data[["Size"]]
y = data["Price"]
# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("\nTraining and testing data shapes:")
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)
# Initialize the Linear Regression model
model = LinearRegression()
print("\nTraining the LinearRegression model...")
model.fit(X_train, y_train) # Train the model
print("Model trained successfully!")
# Print out the learned parameters
print("Intercept (the base price when Size=0):", model.intercept_)
print("Slope (price increase per square foot):", model.coef_[0])
# Evaluate the model using the test data
y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("\nEvaluating the model on test data:")
print("RMSE:", rmse)
Where to Add It:
Add the
from sklearn.metrics import mean_squared_error
andimport numpy as np
lines at the top with the other imports.After printing the slope and intercept, add the code to predict
y_pred
, calculatermse
, and print the RMSE.
Run the Code
Make sure
model_building.py
andhouse_prices.csv
are in the same directory.
You will now see the RMSE printed at the end of the output. If it’s large, it suggests the model’s predictions aren’t very precise. If it’s small, the model might be doing a decent job.
Interpreting the RMSE
If
RMSE = 20,000
, the model’s predictions are off by about $20,000 on average. Is that good or bad? It depends on the typical prices in your dataset. If houses cost around $300,000, being off by $20,000 might not be terrible for a first attempt, but there’s room for improvement.If you had more features (like number of bedrooms, location, etc.), your model might achieve a lower RMSE.
Remember, our dataset and setup are simplified for learning purposes, so we’re not aiming for state-of-the-art accuracy right now. The important takeaway is that we’ve learned how to evaluate our model quantitatively.
Recap of Step 7
We introduced the concept of evaluating the model using a separate test set.
We explained and calculated the RMSE metric.
We integrated evaluation code into
model_building.py
.By checking RMSE, we understand how well our model performs on unseen data, giving us insights into its real-world utility.
With our evaluation done, we now have a sense of where we stand. In the next steps, we can explore saving the model, building a GUI, or refining our approach. But at this moment, we have a fully trained and evaluated machine learning model!
Step 8: Visualizing the Results (Optional)
Machine learning models often produce numeric outputs and metrics, which can be hard to intuitively grasp. Visualization can make the results more tangible. By plotting the data points and the model’s predicted line, we gain a better understanding of how well the model fits the data.
Remember our model: we trained a linear regression model to predict house prices based on house sizes. If our model is doing a decent job, the red line (representing the model’s predictions) should go through the middle of the data points, capturing the overall trend.
Why Visualize?
Intuition: Numbers like slopes, intercepts, and RMSE help, but a visual makes it instantly clear whether the model is fitting the data well.
Identifying Patterns or Outliers: A plot might reveal if most data points align well with the fitted line or if there are unusual points far away from it.
Communication: If you were explaining your results to someone else, a visual often conveys information more effectively than just metrics.
What Are We Plotting?
The Training Data:
We haveX_train
andy_train
, which represent the houses used to train the model. We’ll plot these points as a scatter plot (blue dots). Each dot corresponds to one house, with the horizontal axis showing the house size (sq ft) and the vertical axis showing the selling price ($).The Model Line:
The trained model gives us a line:
Predicted Price = Slope * Size + Intercept
We can draw this line over our training data. To do this, we choose two sizes (the minimum and maximum from the training set) and use the model to predict the corresponding prices. Connecting these two predicted points creates the regression line.
Using matplotlib
for Plotting
We will use a popular Python library called matplotlib to create our visualization. Just as we used pandas
for handling data and scikit-learn
for machine learning, matplotlib
specializes in creating charts and plots.
import matplotlib.pyplot as plt: This line imports the plotting module from matplotlib. By convention, we refer to it as
plt
.
Key Functions:
plt.scatter(X_train, y_train, ...)
: Plots a scatter of points.plt.plot(x_values, y_values, ...)
: Plots a line connecting given x and y values.plt.xlabel(...)
,plt.ylabel(...)
: Label the x-axis and y-axis.plt.title(...)
: Add a title to the plot.plt.legend()
: Add a legend to explain what different colors or markers mean.plt.show()
: Display the plot in a window.
Code Update for Step 8
After Step 7, our model_building.py
includes code for loading data, splitting it, training the model, and evaluating it. Now we’ll add code to visualize the results.
Here is what model_building.py
looked like after Step 7:
# model_building.py after Step 7
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np
# Load the dataset
data = pd.read_csv("house_prices.csv")
# Inspect the data
print("First 5 rows of the dataset:")
print(data.head())
print("\nChecking for missing values:")
print(data.isnull().sum())
print("\nStatistical summary of the data:")
print(data.describe())
# Define features (X) and target (y)
X = data[["Size"]]
y = data["Price"]
# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("\nTraining and testing data shapes:")
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)
# Initialize and train the Linear Regression model
model = LinearRegression()
print("\nTraining the LinearRegression model...")
model.fit(X_train, y_train)
print("Model trained successfully!")
# Print out the learned parameters
print("Intercept (the base price when Size=0):", model.intercept_)
print("Slope (price increase per square foot):", model.coef_[0])
# Evaluate the model using the test data
y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("\nEvaluating the model on test data:")
print("RMSE:", rmse)
What to Add for Step 8:
Import matplotlib:
import matplotlib.pyplot as plt
Add the visualization code after the RMSE calculation and printing, so we can see the plot once we know how the model performed numerically:
# Visualize the training data and the model's fitted line
plt.scatter(X_train, y_train, color='blue', label='Training data')
# Determine the line's coordinates
line_sizes = [X_train["Size"].min(), X_train["Size"].max()]
# This gives us two points: the smallest and largest house size in the training data.
line_prices = model.predict(pd.DataFrame({"Size": line_sizes}))
# We predict what the model would say for the smallest and largest house sizes.
# This gives us two points: (min_size, predicted_price_for_min_size) and (max_size, predicted_price_for_max_size).
plt.plot(line_sizes, line_prices, color='red', linewidth=2, label='Model line')
# Label the axes and add a title
plt.xlabel("Size (sq ft)")
plt.ylabel("Price ($)")
plt.title("House Size vs. Price")
plt.legend()
plt.show()
Where to Add It:
At the end of model_building.py
, right after printing the RMSE, add these lines. The final model_building.py
after Step 8 will look like this:
# model_building.py after Step 8
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np
import matplotlib.pyplot as plt # Added in Step 8
# Load the dataset
data = pd.read_csv("house_prices.csv")
# Inspect the data
print("First 5 rows of the dataset:")
print(data.head())
print("\nChecking for missing values:")
print(data.isnull().sum())
print("\nStatistical summary of the data:")
print(data.describe())
# Define features (X) and target (y)
X = data[["Size"]]
y = data["Price"]
# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("\nTraining and testing data shapes:")
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)
# Initialize and train the Linear Regression model
model = LinearRegression()
print("\nTraining the LinearRegression model...")
model.fit(X_train, y_train)
print("Model trained successfully!")
# Print out the learned parameters
print("Intercept (the base price when Size=0):", model.intercept_)
print("Slope (price increase per square foot):", model.coef_[0])
# Evaluate the model using the test data
y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("\nEvaluating the model on test data:")
print("RMSE:", rmse)
# Visualize the training data and the model's fitted line
plt.scatter(X_train, y_train, color='blue', label='Training data')
line_sizes = [X_train["Size"].min(), X_train["Size"].max()]
line_prices = model.predict(pd.DataFrame({"Size": line_sizes}))
plt.plot(line_sizes, line_prices, color='red', linewidth=2, label='Model line')
plt.xlabel("Size (sq ft)")
plt.ylabel("Price ($)")
plt.title("House Size vs. Price")
plt.legend()
plt.show()
Run the Code
You’ll see all the previous text outputs. Then, a separate window should appear with the scatterplot of your training data and the red line representing the model’s best fit. Close the window when you’re done examining it.
Interpreting the Visualization
If the red line goes roughly through the middle of the blue points and matches their upward or downward trend, the model is doing a decent job.
If the line doesn’t seem to match the data points at all, it might mean linear regression isn’t capturing the relationship well, or you might need more or better features.
This visual check complements our numerical metrics by giving us a more intuitive feel for the model’s performance.
Recap of Step 8
We introduced visualization to intuitively understand how well the model fits the data.
We used matplotlib to create a scatterplot of training points and draw the model’s regression line.
By inspecting the plot, we get a more nuanced understanding of the model’s strengths and weaknesses, beyond what RMSE alone can tell us.
With this step, we have a complete picture: we loaded and prepared data, trained and evaluated a model, and now visually confirmed how the model’s predictions align with the training data.
Step 9: Saving the Model
Up until now, we’ve successfully built, trained, evaluated, and even visualized our model. Each time we run the code, the model is trained from scratch. For our small dataset, this isn’t a big deal—it trains almost instantly. But for larger datasets or more complex models, training can take a lot of time, sometimes hours or even days.
Why Save the Model?
Efficiency: If training is time-consuming, you don’t want to repeat it every time you need a prediction.
Convenience: After you train the model once, you can load it instantly and use it for predictions without re-running all the data loading and training steps.
Deployment: If you want to integrate your model into another application or share it with others, providing a ready-to-use model file is easier than sharing code and data, expecting them to retrain it.
By saving the model, you essentially “freeze” what the model learned and store it in a file. Then, you can load it later for making predictions or further analysis.
Using joblib
to Save and Load Models
Python’s joblib
library is a convenient tool for serializing and deserializing Python objects, including trained machine learning models. Serialization means converting the model’s internal state into a format that can be written to a file. Deserialization means reading that file and reconstructing the model object in memory, ready to use.
Key Functions:
joblib.dump(model, "filename.pkl")
: Saves the model to a file namedfilename.pkl
.joblib.load("filename.pkl")
: Loads the model from a file back into memory, ready for predictions.
The .pkl
(pickle) extension is a common choice for Python object files. It’s not required by joblib, but it’s a helpful convention.
Code Update for Step 9
After Step 8, our model_building.py
has code to load data, split it, train the model, evaluate its performance, and even visualize the results. Now we’ll add the saving step at the end, after everything else is done.
Here’s what model_building.py
looked like after Step 8:
# model_building.py after Step 8
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np
import matplotlib.pyplot as plt
# Load the dataset
data = pd.read_csv("house_prices.csv")
# Inspect the data
print("First 5 rows of the dataset:")
print(data.head())
print("\nChecking for missing values:")
print(data.isnull().sum())
print("\nStatistical summary of the data:")
print(data.describe())
# Define features (X) and target (y)
X = data[["Size"]]
y = data["Price"]
# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("\nTraining and testing data shapes:")
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)
# Initialize and train the Linear Regression model
model = LinearRegression()
print("\nTraining the LinearRegression model...")
model.fit(X_train, y_train)
print("Model trained successfully!")
# Print out the learned parameters
print("Intercept (the base price when Size=0):", model.intercept_)
print("Slope (price increase per square foot):", model.coef_[0])
# Evaluate the model using the test data
y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("\nEvaluating the model on test data:")
print("RMSE:", rmse)
# Visualize the training data and the model's fitted line
plt.scatter(X_train, y_train, color='blue', label='Training data')
line_sizes = [X_train["Size"].min(), X_train["Size"].max()]
line_prices = model.predict(pd.DataFrame({"Size": line_sizes}))
plt.plot(line_sizes, line_prices, color='red', linewidth=2, label='Model line')
plt.xlabel("Size (sq ft)")
plt.ylabel("Price ($)")
plt.title("House Size vs. Price")
plt.legend()
plt.show()
What to Add for Step 9:
Import
joblib
at the top along with the other imports.After everything is done (training, evaluation, visualization), add the code to save the model.
import joblib
# Save the trained model to a file
joblib.dump(model, "house_price_model.pkl")
print("Model saved to house_price_model.pkl")
Place these lines at the very end of the file, after showing the plot. The final model_building.py
after Step 9 will look like this:
# model_building.py after Step 9
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np
import matplotlib.pyplot as plt
import joblib # Added in Step 9
# Load the dataset
data = pd.read_csv("house_prices.csv")
# Inspect the data
print("First 5 rows of the dataset:")
print(data.head())
print("\nChecking for missing values:")
print(data.isnull().sum())
print("\nStatistical summary of the data:")
print(data.describe())
# Define features (X) and target (y)
X = data[["Size"]]
y = data["Price"]
# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("\nTraining and testing data shapes:")
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)
# Initialize and train the Linear Regression model
model = LinearRegression()
print("\nTraining the LinearRegression model...")
model.fit(X_train, y_train)
print("Model trained successfully!")
# Print out the learned parameters
print("Intercept (the base price when Size=0):", model.intercept_)
print("Slope (price increase per square foot):", model.coef_[0])
# Evaluate the model using the test data
y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("\nEvaluating the model on test data:")
print("RMSE:", rmse)
# Visualize the training data and the model's fitted line
plt.scatter(X_train, y_train, color='blue', label='Training data')
line_sizes = [X_train["Size"].min(), X_train["Size"].max()]
line_prices = model.predict(pd.DataFrame({"Size": line_sizes}))
plt.plot(line_sizes, line_prices, color='red', linewidth=2, label='Model line')
plt.xlabel("Size (sq ft)")
plt.ylabel("Price ($)")
plt.title("House Size vs. Price")
plt.legend()
plt.show()
# Save the trained model to a file
joblib.dump(model, "house_price_model.pkl")
print("Model saved to house_price_model.pkl")
Where to Add It:
Place the import joblib
line with the other imports at the top, and the joblib.dump()
line at the very end of the script, after all other steps have been completed.
Run the Code
This will run through the entire process—loading data, training the model, evaluating it, visualizing the results, and finally saving the trained model to house_price_model.pkl
.
Confirming the Model File
After the code finishes running, check your project folder. You should see a new file named house_price_model.pkl
. This file contains all the information the model learned during training. You can now load this file later, even in a different program or script, without having to retrain the model.
Example of Loading the Model Later:
loaded_model = joblib.load("house_price_model.pkl")
# Now loaded_model can be used to predict prices directly:
predicted_price = loaded_model.predict(pd.DataFrame({"Size": [1600]}))
print("Predicted price for 1600 sq ft:", predicted_price[0])
Recap of Step 9
We introduced the concept of saving a trained model to avoid retraining in the future.
We used
joblib.dump()
to store the model’s parameters into a.pkl
file.We explained how this approach makes sharing and deploying models easier.
This completes the training and preparation process: we now have a fully trained, evaluated, visualized, and saved model.
With the model saved, we are well-prepared to move on to steps like building a GUI, deploying the model as a service, or further refining and experimenting with other models or features.
Step 10: Loading the Model and Making Predictions
In the previous step, we saved our trained model to a file (house_price_model.pkl
). Now it’s time to see how we can load that model in a separate script and use it to make predictions without retraining. This demonstrates the real power of saving models: you can quickly and easily deploy, share, and integrate them into other applications.
Why Load the Model Instead of Retraining?
Efficiency: If training takes a long time, you don’t want to repeat that process every time you need a single prediction.
Convenience: You can keep a model file (
.pkl
) ready to go and load it whenever needed—no data loading, splitting, or model fitting required.Deployment: In a production environment (like a web service that predicts house prices), you’d typically load a pre-trained model at startup and use it to handle user requests instantly.
Creating a Separate File for Testing
To showcase this, we’ll create a new file—let’s call it 04_test_loaded_model.py
. This file’s sole purpose is to:
Load the saved model from the
.pkl
file.Provide a new input (house size) and get a predicted price immediately.
What Does This Demonstrate?
You can run this script on another computer or at another time without having to rerun the entire training process.
It proves that the model’s learned parameters (slope and intercept) are preserved and can be reused anytime.
Code Explanation
import joblib:
We needjoblib
because that’s what we used to save the model.joblib.load()
will deserialize the model from the file.import pandas as pd:
We use pandas to create a small DataFrame for the new input. Even though it’s just one house size, the model expects data in the same format as training.loaded_model = joblib.load("house_price_model.pkl"):
This line reads the model file and returns the model object as it was after training.Creating Input Data:
We create a DataFrame with a single row, representing a house of size 1600 sq ft:
input_data = pd.DataFrame({"Size": [1600]})
This mimics the shape of the training data. The model expects a DataFrame with a column “Size.”
Predicting the Price:
predicted_price = loaded_model.predict(input_data)[0]
loaded_model.predict()
returns an array of predictions, but since we have just one house, we take the first (and only) prediction using [0]
.
Printing the Result:
We print the predicted price so we can see the model’s output. This confirms that the model is functional, no training required.
The Code (04_test_loaded_model.py)
import joblib
import pandas as pd
# Load the saved model from the file
loaded_model = joblib.load("house_price_model.pkl")
print("Model loaded successfully from house_price_model.pkl")
# Create input data for a house of size 1600 sq ft
input_data = pd.DataFrame({"Size": [1600]})
# Use the loaded model to make a prediction
predicted_price = loaded_model.predict(input_data)[0]
print(f"Predicted price for a 1600 sq ft house: ${predicted_price:,.2f}")
What’s New Here?
We added a confirmation print: “Model loaded successfully...” to reassure us that the load worked.
We formatted the predicted price using
:, .2f
to make it look nicer (adding commas and two decimal places). This is optional but makes the output more readable.
Run the Code
You will see something like:
Model loaded successfully from house_price_model.pkl
Predicted price for a 1600 sq ft house: $310,000.00
The exact number depends on how your model was trained, but this demonstrates the concept.
Why Is This Useful?
Speed: You get an instant prediction without going through training steps.
Portability: You can share
house_price_model.pkl
with others, and as long as they have Python, joblib, scikit-learn, and pandas installed, they can load and use the model.Integration: In a real-world scenario, you could integrate this loaded model into a web application, command-line tool, or desktop app. The user inputs a house size, and you immediately provide a predicted price.
Recap of Step 10
We loaded a previously saved model using
joblib.load()
.We created a new script dedicated to testing the loaded model.
We confirmed that we can make predictions instantly, without retraining.
This step shows how models can be reused in different contexts, which is a key benefit of machine learning in production environments.
With this, you’ve completed a full machine learning workflow: from loading and understanding data to training, evaluating, visualizing, saving the model, and finally loading it again to make predictions. You are now prepared to take the model and integrate it into more user-friendly tools, such as a graphical user interface or a web service.
Step 11: Creating a GUI with tkinter
Until now, we’ve built a machine learning model, evaluated it, visualized it, and saved it. We even loaded the model in a separate script to make predictions without retraining. That’s great for developers and data scientists, but what if we want to share this predictive tool with people who don’t know how to run Python scripts or write code?
Enter the GUI (Graphical User Interface):
A GUI allows users to interact with your machine learning model through simple buttons, text boxes, and labels—no coding knowledge required. They can just enter a house size, click a button, and see the predicted price.
Why a GUI?
User-Friendliness: GUIs provide an intuitive way for users to input data and receive predictions, without ever seeing the code.
Wider Accessibility: By packaging your model into a GUI, you can share it with colleagues, clients, or friends who are not tech-savvy.
Professional Touch: A GUI makes your project feel more like a real application rather than a coding exercise.
What is tkinter?
tkinter is a Python library that comes pre-installed with most Python distributions. It allows you to create windows, labels, buttons, text entry boxes, and handle events like button clicks. You don’t need to install anything extra.
Key Concepts:
root = tk.Tk(): Creates the main application window.
Widgets: Elements like labels, entries (text boxes), and buttons that appear in the window.
pack(), grid(), place(): Methods to arrange widgets in the window. We’ll use
pack()
for simplicity.Events and Callbacks: When the user clicks a button, it triggers a function (a callback) that performs an action. In our case, that action will be to predict the price.
Our Plan for the GUI
A Main Window:
A simple window with a title.User Input Field:
A text box (entry widget) where the user types the house size in square feet.Predict Button:
A button labeled “Predict Price.” When clicked, it uses our loaded model to predict the price for the entered size.Result Label:
A label that displays the predicted price once calculated.
Code Overview
We’ll create a new file named 05_gui.py
. This script will:
Load the trained model from
house_price_model.pkl
(created and saved in previous steps).Create the GUI using tkinter.
When the user inputs a size and clicks “Predict Price,” the code will:
Validate the input.
Use the model to predict the price.
Update a label to show the predicted value.
No retraining is needed, because we’re loading a model that was already trained and saved. This is exactly why we saved the model: to reuse it anywhere, anytime.
Code Explanation (05_gui.py)
import tkinter as tk
from tkinter import messagebox
import joblib
import pandas as pd
# Load the model we saved previously
model = joblib.load("house_price_model.pkl")
print("Model loaded successfully for GUI usage!")
import tkinter as tk:
Imports the tkinter library. We’ll call ittk
for convenience.from tkinter import messagebox:
Importsmessagebox
, a tkinter function to show pop-up messages (like errors).import joblib:
We use joblib to load our previously saved model.import pandas as pd:
We need pandas to create a DataFrame for the input before predicting.model = joblib.load("house_price_model.pkl"):
Loads the trained linear regression model. No training required—this is instant and ready to go.
def predict_price():
# This function is called when the user clicks the "Predict" button.
try:
size_str = entry_size.get() # Get the text the user typed in the entry box.
size = float(size_str) # Convert it to a float (number).
# If the user typed something not a number (e.g., 'abc'), float() will raise a ValueError.
# Prepare data as a DataFrame, similar to how we prepared training data.
input_data = pd.DataFrame({"Size": [size]})
# Predict using the loaded model.
pred = model.predict(input_data)[0]
# Update the result label with the predicted price formatted nicely.
result_label.config(text=f"Predicted Price: ${pred:,.2f}")
except ValueError:
# If conversion to float failed, the user didn’t type a valid number.
messagebox.showerror("Input Error", "Please enter a valid numeric size.")
How predict_price()
works:
entry_size.get(): Retrieves whatever the user typed in the text box.
float(size_str): Tries to convert the input to a floating-point number.
input_data: We create a one-row DataFrame with a column “Size” because the model expects this format.
model.predict(...): Produces a predicted price.
result_label.config(...): Updates the GUI label to show the predicted price. If the user input is invalid, we show a message box with an error.
# Create the main window
root = tk.Tk()
root.title("House Price Predictor")
# Create a title label
title_label = tk.Label(root, text="House Price Predictor", font=("Arial", 16, "bold"))
title_label.pack(pady=10)
# Create a label and entry for house size input
size_label = tk.Label(root, text="Enter House Size (sq ft):", font=("Arial", 12))
size_label.pack(pady=5)
entry_size = tk.Entry(root, font=("Arial", 12))
entry_size.pack(pady=5)
# Create a button that calls predict_price() when clicked
predict_button = tk.Button(root, text="Predict Price", font=("Arial", 12), command=predict_price)
predict_button.pack(pady=10)
# Create a label to show the prediction result
result_label = tk.Label(root, text="Predicted Price: --", font=("Arial", 12, "bold"))
result_label.pack(pady=10)
# Run the GUI event loop
root.mainloop()
Step-by-step GUI construction:
root = tk.Tk(): Creates the main application window.
root.title("House Price Predictor"): Sets the window’s title.
title_label, size_label, entry_size, predict_button, result_label:
These lines create and arrange the widgets..pack()
places them in the window with some spacing (pady
adds vertical space).predict_button: The
command=predict_price
means that when the button is clicked,predict_price()
is called.root.mainloop(): Starts the event loop, which keeps the window open and responsive.
Running the GUI
Ensure
05_gui.py
andhouse_price_model.pkl
are in the same directory.Ensure you are on the
05_gui.py
widow on VS Code, then run.
A window will appear with “House Price Predictor” as the title. Type a size (e.g., 1600
) in the text box and click “Predict Price.”
If everything’s correct, the label will update to show the predicted price. If you type something non-numeric (like ‘abc’), it will show an error message.
Why Is This Step Important?
Ease of Use: Now even non-technical people can use your model by interacting with a simple window, no code required.
Showcasing Your Work: If you want to demonstrate your model’s capabilities to others, a GUI makes it straightforward and professional.
Integration with Other Tools: Eventually, you could integrate this GUI into larger applications, add more features, or even deploy it on a system where multiple users can make predictions easily.
Recap of Step 11
We learned what tkinter is and how to use it for GUI development in Python.
We created a simple GUI that loads our saved model, takes user input, and displays predictions.
We handled errors (non-numeric input) gracefully with a message box.
We’ve come full circle: from raw data to a fully interactive application that delivers ML predictions on demand.
This final step shows how machine learning models can be integrated into user-friendly applications, making your project accessible to everyone, not just coders or data scientists.
Optional: Practice Exercises
Congratulations! You’ve built a complete machine learning project—from loading and understanding the data to training, evaluating, visualizing, saving, and deploying your model in a GUI. If you want to strengthen your skills and gain more confidence, consider trying these optional exercises. They will help you apply what you’ve learned and explore new possibilities.
1. Change the Dataset
What to Do:
Add more rows to your
house_prices.csv
file. For example, add houses smaller than 1000 sq ft or larger than 2000 sq ft. You can also add rows with slightly different prices that break the pattern.Retrain your model by running
model_building.py
again and see if the predictions or RMSE change.
Why This Helps:
Adding more data, especially data that covers a wider range, can help the model learn a better relationship between Size and Price.
You’ll see how the model’s parameters (intercept and slope) adapt to new data, and whether the RMSE improves or gets worse.
Hint:
Try adding at least 5-10 new rows. Think about prices that might not follow a perfect linear pattern and observe how it affects the model’s fit line and RMSE.
2. Add Input Validation in the GUI
What to Do:
In
05_gui.py
, currently if the user enters a non-numeric value (like “abc”), we show an error message. But what if the user enters a negative number for the house size?Modify the
predict_price()
function to check ifsize
≤ 0. If so, display a message box telling the user to enter a positive house size.
Why This Helps:
Real-world users might provide unexpected inputs. Adding input validation makes your application more robust and user-friendly.
You’ll gain practice in adding logical conditions and error handling in your GUI code.
Hint:
Inside the
predict_price()
function, after convertingsize_str
tosize
, write something like:
if size <= 0:
messagebox.showerror("Input Error", "Please enter a positive house size.")
return
This ensures you exit the function early if the input is invalid.
3. Improve the Model with Additional Features
What to Do:
Add another column to your CSV file, for example
NumberOfBedrooms
.Update your code to use two features instead of one:
X = data[["Size", "NumberOfBedrooms"]]
Retrain the model and see if RMSE improves. Perhaps a bigger house with more bedrooms sells for more, and including this feature might lead to better predictions.
Why This Helps:
Most real-world datasets have multiple features. By adding another feature, you learn how to handle more complex input data.
You’ll see how the model’s complexity and accuracy can change when given more information.
Hint:
Ensure that every row in
house_prices.csv
now has three columns:Size
,NumberOfBedrooms
, andPrice
.Keep your code changes minimal: just update how you define
X
, and let everything else remain the same. Observe if RMSE goes down after training.
4. Experiment with Different Metrics
What to Do:
Research another evaluation metric called R² (R-squared). This metric measures how much of the variance in the target variable is explained by the model.
Calculate and print R² in addition to RMSE.
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
print("R²:", r2)
Why This Helps:
Understanding different metrics gives you a more complete picture of model performance.
While RMSE shows you the error in terms of the original units (dollars), R² shows you how well the model fits the data in relative terms (1.0 means perfect fit, 0 means no better than a basic guess).
Hint:
Add the
r2_score
import line and print the R² value just after printing RMSE. Compare different runs and see how R² correlates with RMSE.
5. Try Other Models
What to Do:
Explore other regression algorithms in scikit-learn, such as
DecisionTreeRegressor
orRandomForestRegressor
.Replace
LinearRegression
with one of these models in your code:
from sklearn.tree import DecisionTreeRegressor
# model = DecisionTreeRegressor()
# or
from sklearn.ensemble import RandomForestRegressor
# model = RandomForestRegressor()
Retrain and see if RMSE improves.
Why This Helps:
Different models have different ways of capturing patterns. Some might handle more complex relationships better than a straight line.
Trying multiple models is common practice in machine learning to find the best approach for your data.
Hint:
Start with
DecisionTreeRegressor()
since it’s simpler. Notice if RMSE changes significantly. Then tryRandomForestRegressor()
.Remember to import the necessary classes and adjust your code as needed, but the rest of the process (train/test split, fitting, predicting, evaluating) remains the same.
Summary of the Practice Exercises
These exercises are designed to help you go beyond the basic example. By modifying the dataset, improving input validation, adding features, experimenting with metrics, and trying new models, you will:
Deepen your understanding of the machine learning workflow.
Gain confidence adapting code to new situations.
Get comfortable troubleshooting and iterating on your approach.
Move closer to a real-world scenario where data and requirements constantly change.
Feel free to pick one exercise or try them all. Each challenge you tackle will make you a more resourceful and skilled machine learning practitioner.
Extras: Helpful Tips and Resources
As you continue your machine learning journey, you’ll inevitably encounter challenges, new questions, and more advanced concepts. Below are some helpful tips, resources, and tools to guide you as you grow more confident and independent.
Asking AI for Help
Modern AI assistants, such as ChatGPT, can be valuable tools for troubleshooting and learning. You can use these tools to:
Debug Errors:
“Why am I getting a ValueError when converting a string to float in Python?”
If you run into an error message you don’t understand, an AI assistant can explain what it means and suggest solutions.Understand Concepts:
“How to interpret Linear Regression coefficients?”
If you’re unsure about the math or meaning behind certain model parameters, you can ask for a plain-language explanation.
Tip:
Be as specific as possible when asking questions. Include the exact error message or describe the scenario you’re dealing with. This helps AI tools provide more accurate and useful answers.
Cheat Sheet: Common Python and ML Commands
Below is a quick reference for common tasks. Keep this handy as you experiment with data, models, and workflows:
Working with Data:
import pandas as pd
: Use pandas features for data manipulation.data = pd.read_csv("file.csv")
: Load CSV data into a DataFrame.data.head()
: Show the first 5 rows of the DataFrame.data.isnull().sum()
: Check for missing values in each column.
Splitting and Modeling:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
: Split into train/test sets.from sklearn.linear_model import LinearRegression
model = LinearRegression()
: Create a linear regression model.model.fit(X_train, y_train)
: Train the model on training data.y_pred = model.predict(X_test)
: Predict on the test set.
Evaluating and Saving Models:
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred)
: Calculate MSE.rmse = np.sqrt(mse)
: Calculate RMSE for error in original units.import joblib
joblib.dump(model, "model.pkl")
: Save the trained model.loaded_model = joblib.load("model.pkl")
: Load the saved model.
Tip:
If you forget a function’s usage, try help(function_name)
in a Python shell, or search the documentation.
Recommended Resources
1. pandas Documentation:
https://pandas.pydata.org/docs/
Dive deep into data manipulation and analysis. Learn about advanced indexing, merging datasets, and efficient data cleaning techniques.
2. scikit-learn Documentation:
https://scikit-learn.org/stable/
Explore a wide range of machine learning algorithms, best practices, and tutorials. Learn how to tune hyperparameters, handle imbalanced data, and choose the right model for your problem.
3. Python Official Docs:
https://docs.python.org/3/
If you’re ever confused about Python syntax, standard libraries, or data types, the official Python documentation is the ultimate reference.
4. Visual Studio Code Docs:
https://code.visualstudio.com/docs
VS Code offers extensions, debugging tools, and shortcuts that streamline your coding workflow. Explore these docs to increase your productivity.
Keep Exploring!
You’ve completed a full ML project, but this is just the beginning. The machine learning field is vast and evolving:
Experiment with different algorithms and datasets.
Learn about new libraries and frameworks (like TensorFlow or PyTorch for deep learning).
Take on projects like image classification, sentiment analysis, or clustering.
Each new challenge you tackle will build your skills and confidence. Use the resources above as stepping stones to navigate the ever-growing landscape of machine learning.
Conclusion and Next Steps
Congratulations! You’ve successfully completed a full machine learning project from start to finish. Let’s recap what you’ve accomplished:
Machine Learning Concept:
You started by understanding what machine learning is—the idea that a computer can learn patterns from data rather than following explicit instructions. You learned how this concept applies to predicting house prices based on size.Data Handling:
Using pandas, you loaded and inspected a dataset stored in a CSV file. You learned how to check for missing values, look at summary statistics, and ensure your data is ready for modeling.Model Training:
By applying linear regression from scikit-learn, you found a relationship between house size (input) and house price (output). You trained the model to fit a line that best represents that relationship.Evaluation:
You measured how well your model predicts unseen data using the RMSE metric. This gave you a numerical sense of how accurate (or inaccurate) your predictions are.Model Persistence:
You usedjoblib
to save your trained model, allowing you to load it later without retraining. This step emphasized efficiency and reusability, crucial for real-world applications.GUI Creation:
Finally, you built a graphical user interface (GUI) with tkinter, making your model accessible to non-technical users. With a simple window, text input, and a button, anyone can now predict house prices without writing code.
This is a remarkable achievement! You’ve gone from having no knowledge to building a fully functional ML application that is both interactive and reusable.
Where to Go From Here
Machine learning is a vast field, and you’ve just scratched the surface. As you continue exploring, consider these next steps:
Add More Features:
Enhance your model by including more information, such as the number of bedrooms or location data. More features can lead to more accurate predictions.Try Different ML Algorithms:
Experiment with other models like Decision Trees, Random Forests, or Gradient Boosted Trees. This will show you how different approaches might improve your results.Deploy Your Model Online:
For a more advanced challenge, learn how to deploy your model as a web application using frameworks like Flask or FastAPI. This way, anyone with a web browser can access your predictor.Explore Deep Learning:
Libraries like TensorFlow or PyTorch open the door to neural networks and more complex tasks, such as image recognition or natural language processing.Practice on Real Datasets:
Platforms like Kaggle provide a rich variety of datasets and competitions. Practicing on real-world data will help you refine your skills and learn best practices from a community of ML enthusiasts.
Celebrate Your Achievement
You started with no prior experience and now have a working machine learning application. This is no small feat. By understanding each step—from loading data and training a model to evaluating performance and building a GUI—you’ve gained a foundation that will serve you well as you tackle more complex projects.
Keep learning, keep experimenting, and don’t be afraid to try new ideas. The more you practice, the more confident and skilled you’ll become.
Great job and happy coding!