In this tutorial, you will learn how to create a Machine Learning Linear Regression Model using Python. You will be analyzing a house price predication dataset for finding out the price of a house on different parameters. You will do Exploratory Data Analysis, split the training and testing data, Model Evaluation and Predictions.
What is Linear Regression Model in Machine Learning
Linear Regression is a Supervised Machine Learning Model for finding the relationship between independent variables and dependent variable. Linear regression performs the task to predict the response (dependent) variable value (y) based on a given (independent) explanatory variable (x). So, this regression technique finds out a linear relationship between x (input) and y (output).
About House Prediction Data Set
Problem Statement – A real state agents want help to predict the house price for regions in the USA. He gave you the dataset to work on and you decided to use the Linear Regression Model. Create a model that will help him to estimate of what the house would sell for.
The dataset contains 7 columns and 5000 rows with CSV extension. The data contains the following columns :
- ‘Avg. Area Income’ – Avg. The income of the householder of the city house is located.
- ‘Avg. Area House Age’ – Avg. Age of Houses in the same city.
- ‘Avg. Area Number of Rooms’ – Avg. Number of Rooms for Houses in the same city.
- ‘Avg. Area Number of Bedrooms’ – Avg. Number of Bedrooms for Houses in the same city.
- ‘Area Population’ – Population of the city.
- ‘Price’ – Price that the house sold at.
- ‘Address’ – Address of the houses.
You can download the dataset from here – USA_Housing.csv.
An Example: Predicting house prices with linear regression using SciKit-Learn, Pandas, Seaborn and NumPy
Install the required libraries and setup for the environment for the project. We will be importing SciKit-Learn, Pandas, Seaborn, Matplotlib and Numpy.
import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline
The purpose of “%matplotlib inline” is to add plots to your Jupyter notebook.
Importing Data and Checking out
As data is in the CSV file, we will read the CSV using pandas read_csv function and check the first 5 rows of the data frame using head().
HouseDF = pd.read_csv('USA_Housing.csv') HouseDF.head()
OUTPUT <class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 7 columns): Avg. Area Income 5000 non-null float64 Avg. Area House Age 5000 non-null float64 Avg. Area Number of Rooms 5000 non-null float64 Avg. Area Number of Bedrooms 5000 non-null float64 Area Population 5000 non-null float64 Price 5000 non-null float64 Address 5000 non-null object dtypes: float64(6), object(1) memory usage: 273.6+ KB
OUTPUT Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms','Avg. Area Number of Bedrooms', 'Area Population', 'Price', 'Address'], dtype='object')
Exploratory Data Analysis for House Price Prediction
We will create some simple plot for visualizing the data.
Get Data Ready For Training a Linear Regression Model
Let’s now begin to train out the regression model. We will need to first split up our data into an X list that contains the features to train on, and a y list with the target variable, in this case, the Price column. We will ignore the Address column because it only has text which is not useful for linear regression modeling.
X and y List
X = HouseDF[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms', 'Avg. Area Number of Bedrooms', 'Area Population']] y = HouseDF['Price']
Split Data into Train, Test
Now we will split our dataset into a training set and testing set using sklearn train_test_split(). the training set will be going to use for training the model and testing set for testing the model. We are creating a split of 40% training data and 60% of the training set.
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)
X_train and y_train contain data for the training model. X_test and y_test contain data for the testing model. X and y are features and target variable names.
Creating and Training the LinearRegression Model
We will import and create sklearn linearmodel LinearRegression object and fit the training dataset in it.
from sklearn.linear_model import LinearRegression lm = LinearRegression() lm.fit(X_train,y_train)
OUTPUT LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
LinearRegression Model Evaluation
Now let’s evaluate the model by checking out its coefficients and how we can interpret them.
coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient']) coeff_df
What does coefficient of data says:
- Holding all other features fixed, a 1 unit increase in Avg. Area Income is associated with an increase of $21.52 .
- Holding all other features fixed, a 1 unit increase in Avg. Area House Age is associated with an increase of $164883.28 .
- Holding all other features fixed, a 1 unit increase in Avg. Area Number of Rooms is associated with an increase of $122368.67 .
- Holding all other features fixed, a 1 unit increase in Avg. Area Number of Bedrooms is associated with an increase of $2233.80 .
- Holding all other features fixed, a 1 unit increase in Area Population is associated with an increase of $15.15 .
Predictions from our Linear Regression Model
Let’s find out the predictions of our test set and see how well it perform.
predictions = lm.predict(X_test)
In the above scatter plot, we see data is in a line form, which means our model has done good predictions.
In the above histogram plot, we see data is in bell shape (Normally Distributed), which means our model has done good predictions.
Regression Evaluation Metrics
Here are three common evaluation metrics for regression problems:
Mean Absolute Error (MAE) is the mean of the absolute value of the errors:
Mean Squared Error (MSE) is the mean of the squared errors:
Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:
Comparing these metrics:
- MAE is the easiest to understand because it’s the average error.
- MSE is more popular than MAE because MSE “punishes” larger errors, which tends to be useful in the real world.
- RMSE is even more popular than MSE because RMSE is interpretable in the “y” units.
All of these are loss functions because we want to minimize them.
from sklearn import metrics print('MAE:', metrics.mean_absolute_error(y_test, predictions)) print('MSE:', metrics.mean_squared_error(y_test, predictions)) print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
OUTPUT MAE: 82288.22251914957 MSE: 10460958907.209501 RMSE: 102278.82922291153
We have created a Linear Regression Model which we help the real state agent for estimating the house price.
You can find this project on GitHub.