In this tutorial, you will learn how to create a Machine Learning Linear Regression Model using **Python**. You will be analyzing a house price predication dataset for finding out the price of a house on different parameters. You will do Exploratory Data Analysis, split the training and testing data, Model Evaluation and Predictions.

**What is Linear Regression Model in Machine Learning**

Linear Regression is a Supervised Machine Learning Model for finding the relationship between independent variables and dependent variable. Linear regression performs the task to predict the response (dependent) variable value (y) based on a given (independent) explanatory variable (x). So, this regression technique finds out a linear relationship between x (input) and y (output).

### About House Prediction Data Set

Problem Statement – A real state agents want help to predict the house price for regions in the USA. He gave you the dataset to work on and you decided to use the Linear Regression Model. Create a model that will help him to estimate of what the house would sell for.

The dataset contains 7 columns and 5000 rows with CSV extension. The data contains the following columns :

**‘Avg. Area Income’**– Avg. The income of the householder of the city house is located.**‘Avg. Area House Age’**– Avg. Age of Houses in the same city.**‘Avg. Area Number of Rooms’**– Avg. Number of Rooms for Houses in the same city.**‘Avg. Area Number of Bedrooms’**– Avg. Number of Bedrooms for Houses in the same city.**‘Area Population’**– Population of the city.**‘Price’**– Price that the house sold at.**‘Address’**– Address of the houses.

*You can download the dataset from here – USA_Housing.csv.*

## An Example: Predicting house prices with linear regression using SciKit-Learn, Pandas, Seaborn and NumPy

#### Import Libraries

Install the required libraries and setup for the environment for the project. We will be importing SciKit-Learn, Pandas, Seaborn, Matplotlib and Numpy.

importpandasaspdimportnumpyasnpimportseabornassnsimportmatplotlib.pyplotasplt%matplotlib inline

The purpose of “%matplotlib inline” is to add plots to your Jupyter notebook.

#### Importing Data and Checking out

As data is in the CSV file, we will read the CSV using pandas read_csv function and check the first 5 rows of the data frame using head().

HouseDF = pd.read_csv('USA_Housing.csv') HouseDF.head()

HouseDF.info()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 7 columns): Avg. Area Income 5000 non-null float64 Avg. Area House Age 5000 non-null float64 Avg. Area Number of Rooms 5000 non-null float64 Avg. Area Number of Bedrooms 5000 non-null float64 Area Population 5000 non-null float64 Price 5000 non-null float64 Address 5000 non-null object dtypes: float64(6), object(1) memory usage: 273.6+ KBOUTPUT

HouseDF.describe()

HouseDF.columns

Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms','Avg. Area Number of Bedrooms', 'Area Population', 'Price', 'Address'], dtype='object')OUTPUT

**Exploratory Data Analysis for House Price Prediction**

We will create some simple plot for visualizing the data.

sns.pairplot(HouseDF)

sns.distplot(HouseDF['Price'])

sns.heatmap(HouseDF.corr(), annot=True)

### Get Data Ready For Training a Linear Regression Model

Let’s now begin to train out the regression model. We will need to first split up our data into an X list that contains the features to train on, and a y list with the target variable, in this case, the Price column. We will ignore the Address column because it only has text which is not useful for linear regression modeling.

#### X and y List

X = HouseDF[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms', 'Avg. Area Number of Bedrooms', 'Area Population']] y = HouseDF['Price']

#### Split Data into Train, Test

Now we will split our dataset into a training set and testing set using sklearn train_test_split(). the training set will be going to use for training the model and testing set for testing the model. We are creating a split of 40% training data and 60% of the training set.

fromsklearn.model_selectionimporttrain_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)

X_train and y_train contain data for the training model. X_test and y_test contain data for the testing model. X and y are features and target variable names.

### Creating and Training the LinearRegression Model

We will import and create sklearn linearmodel LinearRegression object and fit the training dataset in it.

fromsklearn.linear_modelimportLinearRegression lm = LinearRegression() lm.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)OUTPUT

### LinearRegression Model Evaluation

Now let’s evaluate the model by checking out its coefficients and how we can interpret them.

print(lm.intercept_)

-2640159.796851911OUTPUT

coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient']) coeff_df

What does coefficient of data says:

- Holding all other features fixed, a 1 unit increase in
**Avg. Area Income**is associated with an**increase of $21.52**. - Holding all other features fixed, a 1 unit increase in
**Avg. Area House Age**is associated with an**increase of $164883.28**. - Holding all other features fixed, a 1 unit increase in
**Avg. Area Number of Rooms**is associated with an**increase of $122368.67**. - Holding all other features fixed, a 1 unit increase in
**Avg. Area Number of Bedrooms**is associated with an**increase of $2233.80**. - Holding all other features fixed, a 1 unit increase in
**Area Population**is associated with an**increase of $15.15**.

### Predictions from our Linear Regression Model

Let’s find out the predictions of our test set and see how well it perform.

predictions = lm.predict(X_test)

plt.scatter(y_test,predictions)

In the above scatter plot, we see data is in a line form, which means our model has done good predictions.

sns.distplot((y_test-predictions),bins=50);

In the above histogram plot, we see data is in bell shape (Normally Distributed), which means our model has done good predictions.

### Regression Evaluation Metrics

Here are three common evaluation metrics for regression problems:

**Mean Absolute Error** (MAE) is the mean of the absolute value of the errors:

**Mean Squared Error** (MSE) is the mean of the squared errors:

**Root Mean Squared Error** (RMSE) is the square root of the mean of the squared errors:

Comparing these metrics:

**MAE**is the easiest to understand because it’s the average error.- MSE is more popular than MAE because MSE “punishes” larger errors, which tends to be useful in the real world.
- RMSE is even more popular than MSE because RMSE is interpretable in the “y” units.

All of these are **loss functions** because we want to minimize them.

fromsklearnimportmetrics print('MAE:', metrics.mean_absolute_error(y_test, predictions)) print('MSE:', metrics.mean_squared_error(y_test, predictions)) print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))

MAE: 82288.22251914957 MSE: 10460958907.209501 RMSE: 102278.82922291153OUTPUT

### Conclusion

We have created a Linear Regression Model which we help the real state agent for estimating the house price.

You can find this project on GitHub.