How to Create & Run Pickle File for Machine Learning Model

In the field of machine learning, saving and loading trained models is essential for reusing them in different applications without having to retrain them from scratch. One effective way to achieve this is by creating a pickle file (.pkl), which allows us to serialize the model and store it as a binary file. In this blog, we will explore how to create a pickle file for a machine learning model, enabling us to save and load the model easily.

What is a Pickle File?

A pickle file, short for “pickled data,” is a binary file format used to serialize Python objects. It converts the objects into a byte stream, making it easy to save, transfer, and reconstruct them later. Pickling allows us to preserve the state of our machine learning model, including its parameters, hyperparameters, and trained weights, in a compact format.

For this tutorial, we have taken example from our House Price Prediction using Linear Regression Machine Learning blog.

Creating a Pickle File for a Machine Learning Model:

Step 1: Train the Machine Learning Model

Before creating a pickle file, we need to have a trained machine learning model. Let’s go through a simple example of training a Linear Regression model for house price prediction:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Assuming your dataset is in a CSV file named 'house_data.csv', you can read it into a pandas DataFrame as follows:
df = pd.read_csv('/kaggle/input/housedata/data.csv')

# Let's take a look at the first few rows of the dataset to understand its structure
print(df.head())

# Now, let's extract the relevant columns from the DataFrame to create our feature matrix X and target variable y.
# 'X' will contain the independent variables (bedrooms, bathrooms, sqft_living, sqft_lot, floors, waterfront, view, condition)
# 'y' will contain the target variable (price)

X = df[['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view', 'condition']]
y = df['price']

# Next, you can split the dataset into training and testing sets using sklearn's train_test_split function:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Now, let's specify the feature names for X to avoid the warning message:
feature_names = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view', 'condition']
X_train.columns = feature_names
X_test.columns = feature_names

# Now, you can build your linear regression model using sklearn's LinearRegression class:
model = LinearRegression()

# Fit the model on the training data
model.fit(X_train, y_train)

# Once the model is trained, you can use it to make predictions on new data (test set in this case):
y_pred = model.predict(X_test)

# To evaluate the model's performance, you can use metrics such as Mean Squared Error (MSE) and R-squared:
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("R-squared:", r2)

# Save the trained model to a .pkl file
import pickle
with open('linear_regression_model.pkl', 'wb') as file:
    pickle.dump(model, file)

Step 2: Save the Trained Model to a Pickle File

Now that we have trained the Linear Regression model, we can proceed to save it as a pickle file:

import pickle

# Save the trained model to a .pkl file
with open('linear_regression_model.pkl', 'wb') as file:
    pickle.dump(model, file)

The pickle.dump() function takes two arguments – the model object and the file object (opened in binary write mode ‘wb’). It serializes the model and writes it to the specified file.

Note: In step 1 code, we have highlighted code, that creates pickle file for that model.

Step 3: Load the Model from the Pickle File (Optional)

If you want to use the model in a different script or at a later time, you can load it from the pickle file:

# Load the model from the .pkl file
with open('model_name.pkl', 'rb') as file:
    loaded_model = pickle.load(file)

The pickle.load() function reads the serialized model from the file and reconstructs the model object, allowing you to use it for predictions or other tasks.

How to Run pkl file to Predict on New Data

Certainly! If you want to make predictions on new data with the new_data provided as [[3, 2, 1500, 4000, 1, 0, 0, 3]], you can do it as follows:

Step 1: Load the Trained Model from the Pickle File

import pickle

# Load the model from the .pkl file
with open('linear_regression_model.pkl', 'rb') as file:
    loaded_model = pickle.load(file)

Step 2: Prepare the New Data for Prediction

You need to ensure that the new_data is in the same format as the training data, with the same order of features.

import pandas as pd

# Create a DataFrame with the new data in the same format as the training data
new_data = pd.DataFrame([[3, 2, 1500, 4000, 1, 0, 0, 3]], columns=['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view', 'condition'])

Step 3: Make Predictions

Now, use the loaded model to make predictions on the new_data.

# Make predictions using the loaded model
predicted_price = loaded_model.predict(new_data)

print("Predicted Price:", predicted_price[0])
# Predicted Price: 331038.9687692916

In this step, predicted_price will contain the predicted value for the new_data.

By following these steps, you can easily use the trained model from the .pkl file to make predictions on new data provided in the new_data format. The pickle file allows you to load and reuse the model without the need for retraining, making it a convenient way to apply the model to various datasets.

Conclusion

Creating a pickle file for a machine learning model is a straightforward and effective way to save and reuse trained models. By following the steps outlined in this blog, you can easily serialize and deserialize your models, making them portable and enabling you to apply them to different projects or share them with others. Pickle files are a valuable tool in the machine learning workflow, contributing to efficient model management and deployment.

Find this tutorial on Github. You will find pickle file and code to run it.

Blogs You Might Like to Read!