Creating prediction model as a service using Flask and Docker
Introduction¶
The aim of this notebook is to create a machine learning model and transform it into an API which, when given some novel input parameters, returns the model’s prediction.
The model¶
The model that is going to be used is a Random Forest model, built using a data set of Titanic passengers (data/train.csv
). The model looks to predict the probability of whether a passenger would have survived.
The goal¶
Input is an API call such as
/predict?class=2&sex=male&age=22&sibsp=2&parch=0&title=mr
with a response in the form of
{ "probabilityOfSurvival": 0.95 }
Creating the model¶
import pandas as pd
Import the data from the CSV file train.csv.
train = pd.read_csv('data/train.csv')
Explore the data.
train.head()
train.dtypes
Get the mapping for the gender by creating a function that retrieves the categories of a column of type category
. This will return a number and its corresponding value.
def _get_category_mapping(column):
""" Return the mapping of a category """
return dict([(cat, code) for code, cat in enumerate(column.cat.categories)])
The unique values before converting to a category:
train['Sex'].unique()
train['Sex'] = train['Sex'].astype('category')
sex_mapping = _get_category_mapping(train['Sex'])
train['Sex'] = train['Sex'].cat.codes
and the values after converting:
sex_mapping
We will keep the mapping so we can use it later once we deploy the Model as a Service.
train['Sex'].unique()
Create categories for the titles by extracting them from the names.
train['Name'].head(10)
FRENCH_MAPPING = {
'Mme': 'Mrs', # Madame
'Mlle': 'Miss', # Mademoiselle
'M.': 'Mr', # Monsieur
}
FINAL_TITLES = [
'master',
'miss',
'mr',
'mrs'
]
import re
def _extract_title(column):
""" Extract the title """
# Remove dots
title_column = column.apply(lambda x: re.sub(r'(.*, )|(\..*)', '', x).lower()).astype(str)
# Map the French to English titles
title_column = title_column.replace(FRENCH_MAPPING)
# Create the categories based on the final titles and the rare title
title_column = title_column.apply(lambda x: 'rare title' if x not in FINAL_TITLES else x)
return title_column
train['Title'] = _extract_title(train['Name'])
train[['Name', 'Title']].head()
import matplotlib.pyplot as plt
%matplotlib inline
train.groupby('Title')['Name'].count().plot.pie(title="Distribution of titles")
Let's now convert the titles to a category as we did for the gender:
train['Title'] = train['Title'].astype("category")
title_mapping = _get_category_mapping(train['Title'])
train['Title'] = train['Title'].cat.codes
title_mapping
Let us now investigate the ages of the people aboard the Titanic.
train.groupby('Age')['Name'].count().nlargest(20).plot.bar(title="Top 20 ages")
train[train['Age'].isnull()]
In the dataset the age is missing for 177 persons. Use a linear model to calculate the age based on the class, gender and number of siblings/spouses based on the rows with an age.
LIN_MOD_FEATURES = [
'Pclass',
'Sex',
'SibSp'
]
LIN_MOD_TARGET = [
'Age'
]
from sklearn import linear_model
def _create_linear_model(frame):
""" Create linear model """
imput = frame[frame.Age.notnull()]
features = imput[LIN_MOD_FEATURES]
target = imput[LIN_MOD_TARGET]
model = linear_model.LinearRegression()
model.fit(features, target)
return model
linear_mod = _create_linear_model(train)
Calculate the predicted age for all the rows:
train['PredictedAge'] = linear_mod.predict(train[LIN_MOD_FEATURES])
Merge the predicted age into the dataframe where there is no age yet:
train['Age'] = train.apply(lambda x: x.Age if pd.notnull(x.Age) else x.PredictedAge, axis=1)
Drop the prediction column since it is not needed anymore:
train.drop(['PredictedAge'], axis=1, inplace=True)
To estimate the chance of survival we will use a Random Forest Classifier.
FEATURES = [
'Pclass',
'Sex',
'Age',
'SibSp',
'Parch',
'Title',
]
TARGET = 'Survived'
NUM_TREES = 500
MAX_FEATURES = 2
from sklearn.ensemble import RandomForestClassifier
def _create_random_forest_classifier(frame):
""" Build a random forest classifier """
features = frame[FEATURES]
target = frame[TARGET]
model = RandomForestClassifier(n_estimators=NUM_TREES,
max_features=MAX_FEATURES,
random_state=754)
model.fit(features, target)
return model
random_forest_classifier = _create_random_forest_classifier(train)
As a final step for the MaaS we will save the model and the two mappings to disk. This way we can upload them to the microservice in the next step.
from sklearn.externals import joblib
def _save_variable(variable, filename):
""" Save a variable to a file """
joblib.dump(variable, filename)
_save_variable(random_forest_classifier, 'random_forest.mdl')
_save_variable(title_mapping, 'title_mapping.pkl')
_save_variable(sex_mapping, 'sex_mapping.pkl')
Deploying the model as a service¶
To deploy the model as a service, I am going to use the web framework Flask. This makes it easy to interact with the variables we saved in the previous step and it is straighforward to create a simple web app with only a few routes. The app.py
contains all the magic and will be used in the Dockerfile to get the container with the model online. This is the code for the main application:
%%file app.py
#!/usr/bin/env python
# # -*- coding: utf-8 -*-
""" Flask API for predicting probability of survival """
import json
import sys
from flask import Flask, jsonify, request, render_template, url_for
from sklearn.externals import joblib
import numpy as np
try:
saved_model = joblib.load('random_forest.mdl')
sex_mapping = joblib.load('sex_mapping.pkl')
title_mapping = joblib.load('title_mapping.pkl')
except:
print("Error loading application. Please run `python create_random_forest.py` first!")
sys.exit(0)
app = Flask(__name__)
@app.route('/')
def main():
""" Main page of the API """
return "This is the main page"
@app.route('/predict', methods=['GET'])
def predict():
""" Predict the probability of survival """
args = request.args
required_args = ['class', 'sex', 'age', 'sibsp', 'parch', 'title']
# Simple error handling for the arguments
diff = set(required_args).difference(set(args.keys()))
if len(diff) > 0:
return "Error: wrong arguments. Missing arguments {}".format(str(diff))
person_features = np.array([args['class'],
sex_mapping[args['sex']],
args['age'],
args['sibsp'],
args['parch'],
title_mapping[args['title'].lower()]
]).reshape(1, -1)
probability = saved_model.predict_proba(person_features)[:, 1][0]
return jsonify({'probabilityOfSurvival': probability})
if __name__ == '__main__':
app.run(host='0.0.0.0')
Breakdown¶
The first bit
try:
saved_model = joblib.load('random_forest.mdl')
sex_mapping = joblib.load('sex_mapping.pkl')
title_mapping = joblib.load('title_mapping.pkl')
except:
print("Error loading application. Please run `python create_random_forest.py` first!")
sys.exit(0)
will import the three variables that we saved during training the model. We can use the mappings to map the input arguments of the API to the proper fields of the model.
The next part is default for a Flask application:
app = Flask(__name__)
@app.route('/')
def main():
""" Main page of the API """
return "This is the main page"
The predict route is slightly more advanced. It will take parameters using the GET method and verify if all six required arguments are present.
args = request.args
required_args = ['class', 'sex', 'age', 'sibsp', 'parch', 'title']
# Simple error handling for the arguments
diff = set(required_args).difference(set(args.keys()))
if len(diff) > 0:
return "Error: wrong arguments. Missing arguments {}".format(str(diff))
If all arguments are present, it will create a feature array to feed to the prediction model by using numpy
. This is where the mappings come in, as an input for the title we have a string, but this is mapped to the corresponding index of the mapping to match to the feature in the model.
person_features = np.array([args['class'],
sex_mapping[args['sex']],
args['age'],
args['sibsp'],
args['parch'],
title_mapping[args['title'].lower()]
]).reshape(1, -1)
Finally the probability is calculated based on the features and the probability is fed back using a JSON object.
probability = saved_model.predict_proba(person_features)[:, 1][0]
return jsonify({'probabilityOfSurvival': probability})
The last bit of app.py
is the default way of starting the Flask server. I have modified the host from localhost
to 0.0.0.0
in order to be able to access the API from any IP address.
app.run(host='0.0.0.0')
Docker¶
To run the model as a service, I will use Docker to create a container where the server is running and the endpoint for the prediction is exposed. The Dockerfile
is very basic. It will use Python 3, copy the contents to the container, install the requirements and start the server.
# Base image
FROM python:3
# Copy contents
COPY . /app
# Change work directory
WORKDIR /app
# Install the requirements
RUN pip install -r requirements.txt
# Start the application
CMD ["python", "app.py"]
where requirements.txt
contains the following:
certifi==2018.1.18
click==6.7
Flask==0.12.2
itsdangerous==0.24
Jinja2==2.10
MarkupSafe==1.0
numpy==1.14.0
pandas==0.22.0
python-dateutil==2.6.1
pytz==2017.3
scikit-learn==0.19.1
scipy==1.0.0
six==1.11.0
Werkzeug==0.14.1
The docker-compose.yml
will simply build the current Dockerfile
and expose port 5000.
version: '2'
services:
flask:
build: .
ports:
- "5000:5000"
Running docker-compose up -d
in this folder will now start the server and by going to the IP of the machine the endpoint should be visible on port 5000 and route predict
.
Execution¶
Let's put it to the test: create the files, spin up Docker and check the API response.
%%file Dockerfile
# Base image
FROM python:3
# Copy contents
COPY . /app
# Change work directory
WORKDIR /app
# Install the requirements
RUN pip install -r requirements.txt
# Start the application
CMD ["python", "app.py"]
%%file requirements.txt
certifi==2018.1.18
click==6.7
Flask==0.12.2
itsdangerous==0.24
Jinja2==2.10
MarkupSafe==1.0
numpy==1.14.0
pandas==0.22.0
python-dateutil==2.6.1
pytz==2017.3
scikit-learn==0.19.1
scipy==1.0.0
six==1.11.0
Werkzeug==0.14.1
%%file docker-compose.yml
version: '2'
services:
flask:
build: .
ports:
- "5000:5000"
!docker-compose up -d
!docker ps | grep flask
Verification¶
The following code will call the API with some parameters and print the result.
import requests
params = {
'class': 2,
'age': 22,
'sibsp': 2,
'parch': 0,
'title': 'mr',
'sex': 'male',
}
url = 'http://hub.jitsejan.com:5000/predict'
r = requests.get(url, params)
print(r.url)
print(r.json())
And that is it! The JSON object can now be returned to the front-end of the web application and be displayed in a fancy way, but that is outside the scope of this notebook.
Let's clean up by retrieving the Docker ID and removing the container.
!docker stop $(docker ps -aqf "name=flask")
!docker rm $(docker ps -aqf "name=flask")
Check my Github for the original notebook.