Train & Deploy Census Income Classification Model
Overview
In this tutorial, you will learn how to train and deploy a model for a classification task based on the Adult Dataset. The whole process consists of such steps as preparation, model training, uploading a model to the cluster and making a prediction on test samples.
By the end of this tutorial you will know how to:
Prepare data
Train a model
Deploy a model with SDK
Explore models via UI
Deploy a model with CLI and resource definition
Prerequisites
For this tutorial, you need to have Hydrosphere Platform deployed and Hydrosphere CLI (hs
) along with Python SDK (hydrosdk
) installed on your local machine. If you don't have them yet, please follow these guides first:
For this tutorial, you can use a local cluster. To ensure that, run hs cluster
in your terminal. This command shows the name and server address of a cluster you’re currently using. If it shows that you're not using a local cluster, you can configure one with the following commands:
Data preparation
Let's start with downloading the dataset and moving it to some folder, e.g. it could be data/
folder. Next you need to setup your working environment using the following packages:
Model training always requires some amount of initial preparation, most of which is data preparation. Basically, the Adult Dataset consists of 14 descriptors, 5 of which are numerical and 9 categorical, including a class column. Categorical features are usually presented as strings. This is not an appropriate data type for sending it into a model, so we need to transform it first. Note that we apply a specific type (int64
) for OrdinalEncoder to obtain integers for categorical descriptors after transformation. Transforming the class column usually is not necessary. Also we can remove rows that contain question marks in some samples. Once the preprocessing is complete, you can delete the DataFrame (df
):
Training a model
There are many classifiers that you can potentially use at this stage. In this example, we’ll apply Random Forest classifier. After preprocessing, the dataset will be separated into train and test subsets. The test set will be used to check whether our deployed model can process requests on the cluster. Training step usually requires iniating your model class and applying a specific training method, which is fit()
method in our case. After the training step, we can save a model with joblib.dump()
in a model/
model folder. Training data can be saved as a csv
file, but don't forget to place index=False
to ignore index column and avoid further confusions with reading it again.
Deploy a model with SDK
The easiest way to upload a model to your cluster is by using Hydrosphere SDK. SDK allows Python developers to configure and manage the model lifecycle on the Hydrosphere platform. Before uploading a model, you need to connect to your cluster:
Next, we need to create an inference script to be uploaded to the Hydrosphere platform. This script will be executed each time you are instantiating a model servable. Let's name our function file func_main.py
and store it in the src
folder inside the directory where your model is stored. Your directory structure should look like this:
The code in the func_main.py
should be as follows:
It’s important to make sure that variables will be in the right order after we transform our dictionary for a prediction. For that purpose in cols
we preserve column names as a list sorted by order of their appearance in the DataFrame.
To start working with the model in a cluster, we need to install the necessary libraries used in func_main.py
. You need to create requirements.txt
in the folder with your model and add the following libraries to it:
After this, your model directory with all necessary dependencies should look as follows:
Now we are ready to upload our model to the cluster.
Hydrosphere Serving has a strictly typed inference engine, so before uploading our model we need to specify it’s signature with SignatureBuilder
. A signature contains information about which method inside the func_main.py
should be called, as well as shapes and types of its inputs and outputs. You can use X.dtypes
to check what types of data you have for each column. You can use int64
fields for all our independent variables after transformation. Our class variable (income
) initially consists of two classes with text names instead of numbers, which means that it should be defined as the string (str
) in the signature. In addition, you can specify the type of profiling for each variable using ProfilingType
so Hydrosphere could know what this variable is about and analyze it accordingly. For this purpose, we can create a dictionary, which could contain keys as our variables and values as our profiling types. Otherwise, you can describe them one by one as a parameter in the input. Finally, we can complete our signature with assigning our output variable by with_output
method and giving it a name (e.g. y
), type, shape and profiling type. Afterwards we can build our signature by the build()
method.
Next, we need to specify which files will be uploaded to the cluster. We use path
variable to define the root model folder and payload
to point out paths to all files that we need to upload. At this point, we can combine all our efforts by using ModelVersionBuilder
object, which describes our models and other objects associated with models before the uploading step. It has different methods that are responsible for assigning and uploading different components. For example, we can:
Specify runtime environment for our model by
with_runtime
methodAssign priorly built signature by
with_signature()
Upload model's elements by
with_payload()
Lastly, upload traning data that was previously applied for your model's traininig process by
with_trainig_data()
. Please note that the training data is required if you want to utilize various services as Data Drift, Automatic Outlier Detection and Data Visualization.
Now we are ready to upload our model to the cluster. This process consists of several steps:
Once
ModelVersionBuilder
is prepared we can apply theupload
method to upload it.Then we can lock any interaction with the model until it will be successfully uploaded.
ModelVersion
helps to check whether our model was successfully uploaded to the platform by looking for it.
To deploy a model you should create an Application - a linear pipeline of ModelVersions
with monitoring and other benefits. For that purpose, we are able to apply ExecutionStageBuilder
, which describes the model pipeline for an application. In turn, applications provide Predictor objects, which should be used for data inference purposes. Don't pay much attention to weight
parameter, it is needed for A/B testing.
Predictors provide a predict
method which we can use to send our data to the model. We can try to make predictions for our test set that has preliminarily been converted to a list of dictionaries. You can check the results using the name that we have used for an output of Signature and preserve it in any format you would prefer. Before making a prediction don't forget to make a small pause to finish all necessary loadings.
Explore the UI
If you want to interact with your model via Hydrosphere UI, you can go to http://localhost
. Here you can find all your models. Click on a model to view some basic information about it: versions, building logs, created applications, model's environments, and other services associated with deployed models.
You might notice that after some time there appears an additional model with the metric
postscript at the end of the name. This is your automatically formed monitoring model for outlier detection. Learn more about the Automatic Outlier Detection feature here.
🎉 You have successfully finished this tutorial! 🎉
Next Steps
Next, you can:
Go to the next tutorial and learn how to create a custom Monitoring Metric and attach it to your deployed model.
Explore the extended part of this tutorial to learn how to use YAML resource definitions to upload a ModelVersion and create an Application.
Deploy a model with CLI and Resource Definitions
Another way to upload your model is to apply a resource definition. This process repeats all the previous steps like data preparation and training. The difference is that instead of SDK, we are using CLI to apply a resource definition.
A resource definition is a file that defines the inputs and outputs of a model, a signature function, and some other metadata required for serving. Go to the root directory of the model and create a serving.yaml
file. You should get the following file structure:
Model deployment with a resource definition repeats all the steps of that with SDK, but in one file. A considerable advantage of using a resource definition is that besides describing your model it allows creating an application by simply adding an object to the contract after the separation line at the bottom. Just name your application and provide the name and version of a model you want to tie to it.
To start uploading, run hs apply -f serving.yaml
. To monitor your model you can use Hydrosphere UI as was previously shown.
Last updated