In this tutorial, you will learn how to train and deploy a model for a classification task based on the Adult Dataset. The main steps of this process are data preparation, training a model, uploading a model to the cluster, and making a prediction on test samples.
By the end of this tutorial you will know how to:
Prepare data
Train a model
Deploy a model with SDK
Explore models via UI
Deploy a model with CLI and resource definition
For this tutorial, you need to have Hydrosphere Platform deployed and Hydrosphere CLI (hs
) along with Python SDK (hydrosdk
) _**_installed on your local machine. If you don't have them yet, please follow these guides first:
For this tutorial, you can use a local cluster. To ensure that, run hs cluster
in your terminal. This command will show the name and server address of a cluster you’re currently using. If it shows that you're not using a local cluster, you can configure one with the following commands:
Model training always requires some amount of initial preparation, most of which is data preparation. The Adult Dataset consists of 14 descriptors, 5 of which are numerical and 9 categorical, including the class column.
Categorical features are usually presented as strings. This is not an appropriate data type for sending it into a model, so we need to transform it first. We can remove rows that contain question marks in some samples. Once the preprocessing is complete, you can delete the DataFrame (df
):
There are many classifiers that you can potentially use for this step. In this example, we’ll apply the Random Forest classifier. After preprocessing, the dataset will be separated into train and test subsets. The test set will be used to check whether our deployed model can process requests on the cluster. After the training step, we can save a model with joblib.dump()
in a model/
model folder.
The easiest way to upload a model to your cluster is by using Hydrosphere SDK. SDK allows Python developers to configure and manage the model lifecycle on the Hydrosphere platform. Before uploading a model, you need to connect to your cluster:
Next, we need to create an inference script to be uploaded to the Hydrosphere platform. This script will be executed each time you are instantiating a model servable. Let's name our function file func_main.py
and store it in the src
folder inside the directory where your model is stored. Your directory structure should look like this:
The code in the func_main.py
should be as follows:
It’s important to make sure that variables will be in the right order after we transform our dictionary for a prediction. So in cols
we preserve column names as a list sorted by order of their appearance in the DataFrame.
To start working with the model in a cluster, we need to install the necessary libraries used in func_main.py
. Create a requirements.txt
in the folder with your model and add the following libraries to it:
After this, your model directory with all necessary dependencies should look as follows:
Now we are ready to upload our model to the cluster.
Hydrosphere Serving has a strictly typed inference engine, so before uploading our model we need to specify it’s signature withSignatureBuilder
. A signature contains information about which method inside the func_main.py
should be called, as well as shapes and types of its inputs and outputs.
Use X.dtypes
to check what types of data you have for each column. You can use int64
fields for all variables including income, which is our dependent variable and we can name it as 'y'
in a signature for further prediction.
Besides, you can specify the type of profiling for each variable using ProfilingType
so Hydrosphere could know what this variable is about and analyze it accordingly. For this purpose, we can create a dictionary, which could contain keys as our variables and values as our profiling types. Otherwise, you can describe them one by one as a parameter in the input.
Finally, we can complete our signature with the .build()
method.
Next, we need to specify which files will be uploaded to the cluster. We use path
to define the root model folder and payload
to point out paths to all files that we need to upload.
At this point, we can combine all our efforts into the LocalModel
object. LocalModels are models before they get uploaded to the cluster. They contain all the information required to instantiate a ModelVersion in a Hydrosphere cluster. We’ll name this model adult_model
.
Additionally, we need to specify the environment in which our model will run. Such environments are called Runtimes. In this tutorial, we will use the default Python 3.7 runtime. This runtime uses the src/func_main.py
script as an entry point, which is the reason we organized our files the way we did.
One more parameter that you can define is a path to the training data of your model, required if you want to utilize additional services of Hydrosphere (for example, Automatic Outlier Detection).
Now we are ready to upload our model to the cluster. This process consists of several steps:
Once LocalModel
is prepared we can apply the upload
method to upload it.
Then we can lock any interaction with the model until it will be successfully uploaded.
ModelVersion
helps to check whether our model was successfully uploaded to the platform by looking for it.
To deploy a model you should create an Application - a linear pipeline of ModelVersions
with monitoring and other benefits. Applications provide Predictor objects, which should be used for data inference purposes.
Predictors provide a predict
method which we can use to send our data to the model. We can try to make predictions for our test set that has preliminarily been converted to a list of dictionaries. You can check the results using the name you have used for an output of Signature and preserve it in any format you would prefer. Before making a prediction don't forget to make a small pause to finish all necessary loadings.
If you want to interact with your model via Hydrosphere UI, you can go to http://localhost
. Here you can find all your models. Click on a model to view information about it: versions, building logs, created applications, model's environments, and other services associated with deployed models.
You might notice that after some time there appears an additional model with the metric
postscript at the end of the name. This is your automatically formed monitoring model for outlier detection. Learn more about the Automatic Outlier Detection feature here.
🎉 You have successfully finished this tutorial! 🎉
Next, you can:
Go to the next tutorial and learn how to create a custom Monitoring Metric and attach it to your deployed model:
Explore the extended part of this tutorial to learn how to use YAML resource definitions to upload a ModelVersion and create an Application.
Another way to upload your model is to apply a resource definition. This process repeats all the previous steps like data preparation and training. The difference is that instead of SDK, we are using CLI to apply a resource definition.
A resource definition is a file that defines the inputs and outputs of a model, a signature function, and some other metadata required for serving. Go to the root directory of the model and create a serving.yaml
file. You should get the following file structure:
Model deployment with a resource definition repeats all the steps of that with SDK, but in one file. A considerable advantage of using a resource definition is that besides describing your model it allows creating an application by simply adding an object to the contract after the separation line at the bottom. Just name your application and provide the name and version of a model you want to tie to it.
To start uploading, run hs apply -f serving.yaml
. To monitor your model you can use Hydrosphere UI as was previously shown.