This is an entry-point tutorial to the Hydrosphere platform. Estimated completion time: 13 min.
In this tutorial, you will learn the basics of working with Hydrosphere. We will prepare an example model for serving, deploy it to Hydrosphere, turn it into an application, invoke it locally, and use monitoring. As an example model, we will take a simple logistic regression model fit with randomly generated data, with some noise added to it.
By the end of this tutorial you will know how to:
Prepare a model for Hydrosphere
Serve a model on Hydrosphere
Create an Application
Invoke an Application
Use basic monitoring
For this tutorial, you need to have Hydrosphere Platform deployed and Hydrosphere CLI (hs
) along with Python SDK (hydrosdk
) **installed on your local machine. If you don't have them yet, please follow these guides first:
To let hs
know where the Hydrosphere platform runs, configure a new cluster
entity:
In the next two sections, we will prepare a model for deployment to Hydrosphere. It is important to stick to a specific folder structure during this process to let hs
parse and upload the model correctly. Make sure that the structure of your local model directory looks like this by the end of the model preparation section:
train.py
- a training script for our model
requirements.txt
- provides dependencies for our model
model.joblib
- a model artifact that we get as a result of model training
src/func_main.py
- an inference script that defines a function for making model predictions
serving.yaml
- a resource definition file to let Hydrosphere know which function to call from the func_main.py
script and let the model manager understand model’s inputs and outputs.
While Hydrosphere is a post-training platform, let's start with basic training steps to have a shared context.
As mentioned before, we will use the logistic regression model sklearn.LogisticRegression
. For data generation, we will use the sklearn.datasets.make_regression
(link) method.
First, create a directory for your model and create a new train.py
inside:
Put the following code for your model in the train.py
file:
Next, we need to install all the necessary libraries for our model. In your logistic_regression
folder, create a requirements.txt
file and provide dependencies inside:
Install all the dependencies to your local environment:
Train the model:
As soon as the script finishes, you will get the model saved to a model.joblib
file.
Every model in the Hydrosphere cluster is deployed as an individual container. After a request is sent from the client application, it is passed to the appropriate Docker container with your model deployed on it. An important detail is that all model files are stored in the /model/files
directory inside the container, so we will look there to load the model.
To run our model we will use a Python runtime that can execute any Python code you provide. Model preparation is pretty straightforward, but you have to create a specific folder structure described in the "Before you start" section.
Let's create the main file func_main.py
in the /src
folder of your model directory:
Hydrosphere communicates with the model using TensorProto messages. If you want to perform a transformation or inference on the received TensorProto message, you will have to retrieve its contents, perform a transformation on it, and pack the result back to the TensorProto message. Pre-built python runtime automatically converts TensorProto messages to Numpy arrays, so the end-user doesn't need to interact with TensorProto messages.
To do inference you have to define a function that will be invoked every time Hydrosphere handles a request and passes it to the model. Inside that function, you have to call a predict
(or similar) method of your model and return your predictions:
Inside func_main.py
we initialize our model outside of the serving function infer.
This process will not be triggered every time a new request comes in.
The infer
function takes the actual request, unpacks it, makes a prediction, packs the answer, and returns it. There is no strict rule for naming this function, it just has to be a valid Python function name.
To let Hydrosphere know which function to call from the func_main.py
file, we have to provide a resource definition file. This file will define a function to be called, inputs and outputs of a model, a signature function, and some other metadata required for serving.
Create a resource definition file serving.yaml
in the root of your model directorylogistic_regression
:
Inside serving.yaml
we also providerequirements.txt
andmodel.joblib
as payload files to our model:
At this point make sure that the overall structure of your local model directory looks as shown in the "Before you start" section.
Although we have train.py
inside the model directory, it will not be uploaded to the cluster since we are not listing it underpayload
in the resource definition file.
Now we are ready to upload our model to Hydrosphere. To do so, inside the logistic_regression
model directory run:
To see your uploaded model, open http://localhost/models.
If you cannot find your newly uploaded model and it is listed on your models' page, it is probably still in the building stage. Wait until the model changes its status to Released
, then you can use it.
Once you have opened your model in the UI, you can create an application for it. Basically, an application represents an endpoint to your model, so you can invoke it from anywhere. To learn more about advanced features, go to the Applications page.
Open http://localhost/applications and press the Add New Application
button. In the opened window select the logistic_regression
model, name your application logistic_regression
and click the "Add Application" button.
Invoking applications is available via different interfaces. For this tutorial, we will cover calling the created Application by gRPC via our Python SDK.
To install SDK run:
Define a gRPC client on your side and make a call from it:
Hydrosphere Platform has multiple tools for data drift monitoring:
Data Drift Report
Automatic Outlier Detection
Profiling
In this tutorial, we'll look at the monitoring dashboard and Automatic Outlier Detection feature.
Hydrosphere Monitoring relies heavily on training data. Users must provide training data to enable monitoring features.
To provide training data users need to add the training-data=<path_to_csv>
field to the serving.yaml
file. Run the following script to save training data used in previous steps as a trainig_data.csv
file:
Next, add the training data field to the model definition inside the serving.yaml
file:
Now we are ready to upload our model. Run the following command to create a new version of the logistic_regresion
model:
Open the http://localhost/models page to see that there are now two versions of theogistic_regression
model.
For each model with uploaded training data, Hydrosphere creates an outlier detection metric, which assigns an outlier score to each request. This metric labels a request as an outlier if the outlier score is greater than the 97th percentile of training data outlier scores distribution.
Let's send some data to our new model version. To do so, we need to update our logistic_regression
application. To update it, we can go to the Application tab and click the "Update" button:
After updating our Application, we can reuse our old code to send some data:
You can monitor your data quality in the Monitoring Dashboard:
The Monitoring dashboard plots all requests streaming through a model version as rectangles colored according to how "healthy" they are. On the horizontal axis, we group our data by batches and on the vertical axis, we group data by signature fields. In this plot, cells are determined by their batch and field. Cells are colored from green to red, depending on the average request health inside this batch.
To check whether our metric will be able to detect data drifts, let's simulate one and send data from another distribution. To do so, let's slightly modify our code:
You can validate that your model was able to detect data drifts on the monitoring dashboard.