Estimated Completion Time: 18m.
In this tutorial, you will learn how to create a custom anomaly detection metric for a specific use case.
Let's take a problem described in the previous Train & Deploy Census Income Classification Model tutorial as a use case and census income dataset as a data source. We will monitor a model that classifies whether the income of a given person exceeds $50.000 per year.
By the end of this tutorial you will know how to:
Train a monitoring model
Deploy of a monitoring model with SDK
Manage сustom metrics with UI
Upload a monitoring model with CLI
For this tutorial, you need to have Hydrosphere Platform deployed and Hydrosphere CLI (hs
) along with Python SDK (hydrosdk
) _**_installed on your local machine. If you don't have them yet, please follow these guides first:
This tutorial is a sequel to the previous tutorial. Please complete it first to have a prepared dataset and a trained model deployed to the cluster:
We start with the steps we used for the common model. First, let's create a directory structure for our monitoring model with an /src
folder containing an inference scriptfunc_main.py
. We also need training data used from the previous tutorial, which we can copy directly to just created directory:
As a monitoring metric, we will use IsolationForest. You can learn how it works here. In this example we are going to use PyOD library, which is dedicated specifically to anomaly detection algorithms. Let's install it first.
The whole process is similar to what we are usually doing with common machine learning models. Let's import necessary libraries, train our outlier detection model and save it in our working directory. Specifically for training we are supposed to use the same training data as for our prediction model.
This is what the pprobability distribution of our inliers looks like. It is directly dependent upon the method you choose. In our case we have applied a linear
conversion, which transforms outlier scores by the range of [0, 1] using Min-Max values. Remember that the model must be fitted first. By choosing a contamination parameter we can adjust a threshold that will separate inliers from outliers accordingly. You have to be thorough in choosing it to avoid critical prediction mistakes. Otherwise, you can also stay with 'auto'
. To create a monitoring metric, we have to deploy that Isolation Forest model as a separate model on the Hydrosphere platform.
First, let's create a new directory where we will store our inference script with declared serving function and its definitions. Put the following code inside the src/func_main.py
file:
Next, we need to install the necessary libraries. Create a requirements.txt
and add the following libraries to it:
Just like with common models, we can use SDK to upload our monitoring model and bind it to the trained one. The steps are almost the same, but with some slight differences:
First, since we want to predict the anomaly score instead of sample class, we need to change the type of output field from 'str'
to 'float64'
Next we need to find our model on the cluster before assigning it to our prediction model. There is a specific method called .find()
inside ModelVersion
class
Finally, we need to apply a couple of new methods to create a metric. MetricSpec
is responsible for creating a metric for a specific model, with specific MetricSpecConfig
, which describe parameters of our metric like probability threshold and principle by which metric should detect outliers. In this case, .LESS
denotes that every value below provided threshold is defined as an inlier.
Anomaly scores are obtained through traffic shadowing inside the Hydrosphere's engine after making a Servable, so you don't need to perform any additional manipulations.
Go to the UI to observe and manage all your models. Here you will find 3 models on the left panel:
adult_model
- a model that we trained for prediction in the previous tutorial
adult_monitoring_model
- our monitoring model
adult_model_metric
- a model that was created by Automatic Outlier Detection
Click on the trained model and then on Monitoring. On the monitoring dasboard you now have two external metrics: the first one is auto_od_metric
that was automatically generated by Automatic Outlier Detection, and the new one is custom_metric
that we have just created. You can also change settings for existing metrics and configure the new ones in the Configure Metrics
section:
During the prediction, you will get anomaly scores for each sample in the form of a chart with two lines. The curved line shows scores, while the horizontal dotted one is our threshold. When the curve intersects the threshold, it might be a sign of potential anomalousness. However, this is not always the case, since there are many factors that might affect this, so be careful about your final interpretation.
Just like in the case with all other types of models, we can define and upload a monitoring model using a resource definition. We have to pack our model with a model definition, like in the previous tutorial.
Inputs of this model are the inputs of the target monitored model plus the outputs of that model. We will use the value
field as an output for the monitoring model. The final directory structure should look like this:
From that folder, upload the model to the cluster:
Now we have to attach the deployed Monitoring model as a custom metric. Let's create a monitoring metric for our pre-deployed classification model in the UI:
From the Models section, select the target model you would like to deploy and select the desired model version.
Open the Monitoring tab.
At the bottom of the page click the Configure Metric
button.
From the opened window click the Add Metric
button.
Specify the name of the metric.
Choose the monitoring model.
Choose the version of the monitoring model.
Select a comparison operator Greater
. This means that if you have a metric value greater than a specified threshold, an alarm should be fired.
Set the threshold value. In this case, it should be equal to the value of monitoring_model.threshold_
.
Click the Add Metric
button.
That's it. Now you have a monitored income classifier deployed on the Hydrosphere platform.