Getting Started
This is an entry-point tutorial to the Hydrosphere platform. Estimated completion time: 13 min.
In this tutorial, you will learn the basics of working with Hydrosphere. We will prepare an example model for serving, deploy it to Hydrosphere, turn it into an application, invoke it locally, and use monitoring. As an example model, we will take a simple logistic regression model fit with randomly generated data, with some noise added to it.
By the end of this tutorial you will know how to:
- Prepare a model for Hydrosphere
- Serve a model on Hydrosphere
- Create an Application
- Invoke an Application
- Use basic monitoring
For this tutorial, you need to have Hydrosphere Platform deployed and Hydrosphere CLI (
hs
) along with Python SDK (hydrosdk
) _**_installed on your local machine. If you don't have them yet, please follow these guides first:To let
hs
know where the Hydrosphere platform runs, configure a new cluster
entity:hs cluster add --name local --server http://localhost
hs cluster use local
In the next two sections, we will prepare a model for deployment to Hydrosphere. It is important to stick to a specific folder structure during this process to let
hs
parse and upload the model correctly. Make sure that the structure of your local model directory looks like this by the end of the model preparation section:logistic_regression
├── model.joblib
├── train.py
├── requirements.txt
├── serving.yaml
└── src
└── func_main.py
train.py
- a training script for our modelrequirements.txt
- provides dependencies for our modelmodel.joblib
- a model artifact that we get as a result of model trainingsrc/func_main.py
- an inference script that defines a function for making model predictionsserving.yaml
- a resource definition file to let Hydrosphere know which function to call from thefunc_main.py
script and let the model manager understand model’s inputs and outputs.
While Hydrosphere is a post-training platform, let's start with basic training steps to have a shared context.
As mentioned before, we will use the logistic regression model
sklearn.LogisticRegression
. For data generation, we will use the sklearn.datasets.make_regression
(link) method.First, create a directory for your model and create a new
train.py
inside:mkdir logistic_regression
cd logistic_regression
touch train.py
Put the following code for your model in the
train.py
file:train.py
import joblib
from sklearn.datasets import make_blobs
from sklearn.linear_model import LogisticRegression
# initialize data
X, y = make_blobs(n_samples=300, n_features=2, centers=[[-5, 1],[5, -1]])
# create a model
model = LogisticRegression()
model.fit(X, y)
joblib.dump(model, "model.joblib")
Next, we need to install all the necessary libraries for our model. In your
logistic_regression
folder, create a requirements.txt
file and provide dependencies inside:requirements.txt
numpy~=1.18
scipy==1.4.1
scikit-learn~=0.23
joblib~=0.15
Install all the dependencies to your local environment:
$ pip install -r requirements.txt
Train the model:
$ python train.py
As soon as the script finishes, you will get the model saved to a
model.joblib
file.Every model in the Hydrosphere cluster is deployed as an individual container. After a request is sent from the client application, it is passed to the appropriate Docker container with your model deployed on it. An important detail is that all model files are stored in the
/model/files
directory inside the container, so we will look there to load the model.To run our model we will use a Python runtime that can execute any Python code you provide. Model preparation is pretty straightforward, but you have to create a specific folder structure described in the "Before you start" section.
Let's create the main file
func_main.py
in the /src
folder of your model directory:mkdir src
cd src
touch func_main.py
Hydrosphere communicates with the model using TensorProto messages. If you want to perform a transformation or inference on the received TensorProto message, you will have to retrieve its contents, perform a transformation on it, and pack the result back to the TensorProto message. Pre-built python runtime automatically converts TensorProto messages to Numpy arrays, so the end-user doesn't need to interact with TensorProto messages.
To do inference you have to define a function that will be invoked every time Hydrosphere handles a request and passes it to the model. Inside that function, you have to call a
predict
(or similar) method of your model and return your predictions:func_main.py
import joblib
import numpy as np
# Load a model once
model = joblib.load("/model/files/model.joblib")
def infer(x1, x2):
# Make a prediction
y = model.predict([[x1, x2]])
# Return the scalar representation of y
return {"y": y.item()}
Inside
func_main.py
we initialize our model outside of the serving function infer.
This process will not be triggered every time a new request comes in.The
infer
function takes the actual request, unpacks it, makes a prediction, packs the answer, and returns it. There is no strict rule for naming this function, it just has to be a valid Python function name.To let Hydrosphere know which function to call from the
func_main.py
file, we have to provide a resource definition file. This file will define a function to be called, inputs and outputs of a model, a signature function, and some other metadata required for serving.Create a resource definition file
serving.yaml
in the root of your model directorylogistic_regression
:cd ..
touch serving.yaml
Inside
serving.yaml
we also providerequirements.txt
andmodel.joblib
as payload files to our model:serving.yaml
kind: Model
name: logistic_regression
runtime: hydrosphere/serving-runtime-python-3.7:3.0.0
install-command: pip install -r requirements.txt
payload:
- src/
- requirements.txt
- model.joblib
contract:
name: infer
inputs:
x1:
shape: scalar
type: double
profile: numerical
x2:
shape: scalar
type: double
profile: numerical
outputs:
y:
shape: scalar
type: int64
profile: categorical
At this point make sure that the overall structure of your local model directory looks as shown in the "Before you start" section.
Although we have
train.py
inside the model directory, it will not be uploaded to the cluster since we are not listing it underpayload
in the resource definition file.Now we are ready to upload our model to Hydrosphere. To do so, inside the
logistic_regression
model directory run:hs apply -f serving.yaml
If you cannot find your newly uploaded model and it is listed on your models' page, it is probably still in the building stage. Wait until the model changes its status to
Released
, then you can use it.Once you have opened your model in the UI, you can create an application for it. Basically, an application represents an endpoint to your model, so you can invoke it from anywhere. To learn more about advanced features, go to the Applications page.

Creating an Application from the uploaded model
Open http://localhost/applications and press the
Add New Application
button. In the opened window select the logistic_regression
model, name your application logistic_regression
and click the "Add Application" button.Invoking applications is available via different interfaces. For this tutorial, we will cover calling the created Application by gRPC via our Python SDK.
To install SDK run:
pip install hydrosdk
Define a gRPC client on your side and make a call from it:
send_data.py
from sklearn.datasets import make_blobs
from hydrosdk import Cluster, Application
cluster = Cluster("http://localhost", grpc_address="localhost:9090")
app = Application.find(cluster, "logistic_regression")
predictor = app.predictor()
X, _ = make_blobs(n_samples=300, n_features=2, centers=[[-5, 1],[5, -1]])
for sample in X:
y = predictor.predict({"x1": sample[0], "x2": sample[1]})
print(y)
Hydrosphere Platform has multiple tools for data drift monitoring:
- 1.Data Drift Report
- 2.Automatic Outlier Detection
- 3.Profiling
In this tutorial, we'll look at the monitoring dashboard and Automatic Outlier Detection feature.
Hydrosphere Monitoring relies heavily on training data. Users must provide training data to enable monitoring features.
To provide training data users need to add the
training-data=<path_to_csv>
field to the serving.yaml
file. Run the following script to save training data used in previous steps as a training_data.csv
file:save_training_data.py
import pandas as pd
from sklearn.datasets import make_blobs
# Create training data
X, y = make_blobs(n_samples=300, n_features=2, centers=[[-5, 1],[5, -1]])
# Create pandas.DataFrame from it
df = pd.DataFrame(X, columns=['x1', 'x2'])
df['y'] = y
# Save it as .csv
df.to_csv("training_data.csv", index=False)
Next, add the training data field to the model definition inside the
serving.yaml
file:serving.yaml
kind: Model
name: logistic_regression
runtime: hydrosphere/serving-runtime-python-3.7:3.0.0
install-command: pip install -r requirements.txt
training-data: training_data.csv
payload:
- src/
- requirements.txt
- model.joblib
contract:
name: infer
inputs:
x1:
shape: scalar
type: double
profile: numerical
x2:
shape: scalar
type: double
profile: numerical
outputs:
y:
shape: scalar
type: int64
profile: categorical
Now we are ready to upload our model. Run the following command to create a new version of the
logistic_regresion
model:hs apply -f serving.yaml
Open the http://localhost/models page to see that there are now two versions of the
logistic_regression
model.For each model with uploaded training data, Hydrosphere creates an outlier detection metric, which assigns an outlier score to each request. This metric labels a request as an outlier if the outlier score is greater than the 97th percentile of training data outlier scores distribution.
Let's send some data to our new model version. To do so, we need to update our
logistic_regression
application. To update it, we can go to the Application tab and click the "Update" button:
Upgrading an application stage to a newer version
After updating our Application, we can reuse our old code to send some data:
send_data.py
from sklearn.datasets import make_blobs
from hydrosdk import Cluster, Application
cluster = Cluster("http://localhost", grpc_address="localhost:9090")
app = Application.find(cluster, "logistic_regression")
predictor = app.predictor()
X, _ = make_blobs(n_samples=300, n_features=2, centers=[[-5, 1],[5, -1]])
for sample in X:
y = predictor.predict({"x1": sample[0], "x2": sample[1]})
print(y)
You can monitor your data quality in the Monitoring Dashboard:

The Monitoring dashboard plots all requests streaming through a model version as rectangles colored according to how "healthy" they are. On the horizontal axis, we group our data by batches and on the vertical axis, we group data by signature fields. In this plot, cells are determined by their batch and field. Cells are colored from green to red, depending on the average request health inside this batch.
To check whether our metric will be able to detect data drifts, let's simulate one and send data from another distribution. To do so, let's slightly modify our code:
send_bad_data.py
from sklearn.datasets import make_blobs
from hydrosdk import Cluster, Application
cluster = Cluster("http://localhost", grpc_address="localhost:9090")
app = Application.find(cluster, "logistic_regression")
predictor = app.predictor()
# Change make_blobs arguments to simulate different distribution
X, _ = make_blobs(n_samples=300, n_features=2, centers=[[-10, 10],[0, 0]])
for sample in X:
y = predictor.predict({"x1": sample[0], "x2": sample[1]})
print(y)
You can validate that your model was able to detect data drifts on the monitoring dashboard.
Last modified 2yr ago