Monitor anomalies with a custom Isolation Forest metric

On this page you will learn how to create a custom anomaly detection metric for a specific use case.

Overview

For this use case we have chosen a sample regression problem. We will monitor the model, which will predict how many taxi pickups will occur in the next hour, based on observations from past 5 hours. As a data source we will use a dataset from NYC Taxi & Limousine Commission.

Before you start

We assume you already have a deployed instance of the Hydrosphere platform and a CLI on your local machine.

To let hs know where the Hydrosphere platform runs, configure a new cluster entity.

$ hs cluster add --name local --server http://localhost
$ hs cluster use local

Also, you should have a target regression model already deployed on the cluster. You can find the source code of the target model here.

Model training

As a monitoring model we will use an autoregressive stateful IsolationForest model, which will be continuously retrained on a window of 5 consequent data samples.

We will skip most of the data preparation steps, for the sake of simplicity.

Python
df = pd.read_csv("../data/taxi_pickups.csv")
df.set_index(pd.to_datetime(df.pickup_datetime),inplace=True)
df.drop(["pickup_datetime"], axis=1, inplace=True)

data, _ = transform_to_sliding_windows(df)
iforest = IsolationForest(
    n_jobs=-1, random_state=42,  behaviour="new", contamination=0.03)
is_outlier = iforest.fit_predict(data)
# Find outliers in training data 
outlier_indices = df.index[6:][is_outlier==-1]

Model evaluation

To check that our model works properly, lets plot training data and outliers.

Python
plt.plot(df.index, df.pickups, label="Training data")
plt.vlines(outlier_indices, 0, 600, colors="red", alpha=0.2, label="Outliers")

plt.gcf().set_size_inches(25, 5)
plt.legend()

From the plot you can see a massive amount of anomalies at the end of January 2016. These outliers came from a travel ban due to “Snowzilla”.

Model release

To create a monitoring metric, we have to deploy that IsolationForest model as a separate model in Hydrosphere. Let’s save a trained model for serving.

Python
joblib.dump(iforest, '../monitoring_model/iforest.joblib')

Create a new directory where we will declare the serving function and its definitions.

$ mkdir -p monitoring_model/src
$ cd monitoring_model
$ touch src/func_main.py

Inside the src/func_main.py file put the following code:

Python
import numpy as np
import hydro_serving_grpc as hs
from joblib import dump, load
import collections

init_value = 1.0  # Default value, means that the sample is 'inlier'
window_len = 5  # Length of data sequence required for model.

window = collections.deque(maxlen=window_len)
outlier_detection_model = load('/model/files/iforest.joblib')


def infer(pickups_last_hour, pickups_next_hour):
    global window

    # serving.yaml defines that the type of input is int, so we take int_val 
    # from input sample. The pickups_next_hour parameter is a prediction of 
    # the target monitored model.
    input_value = int(pickups_last_hour.int_val[0])

    if len(window) < window_len-1:
        window.append(input_value)
        return pack_predict(init_value)
    else:
        window.append(input_value)
        prediction_vector = np.array(window)
        # Make a prediction
        result = outlier_detection_model.predict(prediction_vector.reshape(1, 5))
        # Pack the answer
        return pack_predict(result[0])


def pack_predict(result):
    tensor = hs.TensorProto(
        dtype=hs.DT_DOUBLE,
        double_val=[result],
        tensor_shape=hs.TensorShapeProto()
    )
    return hs.PredictResponse(outputs={"value": tensor})

This model also have to be packed with a model definition.

kind: Model
name: nyc_taxi_monitoring
runtime: "hydrosphere/serving-runtime-python-3.6:2.2.1"
install-command: "pip install -r requirements.txt"
payload:
  - "src/"
  - "requirements.txt"
  - "iforest.joblib"

contract:
  name: infer
  inputs:
    pickups_last_hour:
      shape: scalar
      type: int32
      profile: numerical
    pickups_next_hour:
      shape: scalar
      type: int32
      profile: numerical
  outputs:
    value:
      shape: scalar
      type: double
      profile: numerical

Inputs of this model are the inputs of the monitored model plus the outputs of the monitored model. As an output for the monitoring model we will use the value field.

Pay attention to the model’s payload. It has the src folder that we have just created, requirements.txt with all dependencies and iforest.joblib, e.g. our newly trained serialized IsolationForest model.

requirements.txt looks like this:

joblib==0.13.2
numpy==1.16.2
scikit-learn==0.20.2

The final directory structure should look like this:

.
├── iforest.joblib
├── requirements.txt
├── serving.yaml
└── src
    └── func_main.py

From that folder, upload the model to the Hydrosphere.

hs upload

Monitoring

Let’s create a monitoring metric for our pre-deployed regression model.

UI

  1. From the Models section, select the target model you would like to deploy, and select the desired model version;
  2. Open Monitoring tab.
  3. At the bottom of the page, click the Add Metric button;
  4. From the opened window click Add Metric button;
    1. Specify the name of the metric;
    2. Choose monitoring model;
    3. Choose a version of the monitoring model;
    4. Select the comparison operator GreaterEq. This means that whenever our metric value drops below 0, an alarm will be fired.
    5. Set the threshold value to 0.
    6. Click Add Metric button.

That’s it. Now you have a monitored taxi pickups regression model deployed on the Hydrosphere platform.