Monitor anomalies with a custom KNN metric

On this page you will learn how to create a custom anomaly detection metric for a specific use case.

Overview

For this use case we have chosen a sample classification problem. We will monitor the model, which will classify whether the income of a given person exceeds $50.000 per year. As a data source we will use the census income dataset.

Before you start

We assume you already have a deployed instance of the Hydrosphere platform and a CLI on your local machine.

To let hs know where the Hydrosphere platform runs, configure a new cluster entity.

$ hs cluster add --name local --server http://localhost
$ hs cluster use local

Also you should have a target classification model already be deployed on the cluster. You can find the source code of the target model here.

Model training

As a monitoring metric, we will use the KNN outlier detection algorithm from pyod package. Each incoming sample will be scored against predefined clusters, and the final value will be exposed as a monitoring value.

We will skip most of the data preparation steps, for the sake of simplicity.

Python
df = pd.read_csv("../data/adult.data", header=None)
target_labels = pd.Series(df.iloc[:, -1], index=df.index)

df = df.iloc[:, features_to_use]
df.dropna(inplace=True)

# Run feature engineering and then transformations on all features.
for feature, func in transformations.items():
    df[feature] = func(df[feature])

X_train, X_test = train_test_split(np.array(df, dtype="float"), test_size=0.2)

monitoring_model = KNN(contamination=0.05, n_neighbors=15, p = 5)
monitoring_model.fit(X_train)

Model evaluation

To check that our model works properly, lets plot histograms for training and testing datasets.

Python
y_train_pred = monitoring_model.labels_  # binary labels (0: inliers, 1: outliers)
y_train_scores = monitoring_model.decision_scores_  # raw outlier scores

# Get the prediction on the test data
y_test_pred = monitoring_model.predict(X_test)  # outlier labels (0 or 1)
y_test_scores = monitoring_model.decision_function(X_test)  # outlier scores

plt.hist(
    y_test_scores,
    bins=30, 
    alpha=0.5, 
    density=True, 
    label="Test data outlier scores"
)
plt.hist(
    y_train_scores, 
    bins=30, 
    alpha=0.5, 
    density=True, 
    label="Train data outlier scores"
)

plt.vlines(monitoring_model.threshold_, 0, 0.1, label = "Threshold for marking outliers")
plt.gcf().set_size_inches(10, 5)
plt.legend()

Model release

To create a monitoring metric, we have to deploy that KNN model as a separate model on the Hydrosphere platform. Let’s save a trained model for serving.

Python
joblib.dump(monitoring_model, "../monitoring_model/monitoring_model.joblib")

Create a new directory where we will declare the serving function and its definitions.

$ mkdir -p monitoring_model/src
$ cd monitoring_model
$ touch src/func_main.py

Inside the src/func_main.py file put the following code:

Python
import hydro_serving_grpc as hs
import numpy as np
from joblib import load

monitoring_model = load('/model/files/monitoring_model.joblib')

features = ['age',
            'workclass',
            'education',
            'marital_status',
            'occupation',
            'relationship',
            'race',
            'sex',
            'capital_gain',
            'capital_loss',
            'hours_per_week',
            'country']


def extract_value(proto):
    return np.array(proto.int64_val, dtype='int64')[0]


def predict(**kwargs):
    extracted = np.array([extract_value(kwargs[feature]) for feature in features])
    transformed = np.dstack(extracted).reshape(1, len(features))
    predicted = monitoring_model.decision_function(transformed)

    response = hs.TensorProto(
        double_val=[predicted.item()],
        dtype=hs.DT_DOUBLE,
        tensor_shape=hs.TensorShapeProto())

    return hs.PredictResponse(outputs={"value": response})

This model also have to be packed with a model definition.

kind: Model
name: "census_monitoring"
payload:
  - "src/"
  - "requirements.txt"
  - "monitoring_model.joblib"
runtime: "hydrosphere/serving-runtime-python-3.6:2.2.1"
install-command: "pip install -r requirements.txt"
contract:
  name: "predict"
  inputs:
    age:
      shape: scalar
      type: int64
      profile: numerical
    workclass:
      shape: scalar
      type: int64
      profile: numerical
    education:
      shape: scalar
      type: int64
      profile: numerical
    marital_status:
      shape: scalar
      type: int64
      profile: numerical
    occupation:
      shape: scalar
      type: int64
      profile: numerical
    relationship:
      shape: scalar
      type: int64
      profile: numerical
    race:
      shape: scalar
      type: int64
      profile: numerical
    sex:
      shape: scalar
      type: int64
      profile: numerical
    capital_gain:
      shape: scalar
      type: int64
      profile: numerical
    capital_loss:
      shape: scalar
      type: int64
      profile: numerical
    hours_per_week:
      shape: scalar
      type: int64
      profile: numerical
    country:
      shape: scalar
      type: int64
      profile: numerical
    classes:
      shape: scalar
      type: int64
      profile: numerical
  outputs:
    value:
      shape: scalar
      type: double
      profile: numerical

Inputs of this model are the inputs of the target monitored model plus the outputs of that model. As an output for the monitoring model itself we will use the value field.

Pay attention to the model’s payload. It has the src folder that we have just created, requirements.txt with all dependencies and a monitoring_model.joblib file, e.g. our newly trained serialized KNN model.

requirements.txt looks like this:

joblib==0.13.2
numpy==1.16.2
pyod==0.7.4

The final directory structure should look like this:

.
├── monitoring_model.joblib
├── requirements.txt
├── serving.yaml
└── src
    └── func_main.py

From that folder, upload the model to the cluster.

$ hs upload

Monitoring

Let’s create a monitoring metric for our pre-deployed classification model.

UI

  1. From the Models section, select the target model you would like to deploy and select the desired model version;
  2. Open the Monitoring tab.
  3. At the bottom of the page click the Add Metric button;
  4. From the opened window click the Add Metric button;
    1. Specify the name of the metric;
    2. Choose the monitoring model;
    3. Choose the version of the monitoring model;
    4. Select a comparison operator Greater. This means that if you have a metric value greater than a specified threshold, an alarm should be fired;
    5. Set the threshold value. In this case it should be equal to the value of monitoring_model.threshold_.
    6. Click the Add Metric button.

That’s it. Now you have a monitored income classifier deployed in the Hydrosphere platform.