< Back to BLOG /

Moving Data Science into Production: Part 2: Deploying

Share on Facebook0Tweet about this on Twitter0Share on Google+0Share on LinkedIn5
#data science , #Mist , #production , #Pulse

These are blog post series that describe the best practices for transition from data science research activities to end-to-end production applications on a big data scale. There are 3 parts to this series:

What exactly should be deployed?

The machine learning model definition has been extended into the pipelines concept. In order to deliver the model into production, we will need to deploy the following artifacts:

  • Training pipelines
  • Large scale prediction pipelines
  • Single row serving/scoring pipeline
  • Batch on-demand pipelines

deploy in to production

Training Pipelines

Apache Spark has become the de-facto standard for building large scale training pipelines. Besides MLlib, popular frameworks such as TensorFlow, H2O and Xgboost could be plugged into distributed estimation processes powered by Apache Spark.
Cron based deployments for training pipelines are currently the most popular in the community. This is because it is extremely simple: you can schedule re-training jobs in cronos or workflow managers like Airflow. Packaging, versioning and automation would go on top of this.
The main challenge here is to manage computation resources between different jobs elastically and effectively since Spark is a single user program. The ability to run multiple training jobs with different settings in addition to streaming pipelines and ad-hoc requests is crucial for a seamless data science experience. Therefore, it is a requirement to have a cluster of Spark clusters
Cluster managers such as YARN and Mesos aim to solve the problem of re-scheduling resources between different Spark jobs on statically allocated pool of nodes. AWS and GCE clouds allow the building of standalone and independently scalable Spark clusters. But there are no open source blueprints or out of the box cloud offerings for the cost effective on-demand cluster of Spark clusters. So, be prepared to fight with DC/OS, Kubernetes, trying to marry Spark with AWS auto scaling thresholds and YARN dynamic resource allocations.

At Hydrosphere.io we are not in business of managing Spark clusters but we’ve built our own cloud agnostic infrastructure that orchestrates multiple on-demand spark clusters and corresponded spark contexts.

deploying machine learning to production

Prediction Pipelines

Once a model has been built, it can be used to generate predictions. There are 2 different contexts in which machine learning models may be deployed: large scale and real-time serving/scoring.

Large scale prediction pipelines. An offline batch job may be required to classify user profiles for an email campaign or a streaming micro-batch job that analyses incoming payment transactions.

This canonical use case for Apache Spark has already been implemented many times for single user word count applications. Production challenges again, lie in effective resource management and isolation. There are dozens of options of where and how to run Spark such as AWS EMR, Azure, IBM Bluemix, GCE, DC/OS, Kubernetes, Cloudera, Hortonworks, Databricks and Qubole. But still there is a gap. For real use cases, we need multiple on-demand Spark clusters and an orchestrator on top of these. Training, prediction, streaming and ad-hoc jobs should co-exist and provide high level REST API for its users. The databricks platform allows the management of multiple Spark clusters on AWS and the attachment of a notebook to a particular cluster.

Hydrosphere.io is open source and can run on any cloud provider. It operates on a REST services level rather than on a notebooks level. It manages the Spark cluster for every prediction API call to guarantee it’s completion, so that you do not need to reserve and configure a cluster for this particular prediction pipeline. Spark itself could be deployed on any infrastructure (AWS Spot instances, EMR, MapR and others) in YARN, Mesos or standalone modes.
Real-time single row serving pipelines. An example of this might be an online request from the user for image recognition or an API call from the event processing engine like Apache Flink that classifies log events in a low latency context.

Training a model in apache Spark while having it automatically available for real-time serving is an essential feature for end-to-end solutions. However it is not a focus of Apache Spark to deal with single row serving. The Spark community has partially completed the separation of ml-local and are having ongoingdiscussions about the priority of that project. Also Databricks has released a proprietary tools of model exporting and serving it outside of Apache Spark.
There is an option to export the model into PMML and then import it into a separated scoring engine. The idea of interoperability is great but it has multiple challenges, such as:

  • Code duplication. Two different implementations and exporting functions of the same model in different languages needs to be maintained. This leads to inconsistencies, bugs and a lot of excess work.
  • Limited extensibility. The prediction pipeline is not only a machine learning model. It has pre-processing steps, ad-hoc logic and very specific dependencies that could not be encoded in XML.
  • Inconsistency. With different implementations for streaming, large scale and scoring use cases cause different outputs in different contexts for the same inputs.
  • Extra moving parts. Exporting/importing pipelines create additional complexity and points of failure. It should be eliminated rather than automated.

People also talk about PFA as the next generation of PMML. It inherits the same issues as its predecessor but adds more flexibility turing complete language for model encoding. This is a step forward, but it is not clear why we have to invent yet another pseudo language to serialize algorithms. Again, in order to add/fix a model we have to change the training algorithm and the  PMML/PFA/MLeap exporting library, while debugging it further through serialized files and a separated scoring engine.

Hydrosphere Mist has a different approach:

  • No custom formats and new standards. It reads native MLLib, TensorFlow, xgboost metadata files using original local implementations of corresponded libraries. If MLlib has a bug in a random forest, it will have the same consistent behavior in training and the prediction pipeline.
  • No exports/imports for algorithms. This model is as good and fast as its original implementation. An abstract syntax tree is the best way to express an algorithm rather than XML or JSON. If it’s not a type safe or has side effects, the machine learning framework is responsible to fix it.
  • Shared Spark API. Data Scientists are getting familiar with Apache Spark API and its ecosystem. It does not make sense to learn new tools and frameworks just for running the same things in production.
  • Convergence with the Apache Spark Roadmap. Hydrosphere Mist is looking forward to use a separated ML local module once it will become independent from SparkContext and distributed DataFrames. Until then it uses its own local implicit model loaders.

From the user perspective, Hydrosphere Mist has a minimum footprint in addition to the common Apache Spark flow. You can train this model the same way you did before and have it automatically deployed as a REST service.

deploy big data

Online on-demand Pipelines

Interactive business applications that execute analytics pipelines is another context of deploying data science code. Examples of this might be a web application for an ad campaign prediction, measurement and analytics, bank stress testing, pricing optimization or advanced reporting. These can be treated as interactive smart applications that take input from business users and run parameterized Apache Spark jobs on its backend. It is a much more sophisticated and user friendly alternative to hosted notebooks for business users.

From the deployment standpoint it is required to consider the following aspects:

  • Cron based jobs have to be exposed for online access through instant HTTP or Messaging (Kafka, AMPQ, MQTT) API that triggers Apache Spark jobs and returns results to the end user.
  • The robust infrastructure discussed in previous sections has to route API requests to the appropriate Spark cluster for completion.

Hydrosphere Mist covers this use case by providing a cluster of Spark clusters where any Spark pipeline could be deployed as an online on-demand REST service.

Read Next

Related events

O’Reilly Software Architecture



Cloud Expo

Santa Clara


Scale By The Bay

San Francisco