< Back to BLOG /

Moving Data Science into Production. Part 1: Strategy

Share on Facebook0Tweet about this on Twitter0Share on Google+0Share on LinkedIn0
#data science , #Mist , #production

This is a blog post series that describes the best practices for transition from data science research activities to end-to-end production applications on a big data scale. There are 3 parts to this series:

  1. Part 1: Overview of common considerations and keynotes.
  2. Part 2: Deployment of training, prediction and on-demand pipelines.
  3. Part 3: Testing and quality monitoring of data science based applications.

Part 1: Go to Production Strategy

Any startup has to have a clear go to market strategy from the beginning. Similarly, any data science project has to have a go to production strategy from its first days. The following important questions have to be answered in order to prepare a clear go to production strategy:

  • What is the cost and the timeframe of transitioning from proof of concept to actual production scale?
  • How quickly can change requests and bug fixes be dealt with and released into production?
  • How could we guarantee the quality of machine learning prediction and insights, so that any business can trust it?

These questions generally have to be answered by the company CTO, since data scientists in general are not familiar with the production side of business.

Fundamentally, a business will hire Data Scientists in order to make products smarter. So, in an ideal world, we should place the Data Scientist into ecosystem where he can deliver  analytics services that can be tested, monitored and maintained for end to end products.

What does it mean to go to Production?

Production is not a state, it is a continuous process of delivering reproducible and autonomous data pipelines, model training workflows and machine learning based predictions for the end users.

Blurring the line between Environments

Incompatible research and production environments are show stoppers for the operation of data science. It is not realistic to assume that R, Scikit-learn notebooks would magically turn into distributed C++, Java or Scala implementations and successfully deployed on a real scale.

It’s more than Docker. Packaging R into Docker of course simplifies reproducibility of data science experiments. But obviously you won’t be able to scale cluster operations after that.
Exports/imports do not work as well since you could not convert all the data pipelines, machine learning models, dependencies and ad-hoc algorithms into the production version.

The only solution is to unify the frameworks, data sets and APIs between the research and operational environments. Everything is software, and it does not matter what IDE, code editor or hosted notebook you use as long as at the end of the day, you deliver code that works on the local, test and production environments.

There are 2 strategic options here:

  • Pair data scientist with data engineer, so they could develop machine learning pipelines using low level production ready frameworks.
  • Invest into abstraction layer on top of existing low level distributed algorithms to make it as simple and rich as scikit-learn or R. It should be independent from execution mode (standalone, Spark, Flink, others) but pluggable into all of them. SystemML has a very right motivation to build a platform agnostic ML library that could be used in distributed environments as well as in real-time production.

Offline, Online and Real-time Production

Production can be handled in different ways. Nightly cron jobs, stream analytics pipelines, online on-demand simulation services and real-time prediction applications are all different deployment contexts for the code data scientist to produce. It’s all varies according to single user to multi-tenancies and whether it is a high throughput environment or not.

However, in a nutshell there is a machine learning and data processing code base. Sharing the same codebase, libraries and algorithms between large scale offline training engines and real-time single row scoring APIs, simplifies and accelerates transition from research into production.

Read Next

Related events

O’Reilly Software Architecture



Cloud Expo

Santa Clara


Scale By The Bay

San Francisco