Bridge the gap between operations and data science
Lets the data scientist own the data product from data exploration to production and be decoupled from big data engineering and the cluster operations layer. Lets them use the underlying platform as a service so they can concentrate on data analytics challenges.
Eliminate most dependencies and handoff between data scientists and IT. The goal is to reduce the inefficient nature of that and quit making people wait around.
Safely share data lake and cluster resources between production and sandbox environments. Use Hydrosphere Notebooks as a scratch pad for big data ideas. The data scientist builds up their model by doing extractions and transformation in an interactive screen, like a shell. But the notebooks know about the big data cluster thus freeing the data scientist from the details of that.
Hydro Notebooks embraces the Analytics as a Code principle. When data scientists use a Hydrosphere Notebook they are working with the same code that will be pushed into CI/CD pipeline. Doing this does not slow them down at all, as the notebook hides all the complexity of the Spark cluster.
Algorithms and models only work if the data is valid. You have to test that the assumptions you made in development are valid in production. You need to monitor metrics that are specific to statistics, like error variance, and data quality, like the number of bad records.
With Hydro you add unit and integration tests into the notebook. This also does stateful runtime validation. And it supports running different versions of data models at the same time. Plus it facilitates feedback and regression testing.
With existing practices, data scientists are not usually using release cycles. So there is missed opportunity. For example, they might be using three year old algorithms that they are afraid to touch out of fear that they might break something. That approach does not take advantage of monitoring and feedback to continually improve the model.
The notebooks fixes that and it facilitates the way that data scientists work by letting them push different versions of their model to production at the same time.
Provide a feedback loop and re-use the same checkpoints you visualised in the discovery phase by applying production monitoring and performance analytics against those.
Traditional hosted notebooks are great for experiments. But they are not designed for developing production code in Python or Scala. In Zeppelin and iPython, there is the problem that the visualization is too closely tied to the code because of their proprietary formats. So you cannot execute that code without the notebook, meaning you could not run it from the command line or run unit tests in the Jenkins pipeline or edit them in vim. So you lose compatibility with the rest of the development world and project team.
We fix that by storing the Hydro notebooks in .scala and .py Python text file formats so that they can be executed on a cluster independently from the Notebook.
Hydro Notebooks supports the full release cycle. This lets the data scientist work independently from the development team yet commit their code to the build pipeline per Agile iterations or any other scheduling mechanism. That imposes some discipline on the data scientists to align what they are doing with strategic and tactic planning and overall project schedules.
Hydro Notebook is open source software available under the Apache 2 License on GitHub
Start from Hydrosphere Hosted and then move on premise any time