Hydrosphere Notebooks

Bridge the gap between operations and data science

Scroll Down
Products Overview / Hydrosphere Notebooks

Hydro Notebooks is a web IDE that lets data scientists connect to Spark and tap into the CI/CD development model.

Q: Do you have cycled dependencies and handoffs between data scientists, engineers and operations?

A: Give data scientists self service.

Lets the data scientist own the data product from data exploration to production and be decoupled from big data engineering and the cluster operations layer. Lets them use the underlying platform as a service so they can concentrate on data analytics challenges.

Eliminate most dependencies and handoff between data scientists and IT. The goal is to reduce the inefficient nature of that and quit making people wait around.

Q: Do you experiment in siloed environments and limited data sets?

A: Walk through and discover actual data on an actual cluster.

Safely share data lake and cluster resources between production and sandbox environments. Use Hydrosphere Notebooks as a scratch pad for big data ideas. The data scientist builds up their model by doing extractions and transformation in an interactive screen, like a shell. But the notebooks know about the big data cluster thus freeing the data scientist from the details of that.

Q: How do you turn experiments into production code?

A: Allow data scientists and big data engineers to contribute to the same code base.

Hydro Notebooks embraces the Analytics as a Code principle. When data scientists use a Hydrosphere Notebook they are working with the same code that will be pushed into CI/CD pipeline. Doing this does not slow them down at all, as the notebook hides all the complexity of the Spark cluster.

Q: Do you test your jobs, models and algorithms?

A: Implement Continuous Integration for data science scripts.

Algorithms and models only work if the data is valid. You have to test that the assumptions you made in development are valid in production. You need to monitor metrics that are specific to statistics, like error variance, and data quality, like the number of bad records.

With Hydro you add unit and integration tests into the notebook. This also does stateful runtime validation. And it supports running different versions of data models at the same time. Plus it facilitates feedback and regression testing.

Q: Are you able to safely deploy the model to production with one click?

A: Deploy with confidence.

With existing practices, data scientists are not usually using release cycles. So there is missed opportunity. For example, they might be using three year old algorithms that they are afraid to touch out of fear that they might break something. That approach does not take advantage of monitoring and feedback to continually improve the model.

The notebooks fixes that and it facilitates the way that data scientists work by letting them push different versions of their model to production at the same time.

Q: Can you compare your assumptions from the discovery phase with actual production performance?

A: Use continuous monitoring with a tight feedback loop.

Provide a feedback loop and re-use the same checkpoints you visualised in the discovery phase by applying production monitoring and performance analytics against those.

Q: Why do we need yet another hosted notebook?

A: Roll out devops culture to big data analytics teams.

Traditional hosted notebooks are great for experiments. But they are not designed for developing production code in Python or Scala. In Zeppelin and iPython, there is the problem that the visualization is too closely tied to the code because of their proprietary formats. So you cannot execute that code without the notebook, meaning you could not run it from the command line or run unit tests in the Jenkins pipeline or edit them in vim. So you lose compatibility with the rest of the development world and project team.

We fix that by storing the Hydro notebooks in .scala and .py Python text file formats so that they can be executed on a cluster independently from the Notebook.

Hydro Notebooks supports the full release cycle. This lets the data scientist work independently from the development team yet commit their code to the build pipeline per Agile iterations or any other scheduling mechanism. That imposes some discipline on the data scientists to align what they are doing with strategic and tactic planning and overall project schedules.

Open Source

Hydro Notebook is open source software available under the Apache 2 License on GitHub

Hosted Version

Start from Hydrosphere Hosted and then move on premise any time