< Back to BLOG /

Statesafe Data Pipelines

I am going to give a talk at O’Reilly Software Architecture Conference in London. Now, I would like to use this occasion to introduce you to the first part of the talk – state-safe data pipelines on microservices.

For some odd reason, the topic of “Data Pipelines” do not seem to be appealing to audience any more, even when the state of the matter is that there is no open-source framework available with which one could build pipelines that are state- and type- safe. We find this exciting and thrilling challenge to build one. And I am going to tell you a lot about that during this first part of the talk (wait and see what there will be on the spot during the 2nd part then!)

Data Flow Programming Deadfall

Let’s start by recapping how an evolution cycle of a typical data project looks like. We can identify similar pattern in most of the organizations when preliminary idea or need emerge, with specific data shape requirement, either for reporting or for machine learning use case.

  • Move on to Hive or Spark as the dataset grows and matures
  • Deploy with Cron and DAG runners

 

Data Flow Programming makes nearly everyone excited today, especially when there are black-box components connected into DAG’s and a nice Web-UI. However in fact solutions that use AWS Data Pipeline, Airflow, Azure Data Fabric, Jenkins and Luigi are not much better than cron based systems. This is because:

  • The state between pipeline tasks is unmanageable, as data is a state stored in shared folder
  • Tasks scheduling and data flow management have different objections
  • No framework for engineers – too much freedom to kill yourself inside a black box
  • Not suitable for unit testing
  • Absence of contracts between tasks
  • Limited error handling. Only an exit code available, and that is not enough
  • Tight coupling between infrastructure code and data business logic
  • Manual management (isolation & cleanup) of shared session between tasks

 

Stateless black-boxes that are testable and decoupled from the data storage and transport layer would be more optimal. This is how it could be re-designed:

Step 1: Get rid of Workflow Manager!

Step 2: Turn black box tasks and scripts into reactive microservices.

Step 3: Use Avro data contracts between stages. Data is also an API to be standardized, versioned and validated.

Step 4: Segregate black box tasks into (read) (process) and (write) services.

Step 4: Make the state in shared folder/topic/session manageable by framework rather than data engineers.

Step 5: Abstract engineer from data transport and provide a pure function to deal with.

Streaming nerds might point me to Kafka, Spark Structural Streaming, Kafka Streams and Kafka Connect best practices. However the evil is in details, and Kafka is not a silver bullet. Machine Learning Engineers, Data Scientists and SQL driven Data Engineers should not need to learn and should not care about the following while implementing a data pipeline:

  • What topic to read/write into
  • What topic settings to specify
  • Where data is physically stored
  • How to wait for the whole batch offset in order to perform Spark calculations/aggregations/ML training pipeline
  • How to manage multiple input sources required for a Spark stage
  • How to apply / validate / version Avro schemas in runtime and compile time
  • How to use a different transport. E.g. in-memory Datasets managed by Spark Sessions rather than Kafka topics in order to optimise latency
  • How to manage multiple Spark Contexts for parallel pipelines
  • How to create unit tests

 

Please stay tuned as the show is anything but over. Second part will even more exciting! We’ll talk about ML FaaS machine learning functions deployed as services.

 

Please contact me by email spushkarev@hydrosphere.io before or after the talk. Would be happy to chat and follow up.

 

Please use discount code Pushkarev20 to save 20%.