rw-book-cover

Metadata

Highlights

Why do we have orchestrators? tl;dr: legacy. (View Highlight)

Keep state: keeping tabs on what has happened and what hasn’t is likely the core feature of the orchestrator. (View Highlight)

But, in the interest of keeping it efficient, the orchestrator’s problem boils down to two things: they’re just another brick in the stack, and, most importantly, new data technologies don’t need them (View Highlight)

Displacing the orchestrator means removing synchronization overhead, and the chief way to do this is to try and make the time of execution matter less. Today, this means implementing either Asynchronous Processing or High-Frequency batches. (View Highlight)

Let’s cover high-frequency batches first because I see it as an aberration. (View Highlight)

First, latency adds up fast - it doesn’t take a deep dependency graph before high-frequency tasks stop correlating with high-availability datasets. And second, pricing hurts. Most data solutions aren’t designed for this usage for a simple reason: running a LEFT JOIN every 5 min means redundantly scanning a ton of data. (View Highlight)

For anyone who isn’t too familiar with streaming yet, that’s the difference between Pull and Push data operations: the former requires you to constantly trigger the actions you would like performed (like in the ETL / ELT model for instance) while the latter computes and propagates data points passively and incrementally. No orchestration! Each task is almost like a microservice, a data service of sorts. (View Highlight)

Imagine if everything you run in prod had to be orchestrated, that would be the death of agile. Really, why is this still a thing in the data world? (View Highlight)

the solutions are already out there, like the Streaming Workflow Builders Hubert mentioned or what we’re up to at Popsink. (View Highlight)