rw-book-cover

Metadata

Highlights

How to choose among options for a data lake? Transcript: Speaker 1 Like to think in terms of human productivity rather than just machine time because I think that that’s what you should really be optimizing for. The performance is really just about how long are we making a human weight? There’s a cost angle to it certainly as well, but I really like to focus on the human factor. (Time 0:18:42)

The Difficulty of Maintaining Up-to-Date Comparisons of Delta Features Transcript: Speaker 1 Streaming. That’s a hard one because Databricks and Delta, they’re going to say you want to use Spark streaming in this micro batch approach. Then you’ve got hoodie saying like, you know, row level up-cert is the model that you should go with. Iceberg, we mostly say, you know, you should, of course, I think of everything in like a nuanced answer kind of thing where it’s like, well, by default, we’re going to use copy on right because that’s going to give you the best read performance with, you know, no overhead. Mergion Read is going to give you better write performance, but then you have to have something going and compacting that data regularly. What we also recommend is a third approach, which is writing a change log table and periodically et yelling that state to compact down and, you know, snapshot the state. And that last one actually gets you the semantics of, you know, being able to coordinate with transactions in the upstream CDC system. So there are like in terms of that, you’ve got these three like vastly different views of what you should use and why. And so it’s not even enough to say like, we support streaming or we support this feature, right? It’s, well, are you even using the approach to the problem that makes the most sense? Speaker 2 And to make it even muddier, it’s also, are you using the engine that actually supports this particular implementation of that feature? Yes, exactly. (Time 0:21:40)

Lineage < Tagging for data quality Transcript: Speaker 1 But lineage, it seems like 50% of the use cases for it is like, how do I know when data, bad data leaks where it went? And this pattern sort of flips that and says, well, let’s make sure that data is good before we release it. Right? So if you can produce the data without publishing the data, it’s really powerful. So that led us to basically branching and tagging as well. Again, because we’re using a Git-like model, we’ve recently added branches and tags to table metadata. So you can go say, okay, I want to create a branch branch off of main test code, you know, build up a whole bunch of commits and say, okay, yeah, that looks good. And then deploy that code to production, delete the branch. Or you could do a longer version of that right audit publish workflow where you create a branch for this hour worth of data, you accumulate the data, make sure it’s internally consistent, and then you fast forward main to the state of that branch. (Time 0:31:21)

The Complexity of Aligning Commits on Tables with Different Data Structures Transcript: Speaker 1 You shouldn’t always just try to put your downstream table into the exact state that you want. Sometimes it’s better to just defer that work until later and, you know, use an ETL process to say, okay, well, I know that we’ve processed through transaction 10,011 and we’re going to materialize those table states in the analytic tables and, you know, do it that way. As a two step process, it’s much, much easier than trying to put all of that workload on your flink writer. (Time 0:43:16)

Reminded me of this idea in The Black Swan and a bit of Bullshit Jobs

If you do data eng. right, you don’t exist Transcript: Speaker 2 As somebody who has worked in operations and data engineering for a long time, if I’m doing my job right, nobody knows I exist. (Time 0:47:45)