rw-book-cover

Metadata

Highlights

All of the above are instances where we’ve cleaned our data by making a decision. You might classify some as “cleaning” and some as “analysis”, but they’re all very similar classes of transformations. What gets labeled as “cleaning” vs “analysis” seems more a judgement of its value — the good stuff that makes us sound smart is called “analysis” (View Highlight)

cleaning operations themselves impose value judgments upon the data. (View Highlight)

The only way to get that level of familiarity is to do the hard work and get deeply familiar with the data set at hand. People usually attribute this work to be part of the “Exploratory Data Analysis” phase of analysis, but EDA often happens when the data has already been partially cleaned. Truly raw, unclean data often doesn’t even function properly within software, because something unexpected usually trips things up. Fixing those issues bring a huge amount of knowledge (and questions) about the data set. (Okay, astute readers would notice that this isn’t the only way, the best way to get ultimate familiarity with a data set is to actually collect the data yourself, but that’s a topic for another day.) (View Highlight)

It doesn’t help safeguarding against these situations is extremely difficult in a modern distributed-systems infrastructure setup. Something is always in the process of changing or breaking and these have unpredictable side effects. (View Highlight)

the act of “cleaning data” is actually building a library of transformations that are largely reusable across multiple analyses on the same data set (View Highlight)