Why are we doing this workshop

Our metrics are now hard to change, explain and to trust:

  • They are hard to understand:
    • We have very long queries.
    • Some are overly complex in logic.
    • They are not expressive enough.
  • They have no tests.

Although there are also external reasons for this (time pressures, pushing business logic into analytics instead of dealing with it on backends), we can focus on what we control and write more maintainable metrics. Specifically, we will review:

  • Data modelling: common ways to structure tables so that metrics are easier to compute.
  • SQL best practices.
  • dbt as a tool for metric development and shared medium with data engineering.

Apart from this, there are several things we should already be doing, particularly regarding metrics work:

  • Peer reviews: a good check that we all understand what one person did.
  • Pairing with DE on this matters, instead of just giving them the queries.

Data modeling: why do we need it

OLAP vs OLTP and normalization

In the vast majority of applications, the main data access pattern is looking up a small number of records by some key, and records are inserted, updated or deleted based on user’s input. This pattern is known as online transaction processing (OLTP). The term transaction comes from commercial transactions.

To optimize for this access pattern, data is highly normalized.

Imagine storing sales without normalization, like this:

(from Database – Normalisation – 1NF, 2NF, 3NF and 4NF – Journey to Data Science)

What would you need to do to update a supplier phone? However, if you store sales like this:

…change only occurs in one place.

This makes sure data integrity is preserved and there is less data to move around.

Unfortunately, normalization is not very friendly to business queries.

[Here we would have a data model and an example query, showing how difficult is to answer a simple question].

Your life can be better than this.

For business queries, the data access pattern is going to be very different than OLTP: for example we will want aggregates over large number or records. This different access pattern is known as OLAP: online analytic processing.

We cross the chasm between OLTP and OLAP by transforming the data (data modeling) so that we make business analysis easier:

(Data Modeling Layer & Concepts | The Analytics Setup Guidebook)

In the past, this was also required for performance purposes but this does no longer apply in modern data warehouses. However it still applies for logical purposes.

🔗 to original context

Transform data to make analysis easier, how?

Ad-hoc doesn’t work

“In the past”, there was no modeling, and Data Analysts controlled all reporting in an ad-hoc way.

If you were on a “traditional data team” pre 2012, your first data hire was probably a data engineer. (…) Analysts, like me, would maintain a mess of SQL files with names like monthly_revenue_final.sql, or maybe just bookmark their queries in a SQL web editor. Often we would need to supplement data in the warehouse with fancy Excel work. The people consuming the data–CEOs, Marketing VPs, CFOs–would receive monthly reports, request special analysis as-needed, and send analysts a never-ending stream of requests to “segment by this” or “cut by that” or “hey, we’ve updated our definition of ‘account’”. Being a data analyst was a hard and thankless job, and it didn’t have a ton of leverage.

🔗 to original context

Imagine doing this ad-hoc for each business question…

For example: when asking about sales commissions, the CEO will think “sales commissions is 5% of closed deals”. However, the data analyst will think “closed deals are stored in table closed_deals, so I need to take the amount column and multiply that with months to figure out the final amount; oh, and I need to check the payment_received column to make sure that only the received payment is counted”.

(Data Modeling Layer & Concepts | The Analytics Setup Guidebook)

But this has two main problems. One is being a bottleneck:

Every time the CEO wants something, she needs to wait hours (or even days) for Daniel [the DA] to crunch the numbers and get back to her. At one point, Daniel might be too busy crunching out numbers for different business stakeholders, instead of focusing his time on more valuable, long-term impact work. (Data Modeling Layer & Concepts | The Analytics Setup Guidebook)

The other is entropy:

Build awareness about a (cultural) problem: data anarchy

People will show up in a meeting to talk through a business problem, and different people will bring mismatched numbers. All too often, data slows decisions down rather than speeds them up, because people spend more time debating if the data is right than they do talking about what to do about it. And all too often, data teams will update core datasets or KPIs, and old versions will linger on for months.

Incidents like these erode trust in data and data teams. It teaches people not to participate in the data democracy, and to make decisions entirely on their own, often without using data at all. (Good Data Citizenship Doesn’t Work | by Benn Stancil | Jan, 2022 | Towards Data Science)

We kinda know why this happens:

….continually creating new combinations of models in all directions as new analytic opportunities arise.

It’s not difficult to imagine why this happens — people have different opinions, habits, and diligence about naming and when developing, it is often easier to build a new thing fit for the new purpose then to integrate your changes into the pre-existing ecosystem, test that yours works without breaking everyone else’s, etc. (On the Importance of Naming: Model Naming Conventions (Part 1) | dbt Developer Blog)

And it happens to us (here there was an internal example of pipeline jungles and inconsistent metrics).

🔗 to original context

We will later see that a very important thing to support changing requirements in to expose atomic data.

The star schema / Dimensional modeling

Isn’t there a way to structure information so that answering most business questions is easy, without the need to predict in advance which type of questions are going to be asked? Yes, created by this guy, Ralph Kimball in The Data Warehouse Toolkit 3rd Edition.

It is called both dimensional modeling and star schema, given the final result.

(from What is a star schema and how does it work?)

🔗 to original context

Where does this go in our layered data warehouse?

We will later see more about these “layers”, but this belongs into the “marts” layer of the warehouse. Those marts are just a subset of tables/views (models) by domain, which are the only ones intented to be queried by the end users. That is, all the other layers are just internal intermediate transformations that we shouldn´t query. However, both facts and dimensions could be shared by different marts (we will see “conformed dimensions”).

Core concepts

Kimball Dimensional modeling

(Nowadays we don’t really need dimensional modeling for query performance. See also: OBT - One Big Table).

Fact tables

The fact table in a dimensional model stores the performance measurements resulting from an organization’s business process events. (p. 10)

The most useful facts are numeric and additive, such as dollar sales amount. (…) You will see that facts are sometimes semi-additive or even non-additive. Semi-additive facts, such as account balances, cannot be summed across the time dimension. Non-additive facts, such as unit prices, can never be added. (p. 11)

Notice how “fact” denotes a column (a measurement), not a row (which could de different things. Normally, a transaction).

Dimension tables

The dimension tables contain the textual context associated with a business process measurement event. They describe the “who, what, where, when, how, and why” associated with the event. (p. 13)

Dimension attributes serve as the primary source of query constraints, group- ings, and report labels. In a query or report request, attributes are identified as the by words. (p. 13)

Because they are the source of virtually all constraints and report labels, dimension attributes are critical to making the DW/BI system usable and understandable. Attributes should consist of real words rather than cryptic abbreviations. You should strive to mini-mize the use of codes in dimension tables by replacing them with more verbose textual attributes (p. 14)

When triaging operational source data, it is sometimes unclear whether a numeric data element is a fact or dimension attribute. You often make the decision by asking whether the column is a measurement that takes on lots of values and participates in calculations (making it a fact) or is a discretely valued description that is more or less constant and participates in constraints and row labels (making it a dimensional attribute). For example, the standard cost for a product seems like a constant attribute of the product but may be changed so often that you decide it is more like a measured fact. Occasionally, you can’t be certain of the classification; it is possible to model the data element either way (or both ways) as a matter of the designer’s prerogative. (p. 14)

Wow, mind blown:

Contrary to popular folklore, Ralph Kimball didn’t invent the terms fact and dimension. As best as can be determined, the dimension and fact terminology originated from a joint research project conducted by General Mills and Dartmouth University in the 1960s. In the 1970s, both AC Nielsen and IRI used the terms consistently to describe their syndicated data offerings and gravitated to dimensional models for simplifying the presentation of their analytic information. (p. 15)

🔗 to original context

Who is responsible of data modeling?

From Analysts Make the Best Analytics Engineers:

You’ve got three steps that stand between you and your finished curated dataset. If you don’t have an analytics engineer, then the work may be split up like this: (View Highlight)

🔗 to original context

The problem is that this is far from a linear process. It looks more like this:

(View Highlight)

🔗 to original context

Which is even worse if we leave validation in the hands of yet another person.

As a curiosity, the Modern Data Stack made somewhat easier to collapse these two functions into a single role:

Today, if you’re a “modern data team” your first data hire will be someone who ends up owning the entire data stack. This person can set up Stitch or Fivetran to start ingesting data, keep a data warehouse tidy, write complex data transformations in SQL using dbt, and build reports on top of a clean data layer in Looker, Mode, Redash, etc.

🔗 to original context

This job is neither data engineering, nor analysis. It’s somewhere in the middle, and it needed a new title. Starting in 2018, we and a few of our friends in the Locally Optimistic community started calling this role the analytics engineer.

🔗 to original context

But adding such role to the existing Data Engineering and Data Analysis roles (as well as other roles) has been criticized as overspecialization. Since the skills required are closer to engineering, data modeling traditionally has been in Data Engineering hands, but Data Analysis can pick up certain skills so that this collaboration is tighter and feedback loops are shorter.

Some techniques from dimensional modelling worth knowing

Types of fact tables

Keeping the past from changing: slowly changing dimensiones

  • Types of fact tables.
  • Handling slowly changing dimensions.
  • Surrogate keys.
  • OBTs/Wide tables
  • Nulls in Fact Tables
  • …

Metrics and Looker

dbt

The vision

It’s for analysts too! From The dbt Viewpoint | dbt Developer Hub:

We believe that analytics teams have a workflow problem. Too often, analysts operate in isolation, and this creates suboptimal outcomes. Knowledge is siloed. We too often rewrite analyses that a colleague had already written.

It’s not a tool to do analysis (meaning queries), but to transform data (e.g., compute metrics).

Writing a model

…

Jinja and macros

…

Packages
Regular tests

…

Unit testing

…

Docs and tribal knowledge
Best practices SQL

…

An IDE for data analytics: Looker SQL Runner / Google Colab

…

Very good refs:

Improvements to our own

  • No queda claro el grano de las tablas
  • No tenemos FK que nos ayudan a entender cĂłmo relacionarlas.