Why are we doing this workshop
Our metrics are now hard to change, explain and to trust:
- They are hard to understand:
- We have very long queries.
- Some are overly complex in logic.
- They are not expressive enough.
- They have no tests.
Although there are also external reasons for this (time pressures, pushing business logic into analytics instead of dealing with it on backends), we can focus on what we control and write more maintainable metrics. Specifically, we will review:
- Data modelling: common ways to structure tables so that metrics are easier to compute.
- SQL best practices.
- dbt as a tool for metric development and shared medium with data engineering.
Apart from this, there are several things we should already be doing, particularly regarding metrics work:
- Peer reviews: a good check that we all understand what one person did.
- Pairing with DE on this matters, instead of just giving them the queries.
Data modeling: why do we need it
OLAP vs OLTP and normalization
In the vast majority of applications, the main data access pattern is looking up a small number of records by some key, and records are inserted, updated or deleted based on userâs input. This pattern is known as online transaction processing (OLTP). The term transaction comes from commercial transactions.
To optimize for this access pattern, data is highly normalized.
Imagine storing sales without normalization, like this:
(from Database â Normalisation â 1NF, 2NF, 3NF and 4NF â Journey to Data Science)
What would you need to do to update a supplier phone? However, if you store sales like this:
âŚchange only occurs in one place.
This makes sure data integrity is preserved and there is less data to move around.
Unfortunately, normalization is not very friendly to business queries.
[Here we would have a data model and an example query, showing how difficult is to answer a simple question].
Your life can be better than this.
For business queries, the data access pattern is going to be very different than OLTP: for example we will want aggregates over large number or records. This different access pattern is known as OLAP: online analytic processing.
We cross the chasm between OLTP and OLAP by transforming the data (data modeling) so that we make business analysis easier:
(Data Modeling Layer & Concepts | The Analytics Setup Guidebook)
In the past, this was also required for performance purposes but this does no longer apply in modern data warehouses. However it still applies for logical purposes.
đ to original context
Transform data to make analysis easier, how?
Ad-hoc doesnât work
âIn the pastâ, there was no modeling, and Data Analysts controlled all reporting in an ad-hoc way.
If you were on a âtraditional data teamâ pre 2012, your first data hire was probably a data engineer. (âŚ) Analysts, like me, would maintain a mess of SQL files with names likeÂ
đ to original contextmonthly_revenue_final.sql
, or maybe just bookmark their queries in a SQL web editor. Often we would need to supplement data in the warehouse with fancy Excel work. The people consuming the dataâCEOs, Marketing VPs, CFOsâwould receive monthly reports, request special analysis as-needed, and send analysts a never-ending stream of requests to âsegment by thisâ or âcut by thatâ or âhey, weâve updated our definition of âaccountââ. Being a data analyst was a hard and thankless job, and it didnât have a ton of leverage.
Imagine doing this ad-hoc for each business questionâŚ
For example: when asking about sales commissions, the CEO will think âsales commissions is 5% of closed dealsâ. However, the data analyst will think âclosed deals are stored in tableÂ
closed_deals
, so I need to take theÂamount
 column and multiply that withÂmonths
 to figure out the final amount; oh, and I need to check theÂpayment_received
 column to make sure that only the received payment is countedâ.
(Data Modeling Layer & Concepts | The Analytics Setup Guidebook)
But this has two main problems. One is being a bottleneck:
Every time the CEO wants something, she needs to wait hours (or even days) for Daniel [the DA] to crunch the numbers and get back to her. At one point, Daniel might be too busy crunching out numbers for different business stakeholders, instead of focusing his time on more valuable, long-term impact work. (Data Modeling Layer & Concepts | The Analytics Setup Guidebook)
The other is entropy:
Build awareness about a (cultural) problem: data anarchy
People will show up in a meeting to talk through a business problem, and different people will bring mismatched numbers. All too often, data slows decisions down rather than speeds them up, because people spend more time debating if the data is right than they do talking about what to do about it. And all too often, data teams will update core datasets or KPIs, and old versions will linger on for months.
Incidents like these erode trust in data and data teams. It teaches people not to participate in the data democracy, and to make decisions entirely on their own, often without using data at all. (Good Data Citizenship Doesnât Work | by Benn Stancil | Jan, 2022 | Towards Data Science)
We kinda know why this happens:
âŚ.continually creating new combinations of models in all directions as new analytic opportunities arise.
Itâs not difficult to imagine why this happens â people have different opinions, habits, and diligence about naming and when developing, it is often easier to build a new thing fit for the new purpose then to integrate your changes into the pre-existing ecosystem, test that yours works without breaking everyone elseâs, etc. (On the Importance of Naming: Model Naming Conventions (Part 1) | dbt Developer Blog)
And it happens to us (here there was an internal example of pipeline jungles and inconsistent metrics).
đ to original context
We will later see that a very important thing to support changing requirements in to expose atomic data.
The star schema / Dimensional modeling
Isnât there a way to structure information so that answering most business questions is easy, without the need to predict in advance which type of questions are going to be asked? Yes, created by this guy, Ralph Kimball in The Data Warehouse Toolkit 3rd Edition.
It is called both dimensional modeling and star schema, given the final result.
(from What is a star schema and how does it work?)
đ to original context
Where does this go in our layered data warehouse?
We will later see more about these âlayersâ, but this belongs into the âmartsâ layer of the warehouse. Those marts are just a subset of tables/views (models) by domain, which are the only ones intented to be queried by the end users. That is, all the other layers are just internal intermediate transformations that we shouldn´t query. However, both facts and dimensions could be shared by different marts (we will see âconformed dimensionsâ).
Core concepts
Kimball Dimensional modeling
![]()
(Nowadays we donât really need dimensional modeling for query performance. See also: OBT - One Big Table).
Fact tables
The fact table in a dimensional model stores the performance measurements resulting from an organizationâs business process events. (p. 10)
The most useful facts are numeric and additive, such as dollar sales amount. (âŚ) You will see that facts are sometimes semi-additive or even non-additive. Semi-additive facts, such as account balances, cannot be summed across the time dimension. Non-additive facts, such as unit prices, can never be added. (p. 11)
Notice how âfactâ denotes a column (a measurement), not a row (which could de different things. Normally, a transaction).
Dimension tables
The dimension tables contain the textual context associated with a business process measurement event. They describe the âwho, what, where, when, how, and whyâ associated with the event. (p. 13)
Dimension attributes serve as the primary source of query constraints, group- ings, and report labels. In a query or report request, attributes are identified as the by words. (p. 13)
Because they are the source of virtually all constraints and report labels, dimension attributes are critical to making the DW/BI system usable and understandable. Attributes should consist of real words rather than cryptic abbreviations. You should strive to mini-mize the use of codes in dimension tables by replacing them with more verbose textual attributes (p. 14)
When triaging operational source data, it is sometimes unclear whether a numeric data element is a fact or dimension attribute. You often make the decision by asking whether the column is a measurement that takes on lots of values and participates in calculations (making it a fact) or is a discretely valued description that is more or less constant and participates in constraints and row labels (making it a dimensional attribute). For example, the standard cost for a product seems like a constant attribute of the product but may be changed so often that you decide it is more like a measured fact. Occasionally, you canât be certain of the classification; it is possible to model the data element either way (or both ways) as a matter of the designerâs prerogative. (p. 14)
Wow, mind blown:
đ to original contextContrary to popular folklore, Ralph Kimball didnât invent the terms fact and dimension. As best as can be determined, the dimension and fact terminology originated from a joint research project conducted by General Mills and Dartmouth University in the 1960s. In the 1970s, both AC Nielsen and IRI used the terms consistently to describe their syndicated data offerings and gravitated to dimensional models for simplifying the presentation of their analytic information. (p. 15)
Who is responsible of data modeling?
From Analysts Make the Best Analytics Engineers:
đ to original contextYouâve got three steps that stand between you and your finished curated dataset. If you donât have an analytics engineer, then the work may be split up like this: (View Highlight)
The problem is that this is far from a linear process. It looks more like this:
đ to original context
Which is even worse if we leave validation in the hands of yet another person.
As a curiosity, the Modern Data Stack made somewhat easier to collapse these two functions into a single role:
đ to original contextToday, if youâre a âmodern data teamâ your first data hire will be someone who ends up owning the entire data stack. This person can set up Stitch or Fivetran to start ingesting data, keep a data warehouse tidy, write complex data transformations in SQL using dbt, and build reports on top of a clean data layer in Looker, Mode, Redash, etc.
đ to original contextThis job is neither data engineering, nor analysis. Itâs somewhere in the middle, and it needed a new title. Starting in 2018, we and a few of our friends in the Locally Optimistic community started calling this role the analytics engineer.
But adding such role to the existing Data Engineering and Data Analysis roles (as well as other roles) has been criticized as overspecialization. Since the skills required are closer to engineering, data modeling traditionally has been in Data Engineering hands, but Data Analysis can pick up certain skills so that this collaboration is tighter and feedback loops are shorter.
Some techniques from dimensional modelling worth knowing
Types of fact tables
Keeping the past from changing: slowly changing dimensiones
- Types of fact tables.
- Handling slowly changing dimensions.
- Surrogate keys.
- OBTs/Wide tables
- Nulls in Fact Tables
- âŚ
Metrics and Looker
dbt
The vision
Itâs for analysts too! From The dbt Viewpoint | dbt Developer Hub:
We believe that analytics teams have a workflow problem. Too often, analysts operate in isolation, and this creates suboptimal outcomes. Knowledge is siloed. We too often rewrite analyses that a colleague had already written.
Itâs not a tool to do analysis (meaning queries), but to transform data (e.g., compute metrics).
Writing a model
âŚ
Jinja and macros
âŚ
Packages
Regular tests
âŚ
Unit testing
âŚ
Docs and tribal knowledge
Best practices SQL
âŚ
An IDE for data analytics: Looker SQL Runner / Google Colab
âŚ
Very good refs:
Improvements to our own
- No queda claro el grano de las tablas
- No tenemos FK que nos ayudan a entender cĂłmo relacionarlas.