Why are we doing this workshop
Our metrics are now hard to change, explain and to trust:
- They are hard to understand:
- We have very long queries.
- Some are overly complex in logic.
- They are not expressive enough.
- They have no tests.
Although there are also external reasons for this (time pressures, pushing business logic into analytics instead of dealing with it on backends), we can focus on what we control and write more maintainable metrics. Specifically, we will review:
- Data modelling: common ways to structure tables so that metrics are easier to compute.
- SQL best practices.
- dbt as a tool for metric development and shared medium with data engineering.
Apart from this, there are several things we should already be doing, particularly regarding metrics work:
- Peer reviews: a good check that we all understand what one person did.
- Pairing with DE on this matters, instead of just giving them the queries.
Data modeling: why do we need it
![[fight-the-entropyolap-vs-oltp-and-normalization|Fight The Entropy#olap Vs Oltp And Normalization]]
Transform data to make analysis easier, how?
Ad-hoc doesnāt work
āIn the pastā, there was no modeling, and Data Analysts controlled all reporting in an ad-hoc way.
If you were on a ātraditional data teamā pre 2012, your first data hire was probably a data engineer. (ā¦) Analysts, like me, would maintain a mess of SQL files with names likeĀ
š to original contextmonthly_revenue_final.sql
, or maybe just bookmark their queries in a SQL web editor. Often we would need to supplement data in the warehouse with fancy Excel work. The people consuming the dataāCEOs, Marketing VPs, CFOsāwould receive monthly reports, request special analysis as-needed, and send analysts a never-ending stream of requests to āsegment by thisā or ācut by thatā or āhey, weāve updated our definition of āaccountāā. Being a data analyst was a hard and thankless job, and it didnāt have a ton of leverage.
Imagine doing this ad-hoc for each business questionā¦
For example: when asking about sales commissions, the CEO will thinkĀ āsales commissions is 5% of closed dealsā.Ā However, the data analyst will thinkĀ āclosed deals are stored in tableĀ
closed_deals
, so I need to take theĀamount
Ā column and multiply that withĀmonths
Ā to figure out the final amount; oh, and I need to check theĀpayment_received
Ā column to make sure that only the received payment is countedā.
(Data Modeling Layer & Concepts | The Analytics Setup Guidebook)
But this has two main problems. One is being a bottleneck:
Every time the CEO wants something, she needs to wait hours (or even days) for Daniel [the DA] to crunch the numbers and get back to her. At one point, Daniel might be too busy crunching out numbers for different business stakeholders, instead of focusing his time on more valuable, long-term impact work. (Data Modeling Layer & Concepts | The Analytics Setup Guidebook)
The other is entropy: ![[fight-the-entropybuild-awareness-about-a-cultural-problem-data-anarchy|Fight The Entropy#build Awareness About A (cultural) Problem Data Anarchy]]
We will later see that a very important thing to support changing requirements in to expose atomic data.
The star schema / Dimensional modeling
Isnāt there a way to structure information so that answering most business questions is easy, without the need to predict in advance which type of questions are going to be asked? Yes, created by this guy, Ralph Kimball in The Data Warehouse Toolkit 3rd Edition.
It is called both dimensional modeling and star schema, given the final result.
(from What is a star schema and how does it work?)
š to original context
Where does this go in our layered data warehouse?
We will later see more about these ālayersā, but this belongs into the āmartsā layer of the warehouse. Those marts are just a subset of tables/views (models) by domain, which are the only ones intented to be queried by the end users. That is, all the other layers are just internal intermediate transformations that we shouldnĀ“t query. However, both facts and dimensions could be shared by different marts (we will see āconformed dimensionsā).
Core concepts
![[the-data-warehouse-toolkit-3rd-editionkimball-dimensional-modeling|The Data Warehouse Toolkit 3rd Edition#kimball Dimensional Modeling]]
Who is responsible of data modeling?
From Analysts Make The Best Analytics Engineers:
š to original contextYouāve got three steps that stand between you and your finished curated dataset. If you donāt have an analytics engineer, then the work may be split up like this: (View Highlight)
The problem is that this is far from a linear process. It looks more like this:
š to original context
Which is even worse if we leave validation in the hands of yet another person.
As a curiosity, the Modern Data Stack made somewhat easier to collapse these two functions into a single role:
š to original contextToday, if youāre a āmodern data teamā your first data hire will be someone who ends up owning the entire data stack. This person can set up Stitch or Fivetran to start ingesting data, keep a data warehouse tidy, write complexĀ data transformationsĀ in SQL using dbt, and build reports on top of a clean data layer in Looker, Mode, Redash, etc.
š to original contextThis job is neither data engineering, nor analysis. Itās somewhere in the middle, and it needed a new title. Starting in 2018, we and a few of our friends in theĀ Locally Optimistic communityĀ started calling this role theĀ analytics engineer.
But adding such role to the existing Data Engineering and Data Analysis roles (as well as other roles) has been criticized as What Is The Right Level Of Specialization For Data Teams And Anyone Else. Since the skills required are closer to engineering, data modeling traditionally has been in Data Engineering hands, but Data Analysis can pick up certain skills so that this collaboration is tighter and feedback loops are shorter.
Some techniques from dimensional modelling worth knowing
Types of fact tables
Keeping the past from changing: slowly changing dimensiones
- Types of fact tables.
- Handling slowly changing dimensions.
- Surrogate Keys.
- OBTs/Wide tables
- Nulls in Fact Tables
- ā¦
Metrics and Looker
dbt
The vision
Itās for analysts too! From The dbt Viewpoint | dbt Developer Hub:
We believe that analytics teams have a workflow problem. Too often, analysts operate in isolation, and this creates suboptimal outcomes. Knowledge is siloed. We too often rewrite analyses that a colleague had already written.
Itās not a tool to do analysis (meaning queries), but to transform data (e.g., compute metrics).
Writing a model
ā¦
Jinja and macros
ā¦
Packages
Regular tests
ā¦
Unit testing
ā¦
Docs and tribal knowledge
Best practices SQL
ā¦
An IDE for data analytics: Looker SQL Runner / Google Colab
ā¦
Very good refs:
Improvements to our own
- No queda claro el grano de las tablas
- No tenemos FK que nos ayudan a entender cómo relacionarlas.