Metrics development workshop - Dr. Mario's 🧠

## Why are we doing this workshop Our metrics are now hard to change, explain and to trust: - They are hard to understand: - We have very long queries. - Some are overly complex in logic. - They are not expressive enough. - They have no tests. Although there are also external reasons for this (time pressures, pushing business logic into analytics instead of dealing with it on backends), we can focus on what we control and **write more maintainable metrics**. Specifically, we will review: - Data modelling: common ways to structure tables so that metrics are easier to compute. - SQL best practices. - dbt as a tool for metric development and shared medium with data engineering. Apart from this, there are several things we should already be doing, particularly regarding metrics work: - Peer reviews: a good check that we all understand what one person did. - Pairing with DE on this matters, instead of just giving them the queries. ## Data modeling: why do we need it ![[fight-the-entropyolap-vs-oltp-and-normalization|Fight The Entropy#olap Vs Oltp And Normalization]] ## Transform data to make analysis easier, how? ### Ad-hoc doesn't work "In the past", there was no modeling, and Data Analysts controlled all reporting in an ad-hoc way. ![[What is analytics engineering#^97e77c]] Imagine doing this ad-hoc for each business question... >For example: when asking about sales commissions, the CEO will think _“sales commissions is 5% of closed deals”._ However, the data analyst will think _“closed deals are stored in table `closed_deals`, so I need to take the `amount` column and multiply that with `months` to figure out the final amount; oh, and I need to check the `payment_received` column to make sure that only the received payment is counted”_. ![[Pasted image 20230731071504.png|400]] ([Data Modeling Layer & Concepts | The Analytics Setup Guidebook](https://www.holistics.io/books/setup-analytics/data-modeling-layer-and-concepts/ "https://www.holistics.io/books/setup-analytics/data-modeling-layer-and-concepts/")) But this has two main problems. One is being a bottleneck: > Every time the CEO wants something, she needs to wait hours (or even days) for Daniel \[the DA\] to crunch the numbers and get back to her. At one point, Daniel might be too busy crunching out numbers for different business stakeholders, instead of focusing his time on more valuable, long-term impact work. ([Data Modeling Layer & Concepts | The Analytics Setup Guidebook](https://www.holistics.io/books/setup-analytics/data-modeling-layer-and-concepts/ "https://www.holistics.io/books/setup-analytics/data-modeling-layer-and-concepts/")) The other is entropy: ![[fight-the-entropybuild-awareness-about-a-cultural-problem-data-anarchy|Fight The Entropy#build Awareness About A (cultural) Problem Data Anarchy]] We will later see that a very important thing to support changing requirements in to expose atomic data. ### The star schema / Dimensional modeling Isn't there a way to structure information so that answering most business questions is easy, without the need to predict in advance which type of questions are going to be asked? Yes, created by this guy, [[Ralph kimball|Ralph Kimball]] in [[The Data Warehouse Toolkit. 3rd edition|The Data Warehouse Toolkit 3rd Edition]]. ![[Pasted image 20230728075806.png|400]] It is called both dimensional modeling and star schema, given the final result. ![[Fight the entropy#^9534a7]] #### Where does this go in our layered data warehouse? We will later see more about these "layers", but this belongs into the "marts" layer of the warehouse. Those marts are just a subset of tables/views (models) by domain, which are the only ones intented to be queried by the end users. That is, all the other layers are just internal intermediate transformations that we shouldn´t query. However, both facts and dimensions could be shared by different marts (we will see "conformed dimensions"). #### Core concepts ![[the-data-warehouse-toolkit-3rd-editionkimball-dimensional-modeling|The Data Warehouse Toolkit 3rd Edition#kimball Dimensional Modeling]] ## Who is responsible of data modeling? From [[Analysts make the best analytics engineers|Analysts Make The Best Analytics Engineers]]: ![[Analysts make the best analytics engineers#^d82b4b]] The problem is that this is far from a linear process. It looks more like this: ![[Analysts make the best analytics engineers#^bb8e95]] Which is even worse if we leave validation in the hands of yet another person. As a curiosity, the [[Modern data stack|Modern Data Stack]] made somewhat easier to collapse these two functions into a single role: ![[What is analytics engineering#^049b0d]] ![[What is analytics engineering#^281a2f]] But adding such role to the existing Data Engineering and Data Analysis roles (as well as other roles) has been criticized as [[What is the right level of specialization for data teams and anyone else.|What Is The Right Level Of Specialization For Data Teams And Anyone Else]]. Since the skills required are closer to engineering, data modeling traditionally has been in Data Engineering hands, but Data Analysis can pick up certain skills so that this collaboration is tighter and feedback loops are shorter. ## Some techniques from dimensional modelling worth knowing ### Types of fact tables ### Keeping the past from changing: slowly changing dimensiones - Types of fact tables. - Handling slowly changing dimensions. - [[Surrogate keys|Surrogate Keys]]. - OBTs/Wide tables - Nulls in Fact Tables - ... #### Metrics and Looker #### dbt ##### The vision It's for analysts too! From [The dbt Viewpoint | dbt Developer Hub](https://docs.getdbt.com/community/resources/viewpoint): >We believe that analytics teams have a workflow problem. Too often, analysts operate in isolation, and this creates suboptimal outcomes. Knowledge is siloed. We too often rewrite analyses that a colleague had already written. It's not a tool to do analysis (meaning queries), but to transform data (e.g., compute metrics). ##### Writing a model ... ##### Jinja and macros ... ##### Packages ###### Regular tests ... ###### Unit testing ... ##### Docs and tribal knowledge ##### Best practices SQL ... #### An IDE for data analytics: Looker SQL Runner / Google Colab ... Very good refs: - https://about.gitlab.com/handbook/business-technology/data-team/platform/edw/ - [](https://www.advancinganalytics.co.uk/blog/2019/6/17/is-kimball-still-relevant-in-the-modern-data-warehouse) ## Improvements to our own - No queda claro el grano de las tablas - No tenemos FK que nos ayudan a entender cómo relacionarlas.