Metadata
- Author: Tristan Handy
- Full Title:: The Next Big Step Forwards for Analytics Engineering
- Category:: 🗞️Articles
- Document Tags:: Data culture
- URL:: https://www.getdbt.com/blog/analytics-engineering-next-step-forwards/
- Finished date:: 2023-04-20
Highlights
• Create teams of ~5-9 people that own their own code base and can push code to production without being blocked. • Every codebase is responsible for exposing interfaces to other teams to build on top of. • That team owns the entire lifecycle of their assigned surface area, including maintaining code in production and ”holding the pager.” (View Highlight)
This socio-technical architecture has allowed software engineers to build globe-spanning systems of unbelievable sophistication (View Highlight)
Can dbt developers do this? In sufficiently large organizations with sufficiently complex dbt projects, they cannot (View Highlight)
Each team must be able to take ownership for the complexity in its own domain, but then expose their finished work in easy-to-consume interfaces that other teams and their codebases can reference (View Highlight)
Many teams today try to deconstruct dbt projects, but they fail when they attempt to reference models in other projects (which is inevitably critical). The two solutions that teams attempt are both bad options:
- They reference a dataset from another project directly (using schema.table) or register it as a source. Either of these breaks the DAG, environment management, etc.
- They import the project they want to reference as a package and use ref(). This re-couples the two projects together; they are now effectively a single monolithic project again. (View Highlight)
dbt Core v1.5 is slated for release at the end of April, and it will include three new constructs: • Access: Choose which models ought to be “private” (implementation details, handling complexity within one team or domain) and “public” (an intentional interface, shared with other teams). Other groups and projects can only ref a model — that is, take a critical dependency on it — in accordance with its access. • Contracts: Define the structure of a model explicitly. If your model’s SQL doesn’t match the specified column names and data types, it will fail to build. Breaking changes (removing, renaming, retyping a column) will be caught during CI. On data platforms that support build-time constraints, ensure that columns are not null or pass custom checks while a model is being built, in addition to more flexible testing after. • Versions: A single model can have multiple versioned definitions, with the same name for downstream reference. When a mature model with an enforced contract and public access needs to undergo a breaking change, rather than breaking downstream queriers immediately, facilitate their migration by bumping the version and communicating a deprecation window. (View Highlight)
In the future, individual teams will own their own data. Data engineering will own “core tables” or “conformed dimensions” that will be used by other teams. Ecommerce will own models related to site visits and conversion rate. Ops will own data related to fulfillment. Etc. Each of these teams will reference the public interfaces exposed by other teams as a part of their work, and periodically release upgrades as versions are incremented on upstream dependencies. Teams will review PRs for their own models, and so have more context for what “good” looks like. Monitoring and alerting will happen in alignment with teams and codebases, so there will be real accountability to delivering a high quality, high reliability data product. Teams will manage their own warehouse spend and optimize accordingly. And teams will be able to publish their own metrics all the way into their analytics tool of choice. Teams owning their own data. This is how analytics scales. I’m excited about this future and hope you are too. (View Highlight)