![rw-book-cover](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3fe6497-a167-47a8-818f-aada570f6341%2Ffavicon-32x32.png) ## Metadata - Author: [[Chad Sanderson]] - Full Title:: Data is not a Microservice - Category:: #🗞️Articles - Document Tags:: [[✍️ Ser data-driven no es de guapas]], - URL:: https://dataproducts.substack.com/p/data-is-not-a-microservice - Annotated link:: https://readwise.io/reader/shared/01hcmyktqkzy3mj6acwdn0v0sy - Read date:: [[2023-10-13]] ## Highlights > *The motivations of a software engineering team are inherently different than a data team* ([View Highlight](https://read.readwise.io/read/01hjzn5gqmxbgp1akyx0xgjjrz)) ^104d20 >Why Software Engineering Can't Solve Data's Problems. **The motivations of a software engineering team are inherently different than a data team.** >By minimizing interdependencies within the code base, developers can evolve their services with minimal constraints. As a result, organizations scale easily, integrate with off-the-shelf tooling as and when it’s needed, and organize their engineering teams around service ownership. >Sometime later, Zhamak Dehghani, the pioneer behind [Data Mesh](https://martinfowler.com/articles/data-mesh-principles.html) (and also a Thoughtworker) released her book on the data equivalent to microservices - the Data Mesh. >My thesis is based on three core arguments: >1. Data teams require a source of truth, which microservices cannot provide without an overhaul of the software engineering discipline 2. We can’t know in advance when data will become valuable, which makes up-front ownership of data microservices overly restrictive 3. The data development lifecycle is different from the software engineering development lifecycle, and microservices are a poor fit to facilitate the needs of data teams >The purpose of a microservice is to power an aspect of some customer experience. Its primary function is operational. >**The purpose of data is decision-making. Its primary function is TRUTH.** How that truth is used can be operational (like an ML model) or analytical (answering some interesting question). ^41e075 >Data developers struggle because the data they have taken dependencies on has no ownership, the underlying meaning is not clear, and when something changes from a source system very few people know why and what they should expect the new 'truth' to be as a result. **In data, our largest problems are rooted in a lack of** ***trust.*** >nothing that prevents the same data from being defined by multiple microservices in different ways, from being called different names, or from being changed at any time for any reason without the downstream consumers being told about it. > For instance, at Convoy, a metric called shipment_margin was calculated as the revenue we made servicing a load minus the costs of servicing the load. Many teams had a separate view of which costs were germane to their particular revenue stream. These teams would add dimensions, stack CASE statements on top of their SQL queries like Jenga blocks, rename columns, and ultimately push data to new models where it was reused later, often with vastly different assumptions. > > As a data consumer, this made life miserable. It was impossible to tell which data could be depended on, which was production-grade and which was experimental, how columns or tables with similar names differed from each other without exploring the underlying query and resulted in the analyst spending weeks contacting upstream developers to understand what the incoming data meant and how to use it in order to recreate the wheel all over again. >[![](https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcecc1647-8e81-4554-bf22-edb824102f46_966x408.png)](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcecc1647-8e81-4554-bf22-edb824102f46_966x408.png) Which one of these should I use if I want margin? ^ffa9eb Note: Entropy. Clean after use >the Lifecycle of Data Development. >A majority of companies have hundreds if not thousands of scattered dashboards that were leveraged at one point but no longer. There is so much clutter it becomes difficult to know what questions have already been answered or not. >1. Ask an interesting question about the business >2. Understand the data that already exists, where it comes from, and what it means >3. Construct a query (code) that answers the question >4. Decide if the answer to the question has operational value >5. If yes, deploy the query into a production environment >6. Decide if the query requires data quality and governance >7. If yes, build a robust data model and data quality checks/alerting throughout the pipeline *(upstream ownership is required here)* >8. As new data becomes available or changes, continuously evaluate and reconstruct the query accordingly But again, the Worst part is that people do not make clear decisions. Steps 6 to 8 are almost never done. >The two lifecycles are very different. While the SDL produces fit-for-purpose software, data engineering is all about discovering and reusing what already exists for a new use case. Data is always changing as we acquire more of it! It is expected that data implementations will evolve over time, sometimes radically. Thus, it is not self-sufficient and downstream teams are tightly coupled to upstream producers. >- Once a strong use case has been established downstream, data consumers should be able to ‘promote’ data assets to a higher quality >- The promotion that occurs should establish the data asset as a source of truth. Any future promotions should modify the source of truth asset instead of creating multiple versions >- If a data asset no longer becomes useful to consumers, data producers should not be required to support it as a product