rw-book-cover

Metadata

Highlights

There are three primary stakeholders in the data management value chain:

  1. Producers: The teams generating the data
  2. Platforms: The teams maintaining the data infrastructure
  3. Consumers: The teams leveraging the data to accomplish tasks Conway’s Law would dictate that the data management, governance, and quality systems implemented in a company will reflect how these various groups work together. In most businesses, data producers have no idea who their consumers are or why they need the data in the first place. They are unaware of which data is important for AI/BI, nor do they understand what it should look like. Platform teams are rarely informed about how their infrastructure is being leveraged and have little knowledge of the business context surrounding data, while consumers have business context but don’t know where the data is coming from or whether or not it’s quality. (View Highlight)

The early 2000s marked a fundamental shift in how software engineering teams were structured. As technology companies scaled, they recognized that high-quality software required rapid iteration and continuous delivery. (View Highlight)

To facilitate this velocity, companies embraced Agile methodologies, which dismantled the traditional, slow-moving hierarchical structures and replaced them with autonomous, cross-functional teams. (View Highlight)

Out of this decentralized model emerged federated engineering structures and the adoption of microservices architectures. (View Highlight)

The trade-off, however, was that many centralized cost centers—teams and functions designed for a monolithic, tightly controlled architecture—struggled to adapt. (View Highlight)

This same dynamic played out in the world of data. Historically, data teams had ownership over the organization’s entire data architecture—curating data models, defining schema governance, and managing a centralized data warehouse. But as engineering teams began making independent decisions about which events to log, what databases to use, and how to structure data, the once-cohesive data ecosystem fragmented overnight. Without centralized oversight, engineering teams optimized for their immediate needs rather than long-term data quality. Events were collected inconsistently, naming conventions varied wildly, and different teams structured their data models based on what was most convenient for their service, rather than what was best for the organization as a whole. This led to massive data silos and duplicated effort. (View Highlight)

In the early days of the cloud, this reactive data engineering model was sufficient. Most organizations primarily used data for dashboarding and reporting, where occasional inconsistencies could be tolerated. (View Highlight)

Instead of data management solely being the responsibility of the downstream data organizations, the treatment of data is a shared responsibility across producers, data platform teams, and consumers. (View Highlight)

While Shifting Left may sound too good to be true, this pattern has happened on three notable occasions in software engineering. The first is DevOps, second is DevSecOps, and third is Feature Management. (View Highlight)

The primary reason for this delay lies in the inherent complexity and multi-faceted nature of data that makes the shift left difficult to manage. (View Highlight)

most data quality problems are code quality problems. There are really two types of root causes for production impacting bugs:

  1. A lapse in judgment—an engineer skips writing a test and a bug slips through.
  2. A broken dependency—an engineer unknowingly changes something a downstream team relied on. (View Highlight)