rw-book-cover

Metadata

Highlights

  • A data contract is:
    1. a data quality specification for a dataset that describes • schema of the dataset • simple column-oriented quality rules (is_not_null, values_between etc.)
    2. anchored around a dataset which is typically a physical event stream produced by a team / domain 3️. also a carrier for semantic aka business metadata • ownership • classification tags 4️. can also describe operational SLO-s • freshness goals (e.g. must be available for processing by 7am in the warehouse) 5️. can also be a specifier for provisioning configuration for a dataset (e.g. provision this dataset on Kafka and BigQuery) (View Highlight)
  • whether there is a specific reason the community is hyper-focused on attaching data contracts only to this edge. One of the reasons this has happened is that a lot of the intention behind the conversation around data contracts is to shift responsibility and accountability for data to the application teams that produce it (View Highlight)
  • Use a git-based process for creating and managing the lifecycle of a data contract (View Highlight)
  • shift-left approaches to metadata (View Highlight)
    • Note: The practice of filling metadata as close to the source as possible
  • Similar techniques for batch datasets exist using the staging → publish table pattern (View Highlight)
  • • Data producers should own the specification of the contract (View Highlight)
  • • They should be accountable for all the characteristics of the dataset as laid out in the contract (View Highlight)
  • Data Contracts are currently closely aligned to the Source Aligned Data Products as laid out by Zhamak. (View Highlight)