Data contracts bring data providers and data consumers together.
A data contract is a document that defines the structure, format, semantics, quality, and terms of use for exchanging data between a data provider and their consumers. A data contract is implemented by a data product’s output port or other data technologies. Data contracts can also be used for the input port to specify the expectations of data dependencies and verify given guarantees.
The data contract specification defines a YAML format to describe attributes of provided data sets. It is data platform neutral and can be used with any data platform, such as AWS S3, Google BigQuery, Microsoft Fabric, Databricks, and Snowflake. The data contract specification is an open initiative to define a common data contract format. It follows OpenAPI and AsyncAPI conventions.
Data contracts come into play when data is exchanged between different teams or organizational units, such as in a data mesh architecture. First, and foremost, data contracts are a communication tool to express a common understanding of how data should be structured and interpreted. They make semantic and quality expectations explicit. They are often created collaboratively in workshops together with data providers and data consumers. Later in development and production, they also serve as the basis for code generation, testing, schema validations, quality checks, monitoring, access control, and computational governance policies.
The specification comes along with the Data Contract CLI, an open-source tool to develop, validate, and enforce data contracts.
Note: The term “data contract” refers to a specification that is usually owned by the data provider and thus does not align with a “contract” in a legal sense as a mutual agreement between two parties. The term “contract” may be somewhat misleading, but it is how it is used in practice. The mutual agreement between one data provider and one data consumer is the “data usage agreement” that refers to a data contract. Data usage agreements have a defined lifecycle, start/end date, and help the data provider to track who accesses their data and for which purposes.
0.9.1 (Changelog)
dataContractSpecification: 0.9.1
id: urn:datacontract:checkout:orders-latest-npii
info:
title: Orders Latest NPII
version: 1.0.0
description: Successful customer orders in the webshop. All orders since 2020-01-01. Orders with their line items are in their current state (no history included). PII data is removed.
owner: Checkout Team
contact:
name: John Doe (Data Product Owner)
email: [email protected]
servers:
production:
type: BigQuery
project: acme_orders_prod
dataset: bigquery_orders_latest_npii_v1
terms:
usage: >
Data can be used for reports, analytics and machine learning use cases.
Order may be linked and joined by other tables
limitations: >
Not suitable for real-time use cases.
Data may not be used to identify individual customers.
Max data processing per day: 10 TiB
billing: 5000 USD per month
noticePeriod: P3M
models:
orders:
description: One record per order. Includes cancelled and deleted orders.
type: table
fields:
order_id:
$ref: '#/definitions/order_id'
order_timestamp:
type: timestamp
description: The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful.
order_total:
type: long
description: Total amount the smallest monetary unit (e.g., cents).
line_items:
description: A single article that is part of an order.
type: table
fields:
lines_item_id:
type: string
description: Primary key of the lines_item_id table
order_id:
$ref: '#/definitions/order_id'
sku:
description: The purchased article number
$ref: '#/definitions/sku'
definitions:
order_id:
domain: checkout
name: order_id
title: Order ID
type: string
description: An internal ID that identifies an order in the online shop.
example: 243c25e5-a081-43a9-aeab-6d5d5b6cb5e2
pii: true
classification: restricted
sku:
domain: inventory
name: sku
title: Stock Keeping Unit
type: string
example: AC1212ME1
description: |
A Stock Keeping Unit (SKU) is an internal unique identifier for an article.
It is typically associated with an article's barcode, such as the EAN/GTIN.
examples:
- type: csv # csv, json, yaml, custom
model: orders
data: |- # expressed as string or inline yaml or via "$ref: data.csv"
order_id,order_timestamp,order_total
"1001","2023-09-09T08:30:00Z",2500
"1002","2023-09-08T15:45:00Z",1800
"1003","2023-09-07T12:15:00Z",3200
"1004","2023-09-06T19:20:00Z",1500
"1005","2023-09-05T10:10:00Z",4200
"1006","2023-09-04T14:55:00Z",2800
"1007","2023-09-03T21:05:00Z",1900
"1008","2023-09-02T17:40:00Z",3600
"1009","2023-09-01T09:25:00Z",3100
"1010","2023-08-31T22:50:00Z",2700
- type: csv
model: line_items
data: |-
lines_item_id,order_id,sku
"1","1001","5901234123457"
"2","1001","4001234567890"
"3","1002","5901234123457"
"4","1002","2001234567893"
"5","1003","4001234567890"
"6","1003","5001234567892"
"7","1004","5901234123457"
"8","1005","2001234567893"
"9","1005","5001234567892"
"10","1005","6001234567891"
quality:
type: SodaCL # data quality check format: SodaCL, montecarlo, custom
specification: # expressed as string or inline yaml or via "$ref: checks.yaml"
checks for orders:
- freshness(order_timestamp) < 24h
- row_count > 500000
- duplicate_count(order_id) = 0
checks for line_items:
- row_count > 500000
JSON Schema of the Data Contract Specification.
This is the root document.
It is RECOMMENDED that the root document be named: datacontract.yaml
.
Field | Type | Description |
---|---|---|
dataContractSpecification | string |
REQUIRED. Specifies the Data Contract Specification being used. |
id | string |
REQUIRED. An organization-wide unique technical identifier, such as a UUID, URN, slug, string, or number |
info | Info Object | REQUIRED. Specifies the metadata of the data contract. May be used by tooling. |
servers | Map[string, Server Object] | Specifies the servers of the data contract. |
terms | Terms Object | Specifies the terms and conditions of the data contract. |
models | Map[string, Model Object] | Specifies the logical data model. |
definitions | Map[string, Definition Object] | Specifies definitions. |
schema | Schema Object | Specifies the physical schema. The specification supports different schema format. |
examples | Array of Example Objects | Specifies example data sets for the data model. The specification supports different example types. |
quality | Quality Object | Specifies the quality attributes and checks. The specification supports different quality check DSLs. |
This object MAY be extended with Specification Extensions.
Metadata and life cycle information about the data contract.
Field | Type | Description |
---|---|---|
title | string |
REQUIRED. The title of the data contract. |
version | string |
REQUIRED. The version of the data contract document (which is distinct from the Data Contract Specification version or the Data Product implementation version). |
description | string |
A description of the data contract. |
owner | string |
The owner or team responsible for managing the data contract and providing the data. |
contact | Contact Object | Contact information for the data contract. |
This object MAY be extended with Specification Extensions.
Contact information for the data contract.
Field | Type | Description |
---|---|---|
name | string |
The identifying name of the contact person/organization. |
url | string |
The URL pointing to the contact information. This MUST be in the form of a URL. |
string |
The email address of the contact person/organization. This MUST be in the form of an email address. |
This object MAY be extended with Specification Extensions.
The fields are dependent on the defined type.
Field | Type | Description |
---|---|---|
type | string |
The type of the data product technology that implements the data contract. Well-known server types are: bigquery , s3 , redshift , snowflake , databricks , kafka |
description | string |
An optional string describing the server. |
This object MAY be extended with Specification Extensions.
Field | Type | Description |
---|---|---|
type | string |
bigquery |
project | string |
|
dataset | string |
Field | Type | Description |
---|---|---|
type | string |
s3 |
location | string |
S3 URL, starting with s3:// |
Example:
servers:
production:
type: s3
location: s3://acme-orders-prod/orders/
Field | Type | Description |
---|---|---|
type | string |
redshift |
account | string |
|
database | string |
|
schema | string |
Field | Type | Description |
---|---|---|
type | string |
snowflake |
account | string |
|
database | string |
|
schema | string |
Field | Type | Description |
---|---|---|
type | string |
databricks |
share | string |
Field | Type | Description |
---|---|---|
type | string |
kafka |
host | string |
|
topic | string |
The terms and conditions of the data contract.
Field | Type | Description |
---|---|---|
usage | string |
The usage describes the way the data is expected to be used. Can contain business and technical information. |
limitations | string |
The limitations describe the restrictions on how the data can be used, can be technical or restrictions on what the data may not be used for. |
billing | string |
The billing describes the pricing model for using the data, such as whether it’s free, having a monthly fee, or metered pay-per-use. |
noticePeriod | string |
The period of time that must be given by either party to terminate or modify a data usage agreement. Uses ISO-8601 period format, e.g., P3M for a period of three months. |
The Model Object describes the structure and semantics of a data model, such as tables, views, or structured files.
The name of the data model (table name) is defined by the key that refers to this Model Object.
Field | Type | Description |
---|---|---|
type | string |
The type of the model. Examples: table , object . Default: table . |
description | string |
An optional string describing the data model. |
fields | Map[string , Field Object] |
The fields (e.g. columns) of the data model. |
The Field Objects describes one field (column, property, nested field) of a data model.
Field | Type | Description |
---|---|---|
type | Data Type | The logical data type of the field. |
description | string |
An optional string describing the semantic of the data in this field. |
pii | boolean |
An indication, if this field contains Personal Identifiable Information (PII). |
classification | string |
The data class defining the sensitivity level for this field, according to the organization’s classification scheme. Examples may be: sensitive , restricted , internal , public . |
tags | Array of string |
Custom metadata to provide additional context. |
$ref | string |
A reference URI to a definition in the specification, internally or externally. Properties will be inherited from the definition. |
The Definition Object includes a clear and concise explanations of syntax, semantic, and classification of a business object in a given domain.
It serves as a reference for a common understanding of terminology, ensure consistent usage and to identify join-able fields.
Models fields can refer to definitions using the $ref
field to link to existing definitions and avoid duplicate documentations.
Field | Type | Description |
---|---|---|
domain | string |
The domain in which this definition is valid. Default: global . |
name | string |
The technical name of this definition. |
title | string |
The business name of this definition. |
type | Data Type | The logical data type |
description | string |
Clear and concise explanations related to the domain |
example | string |
An example value. |
pii | boolean |
An indication, if this field contains Personal Identifiable Information (PII). |
classification | string |
The data class defining the sensitivity level for this field, according to the organization’s classification scheme. |
tags | Array of string |
Custom metadata to provide additional context. |
The schema of the data contract describes the physical schema. The type of the schema depends on the data platform.
Field | Type | Description |
---|---|---|
type | string |
REQUIRED. The type of the schema. Typical values are: dbt , bigquery , json-schema , sql-ddl , avro , protobuf , custom |
specification | dbt Schema Object | BigQuery Schema Object | JSON Schema Schema Object | SQL DDL Schema Object | string |
REQUIRED. The specification of the schema. The schema specification can be encoded as a string or as inline YAML. |
https://docs.getdbt.com/reference/model-properties
Example (inline YAML):
schema:
type: dbt
specification:
version: 2
models:
- name: "My Table"
description: "My description"
columns:
- name: "My column"
data_type: text
description: "My description"
Example (string):
schema:
type: dbt
specification: |-
version: 2
models:
- name: "My Table"
description: "My description"
columns:
- name: "My column"
data_type: text
description: "My description"
The schema structure is defined by the Google BigQuery Table object. You can extract such a Table object via the tables.get endpoint.
Instead of providing a single Table object, you can also provide an array of such objects. Be aware that tables.list only returns a subset of the full Table object. You need to call every Table object via tables.get to get the full Table object, including the actual schema.
Learn more: Google BigQuery REST Reference v2
Example:
schema:
type: bigquery
specification: |-
{
"tableReference": {
"projectId": "my-project",
"datasetId": "my_dataset",
"tableId": "my_table"
},
"description": "This is a description",
"type": "TABLE",
"schema": {
"fields": [
{
"name": "name",
"type": "STRING",
"mode": "NULLABLE",
"description": "This is a description"
}
]
}
}
JSON Schema can be defined as JSON or rendered as YAML, following the OpenAPI Schema Object dialect
Example (inline YAML):
schema:
type: json-schema
specification:
orders:
description: One record per order. Includes cancelled and deleted orders.
type: object
properties:
order_id:
type: string
description: Primary key of the orders table
order_timestamp:
type: string
format: date-time
description: The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful.
order_total:
type: integer
description: Total amount of the order in the smallest monetary unit (e.g., cents).
line_items:
type: object
properties:
lines_item_id:
type: string
description: Primary key of the lines_item_id table
order_id:
type: string
description: Foreign key to the orders table
sku:
type: string
description: The purchased article number
Example (string):
schema:
type: json-schema
specification: |-
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"orders": {
"type": "object",
"description": "One record per order. Includes cancelled and deleted orders.",
"properties": {
"order_id": {
"type": "string",
"description": "Primary key of the orders table"
},
"order_timestamp": {
"type": "string",
"format": "date-time",
"description": "The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful."
},
"order_total": {
"type": "integer",
"description": "Total amount of the order in the smallest monetary unit (e.g., cents)."
}
},
"required": ["order_id", "order_timestamp", "order_total"]
},
"line_items": {
"type": "object",
"properties": {
"lines_item_id": {
"type": "string",
"description": "Primary key of the lines_item_id table"
},
"order_id": {
"type": "string",
"description": "Foreign key to the orders table"
},
"sku": {
"type": "string",
"description": "The purchased article number"
}
},
"required": ["lines_item_id", "order_id", "sku"]
}
},
"required": ["orders", "line_items"]
}
Classical SQL DDLs can be used to describe the structure.
Example (string):
schema:
type: sql-ddl
specification: |-
-- One record per order. Includes cancelled and deleted orders.
CREATE TABLE orders (
order_id TEXT PRIMARY KEY, -- Primary key of the orders table
order_timestamp TIMESTAMPTZ NOT NULL, -- The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful.
order_total INTEGER NOT NULL -- Total amount of the order in the smallest monetary unit (e.g., cents)
);
-- The items that are part of an order
CREATE TABLE line_items (
lines_item_id TEXT PRIMARY KEY, -- Primary key of the lines_item_id table
order_id TEXT REFERENCES orders(order_id), -- Foreign key to the orders table
sku TEXT NOT NULL -- The purchased article number
);
Field | Type | Description |
---|---|---|
type | string |
The type of the data product technology that implements the data contract. Well-known server types are: csv , json , yaml , custom |
description | string |
An optional string describing the example. |
model | string |
The reference to the model in the schema, e.g. a table name. |
data | string |
Example data for this model. |
Example:
examples:
- type: csv
model: orders
data: |-
order_id,order_timestamp,order_total
"1001","2023-09-09T08:30:00Z",2500
"1002","2023-09-08T15:45:00Z",1800
"1003","2023-09-07T12:15:00Z",3200
"1004","2023-09-06T19:20:00Z",1500
"1005","2023-09-05T10:10:00Z",4200
"1006","2023-09-04T14:55:00Z",2800
"1007","2023-09-03T21:05:00Z",1900
"1008","2023-09-02T17:40:00Z",3600
"1009","2023-09-01T09:25:00Z",3100
"1010","2023-08-31T22:50:00Z",2700
The quality object contains quality attributes and checks.
Field | Type | Description |
---|---|---|
type | string |
REQUIRED. The type of the schema. Typical values are: SodaCL , montecarlo , custom |
specification | SodaCL Quality Object | Monte Carlo Schema Object | string |
REQUIRED. The specification of the quality attributes. The quality specification can be encoded as a string or as inline YAML. |
Quality attributes in Soda Checks Language.
The specification
represents the content of a checks.yml
file.
Example (inline):
quality:
type: SodaCL # data quality check format: SodaCL, montecarlo, dbt-tests, custom
specification: # expressed as string or inline yaml or via "$ref: checks.yaml"
checks for orders:
- row_count > 0
- duplicate_count(order_id) = 0
checks for line_items:
- row_count > 0
Example (string):
quality:
type: SodaCL
specification: |-
checks for search_queries:
- freshness(search_timestamp) < 1d
- row_count > 100000
- missing_count(search_query) = 0
Quality attributes defined as Monte Carlos Monitors as Code.
The specification
represents the content of a montecarlo.yml
file.
Example (string):
quality:
type: montecarlo
specification: |-
montecarlo:
field_health:
- table: project:dataset.table_name
timestamp_field: created
dimension_tracking:
- table: project:dataset.table_name
timestamp_field: created
field: order_status
The following data types are supported for model fields and definitions:
string
, text
, varchar
number
, decimal
, numeric
int
, integer
long
, bigint
float
double
boolean
timestamp
, timestamp_tz
timestamp_ntz
date
array
bytes
object
, record
, struct
null
While the Data Contract Specification tries to accommodate most use cases, additional data can be added to extend the specification at certain points.
A custom fields can be added with any name. The value can be null, a primitive, an array or an object.
The Data Contract Specification follows these design principles:
The Data Contract Specification was originally created by Jochen Christ and Dr. Simon Harrer, and is currently maintained by them.
Contributions are welcome! Please open an issue or a pull request.