Keeping the data lake from becoming a swamp
I have been thinking a lot about how data consumers work with producers to ensure they have access to the data they need in a format that is understandable and reliable. We have some great new tools like snowflake, alation, and looker for exploring, documenting, and visualizing data. As someone who has worked both with REST services and data processing systems, I find myself a little paranoid about the prospect of the data lake becoming a swamp due to a more distributed, open system of ingesting and publishing data.
Our data consumers should enjoy the same types of guarantees around data in the lake that they do consuming data from our APIs. These guarantees aren’t as important when exploring data or performing adhoc queries as they are when building data processing jobs, training machine learning models, and building BI dashboards. I want the contract to be versioned and stable, so that I can be assured that functionality won’t break in unexpected ways. Moreover when a new version is introduced, I want to access all my data from the new version including data created under the old one. So how do we achieve these same guarantees in the lake without hampering discovery and ad-hoc exploration?
Data used in production data processing or published business intelligence should obey a contract to give those data consumers long-term stability and confidence in their dependence on the source data. The data producer should publish the contract to insulate the consumer from multiple data sources, multiple source formats, and data production gaps or errors. Stabilizing the data with a data contract is the first step the producer should take before other processing. Data contracts should satisfy the following needs:
- Rich and flexible schema for describing data — handle all common data types and structures including graphs, maps, and lists
- Self documenting — stored data should be clearly related to the schema it supports
- Smooth schema evolution with clear rules for resolution and conflict detection — as the schema changes it is important that the contract prevents breaking changes
- Support generation of contract classes in multiple programming languages — tool support for generating code from contracts
- Adapted to both batch and streaming data — efficient and performant when reading large amounts of data sequentially or dealing with small amounts of data incrementally
- Compact, fast storage format supporting compression — stores data efficiently yet performs well during reading and writing
Stabilized data is not the only data in need of a contract. Many outputs of data curation should be published with a contract, so that they can be relied upon for other data consumers. Both data producers and data consumers may be performing data curation that can then further feed into other curation, data processing, BI, and/or machine learning. Any time these data products will be published to be used by consumers in a long term use case, the curated data should publish their contract. But doesn’t all this sound a lot like schema-on-write, which I thought we wanted to avoid?
Schema-on-Read vs Schema-on-Write
In many traditional systems, data was written to a fixed schema in a data warehouse (ie schema-on-write). This process worked fine for OLTP systems that relied on a relational database, since the ETL was usually simple replication and aggregation. However, these systems were brittle to change and did not accommodate the new world of big data. OLTP systems wanted to publish their data to a long term storage without worrying about the current schema of the data warehouse. And consumers of OLAP systems wanted to handle more types and quantity of data then the typical data warehouse could handle. Decoupling data production from consumption was necessary but this brought in additional challenges around the evolving nature of systems and the data they produce.
Schema-on-read brought relief to data producers to publish all their data to a data lake without fitting a strict schema. It also gave data consumers more freedom to discover, explore, and use the data in the way they choose without unnecessary barriers and overhead. So this seems like a win-win, right? The trouble is data consumers have all the burden of understanding and unifying data as it evolves. How does the consumer know to merge orders from an old legacy system with orders created in a new system to calculate YOY order growth by various other dimensions that may no longer map 1:1. Schema-on-read will only get us so far in dealing with these issues. The more time that passes, the more rules that are necessary to deal with changing data that cannot be handled at read time.
Schema-on-read and schema-on-write aren’t mutually exclusive. It’s important that ingest and raw storage can accept any data and that good tools are available in the data lake to explore and use that data. Schema-on-write in a format that supports evolution is another tool to optimize the consumption of data that is bound to change over time. But even this stable data can be reproduced at any time from the source data when problems arise or back-scrubbing is necessary. By using both of these strategies correctly, you can provide flexibility and stability at the same time. Most importantly, data consumers need to be informed about the class of data they are dealing with and its guarantees, so they can make smart decisions about how they use it.