In my last post, I discussed the importance of data contracts to ensure longevity and ease of data consumption. Now I’d like to turn attention to the hot topic of Data Products and how they get built in a decentralized world. Data contracts are integral to Data Products as well.
What the heck is a Data Product?
Why do I keep capitalizing Data Products and not data contracts? Perhaps it’s because I can see data contracts taking many forms but I want to convey that Data Products are not just software products that are data. Of course they are that as well but also something more. A Data Product should meet certain standards of maturity in documentation, freshness, support, governance, and monitoring, etc…. In short Data Products should meet the same standards of other quality software products. Data consumers should be treated like proper customers and not an adhoc use case.
Rarely will Data Product domains live in a clearly delineated bubble that don’t have dependencies or relationships to other domains. It should be accepted that some Data Products will be more basic and tied to the source systems that create them. These products should certainly belong to the same data domain of its source. For example a B2B order management group could offer order data as a product. More complex Data Products like product recommendations could be built with a dependency on the orders and other data domains with an understanding of the SLA and guarantees that those products provide.
Each Data Product could contain multiple components that require a variety of skills:
- Data Product owner — to help articulate the business value and drivers
- Software engineer — to instrument the source system
- Data engineer — to build the data pipeline
- Machine learning engineer — to optimize the algorithm for the runtime environment
- Data scientist — to build an algorithm on the data
- UI Engineer or Data Analyst — to build visualizations
Not all Data Products will require these components or skill sets. However there are many that will have common components that could be generalized to avoid duplication of code and resources.
Decentralized Data Products
We value decentralization of decision making and autonomy of each domain to implement the best solution. Each Data Product should exist wholly within its particular domain and built by the respective group that owns that domain. The data pipeline, services, and visualizations that make up that Data Product should be cohesive and have single ownership. So how do we go about building these products without duplicating code and resources across our organization?
The answer is to abstract the common components into self-service capabilities that make up a data platform. Rather than leave each domain to build each component from scratch, self-service capabilities can be offered to handle the generic needs that crosscut many (if not all) Data Products. Our central Data Platform group is doing that now with Databricks for data processing as we have also done with data ingestion and Snowflake. My team is starting out by offering self-service deployment of machine learning models without concern for infrastructure or operation.
Eventually as the platform becomes more capable and accessible to domain-aligned teams, they will be able to focus more on the code and content that drive business value rather than cross-cutting concerns which pervade every Data Product.
If you would like to learn more, I suggest reading this more in-depth article by Thoughtworks from which I shamelessly borrowed ideas.