Building a Data Product can involve machine learning and other data science that differs a bit from traditional software development. Much of the work may be done in an experimental fashion and not be conducive to the typical CI/CD process that is typical of the SDLC. In addition the people doing the work may not have a deep software engineering background and be comfortable with the tools and processes that promote quality software.
There is a natural conflict between supporting the freedom of data scientists to explore and experiment with the tools and techniques that they favor versus the need to operationalize machine learning code that can be built iteratively and methodically.
The way to achieve both these goals is to make a collection of self-service capabilities that promote software development best practices but can be composed with other tools and practices that the data scientist favors. These capabilities make up the data platform discussed in my last post.
The machine learning model lifecycle involves a lot more than training an algorithm. If you search the internet, you will find many diagrams that layout the phases and tasks that make up this effort. The image below is one such diagram. Rather than linear as it appears, we should acknowledge that any step in the lifecycle can tie back to the first phase to start the process over again. Each phase in this simplified diagram results in a ML model that is more mature.
The first phase of the lifecycle and the one that requires the most flexibility for data scientists is experimentation. However there are many tasks that could be supported by data platform capabilities. The output of this phase in the diagram is one or hopefully more saved models (and the experiments that created them). Although training and model evaluation are performed in this phase, it differs from the dedicated phases later in its adhoc nature.
Data Acquisition and Exploration
Easing ingestion of new data sources without data engineering effort supports the ability of data scientists to expand their exploration to new domains. We have started with River for ingestion but will need to add new capabilities in this space. Once recent addition has been an S3 based ingestion to the data lake. Ingestion capabilities should be accept anything and leave data contracts for data curation.
Capabilities to organize, annotate, and share data promote the goals to make all data in the lake discoverable and accessible via common methods. Data scientists shouldn’t be concerned with the details of where data is physically located when the platform provides a unified interface to consume data domains in the data lake. Airbnb published
Feature stores are a recent hot topic that adds on these ideas and provides capabilities specifically around metdata for datasets used in machine learning. Offerings like Feast and FeatureTools provide capabilities to describe these datasets (and their relationships) and ease consuming them from common data science environments.
Experiment tracking can help data scientists organize and evaluate their hypotheses by providing capabilities to store and compare experiment parameters and metrics. MLFlow is an open source option that we are evaluating that is also bundled with Databricks for easy adoption by data scientists using that platform.
Experiments may also benefit from capabilities such as hyperparameter tuning or other automation. AutoML tools aim to simplify and automate some of the more mundane tasks but this is a new area where there are many promises that may not deliver.
After experimentation has been completed and the models have been evaluated to choose the best one, a training phase should commence to create a released model. Whereas experimentation is typically more adhoc, training a released model should be done in a versioned and controlled environment that can be repeatable via CI. Training may also need scalable on-demand compute to handle the full dataset in an efficient and parallel manner.
Databricks has many of these features built-in, such as compute pools, quotas, versioned workspaces, and point-in-time datasets. Our Data Platform will be filling the gaps to connect version control, CI/CD, and orchestration to provide jobs the reliable environment they need.
Offering a simple contract to describe machine learning projects enables us to ensure portability across execution environments such as Databricks, Docker, and the data scientists PC. MLFlow offers one such open source project format that is a declarative approach to dependencies and execution. It also provides a model registry for storing and retrieving the released models. The model registry defines a flexible packaging format that supports all major ML serialization formats (ONNX, TF,Sklearn, etc…), docker, and custom defined ones. These capabilities make it simpler to move from experimentation to the final released model.
Released models need to be deployed into a staging environment where inference can be performed on production data but without impacting the production applications. Data scientists may want to have one or more candidate models operating in this way, so that they can be evaluated against live data and validated before moving to production.
CI/CD tools provide important capabilities to orchestrate this promotion of released models to the staging environment. These tools also can integrate with governance workflows to ensure the right people and processes are involved in the promotion. The model registry in MLFLow provides some capabilities to implement this workflow but a more complete solution will be required.
From the perspective of the data scientist, this workflow shouldn’t be a burden. By declaring the ML project as they did in the train phase, the project should be able to be deployed and served via common tools and not a bespoke process. The serving or execution environment is the place that inference is done either in a staging environment during evaluation or production. The serving environment should treat ML models as cattle not pets like other microservices. Autoscaling, routing, monitoring, logging and other features should be the concern of the platform not data scientists.
The evaluation phase is similar to that during experimentation but is automated and not adhoc. Using the environment established in the stage phase, data scientists can get final metrics on live data prior to moving the ML model to production.
The main capability required in this phase is shadowing the requests from the production environment. This capability enables data scientists the live traffic on candidate models without affecting production. The platform should provide basic metrics without effort and custom metrics, if desired. Seldon Deploy is one option that we will be evaluating to provide this capability.
The breadth of testing and metrics done in this stage may also differ from experimentation:
- Source Data Validation — Ensure the data in production fits the expectation of the model
- Integration Tests — Testing the contracts where the algorithm integrates with other systems
- Technical Performance — Ensure the model inference performs and scales within the expected resource constraints
- Model Quality — Model accuracy, error rate, precision, recall
- Model Interpretability — Model explanations and bias detection
Capabilities to automate some of this validation could give data scientists greater confidence in promoting the models to production.
Once the data scientist is comfortable with the model evaluation, they can deploy the ML model in A/B test where a challenger model can be compared against the current champion model in production. Again a common set of deployment capabilities and standards around serving models for inference are crucial to offer data scientists a self-service method to run these tests. Management of model deployment and routing requests to model versions should be easily controlled by data scientists conducting the test.
For all models in production, a standard set of capabilities should ensure that they continue to run smoothly and contribute back into the next iteration of the lifecycle. Metrics, observability, logging, monitoring, and feedback are all important to the successful operation of the running models.
- Metrics — Model quality or technical performance metrics can be used to declare a new champion model
- Observability — Provides the capability to introspect the internal state of the model runtime in real-time in order to diagnose issues or answer other questions.
- Logging — Speaks for itself
- Monitoring — Automated monitoring and alerting based upon aggregated metrics or outlier detection
- Feedback — A standard method for the model to feed predictions and other data into the feature store to be used for the next lifecycle iteration or other model development.
In addition to A/B testing, data scientists may want to perform more sophisticated scenarios like Multi-Arm Bandit that maximize a metric across a number of algorithms rather than purely determine if one is statistically better. Capabilities such as model metrics, dynamic model routing, and monitoring contribute to implementing these scenarios. These capabilities are also starting to be more directly offered by open source projects, such as Seldon Core.
Of course the runtime of the execution environment is a central capability which enables many of the other ones above. There are no lack of options in this space but the central goal is that data scientists care only about the model. The standard packaging format and deployment capabilities enable us to execute the model inference in docker, spark UDFs, Fargate, Sagemaker, or Kubernetes, etc… without any additional effort from users.
If you would like to read more about this topic, I recommend this very good article on CD for ML which inspired some of this post.