Improving the Machine Learning Lifecycle

ML Model Lifecycle


The first phase of the lifecycle and the one that requires the most flexibility for data scientists is experimentation. However there are many tasks that could be supported by data platform capabilities. The output of this phase in the diagram is one or hopefully more saved models (and the experiments that created them). Although training and model evaluation are performed in this phase, it differs from the dedicated phases later in its adhoc nature.

Data Acquisition and Exploration

Easing ingestion of new data sources without data engineering effort supports the ability of data scientists to expand their exploration to new domains. We have started with River for ingestion but will need to add new capabilities in this space. Once recent addition has been an S3 based ingestion to the data lake. Ingestion capabilities should be accept anything and leave data contracts for data curation.

Experiment Tracking

Experiment tracking can help data scientists organize and evaluate their hypotheses by providing capabilities to store and compare experiment parameters and metrics. MLFlow is an open source option that we are evaluating that is also bundled with Databricks for easy adoption by data scientists using that platform.


After experimentation has been completed and the models have been evaluated to choose the best one, a training phase should commence to create a released model. Whereas experimentation is typically more adhoc, training a released model should be done in a versioned and controlled environment that can be repeatable via CI. Training may also need scalable on-demand compute to handle the full dataset in an efficient and parallel manner.

Model Packaging

Offering a simple contract to describe machine learning projects enables us to ensure portability across execution environments such as Databricks, Docker, and the data scientists PC. MLFlow offers one such open source project format that is a declarative approach to dependencies and execution. It also provides a model registry for storing and retrieving the released models. The model registry defines a flexible packaging format that supports all major ML serialization formats (ONNX, TF,Sklearn, etc…), docker, and custom defined ones. These capabilities make it simpler to move from experimentation to the final released model.


Released models need to be deployed into a staging environment where inference can be performed on production data but without impacting the production applications. Data scientists may want to have one or more candidate models operating in this way, so that they can be evaluated against live data and validated before moving to production.


The evaluation phase is similar to that during experimentation but is automated and not adhoc. Using the environment established in the stage phase, data scientists can get final metrics on live data prior to moving the ML model to production.

  • Source Data Validation — Ensure the data in production fits the expectation of the model
  • Integration Tests — Testing the contracts where the algorithm integrates with other systems
  • Technical Performance — Ensure the model inference performs and scales within the expected resource constraints
  • Model Quality — Model accuracy, error rate, precision, recall
  • Model Interpretability — Model explanations and bias detection


Once the data scientist is comfortable with the model evaluation, they can deploy the ML model in A/B test where a challenger model can be compared against the current champion model in production. Again a common set of deployment capabilities and standards around serving models for inference are crucial to offer data scientists a self-service method to run these tests. Management of model deployment and routing requests to model versions should be easily controlled by data scientists conducting the test.

  • Metrics — Model quality or technical performance metrics can be used to declare a new champion model
  • Observability — Provides the capability to introspect the internal state of the model runtime in real-time in order to diagnose issues or answer other questions.
  • Logging — Speaks for itself
  • Monitoring — Automated monitoring and alerting based upon aggregated metrics or outlier detection
  • Feedback — A standard method for the model to feed predictions and other data into the feature store to be used for the next lifecycle iteration or other model development.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Andres March

Andres March

bringing science and data together