Apr 5, 20225 min read

MLOps

Updated: Apr 25

The relative popularity of MLOps on Google for the past 5 years is shown below.

Introduction

In modern businesses, there is a large need for getting machine learning (ML) models into production. Some typical pitfalls include:

The project depends on a specific person, which has played a key role in development.
The architecture is too intertwined, so that changing one part might break another part.
Data drifts means that there often is a need for retraining.
Updating the models is done manually, which is time consuming.

MLOps is a work strategy that borrows concepts from traditional DevOps but is much more tailored towards machine learning. The goals of MLOps are:

To be able to faster experiment with many different ML models.
To effectively deploy ML solutions into production.
Assurance of model qualities such as performance metrics and explainable artificial intelligence.

Adopting MLOps can be beneficial. However, MLOps is not a step-by-step guide, but a concept that involves many elements. It is up to your company's specific use-case and goals which elements are most important.

DevOps

CI/CD

MLOps builds upon traditional DevOps. Therefore, continuous integration, continuous delivery, and deployment (CI/CD) pipelines are important. This is typically the setup of automated unit testing, such as GitHub Actions or similar SaaS solutions.

CI encourages team members to develop the code as a set of micro services. Breaking up the code into many different functional pieces is expedient because it is easier to understand the entire project, faster to develop. Automated unit testing can pick up errors in the code that are normally difficult to notice by the human eye. All in all, the end-product becomes more robust.

CD manages the delivery of the code to different environments, such as staging/testing and then deployment into production, in an automated manner. Version control is a cornerstone for this concept.

Infrastructure as Code

Infrastructure as code (IaC) is setting up scripts that deploy your cloud resources. In other words, IaC manages your cloud infrastructure in a declarative manner instead of using the web interfaces (Azure portal, Google Cloud console, etc.) This minimizes the chance of human errors and makes it much simpler to reuse and redeploy everything ranging from a single resource up to complex cloud architectures. A popular IaC software is Hashicorp's terraform, and its role is illustrated below.

A common first step in implementing IaC is to put the files in a Git repository (GitHub, Gitlab, etc.) The next step is then to set up a service which reads these files, e.g., Azure Pipelines, and deploys cloud resources to the respective cloud provider.

ML Pipelines

MLOps is typically considered a cross-field between data science, data engineering and DevOps. The main difference between software development and machine learning development is that the latter needs to be constantly changing and evolving due to its dependency on real-world data. To minimize the amount of work that needs to be done, one should use reusable pipelines.

Setting up automated data ingestion and data processing pipelines allows for reusability. This is a job for a data engineer, and it is fundamental for any ML project. It involves gathering, cleaning, and combining data from many sources. On Azure Machine Learning, data processing pipelines can include common steps such as removing or substituting missing data, one-hot encoding, splitting up data, normalization, standardization and much more. When experimenting with different models, you learn a lot about what works and what does not. Experimenting with models is the ML engineer's job. To keep track of this process, training pipelines can be set up. These can be used for automated training of new models (scheduled or triggered), and it supplies reusability and reproducibility. Some common features are k-fold cross validation, tuning of hyperparameters, stopping criterions, and logging along the way.

Explainable Artificial Intelligence

The term explainable AI (XAI) has come alongside AI and the many discussions surrounding it. Developing advanced ML models has demonstrated superhuman performance in specific domains. However, the models are often considered black boxes, which means that they are hard to interpret. For critical or sensitive domains such as health, there is an immense value in increased understanding of the models.

In cloud development, built-in monitoring features or connected dashboards are gold mines. In rapid-moving enterprises, everything of interest should be followed closely. Some examples include data distribution shifts, data quality, production statuses, training logs, model scoring, and deviances. Proper monitoring brings a ton of value. It is particularly important to ask these questions:

Which features are the main contributors to the results?
Are the models under- or overfitted?
What are the uncertainties of the results?
Are the models following the principles of responsible AI?

These are some of the central questions in XAI. To properly answer the questions, monitoring the models is fundamental. This is (again) where dashboards come in handy.

Feature Importance

There are many ways of discovering feature importance in ML. Scikit-learn has a useful framework for feature importance: permutation feature importance. Additionally, models such as decision trees and random forests inherently find feature importance alongside training of the models. Since data distributions can change over time, so can feature importance. Therefore, you could set up a dashboard that plots the relative feature importance vs. time.

Underfitting and Overfitting

It is important to realize when models are under- and overfitted because the goal for any model is to generalize well, i.e., perform well on unseen data. This is the main reason behind splitting up the data set is into training and validation data sets (or cross-fold validation), and test data (final scoring of the model). One way of automatically monitoring this is to record either loss or performance metrics from your ML pipelines. Moving on, visualizing loss or performance vs. model complexity is helpful. For instance, when training a neural network, one can show loss vs. the number of training epochs.

Uncertainty

Any measurable metric is much more valuable if it comes alongside uncertainty. This is fundamental in science, so there are really no excuses for not providing uncertainty estimates to ML models. One approach is to set up an ensemble of N models, and train them independently. The resulting output can be the average score or popular vote among the models, and the uncertainty can be the variance between them.

Responsible AI

Machine learning is subject to direct and uncovering ethical questions: as it should be! Company policies must be followed, but more importantly, important ethical laws such as not discriminating and privacy laws such as GDPR must be followed.

(Image from Microsoft, https://www.microsoft.com/en-us/ai/responsible-ai-resources). Microsoft has categorized six principles for responsible AI:

Fairness
Reliability & Safety
Privacy & Security
Inclusiveness
Transparency
Accountability

It is crucial to not simply develop a bunch of ML models blindly, but to plan, think underway, and then regularly re-assess. Make sure to not undermine the importance of ethics in machine learning.

Conclusion

Adopting MLOps is pre-defined process. Start out by picking the most important elements for your company’s specific use-case. Here are some tips to help you along the way:

Automate wherever it is possible. Set up CI/CD pipelines.
Split functions into micro services implement unit tests for them.
Create reusable data processing pipelines and model training pipelines.
Monitor everything.