AI and machine learning (ML) are everywhere, whether you notice them or not. Amazon uses machine learning to recommend products. TGI Friday’s used machine learning to make a virtual bartender. Car manufacturers use machine learning in prolonged, painful attempts to teach cars how to drive themselves (made extra difficult by humans on the road).
The use cases are endless. ML algorithms offer unique benefits, but they’re not simple to create, maintain, or deploy. This is why you need a machine learning operations (MLOps) process.
MLOps is a tricky beast in and of itself. Blindly deploying an MLOps framework—even a good one—is a recipe for chaos. You need to plan how you’re going to set up MLOps. You need an MLOps strategy.
H2 What is MLOps?
If you’re new to the idea of MLOps or machine learning in general, take a moment to read our What Is MLOps? article. We explain both concepts in simple terms, demonstrating how and why businesses use them.
Machine learning models need specialized development, dataset preparation, evaluation, and maintenance. Getting ML models to provide reliable business value isn’t easy. It requires continuous collaboration between data specialists, machine learning engineers, and software developers.
This collaboration is new and complex, which causes delays, friction, errors, and a heap of other problems. MLOps is an attempt to fix that. It’s the procedures, workflows, and tools that help businesses efficiently integrate reliable ML into software.
MLOps vs DevOps
While MLOps takes its name from DevOps, they are not the same. DevOps is concerned with making the traditional software development process work better. It helps software engineers and operations staff work together to develop, deploy, and update software.
MLOps has a similar goal with a more complex scope. ML models can do incredible things, but they need traditional software to interact with the rest of the world. Imagine a giant robot with a human pilot. The ML model is the pilot, and the traditional software is the robot.
The pilot needs to train, learn, study, and train some more. The robot needs to be built, fine-tuned, and upgraded with cooler guns and laser swords. Both the pilot trainers and the robot builders must work closely together. Imagine that the robot team swaps the Cool Sword button with the Self-Destruct button. It’s a great change. It cuts power use by 20%! But if they forget to tell the pilot training team, our rad mecha anime becomes a tragic cautionary tale about the importance of interdepartmental communication real fast.
MLOps is data engineers, machine learning developers, and software engineers working together smoothly and reliably to put machine learning projects into practice. For a more detailed explanation with fewer anime references, check out this MLOps: Motivation article.
Key Components of MLOps Strategy
Your MLOps strategy should be documented somewhere visible to all stakeholders. It’s not going to be a single set-and-forget process. An MLOps strategy can and should evolve with time!
The key areas it should cover include:
Current Friction Points
If your organization is already using ML models, gather information from your teams about the pain points they’re currently facing. Without this knowledge, you might end up with a MLOps strategy that solves future problems before present ones.
What would a perfect MLOps solution look like for your business? Don’t worry about getting it perfect the first time around. This section will naturally change with your teams, work scope, and business goals.
This is the bummer part, but constraints are important. You probably don’t have unlimited cash to buy software and hire experts. At the very least, know how much money you have and how much your ideal workflow would cost.
Take your current pain points, your ideal workflow, and your budget, and put them together. Find your most troublesome problems, and identify the most readily available solutions. If your situation is complicated enough, you might break this up into immediate and medium-term solutions.
You probably can’t fix all your problems at once. You might also know what problems are likely to develop as you grow. This is where those go. Outline which pain points you’ll solve later (and how you’ll fix them), as well as pain points you can see in the future (and how you’ll avoid them).
Ownership and Team Structure
Nothing will get done if nobody’s in charge. You need to assign responsibility for each problem and solution. Depending on the complexity of your strategy, you might simply distribute this ownership among each team. If you can afford it, though, hiring a MLOps architect can make the process smoother.
Finally, figure out when everything will happen. This timeline should have goal dates for all new processes, tools, and people. Short-term goals should be specific; long-term goals may simply target specific quarters.
There you have it. It’s simple!
Next, let’s take a look at the three overall stages of a MLOps strategy: manual, automated, and CI/CD.
Manual MLOps Strategies
A manual MLOps structure has no automation. Every part of the process is manual: data collection, data preparation, model training, model deployment, and so on. While manual processes aren’t ideal for the long term, they’re cheap and easy to implement. They can be an acceptable way to start addressing immediate problems.
However, manual MLOps processes are prone to human error. If you use manual processes to train and deploy your ML model, you’ll have to repeat those processes every time you want to retrain or update your model. Manual model performance monitoring might miss telltale signs of model degradation. On top of everything else, manual processes are slow. They take up a lot of time, every time.
In practice, manual MLOps can reduce friction between data, ML, and software teams. It falls short of other MLOps goals, though. Without automation, it’s a hassle to iterate on an ML model after it’s deployed—and ML models are particularly prone to data drift and decay over time. Imagine an app that actively creates new bugs in its code. You’d want to make it as painless as possible to fix and re-deploy that software!
Manual MLOps is like driving to a car factory every time you need new tires. The factory is great at making new cars, but it’s terrible at maintaining existing cars.
If you’re going to be using multiple ML models or updating them frequently, manual MLOps processes are only acceptable when you’re starting out. You want to start automating these processes as soon as possible.Automated MLOps Strategy
For many organizations, this will be the sweet spot. An automated MLOps strategy involves the creation of an automated ML pipeline. Integrating multiple software solutions together can remove much of the manual work required to iterate on ML models.
You might start by setting up Argo workflows or NATS. These tools can manage the flow of training data from storage to Python processing libraries like Pandas. They can automatically run custom scripts that ingest, validate, clean, and split the data into discrete sets.
If data isn’t your pain point, those tools can also automate ML model training, experiment tracking, and evaluation. You can link this process to a trigger, such as the availability of new data. Software like TensorBoard can create visualizations for the results of each iteration. This information helps your ML team experiment by tweaking algorithms and hyperparameters.
Similar workflow steps can also automate ML model validation. For example, you can choose only to show performance results for iterations of the model that pass validation.
CI/CD and MLOps
A fully mature MLOps implementation will have an automated continuous integration/continuous delivery (CI/CD) pipeline. It should also include continuous training (CT) and continuous monitoring (CM) components. These components often communicate through APIs.
This kind of MLOps system can be extremely powerful. For instance, a previously deployed ML model might miss a performance metric. Model monitoring tools detect this drift in real-time and alert your team. From there, new data can be automatically ingested and used to re-train the model until it’s back on track. With proper setup, this re-training can be done without ever taking the model out of production.
The problems with a full CI/CD pipeline for MLOps are the same as with any complex system. If it’s not planned and implemented thoughtfully, you can end up with lots of technical debt hidden among the components. It’s a lot of interlocking pieces, so it works great—when it works. You’ll need to keep system experts on-hand for maintenance and updates. When something breaks, you’ll need time and money to do costly repairs.
If done right, the benefits are worth it. You’ll get value from multiple ML models with high reproducibility and reliability. You’ll also have efficient, effective ways to deploy new models and maintain existing ones.
Ownership and Leadership
MLOps strategies are complex. You might have picked up on that already.
Doing MLOps right involves lots of work. Coordinating multiple teams! Implementing and integrating software across different disciplines! Planning and executing a critical years-long strategy! With so many factors in play, you might need one person to be in charge of directing it all. That person is your MLOps architect.
An MLOps architect can be an incredibly valuable resource. This person is a cross-team expert. They help write the strategy and answer cross-team questions. They also plan the addition of new software and keep implementation moving at an even pace. As the MLOps pipeline evolves, you might even need a dedicated MLOps team to keep things running smoothly.
You might already have your MLOps architect in your organization! If someone has the skills, promoting from within can be a great choice. Insiders will already be familiar with the people, goals, and workflow in your business. On the other hand, hiring an outside expert can bring a fresh view to your team. You’ll have to decide which option is best for you.
This has been a lot, so let’s switch gears and talk about example pipelines and software.
If you want an all-in-one solution that’s purpose-built for MLOps, Amazon Sagemaker is a popular choice. Sagemaker has a wealth of built-in tools for multiple roles: business analysts, data scientists, and ML teams.
Sagemaker has tools for every stage of the machine learning lifecycle. It covers development, training, deployment, retraining, and more. Sagemaker can even automatically tune ML models and identify the best-performing versions.
Amazon Sagemaker is priced by usage, so its cost will be proportionate to your needs. It also has a free-to-try option. The pricing details are super complicated, as is usually the case with AWS services. Amazon does provide pricing examples, which range from $0 to over $300 per month.
If Amazon isn’t your platform of choice, Kubeflow is a popular open-source MLOps platform. Kubeflow integrates with and enhances other MLOps tools, like Jupyter, Tensorflow, and Kubernetes. It’s designed to serve as the orchestrator of an open-source MLOps toolchain.
Kubeflow is classic open-source: it’s free, but you have to know what you’re doing. You’ll be relying on community support and development. If your team is already comfortable with complex open-source tools, Kubeflow is a great choice. If you need professional support and services, stick with paid software providers.
More MLOps Resources
Machine learning isn’t hypothetical, untested ground. ML models have provided incredible benefits for businesses of all sizes. They’ve also ground projects to a halt because of unforeseen complexities.
If your business is using ML models in software or planning to do so, don’t take chances. Learn from the successes and failures of ML pioneers. Start building your MLOps strategy before things fall apart.
Remember, you’re not alone. Others have gone through this process before. If you’re still unsure what tools or platforms are right for your MLOps implementation, check out some reviews for guidance:
Even if your business is already using ML models for their software, it’s not too late. Remember the old saying: the best time to plant a tree is ten years ago; the second best time is today.
Was this helpful?