Introduction to MLOps

Lesson 31/50 | Study Time: 30 Min

Course: AI DevOps on AWS: Automation, CI/CD and Cloud Engineering

Building a machine learning model is only a small part of the challenge.

Getting that model reliably into production, keeping it performing well over time, and updating it when it degrades that is where most ML projects fail.

MLOps applies the same principles that made DevOps successful — automation, version control, continuous delivery, and monitoring to the machine learning lifecycle.

The Problem MLOps Solves

A data scientist builds a model that performs well in a notebook. Then the challenges begin:

1. How does the model get deployed to production reliably?

2. What happens when the data it was trained on changes over time and the model's accuracy drops?

3. How do you retrain and redeploy without breaking the live system?

4. How do you track which version of the model is running in production?

5. How do you reproduce a model from six months ago?

These are operational problems, not scientific ones. MLOps exists to solve them by treating models, data, and training pipelines with the same engineering discipline applied to application code in DevOps.

What is MLOps?

MLOps (Machine Learning Operations) is the practice of applying DevOps principles to machine learning systems. It covers the full lifecycle of an ML model:

Just as DevOps makes software delivery continuous and automated, MLOps makes the ML lifecycle continuous and automated — from new data arriving to a retrained model being deployed without manual intervention.

How MLOps Differs from Standard DevOps

MLOps adds three dimensions that standard DevOps does not need to handle:

1. Data versioning: In software, the code is the artifact. In ML, the data is equally important. The same code trained on different data produces a different model. MLOps requires versioning both the code and the data used to train each model.

2. Model versioning: Each trained model is an artifact that needs to be stored, tracked, and compared. A model registry stores every version of every model with its performance metrics, training parameters, and the data it was trained on.

3. Model drift: Software does not degrade on its own. ML models do. As the real-world data they encounter changes over time — user behaviour shifts, market conditions change, seasonal patterns evolve, model accuracy degrades. This is called model drift and it requires continuous monitoring and periodic retraining.

The Three Levels of MLOps Maturity

Not every organisation starts at the same level. MLOps maturity is typically described in three levels:

Level 0 — Manual: Data scientists train models manually in notebooks. Deployment is a manual, one-time process. No automation, no monitoring, no reproducibility. Most organisations start here.

Level 1 — Automated Training: The training pipeline is automated. When new data arrives, the model retrains automatically. Deployment is still partially manual. Model performance is monitored.

Level 2 — Full CI/CD for ML: The entire pipeline is automated — data ingestion, training, evaluation, deployment, and monitoring. A new model is only deployed if it outperforms the current production model. Retraining triggers happen automatically when drift is detected.

The goal is to reach Level 2. AWS SageMaker provides the tools to get there.

MLOps on AWS — The Key Services

SageMaker is the core platform. Everything else connects to and supports it.

Where MLOps Fits in a DevOps Pipeline

MLOps does not replace your existing DevOps pipeline — it extends it. A mature ML system has two parallel pipelines running alongside each other:

1. Application pipeline: The standard CI/CD pipeline covered in Module 03. Deploys the application code that calls the ML model endpoint.

2. ML pipeline: The MLOps pipeline. Retrains the model when new data arrives, evaluates it against the current production model, and deploys the new version if it performs better.

Both pipelines are automated, version-controlled, and monitored. Together they ensure both the application and the model stay current, healthy, and performant.

Previous Lesson Next Lesson

Drew Collins

Product Designer

Profile

Class Sessions

1- What is DevOps? Principles, Culture, and Practices 2- The DevOps Lifecycle 3- Introduction to Cloud Computing 4- AWS Global Infrastructure 5- Core AWS Services Overview 6- Git Fundamentals 7- Branching Strategies 8- Pull Requests and Code Review Best Practices 9- Integrating Git with AWS CodeCommit and GitHub 10- Managing Secrets and Sensitive Files in Repositories 11- What is CI/CD? 12- Building Pipelines with AWS CodePipeline and CodeBuild 13- Automated Testing in CI 14- Deployment Strategies 15- Using GitHub Actions and Jenkins on AWS 16- Why Infrastructure as Code (IaC)? 17- AWS CloudFormation 18- Terraform on AWS 19- AWS Cloud Development Kit (CDK) 20- IaC Best Practices 21- Docker Fundamentals 22- Amazon ECR 23- Deploying Containers with Amazon ECS 24- Kubernetes Basics and Amazon EKS 25- Integrating Containers into CI/CD Pipelines 26- Serverless Computing Concepts and Use Cases 27- Building and Deploying AWS Lambda Functions 28- Event-Driven Automation with Amazon EventBridge 29- Orchestrating Workflows with AWS Step Functions 30- API Gateway Integration for Serverless APIs 31- Introduction to MLOps 32- Training and Deploying Models with Amazon SageMaker 33- Automating ML Pipelines with SageMaker Pipelines 34- Using Amazon CodeWhisperer and AI Tools for Code Automation 35- AI-Powered Testing, Anomaly Detection, and Incident Prediction 36- Observability Fundamentals 37- Amazon CloudWatch 38- Distributed Tracing with AWS X-Ray 39- Centralised Logging with Amazon OpenSearch Service 40- Setting Up Automated Alerts and Incident Response Workflows 41- Shift-Left Security 42- IAM Roles, Policies, and Least-Privilege Access 43- Static Code Analysis and Vulnerability Scanning in CI/CD 44- AWS Security Hub, GuardDuty, and Config for Compliance 45- Secrets Management with AWS Secrets Manager and Parameter Store 46- AWS Well-Architected Framework 47- Auto Scaling and Elastic Load Balancing for Resilience 48- Cost Monitoring with AWS Cost Explorer and Budgets 49- Disaster Recovery Strategies 50- Preparing Your Project for Production