Amazon CloudWatch

Lesson 37/50 | Study Time: 60 Min

Course: AI DevOps on AWS: Automation, CI/CD and Cloud Engineering

Amazon CloudWatch is the central observability service on AWS. It collects metrics and logs from virtually every AWS service automatically and gives you the tools to visualise, alert on, and analyse that data.

Whether you are monitoring a single Lambda function or an entire production platform, CloudWatch is where that visibility lives.

How CloudWatch Works

CloudWatch sits at the centre of AWS observability. Every AWS service — EC2, Lambda, ECS, RDS, API Gateway, and more — sends metrics and logs to CloudWatch automatically.

No setup required for standard metrics. Custom metrics and logs require your application to send them explicitly.

CloudWatch Metrics

A metric is a time-series of numerical values — CPU utilisation measured every minute, request count measured every second, error rate measured every five minutes.

1. Namespaces: Metrics are organised into namespaces — logical groupings by service. For example:

AWS/EC2 — EC2 instance metrics.

AWS/Lambda — Lambda function metrics.

AWS/RDS — Database metrics.

AWS/ApiGateway — API Gateway metrics.

Custom metrics your application sends live in a namespace you define. For example, MyApp/Orders for business-level metrics like orders per minute.

2. Dimensions: Dimensions are key-value pairs that identify a specific resource within a namespace. For example, the EC2 namespace uses InstanceId as a dimension — so you can view CPU utilisation for a specific instance rather than all instances combined.

3. Standard vs. Detailed Monitoring: By default, most AWS services send metrics every 5 minutes — this is standard monitoring. Enabling detailed monitoring reduces this to 1 minute intervals. Detailed monitoring costs extra but gives you faster visibility into issues as they develop.

4. Custom Metrics: Your application can send custom metrics to CloudWatch — business metrics, application-level counters, performance measurements that AWS cannot collect automatically. Custom metrics are priced per metric per month, so be deliberate about what you track.

CloudWatch Dashboards

A dashboard is a visual display of metrics — graphs, numbers, and charts assembled into a single view. Dashboards give your team a real-time picture of system health without needing to navigate to each service individually.

What to Put on a Dashboard

A well-designed production dashboard shows the metrics that matter most for your system. A typical dashboard for a web application includes:

1. Request count and error rate is traffic normal and is the error rate acceptable?

2. API latency are response times within acceptable bounds?

3. EC2 or ECS CPU and memory are resources under pressure?

4. Lambda invocations, errors, and duration are functions executing correctly?

5. Database connection count and query latency is the database healthy?

Dashboard Best Practices

Organise dashboards by audience and purpose. A high-level executive dashboard shows availability and business metrics.

An engineering on-call dashboard shows detailed technical metrics grouped by service. Keep each dashboard focused — too many metrics on one screen makes it harder to spot issues, not easier.

Dashboards can be shared with your team through a public URL or within your AWS organisation.

For teams using third-party tools like Grafana, CloudWatch is a supported data source — you can build dashboards in Grafana pulling data directly from CloudWatch.

CloudWatch Alarms

An alarm watches a single metric and performs an action when the metric crosses a threshold you define. Alarms are the primary way CloudWatch notifies your team when something needs attention.

Alarm States

Every alarm is always in one of three states:

Alarm Actions

When an alarm transitions to the ALARM state, it can trigger one or more actions:

1. SNS notification: Send an email, SMS, or trigger a webhook to notify your team.

2. Auto Scaling action: Add or remove EC2 instances in response to load.

3. EC2 action: Stop, terminate, reboot, or recover an EC2 instance.

4. Lambda function: Trigger a Lambda function for custom automated remediation.

5. Systems Manager action: Run an automation document to resolve the issue.

Composite Alarms

A composite alarm combines multiple alarms using AND and OR logic.

Instead of being paged for every individual alarm, you define a composite alarm that only fires when a combination of conditions is true. For example, alert only when both error rate is high AND latency is elevated. This reduces alert noise significantly.

Anomaly Detection Alarms

Rather than setting a fixed threshold, you can create an alarm based on CloudWatch Anomaly Detection. The alarm fires when the metric deviates from its learned baseline rather than when it crosses a static number.

CloudWatch Log Groups

A log group is a container for log streams that share the same retention policy and access controls. Every application or service that sends logs to CloudWatch gets its own log group.

Log Streams

Within a log group, individual log streams represent a single source — a specific Lambda function invocation, a specific EC2 instance, a specific container. Log streams are created automatically.

Retention Policies

By default, CloudWatch logs are kept indefinitely — which gets expensive. Always set a retention policy on every log group. Common retention periods:

1. Development environments — 7 to 14 days.

2. Production application logs — 30 to 90 days.

3. Security and compliance logs — 1 to 7 years depending on requirements.

CloudWatch Logs Insights

Logs Insights is a query engine for your log data. Instead of scrolling through raw log lines, you write queries to filter, aggregate, and analyse logs quickly. It is essential for debugging production issues at scale.

A simple query to find errors in the last hour:

This returns the 50 most recent error messages sorted by time — what would take minutes of manual scrolling takes seconds with Logs Insights.

Previous Lesson Next Lesson

Drew Collins

Product Designer

Profile

Class Sessions

1- What is DevOps? Principles, Culture, and Practices 2- The DevOps Lifecycle 3- Introduction to Cloud Computing 4- AWS Global Infrastructure 5- Core AWS Services Overview 6- Git Fundamentals 7- Branching Strategies 8- Pull Requests and Code Review Best Practices 9- Integrating Git with AWS CodeCommit and GitHub 10- Managing Secrets and Sensitive Files in Repositories 11- What is CI/CD? 12- Building Pipelines with AWS CodePipeline and CodeBuild 13- Automated Testing in CI 14- Deployment Strategies 15- Using GitHub Actions and Jenkins on AWS 16- Why Infrastructure as Code (IaC)? 17- AWS CloudFormation 18- Terraform on AWS 19- AWS Cloud Development Kit (CDK) 20- IaC Best Practices 21- Docker Fundamentals 22- Amazon ECR 23- Deploying Containers with Amazon ECS 24- Kubernetes Basics and Amazon EKS 25- Integrating Containers into CI/CD Pipelines 26- Serverless Computing Concepts and Use Cases 27- Building and Deploying AWS Lambda Functions 28- Event-Driven Automation with Amazon EventBridge 29- Orchestrating Workflows with AWS Step Functions 30- API Gateway Integration for Serverless APIs 31- Introduction to MLOps 32- Training and Deploying Models with Amazon SageMaker 33- Automating ML Pipelines with SageMaker Pipelines 34- Using Amazon CodeWhisperer and AI Tools for Code Automation 35- AI-Powered Testing, Anomaly Detection, and Incident Prediction 36- Observability Fundamentals 37- Amazon CloudWatch 38- Distributed Tracing with AWS X-Ray 39- Centralised Logging with Amazon OpenSearch Service 40- Setting Up Automated Alerts and Incident Response Workflows 41- Shift-Left Security 42- IAM Roles, Policies, and Least-Privilege Access 43- Static Code Analysis and Vulnerability Scanning in CI/CD 44- AWS Security Hub, GuardDuty, and Config for Compliance 45- Secrets Management with AWS Secrets Manager and Parameter Store 46- AWS Well-Architected Framework 47- Auto Scaling and Elastic Load Balancing for Resilience 48- Cost Monitoring with AWS Cost Explorer and Budgets 49- Disaster Recovery Strategies 50- Preparing Your Project for Production

Amazon CloudWatch

Alarm Actions

Drew Collins

Class Sessions

Sales Campaign