Getting code to work in development is very different from running it reliably in production. Production systems handle real users, real data, and real consequences when things go wrong.
Before any system goes live, a structured review across every dimension — security, reliability, performance, cost, and operations is essential.
What Production Readiness Means
A production-ready system is not just one that works. It is one that:
1. Handles failures gracefully without losing data or user trust.
2. Recovers automatically without requiring manual intervention at 3am.
3. Is observable enough that problems are detected before users report them.
4. Is secure enough that sensitive data is protected and access is controlled.
5. Is cost-efficient enough that it does not waste money on idle or oversized resources.
6. Is documented well enough that any team member — not just the person who built it — can operate and troubleshoot it.
Production Readiness Checklist
Work through each area systematically before declaring a system production-ready.
Infrastructure and Architecture
1. Infrastructure is fully defined in IaC — Terraform, CloudFormation, or CDK. No manually created resources in production.
2. Resources are deployed across at least two Availability Zones.
3. No single points of failure exist in the architecture.
4. Auto Scaling is configured with appropriate minimum, maximum, and desired capacity values.
5. An Application Load Balancer distributes traffic across healthy targets with health checks configured.
6. All infrastructure has been reviewed against the AWS Well-Architected Framework.
7. Resources are tagged consistently with environment, team, and application tags.
Security
1. No secrets, credentials, or API keys exist in source code, environment files, or container images.
2. All secrets are stored in AWS Secrets Manager or Parameter Store and retrieved at runtime.
3. IAM roles follow least-privilege — every role has only the permissions it specifically needs.
4. MFA is enabled for all IAM users with console access.
5. The root account has MFA enabled and is not used for daily operations.
6. Security groups follow least-privilege — no ports open to 0.0.0.0/0 that are not required.
7. All data at rest is encrypted using KMS or service-level encryption.
8. All data in transit uses TLS.
9. GuardDuty is enabled in every account and region.
10. AWS Config rules are active and compliance violations have been remediated.
11. Security Hub is enabled and critical findings have been resolved.
12. Container images have been scanned — no critical CVEs in production images.
13. SAST and SCA scans pass in the CI/CD pipeline.
CI/CD Pipeline
.png)
Observability
1. CloudWatch dashboards are set up for key metrics — error rates, latency, CPU, memory, and business metrics.
2. CloudWatch Alarms are configured for critical thresholds — error rate, latency, availability.
3. Alarms route to SNS and notify the right people through the right channels.
4. Composite alarms reduce noise — alerts only fire when meaningful combinations of conditions are true.
5. Log groups have retention policies set — logs are not stored indefinitely.
6. Structured JSON logging is in place — logs are searchable and filterable.
7. AWS X-Ray tracing is enabled for Lambda functions and API Gateway.
8. A runbook exists for every critical alarm — on-call engineers know what to do when it fires.
Reliability and Recovery
1. Multi-AZ is enabled for all databases — RDS Multi-AZ or DynamoDB global tables where required.
2. Automated backups are configured for all stateful resources — RDS, DynamoDB, EBS.
3. Backups have been tested — a successful restore from backup has been verified.
4. A disaster recovery strategy has been defined — RTO and RPO are agreed with stakeholders.
5. Route 53 health checks and failover routing are configured for multi-region workloads.
6. Connection draining is configured on the load balancer — in-flight requests complete before instance termination.
7. The system has been tested under failure conditions — instance termination, AZ failure, database failover.
Cost
1. AWS Budgets are configured with alerts at 50%, 80%, and 100% of expected monthly spend.
2. Cost Explorer has been reviewed — no unexpected cost categories or unexplained spikes.
3. Dev and staging environments use smaller, cheaper instance types than production.
4. Spot Instances or Savings Plans are used where appropriate for predictable workloads.
5. No idle resources exist — no running instances or NAT Gateways in unused environments.
6. Auto Scaling ensures capacity scales down during low-traffic periods.
Documentation and Operational Readiness
1. A system architecture diagram exists and is up to date.
2. A README covers how to deploy, how to roll back, and how to run the system locally.
3. Runbooks exist for all common operational scenarios — deployment, rollback, incident response, DR failover.
4. On-call responsibilities are clearly defined — who is paged for what.
5. A post-incident review process is defined and the team knows how to conduct a blameless post-mortem.
6. The team has practised at least one DR scenario — tabletop exercise or actual failover test.
The Day-One vs. Day-Two Mindset
Production readiness is not a one-time gate. It is a mindset.
Day One is getting the system to production correctly — infrastructure as code, CI/CD pipeline, security controls, monitoring, and DR strategy in place from the start.
Day Two is everything that happens after — keeping the system healthy, improving reliability, tightening security, optimising costs, and evolving the architecture as requirements change.
The checklists above are your Day One foundation. Day Two never ends.