Keep this page open while drilling. SOA-C03 rewards structured operations thinking: signal -> diagnosis -> low-risk remediation -> verification.
Quick facts (SOA-C03)
| Item | Value |
|---|
| Questions | 65 total |
| Scoring | 50 scored + 15 unscored (unscored items are not identified) |
| Question types | Multiple choice, multiple response |
| Time | 130 minutes |
| Passing score | 720 (scaled 100-1000) |
| Cost | 150 USD |
| Domains | D1 22% - D2 22% - D3 22% - D4 16% - D5 18% |
Fast strategy
- Start from the constraint in the last sentence (availability, compliance, latency, cost, operational effort).
- Prefer the smallest safe operational change that addresses root cause.
- For noisy incidents, choose approaches that improve signal quality first (better alarms, filtering, dashboards).
- For repeated incidents, prefer automation (EventBridge + Lambda/SSM runbooks).
Final 20-minute recall (exam day)
Cue -> best answer (pattern map)
| If the question says… | Usually best answer |
|---|
| Alarm fatigue / noisy incidents | Composite alarms + tuned thresholds + actionable routing |
| Repeatable remediation needed | EventBridge -> SSM Automation/Lambda runbook |
| Patch governance at scale | Systems Manager Patch Manager |
| Configuration drift detection | AWS Config rules + automatic remediation |
| Need secure shell-less instance access | Systems Manager Session Manager |
| Stack update failed | CloudFormation events + change sets + rollback analysis |
| Backup policy across accounts | AWS Backup plans/policies |
| Access denial investigation | IAM policy + resource policy + KMS key policy evaluation |
| Network reachability issue | Route table -> SG -> NACL -> endpoint/NAT path validation |
| Incident postmortem prevention | Runbook updates + alarm improvements + automation |
Must-memorize SOA defaults
| Topic | Fast recall |
|---|
| Core observability stack | CloudWatch metrics/logs, CloudTrail, X-Ray (where relevant) |
| RTO/RPO | Recovery time and data loss objectives drive backup/DR choice |
| Safe remediation order | Detect -> triage -> fix low blast radius -> verify -> automate |
| Operational preference | Managed services + automation over manual repetitive operations |
Last-minute traps
- Acting on one symptom metric without correlation to logs/traces/deploy timeline.
- Running high-risk remediation before confirming blast radius.
- Treating backups as compliant without restore testing.
- Alerting on every metric spike instead of SLO-aligned sustained conditions.
1) CloudOps incident loop
flowchart LR
A[Detect signal] --> B[Triage severity]
B --> C[Identify probable root cause]
C --> D[Apply low-risk remediation]
D --> E[Validate recovery]
E --> F[Document + automate prevention]
Use this loop in scenario questions. Wrong answers often skip validation or choose high-blast-radius changes.
2) Monitoring and logging defaults (Domain 1)
Choose the right telemetry
| Need | Best AWS signal |
|---|
| Resource/service health trends | CloudWatch metrics |
| Application/system event detail | CloudWatch Logs |
| API-level audit trail | CloudTrail |
| Network allow/deny and flow diagnosis | VPC Flow Logs |
Alarm design defaults
- Use alarm thresholds tied to SLO/error budgets where possible.
- Use composite alarms to reduce alert noise.
- Route alerts to SNS or EventBridge for automation paths.
- Include runbook links in alarm descriptions for faster response.
Common D1 pitfalls
- Alarm on raw spikes without sustained evaluation windows.
- No distinction between symptom metrics and cause metrics.
- Missing CloudWatch agent config on EC2/ECS/EKS.
- Automation triggers without guardrails/permissions checks.
3) Reliability and business continuity (Domain 2)
HA and scaling picks
| Requirement | Typical choice |
|---|
| Multi-instance failover + balancing | ELB + Auto Scaling |
| Regional DNS failover patterns | Route 53 health checks + routing policy |
| Managed DB high availability | Multi-AZ for RDS/Aurora |
| Burst read/load reduction | CloudFront or ElastiCache |
Backup/restore language you must apply correctly
- RPO: acceptable data loss window.
- RTO: acceptable restoration time.
If question emphasizes strict RPO/RTO, prioritize restore method and backup frequency that explicitly satisfy those targets.
Reliability anti-patterns
- Single-AZ for critical stateful production workloads.
- Backups with no restore test evidence.
- Scaling policies with no cooldown/health alignment.
4) Deployment, provisioning, and automation (Domain 3)
Core service map
| Need | Typical AWS answer |
|---|
| Declarative infrastructure | CloudFormation (or CDK) |
| Fleet ops and runbooks | Systems Manager |
| Event-driven operational actions | EventBridge + Lambda/SSM |
| Multi-account/region deployment sharing | StackSets / AWS RAM |
- Validate IAM permissions for stack actions.
- Check resource dependency/order failures.
- Confirm subnet CIDR sizing and limits.
- Review event log for first failing resource (not only terminal error).
Automation rule of thumb
Automate repetitive, deterministic operations first: patching, restart/remediation runbooks, compliance drift checks, and standard incident responses.
5) Security and compliance operations (Domain 4)
High-yield controls
| Control goal | Typical services |
|---|
| Identity and least privilege | IAM, IAM Access Analyzer |
| Auditability | CloudTrail, AWS Config |
| Secrets and key management | Secrets Manager, KMS |
| Findings aggregation | Security Hub, GuardDuty, Inspector |
| Encryption in transit | ACM/TLS |
Common exam patterns
- Access denied: check identity policy, resource policy, and KMS key policy.
- Compliance drift: Config rule failure -> remediation workflow.
- Multi-account controls: Organizations/SCP boundaries and delegated operations.
6) Networking and content delivery troubleshooting (Domain 5)
VPC troubleshooting order
- Route tables
- Security groups (stateful)
- NACLs (stateless)
- Gateway/path (IGW, NAT, TGW, endpoints)
- DNS resolution (Route 53 / Resolver)
Network/data path service picks
| Need | Typical AWS answer |
|---|
| Private access to AWS services | VPC endpoints / PrivateLink |
| CDN and edge caching | CloudFront |
| Global traffic acceleration | Global Accelerator |
| Hybrid/private connectivity | Site-to-Site VPN / Transit Gateway |
Frequent D5 anti-patterns
- Allowing SG but blocking ephemeral return traffic via NACL.
- Assuming NAT provides inbound access.
- CloudFront cache issue treated as origin outage.
7) Troubleshooting playbooks you can reuse
5xx spike behind load balancer
- Check target health first.
- Correlate LB access logs + target logs + alarm timeline.
- Validate autoscaling events and recent config/deploy changes.
Alarm noise flood
- Replace independent symptom alarms with composite alarm logic.
- Tune thresholds/evaluation periods from observed baseline.
- Route only actionable alerts to incident channels.
Intermittent connectivity failure
- Validate route and SG path both directions.
- Inspect NACL rules for stateless return traffic blocks.
- Use Reachability Analyzer and VPC Flow Logs for confirmation.
8) Cost-aware operations quick wins
- Delete idle unattached EBS volumes and stale snapshots with retention policy.
- Use lifecycle policies for S3/EFS where access patterns allow.
- Reduce NAT egress by using VPC endpoints where applicable.
- Right-size compute using utilization and recommendation signals.
Next steps
- Use Resources to stay aligned to the official exam guide and AWS operations docs.
- Use the FAQ when you need a quick reset on exam scope and candidate expectations.
- Keep this page open while you drill remediation, backup, and network-troubleshooting scenarios.
Quiz
Loading quiz…