Rules for AWS infrastructure incident response and operations.
Infrastructure Debugging Order#
When an incident occurs, check in the following order (RDS is the most common cause):
1. Check RDS status (storage-full is the most frequent)
2. Check EC2/ASG instance status
3. Check Target Group health checks
4. Check ALB access logs
5. Check CodeDeploy deployment status
6. Check CloudWatch logs
Quick Status Check Commands#
# RDS status
aws rds describe-db-instances \
--query ' DBInstances[*].{ID:DBInstanceIdentifier,Status:DBInstanceStatus,Storage:AllocatedStorage} ' \
--output table
# ASG instance status
aws autoscaling describe-auto-scaling-groups \
--query ' AutoScalingGroups[?contains(AutoScalingGroupName,`kobic`)].{Name:AutoScalingGroupName,Desired:DesiredCapacity,Instances:Instances[*].{ID:InstanceId,Health:HealthStatus } } ' \
--output json
# Target Group health
aws elbv2 describe-target-health \
--target-group-arn < target-group-arn > \
--output table
# Recent deployment status
aws deploy list-deployments \
--application-name kobic-app \
--deployment-group-name kobic- < env > -dg \
--max-items 3 \
--query ' deployments '
RDS Storage-Full Response#
Symptoms#
- Server 502/503 errors (app shows "server maintenance")
- RDS instance status:
storage-full - CloudWatch
FreeStorageSpacemetric reaches 0
Emergency Storage Expansion#
# Applied immediately, no downtime
aws rds modify-db-instance \
--db-instance-identifier kobic- < env > \
--allocated-storage < new_size_gb > \
--max-allocated-storage < new_max_gb > \
--apply-immediately
Cautions:
- RDS storage expansion is irreversible (cannot shrink)
- 6-hour cooldown after expansion
- DB operates normally even in
storage-optimizationstate - Update Terraform
config.auto.tfvarsafter emergency expansion (to prevent drift)
Replication Slot Management#
CDC replication slots accumulate WAL and consume storage.
-- Check slot status
SELECT slot_name, slot_type, active,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as retained_wal
FROM pg_replication_slots;
-- Delete inactive slots
SELECT pg_drop_replication_slot( ' slot_name_here ' );
CodeDeploy ASG Accumulation Issue#
Symptoms#
- CodeDeploy attempts to deploy to deleted ASGs during deployment â failure
Resolution#
# Check ASG list in CodeDeploy Deployment Group
aws deploy get-deployment-group \
--application-name kobic-app \
--deployment-group-name kobic- < env > -dg \
--query ' deploymentGroupInfo.autoScalingGroups[].name '
# Update to include only the correct ASG
aws deploy update-deployment-group \
--application-name kobic-app \
--current-deployment-group-name kobic- < env > -dg \
--auto-scaling-groups " kobic- < env > -asg "
Docker-Based Deployment Workflow#
GitHub Actions â Docker Build â ECR Push â CodeDeploy â ASG Instance (Docker Compose)
| Environment | Source Branch | Deployment Method |
|---|---|---|
| Staging | development |
GitHub Actions workflow_dispatch |
| Production | main |
Auto-deploy on main branch push |
CloudWatch Alarm Status#
| Alarm | Threshold | Status |
|---|---|---|
database-high-cpu | 80% | alarm_actions is empty array |
database-high-connections | 50 | Not configured |
| FreeStorageSpace | Not set | Needs to be added |
ElastiCache Redis Cautions#
- AUTH cannot be enabled when Transit Encryption is disabled
- Serverpod requires
SERVERPOD_PASSWORD_redisenvironment variable when Redis is enabled - Set
SERVERPOD_PASSWORD_redis=""to start server (operates without AUTH) - Authentication works normally on DB-based auth without Redis
Checklist#
Incident Response#
- Check RDS instance status (
describe-db-instances) - Check CloudWatch FreeStorageSpace metric
- Check replication slot retained WAL
- Check ASG/Target Group health checks
- Check CodeDeploy deployment status
Terraform Drift Prevention#
- Update
config.auto.tfvarsafter emergency AWS CLI changes - Verify drift with
terraform plan - Update code via separate PR