LogoSkills

AWS Infrastructure Operations Rules

Rules for AWS infrastructure incident response and operations.

Rules for AWS infrastructure incident response and operations.

Infrastructure Debugging Order#

When an incident occurs, check in the following order (RDS is the most common cause):

1. Check RDS status (storage-full is the most frequent)
2. Check EC2/ASG instance status
3. Check Target Group health checks
4. Check ALB access logs
5. Check CodeDeploy deployment status
6. Check CloudWatch logs

Quick Status Check Commands#

# RDS status
aws rds describe-db-instances \
  --query  ' DBInstances[*].{ID:DBInstanceIdentifier,Status:DBInstanceStatus,Storage:AllocatedStorage} '   \
  --output table

# ASG instance status
aws autoscaling describe-auto-scaling-groups \
  --query  ' AutoScalingGroups[?contains(AutoScalingGroupName,`kobic`)].{Name:AutoScalingGroupName,Desired:DesiredCapacity,Instances:Instances[*].{ID:InstanceId,Health:HealthStatus } } '   \
  --output json

# Target Group health
aws elbv2 describe-target-health \
  --target-group-arn  < target-group-arn >   \
  --output table

# Recent deployment status
aws deploy list-deployments \
  --application-name kobic-app \
  --deployment-group-name kobic- < env > -dg \
  --max-items 3 \
  --query  ' deployments '

RDS Storage-Full Response#

Symptoms#

  • Server 502/503 errors (app shows "server maintenance")
  • RDS instance status: storage-full
  • CloudWatch FreeStorageSpace metric reaches 0

Emergency Storage Expansion#

# Applied immediately, no downtime
aws rds modify-db-instance \
  --db-instance-identifier kobic- < env >   \
  --allocated-storage  < new_size_gb >   \
  --max-allocated-storage  < new_max_gb >   \
  --apply-immediately

Cautions:

  • RDS storage expansion is irreversible (cannot shrink)
  • 6-hour cooldown after expansion
  • DB operates normally even in storage-optimization state
  • Update Terraform config.auto.tfvars after emergency expansion (to prevent drift)

Replication Slot Management#

CDC replication slots accumulate WAL and consume storage.

-- Check slot status
SELECT slot_name, slot_type, active,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as retained_wal
FROM pg_replication_slots;

-- Delete inactive slots
SELECT pg_drop_replication_slot( ' slot_name_here ' );

CodeDeploy ASG Accumulation Issue#

Symptoms#

  • CodeDeploy attempts to deploy to deleted ASGs during deployment → failure

Resolution#

# Check ASG list in CodeDeploy Deployment Group
aws deploy get-deployment-group \
  --application-name kobic-app \
  --deployment-group-name kobic- < env > -dg \
  --query  ' deploymentGroupInfo.autoScalingGroups[].name ' 

 # Update to include only the correct ASG
aws deploy update-deployment-group \
  --application-name kobic-app \
  --current-deployment-group-name kobic- < env > -dg \
  --auto-scaling-groups  " kobic- < env > -asg "

Docker-Based Deployment Workflow#

GitHub Actions → Docker Build → ECR Push → CodeDeploy → ASG Instance (Docker Compose)
EnvironmentSource BranchDeployment Method
Staging development GitHub Actions workflow_dispatch
Production main Auto-deploy on main branch push

CloudWatch Alarm Status#

AlarmThresholdStatus
database-high-cpu80%alarm_actions is empty array
database-high-connections50Not configured
FreeStorageSpaceNot setNeeds to be added

ElastiCache Redis Cautions#

  • AUTH cannot be enabled when Transit Encryption is disabled
  • Serverpod requires SERVERPOD_PASSWORD_redis environment variable when Redis is enabled
  • Set SERVERPOD_PASSWORD_redis="" to start server (operates without AUTH)
  • Authentication works normally on DB-based auth without Redis

Checklist#

Incident Response#

  • Check RDS instance status (describe-db-instances)
  • Check CloudWatch FreeStorageSpace metric
  • Check replication slot retained WAL
  • Check ASG/Target Group health checks
  • Check CodeDeploy deployment status

Terraform Drift Prevention#

  • Update config.auto.tfvars after emergency AWS CLI changes
  • Verify drift with terraform plan
  • Update code via separate PR