AWS Infrastructure Operations Rules

Rules for AWS infrastructure incident response and operations.

Infrastructure Debugging Order#

When an incident occurs, check in the following order (RDS is the most common cause):

1. Check RDS status (storage-full is the most frequent)
2. Check EC2/ASG instance status
3. Check Target Group health checks
4. Check ALB access logs
5. Check CodeDeploy deployment status
6. Check CloudWatch logs

Quick Status Check Commands#

# RDS status
aws rds describe-db-instances \
  --query  ' DBInstances[*].{ID:DBInstanceIdentifier,Status:DBInstanceStatus,Storage:AllocatedStorage} '   \
  --output table

# ASG instance status
aws autoscaling describe-auto-scaling-groups \
  --query  ' AutoScalingGroups[?contains(AutoScalingGroupName,`kobic`)].{Name:AutoScalingGroupName,Desired:DesiredCapacity,Instances:Instances[*].{ID:InstanceId,Health:HealthStatus } } '   \
  --output json

# Target Group health
aws elbv2 describe-target-health \
  --target-group-arn  < target-group-arn >   \
  --output table

# Recent deployment status
aws deploy list-deployments \
  --application-name kobic-app \
  --deployment-group-name kobic- < env > -dg \
  --max-items 3 \
  --query  ' deployments '

RDS Storage-Full Response#

Symptoms#

Server 502/503 errors (app shows "server maintenance")
RDS instance status: storage-full
CloudWatch FreeStorageSpace metric reaches 0

Emergency Storage Expansion#

# Applied immediately, no downtime
aws rds modify-db-instance \
  --db-instance-identifier kobic- < env >   \
  --allocated-storage  < new_size_gb >   \
  --max-allocated-storage  < new_max_gb >   \
  --apply-immediately

Cautions:

RDS storage expansion is irreversible (cannot shrink)
6-hour cooldown after expansion
DB operates normally even in storage-optimization state
Update Terraform config.auto.tfvars after emergency expansion (to prevent drift)

Replication Slot Management#

CDC replication slots accumulate WAL and consume storage.

-- Check slot status
SELECT slot_name, slot_type, active,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as retained_wal
FROM pg_replication_slots;

-- Delete inactive slots
SELECT pg_drop_replication_slot( ' slot_name_here ' );

CodeDeploy ASG Accumulation Issue#

Symptoms#

CodeDeploy attempts to deploy to deleted ASGs during deployment → failure

Resolution#

# Check ASG list in CodeDeploy Deployment Group
aws deploy get-deployment-group \
  --application-name kobic-app \
  --deployment-group-name kobic- < env > -dg \
  --query  ' deploymentGroupInfo.autoScalingGroups[].name ' 

 # Update to include only the correct ASG
aws deploy update-deployment-group \
  --application-name kobic-app \
  --current-deployment-group-name kobic- < env > -dg \
  --auto-scaling-groups  " kobic- < env > -asg "

Docker-Based Deployment Workflow#

GitHub Actions → Docker Build → ECR Push → CodeDeploy → ASG Instance (Docker Compose)

Environment	Source Branch	Deployment Method
Staging	`development`	GitHub Actions workflow_dispatch
Production	`main`	Auto-deploy on `main` branch push

CloudWatch Alarm Status#

Alarm	Threshold	Status
`database-high-cpu`	80%	alarm_actions is empty array
`database-high-connections`	50	Not configured
FreeStorageSpace	Not set	Needs to be added

ElastiCache Redis Cautions#

AUTH cannot be enabled when Transit Encryption is disabled
Serverpod requires SERVERPOD_PASSWORD_redis environment variable when Redis is enabled
Set SERVERPOD_PASSWORD_redis="" to start server (operates without AUTH)
Authentication works normally on DB-based auth without Redis

Checklist#

Incident Response#

Check RDS instance status (describe-db-instances)
Check CloudWatch FreeStorageSpace metric
Check replication slot retained WAL
Check ASG/Target Group health checks
Check CodeDeploy deployment status

Terraform Drift Prevention#

Update config.auto.tfvars after emergency AWS CLI changes
Verify drift with terraform plan
Update code via separate PR