LogoSkills

rds-operations

RDS incident response and operational management

RDS Incident Response and Operations#

Covers RDS PostgreSQL fault diagnosis, storage expansion, and replication slot management.

Triggers#

  • Server 502/503 errors (app shows "server maintenance")
  • RDS instance status anomalies (storage-full, modifying, etc.)
  • Database performance degradation
  • Replication slot WAL accumulation

See rules/aws-operations.md for infrastructure debugging order

RDS Storage-Full Response#

Diagnosis#

# 1. Check RDS instance status
aws rds describe-db-instances \
  --db-instance-identifier kobic- < env >   \
  --query  ' DBInstances[0].{Status:DBInstanceStatus,Storage:AllocatedStorage,MaxStorage:MaxAllocatedStorage} ' 

 # 2. FreeStorageSpace metric (last 1 hour)
aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name FreeStorageSpace \
  --dimensions Name=DBInstanceIdentifier,Value=kobic- < env >   \
  --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 --statistics Minimum \
  --output table

# 3. Check table sizes after DB connection
SELECT schemaname, tablename,
       pg_size_pretty(pg_total_relation_size(schemaname|| ' . ' ||tablename)) as total_size
FROM pg_tables
WHERE schemaname =  ' public ' 
 ORDER BY pg_total_relation_size(schemaname|| ' . ' ||tablename) DESC
LIMIT 20;

Emergency Storage Expansion#

# Applied immediately, no downtime
aws rds modify-db-instance \
  --db-instance-identifier kobic- < env >   \
  --allocated-storage  < new_size_gb >   \
  --max-allocated-storage  < new_max_gb >   \
  --apply-immediately

Cautions:

  • RDS storage expansion is irreversible (once increased, cannot decrease)
  • 6-hour cooldown after expansion (no further expansion possible)
  • DB operates normally even in storage-optimization state

Terraform Drift Handling#

After emergency expansion, Terraform code and actual infrastructure are out of sync:

# config.auto.tfvars update required
db_configs = {
   " staging "   = {
    allocated_storage       = 200  # Reflect emergency expansion
    max_allocated_storage   = 500  # Reflect emergency expansion
    # ...
  }
}

Replication Slot Management#

CDC (Change Data Capture) replication slots accumulate WAL and consume storage.

-- Check replication slot status
SELECT slot_name, slot_type, active,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as retained_wal
FROM pg_replication_slots;

-- Delete inactive slots (ClickHouse ClickPipe, etc.)
SELECT pg_drop_replication_slot( ' slot_name_here ' );

RDS Instance Specifications#

Current Configuration (Graviton2 ARM64)#

Environment Instance Class vCPU Memory Storage Max Storage Multi-AZ
Productiondb.t4g.small22GB20GB200GB✅
Stagingdb.t4g.small22GB20GB100GB❌
Developmentdb.t4g.micro21GB20GB50GB❌

See rules/aws-operations.md for CloudWatch alarm status

Checklist#

Storage-Full Response#

  • Check status with aws rds describe-db-instances
  • Check CloudWatch FreeStorageSpace metric
  • Analyze table sizes (pg_total_relation_size)
  • Check replication slot retained WAL
  • Execute emergency storage expansion
  • Update Terraform config.auto.tfvars (separate PR)

Regular Monitoring#

  • Monthly storage utilization check
  • Replication slot WAL size check
  • CloudWatch alarm alarm_actions connection verification