RDS Incident Response and Operations#
Covers RDS PostgreSQL fault diagnosis, storage expansion, and replication slot management.
Triggers#
- Server 502/503 errors (app shows "server maintenance")
- RDS instance status anomalies (storage-full, modifying, etc.)
- Database performance degradation
- Replication slot WAL accumulation
See
rules/aws-operations.mdfor infrastructure debugging order
RDS Storage-Full Response#
Diagnosis#
# 1. Check RDS instance status
aws rds describe-db-instances \
--db-instance-identifier kobic- < env > \
--query ' DBInstances[0].{Status:DBInstanceStatus,Storage:AllocatedStorage,MaxStorage:MaxAllocatedStorage} '
# 2. FreeStorageSpace metric (last 1 hour)
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name FreeStorageSpace \
--dimensions Name=DBInstanceIdentifier,Value=kobic- < env > \
--start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 --statistics Minimum \
--output table
# 3. Check table sizes after DB connection
SELECT schemaname, tablename,
pg_size_pretty(pg_total_relation_size(schemaname|| ' . ' ||tablename)) as total_size
FROM pg_tables
WHERE schemaname = ' public '
ORDER BY pg_total_relation_size(schemaname|| ' . ' ||tablename) DESC
LIMIT 20;
Emergency Storage Expansion#
# Applied immediately, no downtime
aws rds modify-db-instance \
--db-instance-identifier kobic- < env > \
--allocated-storage < new_size_gb > \
--max-allocated-storage < new_max_gb > \
--apply-immediately
Cautions:
- RDS storage expansion is irreversible (once increased, cannot decrease)
- 6-hour cooldown after expansion (no further expansion possible)
- DB operates normally even in
storage-optimizationstate
Terraform Drift Handling#
After emergency expansion, Terraform code and actual infrastructure are out of sync:
# config.auto.tfvars update required
db_configs = {
" staging " = {
allocated_storage = 200 # Reflect emergency expansion
max_allocated_storage = 500 # Reflect emergency expansion
# ...
}
}
Replication Slot Management#
CDC (Change Data Capture) replication slots accumulate WAL and consume storage.
-- Check replication slot status
SELECT slot_name, slot_type, active,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as retained_wal
FROM pg_replication_slots;
-- Delete inactive slots (ClickHouse ClickPipe, etc.)
SELECT pg_drop_replication_slot( ' slot_name_here ' );
RDS Instance Specifications#
Current Configuration (Graviton2 ARM64)#
| Environment | Instance Class | vCPU | Memory | Storage | Max Storage | Multi-AZ |
|---|---|---|---|---|---|---|
| Production | db.t4g.small | 2 | 2GB | 20GB | 200GB | â |
| Staging | db.t4g.small | 2 | 2GB | 20GB | 100GB | â |
| Development | db.t4g.micro | 2 | 1GB | 20GB | 50GB | â |
See
rules/aws-operations.mdfor CloudWatch alarm status
Checklist#
Storage-Full Response#
- Check status with
aws rds describe-db-instances - Check CloudWatch FreeStorageSpace metric
- Analyze table sizes (pg_total_relation_size)
- Check replication slot retained WAL
- Execute emergency storage expansion
- Update Terraform config.auto.tfvars (separate PR)
Regular Monitoring#
- Monthly storage utilization check
- Replication slot WAL size check
- CloudWatch alarm alarm_actions connection verification