rds-operations

RDS incident response and operational management. Use when investigating RDS performance issues, failovers, snapshots, or parameter changes.

RDS Incident Response and Operations#

한마디로 (비개발자용)#

이 스킬은 서비스의 데이터가 저장되는 "창고(데이터베이스)"에 문제가 생겼을 때 대응하는 응급 매뉴얼입니다. 창고가 꽉 차거나, 갑자기 느려지거나, 손님(사용자)이 "공사 중" 화면을 보게 되는 상황을 빠르게 진단하고 해결하도록 돕습니다. 마치 건물 관리인이 정전이나 누수에 대비한 대응 절차를 미리 정리해 둔 것과 비슷합니다.

무엇을·언제 (비개발자용)#

무엇을 해 주나요
- 데이터 창고(RDS)의 상태를 점검하고, 저장 공간이 부족하면 즉시 늘려 줍니다.
- 데이터베이스가 느려지거나 멈췄을 때 원인을 찾아냅니다.
- 쌓여서 공간을 잡아먹는 불필요한 기록을 정리합니다.
- 응급 조치 후 설정 문서(인프라 코드)와 실제 상태가 어긋나지 않도록 맞춰 줍니다.
언제 작동하나요
- 화면에 502/503 오류나 "서버 점검 중" 안내가 뜰 때
- 데이터 창고가 거의 다 찼거나 상태가 이상할 때(storage-full 등)
- 서비스가 전반적으로 느려졌을 때
- 데이터 복제 과정에서 기록(WAL)이 계속 쌓여 공간을 차지할 때

핵심 용어 (비개발자용)#

용어	쉬운 설명
RDS	AWS가 관리해 주는 데이터 저장 창고(데이터베이스) 서비스
PostgreSQL	이 창고가 사용하는 데이터 정리 방식(데이터베이스 종류 중 하나)
Storage / 스토리지	데이터를 담아 두는 저장 공간(창고의 용량)
Storage-Full	저장 공간이 꽉 차서 더 이상 데이터를 못 넣는 상태
Failover	한쪽이 고장 나면 자동으로 예비 쪽이 대신 일하는 비상 전환
Snapshot	특정 시점의 데이터를 통째로 찍어 둔 백업 사진
Replication Slot	데이터 변경 내역을 다른 시스템으로 실시간 복사하기 위한 연결 통로
WAL	데이터 변경을 순서대로 적어 두는 기록 장부(쌓이면 공간을 차지함)
CDC (Change Data Capture)	데이터가 바뀔 때마다 그 변화를 잡아 다른 곳으로 보내는 방식
Multi-AZ	데이터를 떨어진 두 곳에 동시에 두어 한쪽이 고장 나도 버티게 하는 이중화
Terraform Drift	설정 문서에 적힌 내용과 실제 인프라가 서로 달라진 상태
CloudWatch	AWS 자원의 상태와 지표를 지켜보는 감시 대시보드

Covers RDS PostgreSQL fault diagnosis, storage expansion, and replication slot management.

Triggers#

Server 502/503 errors (app shows "server maintenance")
RDS instance status anomalies (storage-full, modifying, etc.)
Database performance degradation
Replication slot WAL accumulation

See rules/aws-operations.md for infrastructure debugging order

RDS Storage-Full Response#

Diagnosis#

# 1. Check RDS instance status
aws rds describe-db-instances \
  --db-instance-identifier kobic- < env >   \
  --query  ' DBInstances[0].{Status:DBInstanceStatus,Storage:AllocatedStorage,MaxStorage:MaxAllocatedStorage} ' 

 # 2. FreeStorageSpace metric (last 1 hour)
aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name FreeStorageSpace \
  --dimensions Name=DBInstanceIdentifier,Value=kobic- < env >   \
  --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 --statistics Minimum \
  --output table

# 3. Check table sizes after DB connection
SELECT schemaname, tablename,
       pg_size_pretty(pg_total_relation_size(schemaname|| ' . ' ||tablename)) as total_size
FROM pg_tables
WHERE schemaname =  ' public ' 
 ORDER BY pg_total_relation_size(schemaname|| ' . ' ||tablename) DESC
LIMIT 20;

Emergency Storage Expansion#

# Applied immediately, no downtime
aws rds modify-db-instance \
  --db-instance-identifier kobic- < env >   \
  --allocated-storage  < new_size_gb >   \
  --max-allocated-storage  < new_max_gb >   \
  --apply-immediately

Cautions:

RDS storage expansion is irreversible (once increased, cannot decrease)
6-hour cooldown after expansion (no further expansion possible)
DB operates normally even in storage-optimization state

Terraform Drift Handling#

After emergency expansion, Terraform code and actual infrastructure are out of sync:

# config.auto.tfvars update required
db_configs = {
   " staging "   = {
    allocated_storage       = 200  # Reflect emergency expansion
    max_allocated_storage   = 500  # Reflect emergency expansion
    # ...
  }
}

Replication Slot Management#

CDC (Change Data Capture) replication slots accumulate WAL and consume storage.

-- Check replication slot status
SELECT slot_name, slot_type, active,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as retained_wal
FROM pg_replication_slots;

-- Delete inactive slots (ClickHouse ClickPipe, etc.)
SELECT pg_drop_replication_slot( ' slot_name_here ' );

RDS Instance Specifications#

Current Configuration (Graviton2 ARM64)#

Environment	Instance Class	vCPU	Memory	Storage	Max Storage	Multi-AZ
Production	db.t4g.small	2	2GB	20GB	200GB	✅
Staging	db.t4g.small	2	2GB	20GB	100GB	❌
Development	db.t4g.micro	2	1GB	20GB	50GB	❌

See rules/aws-operations.md for CloudWatch alarm status

Checklist#

Storage-Full Response#

Check status with aws rds describe-db-instances
Check CloudWatch FreeStorageSpace metric
Analyze table sizes (pg_total_relation_size)
Check replication slot retained WAL
Execute emergency storage expansion
Update Terraform config.auto.tfvars (separate PR)

Regular Monitoring#

Monthly storage utilization check
Replication slot WAL size check
CloudWatch alarm alarm_actions connection verification