Runbook: Incidents, Recovery, and Backups
Operational runbook for CepatEdge on Cloudflare (Workers, Pages, R2, Durable Objects) and Neon. Use this when something goes wrong or when performing backups/restores. Rollback details are in Deployment workflow; security incident process is in Security Overview.
1. Incident Classification
| Severity | Meaning | Examples |
|---|---|---|
| P0 – Critical | System down or data loss | API/frontend unreachable; DB unreachable; data corruption. |
| P1 – High | Major feature broken or security issue | Auth broken; maintenance workflow broken; confirmed breach. |
| P2 – Medium | Degraded or partial failure | Slow responses; one workflow failing; cache issues. |
| P3 – Low | Minor/cosmetic | UI glitch; non-critical feature broken. |
2. Response Flow
- Detect — Monitoring, alerts, or user report.
- Assess — P0/P1/P2/P3; impact (who, how many, which feature).
- Communicate — Notify stakeholders; if P0/P1, start status updates (internal and, if agreed, users).
- Contain — Stop the bleed: rollback, disable feature, revoke token, scale/restart if applicable.
- Resolve — Fix root cause, deploy, verify.
- Recover — Restore data if needed (from backup); see sections below.
- Post-mortem — Short write-up: what happened, cause, what we changed (runbook, monitoring, code).
3. Quick Actions by Component
3.1 API (Workers) down or bad deploy
- Rollback Workers:
npx wrangler deployments list→npx wrangler rollback <previous-deployment-id>
See Deployment workflow – Rollback. - Check: Dashboard → Workers → Logs; check env/secrets and Neon connectivity.
3.2 Frontend (Pages) down or bad deploy
- Rollback Pages:
npx wrangler pages deployment list <project-name>→npx wrangler pages deployment rollback <deployment-id>
See Deployment workflow – Rollback.
3.3 Database (Neon) issues
- Connection / pool: Check Neon dashboard (connections, latency); restart Worker if needed (new connections).
- Data corruption / bad migration: Use Neon point-in-time recovery (PITR) to restore to a time before the incident. If backups are copied to R2 (see §5), consider restore from R2 as a secondary option.
- Rollback schema: Prefer reversible migrations; if not, run custom rollback script and then redeploy app to match.
3.4 Durable Objects (sessions / cache)
- DO unhealthy / stuck: DOs are per-namespace; identify instance if possible (logs). Restarting Workers does not clear DO state; if a single DO is bad, may require code fix (e.g. reset/evict that namespace) or support.
- Cache inconsistency: Invalidate affected keys or namespaces; if critical, consider short TTL or bypass cache until fixed.
3.5 R2 (storage) issues
- Upload/read failures: Check R2 binding and bucket; check Worker limits (payload, CPU time). If bucket full or misconfigured, fix in dashboard and redeploy if needed.
- Backup copy to R2 failing: See §5; check cron/script and R2 credentials.
3.6 Auth / sessions broken
- JWT/secret rotation: If secret was rotated, all existing tokens invalid; users must re-login. Announce if possible.
- Session DO issues: See §3.4; if DO is the cause, fix DO or session layer.
4. Rollback Reference
- Workers:
npx wrangler deployments list→npx wrangler rollback <id>. - Pages:
npx wrangler pages deployment list <project>→npx wrangler pages deployment rollback <id>. - DB: Neon PITR or restore from backup (Neon or R2).
- Full details: Deployment workflow – Rollback procedures.
5. Backup Strategy and R2
5.1 Current (Neon)
- Neon: Built-in backups and PITR (see Neon docs). Retention (e.g. 30 days) is configured in Neon.
- Code/config: Git + deployment history (Cloudflare retains deployment history).
5.2 R2 backup copy (recommended plan)
Use R2 as a secondary copy of backups for disaster recovery and optional institutional retention (e.g. 30–90 days). Keeps backups under your control and allows restores even if Neon is unavailable.
Options:
| Mode | How | When to use |
|---|---|---|
| Automated | Cron-triggered Worker or external job: export from Neon (e.g. pg_dump via Neon’s connection string) and upload to R2. | Daily or weekly; good for “set and forget.” |
| Manual (admin) | Admin-triggered action (e.g. “Create backup” in admin UI or script) that runs export + R2 upload. | Before major changes, or when institution requests a snapshot. |
Implementation outline:
- R2 bucket: Dedicated bucket (e.g.
cepatedge-backups), lifecycle rule optional (e.g. delete after 90 days). - Export: Use Neon’s Postgres connection from a secure environment (Worker with Neon binding, or a small Node script in CI/admin context):
pg_dump(or Neon’s export API if available) → compressed file (e.g..sql.gz).
- Upload to R2: Key format e.g.
backups/db/YYYY-MM-DD-HHMM.sql.gz(and optionallybackups/db/latest.sql.gz). - Access control: Only deployment/admin credentials can read/write; no public access.
- Restore: Download from R2 and restore into Neon (or another Postgres) using
psql/Neon restore process.
Manual admin backup: Same flow triggered by:
- Script run by operator (e.g.
pnpm run backup:dbthat dumps and uploads to R2), or - Future admin UI “Create backup” that calls an internal API secured for admins only.
Retention: Align with Data retention and privacy (e.g. 30–90 days for R2 backup copies).
6. Restore from R2 backup (when applicable)
- Download object from R2 (e.g.
backups/db/YYYY-MM-DD-HHMM.sql.gz). - Decompress and restore:
gunzip -c file.sql.gz | psql <connection-string>(or Neon’s restore method). - Verify data and then point Workers back at DB (or fix DB connection if restore was to new instance).
- Document in post-mortem if this was used for a real incident.
7. Useful Links
- Deployment workflow — Rollback, health checks, backup strategy.
- Security Overview – Incident response — Classification and process.
- Data retention and privacy — Retention and backups.
- System map — Components and data flow.