Runbook: Incidents, Recovery, and Backups

Operational runbook for CepatEdge on Cloudflare (Workers, Pages, R2, Durable Objects) and Neon. Use this when something goes wrong or when performing backups/restores. Rollback details are in Deployment workflow; security incident process is in Security Overview.

1. Incident Classification

Severity	Meaning	Examples
P0 – Critical	System down or data loss	API/frontend unreachable; DB unreachable; data corruption.
P1 – High	Major feature broken or security issue	Auth broken; maintenance workflow broken; confirmed breach.
P2 – Medium	Degraded or partial failure	Slow responses; one workflow failing; cache issues.
P3 – Low	Minor/cosmetic	UI glitch; non-critical feature broken.

2. Response Flow

Detect — Monitoring, alerts, or user report.
Assess — P0/P1/P2/P3; impact (who, how many, which feature).
Communicate — Notify stakeholders; if P0/P1, start status updates (internal and, if agreed, users).
Contain — Stop the bleed: rollback, disable feature, revoke token, scale/restart if applicable.
Resolve — Fix root cause, deploy, verify.
Recover — Restore data if needed (from backup); see sections below.
Post-mortem — Short write-up: what happened, cause, what we changed (runbook, monitoring, code).

3. Quick Actions by Component

3.1 API (Workers) down or bad deploy

Rollback Workers:
npx wrangler deployments list → npx wrangler rollback <previous-deployment-id>
See Deployment workflow – Rollback.
Check: Dashboard → Workers → Logs; check env/secrets and Neon connectivity.

3.2 Frontend (Pages) down or bad deploy

Rollback Pages:
npx wrangler pages deployment list <project-name> → npx wrangler pages deployment rollback <deployment-id>
See Deployment workflow – Rollback.

3.3 Database (Neon) issues

Connection / pool: Check Neon dashboard (connections, latency); restart Worker if needed (new connections).
Data corruption / bad migration: Use Neon point-in-time recovery (PITR) to restore to a time before the incident. If backups are copied to R2 (see §5), consider restore from R2 as a secondary option.
Rollback schema: Prefer reversible migrations; if not, run custom rollback script and then redeploy app to match.

3.4 Durable Objects (sessions / cache)

DO unhealthy / stuck: DOs are per-namespace; identify instance if possible (logs). Restarting Workers does not clear DO state; if a single DO is bad, may require code fix (e.g. reset/evict that namespace) or support.
Cache inconsistency: Invalidate affected keys or namespaces; if critical, consider short TTL or bypass cache until fixed.

3.5 R2 (storage) issues

Upload/read failures: Check R2 binding and bucket; check Worker limits (payload, CPU time). If bucket full or misconfigured, fix in dashboard and redeploy if needed.
Backup copy to R2 failing: See §5; check cron/script and R2 credentials.

3.6 Auth / sessions broken

JWT/secret rotation: If secret was rotated, all existing tokens invalid; users must re-login. Announce if possible.
Session DO issues: See §3.4; if DO is the cause, fix DO or session layer.

4. Rollback Reference

Workers: npx wrangler deployments list → npx wrangler rollback <id>.
Pages: npx wrangler pages deployment list <project> → npx wrangler pages deployment rollback <id>.
DB: Neon PITR or restore from backup (Neon or R2).
Full details: Deployment workflow – Rollback procedures.

5. Backup Strategy and R2

5.1 Current (Neon)

Neon: Built-in backups and PITR (see Neon docs). Retention (e.g. 30 days) is configured in Neon.
Code/config: Git + deployment history (Cloudflare retains deployment history).

5.2 R2 backup copy (recommended plan)

Use R2 as a secondary copy of backups for disaster recovery and optional institutional retention (e.g. 30–90 days). Keeps backups under your control and allows restores even if Neon is unavailable.

Options:

Mode	How	When to use
Automated	Cron-triggered Worker or external job: export from Neon (e.g. `pg_dump` via Neon’s connection string) and upload to R2.	Daily or weekly; good for “set and forget.”
Manual (admin)	Admin-triggered action (e.g. “Create backup” in admin UI or script) that runs export + R2 upload.	Before major changes, or when institution requests a snapshot.

Implementation outline:

R2 bucket: Dedicated bucket (e.g. cepatedge-backups), lifecycle rule optional (e.g. delete after 90 days).
Export: Use Neon’s Postgres connection from a secure environment (Worker with Neon binding, or a small Node script in CI/admin context):
- pg_dump (or Neon’s export API if available) → compressed file (e.g. .sql.gz).
Upload to R2: Key format e.g. backups/db/YYYY-MM-DD-HHMM.sql.gz (and optionally backups/db/latest.sql.gz).
Access control: Only deployment/admin credentials can read/write; no public access.
Restore: Download from R2 and restore into Neon (or another Postgres) using psql/Neon restore process.

Manual admin backup: Same flow triggered by:

Script run by operator (e.g. pnpm run backup:db that dumps and uploads to R2), or
Future admin UI “Create backup” that calls an internal API secured for admins only.

Retention: Align with Data retention and privacy (e.g. 30–90 days for R2 backup copies).

6. Restore from R2 backup (when applicable)

Download object from R2 (e.g. backups/db/YYYY-MM-DD-HHMM.sql.gz).
Decompress and restore: gunzip -c file.sql.gz | psql <connection-string> (or Neon’s restore method).
Verify data and then point Workers back at DB (or fix DB connection if restore was to new instance).
Document in post-mortem if this was used for a real incident.

7. Useful Links

Deployment workflow — Rollback, health checks, backup strategy.
Security Overview – Incident response — Classification and process.
Data retention and privacy — Retention and backups.
System map — Components and data flow.

Runbook: Incidents, Recovery, and Backups ​

1. Incident Classification ​

2. Response Flow ​

3. Quick Actions by Component ​

3.1 API (Workers) down or bad deploy ​

3.2 Frontend (Pages) down or bad deploy ​

3.3 Database (Neon) issues ​

3.4 Durable Objects (sessions / cache) ​

3.5 R2 (storage) issues ​

3.6 Auth / sessions broken ​

4. Rollback Reference ​

5. Backup Strategy and R2 ​

5.1 Current (Neon) ​

5.2 R2 backup copy (recommended plan) ​

6. Restore from R2 backup (when applicable) ​

7. Useful Links ​