Skip to content

Runbook: Incidents, Recovery, and Backups

Operational runbook for CepatEdge on Cloudflare (Workers, Pages, R2, Durable Objects) and Neon. Use this when something goes wrong or when performing backups/restores. Rollback details are in Deployment workflow; security incident process is in Security Overview.


1. Incident Classification

SeverityMeaningExamples
P0 – CriticalSystem down or data lossAPI/frontend unreachable; DB unreachable; data corruption.
P1 – HighMajor feature broken or security issueAuth broken; maintenance workflow broken; confirmed breach.
P2 – MediumDegraded or partial failureSlow responses; one workflow failing; cache issues.
P3 – LowMinor/cosmeticUI glitch; non-critical feature broken.

2. Response Flow

  1. Detect — Monitoring, alerts, or user report.
  2. Assess — P0/P1/P2/P3; impact (who, how many, which feature).
  3. Communicate — Notify stakeholders; if P0/P1, start status updates (internal and, if agreed, users).
  4. Contain — Stop the bleed: rollback, disable feature, revoke token, scale/restart if applicable.
  5. Resolve — Fix root cause, deploy, verify.
  6. Recover — Restore data if needed (from backup); see sections below.
  7. Post-mortem — Short write-up: what happened, cause, what we changed (runbook, monitoring, code).

3. Quick Actions by Component

3.1 API (Workers) down or bad deploy

  • Rollback Workers:
    npx wrangler deployments listnpx wrangler rollback <previous-deployment-id>
    See Deployment workflow – Rollback.
  • Check: Dashboard → Workers → Logs; check env/secrets and Neon connectivity.

3.2 Frontend (Pages) down or bad deploy

  • Rollback Pages:
    npx wrangler pages deployment list <project-name>npx wrangler pages deployment rollback <deployment-id>
    See Deployment workflow – Rollback.

3.3 Database (Neon) issues

  • Connection / pool: Check Neon dashboard (connections, latency); restart Worker if needed (new connections).
  • Data corruption / bad migration: Use Neon point-in-time recovery (PITR) to restore to a time before the incident. If backups are copied to R2 (see §5), consider restore from R2 as a secondary option.
  • Rollback schema: Prefer reversible migrations; if not, run custom rollback script and then redeploy app to match.

3.4 Durable Objects (sessions / cache)

  • DO unhealthy / stuck: DOs are per-namespace; identify instance if possible (logs). Restarting Workers does not clear DO state; if a single DO is bad, may require code fix (e.g. reset/evict that namespace) or support.
  • Cache inconsistency: Invalidate affected keys or namespaces; if critical, consider short TTL or bypass cache until fixed.

3.5 R2 (storage) issues

  • Upload/read failures: Check R2 binding and bucket; check Worker limits (payload, CPU time). If bucket full or misconfigured, fix in dashboard and redeploy if needed.
  • Backup copy to R2 failing: See §5; check cron/script and R2 credentials.

3.6 Auth / sessions broken

  • JWT/secret rotation: If secret was rotated, all existing tokens invalid; users must re-login. Announce if possible.
  • Session DO issues: See §3.4; if DO is the cause, fix DO or session layer.

4. Rollback Reference

  • Workers: npx wrangler deployments listnpx wrangler rollback <id>.
  • Pages: npx wrangler pages deployment list <project>npx wrangler pages deployment rollback <id>.
  • DB: Neon PITR or restore from backup (Neon or R2).
  • Full details: Deployment workflow – Rollback procedures.

5. Backup Strategy and R2

5.1 Current (Neon)

  • Neon: Built-in backups and PITR (see Neon docs). Retention (e.g. 30 days) is configured in Neon.
  • Code/config: Git + deployment history (Cloudflare retains deployment history).

Use R2 as a secondary copy of backups for disaster recovery and optional institutional retention (e.g. 30–90 days). Keeps backups under your control and allows restores even if Neon is unavailable.

Options:

ModeHowWhen to use
AutomatedCron-triggered Worker or external job: export from Neon (e.g. pg_dump via Neon’s connection string) and upload to R2.Daily or weekly; good for “set and forget.”
Manual (admin)Admin-triggered action (e.g. “Create backup” in admin UI or script) that runs export + R2 upload.Before major changes, or when institution requests a snapshot.

Implementation outline:

  1. R2 bucket: Dedicated bucket (e.g. cepatedge-backups), lifecycle rule optional (e.g. delete after 90 days).
  2. Export: Use Neon’s Postgres connection from a secure environment (Worker with Neon binding, or a small Node script in CI/admin context):
    • pg_dump (or Neon’s export API if available) → compressed file (e.g. .sql.gz).
  3. Upload to R2: Key format e.g. backups/db/YYYY-MM-DD-HHMM.sql.gz (and optionally backups/db/latest.sql.gz).
  4. Access control: Only deployment/admin credentials can read/write; no public access.
  5. Restore: Download from R2 and restore into Neon (or another Postgres) using psql/Neon restore process.

Manual admin backup: Same flow triggered by:

  • Script run by operator (e.g. pnpm run backup:db that dumps and uploads to R2), or
  • Future admin UI “Create backup” that calls an internal API secured for admins only.

Retention: Align with Data retention and privacy (e.g. 30–90 days for R2 backup copies).


6. Restore from R2 backup (when applicable)

  1. Download object from R2 (e.g. backups/db/YYYY-MM-DD-HHMM.sql.gz).
  2. Decompress and restore: gunzip -c file.sql.gz | psql <connection-string> (or Neon’s restore method).
  3. Verify data and then point Workers back at DB (or fix DB connection if restore was to new instance).
  4. Document in post-mortem if this was used for a real incident.