Failure and Resilience Tests Plan
This document plans the next layer of tests beyond happy-path and basic permission checks: failure cases, permission escalation, invalid state transitions, and (where feasible) partial failure recovery. It aligns with the Testing Guide and defers maintenance-heavy work until maintenance services are stable (see Next Session: Maintenance Stabilization).
1. Goals
- Failure-case tests: Services and routes behave correctly when dependencies fail (DB, cache, storage, mailer).
- Permission-escalation tests: Users cannot perform actions above their role (e.g. employee cannot approve, non-owner cannot delete).
- Invalid state transition tests: Workflow steps reject invalid transitions (e.g. complete without assign, approve twice).
- Partial failure recovery: Where applicable, test that one failing part does not corrupt state (e.g. cache miss falls back to DB; failed email does not roll back request creation).
Not in scope here: Full race-condition tests (concurrent approve/assign on same request) — add later when we have time; start with single-request consistency.
2. Current Coverage (Summary)
| Area | Already covered | Missing / to add |
|---|---|---|
| Auth | Login/logout/register/profile; logout session cleanup failure; role validation | DB down on login; invalid token paths; rate-limit behaviour |
| Permissions | Full permissions.test.ts (RBAC, status-based, null request/user) | Route-level “forbidden” for escalation (e.g. employee calls approve endpoint) |
| Maintenance | Queries/mutations/permissions/validation; DB errors gracefully; permission denial in mutations | Invalid state transitions (e.g. complete → approve); partial failure (e.g. cache fail, email fail) |
| Cache | DO behaviour, instance management, session cache | Cache miss → DB fallback; DO timeout or error handling |
| Storage | Storage unavailable, key helpers | Upload success but R2 fail; partial write |
| Mailer | Email send failure | Retry/partial failure (e.g. request created but email fails) |
3. Prioritised Plan
Phase A — Low effort, high value (do first)
Auth failure cases
- Login when DB is unavailable (mock DB throw) → expect 503 or 500 and no session.
- Login with invalid credentials → expect 401, no token.
- Request with expired or malformed JWT → expect 401.
- (Optional) Register with duplicate email → expect 4xx and clear message.
Permission escalation at route level
- As employee, call approve/assign/complete on a request they shouldn’t → expect 403.
- As tenant, call assign or approve → expect 403.
- Use existing permission helpers; add one or two route-level tests (or integration-style) that hit real route and assert 403.
Invalid state transitions (maintenance)
- Once maintenance mutations are stable: “complete” without “assigned”; “approve” when already approved; “assign” when not in assignable status.
- Assert: mutation returns error or 4xx, state unchanged (or assert final state).
Phase B — Medium effort
Cache failure / fallback
- When cache get throws or returns “miss”, service falls back to DB and returns correct data.
- When cache set fails (e.g. DO error), request still succeeds (e.g. create/update still written to DB); optional: log and retry later.
Storage partial failure
- Avatar/attachment: DB update succeeds but R2 put fails → either rollback DB or mark “upload failed” and return clear error (match current design).
- Attachment list when R2 is unavailable for one key → list still returns metadata; file endpoint returns 503 for that key.
Mailer partial failure
- Maintenance assigned (or similar) → DB updated, but email send fails → request state remains correct; user sees success; log or queue retry (if we add queue later, test that path).
Phase C — When we have time
Race conditions (later)
- Two concurrent “approve” or “assign” on same request → one succeeds, one gets conflict or 409.
- Requires concurrency in test (e.g. Promise.all two requests); define expected behaviour first (last-write-wins vs first-wins vs 409).
Partial failure recovery (deeper)
- Multi-step flow: e.g. create request → create attachment → if attachment fails, request still exists and is consistent (no orphan attachments; or orphan cleanup documented).
4. Where to Put These Tests
- Auth:
auth/__tests__/login.test.ts,logout.test.ts, etc. Adddescribe('failure cases', …)or newauth/__tests__/failure.test.tsif it grows. - Permissions / escalation: Reuse
maintenance/__tests__/permissions.test.tsfor logic; add route-level or integration tests (e.g. inroutes/maintenance/or a single__tests__/maintenance-escalation.test.ts) that call HTTP and assert 403. - Invalid state transitions:
maintenance/__tests__/mutations.test.tsorvalidation.test.ts— e.g.describe('invalid state transitions', …). - Cache fallback:
cache/__tests__/cache.test.tsormaintenance/__tests__/queries.test.ts(when cache is wired) — mock DO to throw or return miss, assert DB is used and result correct. - Storage:
storage/__tests__/storage.test.ts,maintenance/__tests__/attachment.test.ts— mock R2 to fail after DB update and assert behaviour. - Mailer:
mailer/__tests__/mailer.test.ts— already has send failure; add “request updated, email failed” scenario when maintenance + mailer are integrated in test.
5. Acceptance Criteria
- Phase A: All Phase A items have at least one test each; CI green.
- Phase B: Cache fallback and storage/mailer partial-failure behaviour documented and tested.
- Phase C: Optional; document expected behaviour first, then add tests when prioritised.
6. References
- Testing Guide — Structure, naming, type-safety.
- Next Session: Maintenance Stabilization — When to deepen maintenance tests.
- Security Overview – Incident response — Align failure behaviour with operational response.