Failure and Resilience Tests Plan

This document plans the next layer of tests beyond happy-path and basic permission checks: failure cases, permission escalation, invalid state transitions, and (where feasible) partial failure recovery. It aligns with the Testing Guide and defers maintenance-heavy work until maintenance services are stable (see Next Session: Maintenance Stabilization).

1. Goals

Failure-case tests: Services and routes behave correctly when dependencies fail (DB, cache, storage, mailer).
Permission-escalation tests: Users cannot perform actions above their role (e.g. employee cannot approve, non-owner cannot delete).
Invalid state transition tests: Workflow steps reject invalid transitions (e.g. complete without assign, approve twice).
Partial failure recovery: Where applicable, test that one failing part does not corrupt state (e.g. cache miss falls back to DB; failed email does not roll back request creation).

Not in scope here: Full race-condition tests (concurrent approve/assign on same request) — add later when we have time; start with single-request consistency.

2. Current Coverage (Summary)

Area	Already covered	Missing / to add
Auth	Login/logout/register/profile; logout session cleanup failure; role validation	DB down on login; invalid token paths; rate-limit behaviour
Permissions	Full `permissions.test.ts` (RBAC, status-based, null request/user)	Route-level “forbidden” for escalation (e.g. employee calls approve endpoint)
Maintenance	Queries/mutations/permissions/validation; DB errors gracefully; permission denial in mutations	Invalid state transitions (e.g. complete → approve); partial failure (e.g. cache fail, email fail)
Cache	DO behaviour, instance management, session cache	Cache miss → DB fallback; DO timeout or error handling
Storage	Storage unavailable, key helpers	Upload success but R2 fail; partial write
Mailer	Email send failure	Retry/partial failure (e.g. request created but email fails)

3. Prioritised Plan

Phase A — Low effort, high value (do first)

Auth failure cases
- Login when DB is unavailable (mock DB throw) → expect 503 or 500 and no session.
- Login with invalid credentials → expect 401, no token.
- Request with expired or malformed JWT → expect 401.
- (Optional) Register with duplicate email → expect 4xx and clear message.
Permission escalation at route level
- As employee, call approve/assign/complete on a request they shouldn’t → expect 403.
- As tenant, call assign or approve → expect 403.
- Use existing permission helpers; add one or two route-level tests (or integration-style) that hit real route and assert 403.
Invalid state transitions (maintenance)
- Once maintenance mutations are stable: “complete” without “assigned”; “approve” when already approved; “assign” when not in assignable status.
- Assert: mutation returns error or 4xx, state unchanged (or assert final state).

Phase B — Medium effort

Cache failure / fallback
- When cache get throws or returns “miss”, service falls back to DB and returns correct data.
- When cache set fails (e.g. DO error), request still succeeds (e.g. create/update still written to DB); optional: log and retry later.
Storage partial failure
- Avatar/attachment: DB update succeeds but R2 put fails → either rollback DB or mark “upload failed” and return clear error (match current design).
- Attachment list when R2 is unavailable for one key → list still returns metadata; file endpoint returns 503 for that key.
Mailer partial failure
- Maintenance assigned (or similar) → DB updated, but email send fails → request state remains correct; user sees success; log or queue retry (if we add queue later, test that path).

Phase C — When we have time

Race conditions (later)
- Two concurrent “approve” or “assign” on same request → one succeeds, one gets conflict or 409.
- Requires concurrency in test (e.g. Promise.all two requests); define expected behaviour first (last-write-wins vs first-wins vs 409).
Partial failure recovery (deeper)
- Multi-step flow: e.g. create request → create attachment → if attachment fails, request still exists and is consistent (no orphan attachments; or orphan cleanup documented).

4. Where to Put These Tests

Auth: auth/__tests__/login.test.ts, logout.test.ts, etc. Add describe('failure cases', …) or new auth/__tests__/failure.test.ts if it grows.
Permissions / escalation: Reuse maintenance/__tests__/permissions.test.ts for logic; add route-level or integration tests (e.g. in routes/maintenance/ or a single __tests__/maintenance-escalation.test.ts) that call HTTP and assert 403.
Invalid state transitions: maintenance/__tests__/mutations.test.ts or validation.test.ts — e.g. describe('invalid state transitions', …).
Cache fallback: cache/__tests__/cache.test.ts or maintenance/__tests__/queries.test.ts (when cache is wired) — mock DO to throw or return miss, assert DB is used and result correct.
Storage: storage/__tests__/storage.test.ts, maintenance/__tests__/attachment.test.ts — mock R2 to fail after DB update and assert behaviour.
Mailer: mailer/__tests__/mailer.test.ts — already has send failure; add “request updated, email failed” scenario when maintenance + mailer are integrated in test.

5. Acceptance Criteria

Phase A: All Phase A items have at least one test each; CI green.
Phase B: Cache fallback and storage/mailer partial-failure behaviour documented and tested.
Phase C: Optional; document expected behaviour first, then add tests when prioritised.

6. References

Testing Guide — Structure, naming, type-safety.
Next Session: Maintenance Stabilization — When to deepen maintenance tests.
Security Overview – Incident response — Align failure behaviour with operational response.

Failure and Resilience Tests Plan ​

1. Goals ​

2. Current Coverage (Summary) ​

3. Prioritised Plan ​

Phase A — Low effort, high value (do first) ​

Phase B — Medium effort ​

Phase C — When we have time ​

4. Where to Put These Tests ​

5. Acceptance Criteria ​

6. References ​