Skip to content

Failure and Resilience Tests Plan

This document plans the next layer of tests beyond happy-path and basic permission checks: failure cases, permission escalation, invalid state transitions, and (where feasible) partial failure recovery. It aligns with the Testing Guide and defers maintenance-heavy work until maintenance services are stable (see Next Session: Maintenance Stabilization).


1. Goals

  • Failure-case tests: Services and routes behave correctly when dependencies fail (DB, cache, storage, mailer).
  • Permission-escalation tests: Users cannot perform actions above their role (e.g. employee cannot approve, non-owner cannot delete).
  • Invalid state transition tests: Workflow steps reject invalid transitions (e.g. complete without assign, approve twice).
  • Partial failure recovery: Where applicable, test that one failing part does not corrupt state (e.g. cache miss falls back to DB; failed email does not roll back request creation).

Not in scope here: Full race-condition tests (concurrent approve/assign on same request) — add later when we have time; start with single-request consistency.


2. Current Coverage (Summary)

AreaAlready coveredMissing / to add
AuthLogin/logout/register/profile; logout session cleanup failure; role validationDB down on login; invalid token paths; rate-limit behaviour
PermissionsFull permissions.test.ts (RBAC, status-based, null request/user)Route-level “forbidden” for escalation (e.g. employee calls approve endpoint)
MaintenanceQueries/mutations/permissions/validation; DB errors gracefully; permission denial in mutationsInvalid state transitions (e.g. complete → approve); partial failure (e.g. cache fail, email fail)
CacheDO behaviour, instance management, session cacheCache miss → DB fallback; DO timeout or error handling
StorageStorage unavailable, key helpersUpload success but R2 fail; partial write
MailerEmail send failureRetry/partial failure (e.g. request created but email fails)

3. Prioritised Plan

Phase A — Low effort, high value (do first)

  1. Auth failure cases

    • Login when DB is unavailable (mock DB throw) → expect 503 or 500 and no session.
    • Login with invalid credentials → expect 401, no token.
    • Request with expired or malformed JWT → expect 401.
    • (Optional) Register with duplicate email → expect 4xx and clear message.
  2. Permission escalation at route level

    • As employee, call approve/assign/complete on a request they shouldn’t → expect 403.
    • As tenant, call assign or approve → expect 403.
    • Use existing permission helpers; add one or two route-level tests (or integration-style) that hit real route and assert 403.
  3. Invalid state transitions (maintenance)

    • Once maintenance mutations are stable: “complete” without “assigned”; “approve” when already approved; “assign” when not in assignable status.
    • Assert: mutation returns error or 4xx, state unchanged (or assert final state).

Phase B — Medium effort

  1. Cache failure / fallback

    • When cache get throws or returns “miss”, service falls back to DB and returns correct data.
    • When cache set fails (e.g. DO error), request still succeeds (e.g. create/update still written to DB); optional: log and retry later.
  2. Storage partial failure

    • Avatar/attachment: DB update succeeds but R2 put fails → either rollback DB or mark “upload failed” and return clear error (match current design).
    • Attachment list when R2 is unavailable for one key → list still returns metadata; file endpoint returns 503 for that key.
  3. Mailer partial failure

    • Maintenance assigned (or similar) → DB updated, but email send fails → request state remains correct; user sees success; log or queue retry (if we add queue later, test that path).

Phase C — When we have time

  1. Race conditions (later)

    • Two concurrent “approve” or “assign” on same request → one succeeds, one gets conflict or 409.
    • Requires concurrency in test (e.g. Promise.all two requests); define expected behaviour first (last-write-wins vs first-wins vs 409).
  2. Partial failure recovery (deeper)

    • Multi-step flow: e.g. create request → create attachment → if attachment fails, request still exists and is consistent (no orphan attachments; or orphan cleanup documented).

4. Where to Put These Tests

  • Auth: auth/__tests__/login.test.ts, logout.test.ts, etc. Add describe('failure cases', …) or new auth/__tests__/failure.test.ts if it grows.
  • Permissions / escalation: Reuse maintenance/__tests__/permissions.test.ts for logic; add route-level or integration tests (e.g. in routes/maintenance/ or a single __tests__/maintenance-escalation.test.ts) that call HTTP and assert 403.
  • Invalid state transitions: maintenance/__tests__/mutations.test.ts or validation.test.ts — e.g. describe('invalid state transitions', …).
  • Cache fallback: cache/__tests__/cache.test.ts or maintenance/__tests__/queries.test.ts (when cache is wired) — mock DO to throw or return miss, assert DB is used and result correct.
  • Storage: storage/__tests__/storage.test.ts, maintenance/__tests__/attachment.test.ts — mock R2 to fail after DB update and assert behaviour.
  • Mailer: mailer/__tests__/mailer.test.ts — already has send failure; add “request updated, email failed” scenario when maintenance + mailer are integrated in test.

5. Acceptance Criteria

  • Phase A: All Phase A items have at least one test each; CI green.
  • Phase B: Cache fallback and storage/mailer partial-failure behaviour documented and tested.
  • Phase C: Optional; document expected behaviour first, then add tests when prioritised.

6. References