Skip to content

Architectural Weak Points — To Address ASAP

This document captures known architectural pressure points in the edge/cache/init layer so we can attack them in a planned way. No shared state across isolates; DB never cached; cache instance selection globally consistent. Target phase for implementation: Phase 4 (document and plan here); implementation in Phase 4 or early Phase 5.


1. DO Instance Registry: No Isolate-Local Map for Cache Instance Selection

Problem

Today, CacheService uses an isolate-local Map (e.g. instanceCounters: Map<string, number>) to decide which DO instance (shard) to talk to. Each Worker isolate has its own Map, so:

  • Data distribution is per-isolate, not global.
  • Isolate A and Isolate B can each think they are "shard 0" for the same logical key.
  • We get pseudo-sharding, not cluster-aware scaling. State is inconsistent across the edge.

We already avoid caching DO instances (we use env.CEPATEDGE_CACHE.get(env.CEPATEDGE_CACHE.idFromName(...)) and never hold a DO reference in a global). The remaining issue is which DO name/id we choose — that choice must not be based on isolate-local memory.

Required direction

  • Main DO (registry): Introduce a single "main" or "registry" Durable Object whose only job is to store and serve the list of current active cache DO instance names/ids (the logical shards that actually exist and are in use).
  • Workers never use local Map for instance selection. Every Worker, on every cache operation, asks the main DO: "which instance(s) should I use for this key/type?" and gets back the current active instance id(s). Workers then do get(idFromName(returnedName)) and use that stub. No instanceCounters or similar in Worker memory.
  • Effect: All Workers, across all isolates and data centers, agree on which DO instance handles which logical shard. No isolate-level state mismatch; cache read/write is globally consistent.

Tasks (to do in phase)

  • [ ] Design the main DO API: e.g. getActiveInstanceFor(type, keyOrShard) returns one instance id/name; optionally registerInstance / deregisterInstance if we want dynamic registration (or fixed shard list in main DO).
  • [ ] Remove from CacheService (and any other place) all local Map or in-memory state used to pick DO instances. Replace with: call main DO → get current active instance → use that stub.
  • [ ] Ensure no other Worker code keeps a local Map of DO instances or shard indices. Single source of truth = main DO.

2. DB: Remove from Init Cache — Per-Request Only, Independently Imported

Problem

  • In initializeAllServices we still set serviceCache.db = createDb(env). So the init cache holds a DB instance.
  • In practice, getService('db', env) ignores that cached value and returns a new createDb(env) per call (correct for Neon HTTP driver: no cross-request reuse).
  • So the cached serviceCache.db is unused and redundant. It suggests DB might be shared across requests; the mental model is unclear.

Required direction

  • Remove db from the init cache entirely. Do not assign serviceCache.db in initializeAllServices; do not treat db as part of the cached InitializedServices object that lives for the lifetime of the isolate.
  • DB is always per-request. Every service that needs the DB should get it via:
    • getDb(env) or getService('db', env) at request time, and never from a global service cache.
  • Clear model: DB is not a "cached service"; it is a per-request resource created from env and passed or obtained where needed. Services that need DB should import and use the DB getter (e.g. getDb(env) / createDb(env)) explicitly — not by receiving a DB from a shared init cache.

Tasks (to do in phase)

  • [ ] Remove db from the init cache: do not set serviceCache.db in initializeAllServices; remove db from the cached object type if it is only used there (or keep type for createServicesForEnv but ensure production path never reads db from cache).
  • [ ] Ensure all call sites that need DB use getService('db', env) or getDb(env) (or equivalent) with env from the request context — never getCachedServices().db.
  • [ ] Document in service/init docs: "DB is never cached; always obtain per-request via getDb(env)."

3. Usage Logs in DO: Avoid Single Giant Array

Problem

Inside the cache DO, usage logs are stored as a single key (e.g. usage:logs) holding one big array. All logs are read, mutated (push), and written back. This causes:

  • Serialization and memory pressure as the array grows.
  • Single hot DO doing large JSON read/write.
  • Risk of CPU time and size limits; silent scale killer.

DO storage is key-value; it is not designed for "one key = one huge array."

Required direction

  • Store logs as individual keys, e.g. usage:log:<timestamp>:<randomId> (or similar). Append = write one new key. List/recent = list by prefix and paginate. No single key holding the full array.
  • Optionally: cap total keys per day or per type; trim old keys with a scheduled DO call or TTL if supported.

Tasks (to do in phase)

  • [ ] Refactor DO storage for usage/analytics: replace single usage:logs array with per-log keys and prefix listing.
  • [ ] Add pagination or limit when reading logs; document retention (e.g. delete keys older than N days).
  • [ ] See DO vs KV Analysis for why DO (not KV) is correct for logs and maintenance cache, and why timestamp/indexed + pagination (500-1K per read) is practical and scales to 10K+ logs.

4. Circuit Breaker Is Isolate-Local (Document Only for Now)

Observation

If a circuit breaker lives in Worker memory, it is isolate-local. When one DO instance starts failing, Isolate A might open the breaker and stop calling it, while Isolate B keeps hammering. So we do not get global cascade prevention.

Direction (document, implement later if needed)

  • For global cascade prevention, the breaker state would need to live in a shared place: e.g. inside the DO (DO opens its own "I'm unhealthy" flag), or KV, or a dedicated "breaker" DO.
  • Current behavior: breaker protects a single isolate. Document this as a known limitation; plan to move breaker state to DO or KV if we need cross-isolate protection.

Tasks (to do in phase)

  • [ ] Add a short note in runbook or architecture: "Circuit breaker is isolate-local; global protection would require breaker state in DO or KV."
  • [ ] (Optional) In a later phase, implement breaker state in DO or KV if we want global cascade prevention.

5. Summary Table

Weak pointCurrentTargetPhase
Cache instance selectionIsolate-local Map in CacheServiceMain DO holds list of active instances; workers ask main DO, no local Map4 (important)
DB in initCached in serviceCache (unused by getService)Remove from init cache; DB only via getDb(env) per request4 (important)
Usage logs in DOSingle key = big arrayPer-log keys, prefix list, paginate4
Circuit breakerIsolate-localDocument; later: DO/KV if global protection needed4 (doc only)

6. References