Architectural Weak Points — To Address ASAP
This document captures known architectural pressure points in the edge/cache/init layer so we can attack them in a planned way. No shared state across isolates; DB never cached; cache instance selection globally consistent. Target phase for implementation: Phase 4 (document and plan here); implementation in Phase 4 or early Phase 5.
1. DO Instance Registry: No Isolate-Local Map for Cache Instance Selection
Problem
Today, CacheService uses an isolate-local Map (e.g. instanceCounters: Map<string, number>) to decide which DO instance (shard) to talk to. Each Worker isolate has its own Map, so:
- Data distribution is per-isolate, not global.
- Isolate A and Isolate B can each think they are "shard 0" for the same logical key.
- We get pseudo-sharding, not cluster-aware scaling. State is inconsistent across the edge.
We already avoid caching DO instances (we use env.CEPATEDGE_CACHE.get(env.CEPATEDGE_CACHE.idFromName(...)) and never hold a DO reference in a global). The remaining issue is which DO name/id we choose — that choice must not be based on isolate-local memory.
Required direction
- Main DO (registry): Introduce a single "main" or "registry" Durable Object whose only job is to store and serve the list of current active cache DO instance names/ids (the logical shards that actually exist and are in use).
- Workers never use local Map for instance selection. Every Worker, on every cache operation, asks the main DO: "which instance(s) should I use for this key/type?" and gets back the current active instance id(s). Workers then do
get(idFromName(returnedName))and use that stub. NoinstanceCountersor similar in Worker memory. - Effect: All Workers, across all isolates and data centers, agree on which DO instance handles which logical shard. No isolate-level state mismatch; cache read/write is globally consistent.
Tasks (to do in phase)
- [ ] Design the main DO API: e.g.
getActiveInstanceFor(type, keyOrShard)returns one instance id/name; optionallyregisterInstance/deregisterInstanceif we want dynamic registration (or fixed shard list in main DO). - [ ] Remove from
CacheService(and any other place) all local Map or in-memory state used to pick DO instances. Replace with: call main DO → get current active instance → use that stub. - [ ] Ensure no other Worker code keeps a local Map of DO instances or shard indices. Single source of truth = main DO.
2. DB: Remove from Init Cache — Per-Request Only, Independently Imported
Problem
- In
initializeAllServiceswe still setserviceCache.db = createDb(env). So the init cache holds a DB instance. - In practice,
getService('db', env)ignores that cached value and returns a newcreateDb(env)per call (correct for Neon HTTP driver: no cross-request reuse). - So the cached
serviceCache.dbis unused and redundant. It suggests DB might be shared across requests; the mental model is unclear.
Required direction
- Remove
dbfrom the init cache entirely. Do not assignserviceCache.dbininitializeAllServices; do not treatdbas part of the cachedInitializedServicesobject that lives for the lifetime of the isolate. - DB is always per-request. Every service that needs the DB should get it via:
getDb(env)orgetService('db', env)at request time, and never from a global service cache.
- Clear model: DB is not a "cached service"; it is a per-request resource created from
envand passed or obtained where needed. Services that need DB should import and use the DB getter (e.g.getDb(env)/createDb(env)) explicitly — not by receiving a DB from a shared init cache.
Tasks (to do in phase)
- [ ] Remove
dbfrom the init cache: do not setserviceCache.dbininitializeAllServices; removedbfrom the cached object type if it is only used there (or keep type forcreateServicesForEnvbut ensure production path never readsdbfrom cache). - [ ] Ensure all call sites that need DB use
getService('db', env)orgetDb(env)(or equivalent) withenvfrom the request context — nevergetCachedServices().db. - [ ] Document in service/init docs: "DB is never cached; always obtain per-request via getDb(env)."
3. Usage Logs in DO: Avoid Single Giant Array
Problem
Inside the cache DO, usage logs are stored as a single key (e.g. usage:logs) holding one big array. All logs are read, mutated (push), and written back. This causes:
- Serialization and memory pressure as the array grows.
- Single hot DO doing large JSON read/write.
- Risk of CPU time and size limits; silent scale killer.
DO storage is key-value; it is not designed for "one key = one huge array."
Required direction
- Store logs as individual keys, e.g.
usage:log:<timestamp>:<randomId>(or similar). Append = write one new key. List/recent = list by prefix and paginate. No single key holding the full array. - Optionally: cap total keys per day or per type; trim old keys with a scheduled DO call or TTL if supported.
Tasks (to do in phase)
- [ ] Refactor DO storage for usage/analytics: replace single
usage:logsarray with per-log keys and prefix listing. - [ ] Add pagination or limit when reading logs; document retention (e.g. delete keys older than N days).
- [ ] See DO vs KV Analysis for why DO (not KV) is correct for logs and maintenance cache, and why timestamp/indexed + pagination (500-1K per read) is practical and scales to 10K+ logs.
4. Circuit Breaker Is Isolate-Local (Document Only for Now)
Observation
If a circuit breaker lives in Worker memory, it is isolate-local. When one DO instance starts failing, Isolate A might open the breaker and stop calling it, while Isolate B keeps hammering. So we do not get global cascade prevention.
Direction (document, implement later if needed)
- For global cascade prevention, the breaker state would need to live in a shared place: e.g. inside the DO (DO opens its own "I'm unhealthy" flag), or KV, or a dedicated "breaker" DO.
- Current behavior: breaker protects a single isolate. Document this as a known limitation; plan to move breaker state to DO or KV if we need cross-isolate protection.
Tasks (to do in phase)
- [ ] Add a short note in runbook or architecture: "Circuit breaker is isolate-local; global protection would require breaker state in DO or KV."
- [ ] (Optional) In a later phase, implement breaker state in DO or KV if we want global cascade prevention.
5. Summary Table
| Weak point | Current | Target | Phase |
|---|---|---|---|
| Cache instance selection | Isolate-local Map in CacheService | Main DO holds list of active instances; workers ask main DO, no local Map | 4 (important) |
| DB in init | Cached in serviceCache (unused by getService) | Remove from init cache; DB only via getDb(env) per request | 4 (important) |
| Usage logs in DO | Single key = big array | Per-log keys, prefix list, paginate | 4 |
| Circuit breaker | Isolate-local | Document; later: DO/KV if global protection needed | 4 (doc only) |
6. References
- System map — DO and cache in the architecture.
- ADR-005 Durable Objects — DO usage and lifecycle.
- Init service — Service creation and test setup; after changes, DB must remain per-request in tests too.