Skip to content

Institutional Hardening Sprint Plan (4–8 Weeks)

Objective: Move CepatEdge from “strong technical project” to credible pilot candidate for a risk‑averse university.
Scope: Only the institutional trust layer (identity, security, governance, operations) – no new product features.


Phase 1 (Week 1–2): Identity & Architecture Clarity ✅ COMPLETED

Goals

  • Design and start implementing enterprise SSO (OIDC/SAML).
  • Make the architecture and data‑flows reviewable by IT/security.

Deliverables ✅ IMPLEMENTED

  • SSO Implementation:
    • ✅ OIDC protocol selected (Azure AD primary, extensible to Okta/generic OIDC).
    • ✅ IdP group-to-role mapping system: MAINT-EMPLOYEE → employee, MAINT-TECHNICIAN → technician, MAINT-HOD → department_head, MAINT-ADMIN → administrator, MAINT-DEVELOPER → developer.
    • ✅ Complete login flow: /auth/oidc/login → IdP → /auth/oidc/callback → JWT session.
    • ✅ User auto-creation and role synchronization on first SSO login.
  • JWT Token System:
    • ✅ Short-lived access tokens (configurable per user session timeout).
    • ✅ Session-based token management with automatic expiration.
    • ✅ Auth method tracking ('oidc' vs 'password' vs 'backup_code').
  • System Architecture:
    • ✅ Workers/DO ↔ Neon ↔ R2 ↔ SPA data flow documented.
    • ✅ PII data classification (user emails, maintenance details).
    • ✅ Environment diagrams for dev/test/prod separation.

Implementation Tasks ✅ COMPLETED

  • ✅ Full OIDC login flow implemented with state/nonce protection.
  • ✅ JWT tokens with configurable session timeouts and user-specific expiration.
  • ✅ Database schema supports SSO users (isOidc flag, role synchronization).

Phase 2 (Week 3–4): Security Hardening & Audit Trail ✅ COMPLETED

Goals

  • Define and enforce a token lifecycle that IT security can accept.
  • Establish an audit trail suitable for investigations and compliance.

Deliverables ✅ IMPLEMENTED

  • Token Lifecycle:
    • ✅ Access token TTL: Configurable per user (default 30min, user-specific session timeout).
    • ✅ Session-based revocation: Tokens invalidated on logout, session expiry, admin action.
    • ✅ No refresh tokens yet (Phase 4.5 priority - short-lived tokens with frequent re-auth).
  • Audit System:
    • ✅ Comprehensive audit table: userId, action, category, level, IP, userAgent, requestId, sessionId, details.
    • ✅ Event categories: auth, user, maintenance, system, api, security.
    • ✅ Events logged: logins/logouts, failed auth, role changes, maintenance CRUD, approvals, assignments.
    • ✅ Admin API: /admin/audit/logs with filtering, counting, CSV export.
    • ✅ Monitoring service with error analysis, incident tracking, user activity reports.

Implementation Tasks ✅ COMPLETED

  • ✅ Audit logging integrated throughout auth, maintenance, and admin endpoints.
  • ✅ Neon audit table with comprehensive indexing for performance.
  • ✅ R2 export path designed (CSV export implemented, SIEM integration ready).
  • ✅ Security event tracking for compliance (failed logins, suspicious activity).

Phase 3 (Week 5–6): Backup, DR & Governance Baseline 🔄 IN PROGRESS

Goals

  • Ensure we can recover from failures with known RPO/RTO.
  • Define a minimum data governance posture (retention + classification).

Deliverables 🔄 PARTIALLY IMPLEMENTED

  • Backup Infrastructure:
    • ✅ Neon PostgreSQL: Managed backups available (Neon handles automatically).
    • R2 Storage: Manual versioning strategy (Cloudflare R2 does not support native object versioning - manual versioning implemented instead).
    • 🔄 RPO/RTO: Not yet formally defined and tested for pilot.
  • Data Retention Policy:
    • ✅ Audit logs: 2-year retention (compliance requirement).
    • ✅ User data: Indefinite retention (business requirement).
    • ✅ Maintenance data: 7-year retention (regulatory requirement).
    • 🔄 Automated cleanup: Not yet implemented.
  • DR Runbook:
    • 🔄 Restore procedures documented but not tested.
    • 🔄 Failover scenarios not validated.

Implementation Tasks 🔄 PARTIALLY COMPLETED

  • Neon backups: Configuration documented (managed service).
  • Data classification: PII vs non-PII data identified and documented.
  • 🔄 R2 manual versioning: Implement application-level versioning (timestamp/hash-based) since Cloudflare R2 doesn't support native object versioning.
  • 🔄 Restore testing: Not yet performed - requires test environment setup.
  • 🔄 Retention automation: Cleanup scripts not yet implemented.

Phase 4 (Week 7–8): Operational Model & Pilot Packaging 🔄 IN PROGRESS

Goals

  • Make it clear who owns what during a pilot.
  • Package documentation for IT, security, and governance reviewers.

Deliverables 🔄 PARTIALLY IMPLEMENTED

  • Monitoring & Alerting:
    • Comprehensive monitoring system implemented with health checks, error analysis, incident tracking, user activity monitoring, diagnostic tools, and incident dashboard.
    • Real-time health monitoring: /monitoring/health, /monitoring/errors, /monitoring/performance, /monitoring/dashboard.
    • Incident investigation tools: Request tracing, error pattern analysis, user activity monitoring.
    • Audit-based analytics: Error trends, security event monitoring, system diagnostics.
    • 🔄 Automated email/SMS alerts not yet configured (Cloudflare Workers alerting available).
  • Support Model:
    • RACI matrix defined for pilot ownership (see support-ownership-raci.md).
    • ✅ Incident response procedures documented with escalation paths and SLAs.
  • Documentation Package:
    • ✅ This hardening sprint plan updated with actual implementation status.
    • ✅ Pilot readiness gap assessment reflects current state.
    • 🔄 Final IT-ready documentation package not yet assembled.

Implementation Tasks 🔄 MOSTLY COMPLETED

  • Full monitoring infrastructure: Complete incident response dashboard, error analysis, user activity tracking, system diagnostics.
  • Documentation updates: SSO, audit, and monitoring features accurately documented as completed.
  • 🔄 Alert configuration: Cloudflare Workers alerting can be configured but not yet set up for institutional notifications.
  • Ownership documentation: RACI matrix and support model defined for institutional pilot.

Phase 5 (Future): Token Refresh & Advanced SSO 🔮 PLANNED

Goals

  • Implement refresh token mechanism for better user experience.
  • Add support for SAML protocol if required by institution.
  • Enhanced token revocation with IdP event handling.

Planned Tasks

  • Implement refresh token flow with secure storage.
  • Add SAML support alongside existing OIDC.
  • IdP-initiated logout and token revocation hooks.
  • Advanced session management with concurrent session limits.

Work That Is Explicitly Deferred

Until Phase 1 and Phase 2 of this plan are complete, defer:

  • Maintenance caching enhancements and remaining backend route migration.
  • New product features (notifications, analytics, scheduling, etc.).
  • Deep abuse/performance testing beyond what is needed to validate SSO and token changes.

This keeps all energy focused on the institutional trust layer first. Once SSO, token lifecycle, and audit/backup posture are in place, it is safe to resume feature and migration work.