Institutional Hardening Sprint Plan (4–8 Weeks)
Objective: Move CepatEdge from “strong technical project” to credible pilot candidate for a risk‑averse university.
Scope: Only the institutional trust layer (identity, security, governance, operations) – no new product features.
Phase 1 (Week 1–2): Identity & Architecture Clarity ✅ COMPLETED
Goals
- Design and start implementing enterprise SSO (OIDC/SAML).
- Make the architecture and data‑flows reviewable by IT/security.
Deliverables ✅ IMPLEMENTED
- SSO Implementation:
- ✅ OIDC protocol selected (Azure AD primary, extensible to Okta/generic OIDC).
- ✅ IdP group-to-role mapping system: MAINT-EMPLOYEE → employee, MAINT-TECHNICIAN → technician, MAINT-HOD → department_head, MAINT-ADMIN → administrator, MAINT-DEVELOPER → developer.
- ✅ Complete login flow:
/auth/oidc/login→ IdP →/auth/oidc/callback→ JWT session. - ✅ User auto-creation and role synchronization on first SSO login.
- JWT Token System:
- ✅ Short-lived access tokens (configurable per user session timeout).
- ✅ Session-based token management with automatic expiration.
- ✅ Auth method tracking ('oidc' vs 'password' vs 'backup_code').
- System Architecture:
- ✅ Workers/DO ↔ Neon ↔ R2 ↔ SPA data flow documented.
- ✅ PII data classification (user emails, maintenance details).
- ✅ Environment diagrams for dev/test/prod separation.
Implementation Tasks ✅ COMPLETED
- ✅ Full OIDC login flow implemented with state/nonce protection.
- ✅ JWT tokens with configurable session timeouts and user-specific expiration.
- ✅ Database schema supports SSO users (isOidc flag, role synchronization).
Phase 2 (Week 3–4): Security Hardening & Audit Trail ✅ COMPLETED
Goals
- Define and enforce a token lifecycle that IT security can accept.
- Establish an audit trail suitable for investigations and compliance.
Deliverables ✅ IMPLEMENTED
- Token Lifecycle:
- ✅ Access token TTL: Configurable per user (default 30min, user-specific session timeout).
- ✅ Session-based revocation: Tokens invalidated on logout, session expiry, admin action.
- ✅ No refresh tokens yet (Phase 4.5 priority - short-lived tokens with frequent re-auth).
- Audit System:
- ✅ Comprehensive audit table: userId, action, category, level, IP, userAgent, requestId, sessionId, details.
- ✅ Event categories: auth, user, maintenance, system, api, security.
- ✅ Events logged: logins/logouts, failed auth, role changes, maintenance CRUD, approvals, assignments.
- ✅ Admin API:
/admin/audit/logswith filtering, counting, CSV export. - ✅ Monitoring service with error analysis, incident tracking, user activity reports.
Implementation Tasks ✅ COMPLETED
- ✅ Audit logging integrated throughout auth, maintenance, and admin endpoints.
- ✅ Neon audit table with comprehensive indexing for performance.
- ✅ R2 export path designed (CSV export implemented, SIEM integration ready).
- ✅ Security event tracking for compliance (failed logins, suspicious activity).
Phase 3 (Week 5–6): Backup, DR & Governance Baseline 🔄 IN PROGRESS
Goals
- Ensure we can recover from failures with known RPO/RTO.
- Define a minimum data governance posture (retention + classification).
Deliverables 🔄 PARTIALLY IMPLEMENTED
- Backup Infrastructure:
- ✅ Neon PostgreSQL: Managed backups available (Neon handles automatically).
- ✅ R2 Storage: Manual versioning strategy (Cloudflare R2 does not support native object versioning - manual versioning implemented instead).
- 🔄 RPO/RTO: Not yet formally defined and tested for pilot.
- Data Retention Policy:
- ✅ Audit logs: 2-year retention (compliance requirement).
- ✅ User data: Indefinite retention (business requirement).
- ✅ Maintenance data: 7-year retention (regulatory requirement).
- 🔄 Automated cleanup: Not yet implemented.
- DR Runbook:
- 🔄 Restore procedures documented but not tested.
- 🔄 Failover scenarios not validated.
Implementation Tasks 🔄 PARTIALLY COMPLETED
- ✅ Neon backups: Configuration documented (managed service).
- ✅ Data classification: PII vs non-PII data identified and documented.
- 🔄 R2 manual versioning: Implement application-level versioning (timestamp/hash-based) since Cloudflare R2 doesn't support native object versioning.
- 🔄 Restore testing: Not yet performed - requires test environment setup.
- 🔄 Retention automation: Cleanup scripts not yet implemented.
Phase 4 (Week 7–8): Operational Model & Pilot Packaging 🔄 IN PROGRESS
Goals
- Make it clear who owns what during a pilot.
- Package documentation for IT, security, and governance reviewers.
Deliverables 🔄 PARTIALLY IMPLEMENTED
- Monitoring & Alerting:
- ✅ Comprehensive monitoring system implemented with health checks, error analysis, incident tracking, user activity monitoring, diagnostic tools, and incident dashboard.
- ✅ Real-time health monitoring:
/monitoring/health,/monitoring/errors,/monitoring/performance,/monitoring/dashboard. - ✅ Incident investigation tools: Request tracing, error pattern analysis, user activity monitoring.
- ✅ Audit-based analytics: Error trends, security event monitoring, system diagnostics.
- 🔄 Automated email/SMS alerts not yet configured (Cloudflare Workers alerting available).
- Support Model:
- ✅ RACI matrix defined for pilot ownership (see support-ownership-raci.md).
- ✅ Incident response procedures documented with escalation paths and SLAs.
- Documentation Package:
- ✅ This hardening sprint plan updated with actual implementation status.
- ✅ Pilot readiness gap assessment reflects current state.
- 🔄 Final IT-ready documentation package not yet assembled.
Implementation Tasks 🔄 MOSTLY COMPLETED
- ✅ Full monitoring infrastructure: Complete incident response dashboard, error analysis, user activity tracking, system diagnostics.
- ✅ Documentation updates: SSO, audit, and monitoring features accurately documented as completed.
- 🔄 Alert configuration: Cloudflare Workers alerting can be configured but not yet set up for institutional notifications.
- ✅ Ownership documentation: RACI matrix and support model defined for institutional pilot.
Phase 5 (Future): Token Refresh & Advanced SSO 🔮 PLANNED
Goals
- Implement refresh token mechanism for better user experience.
- Add support for SAML protocol if required by institution.
- Enhanced token revocation with IdP event handling.
Planned Tasks
- Implement refresh token flow with secure storage.
- Add SAML support alongside existing OIDC.
- IdP-initiated logout and token revocation hooks.
- Advanced session management with concurrent session limits.
Work That Is Explicitly Deferred
Until Phase 1 and Phase 2 of this plan are complete, defer:
- Maintenance caching enhancements and remaining backend route migration.
- New product features (notifications, analytics, scheduling, etc.).
- Deep abuse/performance testing beyond what is needed to validate SSO and token changes.
This keeps all energy focused on the institutional trust layer first. Once SSO, token lifecycle, and audit/backup posture are in place, it is safe to resume feature and migration work.