Observability, Monitoring & Runbooks¶

This document is the operations playbook for the Mr. Mentor backend. It maps the observability surface the service exposes (the /api/health endpoint, the Bull Board queue dashboard at /admin/queues, the Redis admin routes, the slot-cleanup diagnostics, the various audit-trail tables, and the console-based logging), and then provides concrete, step-by-step runbooks for the incidents that actually happen in production — backed-up queues, emails not sending, WebRTC meetings that will not connect, failed payment reconciliation, Mr. Learn / Mr. Test sync stalls, DB pool exhaustion, Exotel telephony failures, and a deploy rollback.

Status: documented from source on this branch. Endpoints, queues, env vars, and behaviours below are derived from the actual source under src/ and cross-checked against the deployment doc. Anything not directly provable from code is marked (inferred).

Overview¶

The backend has no third-party APM or error tracker wired in — there is no Sentry, Datadog, or New Relic integration in the code (grep -ri sentry src returns nothing, and morgan is imported in src/app.ts but its app.use(morgan('combined')) line is commented out). Observability is therefore built from a handful of first-party pieces:

Surface	What it answers	Who uses it
`GET /api/health`	Is the process up, and are Postgres + Redis reachable?	Load balancer / Docker `HEALTHCHECK`, blue-green deploy verification, uptime monitors
`/admin/queues` (Bull Board)	Are background jobs flowing? What failed and why?	On-call engineers, ops
`/api/redis/keys`	What is cached / what repeatable jobs exist?	Engineers debugging cache / KPI staleness
`/api/cleanup/statistics`	How many slots are expired / unbooked?	Admins, ops
`console.*` stdout logs	Everything else — request errors, worker progress, integration failures	Anyone with `docker logs` / `pm2 logs` shell access
Audit tables (`login_logs`, `gst_audit_log`, `lead_activity_log`, …)	Who did what, when (forensics)	Admins, finance, sales-head, compliance

Personas. SREs/on-call engineers (shell + Bull Board), admins/superadmins (authenticated cleanup + audit views), and automated infrastructure (the load balancer and the blue-green health-check.sh that polls /api/health).

This doc sits in docs/devops/ alongside deployment-architecture.md; for the queue internals it references, see background-jobs-and-queues.md.

Key concepts & entities¶

Glossary

Health check — GET /api/health; returns 200 with success:true only when the database and Redis are both connected, else 503. Used as the deploy gate.
Bull Board — a self-hosted dashboard (@bull-board/express) mounted at /admin/queues, protected by HTTP Basic Auth, listing every BullMQ queue with its waiting / active / completed / failed / delayed counts and per-job payloads + stack traces.
Repeatable job — a BullMQ cron/interval job (e.g. KPI every 15 min). They are stored in Redis; re-registering on boot removes the old key first so schedules do not stack.
Audit trail — append-only DB rows recording sensitive actions. The codebase has several independent ones (see below) rather than one unified log.
Structured logging — not present. Logs are free-text console.info/warn/error/debug to stdout (≈2,250 call sites), captured by Docker / PM2.

Main TypeORM entities relevant to observability

Entity	File	Purpose
`LoginLog` (`login_logs`)	`src/entities/LoginLog.ts`	One row per login / logout / master-login with `ipAddress`, `userAgent`, `timestamp`. Written by `src/services/AuthService.ts`.
`GstAuditLog` (`superadmin.gst_audit_log`)	`src/entities/GstAuditLog.ts`	Append-only, one row per changed GST/invoice field: `action`, `field`, `oldValue`, `newValue`, `changedBy`, `changedByEmail`.
`LeadActivityLog` (`mas_crm.lead_activity_log`)	`src/entities/LeadActivityLog.ts`	CRM forensics — lead status/owner changes with old/new value and `changed_by`.
`AgentConfigurationHistory`	`src/entities/AgentConfigurationHistory.ts`	Versioned history of AI agent config edits.
`LeadEmailLog` / `LeadWhatsAppLog` / `AskMasLog`	`src/entities/*`	Per-channel send/usage trails.

There is no FinanceAuditService.ts in this repo. Finance reporting/forensics is served by src/services/FinanceReportsService.ts and src/services/FinanceExportService.ts, and the GST audit trail is the GstAuditLog entity above.

Architecture¶

How the observability surface is wired into the app. The health, redis, cleanup, and Bull Board routers are registered in src/routes/index.ts and mounted onto the Express app in src/app.ts.

flowchart TD
    subgraph Clients["Operators & Infra"]
        LB["Load balancer / Docker HEALTHCHECK"]
        BG["Blue-green health-check.sh"]
        Eng["On-call engineer"]
        Adm["Admin / Superadmin"]
    end

    subgraph App["Express app (src/app.ts)"]
        RIdx["RouteIndex (src/routes/index.ts)"]
        HR["HealthRoutes -> /api/health"]
        RR["RedisRoutes -> /api/redis/keys"]
        CR["CleanupRoutes -> /api/cleanup/*"]
        BB["BullBoardRoutes -> /admin/queues"]
    end

    subgraph Svc["Services"]
        HS["HealthService"]
        RS["RedisService (ioredis singleton)"]
        SCS["SlotCleanupService"]
        QS["QueueService (BullMQ)"]
    end

    subgraph Infra["Infrastructure"]
        DB[("PostgreSQL pool 5-20")]
        RD[("Redis - cache + queues")]
        Stdout["stdout console logs -> docker/pm2"]
    end

    LB --> HR
    BG --> HR
    Eng --> BB
    Eng --> RR
    Adm --> CR

    RIdx --> HR & RR & CR & BB
    HR --> HS
    HS --> DB
    HS --> RD
    RR --> RS --> RD
    CR --> SCS --> DB
    CR --> QS
    BB --> QS --> RD
    App -.->|"all errors/info"| Stdout

Data model¶

The observability/audit footprint is a small set of append-only or status tables. They are not tightly related to one another (each domain keeps its own trail); the diagram below shows their shape and the User they reference.

erDiagram
    USER ||--o{ LOGIN_LOG : "generates"
    USER ||--o{ LEAD_ACTIVITY_LOG : "acts via changed_by"
    USER ||--o{ GST_AUDIT_LOG : "acts via changed_by"

    USER {
        uuid id PK
        string email
        string role
    }
    LOGIN_LOG {
        uuid id PK
        uuid userId FK
        enum action "login|logout|master_login"
        string ipAddress
        text userAgent
        timestamp timestamp
    }
    LEAD_ACTIVITY_LOG {
        uuid id PK
        uuid lead_id
        enum activity_type
        uuid changed_by FK
        text old_value
        text new_value
    }
    GST_AUDIT_LOG {
        uuid id PK
        string payment_id
        string invoice_number
        string action "create|update|issue"
        string field
        text old_value
        text new_value
        string changed_by FK
        string changed_by_email
        timestamp created_at
    }

Notable enums / status fields:

LogAction (src/entities/LoginLog.ts): login, logout, master_login.
GstAuditLog.action: create / update / issue (one row per changed field).
HealthResponse.status (src/types/health.types.ts): always the literal "OK" in the body — the real signal is the HTTP status code (200 vs 503), not this string.

API surface¶

Derived from src/routes/index.ts mounts and the individual route files. Note the auth nuances called out in Edge cases below.

Method	Path	Auth/role	Purpose
GET	`/api/health`	none (public)	Liveness + DB/Redis readiness. `200` healthy, `503` if DB or Redis down, `500` on internal error. `src/routes/health.routes.ts`
GET	`/api/redis/keys`	none in route file (intended admin)	Dump all Redis keys and values. `src/routes/redis.routes.ts`
DELETE	`/api/redis/keys/:key`	none in route file (intended admin)	Delete one Redis key. `404` if key absent.
DELETE	`/api/cleanup/slots`	`authMiddleware` + `adminMiddleware`	Run expired-unbooked-slot cleanup synchronously now.
GET	`/api/cleanup/statistics`	auth + admin	Slot counts (total/available/booked/future/expired/expiredUnbooked).
GET	`/api/cleanup/statistics/:mentorId`	auth + admin	Per-mentor slot stats.
POST	`/api/cleanup/schedule`	auth + admin	Enqueue a one-off cleanup job onto `cleanupQueue`.
GET	`/api/cleanup/slots/:filter`	auth + admin	List slots by filter (`all\|available\|booked\|future\|expired\|expiredUnbooked`), `?limit` default 100.
GET / POST / etc.	`/admin/queues/*`	HTTP Basic Auth (`BULL_BOARD_USERNAME`/`BULL_BOARD_PASSWORD` (credentials set via env vars))	Bull Board UI + its internal API. `src/routes/bullBoard.routes.ts`

Bull Board is mounted at the root (this.router.use('/', this.bullBoardRoutes.router) in src/routes/index.ts), and the router itself prefixes /admin/queues. The base path for its assets is set with serverAdapter.setBasePath('/admin/queues').

User journeys¶

1. Infrastructure health probe (load balancer / deploy gate)¶

The most frequently hit endpoint. The blue-green deploy script and the Docker HEALTHCHECK both rely on it returning 200 + success:true before routing traffic.

sequenceDiagram
    participant LB as Load balancer
    participant API as Express
    participant HC as HealthController
    participant HS as HealthService
    participant DB as Postgres
    participant RD as Redis

    LB->>API: GET /api/health
    API->>HC: getHealth
    HC->>HS: getHealthStatus and isHealthy
    HS->>DB: database.isConnected
    HS->>RD: client.status equals ready
    alt both connected
        HS-->>HC: status OK with uptime and env
        HC-->>LB: 200 success true
    else db or redis down
        HS-->>HC: isHealthy false
        HC-->>LB: 503 Service unavailable
    end
    Note over LB: On 503 the blue-green switch is aborted and the old color stays live

2. On-call triages a failing queue via Bull Board¶

When jobs stop flowing, the engineer opens the dashboard, inspects failed jobs, reads their stack traces, and retries.

sequenceDiagram
    participant Eng as On-call engineer
    participant BB as Bull Board at /admin/queues
    participant QS as QueueService
    participant RD as Redis

    Eng->>BB: Open /admin/queues
    BB->>Eng: 401 WWW-Authenticate Basic
    Eng->>BB: Resend with Basic credentials
    BB->>QS: getQueues
    QS->>RD: read waiting active failed delayed counts
    RD-->>BB: per-queue metrics
    BB-->>Eng: Render board with failed job list
    Eng->>BB: Click a failed job to read stack trace
    Eng->>BB: Retry job or Retry all failed
    BB->>RD: move job back to waiting
    Note over RD: A worker picks it up and reprocesses

3. Admin runs manual slot cleanup and reads statistics¶

sequenceDiagram
    participant Adm as Admin
    participant API as Express
    participant CC as CleanupController
    participant SCS as SlotCleanupService
    participant DB as Postgres

    Adm->>API: GET /api/cleanup/statistics with JWT
    API->>CC: getSlotStatistics after auth and admin checks
    CC->>SCS: getSlotStatistics
    SCS->>DB: count slots by category
    DB-->>Adm: counts available booked expired expiredUnbooked
    Adm->>API: DELETE /api/cleanup/slots
    CC->>SCS: cleanupExpiredUnbookedSlots
    SCS->>DB: delete expired unbooked rows
    DB-->>Adm: deletedCount and details
    Note over Adm: Prefer POST /api/cleanup/schedule to offload to the queue under heavy load

4. Engineer inspects and clears a stale cache key in Redis¶

sequenceDiagram
    participant Eng as Engineer
    participant API as Express
    participant RC as RedisController
    participant RS as RedisService
    participant RD as Redis

    Eng->>API: GET /api/redis/keys
    API->>RC: getAllKeysAndValues
    RC->>RS: getAllKeysAndValues
    RS->>RD: keys star then read each value
    RD-->>Eng: all keys and values as JSON
    Eng->>API: DELETE /api/redis/keys/mykey
    RC->>RS: delete mykey
    RS->>RD: DEL mykey
    alt key existed
        RD-->>Eng: 200 deleted with deletedCount
    else key missing
        RD-->>Eng: 404 Key not found
    end
    Note over Eng: Deleting a KPI cache key forces a recompute on next dashboard read

sequenceDiagram
    participant U as User
    participant Auth as AuthService
    participant DB as Postgres login_logs

    U->>Auth: signIn with credentials
    Auth->>Auth: verify password and issue JWT
    Auth->>DB: insert LoginLog action login ip userAgent timestamp
    Note over DB: Later an admin queries login_logs by userId to confirm access time and source IP
    Auth->>U: token and user
    U->>Auth: logout
    Auth->>DB: insert LoginLog action logout

Background jobs & async¶

Observability of async work is entirely through Bull Board + worker stdout logs. The full queue catalogue lives in background-jobs-and-queues.md; the monitoring-relevant facts:

Queues are created in src/services/QueueService.ts (singleton). getQueues() returns the list surfaced in Bull Board — note it currently exposes 19 of the declared queues; some declared queues (e.g. studentRiskComputationQueue, badgeEvaluationQueue) are not in the getQueues() array and therefore do not appear in Bull Board (gotcha — see below).
Job retention is set per-queue via defaultJobOptions.removeOnComplete / removeOnFail (typically 10–200). Once trimmed, failed jobs are gone — capture stack traces before they age out.
Repeatable schedules (registered on boot in src/index.ts):
KPI dashboard + sales-overview — every 15 min (scheduleKpiCalculation).
Slot cleanup — every 15 minutes despite the method name scheduleSlotCleanup and a stale "24 hours" comment (the code uses every: 15 * 60 * 1000).
Lead auto-assignment — every 15 min.
Aarya call-sync (ElevenLabs) — every 15 min.
Workflow trigger-scan — every 5 min.
Miss Ozone reconciler — every 60 s; prune daily 03:30.
Daily warning processing — 59 23 * * * IST; assignment reminders — 0 19 * * * IST; daily cards — 0 0 * * * IST; application backfill — 0 2 * * * IST.
Salary benchmark — every 15 days.
Mr. Learn / Mr. Test sync + reminders — per-config intervals (hours), stable jobId.
Socket.IO events (meetings, recording, presence) are not captured by any dashboard — diagnose them from stdout logs only.

External integrations¶

Integration	Env vars	Failure / fallback behaviour
PostgreSQL	`DB_HOST/PORT/USERNAME/PASSWORD/NAME`	Pool `min 5 / max 20` (`src/config/database.ts` `extra`). `synchronize: true`. If down, `/api/health` returns 503 and most routes 500.
Redis	`REDIS_HOST/PORT` (password commented out)	`ioredis` with `lazyConnect`, `reconnectOnError` on `READONLY`. If down, queues stall and `/api/health` returns 503.
Bull Board auth	`BULL_BOARD_USERNAME`, `BULL_BOARD_PASSWORD`	(credentials set via env vars) — must override in prod.
Exotel (telephony)	`EXOTEL_SID/API_KEY/API_TOKEN/CALLER_ID/SUBDOMAIN/SMS_SENDER_ID/WEBHOOK_TOKEN`, `BACKEND_PUBLIC_URL`	Feature self-disables (CRM falls back to `tel:` links) when unset. `EXOTEL_SUBDOMAIN` must match the account cluster or auth 401s. See comms-telephony-exotel.md.
Razorpay (payments)	`RAZORPAY_KEY_ID/SECRET`	Webhook-driven; failures surface in worker/controller logs and in finance reconciliation, not health. See payments-finance-gst.md.
Gmail SMTP	`EMAIL_USER`, `EMAIL_PASS`	`email.worker` throws `535-5.7.8 BadCredentials` on a stale app password — jobs land in `emailQueue` failed set.
ElevenLabs / Aarya / Miss Ozone (AI calling)	per-service keys	Sync workers poll and write back; failures visible in `aaryaSyncQueue` / `missOzoneQueue`.
Mr. Learn (Graphy) / Mr. Test	per-config credentials	Sync workers; stalls visible in `mrlearnSyncQueue` / `mrtestSyncQueue`. See integration-mr-learn.md.

There are no formal feature flags for observability; behaviour is governed by presence or absence of the env vars above.

Status lifecycles¶

A BullMQ job's lifecycle — exactly what the Bull Board columns represent:

stateDiagram-v2
    [*] --> Waiting: enqueued
    Waiting --> Active: worker picks up
    Active --> Completed: success
    Active --> Failed: throws
    Failed --> Waiting: retry attempt remaining
    Failed --> [*]: attempts exhausted then trimmed by removeOnFail
    Completed --> [*]: trimmed by removeOnComplete
    Waiting --> Delayed: scheduled or backoff
    Delayed --> Waiting: delay elapsed

The health endpoint's effective state machine:

stateDiagram-v2
    [*] --> Checking
    Checking --> Healthy: db connected and redis ready
    Checking --> Unavailable: db down or redis not ready
    Checking --> Error: exception thrown
    Healthy --> [*]: 200 success true
    Unavailable --> [*]: 503 Service unavailable
    Error --> [*]: 500 Internal server error

Runbooks¶

Each runbook is Symptom → Diagnosis → Fix. Shell access to the server (or docker logs / pm2 logs) and the Bull Board credentials are assumed.

Runbook A — A queue is backed up (jobs piling in "waiting")¶

Symptom: Bull Board shows a large/growing waiting count and a low/zero active count; downstream effects (no emails, stale KPIs) appear.

Diagnosis 1. Open /admin/queues, identify the queue with the growing backlog. 2. Check its active count — if 0, the worker is dead or not consuming. 3. docker logs <container> (or pm2 logs) and grep for the worker name (e.g. email.worker) — look for a crash on boot or an unhandled rejection. 4. Confirm Redis is up: GET /api/health should be 200. If 503, jump to Runbook F. 5. Check the failed tab — a poison job repeatedly failing can stall throughput.

Fix - If the worker process crashed: restart the container/PM2 process; workers are started in src/index.ts on boot. - If a poison job is the cause: open it in Bull Board, read the stack trace, then remove it (or fix the data and retry). - If Redis was the root cause: restore Redis (Runbook F), then retry the waiting jobs. - Verify recovery: active climbs and waiting drains.

Runbook B — Emails are not sending¶

Symptom: OTP / meeting / reminder emails not arriving; users report no mail.

Diagnosis 1. Bull Board → emailQueue. Are jobs in failed? Open one and read the error. 2. A 535-5.7.8 Username and Password not accepted (BadCredentials) means the Gmail app password is stale/rotated. 3. If jobs sit in waiting with no active, the email.worker is down (see Runbook A). 4. Check stdout for email.worker errors.

Fix - For bad credentials: rotate the Gmail app password, update both AWS Secrets Manager secrets (mr-mentor-backend/development and /production), restart the blue and green containers so they re-read the env, then retry the failed emailQueue jobs in Bull Board (they do not auto-retry once exhausted). See the deployment doc for the env propagation path. - For a dead worker: restart the process, then retry failed jobs.

Runbook C — Meeting WebRTC will not connect¶

Symptom: Two participants join a meeting room but never see/hear each other; video stays black.

Diagnosis (WebRTC is not in Bull Board — use logs + the realtime doc) 1. Confirm /api/health is 200 (Socket.IO shares the HTTP server in src/index.ts; if the process is unhealthy, sockets are too). 2. docker logs and grep for the socket signaling events: offer, answer, ice-candidate, join-room. Confirm both peers emitted join-room and exchanged an offer/answer pair. 3. Check whether ICE candidates are being relayed — if offers/answers appear but no media, it is almost always a STUN/TURN / NAT-traversal problem, not the app. 4. Confirm CORS / origin: Socket.IO only allows configured origins (localhost 3000/3001/3002/8088 + production URLs). A blocked origin prevents the socket from connecting at all. 5. Confirm the client cleared any stale connection (takeover-connection is emitted when a second tab takes over).

Fix - App-level signaling broken (no offer/answer in logs): restart the backend so Socket.IO re-initialises; have clients rejoin (leave-room then join-room). - Media never flows despite signaling: this is infra — verify the TURN server config on the client side; backend cannot fix NAT traversal. - Origin blocked: add the origin to the Socket.IO CORS allow-list and redeploy. - See realtime-and-socketio.md and mentorship-and-meetings.md.

Runbook D — Payment / Razorpay webhook failed or not reconciled¶

Symptom: A user paid but tokens/enrollment were not granted, or finance reports show a payment with no invoice.

Diagnosis 1. Search stdout for the Razorpay order/payment id around the payment time. 2. Confirm the webhook reached the backend (look for the webhook handler log line). If the request never arrived, the issue is at the Razorpay dashboard / public URL level. 3. If the webhook arrived but processing threw, the error is in the controller/worker logs. 4. For GST/invoice forensics, query superadmin.gst_audit_log by payment_id to see what field changes (if any) were recorded.

Fix - Re-trigger reconciliation per the finance flow (see payments-finance-gst.md); the payment id is the key. - If the webhook never arrived: verify the Razorpay webhook URL points at the live blue-green color and the public URL resolves; re-send the webhook from Razorpay. - Confirm the grant landed (tokens/enrollment) and that a gst_audit_log / invoice row now exists.

Runbook E — Mr. Learn or Mr. Test sync is stuck¶

Symptom: New enrolments / progress from Mr. Learn (Graphy) or Mr. Test are not appearing; admin "sync" shows no recent activity.

Diagnosis 1. Bull Board → mrlearnSyncQueue / mrlearnNewStudentSyncQueue / mrtestSyncQueue. Check failed jobs and read the error (commonly a 401/403 from the upstream API = stale credentials). 2. Confirm the repeatable job exists: GET /api/redis/keys and look for the repeatable job key, or check Bull Board's repeatable section. Per-config jobs use stable ids like mrlearn-sync-<configId>. 3. Check stdout for mrlearnSync.worker / mrtestSync.worker errors.

Fix - Stale upstream credentials: update the sync config credentials, then trigger a manual run (QueueService.triggerMrLearnSyncNow(configId) / triggerMrTestSyncNow(configId) via the admin endpoint). - Missing schedule: re-save the sync config (admin UI) to re-register the repeatable job, or reboot the backend (boot re-wires schedules in src/index.ts). - Poison job: remove the failed job, fix the config, retry. - See integration-mr-learn.md and integration-mr-test.md.

Runbook F — Database connection pool exhausted¶

Symptom: Requests hang then 500; logs show TimeoutError: Could not acquire a connection or "remaining connection slots" errors; /api/health may flip to 503.

Diagnosis 1. GET /api/health — a 503 with database.connected:false confirms DB unreachable; a slow 200 suggests saturation, not outage. 2. Recall the pool is min 5 / max 20 per process (src/config/database.ts extra). With multiple instances/colors live, total connections = 20 × instances. 3. On the DB: count active connections (SELECT count(*) FROM pg_stat_activity;) and look for long-running / idle-in-transaction queries. 4. Check for a query leak in recently deployed code (a transaction never committed/rolled back holds a connection).

Fix - Immediate: restart the backend container(s) to drop and re-establish the pool. - Kill stuck Postgres sessions: SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle in transaction' AND now() - state_change > interval '5 minutes'; - If legitimately at capacity, reduce live instance count or raise the DB max_connections (and re-evaluate the pool max). - Root-cause the leaking query and roll forward a fix (or roll back — Runbook H).

Runbook G — Exotel click-to-call / SMS failing¶

Symptom: CRM click-to-call does nothing or returns errors; SMS not delivered; or the feature silently fell back to plain tel: links.

Diagnosis 1. If the UI shows tel: links instead of click-to-call, the Exotel env vars are unset — the feature self-disables when any of EXOTEL_SID/API_KEY/API_TOKEN/CALLER_ID is missing. 2. A 401 from Exotel almost always means EXOTEL_SUBDOMAIN does not match the account's cluster (e.g. Singapore api.exotel.com vs Mumbai api.in.exotel.com). 3. Validate creds directly: curl -u <key>:<token> https://<subdomain>/v1/Accounts/<SID> — a 200 confirms creds + cluster. 4. For status callbacks, confirm EXOTEL_WEBHOOK_TOKEN and BACKEND_PUBLIC_URL are set so Exotel can reach /api/exotel/*.

Fix - Set/correct the env vars in AWS Secrets Manager for the right environment, ensuring EXOTEL_SUBDOMAIN matches the account cluster; restart both colors. - Re-run the curl validation; then retry a click-to-call from the CRM. - See comms-telephony-exotel.md.

Runbook H — Deploy rollback (production)¶

Symptom: A new release is live but unhealthy (errors spiking, /api/health 503, or a regression).

Diagnosis 1. GET /api/health against the live color — 503 or non-success:true confirms the new release is bad. 2. Check /home/ubuntu/blue-green-deployment/.current_env to see which color is live. 3. Tail the new container's logs for the boot error or regression.

Fix (instant rollback) - The previous color is stopped, not removed — the old container is intact. Run the server-side rollback.sh (in /home/ubuntu/blue-green-deployment/) which switches Nginx back to the previous color and flips .current_env. - Verify /api/health on the restored color returns 200 success:true. - Leave the bad color stopped; investigate and re-deploy once fixed. - Full topology and scripts: deployment-architecture.md.

Runbook I — KPIs / dashboard numbers are stale¶

Symptom: Admin/sales dashboards show outdated figures.

Diagnosis 1. KPIs are recomputed by the kpiQueue repeatable job every 15 min and cached in Redis. 2. Bull Board → kpiQueue: confirm the repeatable job ran recently and did not fail. 3. GET /api/redis/keys to inspect the cached KPI value and its freshness.

Fix - Delete the stale cache key via DELETE /api/redis/keys/<kpiKey> to force a recompute on next read, or enqueue an immediate recompute (QueueService.addKpiJob). - If the repeatable schedule is missing, reboot the backend (re-registers on boot) or inspect why scheduleKpiCalculation failed in stdout.

Edge cases, limits & gotchas¶

Hardening note (internal). A security/hardening observation for this area is tracked in the team's private notes (internal/security-and-hardening-notes.md) and is intentionally not published on this site.
Hardening note (internal). A security/hardening observation for this area is tracked in the team's private notes (internal/security-and-hardening-notes.md) and is intentionally not published on this site.
Not all queues appear in Bull Board. QueueService.getQueues() omits some declared queues (e.g. studentRiskComputationQueue, badgeEvaluationQueue), so their jobs are invisible in the dashboard — diagnose those from logs only.
Health check is binary and shallow. It only checks DB isConnected() and Redis client.status === 'ready'; it does not run a query or ping Redis, so a "connected but wedged" dependency can still report healthy.
scheduleSlotCleanup runs every 15 min, not 24 h. The method name and an inline comment say 24 hours, but the code uses every: 15 * 60 * 1000.
Logs are ephemeral free-text. ≈2,250 console.* calls go to stdout with no structure, no levels enforced, no correlation ids, and no central aggregation. Capture what you need with docker logs/pm2 logs before containers are recycled.
Failed jobs age out. removeOnFail (10–200) means stack traces disappear after enough newer failures — screenshot/copy them during triage.
Multi-platform. Requests carry an x-platform header; logs do not consistently record it, so attributing an issue to mr-mentor vs my-analytics-school may require correlating with the request body/route (inferred).
Auth on cleanup is real. Unlike redis, all /api/cleanup/* routes enforce authMiddleware + adminMiddleware.
No alerting. Nothing pages on a 503 or a queue backlog; monitoring is pull-based (someone must look). External uptime monitoring on /api/health is the only proactive signal (inferred).

deployment-architecture.md — Docker, blue-green, rollback scripts, env propagation.
background-jobs-and-queues.md — full BullMQ queue/worker catalogue and schedules.
realtime-and-socketio.md — WebRTC signaling and Socket.IO events.
request-lifecycle-and-middleware.md — middleware, auth, error handling.
multi-platform-architecture.md — x-platform routing.
comms-telephony-exotel.md — Exotel integration internals.
comms-email-and-notifications.md — email worker + templates.
payments-finance-gst.md — Razorpay + GST audit trail.
integration-mr-learn.md, integration-mr-test.md — external LMS sync.
identity-and-access.md — auth + login_logs audit trail.