Observability, Monitoring & Runbooks¶
This document is the operations playbook for the Mr. Mentor backend. It maps the
observability surface the service exposes (the /api/health endpoint, the Bull Board
queue dashboard at /admin/queues, the Redis admin routes, the slot-cleanup
diagnostics, the various audit-trail tables, and the console-based logging), and then
provides concrete, step-by-step runbooks for the incidents that actually happen in
production — backed-up queues, emails not sending, WebRTC meetings that will not connect,
failed payment reconciliation, Mr. Learn / Mr. Test sync stalls, DB pool exhaustion,
Exotel telephony failures, and a deploy rollback.
Status: documented from source on this branch. Endpoints, queues, env vars, and behaviours below are derived from the actual source under
src/and cross-checked against the deployment doc. Anything not directly provable from code is marked (inferred).
Overview¶
The backend has no third-party APM or error tracker wired in — there is no Sentry,
Datadog, or New Relic integration in the code (grep -ri sentry src returns nothing, and
morgan is imported in src/app.ts but its app.use(morgan('combined')) line is
commented out). Observability is therefore built from a handful of first-party pieces:
| Surface | What it answers | Who uses it |
|---|---|---|
GET /api/health |
Is the process up, and are Postgres + Redis reachable? | Load balancer / Docker HEALTHCHECK, blue-green deploy verification, uptime monitors |
/admin/queues (Bull Board) |
Are background jobs flowing? What failed and why? | On-call engineers, ops |
/api/redis/keys |
What is cached / what repeatable jobs exist? | Engineers debugging cache / KPI staleness |
/api/cleanup/statistics |
How many slots are expired / unbooked? | Admins, ops |
console.* stdout logs |
Everything else — request errors, worker progress, integration failures | Anyone with docker logs / pm2 logs shell access |
Audit tables (login_logs, gst_audit_log, lead_activity_log, …) |
Who did what, when (forensics) | Admins, finance, sales-head, compliance |
Personas. SREs/on-call engineers (shell + Bull Board), admins/superadmins
(authenticated cleanup + audit views), and automated infrastructure (the load balancer and
the blue-green health-check.sh that polls /api/health).
This doc sits in docs/devops/ alongside
deployment-architecture.md; for the queue internals it
references, see background-jobs-and-queues.md.
Key concepts & entities¶
Glossary
- Health check —
GET /api/health; returns200withsuccess:trueonly when the database and Redis are both connected, else503. Used as the deploy gate. - Bull Board — a self-hosted dashboard (
@bull-board/express) mounted at/admin/queues, protected by HTTP Basic Auth, listing every BullMQ queue with its waiting / active / completed / failed / delayed counts and per-job payloads + stack traces. - Repeatable job — a BullMQ cron/interval job (e.g. KPI every 15 min). They are stored in Redis; re-registering on boot removes the old key first so schedules do not stack.
- Audit trail — append-only DB rows recording sensitive actions. The codebase has several independent ones (see below) rather than one unified log.
- Structured logging — not present. Logs are free-text
console.info/warn/error/debugto stdout (≈2,250 call sites), captured by Docker / PM2.
Main TypeORM entities relevant to observability
| Entity | File | Purpose |
|---|---|---|
LoginLog (login_logs) |
src/entities/LoginLog.ts |
One row per login / logout / master-login with ipAddress, userAgent, timestamp. Written by src/services/AuthService.ts. |
GstAuditLog (superadmin.gst_audit_log) |
src/entities/GstAuditLog.ts |
Append-only, one row per changed GST/invoice field: action, field, oldValue, newValue, changedBy, changedByEmail. |
LeadActivityLog (mas_crm.lead_activity_log) |
src/entities/LeadActivityLog.ts |
CRM forensics — lead status/owner changes with old/new value and changed_by. |
AgentConfigurationHistory |
src/entities/AgentConfigurationHistory.ts |
Versioned history of AI agent config edits. |
LeadEmailLog / LeadWhatsAppLog / AskMasLog |
src/entities/* |
Per-channel send/usage trails. |
There is no
FinanceAuditService.tsin this repo. Finance reporting/forensics is served bysrc/services/FinanceReportsService.tsandsrc/services/FinanceExportService.ts, and the GST audit trail is theGstAuditLogentity above.
Architecture¶
How the observability surface is wired into the app. The health, redis, cleanup, and Bull
Board routers are registered in src/routes/index.ts and mounted onto the Express app in
src/app.ts.
flowchart TD
subgraph Clients["Operators & Infra"]
LB["Load balancer / Docker HEALTHCHECK"]
BG["Blue-green health-check.sh"]
Eng["On-call engineer"]
Adm["Admin / Superadmin"]
end
subgraph App["Express app (src/app.ts)"]
RIdx["RouteIndex (src/routes/index.ts)"]
HR["HealthRoutes -> /api/health"]
RR["RedisRoutes -> /api/redis/keys"]
CR["CleanupRoutes -> /api/cleanup/*"]
BB["BullBoardRoutes -> /admin/queues"]
end
subgraph Svc["Services"]
HS["HealthService"]
RS["RedisService (ioredis singleton)"]
SCS["SlotCleanupService"]
QS["QueueService (BullMQ)"]
end
subgraph Infra["Infrastructure"]
DB[("PostgreSQL pool 5-20")]
RD[("Redis - cache + queues")]
Stdout["stdout console logs -> docker/pm2"]
end
LB --> HR
BG --> HR
Eng --> BB
Eng --> RR
Adm --> CR
RIdx --> HR & RR & CR & BB
HR --> HS
HS --> DB
HS --> RD
RR --> RS --> RD
CR --> SCS --> DB
CR --> QS
BB --> QS --> RD
App -.->|"all errors/info"| Stdout
Data model¶
The observability/audit footprint is a small set of append-only or status tables. They are
not tightly related to one another (each domain keeps its own trail); the diagram below
shows their shape and the User they reference.
erDiagram
USER ||--o{ LOGIN_LOG : "generates"
USER ||--o{ LEAD_ACTIVITY_LOG : "acts via changed_by"
USER ||--o{ GST_AUDIT_LOG : "acts via changed_by"
USER {
uuid id PK
string email
string role
}
LOGIN_LOG {
uuid id PK
uuid userId FK
enum action "login|logout|master_login"
string ipAddress
text userAgent
timestamp timestamp
}
LEAD_ACTIVITY_LOG {
uuid id PK
uuid lead_id
enum activity_type
uuid changed_by FK
text old_value
text new_value
}
GST_AUDIT_LOG {
uuid id PK
string payment_id
string invoice_number
string action "create|update|issue"
string field
text old_value
text new_value
string changed_by FK
string changed_by_email
timestamp created_at
}
Notable enums / status fields:
LogAction(src/entities/LoginLog.ts):login,logout,master_login.GstAuditLog.action:create/update/issue(one row per changed field).HealthResponse.status(src/types/health.types.ts): always the literal"OK"in the body — the real signal is the HTTP status code (200 vs 503), not this string.
API surface¶
Derived from src/routes/index.ts mounts and the individual route files. Note the auth
nuances called out in Edge cases below.
| Method | Path | Auth/role | Purpose |
|---|---|---|---|
| GET | /api/health |
none (public) | Liveness + DB/Redis readiness. 200 healthy, 503 if DB or Redis down, 500 on internal error. src/routes/health.routes.ts |
| GET | /api/redis/keys |
none in route file (intended admin) | Dump all Redis keys and values. src/routes/redis.routes.ts |
| DELETE | /api/redis/keys/:key |
none in route file (intended admin) | Delete one Redis key. 404 if key absent. |
| DELETE | /api/cleanup/slots |
authMiddleware + adminMiddleware |
Run expired-unbooked-slot cleanup synchronously now. |
| GET | /api/cleanup/statistics |
auth + admin | Slot counts (total/available/booked/future/expired/expiredUnbooked). |
| GET | /api/cleanup/statistics/:mentorId |
auth + admin | Per-mentor slot stats. |
| POST | /api/cleanup/schedule |
auth + admin | Enqueue a one-off cleanup job onto cleanupQueue. |
| GET | /api/cleanup/slots/:filter |
auth + admin | List slots by filter (all|available|booked|future|expired|expiredUnbooked), ?limit default 100. |
| GET / POST / etc. | /admin/queues/* |
HTTP Basic Auth (BULL_BOARD_USERNAME/BULL_BOARD_PASSWORD (credentials set via env vars)) |
Bull Board UI + its internal API. src/routes/bullBoard.routes.ts |
Bull Board is mounted at the root (
this.router.use('/', this.bullBoardRoutes.router)insrc/routes/index.ts), and the router itself prefixes/admin/queues. The base path for its assets is set withserverAdapter.setBasePath('/admin/queues').
User journeys¶
1. Infrastructure health probe (load balancer / deploy gate)¶
The most frequently hit endpoint. The blue-green deploy script and the Docker
HEALTHCHECK both rely on it returning 200 + success:true before routing traffic.
sequenceDiagram
participant LB as Load balancer
participant API as Express
participant HC as HealthController
participant HS as HealthService
participant DB as Postgres
participant RD as Redis
LB->>API: GET /api/health
API->>HC: getHealth
HC->>HS: getHealthStatus and isHealthy
HS->>DB: database.isConnected
HS->>RD: client.status equals ready
alt both connected
HS-->>HC: status OK with uptime and env
HC-->>LB: 200 success true
else db or redis down
HS-->>HC: isHealthy false
HC-->>LB: 503 Service unavailable
end
Note over LB: On 503 the blue-green switch is aborted and the old color stays live
2. On-call triages a failing queue via Bull Board¶
When jobs stop flowing, the engineer opens the dashboard, inspects failed jobs, reads their stack traces, and retries.
sequenceDiagram
participant Eng as On-call engineer
participant BB as Bull Board at /admin/queues
participant QS as QueueService
participant RD as Redis
Eng->>BB: Open /admin/queues
BB->>Eng: 401 WWW-Authenticate Basic
Eng->>BB: Resend with Basic credentials
BB->>QS: getQueues
QS->>RD: read waiting active failed delayed counts
RD-->>BB: per-queue metrics
BB-->>Eng: Render board with failed job list
Eng->>BB: Click a failed job to read stack trace
Eng->>BB: Retry job or Retry all failed
BB->>RD: move job back to waiting
Note over RD: A worker picks it up and reprocesses
3. Admin runs manual slot cleanup and reads statistics¶
sequenceDiagram
participant Adm as Admin
participant API as Express
participant CC as CleanupController
participant SCS as SlotCleanupService
participant DB as Postgres
Adm->>API: GET /api/cleanup/statistics with JWT
API->>CC: getSlotStatistics after auth and admin checks
CC->>SCS: getSlotStatistics
SCS->>DB: count slots by category
DB-->>Adm: counts available booked expired expiredUnbooked
Adm->>API: DELETE /api/cleanup/slots
CC->>SCS: cleanupExpiredUnbookedSlots
SCS->>DB: delete expired unbooked rows
DB-->>Adm: deletedCount and details
Note over Adm: Prefer POST /api/cleanup/schedule to offload to the queue under heavy load
4. Engineer inspects and clears a stale cache key in Redis¶
sequenceDiagram
participant Eng as Engineer
participant API as Express
participant RC as RedisController
participant RS as RedisService
participant RD as Redis
Eng->>API: GET /api/redis/keys
API->>RC: getAllKeysAndValues
RC->>RS: getAllKeysAndValues
RS->>RD: keys star then read each value
RD-->>Eng: all keys and values as JSON
Eng->>API: DELETE /api/redis/keys/mykey
RC->>RS: delete mykey
RS->>RD: DEL mykey
alt key existed
RD-->>Eng: 200 deleted with deletedCount
else key missing
RD-->>Eng: 404 Key not found
end
Note over Eng: Deleting a KPI cache key forces a recompute on next dashboard read
5. Forensic lookup of a login (audit trail)¶
sequenceDiagram
participant U as User
participant Auth as AuthService
participant DB as Postgres login_logs
U->>Auth: signIn with credentials
Auth->>Auth: verify password and issue JWT
Auth->>DB: insert LoginLog action login ip userAgent timestamp
Note over DB: Later an admin queries login_logs by userId to confirm access time and source IP
Auth->>U: token and user
U->>Auth: logout
Auth->>DB: insert LoginLog action logout
Background jobs & async¶
Observability of async work is entirely through Bull Board + worker stdout logs. The full queue catalogue lives in background-jobs-and-queues.md; the monitoring-relevant facts:
- Queues are created in
src/services/QueueService.ts(singleton).getQueues()returns the list surfaced in Bull Board — note it currently exposes 19 of the declared queues; some declared queues (e.g.studentRiskComputationQueue,badgeEvaluationQueue) are not in thegetQueues()array and therefore do not appear in Bull Board (gotcha — see below). - Job retention is set per-queue via
defaultJobOptions.removeOnComplete/removeOnFail(typically 10–200). Once trimmed, failed jobs are gone — capture stack traces before they age out. - Repeatable schedules (registered on boot in
src/index.ts): - KPI dashboard + sales-overview — every 15 min (
scheduleKpiCalculation). - Slot cleanup — every 15 minutes despite the method name
scheduleSlotCleanupand a stale "24 hours" comment (the code usesevery: 15 * 60 * 1000). - Lead auto-assignment — every 15 min.
- Aarya call-sync (ElevenLabs) — every 15 min.
- Workflow trigger-scan — every 5 min.
- Miss Ozone reconciler — every 60 s; prune daily 03:30.
- Daily warning processing —
59 23 * * *IST; assignment reminders —0 19 * * *IST; daily cards —0 0 * * *IST; application backfill —0 2 * * *IST. - Salary benchmark — every 15 days.
- Mr. Learn / Mr. Test sync + reminders — per-config intervals (hours), stable
jobId. - Socket.IO events (meetings, recording, presence) are not captured by any dashboard — diagnose them from stdout logs only.
External integrations¶
| Integration | Env vars | Failure / fallback behaviour |
|---|---|---|
| PostgreSQL | DB_HOST/PORT/USERNAME/PASSWORD/NAME |
Pool min 5 / max 20 (src/config/database.ts extra). synchronize: true. If down, /api/health returns 503 and most routes 500. |
| Redis | REDIS_HOST/PORT (password commented out) |
ioredis with lazyConnect, reconnectOnError on READONLY. If down, queues stall and /api/health returns 503. |
| Bull Board auth | BULL_BOARD_USERNAME, BULL_BOARD_PASSWORD |
(credentials set via env vars) — must override in prod. |
| Exotel (telephony) | EXOTEL_SID/API_KEY/API_TOKEN/CALLER_ID/SUBDOMAIN/SMS_SENDER_ID/WEBHOOK_TOKEN, BACKEND_PUBLIC_URL |
Feature self-disables (CRM falls back to tel: links) when unset. EXOTEL_SUBDOMAIN must match the account cluster or auth 401s. See comms-telephony-exotel.md. |
| Razorpay (payments) | RAZORPAY_KEY_ID/SECRET |
Webhook-driven; failures surface in worker/controller logs and in finance reconciliation, not health. See payments-finance-gst.md. |
| Gmail SMTP | EMAIL_USER, EMAIL_PASS |
email.worker throws 535-5.7.8 BadCredentials on a stale app password — jobs land in emailQueue failed set. |
| ElevenLabs / Aarya / Miss Ozone (AI calling) | per-service keys | Sync workers poll and write back; failures visible in aaryaSyncQueue / missOzoneQueue. |
| Mr. Learn (Graphy) / Mr. Test | per-config credentials | Sync workers; stalls visible in mrlearnSyncQueue / mrtestSyncQueue. See integration-mr-learn.md. |
There are no formal feature flags for observability; behaviour is governed by presence or absence of the env vars above.
Status lifecycles¶
A BullMQ job's lifecycle — exactly what the Bull Board columns represent:
stateDiagram-v2
[*] --> Waiting: enqueued
Waiting --> Active: worker picks up
Active --> Completed: success
Active --> Failed: throws
Failed --> Waiting: retry attempt remaining
Failed --> [*]: attempts exhausted then trimmed by removeOnFail
Completed --> [*]: trimmed by removeOnComplete
Waiting --> Delayed: scheduled or backoff
Delayed --> Waiting: delay elapsed
The health endpoint's effective state machine:
stateDiagram-v2
[*] --> Checking
Checking --> Healthy: db connected and redis ready
Checking --> Unavailable: db down or redis not ready
Checking --> Error: exception thrown
Healthy --> [*]: 200 success true
Unavailable --> [*]: 503 Service unavailable
Error --> [*]: 500 Internal server error
Runbooks¶
Each runbook is Symptom → Diagnosis → Fix. Shell access to the server (or
docker logs / pm2 logs) and the Bull Board credentials are assumed.
Runbook A — A queue is backed up (jobs piling in "waiting")¶
Symptom: Bull Board shows a large/growing waiting count and a low/zero active
count; downstream effects (no emails, stale KPIs) appear.
Diagnosis
1. Open /admin/queues, identify the queue with the growing backlog.
2. Check its active count — if 0, the worker is dead or not consuming.
3. docker logs <container> (or pm2 logs) and grep for the worker name (e.g.
email.worker) — look for a crash on boot or an unhandled rejection.
4. Confirm Redis is up: GET /api/health should be 200. If 503, jump to Runbook F.
5. Check the failed tab — a poison job repeatedly failing can stall throughput.
Fix
- If the worker process crashed: restart the container/PM2 process; workers are started in
src/index.ts on boot.
- If a poison job is the cause: open it in Bull Board, read the stack trace, then remove
it (or fix the data and retry).
- If Redis was the root cause: restore Redis (Runbook F), then retry the waiting jobs.
- Verify recovery: active climbs and waiting drains.
Runbook B — Emails are not sending¶
Symptom: OTP / meeting / reminder emails not arriving; users report no mail.
Diagnosis
1. Bull Board → emailQueue. Are jobs in failed? Open one and read the error.
2. A 535-5.7.8 Username and Password not accepted (BadCredentials) means the Gmail app
password is stale/rotated.
3. If jobs sit in waiting with no active, the email.worker is down (see Runbook A).
4. Check stdout for email.worker errors.
Fix
- For bad credentials: rotate the Gmail app password, update both AWS Secrets Manager
secrets (mr-mentor-backend/development and /production), restart the blue and
green containers so they re-read the env, then retry the failed emailQueue jobs in
Bull Board (they do not auto-retry once exhausted). See the
deployment doc for the env propagation path.
- For a dead worker: restart the process, then retry failed jobs.
Runbook C — Meeting WebRTC will not connect¶
Symptom: Two participants join a meeting room but never see/hear each other; video stays black.
Diagnosis (WebRTC is not in Bull Board — use logs + the realtime doc)
1. Confirm /api/health is 200 (Socket.IO shares the HTTP server in src/index.ts; if
the process is unhealthy, sockets are too).
2. docker logs and grep for the socket signaling events: offer, answer,
ice-candidate, join-room. Confirm both peers emitted join-room and exchanged an
offer/answer pair.
3. Check whether ICE candidates are being relayed — if offers/answers appear but no media,
it is almost always a STUN/TURN / NAT-traversal problem, not the app.
4. Confirm CORS / origin: Socket.IO only allows configured origins (localhost 3000/3001/3002/8088
+ production URLs). A blocked origin prevents the socket from connecting at all.
5. Confirm the client cleared any stale connection (takeover-connection is emitted when a
second tab takes over).
Fix
- App-level signaling broken (no offer/answer in logs): restart the backend so Socket.IO
re-initialises; have clients rejoin (leave-room then join-room).
- Media never flows despite signaling: this is infra — verify the TURN server config on the
client side; backend cannot fix NAT traversal.
- Origin blocked: add the origin to the Socket.IO CORS allow-list and redeploy.
- See realtime-and-socketio.md and
mentorship-and-meetings.md.
Runbook D — Payment / Razorpay webhook failed or not reconciled¶
Symptom: A user paid but tokens/enrollment were not granted, or finance reports show a payment with no invoice.
Diagnosis
1. Search stdout for the Razorpay order/payment id around the payment time.
2. Confirm the webhook reached the backend (look for the webhook handler log line). If the
request never arrived, the issue is at the Razorpay dashboard / public URL level.
3. If the webhook arrived but processing threw, the error is in the controller/worker logs.
4. For GST/invoice forensics, query superadmin.gst_audit_log by payment_id to see what
field changes (if any) were recorded.
Fix
- Re-trigger reconciliation per the finance flow (see
payments-finance-gst.md); the payment id is the key.
- If the webhook never arrived: verify the Razorpay webhook URL points at the live
blue-green color and the public URL resolves; re-send the webhook from Razorpay.
- Confirm the grant landed (tokens/enrollment) and that a gst_audit_log / invoice row now
exists.
Runbook E — Mr. Learn or Mr. Test sync is stuck¶
Symptom: New enrolments / progress from Mr. Learn (Graphy) or Mr. Test are not appearing; admin "sync" shows no recent activity.
Diagnosis
1. Bull Board → mrlearnSyncQueue / mrlearnNewStudentSyncQueue / mrtestSyncQueue.
Check failed jobs and read the error (commonly a 401/403 from the upstream API = stale
credentials).
2. Confirm the repeatable job exists: GET /api/redis/keys and look for the repeatable job
key, or check Bull Board's repeatable section. Per-config jobs use stable ids like
mrlearn-sync-<configId>.
3. Check stdout for mrlearnSync.worker / mrtestSync.worker errors.
Fix
- Stale upstream credentials: update the sync config credentials, then trigger a manual run
(QueueService.triggerMrLearnSyncNow(configId) / triggerMrTestSyncNow(configId) via the
admin endpoint).
- Missing schedule: re-save the sync config (admin UI) to re-register the repeatable job, or
reboot the backend (boot re-wires schedules in src/index.ts).
- Poison job: remove the failed job, fix the config, retry.
- See integration-mr-learn.md and
integration-mr-test.md.
Runbook F — Database connection pool exhausted¶
Symptom: Requests hang then 500; logs show TimeoutError: Could not acquire a
connection or "remaining connection slots" errors; /api/health may flip to 503.
Diagnosis
1. GET /api/health — a 503 with database.connected:false confirms DB unreachable; a
slow 200 suggests saturation, not outage.
2. Recall the pool is min 5 / max 20 per process (src/config/database.ts extra).
With multiple instances/colors live, total connections = 20 × instances.
3. On the DB: count active connections (SELECT count(*) FROM pg_stat_activity;) and look
for long-running / idle-in-transaction queries.
4. Check for a query leak in recently deployed code (a transaction never committed/rolled
back holds a connection).
Fix
- Immediate: restart the backend container(s) to drop and re-establish the pool.
- Kill stuck Postgres sessions: SELECT pg_terminate_backend(pid) FROM pg_stat_activity
WHERE state = 'idle in transaction' AND now() - state_change > interval '5 minutes';
- If legitimately at capacity, reduce live instance count or raise the DB max_connections
(and re-evaluate the pool max).
- Root-cause the leaking query and roll forward a fix (or roll back — Runbook H).
Runbook G — Exotel click-to-call / SMS failing¶
Symptom: CRM click-to-call does nothing or returns errors; SMS not delivered; or the
feature silently fell back to plain tel: links.
Diagnosis
1. If the UI shows tel: links instead of click-to-call, the Exotel env vars are unset —
the feature self-disables when any of EXOTEL_SID/API_KEY/API_TOKEN/CALLER_ID is missing.
2. A 401 from Exotel almost always means EXOTEL_SUBDOMAIN does not match the account's
cluster (e.g. Singapore api.exotel.com vs Mumbai api.in.exotel.com).
3. Validate creds directly:
curl -u <key>:<token> https://<subdomain>/v1/Accounts/<SID> — a 200 confirms creds +
cluster.
4. For status callbacks, confirm EXOTEL_WEBHOOK_TOKEN and BACKEND_PUBLIC_URL are set so
Exotel can reach /api/exotel/*.
Fix
- Set/correct the env vars in AWS Secrets Manager for the right environment, ensuring
EXOTEL_SUBDOMAIN matches the account cluster; restart both colors.
- Re-run the curl validation; then retry a click-to-call from the CRM.
- See comms-telephony-exotel.md.
Runbook H — Deploy rollback (production)¶
Symptom: A new release is live but unhealthy (errors spiking, /api/health 503, or a
regression).
Diagnosis
1. GET /api/health against the live color — 503 or non-success:true confirms the new
release is bad.
2. Check /home/ubuntu/blue-green-deployment/.current_env to see which color is live.
3. Tail the new container's logs for the boot error or regression.
Fix (instant rollback)
- The previous color is stopped, not removed — the old container is intact. Run the
server-side rollback.sh (in /home/ubuntu/blue-green-deployment/) which switches Nginx
back to the previous color and flips .current_env.
- Verify /api/health on the restored color returns 200 success:true.
- Leave the bad color stopped; investigate and re-deploy once fixed.
- Full topology and scripts: deployment-architecture.md.
Runbook I — KPIs / dashboard numbers are stale¶
Symptom: Admin/sales dashboards show outdated figures.
Diagnosis
1. KPIs are recomputed by the kpiQueue repeatable job every 15 min and cached in Redis.
2. Bull Board → kpiQueue: confirm the repeatable job ran recently and did not fail.
3. GET /api/redis/keys to inspect the cached KPI value and its freshness.
Fix
- Delete the stale cache key via DELETE /api/redis/keys/<kpiKey> to force a recompute on
next read, or enqueue an immediate recompute (QueueService.addKpiJob).
- If the repeatable schedule is missing, reboot the backend (re-registers on boot) or
inspect why scheduleKpiCalculation failed in stdout.
Edge cases, limits & gotchas¶
- Hardening note (internal). A security/hardening observation for this area is tracked in the team's private notes (
internal/security-and-hardening-notes.md) and is intentionally not published on this site. - Hardening note (internal). A security/hardening observation for this area is tracked in the team's private notes (
internal/security-and-hardening-notes.md) and is intentionally not published on this site. - Not all queues appear in Bull Board.
QueueService.getQueues()omits some declared queues (e.g.studentRiskComputationQueue,badgeEvaluationQueue), so their jobs are invisible in the dashboard — diagnose those from logs only. - Health check is binary and shallow. It only checks DB
isConnected()and Redisclient.status === 'ready'; it does not run a query or ping Redis, so a "connected but wedged" dependency can still report healthy. scheduleSlotCleanupruns every 15 min, not 24 h. The method name and an inline comment say 24 hours, but the code usesevery: 15 * 60 * 1000.- Logs are ephemeral free-text. ≈2,250
console.*calls go to stdout with no structure, no levels enforced, no correlation ids, and no central aggregation. Capture what you need withdocker logs/pm2 logsbefore containers are recycled. - Failed jobs age out.
removeOnFail(10–200) means stack traces disappear after enough newer failures — screenshot/copy them during triage. - Multi-platform. Requests carry an
x-platformheader; logs do not consistently record it, so attributing an issue tomr-mentorvsmy-analytics-schoolmay require correlating with the request body/route (inferred). - Auth on cleanup is real. Unlike redis, all
/api/cleanup/*routes enforceauthMiddleware+adminMiddleware. - No alerting. Nothing pages on a 503 or a queue backlog; monitoring is pull-based
(someone must look). External uptime monitoring on
/api/healthis the only proactive signal (inferred).
Related docs¶
- deployment-architecture.md — Docker, blue-green, rollback scripts, env propagation.
- background-jobs-and-queues.md — full BullMQ queue/worker catalogue and schedules.
- realtime-and-socketio.md — WebRTC signaling and Socket.IO events.
- request-lifecycle-and-middleware.md — middleware, auth, error handling.
- multi-platform-architecture.md —
x-platformrouting. - comms-telephony-exotel.md — Exotel integration internals.
- comms-email-and-notifications.md — email worker + templates.
- payments-finance-gst.md — Razorpay + GST audit trail.
- integration-mr-learn.md, integration-mr-test.md — external LMS sync.
- identity-and-access.md — auth +
login_logsaudit trail.