Infrastructure Topology¶
The runtime infrastructure that serves the MAS / Mr. Mentor backend in production and development:
how client traffic reaches the API, where it terminates, what data stores and object storage it
depends on, which external services it calls, and how new builds reach the servers via a
GitHub Container Registry (GHCR) + blue-green deployment pipeline. This document synthesizes the
topology from the build/deploy workflows, Docker assets, PM2 config, server-side scripts referenced
by the workflows, and runtime env. Parts that live outside the repo (nginx config, EC2 host layout,
the blue-green shell scripts that live under /home/ubuntu/blue-green-deployment on the server) are
marked (inferred) or (server-side, not in repo).
Status: documented from source on this branch.
Overview¶
The backend is a single Node.js 20 + Express + TypeScript service (src/index.ts → src/app.ts)
that bundles, in one process:
- the HTTP REST API on port 8000 (42 route files, see
src/routes/index.ts), - a Socket.IO server on the same HTTP server at path
/socket.io(WebRTC meetings, presence, recording, chat —src/socket.ts), - a raw WebSocket terminal-relay at path
/api/terminal/relay(src/services/TerminalRelayWsServer.ts), attached innoServermode to the same HTTP server so it co-exists with Socket.IO, - five BullMQ workers running in-process against Redis.
It is consumed by three frontends and several sibling services. In production it runs as a Docker
container behind nginx, with PostgreSQL and Redis as sidecar containers, and AWS S3 (region
ap-south-1) for object storage.
Who operates / depends on it:
| Persona | How they reach the infra |
|---|---|
| Students, mentors, public visitors | mas-website-live (Next.js, myanalyticsschool.com) → HTTPS API |
| Admins, sales, superadmin | mr-mentor-frontend (Next.js, admin dashboard) → HTTPS API + Socket.IO |
| Recruiters | mr-hire-frontend → mr-hire-backend, which also calls this backend |
| Students running code | @myanalyticsschool/connect CLI + xterm.js → terminal-relay WS |
| DevOps / CI | GitHub Actions → GHCR → SSH blue-green deploy to EC2 |
Key concepts & entities¶
This is a devops/operations domain; it owns no TypeORM entities. The "entities" here are infrastructure components and config artifacts.
| Concept | Where defined | Notes |
|---|---|---|
| Backend container image | Dockerfile, build.yml |
Multi-stage (Bun build → Node 20 runner), pushed to GHCR |
| Production compose stack | docker-compose.prod.yml |
app + postgres:16-alpine + redis:7-alpine on external mas-network |
| Dev compose stack | docker-compose.yml, docker-compose.dev.yml |
Used by the development server deploy |
| PM2 ecosystem | ecosystem.config.js |
Legacy/alternate process manager path (see gotchas) |
| Blue-green scripts | deploy.sh, rollback.sh, health-check.sh under /home/ubuntu/blue-green-deployment |
(server-side, not in repo) — invoked over SSH by deploy-production.yml |
| Secrets source | AWS Secrets Manager mr-mentor-backend/{production,development} |
Fetched at deploy time, materialized as .env.original / .env on the server |
| Health endpoint | GET /api/health |
Used by Docker healthcheck, nginx switch verification, pre/post-deploy checks |
.current_env marker |
server file | Holds blue or green — the currently live color |
Architecture¶
Production topology — clients, edge, the backend process, data stores, object storage, and external APIs. Ports are container/host ports; nginx terminates TLS and reverse-proxies to the active backend container.
flowchart TD
subgraph CLIENTS["Clients"]
WB["mas-website-live (myanalyticsschool.com)"]
AD["mr-mentor-frontend (admin dashboard)"]
HF["mr-hire-frontend"]
CLI["@myanalyticsschool/connect CLI + xterm.js"]
end
DNS["DNS: api.mrmentor.in / api.myanalyticsschool.com"]
NGINX["nginx reverse proxy + TLS (inferred)"]
subgraph HOST["Production EC2 host (Ubuntu, ap-south-1)"]
subgraph BG["Blue-Green app containers"]
BLUE["mr-mentor-backend-blue :8000"]
GREEN["mr-mentor-backend-green :8000"]
end
PG[("mr-mentor-postgres :5432 (postgres:16-alpine)")]
RD[("mr-mentor-redis :6379 (redis:7-alpine)")]
end
subgraph PROC["Inside the active backend container"]
API["Express REST API /api/*"]
IO["Socket.IO /socket.io"]
WS["Terminal relay WS /api/terminal/relay"]
WORK["5 BullMQ workers"]
end
subgraph S3["AWS S3 (ap-south-1)"]
B1["mr-mentor-recordings"]
B2["student documents bucket"]
B3["banner assets bucket"]
B4["invoice PDF bucket (private)"]
end
subgraph EXT["External APIs"]
RZP["Razorpay"]
GOOG["Google OAuth + Calendar + Drive"]
SMTP["Gmail SMTP"]
VOICE["ElevenLabs / Aarya / MissOzone"]
EXO["Exotel telephony + SMS"]
WA["WhatsApp Cloud API"]
LLM["LiteLLM gateway (dev-llm.myanalyticsschool.com)"]
GRAPHY["Graphy LMS (mrlearn.in)"]
HIRE["mr-hire-backend"]
LEEG["Leegality eSign"]
JUDGE["Judge0 / compiler API"]
CFN["CloudFront assets CDN"]
end
WB --> DNS
AD --> DNS
HF --> DNS
CLI --> DNS
DNS --> NGINX
NGINX -->|"active color"| BLUE
NGINX -.->|"standby"| GREEN
BLUE --> PROC
API --> PG
API --> RD
WORK --> RD
WORK --> PG
API --> S3
API --> EXT
IO --> RD
Request flow inside the process (routes → controllers → services → entities/external):
flowchart LR
REQ["HTTP request"] --> CORS["CORS + helmet middleware (app.ts)"]
CORS --> AUTH["auth.middleware (JWT) and role guards"]
AUTH --> ROUTES["routes/index.ts (42 route files)"]
ROUTES --> CTRL["controllers/*"]
CTRL --> SVC["services/*"]
SVC --> ORM["TypeORM DataSource (config/database.ts)"]
SVC --> REDISC["Redis (config/redis.ts)"]
SVC --> S3SVC["S3Service / s3Uploader.service"]
SVC --> QUEUE["QueueService → BullMQ"]
ORM --> PGDB[("PostgreSQL mas DB")]
S3SVC --> S3OBJ["AWS S3 ap-south-1"]
QUEUE --> WORKERS["email / database / cleanup / kpi / resumeAnalysis workers"]
Data model¶
This domain has no relational entities; its "data model" is the set of stateful artifacts and volumes that the infrastructure persists. The erDiagram below models the operational relationships (host → containers → volumes → buckets) for orientation.
erDiagram
HOST ||--o{ CONTAINER : "runs"
CONTAINER ||--o| VOLUME : "mounts"
HOST ||--|| NGINX : "fronts"
APP_CONTAINER ||--|| POSTGRES : "connects"
APP_CONTAINER ||--|| REDIS : "connects"
APP_CONTAINER ||--o{ S3_BUCKET : "reads_writes"
HOST {
string provider "AWS EC2 ap-south-1"
string os "Ubuntu"
string network "mas-network (docker external)"
}
CONTAINER {
string name "blue / green / postgres / redis"
int port "8000 / 5432 / 6379"
string restart "always"
}
VOLUME {
string postgres_data "PG datadir"
string redis_data "Redis AOF/RDB"
}
NGINX {
string config "sites-available/api.myanalyticsschool.com"
string upstream "active backend host port"
}
S3_BUCKET {
string recordings "mr-mentor-recordings"
string documents "student documents"
string banners "banner assets"
string invoices "private invoice PDFs"
}
API surface¶
Infrastructure exposes a small set of operational endpoints. Application routes are documented in
the feature docs; the ops-relevant surface mounted in src/app.ts / src/routes/index.ts:
| Method | Path | Auth/role | Purpose |
|---|---|---|---|
| GET | /api/health |
Public | Liveness/readiness — returns { success: true, ... }. Used by Docker healthcheck, nginx switch verification, pre/post-deploy SSH checks |
| GET | /assets/* |
Public | Static assets (logos, images) served via express.static('public/assets') |
| ALL | /admin/queues |
Bull Board UI | BullMQ queue monitoring dashboard |
| WS | /socket.io |
Socket.IO handshake (CORS from CORS_ALLOWED_ORIGINS) |
Real-time meetings, presence, recording, chat |
| WS | /api/terminal/relay |
First-frame token via TerminalRelayService |
Browser ↔ CLI terminal mirroring (HTTP upgrade) |
The HTTP listener binds to HOST (0.0.0.0 in containers) and PORT (default 8000, see
src/index.ts). The container EXPOSEs 8000 and the prod compose maps ${PORT:-8000}:8000.
User journeys¶
These are operational journeys — how code and config move through the system — rather than end-user product flows.
1. Production release (push to main → blue-green swap)¶
A push to main builds and pushes a Docker image to GHCR; on success the production workflow
SSHes into the EC2 host, deploys to the standby color, health-checks it, then flips nginx — zero
downtime. The previous color is kept for instant rollback.
sequenceDiagram
participant DEV as Developer
participant GH as GitHub Actions
participant GHCR as GHCR registry
participant SM as AWS Secrets Manager
participant SRV as EC2 prod host
participant NG as nginx
participant OLD as Active color
participant NEW as Standby color
DEV->>GH: push to main
GH->>GH: Build and Push Docker Image workflow
GH->>GHCR: push ghcr.io repo tag branch-sha and latest
GH->>GH: on success trigger Deploy to Production Blue-Green
GH->>SM: get-secret-value mr-mentor-backend production
GH->>SRV: scp .env.original to blue-green-deployment
GH->>SRV: ssh run health-check.sh and verify nginx pg redis
GH->>SRV: ssh docker login ghcr.io as mas-mr-mentor
GH->>SRV: ssh run deploy.sh with image URL
SRV->>GHCR: docker pull new image
SRV->>NEW: start standby container on its host port
SRV->>NEW: wait for health start-period then poll /api/health
alt standby healthy
SRV->>NG: rewrite proxy_pass to standby port and reload
NG->>NEW: traffic now flows to new color
SRV->>SRV: write .current_env to new color
SRV-->>GH: deploy success
GH->>SRV: post-deploy verify health pg users redis ping nginx port
else standby unhealthy or deploy non-zero exit
SRV->>SRV: dump container logs then run rollback.sh
SRV->>OLD: keep old color live
SRV-->>GH: exit 1 deployment failed
end
2. Development deploy (push to development → in-place compose recreate)¶
The development environment does NOT use blue-green. It pulls the new image and recreates the single compose stack in place (brief downtime acceptable).
sequenceDiagram
participant DEV as Developer
participant GH as GitHub Actions
participant SM as AWS Secrets Manager
participant SRV as EC2 dev host
DEV->>GH: push to development
GH->>GH: build image and tag development
GH->>SM: get-secret-value mr-mentor-backend development
GH->>SRV: scp .env to repo dir
GH->>SRV: ssh export IMAGE_TAG and GHCR_REPOSITORY
GH->>SRV: ssh docker login ghcr.io with GHCR_PAT
SRV->>SRV: docker compose -f docker-compose.prod.yml down
SRV->>SRV: docker compose pull then up -d
SRV->>SRV: wait for health then compose ps and logs
SRV->>SRV: docker image prune old images
SRV-->>GH: deploy complete
3. Container cold start (process boot inside a container)¶
What happens when a freshly pulled container starts — the startup sequence from src/index.ts.
sequenceDiagram
participant DOCKER as Docker runtime
participant APP as backend process
participant PG as PostgreSQL
participant RD as Redis
participant HC as Healthcheck
DOCKER->>APP: node dist/index.js as non-root nodejs user
APP->>APP: load module aliases and reflect-metadata
APP->>PG: initialize TypeORM DataSource auto-sync on
APP->>PG: seed colleges and ensure admin superadmin users
opt ENABLE_SEEDING true
APP->>PG: seed batches and courses
end
APP->>RD: initialize Redis singleton
APP->>APP: start 5 BullMQ workers
APP->>APP: schedule cleanup 24h and kpi 15min jobs
APP->>APP: createServer then attach Socket.IO and terminal relay
APP->>APP: listen on HOST and PORT 8000
loop every 30s after 40s start period
HC->>APP: GET /api/health
APP-->>HC: 200 success
end
4. Object storage upload (recording / document / banner)¶
The API uses presigned direct-to-S3 uploads (USE_DIRECT_S3_UPLOAD=true) so large media bypasses
the backend process, then stores metadata in PostgreSQL.
sequenceDiagram
participant FE as Frontend or client
participant API as Backend API
participant S3SVC as S3Service
participant S3 as AWS S3 ap-south-1
participant DB as PostgreSQL
FE->>API: request presigned upload URL
API->>S3SVC: build putObject signed URL for target bucket
S3SVC-->>API: presigned URL
API-->>FE: presigned URL and object key
FE->>S3: PUT bytes directly to bucket
FE->>API: confirm upload with object key
API->>DB: persist metadata row
API-->>FE: stored confirmation
Note over API,S3: Private buckets like invoices are read back via short-lived getSignedUrlForBucket only
5. Manual rollback¶
If a release misbehaves after the swap, ops re-runs the standby color or invokes rollback.
sequenceDiagram
participant OPS as Operator
participant SRV as EC2 prod host
participant NG as nginx
OPS->>SRV: ssh run rollback.sh
SRV->>SRV: read .current_env and pick previous color
SRV->>NG: point proxy_pass back to previous port and reload
SRV->>SRV: write .current_env to previous color
SRV-->>OPS: previous version live again
Background jobs & async¶
The five BullMQ workers run in-process inside the backend container, sharing the same Redis
instance (REDIS_HOST=redis in compose). They are part of the runtime, so they scale and restart
with the app container.
| Worker | Queue | Schedule | Purpose |
|---|---|---|---|
email.worker |
emailQueue | On demand | OTP, password reset, meeting notifications via Gmail SMTP |
database.worker |
databaseQueue | On demand | Heavy DB operations |
cleanup.worker |
cleanupQueue | Every 24h | Expired slot cleanup |
kpi.worker |
kpiQueue | Every 15min | Dashboard KPI calculations |
resumeAnalysis.worker |
resumeAnalysisQueue | On demand | AI resume processing |
The Bull Board UI is mounted at /admin/queues for queue monitoring. Real-time fan-out
(Socket.IO presence, recording state) also leans on Redis. Infra implication: Redis is a hard
dependency for both queueing and websocket coordination; the prod compose marks the app
depends_on: redis (service_healthy).
Webhooks the host must reach (inbound): Razorpay payment callbacks, Exotel status callbacks
(/api/exotel/*, gated by EXOTEL_WEBHOOK_TOKEN), WhatsApp webhook verification, and Leegality
eSign completion. These require BACKEND_PUBLIC_URL to be the publicly resolvable HTTPS origin
(https://api.mrmentor.in in production; an ngrok tunnel in local dev per .env).
External integrations¶
All third-party endpoints are configured via env (sourced from AWS Secrets Manager at deploy). Most integrations self-disable or degrade gracefully when their keys are unset.
| Service | Env vars | Failure / fallback |
|---|---|---|
| AWS S3 (ap-south-1) | AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION, AWS_S3_BUCKET_NAME, AWS_S3_STUDENT_DOCUMENTS_BUCKET, AWS_S3_BANNER_ASSETS_BUCKET, GLOBAL_INVOICE_S3_BUCKET |
S3Service throws on boot if core bucket/creds missing; invoice bucket optional (PDFs emailed but not archived if unset) |
| Razorpay | RAZORPAY_KEY_ID, RAZORPAY_KEY_SECRET |
Payment flows fail closed |
| Google OAuth / Calendar / Drive | GOOGLE_CLIENT_ID, GOOGLE_CLIENT_SECRET, GOOGLE_REDIRECT_URI, GOOGLE_DRIVE_REDIRECT_URI |
Google login/calendar sync disabled |
| Gmail SMTP | EMAIL_USER, EMAIL_PASS |
Email worker errors (535 BadCredentials on stale app password) |
| Voice AI calling | ELEVENLABS_API_KEY, AARYA_AGENT_ID, AARYA_PHONE_NUMBER_ID, MissOzone config |
Voice interview calling disabled |
| Exotel telephony + SMS | EXOTEL_SID, EXOTEL_API_KEY, EXOTEL_API_TOKEN, EXOTEL_CALLER_ID, EXOTEL_SUBDOMAIN, EXOTEL_WEBHOOK_TOKEN |
CRM falls back to tel: links when unset; subdomain must match the account cluster or auth 401s |
| WhatsApp Cloud API | WHATSAPP_ACCESS_TOKEN, WHATSAPP_PHONE_NUMBER_ID, WHATSAPP_BUSINESS_ACCOUNT_ID, WHATSAPP_WEBHOOK_VERIFY_TOKEN, WHATSAPP_GRAPH_API_VERSION |
WhatsApp messaging disabled |
| LiteLLM gateway | LITELLM_BASE_URL, LITELLM_MASTER_KEY |
LLM features fail; gateway is a separate dev EC2 service on mas-network |
| Graphy LMS | GRAPHY_BASE_URL (mrlearn.in) |
Mr. Learn proxy endpoints fail |
| Judge0 / compiler | JUDGE0_API_URL, COMPILER_API_URL |
Code execution disabled |
| mr-hire-backend | MR_HIRE_BACKEND_URL |
Recruitment AI integration unavailable |
| Leegality eSign | LEEGALITY_SANDBOX_BASE_URL, template envs |
PAP eSign flow blocked |
| class-agent | CLASS_AGENT_URL, CLASS_AGENT_API_KEY |
AI classroom proxy fails |
| CloudFront assets CDN | MAS_ASSETS_CDN_URL (d1ib7yueotbmuk.cloudfront.net) |
Asset URLs fall back to direct origin |
Frontend / origin env: FRONTEND_URL, MR_HIRE_FRONTEND_URL, MAS_WEBSITE_URL,
BACKEND_PUBLIC_URL, and CORS_ALLOWED_ORIGINS (JSON array consumed by both Express CORS in
src/app.ts and Socket.IO CORS in src/index.ts).
Status lifecycles¶
The most meaningful operational lifecycle is the active color of the blue-green deployment,
tracked by the server-side .current_env file and the nginx upstream.
stateDiagram-v2
[*] --> Blue
Blue --> DeployingGreen : deploy.sh pulls image and starts green
DeployingGreen --> Blue : green health check fails then rollback
DeployingGreen --> Green : green healthy and nginx switched
Green --> DeployingBlue : next deploy starts blue
DeployingBlue --> Green : blue health check fails then rollback
DeployingBlue --> Blue : blue healthy and nginx switched
Green --> Green : manual rollback keeps current
Blue --> Blue : manual rollback keeps current
Container restart lifecycle (Docker restart: always + healthcheck):
stateDiagram-v2
[*] --> Starting
Starting --> Healthy : /api/health returns 200 after start period
Starting --> Unhealthy : 3 failed health checks
Healthy --> Unhealthy : health checks start failing
Unhealthy --> Restarting : Docker restarts container
Restarting --> Starting
Healthy --> Stopped : SIGTERM graceful shutdown
Stopped --> [*]
Edge cases, limits & gotchas¶
- PM2 config is largely vestigial.
ecosystem.config.js(cluster mode,instances: 'max',PORT: 3000) anddeploy.shdescribe a PM2-on-bare-metal path. Production today runs the Docker image (CMD ["node", "dist/index.js"]) under blue-green compose, not PM2. The PM2PORT: 3000is stale — the container listens on 8000. Treat the PM2 path as legacy/alternate unless a host is explicitly running it. (inferred) - Single Node process per container, no clustering. The container runs one Node process
(
node dist/index.js). Horizontal scale is achieved by blue-green color swap, not by running multiple replicas simultaneously. See the next point for why multi-replica is currently unsafe. - Terminal relay requires sticky sessions.
TerminalRelayWsServer.tskeeps an in-memorypairingsmap; the CLI and the browser must land on the same process to be wired together. Multi-replica scaling would need Redis pub/sub routing (documented as a v1 limitation in the source comment). Same caveat applies to any in-memory Socket.IO state not backed by Redis. - Postgres and Redis are containers, shared across colors. In
docker-compose.prod.ymlthey are defined as services (postgres:16-alpine,redis:7-alpine) with named volumes (postgres_data,redis_data). On the prod host both blue and green app containers connect to the same long-livedmr-mentor-postgres/mr-mentor-rediscontainers (the post-deploy checkdocker exec mr-mentor-postgres/mr-mentor-redisconfirms fixed container names). The data layer is therefore NOT swapped during a deploy — only the app color is. (inferred from workflow exec names) mas-networkis external. The compose network is declaredexternal: true; it must be created out-of-band (docker network create mas-network) so the backend, postgres, redis, and sibling services (LiteLLM gateway, voice agent, mr-hire) can resolve each other by container name.- TypeORM auto-sync is ON. No migrations — entity changes auto-apply to the live DB on boot. In production this means a deploy can mutate schema; be cautious with destructive entity edits.
- Secrets live in AWS Secrets Manager, not the repo. Deploys fetch
mr-mentor-backend/{production,development}(ap-south-1) and materialize.env.original(prod, into/home/ubuntu/blue-green-deployment) or.env(dev). Theadd-env-varskill's older "Pattern A" docs are stale per project memory — env changes must reach Secrets Manager AND be redeployed (or the running container restarted) to take effect;docker compose restartwill NOT re-readenv_file, onlyup -drecreate will. - GHCR auth on the host. Pulls require
docker login ghcr.iowith a valid token. Prod logs in asmas-mr-mentor(GHCR_PULL_TOKEN); dev uses the long-livedGHCR_PAT(MAS-intern). A periodic refresh workflow exists as a safety net because deploy workflows can overwrite the host's docker login. - Image tagging.
build.ymltagsghcr.io/<repo>:<branch>,:<branch>-<shortsha>, plus:latestformain. The production deploy reconstructs the tag asmain-<shortsha>and pulls the exact image — so re-running withworkflow_dispatchrebuilds the tag from the currentmainHEAD commit. - CORS is permissive in development.
src/app.tsallows ALL origins whenNODE_ENV === 'development'and always allows no-origin andOrigin: nullrequests. In production, origins must be indefaultOriginsorCORS_ALLOWED_ORIGINS. - Body size limit 20mb.
express.json/urlencodedare capped at 20mb; large media must go through presigned direct-to-S3 upload, not the JSON body. - Health endpoint is load-bearing. Docker healthcheck, nginx switch verification, and both
pre/post-deploy SSH steps assert
GET /api/health→{"success":true}. A change to its shape would break the deploy gate. - Non-root container. The runner stage runs as user
nodejs(uid 1001);/app/recordingsis pre-created and chowned. Any new write path inside the container needs matching ownership.