Skip to content

Infrastructure Topology

The runtime infrastructure that serves the MAS / Mr. Mentor backend in production and development: how client traffic reaches the API, where it terminates, what data stores and object storage it depends on, which external services it calls, and how new builds reach the servers via a GitHub Container Registry (GHCR) + blue-green deployment pipeline. This document synthesizes the topology from the build/deploy workflows, Docker assets, PM2 config, server-side scripts referenced by the workflows, and runtime env. Parts that live outside the repo (nginx config, EC2 host layout, the blue-green shell scripts that live under /home/ubuntu/blue-green-deployment on the server) are marked (inferred) or (server-side, not in repo).

Status: documented from source on this branch.


Overview

The backend is a single Node.js 20 + Express + TypeScript service (src/index.tssrc/app.ts) that bundles, in one process:

  • the HTTP REST API on port 8000 (42 route files, see src/routes/index.ts),
  • a Socket.IO server on the same HTTP server at path /socket.io (WebRTC meetings, presence, recording, chat — src/socket.ts),
  • a raw WebSocket terminal-relay at path /api/terminal/relay (src/services/TerminalRelayWsServer.ts), attached in noServer mode to the same HTTP server so it co-exists with Socket.IO,
  • five BullMQ workers running in-process against Redis.

It is consumed by three frontends and several sibling services. In production it runs as a Docker container behind nginx, with PostgreSQL and Redis as sidecar containers, and AWS S3 (region ap-south-1) for object storage.

Who operates / depends on it:

Persona How they reach the infra
Students, mentors, public visitors mas-website-live (Next.js, myanalyticsschool.com) → HTTPS API
Admins, sales, superadmin mr-mentor-frontend (Next.js, admin dashboard) → HTTPS API + Socket.IO
Recruiters mr-hire-frontendmr-hire-backend, which also calls this backend
Students running code @myanalyticsschool/connect CLI + xterm.js → terminal-relay WS
DevOps / CI GitHub Actions → GHCR → SSH blue-green deploy to EC2

Key concepts & entities

This is a devops/operations domain; it owns no TypeORM entities. The "entities" here are infrastructure components and config artifacts.

Concept Where defined Notes
Backend container image Dockerfile, build.yml Multi-stage (Bun build → Node 20 runner), pushed to GHCR
Production compose stack docker-compose.prod.yml app + postgres:16-alpine + redis:7-alpine on external mas-network
Dev compose stack docker-compose.yml, docker-compose.dev.yml Used by the development server deploy
PM2 ecosystem ecosystem.config.js Legacy/alternate process manager path (see gotchas)
Blue-green scripts deploy.sh, rollback.sh, health-check.sh under /home/ubuntu/blue-green-deployment (server-side, not in repo) — invoked over SSH by deploy-production.yml
Secrets source AWS Secrets Manager mr-mentor-backend/{production,development} Fetched at deploy time, materialized as .env.original / .env on the server
Health endpoint GET /api/health Used by Docker healthcheck, nginx switch verification, pre/post-deploy checks
.current_env marker server file Holds blue or green — the currently live color

Architecture

Production topology — clients, edge, the backend process, data stores, object storage, and external APIs. Ports are container/host ports; nginx terminates TLS and reverse-proxies to the active backend container.

flowchart TD
    subgraph CLIENTS["Clients"]
      WB["mas-website-live (myanalyticsschool.com)"]
      AD["mr-mentor-frontend (admin dashboard)"]
      HF["mr-hire-frontend"]
      CLI["@myanalyticsschool/connect CLI + xterm.js"]
    end

    DNS["DNS: api.mrmentor.in / api.myanalyticsschool.com"]
    NGINX["nginx reverse proxy + TLS (inferred)"]

    subgraph HOST["Production EC2 host (Ubuntu, ap-south-1)"]
      subgraph BG["Blue-Green app containers"]
        BLUE["mr-mentor-backend-blue :8000"]
        GREEN["mr-mentor-backend-green :8000"]
      end
      PG[("mr-mentor-postgres :5432 (postgres:16-alpine)")]
      RD[("mr-mentor-redis :6379 (redis:7-alpine)")]
    end

    subgraph PROC["Inside the active backend container"]
      API["Express REST API /api/*"]
      IO["Socket.IO /socket.io"]
      WS["Terminal relay WS /api/terminal/relay"]
      WORK["5 BullMQ workers"]
    end

    subgraph S3["AWS S3 (ap-south-1)"]
      B1["mr-mentor-recordings"]
      B2["student documents bucket"]
      B3["banner assets bucket"]
      B4["invoice PDF bucket (private)"]
    end

    subgraph EXT["External APIs"]
      RZP["Razorpay"]
      GOOG["Google OAuth + Calendar + Drive"]
      SMTP["Gmail SMTP"]
      VOICE["ElevenLabs / Aarya / MissOzone"]
      EXO["Exotel telephony + SMS"]
      WA["WhatsApp Cloud API"]
      LLM["LiteLLM gateway (dev-llm.myanalyticsschool.com)"]
      GRAPHY["Graphy LMS (mrlearn.in)"]
      HIRE["mr-hire-backend"]
      LEEG["Leegality eSign"]
      JUDGE["Judge0 / compiler API"]
      CFN["CloudFront assets CDN"]
    end

    WB --> DNS
    AD --> DNS
    HF --> DNS
    CLI --> DNS
    DNS --> NGINX
    NGINX -->|"active color"| BLUE
    NGINX -.->|"standby"| GREEN
    BLUE --> PROC
    API --> PG
    API --> RD
    WORK --> RD
    WORK --> PG
    API --> S3
    API --> EXT
    IO --> RD

Request flow inside the process (routes → controllers → services → entities/external):

flowchart LR
    REQ["HTTP request"] --> CORS["CORS + helmet middleware (app.ts)"]
    CORS --> AUTH["auth.middleware (JWT) and role guards"]
    AUTH --> ROUTES["routes/index.ts (42 route files)"]
    ROUTES --> CTRL["controllers/*"]
    CTRL --> SVC["services/*"]
    SVC --> ORM["TypeORM DataSource (config/database.ts)"]
    SVC --> REDISC["Redis (config/redis.ts)"]
    SVC --> S3SVC["S3Service / s3Uploader.service"]
    SVC --> QUEUE["QueueService → BullMQ"]
    ORM --> PGDB[("PostgreSQL mas DB")]
    S3SVC --> S3OBJ["AWS S3 ap-south-1"]
    QUEUE --> WORKERS["email / database / cleanup / kpi / resumeAnalysis workers"]

Data model

This domain has no relational entities; its "data model" is the set of stateful artifacts and volumes that the infrastructure persists. The erDiagram below models the operational relationships (host → containers → volumes → buckets) for orientation.

erDiagram
    HOST ||--o{ CONTAINER : "runs"
    CONTAINER ||--o| VOLUME : "mounts"
    HOST ||--|| NGINX : "fronts"
    APP_CONTAINER ||--|| POSTGRES : "connects"
    APP_CONTAINER ||--|| REDIS : "connects"
    APP_CONTAINER ||--o{ S3_BUCKET : "reads_writes"

    HOST {
        string provider "AWS EC2 ap-south-1"
        string os "Ubuntu"
        string network "mas-network (docker external)"
    }
    CONTAINER {
        string name "blue / green / postgres / redis"
        int port "8000 / 5432 / 6379"
        string restart "always"
    }
    VOLUME {
        string postgres_data "PG datadir"
        string redis_data "Redis AOF/RDB"
    }
    NGINX {
        string config "sites-available/api.myanalyticsschool.com"
        string upstream "active backend host port"
    }
    S3_BUCKET {
        string recordings "mr-mentor-recordings"
        string documents "student documents"
        string banners "banner assets"
        string invoices "private invoice PDFs"
    }

API surface

Infrastructure exposes a small set of operational endpoints. Application routes are documented in the feature docs; the ops-relevant surface mounted in src/app.ts / src/routes/index.ts:

Method Path Auth/role Purpose
GET /api/health Public Liveness/readiness — returns { success: true, ... }. Used by Docker healthcheck, nginx switch verification, pre/post-deploy SSH checks
GET /assets/* Public Static assets (logos, images) served via express.static('public/assets')
ALL /admin/queues Bull Board UI BullMQ queue monitoring dashboard
WS /socket.io Socket.IO handshake (CORS from CORS_ALLOWED_ORIGINS) Real-time meetings, presence, recording, chat
WS /api/terminal/relay First-frame token via TerminalRelayService Browser ↔ CLI terminal mirroring (HTTP upgrade)

The HTTP listener binds to HOST (0.0.0.0 in containers) and PORT (default 8000, see src/index.ts). The container EXPOSEs 8000 and the prod compose maps ${PORT:-8000}:8000.


User journeys

These are operational journeys — how code and config move through the system — rather than end-user product flows.

1. Production release (push to main → blue-green swap)

A push to main builds and pushes a Docker image to GHCR; on success the production workflow SSHes into the EC2 host, deploys to the standby color, health-checks it, then flips nginx — zero downtime. The previous color is kept for instant rollback.

sequenceDiagram
    participant DEV as Developer
    participant GH as GitHub Actions
    participant GHCR as GHCR registry
    participant SM as AWS Secrets Manager
    participant SRV as EC2 prod host
    participant NG as nginx
    participant OLD as Active color
    participant NEW as Standby color

    DEV->>GH: push to main
    GH->>GH: Build and Push Docker Image workflow
    GH->>GHCR: push ghcr.io repo tag branch-sha and latest
    GH->>GH: on success trigger Deploy to Production Blue-Green
    GH->>SM: get-secret-value mr-mentor-backend production
    GH->>SRV: scp .env.original to blue-green-deployment
    GH->>SRV: ssh run health-check.sh and verify nginx pg redis
    GH->>SRV: ssh docker login ghcr.io as mas-mr-mentor
    GH->>SRV: ssh run deploy.sh with image URL
    SRV->>GHCR: docker pull new image
    SRV->>NEW: start standby container on its host port
    SRV->>NEW: wait for health start-period then poll /api/health
    alt standby healthy
        SRV->>NG: rewrite proxy_pass to standby port and reload
        NG->>NEW: traffic now flows to new color
        SRV->>SRV: write .current_env to new color
        SRV-->>GH: deploy success
        GH->>SRV: post-deploy verify health pg users redis ping nginx port
    else standby unhealthy or deploy non-zero exit
        SRV->>SRV: dump container logs then run rollback.sh
        SRV->>OLD: keep old color live
        SRV-->>GH: exit 1 deployment failed
    end

2. Development deploy (push to development → in-place compose recreate)

The development environment does NOT use blue-green. It pulls the new image and recreates the single compose stack in place (brief downtime acceptable).

sequenceDiagram
    participant DEV as Developer
    participant GH as GitHub Actions
    participant SM as AWS Secrets Manager
    participant SRV as EC2 dev host

    DEV->>GH: push to development
    GH->>GH: build image and tag development
    GH->>SM: get-secret-value mr-mentor-backend development
    GH->>SRV: scp .env to repo dir
    GH->>SRV: ssh export IMAGE_TAG and GHCR_REPOSITORY
    GH->>SRV: ssh docker login ghcr.io with GHCR_PAT
    SRV->>SRV: docker compose -f docker-compose.prod.yml down
    SRV->>SRV: docker compose pull then up -d
    SRV->>SRV: wait for health then compose ps and logs
    SRV->>SRV: docker image prune old images
    SRV-->>GH: deploy complete

3. Container cold start (process boot inside a container)

What happens when a freshly pulled container starts — the startup sequence from src/index.ts.

sequenceDiagram
    participant DOCKER as Docker runtime
    participant APP as backend process
    participant PG as PostgreSQL
    participant RD as Redis
    participant HC as Healthcheck

    DOCKER->>APP: node dist/index.js as non-root nodejs user
    APP->>APP: load module aliases and reflect-metadata
    APP->>PG: initialize TypeORM DataSource auto-sync on
    APP->>PG: seed colleges and ensure admin superadmin users
    opt ENABLE_SEEDING true
        APP->>PG: seed batches and courses
    end
    APP->>RD: initialize Redis singleton
    APP->>APP: start 5 BullMQ workers
    APP->>APP: schedule cleanup 24h and kpi 15min jobs
    APP->>APP: createServer then attach Socket.IO and terminal relay
    APP->>APP: listen on HOST and PORT 8000
    loop every 30s after 40s start period
        HC->>APP: GET /api/health
        APP-->>HC: 200 success
    end

4. Object storage upload (recording / document / banner)

The API uses presigned direct-to-S3 uploads (USE_DIRECT_S3_UPLOAD=true) so large media bypasses the backend process, then stores metadata in PostgreSQL.

sequenceDiagram
    participant FE as Frontend or client
    participant API as Backend API
    participant S3SVC as S3Service
    participant S3 as AWS S3 ap-south-1
    participant DB as PostgreSQL

    FE->>API: request presigned upload URL
    API->>S3SVC: build putObject signed URL for target bucket
    S3SVC-->>API: presigned URL
    API-->>FE: presigned URL and object key
    FE->>S3: PUT bytes directly to bucket
    FE->>API: confirm upload with object key
    API->>DB: persist metadata row
    API-->>FE: stored confirmation
    Note over API,S3: Private buckets like invoices are read back via short-lived getSignedUrlForBucket only

5. Manual rollback

If a release misbehaves after the swap, ops re-runs the standby color or invokes rollback.

sequenceDiagram
    participant OPS as Operator
    participant SRV as EC2 prod host
    participant NG as nginx

    OPS->>SRV: ssh run rollback.sh
    SRV->>SRV: read .current_env and pick previous color
    SRV->>NG: point proxy_pass back to previous port and reload
    SRV->>SRV: write .current_env to previous color
    SRV-->>OPS: previous version live again

Background jobs & async

The five BullMQ workers run in-process inside the backend container, sharing the same Redis instance (REDIS_HOST=redis in compose). They are part of the runtime, so they scale and restart with the app container.

Worker Queue Schedule Purpose
email.worker emailQueue On demand OTP, password reset, meeting notifications via Gmail SMTP
database.worker databaseQueue On demand Heavy DB operations
cleanup.worker cleanupQueue Every 24h Expired slot cleanup
kpi.worker kpiQueue Every 15min Dashboard KPI calculations
resumeAnalysis.worker resumeAnalysisQueue On demand AI resume processing

The Bull Board UI is mounted at /admin/queues for queue monitoring. Real-time fan-out (Socket.IO presence, recording state) also leans on Redis. Infra implication: Redis is a hard dependency for both queueing and websocket coordination; the prod compose marks the app depends_on: redis (service_healthy).

Webhooks the host must reach (inbound): Razorpay payment callbacks, Exotel status callbacks (/api/exotel/*, gated by EXOTEL_WEBHOOK_TOKEN), WhatsApp webhook verification, and Leegality eSign completion. These require BACKEND_PUBLIC_URL to be the publicly resolvable HTTPS origin (https://api.mrmentor.in in production; an ngrok tunnel in local dev per .env).


External integrations

All third-party endpoints are configured via env (sourced from AWS Secrets Manager at deploy). Most integrations self-disable or degrade gracefully when their keys are unset.

Service Env vars Failure / fallback
AWS S3 (ap-south-1) AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION, AWS_S3_BUCKET_NAME, AWS_S3_STUDENT_DOCUMENTS_BUCKET, AWS_S3_BANNER_ASSETS_BUCKET, GLOBAL_INVOICE_S3_BUCKET S3Service throws on boot if core bucket/creds missing; invoice bucket optional (PDFs emailed but not archived if unset)
Razorpay RAZORPAY_KEY_ID, RAZORPAY_KEY_SECRET Payment flows fail closed
Google OAuth / Calendar / Drive GOOGLE_CLIENT_ID, GOOGLE_CLIENT_SECRET, GOOGLE_REDIRECT_URI, GOOGLE_DRIVE_REDIRECT_URI Google login/calendar sync disabled
Gmail SMTP EMAIL_USER, EMAIL_PASS Email worker errors (535 BadCredentials on stale app password)
Voice AI calling ELEVENLABS_API_KEY, AARYA_AGENT_ID, AARYA_PHONE_NUMBER_ID, MissOzone config Voice interview calling disabled
Exotel telephony + SMS EXOTEL_SID, EXOTEL_API_KEY, EXOTEL_API_TOKEN, EXOTEL_CALLER_ID, EXOTEL_SUBDOMAIN, EXOTEL_WEBHOOK_TOKEN CRM falls back to tel: links when unset; subdomain must match the account cluster or auth 401s
WhatsApp Cloud API WHATSAPP_ACCESS_TOKEN, WHATSAPP_PHONE_NUMBER_ID, WHATSAPP_BUSINESS_ACCOUNT_ID, WHATSAPP_WEBHOOK_VERIFY_TOKEN, WHATSAPP_GRAPH_API_VERSION WhatsApp messaging disabled
LiteLLM gateway LITELLM_BASE_URL, LITELLM_MASTER_KEY LLM features fail; gateway is a separate dev EC2 service on mas-network
Graphy LMS GRAPHY_BASE_URL (mrlearn.in) Mr. Learn proxy endpoints fail
Judge0 / compiler JUDGE0_API_URL, COMPILER_API_URL Code execution disabled
mr-hire-backend MR_HIRE_BACKEND_URL Recruitment AI integration unavailable
Leegality eSign LEEGALITY_SANDBOX_BASE_URL, template envs PAP eSign flow blocked
class-agent CLASS_AGENT_URL, CLASS_AGENT_API_KEY AI classroom proxy fails
CloudFront assets CDN MAS_ASSETS_CDN_URL (d1ib7yueotbmuk.cloudfront.net) Asset URLs fall back to direct origin

Frontend / origin env: FRONTEND_URL, MR_HIRE_FRONTEND_URL, MAS_WEBSITE_URL, BACKEND_PUBLIC_URL, and CORS_ALLOWED_ORIGINS (JSON array consumed by both Express CORS in src/app.ts and Socket.IO CORS in src/index.ts).


Status lifecycles

The most meaningful operational lifecycle is the active color of the blue-green deployment, tracked by the server-side .current_env file and the nginx upstream.

stateDiagram-v2
    [*] --> Blue
    Blue --> DeployingGreen : deploy.sh pulls image and starts green
    DeployingGreen --> Blue : green health check fails then rollback
    DeployingGreen --> Green : green healthy and nginx switched
    Green --> DeployingBlue : next deploy starts blue
    DeployingBlue --> Green : blue health check fails then rollback
    DeployingBlue --> Blue : blue healthy and nginx switched
    Green --> Green : manual rollback keeps current
    Blue --> Blue : manual rollback keeps current

Container restart lifecycle (Docker restart: always + healthcheck):

stateDiagram-v2
    [*] --> Starting
    Starting --> Healthy : /api/health returns 200 after start period
    Starting --> Unhealthy : 3 failed health checks
    Healthy --> Unhealthy : health checks start failing
    Unhealthy --> Restarting : Docker restarts container
    Restarting --> Starting
    Healthy --> Stopped : SIGTERM graceful shutdown
    Stopped --> [*]

Edge cases, limits & gotchas

  • PM2 config is largely vestigial. ecosystem.config.js (cluster mode, instances: 'max', PORT: 3000) and deploy.sh describe a PM2-on-bare-metal path. Production today runs the Docker image (CMD ["node", "dist/index.js"]) under blue-green compose, not PM2. The PM2 PORT: 3000 is stale — the container listens on 8000. Treat the PM2 path as legacy/alternate unless a host is explicitly running it. (inferred)
  • Single Node process per container, no clustering. The container runs one Node process (node dist/index.js). Horizontal scale is achieved by blue-green color swap, not by running multiple replicas simultaneously. See the next point for why multi-replica is currently unsafe.
  • Terminal relay requires sticky sessions. TerminalRelayWsServer.ts keeps an in-memory pairings map; the CLI and the browser must land on the same process to be wired together. Multi-replica scaling would need Redis pub/sub routing (documented as a v1 limitation in the source comment). Same caveat applies to any in-memory Socket.IO state not backed by Redis.
  • Postgres and Redis are containers, shared across colors. In docker-compose.prod.yml they are defined as services (postgres:16-alpine, redis:7-alpine) with named volumes (postgres_data, redis_data). On the prod host both blue and green app containers connect to the same long-lived mr-mentor-postgres / mr-mentor-redis containers (the post-deploy check docker exec mr-mentor-postgres / mr-mentor-redis confirms fixed container names). The data layer is therefore NOT swapped during a deploy — only the app color is. (inferred from workflow exec names)
  • mas-network is external. The compose network is declared external: true; it must be created out-of-band (docker network create mas-network) so the backend, postgres, redis, and sibling services (LiteLLM gateway, voice agent, mr-hire) can resolve each other by container name.
  • TypeORM auto-sync is ON. No migrations — entity changes auto-apply to the live DB on boot. In production this means a deploy can mutate schema; be cautious with destructive entity edits.
  • Secrets live in AWS Secrets Manager, not the repo. Deploys fetch mr-mentor-backend/{production,development} (ap-south-1) and materialize .env.original (prod, into /home/ubuntu/blue-green-deployment) or .env (dev). The add-env-var skill's older "Pattern A" docs are stale per project memory — env changes must reach Secrets Manager AND be redeployed (or the running container restarted) to take effect; docker compose restart will NOT re-read env_file, only up -d recreate will.
  • GHCR auth on the host. Pulls require docker login ghcr.io with a valid token. Prod logs in as mas-mr-mentor (GHCR_PULL_TOKEN); dev uses the long-lived GHCR_PAT (MAS-intern). A periodic refresh workflow exists as a safety net because deploy workflows can overwrite the host's docker login.
  • Image tagging. build.yml tags ghcr.io/<repo>:<branch>, :<branch>-<shortsha>, plus :latest for main. The production deploy reconstructs the tag as main-<shortsha> and pulls the exact image — so re-running with workflow_dispatch rebuilds the tag from the current main HEAD commit.
  • CORS is permissive in development. src/app.ts allows ALL origins when NODE_ENV === 'development' and always allows no-origin and Origin: null requests. In production, origins must be in defaultOrigins or CORS_ALLOWED_ORIGINS.
  • Body size limit 20mb. express.json / urlencoded are capped at 20mb; large media must go through presigned direct-to-S3 upload, not the JSON body.
  • Health endpoint is load-bearing. Docker healthcheck, nginx switch verification, and both pre/post-deploy SSH steps assert GET /api/health{"success":true}. A change to its shape would break the deploy gate.
  • Non-root container. The runner stage runs as user nodejs (uid 1001); /app/recordings is pre-created and chowned. Any new write path inside the container needs matching ownership.