Deployment Architecture¶

This document describes how the MAS / Mr. Mentor backend and its two companion frontends are built, packaged, and run in production. It covers the multi-stage Docker build, the Docker Compose service topology (API + PostgreSQL + Redis), the legacy PM2 process model, the blue-green zero-downtime release flow driven by GitHub Actions + Nginx, and the manual deploy scripts. Read this together with cicd-pipelines (the GitHub Actions that build and trigger deploys) and infrastructure-topology (servers, networks, DNS).

Status: documented from source on this branch.

Overview¶

The backend (mr-mentor-backend) is a single Node.js 20 / Express service. It is packaged as one OCI image and run as a container behind Nginx on an Ubuntu EC2 host. Two persistent infrastructure containers — PostgreSQL 16 and Redis 7 — run alongside it on a shared Docker network. AWS S3 and all third-party APIs (Razorpay, Google, Gmail SMTP, Exotel, etc.) are external managed services.

There are two deployment styles in the repo, and the codebase is mid-migration between them:

Style	Where it lives	Status
PM2 (build on server, run with PM2 cluster)	`deploy.sh`, `ecosystem.config.js`, `CI-CD.md`	Legacy / fallback
Docker + GHCR (build image in CI, pull on server)	`Dockerfile`, `docker-compose.prod.yml`, `deploy-docker.sh`, `DOCKER_DEPLOYMENT.md`	Current — dev uses plain compose, prod uses blue-green

Two environments exist, fed by two git branches:

Environment	Branch	Server	Strategy
Development	`development`	`DEVELOPMENT_SERVER_HOST`	`docker compose -f docker-compose.prod.yml up -d` (recreate)
Production	`main`	`PRODUCTION_SERVER_HOST` (`api.myanalyticsschool.com`)	Blue-green with Nginx traffic switch

Personas who touch this domain: platform/DevOps engineers (own the servers, Nginx, secrets), release engineers (trigger and verify deploys), and on-call engineers (roll back). End users never interact with deployment machinery directly.

Where the backend sits in the suite: it is the hub. mas-website-live (:8088) and mr-mentor-frontend (:3000) call it over HTTP/WebSocket; mr-hire-backend is reachable from the backend container over the shared Docker network at MR_HIRE_BACKEND_URL.

Key concepts & entities¶

This is an operations domain, so the "entities" are build artifacts and runtime objects rather than TypeORM tables.

Term	Meaning
Multi-stage build	`Dockerfile` has 4 stages: `builder` (Bun + esbuild bundle), `deps` (npm production node_modules for native `bcrypt`), `runner` (slim Node 20 runtime, non-root), `prod` (alias of `runner` used by compose `target: prod`).
esbuild bundle	`npm run build` bundles `src/index.ts` to a single `dist/index.js`, externalizing `bcrypt` and `module-alias/register`. See `package.json`.
GHCR	GitHub Container Registry. Images are pushed to `ghcr.io/<owner>/mr-mentor-backend:<tag>`.
Image tags	`<branch>`, `<branch>-<short-sha>`, plus `latest` for `main`. Generated in `.github/workflows/build.yml`.
`mas-network`	External Docker network shared by app + postgres + redis (+ mr-hire-backend). Must be created with `docker network create mas-network` before first deploy.
Blue-green	Two identical containers (`-blue` / `-green`) on different host ports; Nginx points at the active one. New release goes to the idle color, is health-checked, then traffic is switched. The old container is stopped, not removed, for instant rollback.
`.current_env`	A file on the prod server (`/home/ubuntu/blue-green-deployment/.current_env`) holding `blue` or `green` — the source of truth for which color is live.
PM2 ecosystem	`ecosystem.config.js` — cluster mode, `instances: 'max'`, `max_memory_restart: 1G`, auto-restart. Legacy path.
Healthcheck	`GET /api/health` must return 200 (and JSON `"success":true` in the prod verification step). Baked into the image `HEALTHCHECK`.

Source files of record:

Dockerfile, Dockerfile.dev, .dockerignore
docker-compose.yml (local build), docker-compose.dev.yml (local hot-reload), docker-compose.prod.yml (GHCR image)
deploy.sh (PM2), deploy-docker.sh (compose on server), ecosystem.config.js
.github/workflows/build.yml, deploy-development.yml, deploy-production.yml
Server-side blue-green scripts: /home/ubuntu/blue-green-deployment/{deploy.sh,rollback.sh,health-check.sh} (not in this repo; analogous scripts for the frontend live in mr-mentor-frontend/deploy/blue-green/)

Architecture¶

Runtime topology (production)¶

flowchart TD
    subgraph Internet["Public Internet"]
        Browser["Browsers / Frontend apps"]
    end

    subgraph EC2["Ubuntu EC2 host (production)"]
        Nginx["Nginx reverse proxy<br/>api.myanalyticsschool.com<br/>TLS termination"]

        subgraph BG["Blue-Green pair (Docker)"]
            Blue["mr-mentor-backend-blue<br/>host :8000"]
            Green["mr-mentor-backend-green<br/>host :8001 (idle)"]
        end

        PG["mr-mentor-postgres<br/>postgres:16-alpine<br/>volume postgres_data"]
        RD["mr-mentor-redis<br/>redis:7-alpine<br/>volume redis_data"]
        Net["Docker network: mas-network"]
    end

    subgraph AWS["AWS managed services"]
        S3["S3 buckets<br/>recordings / documents / banners"]
        SM["Secrets Manager<br/>mr-mentor-backend/production"]
    end

    subgraph Ext["External APIs"]
        RZP["Razorpay"]
        GOOG["Google OAuth / Calendar"]
        SMTP["Gmail SMTP"]
        EXO["Exotel"]
        HIRE["mr-hire-backend"]
    end

    Browser -->|HTTPS / WSS| Nginx
    Nginx -->|"proxy_pass active color"| Blue
    Blue --> Net
    Green --> Net
    Net --> PG
    Net --> RD
    Net -->|"http internal"| HIRE
    Blue --> S3
    Blue --> RZP
    Blue --> GOOG
    Blue --> SMTP
    Blue --> EXO
    SM -->|"fetched in CI, written as .env"| EC2

Build and release pipeline¶

flowchart LR
    Dev["Developer push<br/>to main or development"] --> GH["GitHub Actions<br/>build.yml"]

    subgraph Build["Build and Push Docker Image"]
        BX["Docker Buildx<br/>multi-stage build"]
        S1["Stage builder<br/>Bun install + esbuild"]
        S2["Stage deps<br/>npm prod node_modules"]
        S3b["Stage runner / prod<br/>Node 20 non-root"]
        BX --> S1 --> S2 --> S3b
    end

    GH --> Build
    Build -->|"push tags branch, branch-sha, latest"| GHCR["GHCR<br/>ghcr.io/owner/mr-mentor-backend"]

    GHCR --> DepDev["deploy-development.yml<br/>compose up -d"]
    GHCR --> DepProd["deploy-production.yml<br/>blue-green deploy.sh"]

    DepDev --> DevServer["Dev server container"]
    DepProd --> ProdServer["Prod blue-green + Nginx switch"]

Note: the multi-stage build deliberately uses two package managers. Bun does the fast install and esbuild bundling in builder, but production node_modules are installed with npm in the deps stage so the native bcrypt prebuild resolves against the same Node 20 ABI used at runtime (node dist/index.js). The builder stage's Bun node_modules are discarded.

Data model¶

Deployment has no TypeORM entities. The "data model" here is the relationship between build stages, images, and runtime services.

erDiagram
    DOCKERFILE ||--|{ STAGE : "defines"
    STAGE ||--o| IMAGE : "produces"
    IMAGE ||--o{ TAG : "published as"
    IMAGE ||--|| CONTAINER_APP : "runs as"
    COMPOSE_FILE ||--|{ SERVICE : "declares"
    SERVICE ||--o| CONTAINER_APP : "app"
    SERVICE ||--o| CONTAINER_PG : "postgres"
    SERVICE ||--o| CONTAINER_RD : "redis"
    CONTAINER_PG ||--|| VOLUME_PG : "persists to"
    CONTAINER_RD ||--|| VOLUME_RD : "persists to"
    NETWORK ||--o{ SERVICE : "connects"

    DOCKERFILE {
        string path "Dockerfile"
        int stages "4 builder deps runner prod"
    }
    STAGE {
        string name "builder deps runner prod"
        string base "oven-bun-1-alpine or node-20-alpine"
    }
    IMAGE {
        string registry "ghcr.io"
        string target "prod"
    }
    TAG {
        string branch
        string branch_sha
        string latest "main only"
    }
    SERVICE {
        string name "app postgres redis"
        bool healthcheck
    }
    VOLUME_PG {
        string name "postgres_data"
    }
    VOLUME_RD {
        string name "redis_data"
    }
    NETWORK {
        string name "mas-network"
        bool external "true"
    }

API surface¶

Deployment exposes no business API. The only HTTP surface relevant to deployment is the health endpoint, used by the image HEALTHCHECK, the compose healthcheck, the blue-green smoke test, and the post-deploy verification step.

Method	Path	Auth/role	Purpose
GET	`/api/health`	none (public)	Liveness/readiness probe. Returns 200 with JSON `success:true` when the app is up. Used by Docker healthcheck, blue-green `deploy.sh` smoke test, and `deploy-production.yml` post-deploy verification.

Operational management is done over SSH and docker compose / docker commands, not HTTP. The Bull Board queue UI (/admin/queues) and the rest of the API are documented in their own feature docs.

User journeys¶

The "users" here are CI and operators. Each journey is an end-to-end deployment flow.

Journey 1 — Build and push image (every push to a deploy branch)¶

A push to main, development, or staging triggers build.yml. It runs the multi-stage build with BuildKit GitHub Actions cache and pushes tagged images to GHCR.

sequenceDiagram
    participant Dev as Developer
    participant GH as GitHub Actions
    participant BK as Docker Buildx
    participant GHCR as GHCR Registry

    Dev->>GH: push to main or development or staging
    GH->>GH: checkout code
    GH->>GH: compute tags branch and branch-sha and latest
    GH->>BK: build with target prod and cache-from gha
    BK->>BK: stage builder runs bun install then esbuild bundle
    BK->>BK: stage deps runs npm install omit dev for bcrypt
    BK->>BK: stage runner copies dist and node_modules as non-root
    BK->>GHCR: push image tags
    GHCR-->>GH: digest
    GH-->>Dev: build succeeded, triggers deploy workflow

Key facts (from build.yml): concurrency: docker-build with cancel-in-progress so only the latest build runs; auth uses the built-in GITHUB_TOKEN; OCI labels record source, revision, and version.

Journey 2 — Development deploy (recreate strategy)¶

When the build for development succeeds, deploy-development.yml fires via workflow_run. It pulls env from AWS Secrets Manager, copies it plus docker-compose.prod.yml to the dev server, and recreates the stack. There is a brief downtime window during recreate (acceptable for dev).

sequenceDiagram
    participant BW as build.yml
    participant DW as deploy-development.yml
    participant SM as AWS Secrets Manager
    participant SRV as Dev server
    participant DK as Docker on server

    BW-->>DW: workflow_run success on development
    DW->>SM: get-secret-value mr-mentor-backend development
    SM-->>DW: secret JSON
    DW->>DW: jq to-entries builds .env file
    DW->>SRV: scp docker-compose.prod.yml and .env
    DW->>SRV: ssh into server
    SRV->>DK: docker login ghcr.io
    SRV->>DK: docker compose -f docker-compose.prod.yml pull
    SRV->>DK: docker compose -f docker-compose.prod.yml up -d
    DK->>DK: postgres and redis healthchecks pass first
    DK->>DK: app starts, waits depends_on healthy
    DK-->>SRV: containers running
    SRV-->>DW: deploy complete

Journey 3 — Production blue-green deploy (zero downtime)¶

When the build for main succeeds, deploy-production.yml fires. It runs pre-checks, then calls the server-side deploy.sh which deploys to the idle color, health-checks it, switches Nginx, and keeps the old color stopped for rollback.

sequenceDiagram
    participant DW as deploy-production.yml
    participant SM as AWS Secrets Manager
    participant SRV as Prod server
    participant DS as deploy.sh on server
    participant NG as Nginx
    participant OLD as Old color container
    participant NEW as New color container

    DW->>SM: get-secret-value mr-mentor-backend production
    SM-->>DW: secret JSON
    DW->>DW: jq builds .env.original
    DW->>SRV: scp .env.original to blue-green-deployment
    DW->>SRV: ssh pre-deploy health check
    SRV->>SRV: verify nginx running and postgres pg_isready and redis ping
    DW->>SRV: docker login ghcr.io as mas-mr-mentor
    DW->>DS: run deploy.sh with image url
    DS->>DS: read .current_env to pick idle target color
    DS->>NEW: docker run target color on idle port
    DS->>NEW: poll docker health until healthy
    DS->>NEW: curl smoke test on idle port expects 200
    DS->>NG: sed switch proxy_pass to target port then nginx reload
    DS->>OLD: docker stop old container kept for rollback
    DS->>DS: write target color to .current_env
    DS-->>DW: success
    DW->>SRV: post-deploy verify health and db user count and redis ping
    SRV-->>DW: all checks pass

Journey 4 — Failed deploy with automatic rollback¶

If deploy.sh returns non-zero (new color never goes healthy, or smoke test fails), the workflow dumps the new container logs and invokes rollback.sh. Because the old color was only stopped, rollback is just restart-old + flip Nginx back.

sequenceDiagram
    participant DW as deploy-production.yml
    participant DS as deploy.sh
    participant RB as rollback.sh
    participant NG as Nginx
    participant OLD as Old color
    participant NEW as New color

    DW->>DS: run deploy.sh with image url
    DS->>NEW: start new color and wait healthy
    NEW-->>DS: stays unhealthy or smoke test fails
    DS-->>DW: non-zero exit code
    DW->>NEW: docker logs tail 50 for diagnosis
    DW->>RB: run rollback.sh
    RB->>OLD: docker start old color
    RB->>NG: switch proxy_pass back to old port then reload
    RB-->>DW: traffic restored to previous version
    DW-->>DW: job marked failed for investigation

Journey 5 — Manual deploy via deploy-docker.sh¶

For ad-hoc server-side deploys (no CI), an operator runs deploy-docker.sh. It logs into GHCR, stops the stack, pulls, and brings it back up with docker-compose.prod.yml. This is the simpler recreate path, not blue-green.

sequenceDiagram
    participant Op as Operator
    participant SH as deploy-docker.sh
    participant DK as Docker

    Op->>SH: IMAGE_TAG and GHCR_REPOSITORY set then run script
    SH->>SH: verify docker and docker compose installed
    SH->>DK: docker login ghcr.io if token provided
    SH->>DK: docker compose -f docker-compose.prod.yml down
    SH->>DK: docker compose -f docker-compose.prod.yml pull
    SH->>DK: docker compose -f docker-compose.prod.yml up -d
    SH->>SH: sleep 15 then show ps and logs tail 50
    SH->>Op: prompt to prune old images
    SH-->>Op: deployment complete on configured port

Journey 6 — Legacy PM2 deploy¶

The original flow, still present as deploy.sh + npm run deploy. It builds on the server and runs the bundle under PM2 cluster mode. Documented in CI-CD.md.

sequenceDiagram
    participant Op as Operator or CI
    participant SH as deploy.sh
    participant PM as PM2

    Op->>SH: run deploy.sh
    SH->>SH: tar backup of existing dist
    SH->>SH: git pull current branch
    SH->>SH: bun install production
    SH->>SH: bun run build esbuild to dist
    SH->>PM: pm2 restart mr-mentor-backend if exists
    PM->>PM: cluster mode instances max
    SH->>PM: pm2 save
    SH-->>Op: pm2 status printed

Background jobs & async¶

Deployment does not own BullMQ queues, but operators must know they exist because they affect restart behavior:

The app container starts 5 BullMQ workers in-process (email, database, cleanup, kpi, resumeAnalysis) plus scheduled jobs (cleanup every 24h, KPI every 15min). See the backend startup sequence in the project guide.
Restart caveat: BullMQ jobs that were in flight do not always auto-retry across a container recreate. After a deploy, stuck jobs may need a manual nudge (npm run queue:clear flushes all queues; use with care). The KPI/cleanup schedulers re-register on boot.
Socket.IO: the app serves WebSocket traffic for meetings. During a blue-green switch, existing WebSocket connections on the old color are dropped when it is stopped; clients reconnect to the new color via Nginx. Plan production deploys outside live-meeting windows where possible.
Bull Board queue-monitoring UI is mounted at /admin/queues and is reachable through the same Nginx proxy.

No deployment-specific webhooks exist; CI is triggered by push and workflow_run events, not inbound webhooks.

External integrations¶

Integration	Used by deployment for	Env / secret	Failure / fallback
GHCR (`ghcr.io`)	Image registry; CI pushes, servers pull	`GITHUB_TOKEN` (CI push), `GHCR_PULL_TOKEN` + user `mas-mr-mentor` (prod pull), `GHCR_USERNAME`/`GHCR_TOKEN` (manual)	Pull failure aborts deploy; old container keeps serving.
AWS Secrets Manager	Source of truth for `.env`. CI fetches `mr-mentor-backend/{development,production}` and writes a flat `.env` via `jq`	`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_REGION` (default `ap-south-1`)	Secret missing/malformed (`jq -e 'type==object'` guard) fails the deploy before touching the server.
SSH (appleboy actions)	Copy files and run remote scripts	`_SERVER_HOST`, `_SERVER_USER`, `_SSH_KEY`, `_SSH_PORT`	SSH failure aborts the workflow step.
Nginx	TLS termination + traffic switch between colors	`/etc/nginx/sites-available/api.myanalyticsschool.com`	`nginx -t` validates config before `systemctl reload`; a bad config blocks the switch.
PostgreSQL 16	Stateful DB container	`DB_*` env; volume `postgres_data`	`pg_isready` healthcheck gates app start; verified pre/post deploy.
Redis 7	Cache + BullMQ broker	`REDIS_*` env; volume `redis_data`	`redis-cli ping` healthcheck; verified pre/post deploy.
AWS S3	Recordings, documents, banner assets	`AWS_S3_*` buckets	App-level; not gated by deploy.
mr-hire-backend	AI services over internal network	`MR_HIRE_BACKEND_URL` (container DNS on `mas-network`)	Optional at boot; affects only Mr. Hire features.

Feature flags / toggles relevant at deploy time (from compose files): ENABLE_SEEDING (prod compose sets true to seed colleges/batches on first boot), USE_DIRECT_S3_UPLOAD, ALLOW_EARLY_MEETING_JOIN, MEETING_JOIN_BUFFER_MINUTES, TOKEN_VALUE.

Status lifecycles¶

Blue-green active color¶

The live color is tracked in .current_env. Each successful deploy flips it; rollback flips it back.

stateDiagram-v2
    [*] --> Blue
    Blue --> DeployingGreen : deploy.sh picks idle green
    DeployingGreen --> Green : green healthy and nginx switched
    DeployingGreen --> Blue : green unhealthy, rollback
    Green --> DeployingBlue : next deploy picks idle blue
    DeployingBlue --> Blue : blue healthy and nginx switched
    DeployingBlue --> Green : blue unhealthy, rollback

Container health (Docker HEALTHCHECK)¶

Every app container moves through Docker's health states; the deploy script waits up to ~80s (40 retries x 2s) for healthy before switching traffic.

stateDiagram-v2
    [*] --> starting
    starting --> healthy : /api/health returns 200 within start-period
    starting --> unhealthy : retries exhausted
    healthy --> unhealthy : 3 consecutive failed probes
    unhealthy --> healthy : probe recovers
    unhealthy --> [*] : deploy aborts and dumps logs

Edge cases, limits & gotchas¶

mas-network is external. All three compose files declare networks: mas-network: external: true (dev compose uses app-network for its own services but still declares mas-network external). The network must exist (docker network create mas-network) before the first up, or compose fails. This shared network is how the backend reaches mr-hire-backend by container name.
Two package managers by design. Do not "simplify" the Dockerfile to a single Bun install. The deps stage uses npm specifically so bcrypt native prebuilds match the Node 20 runtime ABI. Bun-installed bcrypt from the builder stage is intentionally discarded.
PORT mismatch in PM2 config. ecosystem.config.js sets PORT: 3000, but the Docker image, compose files, and Nginx all use 8000. The CI-CD doc's troubleshooting text also says "port 3000". Treat 8000 as authoritative for the containerized backend; the PM2 path is legacy. (Noted discrepancy, not a runtime bug in the Docker flow.)
MR_HIRE_BACKEND_URL default differs across files. docker-compose.yml / .dev.yml default to http://mr-hire-backend:8001, but docker-compose.prod.yml defaults to http://mr-hire-backend:8000. Set it explicitly via Secrets Manager to avoid relying on the default. (inferred risk)
Dev deploy has a downtime blip. deploy-development.yml uses compose up -d recreate, not blue-green. Acceptable for dev; never use this path for production.
Old container is stopped, not removed. Blue-green rollback depends on the previous color still existing (stopped). A docker container prune between deploys would destroy the rollback target — prune only old images, and only after a deploy is confirmed good.
Secrets are written as a flat .env on the server. CI converts the Secrets Manager JSON object to KEY=value lines with jq and scps it. A non-object secret payload is rejected by the jq -e 'type == "object"' guard before deploy. The env file lives at /home/ubuntu/blue-green-deployment/.env(.original) (prod) and /home/ubuntu/mr-mentor-backend/.env (dev).
docker-compose.prod.yml requires GHCR_REPOSITORY and IMAGE_TAG. With neither set the image: line resolves to an invalid reference and up/pull fails. The deploy scripts/workflows export these.
Healthcheck without curl. The image and compose healthchecks shell out to node -e HTTP probes because the Alpine/Bun base images ship no curl/wget. Keep this in mind when editing healthcheck commands.
Seeding on prod boot. ENABLE_SEEDING=true in docker-compose.prod.yml means the app seeds colleges/batches if those tables are empty. With TypeORM synchronize: true also on, entity changes auto-apply to the prod DB on deploy — review entity changes carefully before shipping.
Logs are capped. Prod compose sets json-file logging with max-size: 10m, max-file: 3 per service; deeper history must come from external log shipping (not configured here).

Companion frontends (deployed alongside the backend)¶

The two Next.js frontends follow the same GHCR + (for mr-mentor-frontend) blue-green pattern:

Repo	Image base	Port	Runtime	Strategy	Nginx host
`mr-mentor-frontend`	`oven/bun:1`, Next.js standalone (`server.js`)	3000 (blue) / 3001 (green)	`node server.js` as non-root `bun`	Blue-green (`deploy/blue-green/{deploy,rollback,health-check}.sh`, network `frontend-network`)	`mrmentor.in`
`mas-website-live`	`node:20-alpine`, Next.js standalone	8088	`node server.js`	Single container (`docker-compose-prod.yml`)	(public site)

Both bake NEXT_PUBLIC_* values as Docker build ARGs (they must be present at build time, not just runtime), so the build workflows pass them as --build-arg from GitHub secrets. The frontend blue-green deploy.sh mirrors the backend's: pick idle color, docker run on the idle port, poll Docker health, curl smoke test, sed the Nginx upstream, reload, then stop (not remove) the old color. See mr-mentor-frontend/deploy/blue-green/deploy.sh and mas-website-live/Dockerfile.

cicd-pipelines — the GitHub Actions workflows (build.yml, deploy-development.yml, deploy-production.yml) that drive this deployment.
infrastructure-topology — servers, Nginx, DNS, networks, AWS account layout.
../architecture/system-overview.md — where the backend sits among the frontends and mr-hire-backend.
../architecture/background-jobs-and-queues.md — BullMQ workers affected by restarts.