Deployment Architecture¶
This document describes how the MAS / Mr. Mentor backend and its two companion frontends are built, packaged, and run in production. It covers the multi-stage Docker build, the Docker Compose service topology (API + PostgreSQL + Redis), the legacy PM2 process model, the blue-green zero-downtime release flow driven by GitHub Actions + Nginx, and the manual deploy scripts. Read this together with cicd-pipelines (the GitHub Actions that build and trigger deploys) and infrastructure-topology (servers, networks, DNS).
Status: documented from source on this branch.
Overview¶
The backend (mr-mentor-backend) is a single Node.js 20 / Express service. It is packaged as one
OCI image and run as a container behind Nginx on an Ubuntu EC2 host. Two persistent infrastructure
containers — PostgreSQL 16 and Redis 7 — run alongside it on a shared Docker network. AWS S3 and
all third-party APIs (Razorpay, Google, Gmail SMTP, Exotel, etc.) are external managed services.
There are two deployment styles in the repo, and the codebase is mid-migration between them:
| Style | Where it lives | Status |
|---|---|---|
| PM2 (build on server, run with PM2 cluster) | deploy.sh, ecosystem.config.js, CI-CD.md |
Legacy / fallback |
| Docker + GHCR (build image in CI, pull on server) | Dockerfile, docker-compose.prod.yml, deploy-docker.sh, DOCKER_DEPLOYMENT.md |
Current — dev uses plain compose, prod uses blue-green |
Two environments exist, fed by two git branches:
| Environment | Branch | Server | Strategy |
|---|---|---|---|
| Development | development |
DEVELOPMENT_SERVER_HOST |
docker compose -f docker-compose.prod.yml up -d (recreate) |
| Production | main |
PRODUCTION_SERVER_HOST (api.myanalyticsschool.com) |
Blue-green with Nginx traffic switch |
Personas who touch this domain: platform/DevOps engineers (own the servers, Nginx, secrets), release engineers (trigger and verify deploys), and on-call engineers (roll back). End users never interact with deployment machinery directly.
Where the backend sits in the suite: it is the hub. mas-website-live (:8088) and
mr-mentor-frontend (:3000) call it over HTTP/WebSocket; mr-hire-backend is reachable from the
backend container over the shared Docker network at MR_HIRE_BACKEND_URL.
Key concepts & entities¶
This is an operations domain, so the "entities" are build artifacts and runtime objects rather than TypeORM tables.
| Term | Meaning |
|---|---|
| Multi-stage build | Dockerfile has 4 stages: builder (Bun + esbuild bundle), deps (npm production node_modules for native bcrypt), runner (slim Node 20 runtime, non-root), prod (alias of runner used by compose target: prod). |
| esbuild bundle | npm run build bundles src/index.ts to a single dist/index.js, externalizing bcrypt and module-alias/register. See package.json. |
| GHCR | GitHub Container Registry. Images are pushed to ghcr.io/<owner>/mr-mentor-backend:<tag>. |
| Image tags | <branch>, <branch>-<short-sha>, plus latest for main. Generated in .github/workflows/build.yml. |
mas-network |
External Docker network shared by app + postgres + redis (+ mr-hire-backend). Must be created with docker network create mas-network before first deploy. |
| Blue-green | Two identical containers (-blue / -green) on different host ports; Nginx points at the active one. New release goes to the idle color, is health-checked, then traffic is switched. The old container is stopped, not removed, for instant rollback. |
.current_env |
A file on the prod server (/home/ubuntu/blue-green-deployment/.current_env) holding blue or green — the source of truth for which color is live. |
| PM2 ecosystem | ecosystem.config.js — cluster mode, instances: 'max', max_memory_restart: 1G, auto-restart. Legacy path. |
| Healthcheck | GET /api/health must return 200 (and JSON "success":true in the prod verification step). Baked into the image HEALTHCHECK. |
Source files of record:
Dockerfile,Dockerfile.dev,.dockerignoredocker-compose.yml(local build),docker-compose.dev.yml(local hot-reload),docker-compose.prod.yml(GHCR image)deploy.sh(PM2),deploy-docker.sh(compose on server),ecosystem.config.js.github/workflows/build.yml,deploy-development.yml,deploy-production.yml- Server-side blue-green scripts:
/home/ubuntu/blue-green-deployment/{deploy.sh,rollback.sh,health-check.sh}(not in this repo; analogous scripts for the frontend live inmr-mentor-frontend/deploy/blue-green/)
Architecture¶
Runtime topology (production)¶
flowchart TD
subgraph Internet["Public Internet"]
Browser["Browsers / Frontend apps"]
end
subgraph EC2["Ubuntu EC2 host (production)"]
Nginx["Nginx reverse proxy<br/>api.myanalyticsschool.com<br/>TLS termination"]
subgraph BG["Blue-Green pair (Docker)"]
Blue["mr-mentor-backend-blue<br/>host :8000"]
Green["mr-mentor-backend-green<br/>host :8001 (idle)"]
end
PG["mr-mentor-postgres<br/>postgres:16-alpine<br/>volume postgres_data"]
RD["mr-mentor-redis<br/>redis:7-alpine<br/>volume redis_data"]
Net["Docker network: mas-network"]
end
subgraph AWS["AWS managed services"]
S3["S3 buckets<br/>recordings / documents / banners"]
SM["Secrets Manager<br/>mr-mentor-backend/production"]
end
subgraph Ext["External APIs"]
RZP["Razorpay"]
GOOG["Google OAuth / Calendar"]
SMTP["Gmail SMTP"]
EXO["Exotel"]
HIRE["mr-hire-backend"]
end
Browser -->|HTTPS / WSS| Nginx
Nginx -->|"proxy_pass active color"| Blue
Blue --> Net
Green --> Net
Net --> PG
Net --> RD
Net -->|"http internal"| HIRE
Blue --> S3
Blue --> RZP
Blue --> GOOG
Blue --> SMTP
Blue --> EXO
SM -->|"fetched in CI, written as .env"| EC2
Build and release pipeline¶
flowchart LR
Dev["Developer push<br/>to main or development"] --> GH["GitHub Actions<br/>build.yml"]
subgraph Build["Build and Push Docker Image"]
BX["Docker Buildx<br/>multi-stage build"]
S1["Stage builder<br/>Bun install + esbuild"]
S2["Stage deps<br/>npm prod node_modules"]
S3b["Stage runner / prod<br/>Node 20 non-root"]
BX --> S1 --> S2 --> S3b
end
GH --> Build
Build -->|"push tags branch, branch-sha, latest"| GHCR["GHCR<br/>ghcr.io/owner/mr-mentor-backend"]
GHCR --> DepDev["deploy-development.yml<br/>compose up -d"]
GHCR --> DepProd["deploy-production.yml<br/>blue-green deploy.sh"]
DepDev --> DevServer["Dev server container"]
DepProd --> ProdServer["Prod blue-green + Nginx switch"]
Note: the multi-stage build deliberately uses two package managers. Bun does the fast install and esbuild bundling in
builder, but productionnode_modulesare installed with npm in thedepsstage so the nativebcryptprebuild resolves against the same Node 20 ABI used at runtime (node dist/index.js). Thebuilderstage's Bunnode_modulesare discarded.
Data model¶
Deployment has no TypeORM entities. The "data model" here is the relationship between build stages, images, and runtime services.
erDiagram
DOCKERFILE ||--|{ STAGE : "defines"
STAGE ||--o| IMAGE : "produces"
IMAGE ||--o{ TAG : "published as"
IMAGE ||--|| CONTAINER_APP : "runs as"
COMPOSE_FILE ||--|{ SERVICE : "declares"
SERVICE ||--o| CONTAINER_APP : "app"
SERVICE ||--o| CONTAINER_PG : "postgres"
SERVICE ||--o| CONTAINER_RD : "redis"
CONTAINER_PG ||--|| VOLUME_PG : "persists to"
CONTAINER_RD ||--|| VOLUME_RD : "persists to"
NETWORK ||--o{ SERVICE : "connects"
DOCKERFILE {
string path "Dockerfile"
int stages "4 builder deps runner prod"
}
STAGE {
string name "builder deps runner prod"
string base "oven-bun-1-alpine or node-20-alpine"
}
IMAGE {
string registry "ghcr.io"
string target "prod"
}
TAG {
string branch
string branch_sha
string latest "main only"
}
SERVICE {
string name "app postgres redis"
bool healthcheck
}
VOLUME_PG {
string name "postgres_data"
}
VOLUME_RD {
string name "redis_data"
}
NETWORK {
string name "mas-network"
bool external "true"
}
API surface¶
Deployment exposes no business API. The only HTTP surface relevant to deployment is the health
endpoint, used by the image HEALTHCHECK, the compose healthcheck, the blue-green smoke test,
and the post-deploy verification step.
| Method | Path | Auth/role | Purpose |
|---|---|---|---|
| GET | /api/health |
none (public) | Liveness/readiness probe. Returns 200 with JSON success:true when the app is up. Used by Docker healthcheck, blue-green deploy.sh smoke test, and deploy-production.yml post-deploy verification. |
Operational management is done over SSH and docker compose / docker commands, not HTTP. The
Bull Board queue UI (/admin/queues) and the rest of the API are documented in their own feature
docs.
User journeys¶
The "users" here are CI and operators. Each journey is an end-to-end deployment flow.
Journey 1 — Build and push image (every push to a deploy branch)¶
A push to main, development, or staging triggers build.yml. It runs the multi-stage build
with BuildKit GitHub Actions cache and pushes tagged images to GHCR.
sequenceDiagram
participant Dev as Developer
participant GH as GitHub Actions
participant BK as Docker Buildx
participant GHCR as GHCR Registry
Dev->>GH: push to main or development or staging
GH->>GH: checkout code
GH->>GH: compute tags branch and branch-sha and latest
GH->>BK: build with target prod and cache-from gha
BK->>BK: stage builder runs bun install then esbuild bundle
BK->>BK: stage deps runs npm install omit dev for bcrypt
BK->>BK: stage runner copies dist and node_modules as non-root
BK->>GHCR: push image tags
GHCR-->>GH: digest
GH-->>Dev: build succeeded, triggers deploy workflow
Key facts (from build.yml): concurrency: docker-build with cancel-in-progress so only the
latest build runs; auth uses the built-in GITHUB_TOKEN; OCI labels record source, revision, and
version.
Journey 2 — Development deploy (recreate strategy)¶
When the build for development succeeds, deploy-development.yml fires via workflow_run. It
pulls env from AWS Secrets Manager, copies it plus docker-compose.prod.yml to the dev server, and
recreates the stack. There is a brief downtime window during recreate (acceptable for dev).
sequenceDiagram
participant BW as build.yml
participant DW as deploy-development.yml
participant SM as AWS Secrets Manager
participant SRV as Dev server
participant DK as Docker on server
BW-->>DW: workflow_run success on development
DW->>SM: get-secret-value mr-mentor-backend development
SM-->>DW: secret JSON
DW->>DW: jq to-entries builds .env file
DW->>SRV: scp docker-compose.prod.yml and .env
DW->>SRV: ssh into server
SRV->>DK: docker login ghcr.io
SRV->>DK: docker compose -f docker-compose.prod.yml pull
SRV->>DK: docker compose -f docker-compose.prod.yml up -d
DK->>DK: postgres and redis healthchecks pass first
DK->>DK: app starts, waits depends_on healthy
DK-->>SRV: containers running
SRV-->>DW: deploy complete
Journey 3 — Production blue-green deploy (zero downtime)¶
When the build for main succeeds, deploy-production.yml fires. It runs pre-checks, then calls
the server-side deploy.sh which deploys to the idle color, health-checks it, switches Nginx, and
keeps the old color stopped for rollback.
sequenceDiagram
participant DW as deploy-production.yml
participant SM as AWS Secrets Manager
participant SRV as Prod server
participant DS as deploy.sh on server
participant NG as Nginx
participant OLD as Old color container
participant NEW as New color container
DW->>SM: get-secret-value mr-mentor-backend production
SM-->>DW: secret JSON
DW->>DW: jq builds .env.original
DW->>SRV: scp .env.original to blue-green-deployment
DW->>SRV: ssh pre-deploy health check
SRV->>SRV: verify nginx running and postgres pg_isready and redis ping
DW->>SRV: docker login ghcr.io as mas-mr-mentor
DW->>DS: run deploy.sh with image url
DS->>DS: read .current_env to pick idle target color
DS->>NEW: docker run target color on idle port
DS->>NEW: poll docker health until healthy
DS->>NEW: curl smoke test on idle port expects 200
DS->>NG: sed switch proxy_pass to target port then nginx reload
DS->>OLD: docker stop old container kept for rollback
DS->>DS: write target color to .current_env
DS-->>DW: success
DW->>SRV: post-deploy verify health and db user count and redis ping
SRV-->>DW: all checks pass
Journey 4 — Failed deploy with automatic rollback¶
If deploy.sh returns non-zero (new color never goes healthy, or smoke test fails), the workflow
dumps the new container logs and invokes rollback.sh. Because the old color was only stopped,
rollback is just restart-old + flip Nginx back.
sequenceDiagram
participant DW as deploy-production.yml
participant DS as deploy.sh
participant RB as rollback.sh
participant NG as Nginx
participant OLD as Old color
participant NEW as New color
DW->>DS: run deploy.sh with image url
DS->>NEW: start new color and wait healthy
NEW-->>DS: stays unhealthy or smoke test fails
DS-->>DW: non-zero exit code
DW->>NEW: docker logs tail 50 for diagnosis
DW->>RB: run rollback.sh
RB->>OLD: docker start old color
RB->>NG: switch proxy_pass back to old port then reload
RB-->>DW: traffic restored to previous version
DW-->>DW: job marked failed for investigation
Journey 5 — Manual deploy via deploy-docker.sh¶
For ad-hoc server-side deploys (no CI), an operator runs deploy-docker.sh. It logs into GHCR,
stops the stack, pulls, and brings it back up with docker-compose.prod.yml. This is the simpler
recreate path, not blue-green.
sequenceDiagram
participant Op as Operator
participant SH as deploy-docker.sh
participant DK as Docker
Op->>SH: IMAGE_TAG and GHCR_REPOSITORY set then run script
SH->>SH: verify docker and docker compose installed
SH->>DK: docker login ghcr.io if token provided
SH->>DK: docker compose -f docker-compose.prod.yml down
SH->>DK: docker compose -f docker-compose.prod.yml pull
SH->>DK: docker compose -f docker-compose.prod.yml up -d
SH->>SH: sleep 15 then show ps and logs tail 50
SH->>Op: prompt to prune old images
SH-->>Op: deployment complete on configured port
Journey 6 — Legacy PM2 deploy¶
The original flow, still present as deploy.sh + npm run deploy. It builds on the server and runs
the bundle under PM2 cluster mode. Documented in CI-CD.md.
sequenceDiagram
participant Op as Operator or CI
participant SH as deploy.sh
participant PM as PM2
Op->>SH: run deploy.sh
SH->>SH: tar backup of existing dist
SH->>SH: git pull current branch
SH->>SH: bun install production
SH->>SH: bun run build esbuild to dist
SH->>PM: pm2 restart mr-mentor-backend if exists
PM->>PM: cluster mode instances max
SH->>PM: pm2 save
SH-->>Op: pm2 status printed
Background jobs & async¶
Deployment does not own BullMQ queues, but operators must know they exist because they affect restart behavior:
- The app container starts 5 BullMQ workers in-process (
email,database,cleanup,kpi,resumeAnalysis) plus scheduled jobs (cleanup every 24h, KPI every 15min). See the backend startup sequence in the project guide. - Restart caveat: BullMQ jobs that were in flight do not always auto-retry across a container
recreate. After a deploy, stuck jobs may need a manual nudge (
npm run queue:clearflushes all queues; use with care). The KPI/cleanup schedulers re-register on boot. - Socket.IO: the app serves WebSocket traffic for meetings. During a blue-green switch, existing WebSocket connections on the old color are dropped when it is stopped; clients reconnect to the new color via Nginx. Plan production deploys outside live-meeting windows where possible.
- Bull Board queue-monitoring UI is mounted at
/admin/queuesand is reachable through the same Nginx proxy.
No deployment-specific webhooks exist; CI is triggered by push and workflow_run events, not
inbound webhooks.
External integrations¶
| Integration | Used by deployment for | Env / secret | Failure / fallback |
|---|---|---|---|
GHCR (ghcr.io) |
Image registry; CI pushes, servers pull | GITHUB_TOKEN (CI push), GHCR_PULL_TOKEN + user mas-mr-mentor (prod pull), GHCR_USERNAME/GHCR_TOKEN (manual) |
Pull failure aborts deploy; old container keeps serving. |
| AWS Secrets Manager | Source of truth for .env. CI fetches mr-mentor-backend/{development,production} and writes a flat .env via jq |
AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION (default ap-south-1) |
Secret missing/malformed (jq -e 'type==object' guard) fails the deploy before touching the server. |
| SSH (appleboy actions) | Copy files and run remote scripts | *_SERVER_HOST, *_SERVER_USER, *_SSH_KEY, *_SSH_PORT |
SSH failure aborts the workflow step. |
| Nginx | TLS termination + traffic switch between colors | /etc/nginx/sites-available/api.myanalyticsschool.com |
nginx -t validates config before systemctl reload; a bad config blocks the switch. |
| PostgreSQL 16 | Stateful DB container | DB_* env; volume postgres_data |
pg_isready healthcheck gates app start; verified pre/post deploy. |
| Redis 7 | Cache + BullMQ broker | REDIS_* env; volume redis_data |
redis-cli ping healthcheck; verified pre/post deploy. |
| AWS S3 | Recordings, documents, banner assets | AWS_S3_* buckets |
App-level; not gated by deploy. |
| mr-hire-backend | AI services over internal network | MR_HIRE_BACKEND_URL (container DNS on mas-network) |
Optional at boot; affects only Mr. Hire features. |
Feature flags / toggles relevant at deploy time (from compose files): ENABLE_SEEDING (prod compose
sets true to seed colleges/batches on first boot), USE_DIRECT_S3_UPLOAD, ALLOW_EARLY_MEETING_JOIN,
MEETING_JOIN_BUFFER_MINUTES, TOKEN_VALUE.
Status lifecycles¶
Blue-green active color¶
The live color is tracked in .current_env. Each successful deploy flips it; rollback flips it back.
stateDiagram-v2
[*] --> Blue
Blue --> DeployingGreen : deploy.sh picks idle green
DeployingGreen --> Green : green healthy and nginx switched
DeployingGreen --> Blue : green unhealthy, rollback
Green --> DeployingBlue : next deploy picks idle blue
DeployingBlue --> Blue : blue healthy and nginx switched
DeployingBlue --> Green : blue unhealthy, rollback
Container health (Docker HEALTHCHECK)¶
Every app container moves through Docker's health states; the deploy script waits up to ~80s
(40 retries x 2s) for healthy before switching traffic.
stateDiagram-v2
[*] --> starting
starting --> healthy : /api/health returns 200 within start-period
starting --> unhealthy : retries exhausted
healthy --> unhealthy : 3 consecutive failed probes
unhealthy --> healthy : probe recovers
unhealthy --> [*] : deploy aborts and dumps logs
Edge cases, limits & gotchas¶
mas-networkis external. All three compose files declarenetworks: mas-network: external: true(dev compose usesapp-networkfor its own services but still declaresmas-networkexternal). The network must exist (docker network create mas-network) before the firstup, or compose fails. This shared network is how the backend reachesmr-hire-backendby container name.- Two package managers by design. Do not "simplify" the Dockerfile to a single Bun install. The
depsstage uses npm specifically sobcryptnative prebuilds match the Node 20 runtime ABI. Bun-installedbcryptfrom thebuilderstage is intentionally discarded. PORTmismatch in PM2 config.ecosystem.config.jssetsPORT: 3000, but the Docker image, compose files, and Nginx all use 8000. The CI-CD doc's troubleshooting text also says "port 3000". Treat 8000 as authoritative for the containerized backend; the PM2 path is legacy. (Noted discrepancy, not a runtime bug in the Docker flow.)MR_HIRE_BACKEND_URLdefault differs across files.docker-compose.yml/.dev.ymldefault tohttp://mr-hire-backend:8001, butdocker-compose.prod.ymldefaults tohttp://mr-hire-backend:8000. Set it explicitly via Secrets Manager to avoid relying on the default. (inferred risk)- Dev deploy has a downtime blip.
deploy-development.ymlusescompose up -drecreate, not blue-green. Acceptable for dev; never use this path for production. - Old container is stopped, not removed. Blue-green rollback depends on the previous color still
existing (stopped). A
docker container prunebetween deploys would destroy the rollback target — prune only old images, and only after a deploy is confirmed good. - Secrets are written as a flat
.envon the server. CI converts the Secrets Manager JSON object toKEY=valuelines withjqandscps it. A non-object secret payload is rejected by thejq -e 'type == "object"'guard before deploy. The env file lives at/home/ubuntu/blue-green-deployment/.env(.original)(prod) and/home/ubuntu/mr-mentor-backend/.env(dev). docker-compose.prod.ymlrequiresGHCR_REPOSITORYandIMAGE_TAG. With neither set theimage:line resolves to an invalid reference andup/pullfails. The deploy scripts/workflows export these.- Healthcheck without curl. The image and compose healthchecks shell out to
node -eHTTP probes because the Alpine/Bun base images ship nocurl/wget. Keep this in mind when editing healthcheck commands. - Seeding on prod boot.
ENABLE_SEEDING=trueindocker-compose.prod.ymlmeans the app seeds colleges/batches if those tables are empty. With TypeORMsynchronize: truealso on, entity changes auto-apply to the prod DB on deploy — review entity changes carefully before shipping. - Logs are capped. Prod compose sets
json-filelogging withmax-size: 10m,max-file: 3per service; deeper history must come from external log shipping (not configured here).
Companion frontends (deployed alongside the backend)¶
The two Next.js frontends follow the same GHCR + (for mr-mentor-frontend) blue-green pattern:
| Repo | Image base | Port | Runtime | Strategy | Nginx host |
|---|---|---|---|---|---|
mr-mentor-frontend |
oven/bun:1, Next.js standalone (server.js) |
3000 (blue) / 3001 (green) | node server.js as non-root bun |
Blue-green (deploy/blue-green/{deploy,rollback,health-check}.sh, network frontend-network) |
mrmentor.in |
mas-website-live |
node:20-alpine, Next.js standalone |
8088 | node server.js |
Single container (docker-compose-prod.yml) |
(public site) |
Both bake NEXT_PUBLIC_* values as Docker build ARGs (they must be present at build time, not just
runtime), so the build workflows pass them as --build-arg from GitHub secrets. The frontend
blue-green deploy.sh mirrors the backend's: pick idle color, docker run on the idle port,
poll Docker health, curl smoke test, sed the Nginx upstream, reload, then stop (not remove) the
old color. See mr-mentor-frontend/deploy/blue-green/deploy.sh and
mas-website-live/Dockerfile.
Related docs¶
- cicd-pipelines — the GitHub Actions workflows (
build.yml,deploy-development.yml,deploy-production.yml) that drive this deployment. - infrastructure-topology — servers, Nginx, DNS, networks, AWS account layout.
- ../architecture/system-overview.md — where the backend sits among the frontends and
mr-hire-backend. - ../architecture/background-jobs-and-queues.md — BullMQ workers affected by restarts.