Skip to content

Deployment Architecture

This document describes how the MAS / Mr. Mentor backend and its two companion frontends are built, packaged, and run in production. It covers the multi-stage Docker build, the Docker Compose service topology (API + PostgreSQL + Redis), the legacy PM2 process model, the blue-green zero-downtime release flow driven by GitHub Actions + Nginx, and the manual deploy scripts. Read this together with cicd-pipelines (the GitHub Actions that build and trigger deploys) and infrastructure-topology (servers, networks, DNS).

Status: documented from source on this branch.


Overview

The backend (mr-mentor-backend) is a single Node.js 20 / Express service. It is packaged as one OCI image and run as a container behind Nginx on an Ubuntu EC2 host. Two persistent infrastructure containers — PostgreSQL 16 and Redis 7 — run alongside it on a shared Docker network. AWS S3 and all third-party APIs (Razorpay, Google, Gmail SMTP, Exotel, etc.) are external managed services.

There are two deployment styles in the repo, and the codebase is mid-migration between them:

Style Where it lives Status
PM2 (build on server, run with PM2 cluster) deploy.sh, ecosystem.config.js, CI-CD.md Legacy / fallback
Docker + GHCR (build image in CI, pull on server) Dockerfile, docker-compose.prod.yml, deploy-docker.sh, DOCKER_DEPLOYMENT.md Current — dev uses plain compose, prod uses blue-green

Two environments exist, fed by two git branches:

Environment Branch Server Strategy
Development development DEVELOPMENT_SERVER_HOST docker compose -f docker-compose.prod.yml up -d (recreate)
Production main PRODUCTION_SERVER_HOST (api.myanalyticsschool.com) Blue-green with Nginx traffic switch

Personas who touch this domain: platform/DevOps engineers (own the servers, Nginx, secrets), release engineers (trigger and verify deploys), and on-call engineers (roll back). End users never interact with deployment machinery directly.

Where the backend sits in the suite: it is the hub. mas-website-live (:8088) and mr-mentor-frontend (:3000) call it over HTTP/WebSocket; mr-hire-backend is reachable from the backend container over the shared Docker network at MR_HIRE_BACKEND_URL.


Key concepts & entities

This is an operations domain, so the "entities" are build artifacts and runtime objects rather than TypeORM tables.

Term Meaning
Multi-stage build Dockerfile has 4 stages: builder (Bun + esbuild bundle), deps (npm production node_modules for native bcrypt), runner (slim Node 20 runtime, non-root), prod (alias of runner used by compose target: prod).
esbuild bundle npm run build bundles src/index.ts to a single dist/index.js, externalizing bcrypt and module-alias/register. See package.json.
GHCR GitHub Container Registry. Images are pushed to ghcr.io/<owner>/mr-mentor-backend:<tag>.
Image tags <branch>, <branch>-<short-sha>, plus latest for main. Generated in .github/workflows/build.yml.
mas-network External Docker network shared by app + postgres + redis (+ mr-hire-backend). Must be created with docker network create mas-network before first deploy.
Blue-green Two identical containers (-blue / -green) on different host ports; Nginx points at the active one. New release goes to the idle color, is health-checked, then traffic is switched. The old container is stopped, not removed, for instant rollback.
.current_env A file on the prod server (/home/ubuntu/blue-green-deployment/.current_env) holding blue or green — the source of truth for which color is live.
PM2 ecosystem ecosystem.config.js — cluster mode, instances: 'max', max_memory_restart: 1G, auto-restart. Legacy path.
Healthcheck GET /api/health must return 200 (and JSON "success":true in the prod verification step). Baked into the image HEALTHCHECK.

Source files of record:

  • Dockerfile, Dockerfile.dev, .dockerignore
  • docker-compose.yml (local build), docker-compose.dev.yml (local hot-reload), docker-compose.prod.yml (GHCR image)
  • deploy.sh (PM2), deploy-docker.sh (compose on server), ecosystem.config.js
  • .github/workflows/build.yml, deploy-development.yml, deploy-production.yml
  • Server-side blue-green scripts: /home/ubuntu/blue-green-deployment/{deploy.sh,rollback.sh,health-check.sh} (not in this repo; analogous scripts for the frontend live in mr-mentor-frontend/deploy/blue-green/)

Architecture

Runtime topology (production)

flowchart TD
    subgraph Internet["Public Internet"]
        Browser["Browsers / Frontend apps"]
    end

    subgraph EC2["Ubuntu EC2 host (production)"]
        Nginx["Nginx reverse proxy<br/>api.myanalyticsschool.com<br/>TLS termination"]

        subgraph BG["Blue-Green pair (Docker)"]
            Blue["mr-mentor-backend-blue<br/>host :8000"]
            Green["mr-mentor-backend-green<br/>host :8001 (idle)"]
        end

        PG["mr-mentor-postgres<br/>postgres:16-alpine<br/>volume postgres_data"]
        RD["mr-mentor-redis<br/>redis:7-alpine<br/>volume redis_data"]
        Net["Docker network: mas-network"]
    end

    subgraph AWS["AWS managed services"]
        S3["S3 buckets<br/>recordings / documents / banners"]
        SM["Secrets Manager<br/>mr-mentor-backend/production"]
    end

    subgraph Ext["External APIs"]
        RZP["Razorpay"]
        GOOG["Google OAuth / Calendar"]
        SMTP["Gmail SMTP"]
        EXO["Exotel"]
        HIRE["mr-hire-backend"]
    end

    Browser -->|HTTPS / WSS| Nginx
    Nginx -->|"proxy_pass active color"| Blue
    Blue --> Net
    Green --> Net
    Net --> PG
    Net --> RD
    Net -->|"http internal"| HIRE
    Blue --> S3
    Blue --> RZP
    Blue --> GOOG
    Blue --> SMTP
    Blue --> EXO
    SM -->|"fetched in CI, written as .env"| EC2

Build and release pipeline

flowchart LR
    Dev["Developer push<br/>to main or development"] --> GH["GitHub Actions<br/>build.yml"]

    subgraph Build["Build and Push Docker Image"]
        BX["Docker Buildx<br/>multi-stage build"]
        S1["Stage builder<br/>Bun install + esbuild"]
        S2["Stage deps<br/>npm prod node_modules"]
        S3b["Stage runner / prod<br/>Node 20 non-root"]
        BX --> S1 --> S2 --> S3b
    end

    GH --> Build
    Build -->|"push tags branch, branch-sha, latest"| GHCR["GHCR<br/>ghcr.io/owner/mr-mentor-backend"]

    GHCR --> DepDev["deploy-development.yml<br/>compose up -d"]
    GHCR --> DepProd["deploy-production.yml<br/>blue-green deploy.sh"]

    DepDev --> DevServer["Dev server container"]
    DepProd --> ProdServer["Prod blue-green + Nginx switch"]

Note: the multi-stage build deliberately uses two package managers. Bun does the fast install and esbuild bundling in builder, but production node_modules are installed with npm in the deps stage so the native bcrypt prebuild resolves against the same Node 20 ABI used at runtime (node dist/index.js). The builder stage's Bun node_modules are discarded.


Data model

Deployment has no TypeORM entities. The "data model" here is the relationship between build stages, images, and runtime services.

erDiagram
    DOCKERFILE ||--|{ STAGE : "defines"
    STAGE ||--o| IMAGE : "produces"
    IMAGE ||--o{ TAG : "published as"
    IMAGE ||--|| CONTAINER_APP : "runs as"
    COMPOSE_FILE ||--|{ SERVICE : "declares"
    SERVICE ||--o| CONTAINER_APP : "app"
    SERVICE ||--o| CONTAINER_PG : "postgres"
    SERVICE ||--o| CONTAINER_RD : "redis"
    CONTAINER_PG ||--|| VOLUME_PG : "persists to"
    CONTAINER_RD ||--|| VOLUME_RD : "persists to"
    NETWORK ||--o{ SERVICE : "connects"

    DOCKERFILE {
        string path "Dockerfile"
        int stages "4 builder deps runner prod"
    }
    STAGE {
        string name "builder deps runner prod"
        string base "oven-bun-1-alpine or node-20-alpine"
    }
    IMAGE {
        string registry "ghcr.io"
        string target "prod"
    }
    TAG {
        string branch
        string branch_sha
        string latest "main only"
    }
    SERVICE {
        string name "app postgres redis"
        bool healthcheck
    }
    VOLUME_PG {
        string name "postgres_data"
    }
    VOLUME_RD {
        string name "redis_data"
    }
    NETWORK {
        string name "mas-network"
        bool external "true"
    }

API surface

Deployment exposes no business API. The only HTTP surface relevant to deployment is the health endpoint, used by the image HEALTHCHECK, the compose healthcheck, the blue-green smoke test, and the post-deploy verification step.

Method Path Auth/role Purpose
GET /api/health none (public) Liveness/readiness probe. Returns 200 with JSON success:true when the app is up. Used by Docker healthcheck, blue-green deploy.sh smoke test, and deploy-production.yml post-deploy verification.

Operational management is done over SSH and docker compose / docker commands, not HTTP. The Bull Board queue UI (/admin/queues) and the rest of the API are documented in their own feature docs.


User journeys

The "users" here are CI and operators. Each journey is an end-to-end deployment flow.

Journey 1 — Build and push image (every push to a deploy branch)

A push to main, development, or staging triggers build.yml. It runs the multi-stage build with BuildKit GitHub Actions cache and pushes tagged images to GHCR.

sequenceDiagram
    participant Dev as Developer
    participant GH as GitHub Actions
    participant BK as Docker Buildx
    participant GHCR as GHCR Registry

    Dev->>GH: push to main or development or staging
    GH->>GH: checkout code
    GH->>GH: compute tags branch and branch-sha and latest
    GH->>BK: build with target prod and cache-from gha
    BK->>BK: stage builder runs bun install then esbuild bundle
    BK->>BK: stage deps runs npm install omit dev for bcrypt
    BK->>BK: stage runner copies dist and node_modules as non-root
    BK->>GHCR: push image tags
    GHCR-->>GH: digest
    GH-->>Dev: build succeeded, triggers deploy workflow

Key facts (from build.yml): concurrency: docker-build with cancel-in-progress so only the latest build runs; auth uses the built-in GITHUB_TOKEN; OCI labels record source, revision, and version.

Journey 2 — Development deploy (recreate strategy)

When the build for development succeeds, deploy-development.yml fires via workflow_run. It pulls env from AWS Secrets Manager, copies it plus docker-compose.prod.yml to the dev server, and recreates the stack. There is a brief downtime window during recreate (acceptable for dev).

sequenceDiagram
    participant BW as build.yml
    participant DW as deploy-development.yml
    participant SM as AWS Secrets Manager
    participant SRV as Dev server
    participant DK as Docker on server

    BW-->>DW: workflow_run success on development
    DW->>SM: get-secret-value mr-mentor-backend development
    SM-->>DW: secret JSON
    DW->>DW: jq to-entries builds .env file
    DW->>SRV: scp docker-compose.prod.yml and .env
    DW->>SRV: ssh into server
    SRV->>DK: docker login ghcr.io
    SRV->>DK: docker compose -f docker-compose.prod.yml pull
    SRV->>DK: docker compose -f docker-compose.prod.yml up -d
    DK->>DK: postgres and redis healthchecks pass first
    DK->>DK: app starts, waits depends_on healthy
    DK-->>SRV: containers running
    SRV-->>DW: deploy complete

Journey 3 — Production blue-green deploy (zero downtime)

When the build for main succeeds, deploy-production.yml fires. It runs pre-checks, then calls the server-side deploy.sh which deploys to the idle color, health-checks it, switches Nginx, and keeps the old color stopped for rollback.

sequenceDiagram
    participant DW as deploy-production.yml
    participant SM as AWS Secrets Manager
    participant SRV as Prod server
    participant DS as deploy.sh on server
    participant NG as Nginx
    participant OLD as Old color container
    participant NEW as New color container

    DW->>SM: get-secret-value mr-mentor-backend production
    SM-->>DW: secret JSON
    DW->>DW: jq builds .env.original
    DW->>SRV: scp .env.original to blue-green-deployment
    DW->>SRV: ssh pre-deploy health check
    SRV->>SRV: verify nginx running and postgres pg_isready and redis ping
    DW->>SRV: docker login ghcr.io as mas-mr-mentor
    DW->>DS: run deploy.sh with image url
    DS->>DS: read .current_env to pick idle target color
    DS->>NEW: docker run target color on idle port
    DS->>NEW: poll docker health until healthy
    DS->>NEW: curl smoke test on idle port expects 200
    DS->>NG: sed switch proxy_pass to target port then nginx reload
    DS->>OLD: docker stop old container kept for rollback
    DS->>DS: write target color to .current_env
    DS-->>DW: success
    DW->>SRV: post-deploy verify health and db user count and redis ping
    SRV-->>DW: all checks pass

Journey 4 — Failed deploy with automatic rollback

If deploy.sh returns non-zero (new color never goes healthy, or smoke test fails), the workflow dumps the new container logs and invokes rollback.sh. Because the old color was only stopped, rollback is just restart-old + flip Nginx back.

sequenceDiagram
    participant DW as deploy-production.yml
    participant DS as deploy.sh
    participant RB as rollback.sh
    participant NG as Nginx
    participant OLD as Old color
    participant NEW as New color

    DW->>DS: run deploy.sh with image url
    DS->>NEW: start new color and wait healthy
    NEW-->>DS: stays unhealthy or smoke test fails
    DS-->>DW: non-zero exit code
    DW->>NEW: docker logs tail 50 for diagnosis
    DW->>RB: run rollback.sh
    RB->>OLD: docker start old color
    RB->>NG: switch proxy_pass back to old port then reload
    RB-->>DW: traffic restored to previous version
    DW-->>DW: job marked failed for investigation

Journey 5 — Manual deploy via deploy-docker.sh

For ad-hoc server-side deploys (no CI), an operator runs deploy-docker.sh. It logs into GHCR, stops the stack, pulls, and brings it back up with docker-compose.prod.yml. This is the simpler recreate path, not blue-green.

sequenceDiagram
    participant Op as Operator
    participant SH as deploy-docker.sh
    participant DK as Docker

    Op->>SH: IMAGE_TAG and GHCR_REPOSITORY set then run script
    SH->>SH: verify docker and docker compose installed
    SH->>DK: docker login ghcr.io if token provided
    SH->>DK: docker compose -f docker-compose.prod.yml down
    SH->>DK: docker compose -f docker-compose.prod.yml pull
    SH->>DK: docker compose -f docker-compose.prod.yml up -d
    SH->>SH: sleep 15 then show ps and logs tail 50
    SH->>Op: prompt to prune old images
    SH-->>Op: deployment complete on configured port

Journey 6 — Legacy PM2 deploy

The original flow, still present as deploy.sh + npm run deploy. It builds on the server and runs the bundle under PM2 cluster mode. Documented in CI-CD.md.

sequenceDiagram
    participant Op as Operator or CI
    participant SH as deploy.sh
    participant PM as PM2

    Op->>SH: run deploy.sh
    SH->>SH: tar backup of existing dist
    SH->>SH: git pull current branch
    SH->>SH: bun install production
    SH->>SH: bun run build esbuild to dist
    SH->>PM: pm2 restart mr-mentor-backend if exists
    PM->>PM: cluster mode instances max
    SH->>PM: pm2 save
    SH-->>Op: pm2 status printed

Background jobs & async

Deployment does not own BullMQ queues, but operators must know they exist because they affect restart behavior:

  • The app container starts 5 BullMQ workers in-process (email, database, cleanup, kpi, resumeAnalysis) plus scheduled jobs (cleanup every 24h, KPI every 15min). See the backend startup sequence in the project guide.
  • Restart caveat: BullMQ jobs that were in flight do not always auto-retry across a container recreate. After a deploy, stuck jobs may need a manual nudge (npm run queue:clear flushes all queues; use with care). The KPI/cleanup schedulers re-register on boot.
  • Socket.IO: the app serves WebSocket traffic for meetings. During a blue-green switch, existing WebSocket connections on the old color are dropped when it is stopped; clients reconnect to the new color via Nginx. Plan production deploys outside live-meeting windows where possible.
  • Bull Board queue-monitoring UI is mounted at /admin/queues and is reachable through the same Nginx proxy.

No deployment-specific webhooks exist; CI is triggered by push and workflow_run events, not inbound webhooks.


External integrations

Integration Used by deployment for Env / secret Failure / fallback
GHCR (ghcr.io) Image registry; CI pushes, servers pull GITHUB_TOKEN (CI push), GHCR_PULL_TOKEN + user mas-mr-mentor (prod pull), GHCR_USERNAME/GHCR_TOKEN (manual) Pull failure aborts deploy; old container keeps serving.
AWS Secrets Manager Source of truth for .env. CI fetches mr-mentor-backend/{development,production} and writes a flat .env via jq AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION (default ap-south-1) Secret missing/malformed (jq -e 'type==object' guard) fails the deploy before touching the server.
SSH (appleboy actions) Copy files and run remote scripts *_SERVER_HOST, *_SERVER_USER, *_SSH_KEY, *_SSH_PORT SSH failure aborts the workflow step.
Nginx TLS termination + traffic switch between colors /etc/nginx/sites-available/api.myanalyticsschool.com nginx -t validates config before systemctl reload; a bad config blocks the switch.
PostgreSQL 16 Stateful DB container DB_* env; volume postgres_data pg_isready healthcheck gates app start; verified pre/post deploy.
Redis 7 Cache + BullMQ broker REDIS_* env; volume redis_data redis-cli ping healthcheck; verified pre/post deploy.
AWS S3 Recordings, documents, banner assets AWS_S3_* buckets App-level; not gated by deploy.
mr-hire-backend AI services over internal network MR_HIRE_BACKEND_URL (container DNS on mas-network) Optional at boot; affects only Mr. Hire features.

Feature flags / toggles relevant at deploy time (from compose files): ENABLE_SEEDING (prod compose sets true to seed colleges/batches on first boot), USE_DIRECT_S3_UPLOAD, ALLOW_EARLY_MEETING_JOIN, MEETING_JOIN_BUFFER_MINUTES, TOKEN_VALUE.


Status lifecycles

Blue-green active color

The live color is tracked in .current_env. Each successful deploy flips it; rollback flips it back.

stateDiagram-v2
    [*] --> Blue
    Blue --> DeployingGreen : deploy.sh picks idle green
    DeployingGreen --> Green : green healthy and nginx switched
    DeployingGreen --> Blue : green unhealthy, rollback
    Green --> DeployingBlue : next deploy picks idle blue
    DeployingBlue --> Blue : blue healthy and nginx switched
    DeployingBlue --> Green : blue unhealthy, rollback

Container health (Docker HEALTHCHECK)

Every app container moves through Docker's health states; the deploy script waits up to ~80s (40 retries x 2s) for healthy before switching traffic.

stateDiagram-v2
    [*] --> starting
    starting --> healthy : /api/health returns 200 within start-period
    starting --> unhealthy : retries exhausted
    healthy --> unhealthy : 3 consecutive failed probes
    unhealthy --> healthy : probe recovers
    unhealthy --> [*] : deploy aborts and dumps logs

Edge cases, limits & gotchas

  • mas-network is external. All three compose files declare networks: mas-network: external: true (dev compose uses app-network for its own services but still declares mas-network external). The network must exist (docker network create mas-network) before the first up, or compose fails. This shared network is how the backend reaches mr-hire-backend by container name.
  • Two package managers by design. Do not "simplify" the Dockerfile to a single Bun install. The deps stage uses npm specifically so bcrypt native prebuilds match the Node 20 runtime ABI. Bun-installed bcrypt from the builder stage is intentionally discarded.
  • PORT mismatch in PM2 config. ecosystem.config.js sets PORT: 3000, but the Docker image, compose files, and Nginx all use 8000. The CI-CD doc's troubleshooting text also says "port 3000". Treat 8000 as authoritative for the containerized backend; the PM2 path is legacy. (Noted discrepancy, not a runtime bug in the Docker flow.)
  • MR_HIRE_BACKEND_URL default differs across files. docker-compose.yml / .dev.yml default to http://mr-hire-backend:8001, but docker-compose.prod.yml defaults to http://mr-hire-backend:8000. Set it explicitly via Secrets Manager to avoid relying on the default. (inferred risk)
  • Dev deploy has a downtime blip. deploy-development.yml uses compose up -d recreate, not blue-green. Acceptable for dev; never use this path for production.
  • Old container is stopped, not removed. Blue-green rollback depends on the previous color still existing (stopped). A docker container prune between deploys would destroy the rollback target — prune only old images, and only after a deploy is confirmed good.
  • Secrets are written as a flat .env on the server. CI converts the Secrets Manager JSON object to KEY=value lines with jq and scps it. A non-object secret payload is rejected by the jq -e 'type == "object"' guard before deploy. The env file lives at /home/ubuntu/blue-green-deployment/.env(.original) (prod) and /home/ubuntu/mr-mentor-backend/.env (dev).
  • docker-compose.prod.yml requires GHCR_REPOSITORY and IMAGE_TAG. With neither set the image: line resolves to an invalid reference and up/pull fails. The deploy scripts/workflows export these.
  • Healthcheck without curl. The image and compose healthchecks shell out to node -e HTTP probes because the Alpine/Bun base images ship no curl/wget. Keep this in mind when editing healthcheck commands.
  • Seeding on prod boot. ENABLE_SEEDING=true in docker-compose.prod.yml means the app seeds colleges/batches if those tables are empty. With TypeORM synchronize: true also on, entity changes auto-apply to the prod DB on deploy — review entity changes carefully before shipping.
  • Logs are capped. Prod compose sets json-file logging with max-size: 10m, max-file: 3 per service; deeper history must come from external log shipping (not configured here).

Companion frontends (deployed alongside the backend)

The two Next.js frontends follow the same GHCR + (for mr-mentor-frontend) blue-green pattern:

Repo Image base Port Runtime Strategy Nginx host
mr-mentor-frontend oven/bun:1, Next.js standalone (server.js) 3000 (blue) / 3001 (green) node server.js as non-root bun Blue-green (deploy/blue-green/{deploy,rollback,health-check}.sh, network frontend-network) mrmentor.in
mas-website-live node:20-alpine, Next.js standalone 8088 node server.js Single container (docker-compose-prod.yml) (public site)

Both bake NEXT_PUBLIC_* values as Docker build ARGs (they must be present at build time, not just runtime), so the build workflows pass them as --build-arg from GitHub secrets. The frontend blue-green deploy.sh mirrors the backend's: pick idle color, docker run on the idle port, poll Docker health, curl smoke test, sed the Nginx upstream, reload, then stop (not remove) the old color. See mr-mentor-frontend/deploy/blue-green/deploy.sh and mas-website-live/Dockerfile.