Skip to content

CI/CD Pipelines

This document is the canonical reference for the GitHub Actions CI/CD pipelines that build and deploy the three deployable MAS / Mr. Mentor web services: mr-mentor-backend (Express API), mr-mentor-frontend (Next.js admin/portal), and mas-website-live (Next.js public site). It covers every workflow (triggers, jobs, steps, target environment, and the GitHub Secrets they consume — names only), the build-then-deploy chaining model, the blue-green production deployment, and the claude.yml Claude automation workflow.

Status: documented from source on this branch.


Overview

All three services follow the same shape: GitHub Actions builds a Docker image, pushes it to GitHub Container Registry (GHCR), then a separate "deploy" workflow chained via workflow_run SSHes into an EC2 server and rolls the new image out. There is no Kubernetes, no managed PaaS — deployment is docker/docker compose over SSH (appleboy/ssh-action + appleboy/scp-action) onto two long-lived Ubuntu EC2 hosts:

  • Development server (dev EC2, 15.206.142.123) — single-container rolling deploys.
  • Production server (prod EC2, 13.234.60.63) — zero-downtime blue-green deploys behind nginx.

Personas / who uses this:

Persona Interaction with CI/CD
Backend / frontend engineers Push to development (auto deploy to dev) or merge to main (auto deploy to prod).
Release engineer / maintainer Manually re-trigger any workflow via workflow_dispatch; watch the blue-green deploy + verification steps; perform rollback.
Infra / on-call Owns server-side scripts (deploy.sh, rollback.sh, health-check.sh), nginx config, AWS Secrets Manager, GHCR PAT rotation.
GitHub reviewers Mention @claude in issues/PR comments to trigger the Claude Code automation workflow.

Repo note. The backend repo is MAS-Mr-Mentor/mr-mentor-backend; the frontend is MAS-Mr-Mentor/mr-mentor-frontend. Pushing to GHCR/prod is restricted (a dedicated MAS-intern / mas-mr-mentor GHCR account owns the long-lived pull token). See the org memory note "mr-mentor-backend repo + deploy access".

Stale doc warning. The in-repo mr-mentor-backend/CI-CD.md and mr-mentor-frontend/CI-CD.md describe an older PM2 + ci-cd.yml / deploy.yml + Bun-over-SSH pipeline that no longer matches the actual workflow files. The split build/deploy + Docker/GHCR + blue-green model documented here (from the real .github/workflows/* files) is authoritative. Treat the legacy CI-CD.md PM2 sections as historical.


Key concepts & entities

This is a DevOps domain — it owns no TypeORM entities. The "entities" are pipeline artifacts and conventions:

Term Meaning
GHCR GitHub Container Registry (ghcr.io/<owner>/<repo>). Holds the built Docker images.
Image tag convention ghcr.io/<repo-lowercase>:<branch> plus :<branch>-<shortsha> (immutable, used by deploy), plus :latest on main. Frontend also adds floating :development / :production aliases.
workflow_run chaining Deploy workflows do not trigger on push; they trigger when the matching build workflow completes successfully on the right branch.
workflow_dispatch Manual "Run workflow" button — every build and deploy workflow supports it for re-runs / hotfix deploys.
GitHub Environment development and production named environments (environment: key) — gate jobs and scope environment-specific secrets/reviewers.
AWS Secrets Manager (backend) Backend runtime .env is fetched at deploy time from SM secrets mr-mentor-backend/development and mr-mentor-backend/production (see AWS Secrets Manager migration).
Inline .env (frontends) Frontend .env is assembled from per-key GitHub Secrets, either as Docker build-args (public NEXT_PUBLIC_* baked into the image) or written to the server at deploy time.
Blue-green Two prod containers (-blue on :3000, -green on :3001). Deploy targets the standby color, health-checks it, flips nginx, and keeps the old color stopped for instant rollback.
.current_env A file on the prod server recording which color is live; deploy.sh reads/writes it.

Source files:

  • Backend: mr-mentor-backend/.github/workflows/{build.yml, deploy-development.yml, deploy-production.yml, claude.yml}
  • Frontend: mr-mentor-frontend/.github/workflows/{build-development.yml, build-production.yml, deploy-development.yml, deploy-production.yml, claude.yml} + mr-mentor-frontend/deploy/blue-green/{deploy.sh, rollback.sh, health-check.sh}
  • Website: mas-website-live/.github/workflows/{build-development.yml, build-production.yml, deploy-development.yml, deploy-production.yml, claude.yml}
  • Prod blue-green scripts for the backend live on the server only (/home/ubuntu/blue-green-deployment/{deploy.sh, rollback.sh, health-check.sh}), not in the repo.

Architecture

flowchart TD
    Dev["Engineer git push / PR merge"] --> GH["GitHub repo"]

    subgraph CI["GitHub Actions CI"]
      B["Build and Push Docker Image workflow"]
      DD["Deploy to Development workflow"]
      DP["Deploy to Production (Blue-Green) workflow"]
      CL["Claude Code workflow (claude.yml)"]
    end

    GH -->|"push to development / main / staging"| B
    GH -->|"comment or issue mentions @claude"| CL
    B -->|"docker build target prod"| GHCR["GHCR image registry"]
    B -->|"workflow_run completed success on development"| DD
    B -->|"workflow_run completed success on main"| DP

    subgraph AWS["AWS"]
      SM["Secrets Manager (backend .env)"]
    end

    DD -->|"fetch .env (backend only)"| SM
    DP -->|"fetch .env.original (backend only)"| SM

    DD -->|"scp compose + .env, ssh docker compose up"| DEVSRV["Dev EC2 server"]
    DP -->|"scp scripts + .env, ssh deploy.sh"| PRODSRV["Prod EC2 server"]

    GHCR -->|"docker pull"| DEVSRV
    GHCR -->|"docker pull"| PRODSRV

    subgraph PRODSRV["Prod EC2 server"]
      NGINX["nginx reverse proxy"]
      BLUE["container blue :3000"]
      GREEN["container green :3001"]
      NGINX --> BLUE
      NGINX -.->|"after flip"| GREEN
    end

    CL -->|"reads code, may push commits / PR comments"| GH

Pipeline catalogue

Workflow File Trigger Jobs Target env
Build and Push Docker Image (backend) mr-mentor-backend/.github/workflows/build.yml push to main, development, staging; workflow_dispatch build-and-push n/a (pushes image to GHCR)
Deploy to Development (backend) mr-mentor-backend/.github/workflows/deploy-development.yml workflow_run of build on development (success); workflow_dispatch deploy development
Deploy to Production Blue-Green (backend) mr-mentor-backend/.github/workflows/deploy-production.yml workflow_run of build on main (success); workflow_dispatch deploy production
Claude Code (backend) mr-mentor-backend/.github/workflows/claude.yml issue/PR comment, review, issue opened/assigned containing @claude claude n/a
Build (Development) (frontend) mr-mentor-frontend/.github/workflows/build-development.yml push to development; workflow_dispatch build-and-push development
Build (Production) (frontend) mr-mentor-frontend/.github/workflows/build-production.yml push to main; workflow_dispatch build-and-push production
Deploy to Development (frontend) mr-mentor-frontend/.github/workflows/deploy-development.yml workflow_run of dev build (success); workflow_dispatch deploy development
Deploy to Production Blue-Green (frontend) mr-mentor-frontend/.github/workflows/deploy-production.yml workflow_run of prod build on main (success); workflow_dispatch deploy production
Claude Code (frontend) mr-mentor-frontend/.github/workflows/claude.yml @claude mentions claude n/a
Build (Development) (website) mas-website-live/.github/workflows/build-development.yml push to development; workflow_dispatch build-and-push development
Build (Production) (website) mas-website-live/.github/workflows/build-production.yml push to main; workflow_dispatch build-and-push production
Deploy to Development (website) mas-website-live/.github/workflows/deploy-development.yml workflow_run of dev build (success); workflow_dispatch deploy development
Deploy to Production (website) mas-website-live/.github/workflows/deploy-production.yml workflow_run of prod build on main (success); workflow_dispatch deploy production
Claude Code (website) mas-website-live/.github/workflows/claude.yml @claude mentions claude n/a

The three claude.yml files are byte-identical across repos. The build/deploy workflows differ slightly per repo (see per-service sections).


Promotion flow

flowchart LR
    F["feature branch"] -->|"merge / PR"| DEVB["development branch"]
    DEVB -->|"build + auto deploy"| DEVENV["DEV environment"]
    DEVB -->|"merge / PR"| MAIN["main branch"]
    MAIN -->|"build + auto deploy"| PRODENV["PROD environment (blue-green)"]
    PRODENV -.->|"rollback.sh on failure"| PREV["previous color kept stopped"]
  • Push to development → builds dev image → auto-deploys to the dev server.
  • Merge to main → builds prod image → auto blue-green deploys to prod.
  • staging is accepted as a build branch in the backend build.yml only (no chained deploy workflow exists for it; it just produces a :staging image in GHCR).
  • Any workflow can be re-run manually via the Run workflow button (workflow_dispatch).

Per-service details

Backend — build (build.yml)

  • Triggers: push to main / development / staging, or manual. concurrency: docker-build with cancel-in-progress: true (a newer push cancels an in-flight build).
  • Job build-and-push (ubuntu-latest):
  • actions/checkout@v4.
  • docker/setup-buildx-action@v3.
  • Lowercase the repo name (GHCR requires lowercase).
  • Generate tags: :<branch-safe>, :<branch-safe>-<shortsha>, plus :latest on main.
  • docker/login-action@v3 to ghcr.io as ${{ github.actor }} with secrets.GITHUB_TOKEN.
  • docker/build-push-action@v5 — builds target: prod from ./Dockerfile, pushes all tags, uses GitHub Actions layer cache (cache-from/to: type=gha), stamps OCI labels.
  • Outputs: image-tag, image-digest, repository-name.
  • Secrets: GITHUB_TOKEN (auto-provided).

Backend — deploy to development (deploy-development.yml)

  • Triggers: workflow_run after Build and Push Docker Image completes on development; gated by github.event.workflow_run.conclusion == 'success'. Also workflow_dispatch with a branch input.
  • Environment: development.
  • Steps:
  • Checkout at the build's head_sha.
  • Resolve the branch into steps.env.outputs.branch.
  • scp docker-compose.prod.yml to /home/ubuntu/mr-mentor-backend/.
  • aws-actions/configure-aws-credentials@v4.
  • Fetch .env from AWS Secrets Manager secret mr-mentor-backend/development → validate it's a JSON object → convert to KEY=VALUE lines → write .env.
  • scp the .env to the server.
  • SSH: set IMAGE_TAG=<branch>, GHCR_REPOSITORY, GHCR_USERNAME/GHCR_TOKEN (prefers long-lived GHCR_PAT as user MAS-intern, falls back to github.actor + GITHUB_TOKEN), docker login, then docker compose -f docker-compose.prod.yml down && pull && up -d, wait 15s, print ps + last 50 log lines, docker image prune.
  • Secrets: DEVELOPMENT_SERVER_HOST, DEVELOPMENT_SERVER_USER, DEVELOPMENT_SSH_KEY, DEVELOPMENT_SSH_PORT (optional), AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION (optional), GHCR_PAT, GITHUB_TOKEN.

Backend — deploy to production blue-green (deploy-production.yml)

  • Triggers: workflow_run after the build completes with head_branch == 'main' and conclusion == 'success'; or workflow_dispatch.
  • Environment: production.
  • Steps:
  • Checkout at head_sha.
  • Compute IMAGE_TAG = main-<shortsha> (for dispatch) or <head_branch>-<shortsha>, build image_url.
  • Configure AWS creds; fetch .env.original from SM secret mr-mentor-backend/production; scp it to /home/ubuntu/blue-green-deployment/.
  • SSH: chmod +x deploy.sh rollback.sh health-check.sh.
  • Pre-deployment health check — runs health-check.sh, reads .current_env, asserts nginx active, pg_isready on mr-mentor-postgres, redis-cli ping on mr-mentor-redis.
  • GHCR login on the prod server as mas-mr-mentor using GHCR_PULL_TOKEN.
  • Deploy./deploy.sh "<image_url>"; on non-zero exit it dumps the target container logs and runs ./rollback.sh, then fails.
  • Post-deployment verification (if: always()) — reads .current_env, inspects the active container's published port, curls /api/health and asserts "success":true, runs a SELECT COUNT(*) FROM users, redis-cli ping, and checks the nginx proxy_pass port matches the active port.
  • Summary + success notification (Slack webhook is present but commented out).
  • Secrets: PRODUCTION_SERVER_HOST, PRODUCTION_SERVER_USER, PRODUCTION_SSH_KEY, PRODUCTION_SSH_PORT (optional), AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION (optional), GHCR_PULL_TOKEN.

Frontend — builds (build-development.yml, build-production.yml)

Two separate build workflows (dev triggers on development, prod on main); each declares the matching GitHub environment so it can read env-scoped secrets. Both timeout-minutes: 20.

  • Dev build extra: restores/saves the Next.js .next/cache across runs via actions/cache@v4 + reproducible-containers/buildkit-cache-dance (mounts cache into /app/.next/cache) so a code change doesn't pay a full next build. Tags: :<branch>, :<branch>-<shortsha>, floating :development. Cache mode=max,ignore-error=true.
  • Prod build: tags :<branch>, :<branch>-<shortsha>, floating :production, plus :latest on main.
  • Both pass build-args so NEXT_PUBLIC_* values and NextAuth/Google config are baked into the image at build time (Next.js inlines public vars at build): GOOGLE_CLIENT_ID, GOOGLE_CLIENT_SECRET, GOOGLE_REDIRECT_URI, NEXT_PUBLIC_BACKEND_URL, NEXTAUTH_URL, NEXTAUTH_URL_INTERNAL, NEXTAUTH_SECRET, NEXT_PUBLIC_FRONTEND_URL, NEXT_PUBLIC_TOKEN_VALUE, NEXT_PUBLIC_CDN_URL, NEXT_PUBLIC_TURN_SERVER_HOST/PORT/USERNAME/CREDENTIAL (NEXT_PUBLIC_NODE_ENV=production is hardcoded).
  • Secrets: GITHUB_TOKEN + all the build-arg secrets above (scoped to the environment).

Frontend — deploy to development (deploy-development.yml)

workflow_run after the dev build succeeds; environment: development, timeout-minutes: 15. Steps: scp docker-compose.prod.yml to /home/ubuntu/mr-mentor-frontend/; SSH-write .env from secrets (TURN host/port/user/cred hardcoded to the dev EC2 15.206.142.123); SSH deploy — login as MAS-intern with GHCR_PAT, docker network create app-network, docker compose down/pull/up -d, wait, ps+logs, prune. Secrets: DEVELOPMENT_SERVER_*, GHCR_PAT, plus the per-key NextAuth/Google/NEXT_PUBLIC_* secrets, GITHUB_TOKEN.

Frontend — deploy to production blue-green (deploy-production.yml)

workflow_run after the prod build succeeds on main; environment: production. Distinctive steps: 1. SSH-write .env into /home/ubuntu/blue-green-frontend/ (TURN host hardcoded to the prod EC2 13.234.60.63). 2. scp deploy/blue-green/{deploy.sh,rollback.sh,health-check.sh} from the repo and flatten them into /home/ubuntu/blue-green-frontend/ (frontend ships its blue-green scripts in-repo, unlike the backend). 3. Pre-deploy health check + nginx-active assertion. 4. Run ./deploy.sh "<image_url>" with GHCR_USERNAME=MAS-intern + GHCR_TOKEN=GHCR_PAT; on failure run ./rollback.sh. 5. Post-deploy verification (if: always()): health-check.sh, curl https://www.mrmentor.in and https://www.mrmentor.in/api/auth/session; a 500 on the session endpoint is treated as a fatal "missing NEXTAUTH_SECRET" and dumps container logs.

Secrets: PRODUCTION_SERVER_*, GHCR_PAT, the NextAuth/Google/NEXT_PUBLIC_* secrets, GITHUB_TOKEN.

Website — builds & deploys

Simpler than the frontend (no blue-green; single container on mas-network).

  • Builds: use docker/metadata-action@v5 for tagging. Dev tags development + development-<sha>; prod tags production + latest + production-<sha>. platforms: linux/amd64. Build-args baked in: NEXT_PUBLIC_GOOGLE_CLIENT_ID, NEXT_PUBLIC_BACKEND_URL, NEXT_PUBLIC_NODEJS_SERVER, NEXT_PUBLIC_CLOUDFRONT_URL, NEXT_PUBLIC_IMAGE_ENV, NEXT_PUBLIC_MR_MENTOR_FRONTEND_URL, NEXT_PUBLIC_SHOW_ENROLLMENT_STEP. timeout-minutes: 30.
  • Deploys: workflow_run-chained; scp docker-compose-prod.yml to /home/ubuntu/mas-website-live/; SSH login as github.actor with GITHUB_TOKEN, docker network create mas-network, docker compose down/pull/up -d (dev uses IMAGE_TAG=<branch>, prod uses IMAGE_TAG=production), wait, ps+logs, prune.
  • Secrets: DEVELOPMENT_SERVER_* / PRODUCTION_SERVER_*, GITHUB_TOKEN, and the NEXT_PUBLIC_* build-arg secrets. The website deploy does not use a long-lived GHCR PAT or AWS Secrets Manager.

User journeys

Journey 1 — Developer pushes to development (auto deploy to dev)

sequenceDiagram
    participant Dev as Engineer
    participant GH as GitHub
    participant Build as Build workflow
    participant GHCR as GHCR
    participant Deploy as Deploy-dev workflow
    participant SM as AWS Secrets Manager
    participant Srv as Dev EC2 server

    Dev->>GH: git push to development
    GH->>Build: trigger on push
    Build->>Build: checkout then buildx then generate tags
    Build->>GHCR: docker build target prod and push branch and shortsha tags
    Build-->>GH: workflow_run completed success
    GH->>Deploy: trigger workflow_run on development
    Note over Deploy: gated on conclusion success
    Deploy->>Srv: scp docker-compose.prod.yml
    Deploy->>SM: get-secret-value mr-mentor-backend development
    SM-->>Deploy: JSON secret string
    Deploy->>Deploy: convert JSON to dotenv .env
    Deploy->>Srv: scp .env
    Deploy->>Srv: ssh docker login then compose down pull up
    Srv->>GHCR: docker pull image by tag
    GHCR-->>Srv: image layers
    Srv-->>Deploy: containers running and logs tail
    Deploy-->>Dev: deployment summary in Actions log

Frontend and website variants are identical in shape, except the .env is assembled from per-key GitHub Secrets (written over SSH) instead of fetched from Secrets Manager.

Journey 2 — Merge to main triggers blue-green production deploy

The headline production path. The deploy targets the standby color, proves it healthy, flips nginx, then keeps the previous color stopped for instant rollback.

sequenceDiagram
    participant Dev as Engineer
    participant GH as GitHub
    participant Build as Build workflow
    participant GHCR as GHCR
    participant Deploy as Deploy-prod workflow
    participant Srv as Prod EC2 server
    participant Nginx as nginx
    participant New as Standby container
    participant Old as Active container

    Dev->>GH: merge PR into main
    GH->>Build: trigger on push to main
    Build->>GHCR: push image main-shortsha and latest
    Build-->>GH: workflow_run success
    GH->>Deploy: trigger workflow_run head_branch main
    Deploy->>Srv: scp env and chmod blue-green scripts
    Deploy->>Srv: pre-deploy health-check nginx postgres redis
    Srv-->>Deploy: pre-checks passed
    Deploy->>Srv: ssh run deploy.sh with image url
    Srv->>Srv: read current_env to pick target color
    Srv->>GHCR: docker pull image url
    Srv->>New: docker run standby on standby port
    New-->>Srv: health status healthy
    Srv->>New: curl smoke test returns 200
    Srv->>Nginx: sed proxy_pass to standby port then reload
    Nginx-->>Srv: config valid and reloaded
    Srv->>Old: docker stop old kept for rollback
    Srv->>Srv: write target color to current_env
    Deploy->>Srv: post-deploy verify health api and db and redis
    Srv-->>Deploy: all verifications passed
    Deploy-->>Dev: production deployment summary

Journey 3 — Failed production deploy auto-rolls back

sequenceDiagram
    participant Deploy as Deploy-prod workflow
    participant Srv as Prod EC2 server
    participant New as Standby container
    participant Roll as rollback.sh

    Deploy->>Srv: ssh run deploy.sh with image url
    Srv->>New: docker run standby then wait for health
    New-->>Srv: health status unhealthy or smoke test not 200
    Srv->>Srv: deploy.sh exits non zero
    Srv->>Srv: dump last 50 container log lines
    Srv->>Roll: run rollback.sh
    Roll-->>Srv: previous color restarted nginx restored
    Srv-->>Deploy: step exits 1 job marked failed
    Note over Deploy: nginx still points at last known good color

Journey 4 — Manual hotfix deploy via workflow_dispatch

sequenceDiagram
    participant Eng as Release engineer
    participant GH as GitHub Actions UI
    participant Deploy as Deploy workflow
    participant Srv as Target server

    Eng->>GH: click Run workflow and choose branch
    GH->>Deploy: workflow_dispatch event
    Note over Deploy: if condition allows dispatch without a build run
    Deploy->>Deploy: derive IMAGE_TAG from branch and shortsha
    Deploy->>Srv: same scp and ssh deploy steps as auto path
    Srv-->>Deploy: deployed
    Deploy-->>Eng: summary

When dispatched without a triggering build, github.event.workflow_run.head_sha is empty so checkout falls back to github.ref, and the prod image tag is computed as main-<shortsha>. The image must already exist in GHCR for that SHA.

Journey 5 — @claude automation on an issue or PR

sequenceDiagram
    participant User as Reviewer or author
    participant GH as GitHub
    participant CL as Claude Code workflow
    participant Action as claude-code-action
    participant API as Anthropic

    User->>GH: comment or open issue containing at claude
    GH->>CL: issue_comment or pull_request_review or issues event
    Note over CL: if guard requires the at claude token in body or title
    CL->>CL: checkout fetch-depth 1
    CL->>Action: run with oauth token and model sonnet
    Action->>API: send context and instructions
    API-->>Action: code edits or analysis
    Action->>GH: push commit or post PR or issue comment
    GH-->>User: Claude response visible inline

Background jobs & async

CI/CD here has no application-level BullMQ jobs — its async mechanics are GitHub Actions constructs:

  • workflow_run chaining is the deploy trigger. A deploy never runs unless its build completed (and, for prod, on the right branch with conclusion == 'success'). This means a failed/cancelled build silently skips deployment.
  • concurrency groups (docker-build*, with cancel-in-progress: true) cancel an in-flight build when a newer push arrives, so only the latest commit's image is built.
  • Timeouts — frontend builds 20 min, deploys 15 min; website builds 30 min — abort hung jobs.
  • GHCR layer cache (type=gha) and the frontend's persisted .next/cache are the async optimizations that keep build times down.
  • No scheduled (cron) workflows exist in any of the three repos. (A separate GHCR-login refresh workflow is referenced in org memory but is not present in these repo workflow directories.)

There are no webhooks wired into these pipelines beyond the (commented-out) Slack notification placeholder in the backend prod deploy.


External integrations

Integration Used by Env / secret names (names only) Failure / fallback
GHCR (GitHub Container Registry) all builds + deploys GITHUB_TOKEN (build push), GHCR_PAT (backend+frontend server pulls, user MAS-intern), GHCR_PULL_TOKEN (backend prod, user mas-mr-mentor) If the server's stored docker login is stale/mismatched, pulls 401. Deploys re-login each run; backend prefers the long-lived PAT precisely so the credential survives after the run.
AWS Secrets Manager backend deploys only AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION (default ap-south-1); secrets mr-mentor-backend/development, mr-mentor-backend/production The fetch step set -euo pipefail + jq -e 'type=="object"' aborts the deploy if the secret is missing or not a JSON object — no silent empty .env.
AWS EC2 over SSH all deploys DEVELOPMENT_SERVER_HOST/USER, PRODUCTION_SERVER_HOST/USER, DEVELOPMENT_SSH_KEY, PRODUCTION_SSH_KEY, *_SSH_PORT (default 22) A bad key/host fails the SSH step and the deploy.
nginx (on servers) prod blue-green n/a (server config) deploy.sh runs nginx -t before reload; a bad config aborts the flip.
Anthropic / Claude Code claude.yml CLAUDE_CODE_OAUTH_TOKEN; model: claude-sonnet-4-6 Job only runs when @claude appears in the triggering text; otherwise the if: guard skips it.
Google OAuth / NextAuth / TURN frontend builds (baked in) GOOGLE_CLIENT_ID/SECRET, GOOGLE_REDIRECT_URI, NEXTAUTH_URL, NEXTAUTH_URL_INTERNAL, NEXTAUTH_SECRET, NEXT_PUBLIC_TURN_* Missing NEXTAUTH_SECRET surfaces as an HTTP 500 on /api/auth/session, which the frontend prod post-deploy verification treats as a fatal error.

Feature flags / env toggles seen in pipelines: NEXT_PUBLIC_SHOW_ENROLLMENT_STEP (website), NEXT_PUBLIC_IMAGE_ENV (website). Backend runtime flags (e.g. ENABLE_SEEDING, USE_DIRECT_S3_UPLOAD) live in the Secrets-Manager-managed .env, not in the workflow YAML.


Status lifecycles

Blue-green active color lifecycle (prod servers)

stateDiagram-v2
    [*] --> Blue
    Blue --> DeployingGreen: deploy.sh starts (blue active)
    DeployingGreen --> Green: green healthy then nginx flipped then current_env=green
    DeployingGreen --> Blue: green unhealthy then rollback.sh
    Green --> DeployingBlue: next deploy (green active)
    DeployingBlue --> Blue: blue healthy then nginx flipped then current_env=blue
    DeployingBlue --> Green: blue unhealthy then rollback.sh

GitHub Actions run status (per deploy job)

stateDiagram-v2
    [*] --> Queued
    Queued --> Running: runner picks up job
    Running --> Success: all steps pass
    Running --> Failed: a step exits non zero
    Running --> Cancelled: concurrency cancel or timeout
    Failed --> [*]
    Success --> [*]
    Cancelled --> [*]
    note right of Failed
        prod deploy.sh failure also
        triggers rollback.sh on server
    end note

Edge cases, limits & gotchas

  • Stale in-repo docs. mr-mentor-backend/CI-CD.md and mr-mentor-frontend/CI-CD.md document a defunct PM2/Bun-over-SSH pipeline (ci-cd.yml, deploy.yml, SERVER_HOST/SSH_PRIVATE_KEY secrets) that does not match the live Docker/GHCR/blue-green workflows. Don't rely on them; this doc is sourced from the actual workflow YAML.
  • Deploy depends on build success, not on the commit. Because deploys trigger via workflow_run, a cancelled or failed build (including a concurrency cancellation) means no deploy happens for that commit — and the next push's build supersedes it.
  • workflow_dispatch requires a pre-existing image. Manually dispatching a deploy computes the tag as <branch>-<shortsha> / main-<shortsha> and pulls it; if no build ever pushed that tag, the pull fails.
  • NEXT_PUBLIC_* are baked at build time. Frontend/website public env values are compiled into the image via build-args. Changing them requires a rebuild, not just a redeploy. The deploy-time .env only affects server-side (non-public) values.
  • GHCR credential identity matters. The backend dev deploy comment and frontend deploys both call out that logging into the server's docker as the ephemeral github.actor/GITHUB_TOKEN leaves a credential that dies with the run; the long-lived GHCR_PAT (user MAS-intern) / GHCR_PULL_TOKEN (user mas-mr-mentor) is used so subsequent server-side pulls keep working. Rotating the PAT is an operational task (see org memory "GHCR docker-login root cause + fix").
  • Hardcoded TURN/IP values. Frontend deploy workflows hardcode the dev (15.206.142.123) and prod (13.234.60.63) TURN server hosts and webrtcuser/webrtccred credentials directly in the workflow .env heredoc rather than in secrets.
  • Backend prod verification is opinionated. The post-deploy step runs SELECT COUNT(*) FROM users with hardcoded DB user shubham/db mas and container names mr-mentor-postgres/mr-mentor-redis; renaming those infra containers would break verification (not the app).
  • staging builds an image but never deploys. Backend build.yml accepts staging; there is no chained staging deploy workflow, so a :staging image just sits in GHCR.
  • if: always() on verification. Backend/frontend prod verification + summary run even on failure so logs are captured; the job's overall status still reflects the deploy step.
  • No automated test/lint gate in the live pipelines. The current build workflows go straight from checkout to docker build; there is no separate test job in the active YAML (the legacy CI-CD.md "lint + unit test" stage is not implemented in the real workflows). Tests run locally / pre-merge, not as a CI gate (inferred from the absence of a test job).
  • Claude workflow permissions. claude.yml grants contents: write, pull-requests: write, issues: write, id-token: write; it can push commits and comment. It is strictly gated on the literal @claude token in the comment/issue body or title.