CI/CD Pipelines¶
This document is the canonical reference for the GitHub Actions CI/CD pipelines that build and deploy the three deployable MAS / Mr. Mentor web services: mr-mentor-backend (Express API), mr-mentor-frontend (Next.js admin/portal), and mas-website-live (Next.js public site). It covers every workflow (triggers, jobs, steps, target environment, and the GitHub Secrets they consume — names only), the build-then-deploy chaining model, the blue-green production deployment, and the claude.yml Claude automation workflow.
Status: documented from source on this branch.
Overview¶
All three services follow the same shape: GitHub Actions builds a Docker image, pushes it to GitHub Container Registry (GHCR), then a separate "deploy" workflow chained via workflow_run SSHes into an EC2 server and rolls the new image out. There is no Kubernetes, no managed PaaS — deployment is docker/docker compose over SSH (appleboy/ssh-action + appleboy/scp-action) onto two long-lived Ubuntu EC2 hosts:
- Development server (dev EC2,
15.206.142.123) — single-container rolling deploys. - Production server (prod EC2,
13.234.60.63) — zero-downtime blue-green deploys behind nginx.
Personas / who uses this:
| Persona | Interaction with CI/CD |
|---|---|
| Backend / frontend engineers | Push to development (auto deploy to dev) or merge to main (auto deploy to prod). |
| Release engineer / maintainer | Manually re-trigger any workflow via workflow_dispatch; watch the blue-green deploy + verification steps; perform rollback. |
| Infra / on-call | Owns server-side scripts (deploy.sh, rollback.sh, health-check.sh), nginx config, AWS Secrets Manager, GHCR PAT rotation. |
| GitHub reviewers | Mention @claude in issues/PR comments to trigger the Claude Code automation workflow. |
Repo note. The backend repo is
MAS-Mr-Mentor/mr-mentor-backend; the frontend isMAS-Mr-Mentor/mr-mentor-frontend. Pushing to GHCR/prod is restricted (a dedicatedMAS-intern/mas-mr-mentorGHCR account owns the long-lived pull token). See the org memory note "mr-mentor-backend repo + deploy access".Stale doc warning. The in-repo
mr-mentor-backend/CI-CD.mdandmr-mentor-frontend/CI-CD.mddescribe an older PM2 +ci-cd.yml/deploy.yml+ Bun-over-SSH pipeline that no longer matches the actual workflow files. The split build/deploy + Docker/GHCR + blue-green model documented here (from the real.github/workflows/*files) is authoritative. Treat the legacyCI-CD.mdPM2 sections as historical.
Key concepts & entities¶
This is a DevOps domain — it owns no TypeORM entities. The "entities" are pipeline artifacts and conventions:
| Term | Meaning |
|---|---|
| GHCR | GitHub Container Registry (ghcr.io/<owner>/<repo>). Holds the built Docker images. |
| Image tag convention | ghcr.io/<repo-lowercase>:<branch> plus :<branch>-<shortsha> (immutable, used by deploy), plus :latest on main. Frontend also adds floating :development / :production aliases. |
workflow_run chaining |
Deploy workflows do not trigger on push; they trigger when the matching build workflow completes successfully on the right branch. |
workflow_dispatch |
Manual "Run workflow" button — every build and deploy workflow supports it for re-runs / hotfix deploys. |
| GitHub Environment | development and production named environments (environment: key) — gate jobs and scope environment-specific secrets/reviewers. |
| AWS Secrets Manager (backend) | Backend runtime .env is fetched at deploy time from SM secrets mr-mentor-backend/development and mr-mentor-backend/production (see AWS Secrets Manager migration). |
Inline .env (frontends) |
Frontend .env is assembled from per-key GitHub Secrets, either as Docker build-args (public NEXT_PUBLIC_* baked into the image) or written to the server at deploy time. |
| Blue-green | Two prod containers (-blue on :3000, -green on :3001). Deploy targets the standby color, health-checks it, flips nginx, and keeps the old color stopped for instant rollback. |
.current_env |
A file on the prod server recording which color is live; deploy.sh reads/writes it. |
Source files:
- Backend:
mr-mentor-backend/.github/workflows/{build.yml, deploy-development.yml, deploy-production.yml, claude.yml} - Frontend:
mr-mentor-frontend/.github/workflows/{build-development.yml, build-production.yml, deploy-development.yml, deploy-production.yml, claude.yml}+mr-mentor-frontend/deploy/blue-green/{deploy.sh, rollback.sh, health-check.sh} - Website:
mas-website-live/.github/workflows/{build-development.yml, build-production.yml, deploy-development.yml, deploy-production.yml, claude.yml} - Prod blue-green scripts for the backend live on the server only (
/home/ubuntu/blue-green-deployment/{deploy.sh, rollback.sh, health-check.sh}), not in the repo.
Architecture¶
flowchart TD
Dev["Engineer git push / PR merge"] --> GH["GitHub repo"]
subgraph CI["GitHub Actions CI"]
B["Build and Push Docker Image workflow"]
DD["Deploy to Development workflow"]
DP["Deploy to Production (Blue-Green) workflow"]
CL["Claude Code workflow (claude.yml)"]
end
GH -->|"push to development / main / staging"| B
GH -->|"comment or issue mentions @claude"| CL
B -->|"docker build target prod"| GHCR["GHCR image registry"]
B -->|"workflow_run completed success on development"| DD
B -->|"workflow_run completed success on main"| DP
subgraph AWS["AWS"]
SM["Secrets Manager (backend .env)"]
end
DD -->|"fetch .env (backend only)"| SM
DP -->|"fetch .env.original (backend only)"| SM
DD -->|"scp compose + .env, ssh docker compose up"| DEVSRV["Dev EC2 server"]
DP -->|"scp scripts + .env, ssh deploy.sh"| PRODSRV["Prod EC2 server"]
GHCR -->|"docker pull"| DEVSRV
GHCR -->|"docker pull"| PRODSRV
subgraph PRODSRV["Prod EC2 server"]
NGINX["nginx reverse proxy"]
BLUE["container blue :3000"]
GREEN["container green :3001"]
NGINX --> BLUE
NGINX -.->|"after flip"| GREEN
end
CL -->|"reads code, may push commits / PR comments"| GH
Pipeline catalogue¶
| Workflow | File | Trigger | Jobs | Target env |
|---|---|---|---|---|
| Build and Push Docker Image (backend) | mr-mentor-backend/.github/workflows/build.yml |
push to main, development, staging; workflow_dispatch |
build-and-push |
n/a (pushes image to GHCR) |
| Deploy to Development (backend) | mr-mentor-backend/.github/workflows/deploy-development.yml |
workflow_run of build on development (success); workflow_dispatch |
deploy |
development |
| Deploy to Production Blue-Green (backend) | mr-mentor-backend/.github/workflows/deploy-production.yml |
workflow_run of build on main (success); workflow_dispatch |
deploy |
production |
| Claude Code (backend) | mr-mentor-backend/.github/workflows/claude.yml |
issue/PR comment, review, issue opened/assigned containing @claude |
claude |
n/a |
| Build (Development) (frontend) | mr-mentor-frontend/.github/workflows/build-development.yml |
push to development; workflow_dispatch |
build-and-push |
development |
| Build (Production) (frontend) | mr-mentor-frontend/.github/workflows/build-production.yml |
push to main; workflow_dispatch |
build-and-push |
production |
| Deploy to Development (frontend) | mr-mentor-frontend/.github/workflows/deploy-development.yml |
workflow_run of dev build (success); workflow_dispatch |
deploy |
development |
| Deploy to Production Blue-Green (frontend) | mr-mentor-frontend/.github/workflows/deploy-production.yml |
workflow_run of prod build on main (success); workflow_dispatch |
deploy |
production |
| Claude Code (frontend) | mr-mentor-frontend/.github/workflows/claude.yml |
@claude mentions |
claude |
n/a |
| Build (Development) (website) | mas-website-live/.github/workflows/build-development.yml |
push to development; workflow_dispatch |
build-and-push |
development |
| Build (Production) (website) | mas-website-live/.github/workflows/build-production.yml |
push to main; workflow_dispatch |
build-and-push |
production |
| Deploy to Development (website) | mas-website-live/.github/workflows/deploy-development.yml |
workflow_run of dev build (success); workflow_dispatch |
deploy |
development |
| Deploy to Production (website) | mas-website-live/.github/workflows/deploy-production.yml |
workflow_run of prod build on main (success); workflow_dispatch |
deploy |
production |
| Claude Code (website) | mas-website-live/.github/workflows/claude.yml |
@claude mentions |
claude |
n/a |
The three
claude.ymlfiles are byte-identical across repos. The build/deploy workflows differ slightly per repo (see per-service sections).
Promotion flow¶
flowchart LR
F["feature branch"] -->|"merge / PR"| DEVB["development branch"]
DEVB -->|"build + auto deploy"| DEVENV["DEV environment"]
DEVB -->|"merge / PR"| MAIN["main branch"]
MAIN -->|"build + auto deploy"| PRODENV["PROD environment (blue-green)"]
PRODENV -.->|"rollback.sh on failure"| PREV["previous color kept stopped"]
- Push to
development→ builds dev image → auto-deploys to the dev server. - Merge to
main→ builds prod image → auto blue-green deploys to prod. stagingis accepted as a build branch in the backendbuild.ymlonly (no chained deploy workflow exists for it; it just produces a:stagingimage in GHCR).- Any workflow can be re-run manually via the Run workflow button (
workflow_dispatch).
Per-service details¶
Backend — build (build.yml)¶
- Triggers: push to
main/development/staging, or manual.concurrency: docker-buildwithcancel-in-progress: true(a newer push cancels an in-flight build). - Job
build-and-push(ubuntu-latest): actions/checkout@v4.docker/setup-buildx-action@v3.- Lowercase the repo name (GHCR requires lowercase).
- Generate tags:
:<branch-safe>,:<branch-safe>-<shortsha>, plus:latestonmain. docker/login-action@v3toghcr.ioas${{ github.actor }}withsecrets.GITHUB_TOKEN.docker/build-push-action@v5— buildstarget: prodfrom./Dockerfile, pushes all tags, uses GitHub Actions layer cache (cache-from/to: type=gha), stamps OCI labels.- Outputs:
image-tag,image-digest,repository-name. - Secrets:
GITHUB_TOKEN(auto-provided).
Backend — deploy to development (deploy-development.yml)¶
- Triggers:
workflow_runafter Build and Push Docker Image completes ondevelopment; gated bygithub.event.workflow_run.conclusion == 'success'. Alsoworkflow_dispatchwith abranchinput. - Environment:
development. - Steps:
- Checkout at the build's
head_sha. - Resolve the branch into
steps.env.outputs.branch. scpdocker-compose.prod.ymlto/home/ubuntu/mr-mentor-backend/.aws-actions/configure-aws-credentials@v4.- Fetch
.envfrom AWS Secrets Manager secretmr-mentor-backend/development→ validate it's a JSON object → convert toKEY=VALUElines → write.env. scpthe.envto the server.- SSH: set
IMAGE_TAG=<branch>,GHCR_REPOSITORY,GHCR_USERNAME/GHCR_TOKEN(prefers long-livedGHCR_PATas userMAS-intern, falls back togithub.actor+GITHUB_TOKEN),docker login, thendocker compose -f docker-compose.prod.yml down && pull && up -d, wait 15s, printps+ last 50 log lines,docker image prune. - Secrets:
DEVELOPMENT_SERVER_HOST,DEVELOPMENT_SERVER_USER,DEVELOPMENT_SSH_KEY,DEVELOPMENT_SSH_PORT(optional),AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,AWS_REGION(optional),GHCR_PAT,GITHUB_TOKEN.
Backend — deploy to production blue-green (deploy-production.yml)¶
- Triggers:
workflow_runafter the build completes withhead_branch == 'main'andconclusion == 'success'; orworkflow_dispatch. - Environment:
production. - Steps:
- Checkout at
head_sha. - Compute
IMAGE_TAG=main-<shortsha>(for dispatch) or<head_branch>-<shortsha>, buildimage_url. - Configure AWS creds; fetch
.env.originalfrom SM secretmr-mentor-backend/production;scpit to/home/ubuntu/blue-green-deployment/. - SSH:
chmod +x deploy.sh rollback.sh health-check.sh. - Pre-deployment health check — runs
health-check.sh, reads.current_env, asserts nginx active,pg_isreadyonmr-mentor-postgres,redis-cli pingonmr-mentor-redis. - GHCR login on the prod server as
mas-mr-mentorusingGHCR_PULL_TOKEN. - Deploy —
./deploy.sh "<image_url>"; on non-zero exit it dumps the target container logs and runs./rollback.sh, then fails. - Post-deployment verification (
if: always()) — reads.current_env, inspects the active container's published port,curls/api/healthand asserts"success":true, runs aSELECT COUNT(*) FROM users,redis-cli ping, and checks the nginxproxy_passport matches the active port. - Summary + success notification (Slack webhook is present but commented out).
- Secrets:
PRODUCTION_SERVER_HOST,PRODUCTION_SERVER_USER,PRODUCTION_SSH_KEY,PRODUCTION_SSH_PORT(optional),AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,AWS_REGION(optional),GHCR_PULL_TOKEN.
Frontend — builds (build-development.yml, build-production.yml)¶
Two separate build workflows (dev triggers on development, prod on main); each declares the matching GitHub environment so it can read env-scoped secrets. Both timeout-minutes: 20.
- Dev build extra: restores/saves the Next.js
.next/cacheacross runs viaactions/cache@v4+reproducible-containers/buildkit-cache-dance(mounts cache into/app/.next/cache) so a code change doesn't pay a fullnext build. Tags::<branch>,:<branch>-<shortsha>, floating:development. Cachemode=max,ignore-error=true. - Prod build: tags
:<branch>,:<branch>-<shortsha>, floating:production, plus:lateston main. - Both pass
build-argssoNEXT_PUBLIC_*values and NextAuth/Google config are baked into the image at build time (Next.js inlines public vars at build):GOOGLE_CLIENT_ID,GOOGLE_CLIENT_SECRET,GOOGLE_REDIRECT_URI,NEXT_PUBLIC_BACKEND_URL,NEXTAUTH_URL,NEXTAUTH_URL_INTERNAL,NEXTAUTH_SECRET,NEXT_PUBLIC_FRONTEND_URL,NEXT_PUBLIC_TOKEN_VALUE,NEXT_PUBLIC_CDN_URL,NEXT_PUBLIC_TURN_SERVER_HOST/PORT/USERNAME/CREDENTIAL(NEXT_PUBLIC_NODE_ENV=productionis hardcoded). - Secrets:
GITHUB_TOKEN+ all the build-arg secrets above (scoped to the environment).
Frontend — deploy to development (deploy-development.yml)¶
workflow_run after the dev build succeeds; environment: development, timeout-minutes: 15. Steps: scp docker-compose.prod.yml to /home/ubuntu/mr-mentor-frontend/; SSH-write .env from secrets (TURN host/port/user/cred hardcoded to the dev EC2 15.206.142.123); SSH deploy — login as MAS-intern with GHCR_PAT, docker network create app-network, docker compose down/pull/up -d, wait, ps+logs, prune. Secrets: DEVELOPMENT_SERVER_*, GHCR_PAT, plus the per-key NextAuth/Google/NEXT_PUBLIC_* secrets, GITHUB_TOKEN.
Frontend — deploy to production blue-green (deploy-production.yml)¶
workflow_run after the prod build succeeds on main; environment: production. Distinctive steps:
1. SSH-write .env into /home/ubuntu/blue-green-frontend/ (TURN host hardcoded to the prod EC2 13.234.60.63).
2. scp deploy/blue-green/{deploy.sh,rollback.sh,health-check.sh} from the repo and flatten them into /home/ubuntu/blue-green-frontend/ (frontend ships its blue-green scripts in-repo, unlike the backend).
3. Pre-deploy health check + nginx-active assertion.
4. Run ./deploy.sh "<image_url>" with GHCR_USERNAME=MAS-intern + GHCR_TOKEN=GHCR_PAT; on failure run ./rollback.sh.
5. Post-deploy verification (if: always()): health-check.sh, curl https://www.mrmentor.in and https://www.mrmentor.in/api/auth/session; a 500 on the session endpoint is treated as a fatal "missing NEXTAUTH_SECRET" and dumps container logs.
Secrets: PRODUCTION_SERVER_*, GHCR_PAT, the NextAuth/Google/NEXT_PUBLIC_* secrets, GITHUB_TOKEN.
Website — builds & deploys¶
Simpler than the frontend (no blue-green; single container on mas-network).
- Builds: use
docker/metadata-action@v5for tagging. Dev tagsdevelopment+development-<sha>; prod tagsproduction+latest+production-<sha>.platforms: linux/amd64. Build-args baked in:NEXT_PUBLIC_GOOGLE_CLIENT_ID,NEXT_PUBLIC_BACKEND_URL,NEXT_PUBLIC_NODEJS_SERVER,NEXT_PUBLIC_CLOUDFRONT_URL,NEXT_PUBLIC_IMAGE_ENV,NEXT_PUBLIC_MR_MENTOR_FRONTEND_URL,NEXT_PUBLIC_SHOW_ENROLLMENT_STEP.timeout-minutes: 30. - Deploys:
workflow_run-chained;scpdocker-compose-prod.ymlto/home/ubuntu/mas-website-live/; SSH login asgithub.actorwithGITHUB_TOKEN,docker network create mas-network,docker compose down/pull/up -d(dev usesIMAGE_TAG=<branch>, prod usesIMAGE_TAG=production), wait,ps+logs, prune. - Secrets:
DEVELOPMENT_SERVER_*/PRODUCTION_SERVER_*,GITHUB_TOKEN, and theNEXT_PUBLIC_*build-arg secrets. The website deploy does not use a long-lived GHCR PAT or AWS Secrets Manager.
User journeys¶
Journey 1 — Developer pushes to development (auto deploy to dev)¶
sequenceDiagram
participant Dev as Engineer
participant GH as GitHub
participant Build as Build workflow
participant GHCR as GHCR
participant Deploy as Deploy-dev workflow
participant SM as AWS Secrets Manager
participant Srv as Dev EC2 server
Dev->>GH: git push to development
GH->>Build: trigger on push
Build->>Build: checkout then buildx then generate tags
Build->>GHCR: docker build target prod and push branch and shortsha tags
Build-->>GH: workflow_run completed success
GH->>Deploy: trigger workflow_run on development
Note over Deploy: gated on conclusion success
Deploy->>Srv: scp docker-compose.prod.yml
Deploy->>SM: get-secret-value mr-mentor-backend development
SM-->>Deploy: JSON secret string
Deploy->>Deploy: convert JSON to dotenv .env
Deploy->>Srv: scp .env
Deploy->>Srv: ssh docker login then compose down pull up
Srv->>GHCR: docker pull image by tag
GHCR-->>Srv: image layers
Srv-->>Deploy: containers running and logs tail
Deploy-->>Dev: deployment summary in Actions log
Frontend and website variants are identical in shape, except the .env is assembled from per-key GitHub Secrets (written over SSH) instead of fetched from Secrets Manager.
Journey 2 — Merge to main triggers blue-green production deploy¶
The headline production path. The deploy targets the standby color, proves it healthy, flips nginx, then keeps the previous color stopped for instant rollback.
sequenceDiagram
participant Dev as Engineer
participant GH as GitHub
participant Build as Build workflow
participant GHCR as GHCR
participant Deploy as Deploy-prod workflow
participant Srv as Prod EC2 server
participant Nginx as nginx
participant New as Standby container
participant Old as Active container
Dev->>GH: merge PR into main
GH->>Build: trigger on push to main
Build->>GHCR: push image main-shortsha and latest
Build-->>GH: workflow_run success
GH->>Deploy: trigger workflow_run head_branch main
Deploy->>Srv: scp env and chmod blue-green scripts
Deploy->>Srv: pre-deploy health-check nginx postgres redis
Srv-->>Deploy: pre-checks passed
Deploy->>Srv: ssh run deploy.sh with image url
Srv->>Srv: read current_env to pick target color
Srv->>GHCR: docker pull image url
Srv->>New: docker run standby on standby port
New-->>Srv: health status healthy
Srv->>New: curl smoke test returns 200
Srv->>Nginx: sed proxy_pass to standby port then reload
Nginx-->>Srv: config valid and reloaded
Srv->>Old: docker stop old kept for rollback
Srv->>Srv: write target color to current_env
Deploy->>Srv: post-deploy verify health api and db and redis
Srv-->>Deploy: all verifications passed
Deploy-->>Dev: production deployment summary
Journey 3 — Failed production deploy auto-rolls back¶
sequenceDiagram
participant Deploy as Deploy-prod workflow
participant Srv as Prod EC2 server
participant New as Standby container
participant Roll as rollback.sh
Deploy->>Srv: ssh run deploy.sh with image url
Srv->>New: docker run standby then wait for health
New-->>Srv: health status unhealthy or smoke test not 200
Srv->>Srv: deploy.sh exits non zero
Srv->>Srv: dump last 50 container log lines
Srv->>Roll: run rollback.sh
Roll-->>Srv: previous color restarted nginx restored
Srv-->>Deploy: step exits 1 job marked failed
Note over Deploy: nginx still points at last known good color
Journey 4 — Manual hotfix deploy via workflow_dispatch¶
sequenceDiagram
participant Eng as Release engineer
participant GH as GitHub Actions UI
participant Deploy as Deploy workflow
participant Srv as Target server
Eng->>GH: click Run workflow and choose branch
GH->>Deploy: workflow_dispatch event
Note over Deploy: if condition allows dispatch without a build run
Deploy->>Deploy: derive IMAGE_TAG from branch and shortsha
Deploy->>Srv: same scp and ssh deploy steps as auto path
Srv-->>Deploy: deployed
Deploy-->>Eng: summary
When dispatched without a triggering build, github.event.workflow_run.head_sha is empty so checkout falls back to github.ref, and the prod image tag is computed as main-<shortsha>. The image must already exist in GHCR for that SHA.
Journey 5 — @claude automation on an issue or PR¶
sequenceDiagram
participant User as Reviewer or author
participant GH as GitHub
participant CL as Claude Code workflow
participant Action as claude-code-action
participant API as Anthropic
User->>GH: comment or open issue containing at claude
GH->>CL: issue_comment or pull_request_review or issues event
Note over CL: if guard requires the at claude token in body or title
CL->>CL: checkout fetch-depth 1
CL->>Action: run with oauth token and model sonnet
Action->>API: send context and instructions
API-->>Action: code edits or analysis
Action->>GH: push commit or post PR or issue comment
GH-->>User: Claude response visible inline
Background jobs & async¶
CI/CD here has no application-level BullMQ jobs — its async mechanics are GitHub Actions constructs:
workflow_runchaining is the deploy trigger. A deploy never runs unless its build completed (and, for prod, on the right branch withconclusion == 'success'). This means a failed/cancelled build silently skips deployment.concurrencygroups (docker-build*, withcancel-in-progress: true) cancel an in-flight build when a newer push arrives, so only the latest commit's image is built.- Timeouts — frontend builds 20 min, deploys 15 min; website builds 30 min — abort hung jobs.
- GHCR layer cache (
type=gha) and the frontend's persisted.next/cacheare the async optimizations that keep build times down. - No scheduled (
cron) workflows exist in any of the three repos. (A separate GHCR-login refresh workflow is referenced in org memory but is not present in these repo workflow directories.)
There are no webhooks wired into these pipelines beyond the (commented-out) Slack notification placeholder in the backend prod deploy.
External integrations¶
| Integration | Used by | Env / secret names (names only) | Failure / fallback |
|---|---|---|---|
| GHCR (GitHub Container Registry) | all builds + deploys | GITHUB_TOKEN (build push), GHCR_PAT (backend+frontend server pulls, user MAS-intern), GHCR_PULL_TOKEN (backend prod, user mas-mr-mentor) |
If the server's stored docker login is stale/mismatched, pulls 401. Deploys re-login each run; backend prefers the long-lived PAT precisely so the credential survives after the run. |
| AWS Secrets Manager | backend deploys only | AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION (default ap-south-1); secrets mr-mentor-backend/development, mr-mentor-backend/production |
The fetch step set -euo pipefail + jq -e 'type=="object"' aborts the deploy if the secret is missing or not a JSON object — no silent empty .env. |
| AWS EC2 over SSH | all deploys | DEVELOPMENT_SERVER_HOST/USER, PRODUCTION_SERVER_HOST/USER, DEVELOPMENT_SSH_KEY, PRODUCTION_SSH_KEY, *_SSH_PORT (default 22) |
A bad key/host fails the SSH step and the deploy. |
| nginx (on servers) | prod blue-green | n/a (server config) | deploy.sh runs nginx -t before reload; a bad config aborts the flip. |
| Anthropic / Claude Code | claude.yml |
CLAUDE_CODE_OAUTH_TOKEN; model: claude-sonnet-4-6 |
Job only runs when @claude appears in the triggering text; otherwise the if: guard skips it. |
| Google OAuth / NextAuth / TURN | frontend builds (baked in) | GOOGLE_CLIENT_ID/SECRET, GOOGLE_REDIRECT_URI, NEXTAUTH_URL, NEXTAUTH_URL_INTERNAL, NEXTAUTH_SECRET, NEXT_PUBLIC_TURN_* |
Missing NEXTAUTH_SECRET surfaces as an HTTP 500 on /api/auth/session, which the frontend prod post-deploy verification treats as a fatal error. |
Feature flags / env toggles seen in pipelines: NEXT_PUBLIC_SHOW_ENROLLMENT_STEP (website), NEXT_PUBLIC_IMAGE_ENV (website). Backend runtime flags (e.g. ENABLE_SEEDING, USE_DIRECT_S3_UPLOAD) live in the Secrets-Manager-managed .env, not in the workflow YAML.
Status lifecycles¶
Blue-green active color lifecycle (prod servers)¶
stateDiagram-v2
[*] --> Blue
Blue --> DeployingGreen: deploy.sh starts (blue active)
DeployingGreen --> Green: green healthy then nginx flipped then current_env=green
DeployingGreen --> Blue: green unhealthy then rollback.sh
Green --> DeployingBlue: next deploy (green active)
DeployingBlue --> Blue: blue healthy then nginx flipped then current_env=blue
DeployingBlue --> Green: blue unhealthy then rollback.sh
GitHub Actions run status (per deploy job)¶
stateDiagram-v2
[*] --> Queued
Queued --> Running: runner picks up job
Running --> Success: all steps pass
Running --> Failed: a step exits non zero
Running --> Cancelled: concurrency cancel or timeout
Failed --> [*]
Success --> [*]
Cancelled --> [*]
note right of Failed
prod deploy.sh failure also
triggers rollback.sh on server
end note
Edge cases, limits & gotchas¶
- Stale in-repo docs.
mr-mentor-backend/CI-CD.mdandmr-mentor-frontend/CI-CD.mddocument a defunct PM2/Bun-over-SSH pipeline (ci-cd.yml,deploy.yml,SERVER_HOST/SSH_PRIVATE_KEYsecrets) that does not match the live Docker/GHCR/blue-green workflows. Don't rely on them; this doc is sourced from the actual workflow YAML. - Deploy depends on build success, not on the commit. Because deploys trigger via
workflow_run, a cancelled or failed build (including aconcurrencycancellation) means no deploy happens for that commit — and the next push's build supersedes it. workflow_dispatchrequires a pre-existing image. Manually dispatching a deploy computes the tag as<branch>-<shortsha>/main-<shortsha>and pulls it; if no build ever pushed that tag, the pull fails.NEXT_PUBLIC_*are baked at build time. Frontend/website public env values are compiled into the image viabuild-args. Changing them requires a rebuild, not just a redeploy. The deploy-time.envonly affects server-side (non-public) values.- GHCR credential identity matters. The backend dev deploy comment and frontend deploys both call out that logging into the server's docker as the ephemeral
github.actor/GITHUB_TOKENleaves a credential that dies with the run; the long-livedGHCR_PAT(userMAS-intern) /GHCR_PULL_TOKEN(usermas-mr-mentor) is used so subsequent server-side pulls keep working. Rotating the PAT is an operational task (see org memory "GHCR docker-login root cause + fix"). - Hardcoded TURN/IP values. Frontend deploy workflows hardcode the dev (
15.206.142.123) and prod (13.234.60.63) TURN server hosts andwebrtcuser/webrtccredcredentials directly in the workflow.envheredoc rather than in secrets. - Backend prod verification is opinionated. The post-deploy step runs
SELECT COUNT(*) FROM userswith hardcoded DB usershubham/dbmasand container namesmr-mentor-postgres/mr-mentor-redis; renaming those infra containers would break verification (not the app). stagingbuilds an image but never deploys. Backendbuild.ymlacceptsstaging; there is no chainedstagingdeploy workflow, so a:stagingimage just sits in GHCR.if: always()on verification. Backend/frontend prod verification + summary run even on failure so logs are captured; the job's overall status still reflects the deploy step.- No automated test/lint gate in the live pipelines. The current build workflows go straight from checkout to
docker build; there is no separate test job in the active YAML (the legacyCI-CD.md"lint + unit test" stage is not implemented in the real workflows). Tests run locally / pre-merge, not as a CI gate (inferred from the absence of a test job). - Claude workflow permissions.
claude.ymlgrantscontents: write,pull-requests: write,issues: write,id-token: write; it can push commits and comment. It is strictly gated on the literal@claudetoken in the comment/issue body or title.
Related docs¶
- Deployment Architecture — server topology, nginx, blue-green hosts, Docker Compose layout (the runtime side of what these pipelines deploy).
- System Context & Containers — how the three deployables fit the wider suite.
- Multi-Platform Architecture — the
x-platformtenant routing the deployed backend serves. - Background Jobs & Queues — the BullMQ workers that run inside the deployed backend container.
- System Design — overall design reference.