molecule-core/workspace-server/Dockerfile.tenant
cp-lead 12bb73d000
Some checks failed
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 4s
sop-tier-check / tier-check (pull_request) Failing after 4s
fix(dockerfile-tenant): chown /org-templates to canvas user so !external resolver can mkdir cache
Root cause:
  Dockerfile.tenant chowns /canvas /platform /memory-plugin /migrations
  to canvas:canvas (line ~119) but not /org-templates. The image is
  built as root, COPY-ed templates inherit root:root 0755. The platform
  binary then runs as the canvas user (uid 1000) because of the USER
  directive on line ~124, so when the !external resolver
  (org_external.go, internal#77 / task #222) tries
  os.MkdirAll("/org-templates/<tmpl>/.external-cache/<repo>") on first
  import, mkdir(2) returns EACCES and the import handler returns 400
  "org template expansion failed" (org.go:592). The user-facing error
  is generic; only the server log carries:

    Org import: refusing import: !include expansion failed:
    !external at line 156: fetch git.moleculesai.app/molecule-ai/molecule-dev-department@v1.0.0:
    mkdir cache root: mkdir /org-templates/molecule-dev/.external-cache: permission denied

Repro:
  Tenant staging-cplead-2 (canary AWS 004947743811, image SHA
  a93c4ce17725...). POST /org/import {"dir":"molecule-dev"} returns 400
  while POST /org/import {"dir":"free-beats-all"} returns 201 — only
  templates with !external trip the bug.

Fix:
  Add /org-templates to the chown -R argv. One-line change. Same
  ownership shape as the other writable platform-state dirs.

Why this is safe for prod:
  * The platform binary already needs read access to /org-templates,
    so canvas:canvas owning it doesn't widen any attack surface.
  * /org-templates is image-resident, not bind-mounted; chown applies
    inside the image layers and prod tenants get the fix on next
    image rebuild + redeploy. Live prod tenants are unaffected until
    the next deploy (no orgs currently using !external in prod —
    molecule-dev consumers are all internal staging).

Verification:
  After hand-applying the chown live (docker exec --user 0 ... chown -R
  canvas:canvas /org-templates/molecule-dev), POST /org/import
  {"dir":"molecule-dev"} returns 201 with 39 workspaces; cp-lead +
  CP-BE + CP-QA + CP-Security all reach status=online within ~2 min.

Refs:
  internal#77 — !external RFC (Phase 3a)
  task #222 — resolver PR (introduced the unflagged-permission
              dependency this fixes)
  Live incident 2026-05-10 — staging-cplead-2 import failed,
              chown-on-host workaround in place pending image rebuild
2026-05-09 19:40:52 -07:00

134 lines
6.6 KiB
Docker

# Dockerfile.tenant — combined platform (Go) + canvas (Next.js) image.
#
# Serves both the API (Go on :8080) and the UI (Node.js on :3000) in a
# single container. Go reverse-proxies unknown routes to canvas.
#
# Templates + plugins are NOT cloned at build time. They are pre-cloned
# in the trusted CI context (or operator host) by
# `scripts/clone-manifest.sh` into `.tenant-bundle-deps/` and COPYed in.
# The reason: post-2026-05-06, every workspace-template-* repo on Gitea
# (codex, crewai, deepagents, gemini-cli, langgraph) plus all 7
# org-template-* repos are private, so the Docker build can't `git clone`
# from inside the build context — there's no auth path that doesn't leak
# the Gitea token into an image layer. Pre-cloning keeps the token in
# the CI environment only; the resulting image carries the cloned trees
# with `.git` already stripped (see clone-manifest.sh).
#
# Build context: repo root, with `.tenant-bundle-deps/` populated by:
#
# MOLECULE_GITEA_TOKEN=<persona-PAT> scripts/clone-manifest.sh \
# manifest.json \
# .tenant-bundle-deps/workspace-configs-templates \
# .tenant-bundle-deps/org-templates \
# .tenant-bundle-deps/plugins
#
# In CI this happens in publish-workspace-server-image.yml's "Pre-clone
# manifest deps" step (uses AUTO_SYNC_TOKEN = devops-engineer persona).
# For a manual operator-host build, source the same token from
# /etc/molecule-bootstrap/agent-secrets.env first.
#
# docker buildx build --platform linux/amd64 \
# -f workspace-server/Dockerfile.tenant \
# -t <ECR>/molecule-ai/platform-tenant:latest \
# --build-arg GIT_SHA=<sha> --build-arg NEXT_PUBLIC_PLATFORM_URL= \
# --push .
# ── Stage 1: Go platform binary ──────────────────────────────────────
FROM golang:1.25-alpine@sha256:c4ea15b4a7912716eb362a022e2b12317762eca387423760bc59c0f9ae69423c AS go-builder
WORKDIR /app
COPY workspace-server/go.mod workspace-server/go.sum ./
# github-app-auth plugin removed 2026-05-07 (#157): per-agent Gitea
# identities replaced GitHub-App tokens post-suspension. The sibling
# COPY + replace directive are gone.
RUN go mod download
COPY workspace-server/ .
# GIT_SHA is baked into the binary via -ldflags so /buildinfo can return
# it at runtime. CI passes ${{ github.sha }}; local builds default to
# "dev" so an unset value never reads as a real SHA.
#
# Why this matters: the redeploy verification step compares each tenant's
# /buildinfo against the SHA the workflow expects. If GIT_SHA isn't
# threaded through here, every tenant returns "dev" and the verification
# fails closed — which is the correct fail-direction (#2395 root fix).
ARG GIT_SHA=dev
RUN CGO_ENABLED=0 GOOS=linux go build \
-ldflags "-X github.com/Molecule-AI/molecule-monorepo/platform/internal/buildinfo.GitSHA=${GIT_SHA}" \
-o /platform ./cmd/server
# Memory v2 sidecar binary (Memory v2 #2728). Bundled so an operator
# can activate cutover by flipping MEMORY_V2_CUTOVER=true without
# provisioning a separate service. See entrypoint-tenant.sh for the
# launch logic.
RUN CGO_ENABLED=0 GOOS=linux go build \
-ldflags "-X github.com/Molecule-AI/molecule-monorepo/platform/internal/buildinfo.GitSHA=${GIT_SHA}" \
-o /memory-plugin ./cmd/memory-plugin-postgres
# ── Stage 2: Canvas Next.js standalone ────────────────────────────────
FROM node:20-alpine@sha256:afdf98210b07b586eb71fa22ba2e432e058e4cd1304d31ed60888755b8c865fb AS canvas-builder
WORKDIR /canvas
COPY canvas/package.json canvas/package-lock.json* ./
RUN npm install
COPY canvas/ .
ARG NEXT_PUBLIC_PLATFORM_URL=""
ARG NEXT_PUBLIC_WS_URL=""
ENV NEXT_PUBLIC_PLATFORM_URL=$NEXT_PUBLIC_PLATFORM_URL
ENV NEXT_PUBLIC_WS_URL=$NEXT_PUBLIC_WS_URL
RUN npm run build
# ── Stage 3: Runtime ──────────────────────────────────────────────────
FROM node:20-alpine@sha256:afdf98210b07b586eb71fa22ba2e432e058e4cd1304d31ed60888755b8c865fb
RUN apk add --no-cache ca-certificates git tzdata openssh-client aws-cli
# Non-root runtime for the Node.js canvas process.
# The Go binary (started by entrypoint.sh) is also non-root — the
# entrypoint runs as root only long enough to set volume ownership,
# then exec's as the 'canvas' user via su-exec / setpriv.
# The Go platform itself drops privileges after init.
#
# node:20-alpine ships with uid/gid 1000 already taken by `node`. Delete
# it first so we can recreate `canvas` at the same uid/gid without
# conflict. Previously plain addgroup/adduser at 1000 failed with
# "group 'node' in use" — blocked the tenant image build for hours
# 2026-04-21. Picking a different uid would break mounted volumes
# that expect 1000, so we keep the slot and rename the user.
RUN deluser --remove-home node 2>/dev/null || true; \
delgroup node 2>/dev/null || true; \
addgroup -g 1000 canvas && adduser -u 1000 -G canvas -s /bin/sh -D canvas
# Go platform binary + Memory v2 sidecar
COPY --from=go-builder /platform /platform
COPY --from=go-builder /memory-plugin /memory-plugin
COPY workspace-server/migrations /migrations
# Templates + plugins (pre-cloned by scripts/clone-manifest.sh in the
# trusted CI / operator-host context, .git already stripped — see
# .tenant-bundle-deps/ in the build context). The Gitea token used to
# clone them never enters this image.
COPY .tenant-bundle-deps/workspace-configs-templates /workspace-configs-templates
COPY .tenant-bundle-deps/org-templates /org-templates
COPY .tenant-bundle-deps/plugins /plugins
# Canvas standalone
WORKDIR /canvas
COPY --from=canvas-builder /canvas/.next/standalone ./
COPY --from=canvas-builder /canvas/.next/static ./.next/static
COPY --from=canvas-builder /canvas/public ./public
COPY workspace-server/entrypoint-tenant.sh /entrypoint.sh
# /org-templates must be writable by the canvas user — the !external
# resolver mkdirs <orgBaseDir>/.external-cache/<repo>/<sha>/ on first
# import to cache cross-repo subtree fetches (org_external.go,
# internal#77 / task #222). Without this chown the resolver fails with
# "mkdir cache root: permission denied" and POST /org/import returns
# 400 "org template expansion failed" for any template that uses
# !external (e.g. molecule-dev → dev-lead). Caught on staging-cplead-2
# 2026-05-10 — see internal incident debrief.
RUN chmod +x /entrypoint.sh && \
chown -R canvas:canvas /canvas /platform /memory-plugin /migrations /org-templates
EXPOSE 8080
# entrypoint.sh starts as root to fix volume perms, then drops to
# canvas user. The Go binary (PID 1 replacement) runs as non-root.
USER canvas
CMD ["/entrypoint.sh"]