SaaS tenants cannot adopt a newly-promoted runtime-image pin without a full re-provision (soft restart + WAF-blocked refresh = no reachable trigger) #2239

Open
opened 2026-06-04 09:55:22 +00:00 by devops-engineer · 5 comments
Member

Problem

On a SaaS tenant (e.g. agents-team), promoting a new runtime-image pin (CP runtime_image_pins, migration 027) does not propagate to already-running containers through any operator-reachable path:

  1. Soft /restart (existing-volume) does not adopt a new pin. The CP digest-pin is applied at instance/thin-AMI launch (provisioner/registry.go:90, provisioner.go:110-133), not re-read on container restart. The provisioner only re-pulls when the image is absent locally OR the tag is moving (:latest); a digest-pinned cfg.Image is treated as immutable and the pull is skipped when any image is cached (provisioner.go:511-533). A soft restart recreates the container from the already-baked old cfg.Image -> stuck on the old image.
  2. IMAGE_AUTO_REFRESH (imagewatch) is off by design on SaaS (cmd/server/main.go:388) -- "SaaS deploys whose pipeline already pulls every release should leave it off." But the fleet auto-deploy rolls only the workspace-server (platform-tenant) image, not the separate workspace-template-<runtime> artifacts. So nothing auto-adopts a new runtime-template image.
  3. The manual trigger /admin/workspace-images/refresh is not edge-reachable -- the tenant WAF rewrites /admin/* to the canvas Next.js app (returns HTML), even with the Origin header. Only /workspaces/* is exposed.

Net: a freshly-published runtime-template image (correct pin, correct :latest) cannot be forced onto running SaaS-tenant containers without host access (which lives in the platform-tenant AWS account, unreachable from the operator).

Concrete impact

Two agents-team codex agents (codex sandbox/bwrap fix shipped in the new image: GIT_ASKPASS + sandbox network_access + runtime 0.3.9) remain on the old image. Non-blocking (swarm routes around it) but they only adopt on the next full re-provision / tenant redeploy.

Durable fix options (pick one)

  • (b) CP-internal refresh proxy (preferred): POST /cp/admin/orgs/<slug>/refresh-runtime-image?runtime=<rt> that calls the tenant's WorkspaceImageService.Refresh over the internal CP->tenant path (not the edge). Operator-triggerable, no WAF/host dependency, blast-radius-scoped per runtime.
  • (c) Enable IMAGE_AUTO_REFRESH=true on tenants whose fleet-deploy does not cover runtime templates (zero-touch; the intended SaaS knob for this case).
  • (d) Make soft /restart re-read the current CP pin and adopt a changed digest (force-pull when the resolved pin digest differs from the running container's image).

Low priority (non-blocking); file-and-track per CTO no-regression program.

## Problem On a SaaS tenant (e.g. agents-team), promoting a new runtime-image pin (CP `runtime_image_pins`, migration 027) does **not** propagate to already-running containers through any operator-reachable path: 1. **Soft `/restart` (`existing-volume`) does not adopt a new pin.** The CP digest-pin is applied at **instance/thin-AMI launch** (`provisioner/registry.go:90`, `provisioner.go:110-133`), not re-read on container restart. The provisioner only re-pulls when the image is absent locally OR the tag is *moving* (`:latest`); a digest-pinned `cfg.Image` is treated as immutable and the pull is skipped when any image is cached (`provisioner.go:511-533`). A soft restart recreates the container from the already-baked old `cfg.Image` -> stuck on the old image. 2. **`IMAGE_AUTO_REFRESH` (imagewatch) is off by design on SaaS** (`cmd/server/main.go:388`) -- "SaaS deploys whose pipeline already pulls every release should leave it off." But the fleet auto-deploy rolls only the **workspace-server (platform-tenant)** image, *not* the separate **workspace-template-`<runtime>`** artifacts. So nothing auto-adopts a new runtime-template image. 3. **The manual trigger `/admin/workspace-images/refresh` is not edge-reachable** -- the tenant WAF rewrites `/admin/*` to the canvas Next.js app (returns HTML), even with the `Origin` header. Only `/workspaces/*` is exposed. Net: a freshly-published runtime-template image (correct pin, correct `:latest`) cannot be forced onto running SaaS-tenant containers without host access (which lives in the platform-tenant AWS account, unreachable from the operator). ## Concrete impact Two agents-team codex agents (codex sandbox/bwrap fix shipped in the new image: GIT_ASKPASS + sandbox network_access + runtime 0.3.9) remain on the old image. Non-blocking (swarm routes around it) but they only adopt on the next full re-provision / tenant redeploy. ## Durable fix options (pick one) - **(b) CP-internal refresh proxy (preferred):** `POST /cp/admin/orgs/<slug>/refresh-runtime-image?runtime=<rt>` that calls the tenant's `WorkspaceImageService.Refresh` over the **internal** CP->tenant path (not the edge). Operator-triggerable, no WAF/host dependency, blast-radius-scoped per runtime. - **(c) Enable `IMAGE_AUTO_REFRESH=true` on tenants whose fleet-deploy does not cover runtime templates** (zero-touch; the intended SaaS knob for this case). - **(d) Make soft `/restart` re-read the current CP pin and adopt a changed digest** (force-pull when the resolved pin digest differs from the running container's image). Low priority (non-blocking); file-and-track per CTO no-regression program.
Member

Research complete — root cause refined + design decision (CTO).

Key correction: IMAGE_AUTO_REFRESH is ALREADY ON for all SaaS tenants (hardcoded controlplane ec2.go:2410), so imagewatch IS running + recreating containers on a :latest digest change. The real root cause: the CP digest-pin overrides it — recreate re-provisions from the OLD pinned Config.Image (provisioner.go:511-534 skips re-pull for an immutable digest), so containers come back on the old image. CP→tenant has NO internal bypass (it uses the same WAF-fronted public /admin/workspace-images/refresh with an Origin header; workspace_redeploy.go:240-330).

Two viable fixes:

  • D.2 (simpler, ~10 lines): for tenants whose fleet does NOT roll runtime images, CP sends Config.Image="" (unpinned) → provisioner uses :latest → imagewatch adoption works. COST: sacrifices digest-pin determinism/supply-chain reproducibility for those tenants (the pin exists per RFC internal#483 / security review 4269).
  • D.3 (proper, preferred): KEEP pinning; on runtime-image/promote, CP propagates the NEW pin to running tenants of that runtime (re-provision with new Config.Image → provisioner pulls the new digest, not cached → lands). Preserves supply-chain integrity AND achieves adoption.

CTO LEAN = D.3 (don't trade away pinning). This is a deliberate CP change with a supply-chain dimension — I'll finalize the design + implement carefully (not auto-dispatched). Tracking here. Non-blocking; agents adopt on next full re-provision meanwhile.

**Research complete — root cause refined + design decision (CTO).** Key correction: IMAGE_AUTO_REFRESH is ALREADY ON for all SaaS tenants (hardcoded controlplane ec2.go:2410), so imagewatch IS running + recreating containers on a :latest digest change. The real root cause: the **CP digest-pin overrides it** — recreate re-provisions from the OLD pinned Config.Image (provisioner.go:511-534 skips re-pull for an immutable digest), so containers come back on the old image. CP→tenant has NO internal bypass (it uses the same WAF-fronted public /admin/workspace-images/refresh with an Origin header; workspace_redeploy.go:240-330). Two viable fixes: - **D.2 (simpler, ~10 lines):** for tenants whose fleet does NOT roll runtime images, CP sends Config.Image="" (unpinned) → provisioner uses :latest → imagewatch adoption works. COST: sacrifices digest-pin determinism/supply-chain reproducibility for those tenants (the pin exists per RFC internal#483 / security review 4269). - **D.3 (proper, preferred):** KEEP pinning; on runtime-image/promote, CP propagates the NEW pin to running tenants of that runtime (re-provision with new Config.Image → provisioner pulls the new digest, not cached → lands). Preserves supply-chain integrity AND achieves adoption. CTO LEAN = D.3 (don't trade away pinning). This is a deliberate CP change with a supply-chain dimension — I'll finalize the design + implement carefully (not auto-dispatched). Tracking here. Non-blocking; agents adopt on next full re-provision meanwhile.
Member

CTO review of the D.3 draft (authored in worktree, NOT yet merged) — HELD pending mechanism-validation.

Code quality: GOOD. Reviewed the full diff:

  • Hook wiring (admin_pin_handler.go): clean — handlePromote responds 200 FIRST, then fires PostPromote as a detached goroutine (request ctx correctly not forwarded); opt-in per resource (thin-AMI excluded). Safe.
  • Propagation logic (pin_runtime_image_propagation.go): nil-guards redeployer+db, sequential per-tenant with 5-min timeout, best-effort (one failure doesn't abort rest), structured logging. Clean.
  • Fleet query: byte-identical to canonical resolveFleetEntries (workspace_redeploy.go:692-696); fly_machine_id is the legacy-named canonical live-instance marker. Faithful duplication.
  • Build clean; 15 table-driven tests pass.

WHY HELD (the blocker): the draft's caveat (a) — that force-remove + re-provision lands the NEW pinned digest — is UNVERIFIED and contradicts the prior root-cause research, which found this same path leaves containers on the OLD pin (re-provision reuses the persisted/baked digest-pinned Config.Image; provisioner skips re-pull for an immutable digest). D.3 only works IF re-provision-after-force-remove re-fetches the pin from CP (getting the now-new one). That runtime behavior is the crux and is NOT proven by 'tests pass' (the tests verify the propagation CALLS happen, not that containers ADOPT the new image).

Validation gate before merge: trace the molecule-core workspace-server re-provision-after-container-removal path — does it re-query CP for Config.Image (→ gets new pin → adopts) or reuse a persisted config (→ old pin → no-op churn)? OR a staging test: promote a runtime pin, confirm a RUNNING container's image digest actually changes. Only merge once adoption is proven on the real artifact. If re-provision reuses persisted config, the fix must ALSO invalidate that (e.g. the refresh/redeploy clears the tenant's persisted Config.Image so re-provision re-fetches). Non-blocking (agents adopt on next full redeploy meanwhile) — getting it RIGHT > shipping fast. Worktree retained at /tmp/cp-2239-d3.

**CTO review of the D.3 draft (authored in worktree, NOT yet merged) — HELD pending mechanism-validation.** Code quality: GOOD. Reviewed the full diff: - Hook wiring (admin_pin_handler.go): clean — handlePromote responds 200 FIRST, then fires PostPromote as a detached goroutine (request ctx correctly not forwarded); opt-in per resource (thin-AMI excluded). Safe. - Propagation logic (pin_runtime_image_propagation.go): nil-guards redeployer+db, sequential per-tenant with 5-min timeout, best-effort (one failure doesn't abort rest), structured logging. Clean. - Fleet query: byte-identical to canonical resolveFleetEntries (workspace_redeploy.go:692-696); fly_machine_id is the legacy-named canonical live-instance marker. Faithful duplication. - Build clean; 15 table-driven tests pass. **WHY HELD (the blocker):** the draft's caveat (a) — that force-remove + re-provision lands the NEW pinned digest — is UNVERIFIED and contradicts the prior root-cause research, which found this same path leaves containers on the OLD pin (re-provision reuses the persisted/baked digest-pinned Config.Image; provisioner skips re-pull for an immutable digest). D.3 only works IF re-provision-after-force-remove re-fetches the pin from CP (getting the now-new one). That runtime behavior is the crux and is NOT proven by 'tests pass' (the tests verify the propagation CALLS happen, not that containers ADOPT the new image). **Validation gate before merge:** trace the molecule-core workspace-server re-provision-after-container-removal path — does it re-query CP for Config.Image (→ gets new pin → adopts) or reuse a persisted config (→ old pin → no-op churn)? OR a staging test: promote a runtime pin, confirm a RUNNING container's image digest actually changes. Only merge once adoption is proven on the real artifact. If re-provision reuses persisted config, the fix must ALSO invalidate that (e.g. the refresh/redeploy clears the tenant's persisted Config.Image so re-provision re-fetches). Non-blocking (agents adopt on next full redeploy meanwhile) — getting it RIGHT > shipping fast. Worktree retained at /tmp/cp-2239-d3.
Member

MECHANISM VALIDATED (static trace) — blocker cleared; proceeding to merge with a post-deploy real-artifact confirmation.

Resolved the open question (does re-provision adopt the new pin, or reuse a stale one?) by tracing both layers + reconciling two partially-conflicting subagent findings:

  • workspace-server RE-PROVISION path sets cfg.Image="" (workspace_provision.go:350) and delegates the image to CP — it does NOT carry a baked digest.
  • CP resolves the image via resolveRuntimeImage() which does a FRESH on EACH call (runtime_image_pin.go:117) — NO cache.
  • So a container re-provision re-reads the CURRENT pin. The earlier 'stuck on old pin' was the pin-NOT-updated scenario (imagewatch on :latest with unchanged CP pin). D.3 updates the pin FIRST → re-provision reads the NEW pin → ImageInspect(new-digest) misses locally → Docker pulls → ADOPTS.

VERDICT: D.3 should work. Reconciliation of provisioner.go:511-534 'skip re-pull for pinned digest' — that applies to an ALREADY-PRESENT digest; a freshly-promoted digest is NOT present locally → pull fires.

RESIDUAL (small, can't be settled by static analysis): for the agents-team shared-host topology, confirm the CONTAINER-recreate (not EC2 instance-launch) path re-resolves via CP rather than reusing an instance-launch-baked image. STATIC EVIDENCE leans strongly that it re-resolves (the restart path goes through buildProvisionerConfig → CP, not user-data).

PLAN: push D.3 branch (proper per-agent identity, NEVER founder-PAT) → PR → owner-merge after this review → deploy to CP → CONFIRM ADOPTION on the next real codex pin promote (CR2/Researcher bwrap clears = PROVEN). Do NOT close core#2239 as PROVEN until that live observation. If the shared-host residual bites (no adoption), extend: the refresh/redeploy must force the container-recreate to re-resolve from CP. Worktree /tmp/cp-2239-d3.

**MECHANISM VALIDATED (static trace) — blocker cleared; proceeding to merge with a post-deploy real-artifact confirmation.** Resolved the open question (does re-provision adopt the new pin, or reuse a stale one?) by tracing both layers + reconciling two partially-conflicting subagent findings: - workspace-server RE-PROVISION path sets cfg.Image="" (workspace_provision.go:350) and delegates the image to CP — it does NOT carry a baked digest. - CP resolves the image via resolveRuntimeImage() which does a FRESH on EACH call (runtime_image_pin.go:117) — NO cache. - So a container re-provision re-reads the CURRENT pin. The earlier 'stuck on old pin' was the pin-NOT-updated scenario (imagewatch on :latest with unchanged CP pin). D.3 updates the pin FIRST → re-provision reads the NEW pin → ImageInspect(new-digest) misses locally → Docker pulls → ADOPTS. VERDICT: D.3 should work. Reconciliation of provisioner.go:511-534 'skip re-pull for pinned digest' — that applies to an ALREADY-PRESENT digest; a freshly-promoted digest is NOT present locally → pull fires. RESIDUAL (small, can't be settled by static analysis): for the agents-team shared-host topology, confirm the CONTAINER-recreate (not EC2 instance-launch) path re-resolves via CP rather than reusing an instance-launch-baked image. STATIC EVIDENCE leans strongly that it re-resolves (the restart path goes through buildProvisionerConfig → CP, not user-data). PLAN: push D.3 branch (proper per-agent identity, NEVER founder-PAT) → PR → owner-merge after this review → deploy to CP → CONFIRM ADOPTION on the next real codex pin promote (CR2/Researcher bwrap clears = PROVEN). Do NOT close core#2239 as PROVEN until that live observation. If the shared-host residual bites (no adoption), extend: the refresh/redeploy must force the container-recreate to re-resolve from CP. Worktree /tmp/cp-2239-d3.
Member

D.3 LIVE-VALIDATION FAILED — mechanism is WAF-blocked. #2239 STAYS OPEN. (verify-the-real-artifact caught what static analysis missed.)

Test: triggered a scoped agents-team codex workspace-redeploy (POST /cp/admin/tenants/agents-team/workspaces/redeploy {runtime:codex}) on prod (D.3 already deployed via CP PR#537). Result: the CP→tenant /admin/workspace-images/refresh call returned HTTP 404 + canvas Next.js HTML — the tenant Cloudflare WAF rewrites /admin/* to the canvas app, DESPITE CP setting the Origin header (workspace_redeploy.go:327-329 sets Authorization + Origin:tenantURL + X-Molecule-Org-Id). So the refresh never reaches the tenant workspace-server → no container force-remove → no adoption. D.3's propagation, when it fires, hits the same wall.

CR2/Researcher UNCHANGED (refresh never reached them; no harm, recoverable). No regression introduced.

WHY static analysis was wrong: the research subagent read the code that SETS the Origin header and assumed it works; it does not (for this endpoint). The Origin trick may work for the /admin/* paths WorkspaceLister/billing-mode use but NOT /admin/workspace-images/refresh — needs a diff of why.

FIX CANDIDATES (next): (a) whitelist /admin/workspace-images/refresh at the tenant Cloudflare edge — compare against the working /admin/* paths; (b) check if the refresh route is conditionally-unregistered on the tenant (router.go:571 opt-in) so the ws-server itself 404s → edge serves canvas; (c) switch the propagation to the EC2 INSTANCE-redeploy path (Redeployer terminate+relaunch → re-reads pin at userdata, bypasses /admin/refresh entirely) — heavier but not WAF-dependent. Leaning (b)-investigate-then-(a)-or-(c). The D.3 PR#537 code (enumerate+call) stays; the TRANSPORT to the tenant needs fixing.

**D.3 LIVE-VALIDATION FAILED — mechanism is WAF-blocked. #2239 STAYS OPEN.** (verify-the-real-artifact caught what static analysis missed.) Test: triggered a scoped agents-team codex workspace-redeploy (POST /cp/admin/tenants/agents-team/workspaces/redeploy {runtime:codex}) on prod (D.3 already deployed via CP PR#537). Result: the CP→tenant /admin/workspace-images/refresh call returned **HTTP 404 + canvas Next.js HTML** — the tenant Cloudflare WAF rewrites /admin/* to the canvas app, DESPITE CP setting the Origin header (workspace_redeploy.go:327-329 sets Authorization + Origin:tenantURL + X-Molecule-Org-Id). So the refresh never reaches the tenant workspace-server → no container force-remove → no adoption. D.3's propagation, when it fires, hits the same wall. CR2/Researcher UNCHANGED (refresh never reached them; no harm, recoverable). No regression introduced. WHY static analysis was wrong: the research subagent read the code that SETS the Origin header and assumed it works; it does not (for this endpoint). The Origin trick may work for the /admin/* paths WorkspaceLister/billing-mode use but NOT /admin/workspace-images/refresh — needs a diff of why. FIX CANDIDATES (next): (a) whitelist /admin/workspace-images/refresh at the tenant Cloudflare edge — compare against the working /admin/* paths; (b) check if the refresh route is conditionally-unregistered on the tenant (router.go:571 opt-in) so the ws-server itself 404s → edge serves canvas; (c) switch the propagation to the EC2 INSTANCE-redeploy path (Redeployer terminate+relaunch → re-reads pin at userdata, bypasses /admin/refresh entirely) — heavier but not WAF-dependent. Leaning (b)-investigate-then-(a)-or-(c). The D.3 PR#537 code (enumerate+call) stays; the TRANSPORT to the tenant needs fixing.
Member

D.3 v2 diagnosis refinement (sweep): my 'route opt-in-unregistered' hypothesis is WRONG. router.go:570-573 registers POST /admin/workspace-images/refresh whenever prov != nil — and on the agents-team Docker-host tenant prov IS non-nil (Docker is how the agents run). So the route IS registered on the ws-server. The canvas-HTML 404 is therefore an EDGE-ROUTING issue: the tenant edge (Cloudflare/caddy proxy) routes /admin/workspace-images/* to the canvas Next.js app instead of the ws-server, and CP's Origin header did NOT prevent it (unlike the other /admin/* paths WorkspaceLister/billing-mode use, which DO reach ws-server — so the edge has a per-path allowlist that excludes workspace-images). FIX OPTIONS narrow to: (a) add /admin/workspace-images/refresh to the tenant edge's ws-server route allowlist (needs tenant host / Cloudflare config — agents-team host is in AWS acct 153263036946, hard to reach), OR (c) switch D.3 propagation to the EC2 INSTANCE-redeploy path (Redeployer.RedeployTenant terminate+relaunch → re-reads pin at userdata, bypasses /admin/refresh entirely — CP has the 153263036946 creds to do it; heavier = whole-tenant restart but WAF-independent). Leaning (c) as the robust fix. Diagnose the edge allowlist next to confirm + pick.

**D.3 v2 diagnosis refinement (sweep):** my 'route opt-in-unregistered' hypothesis is WRONG. router.go:570-573 registers POST /admin/workspace-images/refresh whenever `prov != nil` — and on the agents-team Docker-host tenant prov IS non-nil (Docker is how the agents run). So the route IS registered on the ws-server. The canvas-HTML 404 is therefore an EDGE-ROUTING issue: the tenant edge (Cloudflare/caddy proxy) routes /admin/workspace-images/* to the canvas Next.js app instead of the ws-server, and CP's Origin header did NOT prevent it (unlike the other /admin/* paths WorkspaceLister/billing-mode use, which DO reach ws-server — so the edge has a per-path allowlist that excludes workspace-images). FIX OPTIONS narrow to: (a) add /admin/workspace-images/refresh to the tenant edge's ws-server route allowlist (needs tenant host / Cloudflare config — agents-team host is in AWS acct 153263036946, hard to reach), OR (c) switch D.3 propagation to the EC2 INSTANCE-redeploy path (Redeployer.RedeployTenant terminate+relaunch → re-reads pin at userdata, bypasses /admin/refresh entirely — CP has the 153263036946 creds to do it; heavier = whole-tenant restart but WAF-independent). Leaning (c) as the robust fix. Diagnose the edge allowlist next to confirm + pick.
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2239