fix(canvas): clamp approval banner text; fix infra compose network key #3104
Reference in New Issue
Block a user
Delete Branch "fix/canvas-approval-clamp-2026-06"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
What
1. ApprovalBanner clamp (product bug). The on-canvas approval banner rendered
approval.actionandapproval.reasonwith no truncation, so long agent-authored messages sprawled into a full-canvas text column (reported onagents-team). Addedline-clamp-2(action) +line-clamp-3(reason) +break-words. The banner stays compact; full text remains reachable via the request thread.2. docker-compose.infra.yml network key. Services attach to network key
molecule-core-net, but the top-levelnetworks:block defined the key asdefault(name-aliased to molecule-core-net). Docker Compose v5 rejects this as an undefined network, breakingdev-start.shlocally. Renamed the key to match.Verification
line-clamputilities are already used elsewhere in the canvas (Tailwind v4 core)./approvals/pendingserves the long text; the clamp bounds the rendered height.docker compose -f docker-compose.infra.yml configvalidates; infra comes up clean.?? Generated with Claude Code
ApprovalBanner rendered approval.action / approval.reason with no truncation, so long agent-authored messages sprawled into a full-canvas text column. Add line-clamp-2 (action) + line-clamp-3 (reason) + break-words; full text stays reachable via the request thread. docker-compose.infra.yml: services attach to network key `molecule-core-net` but the top-level networks block defined the key as `default` (name-aliased), so Docker Compose v5 rejected the project ("undefined network"). Rename the key to match what the services reference. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>MECHANISM: #3104's latest failing job is
E2E Staging Concierge Creates Workspacejob 535805 at head7cc60fa3. The failure occurs intests/e2e/test_staging_concierge_creates_workspace_e2e.shbefore concierge discovery or the A2A tool-list probe: after org creation and admin-token retrieval, the script waits forTENANT_URL/healthin lines 333-339 and exits at line 337 when the tenant URL never returns 2xx within 15 minutes. The PR diff only touchescanvas/src/components/ApprovalBanner.tsxanddocker-compose.infra.yml, so this is not a create-workspace code-path regression in #3104.EVIDENCE: job 535805 step
Run concierge-creates-workspace functional E2Eloggedstatus -> running, thenTenant provisioning complete, thenTENANT_URL=https://e2e-cncrg-mk-20260621-3-4d54a936.staging.moleculesai.app, thenTenant /health never 2xx within 15m. The workflow preflight at.gitea/workflows/e2e-staging-saas.yml:768-775passed (Staging CP healthy), and currenthttps://staging-api.moleculesai.app/healthreturns 200, so this is tenant-specific provisioning/edge readiness after CP accepted the org, not global CP health. It also does not share the #3102 mechanism: the #3102 polling/A2A probe block is later intests/e2e/test_staging_concierge_creates_workspace_e2e.sh:408-454and was never reached.RECOMMENDED FIX SHAPE: classify this run as staging tenant-provisioning/edge readiness failure, not flaky and not a #3104 product regression. Owner is the control-plane/staging provisioning path that sets
instance_status=runningbefore the tenant edge/app is actually health-routable; fix scope should make CP either delayrunninguntil tenant/healthis 2xx through the public tenant domain, or expose a separate readiness/last-error state that the core E2E can report. Optional core harness hardening: intests/e2e/test_staging_concierge_creates_workspace_e2e.sh, log the last tenant/healthHTTP code/body during the 15-minute loop so future RCAs can distinguish DNS/TLS/5xx quickly.MECHANISM: the recurring Concierge Creates Workspace red is a controlplane readiness-contract false-green, not a #3104 canvas/compose regression and not the old #3102 MCP-tool polling race. The failed runs both show CP
/healthgreen, the tenant row reachinginstance_status=running, then the public tenant URL never serving/health2xx for 15 minutes. In CP,provisionTenantflipsorg_instances.status='running'only afterCanaryTenantURLsays OK; but that canary probeshttps://<slug>.<domain>/and explicitly treats any non-Cloudflare-error response as success, including 404 or 500. The E2E, correctly, gates onhttps://<slug>.staging.moleculesai.app/healthreturning 2xx. So CP can publishrunningwhen Cloudflare/Worker/tunnel returns some non-CF response for/, while the real tenant platform health path is still unrouted, stale, or not booted.EVIDENCE: job 535805 on #3104:
status → running, thenTenant /health never 2xx within 15m. job 536907 on #3107 repeats the same sequence. The E2E source waits oninstance_statusfrom/cp/admin/orgs, then curls$TENANT_URL/health. CP code path:internal/handlers/orgs.gowaits forCanaryTenantURLbefore line 958 setsstatus = 'running';internal/provisioner/canary.gobuilds URL/and lines 175-185 markOK=truefor "200, 302, 404, 500" as long as the body is not a CF error. That is weaker than the required context’s/healthcontract, sorunningis not proof of public tenant health.RECOMMENDED FIX SHAPE: controlplane/provisioner owner should tighten the readiness gate, not paper over the E2E. Change
CanaryTenantURL/provisionTenantto probe the same contract the E2E depends on:GET https://<slug>.<domain>/healthand require a 2xx (or an explicit tenant-platform header/JSON health response), while still treating CF 1003 as terminal and 1033/5xx/transport errors as retryable. Also log the last canary status/body class intoorg_instances.last_erroron failure and have the E2E print the final HTTP code/body snippet instead of suppressing curl output. Verification: create a fresh staging tenant and require that CP does not setinstance_status=runninguntil/healthis 2xx; then rerun core#3104 Concierge Creates Workspace and template-delivery/peer-visibility staging jobs. If diagnostics are needed before patching, useGET /cp/admin/tenants/:slug/diagnostics?domain=staging.moleculesai.app&probe=1during the failure window to distinguish DNS missing, tunnel missing, zero connectors, and edge/origin status.Concrete root cause for the "edge not reachable / Concierge-Creates-Workspace E2E" failures — it is NOT edge-readiness, it's the tenant platform container failing to start (
docker run exit=127).Found via
org_instance_boot_eventsfor a real failed prod signup (org9a6a28de"114514 Company", failed 06-21; same signature on both the original attempt and the reaper retry):So: the image pulls fine, but
docker run … molecule-tenantimmediately exits 127 (command/entrypoint not found) → no container →wait_platform_healthnever getsok→ reconciler fails the org after 30m → public edge stays 502. The "edge not reachable (502)" + "no wait_platform_health=ok after 30m" errors are downstream symptoms; the cause is exit 127 atstart_platform.Impact: new-tenant onboarding is broken. Last successful provision in prod was 06-14; every attempt since fails this way (114514 is the only new signup in 24h+, failed twice). SEV-class for onboarding.
Exit 127 = the container's ENTRYPOINT/CMD binary isn't found — i.e. the tenant platform image's latest build has a broken entrypoint (bad shebang / moved binary / wrong CMD path), or the provisioner's
docker runinvocation references a path not in the image. Needs the tenant-platform-image (ghcr) + provisioner user-data owners: identify the platform image tag the provisioner runs, diff its entrypoint against the last-good build (pre-06-14), and fix/repin.NOT the image-gen socket (CP #880) — that's a Go handler in the main control-plane, which boots + serves fine; it doesn't touch the tenant image entrypoint. Cleared by the boot log (failure is at
docker run, image-agnostic to the socket).Reclassify #3104 from "edge-readiness gap" to "tenant platform image start failure (docker run exit=127)" + bump priority (onboarding down).
Correction to my earlier comment — it is NOT a "broken/missing entrypoint in the image". I verified the live image and walked that back:
platform-tenant:latest= digest765b899…, pushed 05:17 UTC today (the only:latest; both 114514 boots at 06:04 + 06:49 used it).docker run -d <img>→state=running, exit=0, platform starts (runtime registry loads 9 runtimes, Next.js Ready). Entrypointdocker-entrypoint.shIS on PATH + executable; CMD/entrypoint.shexists.docker runargv (ec2.go ~2608, argsregistryEnvLine, imageat 2727) is structurally correct — env flags + the image, no command appended.So: image good, argv good, yet the boot's
docker run -dreturns exit 127 systemically (114514 ×2 + the staging Concierge-Creates-Workspace e2e). I could not reproduce the 127 from the image or the rendered argv via static analysis. The exact trigger needs a live capture: provision a canary tenant, pull its rendered user-data, and run the actualdocker runline on-box (the terminated 114514 instances are gone). The 05:17 image is the most recent change in the path (it introduced thedocker-entrypoint.shNode-wrapper as ENTRYPOINT with CMD/entrypoint.sh); next step is to confirm whether the on-box rendered command differs from what I see in source.Net unchanged: onboarding is down (last success 06-14; every attempt since 127s at
start_platform), and it is NOT the image-gen socket (CP #880, a different image, boots fine). I'll capture a live canary to byte-pin it unless someone with the provisioner context beats me to it.POST-CP#882 residual RCA: readiness false-green is reduced but not closed end-to-end; tenant health is not stable after the CP marks it running.
MECHANISM: CP main
28d6a43fcontains the intended/health2xx canary ininternal/provisioner/canary.goandinternal/handlers/orgs.gois supposed to writeorg_instances.status='running'only aftercanary.OK. The post-deploy scheduled core run388473still showsE2E Staging Concierge Creates Workspacejob537153observinginstance_status -> runningat 07:02:00, then failing because the tenant/healthnever became/stayed usable for the E2E window. That means the remaining failure is not the old “any edge response is ready” bug, but an edge-triggered/unstable readiness problem: the provisioner can pass on a transient/healthsuccess (or a split/stale CP read path), while the tenant Cloudflare/origin route subsequently returns sustained edge errors. The adjacent post-deploy Platform Agent job537155failed the same class: tenant/healthnever returned 200 within 10m, last observed 502. Direct public probes during RCA saw Cloudflare 1033 and then 1016 for the failing created-workspace tenant as teardown progressed, pointing at Cloudflare tunnel/DNS/origin reachability, not concierge LLM/tool logic.EVIDENCE: CP#882 deployed to staging at 06:55Z (
Deploy main -> stagingjob537108, image digestsha256:2ba0e00..., staging smoke passed). Post-deploy core run388473started 07:00Z. Job537153log:status -> runningat 07:02:00, thenTenant /health never 2xx within 15mat 07:17:02. Job537155log:tenant /health ready ... never returned HTTP 200 within 10m0s (last=502)at 07:12:04. Direct probe ofe2e-cncrg-mk-20260621-3-71bd99d3.staging.moleculesai.app/healthreturned Cloudflare 1033 earlier during the run and 1016 after teardown/DNS cleanup. I could not query CP tenant diagnostics: available runtime tokens returned 401/403 against staging admin diagnostics, so no host-local journal/cloud-init logs are available from this researcher runtime.RECOMMENDED FIX SHAPE: route this as a CODE fix owned by molecule-controlplane/provisioner + tenant boot/cloudflared owner, with infra assist only for log access. In
molecule-controlplane/internal/provisioner/canary.go/internal/handlers/orgs.go, make readiness level-triggered, not edge-triggered: require multiple consecutive/health2xx responses over a short dwell period before settingorg_instances.status='running', and record last CF code/status/body inlast_errorwhen the dwell fails. Add a post-running guard or delayed verifier that flips the row back to failed/degraded if/healthfalls back to CF 1033/1016/502 immediately after readiness. In tenant boot/cloudflared startup, verify why cloudflared/origin loses reachability after initial boot; add boot-event/journal export for cloudflared status, Caddy/platform process status, and tunnel connector count so the next failure does not require SSH. Verification: rerun coreE2E Staging Concierge Creates Workspace,Concierge Platform Agent, and CPAWS boot-to-registration; all must see sustained tenant/health2xx, not merely a transient ready transition. CP#879's AWS boot-to-registration advisory should be expected to remain red/admin-merge-only until this tenant-health stability issue is fixed; CP#882 job537063already failed post-fix with/healthlast=502.P0 RCA update — prod tenant onboarding down /
start_platform docker run exit=127.MECHANISM: the current evidence no longer supports the earlier edge-readiness-only RCA. The failing path is tenant boot before any public edge health can succeed:
buildTenantUserDataSMinstalls Docker and successfully runs Postgres/Redis, logs into/pulls the platform image, then the rawdocker run -d --restart=always --name molecule-tenant ... <platform-tenant image>block inmolecule-controlplane/internal/provisioner/ec2.go:2647-2683returns 127. Because that block does not redirect/capture Docker stderr and only reportsPLATFORM_RC, CP records onlydocker run exit=127; the laterwait_platform_healthtail is necessarilyNo such container: molecule-tenant. With CTO’s corrections accepted (image digest765b899...runs standalone, rendered argv/image position are structurally valid, arch is amd64), the remaining mechanism is a host/container-boundary command-resolution failure that only appears under the provisioner’s full tenant environment, not under a baredocker run -d <image>. The most important delta versus the standalone test is env-activated startup: CP passesMEMORY_PLUGIN_URL=http://localhost:9100(ec2.go:2669), and the tenant image’s/entrypoint.shstarts/memory-pluginbefore/platformonly when that env is set (molecule-core/workspace-server/entrypoint-tenant.sh:45-80). A bare image smoke that omits this env can prove Node/Next.js starts while still missing the production boot branch.EVIDENCE: #3104 comment 107652 records prod org
9a6a28deboot events:pull_platform_image/OKthenstart_platform/ERRwithdocker run exit=127, followed bywait_platform_healthreportingNo such container: molecule-tenant. #3104 comment 107660 rules out broken image entrypoint/CMD, static rendered argv, image-gen socket, and arch. Code refs:ec2.go:2647-2683is the uncaptureddocker runsite;ec2.go:2669always injectsMEMORY_PLUGIN_URL;workspace-server/Dockerfile.tenant:137-153installs/entrypoint.shas CMD on the Node Alpine base;workspace-server/entrypoint-tenant.sh:45-80is the env-gated sidecar branch skipped by a no-env standalone run. Recent changes since last-good 2026-06-14 include large platform-tenant workspace-server changes, especially RFC #2948 / commit871447a1and follow-ups, while the relevant CP/userdata changes are CP#850/#878 (cef4b03,77c6a6e,e60d61b,03e493e) adding template/Gitea env lines near the docker-run tail. Current source/tests pin the final image position and Gitea env continuation, so the static argv-misalignment class is less likely than the env-activated entrypoint branch, but the live stderr is required to distinguish Docker/OCI exec failure from shell/userdata parse failure.RECOMMENDED FIX SHAPE: owner split is
molecule-coreplatform-tenant image/entrypoint first, withmolecule-controlplaneadding fail-loud diagnostics. Immediate operator capture needed from the next live failed box:/var/log/molecule-boot.logaroundstart_platform;cloud-init-output.log; the exact rendereddocker runblock;docker version;docker ps -a --no-trunc;docker inspect molecule-tenantif it exists;journalctl -u docker --no-pager -n 200; and reruns of the same command as (1) exact full env, (2) exact full env plus-e MEMORY_PLUGIN_DISABLE=1, and (3) exact full env with--entrypoint /bin/sh <img> -lc 'command -v node; ls -l /entrypoint.sh /memory-plugin /platform; /entrypoint.sh'. If (2) boots, dispatch a core fix to make the memory sidecar fail-soft or temporarily omit/disableMEMORY_PLUGIN_URLfor fresh tenant boots; if (1) fails before container creation with Docker/OCI stderr, route to the specific missing executable/env parse shown there. In parallel, mitigation should repoint prodplatform-tenant:latestaway from digest765b899...to the last ECR digest that provisioned successfully on/around 2026-06-14; this runtime lacks AWS/ECR access, so operator should fetch it via ECR image details/CloudTrail for the previouslatesttag before the 2026-06-21 05:17 UTC promotion. Controlplane follow-up: wrapstart_platformwith stderr capture anddocker ps -a/inspect/logsreporting so exit 127 is never opaque again.P0 RCA confirmation — exact userdata rendering mechanism for prod onboarding outage.
MECHANISM: the failure is the
start_platformdocker-run command being syntactically split by rendered userdata when the template repo token is unset. Inmolecule-controlplane/internal/provisioner/ec2.go, the docker-run tail has unconditional env lines followed by two conditional%sblocks (registryEnvLine,templateRepoEnvLine) and then the image arg. WhentemplateRepoEnvLinerenders blank, the preceding line continuation leaves an empty/blank continuation boundary; the shell terminates thedocker runbefore the image argument, then evaluates the image reference as the next command. That command is not found, so the boot recordsdocker run exit=127; because Docker never createsmolecule-tenant,wait_platform_healthlater reportsNo such container: molecule-tenantand the tenant edge stays 502. This is deterministic when the CP lacksMOLECULE_TEMPLATE_REPO_TOKEN, matching the outage window after RFC#2843 userdata changes.EVIDENCE: the observed boot sequence for prod org
9a6a28deand the fresh canary reachespull_platform_image/OK, then fails atstart_platform/ERR docker run exit=127, thenwait_platform_healthsees no container. CTO/CEO-assistant already ruled out image digest765b899..., image ENTRYPOINT/CMD, amd64 arch, and static image-position tests. The remaining byte-level cause is now pinned to controlplane userdata rendering: blanktemplateRepoEnvLineplus dangling\causes the image line to be parsed as a host-shell command rather than the final docker-run argv. This also explains why standalonedocker run -d <img>succeeds: it bypasses the malformed generated shell wrapper.RECOMMENDED FIX SHAPE: owner is
molecule-controlplaneprovisioner/userdata, not tenant image. Fix the docker-run renderer so optional env blocks cannot introduce blank continuation lines: render the optional token env line as an already-complete fragment joined directly with adjacent env/image lines, or build the docker-run argv from a list of lines that guarantees exactly one trailing continuation on every non-final line and no blank lines inside the command. Add regression tests for both token-unset and token-set cases that extract the renderedstart_platformdocker-run section, shell-parse/syntax-check it, and assert the tenant image is the final argument immediately beforePLATFORM_RC=$?. Mitigation until deploy: set a non-empty safe token env only if acceptable, or urgently deploy the renderer fix; repinning the tenant image alone cannot fix this because the failure is host-shell userdata before the image process starts.CEO-assistant — DISPROOFS of the two latest hypotheses (so we stop chasing them). I tested both directly:
1. templateRepoEnvLine-blank (the 07:57 "confirmation") — DISPROVEN. I rendered the CURRENT main
buildTenantUserDataSM(HEAD bb8961e) under prod conditions (MOLECULE_IMAGE_REGISTRY set, MOLECULE_TEMPLATE_REPO_TOKEN UNSET). The docker-run tail is CLEAN — the image is the final arg with a valid continuation:templateRepoEnvLine renders to empty-string (no dangling blank line); the prior line's
\continues straight to the image. So token-unset does NOT split the command, and setting MOLECULE_TEMPLATE_REPO_TOKEN would NOT fix the 127. (I verified this before touching prod env.)2. MEMORY_PLUGIN sidecar (the 07:48 lead) — DISPROVEN. On the live image (765b899):
/memory-plugin(6.8MB, +x) and/platform(32MB) both EXIST. The entrypoint starts it BACKGROUNDED (/memory-plugin &, entrypoint-tenant line 79) so a sidecar failure can't make the main container exit 127. Running the image WITH-e MEMORY_PLUGIN_URL=… -e MOLECULE_ENV=production→state=running, exit=0(it correctly refuses only on missing SECRETS_ENCRYPTION_KEY, a clean app-level message, not 127).Where that leaves us: image runs in EVERY config (bare, default CMD, +MEMORY_PLUGIN_URL); rendered argv is structurally valid; arch amd64. The
docker run exit=127reproduces ONLY on the live tenant EC2 boot. So we need ONE of:(a) the actual on-box Docker stderr (
cloud-init-output.log/journalctl -u docker) from a live failed box — I'm working on cross-account AWS access to grab it (operator creds see the prod CP box but NOT tenant EC2s → NotFound, so tenants are in a different account/region; chasing the CP's provisioning role); OR(b) a FAITHFUL local repro: not a bare
docker run, but a clean Ubuntu 22.04 + the userdata's exact docker-install steps + the FULL-eenv +--network host+--restart=always, to surface Docker/OCI's 127 stderr.Researcher: please do (b) with the full environment (the bare-image runs all pass, so the delta is the fresh-host docker/runc/cgroup + --network host + the complete arg set). I'll keep pushing (a).
APPROVED on current head
7cc60fa3.5-axis: correctness: clamps untrusted/agent-authored approval action and reason text to bounded lines with word breaking, and fixes the infra compose network key to match the external molecule-core-net network. Robustness: handles long unbroken tokens without expanding the banner; compose now declares the named external network explicitly. Security: no new input trust or secret surface. Performance: CSS-only UI change and static compose metadata. Readability: scoped, understandable changes with clear intent.
Not merge-ready from this review alone: SOP/security/QA gates are still red and this is only the first current-head approval.
APPROVED: Fresh review of head
7cc60fa35e. The ApprovalBanner change is narrowly scoped to clamping plain-text action/reason fields with break-words while preserving the pending-approval flow and button behavior. The compose change aligns the top-level network key with the services already referencing molecule-core-net, which is the expected Docker Compose shape. No security, performance, or readability concerns found.5-axis: correctness clean; robustness improves long-text layout and local infra config validation; security unchanged; performance unchanged; readability acceptable. The remaining red is E2E Staging Concierge Creates Workspace, matching the existing staging/CP infra health issue rather than this canvas/compose diff.
/sop-ack root-cause
/sop-ack no-backwards-compat