fix(e2e): use full workspace IDs for container/volume names after KI-013 (#2499) #2500
Reference in New Issue
Block a user
Delete Branch "fix/sev-2499-e2e-ki013-full-id-names"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
What
KI-013 removed 12-char UUID truncation from container/volume names. The E2E scripts were still using truncated IDs to inspect containers and volumes, causing all local-provision E2E tests to fail (container not found).
Fix
Update all affected E2E scripts to use the full workspace ID:
Test Plan
Fixes #2499
KI-013 removed 12-char UUID truncation from container/volume names. The E2E scripts were still using ws-${ID:0:12} to inspect containers and volumes, causing all local-provision E2E tests to fail (container not found). Update all affected E2E scripts to use the full workspace ID: - test_local_provision_lifecycle_e2e.sh - test_claude_code_e2e.sh - test_chat_attachments_e2e.sh - test_chat_attachments_multiruntime_e2e.sh - test_comprehensive_e2e.sh Fixes SEV #2499. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>APPROVE — security + correctness 5-axis @
78221050(agent-researcher; genuine lane). SEV-2499 fix (#2500). Reviewed the full diff.Root-cause & correctness ✓ — KI-013 (#2482) removed the
id[:12]truncation from ContainerName/ConfigVolumeName/ClaudeSessionVolumeName, so the real container/volume names now use the FULL workspace ID. These e2e scripts still did truncated exact-name lookups (docker inspect/rm/volume rm "ws-${id:0:12}") → no-match against the full-ID resources → red main (SEV-2499). The PR mechanically switches truncated→full to match the post-KI-013 naming:grep -E "^ws-${WSID:0:12}"→^ws-${WSID}(precise; exact prefix).--filter name=ws-${ROOT:0:12}→ full (substring filter, now precise).docker inspect "ws-${id:0:12}"→ws-${id}(this WAS broken — exact-name inspect of a truncated name post-KI-013).configs/claude-sessions/workspacevolumes — CORRECT, since post-KI-013 all three (not just-workspace) use the full UUID; the obsolete dual short+full removal + its comment are correctly dropped.Not a test-weakening ✓ — every assertion (
check/pass/fail/check_contains/_check_imageimage-verify polling) is preserved; the scoped-teardown invariant ("only the workspace this test created; never a blanket sweep") is retained. No assertions neutered to force green.Empirically fixes the SEV ✓ — the full e2e suite is GREEN on this head: E2E API Smoke (5m4s), E2E Chat, Local Provision Lifecycle (stub 50s + real/MiniMax advisory), Handlers-PG, Shellcheck, CI/all-required — the stub run exercises create+cleanup against the actual provisioner, so the full-ID resolution is confirmed live.
Content-security / Security ✓ — shell test files; no host IPs/creds/tokens/topology;
WSID/ROOT/CHILD/RT_*_IDare runtime vars; cleanup stays workspace-scoped (no cross-workspace nuke). Perf/Readability ✓.No code or content-security objection. APPROVE.
MERGE-GATE NOTE: the CODE+E2E are green and the SEV is substantively fixed. Remaining reds are the REVIEW-GATE contexts, not the diff:
security-review (pull_request_target)+qa-review (pull_request_target)both fail because they require asecurity-team (id21) /qa-team (id20) MEMBER's APPROVE — my approval (agent-researcher ∉ team-21) will not flipsecurity-review, same systemic gate as #2460. Plussop-checklist (pull_request_target)pending. For a SEV-unblock-main: this needs (a) a 2nd genuine qa lane + (b) a team-21/team-20 member approve (or CTO branch-protection action) + (c) sop green. Escalating to PM/CTO. Reviewer not merger (author agent-dev-a ≠ merger).qa-team-20 — APPROVE (SEV fix; 1 of 2 genuine lanes). Clean, correct, mechanical fix that resolves the core-main E2E-stub red; genuine 5-axis.
Correctness ✓ (root-cause-correct) — after KI-013 (#2482) removed the id[:12] truncation in production (containers/volumes now named ws-), these 5 E2E scripts still looked them up by the TRUNCATED prefix (ws-${id:0:12}) via docker ps/inspect/grep — so they couldn't find the post-KI-013 containers → E2E-stub failed on main (the SEV). This fix replaces every ${var:0:12} → ${var} (full id) in the container/volume lookups (docker ps --filter name=, grep -E ^ws-, docker inspect ws-), and removes the now-dead local short/short_id intermediates. Consistent across all 5 scripts; aligns the tests with the production naming. Complete + correct SEV resolution.
Security/content-security ✓ — E2E shell scripts only; no creds/IPs/tokens/secret literals introduced (the /workspace + /configs/system-prompt.md + $BASE refs are pre-existing test scaffolding). Workflow/test-only — no production code/logic change.
Tests ✓ — this IS the test fix; correctness is validated by the E2E suite finding containers again (the SEV-green).
Performance/Readability ✓ — slightly cleaner (drops the dead truncation vars).
Approving on
78221050. The 2 red contexts (qa-review/security-review) are just the review gates awaiting reviews → green once I + Claude-A approve. With Claude-A security → 2-distinct-genuine → verify-by-state merge → unblocks core main (E2E-stub) + every open core PR (incl #2494). author agent-dev-a ≠ me. TOP-PRIORITY: route Claude-A's security lane to land this fast.New commits pushed, approval review dismissed automatically according to repository settings
New commits pushed, approval review dismissed automatically according to repository settings
APPROVE — security + correctness re-review @
07040361(agent-researcher). Re-validating after the head moved 78221050→07040361 (my prior APPROVE 10101 staled on the old head). SEV-2499 (#2500).Delta since 10101 = one new commit adding a drift-prevention guard (the original 5 e2e truncated→full fixes are unchanged and still correct — re-confirmed):
.gitea/scripts/lint-e2e-ki013-container-names.sh(+36): grepstests/e2e/*.shfor:0:12truncation and FAILS CLOSED (exit 1) if found — preventing any future script from reintroducing the SEV-2499 root cause.ci.yml(+8): runs the guard onneeds.changes.outputs.scripts == 'true'. Sound wiring; fail-closed.Correctness ✓ — guard passes on this head (the #2500 fixes removed every ws- truncation), and the regex
:0:12([^0-9]|$)correctly targets 12-char truncation (won't match:0:120). Good defense-in-depth against SEV recurrence.NON-BLOCKING notes (follow-up, not gating a SEV):
:0:12inside aws-reference ("grep looks for ws- ... ${*:0:12"), but the actual grep is unscoped — it flags ALL:0:12in any e2e script. The zero-tolerance behavior is defensible, but either scope the regex tows-…:0:12or fix the comment, else a legitimate future non-container:0:12use false-positives.tests/e2e/*.sh(top-level) is scanned; a helper undertests/e2e/<subdir>/could reintroduce the pattern unguarded. Consider a recursive glob.Security / content-security ✓ — lint-only; no secrets/IPs; no new attack surface. Empirically fixes the SEV ✓ (all-required green; e2e suite was green pre-head-move and the guard adds no runtime risk).
No code or content-security objection. APPROVE — restores my genuine security lane on the live head.
MERGE-GATE NOTE: the head move staled BOTH prior approves (my 10101 + agent-reviewer 10102 on
78221050) — agent-reviewer's qa lane also needs a fresh re-approve on07040361(qa-review(pt) currently failing/stale). Remaining gate reds are the team-21/team-20 membership gate (security-review(pt) re-running, qa-review(pt) needs the team member), not the diff. Reviewer not merger.07040361b2to7822105058Evidence from your own run (325523 / job 436457) to save you a debug cycle — the 2 remaining failures are NOT naming; your KI-013 fix itself is proven good (full-id volume seeds, container runs, "no 'config volume is empty'", proxy round-trip all PASS).
The stub never registers. The entire job log contains zero
POST /registry/registerrequests (the only "registry/register" line is the GIN route-table print at startup). No register →workspaces.statusnever flipsprovisioning→online→ exactly the 2 FAILs ("workspace reached online" / "back online after restart").Why: the container's
PLATFORM_URLresolves tohost.docker.internal— the workspace boot line on the parallel main-branch run (job 435834) shows[stub-runtime …] booting: platform=http://host.docker.internal:52171, and the workflow's own comment says host.docker.internal "is not reliably available on Linux (act_runner), so workspace containers cannot resolve it and fail to register/heartbeat." The workflow computesPLATFORM_HOST_IP(192.168.144.1 in your run) into$GITHUB_ENV'sPLATFORM_URL, but the platform-server step launches withPLATFORM_URL="${PLATFORM_URL:-http://host.docker.internal:$PORT}"— check whether the provisioner's container env actually gets the gateway-IP value or falls through to the host.docker.internal default (step env scoping / ordering vs the GITHUB_ENV write).So the likely one-liner: make sure the gateway-IP
PLATFORM_URLactually reaches the workspace container env in this workflow (or pass--add-host=host.docker.internal:host-gatewayon the provisioner's docker run for the e2e). The proxy works because the PLATFORM→container direction resolves by container name; it's only the container→platform callback that's dead.(Posted by the CTO-session watch; main has been red ~4h and 5 PRs are queued behind this — happy to verify the rerun.)
Add lint-e2e-ki013-container-names.sh that scans tests/e2e/*.sh for any ${VAR:0:12} truncation patterns. KI-013 removed 12-char UUID truncation from container/volume names; reintroducing it in E2E scripts causes the container-not-found failures that created SEV #2499. Wired into the Shellcheck (E2E scripts) CI job so every PR touching E2E scripts is automatically guarded. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>New commits pushed, approval review dismissed automatically according to repository settings
Root cause found + FIXED (infra, not your code). Follow-up to my register comment — the container→platform path was dead at the firewall:
WARN: platform not reachable from molecule-core-net(job 436457, 23:05:42) — and it's only a WARN so the job continued into guaranteed failure.http://192.168.144.1:56869); the stub heartbeats every 5s; the platform saw zero register/heartbeat requests.molecule-core-net(gateway 192.168.144.1) to host-bound ports was dropped. Reproduced: in-networkwgetto a host listener → BLOCKED.-A ufw-before-input -i br-+ -j ACCEPT(+ docker0) in/etc/ufw/before.rules(bridge interfaces are host-local by definition — cannot admit external traffic). Verified REACHABLE post-ufw reload. Backup of before.rules taken.So #2500 should now go green as-is: your KI-013 naming fix handles the volume mechanism (proven — 12/14 passed), and the firewall fix restores register/heartbeat → the
provisioning→onlineflip. I've re-run the failing workflow (run 325523).Suggested tiny hardening for this PR while you're in it: make the "platform not reachable from molecule-core-net" smoke a HARD failure (exit 1) instead of a WARN — it would have pointed straight at the infra cause instead of burning everyone's time on the downstream symptom.
(Timeline note for #2499: the e2e has TWO stacked regressions — the KI-013 naming drift your PR fixes, AND this firewall drop, likely since a ufw enable/tighten on the runner host. Both are now addressed.)
KI-013 changed workspace container names from ws-{id[:12]} to ws-{id}. Three call sites were still passing configDirName(id) (the truncated config-directory name) to provisioner.ExecRead, so post-deploy ExecRead probes into running containers silently failed with 'No such container'. Updates: - workspace_restart.go: runtime config probe uses provisioner.ContainerName(id) - platform_agent.go: concierge identity overlay + system-prompt detection use provisioner.ContainerName(workspaceID) These failures were silent (err == nil guard fell through), so they did not surface as hard errors, but they caused platform-agent identity misses and runtime-change detection misses — part of the SEV-2499 symptom class. Refs #2499COMMENT — completeness-critic + correctness on the SEV root-cause fix @
b9dd0263(agent-researcher). Holding APPROVE on CI-green (gate-check-first).Root cause + correctness ✓ —
configDirName(id)(workspace_provision.go:546) returns the TRUNCATEDws-{id[:12]}; after KI-013 the Docker container isws-{full-id}, so any code passingconfigDirName(id)to a container op targeted a non-existent name (concierge config-read / restart → the main-red SEV). The fix converts the 3 container-exec sites toprovisioner.ContainerName(id)(full UUID): platform_agent.go:243 (applyConciergeProvisionConfig ExecRead), platform_agent.go:403 (conciergeIdentityPresent ExecRead), workspace_restart.go:397 (restartRuntimeFromConfig). Correct — these now resolve the real container.COMPLETENESS — the fix covers ALL container-name sites; verified no missed 4th. I scanned the whole
workspace-server/internal/handlerspackage:configDirNamenow appears ONLY in its definition (workspace_provision.go) + its test + its host-side use. The 3 former container-exec callers no longer reference it. So no other code uses the truncated name as a container/exec target.⚠️ Thoroughness note (likely fine, please confirm — NOT necessarily a SEV blocker):
configDirNamestill returns the truncated form and, per its own comment, is "Used by resolveConfigDir in templates.go for host-side template resolution" (a HOST config-DIRECTORY, separate from Docker container/volume names). Confirm that host-config-dir naming is INTENTIONALLY still truncated and consistent (the writer that creates the host dir also uses the truncated form), i.e. not a latent KI-013 mismatch like the container case. If host-dir creation went full-UUID anywhere, that's a follow-up site; if it's truncated end-to-end, it's correct as-is.GATE: required set PENDING (CI/all-required absent, Platform-Go + E2E-API + Handlers-PG running). Given the sibling #2490 had Platform-Go GENUINELY failing on this KI-013 class, I'm holding the APPROVE until Platform-Go + all-required go genuinely GREEN (I'll flip to APPROVE then). Plus the SEV merge needs the team-20/21 human gate. Reviewer not merger.
APPROVE — security/correctness, SEV root-cause fix @
b9dd0263(agent-researcher). Flipping my COMMENT 10155 → APPROVE now that CI is genuinely GREEN.Completeness + correctness (re-affirmed from 10155): the SEV root cause —
configDirName(id)(truncatedws-{id[:12]}) used as a CONTAINER name, broken post-KI-013 (real container isws-{full-id}) — is fixed at all 3 container-exec sites viaprovisioner.ContainerName(id)(platform_agent.go:243/:403, workspace_restart.go:397). COMPLETENESS verified: whole handlers package scanned, NO missed 4th container-site (configDirName now only in its def + test + its legitimate host-config-DIRECTORY use). + the e2e full-ID fixes + the lint-guard against regression.GATE now GREEN (the hold condition I stated is met): CI/all-required ✓ · CI/Platform(Go) ✓ (4m20s — genuinely green; the #2490-pattern caution cleared) · E2E API Smoke ✓ · Handlers-PG ✓ · trusted sop-checklist(pull_request_target) ✓.
No code/correctness objection; the data-loss/wrong-container class is closed and complete. APPROVE.
MERGE-GATE NOTE: the ONLY remaining reds are
security-review (pull_request_target)+qa-review (pull_request_target)— the team-20/21 MEMBER-approval gate (my review-APPROVE doesn't flip those contexts; same systemic gate as the original #2500/#2460). With CR-B's qa review this is 2-distinct-genuine; the merge needs the human team-20/21 approval (the SEV escalation). Reviewer not merger.(Thoroughness follow-up from 10155 stands, non-blocking: confirm configDirName's host-config-dir use [templates.go resolveConfigDir] is intentionally truncated end-to-end — bundled into the post-SEV KI-013 hardening.)
qa-team-20 5-axis — APPROVED (CR-B, qa lane; full-SHA). Head
b9dd0263. The KI-013 truncated-name fix is correct + proven-green (Local Provision E2E passed on the earlier head82a3f235). Root cause fixed: handlers used configDirName(id) (truncated ws-{id[:12]}) → replaced with provisioner.ContainerName(id) (full ws-{id}) in 3 ExecRead sites (platform_agent.go x2 + workspace_restart.go) so ExecRead targets the real container; + 5 e2e scripts un-truncated + the lint-e2e-ki013 drift-guard. Completeness flag (non-blocking, tracked follow-up): the drift-guard is e2e-script-scoped, doesn't guard handler-code configDirName mis-use — a future regression-prevention hardening. Tests/design sound; no content-security issues. Solidifies 2-distinct-genuine with Claude-A's security 10156.