The hourly Sweep stale Cloudflare Tunnels job got cancelled mid-cleanup
on 2026-05-02 (run 25248788312, killed at 5min after deleting 424/672
stale tunnels). A second manual dispatch finished the remaining 254
fine, so the immediate backlog cleared, but two underlying bugs would
re-trip on the next big cleanup.
Bug 1: serial delete loop. The execute branch was a `while read; do
curl -X DELETE; done` pipeline at ~0.7s/tunnel — fine for the
steady-state cleanup of a handful, but a 600+ backlog needs ~7-8min.
This commit fans out to $SWEEP_CONCURRENCY (default 8) workers via
`xargs -P 8 -L 1 -I {} bash -c '...' _ {} < "$DELETE_PLAN"`. With 8x
parallelism the same 600+ list drains in ~60s. Notes:
- We use stdin (`<`) not GNU's `xargs -a FILE` so the script stays
portable to BSD xargs (matters for local-runner testing on macOS).
- We pass ONLY the tunnel id on argv. xargs tokenizes on whitespace
by default; tab-separating id+name on argv risks mangling. The
name is kept in a side-channel id->name map ($NAME_MAP) and looked
up by the worker only on failure, for FAIL_LOG readability.
- Workers print exactly `OK` or `FAIL` on stdout; tally with
`grep -c '^OK$' / '^FAIL$'`.
- On non-zero FAILED, log the first 20 lines of $FAIL_LOG as
"Failure detail (first 20):" — same diagnostic surface as before
but consolidated so we don't spam logs on a flaky CF API.
Bug 2: workflow's 5-min cap was set as a hangs-detector but turned out
to be a real-job-too-slow detector. Raised to 30 min — generous
headroom for the ~60s steady-state run while still surfacing genuine
hangs (and in line with the sweep-cf-orphans companion job).
Bug 3 (drive-by): the existing trap was `trap 'rm -rf "$PAGES_DIR"'
EXIT`, which would have been silently overwritten by any later trap
registration. Replaced with a single `cleanup()` function that wipes
PAGES_DIR + all four new tempfiles (DELETE_PLAN, NAME_MAP, FAIL_LOG,
RESULT_LOG), called once via `trap cleanup EXIT`.
Verification:
- bash -n scripts/ops/sweep-cf-tunnels.sh: clean
- shellcheck -S warning scripts/ops/sweep-cf-tunnels.sh: clean
- python3 yaml.safe_load on the workflow: clean
- Synthetic 30-line delete plan with every 7th id sentinel'd to
return {"success":false}: TEST PASS, DELETED=26 FAILED=4, FAIL_LOG
side-channel name lookup verified.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Recurring failure pattern in redeploy-tenants-on-staging:
##[error]redeploy-fleet returned HTTP 500
##[error]Process completed with exit code 1.
with the per-tenant breakdown in the response body showing the failures
were on ephemeral e2e-* tenants (saas/canvas/ext) whose parent E2E run
torn them down mid-redeploy — SSM exit=2 because the EC2 was already
terminating, or healthz timeout because the CF tunnel was already gone.
The actual operator-facing tenants (dryrun-98407, demo-prep, etc) all
rolled fine in the same call.
This shape repeats every staging push that overlaps an active E2E run.
The downstream `Verify each staging tenant /buildinfo matches published
SHA` step ALREADY distinguishes STALE vs UNREACHABLE for exactly this
reason (per #2402); only the top-level `if HTTP_CODE != 200; exit 1`
gate misclassifies the race.
Filter: HTTP 500 + every failed slug matches `^e2e-` → soft-warn and
fall through to verify. Any non-e2e-* failure or non-500 HTTP remains
a hard fail, with the failed non-e2e slugs surfaced in the error so
the operator doesn't have to dig the response body out of CI.
Verified the gate logic with 6 synthetic CP responses (happy / e2e-only
race / mixed real+e2e fail / non-200 / 200+ok=false / all-real-fail) —
all behave correctly.
prod's redeploy-tenants-on-main is intentionally NOT touched: prod CP
serves no e2e-* tenants, so the race can't occur there and the strict
gate is the right behavior.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous PR (#2509) flipped canvas outbound file parts to the v1
flat shape `{url, filename, mediaType}` based on a hypothesis that
a2a-sdk's JSON-RPC parser silently dropped v0 `{kind:"file", file:{...}}`
shapes. Live test shows the opposite: a2a-sdk's JSON-RPC layer
validates against the v0 Pydantic discriminated union (TextPart |
FilePart | DataPart), so v1 flat shape is rejected with:
Invalid Request:
params.message.parts.0.TextPart.text — Field required
params.message.parts.0.FilePart.file — Field required
params.message.parts.0.DataPart.data — Field required
The actual root cause of the user-visible "Error: message contained
no text content" was the missing `/workspace` chown (CP PR #381 +
test pin #382), not a wire-shape mismatch. Verified end-to-end by
sending a v0 image-only message after PR #381 + workspace re-provision
— agent receives the file, reads its bytes, and replies normally.
Reverting only the canvas outbound shape. Defensive v1-tolerance
stays in:
- workspace/executor_helpers.py — extract_attached_files still
accepts v1 protobuf parts in case a future client emits them or
a future SDK release flips internal representation. Harmless on
the v0 hot path.
- canvas/message-parser.ts — extractFilesFromTask still tolerates
v1 shape on incoming agent responses. Some agents may emit v1
when their internal serializer round-trips through protobuf.
Tests stay green (91 canvas, 86 workspace).
Image-only chats surface "Error: message contained no text content"
because canvas posts v0 `{kind:"file", file:{uri,name,mimeType}}` shapes
that the workspace runtime's a2a-sdk v1 protobuf parser silently drops:
v1 `Part` has fields `[text, raw, url, data, metadata, filename,
media_type]` and `ignore_unknown_fields=True` discards `kind`+`file`,
producing a fully-empty Part. With no text and no extracted file
attachments, the executor's "no text content" guard fires.
Three coordinated changes close the gap:
1. canvas/ChatTab.tsx — outbound file parts now carry the v1 flat
shape `{url, filename, mediaType}` so the v1 protobuf parser
populates Part fields instead of dropping them.
2. workspace/executor_helpers.py — extract_attached_files learns the
v1 detection branch (non-empty `part.url` + `filename` +
`media_type`) alongside the existing v0 RootModel and flat-file
shapes. Defends every runtime that mounts the OSS wheel against
the same drop, including any pre-fix client still on the wire.
3. canvas/message-parser.ts — extractFilesFromTask tolerates the v1
shape on incoming agent responses too, so file chips render in
chat history regardless of which Part shape the runtime emits.
Test pins:
- workspace/tests/test_executor_helpers.py:
+ v1 protobuf shape extraction
+ empty-Part defense (v0→v1 silent-drop fall-through returns [])
- canvas message-parser test:
+ v1 protobuf flat parts
+ filename fallback to URL basename for v1
The page-merge loop passed the entire accumulating tunnel JSON to
python3 -c via argv on every iteration. On a busy account (verified
2026-05-02: 672 tunnels, 14 pages on Hongmingwangrabbit account) this
exceeds the GH Ubuntu runner's combined argv+envp limit (~128 KB) and
dies with `python3: Argument list too long` at exit 126 — the workflow
has been silently failing this way since the very first run that hit a
real account, masked earlier by a missing-CF_ACCOUNT_ID secret check.
Buffer each page response to a file under a temp dir, merge from disk
at the end. Also bumps the page cap from 20 to 40 (1000 → 2000 tunnel
ceiling) so the existing soft-cap warning has headroom; the disk-merge
shape is O(n) in tunnel count rather than the previous O(n^2) so the
larger ceiling is cheap.
Verified locally against the live account (672 tunnels): script now
runs cleanly to the existing MAX_DELETE_PCT safety gate, which trips
at 99% > 90% as designed and surfaces the actual orphan backlog for
operator-driven cleanup.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Canary started flaking 2026-05-01 22:11 with model-refusal replies:
- "I'm unable to do that."
- "I'm unable to fulfill that request. Can I assist you with anything else?"
- "I'm unable to reply with responses that don't allow me to fulfill tasks…"
3 fails / 10 recent runs ≈ 30% flake.
Trigger: 2026-04-30's Platform Capabilities preamble (#2332) added the
directive "Use them proactively" to the top of every system prompt.
Combined with the heavy A2A + HMA tool docs further down, the model
reads the contrived bare-echo prompt ("Reply with exactly: PONG") as
out-of-role and intermittently refuses.
Real user prompts don't hit this — only the synthetic smoke prompt does,
so the right fix is in the canary's prompt phrasing, not the platform's
system prompt (which is correctly priming agents toward tool use). New
phrasing explicitly tells the model "this is a smoke test" and "no
tools or memory are needed" so it has permission to comply.
Also updates the child workspace's CHILD_PONG prompt with the same
framing — same failure mode would have hit it once full-mode runs again.
No code change to system prompt, no test infra change. Just two prompt
strings + a load-bearing comment so future readers don't trim back to
the brittle phrasing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#1569 Phase 1 discovery (2026-05-02) found six historical credential
exposures in molecule-core git history. All confirmed dead — but the
reason they got committed in the first place was that the local
pre-commit hook had two gaps that the canonical CI gate (and the
runtime's hook) didn't:
1. **Pattern set was incomplete.** Local hook checked
`sk-ant-|sk-proj-|ghp_|gho_|AKIA|mol_pk_|cfut_` — missing
`ghs_*`, `ghu_*`, `ghr_*`, `github_pat_*`, `sk-svcacct-`,
`sk-cp-`, `xox[baprs]-`, `ASIA*`. The historical leaks were 5×
`ghs_*` (App installation tokens) + 1× `github_pat_*` — none of
which the local hook would have caught even if it ran.
2. **`*.md` and `docs/` were skip-listed.** The leaked tokens lived
in `tick-reflections-temp.md`, `qa-audit-2026-04-21.md`, and
`docs/incidents/INCIDENT_LOG.md` — exactly the file types the
skip-list excluded. The hook ran and silently passed.
This commit:
- Replaces the local hook's hard-coded inline regex with the canonical
13-pattern array (byte-aligned with `.github/workflows/secret-scan.yml`
and the workspace runtime's `pre-commit-checks.sh`).
- Removes the `\.md$|docs/` skip — keeps only binary, lockfile, and
hook-self exclusions.
- Adds the local hook to `lint_secret_pattern_drift.py` as an in-repo
consumer (read-from-disk, no network — the hook lives in the same
checkout the lint runs against). Drift now fails the lint when
canonical changes without the local hook updating in lockstep.
- Adds `.githooks/pre-commit` to the drift-lint workflow's path
filter so consumer-side edits also trigger the lint.
- Adopts the canonical's "don't echo the matched value" defense (the
prior version would have round-tripped a leaked credential into
scrollback / CI logs).
Verified: `python3 .github/scripts/lint_secret_pattern_drift.py`
reports both consumers aligned at 13 patterns. The hook's existing
six other gates (canvas 'use client', dark theme, SQL injection,
go-build, etc.) are untouched.
Companion change (already applied via API, no diff here):
`Scan diff for credential-shaped strings` is now in the required-checks
list on both `staging` and `main` branch protection — was previously a
soft gate (workflow ran, exited 1, but didn't block merge).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both restart paths (interactive Restart handler + auto-restart's
stopForRestart) used to log-and-continue on cpProv.Stop failure. After
PR #2500 made CPProvisioner.Stop surface CP non-2xx as an error, those
paths became the actual leak generator: every transient CP/AWS hiccup =
one orphan EC2 alongside the freshly provisioned one. The 13 zombie
workspace EC2s on demo-prep staging traced to this exact path.
Adds cpStopWithRetry helper with bounded exponential backoff (3 attempts,
1s/2s/4s). Different policy from workspace_crud.go's Delete handler:
Delete returns 500 to the client on Stop failure (loud-fail-and-block —
user asked to destroy, silent leak unacceptable), whereas Restart's
contract is "make the workspace alive again" — refusing to reprovision
strands the user with a dead workspace. So this helper retries to absorb
transient failures, then on exhaustion emits a structured `LEAK-SUSPECT`
log line for the (forthcoming) CP-side workspace orphan reconciler to
correlate. Caller proceeds to reprovision regardless.
ctx-cancel exits the retry early without sleeping the backoff (matters
during shutdown drain); the cancel path emits a distinct log line and
deliberately does NOT emit LEAK-SUSPECT — operator-cancel and
retry-exhaustion are different signals and conflating them would noise
up the orphan-reconciler queue with workspaces we never had a chance to
retry.
Tests: 5 behavior tests covering every branch (no-op, first-try success,
eventual success, exhaustion, ctx-cancel) + 1 AST gate that pins the
helper-only invariant (any future inline `h.cpProv.Stop(...)` in
workspace_restart.go fires the gate, mutation-tested).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Auto-trigger from publish-workspace-server-image now resolves
target_tag to the just-published `staging-<short_head_sha>` digest
instead of `:latest`. Bypasses the dead retag path that was leaving
prod tenants on a 4-day-old image.
The chain pre-fix:
publish-image → pushes :staging-<sha> + :staging-latest (NOT :latest)
canary-verify → soft-skips (CANARY_TENANT_URLS unset, fleet not stood up)
promote-latest → manual workflow_dispatch only, last run 2026-04-28
redeploy-main → pulls :latest → 2026-04-28 digest → all 3 tenants STALE
Today's incident:
e7375348 (main) → publish-image green → redeploy fired → tenants
pulled :latest (76c604fb digest from prior canary-verified state) →
hongming /buildinfo returned 76c604fb instead of e7375348 → verify
step correctly flagged 3/3 STALE → workflow failed.
Today's PRs (#2473 smoke wedge, #2487 panic recovery, #2496 sweeper
followups) shipped to GHCR as :staging-<sha> but never reached prod.
Fix:
- workflow_dispatch input default '' (was 'latest'); empty input
triggers auto-compute path
- new "Compute target tag" step resolves:
1. operator-supplied input → verbatim (rollback / pin)
2. else → staging-<short_head_sha> (auto)
- verify step's operator-pin detection now allows
staging-<short_head_sha> as a non-pin (verification still runs)
When canary fleet is real, this workflow should chain on
canary-verify completion (workflow_run from canary-verify, gated on
promote-to-latest success) instead of publish-image — separate,
smaller PR. Today's fix unblocks prod deploys without that
prerequisite.
Companion: promote-latest.yml dispatched 2026-05-02 against
e7375348 to unstick existing prod tenants. This PR prevents
recurrence.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
http.Client.Do only errors on transport failure — a CP 5xx (AWS
hiccup, missing IAM, transient outage) was silently treated as
success. Workspace row then flipped to status='removed' and the EC2
stayed alive forever with no DB pointer (the "orphan EC2 on a
0-customer account" scenario flagged in workspace_crud.go #1843).
Found while triaging 13 zombie workspace EC2s on demo-prep staging.
Adds a status-code check that returns an error tagged with the
workspace ID + status + bounded body excerpt, so the existing
loud-fail path in workspace_crud.go's Delete handler can populate
stop_failures and surface a 500. Body read is io.LimitReader-capped
at 512 bytes to keep error logs sane during a CP outage.
Tests: 4 new (5xx surfaces, 4xx surfaces, 2xx variants 200/202/204
all succeed, long body is truncated). Test-first verified — the
first three fail on the buggy code and all four pass on the fix.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors what auto-sync-main-to-staging.yml would have produced if its
on:push trigger had fired for the GITHUB_TOKEN-initiated merge of PR
#2437 (staging→main) on 2026-05-01. Per the diagnosis in PR #2497,
that push was suppressed by GitHub's no-recursion rule, leaving
staging missing main's merge commit and dead-locking PR #2442
(Phase 2 promote) on mergeStateStatus: BEHIND.
This sync absorbs only the merge commit 76c604fb (no code-change
diff — it's a merge of staging back to itself from a prior round).
The proper fix (PR #2497) makes this self-healing for future rounds.
auto-sync-main-to-staging.yml hasn't fired since 2026-04-29 despite
multiple staging→main promotes since. The promote PR #2442 (Phase 2)
has been wedged on `mergeStateStatus: BEHIND` for hours because
staging is missing the merge commit from PR #2437.
Three compounding bugs, all fixed here:
1. **GitHub no-recursion suppresses the `on: push` trigger.**
When the merge queue lands a staging→main promote, the resulting
push to main is "by GITHUB_TOKEN", and per
https://docs.github.com/en/actions/using-workflows/triggering-a-workflow#triggering-a-workflow-from-a-workflow
that push event does NOT fire any downstream workflows. Verified
empirically against SHA 76c604fb (PR #2437): exactly ONE workflow
fired on that push — `publish-workspace-server-image`, dispatched
explicitly by auto-promote-staging.yml's polling tail with an App
token (the documented #2357 workaround). Every other `on: push`
workflow on main, including auto-sync, was silently suppressed.
Same fix extended here: auto-promote-staging.yml's polling tail
now ALSO dispatches `auto-sync-main-to-staging.yml --ref main`
via the App token after the merge lands. App-initiated dispatch
propagates `workflow_run` cascades, which is what the publish
tail relies on too. Failure path: emits `::error::` with the
recovery command — operator runs it once and the next promote
self-heals.
auto-sync.yml gains `workflow_dispatch:` so it can be invoked
from the dispatch above + manually if a future promote also
misses (defense in depth).
2. **`runs-on: [self-hosted, macos, arm64]` was wrong for this repo.**
Comment claimed "matches the rest of this repo's workflows" — false:
this is the ONLY workflow in molecule-core/.github/workflows/ with
a non-ubuntu runs-on. Copy-paste artefact from molecule-controlplane
(which IS private and has a Mac runner). molecule-core has no Mac
runner registered, so even when the trigger DID fire (the 3 historic
manual-UI merges), the job would have sat unassigned if the runner
were offline. Switched to `ubuntu-latest` to match every other
workflow in this repo.
3. **The `on: push` trigger remains** as a defense-in-depth path for
the rare case of a manual UI merge by a real user (which uses
their PAT and DOES fire downstream workflows — confirmed via the
2026-04-29 d35a2420 run with `triggering_actor=HongmingWang-Rabbit`
that fired 16 workflows including auto-sync). Belt-and-suspenders.
Long-term: switching auto-promote's `gh pr merge --auto` call to use
the App token (instead of GITHUB_TOKEN) would let `on: push` triggers
fire naturally and obviate the need for the explicit dispatches in
the polling tail. Tracked in #2357 — out of scope here.
Operator recovery for the current Phase 2 wedge: after this lands on
staging, dispatch auto-sync once via
`gh workflow run auto-sync-main-to-staging.yml --ref main` to
backfill the missed sync from 76c604fb. PR #2442 will go from
BEHIND → CLEAN and auto-merge.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two follow-ups from PR #2494's review:
1. Two new sweep tests exercise the lookup path through
sweepStuckProvisioning end-to-end:
- ManifestOverrideSparesRow: claude-code 11min old, manifest=20min
→ no UPDATE, no broadcast (sparing works through the sweeper)
- ManifestOverrideStillFlipsPastDeadline: claude-code 21min old,
manifest=20min → flipped + payload.timeout_secs=1200
Closes the gap that the unit-test on provisioningTimeoutFor alone
left open: a future refactor could drop the lookup arg from the
sweeper's call and only the unit test caught it. Verified by
regression-injecting `lookup→nil` in sweepStuckProvisioning — both
new tests fail, the old ones still pass.
2. addProvisionTimeoutMs now goes through ProvisionTimeoutSecondsForRuntime
instead of calling provisionTimeouts.get directly. Single accessor
path for the same data — the canvas response and the sweeper now
resolve identically by construction.
No production behavior change; tests + accessor cleanup only.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two review nits from PR #2493 that don't affect correctness but matter
for honesty in the harness's own self-documentation:
1. tenant-isolation.sh F3/F4 used assert_status for non-HTTP values.
LEAKED_INTO_ALPHA/BETA are jq-derived counts, not HTTP codes — but
the assertion ran through assert_status, which formats the result
as "(HTTP 0)". Anyone reading the test output would believe these
assertions involved an HTTP call. Adds a plain `assert` helper
matching per-tenant-independence.sh's pattern, and uses it on the
two count comparisons.
2. per-tenant-independence.sh Phase F over-claimed coverage.
The comment said the concurrent-INSERT race catches "shared-pool
corruption" + "lib/pq prepared-statement cache collision". Both
are real failure modes — but neither can fire across tenants in
THIS topology, because each tenant owns its own DATABASE_URL and
its own postgres-{alpha,beta} container. The comment now lists
only what the test actually catches (redis cross-keyspace bleed,
shared cp-stub state corruption, cf-proxy buffer mixup) and notes
that a future shared-Postgres variant is the right place for the
lib/pq cache assertion.
No behavioural change — both replays still pass 13/13 + 12/12, all six
replays pass on a clean run-all-replays.sh boot.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Real wiring gap discovered while investigating issue #2486 cluster of
prod claude-code workspaces failed at exactly 10m. The
runtimeProvisionTimeoutsCache (#2054 phase 2) reads
runtime_config.provision_timeout_seconds from each template's
config.yaml so the **canvas** spinner respects per-template timeouts —
but the **sweeper** in registry/provisiontimeout.go hardcoded 10 min
(claude-code) / 30 min (hermes) and never consulted the manifest. So a
template that declared a longer window had a UI that waited correctly
but a sweeper that killed the row at the hardcoded floor anyway.
Resolution order pinned by new TestProvisioningTimeout_ManifestOverride:
1. PROVISION_TIMEOUT_SECONDS env (ops-debug global override)
2. Template manifest lookup (per-runtime, beats hermes default too)
3. Hermes default (30 min — CP bootstrap-watcher 25 min + 5 min slack)
4. DefaultProvisioningTimeout (10 min)
Wiring:
- registry: new RuntimeTimeoutLookup function type, threaded through
StartProvisioningTimeoutSweep + sweepStuckProvisioning + the
pre-existing provisioningTimeoutFor.
- handlers: ProvisionTimeoutSecondsForRuntime exposes the cache's
lookup as a method so main.go can pass it without breaking the
handlers→registry import direction.
- cmd/server/main.go: wire wh.ProvisionTimeoutSecondsForRuntime into
the sweep boot.
Verified:
- go test -race ./... passes (every workspace-server package).
- Regression-injected the lookup arm: 3 manifest-override subcases
fail with the actual-vs-expected gap, confirming the new test is
load-bearing.
- The original two timeout tests (env-override, hermes default) keep
passing — `lookup=nil` argument preserves their semantics.
Operator action enabled: a template wanting a 15-min window can now
just set `runtime_config.provision_timeout_seconds: 900` in its
config.yaml and the sweeper honours it on the next workspace-server
restart.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brings the local harness from "single tenant covering the request path"
to "two tenants covering both the request path AND the per-tenant
isolation boundary" — the same shape production runs (one EC2 + one
Postgres + one MOLECULE_ORG_ID per tenant).
Why this matters: the four prior replays exercise the SaaS request
path against one tenant. They cannot prove that TenantGuard rejects
a misrouted request (production CF tunnel + AWS LB are the failure
surface), nor that two tenants doing legitimate work in parallel
keep their `activity_logs` / `workspaces` / connection-pool state
partitioned. Both are real bug classes — TenantGuard allowlist drift
shipped #2398, lib/pq prepared-statement cache collision is documented
as an org-wide hazard.
What changed:
1. compose.yml — split into two tenants.
tenant-alpha + postgres-alpha + tenant-beta + postgres-beta + the
shared cp-stub, redis, cf-proxy. Each tenant gets a distinct
ADMIN_TOKEN + MOLECULE_ORG_ID and its own Postgres database. cf-proxy
depends on both tenants becoming healthy.
2. cf-proxy/nginx.conf — Host-header → tenant routing.
`map $host $tenant_upstream` resolves the right backend per request.
Required `resolver 127.0.0.11 valid=30s ipv6=off;` because nginx
needs an explicit DNS resolver to use a variable in `proxy_pass`
(literal hostnames resolve once at startup; variables resolve per
request — without the resolver nginx fails closed with 502).
`server_name` lists both tenants + the legacy alias so unknown Host
headers don't silently route to a default and mask routing bugs.
3. _curl.sh — per-tenant + cross-tenant-negative helpers.
`curl_alpha_admin` / `curl_beta_admin` set the right
Host + Authorization + X-Molecule-Org-Id triple.
`curl_alpha_creds_at_beta` / `curl_beta_creds_at_alpha` exist
precisely to make WRONG requests (replays use them to assert
TenantGuard rejects). `psql_exec_alpha` / `psql_exec_beta` shell out
per-tenant Postgres exec. Legacy aliases (`curl_admin`, `psql_exec`)
keep the four pre-Phase-2 replays working without edits.
4. seed.sh — registers parent+child workspaces in BOTH tenants.
Captures server-generated IDs via `jq -r '.id'` (POST /workspaces
ignores body.id, so the older client-side mint silently desynced
from the workspaces table and broke FK-dependent replays). Stashes
`ALPHA_PARENT_ID` / `ALPHA_CHILD_ID` / `BETA_PARENT_ID` /
`BETA_CHILD_ID` to .seed.env, plus legacy `ALPHA_ID` / `BETA_ID`
aliases for backwards compat with chat-history / channel-envelope.
5. New replays.
tenant-isolation.sh (13 assertions) — TenantGuard 404s any request
whose X-Molecule-Org-Id doesn't match the container's
MOLECULE_ORG_ID. Asserts the 404 body has zero
tenant/org/forbidden/denied keywords (existence of a tenant must
not be probable from the outside). Covers cross-tenant routing
misconfigure + allowlist drift + missing-org-header.
per-tenant-independence.sh (12 assertions) — both tenants seed
activity_logs in parallel with distinct row counts (3 vs 5) and
confirm each tenant's history endpoint returns exactly its own
counts. Then a concurrent INSERT race (10 rows per tenant in
parallel via `&` + wait) catches shared-pool corruption +
prepared-statement cache poisoning + redis cross-keyspace bleed.
6. Bug fix: down.sh + dump-logs SECRETS_ENCRYPTION_KEY validation.
`docker compose down -v` validates the entire compose file even
though it doesn't read the env. up.sh generates a per-run key into
its own shell — down.sh runs in a fresh shell that wouldn't see it,
so without a placeholder `compose down` exited non-zero before
removing volumes. Workspaces silently leaked into the next
./up.sh + seed.sh boot. Caught when tenant-isolation.sh F1/F2 saw
3× duplicate alpha-parent rows accumulated across three prior runs.
Same fix applied to the workflow's dump-logs step.
7. requirements.txt — pin molecule-ai-workspace-runtime>=0.1.78.
channel-envelope-trust-boundary.sh imports from `molecule_runtime.*`
(the wheel-rewritten path) so it catches the failure mode where
the wheel build silently strips a fix that unit tests on local
source still pass. CI was failing this replay because the wheel
wasn't installed — caught in the staging push run from #2492.
8. .github/workflows/harness-replays.yml — Phase 2 plumbing.
* Removed /etc/hosts step (Host-header path eliminated the need;
scripts already source _curl.sh).
* Updated dump-logs to reference the new service names
(tenant-alpha + tenant-beta + postgres-alpha + postgres-beta).
* Added SECRETS_ENCRYPTION_KEY placeholder env on the dump step.
Verified: ./run-all-replays.sh from a clean state — 6/6 passed
(buildinfo-stale-image, channel-envelope-trust-boundary, chat-history,
peer-discovery-404, per-tenant-independence, tenant-isolation).
Roadmap section updated: Phase 2 marked shipped. Phase 3 promoted to
"replace cp-stub with real molecule-controlplane Docker build + env
coherence lint."
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per review nit on PR #2491: the previous message ("a goroutine reached
cpProv.Start but never broadcast its failure") could mislead an
operator if Assertion 2 and 4 both fire — Assertion 4 also catches
"goroutine exited via an earlier path before reaching Start." Spell
both modes out and cross-reference Assertion 2.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three changes that bring the local harness from "covers what staging
covers minus the SaaS topology" to "exercises every surface we shipped
this session against the prod-shape Dockerfile.tenant image."
1. Drop the /etc/hosts requirement.
Replays previously needed `127.0.0.1 harness-tenant.localhost` in
/etc/hosts to resolve the cf-proxy. That gated the harness behind a
sudo step on every fresh dev box and CI runner. The cf-proxy nginx
already routes by Host header (matches production CF tunnel: URL is
public, Host carries tenant identity), so the no-sudo path is to
target loopback :8080 with `Host: harness-tenant.localhost` set as
a header.
New `tests/harness/_curl.sh` centralises this — curl_anon /
curl_admin / curl_workspace / psql_exec wrappers all set the Host
+ auth headers automatically. seed.sh, peer-discovery-404.sh,
buildinfo-stale-image.sh updated to source it. Legacy /etc/hosts
users still work via env-var override.
2. Fix the seed.sh FK regression that blocked DB-side replays.
POST /workspaces ignores any `id` in the request body and generates
one server-side. seed.sh was minting client-side UUIDs that never
reached the workspaces table, so any replay that INSERTed into
activity_logs (FK-constrained on workspace_id) failed with the
workspace-not-found error. Capture the returned id from the
response instead.
3. Two new replays cover the surfaces shipped this session.
chat-history.sh — exercises the full SaaS-shape wire that PR #2472
(peer_id filter), #2474 (chat_history client tool), and #2476
(before_ts paging) ride on. 8 phases / 16 assertions: peer_id filter,
limit cap, before_ts paging, OR-clause covering both source_id and
target_id, malformed peer_id 400, malformed before_ts 400, URL-encoded
SQLi-shape rejection. Verified PASS against the live harness.
channel-envelope-trust-boundary.sh — exercises PR #2471 + #2481 by
importing from `molecule_runtime.*` (the wheel-rewritten path) so
it catches "wheel build dropped a fix that unit tests still pass."
5 phases / 11 assertions: malicious peer_id scrubbed from envelope,
agent_card_url omitted on validation failure, XML-injection bytes
scrubbed, valid UUID preserved, _agent_card_url_for direct gate.
Verified PASS against published wheel 0.1.79.
run-all-replays.sh auto-discovers — no registration needed. Full
lifecycle (boot → seed → 4 replays → teardown) runs clean.
Roadmap section updated to reflect Phase 1 (this PR) → Phase 2
(multi-tenant + CI gate) → Phase 3 (real CP) → Phase 4 (Miniflare +
LocalStack + traffic replay).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Post-merge follow-up to PR #2487 review feedback:
1. guardAgainstReraise(fn) helper around every panic-test exercise. The
original RecoversAndMarksFailed had its own outer recover() to detect
re-raise; NoOpWhenNoPanic and PersistFailureLogged didn't. If a future
regression makes logProvisionPanic re-raise, those two would have
crashed the test process (taking sibling tests down) instead of
reporting a clean failure. Now all three use the shared guard.
2. Concurrent repro now asserts bcast.count == 7 — the new
concurrentSafeBroadcaster's count field was added in the race fix
but not actually consumed. Cross-checks the existing recorder-set
assertion from a different angle: a goroutine could in principle
reach cpProv.Start (recorder hits) but then lose its
WORKSPACE_PROVISION_FAILED broadcast on the failure path. Pinning
both rules out that silent-drop variant for the canvas-broadcast
contract specifically.
3. Comment on captureLog noting log.SetOutput is process-global and
incompatible with t.Parallel() — preempts a future footgun if
someone parallelizes the panic suite.
Verified: all four tests pass under -race; full handlers + db packages
green under -race.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- workspace-runtime-package.md: add explicit "Where to make changes"
section documenting the mirror-only policy on
Molecule-AI/molecule-ai-workspace-runtime — direct PRs are auto-rejected
by mirror-guard CI; staging push regenerates both the mirror and the
PyPI wheel via .github/workflows/publish-runtime.yml.
- infra/workspace-terminal.md: replace dead molecule-core#1528 reference
(repo renamed to molecule-monorepo, no longer accepting issues at the
old name) with a forward-pointer to monorepo + molecule-controlplane
issue trackers.
- architecture/backends.md: bump audit date to 2026-05-02 and add rows
for channel envelope enrichment (#2471), chat_history MCP tool
(#2474), /activity before_ts paging (#2476), /activity peer_id filter
(#2472), runtime_wedge smoke gate (#2473 + #2475), and the canvas-E2E
state-file requirement (#2327).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI Platform (Go) ran with -race and the concurrent test tripped the
detector: captureBroadcaster (sequential-test stub) writes lastData
unguarded; 7 fan-out goroutines call markProvisionFailed → that stub
concurrently. Local non-race run had hidden it.
Introduce concurrentSafeBroadcaster (mutex-counted) for this single
fan-out test. Sequential tests keep using captureBroadcaster — the
fix is local to the test that creates the goroutines.
Verified ./internal/handlers passes with -race.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three fixes addressing review of the issue #2486 observability PR:
1. CI failure: original inline UPDATE in logProvisionPanic used a hard-coded
`status='failed'` literal, which trips workspace_status_enum_drift_test
(the post-PR-#2396 gate that requires every status write to flow through
models.Status* via parameterized $N). Refactor to call
h.markProvisionFailed which uses StatusFailed parameterized.
2. Canvas-broadcast gap (review finding): inline UPDATE skipped
RecordAndBroadcast, so panic recovery marked the row failed in DB but
the canvas spinner stayed on "provisioning" until the next poll.
markProvisionFailed fires WORKSPACE_PROVISION_FAILED, so canvas now
flips to a failure card immediately.
3. Critical test bug (review finding): `defer log.SetOutput(log.Writer())`
in three test sites evaluated log.Writer() at defer-fire time AFTER the
SetOutput swap — restoring the buffer to itself, never restoring
os.Stderr. Subsequent tests in the package were running with the panic
tests' captured buffer as their writer. Extracted captureLog(t) helper
that captures `prev` BEFORE the swap and uses t.Cleanup.
Plus: softened the "goroutine never started" comment in the concurrent
repro harness — the harness atomic-counts BEFORE the entry log fires, so
"never started" was misleading; the real failure mode is "entry log
renamed/removed or writer hijacked."
Verified: full handlers suite passes; drift gate passes (Platform Go CI
failure root-caused). Regression-injected the recover body again — both
panic tests still fail as expected, confirming the contract is gated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Goal: a deterministic, in-process reproduction of the prod incident
where 7 simultaneous claude-code provisions on the hongming tenant
produced ZERO log lines from any of the four documented exit paths.
Approach: stub CPProvisioner that records every Start() call,
sqlmock for the prepare flow, fire 7 goroutines concurrently against
provisionWorkspaceCP, then assert:
1. Entry log fired exactly 7 times (one per goroutine).
2. Stub Start() recorded all 7 distinct workspace IDs.
3. Each goroutine's entry log names its own workspace ID.
Result on staging head as of 2026-05-02: PASSES — meaning the
silent-drop class isn't reproducible against current head with stub
CP. Tenant hongming runs sha 76c604fb (725 commits behind staging),
so the bug is most likely already fixed upstream — hongming needs
a redeploy.
The test stays as a regression gate: any future refactor that
re-introduces silent goroutine swallow in the CP provision path
(rate-limit drop, channel-send-without-receiver, panic without
recover, etc.) trips it.
A safeWriter wraps the captured log buffer because raw
bytes.Buffer.Write isn't safe for concurrent goroutines — without
serialization the 7 entry-log lines interleave at byte boundaries
and the strings.Count assertion gets unreliable.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
github-code-quality bot flagged 4 instances of `import a2a_mcp_server` in
the new TestStdioPipeAssertion class — every other test in the file uses
the `from a2a_mcp_server import ...` per-test pattern, so this is a real
inconsistency.
Switching the new tests to match. No behavior change; resolves the
4 unresolved review threads blocking the merge queue.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Issue #2486: 7 claude-code workspaces stuck in provisioning produced
NONE of the four documented exit-path log lines in
provisionWorkspaceCP — neither prepare-failed, nor start-failed, nor
persist-instance-id-failed, nor success. Operators couldn't tell
whether the goroutine ran at all.
Add an entry log at the top of provisionWorkspaceOpts +
provisionWorkspaceCP so a missing entry distinguishes "goroutine
never started" from "started but exited via an unlogged path."
Add logProvisionPanic at the same defer site so a panic inside
either provisioner doesn't (a) crash the whole workspace-server
process, taking every other tenant workspace with it, and (b)
silently leave the row in `provisioning` until the 10-min sweeper
fires. The recover persists status='failed' with a sanitized
panic-class message via a fresh 10s context (the goroutine's own
ctx may have been the one panicking).
Tests pin three contracts:
- no-op when no panic (otherwise every successful provision
emits a spurious log line)
- recovers + persists failed status on panic, with stack trace
- defense-in-depth: if the persist itself fails, log it instead
of leaving the operator with a recovered-panic log but no row
Regression-injected by neutering the recover() body — all three
tests fail until the recover + UPDATE path is restored.
This is observability + resilience only, not a root-cause fix
for #2486. The actual silent-drop class still needs reproduction
once the tenant is on a build that includes this entry log.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>