Self-review follow-ups on #2257:
- Drop `local exit_code=$?` from cleanup(). `trap`-handler return values
are ignored, so capturing $? only misled a future reader into thinking
exit-code preservation was happening.
- Replace silenced `>/dev/null 2>&1` DELETE with `-w '%{http_code}'`
capture. ADMIN_TOKEN expiring mid-run was the realistic failure mode
here — previously we swallowed it under the silenced redirect, leaving
workspaces leaked with no signal. Now a 401/403/5xx surfaces as a
`cleanup_failed` JSON event with a remediation hint pointing at
cleanup-rogue-workspaces.sh; 404 is treated as success (the
post-condition — workspace absent — holds).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three follow-ups from #2254 code review before the harness is safe to
run against staging:
1. Cleanup trap. Workspaces are now auto-deleted on EXIT/INT/TERM. A
Ctrl-C mid-run no longer leaks the PM + Researcher pair against
shared infra. KEEP_WORKSPACES=1 opts out for post-run inspection.
2. Tenant scoping + admin auth. Non-localhost PLATFORM values now
require both ADMIN_TOKEN and TENANT_ID; the script refuses to run
without them. The previous version sent unauthenticated POSTs that,
on staging, would either 401 every request or — worse — provision
into the wrong tenant. Memory `feedback_never_run_cluster_cleanup_
tests_on_live_platform` calls out the same hazard class.
3. DRY_RUN=1 mode. Prints platform target, tenant id, auth fingerprint,
and the planned actions, then exits before any state mutation. The
intended pre-flight before running against staging.
Also tightened OR_KEY check (the chained default silently accepted an
empty OPENROUTER_API_KEY) and added a heartbeat-trace caveat to the
interpretation guide explaining what `<endpoint_unavailable>` means
for the bound question.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a reproduction harness for Issue 4 of the 2026-04-28 CP review,
referenced in RFC molecule-core#2251. The RFC review (issue #2251
comment) flagged that Issue 4 was hypothesized but not reproduced
before V1.0 implementation begins — this script closes that gap.
What it does:
- Provisions a coordinator (PM, claude-code-default) + 1 child
(Researcher, langgraph) via the platform API.
- Sends an A2A kickoff with a synthesis-heavy task that requires
SYNTHESIS_DEPTH (default 3) sequential delegations followed by a
600-word post-delegation synthesis.
- Times the coordinator's full A2A round-trip with millisecond
precision and emits one JSON event per phase (machine-readable).
- Pulls the coordinator's heartbeat trace post-run so the team can
see whether any platform-side state transition fired during the
long synthesis (the V1.0 RFC's MAX_TASK_EXECUTION_SECS would
surface as such a transition; absence of one in this trace
confirms the RFC's premise).
Why a measurement harness, not a pass/fail test:
Issue 4's claim is "absence of platform-side bound", which is hard
to assert in a single CI run. Outputting structured measurement
data lets the team interpret across multiple runs / staging vs
prod / different SYNTHESIS_DEPTH values rather than relying on one
reproduction snapshot.
The script's header has the full interpretation guide:
- ELAPSED < 60s → not informative (LLM was just fast)
- 60–300s → within DELEGATION_TIMEOUT, ambiguous
- >= 300s without trace transitions → BUG CONFIRMED
- curl_failed → coordinator hung past A2A_TIMEOUT or genuinely
slow (disambiguate by querying status separately)
Doesn't run in CI by default — invoked manually against staging or a
local platform with PLATFORM=... and OPENROUTER_API_KEY=... env vars.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>