molecule-core

molecule-ai/molecule-core

Fork 2

Commit Graph

Author	SHA1	Message	Date
Hongming Wang	6e5b5c4142	fix(harness): cleanup_failed event + drop misleading exit_code capture Self-review follow-ups on #2257: - Drop `local exit_code=$?` from cleanup(). `trap`-handler return values are ignored, so capturing $? only misled a future reader into thinking exit-code preservation was happening. - Replace silenced `>/dev/null 2>&1` DELETE with `-w '%{http_code}'` capture. ADMIN_TOKEN expiring mid-run was the realistic failure mode here — previously we swallowed it under the silenced redirect, leaving workspaces leaked with no signal. Now a 401/403/5xx surfaces as a `cleanup_failed` JSON event with a remediation hint pointing at cleanup-rogue-workspaces.sh; 404 is treated as success (the post-condition — workspace absent — holds). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 21:00:38 -07:00
Hongming Wang	039a41cce3	fix(harness): cleanup trap + tenant scoping + dry-run for measure-coordinator-task-bounds Three follow-ups from #2254 code review before the harness is safe to run against staging: 1. Cleanup trap. Workspaces are now auto-deleted on EXIT/INT/TERM. A Ctrl-C mid-run no longer leaks the PM + Researcher pair against shared infra. KEEP_WORKSPACES=1 opts out for post-run inspection. 2. Tenant scoping + admin auth. Non-localhost PLATFORM values now require both ADMIN_TOKEN and TENANT_ID; the script refuses to run without them. The previous version sent unauthenticated POSTs that, on staging, would either 401 every request or — worse — provision into the wrong tenant. Memory `feedback_never_run_cluster_cleanup_ tests_on_live_platform` calls out the same hazard class. 3. DRY_RUN=1 mode. Prints platform target, tenant id, auth fingerprint, and the planned actions, then exits before any state mutation. The intended pre-flight before running against staging. Also tightened OR_KEY check (the chained default silently accepted an empty OPENROUTER_API_KEY) and added a heartbeat-trace caveat to the interpretation guide explaining what `<endpoint_unavailable>` means for the bound question. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 20:38:35 -07:00
Hongming Wang	acd7fe76a5	docs(rfc-2251): add coordinator task-bounds measurement harness Adds a reproduction harness for Issue 4 of the 2026-04-28 CP review, referenced in RFC molecule-core#2251. The RFC review (issue #2251 comment) flagged that Issue 4 was hypothesized but not reproduced before V1.0 implementation begins — this script closes that gap. What it does: - Provisions a coordinator (PM, claude-code-default) + 1 child (Researcher, langgraph) via the platform API. - Sends an A2A kickoff with a synthesis-heavy task that requires SYNTHESIS_DEPTH (default 3) sequential delegations followed by a 600-word post-delegation synthesis. - Times the coordinator's full A2A round-trip with millisecond precision and emits one JSON event per phase (machine-readable). - Pulls the coordinator's heartbeat trace post-run so the team can see whether any platform-side state transition fired during the long synthesis (the V1.0 RFC's MAX_TASK_EXECUTION_SECS would surface as such a transition; absence of one in this trace confirms the RFC's premise). Why a measurement harness, not a pass/fail test: Issue 4's claim is "absence of platform-side bound", which is hard to assert in a single CI run. Outputting structured measurement data lets the team interpret across multiple runs / staging vs prod / different SYNTHESIS_DEPTH values rather than relying on one reproduction snapshot. The script's header has the full interpretation guide: - ELAPSED < 60s → not informative (LLM was just fast) - 60–300s → within DELEGATION_TIMEOUT, ambiguous - >= 300s without trace transitions → BUG CONFIRMED - curl_failed → coordinator hung past A2A_TIMEOUT or genuinely slow (disambiguate by querying status separately) Doesn't run in CI by default — invoked manually against staging or a local platform with PLATFORM=... and OPENROUTER_API_KEY=... env vars. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 18:58:39 -07:00

Author

SHA1

Message

Date

Hongming Wang

6e5b5c4142

fix(harness): cleanup_failed event + drop misleading exit_code capture

Self-review follow-ups on #2257:

- Drop `local exit_code=$?` from cleanup(). `trap`-handler return values
  are ignored, so capturing $? only misled a future reader into thinking
  exit-code preservation was happening.

- Replace silenced `>/dev/null 2>&1` DELETE with `-w '%{http_code}'`
  capture. ADMIN_TOKEN expiring mid-run was the realistic failure mode
  here — previously we swallowed it under the silenced redirect, leaving
  workspaces leaked with no signal. Now a 401/403/5xx surfaces as a
  `cleanup_failed` JSON event with a remediation hint pointing at
  cleanup-rogue-workspaces.sh; 404 is treated as success (the
  post-condition — workspace absent — holds).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-28 21:00:38 -07:00

Hongming Wang

039a41cce3

fix(harness): cleanup trap + tenant scoping + dry-run for measure-coordinator-task-bounds

Three follow-ups from #2254 code review before the harness is safe to
run against staging:

1. Cleanup trap. Workspaces are now auto-deleted on EXIT/INT/TERM. A
   Ctrl-C mid-run no longer leaks the PM + Researcher pair against
   shared infra. KEEP_WORKSPACES=1 opts out for post-run inspection.

2. Tenant scoping + admin auth. Non-localhost PLATFORM values now
   require both ADMIN_TOKEN and TENANT_ID; the script refuses to run
   without them. The previous version sent unauthenticated POSTs that,
   on staging, would either 401 every request or — worse — provision
   into the wrong tenant. Memory `feedback_never_run_cluster_cleanup_
   tests_on_live_platform` calls out the same hazard class.

3. DRY_RUN=1 mode. Prints platform target, tenant id, auth fingerprint,
   and the planned actions, then exits before any state mutation. The
   intended pre-flight before running against staging.

Also tightened OR_KEY check (the chained default silently accepted an
empty OPENROUTER_API_KEY) and added a heartbeat-trace caveat to the
interpretation guide explaining what `<endpoint_unavailable>` means
for the bound question.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-28 20:38:35 -07:00

Hongming Wang

acd7fe76a5

docs(rfc-2251): add coordinator task-bounds measurement harness

Adds a reproduction harness for Issue 4 of the 2026-04-28 CP review,
referenced in RFC molecule-core#2251. The RFC review (issue #2251
comment) flagged that Issue 4 was hypothesized but not reproduced
before V1.0 implementation begins — this script closes that gap.

What it does:
  - Provisions a coordinator (PM, claude-code-default) + 1 child
    (Researcher, langgraph) via the platform API.
  - Sends an A2A kickoff with a synthesis-heavy task that requires
    SYNTHESIS_DEPTH (default 3) sequential delegations followed by a
    600-word post-delegation synthesis.
  - Times the coordinator's full A2A round-trip with millisecond
    precision and emits one JSON event per phase (machine-readable).
  - Pulls the coordinator's heartbeat trace post-run so the team can
    see whether any platform-side state transition fired during the
    long synthesis (the V1.0 RFC's MAX_TASK_EXECUTION_SECS would
    surface as such a transition; absence of one in this trace
    confirms the RFC's premise).

Why a measurement harness, not a pass/fail test:
  Issue 4's claim is "absence of platform-side bound", which is hard
  to assert in a single CI run. Outputting structured measurement
  data lets the team interpret across multiple runs / staging vs
  prod / different SYNTHESIS_DEPTH values rather than relying on one
  reproduction snapshot.

The script's header has the full interpretation guide:
  - ELAPSED < 60s     → not informative (LLM was just fast)
  - 60–300s           → within DELEGATION_TIMEOUT, ambiguous
  - >= 300s without trace transitions → BUG CONFIRMED
  - curl_failed       → coordinator hung past A2A_TIMEOUT or genuinely
                        slow (disambiguate by querying status separately)

Doesn't run in CI by default — invoked manually against staging or a
local platform with PLATFORM=... and OPENROUTER_API_KEY=... env vars.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-28 18:58:39 -07:00

3 Commits