Step 2 of #1815. Step 1 (instrumentation in canvas/vitest.config.ts)
already shipped — the inline comment there explicitly defers wiring
into CI to a follow-up because turning on a 70% threshold blind would
either fail CI immediately or paper over a real gap with an ad-hoc
exclude list.
This PR ships the observability half:
- Replaces `npx vitest run` with `npx vitest run --coverage` in the
canvas-build job. Coverage gets reported on every PR; no threshold
gate yet (vitest.config.ts intentionally doesn't set thresholds).
- Adds an artifact upload step for canvas/coverage/ (HTML + json-summary)
so reviewers can browse the coverage report from any PR. 7-day
retention; if-no-files-found=warn so a step skip doesn't fail.
Step 3 (thresholds + hard gate) is the natural follow-up — track in a
new sub-issue once we've seen ~5-10 PRs of baseline data and know
where current coverage sits. The issue body proposed lines:70 /
functions:70 / branches:65 / statements:70; that may need adjustment
once the baseline lands.
Closes the Step-2 portion of #1815. Step 3 stays open or gets a fresh
issue depending on your preference.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Surfaced via cross-template review of the a2a-sdk v0→v1 migration:
every adapter executor (claude-code, gemini-cli, crewai, openclaw,
autogen) builds A2A response Messages independently using
`new_text_message(text)` from the SDK, which omits `task_id` and
`context_id`. The runtime's own canonical pattern in
`workspace/a2a_executor.py:466-475` correctly threads both:
Message(
message_id=uuid.uuid4().hex,
role=Role.ROLE_AGENT,
parts=_parts,
task_id=task_id, # ← canonical
context_id=context_id, # ← canonical
)
Adapters skipping these correlation fields means the platform's a2a
proxy can't reliably tie the response back to the originating task.
This is a divergence from canonical, not necessarily a strict bug
(task_id may be optional with a default) — but it's enough of a
correlation/observability gap that the canonical pattern bothers to
thread it.
Add `new_response_message(context, text, files=None)` to
executor_helpers.py — single home for response Message construction.
Templates can migrate from `new_text_message(text)` to this helper
in stacked PRs once the runtime publishes to PyPI.
The helper:
- Reads `context.task_id`/`context.context_id` from the inbound
RequestContext, falling back to fresh UUIDs (RequestContextBuilder
always sets them in production; fallback is for unit tests).
- Sets `role=Role.ROLE_AGENT` (the v1 enum value).
- Builds text Parts via `Part(text=...)` and file Parts via
`Part(url="workspace:<path>", filename=..., media_type=...)`.
- Returns a v1 protobuf Message ready for
`event_queue.enqueue_event(...)`.
Why "files=None" with the workspace: URI scheme as the file Part
shape: matches the canonical pattern in a2a_executor.py exactly so
the platform's chat-attachment download path (executor_helpers.py
`resolve_attachment_uri`) interprets responses uniformly across all
adapters.
Tests (5, all pass with --no-cov against the live runtime image):
- test_new_response_message_text_only
- test_new_response_message_with_files
- test_new_response_message_files_only_no_text
- test_new_response_message_falls_back_when_context_ids_unset
- test_new_response_message_handles_missing_attrs
The conftest's a2a stubs needed an extension for Message + Role +
Part with kwargs preservation. Strictly additive — no existing tests
affected. (The 19 pre-existing failures in test_executor_helpers.py
are unrelated debt from the commit_memory/recall_memory rewrite,
visible on staging baseline before this change.)
Per-template migration is the follow-up: claude-code, gemini-cli,
crewai, openclaw, autogen all call `new_text_message(text)` today;
each gets a per-repo PR replacing it with
`new_response_message(context, text)`. This PR ships the helper
first so the templates have something to import.
Refs: PR #2266/#2267 (restart-race), claude-code #15 (FilePart fix),
gemini-cli #10/crewai #8/openclaw #9/autogen #8 (rename PRs).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-review caught a regression I introduced in #2266: if cycle() panics
(e.g. a future provisionWorkspace nil-deref or any runtime error from
the DB / Docker / encryption stacks it touches), the loop never reaches
`state.running = false`. The flag stays true forever, the early-return
guard at the top of coalesceRestart fires for every subsequent call,
and that workspace is permanently locked out of restarts until the
platform process restarts.
The pre-fix code had similar exposure (panic killed the goroutine
before defer wsMu.Unlock() ran in some Go versions), but my pending-
flag version made it worse: the guard is sticky, not ephemeral.
Fix: defer the state-clear so it always runs on exit, including panic.
Recover (and DON'T re-raise) so the panic doesn't propagate to the
goroutine boundary and crash the whole platform process — RestartByID
is always called via `go h.RestartByID(...)` from HTTP handlers, and
an unrecovered goroutine panic in Go terminates the program. Crashing
the platform for every tenant because one workspace's cycle panicked
is the wrong availability tradeoff. The panic message + full stack
trace via runtime/debug.Stack() are still logged for debuggability.
Regression test in TestCoalesceRestart_PanicInCycleClearsState:
1. First call's cycle panics. coalesceRestart's defer must swallow
the panic — assert no panic propagates out (would crash the
platform process from a goroutine in production).
2. Second call must run a fresh cycle (proves running was cleared).
All 7 tests pass with -race -count=10.
Surfaced via /code-review-and-quality self-review of #2266; the
re-raise-after-recover anti-pattern (originally argued as "don't
mask bugs") came up in the comprehensive review and was corrected
to log-with-stack-and-suppress for availability.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The naive mutex-with-TryLock pattern in RestartByID was silently dropping
the second of two close-together restart requests. SetSecret and SetModel
both fire `go restartFunc(...)` from their HTTP handlers, and both DB
writes commit before either restart goroutine reaches loadWorkspaceSecrets.
If the second goroutine arrives while the first holds the per-workspace
mutex, TryLock returns false and the second is logged-and-dropped:
Auto-restart: skipping <id> — restart already in progress
The first goroutine's loadWorkspaceSecrets ran before the second write
committed, so the new container boots without that env var. Surfaced
during the RFC #2251 V1.0 measurement as hermes returning "No LLM
provider configured" when MODEL_PROVIDER landed after the API-key write
and lost its restart to the mutex (HERMES_DEFAULT_MODEL absent →
start.sh fell back to nousresearch/hermes-4-70b → derived
provider=openrouter → no OPENROUTER_API_KEY → request-time error).
The same race hits any back-to-back secret/model save flow including
the canvas's "set MiniMax key + pick model" UX.
Fix: pending-flag / coalescing pattern. Any restart request that arrives
while one is in flight sets `pending=true` and returns. The in-flight
runner, on completion, checks the flag and runs another cycle. This
collapses N concurrent requests into at most 2 sequential cycles (the
current one + one more that picks up everyone who arrived during it),
while guaranteeing the final container always sees the latest secrets.
Concrete contract:
- 1 request, no concurrency: 1 cycle
- N concurrent requests during 1 in-flight cycle: 2 cycles total
- N sequential requests (no overlap): N cycles
- Per-workspace state — different workspaces never serialize
Coalescing is extracted into `coalesceRestart(workspaceID, cycle func())`
so the gate logic is testable without the full WorkspaceHandler / DB /
provisioner stack. RestartByID now wraps that with the production cycle
function. runRestartCycle calls provisionWorkspace SYNCHRONOUSLY (drops
the historical `go`) so the loop's pending-flag check happens AFTER the
new container is up — without that, the next cycle's Stop call would
race the previous cycle's still-spawning provision goroutine.
sendRestartContext stays async; it's a one-way notification.
Tests in workspace_restart_coalesce_test.go cover all five contract
points + race-detector clean over 10 iterations:
- Single call → 1 cycle
- 5 concurrent during in-flight → exactly 2 cycles total
- 3 sequential → 3 cycles
- Pending-during-cycle picked up (targeted bug repro)
- State cleared after drain (running flag reset)
- Per-workspace isolation (no cross-workspace serialization)
Refs: molecule-core#2256 (V1.0 gate measurement); root cause for the
"No LLM provider configured" symptom seen during hermes/MiniMax repro.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The runner was speculatively calling `/workspaces/:id/heartbeat-history` —
that endpoint doesn't exist on workspace-server. On local dev it 404'd;
on tenant builds the platform's :8080 canvas-proxy fallback intercepted
it and returned 28KB of Next.js HTML which then landed in the JSON event
log. Neither outcome was useful trace data.
`GET /workspaces/:id/activity` is the existing endpoint that reads
activity_logs. That table already records the events the RFC §V1.0
step 6 'platform-side transition' check needs (a2a_send / a2a_receive /
task_update / agent_log / error, plus duration_ms + status). Rename
the runner's fetch + emitted event accordingly.
Verified: GET /workspaces/<uuid>/activity?since_secs=60 returns 200
with `[]` against the local platform; no SaaS skip needed since the
endpoint exists in both environments.
Refs: molecule-core#2256 (V1.0 gate #1 measurement comment).
Three review-driven fixes to the runner before #2261 merges:
1. `WAIT_ONLINE_SECS / 3` truncated; an operator passing 200 actually
waited 198s. Round up so 200 → 67 polls × 3s = 201s ≥ requested.
2. The heartbeat-history endpoint isn't on tenant workspace-servers —
the platform's :8080 fallback proxies unmatched paths to the
canvas Next.js, so the SaaS run captured 28KB of HTML in the
`heartbeat_trace` event log. Skip the fetch in MODE=saas; emit an
explicit `<skipped: ...>` placeholder. Local mode behaviour
unchanged.
3. ORG_ID and ORG_SLUG had no client-side format check, so a typo'd
value got swallowed by TenantGuard's intentionally-opaque 404
(which doesn't tell the operator whether slug, UUID, or auth was
wrong). Validate UUID and slug shape up front; matching errors
are actionable.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two docs covering load-bearing patterns from today's work that
weren't previously discoverable:
1. workspace/platform_tools/README.md — explains the ToolSpec
single-source-of-truth pattern (#2240), the CLI-block alignment
gap that hand-maintained generation can't close (#2258), the
snapshot golden files + LF-pinning (#2260), and the add/rename/
remove playbook. The next reader who lands in
workspace/platform_tools/ now has the design rationale + the
safe-edit procedure colocated with the code.
2. scripts/README.md — disambiguates the three measure-coordinator-
task-bounds.sh files that now exist across two repos:
- scripts/measure-coordinator-task-bounds.sh (canonical OSS, this repo)
- scripts/measure-coordinator-task-bounds-runner.sh (Hermes/MiniMax variant, this repo)
- scripts/measure-coordinator-task-bounds.sh (production-shape, in molecule-controlplane)
Cross-references reference_harness_pair_pattern (auto-memory) for
the cross-repo design rationale. Documents the common safety
pattern (cleanup trap, DRY_RUN, non-target guard,
cleanup_*_failed events) and the heartbeat-trace caveat.
Refs: #2240, #2254, #2257, #2258, #2259, #2260; molecule-controlplane#321.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two docs covering load-bearing patterns from today's work that
weren't previously discoverable:
1. workspace/platform_tools/README.md — explains the ToolSpec
single-source-of-truth pattern (#2240), the CLI-block alignment
gap that hand-maintained generation can't close (#2258), the
snapshot golden files + LF-pinning (#2260), and the add/rename/
remove playbook. The next reader who lands in
workspace/platform_tools/ now has the design rationale + the
safe-edit procedure colocated with the code.
2. scripts/README.md — disambiguates the three measure-coordinator-
task-bounds.sh files that now exist across two repos:
- scripts/measure-coordinator-task-bounds.sh (canonical OSS, this repo)
- scripts/measure-coordinator-task-bounds-runner.sh (Hermes/MiniMax variant, this repo)
- scripts/measure-coordinator-task-bounds.sh (production-shape, in molecule-controlplane)
Cross-references reference_harness_pair_pattern (auto-memory) for
the cross-repo design rationale. Documents the common safety
pattern (cleanup trap, DRY_RUN, non-target guard,
cleanup_*_failed events) and the heartbeat-trace caveat.
Refs: #2240, #2254, #2257, #2258, #2259, #2260; molecule-controlplane#321.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The original measure-coordinator-task-bounds.sh was hardcoded for
local-dev (workspace-server on :8080) with claude-code/langgraph
templates and OPENROUTER_API_KEY. Running it against staging requires
both auth-chain plumbing (per-tenant ADMIN_TOKEN + X-Molecule-Org-Id
TenantGuard header + tenant subdomain routing) and template/secret
flexibility (e.g. Hermes/MiniMax for Token Plan keys).
This adds:
* `measure-coordinator-task-bounds-runner.sh` — separate runner that
wraps the same workspace-server API calls but takes everything as
env-var inputs. Two MODE values:
- `local` → direct workspace-server (no auth/tenant scoping)
- `saas` → tenant subdomain + per-tenant ADMIN_TOKEN bearer +
X-Molecule-Org-Id TenantGuard header. Auto-fetches
tenant token via CP /cp/admin/orgs/<slug>/admin-token
given ORG_SLUG + CP_ADMIN_API_TOKEN, OR accepts a
pre-resolved TENANT_ADMIN_TOKEN.
* Configurable PM_TEMPLATE / CHILD_TEMPLATE / MODEL / SECRET_NAME /
SECRET_VALUE — defaults match the original (claude-code-default +
langgraph + OpenRouter). Hermes/MiniMax example documented in the
header.
* Per-poll status_change events during wait_online, so a workspace
that never reaches online surfaces its last status (provisioning,
failed, etc.) instead of a bare timeout.
* WAIT_ONLINE_SECS knob (default 180s; SaaS cold-start needs ~420s
for first hermes-image pull on a freshly-provisioned EC2 tenant).
* `${args[@]+...}` guard on the api() helper — avoids `set -u`
exploding on an empty header array (the local-dev hot-path).
The original script also gained a SECRET_VALUE block earlier in the
session — that change (separately staged) makes the secret-name
configurable without forcing every operator through the new runner.
V1.0 gate #1 (RFC #2251, Issue 4 repro) measurement results posted
as a separate comment on molecule-core#2256.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-review follow-up on #2258 (registry snapshot tests, just merged).
The byte-exact snapshot comparisons in test_platform_tools.py would
fail mysteriously on a Windows contributor's machine with
core.autocrlf=true: checkout would convert LF → CRLF, the test would
fail locally with no useful diagnostic, and the regen instructions
in the test-file header would produce LF files that disagree with
the working copy.
Pin workspace/tests/snapshots/*.txt to text eol=lf so this can't
happen. All three current snapshots are already LF; the attribute
ensures it stays that way.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-review follow-ups on #2257:
- Drop `local exit_code=$?` from cleanup(). `trap`-handler return values
are ignored, so capturing $? only misled a future reader into thinking
exit-code preservation was happening.
- Replace silenced `>/dev/null 2>&1` DELETE with `-w '%{http_code}'`
capture. ADMIN_TOKEN expiring mid-run was the realistic failure mode
here — previously we swallowed it under the silenced redirect, leaving
workspaces leaked with no signal. Now a 401/403/5xx surfaces as a
`cleanup_failed` JSON event with a remediation hint pointing at
cleanup-rogue-workspaces.sh; 404 is treated as success (the
post-condition — workspace absent — holds).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two follow-ups from the #2240 code review:
1. Snapshot tests for the rendered tool-instruction blocks. The
structural tests added in #2240 guarantee tool NAMES are present;
these new tests pin the SHAPE — bullet ordering, heading style,
footer placement — so a future contributor who reorders fields in
`_render_section` or rewrites a `when_to_use` paragraph sees the
diff in CI rather than shipping a silently-different system prompt.
Golden files live under workspace/tests/snapshots/.
2. CLI-block alignment test + corrected source-of-truth comment.
`_A2A_INSTRUCTIONS_CLI` is a separate hand-maintained surface for
ollama and other non-MCP runtimes — the registry can't auto-generate
it because the CLI subprocess interface uses different command
shapes (`peers` vs `list_peers`, etc.). A new
`_CLI_A2A_COMMAND_KEYWORDS` mapping declares the registry-tool →
CLI-keyword correspondence (or explicit `None` for tools not
exposed via subprocess). Two tests enforce coverage:
- every a2a tool in the registry is keyed in the mapping
- every non-None subcommand keyword literally appears in
`_A2A_INSTRUCTIONS_CLI`
Caught one real gap: `send_message_to_user` is in the registry but
has no CLI subcommand. Mapped to `None` with an explanatory comment.
The "no other source of truth" claim in registry.py's docstring
was wrong post-#2240 (the CLI block survived) — corrected to
describe the two surfaces explicitly and point at the alignment
tests as the gate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three follow-ups from #2254 code review before the harness is safe to
run against staging:
1. Cleanup trap. Workspaces are now auto-deleted on EXIT/INT/TERM. A
Ctrl-C mid-run no longer leaks the PM + Researcher pair against
shared infra. KEEP_WORKSPACES=1 opts out for post-run inspection.
2. Tenant scoping + admin auth. Non-localhost PLATFORM values now
require both ADMIN_TOKEN and TENANT_ID; the script refuses to run
without them. The previous version sent unauthenticated POSTs that,
on staging, would either 401 every request or — worse — provision
into the wrong tenant. Memory `feedback_never_run_cluster_cleanup_
tests_on_live_platform` calls out the same hazard class.
3. DRY_RUN=1 mode. Prints platform target, tenant id, auth fingerprint,
and the planned actions, then exits before any state mutation. The
intended pre-flight before running against staging.
Also tightened OR_KEY check (the chained default silently accepted an
empty OPENROUTER_API_KEY) and added a heartbeat-trace caveat to the
interpretation guide explaining what `<endpoint_unavailable>` means
for the bound question.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds structured `rfc2251_phase=...` log lines at the deterministic phase
boundaries inside route_task_to_team and check_task_status, so an
operator running scripts/measure-coordinator-task-bounds.sh against
staging can correlate the harness's external timing trace with what
phase the coordinator was in at any given second.
The harness already exists in staging and measures end-to-end response
time + heartbeat trace. What it CAN'T do without this PR is answer
"the coordinator response took 7 minutes — was it stuck delegating, or
stuck polling children, or stuck synthesizing after all children
returned?" The phase logs answer that question.
Phases instrumented (deterministic Python boundaries, no agent prompt
involvement):
route_start → enter route_task_to_team
children_fetched → after get_children() returns
routing_decided → after build_team_routing_payload
delegate_invoked → just before delegate_task_async.ainvoke
delegate_returned → after delegate_task_async returns
check_status → every check_task_status poll (per-poll)
route_returning_decision_only → fall-through path
Each line includes elapsed_ms from route_start so per-phase durations
are extractable via:
grep rfc2251_phase= <container.log> \
| awk '{...}' to compute deltas between consecutive phases
The synthesis phase (after all children return, before agent emits
final A2A response) is NOT instrumented here because it's
agent-driven (no deterministic Python boundary). The harness operator
infers synthesis_secs = total_response_secs − max(check_status_ts).
This is reproduction-harness scaffolding; it adds zero behavior. Strip
the rfc2251_phase log lines when V1.0 ships and the phase data lands
in the structured heartbeat payload instead.
Refs:
- RFC: molecule-core#2251
- Harness: scripts/measure-coordinator-task-bounds.sh (shipped earlier)
- V1.0 gate: this is deliverable #2 of the four pre-V1.0 gates
Adds a reproduction harness for Issue 4 of the 2026-04-28 CP review,
referenced in RFC molecule-core#2251. The RFC review (issue #2251
comment) flagged that Issue 4 was hypothesized but not reproduced
before V1.0 implementation begins — this script closes that gap.
What it does:
- Provisions a coordinator (PM, claude-code-default) + 1 child
(Researcher, langgraph) via the platform API.
- Sends an A2A kickoff with a synthesis-heavy task that requires
SYNTHESIS_DEPTH (default 3) sequential delegations followed by a
600-word post-delegation synthesis.
- Times the coordinator's full A2A round-trip with millisecond
precision and emits one JSON event per phase (machine-readable).
- Pulls the coordinator's heartbeat trace post-run so the team can
see whether any platform-side state transition fired during the
long synthesis (the V1.0 RFC's MAX_TASK_EXECUTION_SECS would
surface as such a transition; absence of one in this trace
confirms the RFC's premise).
Why a measurement harness, not a pass/fail test:
Issue 4's claim is "absence of platform-side bound", which is hard
to assert in a single CI run. Outputting structured measurement
data lets the team interpret across multiple runs / staging vs
prod / different SYNTHESIS_DEPTH values rather than relying on one
reproduction snapshot.
The script's header has the full interpretation guide:
- ELAPSED < 60s → not informative (LLM was just fast)
- 60–300s → within DELEGATION_TIMEOUT, ambiguous
- >= 300s without trace transitions → BUG CONFIRMED
- curl_failed → coordinator hung past A2A_TIMEOUT or genuinely
slow (disambiguate by querying status separately)
Doesn't run in CI by default — invoked manually against staging or a
local platform with PLATFORM=... and OPENROUTER_API_KEY=... env vars.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a comment block at the top of auto-promote-staging.yml naming the
load-bearing one-time repo setting that the workflow depends on:
Settings → Actions → General → Workflow permissions
→ ✅ Allow GitHub Actions to create and approve pull requests
Without this toggle, every workflow_run fails with
"GitHub Actions is not permitted to create or approve pull requests
(createPullRequest)". Observed 2026-04-29 01:43 UTC blocking the
fcd87b9 promotion (PRs #2248 + #2249); manually bridged via PR #2252.
The setting is invisible to anyone reading the workflow file, but the
workflow cannot do its job without it. Documenting here so the next
time it gets toggled off (org admin change, repo migration, audit
cleanup) the failure mode points at the cause rather than another
round of "why is auto-promote broken."
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirror the sweep-cf-orphans hardening (#2248) on publish-runtime's
TEMPLATE_DISPATCH_TOKEN gate. The previous behaviour was to print
:⚠️:skipping cascade — templates will pick up the new version
on their own next rebuild and exit 0. That message is wrong: the 8
workspace-template repos only rebuild on this repository_dispatch
fanout. Without the dispatch they stay pinned to whatever runtime
version they last saw, and the gap is invisible until someone
notices a template several versions behind weeks later.
Behaviour after this PR:
- push (auto-trigger on workspace/runtime/** changes) → exit 1
- workflow_dispatch (manual operator) → exit 0
with a warning (operator already accepted state; let them rerun
after restoring the secret)
The token-missing path now also names the consequence concretely
("templates will NOT pick up the new version until this token is
restored") so future operators see the actionable line, not the
misleading "they'll catch up on their own" message.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the soft-skip-with-warning behaviour for scheduled runs of the
hourly Cloudflare orphan sweeper with an explicit failure when the six
required secrets aren't set. Manual workflow_dispatch keeps the
soft-skip path so an operator can short-circuit a deliberate rerun
without redoing the secrets dance — they accepted the state when they
clicked the button.
Why: from some-date to 2026-04-28, all six secrets were unset on the
repo. Every hourly tick printed a yellow :⚠️: and exited 0,
which GitHub registers as "completed/success" — the sweeper was
indistinguishable from a healthy janitor with nothing to do. Cloudflare
orphans accumulated unobserved to 152/200 (~76% of the zone quota),
and only surfaced via a manual audit. The mechanism to catch this kind
of regression is to make the workflow loud: red runs prompt
investigation, green runs are presumed healthy.
Schedule/workflow_run/push paths now print three ::error:: lines
naming the missing secrets, the fix, and a one-line reference to this
incident, then exit 1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the fix#2234 applied to auto-sync-main-to-staging.yml in the
reverse direction. Both workflows now use the same merge-queue path
that humans use; no special-case bypass.
Why
Every tick of auto-promote-staging.yml since main's branch protection
went stricter has been failing with:
remote: error: GH006: Protected branch update failed for refs/heads/main.
remote: - Required status checks "Analyze (go)", "Analyze (javascript-typescript)",
"Analyze (python)", "Canvas (Next.js)", "Detect changes",
"E2E API Smoke Test", "Platform (Go)", "Python Lint & Test",
and "Shellcheck (E2E scripts)" were not set by the expected
GitHub apps.
remote: - Changes must be made through a pull request.
The previous version did `git merge --ff-only origin/staging &&
git push origin main` directly. That works against a permissive
branch — it doesn't work against a ruleset that requires checks
satisfied by the expected GitHub apps. Only PR merges through the
queue produce check runs from the right apps.
Result was that today's 12+ merges to staging never propagated to
main; the auto-promote ran every tick and failed every tick, while
operators had to keep opening manual `staging → main` bridges.
Fix
- Replace the direct git push step with a step that opens (or reuses)
a PR base=main head=staging and enables auto-merge. The merge queue
lands it once gates are green on the merge_group ref.
- The PR's head IS the staging branch (no per-SHA promote branch
needed) — the whole purpose is "advance main to staging's tip".
- Add `pull-requests: write` permission so the workflow can call
gh pr create + gh pr merge --auto.
- Drop the `git merge-base --is-ancestor` divergence check — the
merge queue itself enforces branch protection now, and rejects
the PR if main has diverged from staging history.
Loop safety preserved: when this PR's merge lands on main, it
triggers auto-sync-main-to-staging.yml which opens a sync PR back
to staging. That sync PR's eventual merge is by GITHUB_TOKEN (the
merge queue) which doesn't trigger downstream workflow_run events
— so auto-promote-staging.yml does NOT re-fire from its own merge
landing.
Refs: #2234 (the parallel fix for auto-sync-main-to-staging.yml),
task #142, multiple failing runs visible in
https://github.com/Molecule-AI/molecule-core/actions/workflows/auto-promote-staging.yml
Consolidates the remaining safe-to-merge dependabot PRs from the
2026-04-28 wave into one consumable PR. Replaces three earlier
single-bump PRs (#2245, #2230, #2231) which were closed in favor of
this single batch — same pattern as #2235.
GitHub Actions majors (SHA-pinned per org convention):
github/codeql-action v3 → v4.35.2 (#2228)
actions/setup-node v4 → v6.4.0 (#2218)
actions/upload-artifact v4 → v7.0.1 (#2216)
actions/setup-python v5 → v6.2.0 (#2214)
npm dev deps (canvas/, lockfile regenerated in node:22-bookworm
container so @emnapi/* and other Linux-only optional deps are
properly resolved — Mac-native `npm install` strips them, which
caused the earlier #2235 batch to drop these two):
@types/node ^22 → ^25.6 (#2231)
jsdom ^25 → ^29.1 (#2230)
Why each is safe
setup-node v4 → v6 / setup-python v5 → v6:
Every consumer call pins node-version / python-version
explicitly. v5 / v6 changed defaults but pinned consumers
are unaffected. Confirmed via grep across .github/workflows/
— all setup-node call sites pin '20' or '22', all
setup-python call sites pin '3.11'.
codeql-action v3 → v4.35.2:
Used as init/autobuild/analyze sub-actions in codeql.yml.
v4 bundles a newer CodeQL CLI; ubuntu-latest auto-updates
so functional behavior is unchanged. The deprecated
CODEQL_ACTION_CLEANUP_TRAP_CACHES env var (per v4.35.2
release notes) is undocumented and we don't set it.
upload-artifact v4 → v7.0.1:
v6 introduced Node.js 24 runtime requiring Actions Runner
>= 2.327.1. All upload-artifact users (codeql.yml,
e2e-staging-canvas.yml) run on `ubuntu-latest` (GitHub-
hosted), which auto-updates the runner agent. Self-hosted
runners are NOT used for these jobs.
@types/node 22 → 25 / jsdom 25 → 29:
Both are dev-only — @types/node is type definitions,
jsdom backs vitest's DOM environment. Tests pass:
79 files / 1154 tests in node:22-bookworm container.
Verified locally (Linux container so the lockfile reflects what
CI's `npm ci` will install):
- cd canvas && npm install --include=optional → 169 packages
- npm test → 1154/1154 pass
- npm ci → clean install succeeds
- npm run build → Next.js prerendering succeeds
Closes when this lands (the 3 individual auto-merge PRs from earlier
were closed):
#2228#2218#2216#2214#2231#2230
NOT included (CI failing on dependabot's own run — major framework
bumps that need code-side migration tasks, not safe auto-bumps):
#2233 next 15 → 16
#2232 tailwindcss 3 → 4
#2226 typescript 5 → 6
Branch protection on `main` requires "E2E API Smoke Test" as a status
check. With Design B's no-op + e2e-api job split, when paths-filter
excludes a commit:
- e2e-api job (name="E2E API Smoke Test"): SKIPPED
- no-op job (name="no-op"): SUCCESS
Branch protection counts the skipped check-run as not-satisfied →
auto-promote-staging's `git push origin main` rejected with GH006.
Observed 2026-04-28 00:22 UTC: every gate green at the workflow level,
all_green=true in auto-promote-staging's gate-check, but the FF push
itself rejected with:
Required status checks "..., E2E API Smoke Test, ..." were not set
by the expected GitHub apps.
Fix: give the no-op job the same `name:` as the real one. Now both
register as check-runs named "E2E API Smoke Test" — exactly one runs
per workflow execution (mutex `if`), the other registers as skipped
with the same name. Branch protection sees at least one success,
requirement satisfied.
Same fix applied to e2e-staging-canvas.yml's no-op (name → "Canvas
tabs E2E") for symmetry, even though "Canvas tabs E2E" isn't currently
in main's required check list — kept consistent so the next time a
required-checks reshuffle pulls it in, it doesn't recreate this bug.
Note: Design B's intent was always "emit a result auto-promote can
read" — that intent was satisfied at the workflow-conclusion level
(success), but missed the per-check-run-name level. This PR closes
that second-order gap.
The PR-built wheel + import smoke gate refused the platform_tools
package because it's a new subdirectory under workspace/ that wasn't
in scripts/build_runtime_package.py:SUBPACKAGES. The drift gate (which
exists for exactly this reason) caught it cleanly:
error: SUBPACKAGES drifted from workspace/ subdirectories:
in workspace/ but NOT in SUBPACKAGES (will ship un-rewritten or
be excluded): ['platform_tools']
Adding platform_tools to SUBPACKAGES wires the package into the
runtime wheel + applies the canonical
from platform_tools.<x> -> from molecule_runtime.platform_tools.<x>
import-rewrite step that every other subpackage uses.
Verified locally: scripts/build_runtime_package.py succeeds, the
rewritten a2a_mcp_server.py reads
from molecule_runtime.platform_tools.registry import TOOLS
which matches the package layout in the wheel.
e2e-staging-canvas had a single global concurrency group:
concurrency:
group: e2e-staging-canvas
cancel-in-progress: false
That meant the entire repo shared one running + one pending slot. When a
staging push queued behind an in-flight run and a third entrant (a PR
run, a follow-on push) entered the group, the staging push got
cancelled. auto-promote-staging then saw `completed/cancelled` for a
required gate and refused to advance main.
Observed 2026-04-28 23:51-23:53: staging tip 3f99fede's e2e-staging-
canvas push run was cancelled within 2:20 of starting because a PR run
on a follow-on branch entered the group. Auto-promote-staging fired 8+
times after that, all skipped because canvas was still in the cancelled
state. The chain stayed stuck until the cancelled run was manually
re-dispatched.
e2e-api had a softer version of the same bug — `group: e2e-api-${{
github.ref }}`. Per-ref isolates push events from PR events, so this
specific scenario didn't hit it, but back-to-back pushes to staging at
SHA-A and SHA-B share refs/heads/staging and would still cancel SHA-A's
queued run when SHA-B enters.
Both workflows now use per-SHA grouping. The single-global-group's
original intent was to throttle parallel E2E provisions, but each E2E
run already isolates its state via fresh-org-per-run, and parallel
infrastructure cost at our scale (~$0.001/min × 10min × 2) is rounding
error compared to a stuck pipeline.
Per-SHA still dedupes accidental double-triggers for the SAME SHA.
It does not cancel obsolete-PR-version runs on force-push — that wasted
CI is acceptable given the alternative is losing staging-tip data that
auto-promote-staging depends on.
Other gate workflows: ci.yml uses `cancel-in-progress: true` which is
correct for unit tests (intentional cancellation on supersede). codeql.yml
is per-ref like e2e-api was; same fix probably applies if the same
deadlock pattern is observed there, but no incident yet so deferring.
Establishes workspace/platform_tools/registry.py as THE place tool
naming and docs live. Every consumer reads from it; nothing duplicates
the source. Closes the architectural gap behind the doc/tool drift
discussion 2026-04-28 — adding hundreds of future runtime SDK adapters
should not require touching tool names anywhere except the registry.
What the registry owns
ToolSpec dataclass with: name, short (one-line description), when_to_use
(multi-paragraph agent-facing usage guidance), input_schema (JSON Schema),
impl (the actual coroutine in a2a_tools.py), section ('a2a' | 'memory').
TOOLS list with 8 entries — delegate_task, delegate_task_async,
check_task_status, list_peers, get_workspace_info, send_message_to_user,
commit_memory, recall_memory.
What now reads from the registry
- workspace/a2a_mcp_server.py
The hardcoded TOOLS list (167 lines of hand-maintained dicts) is
gone. Replaced with a 6-line list comprehension over the registry.
MCP description = spec.short. inputSchema = spec.input_schema.
- workspace/executor_helpers.py
get_a2a_instructions(mcp=True) and get_hma_instructions() now
GENERATE the agent-facing system-prompt text from the registry.
Heading + per-tool bullet (spec.short) + per-tool when_to_use +
a section-specific footer. No more hand-maintained instruction
blocks that drift from reality.
- workspace/builtin_tools/delegation.py
Renamed delegate_to_workspace -> delegate_task_async to match
registry. check_delegation_status -> check_task_status. Added
sync delegate_task @tool wrapping a2a_tools.tool_delegate_task
(was missing for LangChain runtimes — CP review Issue 3).
- workspace/builtin_tools/memory.py
Renamed search_memory -> recall_memory to match registry.
- workspace/adapter_base.py, workspace/main.py
Bundle all 7 core tools (was 6) into all_tools / base_tools.
- workspace/coordinator.py, shared_runtime.py, policies/routing.py
Updated system-prompt-text references to use the registry names.
Structural alignment tests
workspace/tests/test_platform_tools.py — 9 tests pin every
registry-to-adapter mapping:
- registry names are unique
- a2a + memory partition is complete (no orphans)
- by_name lookup works
- MCP server registers exactly the registry's tool set
- MCP description equals registry.short for every tool
- MCP inputSchema equals registry.input_schema for every tool
- get_a2a_instructions text contains every a2a tool name
- get_hma_instructions text contains every memory tool name
- pre-rename names (delegate_to_workspace, search_memory,
check_delegation_status) cannot leak back
Adding a future tool means adding one ToolSpec; the test failure
list tells the author exactly which adapter to update.
Adapter pattern for future SDK support
When (e.g.) AutoGen or Pydantic AI gets adapters, the only work
needed for tool surfacing is "wrap registry.TOOLS in your SDK's
tool format." Names, descriptions, schemas, impl come from the
registry — adapter author writes zero strings.
Why this needed to ship now
PR #2237 (already in staging) injected MCP-world docs as the
default system-prompt content. Without the registry, those docs
said "delegate_task" while LangChain runtimes only had
"delegate_to_workspace" — workers see docs for tools that don't
exist (CP review Issue 1+3). PR #2239 was a tactical rename;
this PR is the structural fix that prevents the same class of
drift from recurring as new adapters ship.
PR #2239 was closed in favor of this — same renames, plus the
registry, plus structural tests. Single coherent change.
Tests: 1232 pass, 2 xfailed (pre-existing). 9 new in
test_platform_tools.py; 4 alignment tests in test_prompt.py from
#2237 still pass; original test_executor_helpers tests adapted to
the registry-driven world.
Refs: CP review Issues 1, 2, 3, 5; project memory
project_runtime_native_pluggable.md (platform owns A2A);
project memory feedback_doc_tool_alignment.md (this is the structural
fix for the tactical lesson).
Self-review caught a real correctness bug: scenario where publish-
workspace-server-image completes BEFORE E2E Staging SaaS for a runtime-
touching SHA. Publish typically takes ~5-10min; E2E ~10-15min, so this
ordering is the common case for runtime-path PRs.
Previous gate logic:
- completed/success: proceed
- completed/failure: abort
- everything else (including in_progress): proceed ← BUG
If publish-trigger fires while E2E is still running, the gate returned
"in_progress/none" and fell through the catch-all "proceed" branch.
Result: :latest retagged on the publish signal alone. Then E2E ends
red — but :latest was already wrongly advanced; the E2E-completion
trigger's job-level if=conclusion==success filter just skips, never
rolls back.
Fix: explicit case for in_progress|queued|requested|waiting|pending
that DEFERS — sets gate.proceed=false, writes a "deferred" summary,
exits 0 (workflow run shows success, retag steps skipped). The E2E
completion trigger then fires later and either promotes (green) or
aborts (red), giving us correct ordering regardless of who finishes
first.
Subsequent steps now guarded by `if: steps.gate.outputs.proceed ==
'true'` instead of relying on `exit 1` for skip semantics.
Also added an explicit catch-all `*)` branch that aborts on unknown
states (forward-compat: GitHub adds a new status, we surface it
instead of silently promoting through it).