Closes#132. Extends the cascade propagation probe (added in #2197
and clarified in #2198) with a content-integrity check.
The previous probe verified pip can RESOLVE the version we just
published (catches surface 1+2 propagation lag — metadata + simple
index). It did NOT verify pip can DOWNLOAD bytes that match what we
uploaded — leaving a window where a Fastly stale-content scenario
(rare but PyPI has had it: e.g. 2026-04-01 incident where a CDN node
served a previous version's wheel under the new version's URL for
~90s after upload) would pass the probe and ship corrupt builds to
all 8 receiver templates.
Two-stage check, both must pass before the cascade fans out:
(a) `pip install --no-cache-dir PACKAGE==VERSION` succeeds —
version is resolvable. (Existing, unchanged.)
(b) `pip download` of the same wheel + `sha256sum` matches the
hash captured pre-upload from `dist/*.whl`. (New.)
Captured BEFORE upload via a new `wheel_hash` step that exposes
`steps.wheel_hash.outputs.wheel_sha256`, bubbled up as
`needs.publish.outputs.wheel_sha256`, and consumed by the cascade
probe via the EXPECTED_SHA256 env var.
`pip download` is the right primitive: it writes the actual .whl
file (vs `pip install` which unpacks and discards), so we can
sha256sum it directly. Combined with --no-cache-dir + a wiped
/tmp/probe-dl per poll, every poll re-fetches from the live Fastly
edge — no local-cache mask.
Per-poll cost: ~3-5s pip install + ~3s pip download + 4s sleep.
30-poll budget = ~5-6 min wall on a slow runner (vs the previous
~4-5 min for resolve-only). Well within the cascade's tolerance for
a known-rare CDN issue, and the overwhelming-common case (Fastly
serves matching bytes immediately) exits on the first poll.
Verified locally: pip download of the current PyPI-latest
(molecule-ai-workspace-runtime 0.1.29) produced
sha256=7e782b2d50812257…, exactly matching PyPI's own metadata
endpoint. The mismatch path is exercised inline (different builds
of the same version produce different hashes by definition — the
build_runtime_package.py output is timestamp-deterministic only
within a single CI invocation).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cascade's PyPI-propagation gate polled `/pypi/<pkg>/<ver>/json`,
which is one of THREE surfaces pip touches when resolving an install:
1. /pypi/<pkg>/<ver>/json — metadata endpoint (the old check)
2. /simple/<pkg>/ — pip's primary download index
3. files.pythonhosted.org — CDN-fronted wheel binary
Each has its own cache. Any one of them can lag behind the others,
and the previous gate would let the cascade fire while (2) or (3)
still served the previous version. Downstream `pip install` in the
template repos then resolved to the OLD wheel, the docker layer
cache locked that stale resolution in, and subsequent rebuilds kept
shipping the old runtime — the "five times in one night" cache trap
referenced in the prior comment.
Replace the metadata-only poll with an actual `pip install
--no-cache-dir --force-reinstall --no-deps PACKAGE==VERSION` from
a fresh venv. If pip can resolve and install the exact version we
just published, every receiver template will too — pip itself is
the ground truth for what the receivers will see, no proxy guessing
about which surface is lagging.
- Venv created once outside the loop; only `pip install` runs in
the poll body.
- --no-cache-dir + --force-reinstall ensures every poll hits the
live PyPI surfaces (no local-cache mask).
- --no-deps keeps each poll fast — we only care about resolving
THIS package, not its dep tree.
- Loop budget: 30 attempts × 4s ≈ 2 min (vs prior 30 × 2s = 60s).
Generous vs typical PyPI propagation, surfaces real upstream
issues past the budget.
Verified locally:
- Probing a non-existent version (0.1.999999) → pip exits 1, loop
retries.
- Probing the current PyPI-latest → pip exits 0, `pip show`
returns the version, loop succeeds.
Closes#130.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes#128's chicken-and-egg. The original gate installed the
CURRENTLY-PUBLISHED molecule-ai-workspace-runtime from PyPI, then
overlaid workspace/requirements.txt, then smoke-imported. That
catches problems with the already-shipped artifact (the daily-cron
upstream-yank case), but it cannot catch problems introduced by the
PR itself: the imports it exercises are from the OLD wheel, not the
PR's source. A PR that adds `from a2a.utils.foo import bar` (where
`bar` is added in a2a-sdk 1.5 and the runtime currently pins 1.3)
slips through:
1. Pip resolves the existing PyPI wheel + a2a-sdk 1.3.
2. Smoke imports the OLD main.py — no reference to `bar` → green.
3. Merge → publish-runtime.yml ships a wheel WITH the new import.
4. Tenant images redeploy → all crash on first boot with
ImportError: cannot import name 'bar' from 'a2a.utils.foo'.
Splits the workflow into two jobs:
- pypi-latest-install (renamed from default-install): unchanged
behavior. Runs on the daily cron and on requirements.txt /
workflow edits. Catches upstream PyPI yanks + the
already-shipped artifact going stale.
- local-build-install (new): runs scripts/build_runtime_package.py
on the PR's workspace/, builds the wheel with python -m build
(mirroring publish-runtime.yml byte-for-byte), installs that
wheel, then runs the same smoke import. Tests the artifact
that WOULD be published if this PR merges.
Path filter widened to workspace/** so any runtime-source change
triggers the local-build job. The pypi-latest job's filter is the
same union; its internal logic is unchanged so the daily-cron and
upstream-detection use cases continue to work.
Verified locally: built the wheel from current workspace/ source via
the same script + python -m build invocation, installed into a fresh
venv, imported from molecule_runtime.main import main_sync
successfully.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The existing wheel-smoke catches AgentCard kwarg-shape regressions
(state_transition_history, supported_protocols) but doesn't catch the
SDK-contract drift class that #2193 just fixed in production: the
a2a-sdk 1.x rename of /.well-known/agent.json →
/.well-known/agent-card.json, plus AGENT_CARD_WELL_KNOWN_PATH moving
to a2a.utils.constants. main.py's readiness probe hardcoded the old
literal and 404'd every attempt, silently dropping every workspace's
initial_prompt for ~weeks before a user reported it.
Two additions to the smoke block:
1. Mount alignment: build an AgentCard, call create_agent_card_routes(),
and assert AGENT_CARD_WELL_KNOWN_PATH is among the mounted paths.
Catches a future SDK release that decouples the constant value
from the route factory's mount path. The source-tree test
(workspace/tests/test_agent_card_well_known_path.py) catches the
main.py side; this catches the SDK side BEFORE PyPI upload.
2. Message helper smoke: import a2a.helpers.new_text_message and
instantiate one. The v0→v1 cheat sheet (memory:
reference_a2a_sdk_v0_to_v1_migration.md) flagged this as a real
migration find — main.py and a2a_executor.py call it in hot
paths, so an import break errors every reply before the message
even leaves the workspace.
Verified by running the equivalent Python inside
ghcr.io/molecule-ai/workspace-template-langgraph:latest:
✓ well-known mount alignment OK (/.well-known/agent-card.json)
✓ message helper import + call OK
Closes the structural-fix half of the #2193 finding from the code-
review-and-quality pass: "the wheel publish smoke didn't catch this.
This is the 7th a2a-sdk migration find of this kind. Task #131 is the
right root-cause fix."
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Independent code review caught a real bug in the previous commit's
stale-token revoke pass. The platform's restart endpoint
(workspace_restart.go:104) Stops the workspace container synchronously
then dispatches re-provisioning to a goroutine (line 173). For a
workspace that's been idle past the 5-minute grace window — extremely
common: user comes back to a long-idle workspace and clicks Restart —
this opens a race window:
1. Container stopped → ListWorkspaceContainerIDPrefixes returns no
entry → workspace becomes a stale-token candidate.
2. issueAndInjectToken runs in the goroutine: revokes old tokens,
issues a fresh one, writes it to /configs/.auth_token.
3. If the sweeper's predicate-only UPDATE
`WHERE workspace_id = $1 AND revoked_at IS NULL` runs AFTER
IssueToken commits but is racing the SELECT-then-UPDATE window,
it revokes the freshly-issued token alongside the old ones.
4. Container starts with a now-revoked token → 401 forever.
The fix carries the SAME staleness predicate from the SELECT into the
per-workspace UPDATE: a token created within the grace window can't
match `< now() - grace` and is automatically excluded. The operation
is now idempotent against fresh inserts.
Also addresses other findings from the same review:
- Add `status NOT IN ('removed', 'provisioning')` to the SELECT
(R2 + first-line C1 defence). 'provisioning' is set synchronously
in workspace_restart.go before the async re-provision begins, so
it's a reliable in-flight signal that narrows the candidate set.
- Stop calling wsauth.RevokeAllForWorkspace from the sweeper —
that helper revokes EVERY live token unconditionally; the sweeper
needs "every STALE live token" which is a different (safer)
operation. Inline the UPDATE so we own the predicate end-to-end.
Drop the wsauth import (no longer needed in this package).
- Tighten expectStaleTokenSweepNoOp regex to anchor at start and
require the status filter, so a future query whose first line
coincidentally starts with "SELECT DISTINCT t.workspace_id" can't
silently absorb the helper's expectation (R3).
- Defensive `if reaper == nil { return }` at top of
sweepStaleTokensWithoutContainer — even though StartOrphanSweeper
already short-circuits on nil, a future refactor that wires this
pass directly without checking would otherwise mass-revoke in
CP/SaaS mode (F2).
- Comment in the function explaining why empty likes is intentionally
NOT a short-circuit (asymmetry with the first two passes is the
whole point — "no containers running" is the load-bearing case).
- Add TestSweepOnce_StaleTokenRevokeUsesStalenessPredicate that
asserts the UPDATE shape (predicate present, grace bound). A
real-Postgres integration test would prove the race resolution
end-to-end; this catches the regression where someone simplifies
the UPDATE back to predicate-only.
- Add TestSweepStaleTokens_NilReaperEarlyExit pinning the F2 guard.
Existing tests updated to match the new query/UPDATE shape with tight
regexes that pin all the safety guards (status filter, staleness
predicate in both SELECT and UPDATE).
Full Go suite green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Heals the user-reported "auth token conflict after volume wipe" failure
mode. When an operator nukes a workspace's /configs volume outside the
platform's restart endpoint (common via `docker compose down -v` or
manual cleanup scripts), the DB still holds live workspace_auth_tokens
for that workspace while the recreated container has an empty
/configs/.auth_token. Subsequent /registry/register calls 401 forever:
requireWorkspaceToken sees live tokens, container has no token to
present, and the workspace is permanently wedged until an operator
manually revokes via SQL.
The platform's restart endpoint already handles this correctly via
wsauth.RevokeAllForWorkspace inside issueAndInjectToken. This change
adds a third orphan-sweeper pass — sweepStaleTokensWithoutContainer —
as the safety net for the equivalent action taken outside the API.
Detection criterion: workspace has at least one live (non-revoked)
token whose most-recent activity (COALESCE(last_used_at, created_at))
is older than staleTokenGrace (5 minutes), AND no live Docker
container's name prefix matches the workspace ID.
Safety filters that bound the revoke radius:
1. Only runs in single-tenant Docker mode. The orphan sweeper is
wired only when prov != nil in cmd/server/main.go — CP/SaaS mode
never gets here, so an empty container list cannot be confused
with "no Docker at all" (which would otherwise revoke every
workspace's tokens in production SaaS).
2. staleTokenGrace = 5min skips tokens issued/used in the last
5 minutes. Bounds the race with mid-provisioning (token issued
moments before docker run completes) and brief restart windows
— a healthy workspace touches last_used_at every 30s heartbeat,
so 5min is 10× the heartbeat interval.
3. The query joins workspaces.status != 'removed' so deleted
workspaces are not revoked here (handled at delete time by the
explicit RevokeAllForWorkspace call).
4. make_interval(secs => $2) avoids a time.Duration.String() →
"5m0s" mismatch with Postgres interval grammar that I caught
during implementation.
5. Each revocation logs the workspace ID so operators can correlate
"workspace just lost auth" with this sweeper, not blame a
network blip.
Failure mode: revoke fails (transient DB error). Loop bails to avoid
log spam; next 60s cycle retries. Worst case a workspace stays
401-blocked an extra minute.
Tests: 5 new tests covering the headline scenario, the safety gate
(workspace with container is NOT revoked), revoke-failure-bails-loop,
query-error-non-fatal, and Docker-list-failure-skips-cycle. All 11
existing sweepOnce tests updated to register the new third-pass query
expectation via a small `expectStaleTokenSweepNoOp` helper that keeps
their existing assertions readable.
Full Go test suite green: registry, wsauth, handlers, and all other
packages.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The initial-prompt readiness probe in workspace/main.py hardcoded the
pre-1.x well-known path. After the a2a-sdk 1.x bump the SDK started
mounting the agent card at the new canonical path (the value of
`a2a.utils.constants.AGENT_CARD_WELL_KNOWN_PATH`), so the probe
returned 404 every attempt and silently fell through to "server not
ready after 30s, skipping". Net effect: every workspace silently
dropped its `initial_prompt` from config.yaml — the agent never sent
the kickoff self-message, and users hit a fresh chat with no context.
Reported by an external user as "/.well-known/agent.json 404 — the
a2a-sdk agent card route was not being mounted at the expected path".
The route IS mounted; the probe was looking at the wrong place.
Fix imports `AGENT_CARD_WELL_KNOWN_PATH` from `a2a.utils.constants`
and uses it directly in the probe URL — the SDK constant is now the
single source of truth, so any future rename travels through
automatically.
Adds two static regression tests pinning the invariant:
1. No hardcoded `/.well-known/agent.json` literal anywhere in
main.py.
2. The probe URL fstring interpolates AGENT_CARD_WELL_KNOWN_PATH
(catches a "fix" that imports the constant for show but reverts
to a literal in the actual GET).
Verified manually inside ghcr.io/molecule-ai/workspace-template-langgraph
that AGENT_CARD_WELL_KNOWN_PATH == '/.well-known/agent-card.json' and
that `create_agent_card_routes(card)` mounts at exactly that path —
constant + mount are aligned in the runtime image, so the probe will
now find the server.
Full workspace test suite: 1209 passed, 2 xfailed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Manual fresh-user clean-slate test surfaced three friction points in
the existing dev-start.sh:
1. The script ran docker compose -f docker-compose.infra.yml
directly, bypassing infra/scripts/setup.sh — so the workspace
template registry was never populated and the canvas template
palette came up empty (the "Template palette is empty"
troubleshooting hit).
2. ADMIN_TOKEN was not handled at all. Without it, the AdminAuth
fail-open gate worked initially but slammed shut the moment the
first workspace registered a token — at which point the canvas
could no longer call /workspaces or /templates. New users hit
401s with no obvious next step.
3. The script wasn't mentioned in docs/quickstart.md. New users
followed the documented 4-step manual flow and never discovered
the single command existed.
Fixes:
- dev-start.sh now calls infra/scripts/setup.sh, which brings up
full infra (postgres + redis + langfuse + clickhouse + temporal)
AND populates the template/plugin registry from manifest.json.
- On first run, dev-start.sh writes MOLECULE_ENV=development to
.env. This activates middleware.isDevModeFailOpen() which lets
the canvas keep calling admin endpoints without a bearer (the
intended local-dev escape hatch). The .env is preserved on
re-runs and sourced before the platform launches.
- The script intentionally does NOT auto-generate an ADMIN_TOKEN.
A first attempt did, and broke the canvas because isDevModeFailOpen
requires ADMIN_TOKEN empty AND MOLECULE_ENV=development together.
Setting ADMIN_TOKEN in dev would close the hatch and the canvas
has no way to read that token in a dev build (no
NEXT_PUBLIC_ADMIN_TOKEN bake step here). The .env comment block
explicitly warns future contributors not to add it.
- Both processes' logs go to /tmp/molecule-{platform,canvas}.log
instead of stdout-mixed so the readiness banner stays clean.
- Health-poll loops cap at 30s with a clear timeout error pointing
to the log file, instead of hanging forever.
- The readiness banner now lists the log paths AND tells the user
the next step is "open localhost:3000 → add API key in Config →
Secrets & API Keys → Global", instead of just listing service
URLs.
Quickstart doc rewrite leads with:
git clone ...
cd molecule-monorepo
./scripts/dev-start.sh
The 4-step manual flow is preserved as "Manual setup (advanced)"
for contributors who want per-component logs.
Verified end-to-end from clean Docker (no containers, no volumes,
no .env) three times: total wall-clock ~12s for a re-run with
cached npm/docker layers. Platform's HTTP 200 on /workspaces
without a bearer confirms the dev-mode auth hatch is active.
Three different intermittent failures observed during a single
manual-test session — RemoteProtocolError, ReadTimeout, ConnectError —
each surfaced as a "Failed to deliver to <peer>" error chip in the
canvas Agent Comms panel even though the next attempt would have
succeeded (verified by direct probes from the same source workspace
to the same peer). The error message even told the user "Usually a
transient network blip — retry once," but it left the retry to a
human reading the error message.
Auto-retry inside send_a2a_message itself: up to 5 attempts (1
initial + 4 retries) with exponential backoff (1s, 2s, 4s, 8s,
16s-capped), each backoff jittered ±25% to break sync across
siblings. Cumulative wall-clock capped at 600s by
_DELEGATE_TOTAL_BUDGET_S so a string of 5×300s ReadTimeouts can't
make the caller wait 25 minutes — once the deadline elapses, retries
stop even if attempts remain.
Retry only on transport-layer transients:
- ConnectError / ConnectTimeout (peer's listening socket not ready)
- RemoteProtocolError (peer closed TCP without writing — observed
when a peer's prior in-flight Claude SDK session aborted)
- ReadError / WriteError (network blip on Docker bridge)
- ReadTimeout (peer wrote no response in 300s)
Application-level errors are NOT retried — they're deterministic and
retrying just wastes wall-clock:
- HTTP 4xx (peer rejected the request format)
- JSON parse failures (peer returned garbage)
- JSON-RPC error in response body (peer's runtime errored cleanly)
- Programmer-bug exceptions (ValueError, etc.)
8 new tests pin the contract:
- retry succeeds after 2 RemoteProtocolErrors
- retry succeeds after 1 ConnectError
- all 5 attempts fail → returns formatted last-error
- capped at exactly _DELEGATE_MAX_ATTEMPTS (regression cover for
"did someone bump the constant accidentally?")
- JSON-RPC error response NOT retried (1 attempt only)
- non-httpx exception NOT retried (programmer bugs stay loud)
- total budget caps the loop even if attempts remain
- backoff schedule grows exponentially with ±25% jitter
Refactor: extracted _format_a2a_error() so the success and exhausted
paths share one error-formatting routine. _delegate_backoff_seconds()
is a pure function so the schedule is unit-testable without monkey-
patching asyncio.sleep.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Canvas Agent Comms bubbles for outbound delegation showed only
"Delegating to <peer>" boilerplate during the live update window —
the actual task text only surfaced after a refresh re-fetched the row
from /workspaces/:id/activity. Symptom flagged today during a fresh
delegation manual test where the bubble said "Delegating to Perf
Auditor" instead of the user's "audit moleculesai.app for
performance" prompt.
Root cause: LogActivity's broadcast payload at activity.go:510-518
deliberately omitted request_body and response_body, so the canvas's
live-update path (AgentCommsPanel.tsx:271-289) saw `p.request_body =
undefined` and toCommMessage fell back to the
`Delegating to ${peerName}` template string. The DB row stored the
real task / reply, which is why GET-on-mount worked.
Fix: include both bodies in the broadcast as json.RawMessage values
(no re-marshal cost — they were already encoded for the DB insert
above). Same pattern as tool_trace, which has been included since #1814.
Each side is bounded by the workspace-side caller's own caps: the
runtime's report_activity helper caps error_detail at 4096 chars and
summary at 256; request/response are constrained by the runtime's
own limits — typical delegate_task payload is hundreds of chars to a
few KB. If a much-larger broadcast becomes a concern later, a soft
cap can be added at this site without breaking the contract.
Two regression tests pin the broadcast shape:
- request_body present → canvas renders the actual task text
- response_body present → canvas renders the actual reply text
- response_body nil → omitted from payload (no empty-bubble flicker)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cascade-deleting a 7-workspace org returned 500 with
"workspace marked removed, but 2 stop call(s) failed — please retry:
stop eeb99b5d-...: force-remove ws-eeb99b5d-607: Error response
from daemon: removal of container ws-eeb99b5d-607 is already in
progress"
even though the DB-side post-condition succeeded (removed_count=7) and
the containers WERE removed shortly after. The fanout fired Stop() on
every workspace concurrently and the orphan sweeper happened to reap
two of them at the same instant, so Docker rejected the second
ContainerRemove with "removal already in progress" — a race-condition
ack, not a real failure. Retrying just races the same in-flight
removal.
The post-condition we care about (the container WILL be gone) is
identical to a successful removal, so Stop() should treat it the
same way it already treats "No such container" — a no-op return nil
that lets the caller proceed with volume cleanup. Real daemon
failures (timeout, EOF, ctx cancel) still surface as errors.
Two pieces:
- New isRemovalInProgress() predicate using the same string-match
approach as isContainerNotFound (docker/docker has no typed
errdef for this; the CLI itself relies on the message).
- Stop() now treats the predicate as success, with a log line
distinct from the not-found path so debugging can tell which
race fired.
Both substrings ("removal of container" + "already in progress") must
match — "already in progress" alone would false-positive on unrelated
operations like image pulls. Truth table pinned in 7 new test cases.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both AgentCommsPanel and ChatTab's activity-feed opened raw
`new WebSocket(WS_URL)` instances per mount, with no onclose handler
and no reconnect logic. When the underlying connection dropped — idle
timeout, browser background-tab throttle, network jitter — the per-
panel sockets stayed dead until the panel re-mounted (refresh or
sub-tab unmount/remount). Live agent-comms bubbles and live activity
feed lines silently went missing in the gap, manifesting as "the
delegation didn't show up until I refreshed."
The global ReconnectingSocket in store/socket.ts already owns
reconnect, exponential backoff, health-check, and HTTP fallback poll.
Routing component subscribers through it gives every consumer those
guarantees for free, with one TCP connection per tab instead of N.
Three new pieces:
- store/socket-events.ts: tiny pub/sub bus. emitSocketEvent fan-outs
every decoded WSMessage to the listener Set; subscribeSocketEvents
returns an unsubscribe. A throwing listener is logged and isolated
so it can't break siblings.
- store/socket.ts: ws.onmessage now calls emitSocketEvent(msg) right
after applyEvent(msg), so the store's derived state and component
subscribers stay in lockstep on every event arrival.
- hooks/useSocketEvent.ts: React hook that registers exactly once
per mount, capturing the latest handler in a ref so the closure
sees current state/props without re-subscribing on every render.
Refactored sites:
- AgentCommsPanel: replaced its WebSocket-in-useEffect block with
useSocketEvent. Same parsing logic; the panel no longer opens its
own connection.
- ChatTab activity feed: split the previous useEffect in two — one
seeds the activity log when `sending` flips, the other subscribes
unconditionally and gates work on `sending` inside the handler.
Hooks can't be conditional, so the gate has to live in the body
rather than around the effect.
The ws-close graceful-close helper is no longer needed in either
site; the global socket owns its own teardown.
Tests: 6 new tests for the bus contract (single delivery, fan-out
order, unsubscribe, throwing-listener isolation, no-subscriber emit,
duplicate-subscribe Set semantics). All 27 existing socket tests
still pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Manual-test failure surfaced what was hidden behind the MCP-path bug:
once delegate_task could actually fire, every cross-workspace call
came back as JSON-RPC -32600 "Invalid Request" with the underlying
pydantic ValidationError:
params.message.role
Input should be 'agent' or 'user' [type=enum,
input_value='ROLE_USER', input_type=str]
PR #2184's a2a-sdk 1.x migration sweep over-corrected: it changed
every `"role": "user"` literal in JSON-RPC payload construction to
`"role": "ROLE_USER"` to match the protobuf enum names of the 1.x
native types (a2a.types.Role.ROLE_USER / ROLE_AGENT). That was
correct for in-process Message construction (which the SDK
serialises before wire transmission) but WRONG for the 8 sites that
hand-build JSON-RPC payloads. The workspace's own a2a-sdk runs
inbound requests through the v0.3 compat adapter
(/usr/local/lib/python3.11/site-packages/a2a/compat/v0_3/) because
main.py sets enable_v0_3_compat=True for backwards compatibility,
and that adapter validates against the v0.3 Pydantic Role enum
(`agent` | `user` lowercase). The protobuf-style names blow it up.
Reverted the 8 wire-payload sites to lowercase:
- workspace/a2a_client.py:74
- workspace/a2a_cli.py:74, 111
- workspace/heartbeat.py:378
- workspace/main.py:464, 563
- workspace/builtin_tools/a2a_tools.py:60
- workspace/builtin_tools/delegation.py:272
Native-type usage at workspace/a2a_executor.py:471 (`Role.ROLE_AGENT`)
stays — that's an in-process Message construction; the SDK handles
wire serialisation correctly.
Updated the misleading comment at main.py:255-257 (which said
"outbound payloads are now 1.x-shaped (ROLE_USER)") to spell out
the actual rule: outbound JSON-RPC wire payloads MUST use v0.3
shape, native types are only for in-process construction.
New regression test test_jsonrpc_wire_role_format.py greps the 6
wire-payload-emitting files for any "ROLE_USER" / "ROLE_AGENT"
string literal and fails loud — cheapest possible drift detector.
Why E2E missed it: the priority-runtimes harness sends a single
message canvas → workspace, but the canvas already used lowercase
"user" (it never went through the migration sweep). The bug only
surfaces on workspace → workspace delegation, which the harness
doesn't exercise. Same gap as #131 (extend smoke to call main()
against a stub).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same root cause as the workspace/molecule_ai_status.py docstring fix
in this PR: this doc claimed `molecule-monorepo-status` was a usable
shell alias and `from molecule_ai_status import set_status` was a
usable Python import. Both worked under the pre-#87 monolithic-template
layout (where workspace/Dockerfile created the symlink and COPY'd the
modules into /app/) but neither works in current standalone template
images that install the runtime as a wheel:
- `which molecule-monorepo-status` errors — only `a2a-db` and
`molecule-runtime` are registered console scripts.
- `from molecule_ai_status` raises ImportError — modules are under the
`molecule_runtime` package now.
Switched both examples to the canonical `python3 -m
molecule_runtime.molecule_ai_status` form (CLI) and `from
molecule_runtime.molecule_ai_status import set_status` (Python). Same
form the runtime ships in its own usage banner, so anyone discovering
this doc gets a runnable example.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Third-pass review caught a fourth WS path I missed. The original fix +
the stale-callback follow-up patched 3 sites that release the in-flight
guards (pendingAgentMsgs effect, HTTP .then() success, HTTP .catch()
success), but the ACTIVITY_LOGGED handler at lines 410-419 also clears
`sending` + `sendingFromAPIRef` when the platform logs the workspace's
a2a_receive ok/error. It only cleared 2 of the 3 refs — same exact
bug class as the original. If THIS path wins the race (a2a_receive
activity logged before pendingAgentMsgs delivers the reply text),
sendInFlightRef stays stuck true and the next sendMessage() silently
no-ops at line 464.
Fix: route both branches (ok and error) through releaseSendGuards()
so all four sites are now uniform.
Updated the helper's docstring to explicitly list all four sites and
warn that any future "I saw the reply" path that only clears the
natural pair (sending + sendingFromAPIRef) will silently re-introduce
the freeze. The disabled-button logic can't see sendInFlightRef so
the visible state diverges from the synchronous re-entry guard
otherwise.
This is exactly the drift `releaseSendGuards()` was supposed to
prevent — the helper landed in the prior commit but the activity-log
site wasn't migrated to use it. Fixing now closes the gap.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-existing test_set_status_exception_prints_to_stderr asserted on the
legacy "molecule-monorepo-status: failed to update" prefix string. The
prior commit renamed it to "molecule_ai_status: failed to update" so
the printed label matches the canonical module-form invocation
(`python3 -m molecule_runtime.molecule_ai_status`) instead of a shell
alias that only ever existed in the dev-only base image. Updating the
expected substring in lockstep.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-review on PR #2185 surfaced a latent race the original fix exposed:
the WS-clears-guards path now releases sendInFlightRef immediately, which
means a user can fire msg #2 between WS-arrival and HTTP-arrival for
msg #1. Without coordination, msg #1's late .then() sees
sendingFromAPIRef=true (set by msg #2's send), enters the main body,
and runs setSending(false) + appendMessageDeduped against msg #1's
response body — clobbering msg #2's in-flight UI state.
This race is realistic for claude-code SDK: the comment at line 294-298
already calls WS the "authoritative reply arrived" signal, and the user
typically reads-then-types before the trailing HTTP completes. Without
the original Send-button freeze "protecting" the race, it surfaces.
Two changes:
1. Token-keyed callbacks. sendTokenRef bumps on every sendMessage
entry; .then()/.catch() capture the token in closure and bail
without touching any flags if a newer send has superseded them.
The newer send owns the in-flight guards.
2. releaseSendGuards() helper. The three-clear-guards trio
(setSending, sendingFromAPIRef, sendInFlightRef) now lives in one
useCallback so the WS handler, .then() success, and .catch()
success can't drift apart. A future contributor dropping one of
the three would silently re-introduce either the post-WS Send
freeze or the stale-callback clobber.
Skipped a unit test for this regression — ChatTab has no __tests__
file and a mount test would need WS + zustand + api mocks. The fix
is 4 logical lines (token capture + 2 guard checks) and the manual
test covers it. Follow-up to add a focused mount test when ChatTab
gets its first __tests__ file.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comprehensive sweep follow-up to the MCP server path fix. Audited every
/app/ reference in the runtime source against the live claude-code
template image and confirmed the actual /app/ contents post-#87 are
ONLY: __init__.py, adapter.py, claude_sdk_executor.py, requirements.txt
— every other workspace module ships in the wheel under
site-packages/molecule_runtime/. Two more leaks found:
1. executor_helpers.py:_A2A_INSTRUCTIONS_CLI — inter-agent system prompt
for non-MCP runtimes (Ollama, custom) had 5 lines telling the model
`python3 /app/a2a_cli.py X`. Models copy these examples verbatim, so
every CLI-runtime delegation would fail at the shell layer (no such
file). Replaced with `python3 -m molecule_runtime.a2a_cli` form,
which works regardless of where the wheel is installed.
2. molecule_ai_status.py docstring — usage examples invoked
`python3 /app/molecule_ai_status.py` and claimed a
`molecule-monorepo-status` shell alias. Both broken in current
templates: the file's at site-packages, and `which
molecule-monorepo-status` errors (the legacy symlink only existed
in the dev-only workspace/Dockerfile base image, not in the
standalone template Dockerfiles that ship to production).
Updated docstring + the __main__ usage banner + the stderr error
prefix to use the same `python3 -m molecule_runtime.X` form.
Plugins audited and clean: WORKSPACE_PLUGINS_DIR=/configs/plugins,
SHARED_PLUGINS_DIR=$PLUGINS_DIR fallback /plugins. No /app/
assumptions.
Regression test: `test_a2a_cli_instructions_use_module_invocation_not_legacy_app_path`
asserts the legacy /app/a2a_cli.py path can't drift back into the CLI
system prompt and that the canonical module form is present.
The legacy workspace/Dockerfile + workspace/entrypoint.sh + workspace/scripts/
still contain /app/-shaped paths but are dev-only base-image scaffolding
(per workspace/build-all.sh's own header comment) — not shipped to the
standalone template images. Out of scope here; can be cleaned up in a
separate dead-code pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DEFAULT_MCP_SERVER_PATH was hardcoded to /app/a2a_mcp_server.py, which
was correct under the pre-#87 monolithic-template Docker layout where
the workspace/ tree was COPY'd into /app/. After the universal-runtime
refactor (#87, #117), workspace modules ship inside the
molecule-ai-workspace-runtime wheel under
site-packages/molecule_runtime/, while /app/ now holds only
template-specific files (adapter.py + the runtime-native executor for
that template).
Net effect: in every workspace built since the wheel cutover, Claude
Code SDK's mcp_servers={"a2a": {"command": python, "args":
["/app/a2a_mcp_server.py"]}} pointed at a missing file. The subprocess
launch failed silently, the SDK registered zero MCP tools, and the
agent's list_peers / delegate_task / a2a_send_message / a2a_send_signal
all disappeared. Symptom observed today: Design Director said
"I tried to reach the perf auditor via the inter-agent MCP tools
(list_peers, delegate_task) but those tools didn't resolve in this
environment" and fell back to running the audit itself with WebFetch.
Why this slipped through E2E: the priority-runtimes harness sends a
single message and verifies a reply — it does not exercise inter-agent
delegation, so the missing MCP tools are invisible at that layer.
Fix: resolve the path relative to executor_helpers.py via __file__,
which tracks wherever the wheel is installed (site-packages today,
anywhere else tomorrow). The A2A_MCP_SERVER_PATH env override is
preserved for tests / non-default layouts.
Regression test: assert os.path.exists(DEFAULT_MCP_SERVER_PATH) so
any future move of a2a_mcp_server.py out of the package directory
fails at unit-test time instead of silently disabling delegation in
production.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Send button + Enter both silently no-op'd after the first agent reply
on runtimes that deliver via WebSocket (claude-code SDK does this per
the comment at ChatTab.tsx:294-298). The visible disabled-state checks
(sending, uploading, agentReachable) were all clean — the freeze came
from a third synchronous reentry guard the button can't see:
if (sendInFlightRef.current) return; // ChatTab.tsx:438
The ref was set true at the start of sendMessage() and only cleared in
.then() / .catch() of the HTTP fall-through and the upload-failure
branch. The WS-push handler in the pendingAgentMsgs effect cleared
`sending` and `sendingFromAPIRef` but left `sendInFlightRef` stuck
true. The HTTP .then() then early-returned at the dedup check (line
513) without touching the ref — only the .catch() early-return path
did. Net result: refresh fixed it because the ref reset on remount.
Two-line fix:
- WS handler: also clear sendInFlightRef when the push delivers the
reply (primary fix; no race window where the ref is stuck while
the user can already type)
- .then() early-return: mirror .catch()'s cleanup as defense in
depth, so neither delivery order leaks the ref
While here: A2AEdge.test.tsx fixture was typed `as never` to dodge
EdgeProps' discriminated-union complaint, which broke spreading at
the call sites with TS2698 ("Spread types may only be created from
object types"). Replaced with `as unknown as ComponentProps<typeof
A2AEdge>` — preserves the original "skip restating every optional
field" intent and keeps a spreadable type.
All 10 A2AEdge tests pass; tsc --noEmit is clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Audited every a2a-sdk surface in workspace/ against the installed
1.0.2 wheel. Found and fixed:
main.py (the live workspace startup path):
• create_jsonrpc_routes(rpc_url='/', enable_v0_3_compat=True) —
rpc_url required in 1.x; v0.3 compat enables inbound legacy
clients (`"role": "user"` lowercase) without forcing them to
upgrade. Pairs with the outbound rename below.
a2a_executor.py:
• TextPart/FilePart/FileWithUri removed in 1.x. Part is now a
flat proto message: Part(text=…) / Part(url=…, filename=…,
media_type=…). Updated the file-attachment branch (only
reachable when an agent emits files; the harness's PONG path
didn't exercise this, but it's a latent crash).
• Message field names: messageId/taskId/contextId →
message_id/task_id/context_id (proto3 snake_case).
• Role enum: Role.agent → Role.ROLE_AGENT (proto enum).
Outbound JSON-RPC payloads (8 files):
• "role": "user" → "role": "ROLE_USER" — proto3 JSON serialization
is strict about enum values. Sites: a2a_client, a2a_cli, main
(initial+idle prompts), heartbeat, builtin_tools/a2a_tools,
builtin_tools/delegation. Wire JSON keys stay camelCase
(proto3 default), only the role enum value changed.
google-adk/adapter.py:
• new_agent_text_message → new_text_message (4 sites). This
adapter's directory has a hyphen, so it can't be imported as a
Python module — effectively dead code, but the wheel ships the
file and a future fix should keep it correct against 1.x.
Why one PR instead of seven: every previous a2a-sdk migration find
landed as its own publish → cascade → harness → next-bug cycle.
Today's audit ran every a2a-sdk symbol/type/method in workspace/
against the installed 1.0.2 wheel in a single sweep + tested the
critical paths (Message construction, Part construction, Role enum
parsing) against the actual SDK. Should be the last migration PR.
Verified locally:
python3 scripts/build_runtime_package.py --version 0.1.99 \
--out /tmp/build-final
pip install /tmp/build-final
python -c "import molecule_runtime.main; \
from molecule_runtime.a2a_executor import LangGraphA2AExecutor"
→ ✓ all imports clean against a2a-sdk 1.0.2
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7th a2a-sdk migration find from the v0 → v1 transition.
create_jsonrpc_routes() now requires rpc_url as a positional arg
(was implicit at root in 0.x). Pass '/' to match
a2a.utils.constants.DEFAULT_RPC_URL — that's also what
workspace-server's a2a_proxy.go forwards to (POSTs to workspace URL
without appending a path).
Symptom before fix: every workspace startup crashed with
TypeError: create_jsonrpc_routes() missing 1 required positional
argument: 'rpc_url'
Caught by harness 9 phase 4 (claude-code + langgraph both on
0.1.24). The user's "use langgraph for fast iteration" call cut
the diagnose cycle from 15min to ~30s — without that, this would
have taken another hermes round-trip to surface.
Updated reference_a2a_sdk_v0_to_v1_migration.md memory with this
entry alongside the previous 6 finds.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
a2a-sdk 1.x added agent_card as a required argument to
DefaultRequestHandler.__init__. main.py constructed it with only
agent_executor + task_store, so every workspace startup that reached
the handler init step crashed with:
TypeError: DefaultRequestHandlerV2.__init__() missing 1 required
positional argument: 'agent_card'
This is the 6th a2a-sdk migration find from the v0 → v1 transition
(see reference_a2a_sdk_v0_to_v1_migration memory). Pattern is the
same: SDK exposes a new required arg, our call site needs to pass
the existing object we already construct upstream.
Why the import-only smoke gates didn't catch this: it's a call-time
constructor error inside `async def main()`, not a module load
error. The runtime-pin-compat smoke imports main_sync but doesn't
invoke main() against a real config. Worth filing a follow-up to
extend the smoke to a "construct + dispose" cycle.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CRITICAL: every workspace boot since the a2a-sdk 1.0 migration (#1974)
has been crashing at AgentCard construction with:
ValueError: Protocol message AgentCard has no "supported_protocols" field
The protobuf field is `supported_interfaces` (plural, interfaces — see
a2a-sdk types/a2a_pb2.pyi:189). The 0.3→1.0 migration left the kwarg
as `supported_protocols`, which doesn't exist in the 1.0 schema, so
the constructor raises before any subsequent line of main runs.
Why this hid for so long:
- publish-runtime.yml's smoke step only IMPORTED molecule_runtime.main;
importing the module is fine, only CONSTRUCTING the AgentCard fails
- The user-visible symptom is "Workspace failed: " with empty
last_sample_error, indistinguishable from generic boot timeouts
- The state_transition_history=True bug (fixed in #2179) was a
sibling of this — same migration, same class, just caught first
Fix is symmetric with #2179:
1. workspace/main.py: rename the kwarg + comment explaining why
2. .github/workflows/publish-runtime.yml: extend the smoke block to
instantiate AgentCard with the exact production call shape, so
the next field-rename of this class fails at publish time
instead of breaking every workspace startup
Verification:
- Constructed AgentCard against fresh a2a-sdk 1.0.2 in a clean
venv with the corrected kwarg → succeeds
- Constructed it with the original `supported_protocols` kwarg →
fails immediately with the exact error production sees
- Smoke test pinned to mirror main.py's exact call shape; main.py
+ smoke must stay in lockstep going forward
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two structural fixes for the cascade race conditions that bit us
five times today:
1. **PyPI propagation wait** (cascade job): poll PyPI for the
just-published version with a 60s budget BEFORE firing
repository_dispatch. PyPI accepts the upload but takes a few
seconds to make it available via the package index. Cascade was
firing too fast — downstream template builds ran `pip install`
against a stale index, resolved to the previous version, and
docker layer cache locked that in for subsequent rebuilds.
Pairs with the build-arg cache invalidation in molecule-ci PR
(separate change). Wait without invalidation = next build still
pip-resolves correctly. Invalidation without wait = first cascade
build may still race PyPI propagation. Together: no race, no
stale cache.
2. **Path filter expansion**: scripts/build_runtime_package.py is
the build script and changes to it (e.g. import-rewrite fixes,
manifest emit, lib/ subpackage move) directly affect what ships
in the wheel. Was missing from the path filter, so PRs touching
only scripts/ (like #2174's lib/ fix) didn't auto-publish — the
operator had to remember a manual dispatch. Add it to the closed
list of files that trigger auto-publish.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the molecule-core-side ask of controlplane #285. CP #289 already
landed migration 022 + the handler change exposing \`last_error\` in
/cp/admin/orgs responses. This makes the canary harness actually USE
that field — pre-fix the harness exited with just "Tenant provisioning
failed for <slug>" and forced operators to scrape CP server logs to
learn WHY.
The diagnostic burst dumps the matched org row from the LIST_JSON
already in scope (no extra HTTP call), pretty-printed and prefixed,
right before \`fail\`. Mirrors the TLS-readiness burst pattern from
PR #2107 at step 4. Includes a not-found fallback for DB-drift cases.
No redaction needed — adminOrgSummary is already ops-safe (id, slug,
name, plan, member_count, instance_status, last_error, timestamps;
no tokens, no encrypted fields).
Verification: smoke-tested both branches (org found with last_error +
slug-not-found fallback) with synthetic JSON; bash syntax OK; the only
shellcheck warning is pre-existing on line 93.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a comment block citing a2a-sdk's own
a2a/compat/v0_3/conversions.py, which says verbatim:
state_transition_history=None, # No longer supported in v1.0
So a future reader who notices the missing kwarg won't try to add it
back. The capability is now universal: every v1.x Task carries a
history list and tasks/get supports historyLength via the
apply_history_length helper. No flag because nothing's optional.
Confirmed by reading the SDK source directly:
- a2a/types.py AgentCapabilities exposes only: streaming,
push_notifications, extensions, extended_agent_card.
- a2a/compat/v0_3/conversions.py explicitly maps None when
down-converting v1 → v0.3 (deliberate removal, not rename).
- a2a/server/request_handlers/default_request_handler_v2.py uses
apply_history_length(task, params) — agent doesn't opt in.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
a2a-sdk 1.x's AgentCapabilities only exposes 4 fields:
streaming, push_notifications, extensions, extended_agent_card.
The state_transition_history field was removed in the v1 protobuf
schema. main.py still passed it as a kwarg, so every workspace
that reached the AgentCard construction step (line 188) crashed:
ValueError: Protocol message AgentCapabilities has no
"state_transition_history" field
Symptom: every claude-code + hermes workspace stuck in `provisioning`
forever — caught when the user provisioned a Design Director crew
manually via the canvas while harness 5 was running.
Why every prior smoke gate missed it:
- runtime-pin-compat.yml smokes `from molecule_runtime.main import
main_sync` — only imports the module. AgentCapabilities() runs
inside `async def main()`, not at module load.
- Template image boot smoke does `import every /app/*.py` — same
story. main.py imports fine; the field error only fires at call.
The fix is one line — drop the kwarg. Fields we actually need
(streaming + push_notifications) are still passed.
Follow-up worth filing: smoke step that instantiates Adapter() +
calls a no-op setup() against a stub config. That would have
caught this before publish.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>