molecule-core

Author	SHA1	Message	Date
Hongming Wang	a089712cef	feat(cascade): verify wheel content sha256 against just-built dist Closes #132. Extends the cascade propagation probe (added in #2197 and clarified in #2198) with a content-integrity check. The previous probe verified pip can RESOLVE the version we just published (catches surface 1+2 propagation lag — metadata + simple index). It did NOT verify pip can DOWNLOAD bytes that match what we uploaded — leaving a window where a Fastly stale-content scenario (rare but PyPI has had it: e.g. 2026-04-01 incident where a CDN node served a previous version's wheel under the new version's URL for ~90s after upload) would pass the probe and ship corrupt builds to all 8 receiver templates. Two-stage check, both must pass before the cascade fans out: (a) `pip install --no-cache-dir PACKAGE==VERSION` succeeds — version is resolvable. (Existing, unchanged.) (b) `pip download` of the same wheel + `sha256sum` matches the hash captured pre-upload from `dist/*.whl`. (New.) Captured BEFORE upload via a new `wheel_hash` step that exposes `steps.wheel_hash.outputs.wheel_sha256`, bubbled up as `needs.publish.outputs.wheel_sha256`, and consumed by the cascade probe via the EXPECTED_SHA256 env var. `pip download` is the right primitive: it writes the actual .whl file (vs `pip install` which unpacks and discards), so we can sha256sum it directly. Combined with --no-cache-dir + a wiped /tmp/probe-dl per poll, every poll re-fetches from the live Fastly edge — no local-cache mask. Per-poll cost: ~3-5s pip install + ~3s pip download + 4s sleep. 30-poll budget = ~5-6 min wall on a slow runner (vs the previous ~4-5 min for resolve-only). Well within the cascade's tolerance for a known-rare CDN issue, and the overwhelming-common case (Fastly serves matching bytes immediately) exits on the first poll. Verified locally: pip download of the current PyPI-latest (molecule-ai-workspace-runtime 0.1.29) produced sha256=7e782b2d50812257…, exactly matching PyPI's own metadata endpoint. The mismatch path is exercised inline (different builds of the same version produce different hashes by definition — the build_runtime_package.py output is timestamp-deterministic only within a single CI invocation). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 10:53:50 -07:00
Hongming Wang	2f6fe9ab79	Merge pull request #2197 from Molecule-AI/fix/cascade-pip-resolve-propagation ci(publish-runtime): use pip-resolve probe to bound cascade fan-out	2026-04-28 15:25:06 +00:00
Hongming Wang	e6ce54006d	ci(publish-runtime): use pip-resolve probe to bound cascade fan-out The cascade's PyPI-propagation gate polled `/pypi/<pkg>/<ver>/json`, which is one of THREE surfaces pip touches when resolving an install: 1. /pypi/<pkg>/<ver>/json — metadata endpoint (the old check) 2. /simple/<pkg>/ — pip's primary download index 3. files.pythonhosted.org — CDN-fronted wheel binary Each has its own cache. Any one of them can lag behind the others, and the previous gate would let the cascade fire while (2) or (3) still served the previous version. Downstream `pip install` in the template repos then resolved to the OLD wheel, the docker layer cache locked that stale resolution in, and subsequent rebuilds kept shipping the old runtime — the "five times in one night" cache trap referenced in the prior comment. Replace the metadata-only poll with an actual `pip install --no-cache-dir --force-reinstall --no-deps PACKAGE==VERSION` from a fresh venv. If pip can resolve and install the exact version we just published, every receiver template will too — pip itself is the ground truth for what the receivers will see, no proxy guessing about which surface is lagging. - Venv created once outside the loop; only `pip install` runs in the poll body. - --no-cache-dir + --force-reinstall ensures every poll hits the live PyPI surfaces (no local-cache mask). - --no-deps keeps each poll fast — we only care about resolving THIS package, not its dep tree. - Loop budget: 30 attempts × 4s ≈ 2 min (vs prior 30 × 2s = 60s). Generous vs typical PyPI propagation, surfaces real upstream issues past the budget. Verified locally: - Probing a non-existent version (0.1.999999) → pip exits 1, loop retries. - Probing the current PyPI-latest → pip exits 0, `pip show` returns the version, loop succeeds. Closes #130. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 18:16:33 -07:00
Hongming Wang	7484e6fbec	Merge pull request #2196 from Molecule-AI/fix/runtime-pin-compat-test-pr-artifact ci(runtime-pin-compat): test the PR-built wheel, not PyPI-latest	2026-04-28 00:42:02 +00:00
Hongming Wang	7065579967	ci(runtime-pin-compat): test the PR-built wheel, not the PyPI-latest one Closes #128's chicken-and-egg. The original gate installed the CURRENTLY-PUBLISHED molecule-ai-workspace-runtime from PyPI, then overlaid workspace/requirements.txt, then smoke-imported. That catches problems with the already-shipped artifact (the daily-cron upstream-yank case), but it cannot catch problems introduced by the PR itself: the imports it exercises are from the OLD wheel, not the PR's source. A PR that adds `from a2a.utils.foo import bar` (where `bar` is added in a2a-sdk 1.5 and the runtime currently pins 1.3) slips through: 1. Pip resolves the existing PyPI wheel + a2a-sdk 1.3. 2. Smoke imports the OLD main.py — no reference to `bar` → green. 3. Merge → publish-runtime.yml ships a wheel WITH the new import. 4. Tenant images redeploy → all crash on first boot with ImportError: cannot import name 'bar' from 'a2a.utils.foo'. Splits the workflow into two jobs: - pypi-latest-install (renamed from default-install): unchanged behavior. Runs on the daily cron and on requirements.txt / workflow edits. Catches upstream PyPI yanks + the already-shipped artifact going stale. - local-build-install (new): runs scripts/build_runtime_package.py on the PR's workspace/, builds the wheel with python -m build (mirroring publish-runtime.yml byte-for-byte), installs that wheel, then runs the same smoke import. Tests the artifact that WOULD be published if this PR merges. Path filter widened to workspace/** so any runtime-source change triggers the local-build job. The pypi-latest job's filter is the same union; its internal logic is unchanged so the daily-cron and upstream-detection use cases continue to work. Verified locally: built the wheel from current workspace/ source via the same script + python -m build invocation, installed into a fresh venv, imported from molecule_runtime.main import main_sync successfully. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 17:39:00 -07:00
hongmingwang-moleculeai	2e45c94e33	Merge pull request #2195 from Molecule-AI/fix/wheel-smoke-call-shape-coverage ci(publish-runtime): smoke well-known mount alignment + message helper	2026-04-28 00:37:37 +00:00
Hongming Wang	1b0fab674b	ci(publish-runtime): smoke well-known mount alignment + message helper The existing wheel-smoke catches AgentCard kwarg-shape regressions (state_transition_history, supported_protocols) but doesn't catch the SDK-contract drift class that #2193 just fixed in production: the a2a-sdk 1.x rename of /.well-known/agent.json → /.well-known/agent-card.json, plus AGENT_CARD_WELL_KNOWN_PATH moving to a2a.utils.constants. main.py's readiness probe hardcoded the old literal and 404'd every attempt, silently dropping every workspace's initial_prompt for ~weeks before a user reported it. Two additions to the smoke block: 1. Mount alignment: build an AgentCard, call create_agent_card_routes(), and assert AGENT_CARD_WELL_KNOWN_PATH is among the mounted paths. Catches a future SDK release that decouples the constant value from the route factory's mount path. The source-tree test (workspace/tests/test_agent_card_well_known_path.py) catches the main.py side; this catches the SDK side BEFORE PyPI upload. 2. Message helper smoke: import a2a.helpers.new_text_message and instantiate one. The v0→v1 cheat sheet (memory: reference_a2a_sdk_v0_to_v1_migration.md) flagged this as a real migration find — main.py and a2a_executor.py call it in hot paths, so an import break errors every reply before the message even leaves the workspace. Verified by running the equivalent Python inside ghcr.io/molecule-ai/workspace-template-langgraph:latest: ✓ well-known mount alignment OK (/.well-known/agent-card.json) ✓ message helper import + call OK Closes the structural-fix half of the #2193 finding from the code- review-and-quality pass: "the wheel publish smoke didn't catch this. This is the 7th a2a-sdk migration find of this kind. Task #131 is the right root-cause fix." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 17:34:12 -07:00
hongmingwang-moleculeai	19572119df	Merge pull request #2194 from Molecule-AI/fix/orphan-sweeper-revoke-stale-tokens fix(orphan-sweeper): self-heal auth-token conflict after volume wipe	2026-04-28 00:32:35 +00:00
Hongming Wang	317196463a	fix(orphan-sweeper): close TOCTOU race with issueAndInjectToken on restart Independent code review caught a real bug in the previous commit's stale-token revoke pass. The platform's restart endpoint (workspace_restart.go:104) Stops the workspace container synchronously then dispatches re-provisioning to a goroutine (line 173). For a workspace that's been idle past the 5-minute grace window — extremely common: user comes back to a long-idle workspace and clicks Restart — this opens a race window: 1. Container stopped → ListWorkspaceContainerIDPrefixes returns no entry → workspace becomes a stale-token candidate. 2. issueAndInjectToken runs in the goroutine: revokes old tokens, issues a fresh one, writes it to /configs/.auth_token. 3. If the sweeper's predicate-only UPDATE `WHERE workspace_id = $1 AND revoked_at IS NULL` runs AFTER IssueToken commits but is racing the SELECT-then-UPDATE window, it revokes the freshly-issued token alongside the old ones. 4. Container starts with a now-revoked token → 401 forever. The fix carries the SAME staleness predicate from the SELECT into the per-workspace UPDATE: a token created within the grace window can't match `< now() - grace` and is automatically excluded. The operation is now idempotent against fresh inserts. Also addresses other findings from the same review: - Add `status NOT IN ('removed', 'provisioning')` to the SELECT (R2 + first-line C1 defence). 'provisioning' is set synchronously in workspace_restart.go before the async re-provision begins, so it's a reliable in-flight signal that narrows the candidate set. - Stop calling wsauth.RevokeAllForWorkspace from the sweeper — that helper revokes EVERY live token unconditionally; the sweeper needs "every STALE live token" which is a different (safer) operation. Inline the UPDATE so we own the predicate end-to-end. Drop the wsauth import (no longer needed in this package). - Tighten expectStaleTokenSweepNoOp regex to anchor at start and require the status filter, so a future query whose first line coincidentally starts with "SELECT DISTINCT t.workspace_id" can't silently absorb the helper's expectation (R3). - Defensive `if reaper == nil { return }` at top of sweepStaleTokensWithoutContainer — even though StartOrphanSweeper already short-circuits on nil, a future refactor that wires this pass directly without checking would otherwise mass-revoke in CP/SaaS mode (F2). - Comment in the function explaining why empty likes is intentionally NOT a short-circuit (asymmetry with the first two passes is the whole point — "no containers running" is the load-bearing case). - Add TestSweepOnce_StaleTokenRevokeUsesStalenessPredicate that asserts the UPDATE shape (predicate present, grace bound). A real-Postgres integration test would prove the race resolution end-to-end; this catches the regression where someone simplifies the UPDATE back to predicate-only. - Add TestSweepStaleTokens_NilReaperEarlyExit pinning the F2 guard. Existing tests updated to match the new query/UPDATE shape with tight regexes that pin all the safety guards (status filter, staleness predicate in both SELECT and UPDATE). Full Go suite green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 17:28:50 -07:00
Hongming Wang	3332e6878b	fix(orphan-sweeper): revoke stale tokens for workspaces with no live container Heals the user-reported "auth token conflict after volume wipe" failure mode. When an operator nukes a workspace's /configs volume outside the platform's restart endpoint (common via `docker compose down -v` or manual cleanup scripts), the DB still holds live workspace_auth_tokens for that workspace while the recreated container has an empty /configs/.auth_token. Subsequent /registry/register calls 401 forever: requireWorkspaceToken sees live tokens, container has no token to present, and the workspace is permanently wedged until an operator manually revokes via SQL. The platform's restart endpoint already handles this correctly via wsauth.RevokeAllForWorkspace inside issueAndInjectToken. This change adds a third orphan-sweeper pass — sweepStaleTokensWithoutContainer — as the safety net for the equivalent action taken outside the API. Detection criterion: workspace has at least one live (non-revoked) token whose most-recent activity (COALESCE(last_used_at, created_at)) is older than staleTokenGrace (5 minutes), AND no live Docker container's name prefix matches the workspace ID. Safety filters that bound the revoke radius: 1. Only runs in single-tenant Docker mode. The orphan sweeper is wired only when prov != nil in cmd/server/main.go — CP/SaaS mode never gets here, so an empty container list cannot be confused with "no Docker at all" (which would otherwise revoke every workspace's tokens in production SaaS). 2. staleTokenGrace = 5min skips tokens issued/used in the last 5 minutes. Bounds the race with mid-provisioning (token issued moments before docker run completes) and brief restart windows — a healthy workspace touches last_used_at every 30s heartbeat, so 5min is 10× the heartbeat interval. 3. The query joins workspaces.status != 'removed' so deleted workspaces are not revoked here (handled at delete time by the explicit RevokeAllForWorkspace call). 4. make_interval(secs => $2) avoids a time.Duration.String() → "5m0s" mismatch with Postgres interval grammar that I caught during implementation. 5. Each revocation logs the workspace ID so operators can correlate "workspace just lost auth" with this sweeper, not blame a network blip. Failure mode: revoke fails (transient DB error). Loop bails to avoid log spam; next 60s cycle retries. Worst case a workspace stays 401-blocked an extra minute. Tests: 5 new tests covering the headline scenario, the safety gate (workspace with container is NOT revoked), revoke-failure-bails-loop, query-error-non-fatal, and Docker-list-failure-skips-cycle. All 11 existing sweepOnce tests updated to register the new third-pass query expectation via a small `expectStaleTokenSweepNoOp` helper that keeps their existing assertions readable. Full Go test suite green: registry, wsauth, handlers, and all other packages. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 17:20:08 -07:00
hongmingwang-moleculeai	b9c867a7bf	Merge pull request #2193 from Molecule-AI/fix/agent-card-well-known-path-probe fix(workspace): use SDK constant for agent-card readiness probe	2026-04-27 23:46:14 +00:00
Hongming Wang	3eb599bbb6	fix(workspace): use SDK constant for agent-card readiness probe The initial-prompt readiness probe in workspace/main.py hardcoded the pre-1.x well-known path. After the a2a-sdk 1.x bump the SDK started mounting the agent card at the new canonical path (the value of `a2a.utils.constants.AGENT_CARD_WELL_KNOWN_PATH`), so the probe returned 404 every attempt and silently fell through to "server not ready after 30s, skipping". Net effect: every workspace silently dropped its `initial_prompt` from config.yaml — the agent never sent the kickoff self-message, and users hit a fresh chat with no context. Reported by an external user as "/.well-known/agent.json 404 — the a2a-sdk agent card route was not being mounted at the expected path". The route IS mounted; the probe was looking at the wrong place. Fix imports `AGENT_CARD_WELL_KNOWN_PATH` from `a2a.utils.constants` and uses it directly in the probe URL — the SDK constant is now the single source of truth, so any future rename travels through automatically. Adds two static regression tests pinning the invariant: 1. No hardcoded `/.well-known/agent.json` literal anywhere in main.py. 2. The probe URL fstring interpolates AGENT_CARD_WELL_KNOWN_PATH (catches a "fix" that imports the constant for show but reverts to a literal in the actual GET). Verified manually inside ghcr.io/molecule-ai/workspace-template-langgraph that AGENT_CARD_WELL_KNOWN_PATH == '/.well-known/agent-card.json' and that `create_agent_card_routes(card)` mounts at exactly that path — constant + mount are aligned in the runtime image, so the probe will now find the server. Full workspace test suite: 1209 passed, 2 xfailed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 16:43:32 -07:00
hongmingwang-moleculeai	79265f6b3a	Merge pull request #2192 from Molecule-AI/feat/single-command-spinup feat(dev-start): true single-command spinup — infra + templates + auth posture	2026-04-27 23:33:54 +00:00
Hongming Wang	f2c3594abc	feat(dev-start): true single-command spinup — infra + templates + auth posture Manual fresh-user clean-slate test surfaced three friction points in the existing dev-start.sh: 1. The script ran docker compose -f docker-compose.infra.yml directly, bypassing infra/scripts/setup.sh — so the workspace template registry was never populated and the canvas template palette came up empty (the "Template palette is empty" troubleshooting hit). 2. ADMIN_TOKEN was not handled at all. Without it, the AdminAuth fail-open gate worked initially but slammed shut the moment the first workspace registered a token — at which point the canvas could no longer call /workspaces or /templates. New users hit 401s with no obvious next step. 3. The script wasn't mentioned in docs/quickstart.md. New users followed the documented 4-step manual flow and never discovered the single command existed. Fixes: - dev-start.sh now calls infra/scripts/setup.sh, which brings up full infra (postgres + redis + langfuse + clickhouse + temporal) AND populates the template/plugin registry from manifest.json. - On first run, dev-start.sh writes MOLECULE_ENV=development to .env. This activates middleware.isDevModeFailOpen() which lets the canvas keep calling admin endpoints without a bearer (the intended local-dev escape hatch). The .env is preserved on re-runs and sourced before the platform launches. - The script intentionally does NOT auto-generate an ADMIN_TOKEN. A first attempt did, and broke the canvas because isDevModeFailOpen requires ADMIN_TOKEN empty AND MOLECULE_ENV=development together. Setting ADMIN_TOKEN in dev would close the hatch and the canvas has no way to read that token in a dev build (no NEXT_PUBLIC_ADMIN_TOKEN bake step here). The .env comment block explicitly warns future contributors not to add it. - Both processes' logs go to /tmp/molecule-{platform,canvas}.log instead of stdout-mixed so the readiness banner stays clean. - Health-poll loops cap at 30s with a clear timeout error pointing to the log file, instead of hanging forever. - The readiness banner now lists the log paths AND tells the user the next step is "open localhost:3000 → add API key in Config → Secrets & API Keys → Global", instead of just listing service URLs. Quickstart doc rewrite leads with: git clone ... cd molecule-monorepo ./scripts/dev-start.sh The 4-step manual flow is preserved as "Manual setup (advanced)" for contributors who want per-component logs. Verified end-to-end from clean Docker (no containers, no volumes, no .env) three times: total wall-clock ~12s for a re-run with cached npm/docker layers. Platform's HTTP 200 on /workspaces without a bearer confirms the dev-mode auth hatch is active.	2026-04-27 16:29:37 -07:00
hongmingwang-moleculeai	3f020b8591	Merge pull request #2191 from Molecule-AI/docs/ecosystem-watch-date-2026-04-27 docs: update ecosystem-watch date to 2026-04-27	2026-04-27 22:13:46 +00:00
Hongming Wang	8d77de68c4	docs: update ecosystem-watch date to 2026-04-27	2026-04-27 14:39:35 -07:00
hongmingwang-moleculeai	44dc3c6943	Merge pull request #2189 from Molecule-AI/fix/delegate-task-retry-transient fix(a2a): auto-retry transient transport errors in send_a2a_message (up to 5x)	2026-04-27 20:58:47 +00:00
Hongming Wang	e87a9c3858	fix(a2a): auto-retry transient transport errors in send_a2a_message Three different intermittent failures observed during a single manual-test session — RemoteProtocolError, ReadTimeout, ConnectError — each surfaced as a "Failed to deliver to <peer>" error chip in the canvas Agent Comms panel even though the next attempt would have succeeded (verified by direct probes from the same source workspace to the same peer). The error message even told the user "Usually a transient network blip — retry once," but it left the retry to a human reading the error message. Auto-retry inside send_a2a_message itself: up to 5 attempts (1 initial + 4 retries) with exponential backoff (1s, 2s, 4s, 8s, 16s-capped), each backoff jittered ±25% to break sync across siblings. Cumulative wall-clock capped at 600s by _DELEGATE_TOTAL_BUDGET_S so a string of 5×300s ReadTimeouts can't make the caller wait 25 minutes — once the deadline elapses, retries stop even if attempts remain. Retry only on transport-layer transients: - ConnectError / ConnectTimeout (peer's listening socket not ready) - RemoteProtocolError (peer closed TCP without writing — observed when a peer's prior in-flight Claude SDK session aborted) - ReadError / WriteError (network blip on Docker bridge) - ReadTimeout (peer wrote no response in 300s) Application-level errors are NOT retried — they're deterministic and retrying just wastes wall-clock: - HTTP 4xx (peer rejected the request format) - JSON parse failures (peer returned garbage) - JSON-RPC error in response body (peer's runtime errored cleanly) - Programmer-bug exceptions (ValueError, etc.) 8 new tests pin the contract: - retry succeeds after 2 RemoteProtocolErrors - retry succeeds after 1 ConnectError - all 5 attempts fail → returns formatted last-error - capped at exactly _DELEGATE_MAX_ATTEMPTS (regression cover for "did someone bump the constant accidentally?") - JSON-RPC error response NOT retried (1 attempt only) - non-httpx exception NOT retried (programmer bugs stay loud) - total budget caps the loop even if attempts remain - backoff schedule grows exponentially with ±25% jitter Refactor: extracted _format_a2a_error() so the success and exhausted paths share one error-formatting routine. _delegate_backoff_seconds() is a pure function so the schedule is unit-testable without monkey- patching asyncio.sleep. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 13:52:01 -07:00
hongmingwang-moleculeai	b5441b8c09	Merge pull request #2188 from Molecule-AI/fix/cascade-stop-removal-in-progress fix(workspace-server): cascade-delete race + ACTIVITY_LOGGED body fidelity	2026-04-27 20:46:46 +00:00
Hongming Wang	c91c09dc55	fix(activity): include request/response bodies in ACTIVITY_LOGGED broadcast Canvas Agent Comms bubbles for outbound delegation showed only "Delegating to <peer>" boilerplate during the live update window — the actual task text only surfaced after a refresh re-fetched the row from /workspaces/:id/activity. Symptom flagged today during a fresh delegation manual test where the bubble said "Delegating to Perf Auditor" instead of the user's "audit moleculesai.app for performance" prompt. Root cause: LogActivity's broadcast payload at activity.go:510-518 deliberately omitted request_body and response_body, so the canvas's live-update path (AgentCommsPanel.tsx:271-289) saw `p.request_body = undefined` and toCommMessage fell back to the `Delegating to ${peerName}` template string. The DB row stored the real task / reply, which is why GET-on-mount worked. Fix: include both bodies in the broadcast as json.RawMessage values (no re-marshal cost — they were already encoded for the DB insert above). Same pattern as tool_trace, which has been included since #1814. Each side is bounded by the workspace-side caller's own caps: the runtime's report_activity helper caps error_detail at 4096 chars and summary at 256; request/response are constrained by the runtime's own limits — typical delegate_task payload is hundreds of chars to a few KB. If a much-larger broadcast becomes a concern later, a soft cap can be added at this site without breaking the contract. Two regression tests pin the broadcast shape: - request_body present → canvas renders the actual task text - response_body present → canvas renders the actual reply text - response_body nil → omitted from payload (no empty-bubble flicker) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 13:38:23 -07:00
Hongming Wang	5a7659c54d	Merge pull request #2108 from Molecule-AI/ci/cicd-review-quick-wins ci: e2e-staging-saas on staging + canary auto-issue thresholded at 3 reds	2026-04-27 20:29:12 +00:00
hongmingwang-moleculeai	dccec657d6	Merge branch 'staging' into ci/cicd-review-quick-wins	2026-04-27 13:27:16 -07:00
hongmingwang-moleculeai	e0a35a3c77	Merge pull request #2187 from Molecule-AI/fix/mcp-server-path-wheel-relative fix(runtime): use lowercase wire role for v0.3 JSON-RPC compat layer	2026-04-27 20:27:03 +00:00
Hongming Wang	92d99d96fe	fix(provisioner): treat "removal already in progress" as no-op success Cascade-deleting a 7-workspace org returned 500 with "workspace marked removed, but 2 stop call(s) failed — please retry: stop eeb99b5d-...: force-remove ws-eeb99b5d-607: Error response from daemon: removal of container ws-eeb99b5d-607 is already in progress" even though the DB-side post-condition succeeded (removed_count=7) and the containers WERE removed shortly after. The fanout fired Stop() on every workspace concurrently and the orphan sweeper happened to reap two of them at the same instant, so Docker rejected the second ContainerRemove with "removal already in progress" — a race-condition ack, not a real failure. Retrying just races the same in-flight removal. The post-condition we care about (the container WILL be gone) is identical to a successful removal, so Stop() should treat it the same way it already treats "No such container" — a no-op return nil that lets the caller proceed with volume cleanup. Real daemon failures (timeout, EOF, ctx cancel) still surface as errors. Two pieces: - New isRemovalInProgress() predicate using the same string-match approach as isContainerNotFound (docker/docker has no typed errdef for this; the CLI itself relies on the message). - Stop() now treats the predicate as success, with a log line distinct from the not-found path so debugging can tell which race fired. Both substrings ("removal of container" + "already in progress") must match — "already in progress" alone would false-positive on unrelated operations like image pulls. Truth table pinned in 7 new test cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 13:25:32 -07:00
Hongming Wang	93e8e5329b	Merge pull request #2173 from Molecule-AI/deps/postcss-8.5.10-ghsa-qx2v-qp2m-jg93 deps(canvas): bump postcss 8.5.9 → 8.5.12 (GHSA-qx2v-qp2m-jg93, medium)	2026-04-27 20:20:13 +00:00
Hongming Wang	18b21d420e	Merge pull request #2185 from Molecule-AI/fix/canvas-send-button-stuck-after-ws-reply fix(canvas): clear sendInFlightRef on WS-push reply path	2026-04-27 20:16:39 +00:00
Hongming Wang	4028b81e04	refactor(canvas): route panel WS subscriptions through global socket Both AgentCommsPanel and ChatTab's activity-feed opened raw `new WebSocket(WS_URL)` instances per mount, with no onclose handler and no reconnect logic. When the underlying connection dropped — idle timeout, browser background-tab throttle, network jitter — the per- panel sockets stayed dead until the panel re-mounted (refresh or sub-tab unmount/remount). Live agent-comms bubbles and live activity feed lines silently went missing in the gap, manifesting as "the delegation didn't show up until I refreshed." The global ReconnectingSocket in store/socket.ts already owns reconnect, exponential backoff, health-check, and HTTP fallback poll. Routing component subscribers through it gives every consumer those guarantees for free, with one TCP connection per tab instead of N. Three new pieces: - store/socket-events.ts: tiny pub/sub bus. emitSocketEvent fan-outs every decoded WSMessage to the listener Set; subscribeSocketEvents returns an unsubscribe. A throwing listener is logged and isolated so it can't break siblings. - store/socket.ts: ws.onmessage now calls emitSocketEvent(msg) right after applyEvent(msg), so the store's derived state and component subscribers stay in lockstep on every event arrival. - hooks/useSocketEvent.ts: React hook that registers exactly once per mount, capturing the latest handler in a ref so the closure sees current state/props without re-subscribing on every render. Refactored sites: - AgentCommsPanel: replaced its WebSocket-in-useEffect block with useSocketEvent. Same parsing logic; the panel no longer opens its own connection. - ChatTab activity feed: split the previous useEffect in two — one seeds the activity log when `sending` flips, the other subscribes unconditionally and gates work on `sending` inside the handler. Hooks can't be conditional, so the gate has to live in the body rather than around the effect. The ws-close graceful-close helper is no longer needed in either site; the global socket owns its own teardown. Tests: 6 new tests for the bus contract (single delivery, fan-out order, unsubscribe, throwing-listener isolation, no-subscriber emit, duplicate-subscribe Set semantics). All 27 existing socket tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 13:12:47 -07:00
Hongming Wang	81c4c1321c	fix(runtime): use lowercase wire role for v0.3 JSON-RPC compat layer Manual-test failure surfaced what was hidden behind the MCP-path bug: once delegate_task could actually fire, every cross-workspace call came back as JSON-RPC -32600 "Invalid Request" with the underlying pydantic ValidationError: params.message.role Input should be 'agent' or 'user' [type=enum, input_value='ROLE_USER', input_type=str] PR #2184's a2a-sdk 1.x migration sweep over-corrected: it changed every `"role": "user"` literal in JSON-RPC payload construction to `"role": "ROLE_USER"` to match the protobuf enum names of the 1.x native types (a2a.types.Role.ROLE_USER / ROLE_AGENT). That was correct for in-process Message construction (which the SDK serialises before wire transmission) but WRONG for the 8 sites that hand-build JSON-RPC payloads. The workspace's own a2a-sdk runs inbound requests through the v0.3 compat adapter (/usr/local/lib/python3.11/site-packages/a2a/compat/v0_3/) because main.py sets enable_v0_3_compat=True for backwards compatibility, and that adapter validates against the v0.3 Pydantic Role enum (`agent` \| `user` lowercase). The protobuf-style names blow it up. Reverted the 8 wire-payload sites to lowercase: - workspace/a2a_client.py:74 - workspace/a2a_cli.py:74, 111 - workspace/heartbeat.py:378 - workspace/main.py:464, 563 - workspace/builtin_tools/a2a_tools.py:60 - workspace/builtin_tools/delegation.py:272 Native-type usage at workspace/a2a_executor.py:471 (`Role.ROLE_AGENT`) stays — that's an in-process Message construction; the SDK handles wire serialisation correctly. Updated the misleading comment at main.py:255-257 (which said "outbound payloads are now 1.x-shaped (ROLE_USER)") to spell out the actual rule: outbound JSON-RPC wire payloads MUST use v0.3 shape, native types are only for in-process construction. New regression test test_jsonrpc_wire_role_format.py greps the 6 wire-payload-emitting files for any "ROLE_USER" / "ROLE_AGENT" string literal and fails loud — cheapest possible drift detector. Why E2E missed it: the priority-runtimes harness sends a single message canvas → workspace, but the canvas already used lowercase "user" (it never went through the migration sweep). The bug only surfaces on workspace → workspace delegation, which the harness doesn't exercise. Same gap as #131 (extend smoke to call main() against a stub). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 12:40:11 -07:00
Hongming Wang	db3b472bc9	Merge pull request #2186 from Molecule-AI/fix/mcp-server-path-wheel-relative fix(runtime): legacy /app/ path leaks across MCP server + agent prompts + docstrings	2026-04-27 19:32:33 +00:00
Hongming Wang	49ded74876	docs(cli-runtime): use module-form invocation, drop dead shell-alias claim Same root cause as the workspace/molecule_ai_status.py docstring fix in this PR: this doc claimed `molecule-monorepo-status` was a usable shell alias and `from molecule_ai_status import set_status` was a usable Python import. Both worked under the pre-#87 monolithic-template layout (where workspace/Dockerfile created the symlink and COPY'd the modules into /app/) but neither works in current standalone template images that install the runtime as a wheel: - `which molecule-monorepo-status` errors — only `a2a-db` and `molecule-runtime` are registered console scripts. - `from molecule_ai_status` raises ImportError — modules are under the `molecule_runtime` package now. Switched both examples to the canonical `python3 -m molecule_runtime.molecule_ai_status` form (CLI) and `from molecule_runtime.molecule_ai_status import set_status` (Python). Same form the runtime ships in its own usage banner, so anyone discovering this doc gets a runnable example. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 12:27:50 -07:00
Hongming Wang	f7ad5a82f7	fix(canvas): release sendInFlightRef in the activity-log WS path too Third-pass review caught a fourth WS path I missed. The original fix + the stale-callback follow-up patched 3 sites that release the in-flight guards (pendingAgentMsgs effect, HTTP .then() success, HTTP .catch() success), but the ACTIVITY_LOGGED handler at lines 410-419 also clears `sending` + `sendingFromAPIRef` when the platform logs the workspace's a2a_receive ok/error. It only cleared 2 of the 3 refs — same exact bug class as the original. If THIS path wins the race (a2a_receive activity logged before pendingAgentMsgs delivers the reply text), sendInFlightRef stays stuck true and the next sendMessage() silently no-ops at line 464. Fix: route both branches (ok and error) through releaseSendGuards() so all four sites are now uniform. Updated the helper's docstring to explicitly list all four sites and warn that any future "I saw the reply" path that only clears the natural pair (sending + sendingFromAPIRef) will silently re-introduce the freeze. The disabled-button logic can't see sendInFlightRef so the visible state diverges from the synchronous re-entry guard otherwise. This is exactly the drift `releaseSendGuards()` was supposed to prevent — the helper landed in the prior commit but the activity-log site wasn't migrated to use it. Fixing now closes the gap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 12:27:29 -07:00
Hongming Wang	9c3695df6d	test(runtime): update molecule_ai_status test for renamed error prefix Pre-existing test_set_status_exception_prints_to_stderr asserted on the legacy "molecule-monorepo-status: failed to update" prefix string. The prior commit renamed it to "molecule_ai_status: failed to update" so the printed label matches the canonical module-form invocation (`python3 -m molecule_runtime.molecule_ai_status`) instead of a shell alias that only ever existed in the dev-only base image. Updating the expected substring in lockstep. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 11:48:05 -07:00
Hongming Wang	cacf499354	fix(canvas): close stale-callback race + extract releaseSendGuards helper Self-review on PR #2185 surfaced a latent race the original fix exposed: the WS-clears-guards path now releases sendInFlightRef immediately, which means a user can fire msg #2 between WS-arrival and HTTP-arrival for msg #1. Without coordination, msg #1's late .then() sees sendingFromAPIRef=true (set by msg #2's send), enters the main body, and runs setSending(false) + appendMessageDeduped against msg #1's response body — clobbering msg #2's in-flight UI state. This race is realistic for claude-code SDK: the comment at line 294-298 already calls WS the "authoritative reply arrived" signal, and the user typically reads-then-types before the trailing HTTP completes. Without the original Send-button freeze "protecting" the race, it surfaces. Two changes: 1. Token-keyed callbacks. sendTokenRef bumps on every sendMessage entry; .then()/.catch() capture the token in closure and bail without touching any flags if a newer send has superseded them. The newer send owns the in-flight guards. 2. releaseSendGuards() helper. The three-clear-guards trio (setSending, sendingFromAPIRef, sendInFlightRef) now lives in one useCallback so the WS handler, .then() success, and .catch() success can't drift apart. A future contributor dropping one of the three would silently re-introduce either the post-WS Send freeze or the stale-callback clobber. Skipped a unit test for this regression — ChatTab has no __tests__ file and a mount test would need WS + zustand + api mocks. The fix is 4 logical lines (token capture + 2 guard checks) and the manual test covers it. Follow-up to add a focused mount test when ChatTab gets its first __tests__ file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 11:47:12 -07:00
Hongming Wang	28fc7a8cbd	fix(runtime): replace remaining /app/ legacy paths in agent prompts + docstrings Comprehensive sweep follow-up to the MCP server path fix. Audited every /app/ reference in the runtime source against the live claude-code template image and confirmed the actual /app/ contents post-#87 are ONLY: __init__.py, adapter.py, claude_sdk_executor.py, requirements.txt — every other workspace module ships in the wheel under site-packages/molecule_runtime/. Two more leaks found: 1. executor_helpers.py:_A2A_INSTRUCTIONS_CLI — inter-agent system prompt for non-MCP runtimes (Ollama, custom) had 5 lines telling the model `python3 /app/a2a_cli.py X`. Models copy these examples verbatim, so every CLI-runtime delegation would fail at the shell layer (no such file). Replaced with `python3 -m molecule_runtime.a2a_cli` form, which works regardless of where the wheel is installed. 2. molecule_ai_status.py docstring — usage examples invoked `python3 /app/molecule_ai_status.py` and claimed a `molecule-monorepo-status` shell alias. Both broken in current templates: the file's at site-packages, and `which molecule-monorepo-status` errors (the legacy symlink only existed in the dev-only workspace/Dockerfile base image, not in the standalone template Dockerfiles that ship to production). Updated docstring + the __main__ usage banner + the stderr error prefix to use the same `python3 -m molecule_runtime.X` form. Plugins audited and clean: WORKSPACE_PLUGINS_DIR=/configs/plugins, SHARED_PLUGINS_DIR=$PLUGINS_DIR fallback /plugins. No /app/ assumptions. Regression test: `test_a2a_cli_instructions_use_module_invocation_not_legacy_app_path` asserts the legacy /app/a2a_cli.py path can't drift back into the CLI system prompt and that the canonical module form is present. The legacy workspace/Dockerfile + workspace/entrypoint.sh + workspace/scripts/ still contain /app/-shaped paths but are dev-only base-image scaffolding (per workspace/build-all.sh's own header comment) — not shipped to the standalone template images. Out of scope here; can be cleaned up in a separate dead-code pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 11:22:00 -07:00
Hongming Wang	203a4f0f91	fix(runtime): resolve a2a_mcp_server.py path from wheel install location DEFAULT_MCP_SERVER_PATH was hardcoded to /app/a2a_mcp_server.py, which was correct under the pre-#87 monolithic-template Docker layout where the workspace/ tree was COPY'd into /app/. After the universal-runtime refactor (#87, #117), workspace modules ship inside the molecule-ai-workspace-runtime wheel under site-packages/molecule_runtime/, while /app/ now holds only template-specific files (adapter.py + the runtime-native executor for that template). Net effect: in every workspace built since the wheel cutover, Claude Code SDK's mcp_servers={"a2a": {"command": python, "args": ["/app/a2a_mcp_server.py"]}} pointed at a missing file. The subprocess launch failed silently, the SDK registered zero MCP tools, and the agent's list_peers / delegate_task / a2a_send_message / a2a_send_signal all disappeared. Symptom observed today: Design Director said "I tried to reach the perf auditor via the inter-agent MCP tools (list_peers, delegate_task) but those tools didn't resolve in this environment" and fell back to running the audit itself with WebFetch. Why this slipped through E2E: the priority-runtimes harness sends a single message and verifies a reply — it does not exercise inter-agent delegation, so the missing MCP tools are invisible at that layer. Fix: resolve the path relative to executor_helpers.py via __file__, which tracks wherever the wheel is installed (site-packages today, anywhere else tomorrow). The A2A_MCP_SERVER_PATH env override is preserved for tests / non-default layouts. Regression test: assert os.path.exists(DEFAULT_MCP_SERVER_PATH) so any future move of a2a_mcp_server.py out of the package directory fails at unit-test time instead of silently disabling delegation in production. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 11:15:06 -07:00
Hongming Wang	5faaf58466	fix(canvas): clear sendInFlightRef on WS-push reply path Send button + Enter both silently no-op'd after the first agent reply on runtimes that deliver via WebSocket (claude-code SDK does this per the comment at ChatTab.tsx:294-298). The visible disabled-state checks (sending, uploading, agentReachable) were all clean — the freeze came from a third synchronous reentry guard the button can't see: if (sendInFlightRef.current) return; // ChatTab.tsx:438 The ref was set true at the start of sendMessage() and only cleared in .then() / .catch() of the HTTP fall-through and the upload-failure branch. The WS-push handler in the pendingAgentMsgs effect cleared `sending` and `sendingFromAPIRef` but left `sendInFlightRef` stuck true. The HTTP .then() then early-returned at the dedup check (line 513) without touching the ref — only the .catch() early-return path did. Net result: refresh fixed it because the ref reset on remount. Two-line fix: - WS handler: also clear sendInFlightRef when the push delivers the reply (primary fix; no race window where the ref is stuck while the user can already type) - .then() early-return: mirror .catch()'s cleanup as defense in depth, so neither delivery order leaks the ref While here: A2AEdge.test.tsx fixture was typed `as never` to dodge EdgeProps' discriminated-union complaint, which broke spreading at the call sites with TS2698 ("Spread types may only be created from object types"). Replaced with `as unknown as ComponentProps<typeof A2AEdge>` — preserves the original "skip restating every optional field" intent and keeps a spreadable type. All 10 A2AEdge tests pass; tsc --noEmit is clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 11:11:58 -07:00
Hongming Wang	9532890f04	Merge pull request #2184 from Molecule-AI/fix/jsonrpc-routes-rpc-url fix: pass rpc_url='/' to create_jsonrpc_routes (a2a-sdk 1.x)	2026-04-27 16:45:48 +00:00
Hongming Wang	dd57a840b6	fix: comprehensive a2a-sdk 1.x migration sweep across workspace/ Audited every a2a-sdk surface in workspace/ against the installed 1.0.2 wheel. Found and fixed: main.py (the live workspace startup path): • create_jsonrpc_routes(rpc_url='/', enable_v0_3_compat=True) — rpc_url required in 1.x; v0.3 compat enables inbound legacy clients (`"role": "user"` lowercase) without forcing them to upgrade. Pairs with the outbound rename below. a2a_executor.py: • TextPart/FilePart/FileWithUri removed in 1.x. Part is now a flat proto message: Part(text=…) / Part(url=…, filename=…, media_type=…). Updated the file-attachment branch (only reachable when an agent emits files; the harness's PONG path didn't exercise this, but it's a latent crash). • Message field names: messageId/taskId/contextId → message_id/task_id/context_id (proto3 snake_case). • Role enum: Role.agent → Role.ROLE_AGENT (proto enum). Outbound JSON-RPC payloads (8 files): • "role": "user" → "role": "ROLE_USER" — proto3 JSON serialization is strict about enum values. Sites: a2a_client, a2a_cli, main (initial+idle prompts), heartbeat, builtin_tools/a2a_tools, builtin_tools/delegation. Wire JSON keys stay camelCase (proto3 default), only the role enum value changed. google-adk/adapter.py: • new_agent_text_message → new_text_message (4 sites). This adapter's directory has a hyphen, so it can't be imported as a Python module — effectively dead code, but the wheel ships the file and a future fix should keep it correct against 1.x. Why one PR instead of seven: every previous a2a-sdk migration find landed as its own publish → cascade → harness → next-bug cycle. Today's audit ran every a2a-sdk symbol/type/method in workspace/ against the installed 1.0.2 wheel in a single sweep + tested the critical paths (Message construction, Part construction, Role enum parsing) against the actual SDK. Should be the last migration PR. Verified locally: python3 scripts/build_runtime_package.py --version 0.1.99 \ --out /tmp/build-final pip install /tmp/build-final python -c "import molecule_runtime.main; \ from molecule_runtime.a2a_executor import LangGraphA2AExecutor" → ✓ all imports clean against a2a-sdk 1.0.2 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 09:42:57 -07:00
Hongming Wang	c80b3ff0eb	fix: pass rpc_url='/' to create_jsonrpc_routes (a2a-sdk 1.x requirement) 7th a2a-sdk migration find from the v0 → v1 transition. create_jsonrpc_routes() now requires rpc_url as a positional arg (was implicit at root in 0.x). Pass '/' to match a2a.utils.constants.DEFAULT_RPC_URL — that's also what workspace-server's a2a_proxy.go forwards to (POSTs to workspace URL without appending a path). Symptom before fix: every workspace startup crashed with TypeError: create_jsonrpc_routes() missing 1 required positional argument: 'rpc_url' Caught by harness 9 phase 4 (claude-code + langgraph both on 0.1.24). The user's "use langgraph for fast iteration" call cut the diagnose cycle from 15min to ~30s — without that, this would have taken another hermes round-trip to surface. Updated reference_a2a_sdk_v0_to_v1_migration.md memory with this entry alongside the previous 6 finds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 09:33:23 -07:00
Hongming Wang	d3d57eb3a7	Merge pull request #2183 from Molecule-AI/fix/default-request-handler-agent-card fix: pass agent_card to DefaultRequestHandler (a2a-sdk 1.x)	2026-04-27 16:06:36 +00:00
Hongming Wang	6859099a08	fix: pass agent_card to DefaultRequestHandler (a2a-sdk 1.x requirement) a2a-sdk 1.x added agent_card as a required argument to DefaultRequestHandler.__init__. main.py constructed it with only agent_executor + task_store, so every workspace startup that reached the handler init step crashed with: TypeError: DefaultRequestHandlerV2.__init__() missing 1 required positional argument: 'agent_card' This is the 6th a2a-sdk migration find from the v0 → v1 transition (see reference_a2a_sdk_v0_to_v1_migration memory). Pattern is the same: SDK exposes a new required arg, our call site needs to pass the existing object we already construct upstream. Why the import-only smoke gates didn't catch this: it's a call-time constructor error inside `async def main()`, not a module load error. The runtime-pin-compat smoke imports main_sync but doesn't invoke main() against a real config. Worth filing a follow-up to extend the smoke to a "construct + dispose" cycle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 08:53:47 -07:00
Hongming Wang	5920fc856d	Merge pull request #2182 from Molecule-AI/ci/agentcard-smoke-followup-2179 fix(workspace): rename supported_protocols → supported_interfaces (CRITICAL — every boot crashes)	2026-04-27 14:58:28 +00:00
Hongming Wang	851fd21fb1	fix(workspace): rename supported_protocols → supported_interfaces (a2a-sdk 1.0) CRITICAL: every workspace boot since the a2a-sdk 1.0 migration (#1974) has been crashing at AgentCard construction with: ValueError: Protocol message AgentCard has no "supported_protocols" field The protobuf field is `supported_interfaces` (plural, interfaces — see a2a-sdk types/a2a_pb2.pyi:189). The 0.3→1.0 migration left the kwarg as `supported_protocols`, which doesn't exist in the 1.0 schema, so the constructor raises before any subsequent line of main runs. Why this hid for so long: - publish-runtime.yml's smoke step only IMPORTED molecule_runtime.main; importing the module is fine, only CONSTRUCTING the AgentCard fails - The user-visible symptom is "Workspace failed: " with empty last_sample_error, indistinguishable from generic boot timeouts - The state_transition_history=True bug (fixed in #2179) was a sibling of this — same migration, same class, just caught first Fix is symmetric with #2179: 1. workspace/main.py: rename the kwarg + comment explaining why 2. .github/workflows/publish-runtime.yml: extend the smoke block to instantiate AgentCard with the exact production call shape, so the next field-rename of this class fails at publish time instead of breaking every workspace startup Verification: - Constructed AgentCard against fresh a2a-sdk 1.0.2 in a clean venv with the corrected kwarg → succeeds - Constructed it with the original `supported_protocols` kwarg → fails immediately with the exact error production sees - Smoke test pinned to mirror main.py's exact call shape; main.py + smoke must stay in lockstep going forward Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 07:54:23 -07:00
Hongming Wang	2a39061635	Merge pull request #2181 from Molecule-AI/fix/cascade-pypi-wait-and-paths-filter fix(publish-runtime): wait for PyPI propagation + expand path filter	2026-04-27 14:48:03 +00:00
Hongming Wang	1a703f5687	fix(publish-runtime): wait for PyPI propagation + expand path filter Two structural fixes for the cascade race conditions that bit us five times today: 1. PyPI propagation wait (cascade job): poll PyPI for the just-published version with a 60s budget BEFORE firing repository_dispatch. PyPI accepts the upload but takes a few seconds to make it available via the package index. Cascade was firing too fast — downstream template builds ran `pip install` against a stale index, resolved to the previous version, and docker layer cache locked that in for subsequent rebuilds. Pairs with the build-arg cache invalidation in molecule-ci PR (separate change). Wait without invalidation = next build still pip-resolves correctly. Invalidation without wait = first cascade build may still race PyPI propagation. Together: no race, no stale cache. 2. Path filter expansion: scripts/build_runtime_package.py is the build script and changes to it (e.g. import-rewrite fixes, manifest emit, lib/ subpackage move) directly affect what ships in the wheel. Was missing from the path filter, so PRs touching only scripts/ (like #2174's lib/ fix) didn't auto-publish — the operator had to remember a manual dispatch. Add it to the closed list of files that trigger auto-publish. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 07:42:37 -07:00
Hongming Wang	2f5ea7a537	Merge pull request #2180 from Molecule-AI/harness/diagnostic-burst-step2-cp-285 test(e2e): diagnostic burst on step-2 provisioning failure (CP #285)	2026-04-27 14:27:15 +00:00
Hongming Wang	3c345f5674	test(e2e): diagnostic burst on step-2 provisioning failure (CP #285 ) Closes the molecule-core-side ask of controlplane #285. CP #289 already landed migration 022 + the handler change exposing \`last_error\` in /cp/admin/orgs responses. This makes the canary harness actually USE that field — pre-fix the harness exited with just "Tenant provisioning failed for <slug>" and forced operators to scrape CP server logs to learn WHY. The diagnostic burst dumps the matched org row from the LIST_JSON already in scope (no extra HTTP call), pretty-printed and prefixed, right before \`fail\`. Mirrors the TLS-readiness burst pattern from PR #2107 at step 4. Includes a not-found fallback for DB-drift cases. No redaction needed — adminOrgSummary is already ops-safe (id, slug, name, plan, member_count, instance_status, last_error, timestamps; no tokens, no encrypted fields). Verification: smoke-tested both branches (org found with last_error + slug-not-found fallback) with synthetic JSON; bash syntax OK; the only shellcheck warning is pre-existing on line 93. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 07:22:12 -07:00
Hongming Wang	11e149f05c	Merge pull request #2179 from Molecule-AI/fix/agent-capabilities-state-transition-history fix: drop state_transition_history (removed in a2a-sdk 1.x)	2026-04-27 14:22:09 +00:00
Hongming Wang	12d446bc8e	docs: explain why state_transition_history is gone (research-backed) Adds a comment block citing a2a-sdk's own a2a/compat/v0_3/conversions.py, which says verbatim: state_transition_history=None, # No longer supported in v1.0 So a future reader who notices the missing kwarg won't try to add it back. The capability is now universal: every v1.x Task carries a history list and tasks/get supports historyLength via the apply_history_length helper. No flag because nothing's optional. Confirmed by reading the SDK source directly: - a2a/types.py AgentCapabilities exposes only: streaming, push_notifications, extensions, extended_agent_card. - a2a/compat/v0_3/conversions.py explicitly maps None when down-converting v1 → v0.3 (deliberate removal, not rename). - a2a/server/request_handlers/default_request_handler_v2.py uses apply_history_length(task, params) — agent doesn't opt in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 07:20:05 -07:00
Hongming Wang	f531fe1367	fix: drop state_transition_history field — removed in a2a-sdk 1.x a2a-sdk 1.x's AgentCapabilities only exposes 4 fields: streaming, push_notifications, extensions, extended_agent_card. The state_transition_history field was removed in the v1 protobuf schema. main.py still passed it as a kwarg, so every workspace that reached the AgentCard construction step (line 188) crashed: ValueError: Protocol message AgentCapabilities has no "state_transition_history" field Symptom: every claude-code + hermes workspace stuck in `provisioning` forever — caught when the user provisioned a Design Director crew manually via the canvas while harness 5 was running. Why every prior smoke gate missed it: - runtime-pin-compat.yml smokes `from molecule_runtime.main import main_sync` — only imports the module. AgentCapabilities() runs inside `async def main()`, not at module load. - Template image boot smoke does `import every /app/*.py` — same story. main.py imports fine; the field error only fires at call. The fix is one line — drop the kwarg. Fields we actually need (streaming + push_notifications) are still passed. Follow-up worth filing: smoke step that instantiates Adapter() + calls a no-op setup() against a stub config. That would have caught this before publish. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 07:16:16 -07:00

1 2 3 4 5 ...

3277 Commits