molecule-core

Author	SHA1	Message	Date
Hongming Wang	8b9f809966	fix(a2a): SSOT response parser — handle poll-mode queued envelope (#2967 ) Introduce ``workspace/a2a_response.py`` as the single source of truth for the wire shapes the workspace-server proxy can return at ``/workspaces/<id>/a2a``: * ``Result`` — JSON-RPC success * ``Error`` — JSON-RPC error or platform-level error (with restart-in-progress metadata when present) * ``Queued`` — poll-mode short-circuit envelope: the platform queued the message into the target's inbox, the target will fetch via /activity poll * ``Malformed`` — anything the parser can't classify (logged at WARNING so a future server change is loud) ``send_a2a_message`` (in ``a2a_client.py``) now dispatches via ``a2a_response.parse(data)`` instead of inline ``"result" in data`` / ``"error" in data`` sniffing. The Queued variant returns a new ``_A2A_QUEUED_PREFIX`` sentinel so callers can distinguish "delivered async, no synchronous reply" from both success-with-text and failure. reno-stars production data caught two intermittent failures that both reduced to the same root cause: 1. File transfer announce silently failed — when CEO Ryan PC (poll-mode external molecule-mcp) sent the harmi.zip announcement to Reno Stars Business Intelligent (also poll-mode external), ``send_a2a_message`` saw the platform's poll-queued envelope ``{"status":"queued","delivery_mode":"poll","method":"..."}``, didn't recognize it as the synthetic delivery-acknowledgement it is, and returned ``[A2A_ERROR] unexpected response shape``. The agent fell back to a chunk-shipping path; receiver did get the file but operator-facing logs showed a failure that didn't actually fail. 2. Duplicated agent comm — same bug, inverted direction. d76 delegated to 67d, send_a2a_message returned the unexpected-shape error, delegate_task wrapped it as DELEGATION FAILED, the calling agent retried with sharper wording, the recipient saw the same request twice and self-reported "二次请求 — 我先不执行". External molecule-mcp standalone runtimes are inherently poll-mode (they have no public URL), so every external↔external A2A pair was hitting this on every send. The pre-fix client only handled JSON-RPC ``result``/``error`` keys and treated the queued envelope (which has neither) as malformed. RFC #2339 PR 2 added the queued envelope on the server side; the client never caught up. When ``send_a2a_message`` returns the ``_A2A_QUEUED_PREFIX`` sentinel, ``tool_delegate_task`` now transparently falls back to ``_delegate_sync_via_polling`` (RFC #2829 PR-5's durable ``/delegate`` + ``/delegations`` polling path, which DOES work for poll-mode peers because the platform's executeDelegation goroutine writes to the inbox queue and the result row arrives when the target picks it up + replies). The agent gets a real synchronous reply instead of the empty queued sentinel. * ``test_a2a_response.py`` — 62 tests, 100% line coverage on the parser (verified via ``coverage run --source=a2a_response``). Includes adversarial-input fuzzing across ~25 pathological payloads — parser must never raise. * ``test_a2a_client.py::TestSendA2AMessagePollMode`` — 4 tests for the new Queued/Error wiring in ``send_a2a_message``. * ``test_delegation_sync_via_polling.py::TestPollModeAutoFallback`` — 3 tests for the auto-fallback in ``tool_delegate_task``, including negative cases (push-mode reply must NOT trigger fallback; genuine error must NOT silently retry). * Verified all new tests FAIL on pre-fix source by stashing a2a_client.py + a2a_tools_delegation.py and re-running — 5 failures including ImportError for the missing ``_A2A_QUEUED_PREFIX``. Per the operator-debuggability directive: * INFO at every Queued classification (expected variant; operator sees normal poll-mode-peer queueing in log stream). * INFO at the auto-fallback decision in ``tool_delegate_task`` so a future operator can correlate "send returned queued → falling back to polling path" without reading the source. * WARNING at every Malformed classification (server contract drift; operator MUST see this immediately). * Existing transient-retry WARNING preserved. * Mirror Go-side typed model in workspace-server. The wire shape is documented in ``a2a_response.py``'s module docstring with file:line pointers to the canonical emitters; a future PR can introduce ``models/a2a_response.go`` without changing wire behavior. The fixture corpus in ``test_a2a_response.py`` is designed so a one-sided edit breaks CI. * ``send_message_to_user`` and ``chat_upload_receive`` use a different endpoint (``/notify``) and aren't affected by this bug; their parsing stays unchanged. * 135 tests pass across ``test_a2a_response.py`` + ``test_a2a_client.py`` + ``test_delegation_sync_via_polling.py`` + ``test_a2a_tools_impl.py``. * ``coverage run --source=a2a_response -m pytest`` reports 100% line coverage with 0 missing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 17:21:28 -07:00
Hongming Wang	146c0e7c60	fix(a2a-client): recognize poll-mode 'queued' envelope (#2967 ) workspace-server's a2a_proxy poll-mode short-circuit returns {status: "queued", delivery_mode: "poll", method: <a2a_method>} when the peer has no URL to dispatch to (poll-mode peers, including every external molecule-mcp standalone runtime). The bare send_a2a_message parser only knew about JSON-RPC {result, error} keys, so this envelope fell through to the "unexpected response shape" error path. Two production symptoms on the reno-stars tenant traced to it: 1. File transfer logged as failed when it actually succeeded — operator-facing logs showed an A2A_ERROR but the receiving workspace did get the chunked file via the agent's fallback path. 2. delegate_task retried after the false failure → peer received duplicate delegations → conversation got confused, the second peer self-diagnosed in a notify ("⚠️ Peer 二次请求 — 我先不执行"). Add a third branch to the parser, BETWEEN the existing JSON-RPC {result, error} cases and the catch-all "unexpected" fallback. The queued envelope is delivery-acknowledged-but-pending-consumption — not an error — so it returns a clean success string the agent can render as a normal outcome. The success string includes "queued" and "poll" so an operator scanning logs sees the routing path without parsing JSON. Defensive: the new branch only fires when BOTH status="queued" AND delivery_mode="poll" are present. A partial envelope (one key missing) still falls through to the catch-all, so a future server bug that emits a malformed shape gets surfaced instead of silently swallowed. Tests: - test_poll_queued_envelope_returns_success_string — pins the canonical envelope returns a non-error string. Discriminating: verified to FAIL on old code (returned [A2A_ERROR] string), PASS on new. - test_poll_queued_envelope_with_other_method — pins the parser doesn't hardcode message/send. Discriminating: also FAILS on old code. - test_status_queued_without_poll_mode_still_falls_through — pins both keys are required (defensive against future server bugs). 12 existing tests in TestSendA2AMessage still pass — no regression. Scope: hotfix for the bare send_a2a_message path. The full SSOT typed-A2AResponse refactor (#158-#163, parents under #2967) covers the broader vocabulary alignment between Go server and Python client. This PR ends the production symptoms now without preempting that work.	2026-05-05 16:58:48 -07:00
Hongming Wang	da6d319c48	perf(a2a): bound + LRU-evict _peer_metadata cache (#2482 ) Pre-fix _peer_metadata was an unbounded dict — a workspace receiving from N distinct peers across its lifetime accumulated entries indefinitely (~100 bytes × N). Not crash-class at typical scale (10K peers ≈ 1 MB) but unbounded. The TTL-at-read pattern bounded staleness but did nothing for memory. Fix: hand-rolled LRU on top of OrderedDict. No new dependency. - _PEER_METADATA_MAXSIZE = 1024 (issue's recommended bound) - _peer_metadata_get(canon) — read + LRU touch (move to MRU) - _peer_metadata_set(canon, value) — write + evict-if-over-maxsize - All production reads/writes route through the helpers - _peer_metadata_lock guards the OrderedDict ops so concurrent background-enrichment workers (#2484) don't race the LRU invariant Why hand-rolled vs cachetools: - No new dep. workspace/ has 0 cache libraries today; adding one for ~30 lines is negative leverage. - The TTL is enforced at the call site (existing pattern); only the size cap + LRU is new. cachetools.TTLCache fuses the two, which would force a refactor of every caller's TTL check. - The size + lock are simple enough that a future swap-in of cachetools is mechanical if needs evolve. Why maxsize matters more than ttl (issue's framing): A runaway poller that touches new peer_ids every push would still grow within a single TTL window — TTL eviction only fires at read time. The size cap fires immediately on insert, regardless of read pattern. Three new tests: - test_peer_metadata_set_evicts_lru_when_at_maxsize - test_peer_metadata_get_promotes_to_lru_head - test_peer_metadata_set_replaces_existing_entry_in_place 1742 passed / 0 failed locally (78 new + 1664 existing). Closes #2482. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 01:39:07 -07:00
Hongming Wang	35017c5452	perf(a2a): move enrichment GET off the inbox poller thread (#2484 ) The inbox poller's notification callback called the synchronous enrich_peer_metadata on every push, blocking the poller for up to 2s × N uncached peers per poll batch. Push delivery latency was gated on registry RTT — exactly what PR #2471's negative-cache patch was trying to avoid amplifying. Fix: cache-first nonblocking path with a tiny background worker pool. enrich_peer_metadata_nonblocking(peer_id): - Cache hit (fresh, within TTL): return cached record immediately - Cache miss / stale: return None, schedule background fetch via ThreadPoolExecutor The first push from a new peer arrives metadata-light (bare peer_id); the next push within the 5-min TTL hits the warm cache and gets full name/role. Acceptable trade-off because the channel-envelope enrichment is a UX nicety, not a correctness invariant — and the cold-cache window per peer is bounded to one push. Defenses: - In-flight gate (_enrich_in_flight) — N concurrent pushes for the same uncached peer schedule exactly ONE worker, not N. Without this, a chatty peer's first burst of pushes would amplify into parallel registry GETs — the exact DoS-on-self pattern the negative cache was meant to rate-limit. - Lazy executor init — most test fixtures + short-lived CLI invocations never need it; only the long-running molecule-mcp path actually fires background work. - Daemon-style threads via thread_name_prefix; executor never blocks process exit. Tests: - test_enrich_peer_metadata_nonblocking_cache_hit_returns_immediately - test_enrich_peer_metadata_nonblocking_cache_miss_schedules_fetch - test_enrich_peer_metadata_nonblocking_coalesces_duplicate_pushes - test_enrich_peer_metadata_nonblocking_invalid_peer_id_returns_none Plus updates to the existing test_envelope_enrichment_* suite that asserted synchronous behavior — they now drain the in-flight set via _wait_for_enrichment_inflight_for_testing before checking cache state. Existing synchronous enrich_peer_metadata is unchanged — Phase B (#2790) schema↔dispatcher drift gate + the negative-cache contract from PR #2471 still apply. The nonblocking variant is purely additive. 1739 passed, 0 failed locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 01:24:42 -07:00
Hongming Wang	700d44ec3d	feat(mcp): multi-workspace routing for memory + chat_history + workspace_info PR-3 of the multi-workspace MCP rollout. PR-1 made the MCP server itself multi-workspace aware (one process, N workspace memberships). PR-2 added source_workspace_id threading to delegate_task / list_peers. This change closes the remaining workspace-scoped tools so a single agent registered into multiple workspaces no longer leaks memories or chat history across tenants. Tools now accepting `source_workspace_id`: - tool_commit_memory(content, scope, source_workspace_id=None) — routes POST to /workspaces/{src}/memories with the source workspace's Bearer token. Body still embeds source_workspace_id for the platform's audit + namespace-isolation enforcement. - tool_recall_memory(query, scope, source_workspace_id=None) — GET /workspaces/{src}/memories with the source workspace's token and ?workspace_id={src} query so the platform scopes the read to the caller's tenant view (PR-1 / multi-workspace mode). - tool_chat_history(peer_id, limit, before_ts, source_workspace_id=None) — auto-routes via the _peer_to_source cache populated by list_peers, with explicit override winning. Falls back to module-level WORKSPACE_ID if neither is available. URL: /workspaces/{src}/chat-history. - tool_get_workspace_info(source_workspace_id=None) — GET /workspaces/{src} with the source workspace's token. Useful for introspecting any workspace the agent is registered into, not just the primary. In every path, `src = source_workspace_id or WORKSPACE_ID`, so single-workspace operators see no behavior change. Tokens are resolved per-workspace via auth_headers(src) / _auth_headers_for_heartbeat(src), which fall through to the legacy AUTH_TOKEN env when not in multi-workspace mode. Also updates input_schemas in platform_tools/registry.py so the new optional parameter is advertised to LLM clients (claude-code, hermes-agent, langchain wrappers). Tests (4 new classes in test_a2a_multi_workspace.py, 21 new tests): - TestCommitMemorySourceRouting — URL + Authorization header per source - TestRecallMemorySourceRouting — URL + query param + Authorization - TestChatHistorySourceRouting — peer-cache auto-route + explicit override - TestGetWorkspaceInfoSourceRouting — URL + Authorization Inbox tools (peek/pop/wait_for_message) already multi-workspace aware since PR-1 — inbox.py spawns per-workspace pollers and tags every InboxMessage with arrival_workspace_id. No further plumbing needed. Suite: 1700 passed, 3 skipped, 2 xfailed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 14:17:58 -07:00
Hongming Wang	1161b97faf	feat(mcp): cross-workspace delegation routing (multi-ws PR-2) PR-2 of the multi-workspace external-agent stack. PR-1 (#2739) landed per-workspace auth + heartbeat + inbox. This PR threads ``source_workspace_id`` through the A2A client + tool surface so an agent registered against multiple workspaces can list peers across all of them and delegate from a specific source. Changes ------- * ``a2a_client``: ``discover_peer``, ``send_a2a_message``, ``get_peers_with_diagnostic``, and ``enrich_peer_metadata`` now accept ``source_workspace_id``. Routing uses it for both the X-Workspace-ID header and (transitively, via ``auth_headers(src)``) the bearer token. Defaults to module-level WORKSPACE_ID for back-compat. * ``a2a_client._peer_to_source``: a new lock-free cache mapping each discovered peer back to the source workspace whose registry surfaced it. ``tool_list_peers`` populates the cache on every call; ``tool_delegate_task`` consults it for auto-routing. * ``a2a_tools.tool_list_peers(source_workspace_id=None)``: when multiple workspaces are registered (MOLECULE_WORKSPACES) and no explicit source is passed, aggregates peers across every registered workspace and tags each entry with ``via: <src[:8]>``. Single-workspace mode is unchanged — no ``via:`` annotation, same output shape. * ``a2a_tools.tool_delegate_task`` and ``tool_delegate_task_async`` resolve source via ``source_workspace_id arg → _peer_to_source[target] → WORKSPACE_ID``. Agents almost never need to specify ``source_`` explicitly — call ``list_peers`` first and the cache handles the rest. ``tool_delegate_task_async`` idempotency key now includes the source workspace, so the same task delegated from two registered workspaces produces two distinct delegations (the right behavior — one per tenant audit trail). * ``platform_auth.list_registered_workspaces()``: new helper for the tool layer to enumerate the multi-ws registry. Lock-free reads matched by the existing single-writer-per-workspace contract from PR-1. * ``platform_auth.self_source_headers``: now passes ``workspace_id`` through to ``auth_headers`` — without this, a multi-workspace POST source-tagged with ``X-Workspace-ID=ws_b`` was authenticating with ws_a's token (or no token if MOLECULE_WORKSPACE_TOKEN unset). Latent PR-1 bug exposed by the new tool surface. * ``a2a_mcp_server`` tool dispatch passes ``source_workspace_id`` from the tool call arguments. * ``platform_tools.registry``: add ``source_workspace_id`` to the delegate_task, delegate_task_async, check_task_status, list_peers input schemas with copy explaining when to use it (rarely — the cache handles it). Tests (15 new, all passing) --------------------------- ``test_a2a_multi_workspace.py``: * TestDiscoverPeerSourceRouting (3): src arg drives header+token, fallback to module ws when omitted, invalid target short-circuits before any HTTP attempt. * TestSendA2AMessageSourceRouting (1): X-Workspace-ID source header + Authorization bearer both come from the source arg via the patched self_source_headers chain. * TestGetPeersSourceRouting (1): URL path AND headers use the source workspace id. * TestToolListPeersAggregation (4): aggregates across multiple registered workspaces, tags origin, leaves single-workspace path unchanged, explicit src arg overrides aggregation, diagnostic joining when every workspace returns empty. * TestToolDelegateTaskAutoRouting (3): cache-driven auto-route, explicit override beats cache, single-workspace fallback to module WORKSPACE_ID. * TestListRegisteredWorkspaces (3): registry enumeration helper. Plus ``tests/snapshots/a2a_instructions_mcp.txt`` regenerated to absorb the new ``source_workspace_id`` schema entries. Back-compat ----------- Every change defaults ``source_workspace_id=None``; legacy single-workspace operators (no MOLECULE_WORKSPACES) see identical behavior — same URLs, same headers, same tool output. The 24 PR-1 tests + 125 existing A2A tests all still pass. Out of scope (PR-3) ------------------- Memory namespacing per registered workspace lands after the new memory system v2 PR (#2740) settles in production. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 08:32:24 -07:00
Hongming Wang	0b979aed78	fix(channel): validate peer_id at envelope build — close path-traversal foothold Two trust-boundary leaks surfaced in code review of the channel-envelope enrichment work: 1. _agent_card_url_for(peer_id) interpolated raw input into ${PLATFORM_URL}/registry/discover/<peer_id> with no UUID guard. An upstream row with peer_id=`../../foo` produced an agent-visible URL pointing at a sibling registry path. Same trust-boundary rationale discover_peer's docstring already calls out: "never interpolate path-traversal characters into the URL". Now gated by _validate_peer_id; returns "" on validation failure. 2. _build_channel_notification echoed raw peer_id back into meta["peer_id"], which on the push path renders inside the agent's <channel peer_id="..." kind="..."> XML-attribute context. Attacker bytes (control chars, embedded quotes) would land in agent-rendered text wired into the next conversation turn. Now canonicalised through _validate_peer_id before any meta write; on validation failure we set "" rather than reflecting the raw bytes. Defense-in-depth — both layers gate independently. Mutation-verified by stashing both prod-side files and confirming both regression tests fail. Tests: - test_envelope_enrichment_invalid_peer_id_skips_lookup: updated to pin the safe behavior (peer_id="" + agent_card_url absent), not the prior leak shape. - test_envelope_enrichment_strips_path_traversal_peer_id: NEW. Hard regression for peer_id="../../foo" — pins both the URL-builder and the meta echo against this specific exploit shape. - Two existing tests updated to use UUID-shape placeholders instead of "ws-peer-uuid" / "peer-ws-uuid" since those non-UUIDs now correctly get stripped by the validator. Resolves the Required-grade finding from the multi-axis review on PR #2471. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 18:43:49 -07:00
Hongming Wang	e6eda38318	fix(a2a-client): negative-cache registry failures in enrich_peer_metadata Self-review on PR #2471: failure outcomes (4xx/5xx/non-JSON/network exception) weren't writing to _peer_metadata, so a peer with a flaky or missing registry record re-fired the 2s-bounded GET on EVERY push. The cache became a no-op for the exact failure scenarios it most needs to defend against, and the poller thread stalled 2s per push for that peer until the registry came back. Cache the failure outcome as `(now, None)` so the TTL window suppresses re-fetch. Two new tests pin the behaviour for both HTTP failures (5xx) and transport exceptions (httpx.ConnectError). Type signature widens to `dict \| None` on the value tuple's second slot to match the new sentinel; readers already handle `None` as "no enrichment available" — that's the documented graceful-degrade contract — so no caller change needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 18:16:35 -07:00
Hongming Wang	050aa33fc1	feat(a2a-mcp): enrich channel envelope with peer name/role/agent_card_url The bare envelope only carried `peer_id` for peer_agent inbound, so a receiving agent had to round-trip to /registry to find out who's talking. Surface the sender's display name, role, and an agent-card URL alongside the routing fields so the agent can render "ops-agent (sre): ping" in one shot without an extra lookup. a2a_client.py: - Add _peer_metadata cache `dict[peer_id → (fetched_at, record)]` - Add enrich_peer_metadata(peer_id) — sync, hits cache or registry with a tight 2s timeout, returns None on validation/network/non-200 so callers can degrade gracefully - TTL = 5 min so a busy multi-peer chat doesn't hit registry on every push, but role/name renames propagate within a session - Add _agent_card_url_for(peer_id) — deterministic from peer_id alone a2a_mcp_server.py: - _build_channel_notification calls enrich_peer_metadata when peer_id is non-empty; meta carries peer_name + peer_role + agent_card_url alongside the existing routing fields - agent_card_url surfaces unconditionally (constructable from peer_id); peer_name/role only when registry lookup succeeds — never blocks the push on a registry stall Tests: 6 new branches (canvas_user no enrichment / cache hit no GET / cache miss fetches once / registry-fail graceful degrade / TTL expiry re-fetches / invalid peer_id skips lookup). Mutation-verified: 6/6 fail without prod code, 39/39 pass with. Tracks the broader RFC at #2469 (workspace-server activity_type rename to break the echo loop). Independent of PR #2470 — this is the metadata-enrichment half of the same UX improvement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 17:40:09 -07:00
Hongming Wang	645c1862c4	feat(a2a-client): surface 410 Gone as 'removed' error so callers can re-onboard (#2429 ) Follow-up A to PR #2449 — that PR taught the platform to return 410 Gone for status='removed' workspaces; this PR teaches get_workspace_info to consume that signal. Before: every non-200 collapsed into {"error": "not found"}, which made the 2026-04-30 incident impossible to diagnose — the operator KNEW the workspace_id existed (they'd just registered it), but the runtime kept reporting "not found" for a deleted-but-not-purged row. After: 410 produces a distinct {"error": "removed", "id", "removed_at", "hint"} dict so callers (heartbeat-loop, channel bridge, dashboard tools) can surface "your workspace was deleted, re-onboard" instead of "not found". Falls back to a default hint if the platform body isn't parseable so the actionable signal doesn't depend on body shape parity. Two new tests: - TestGetWorkspaceInfo.test_410_returns_removed_with_hint - TestGetWorkspaceInfo.test_410_with_unparseable_body_falls_back_to_default_hint Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 22:08:08 -07:00
Hongming Wang	993f8c494e	refactor(workspace-runtime): send_a2a_message takes peer_id, validates UUID Two cleanups stacked on PR #2418: 1. Refactor `send_a2a_message(target_url, msg)` → `send_a2a_message(peer_id, msg)`. After #2418 every caller passes `${PLATFORM_URL}/workspaces/{peer_id}/a2a` — the function's parameter pretended to accept arbitrary URLs but in practice only one shape is meaningful. Owning URL construction inside the function makes the contract honest and centralises the peer-id validation introduced below. 2. Add `_validate_peer_id` UUID-shape check at the trust boundary. `discover_peer` and `send_a2a_message` are the entry points where agent-controlled strings flow into URL paths; rejecting non-UUID input at this layer eliminates the URL-interpolation class of bug (`workspace_id="../admin"` etc.) regardless of how the rest of the codebase interpolates ids elsewhere. Auth was already gating malicious access — this is consistency + clear failure over silent platform 4xx. In-container tests cover positive UUIDs, malformed input (``"ws-abc"``, ``"../admin"``, empty), and the contract that ``tool_delegate_task`` hands the peer_id to ``send_a2a_message`` without building URLs itself. Live-verified: external delegation 8dad3e29 → 97ac32e9 returned "refactor verified" from Claude Code Agent through the refactored code; ``_validate_peer_id`` rejects ``"ws-abc"`` and ``"../admin"`` and accepts canonical UUIDs. Stacked on PR #2418 (proxy-routing fix). Will rebase onto staging once #2418 merges. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 17:43:01 -07:00
Hongming Wang	3b34dfefbc	feat(workspace): surface peer-discovery failure reason instead of "may be isolated" Closes #2397. Today, every empty-peer condition (true empty, 401/403, 404, 5xx, network) collapses to a single message: "No peers available (this workspace may be isolated)". The user has no way to tell whether they need to provision more workspaces (true isolation), restart the workspace (auth), re-register (404), page on-call (5xx), or check network (timeout) — five different operator actions, one ambiguous string. Wire: - new helper get_peers_with_diagnostic() in a2a_client.py returns (peers, error_summary). error_summary is None on 200; a short actionable string on every other branch. - get_peers() now shims through it so non-tool callers (system-prompt formatters) keep the bare-list contract. - tool_list_peers() switches to the diagnostic helper and surfaces the actual reason. The "may be isolated" string is removed; true empty now reads "no peers in the platform registry." Tests: - TestGetPeersWithDiagnostic: 200, 200-empty, 401, 403, 404, 5xx, network exception, 200-but-non-list-body, and the bare-list-shim regression guard. - TestToolListPeers: each diagnostic branch surfaces its reason + explicit assertion that "may be isolated" is gone. Coverage 91.53% (floor 86%). 122 a2a tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 11:09:26 -07:00
Hongming Wang	e87a9c3858	fix(a2a): auto-retry transient transport errors in send_a2a_message Three different intermittent failures observed during a single manual-test session — RemoteProtocolError, ReadTimeout, ConnectError — each surfaced as a "Failed to deliver to <peer>" error chip in the canvas Agent Comms panel even though the next attempt would have succeeded (verified by direct probes from the same source workspace to the same peer). The error message even told the user "Usually a transient network blip — retry once," but it left the retry to a human reading the error message. Auto-retry inside send_a2a_message itself: up to 5 attempts (1 initial + 4 retries) with exponential backoff (1s, 2s, 4s, 8s, 16s-capped), each backoff jittered ±25% to break sync across siblings. Cumulative wall-clock capped at 600s by _DELEGATE_TOTAL_BUDGET_S so a string of 5×300s ReadTimeouts can't make the caller wait 25 minutes — once the deadline elapses, retries stop even if attempts remain. Retry only on transport-layer transients: - ConnectError / ConnectTimeout (peer's listening socket not ready) - RemoteProtocolError (peer closed TCP without writing — observed when a peer's prior in-flight Claude SDK session aborted) - ReadError / WriteError (network blip on Docker bridge) - ReadTimeout (peer wrote no response in 300s) Application-level errors are NOT retried — they're deterministic and retrying just wastes wall-clock: - HTTP 4xx (peer rejected the request format) - JSON parse failures (peer returned garbage) - JSON-RPC error in response body (peer's runtime errored cleanly) - Programmer-bug exceptions (ValueError, etc.) 8 new tests pin the contract: - retry succeeds after 2 RemoteProtocolErrors - retry succeeds after 1 ConnectError - all 5 attempts fail → returns formatted last-error - capped at exactly _DELEGATE_MAX_ATTEMPTS (regression cover for "did someone bump the constant accidentally?") - JSON-RPC error response NOT retried (1 attempt only) - non-httpx exception NOT retried (programmer bugs stay loud) - total budget caps the loop even if attempts remain - backoff schedule grows exponentially with ±25% jitter Refactor: extracted _format_a2a_error() so the success and exhausted paths share one error-formatting routine. _delegate_backoff_seconds() is a pure function so the schedule is unit-testable without monkey- patching asyncio.sleep. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 13:52:01 -07:00
Hongming Wang	81c4c1321c	fix(runtime): use lowercase wire role for v0.3 JSON-RPC compat layer Manual-test failure surfaced what was hidden behind the MCP-path bug: once delegate_task could actually fire, every cross-workspace call came back as JSON-RPC -32600 "Invalid Request" with the underlying pydantic ValidationError: params.message.role Input should be 'agent' or 'user' [type=enum, input_value='ROLE_USER', input_type=str] PR #2184's a2a-sdk 1.x migration sweep over-corrected: it changed every `"role": "user"` literal in JSON-RPC payload construction to `"role": "ROLE_USER"` to match the protobuf enum names of the 1.x native types (a2a.types.Role.ROLE_USER / ROLE_AGENT). That was correct for in-process Message construction (which the SDK serialises before wire transmission) but WRONG for the 8 sites that hand-build JSON-RPC payloads. The workspace's own a2a-sdk runs inbound requests through the v0.3 compat adapter (/usr/local/lib/python3.11/site-packages/a2a/compat/v0_3/) because main.py sets enable_v0_3_compat=True for backwards compatibility, and that adapter validates against the v0.3 Pydantic Role enum (`agent` \| `user` lowercase). The protobuf-style names blow it up. Reverted the 8 wire-payload sites to lowercase: - workspace/a2a_client.py:74 - workspace/a2a_cli.py:74, 111 - workspace/heartbeat.py:378 - workspace/main.py:464, 563 - workspace/builtin_tools/a2a_tools.py:60 - workspace/builtin_tools/delegation.py:272 Native-type usage at workspace/a2a_executor.py:471 (`Role.ROLE_AGENT`) stays — that's an in-process Message construction; the SDK handles wire serialisation correctly. Updated the misleading comment at main.py:255-257 (which said "outbound payloads are now 1.x-shaped (ROLE_USER)") to spell out the actual rule: outbound JSON-RPC wire payloads MUST use v0.3 shape, native types are only for in-process construction. New regression test test_jsonrpc_wire_role_format.py greps the 6 wire-payload-emitting files for any "ROLE_USER" / "ROLE_AGENT" string literal and fails loud — cheapest possible drift detector. Why E2E missed it: the priority-runtimes harness sends a single message canvas → workspace, but the canvas already used lowercase "user" (it never went through the migration sweep). The bug only surfaces on workspace → workspace delegation, which the harness doesn't exercise. Same gap as #131 (extend smoke to call main() against a stub). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 12:40:11 -07:00
Hongming Wang	dd57a840b6	fix: comprehensive a2a-sdk 1.x migration sweep across workspace/ Audited every a2a-sdk surface in workspace/ against the installed 1.0.2 wheel. Found and fixed: main.py (the live workspace startup path): • create_jsonrpc_routes(rpc_url='/', enable_v0_3_compat=True) — rpc_url required in 1.x; v0.3 compat enables inbound legacy clients (`"role": "user"` lowercase) without forcing them to upgrade. Pairs with the outbound rename below. a2a_executor.py: • TextPart/FilePart/FileWithUri removed in 1.x. Part is now a flat proto message: Part(text=…) / Part(url=…, filename=…, media_type=…). Updated the file-attachment branch (only reachable when an agent emits files; the harness's PONG path didn't exercise this, but it's a latent crash). • Message field names: messageId/taskId/contextId → message_id/task_id/context_id (proto3 snake_case). • Role enum: Role.agent → Role.ROLE_AGENT (proto enum). Outbound JSON-RPC payloads (8 files): • "role": "user" → "role": "ROLE_USER" — proto3 JSON serialization is strict about enum values. Sites: a2a_client, a2a_cli, main (initial+idle prompts), heartbeat, builtin_tools/a2a_tools, builtin_tools/delegation. Wire JSON keys stay camelCase (proto3 default), only the role enum value changed. google-adk/adapter.py: • new_agent_text_message → new_text_message (4 sites). This adapter's directory has a hyphen, so it can't be imported as a Python module — effectively dead code, but the wheel ships the file and a future fix should keep it correct against 1.x. Why one PR instead of seven: every previous a2a-sdk migration find landed as its own publish → cascade → harness → next-bug cycle. Today's audit ran every a2a-sdk symbol/type/method in workspace/ against the installed 1.0.2 wheel in a single sweep + tested the critical paths (Message construction, Part construction, Role enum parsing) against the actual SDK. Should be the last migration PR. Verified locally: python3 scripts/build_runtime_package.py --version 0.1.99 \ --out /tmp/build-final pip install /tmp/build-final python -c "import molecule_runtime.main; \ from molecule_runtime.a2a_executor import LangGraphA2AExecutor" → ✓ all imports clean against a2a-sdk 1.0.2 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 09:42:57 -07:00
Hongming Wang	c159d85eb5	fix(a2a): review-driven hardening — prefix-anchored type check, error_detail cap, shared hint module Three required fixes from the bundle review of `391e1872`: 1. workspace/a2a_client.py: substring `type_name in msg` could miss the diagnostic prefix when an exception's message embedded a different class name mid-string (e.g. `OSError("see ConnectionError below")` → printed as plain msg, type lost). Switched to a prefix-anchored check (`msg.startswith(f"{type_name}:")` etc.) so the type label is always added when not already at the start of the message. 2. workspace/a2a_tools.py: `activity_logs.error_detail` is unbounded TEXT on the platform (handlers/activity.go does not validate length). A buggy or hostile peer could stream arbitrarily large error messages into the caller's activity log. Cap at 4096 chars at the producer — comfortably above any real exception traceback, well below an obvious-DoS threshold. 3. New regression test for JSON-RPC `code=0` — pins the `code is not None` semantics so the code is preserved in the detail rather than collapsing into the no-code path. Code=0 is not valid per the spec, but a malformed peer can still emit it and we want it visible for diagnosis. Plus one optional taken: extracted the A2A-error → hint mapping into canvas/src/components/tabs/chat/a2aErrorHint.ts. The two prior copies (AgentCommsPanel.inferCauseHint + ActivityTab.inferA2AErrorHint) had already drifted — Activity tab gained `not found`/`offline` cases the chat panel never picked up, AgentCommsPanel handled empty-input explicitly while Activity didn't. The shared module is the merged superset, with 10 unit tests pinning each named pattern + the "most specific first" ordering (Claude SDK wedge wins over generic timeout). Skipped (per analysis): - Unicode-naive 120-char slice — Python str[:N] slices on code points, not bytes. Safe. - Nested [A2A_ERROR] confusion — non-issue per reviewer; outer prefix winning still produces a structured render. - MessagePreview + JsonBlock dual render on errors — intentional drilldown; raw JSON is below the fold for operators who need it. - console.warn dedup — refetches don't happen per-event so spam risk is low. - str(data)[:200] materialization — A2A response bodies aren't typically MB-sized. Verified: 1005 canvas tests pass (10 new hint tests); 10 Python send_a2a_message tests pass (1 new for code=0); tsc clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 23:47:44 -07:00
Hongming Wang	391e187281	fix(a2a,canvas): make delivery failures comprehensive instead of "[A2A_ERROR] " Symptom: Activity tab and Agent Comms surfaced bare "[A2A_ERROR] " (prefix + nothing) for failed delegations. Operator had no signal to act on — no exception type, no target, no hint about what went wrong, no next step. Fix is in three layers. 1. workspace/a2a_client.py — every error path now produces an actionable detail string: - except branch: some httpx exceptions (RemoteProtocolError, ConnectionReset variants) stringify to "". Pre-fix the catch was `f"{_A2A_ERROR_PREFIX}{e}"` → bare prefix. Now falls back to `<TypeName> (no message — likely connection reset or silent timeout)` and always appends `[target=<url>]` for traceability in chained delegations. - JSON-RPC error branch: previously dropped error.code on the floor and printed "unknown" when message was missing. Now surfaces both, including the well-defined "JSON-RPC error with no message (code=N)" path. - "neither result nor error" branch: pre-fix returned str(payload) which the canvas rendered as a successful response block. Now tagged as A2A_ERROR with a payload snippet so downstream UI routes through the error path. 2. workspace/a2a_tools.py — tool_delegate_task now passes error_detail (the stripped error message) through to the activity-log POST. The platform's activity_logs.error_detail column is the canvas's red error chip source; populating it makes the failure visible in the row header without the user having to expand into raw response_body JSON. The summary line also gets a 120-char prefix of the cause so the collapsed row reads "React Engineer failed: ConnectionResetError: ... [target=...]" instead of "React Engineer failed". 3. canvas/src/components/tabs/ActivityTab.tsx — MessagePreview now detects [A2A_ERROR]-prefixed bodies and renders a structured error block (red chip, stripped detail, cause hint) instead of the previous gray text-block that showed the literal "[A2A_ERROR]" string. inferA2AErrorHint mirrors the patterns from AgentCommsPanel.inferCauseHint so the same symptom reads the same way in both surfaces (Claude SDK init wedge → restart workspace; timeout → busy/stuck; connection-reset → transient blip then check logs). Tests: 9 send_a2a_message tests pass (including a new regression test for the empty-stringifying-exception case that the user reported); 995 canvas tests pass; tsc clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 23:40:05 -07:00
Hongming Wang	65b531acf6	fix(workspace): tag self-originated A2A POSTs with X-Workspace-ID Workspace runtime fired four classes of A2A request to the platform without the X-Workspace-ID header that identifies the source workspace: heartbeat self-messages, initial_prompt, idle-loop fires, and peer-to-peer A2A from runtime tools. The platform's a2a_receive logger keys source_id off that header — without it, every such row was written with source_id=NULL, which the canvas's My Chat tab filters as ?source=canvas (i.e. "user typed this") and rendered the internal triggers as if the human user had sent them. The "Delegation results are ready..." heartbeat trigger was visible to end users in the chat history; delegate_task A2A calls between agents were misclassified the same way. Centralise the header construction in a new platform_auth helper self_source_headers(workspace_id) that returns auth_headers() PLUS {X-Workspace-ID: <id>}. Apply it to: - heartbeat.py self-message (refactored from inline header dict) - main.py initial_prompt POST - main.py idle_prompt POST - a2a_client.py send_a2a_message (peer A2A from runtime) - builtin_tools/a2a_tools.py delegate_task (was missing ALL headers) Tests: - test_heartbeat.py asserts the X-Workspace-ID header is set on the self-message POST. - test_a2a_tools_module.py asserts the same on delegate_task POSTs; FakeClient.post mocks updated to accept the headers kwarg. Production effect lands the moment workspace containers are rebuilt with this code; existing rows in activity_logs keep their NULL source_id (legacy data). The canvas-side filter (#follow-up) covers the historical-rows case until backfill. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 19:54:43 -07:00
molecule-ai[bot]	3bef6af241	fix: apply #1124 env-var defaults + scrub F1088 credentials from INCIDENT_LOG.md (#1347 ) - PLATFORM_URL: replace unreachable http://platform:8080 mesh-only default with Docker-aware detection (host.docker.internal in containers, localhost for local dev) across all workspace Python modules and the git-token-helper shell script. - WORKSPACE_ID: add fail-fast validation in main.py (SystemExit if empty) consistent with coordinator.py / a2a_cli.py patterns already in place. - INCIDENT_LOG.md: replace all 3 F1088 credential types with *REDACTED* (sk-cp- 2x, github_pat_ 2x, ADMIN_TOKEN base64 3x). Fixes #1124, #1333. Co-authored-by: Molecule AI Dev Lead <dev-lead@agents.moleculesai.app>	2026-04-21 08:11:44 +00:00
molecule-ai[bot]	e07e22ad57	fix(orchestrator): fail-fast if WORKSPACE_ID env var is unset/empty (#1124 ) (#1336 ) * fix(orchestrator): fail-fast if WORKSPACE_ID env var is unset/empty Issue #1124: orchestrator GET /workspaces/{WORKSPACE_ID} returned 404 because 5 Python modules defaulted WORKSPACE_ID to "" instead of validating the injected value. Empty string produced URLs like /workspaces//heartbeat — route not found. Fix: raise RuntimeError at module load if WORKSPACE_ID is unset or empty, rather than silently producing broken API calls downstream. Files changed (all same pattern): - workspace/a2a_cli.py - workspace/a2a_client.py - workspace/coordinator.py - workspace/consolidation.py - workspace/molecule_ai_status.py The platform (provisioner.go:375) correctly injects WORKSPACE_ID at container provision time. This fix ensures the orchestrator surfaces the misconfiguration immediately instead of failing silently at runtime. Closes #1124. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(incidents): rebuild INCIDENT_LOG — linter reset, all sections restored Rebuilt after linter reset. Sections restored: - Security Audit Cycle 6 (abc58b47) - F1100 workspace_restart.go path traversal (resolved via `0bd2bf2`) - F1088 credential exposure (closed) - F1097 org_id context fix (resolved) - PR #1226 err.Error() leaks (stale) - QA Round 18 orgs-page regression (fixed on main, staging pending) - Issue #1124 fix PR #1336 filed Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Molecule AI Core Platform Lead <core-platform-lead@agents.moleculesai.app> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-21 07:42:00 +00:00
Hongming Wang	d8026347e5	chore: open-source restructure — rename dirs, remove internal files, scrub secrets Renames: - platform/ → workspace-server/ (Go module path stays as "platform" for external dep compat — will update after plugin module republish) - workspace-template/ → workspace/ Removed (moved to separate repos or deleted): - PLAN.md — internal roadmap (move to private project board) - HANDOFF.md, AGENTS.md — one-time internal session docs - .claude/ — gitignored entirely (local agent config) - infra/cloudflare-worker/ → Molecule-AI/molecule-tenant-proxy - org-templates/molecule-dev/ → standalone template repo - .mcp-eval/ → molecule-mcp-server repo - test-results/ — ephemeral, gitignored Security scrubbing: - Cloudflare account/zone/KV IDs → placeholders - Real EC2 IPs → <EC2_IP> in all docs - CF token prefix, Neon project ID, Fly app names → redacted - Langfuse dev credentials → parameterized - Personal runner username/machine name → generic Community files: - CONTRIBUTING.md — build, test, branch conventions - CODE_OF_CONDUCT.md — Contributor Covenant 2.1 All Dockerfiles, CI workflows, docker-compose, railway.toml, render.yaml, README, CLAUDE.md updated for new directory names. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-18 00:24:44 -07:00

21 Commits