Commit Graph

19 Commits

Author SHA1 Message Date
Hongming Wang
da6d319c48 perf(a2a): bound + LRU-evict _peer_metadata cache (#2482)
Pre-fix _peer_metadata was an unbounded dict — a workspace receiving
from N distinct peers across its lifetime accumulated entries
indefinitely (~100 bytes × N). Not crash-class at typical scale (10K
peers ≈ 1 MB) but unbounded. The TTL-at-read pattern bounded
staleness but did nothing for memory.

Fix: hand-rolled LRU on top of OrderedDict. No new dependency.

  - _PEER_METADATA_MAXSIZE = 1024 (issue's recommended bound)
  - _peer_metadata_get(canon)  — read + LRU touch (move to MRU)
  - _peer_metadata_set(canon, value) — write + evict-if-over-maxsize
  - All production reads/writes route through the helpers
  - _peer_metadata_lock guards the OrderedDict ops so concurrent
    background-enrichment workers (#2484) don't race the LRU
    invariant

Why hand-rolled vs cachetools:
  - No new dep. workspace/ has 0 cache libraries today; adding one
    for ~30 lines is negative leverage.
  - The TTL is enforced at the call site (existing pattern); only
    the size cap + LRU is new. cachetools.TTLCache fuses the two,
    which would force a refactor of every caller's TTL check.
  - The size + lock are simple enough that a future swap-in of
    cachetools is mechanical if needs evolve.

Why maxsize matters more than ttl (issue's framing):
  A runaway poller that touches new peer_ids every push would still
  grow within a single TTL window — TTL eviction only fires at
  read time. The size cap fires immediately on insert, regardless
  of read pattern.

Three new tests:
  - test_peer_metadata_set_evicts_lru_when_at_maxsize
  - test_peer_metadata_get_promotes_to_lru_head
  - test_peer_metadata_set_replaces_existing_entry_in_place

1742 passed / 0 failed locally (78 new + 1664 existing).

Closes #2482.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 01:39:07 -07:00
Hongming Wang
35017c5452 perf(a2a): move enrichment GET off the inbox poller thread (#2484)
The inbox poller's notification callback called the synchronous
enrich_peer_metadata on every push, blocking the poller for up to
2s × N uncached peers per poll batch. Push delivery latency was
gated on registry RTT — exactly what PR #2471's negative-cache patch
was trying to avoid amplifying.

Fix: cache-first nonblocking path with a tiny background worker pool.

  enrich_peer_metadata_nonblocking(peer_id):
    - Cache hit (fresh, within TTL):  return cached record immediately
    - Cache miss / stale:              return None, schedule background
                                       fetch via ThreadPoolExecutor

The first push from a new peer arrives metadata-light (bare peer_id);
the next push within the 5-min TTL hits the warm cache and gets full
name/role. Acceptable trade-off because the channel-envelope
enrichment is a UX nicety, not a correctness invariant — and the
cold-cache window per peer is bounded to one push.

Defenses:
  - In-flight gate (_enrich_in_flight) — N concurrent pushes for the
    same uncached peer schedule exactly ONE worker, not N. Without
    this, a chatty peer's first burst of pushes would amplify into
    parallel registry GETs — the exact DoS-on-self pattern the
    negative cache was meant to rate-limit.
  - Lazy executor init — most test fixtures + short-lived CLI
    invocations never need it; only the long-running molecule-mcp
    path actually fires background work.
  - Daemon-style threads via thread_name_prefix; executor never
    blocks process exit.

Tests:
  - test_enrich_peer_metadata_nonblocking_cache_hit_returns_immediately
  - test_enrich_peer_metadata_nonblocking_cache_miss_schedules_fetch
  - test_enrich_peer_metadata_nonblocking_coalesces_duplicate_pushes
  - test_enrich_peer_metadata_nonblocking_invalid_peer_id_returns_none

Plus updates to the existing test_envelope_enrichment_* suite that
asserted synchronous behavior — they now drain the in-flight set via
_wait_for_enrichment_inflight_for_testing before checking cache state.

Existing synchronous enrich_peer_metadata is unchanged — Phase B (#2790)
schema↔dispatcher drift gate + the negative-cache contract from PR
#2471 still apply. The nonblocking variant is purely additive.

1739 passed, 0 failed locally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 01:24:42 -07:00
Hongming Wang
700d44ec3d feat(mcp): multi-workspace routing for memory + chat_history + workspace_info
PR-3 of the multi-workspace MCP rollout. PR-1 made the MCP server itself
multi-workspace aware (one process, N workspace memberships). PR-2 added
source_workspace_id threading to delegate_task / list_peers. This change
closes the remaining workspace-scoped tools so a single agent registered
into multiple workspaces no longer leaks memories or chat history across
tenants.

Tools now accepting `source_workspace_id`:

- tool_commit_memory(content, scope, source_workspace_id=None) —
  routes POST to /workspaces/{src}/memories with the source workspace's
  Bearer token. Body still embeds source_workspace_id for the platform's
  audit + namespace-isolation enforcement.

- tool_recall_memory(query, scope, source_workspace_id=None) —
  GET /workspaces/{src}/memories with the source workspace's token and
  ?workspace_id={src} query so the platform scopes the read to the
  caller's tenant view (PR-1 / multi-workspace mode).

- tool_chat_history(peer_id, limit, before_ts, source_workspace_id=None)
  — auto-routes via the _peer_to_source cache populated by list_peers,
  with explicit override winning. Falls back to module-level WORKSPACE_ID
  if neither is available. URL: /workspaces/{src}/chat-history.

- tool_get_workspace_info(source_workspace_id=None) — GET /workspaces/{src}
  with the source workspace's token. Useful for introspecting any
  workspace the agent is registered into, not just the primary.

In every path, `src = source_workspace_id or WORKSPACE_ID`, so
single-workspace operators see no behavior change. Tokens are resolved
per-workspace via auth_headers(src) / _auth_headers_for_heartbeat(src),
which fall through to the legacy AUTH_TOKEN env when not in
multi-workspace mode.

Also updates input_schemas in platform_tools/registry.py so the new
optional parameter is advertised to LLM clients (claude-code,
hermes-agent, langchain wrappers).

Tests (4 new classes in test_a2a_multi_workspace.py, 21 new tests):
- TestCommitMemorySourceRouting — URL + Authorization header per source
- TestRecallMemorySourceRouting — URL + query param + Authorization
- TestChatHistorySourceRouting — peer-cache auto-route + explicit override
- TestGetWorkspaceInfoSourceRouting — URL + Authorization

Inbox tools (peek/pop/wait_for_message) already multi-workspace aware
since PR-1 — inbox.py spawns per-workspace pollers and tags every
InboxMessage with arrival_workspace_id. No further plumbing needed.

Suite: 1700 passed, 3 skipped, 2 xfailed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 14:17:58 -07:00
Hongming Wang
1161b97faf feat(mcp): cross-workspace delegation routing (multi-ws PR-2)
PR-2 of the multi-workspace external-agent stack. PR-1 (#2739)
landed per-workspace auth + heartbeat + inbox. This PR threads
``source_workspace_id`` through the A2A client + tool surface so an
agent registered against multiple workspaces can list peers across
all of them and delegate from a specific source.

Changes
-------

* ``a2a_client``: ``discover_peer``, ``send_a2a_message``,
  ``get_peers_with_diagnostic``, and ``enrich_peer_metadata`` now
  accept ``source_workspace_id``. Routing uses it for both the
  X-Workspace-ID header and (transitively, via ``auth_headers(src)``)
  the bearer token. Defaults to module-level WORKSPACE_ID for
  back-compat.
* ``a2a_client._peer_to_source``: a new lock-free cache mapping each
  discovered peer back to the source workspace whose registry
  surfaced it. ``tool_list_peers`` populates the cache on every call;
  ``tool_delegate_task`` consults it for auto-routing.
* ``a2a_tools.tool_list_peers(source_workspace_id=None)``: when
  multiple workspaces are registered (MOLECULE_WORKSPACES) and no
  explicit source is passed, aggregates peers across every
  registered workspace and tags each entry with ``via: <src[:8]>``.
  Single-workspace mode is unchanged — no ``via:`` annotation, same
  output shape.
* ``a2a_tools.tool_delegate_task`` and ``tool_delegate_task_async``
  resolve source via ``source_workspace_id arg → _peer_to_source[target]
  → WORKSPACE_ID``. Agents almost never need to specify ``source_*``
  explicitly — call ``list_peers`` first and the cache handles the
  rest.
* ``tool_delegate_task_async`` idempotency key now includes the
  source workspace, so the same task delegated from two registered
  workspaces produces two distinct delegations (the right behavior
  — one per tenant audit trail).
* ``platform_auth.list_registered_workspaces()``: new helper for the
  tool layer to enumerate the multi-ws registry. Lock-free reads
  matched by the existing single-writer-per-workspace contract from
  PR-1.
* ``platform_auth.self_source_headers``: now passes ``workspace_id``
  through to ``auth_headers`` — without this, a multi-workspace POST
  source-tagged with ``X-Workspace-ID=ws_b`` was authenticating
  with ws_a's token (or no token if MOLECULE_WORKSPACE_TOKEN unset).
  Latent PR-1 bug exposed by the new tool surface.
* ``a2a_mcp_server`` tool dispatch passes ``source_workspace_id``
  from the tool call arguments.
* ``platform_tools.registry``: add ``source_workspace_id`` to the
  delegate_task, delegate_task_async, check_task_status, list_peers
  input schemas with copy explaining when to use it (rarely — the
  cache handles it).

Tests (15 new, all passing)
---------------------------

``test_a2a_multi_workspace.py``:
* TestDiscoverPeerSourceRouting (3): src arg drives header+token,
  fallback to module ws when omitted, invalid target short-circuits
  before any HTTP attempt.
* TestSendA2AMessageSourceRouting (1): X-Workspace-ID source header
  + Authorization bearer both come from the source arg via the
  patched self_source_headers chain.
* TestGetPeersSourceRouting (1): URL path AND headers use the
  source workspace id.
* TestToolListPeersAggregation (4): aggregates across multiple
  registered workspaces, tags origin, leaves single-workspace path
  unchanged, explicit src arg overrides aggregation, diagnostic
  joining when every workspace returns empty.
* TestToolDelegateTaskAutoRouting (3): cache-driven auto-route,
  explicit override beats cache, single-workspace fallback to
  module WORKSPACE_ID.
* TestListRegisteredWorkspaces (3): registry enumeration helper.

Plus ``tests/snapshots/a2a_instructions_mcp.txt`` regenerated to
absorb the new ``source_workspace_id`` schema entries.

Back-compat
-----------

Every change defaults ``source_workspace_id=None``; legacy
single-workspace operators (no MOLECULE_WORKSPACES) see identical
behavior — same URLs, same headers, same tool output. The 24
PR-1 tests + 125 existing A2A tests all still pass.

Out of scope (PR-3)
-------------------

Memory namespacing per registered workspace lands after the new
memory system v2 PR (#2740) settles in production.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 08:32:24 -07:00
Hongming Wang
0b979aed78 fix(channel): validate peer_id at envelope build — close path-traversal foothold
Two trust-boundary leaks surfaced in code review of the channel-envelope
enrichment work:

1. _agent_card_url_for(peer_id) interpolated raw input into
   ${PLATFORM_URL}/registry/discover/<peer_id> with no UUID guard. An
   upstream row with peer_id=`../../foo` produced an agent-visible URL
   pointing at a sibling registry path. Same trust-boundary rationale
   discover_peer's docstring already calls out: "never interpolate
   path-traversal characters into the URL". Now gated by _validate_peer_id;
   returns "" on validation failure.

2. _build_channel_notification echoed raw peer_id back into
   meta["peer_id"], which on the push path renders inside the agent's
   <channel peer_id="..." kind="..."> XML-attribute context. Attacker
   bytes (control chars, embedded quotes) would land in agent-rendered
   text wired into the next conversation turn. Now canonicalised through
   _validate_peer_id before any meta write; on validation failure we
   set "" rather than reflecting the raw bytes.

Defense-in-depth — both layers gate independently. Mutation-verified by
stashing both prod-side files and confirming both regression tests fail.

Tests:
- test_envelope_enrichment_invalid_peer_id_skips_lookup: updated to
  pin the safe behavior (peer_id="" + agent_card_url absent), not the
  prior leak shape.
- test_envelope_enrichment_strips_path_traversal_peer_id: NEW. Hard
  regression for peer_id="../../foo" — pins both the URL-builder and
  the meta echo against this specific exploit shape.
- Two existing tests updated to use UUID-shape placeholders instead
  of "ws-peer-uuid" / "peer-ws-uuid" since those non-UUIDs now correctly
  get stripped by the validator.

Resolves the Required-grade finding from the multi-axis review on PR #2471.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 18:43:49 -07:00
Hongming Wang
e6eda38318 fix(a2a-client): negative-cache registry failures in enrich_peer_metadata
Self-review on PR #2471: failure outcomes (4xx/5xx/non-JSON/network
exception) weren't writing to _peer_metadata, so a peer with a flaky
or missing registry record re-fired the 2s-bounded GET on EVERY
push. The cache became a no-op for the exact failure scenarios it
most needs to defend against, and the poller thread stalled 2s per
push for that peer until the registry came back.

Cache the failure outcome as `(now, None)` so the TTL window
suppresses re-fetch. Two new tests pin the behaviour for both
HTTP failures (5xx) and transport exceptions (httpx.ConnectError).

Type signature widens to `dict | None` on the value tuple's second
slot to match the new sentinel; readers already handle `None` as
"no enrichment available" — that's the documented graceful-degrade
contract — so no caller change needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 18:16:35 -07:00
Hongming Wang
050aa33fc1 feat(a2a-mcp): enrich channel envelope with peer name/role/agent_card_url
The bare envelope only carried `peer_id` for peer_agent inbound, so a
receiving agent had to round-trip to /registry to find out who's
talking. Surface the sender's display name, role, and an agent-card
URL alongside the routing fields so the agent can render
"ops-agent (sre): ping" in one shot without an extra lookup.

a2a_client.py:
- Add _peer_metadata cache `dict[peer_id → (fetched_at, record)]`
- Add enrich_peer_metadata(peer_id) — sync, hits cache or registry
  with a tight 2s timeout, returns None on validation/network/non-200
  so callers can degrade gracefully
- TTL = 5 min so a busy multi-peer chat doesn't hit registry on every
  push, but role/name renames propagate within a session
- Add _agent_card_url_for(peer_id) — deterministic from peer_id alone

a2a_mcp_server.py:
- _build_channel_notification calls enrich_peer_metadata when peer_id
  is non-empty; meta carries peer_name + peer_role + agent_card_url
  alongside the existing routing fields
- agent_card_url surfaces unconditionally (constructable from peer_id);
  peer_name/role only when registry lookup succeeds — never blocks the
  push on a registry stall

Tests: 6 new branches (canvas_user no enrichment / cache hit no GET /
cache miss fetches once / registry-fail graceful degrade / TTL expiry
re-fetches / invalid peer_id skips lookup). Mutation-verified: 6/6
fail without prod code, 39/39 pass with.

Tracks the broader RFC at #2469 (workspace-server activity_type rename
to break the echo loop). Independent of PR #2470 — this is the
metadata-enrichment half of the same UX improvement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 17:40:09 -07:00
Hongming Wang
645c1862c4 feat(a2a-client): surface 410 Gone as 'removed' error so callers can re-onboard (#2429)
Follow-up A to PR #2449 — that PR taught the platform to return 410
Gone for status='removed' workspaces; this PR teaches get_workspace_info
to consume that signal.

Before: every non-200 collapsed into {"error": "not found"}, which
made the 2026-04-30 incident impossible to diagnose — the operator
KNEW the workspace_id existed (they'd just registered it), but the
runtime kept reporting "not found" for a deleted-but-not-purged row.

After: 410 produces a distinct {"error": "removed", "id", "removed_at",
"hint"} dict so callers (heartbeat-loop, channel bridge, dashboard
tools) can surface "your workspace was deleted, re-onboard" instead
of "not found". Falls back to a default hint if the platform body
isn't parseable so the actionable signal doesn't depend on body
shape parity.

Two new tests:
  - TestGetWorkspaceInfo.test_410_returns_removed_with_hint
  - TestGetWorkspaceInfo.test_410_with_unparseable_body_falls_back_to_default_hint

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 22:08:08 -07:00
Hongming Wang
993f8c494e refactor(workspace-runtime): send_a2a_message takes peer_id, validates UUID
Two cleanups stacked on PR #2418:

1. Refactor `send_a2a_message(target_url, msg)` →
   `send_a2a_message(peer_id, msg)`. After #2418 every caller passes
   `${PLATFORM_URL}/workspaces/{peer_id}/a2a` — the function's
   parameter pretended to accept arbitrary URLs but in practice only
   one shape is meaningful. Owning URL construction inside the
   function makes the contract honest and centralises the peer-id
   validation introduced below.

2. Add `_validate_peer_id` UUID-shape check at the trust boundary.
   `discover_peer` and `send_a2a_message` are the entry points where
   agent-controlled strings flow into URL paths; rejecting non-UUID
   input at this layer eliminates the URL-interpolation class of
   bug (`workspace_id="../admin"` etc.) regardless of how the rest
   of the codebase interpolates ids elsewhere. Auth was already
   gating malicious access — this is consistency + clear failure
   over silent platform 4xx.

In-container tests cover positive UUIDs, malformed input
(``"ws-abc"``, ``"../admin"``, empty), and the contract that
``tool_delegate_task`` hands the peer_id to ``send_a2a_message``
without building URLs itself.

Live-verified: external delegation 8dad3e29 → 97ac32e9 returned
"refactor verified" from Claude Code Agent through the refactored
code; ``_validate_peer_id`` rejects ``"ws-abc"`` and ``"../admin"``
and accepts canonical UUIDs.

Stacked on PR #2418 (proxy-routing fix). Will rebase onto staging
once #2418 merges.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 17:43:01 -07:00
Hongming Wang
3b34dfefbc feat(workspace): surface peer-discovery failure reason instead of "may be isolated"
Closes #2397. Today, every empty-peer condition (true empty, 401/403, 404,
5xx, network) collapses to a single message: "No peers available (this
workspace may be isolated)". The user has no way to tell whether they need
to provision more workspaces (true isolation), restart the workspace
(auth), re-register (404), page on-call (5xx), or check network (timeout) —
five different operator actions, one ambiguous string.

Wire:
  - new helper get_peers_with_diagnostic() in a2a_client.py returns
    (peers, error_summary). error_summary is None on 200; a short
    actionable string on every other branch.
  - get_peers() now shims through it so non-tool callers (system-prompt
    formatters) keep the bare-list contract.
  - tool_list_peers() switches to the diagnostic helper and surfaces the
    actual reason. The "may be isolated" string is removed; true empty
    now reads "no peers in the platform registry."

Tests:
  - TestGetPeersWithDiagnostic: 200, 200-empty, 401, 403, 404, 5xx,
    network exception, 200-but-non-list-body, and the bare-list-shim
    regression guard.
  - TestToolListPeers: each diagnostic branch surfaces its reason +
    explicit assertion that "may be isolated" is gone.

Coverage 91.53% (floor 86%). 122 a2a tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 11:09:26 -07:00
Hongming Wang
e87a9c3858 fix(a2a): auto-retry transient transport errors in send_a2a_message
Three different intermittent failures observed during a single
manual-test session — RemoteProtocolError, ReadTimeout, ConnectError —
each surfaced as a "Failed to deliver to <peer>" error chip in the
canvas Agent Comms panel even though the next attempt would have
succeeded (verified by direct probes from the same source workspace
to the same peer). The error message even told the user "Usually a
transient network blip — retry once," but it left the retry to a
human reading the error message.

Auto-retry inside send_a2a_message itself: up to 5 attempts (1
initial + 4 retries) with exponential backoff (1s, 2s, 4s, 8s,
16s-capped), each backoff jittered ±25% to break sync across
siblings. Cumulative wall-clock capped at 600s by
_DELEGATE_TOTAL_BUDGET_S so a string of 5×300s ReadTimeouts can't
make the caller wait 25 minutes — once the deadline elapses, retries
stop even if attempts remain.

Retry only on transport-layer transients:
  - ConnectError / ConnectTimeout (peer's listening socket not ready)
  - RemoteProtocolError (peer closed TCP without writing — observed
    when a peer's prior in-flight Claude SDK session aborted)
  - ReadError / WriteError (network blip on Docker bridge)
  - ReadTimeout (peer wrote no response in 300s)

Application-level errors are NOT retried — they're deterministic and
retrying just wastes wall-clock:
  - HTTP 4xx (peer rejected the request format)
  - JSON parse failures (peer returned garbage)
  - JSON-RPC error in response body (peer's runtime errored cleanly)
  - Programmer-bug exceptions (ValueError, etc.)

8 new tests pin the contract:
  - retry succeeds after 2 RemoteProtocolErrors
  - retry succeeds after 1 ConnectError
  - all 5 attempts fail → returns formatted last-error
  - capped at exactly _DELEGATE_MAX_ATTEMPTS (regression cover for
    "did someone bump the constant accidentally?")
  - JSON-RPC error response NOT retried (1 attempt only)
  - non-httpx exception NOT retried (programmer bugs stay loud)
  - total budget caps the loop even if attempts remain
  - backoff schedule grows exponentially with ±25% jitter

Refactor: extracted _format_a2a_error() so the success and exhausted
paths share one error-formatting routine. _delegate_backoff_seconds()
is a pure function so the schedule is unit-testable without monkey-
patching asyncio.sleep.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 13:52:01 -07:00
Hongming Wang
81c4c1321c fix(runtime): use lowercase wire role for v0.3 JSON-RPC compat layer
Manual-test failure surfaced what was hidden behind the MCP-path bug:
once delegate_task could actually fire, every cross-workspace call
came back as JSON-RPC -32600 "Invalid Request" with the underlying
pydantic ValidationError:

    params.message.role
      Input should be 'agent' or 'user' [type=enum,
      input_value='ROLE_USER', input_type=str]

PR #2184's a2a-sdk 1.x migration sweep over-corrected: it changed
every `"role": "user"` literal in JSON-RPC payload construction to
`"role": "ROLE_USER"` to match the protobuf enum names of the 1.x
native types (a2a.types.Role.ROLE_USER / ROLE_AGENT). That was
correct for in-process Message construction (which the SDK
serialises before wire transmission) but WRONG for the 8 sites that
hand-build JSON-RPC payloads. The workspace's own a2a-sdk runs
inbound requests through the v0.3 compat adapter
(/usr/local/lib/python3.11/site-packages/a2a/compat/v0_3/) because
main.py sets enable_v0_3_compat=True for backwards compatibility,
and that adapter validates against the v0.3 Pydantic Role enum
(`agent` | `user` lowercase). The protobuf-style names blow it up.

Reverted the 8 wire-payload sites to lowercase:
  - workspace/a2a_client.py:74
  - workspace/a2a_cli.py:74, 111
  - workspace/heartbeat.py:378
  - workspace/main.py:464, 563
  - workspace/builtin_tools/a2a_tools.py:60
  - workspace/builtin_tools/delegation.py:272

Native-type usage at workspace/a2a_executor.py:471 (`Role.ROLE_AGENT`)
stays — that's an in-process Message construction; the SDK handles
wire serialisation correctly.

Updated the misleading comment at main.py:255-257 (which said
"outbound payloads are now 1.x-shaped (ROLE_USER)") to spell out
the actual rule: outbound JSON-RPC wire payloads MUST use v0.3
shape, native types are only for in-process construction.

New regression test test_jsonrpc_wire_role_format.py greps the 6
wire-payload-emitting files for any "ROLE_USER" / "ROLE_AGENT"
string literal and fails loud — cheapest possible drift detector.

Why E2E missed it: the priority-runtimes harness sends a single
message canvas → workspace, but the canvas already used lowercase
"user" (it never went through the migration sweep). The bug only
surfaces on workspace → workspace delegation, which the harness
doesn't exercise. Same gap as #131 (extend smoke to call main()
against a stub).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 12:40:11 -07:00
Hongming Wang
dd57a840b6 fix: comprehensive a2a-sdk 1.x migration sweep across workspace/
Audited every a2a-sdk surface in workspace/ against the installed
1.0.2 wheel. Found and fixed:

main.py (the live workspace startup path):
  • create_jsonrpc_routes(rpc_url='/', enable_v0_3_compat=True) —
    rpc_url required in 1.x; v0.3 compat enables inbound legacy
    clients (`"role": "user"` lowercase) without forcing them to
    upgrade. Pairs with the outbound rename below.

a2a_executor.py:
  • TextPart/FilePart/FileWithUri removed in 1.x. Part is now a
    flat proto message: Part(text=…) / Part(url=…, filename=…,
    media_type=…). Updated the file-attachment branch (only
    reachable when an agent emits files; the harness's PONG path
    didn't exercise this, but it's a latent crash).
  • Message field names: messageId/taskId/contextId →
    message_id/task_id/context_id (proto3 snake_case).
  • Role enum: Role.agent → Role.ROLE_AGENT (proto enum).

Outbound JSON-RPC payloads (8 files):
  • "role": "user" → "role": "ROLE_USER" — proto3 JSON serialization
    is strict about enum values. Sites: a2a_client, a2a_cli, main
    (initial+idle prompts), heartbeat, builtin_tools/a2a_tools,
    builtin_tools/delegation. Wire JSON keys stay camelCase
    (proto3 default), only the role enum value changed.

google-adk/adapter.py:
  • new_agent_text_message → new_text_message (4 sites). This
    adapter's directory has a hyphen, so it can't be imported as a
    Python module — effectively dead code, but the wheel ships the
    file and a future fix should keep it correct against 1.x.

Why one PR instead of seven: every previous a2a-sdk migration find
landed as its own publish → cascade → harness → next-bug cycle.
Today's audit ran every a2a-sdk symbol/type/method in workspace/
against the installed 1.0.2 wheel in a single sweep + tested the
critical paths (Message construction, Part construction, Role enum
parsing) against the actual SDK. Should be the last migration PR.

Verified locally:
  python3 scripts/build_runtime_package.py --version 0.1.99 \
      --out /tmp/build-final
  pip install /tmp/build-final
  python -c "import molecule_runtime.main; \
             from molecule_runtime.a2a_executor import LangGraphA2AExecutor"
  → ✓ all imports clean against a2a-sdk 1.0.2

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 09:42:57 -07:00
Hongming Wang
c159d85eb5 fix(a2a): review-driven hardening — prefix-anchored type check, error_detail cap, shared hint module
Three required fixes from the bundle review of 391e1872:

1. workspace/a2a_client.py: substring `type_name in msg` could miss
   the diagnostic prefix when an exception's message embedded a
   different class name mid-string (e.g. `OSError("see ConnectionError
   below")` → printed as plain msg, type lost). Switched to a
   prefix-anchored check (`msg.startswith(f"{type_name}:")` etc.) so
   the type label is always added when not already at the start of
   the message.

2. workspace/a2a_tools.py: `activity_logs.error_detail` is unbounded
   TEXT on the platform (handlers/activity.go does not validate
   length). A buggy or hostile peer could stream arbitrarily large
   error messages into the caller's activity log. Cap at 4096 chars
   at the producer — comfortably above any real exception traceback,
   well below an obvious-DoS threshold.

3. New regression test for JSON-RPC `code=0` — pins the
   `code is not None` semantics so the code is preserved in the
   detail rather than collapsing into the no-code path. Code=0 is
   not valid per the spec, but a malformed peer can still emit it
   and we want it visible for diagnosis.

Plus one optional taken: extracted the A2A-error → hint mapping into
canvas/src/components/tabs/chat/a2aErrorHint.ts. The two prior copies
(AgentCommsPanel.inferCauseHint + ActivityTab.inferA2AErrorHint) had
already drifted — Activity tab gained `not found`/`offline` cases the
chat panel never picked up, AgentCommsPanel handled empty-input
explicitly while Activity didn't. The shared module is the merged
superset, with 10 unit tests pinning each named pattern + the
"most specific first" ordering (Claude SDK wedge wins over generic
timeout).

Skipped (per analysis):
- Unicode-naive 120-char slice — Python str[:N] slices on code
  points, not bytes. Safe.
- Nested [A2A_ERROR] confusion — non-issue per reviewer; outer
  prefix winning still produces a structured render.
- MessagePreview + JsonBlock dual render on errors — intentional
  drilldown; raw JSON is below the fold for operators who need it.
- console.warn dedup — refetches don't happen per-event so spam
  risk is low.
- str(data)[:200] materialization — A2A response bodies aren't
  typically MB-sized.

Verified: 1005 canvas tests pass (10 new hint tests); 10 Python
send_a2a_message tests pass (1 new for code=0); tsc clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 23:47:44 -07:00
Hongming Wang
391e187281 fix(a2a,canvas): make delivery failures comprehensive instead of "[A2A_ERROR] "
Symptom: Activity tab and Agent Comms surfaced bare "[A2A_ERROR] "
(prefix + nothing) for failed delegations. Operator had no signal
to act on — no exception type, no target, no hint about what went
wrong, no next step. Fix is in three layers.

1. workspace/a2a_client.py — every error path now produces an
   actionable detail string:

   - except branch: some httpx exceptions (RemoteProtocolError,
     ConnectionReset variants) stringify to "". Pre-fix the catch
     was `f"{_A2A_ERROR_PREFIX}{e}"` → bare prefix. Now falls back
     to `<TypeName> (no message — likely connection reset or silent
     timeout)` and always appends `[target=<url>]` for traceability
     in chained delegations.
   - JSON-RPC error branch: previously dropped error.code on the
     floor and printed "unknown" when message was missing. Now
     surfaces both, including the well-defined "JSON-RPC error
     with no message (code=N)" path.
   - "neither result nor error" branch: pre-fix returned
     str(payload) which the canvas rendered as a successful
     response block. Now tagged as A2A_ERROR with a payload
     snippet so downstream UI routes through the error path.

2. workspace/a2a_tools.py — tool_delegate_task now passes
   error_detail (the stripped error message) through to the
   activity-log POST. The platform's activity_logs.error_detail
   column is the canvas's red error chip source; populating it
   makes the failure visible in the row header without the user
   having to expand into raw response_body JSON. The summary line
   also gets a 120-char prefix of the cause so the collapsed row
   reads "React Engineer failed: ConnectionResetError: ... [target=...]"
   instead of "React Engineer failed".

3. canvas/src/components/tabs/ActivityTab.tsx — MessagePreview
   now detects [A2A_ERROR]-prefixed bodies and renders a
   structured error block (red chip, stripped detail, cause hint)
   instead of the previous gray text-block that showed the literal
   "[A2A_ERROR]" string. inferA2AErrorHint mirrors the patterns
   from AgentCommsPanel.inferCauseHint so the same symptom reads
   the same way in both surfaces (Claude SDK init wedge → restart
   workspace; timeout → busy/stuck; connection-reset → transient
   blip then check logs).

Tests: 9 send_a2a_message tests pass (including a new regression
test for the empty-stringifying-exception case that the user
reported); 995 canvas tests pass; tsc clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 23:40:05 -07:00
Hongming Wang
65b531acf6 fix(workspace): tag self-originated A2A POSTs with X-Workspace-ID
Workspace runtime fired four classes of A2A request to the platform
without the X-Workspace-ID header that identifies the source
workspace: heartbeat self-messages, initial_prompt, idle-loop fires,
and peer-to-peer A2A from runtime tools. The platform's a2a_receive
logger keys source_id off that header — without it, every such row
was written with source_id=NULL, which the canvas's My Chat tab
filters as ?source=canvas (i.e. "user typed this") and rendered the
internal triggers as if the human user had sent them. The
"Delegation results are ready..." heartbeat trigger was visible to
end users in the chat history; delegate_task A2A calls between agents
were misclassified the same way.

Centralise the header construction in a new platform_auth helper
self_source_headers(workspace_id) that returns auth_headers() PLUS
{X-Workspace-ID: <id>}. Apply it to:

  - heartbeat.py self-message (refactored from inline header dict)
  - main.py initial_prompt POST
  - main.py idle_prompt POST
  - a2a_client.py send_a2a_message (peer A2A from runtime)
  - builtin_tools/a2a_tools.py delegate_task (was missing ALL headers)

Tests:
  - test_heartbeat.py asserts the X-Workspace-ID header is set on
    the self-message POST.
  - test_a2a_tools_module.py asserts the same on delegate_task POSTs;
    FakeClient.post mocks updated to accept the headers kwarg.

Production effect lands the moment workspace containers are rebuilt
with this code; existing rows in activity_logs keep their NULL
source_id (legacy data). The canvas-side filter (#follow-up)
covers the historical-rows case until backfill.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 19:54:43 -07:00
molecule-ai[bot]
3bef6af241 fix: apply #1124 env-var defaults + scrub F1088 credentials from INCIDENT_LOG.md (#1347)
- PLATFORM_URL: replace unreachable http://platform:8080 mesh-only default
  with Docker-aware detection (host.docker.internal in containers,
  localhost for local dev) across all workspace Python modules and the
  git-token-helper shell script.
- WORKSPACE_ID: add fail-fast validation in main.py (SystemExit if empty)
  consistent with coordinator.py / a2a_cli.py patterns already in place.
- INCIDENT_LOG.md: replace all 3 F1088 credential types with
  ***REDACTED*** (sk-cp- 2x, github_pat_ 2x, ADMIN_TOKEN base64 3x).

Fixes #1124, #1333.

Co-authored-by: Molecule AI Dev Lead <dev-lead@agents.moleculesai.app>
2026-04-21 08:11:44 +00:00
molecule-ai[bot]
e07e22ad57 fix(orchestrator): fail-fast if WORKSPACE_ID env var is unset/empty (#1124) (#1336)
* fix(orchestrator): fail-fast if WORKSPACE_ID env var is unset/empty

Issue #1124: orchestrator GET /workspaces/{WORKSPACE_ID} returned 404
because 5 Python modules defaulted WORKSPACE_ID to "" instead of
validating the injected value. Empty string produced URLs like
/workspaces//heartbeat — route not found.

Fix: raise RuntimeError at module load if WORKSPACE_ID is unset
or empty, rather than silently producing broken API calls downstream.

Files changed (all same pattern):
- workspace/a2a_cli.py
- workspace/a2a_client.py
- workspace/coordinator.py
- workspace/consolidation.py
- workspace/molecule_ai_status.py

The platform (provisioner.go:375) correctly injects WORKSPACE_ID at
container provision time. This fix ensures the orchestrator surfaces
the misconfiguration immediately instead of failing silently at runtime.

Closes #1124.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs(incidents): rebuild INCIDENT_LOG — linter reset, all sections restored

Rebuilt after linter reset. Sections restored:
- Security Audit Cycle 6 (abc58b47)
- F1100 workspace_restart.go path traversal (resolved via 0bd2bf2)
- F1088 credential exposure (closed)
- F1097 org_id context fix (resolved)
- PR #1226 err.Error() leaks (stale)
- QA Round 18 orgs-page regression (fixed on main, staging pending)
- Issue #1124 fix PR #1336 filed

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Molecule AI Core Platform Lead <core-platform-lead@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-21 07:42:00 +00:00
Hongming Wang
d8026347e5 chore: open-source restructure — rename dirs, remove internal files, scrub secrets
Renames:
- platform/ → workspace-server/ (Go module path stays as "platform" for
  external dep compat — will update after plugin module republish)
- workspace-template/ → workspace/

Removed (moved to separate repos or deleted):
- PLAN.md — internal roadmap (move to private project board)
- HANDOFF.md, AGENTS.md — one-time internal session docs
- .claude/ — gitignored entirely (local agent config)
- infra/cloudflare-worker/ → Molecule-AI/molecule-tenant-proxy
- org-templates/molecule-dev/ → standalone template repo
- .mcp-eval/ → molecule-mcp-server repo
- test-results/ — ephemeral, gitignored

Security scrubbing:
- Cloudflare account/zone/KV IDs → placeholders
- Real EC2 IPs → <EC2_IP> in all docs
- CF token prefix, Neon project ID, Fly app names → redacted
- Langfuse dev credentials → parameterized
- Personal runner username/machine name → generic

Community files:
- CONTRIBUTING.md — build, test, branch conventions
- CODE_OF_CONDUCT.md — Contributor Covenant 2.1

All Dockerfiles, CI workflows, docker-compose, railway.toml, render.yaml,
README, CLAUDE.md updated for new directory names.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 00:24:44 -07:00