molecule-core

Author	SHA1	Message	Date
Hongming Wang	dd57a840b6	fix: comprehensive a2a-sdk 1.x migration sweep across workspace/ Audited every a2a-sdk surface in workspace/ against the installed 1.0.2 wheel. Found and fixed: main.py (the live workspace startup path): • create_jsonrpc_routes(rpc_url='/', enable_v0_3_compat=True) — rpc_url required in 1.x; v0.3 compat enables inbound legacy clients (`"role": "user"` lowercase) without forcing them to upgrade. Pairs with the outbound rename below. a2a_executor.py: • TextPart/FilePart/FileWithUri removed in 1.x. Part is now a flat proto message: Part(text=…) / Part(url=…, filename=…, media_type=…). Updated the file-attachment branch (only reachable when an agent emits files; the harness's PONG path didn't exercise this, but it's a latent crash). • Message field names: messageId/taskId/contextId → message_id/task_id/context_id (proto3 snake_case). • Role enum: Role.agent → Role.ROLE_AGENT (proto enum). Outbound JSON-RPC payloads (8 files): • "role": "user" → "role": "ROLE_USER" — proto3 JSON serialization is strict about enum values. Sites: a2a_client, a2a_cli, main (initial+idle prompts), heartbeat, builtin_tools/a2a_tools, builtin_tools/delegation. Wire JSON keys stay camelCase (proto3 default), only the role enum value changed. google-adk/adapter.py: • new_agent_text_message → new_text_message (4 sites). This adapter's directory has a hyphen, so it can't be imported as a Python module — effectively dead code, but the wheel ships the file and a future fix should keep it correct against 1.x. Why one PR instead of seven: every previous a2a-sdk migration find landed as its own publish → cascade → harness → next-bug cycle. Today's audit ran every a2a-sdk symbol/type/method in workspace/ against the installed 1.0.2 wheel in a single sweep + tested the critical paths (Message construction, Part construction, Role enum parsing) against the actual SDK. Should be the last migration PR. Verified locally: python3 scripts/build_runtime_package.py --version 0.1.99 \ --out /tmp/build-final pip install /tmp/build-final python -c "import molecule_runtime.main; \ from molecule_runtime.a2a_executor import LangGraphA2AExecutor" → ✓ all imports clean against a2a-sdk 1.0.2 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 09:42:57 -07:00
Hongming Wang	5071454074	fix(delegation): lazy-refresh QUEUED state from platform; live DELEGATION_* events Critical follow-up to PR #2126's review. Two real bugs: 1. Runtime QUEUED never resolved. Platform's drain stitch updates the platform's delegate_result row when a queued delegation finally completes, but never pushes back to the runtime. The LLM polling check_delegation_status saw status="queued" forever — combined with the new docstring guidance ("queued → wait, peer will reply"), the model would wait indefinitely on a state that never resolves. Strictly worse than pre-PR behavior where it would have at least bypassed. 2. Live updates dead code. delegation.go writes activity rows by direct INSERT INTO activity_logs, bypassing the LogActivity helper that fires ACTIVITY_LOGGED. Adding "delegation" to the canvas's ACTIVITY_LOGGED filter (PR #2126 first cut) was inert — initial GET worked, live updates did not. Fix: (1) Runtime side, workspace/builtin_tools/delegation.py: - New `_refresh_queued_from_platform(task_id)` async helper that pulls /workspaces/<self>/delegations and finds the platform-side delegate_result row for our task_id. - check_delegation_status calls _refresh when local status is QUEUED, so the LLM's poll itself drives state convergence. - Best-effort: GET failure leaves local state untouched, next poll retries. - Docstring updated to reflect the actual behavior ("polls transparently — keep polling and you'll see the flip"). - 4 new tests cover: QUEUED → completed via refresh; QUEUED → failed via refresh; refresh keeps QUEUED when platform hasn't resolved; refresh swallows network errors safely. (2) Canvas side, AgentCommsPanel.tsx WS push handler: - Listens for DELEGATION_SENT / DELEGATION_STATUS / DELEGATION_COMPLETE / DELEGATION_FAILED in addition to ACTIVITY_LOGGED. - Each event's payload synthesized into an ActivityEntry shape so toCommMessage's existing delegation branch maps it. Status derived: STATUS uses payload.status, COMPLETE → "completed", FAILED → "failed", SENT → "pending". - The ACTIVITY_LOGGED branch keeps the "delegation" type accepted as a no-op-today / future-proof path: if delegation handlers are ever refactored to call LogActivity, this lights up automatically without another canvas change. Doesn't change: the docstring guidance ("queued → wait, don't bypass") is now actually load-bearing because the refresh path will deliver the eventual outcome. Without the refresh, the guidance was a trap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 16:05:04 -07:00
Hongming Wang	057876cb0c	fix(delegation): runtime handles 202+queued; canvas surfaces delegation rows Two bugs that compounded into the "Director does the work itself" UX: 1. workspace/builtin_tools/delegation.py: _execute_delegation only handled HTTP 200 in the response branch. When the peer's a2a-proxy returned HTTP 202 + {queued: true} (single-SDK-session bottleneck on the peer), the loop fell through. Two iterations later the `if "error" in result` check tried to access an unbound `result`, the goroutine ended quietly, and the delegation stayed at FAILED with error="None". The LLM checking status saw "failed" + the platform's "Delegation queued — target at capacity" log line in chat context, concluded the peer was permanently unavailable, and bypassed delegation to do the work itself. Fix: explicit 202+queued branch. Adds DelegationStatus.QUEUED, marks the local delegation as QUEUED, mirrors to the platform, and returns cleanly without retrying. The retry loop is for transient transport errors — queueing is a real ack, not a failure to retry against (retrying would just re-queue the same task). check_delegation_status docstring extended with explicit per-status guidance: pending/in_progress → wait, queued → wait (peer busy on prior task, reply WILL arrive), completed → use result, failed → real error in error field; only fall back on failed, never queued. 2. canvas/src/components/tabs/chat/AgentCommsPanel.tsx: filter dropped every delegation row because it whitelisted only a2a_send / a2a_receive. activity_type='delegation' rows (written by the platform's /delegate handler with method='delegate' or 'delegate_result') never reached toCommMessage. User saw "No agent-to-agent communications yet" while 6+ delegations existed in the DB. Fix: include "delegation" in the both the initial filter and the WS push filter, plus a delegation branch in toCommMessage that maps the row as outbound (always — platform proxies on our behalf) and uses summary as the primary text source. Tests: - 3 new Python tests cover the 202+queued path: status becomes QUEUED not FAILED; no retry on queued (counted by URL match against the A2A target since the mock is shared across all AsyncClient calls); bare 202 without {queued:true} still falls through to the existing retry-then-FAILED path. - 3 new TS tests cover the delegation mapper: 'delegate' row maps as outbound to target with summary text; queued 'delegate_result' preserves status='queued' (load-bearing for the LLM's wait-vs-bypass decision); missing target_id returns null instead of rendering a ghost. Does NOT solve: the underlying single-SDK-session bottleneck that causes peers to queue in the first place. Tracked as task #102 (parallel SDK sessions per workspace) — real architectural work. This PR makes the runtime handle the queueing correctly so the LLM doesn't bail out, and makes the delegations visible in Agent Comms so operators can see what's happening. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 15:01:50 -07:00
Hongming Wang	65b531acf6	fix(workspace): tag self-originated A2A POSTs with X-Workspace-ID Workspace runtime fired four classes of A2A request to the platform without the X-Workspace-ID header that identifies the source workspace: heartbeat self-messages, initial_prompt, idle-loop fires, and peer-to-peer A2A from runtime tools. The platform's a2a_receive logger keys source_id off that header — without it, every such row was written with source_id=NULL, which the canvas's My Chat tab filters as ?source=canvas (i.e. "user typed this") and rendered the internal triggers as if the human user had sent them. The "Delegation results are ready..." heartbeat trigger was visible to end users in the chat history; delegate_task A2A calls between agents were misclassified the same way. Centralise the header construction in a new platform_auth helper self_source_headers(workspace_id) that returns auth_headers() PLUS {X-Workspace-ID: <id>}. Apply it to: - heartbeat.py self-message (refactored from inline header dict) - main.py initial_prompt POST - main.py idle_prompt POST - a2a_client.py send_a2a_message (peer A2A from runtime) - builtin_tools/a2a_tools.py delegate_task (was missing ALL headers) Tests: - test_heartbeat.py asserts the X-Workspace-ID header is set on the self-message POST. - test_a2a_tools_module.py asserts the same on delegate_task POSTs; FakeClient.post mocks updated to accept the headers kwarg. Production effect lands the moment workspace containers are rebuilt with this code; existing rows in activity_logs keep their NULL source_id (legacy data). The canvas-side filter (#follow-up) covers the historical-rows case until backfill. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 19:54:43 -07:00
molecule-ai[bot]	3bef6af241	fix: apply #1124 env-var defaults + scrub F1088 credentials from INCIDENT_LOG.md (#1347 ) - PLATFORM_URL: replace unreachable http://platform:8080 mesh-only default with Docker-aware detection (host.docker.internal in containers, localhost for local dev) across all workspace Python modules and the git-token-helper shell script. - WORKSPACE_ID: add fail-fast validation in main.py (SystemExit if empty) consistent with coordinator.py / a2a_cli.py patterns already in place. - INCIDENT_LOG.md: replace all 3 F1088 credential types with *REDACTED* (sk-cp- 2x, github_pat_ 2x, ADMIN_TOKEN base64 3x). Fixes #1124, #1333. Co-authored-by: Molecule AI Dev Lead <dev-lead@agents.moleculesai.app>	2026-04-21 08:11:44 +00:00
Hongming Wang	d8026347e5	chore: open-source restructure — rename dirs, remove internal files, scrub secrets Renames: - platform/ → workspace-server/ (Go module path stays as "platform" for external dep compat — will update after plugin module republish) - workspace-template/ → workspace/ Removed (moved to separate repos or deleted): - PLAN.md — internal roadmap (move to private project board) - HANDOFF.md, AGENTS.md — one-time internal session docs - .claude/ — gitignored entirely (local agent config) - infra/cloudflare-worker/ → Molecule-AI/molecule-tenant-proxy - org-templates/molecule-dev/ → standalone template repo - .mcp-eval/ → molecule-mcp-server repo - test-results/ — ephemeral, gitignored Security scrubbing: - Cloudflare account/zone/KV IDs → placeholders - Real EC2 IPs → <EC2_IP> in all docs - CF token prefix, Neon project ID, Fly app names → redacted - Langfuse dev credentials → parameterized - Personal runner username/machine name → generic Community files: - CONTRIBUTING.md — build, test, branch conventions - CODE_OF_CONDUCT.md — Contributor Covenant 2.1 All Dockerfiles, CI workflows, docker-compose, railway.toml, render.yaml, README, CLAUDE.md updated for new directory names. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-18 00:24:44 -07:00

6 Commits