Commit Graph

3 Commits

Author SHA1 Message Date
Hongming Wang
5071454074 fix(delegation): lazy-refresh QUEUED state from platform; live DELEGATION_* events
Critical follow-up to PR #2126's review. Two real bugs:

1. **Runtime QUEUED never resolved.** Platform's drain stitch updates
   the platform's delegate_result row when a queued delegation finally
   completes, but never pushes back to the runtime. The LLM polling
   check_delegation_status saw status="queued" forever — combined with
   the new docstring guidance ("queued → wait, peer will reply"), the
   model would wait indefinitely on a state that never resolves.
   Strictly worse than pre-PR behavior where it would have at least
   bypassed.

2. **Live updates dead code.** delegation.go writes activity rows by
   direct INSERT INTO activity_logs, bypassing the LogActivity helper
   that fires ACTIVITY_LOGGED. Adding "delegation" to the canvas's
   ACTIVITY_LOGGED filter (PR #2126 first cut) was inert — initial
   GET worked, live updates did not.

Fix:

(1) Runtime side, workspace/builtin_tools/delegation.py:
  - New `_refresh_queued_from_platform(task_id)` async helper that
    pulls /workspaces/<self>/delegations and finds the platform-side
    delegate_result row for our task_id.
  - check_delegation_status calls _refresh when local status is
    QUEUED, so the LLM's poll itself drives state convergence.
  - Best-effort: GET failure leaves local state untouched, next
    poll retries.
  - Docstring updated to reflect the actual behavior ("polls
    transparently — keep polling and you'll see the flip").
  - 4 new tests cover: QUEUED → completed via refresh; QUEUED →
    failed via refresh; refresh keeps QUEUED when platform hasn't
    resolved; refresh swallows network errors safely.

(2) Canvas side, AgentCommsPanel.tsx WS push handler:
  - Listens for DELEGATION_SENT / DELEGATION_STATUS / DELEGATION_COMPLETE
    / DELEGATION_FAILED in addition to ACTIVITY_LOGGED.
  - Each event's payload synthesized into an ActivityEntry shape
    so toCommMessage's existing delegation branch maps it. Status
    derived: STATUS uses payload.status, COMPLETE → "completed",
    FAILED → "failed", SENT → "pending".
  - The ACTIVITY_LOGGED branch keeps the "delegation" type accepted
    as a no-op-today / future-proof path: if delegation handlers
    are ever refactored to call LogActivity, this lights up
    automatically without another canvas change.

Doesn't change: the docstring guidance ("queued → wait, don't bypass")
is now actually load-bearing because the refresh path will deliver
the eventual outcome. Without the refresh, the guidance was a trap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 16:05:04 -07:00
Hongming Wang
057876cb0c fix(delegation): runtime handles 202+queued; canvas surfaces delegation rows
Two bugs that compounded into the "Director does the work itself" UX:

1. workspace/builtin_tools/delegation.py: _execute_delegation only
   handled HTTP 200 in the response branch. When the peer's a2a-proxy
   returned HTTP 202 + {queued: true} (single-SDK-session bottleneck
   on the peer), the loop fell through. Two iterations later the
   `if "error" in result` check tried to access an unbound `result`,
   the goroutine ended quietly, and the delegation stayed at FAILED
   with error="None". The LLM checking status saw "failed" + the
   platform's "Delegation queued — target at capacity" log line in
   chat context, concluded the peer was permanently unavailable, and
   bypassed delegation to do the work itself.

   Fix: explicit 202+queued branch. Adds DelegationStatus.QUEUED,
   marks the local delegation as QUEUED, mirrors to the platform,
   and returns cleanly without retrying. The retry loop is for
   transient transport errors — queueing is a real ack, not a failure
   to retry against (retrying would just re-queue the same task).

   check_delegation_status docstring extended with explicit per-status
   guidance: pending/in_progress → wait, queued → wait (peer busy on
   prior task, reply WILL arrive), completed → use result, failed →
   real error in error field; only fall back on failed, never queued.

2. canvas/src/components/tabs/chat/AgentCommsPanel.tsx: filter dropped
   every delegation row because it whitelisted only a2a_send /
   a2a_receive. activity_type='delegation' rows (written by the
   platform's /delegate handler with method='delegate' or
   'delegate_result') never reached toCommMessage. User saw "No
   agent-to-agent communications yet" while 6+ delegations existed
   in the DB.

   Fix: include "delegation" in the both the initial filter and the
   WS push filter, plus a delegation branch in toCommMessage that
   maps the row as outbound (always — platform proxies on our behalf)
   and uses summary as the primary text source.

Tests:
  - 3 new Python tests cover the 202+queued path: status becomes
    QUEUED not FAILED; no retry on queued (counted by URL match
    against the A2A target since the mock is shared across all
    AsyncClient calls); bare 202 without {queued:true} still
    falls through to the existing retry-then-FAILED path.
  - 3 new TS tests cover the delegation mapper: 'delegate' row
    maps as outbound to target with summary text; queued
    'delegate_result' preserves status='queued' (load-bearing for
    the LLM's wait-vs-bypass decision); missing target_id returns
    null instead of rendering a ghost.

Does NOT solve: the underlying single-SDK-session bottleneck that
causes peers to queue in the first place. Tracked as task #102
(parallel SDK sessions per workspace) — real architectural work.
This PR makes the runtime handle the queueing correctly so the LLM
doesn't bail out, and makes the delegations visible in Agent Comms
so operators can see what's happening.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 15:01:50 -07:00
Hongming Wang
479a027e4b chore: open-source restructure — rename dirs, remove internal files, scrub secrets
Renames:
- platform/ → workspace-server/ (Go module path stays as "platform" for
  external dep compat — will update after plugin module republish)
- workspace-template/ → workspace/

Removed (moved to separate repos or deleted):
- PLAN.md — internal roadmap (move to private project board)
- HANDOFF.md, AGENTS.md — one-time internal session docs
- .claude/ — gitignored entirely (local agent config)
- infra/cloudflare-worker/ → Molecule-AI/molecule-tenant-proxy
- org-templates/molecule-dev/ → standalone template repo
- .mcp-eval/ → molecule-mcp-server repo
- test-results/ — ephemeral, gitignored

Security scrubbing:
- Cloudflare account/zone/KV IDs → placeholders
- Real EC2 IPs → <EC2_IP> in all docs
- CF token prefix, Neon project ID, Fly app names → redacted
- Langfuse dev credentials → parameterized
- Personal runner username/machine name → generic

Community files:
- CONTRIBUTING.md — build, test, branch conventions
- CODE_OF_CONDUCT.md — Contributor Covenant 2.1

All Dockerfiles, CI workflows, docker-compose, railway.toml, render.yaml,
README, CLAUDE.md updated for new directory names.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 00:24:44 -07:00