molecule-core/docs/edit-history/2026-04-13.md
Hongming Wang d8026347e5 chore: open-source restructure — rename dirs, remove internal files, scrub secrets
Renames:
- platform/ → workspace-server/ (Go module path stays as "platform" for
  external dep compat — will update after plugin module republish)
- workspace-template/ → workspace/

Removed (moved to separate repos or deleted):
- PLAN.md — internal roadmap (move to private project board)
- HANDOFF.md, AGENTS.md — one-time internal session docs
- .claude/ — gitignored entirely (local agent config)
- infra/cloudflare-worker/ → Molecule-AI/molecule-tenant-proxy
- org-templates/molecule-dev/ → standalone template repo
- .mcp-eval/ → molecule-mcp-server repo
- test-results/ — ephemeral, gitignored

Security scrubbing:
- Cloudflare account/zone/KV IDs → placeholders
- Real EC2 IPs → <EC2_IP> in all docs
- CF token prefix, Neon project ID, Fly app names → redacted
- Langfuse dev credentials → parameterized
- Personal runner username/machine name → generic

Community files:
- CONTRIBUTING.md — build, test, branch conventions
- CODE_OF_CONDUCT.md — Contributor Covenant 2.1

All Dockerfiles, CI workflows, docker-compose, railway.toml, render.yaml,
README, CLAUDE.md updated for new directory names.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 00:24:44 -07:00

36 KiB
Raw Blame History

2026-04-13 — edit history

Summary — Quality + Infra Pass (PRs #1#8, all merged)

Eight PRs landed today in a focused quality pass. No user-facing feature changes; the payoff is faster onboarding, lower merge friction, and stronger CI gates.

Brand + structural

  • PR #1 chore/branding-icons — replaced molecule-icon.png across canvas/public/, canvas/src/app/, docs/assets/branding/; added HANDOFF.md at the repo root; fixed a comment typo in .githooks/pre-commit.
  • PR #3 chore/structural-cleanup — deleted empty workspace-server/plugins/; moved examples/remote-agent/sdk/python/examples/remote-agent/ and docs/superpowers/plans/plugins/superpowers/plans/; added READMEs to tests/ and docs/; gitignored .agents/, workspace-server/workspace-configs-templates/, backups/, logs/, test-results/.
  • LICENSE: trailing brand-migration fix — "Agent Molecule" → "Molecule AI".

MCP server refactor (PRs #2, #4, #7)

  • mcp-server/src/index.ts shrank from 1697 → 89 lines. Tool handlers now live in per-domain modules under mcp-server/src/tools/: workspaces.ts, agents.ts, secrets.ts, files.ts, memory.ts, plugins.ts, channels.ts, delegation.ts, schedules.ts, approvals.ts, discovery.ts, remote_agents.ts.
  • New shared HTTP layer mcp-server/src/api.ts exports PLATFORM_URL, generic apiCall<T>, ApiError type, isApiError() guard, toMcpResult(), toMcpText().
  • Each tools/*.ts exports handlers + a registerXxxTools(srv) function. createServer() in index.ts wires them.
  • Fixed handleGetRemoteAgentSetupCommand — emits a valid python3 -c "from molecule_agent import RemoteAgentClient; …" one-liner (was an invalid python3 -m examples.remote-agent.run).
  • MCP now reports 87 tools on startup (older logs / docs said "61" — both updated).

Canvas (PRs, shipped across session)

  • Replaced native window.confirm / alert with ConfirmDialog in seven sites: ChannelsTab.tsx, ScheduleTab.tsx, ChatTab.tsx, TemplatePalette.tsx (×2), ErrorBoundary.tsx (×2 removed; buttons are self-evident).
  • New singleButton prop on ConfirmDialog for info-toast usage, plus 5 new vitest cases at canvas/src/components/__tests__/ConfirmDialog.test.tsx.
  • ErrorBoundary clipboard write now catches rejections and logs to console.warn.
  • Vitest count: 352 → 357.

Platform — handler decomposition (pure refactor)

Four oversize handler functions split into private helpers — behavior unchanged, but each extracted helper is now directly unit-tested.

  • a2a_proxy.go::proxyA2ARequest (257 → 56 lines). New helpers: resolveAgentURL, normalizeA2APayload, dispatchA2A, handleA2ADispatchError, maybeMarkContainerDead, logA2AFailure, logA2ASuccess; sentinel proxyDispatchBuildError.
  • delegation.go::Delegate (127 → 60 lines). New helpers: bindDelegateRequest, lookupIdempotentDelegation, insertDelegationRow; typed insertDelegationOutcome enum (zero value insertOutcomeUnknown) replaces a positional (bool, bool) return.
  • discovery.go::Discover (125 → 40 lines). New helpers: discoverWorkspacePeer, writeExternalWorkspaceURL, discoverHostPeer.
  • activity.go::SessionSearch (109 → 24 lines). New helpers: parseSessionSearchParams, buildSessionSearchQuery, scanSessionSearchRows.

+47 Go unit tests; workspace-server/internal/handlers coverage 56.1 % → 57.6 %.

Config / env documentation

  • .env.example gained 11 previously-undocumented env vars across 6 new sections: PLATFORM_URL, MOLECULE_URL, WORKSPACE_DIR, MOLECULE_ENV, CORS_ORIGINS, RATE_LIMIT, ACTIVITY_RETENTION_DAYS, ACTIVITY_CLEANUP_INTERVAL_HOURS, MOLECULE_IN_DOCKER, AWARENESS_URL, GITHUB_WEBHOOK_SECRET, MOLECLI_URL. All 21 distinct os.Getenv / envx.* keys (except HOME) are now documented.

E2E + CI (PRs #5, #7, #8)

  • New shared helpers tests/e2e/_lib.sh and tests/e2e/_extract_token.py.
  • tests/e2e/test_api.sh updated for Phase 30.1 bearer-token auth and Phase 30.6 X-Workspace-ID requirement on discover/peers; added a pre-test workspace cleanup. 62/62 pass.
  • tests/e2e/test_comprehensive_e2e.sh fixed the token race against the provisioner by registering each workspace immediately after creation. 67/67 pass.
  • tests/e2e/test_activity_e2e.sh re-registers a detected agent to capture its bearer token.
  • tests/e2e/test_claude_code_e2e.sh got shellcheck annotations only.
  • All five scripts are shellcheck-clean.
  • .github/workflows/ci.yml gained two new jobs:
    • e2e-api — Postgres + Redis service containers, migrations applied via docker exec, test_api.sh runs against a freshly-built platform binary.
    • shellcheck — marketplace action lints every tests/e2e/*.sh.
  • Existing Go job got cache: true on setup-go.
  • Bundle round-trip and "status online" assertions now tolerate the async provisioner flipping status, removing flaky false-negatives.

Test totals after today's sync

Stack Before After
Go (platform) 648 695
Python (workspace) 1140 1140
Canvas (vitest) 352 357
SDK (pytest) 132 132
MCP server (Jest) 96 97

Note: only Go (+47 direct tests for extracted handler helpers), canvas (+5 ConfirmDialog singleButton tests), and MCP (+1 createServer smoke test) gained tests today. Python workspace + SDK counts are the pre-session baseline — no pytest additions today. The earlier "1078 / 87" numbers in this session were stale CLAUDE.md baselines, not measurements.


Canvas — org template import (PLAN.md §20.3)

What: added OrgTemplatesSection to canvas/src/components/TemplatePalette.tsx. Lists org templates from GET /org/templates, each entry shows name + description + workspace count + an "Import org" button that POSTs { dir } to /org/import. Renders inside the existing template-palette sidebar, below the workspace template list.

Why: PLAN.md §20.3 had this checkbox unchecked. Platform already exposes the endpoints (handlers/org.go); only the canvas wiring was missing. Authors today have to curl to instantiate multi-workspace orgs — a poor UX given we already curate org-templates/molecule-dev, reno-stars, etc.

How tested: extracted fetchOrgTemplates() and importOrgTemplate() as standalone exports so they're unit-testable in the existing node-only vitest config (no jsdom). 7 new tests cover happy path, non-2xx response, network failure, POST body shape, error propagation, and module exports. Canvas vitest 345 → 352.

Branch: feat/canvas-org-template-import.

Platform — fix #106: plugin uninstall cleanup

Bug: DELETE /workspaces/:id/plugins/:name only removed /configs/plugins/<name>/. Skill dirs copied out to /configs/skills/<skill>/ and rule blocks appended to /configs/CLAUDE.md by AgentskillsAdaptor.install were left behind, so they reappeared after every container auto-restart.

Fix (workspace-server/internal/handlers/plugins.go::Uninstall): before the existing plugin-dir removal, the handler now:

  1. Reads /configs/plugins/<name>/plugin.yaml from the container to learn the plugin's declared skills: list.
  2. Strips every # Plugin: <name> / … block from /configs/CLAUDE.md via an awk script that mirrors AgentskillsAdaptor.uninstall's block layout (marker → blank → content → blank). Other plugins' markers and surrounding user content stay intact.
  3. rm -rf each declared skill dir under /configs/skills/ (with validatePluginName defense against malformed manifest skill names).
  4. Then proceeds with the existing rm -rf /configs/plugins/<name>.

Tests (workspace-server/internal/handlers/plugins_test.go):

  • TestRegexpEscapeForAwk — verifies /, ., [], *+?|, \\, empty string all escape correctly. Caught a real bug (forgot /, awk treated marker as broken regex delimiter).
  • TestStripPluginMarkers_AwkScript — runs the exact awk pipeline the production code uses against a fixture CLAUDE.md with two my-plugin blocks, a keep-me block, and surrounding user content. Asserts both my-plugin blocks (marker + content) gone, keep-me + user content intact, including trailing user content after the last my-plugin block.
  • TestStripPluginMarkers_MissingFileIsNoOp — missing CLAUDE.md must not crash uninstall.

Live E2E: ran fixed binary, installed test plugin (skill in /configs/skills/test-skill/, rules block in CLAUDE.md), called DELETE, confirmed all three artifacts gone, then triggered manual restart and confirmed they stayed gone (the original bug trigger). Other workspace state — review-loop skill, molecule-dev plugin, surrounding CLAUDE.md content — preserved.

Branch: fix/106-plugin-uninstall-cleanup. Closes #106.

Platform — fix #110: A2A busy-response classification

Bug: When an upstream workspace agent is mid-synthesis on a previous request (single-threaded main loop), subsequent A2A requests time out or see the connection reset. The proxy returned 502 failed to reach workspace agent, indistinguishable from a genuinely unreachable agent. 17 such failures recorded over 7h of self-evol loop traffic.

Fix (workspace-server/internal/handlers/a2a_proxy.go): proxyA2AError gains an optional Headers field so handlers can set real response headers. After a2aClient.Do(req) errors, we now classify via isUpstreamBusyError: context.DeadlineExceeded, io.EOF, io.ErrUnexpectedEOF, or stdlib wrap-strings containing "context deadline exceeded", "EOF", "connection reset". When the container is alive and the error matches, return 503 Service Unavailable with Retry-After: 30 and a JSON body {"busy": true, "retry_after": 30}. Fatal / unclassified errors still fall through to the prior 502. Issue #110 Option 3.

Tests (workspace-server/internal/handlers/a2a_proxy_test.go):

  • TestIsUpstreamBusyError — 10 error shapes (stdlib typed and url.Error-wrapped strings for both deadline and EOF). Includes negative cases (DNS / refused / unrelated errors).
  • TestProxyA2AError_BusyShape — end-to-end emit contract: 503 status, Retry-After: 30 header, JSON body with busy=true and retry_after=30.

Live verification attempted but inconclusive: redirected a workspace URL in Postgres to a hang server, but the platform's Redis URL cache shadows the DB value so the fake upstream was never hit. Unit tests cover every link in the chain (error detection → typed error struct → handler emit), so I'm confident in the change; a real 503-busy will be observable the next time an agent actually stalls under load.

Branch: fix/110-a2a-busy-response. Closes #110 (Option 3 — clearer error + Retry-After; queueing and timeout-bump deferred).

Platform — fix #117: surface Docker image-not-found error on provision

Bug: Provisioning a workspace whose runtime image isn't built locally silently failed. GET /workspaces/:id returned {status: "failed", last_sample_error: ""} — no hint that the image was missing or which build command to run. Discovered during the MeDo hackathon smoke test; only diagnostic path was docker logs on the platform container.

Fix (two files):

  1. workspace-server/internal/provisioner/provisioner.go::Start — when ContainerCreate returns "No such image", wrap the error with the resolved image tag and the exact build-all.sh <runtime> command the operator should run. Uses %w so errors.Is/errors.As chains stay intact.
  2. workspace-server/internal/handlers/workspace_provision.go — on provisioner.Start failure, the UPDATE now sets last_sample_error = $2 alongside status='failed'. Previously the error was only logged + broadcast.

Tests (workspace-server/internal/provisioner/provisioner_test.go):

  • TestIsImageNotFoundErr — 7 error shapes (moby's exact message, variants, unrelated errors)
  • TestRuntimeTagFromImage — 6 image-reference shapes including fallback paths
  • TestImageNotFoundErrorIncludesBuildHint — asserts the wrapped error string includes the image, the build command, and the underlying daemon message

Live E2E: provisioned with runtime: autogen after docker rmi workspace-template:autogen. Before: last_sample_error: "". After: docker image "workspace-template:autogen" not found — run 'bash workspace/build-all.sh autogen' to build it (underlying error: Error response from daemon: No such image: workspace-template:autogen). Image rebuilt after test to restore baseline.

Branch: fix/117-provisioner-surface-image-error. Closes #117.

Phase 30.1 — Workspace auth tokens (SaaS foundation)

Scope: first step of Phase 30 (cross-network federation). Per-workspace bearer tokens so remote agents can authenticate themselves to the platform without being spoofable. Transparent to local containers during the transition — legacy workspaces are grandfathered on /registry/heartbeat until their next /registry/register issues them a token.

What landed:

  • workspace-server/migrations/020_workspace_auth_tokens.{up,down}.sql — new workspace_auth_tokens table storing sha256(plaintext) + 8-char prefix for display. Plaintext never persisted.
  • workspace-server/internal/wsauth/ — new package: IssueToken, ValidateToken, HasAnyLiveToken, RevokeAllForWorkspace, BearerTokenFromHeader. Opaque 256-bit tokens (base64url), no JWT.
  • workspace-server/internal/handlers/registry.go::Register — issues a token on first registration only (idempotent on re-register); returns it in the response body as auth_token.
  • registry.go::Heartbeat, ::UpdateCard — validate Authorization: Bearer <token> if the workspace has any live token on file. Legacy workspaces with no token → 200 (grandfather path).
  • workspace/platform_auth.py — new agent-side store: reads ${CONFIGS_DIR}/.auth_token, in-process cache, auth_headers() helper. File is 0600.
  • workspace/main.py — saves the token returned by register.
  • workspace/heartbeat.py, a2a_tools.py, molecule_ai_status.py, executor_helpers.py — all four heartbeat call sites now send auth_headers().

Tests:

  • workspace-server/internal/wsauth/tokens_test.go — 11 cases: issuance persists only hash, tokens unique per call, validate happy path, wrong-workspace rejected, unknown token rejected, empty inputs rejected, HasAnyLiveToken with 0/1/7 rows, revoke, bearer header parser with 7 inputs.
  • workspace/tests/test_platform_auth.py — 14 cases: get/save round-trip, 0600 mode, whitespace stripping, empty-token rejection, idempotent saves (no mtime churn), rotation, header format, caching semantics, empty-file handling, CONFIGS_DIR respect + fallback.
  • Fixed tests/test_molecule_ai_status.py::_FakePost + exploding_post to accept headers= kwarg (test fixture API drift from the production code change).

Live E2E verified against real Postgres + running platform:

  • Legacy workspace (no tokens) → heartbeat 200 (grandfathered)
  • Fresh register → token returned in response body
  • Heartbeat without token (token exists) → 401
  • Heartbeat with valid token → 200
  • Spoofing with guessed token → 401
  • Cross-workspace token reuse (A's token for B) → 401
  • Re-register after token issued → response has no auth_token (idempotent)

Test totals: Go 476 → 487, Python 1064 → 1078.

Docs:

  • docs/remote-workspaces-readiness.md — full code audit that scopes Phase 30 (five sections: local-only assumptions, existing seams, hard problems, minimum viable remote shape, ordered next steps).
  • PLAN.md — new Phase 30 section with eight bounded sub-steps (30.1 through 30.8), out-of-scope boundaries, success criteria.

Branch: feat/30.1-workspace-auth-tokens. First PR of Phase 30.

Fix #125 — commit_memory writes now surface in activity_logs

Bug: commit_memory MCP tool calls succeeded silently. Operators inspecting the Canvas "Agent Comms" tab couldn't see what an agent chose to remember during a task.

Fix (two files):

  1. workspace/builtin_tools/memory.py::commit_memory — on successful write, fire-and-forget a POST /workspaces/:id/activity call via new helper _record_memory_activity(scope, content, memory_id). Summary format [<SCOPE>] <80-char preview>… (id=<id>). The memory id is embedded in the summary (not target_id) because target_id is a UUID column scoped to workspace references; awareness memory ids are arbitrary strings.

  2. workspace-server/internal/handlers/activity.go::Report — added memory_write to the activity_type allowlist. Without this the handler returned 400 with the prior list {a2a_send, a2a_receive, task_update, agent_log, skill_promotion, error}.

Tests:

  • workspace/tests/test_memory.py — 6 new cases: posts to /activity endpoint with right shape; truncates content

    80 chars with ellipsis; strips newlines from summary; skips when WORKSPACE_ID or PLATFORM_URL is missing; swallows POST failures (must not poison tool path); embeds id in summary regardless.

  • workspace-server/internal/handlers/activity_test.go — 2 new cases: memory_write accepted (200), unknown type still 400 with the updated message including memory_write.

Live E2E against running platform + Postgres:

  • Direct curl POST with activity_type=memory_write → 200 + DB row
  • _record_memory_activity from Python → row visible via GET /workspaces/:id/activity?type=memory_write
  • Confirmed target_id UUID-typing rejection from prior attempt (caught the bug — fix lands the id in summary instead)

Test totals: Go 487 → 489, Python 1078 → 1084.

Branch: fix/125-commit-memory-activity-log. Closes #125.

Phase 30.2 + 30.5 — Remote secrets pull + A2A caller-token validation

Two bounded steps shipped together since they share the same wsauth validation shape.

30.2 — GET /workspaces/:id/secrets/values

  • New handler in workspace-server/internal/handlers/secrets.go::Values. Returns the merged decrypted global+workspace secrets as a flat {"KEY": "value"} JSON map. Same merge semantics as the provisioner's env-var injection, so a remote agent bootstrapping via pull sees exactly the same secrets a local container would receive via push.
  • Auth: Phase 30.1 bearer token required when the workspace has any live token on file. Legacy workspaces grandfathered through. Fail-closed on the token-existence check (different from heartbeat's fail-open) because this endpoint returns plaintext secrets.
  • Route wired in workspace-server/internal/router/router.go:170.

30.5 — A2A proxy caller-token validation

  • workspace-server/internal/handlers/a2a_proxy.go::ProxyA2A now calls validateCallerToken(ctx, c, callerID) before the existing CanCommunicate hierarchy check. Three bypass paths preserved: canvas (empty X-Workspace-ID), system callers (webhook:, system:, test: prefixes), self-calls (callerID==workspaceID).
  • Token binding is strict: compromised token from workspace A cannot authenticate a caller claiming to be workspace B. Tested.
  • Fail-open on DB hiccup — caller-token is defense-in-depth on top of hierarchy, not the sole gate.

Tests:

  • 5 new Go tests in secrets_test.go (legacy grandfather, missing token, wrong token, valid token with merge precedence, invalid workspace ID).
  • 5 new Go tests in a2a_proxy_test.go::TestValidateCallerToken (legacy grandfather, missing token, invalid token, valid token, wrong-workspace binding rejection).

Live E2E verified against real Postgres + platform:

  • 30.2: no-token → 401, bad-token → 401, valid-token → 200 with correct {"PHASE_30_DEMO":"hello-from-pull-endpoint"}.
  • 30.5: canvas bypass ✓, self-call bypass ✓, system-caller bypass ✓, cross-workspace no-token → 401 "missing caller auth token", cross-workspace wrong-token → 401 "invalid caller auth token", cross-workspace valid-token → 403 "access denied" (falls through to hierarchy check as designed).

Phase 30 status on main: 30.1 , 30.2 (this PR), 30.5 (this PR). Remaining: 30.3 (plugin tarball), 30.4 (state polling), 30.6 (sibling URL cache), 30.7 (poll-liveness), 30.8 (SDK + GA).

Branch: feat/30.2-30.5-remote-auth. PLAN.md checkboxes flipped for 30.1, 30.2, 30.5.

Phase 30.4 + 30.8 — State polling + Remote-agent SDK (first working e2e)

Shipped together because 30.8 (the runnable example) is the proof-of-life for everything 30.130.5 built up to. 30.4 is the missing piece that lets a remote agent detect pause/delete without WebSocket reachability.

30.4 — GET /workspaces/:id/state

  • New handler workspace.State at workspace-server/internal/handlers/workspace.go. Returns {workspace_id, status, paused, deleted}. Token-gated with the same Phase 30.1 shape (legacy grandfather, fail-closed on DB error). Deliberately not merged with GET /workspaces/:id — that path is for the canvas (unauthenticated, rich config). This is the agent-machinery polling path: tight, token-gated, cache-friendly.
  • Returns 404 + {deleted: true} for hard-deleted rows so the SDK can distinguish from transient network issues.

30.8 — sdk/python/molecule_agent/

  • New RemoteAgentClient class (blocking, requests-only, no async) with methods mirroring the Phase 30 endpoints: register(), pull_secrets(), poll_state(), heartbeat(), run_heartbeat_loop().
  • Token cache at ~/.molecule/<workspace_id>/.auth_token with 0600 perms. Register is idempotent — re-registering an already-tokened workspace keeps using the on-disk copy.
  • Loop exits gracefully on pause/delete, returning the terminal status for the caller to log / exit on.
  • Tolerates transient heartbeat + state-poll failures without crashing the loop (log and continue).

examples/remote-agent/

  • Runnable 100-line demo: WORKSPACE_ID=x PLATFORM_URL=y python3 run.py. README walks through workspace creation via external: true, seeding a secret, running the agent.
  • Note found during live verification: POST /registry/register upserts status='online', so re-registering an already-paused workspace reverts it. Not a bug in 30.4; but affects the order of operations in the demo (register once, then pause takes effect on the long-running loop). Filed as follow-up — see "Known follow-ups" below.

Tests:

  • 5 new Go tests for workspace.State (legacy grandfather, paused, hard-delete 404, missing token, valid token).
  • 22 new Python tests for RemoteAgentClient (token persistence with 0600 check, register issues/reuses, secrets pull, state poll, 404 = deleted, heartbeat body shape, loop exits on paused/deleted/max-iterations, transient-error continuation).

Live E2E with all of 30.1/30.2/30.4/30.5 running:

  • Agent register → token issued ✓
  • received 2 secret(s): keys=['API_KEY', 'REMOTE_DEMO_KEY']
  • Heartbeat loop runs, uptime advances to 10s ✓
  • POST /pause mid-loop → platform reports workspace paused (paused=True deleted=False) — exiting within ~5s ✓
  • Clean terminal status paused

Known follow-ups (not this PR):

  • Register's status='online' overwrite undoes platform-side pause if the agent happens to re-register. Should check current status and preserve paused / removed.
  • Loop currently can't receive inbound A2A — reported_url is remote://no-inbound as a placeholder. A future 30.8b will add an optional start_a2a_server() helper for agents behind a public URL or tunneled port.

PLAN.md: 30.4 , 30.8 . Phase 30 remaining: 30.3 (plugin tarball), 30.6 (sibling URL cache), 30.7 (poll-liveness monitor).

Branch: feat/30.4-state-polling (merged 30.2+30.5 PR #130 into it mid-session for the live E2E to have all endpoints available).

Phase 30.7 — Poll-liveness for external-runtime workspaces

Why this is the missing piece: without it, a dead remote agent stayed "online" on the canvas forever. The existing health sweep explicitly skipped runtime='external' rows because it only knew how to ask Docker "is the container alive?" — wrong question for a workspace the platform never started.

Fix (workspace-server/internal/registry/healthsweep.go):

  • New sweepStaleRemoteWorkspaces runs on the same ticker as the Docker sweep. Queries workspaces with runtime='external' whose last_heartbeat_at is older than REMOTE_LIVENESS_STALE_AFTER (default 90s, env-overridable). Marks them offline, clears Redis state, fires onOffline so the canvas sees WORKSPACE_OFFLINE.
  • StartHealthSweep no longer early-returns on nil Docker checker — a SaaS front-door deployment without local Docker still needs remote-liveness monitoring.
  • Newly-registered external workspaces that haven't heartbeated yet are compared against updated_at (set on register), so an agent that crashes before its first heartbeat is still swept after the grace window.

Tests (workspace-server/internal/registry/healthsweep_test.go):

  • sweepStaleRemoteWorkspaces with 2 stale rows → UPDATE + onOffline called twice
  • No stale rows → onOffline never called
  • Nil callback → no panic
  • DB outage → logged, no panic, no false offlines
  • remoteStaleAfter: default when env unset; honors valid integer override; falls back on garbage values (abc, 0, -10, empty)
  • StartHealthSweep with nil checker: still ticks and runs remote sweep (previously would early-return)

Live E2E with REMOTE_LIVENESS_STALE_AFTER=10 for test speed:

  • Agent register → heartbeat → exit → status=online (heartbeat fresh)
  • Wait 30s → status=offline (platform swept at 15s tick, saw heartbeat >10s old). Log: Health sweep (remote): <id> heartbeat stale (>10s) — marking offline
  • Restart agent → heartbeat resumes → status=online again
  • Full cycle observable on canvas via WORKSPACE_OFFLINE / WORKSPACE_ONLINE broadcasts

All Phase 30 remote-agent capabilities now demonstrable end-to-end:

Step Live E2E status
30.1 Token auth register + heartbeat bearer-auth'd
30.2 Secrets pull keys=['API_KEY','REMOTE_DEMO_KEY']
30.4 State polling pause detected in ~5s
30.5 A2A caller auth 401/403 separation confirmed
30.7 Poll-liveness stale→offline→restart→online cycle
30.8 SDK + example examples/remote-agent/run.py

Phase 30 remaining: 30.3 (plugin tarball), 30.6 (sibling URL cache). Neither blocks the current SaaS loop; 30.3 matters when remote agents need to install plugins with heavy deps, 30.6 is a resilience optimization for agent-to-agent direct calls.

Branch: feat/30.7-poll-liveness. PLAN.md 30.7 .

Phase 30.6 — Sibling discovery auth + URL caching

Two tied fixes:

Platform side/registry/discover/:id and /registry/:id/peers were unauthenticated. For a SaaS front-door deployment, any internet host that knows a workspace ID could enumerate siblings and pull their URLs. Added validateDiscoveryCaller using the same lazy-bootstrap Phase 30.1 token pattern. Fail-open on DB hiccup (unlike secrets.Values which fails-closed) because discovery only exposes URLs already behind CanCommunicate — the hierarchy check downstream is the primary gate, auth is defense-in-depth.

SDK side — new methods on RemoteAgentClient:

  • get_peers() → list of PeerInfo, seeds URL cache
  • discover_peer(id) → cached lookup with 5-min TTL, refreshes on expiry, returns None on 404
  • invalidate_peer_url(id) → drop cache entry (call after a direct-call failure so next call re-discovers)
  • call_peer(id, message, prefer_direct=True) → sends A2A message/send. Direct path on cache hit; graceful fallback to platform proxy on connection error / 5xx with cache invalidation. prefer_direct=False forces proxy routing.
  • New PeerInfo dataclass exported alongside WorkspaceState.

Tests: 12 new SDK tests (cache seeding skips non-http URLs, cache hit short-circuits, expired cache refreshes, 404 returns None, invalidate_peer_url idempotent, direct-path vs proxy-fallback vs prefer-direct=False, fresh call with no cache does discover-then-direct).

Bug caught during verification: my first discovery auth shape fail-closed on DB errors, which broke existing TestDiscover_* and TestPeers_* tests that didn't set up the HasAnyLiveToken sqlmock expectation. Switched to fail-open — discovery is hierarchy-gated anyway, and a DB hiccup shouldn't take agent-to-agent discovery offline. 8 tests restored green.

Live E2E with a tiny Python echo server as sibling-B:

  • get_peers returns 2 peers (echo server + parent PM)
  • URL cache seeded ONLY with http:// entry (skips remote://pm)
  • call_peer routes directly to http://127.0.0.1:9876 — no proxy hop
  • Echo server responds, SDK returns "echoed: hello sibling over SDK"
  • Auth + hierarchy all verified: no-token→401, wrong-token→401, cross-workspace token→401, out-of-hierarchy discover→403

Phase 30 status after this: 30.1 30.2 30.4 30.5 30.6 30.7 30.8 . Only 30.3 (plugin tarball download) remains, and I flagged that one as lower priority — the current SaaS loop doesn't need it until a real user has a heavy-deps plugin.

Branch: feat/30.6-sibling-cache.

Fix #123 — Telegram kicked/left now persists enabled=false

Bug: When the Molecule AI bot was removed from a Telegram chat, the handler at telegram.go:594-596 only logged the event — the matching workspace_channels row stayed enabled=true. Every subsequent outbound message hit Telegram 403 forever.

Fix:

  • New package-level callback disableChannelByChatID in telegram.go, default no-op (safe for early boot / tests).
  • manager.go::NewManager wires it to run UPDATE workspace_channels SET enabled=false WHERE channel_type='telegram' AND enabled=true AND config->>'chat_id'=$1, then call m.Reload(ctx) if any row flipped so the in-memory poller map drops the now-disabled row.
  • onMyChatMember::case "left", "kicked" now calls the callback immediately after the existing log line (removes the TODO).

Tests (workspace-server/internal/channels/channels_test.go):

  • default-is-no-op (var safe to call pre-Manager-init)
  • wired-callback fires UPDATE with exact WHERE shape + arg + triggers Reload via follow-up SELECT
  • no-rows-affected skips reload (avoids SELECT storm on unrelated kicked events from other bots)

Branch: fix/123-telegram-kicked-persist. Closes #123.

Phase 30 client adaptations — MCP / molecli / Canvas / SDK

Phase 30 itself shipped the platform-side endpoints. These adaptations make those endpoints visible and usable from every client surface without requiring callers to know the new URL paths by hand.

MCP — 4 new tools in mcp-server/src/index.ts:

  • list_remote_agents — filters workspace list to runtime='external'
  • get_remote_agent_state — projects {workspace_id, status, paused, deleted}
  • get_remote_agent_setup_command — emits the WORKSPACE_ID=... PLATFORM_URL=... python3 ... bash one-liner an operator can paste into a remote shell
  • check_remote_agent_freshness — compares last_heartbeat_at against configurable threshold (default 90s); returns {fresh, seconds_since_heartbeat}

8 new MCP tests (88 → 96).

molecliWorkspaceInfo gains a Runtime field; printWorkspaceTable adds a RUNTIME column showing ★ external for remote agents so they pop in a long table; detail view labels them external (Phase 30 remote agent). Live: molecli ws list now shows the badge correctly.

CanvasWorkspaceNode.tsx reads data.runtime (workspace row) in preference to data.agentCard.runtime (agent-reported). Remote agents get a distinct violet ★ REMOTE pill with a tooltip explaining the heartbeat-based lifecycle. 352/352 vitests still pass.

SDKpyproject.toml rebranded molecule-sdk@0.2.0 so a single pip install molecule-sdk ships both molecule_plugin (plugin authors) and molecule_agent (remote-agent authors). Added trove classifiers, keywords, requires-python pin. New sdk/python/molecule_agent/README.md quickstart.

Live verification:

  • MCP: spawned a real external workspace, ran all 4 tools via node smoke script — count=1, setup_command renders, freshness=null (no heartbeat yet, returns fresh=false correctly)
  • molecli: ws list shows ★ external badge on the remote workspace
  • Canvas tests green; visual change is small (one badge swap)
  • SDK: 121 SDK tests + 1078 workspace-template tests still pass

Branch: feat/phase30-client-adaptations.

Phase 30.3 — Plugin tarball download (external GitHub repo verified)

Platform: new GET /workspaces/:id/plugins/:name/download[?source=...] streams the named plugin as a gzipped tarball. Reuses resolveAndStage so all existing source schemes (local://, github://, future clawhub://) work — the endpoint is just the download surface for what Install was already doing internally.

Token-gated (fail-closed on DB error since the tarball can include rule text and skill files referencing internals). Defaults source to local://<name> when the query param is omitted. Validates that the URL path's plugin name matches the resolved plugin's manifest name — prevents a github source resolving to a different name from being shipped under the requested name.

SDK RemoteAgentClient.install_plugin(name, source=None):

  1. Stream the download
  2. Atomic extract via sibling-tempdir + rename (no half-installed states)
  3. Run setup.sh if present (best-effort)
  4. POST /workspaces/:id/plugins to register the install

_safe_extract_tar rejects path-traversal (../escape, absolute paths) and silently skips symlinks/hardlinks — defends against tar-slip CVEs. Tested with both adversarial inputs.

Tests:

  • 5 new Go (auth, tarball shape, name mismatch, tar streaming relative paths, tar symlink skip)
  • 11 new Python SDK (unpack location, source query param, atomic rollback on corrupt tarball, overwrite existing, setup.sh ran/skipped, platform-report skipped, 404 surfaces, _safe_extract path-traversal rejection, absolute-path rejection, symlink skip)

Live E2E with a real external GitHub repo created via gh repo create (HongmingWang-Rabbit/starfire-test-plugin):

  • local://molecule-dev → 4612-byte tarball, plugin.yaml + skills/ present
  • github://HongmingWang-Rabbit/starfire-test-plugin → 711-byte tarball pulled from real GitHub, unpacked locally, setup.sh ran on the agent's host machine producing /tmp/sf-plugin-test-setup-ran
  • Auth gates: 401/401/200 confirmed
  • Name-mismatch: requested wrong-name with source=...starfire-test-plugin returned 400 with {"resolved_name":"starfire-test-plugin","requested_name":"wrong-name"}

Phase 30 is now feature-complete: 30.1 30.2 30.3 30.4 30.5 30.6 30.7 30.8

Branch: feat/30.3-plugin-tarball. Test repo: https://github.com/HongmingWang-Rabbit/starfire-test-plugin

Bugfix #124 — Delegation idempotency

Promoted from docs/known-issues.md KI-002. When a workspace container restarted mid-delegation (Redis TTL → liveness restart), agents could re-issue POST /workspaces/:id/delegate and produce duplicate work (double commits, double Telegram messages, double API calls).

Migration 021_delegation_idempotency.up.sql:

  • activity_logs.idempotency_key TEXT NULL
  • Partial unique index on (workspace_id, idempotency_key) WHERE idempotency_key IS NOT NULL — fully backwards compatible

Handler (workspace-server/internal/handlers/delegation.go::Delegate):

  • Optional idempotency_key field on the request body
  • On receipt: lookup (workspace_id, key) → if found and not failed, return existing delegation_id with HTTP 200 + idempotent_hit: true
  • If the prior row is failed, the slot is released so the retry can produce a fresh delegation (still 202)
  • If two concurrent calls race past the lookup, the unique-constraint violation on insert is caught and the loser re-queries to surface the same idempotent response (HTTP 200) instead of a 500

Tests (3 new + 2 updated, all green under go test -race):

  • TestDelegate_IdempotentReplayReturnsExistingDelegation
  • TestDelegate_IdempotentFailedRowIsReleasedAndReplaced
  • TestDelegate_IdempotentRaceUniqueViolationReturnsExisting
  • Updated TestDelegate_Success and TestDelegate_DBInsertFails_Still202WithWarning to assert the new 6th INSERT arg (idempotency_key = NULL when omitted)

Branch: fix/auto-review-2026-04-13-delegation-idempotency. Closes #124.