Hongming Wang d8026347e5 chore: open-source restructure — rename dirs, remove internal files, scrub secrets

Renames:
- platform/ → workspace-server/ (Go module path stays as "platform" for
  external dep compat — will update after plugin module republish)
- workspace-template/ → workspace/

Removed (moved to separate repos or deleted):
- PLAN.md — internal roadmap (move to private project board)
- HANDOFF.md, AGENTS.md — one-time internal session docs
- .claude/ — gitignored entirely (local agent config)
- infra/cloudflare-worker/ → Molecule-AI/molecule-tenant-proxy
- org-templates/molecule-dev/ → standalone template repo
- .mcp-eval/ → molecule-mcp-server repo
- test-results/ — ephemeral, gitignored

Security scrubbing:
- Cloudflare account/zone/KV IDs → placeholders
- Real EC2 IPs → <EC2_IP> in all docs
- CF token prefix, Neon project ID, Fly app names → redacted
- Langfuse dev credentials → parameterized
- Personal runner username/machine name → generic

Community files:
- CONTRIBUTING.md — build, test, branch conventions
- CODE_OF_CONDUCT.md — Contributor Covenant 2.1

All Dockerfiles, CI workflows, docker-compose, railway.toml, render.yaml,
README, CLAUDE.md updated for new directory names.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-18 00:24:44 -07:00

36 KiB

Raw Blame History

2026-04-13 — edit history

Summary — Quality + Infra Pass (PRs #1–#8, all merged)

Eight PRs landed today in a focused quality pass. No user-facing feature changes; the payoff is faster onboarding, lower merge friction, and stronger CI gates.

Brand + structural

PR #1 chore/branding-icons — replaced molecule-icon.png across canvas/public/, canvas/src/app/, docs/assets/branding/; added HANDOFF.md at the repo root; fixed a comment typo in .githooks/pre-commit.
PR #3 chore/structural-cleanup — deleted empty workspace-server/plugins/; moved examples/remote-agent/ → sdk/python/examples/remote-agent/ and docs/superpowers/plans/ → plugins/superpowers/plans/; added READMEs to tests/ and docs/; gitignored .agents/, workspace-server/workspace-configs-templates/, backups/, logs/, test-results/.
LICENSE: trailing brand-migration fix — "Agent Molecule" → "Molecule AI".

MCP server refactor (PRs #2, #4, #7)

mcp-server/src/index.ts shrank from 1697 → 89 lines. Tool handlers now live in per-domain modules under mcp-server/src/tools/: workspaces.ts, agents.ts, secrets.ts, files.ts, memory.ts, plugins.ts, channels.ts, delegation.ts, schedules.ts, approvals.ts, discovery.ts, remote_agents.ts.
New shared HTTP layer mcp-server/src/api.ts exports PLATFORM_URL, generic apiCall<T>, ApiError type, isApiError() guard, toMcpResult(), toMcpText().
Each tools/*.ts exports handlers + a registerXxxTools(srv) function. createServer() in index.ts wires them.
Fixed handleGetRemoteAgentSetupCommand — emits a valid python3 -c "from molecule_agent import RemoteAgentClient; …" one-liner (was an invalid python3 -m examples.remote-agent.run).
MCP now reports 87 tools on startup (older logs / docs said "61" — both updated).

Canvas (PRs, shipped across session)

Replaced native window.confirm / alert with ConfirmDialog in seven sites: ChannelsTab.tsx, ScheduleTab.tsx, ChatTab.tsx, TemplatePalette.tsx (×2), ErrorBoundary.tsx (×2 removed; buttons are self-evident).
New singleButton prop on ConfirmDialog for info-toast usage, plus 5 new vitest cases at canvas/src/components/__tests__/ConfirmDialog.test.tsx.
ErrorBoundary clipboard write now catches rejections and logs to console.warn.
Vitest count: 352 → 357.

Platform — handler decomposition (pure refactor)

Four oversize handler functions split into private helpers — behavior unchanged, but each extracted helper is now directly unit-tested.

a2a_proxy.go::proxyA2ARequest (257 → 56 lines). New helpers: resolveAgentURL, normalizeA2APayload, dispatchA2A, handleA2ADispatchError, maybeMarkContainerDead, logA2AFailure, logA2ASuccess; sentinel proxyDispatchBuildError.
delegation.go::Delegate (127 → 60 lines). New helpers: bindDelegateRequest, lookupIdempotentDelegation, insertDelegationRow; typed insertDelegationOutcome enum (zero value insertOutcomeUnknown) replaces a positional (bool, bool) return.
discovery.go::Discover (125 → 40 lines). New helpers: discoverWorkspacePeer, writeExternalWorkspaceURL, discoverHostPeer.
activity.go::SessionSearch (109 → 24 lines). New helpers: parseSessionSearchParams, buildSessionSearchQuery, scanSessionSearchRows.

+47 Go unit tests; workspace-server/internal/handlers coverage 56.1 % → 57.6 %.

Config / env documentation

.env.example gained 11 previously-undocumented env vars across 6 new sections: PLATFORM_URL, MOLECULE_URL, WORKSPACE_DIR, MOLECULE_ENV, CORS_ORIGINS, RATE_LIMIT, ACTIVITY_RETENTION_DAYS, ACTIVITY_CLEANUP_INTERVAL_HOURS, MOLECULE_IN_DOCKER, AWARENESS_URL, GITHUB_WEBHOOK_SECRET, MOLECLI_URL. All 21 distinct os.Getenv / envx.* keys (except HOME) are now documented.

E2E + CI (PRs #5, #7, #8)

New shared helpers tests/e2e/_lib.sh and tests/e2e/_extract_token.py.
tests/e2e/test_api.sh updated for Phase 30.1 bearer-token auth and Phase 30.6 X-Workspace-ID requirement on discover/peers; added a pre-test workspace cleanup. 62/62 pass.
tests/e2e/test_comprehensive_e2e.sh fixed the token race against the provisioner by registering each workspace immediately after creation. 67/67 pass.
tests/e2e/test_activity_e2e.sh re-registers a detected agent to capture its bearer token.
tests/e2e/test_claude_code_e2e.sh got shellcheck annotations only.
All five scripts are shellcheck-clean.
.github/workflows/ci.yml gained two new jobs:
- e2e-api — Postgres + Redis service containers, migrations applied via docker exec, test_api.sh runs against a freshly-built platform binary.
- shellcheck — marketplace action lints every tests/e2e/*.sh.
Existing Go job got cache: true on setup-go.
Bundle round-trip and "status online" assertions now tolerate the async provisioner flipping status, removing flaky false-negatives.

Test totals after today's sync

Stack	Before	After
Go (platform)	648	695
Python (workspace)	1140	1140
Canvas (vitest)	352	357
SDK (pytest)	132	132
MCP server (Jest)	96	97

Note: only Go (+47 direct tests for extracted handler helpers), canvas (+5 ConfirmDialog singleButton tests), and MCP (+1 createServer smoke test) gained tests today. Python workspace + SDK counts are the pre-session baseline — no pytest additions today. The earlier "1078 / 87" numbers in this session were stale CLAUDE.md baselines, not measurements.

Canvas — org template import (PLAN.md §20.3)

What: added OrgTemplatesSection to canvas/src/components/TemplatePalette.tsx. Lists org templates from GET /org/templates, each entry shows name + description + workspace count + an "Import org" button that POSTs { dir } to /org/import. Renders inside the existing template-palette sidebar, below the workspace template list.

Why: PLAN.md §20.3 had this checkbox unchecked. Platform already exposes the endpoints (handlers/org.go); only the canvas wiring was missing. Authors today have to curl to instantiate multi-workspace orgs — a poor UX given we already curate org-templates/molecule-dev, reno-stars, etc.

How tested: extracted fetchOrgTemplates() and importOrgTemplate() as standalone exports so they're unit-testable in the existing node-only vitest config (no jsdom). 7 new tests cover happy path, non-2xx response, network failure, POST body shape, error propagation, and module exports. Canvas vitest 345 → 352.

Branch: feat/canvas-org-template-import.

Platform — fix #106: plugin uninstall cleanup

Bug: DELETE /workspaces/:id/plugins/:name only removed /configs/plugins/<name>/. Skill dirs copied out to /configs/skills/<skill>/ and rule blocks appended to /configs/CLAUDE.md by AgentskillsAdaptor.install were left behind, so they reappeared after every container auto-restart.

Fix (workspace-server/internal/handlers/plugins.go::Uninstall): before the existing plugin-dir removal, the handler now:

Reads /configs/plugins/<name>/plugin.yaml from the container to learn the plugin's declared skills: list.
Strips every # Plugin: <name> / … block from /configs/CLAUDE.md via an awk script that mirrors AgentskillsAdaptor.uninstall's block layout (marker → blank → content → blank). Other plugins' markers and surrounding user content stay intact.
rm -rf each declared skill dir under /configs/skills/ (with validatePluginName defense against malformed manifest skill names).
Then proceeds with the existing rm -rf /configs/plugins/<name>.

Tests (workspace-server/internal/handlers/plugins_test.go):

TestRegexpEscapeForAwk — verifies /, ., [], *+?|, \\, empty string all escape correctly. Caught a real bug (forgot /, awk treated marker as broken regex delimiter).
TestStripPluginMarkers_AwkScript — runs the exact awk pipeline the production code uses against a fixture CLAUDE.md with two my-plugin blocks, a keep-me block, and surrounding user content. Asserts both my-plugin blocks (marker + content) gone, keep-me + user content intact, including trailing user content after the last my-plugin block.
TestStripPluginMarkers_MissingFileIsNoOp — missing CLAUDE.md must not crash uninstall.

Live E2E: ran fixed binary, installed test plugin (skill in /configs/skills/test-skill/, rules block in CLAUDE.md), called DELETE, confirmed all three artifacts gone, then triggered manual restart and confirmed they stayed gone (the original bug trigger). Other workspace state — review-loop skill, molecule-dev plugin, surrounding CLAUDE.md content — preserved.

Branch: fix/106-plugin-uninstall-cleanup. Closes #106.

Platform — fix #110: A2A busy-response classification

Bug: When an upstream workspace agent is mid-synthesis on a previous request (single-threaded main loop), subsequent A2A requests time out or see the connection reset. The proxy returned 502 failed to reach workspace agent, indistinguishable from a genuinely unreachable agent. 17 such failures recorded over 7h of self-evol loop traffic.

Fix (workspace-server/internal/handlers/a2a_proxy.go): proxyA2AError gains an optional Headers field so handlers can set real response headers. After a2aClient.Do(req) errors, we now classify via isUpstreamBusyError: context.DeadlineExceeded, io.EOF, io.ErrUnexpectedEOF, or stdlib wrap-strings containing "context deadline exceeded", "EOF", "connection reset". When the container is alive and the error matches, return 503 Service Unavailable with Retry-After: 30 and a JSON body {"busy": true, "retry_after": 30}. Fatal / unclassified errors still fall through to the prior 502. Issue #110 Option 3.

Tests (workspace-server/internal/handlers/a2a_proxy_test.go):

TestIsUpstreamBusyError — 10 error shapes (stdlib typed and url.Error-wrapped strings for both deadline and EOF). Includes negative cases (DNS / refused / unrelated errors).
TestProxyA2AError_BusyShape — end-to-end emit contract: 503 status, Retry-After: 30 header, JSON body with busy=true and retry_after=30.

Live verification attempted but inconclusive: redirected a workspace URL in Postgres to a hang server, but the platform's Redis URL cache shadows the DB value so the fake upstream was never hit. Unit tests cover every link in the chain (error detection → typed error struct → handler emit), so I'm confident in the change; a real 503-busy will be observable the next time an agent actually stalls under load.

Branch: fix/110-a2a-busy-response. Closes #110 (Option 3 — clearer error + Retry-After; queueing and timeout-bump deferred).

Platform — fix #117: surface Docker image-not-found error on provision

Bug: Provisioning a workspace whose runtime image isn't built locally silently failed. GET /workspaces/:id returned {status: "failed", last_sample_error: ""} — no hint that the image was missing or which build command to run. Discovered during the MeDo hackathon smoke test; only diagnostic path was docker logs on the platform container.

Fix (two files):

workspace-server/internal/provisioner/provisioner.go::Start — when ContainerCreate returns "No such image", wrap the error with the resolved image tag and the exact build-all.sh <runtime> command the operator should run. Uses %w so errors.Is/errors.As chains stay intact.
workspace-server/internal/handlers/workspace_provision.go — on provisioner.Start failure, the UPDATE now sets last_sample_error = $2 alongside status='failed'. Previously the error was only logged + broadcast.

Tests (workspace-server/internal/provisioner/provisioner_test.go):

TestIsImageNotFoundErr — 7 error shapes (moby's exact message, variants, unrelated errors)
TestRuntimeTagFromImage — 6 image-reference shapes including fallback paths
TestImageNotFoundErrorIncludesBuildHint — asserts the wrapped error string includes the image, the build command, and the underlying daemon message

Live E2E: provisioned with runtime: autogen after docker rmi workspace-template:autogen. Before: last_sample_error: "". After: docker image "workspace-template:autogen" not found — run 'bash workspace/build-all.sh autogen' to build it (underlying error: Error response from daemon: No such image: workspace-template:autogen). Image rebuilt after test to restore baseline.

Branch: fix/117-provisioner-surface-image-error. Closes #117.

Phase 30.1 — Workspace auth tokens (SaaS foundation)

Scope: first step of Phase 30 (cross-network federation). Per-workspace bearer tokens so remote agents can authenticate themselves to the platform without being spoofable. Transparent to local containers during the transition — legacy workspaces are grandfathered on /registry/heartbeat until their next /registry/register issues them a token.

What landed:

workspace-server/migrations/020_workspace_auth_tokens.{up,down}.sql — new workspace_auth_tokens table storing sha256(plaintext) + 8-char prefix for display. Plaintext never persisted.
workspace-server/internal/wsauth/ — new package: IssueToken, ValidateToken, HasAnyLiveToken, RevokeAllForWorkspace, BearerTokenFromHeader. Opaque 256-bit tokens (base64url), no JWT.
workspace-server/internal/handlers/registry.go::Register — issues a token on first registration only (idempotent on re-register); returns it in the response body as auth_token.
registry.go::Heartbeat, ::UpdateCard — validate Authorization: Bearer <token> if the workspace has any live token on file. Legacy workspaces with no token → 200 (grandfather path).
workspace/platform_auth.py — new agent-side store: reads ${CONFIGS_DIR}/.auth_token, in-process cache, auth_headers() helper. File is 0600.
workspace/main.py — saves the token returned by register.
workspace/heartbeat.py, a2a_tools.py, molecule_ai_status.py, executor_helpers.py — all four heartbeat call sites now send auth_headers().

Tests:

workspace-server/internal/wsauth/tokens_test.go — 11 cases: issuance persists only hash, tokens unique per call, validate happy path, wrong-workspace rejected, unknown token rejected, empty inputs rejected, HasAnyLiveToken with 0/1/7 rows, revoke, bearer header parser with 7 inputs.
workspace/tests/test_platform_auth.py — 14 cases: get/save round-trip, 0600 mode, whitespace stripping, empty-token rejection, idempotent saves (no mtime churn), rotation, header format, caching semantics, empty-file handling, CONFIGS_DIR respect + fallback.
Fixed tests/test_molecule_ai_status.py::_FakePost + exploding_post to accept headers= kwarg (test fixture API drift from the production code change).

Live E2E verified against real Postgres + running platform:

Legacy workspace (no tokens) → heartbeat 200 (grandfathered)
Fresh register → token returned in response body
Heartbeat without token (token exists) → 401
Heartbeat with valid token → 200
Spoofing with guessed token → 401
Cross-workspace token reuse (A's token for B) → 401
Re-register after token issued → response has no auth_token (idempotent)

Test totals: Go 476 → 487, Python 1064 → 1078.

Docs:

docs/remote-workspaces-readiness.md — full code audit that scopes Phase 30 (five sections: local-only assumptions, existing seams, hard problems, minimum viable remote shape, ordered next steps).
PLAN.md — new Phase 30 section with eight bounded sub-steps (30.1 through 30.8), out-of-scope boundaries, success criteria.

Branch: feat/30.1-workspace-auth-tokens. First PR of Phase 30.

Fix #125 — `commit_memory` writes now surface in `activity_logs`

Bug: commit_memory MCP tool calls succeeded silently. Operators inspecting the Canvas "Agent Comms" tab couldn't see what an agent chose to remember during a task.

Fix (two files):

workspace/builtin_tools/memory.py::commit_memory — on successful write, fire-and-forget a POST /workspaces/:id/activity call via new helper _record_memory_activity(scope, content, memory_id). Summary format [<SCOPE>] <80-char preview>… (id=<id>). The memory id is embedded in the summary (not target_id) because target_id is a UUID column scoped to workspace references; awareness memory ids are arbitrary strings.
workspace-server/internal/handlers/activity.go::Report — added memory_write to the activity_type allowlist. Without this the handler returned 400 with the prior list {a2a_send, a2a_receive, task_update, agent_log, skill_promotion, error}.

Tests:

workspace/tests/test_memory.py — 6 new cases: posts to /activity endpoint with right shape; truncates content

80 chars with ellipsis; strips newlines from summary; skips when WORKSPACE_ID or PLATFORM_URL is missing; swallows POST failures (must not poison tool path); embeds id in summary regardless.
workspace-server/internal/handlers/activity_test.go — 2 new cases: memory_write accepted (200), unknown type still 400 with the updated message including memory_write.

Live E2E against running platform + Postgres:

Direct curl POST with activity_type=memory_write → 200 + DB row
_record_memory_activity from Python → row visible via GET /workspaces/:id/activity?type=memory_write
Confirmed target_id UUID-typing rejection from prior attempt (caught the bug — fix lands the id in summary instead)

Test totals: Go 487 → 489, Python 1078 → 1084.

Branch: fix/125-commit-memory-activity-log. Closes #125.

Phase 30.2 + 30.5 — Remote secrets pull + A2A caller-token validation

Two bounded steps shipped together since they share the same wsauth validation shape.

30.2 — GET /workspaces/:id/secrets/values

New handler in workspace-server/internal/handlers/secrets.go::Values. Returns the merged decrypted global+workspace secrets as a flat {"KEY": "value"} JSON map. Same merge semantics as the provisioner's env-var injection, so a remote agent bootstrapping via pull sees exactly the same secrets a local container would receive via push.
Auth: Phase 30.1 bearer token required when the workspace has any live token on file. Legacy workspaces grandfathered through. Fail-closed on the token-existence check (different from heartbeat's fail-open) because this endpoint returns plaintext secrets.
Route wired in workspace-server/internal/router/router.go:170.

30.5 — A2A proxy caller-token validation

workspace-server/internal/handlers/a2a_proxy.go::ProxyA2A now calls validateCallerToken(ctx, c, callerID) before the existing CanCommunicate hierarchy check. Three bypass paths preserved: canvas (empty X-Workspace-ID), system callers (webhook:, system:, test: prefixes), self-calls (callerID==workspaceID).
Token binding is strict: compromised token from workspace A cannot authenticate a caller claiming to be workspace B. Tested.
Fail-open on DB hiccup — caller-token is defense-in-depth on top of hierarchy, not the sole gate.

Tests:

5 new Go tests in secrets_test.go (legacy grandfather, missing token, wrong token, valid token with merge precedence, invalid workspace ID).
5 new Go tests in a2a_proxy_test.go::TestValidateCallerToken (legacy grandfather, missing token, invalid token, valid token, wrong-workspace binding rejection).

Live E2E verified against real Postgres + platform:

30.2: no-token → 401, bad-token → 401, valid-token → 200 with correct {"PHASE_30_DEMO":"hello-from-pull-endpoint"}.
30.5: canvas bypass ✓, self-call bypass ✓, system-caller bypass ✓, cross-workspace no-token → 401 "missing caller auth token", cross-workspace wrong-token → 401 "invalid caller auth token", cross-workspace valid-token → 403 "access denied" (falls through to hierarchy check as designed).

Phase 30 status on main: 30.1 ✅, 30.2 ✅ (this PR), 30.5 ✅ (this PR). Remaining: 30.3 (plugin tarball), 30.4 (state polling), 30.6 (sibling URL cache), 30.7 (poll-liveness), 30.8 (SDK + GA).

Branch: feat/30.2-30.5-remote-auth. PLAN.md checkboxes flipped for 30.1, 30.2, 30.5.

Phase 30.4 + 30.8 — State polling + Remote-agent SDK (first working e2e)

Shipped together because 30.8 (the runnable example) is the proof-of-life for everything 30.1–30.5 built up to. 30.4 is the missing piece that lets a remote agent detect pause/delete without WebSocket reachability.

30.4 — GET /workspaces/:id/state

New handler workspace.State at workspace-server/internal/handlers/workspace.go. Returns {workspace_id, status, paused, deleted}. Token-gated with the same Phase 30.1 shape (legacy grandfather, fail-closed on DB error). Deliberately not merged with GET /workspaces/:id — that path is for the canvas (unauthenticated, rich config). This is the agent-machinery polling path: tight, token-gated, cache-friendly.
Returns 404 + {deleted: true} for hard-deleted rows so the SDK can distinguish from transient network issues.

30.8 — sdk/python/molecule_agent/

New RemoteAgentClient class (blocking, requests-only, no async) with methods mirroring the Phase 30 endpoints: register(), pull_secrets(), poll_state(), heartbeat(), run_heartbeat_loop().
Token cache at ~/.molecule/<workspace_id>/.auth_token with 0600 perms. Register is idempotent — re-registering an already-tokened workspace keeps using the on-disk copy.
Loop exits gracefully on pause/delete, returning the terminal status for the caller to log / exit on.
Tolerates transient heartbeat + state-poll failures without crashing the loop (log and continue).

examples/remote-agent/

Runnable 100-line demo: WORKSPACE_ID=x PLATFORM_URL=y python3 run.py. README walks through workspace creation via external: true, seeding a secret, running the agent.
Note found during live verification: POST /registry/register upserts status='online', so re-registering an already-paused workspace reverts it. Not a bug in 30.4; but affects the order of operations in the demo (register once, then pause takes effect on the long-running loop). Filed as follow-up — see "Known follow-ups" below.

Tests:

5 new Go tests for workspace.State (legacy grandfather, paused, hard-delete 404, missing token, valid token).
22 new Python tests for RemoteAgentClient (token persistence with 0600 check, register issues/reuses, secrets pull, state poll, 404 = deleted, heartbeat body shape, loop exits on paused/deleted/max-iterations, transient-error continuation).

Live E2E with all of 30.1/30.2/30.4/30.5 running:

Agent register → token issued ✓
received 2 secret(s): keys=['API_KEY', 'REMOTE_DEMO_KEY'] ✓
Heartbeat loop runs, uptime advances to 10s ✓
POST /pause mid-loop → platform reports workspace paused (paused=True deleted=False) — exiting within ~5s ✓
Clean terminal status paused ✓

Known follow-ups (not this PR):

Register's status='online' overwrite undoes platform-side pause if the agent happens to re-register. Should check current status and preserve paused / removed.
Loop currently can't receive inbound A2A — reported_url is remote://no-inbound as a placeholder. A future 30.8b will add an optional start_a2a_server() helper for agents behind a public URL or tunneled port.

PLAN.md: 30.4 ✅, 30.8 ✅. Phase 30 remaining: 30.3 (plugin tarball), 30.6 (sibling URL cache), 30.7 (poll-liveness monitor).

Branch: feat/30.4-state-polling (merged 30.2+30.5 PR #130 into it mid-session for the live E2E to have all endpoints available).

Phase 30.7 — Poll-liveness for external-runtime workspaces

Why this is the missing piece: without it, a dead remote agent stayed "online" on the canvas forever. The existing health sweep explicitly skipped runtime='external' rows because it only knew how to ask Docker "is the container alive?" — wrong question for a workspace the platform never started.

Fix (workspace-server/internal/registry/healthsweep.go):

New sweepStaleRemoteWorkspaces runs on the same ticker as the Docker sweep. Queries workspaces with runtime='external' whose last_heartbeat_at is older than REMOTE_LIVENESS_STALE_AFTER (default 90s, env-overridable). Marks them offline, clears Redis state, fires onOffline so the canvas sees WORKSPACE_OFFLINE.
StartHealthSweep no longer early-returns on nil Docker checker — a SaaS front-door deployment without local Docker still needs remote-liveness monitoring.
Newly-registered external workspaces that haven't heartbeated yet are compared against updated_at (set on register), so an agent that crashes before its first heartbeat is still swept after the grace window.

Tests (workspace-server/internal/registry/healthsweep_test.go):

sweepStaleRemoteWorkspaces with 2 stale rows → UPDATE + onOffline called twice
No stale rows → onOffline never called
Nil callback → no panic
DB outage → logged, no panic, no false offlines
remoteStaleAfter: default when env unset; honors valid integer override; falls back on garbage values (abc, 0, -10, empty)
StartHealthSweep with nil checker: still ticks and runs remote sweep (previously would early-return)

Live E2E with REMOTE_LIVENESS_STALE_AFTER=10 for test speed:

Agent register → heartbeat → exit → status=online (heartbeat fresh)
Wait 30s → status=offline (platform swept at 15s tick, saw heartbeat >10s old). Log: Health sweep (remote): <id> heartbeat stale (>10s) — marking offline
Restart agent → heartbeat resumes → status=online again
Full cycle observable on canvas via WORKSPACE_OFFLINE / WORKSPACE_ONLINE broadcasts

All Phase 30 remote-agent capabilities now demonstrable end-to-end:

Step	Live E2E status
30.1 Token auth	✅ register + heartbeat bearer-auth'd
30.2 Secrets pull	✅ `keys=['API_KEY','REMOTE_DEMO_KEY']`
30.4 State polling	✅ pause detected in ~5s
30.5 A2A caller auth	✅ 401/403 separation confirmed
30.7 Poll-liveness	✅ stale→offline→restart→online cycle
30.8 SDK + example	✅ `examples/remote-agent/run.py`

Phase 30 remaining: 30.3 (plugin tarball), 30.6 (sibling URL cache). Neither blocks the current SaaS loop; 30.3 matters when remote agents need to install plugins with heavy deps, 30.6 is a resilience optimization for agent-to-agent direct calls.

Branch: feat/30.7-poll-liveness. PLAN.md 30.7 ✅.

Phase 30.6 — Sibling discovery auth + URL caching

Two tied fixes:

Platform side — /registry/discover/:id and /registry/:id/peers were unauthenticated. For a SaaS front-door deployment, any internet host that knows a workspace ID could enumerate siblings and pull their URLs. Added validateDiscoveryCaller using the same lazy-bootstrap Phase 30.1 token pattern. Fail-open on DB hiccup (unlike secrets.Values which fails-closed) because discovery only exposes URLs already behind CanCommunicate — the hierarchy check downstream is the primary gate, auth is defense-in-depth.

SDK side — new methods on RemoteAgentClient:

get_peers() → list of PeerInfo, seeds URL cache
discover_peer(id) → cached lookup with 5-min TTL, refreshes on expiry, returns None on 404
invalidate_peer_url(id) → drop cache entry (call after a direct-call failure so next call re-discovers)
call_peer(id, message, prefer_direct=True) → sends A2A message/send. Direct path on cache hit; graceful fallback to platform proxy on connection error / 5xx with cache invalidation. prefer_direct=False forces proxy routing.
New PeerInfo dataclass exported alongside WorkspaceState.

Tests: 12 new SDK tests (cache seeding skips non-http URLs, cache hit short-circuits, expired cache refreshes, 404 returns None, invalidate_peer_url idempotent, direct-path vs proxy-fallback vs prefer-direct=False, fresh call with no cache does discover-then-direct).

Bug caught during verification: my first discovery auth shape fail-closed on DB errors, which broke existing TestDiscover_* and TestPeers_* tests that didn't set up the HasAnyLiveToken sqlmock expectation. Switched to fail-open — discovery is hierarchy-gated anyway, and a DB hiccup shouldn't take agent-to-agent discovery offline. 8 tests restored green.

Live E2E with a tiny Python echo server as sibling-B:

get_peers returns 2 peers (echo server + parent PM)
URL cache seeded ONLY with http:// entry (skips remote://pm)
call_peer routes directly to http://127.0.0.1:9876 — no proxy hop
Echo server responds, SDK returns "echoed: hello sibling over SDK"
Auth + hierarchy all verified: no-token→401, wrong-token→401, cross-workspace token→401, out-of-hierarchy discover→403

Phase 30 status after this: 30.1 ✅ 30.2 ✅ 30.4 ✅ 30.5 ✅ 30.6 ✅ 30.7 ✅ 30.8 ✅. Only 30.3 (plugin tarball download) remains, and I flagged that one as lower priority — the current SaaS loop doesn't need it until a real user has a heavy-deps plugin.

Branch: feat/30.6-sibling-cache.

Fix #123 — Telegram `kicked`/`left` now persists `enabled=false`

Bug: When the Molecule AI bot was removed from a Telegram chat, the handler at telegram.go:594-596 only logged the event — the matching workspace_channels row stayed enabled=true. Every subsequent outbound message hit Telegram 403 forever.

Fix:

New package-level callback disableChannelByChatID in telegram.go, default no-op (safe for early boot / tests).
manager.go::NewManager wires it to run UPDATE workspace_channels SET enabled=false WHERE channel_type='telegram' AND enabled=true AND config->>'chat_id'=$1, then call m.Reload(ctx) if any row flipped so the in-memory poller map drops the now-disabled row.
onMyChatMember::case "left", "kicked" now calls the callback immediately after the existing log line (removes the TODO).

Tests (workspace-server/internal/channels/channels_test.go):

default-is-no-op (var safe to call pre-Manager-init)
wired-callback fires UPDATE with exact WHERE shape + arg + triggers Reload via follow-up SELECT
no-rows-affected skips reload (avoids SELECT storm on unrelated kicked events from other bots)

Branch: fix/123-telegram-kicked-persist. Closes #123.

Phase 30 client adaptations — MCP / molecli / Canvas / SDK

Phase 30 itself shipped the platform-side endpoints. These adaptations make those endpoints visible and usable from every client surface without requiring callers to know the new URL paths by hand.

MCP — 4 new tools in mcp-server/src/index.ts:

list_remote_agents — filters workspace list to runtime='external'
get_remote_agent_state — projects {workspace_id, status, paused, deleted}
get_remote_agent_setup_command — emits the WORKSPACE_ID=... PLATFORM_URL=... python3 ... bash one-liner an operator can paste into a remote shell
check_remote_agent_freshness — compares last_heartbeat_at against configurable threshold (default 90s); returns {fresh, seconds_since_heartbeat}

8 new MCP tests (88 → 96).

molecli — WorkspaceInfo gains a Runtime field; printWorkspaceTable adds a RUNTIME column showing ★ external for remote agents so they pop in a long table; detail view labels them external (Phase 30 remote agent). Live: molecli ws list now shows the badge correctly.

Canvas — WorkspaceNode.tsx reads data.runtime (workspace row) in preference to data.agentCard.runtime (agent-reported). Remote agents get a distinct violet ★ REMOTE pill with a tooltip explaining the heartbeat-based lifecycle. 352/352 vitests still pass.

SDK — pyproject.toml rebranded molecule-sdk@0.2.0 so a single pip install molecule-sdk ships both molecule_plugin (plugin authors) and molecule_agent (remote-agent authors). Added trove classifiers, keywords, requires-python pin. New sdk/python/molecule_agent/README.md quickstart.

Live verification:

MCP: spawned a real external workspace, ran all 4 tools via node smoke script — count=1, setup_command renders, freshness=null (no heartbeat yet, returns fresh=false correctly)
molecli: ws list shows ★ external badge on the remote workspace
Canvas tests green; visual change is small (one badge swap)
SDK: 121 SDK tests + 1078 workspace-template tests still pass

Branch: feat/phase30-client-adaptations.

Phase 30.3 — Plugin tarball download (external GitHub repo verified)

Platform: new GET /workspaces/:id/plugins/:name/download[?source=...] streams the named plugin as a gzipped tarball. Reuses resolveAndStage so all existing source schemes (local://, github://, future clawhub://) work — the endpoint is just the download surface for what Install was already doing internally.

Token-gated (fail-closed on DB error since the tarball can include rule text and skill files referencing internals). Defaults source to local://<name> when the query param is omitted. Validates that the URL path's plugin name matches the resolved plugin's manifest name — prevents a github source resolving to a different name from being shipped under the requested name.

SDK RemoteAgentClient.install_plugin(name, source=None):

Stream the download
Atomic extract via sibling-tempdir + rename (no half-installed states)
Run setup.sh if present (best-effort)
POST /workspaces/:id/plugins to register the install

_safe_extract_tar rejects path-traversal (../escape, absolute paths) and silently skips symlinks/hardlinks — defends against tar-slip CVEs. Tested with both adversarial inputs.

Tests:

5 new Go (auth, tarball shape, name mismatch, tar streaming relative paths, tar symlink skip)
11 new Python SDK (unpack location, source query param, atomic rollback on corrupt tarball, overwrite existing, setup.sh ran/skipped, platform-report skipped, 404 surfaces, _safe_extract path-traversal rejection, absolute-path rejection, symlink skip)

Live E2E with a real external GitHub repo created via gh repo create (HongmingWang-Rabbit/starfire-test-plugin):

local://molecule-dev → 4612-byte tarball, plugin.yaml + skills/ present
github://HongmingWang-Rabbit/starfire-test-plugin → 711-byte tarball pulled from real GitHub, unpacked locally, setup.sh ran on the agent's host machine producing /tmp/sf-plugin-test-setup-ran
Auth gates: 401/401/200 confirmed
Name-mismatch: requested wrong-name with source=...starfire-test-plugin returned 400 with {"resolved_name":"starfire-test-plugin","requested_name":"wrong-name"}

Phase 30 is now feature-complete: 30.1 ✅ 30.2 ✅ 30.3 ✅ 30.4 ✅ 30.5 ✅ 30.6 ✅ 30.7 ✅ 30.8 ✅

Branch: feat/30.3-plugin-tarball. Test repo: https://github.com/HongmingWang-Rabbit/starfire-test-plugin

Bugfix #124 — Delegation idempotency

Promoted from docs/known-issues.md KI-002. When a workspace container restarted mid-delegation (Redis TTL → liveness restart), agents could re-issue POST /workspaces/:id/delegate and produce duplicate work (double commits, double Telegram messages, double API calls).

Migration 021_delegation_idempotency.up.sql:

activity_logs.idempotency_key TEXT NULL
Partial unique index on (workspace_id, idempotency_key) WHERE idempotency_key IS NOT NULL — fully backwards compatible

Handler (workspace-server/internal/handlers/delegation.go::Delegate):

Optional idempotency_key field on the request body
On receipt: lookup (workspace_id, key) → if found and not failed, return existing delegation_id with HTTP 200 + idempotent_hit: true
If the prior row is failed, the slot is released so the retry can produce a fresh delegation (still 202)
If two concurrent calls race past the lookup, the unique-constraint violation on insert is caught and the loser re-queries to surface the same idempotent response (HTTP 200) instead of a 500

Tests (3 new + 2 updated, all green under go test -race):

TestDelegate_IdempotentReplayReturnsExistingDelegation
TestDelegate_IdempotentFailedRowIsReleasedAndReplaced
TestDelegate_IdempotentRaceUniqueViolationReturnsExisting
Updated TestDelegate_Success and TestDelegate_DBInsertFails_Still202WithWarning to assert the new 6th INSERT arg (idempotency_key = NULL when omitted)

Branch: fix/auto-review-2026-04-13-delegation-idempotency. Closes #124.

36 KiB Raw Blame History Unescape Escape

2026-04-13 — edit history

Summary — Quality + Infra Pass (PRs #1–#8, all merged)

Brand + structural

MCP server refactor (PRs #2, #4, #7)

Canvas (PRs, shipped across session)

Platform — handler decomposition (pure refactor)

Config / env documentation

E2E + CI (PRs #5, #7, #8)

Test totals after today's sync

Canvas — org template import (PLAN.md §20.3)

Platform — fix #106: plugin uninstall cleanup

Platform — fix #110: A2A busy-response classification

Platform — fix #117: surface Docker image-not-found error on provision

Phase 30.1 — Workspace auth tokens (SaaS foundation)

Fix #125 — commit_memory writes now surface in activity_logs

Phase 30.2 + 30.5 — Remote secrets pull + A2A caller-token validation

Phase 30.4 + 30.8 — State polling + Remote-agent SDK (first working e2e)

Phase 30.7 — Poll-liveness for external-runtime workspaces

Phase 30.6 — Sibling discovery auth + URL caching

Fix #123 — Telegram kicked/left now persists enabled=false

Phase 30 client adaptations — MCP / molecli / Canvas / SDK

Phase 30.3 — Plugin tarball download (external GitHub repo verified)

Bugfix #124 — Delegation idempotency

36 KiB

Raw Blame History

Fix #125 — `commit_memory` writes now surface in `activity_logs`

Fix #123 — Telegram `kicked`/`left` now persists `enabled=false`