molecule-core

Author	SHA1	Message	Date
claude-ceo-assistant (Claude Opus 4.7 on Hongming's MacBook)	25fb696965	chore: reconcile main → staging post-suspension divergence Some checks failed Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 7s Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 7s Details cascade-list-drift-gate / check (pull_request) Successful in 9s Details CI / Detect changes (pull_request) Successful in 10s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 10s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 11s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 11s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 12s Details Harness Replays / detect-changes (pull_request) Successful in 13s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 12s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 15s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 9s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 16s Details Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 43s Details Harness Replays / Harness Replays (pull_request) Failing after 40s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 1m32s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m34s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m36s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Failing after 2m53s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 3m44s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 3m57s Details CI / Canvas (Next.js) (pull_request) Successful in 6m50s Details CI / Python Lint & Test (pull_request) Successful in 7m37s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details CI / Platform (Go) (pull_request) Failing after 8m31s Details Refs Task #165 (Class D AUTO_SYNC_TOKEN plumbing). main and staging diverged after the 2026-05-06 GitHub-org suspension because Class D / Class G / feature work landed on staging while unrelated CI fixes (#34-47, ECR auth-inline, buildx→docker, pre-clone manifest deps) landed straight on main. Both branches edited the same workflow files, so every push to main triggered an Auto-sync run that aborted at `git merge --no-ff origin/main` with 7 content conflicts: - .github/workflows/canary-verify.yml (URL: github.com → Gitea) - .github/workflows/ci.yml (3 URL refs) - .github/workflows/publish-runtime.yml (cascade: HTTP repo-dispatch → Gitea push) - .github/workflows/publish-workspace-server-image.yml (drop AWS-action steps; ECR auth is inline) - .github/workflows/retarget-main-to-staging.yml (URL) - manifest.json (lowercase org slug + add mock-bigorg from main) - scripts/clone-manifest.sh (keep main's MOLECULE_GITEA_TOKEN auth path + drop awk-tolower since manifest is now lowercase) Resolution: union — staging's post-suspension Gitea/ECR migrations win on URL/policy edits; main's additive work (mock-bigorg manifest entry, inline ECR auth, MOLECULE_GITEA_TOKEN basic-auth) is preserved on top. After this lands, staging is a strict superset of main, so the next auto-sync run on a push to main will be a clean fast-forward / no-op. The auto-sync workflow on main also picks up staging's AUTO_SYNC_TOKEN swap (Class D #26) for free, fixing the latent layer-2 push-auth issue. Verified locally: - bash -n scripts/clone-manifest.sh - python -c 'yaml.safe_load(...)' on each touched workflow - python -c 'json.load(open(manifest.json))' (21 plugins, 9 templates, 7 org_templates) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 14:24:37 -07:00
Hongming Wang	d64641904f	feat(workspace-server): mock runtime + mock-bigorg org template Some checks failed E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 9s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 8s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 8s Details Harness Replays / detect-changes (pull_request) Successful in 9s Details CI / Python Lint & Test (pull_request) Successful in 6s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 6s Details CI / Canvas (Next.js) (pull_request) Successful in 8s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 10s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 12s Details Harness Replays / Harness Replays (pull_request) Successful in 8s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 8s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m36s Details cascade-list-drift-gate / check (pull_request) Successful in 5s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m30s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 1m39s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Failing after 2m50s Details Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped Details Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s Details CI / Platform (Go) (pull_request) Successful in 4m29s Details CI / Detect changes (pull_request) Successful in 6s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 8s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 7s Details Adds a 'mock' runtime: virtual workspaces with no container, no EC2, no LLM. Every A2A reply is synthesised from a small canned-variant pool ('On it!', 'Got it, on it now.', etc.) deterministically seeded by (workspace_id, request_id). Built for funding-demo "200-workspace mock org" — renders an enterprise-scale org chart on the canvas (CEO/VPs/Managers/ICs) without burning real LLM credits or provisioning 200 EC2 instances. Surfaces: - workspace-server/internal/handlers/mock_runtime.go: A2A proxy short-circuit, canned-reply pool, deterministic variant pick. - workspace-server/internal/handlers/a2a_proxy.go: gate the short-circuit before resolveAgentURL (mock has no URL). - workspace-server/internal/handlers/org_import.go: skip Docker provisioning for mock workspaces, set status='online' directly, drop the per-sibling 2s pacing for mock children (collapses a 200-workspace import from ~7min → ~1s). - workspace-server/internal/handlers/runtime_registry.go: register 'mock' in the runtime allowlist (manifest + fallback set). - workspace-server/internal/registry/healthsweep.go + orphan_sweeper.go: skip mock workspaces in container-health and stale-token sweeps (no container by design). - workspace-server/internal/handlers/workspace_restart.go: mirror the 'external' Restart no-op for mock. - manifest.json: register the new Molecule-AI/molecule-ai-org-template-mock-bigorg repo. Tests: 5 new in mock_runtime_test.go covering happy-path, non-mock regression guard, determinism, IsMockRuntime trim/case, JSON-RPC id echo. All existing handler + registry tests still pass. Local-verified: imported the 200-workspace template against a fresh postgres+redis, confirmed all 200 land in 'online' and stay there through the 30s health-sweep window, exercised A2A on CEO + VPs + Managers + ICs and saw the variant pool rotate. Org template lives at Molecule-AI/molecule-ai-org-template-mock-bigorg (created today) and is imported via the existing /org/import flow on the canvas Template Palette. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 08:40:37 -07:00
Hongming Wang	3cdb67f27e	fix(workspace-server): CP orphan sweeper closes deprovision split-write race (#2989 ) Some checks failed CI / Shellcheck (E2E scripts) (pull_request) Successful in 2s Details Harness Replays / detect-changes (pull_request) Successful in 8s Details Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 4s Details CI / Detect changes (pull_request) Successful in 6s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 4s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 6s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 6s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 4s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 5s Details CI / Python Lint & Test (pull_request) Successful in 3s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 5s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 6s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 4s Details CI / Canvas (Next.js) (pull_request) Successful in 18s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 43s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m19s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m22s Details Harness Replays / Harness Replays (pull_request) Failing after 37s Details CI / Platform (Go) (pull_request) Failing after 2m33s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Failing after 4m48s Details The deprovision path marks `workspaces.status='removed'` BEFORE calling the controlplane DELETE. If that CP call fails (transient 5xx, network hiccup, AWS provider error), the DB row stays at 'removed' with `instance_id` populated and there's no retry — the EC2 lives forever. 9 prod orphans accumulated over 3 days under this bug. Adds a SaaS-mode counterpart to the existing Docker `orphan_sweeper`: - 60s tick (matches the Docker sweeper cadence) - LIMIT 100 per cycle so a sustained CP outage drains over multiple cycles without blowing the request timeout - Re-issues `cpProv.Stop` for any workspace at status='removed' with a non-NULL `instance_id`. Stop is idempotent (AWS terminate on already-terminated is a no-op; CP's Deprovision tolerates already- deleted DNS) so retries are safe. - On Stop success, NULLs `instance_id` so the next cycle skips the row. - On Stop failure, leaves `instance_id` populated for next cycle. The existing Docker sweeper is gated on `prov != nil`; the new sweeper is gated on `cpProv != nil`. SaaS tenants get exactly one of the two, self-hosted tenants get the Docker one — no overlap. Why this shape over option A (CP-first ordering) or B (durable outbox): the existing inline path already returns a loud 500 to the user when CP fails — the only missing piece is automatic retry, which a 60s sweeper provides without protocol changes, new tables, or new workers. ~30 LOC of production code vs. ~400 for an outbox. RFC discussion in #2989 comment chain. Tests: - 9 unit tests covering happy path, Stop failure, UPDATE failure, multiple orphans (one-fails-others-still-process), DB query error, nil-DB defense, nil-reaper short-circuit, and the boot-immediate-then- tick cadence contract. - Mutation-tested: status='running' substitution and removed-UPDATE- block both fail at least one test. Out of scope: - Backfilling the 9 named orphans — they'll heal automatically on the first sweep cycle after this lands; no manual cleanup needed. - Long-term durable-outbox architecture — separate RFC.	2026-05-06 16:43:33 -07:00
Hongming Wang	9ceda9d81f	refactor(events): migrate 18 files to typed EventType constants (RFC #2945 PR-B-1) Mechanical migration of bare event-name strings in BroadcastOnly / RecordAndBroadcast call sites to the typed constants from internal/events/types.go (RFC #2945 PR-B). Wire format unchanged (both shapes serialize to identical WSMessage.Event literals); pinned by TestAllEventTypes_IsSnapshot in #2965. Migrated (18 files, scope: handlers/, scheduler/, registry/, bundle/, channels/): - handlers/{approvals,a2a_proxy_helpers,a2a_queue,activity,agent, delegation,external_rotate,org_import,registry,workspace, workspace_bootstrap,workspace_crud,workspace_provision_shared, workspace_restart}.go - channels/manager.go (caught by hostile-reviewer pass — initial scope missed channels/, found via grep on the post-migration tree) - scheduler/scheduler.go - registry/provisiontimeout.go - bundle/importer.go Hostile self-review (3 weakest spots, addressed) ------------------------------------------------ 1. Missed call sites — initial scope omitted channels/. Post-migration `grep -rEn 'BroadcastOnly\([^,]+,[^,]"[A-Z_]+"\|RecordAndBroadcast\([^,]+,[^,]"[A-Z_]+"' internal/` found 2 stragglers in channels/manager.go. Migrated. Final grep on the same pattern returns only the docstring example in types.go (intentional). 2. gofmt drift — auto-import injection produced non-canonical import ordering. `gofmt -w` applied ONLY to the 18 modified files (NOT the whole tree, to avoid sweeping unrelated pre-existing drift into this PR's diff). Three pre-existing un-gofmt'd files in handlers/ (a2a_proxy.go, a2a_proxy_test.go, a2a_queue_test.go) left as-is — they're unchanged by this PR and their drift predates it. 3. Wire format — paranoia check: do the constants serialize to the exact strings consumers (canvas TS, hermes plugin, anything parsing WSMessage.Event) expect? Yes. Pinned by the snapshot test. The migration is name-only; not a single character of wire output changes. Verified - go build ./... clean - go vet ./internal/... clean - gofmt -l on the 5 migrated package dirs: only pre-existing files - Full tests: handlers/, channels/, scheduler/, registry/, events/, bundle/ all green (5 ok, 0 fail) PR-B-2 (canvas TS mirror + cross-language parity gate) remains as the final piece of RFC #2945 PR-B. Tracked separately so this PR stays mechanical + reviewable. Refs RFC #2945, PR #2965 (PR-B types). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 19:05:03 -07:00
Hongming Wang	be271aef8b	fix(orphan-sweeper): exclude runtime='external' from stale-token revoke The Docker-mode orphan sweeper was incorrectly targeting external runtime workspaces, revoking their auth tokens ~6 minutes after creation (one sweep cycle past the 5-min grace). External workspaces have NO local container by design — their agent runs off-host. The "no live container" predicate the sweep uses to detect wiped-volume orphans matches every external workspace unconditionally, which was killing the only auth credential the off-host agent has. Reproducer: create runtime=external workspace, paste the auth token into molecule-mcp / curl, wait 5 minutes. Next request returns `HTTP 401 — token may be revoked`. Platform log shows `Orphan sweeper: revoking stale tokens for workspace <id> (no live container; volume likely wiped)`. Fix: add `AND w.runtime != 'external'` to the sweep's SELECT. The existing test regexes (third-pass query expectations + the shared expectStaleTokenSweepNoOp helper) are tightened to require the new predicate, so a regression that drops it fails CI immediately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 00:49:37 -07:00
Hongming Wang	0064f02c00	test(sweeper): integration coverage for manifest-override + accessor consolidation Two follow-ups from PR #2494's review: 1. Two new sweep tests exercise the lookup path through sweepStuckProvisioning end-to-end: - ManifestOverrideSparesRow: claude-code 11min old, manifest=20min → no UPDATE, no broadcast (sparing works through the sweeper) - ManifestOverrideStillFlipsPastDeadline: claude-code 21min old, manifest=20min → flipped + payload.timeout_secs=1200 Closes the gap that the unit-test on provisioningTimeoutFor alone left open: a future refactor could drop the lookup arg from the sweeper's call and only the unit test caught it. Verified by regression-injecting `lookup→nil` in sweepStuckProvisioning — both new tests fail, the old ones still pass. 2. addProvisionTimeoutMs now goes through ProvisionTimeoutSecondsForRuntime instead of calling provisionTimeouts.get directly. Single accessor path for the same data — the canvas response and the sweeper now resolve identically by construction. No production behavior change; tests + accessor cleanup only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 22:00:36 -07:00
Hongming Wang	18edf88d59	fix(sweeper): honour template-manifest provision_timeout_seconds Real wiring gap discovered while investigating issue #2486 cluster of prod claude-code workspaces failed at exactly 10m. The runtimeProvisionTimeoutsCache (#2054 phase 2) reads runtime_config.provision_timeout_seconds from each template's config.yaml so the canvas spinner respects per-template timeouts — but the sweeper in registry/provisiontimeout.go hardcoded 10 min (claude-code) / 30 min (hermes) and never consulted the manifest. So a template that declared a longer window had a UI that waited correctly but a sweeper that killed the row at the hardcoded floor anyway. Resolution order pinned by new TestProvisioningTimeout_ManifestOverride: 1. PROVISION_TIMEOUT_SECONDS env (ops-debug global override) 2. Template manifest lookup (per-runtime, beats hermes default too) 3. Hermes default (30 min — CP bootstrap-watcher 25 min + 5 min slack) 4. DefaultProvisioningTimeout (10 min) Wiring: - registry: new RuntimeTimeoutLookup function type, threaded through StartProvisioningTimeoutSweep + sweepStuckProvisioning + the pre-existing provisioningTimeoutFor. - handlers: ProvisionTimeoutSecondsForRuntime exposes the cache's lookup as a method so main.go can pass it without breaking the handlers→registry import direction. - cmd/server/main.go: wire wh.ProvisionTimeoutSecondsForRuntime into the sweep boot. Verified: - go test -race ./... passes (every workspace-server package). - Regression-injected the lookup arm: 3 manifest-override subcases fail with the actual-vs-expected gap, confirming the new test is load-bearing. - The original two timeout tests (env-override, hermes default) keep passing — `lookup=nil` argument preserves their semantics. Operator action enabled: a template wanting a 15-min window can now just set `runtime_config.provision_timeout_seconds: 900` in its config.yaml and the sweeper honours it on the next workspace-server restart. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 21:44:42 -07:00
Hongming Wang	fdf1b5d76a	refactor(workspace-status): typed constants + AST-based drift gate Eliminate raw 'awaiting_agent'/'hibernating'/'failed'/etc string literals from production status writes. Adds models.WorkspaceStatus typed alias and models.AllWorkspaceStatuses canonical slice; every UPDATE workspaces SET status = ... now passes a parameterized $N typed value rather than a hard-coded SQL literal. Defense-in-depth follow-up to migration 046 (#2388): the Postgres enum type was missing 'awaiting_agent' + 'hibernating' for ~5 days because sqlmock regex matching cannot enforce live enum constraints. The drift gate is now a proper Go AST + SQL parser (no regex), asserting the codebase ⊆ migration enum and every const appears in the canonical slice. With status as a parameterized typed value, future enum mismatches fail at the SQL layer in tests, not silently in prod. Test coverage: full suite passes with -race; drift gate green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:41:41 -07:00
Hongming Wang	284511f02e	feat(external): default external runtime to poll-mode + awaiting_agent Paired molecule-core change for the molecule-cli `molecule connect` RFC (https://github.com/Molecule-AI/molecule-cli/issues/10). After this PR an `external`-runtime workspace's full lifecycle matches the operator-driven model: it boots in awaiting_agent, the CLI connects in poll mode without operator-side flag tuning, the heartbeat-loss path lands back on awaiting_agent (re-registrable) instead of the terminal-feeling 'offline'. Two changes in workspace-server: 1) `resolveDeliveryMode` (registry.go) now reads `runtime` alongside `delivery_mode`. Resolution order: a. payload.delivery_mode if non-empty (operator override) b. row's existing delivery_mode if non-empty (preserves prior registration) c. NEW: "poll" if row.runtime = "external" — external operators run on laptops without public HTTPS; push-mode would hard-fail at validateAgentURL anyway. (`molecule connect` registers without --mode and expects this default.) d. "push" otherwise (historical default for platform-managed runtimes — langgraph, hermes, claude-code, etc.) 2) Heartbeat-loss for external workspaces lands them in `awaiting_agent` instead of `offline`. Two code paths: - `liveness.go` — Redis TTL expiration. Uses a CASE expression so the conditional is one UPDATE (no extra round-trip for non-external runtimes, no TOCTOU between runtime read and status write). - `healthsweep.go::sweepStaleRemoteWorkspaces` — DB-side last_heartbeat_at age scan. This sweep is already external- only by query filter, so the UPDATE just hard-codes the new status. The Docker-side `sweepOnlineWorkspaces` keeps `offline` — recovery there is "restart the container", not "re-register from the operator's box". Why awaiting_agent over offline for external: - Matches the status the workspace was created in (workspace.go:333). - The CLI re-registers on every invocation; awaiting_agent → online is the natural transition. offline is a terminal-feeling status that implies operator intervention is needed. - An operator who closed their laptop overnight should see awaiting_agent in canvas, not 'offline (something is wrong)'. Test plan: - Existing: 9 `resolveDeliveryMode` test sites updated to the new query shape. Sqlmock now reads `delivery_mode, runtime` columns. - New: TestRegister_ExternalRuntime_DefaultsToPoll asserts the external→poll branch. TestRegister_NonExternalRuntime_StillDefaultsToPush guards against the new branch overshooting (langgraph keeps push). - Liveness: regex updated to match the CASE expression. - Healthsweep: `TestSweepStaleRemoteWorkspaces_MarksStaleAwaitingAgent` (renamed for grep-ability), Docker-side sweepOnlineWorkspaces test unchanged (verified to still match `'offline'`). - Full handlers + registry suite green under -race (12.873s + 2.264s). No migration needed — `status` is a free-form text column; both 'offline' and 'awaiting_agent' are existing values used elsewhere (workspace.go uses awaiting_agent on initial external creation). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 06:39:57 -07:00
Hongming Wang	317196463a	fix(orphan-sweeper): close TOCTOU race with issueAndInjectToken on restart Independent code review caught a real bug in the previous commit's stale-token revoke pass. The platform's restart endpoint (workspace_restart.go:104) Stops the workspace container synchronously then dispatches re-provisioning to a goroutine (line 173). For a workspace that's been idle past the 5-minute grace window — extremely common: user comes back to a long-idle workspace and clicks Restart — this opens a race window: 1. Container stopped → ListWorkspaceContainerIDPrefixes returns no entry → workspace becomes a stale-token candidate. 2. issueAndInjectToken runs in the goroutine: revokes old tokens, issues a fresh one, writes it to /configs/.auth_token. 3. If the sweeper's predicate-only UPDATE `WHERE workspace_id = $1 AND revoked_at IS NULL` runs AFTER IssueToken commits but is racing the SELECT-then-UPDATE window, it revokes the freshly-issued token alongside the old ones. 4. Container starts with a now-revoked token → 401 forever. The fix carries the SAME staleness predicate from the SELECT into the per-workspace UPDATE: a token created within the grace window can't match `< now() - grace` and is automatically excluded. The operation is now idempotent against fresh inserts. Also addresses other findings from the same review: - Add `status NOT IN ('removed', 'provisioning')` to the SELECT (R2 + first-line C1 defence). 'provisioning' is set synchronously in workspace_restart.go before the async re-provision begins, so it's a reliable in-flight signal that narrows the candidate set. - Stop calling wsauth.RevokeAllForWorkspace from the sweeper — that helper revokes EVERY live token unconditionally; the sweeper needs "every STALE live token" which is a different (safer) operation. Inline the UPDATE so we own the predicate end-to-end. Drop the wsauth import (no longer needed in this package). - Tighten expectStaleTokenSweepNoOp regex to anchor at start and require the status filter, so a future query whose first line coincidentally starts with "SELECT DISTINCT t.workspace_id" can't silently absorb the helper's expectation (R3). - Defensive `if reaper == nil { return }` at top of sweepStaleTokensWithoutContainer — even though StartOrphanSweeper already short-circuits on nil, a future refactor that wires this pass directly without checking would otherwise mass-revoke in CP/SaaS mode (F2). - Comment in the function explaining why empty likes is intentionally NOT a short-circuit (asymmetry with the first two passes is the whole point — "no containers running" is the load-bearing case). - Add TestSweepOnce_StaleTokenRevokeUsesStalenessPredicate that asserts the UPDATE shape (predicate present, grace bound). A real-Postgres integration test would prove the race resolution end-to-end; this catches the regression where someone simplifies the UPDATE back to predicate-only. - Add TestSweepStaleTokens_NilReaperEarlyExit pinning the F2 guard. Existing tests updated to match the new query/UPDATE shape with tight regexes that pin all the safety guards (status filter, staleness predicate in both SELECT and UPDATE). Full Go suite green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 17:28:50 -07:00
Hongming Wang	3332e6878b	fix(orphan-sweeper): revoke stale tokens for workspaces with no live container Heals the user-reported "auth token conflict after volume wipe" failure mode. When an operator nukes a workspace's /configs volume outside the platform's restart endpoint (common via `docker compose down -v` or manual cleanup scripts), the DB still holds live workspace_auth_tokens for that workspace while the recreated container has an empty /configs/.auth_token. Subsequent /registry/register calls 401 forever: requireWorkspaceToken sees live tokens, container has no token to present, and the workspace is permanently wedged until an operator manually revokes via SQL. The platform's restart endpoint already handles this correctly via wsauth.RevokeAllForWorkspace inside issueAndInjectToken. This change adds a third orphan-sweeper pass — sweepStaleTokensWithoutContainer — as the safety net for the equivalent action taken outside the API. Detection criterion: workspace has at least one live (non-revoked) token whose most-recent activity (COALESCE(last_used_at, created_at)) is older than staleTokenGrace (5 minutes), AND no live Docker container's name prefix matches the workspace ID. Safety filters that bound the revoke radius: 1. Only runs in single-tenant Docker mode. The orphan sweeper is wired only when prov != nil in cmd/server/main.go — CP/SaaS mode never gets here, so an empty container list cannot be confused with "no Docker at all" (which would otherwise revoke every workspace's tokens in production SaaS). 2. staleTokenGrace = 5min skips tokens issued/used in the last 5 minutes. Bounds the race with mid-provisioning (token issued moments before docker run completes) and brief restart windows — a healthy workspace touches last_used_at every 30s heartbeat, so 5min is 10× the heartbeat interval. 3. The query joins workspaces.status != 'removed' so deleted workspaces are not revoked here (handled at delete time by the explicit RevokeAllForWorkspace call). 4. make_interval(secs => $2) avoids a time.Duration.String() → "5m0s" mismatch with Postgres interval grammar that I caught during implementation. 5. Each revocation logs the workspace ID so operators can correlate "workspace just lost auth" with this sweeper, not blame a network blip. Failure mode: revoke fails (transient DB error). Loop bails to avoid log spam; next 60s cycle retries. Worst case a workspace stays 401-blocked an extra minute. Tests: 5 new tests covering the headline scenario, the safety gate (workspace with container is NOT revoked), revoke-failure-bails-loop, query-error-non-fatal, and Docker-list-failure-skips-cycle. All 11 existing sweepOnce tests updated to register the new third-pass query expectation via a small `expectStaleTokenSweepNoOp` helper that keeps their existing assertions readable. Full Go test suite green: registry, wsauth, handlers, and all other packages. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 17:20:08 -07:00
Hongming Wang	4915d1d59e	fix(orphan-sweeper): reap labeled containers with no DB row (wiped-DB) The existing sweeper only reaps ws-* containers whose workspace row has status='removed'. That misses the entire wiped-DB case: an operator does `docker compose down -v` (kills the postgres volume), the previous platform's ws-* containers keep running, the new platform boots into an empty workspaces table — first pass finds zero candidates and those containers leak forever. Symptom users hit today: 7 ws-* containers from 11h ago, no rows in DB, no visibility in Canvas, eating CPU + memory. Fix shape: 1. Provisioner stamps every ws-* container + volume with `molecule.platform.managed=true`. Without a label, the sweeper would have to assume any unlabeled ws-* container might belong to a sibling platform stack on a shared Docker daemon. 2. Provisioner exposes ListManagedContainerIDPrefixes — a label-filter counterpart to the existing name-filter. 3. Sweeper splits sweepOnce into two independent passes: - sweepRemovedRows (unchanged behavior; status='removed' only) - sweepLabeledOrphansWithoutRows (new; labeled containers whose workspace_id has no row in the table at all) Each pass has its own short-circuit so an empty result or transient error in one doesn't block the other — load-bearing because the wiped-DB pass exists precisely for cases where the removed-row pass finds nothing. Safe under multi-platform-on-shared-daemon: only containers carrying our label get reaped, sibling stacks' containers are invisible to this pass. (For now the label is a constant string; a future per-instance UUID layer can refine "ours" further if a real shared-daemon scenario emerges.) Migration: existing platforms running pre-PR builds have UNLABELED ws-* containers. After this lands they continue to NOT be reaped by the new path (no label = invisible). They'll only be cleaned via manual intervention or once the operator recreates them — same as today. No regression. Tests cover all five branches of the new pass: happy-path reap, no-reap when row exists, mixed reap-some-keep-some, Docker error short-circuits cleanly, non-UUID prefixes get filtered before the SQL query. Pairs with PR #2122 (script-level fix). Together they close the orphan-leak path for both `bash scripts/nuke-and-rebuild.sh` users (handled by the script) AND `docker compose down -v` users (handled by the runtime). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 14:33:41 -07:00
Hongming Wang	d0f198b24f	merge: resolve staging conflicts (a2a_proxy + workspace_crud) Three files conflicted with staging changes that landed while this PR sat open. Resolved each by combining both intents (not picking one side): - a2a_proxy.go: keep the branch's idle-timeout signature (workspaceID parameter + comment) AND apply staging's #1483 SSRF defense-in-depth check at the top of dispatchA2A. Type-assert h.broadcaster (now an EventEmitter interface per staging) back to Broadcaster for applyIdleTimeout's SubscribeSSE call; falls through to no-op when the assertion fails (test-mock case). - a2a_proxy_test.go: keep both new test suites — branch's TestApplyIdleTimeout_ (3 cases for the idle-timeout helper) AND staging's TestDispatchA2A_RejectsUnsafeURL (#1483 regression). Updated the staging test's dispatchA2A call to pass the workspaceID arg introduced by the branch's signature change. - workspace_crud.go: combine both Delete-cleanup intents: * Branch's cleanupCtx detachment (WithoutCancel + 30s) so canvas hang-up doesn't cancel mid-Docker-call (the container-leak fix) * Branch's stopAndRemove helper that skips RemoveVolume when Stop fails (orphan sweeper handles) * Staging's #1843 stopErrs aggregation so Stop failures bubble up as 500 to the client (the EC2 orphan-instance prevention) Both concerns satisfied: cleanup runs to completion past canvas hangup AND failed Stop calls surface to caller. Build clean, all platform tests pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-04-26 10:43:22 -07:00
Hongming Wang	be1beff4a0	fix(registry): runtime-aware provision-timeout sweep — give hermes 30 min Pre-fix: workspace-server's provision-timeout sweep was hardcoded at 10 min for all runtimes. The CP-side bootstrap-watcher (cp#245) correctly gives hermes 25 min for cold-boot (hermes installs include apt + uv + Python venv + Node + hermes-agent — 13–25 min on slow apt mirrors is normal). The two timeout systems disagreed: the watcher would happily wait 25 min, but the workspace-server's 10-min sweep killed healthy hermes boots mid-install at 10 min and marked them failed. Today's example: #2061's E2E run on 2026-04-26 at 08:06:34Z created a hermes workspace, EC2 cloud-init was visibly making progress on apt-installs (libcjson1, libmbedcrypto7t64) when the sweep flipped status to 'failed' at 08:17:00Z (10:26 elapsed). The test threw "Workspace failed: " (empty error from sql.NullString serialization) and CI failed on a healthy boot. Fix: provisioningTimeoutFor(runtime) — same shape as the CP's bootstrapTimeoutFn: - hermes: 30 min (watcher's 25 min + 5 min slack) - others: 10 min (unchanged — claude-code/langgraph/etc. boot in <5 min, 10 min is plenty) PROVISION_TIMEOUT_SECONDS env override still works (applies to all runtimes — operators who care about the runtime distinction shouldn't use the override anyway). Sweep query change: pulls (id, runtime, age_sec) per row instead of pre-filtering by age in SQL. Per-row Go evaluation picks the correct timeout. Slightly more rows scanned but bounded by the status='provisioning' partial index — workspaces in flight, not historical. Tests: - TestProvisioningTimeout_RuntimeAware — locks in the per-runtime mapping - TestSweepStuckProvisioning_HermesGets30MinSlack — hermes at 11 min must NOT be flipped - TestSweepStuckProvisioning_HermesPastDeadline — hermes at 31 min IS flipped, payload includes runtime - Existing tests updated for the new query shape Verified: - go build ./... clean - go vet ./... clean - go test ./... all green Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 01:44:09 -07:00
Hongming Wang	b47a1b87b0	chore: refresh stale orphan-sweeper Stop-failure comment Convergence-pass review noted the comment at orphan_sweeper.go:171 still describes the pre-cb126014 contract ("Stop returns nil even when container is gone, but a future change could surface real errors"). The future is now — Stop does surface real errors today. Tightened the comment to match the live contract: isContainerNotFound is treated as success, anything else returns the wrapped Docker error, sweeper retries on the next cycle. Pure comment change, no behavior diff. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 23:34:57 -07:00
Hongming Wang	cb12601414	fix(platform): make Provisioner.Stop return real errors so cleanup gates fire Review caught a critical issue with `12c49183`: the headline "skip RemoveVolume when Stop fails" guarantee was dead code. `Provisioner.Stop` unconditionally `return nil`'d after logging the underlying ContainerRemove error, so the new `if err := h.provisioner.Stop(...); err != nil { skip volume }` guard in workspace_crud.go AND the same guard in the orphan sweeper could never fire. RemoveVolume always ran, predictably failing with "volume in use" when Stop hadn't actually killed the container — which is the exact production bug the commit claimed to fix. Now Stop: - returns nil on successful remove (no change) - returns nil when the container is already gone (uses the existing isContainerNotFound helper — that's the cleanup post-condition, not a failure) - returns the wrapped Docker error otherwise (daemon timeout, ctx cancellation, socket EOF — anything that means the container might still be alive) Audited every Provisioner.Stop caller in the tree (team.go, workspace_restart.go ×4, workspace.go) — all of them already discard the return value, so the widened error surface is purely opt-in for the new cleanup paths and breaks no existing behaviour. Other review-driven fixes in this commit: - workspace_crud.go: detached `broadcaster.RecordAndBroadcast` from the request ctx too. RecordAndBroadcast does INSERT INTO structure_events + Redis Publish; if the canvas hangs up, a request-ctx-bound INSERT can be cancelled mid-write and the WORKSPACE_REMOVED event never lands, leaving other WS clients ignorant of the cascade. - orphan_sweeper.go: added isLikelyWorkspaceID guard before turning Docker container prefixes into SQL LIKE patterns. The Docker name filter is a SUBSTRING match (not prefix), so non-workspace containers like `my-ws-tool` slip through; the in-loop HasPrefix in provisioner trims most, but the in-sweeper alphabet check (hex + dashes only) is the second line of defence and also blocks SQL LIKE wildcards (`_`, `%`) from reaching the query. Two new tests pin this — TestSweepOnce_FiltersNonWorkspacePrefixes and TestIsLikelyWorkspaceID with 10 alphabet cases. - provisioner.go: comment added to ListWorkspaceContainerIDPrefixes flagging the substring/HasPrefix relationship as load-bearing. Verified: full Go test suite passes; all 8 sweeper tests pass (2 new for the LIKE-pattern guard); existing dispatch / delete / provisioner tests unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 23:32:48 -07:00
Hongming Wang	12c4918318	fix(platform): stop leaking workspace containers on delete Symptom: deleting workspaces from the canvas marked DB rows status='removed' but left Docker containers running indefinitely. After a session of org imports + cancellations, we counted 10 running ws-* containers all backed by 'removed' DB rows, eating ~1100% CPU on the Docker VM. Two compounding bugs in handlers/workspace_crud.go's delete cascade: 1. The cleanup loop used `c.Request.Context()` for the Docker stop/remove calls. When the canvas's `api.del` resolved on the platform's 200, gin cancelled the request ctx — and any in-flight Docker call cancelled with `context canceled`, leaving the container alive. Old logs: "Delete descendant <id> volume removal warning: ... context canceled" 2. `provisioner.Stop`'s error return was discarded and `RemoveVolume` ran unconditionally afterward. When Stop didn't actually kill the container (transient daemon error, ctx cancellation as in #1), the volume removal would predictably fail with "volume in use" and the container kept running with the volume mounted. Old logs: "Delete descendant <id> volume removal warning: Error response from daemon: remove ... volume is in use" Fix layered in two parts: - workspace_crud.go: detach cleanup with `context.WithoutCancel(ctx)` + a 30s bounded timeout. Stop's error is now checked and on failure we skip RemoveVolume entirely (the orphan sweeper below catches what we deferred). - New registry/orphan_sweeper.go: periodic reconcile pass (every 60s, initial run on boot). Lists running ws-* containers via Docker name filter, intersects with DB rows where status='removed', stops + removes volumes for the leaks. Defence in depth — even a brand-new Stop failure mode heals on the next sweep instead of leaking forever. Provisioner gains a tiny ListWorkspaceContainerIDPrefixes helper that wraps ContainerList with the `name=ws-` filter; the sweeper takes an OrphanReaper interface (matches the ContainerChecker pattern in healthsweep.go) so unit tests don't need a real Docker daemon. main.go wires the sweeper alongside the existing liveness + health-sweep + provisioning-timeout monitors, all under supervised.RunWithRecover so a panic restarts the goroutine. 6 new sweeper tests cover the reconcile path, the no-running-containers short-circuit, the daemon-error skip, the Stop-failure-leaves-volume invariant (the same trap that motivated this fix), the volume-remove-error-is-non-fatal continuation, and the nil-reaper no-op. Verified: full Go test suite passes; manually purged the 10 leaked containers + their orphan volumes from the dev host with `docker rm -f` + `docker volume rm` (one-off cleanup; the sweeper would have caught them on the next cycle once deployed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 12:36:22 -07:00
Hongming Wang	ec52d155f4	fix(sweeper): emit WORKSPACE_PROVISION_FAILED so canvas updates UI The provision-timeout sweeper was emitting a new WORKSPACE_PROVISION_TIMEOUT event type, but the canvas event handler (canvas-events.ts:234) only has a case for WORKSPACE_PROVISION_FAILED — the sweep's event fell through silently. DB was being marked 'failed' but the UI stayed on 'starting' indefinitely until the user hard-refreshed. Reusing the existing event name keeps the UI reaction uniform across both fail paths (runtime-crash via bootstrap-watcher and boot-timeout via sweeper). Operators who need to distinguish can read the `source` payload field — "bootstrap_watcher" vs "provision_timeout_sweep". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 20:38:41 -07:00
molecule-ai[bot]	fcd3a6eaf0	fix(test): align ssrf_test.go localhost test cases with isSafeURL behaviour (#1192 ) * feat(canvas): rewrite MemoryInspectorPanel to match backend API Issue #909 (chunk 3 of #576). The existing MemoryInspectorPanel used the wrong API endpoint (/memory instead of /memories) and wrong field names (key/value/version instead of id/content/scope/namespace/created_at). It also lacked LOCAL/TEAM/GLOBAL scope tabs and a namespace filter. Changes: - Fix endpoint: GET /workspaces/:id/memories with ?scope= query param - Fix MemoryEntry type to match actual API: id, content, scope, namespace, created_at, similarity_score - Add LOCAL/TEAM/GLOBAL scope tabs - Add namespace filter input - Remove Edit functionality (no update endpoint in backend) - Delete uses DELETE /workspaces/:id/memories/:id (by id, not key) - Full rewrite of 27 tests to match new API and UI structure - Uses ConfirmDialog (not native dialogs) for delete confirmation - All dark zinc theme (no light colors) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: tighten types + improve provision-timeout message (#1135, #1136) #1135 — TypeScript: make BudgetData.budget_used and WorkspaceMetrics fields optional to match actual partial-response shapes from provisioning- stuck workspaces. Runtime already guarded with ?? 0. #1136 — provisiontimeout.go: replace misleading "check required env vars" hint (preflight catches that case upfront) with accurate message about container starting but failing to call /registry/register. 🤖 Generated with [Claude Code](https://claude.com/claude-code) * fix(test): align ssrf_test.go localhost test cases with isSafeURL behaviour isSafeURL blocks 127.0.0.1 via ip.IsLoopback() even in dev environments. The test cases `wantErr: false` for localhost were incorrect — the test would fail when go test runs. Fix by changing wantErr to true for both localhost test cases. Rationale: loopback blocking at this layer is intentional. Access control is enforced by WorkspaceAuth + CanCommunicate at the A2A routing layer, not by the URL validation. Opening this would widen the SSRF attack surface without adding real dev flexibility. Closes: ssrf_test.go inconsistency reported 2026-04-21 Co-Authored-By: Claude Sonnet 4.7 <noreply@anthropic.com> --------- Co-authored-by: Molecule AI Core-UIUX <core-uiux@agents.moleculesai.app> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-21 02:08:45 +00:00
Hongming Wang	c3f7447e86	fix: harden stuck-provisioning UX — details crash, preflight, sweeper Workspaces stuck in status='provisioning' previously surfaced in three bad ways: 1. Details tab crashed with `Cannot read properties of undefined (reading 'toLocaleString')`. `BudgetSection` + `WorkspaceUsage` assumed full response shapes but a provisioning-stuck workspace returns partial `{}`. Guard each deep field with `?? 0` and cover the partial-response case with regression tests. 2. Missing required env vars failed silently 15+ minutes later as a cosmetic "Provisioning Timeout" banner. The in-container preflight catches them but by then the container has already crashed without calling /registry/register, so the workspace sat in 'provisioning' forever. Mirror the preflight server-side: parse config.yaml's `runtime_config.required_env` before launch, fail fast with a WORKSPACE_PROVISION_FAILED event naming the missing vars. 3. No backend timeout ever flipped a stuck workspace to 'failed'. Add a registry sweeper (10m default, env-overridable) that detects workspaces stuck past the window, flips them to 'failed', and emits WORKSPACE_PROVISION_TIMEOUT. Race-safe: the UPDATE re-checks the status + age predicate so a concurrent register/restart wins. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 14:51:39 -07:00
Hongming Wang	d8026347e5	chore: open-source restructure — rename dirs, remove internal files, scrub secrets Renames: - platform/ → workspace-server/ (Go module path stays as "platform" for external dep compat — will update after plugin module republish) - workspace-template/ → workspace/ Removed (moved to separate repos or deleted): - PLAN.md — internal roadmap (move to private project board) - HANDOFF.md, AGENTS.md — one-time internal session docs - .claude/ — gitignored entirely (local agent config) - infra/cloudflare-worker/ → Molecule-AI/molecule-tenant-proxy - org-templates/molecule-dev/ → standalone template repo - .mcp-eval/ → molecule-mcp-server repo - test-results/ — ephemeral, gitignored Security scrubbing: - Cloudflare account/zone/KV IDs → placeholders - Real EC2 IPs → <EC2_IP> in all docs - CF token prefix, Neon project ID, Fly app names → redacted - Langfuse dev credentials → parameterized - Personal runner username/machine name → generic Community files: - CONTRIBUTING.md — build, test, branch conventions - CODE_OF_CONDUCT.md — Contributor Covenant 2.1 All Dockerfiles, CI workflows, docker-compose, railway.toml, render.yaml, README, CLAUDE.md updated for new directory names. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-18 00:24:44 -07:00

21 Commits