Commit Graph

21 Commits

Author SHA1 Message Date
25fb696965 chore: reconcile main → staging post-suspension divergence
Some checks failed
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 7s
Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 7s
cascade-list-drift-gate / check (pull_request) Successful in 9s
CI / Detect changes (pull_request) Successful in 10s
E2E API Smoke Test / detect-changes (pull_request) Successful in 10s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 11s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 11s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 12s
Harness Replays / detect-changes (pull_request) Successful in 13s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 12s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 15s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 9s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 16s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 43s
Harness Replays / Harness Replays (pull_request) Failing after 40s
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 1m32s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m34s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m36s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Failing after 2m53s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 3m44s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 3m57s
CI / Canvas (Next.js) (pull_request) Successful in 6m50s
CI / Python Lint & Test (pull_request) Successful in 7m37s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CI / Platform (Go) (pull_request) Failing after 8m31s
Refs Task #165 (Class D AUTO_SYNC_TOKEN plumbing).

main and staging diverged after the 2026-05-06 GitHub-org suspension
because Class D / Class G / feature work landed on staging while
unrelated CI fixes (#34-47, ECR auth-inline, buildx→docker, pre-clone
manifest deps) landed straight on main. Both branches edited the
same workflow files, so every push to main triggered an Auto-sync
run that aborted at `git merge --no-ff origin/main` with 7 content
conflicts:

  - .github/workflows/canary-verify.yml      (URL: github.com → Gitea)
  - .github/workflows/ci.yml                 (3 URL refs)
  - .github/workflows/publish-runtime.yml    (cascade: HTTP repo-dispatch
                                              → Gitea push)
  - .github/workflows/publish-workspace-server-image.yml
                                             (drop AWS-action steps;
                                              ECR auth is inline)
  - .github/workflows/retarget-main-to-staging.yml (URL)
  - manifest.json                            (lowercase org slug + add
                                              mock-bigorg from main)
  - scripts/clone-manifest.sh                (keep main's MOLECULE_GITEA_TOKEN
                                              auth path + drop awk-tolower
                                              since manifest is now lowercase)

Resolution: union — staging's post-suspension Gitea/ECR migrations win
on URL/policy edits; main's additive work (mock-bigorg manifest entry,
inline ECR auth, MOLECULE_GITEA_TOKEN basic-auth) is preserved on top.

After this lands, staging is a strict superset of main, so the next
auto-sync run on a push to main will be a clean fast-forward / no-op.
The auto-sync workflow on main also picks up staging's AUTO_SYNC_TOKEN
swap (Class D #26) for free, fixing the latent layer-2 push-auth issue.

Verified locally:
  - bash -n scripts/clone-manifest.sh
  - python -c 'yaml.safe_load(...)' on each touched workflow
  - python -c 'json.load(open(manifest.json))' (21 plugins, 9 templates,
    7 org_templates)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 14:24:37 -07:00
Hongming Wang
d64641904f feat(workspace-server): mock runtime + mock-bigorg org template
Some checks failed
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 9s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 8s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 8s
Harness Replays / detect-changes (pull_request) Successful in 9s
CI / Python Lint & Test (pull_request) Successful in 6s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 6s
CI / Canvas (Next.js) (pull_request) Successful in 8s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 10s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 12s
Harness Replays / Harness Replays (pull_request) Successful in 8s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 8s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m36s
cascade-list-drift-gate / check (pull_request) Successful in 5s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m30s
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 1m39s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Failing after 2m50s
Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s
CI / Platform (Go) (pull_request) Successful in 4m29s
CI / Detect changes (pull_request) Successful in 6s
E2E API Smoke Test / detect-changes (pull_request) Successful in 8s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 7s
Adds a 'mock' runtime: virtual workspaces with no container, no EC2,
no LLM. Every A2A reply is synthesised from a small canned-variant
pool ('On it!', 'Got it, on it now.', etc.) deterministically seeded
by (workspace_id, request_id).

Built for funding-demo "200-workspace mock org" — renders an
enterprise-scale org chart on the canvas (CEO/VPs/Managers/ICs)
without burning real LLM credits or provisioning 200 EC2 instances.

Surfaces:
  - workspace-server/internal/handlers/mock_runtime.go: A2A proxy
    short-circuit, canned-reply pool, deterministic variant pick.
  - workspace-server/internal/handlers/a2a_proxy.go: gate the
    short-circuit before resolveAgentURL (mock has no URL).
  - workspace-server/internal/handlers/org_import.go: skip Docker
    provisioning for mock workspaces, set status='online' directly,
    drop the per-sibling 2s pacing for mock children (collapses
    a 200-workspace import from ~7min → ~1s).
  - workspace-server/internal/handlers/runtime_registry.go: register
    'mock' in the runtime allowlist (manifest + fallback set).
  - workspace-server/internal/registry/healthsweep.go +
    orphan_sweeper.go: skip mock workspaces in container-health and
    stale-token sweeps (no container by design).
  - workspace-server/internal/handlers/workspace_restart.go: mirror
    the 'external' Restart no-op for mock.
  - manifest.json: register the new
    Molecule-AI/molecule-ai-org-template-mock-bigorg repo.

Tests: 5 new in mock_runtime_test.go covering happy-path, non-mock
regression guard, determinism, IsMockRuntime trim/case, JSON-RPC
id echo. All existing handler + registry tests still pass.

Local-verified: imported the 200-workspace template against a fresh
postgres+redis, confirmed all 200 land in 'online' and stay there
through the 30s health-sweep window, exercised A2A on CEO + VPs +
Managers + ICs and saw the variant pool rotate.

Org template lives at
Molecule-AI/molecule-ai-org-template-mock-bigorg (created today)
and is imported via the existing /org/import flow on the canvas
Template Palette.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 08:40:37 -07:00
Hongming Wang
3cdb67f27e fix(workspace-server): CP orphan sweeper closes deprovision split-write race (#2989)
Some checks failed
CI / Shellcheck (E2E scripts) (pull_request) Successful in 2s
Harness Replays / detect-changes (pull_request) Successful in 8s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 4s
CI / Detect changes (pull_request) Successful in 6s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 4s
E2E API Smoke Test / detect-changes (pull_request) Successful in 6s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 6s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 4s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 5s
CI / Python Lint & Test (pull_request) Successful in 3s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 5s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 6s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 4s
CI / Canvas (Next.js) (pull_request) Successful in 18s
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 43s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m19s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m22s
Harness Replays / Harness Replays (pull_request) Failing after 37s
CI / Platform (Go) (pull_request) Failing after 2m33s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Failing after 4m48s
The deprovision path marks `workspaces.status='removed'` BEFORE calling
the controlplane DELETE. If that CP call fails (transient 5xx, network
hiccup, AWS provider error), the DB row stays at 'removed' with
`instance_id` populated and there's no retry — the EC2 lives forever.
9 prod orphans accumulated over 3 days under this bug.

Adds a SaaS-mode counterpart to the existing Docker `orphan_sweeper`:
- 60s tick (matches the Docker sweeper cadence)
- LIMIT 100 per cycle so a sustained CP outage drains over multiple
  cycles without blowing the request timeout
- Re-issues `cpProv.Stop` for any workspace at status='removed' with a
  non-NULL `instance_id`. Stop is idempotent (AWS terminate on
  already-terminated is a no-op; CP's Deprovision tolerates already-
  deleted DNS) so retries are safe.
- On Stop success, NULLs `instance_id` so the next cycle skips the row.
- On Stop failure, leaves `instance_id` populated for next cycle.

The existing Docker sweeper is gated on `prov != nil`; the new sweeper
is gated on `cpProv != nil`. SaaS tenants get exactly one of the two,
self-hosted tenants get the Docker one — no overlap.

Why this shape over option A (CP-first ordering) or B (durable outbox):
the existing inline path already returns a loud 500 to the user when
CP fails — the only missing piece is automatic retry, which a 60s
sweeper provides without protocol changes, new tables, or new workers.
~30 LOC of production code vs. ~400 for an outbox. RFC discussion in
#2989 comment chain.

Tests:
- 9 unit tests covering happy path, Stop failure, UPDATE failure,
  multiple orphans (one-fails-others-still-process), DB query error,
  nil-DB defense, nil-reaper short-circuit, and the boot-immediate-then-
  tick cadence contract.
- Mutation-tested: status='running' substitution and removed-UPDATE-
  block both fail at least one test.

Out of scope:
- Backfilling the 9 named orphans — they'll heal automatically on the
  first sweep cycle after this lands; no manual cleanup needed.
- Long-term durable-outbox architecture — separate RFC.
2026-05-06 16:43:33 -07:00
Hongming Wang
9ceda9d81f refactor(events): migrate 18 files to typed EventType constants (RFC #2945 PR-B-1)
Mechanical migration of bare event-name strings in BroadcastOnly /
RecordAndBroadcast call sites to the typed constants from
internal/events/types.go (RFC #2945 PR-B). Wire format unchanged
(both shapes serialize to identical WSMessage.Event literals); pinned
by TestAllEventTypes_IsSnapshot in #2965.

Migrated (18 files, scope: handlers/, scheduler/, registry/, bundle/,
channels/):
- handlers/{approvals,a2a_proxy_helpers,a2a_queue,activity,agent,
  delegation,external_rotate,org_import,registry,workspace,
  workspace_bootstrap,workspace_crud,workspace_provision_shared,
  workspace_restart}.go
- channels/manager.go (caught by hostile-reviewer pass — initial
  scope missed channels/, found via grep on the post-migration tree)
- scheduler/scheduler.go
- registry/provisiontimeout.go
- bundle/importer.go

Hostile self-review (3 weakest spots, addressed)
------------------------------------------------

1. Missed call sites — initial scope omitted channels/. Post-migration
   `grep -rEn 'BroadcastOnly\([^,]+,[^,]*"[A-Z_]+"|RecordAndBroadcast\([^,]+,[^,]*"[A-Z_]+"' internal/`
   found 2 stragglers in channels/manager.go. Migrated. Final grep
   on the same pattern returns only the docstring example in
   types.go (intentional).

2. gofmt drift — auto-import injection produced non-canonical import
   ordering. `gofmt -w` applied ONLY to the 18 modified files (NOT
   the whole tree, to avoid sweeping unrelated pre-existing drift
   into this PR's diff). Three pre-existing un-gofmt'd files in
   handlers/ (a2a_proxy.go, a2a_proxy_test.go, a2a_queue_test.go)
   left as-is — they're unchanged by this PR and their drift
   predates it.

3. Wire format — paranoia check: do the constants serialize to the
   exact strings consumers (canvas TS, hermes plugin, anything
   parsing WSMessage.Event) expect? Yes. Pinned by the snapshot
   test. The migration is name-only; not a single character of
   wire output changes.

Verified
- go build ./... clean
- go vet ./internal/... clean
- gofmt -l on the 5 migrated package dirs: only pre-existing files
- Full tests: handlers/, channels/, scheduler/, registry/, events/,
  bundle/ all green (5 ok, 0 fail)

PR-B-2 (canvas TS mirror + cross-language parity gate) remains as
the final piece of RFC #2945 PR-B. Tracked separately so this PR
stays mechanical + reviewable.

Refs RFC #2945, PR #2965 (PR-B types).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 19:05:03 -07:00
Hongming Wang
be271aef8b fix(orphan-sweeper): exclude runtime='external' from stale-token revoke
The Docker-mode orphan sweeper was incorrectly targeting external runtime
workspaces, revoking their auth tokens ~6 minutes after creation (one
sweep cycle past the 5-min grace).

External workspaces have NO local container by design — their agent runs
off-host. The "no live container" predicate the sweep uses to detect
wiped-volume orphans matches every external workspace unconditionally,
which was killing the only auth credential the off-host agent has.

Reproducer: create runtime=external workspace, paste the auth token into
molecule-mcp / curl, wait 5 minutes. Next request returns
`HTTP 401 — token may be revoked`. Platform log shows
`Orphan sweeper: revoking stale tokens for workspace <id> (no live
container; volume likely wiped)`.

Fix: add `AND w.runtime != 'external'` to the sweep's SELECT. The
existing test regexes (third-pass query expectations + the shared
expectStaleTokenSweepNoOp helper) are tightened to require the new
predicate, so a regression that drops it fails CI immediately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 00:49:37 -07:00
Hongming Wang
0064f02c00 test(sweeper): integration coverage for manifest-override + accessor consolidation
Two follow-ups from PR #2494's review:

1. Two new sweep tests exercise the lookup path through
   sweepStuckProvisioning end-to-end:
     - ManifestOverrideSparesRow: claude-code 11min old, manifest=20min
       → no UPDATE, no broadcast (sparing works through the sweeper)
     - ManifestOverrideStillFlipsPastDeadline: claude-code 21min old,
       manifest=20min → flipped + payload.timeout_secs=1200
   Closes the gap that the unit-test on provisioningTimeoutFor alone
   left open: a future refactor could drop the lookup arg from the
   sweeper's call and only the unit test caught it. Verified by
   regression-injecting `lookup→nil` in sweepStuckProvisioning — both
   new tests fail, the old ones still pass.

2. addProvisionTimeoutMs now goes through ProvisionTimeoutSecondsForRuntime
   instead of calling provisionTimeouts.get directly. Single accessor
   path for the same data — the canvas response and the sweeper now
   resolve identically by construction.

No production behavior change; tests + accessor cleanup only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 22:00:36 -07:00
Hongming Wang
18edf88d59 fix(sweeper): honour template-manifest provision_timeout_seconds
Real wiring gap discovered while investigating issue #2486 cluster of
prod claude-code workspaces failed at exactly 10m. The
runtimeProvisionTimeoutsCache (#2054 phase 2) reads
runtime_config.provision_timeout_seconds from each template's
config.yaml so the **canvas** spinner respects per-template timeouts —
but the **sweeper** in registry/provisiontimeout.go hardcoded 10 min
(claude-code) / 30 min (hermes) and never consulted the manifest. So a
template that declared a longer window had a UI that waited correctly
but a sweeper that killed the row at the hardcoded floor anyway.

Resolution order pinned by new TestProvisioningTimeout_ManifestOverride:

  1. PROVISION_TIMEOUT_SECONDS env (ops-debug global override)
  2. Template manifest lookup (per-runtime, beats hermes default too)
  3. Hermes default (30 min — CP bootstrap-watcher 25 min + 5 min slack)
  4. DefaultProvisioningTimeout (10 min)

Wiring:
  - registry: new RuntimeTimeoutLookup function type, threaded through
    StartProvisioningTimeoutSweep + sweepStuckProvisioning + the
    pre-existing provisioningTimeoutFor.
  - handlers: ProvisionTimeoutSecondsForRuntime exposes the cache's
    lookup as a method so main.go can pass it without breaking the
    handlers→registry import direction.
  - cmd/server/main.go: wire wh.ProvisionTimeoutSecondsForRuntime into
    the sweep boot.

Verified:
  - go test -race ./... passes (every workspace-server package).
  - Regression-injected the lookup arm: 3 manifest-override subcases
    fail with the actual-vs-expected gap, confirming the new test is
    load-bearing.
  - The original two timeout tests (env-override, hermes default) keep
    passing — `lookup=nil` argument preserves their semantics.

Operator action enabled: a template wanting a 15-min window can now
just set `runtime_config.provision_timeout_seconds: 900` in its
config.yaml and the sweeper honours it on the next workspace-server
restart.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 21:44:42 -07:00
Hongming Wang
fdf1b5d76a refactor(workspace-status): typed constants + AST-based drift gate
Eliminate raw 'awaiting_agent'/'hibernating'/'failed'/etc string literals
from production status writes. Adds models.WorkspaceStatus typed alias and
models.AllWorkspaceStatuses canonical slice; every UPDATE workspaces SET
status = ... now passes a parameterized $N typed value rather than a
hard-coded SQL literal.

Defense-in-depth follow-up to migration 046 (#2388): the Postgres enum
type was missing 'awaiting_agent' + 'hibernating' for ~5 days because
sqlmock regex matching cannot enforce live enum constraints. The drift
gate is now a proper Go AST + SQL parser (no regex), asserting the
codebase ⊆ migration enum and every const appears in the canonical
slice. With status as a parameterized typed value, future enum mismatches
fail at the SQL layer in tests, not silently in prod.

Test coverage: full suite passes with -race; drift gate green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:41:41 -07:00
Hongming Wang
284511f02e feat(external): default external runtime to poll-mode + awaiting_agent
Paired molecule-core change for the molecule-cli `molecule connect`
RFC (https://github.com/Molecule-AI/molecule-cli/issues/10).

After this PR an `external`-runtime workspace's full lifecycle
matches the operator-driven model: it boots in awaiting_agent, the
CLI connects in poll mode without operator-side flag tuning, the
heartbeat-loss path lands back on awaiting_agent (re-registrable)
instead of the terminal-feeling 'offline'.

Two changes in workspace-server:

1) `resolveDeliveryMode` (registry.go) now reads `runtime` alongside
   `delivery_mode`. Resolution order:
     a. payload.delivery_mode if non-empty (operator override)
     b. row's existing delivery_mode if non-empty (preserves prior
        registration)
     c. **NEW:** "poll" if row.runtime = "external" — external
        operators run on laptops without public HTTPS; push-mode
        would hard-fail at validateAgentURL anyway. (`molecule connect`
        registers without --mode and expects this default.)
     d. "push" otherwise (historical default for platform-managed
        runtimes — langgraph, hermes, claude-code, etc.)

2) Heartbeat-loss for external workspaces lands them in
   `awaiting_agent` instead of `offline`. Two code paths:
   - `liveness.go` — Redis TTL expiration. Uses a CASE expression
     so the conditional is one UPDATE (no extra round-trip for
     non-external runtimes, no TOCTOU between runtime read and
     status write).
   - `healthsweep.go::sweepStaleRemoteWorkspaces` — DB-side
     last_heartbeat_at age scan. This sweep is already external-
     only by query filter, so the UPDATE just hard-codes the new
     status.

   The Docker-side `sweepOnlineWorkspaces` keeps `offline` —
   recovery there is "restart the container", not "re-register from
   the operator's box".

Why awaiting_agent over offline for external:
- Matches the status the workspace was created in (workspace.go:333).
- The CLI re-registers on every invocation; awaiting_agent → online
  is the natural transition. offline is a terminal-feeling status
  that implies operator intervention is needed.
- An operator who closed their laptop overnight should see
  awaiting_agent in canvas, not 'offline (something is wrong)'.

Test plan:
- Existing: 9 `resolveDeliveryMode` test sites updated to the new
  query shape. Sqlmock now reads `delivery_mode, runtime` columns.
- New: TestRegister_ExternalRuntime_DefaultsToPoll asserts the
  external→poll branch. TestRegister_NonExternalRuntime_StillDefaultsToPush
  guards against the new branch overshooting (langgraph keeps push).
- Liveness: regex updated to match the CASE expression.
- Healthsweep: `TestSweepStaleRemoteWorkspaces_MarksStaleAwaitingAgent`
  (renamed for grep-ability), Docker-side sweepOnlineWorkspaces test
  unchanged (verified to still match `'offline'`).
- Full handlers + registry suite green under -race (12.873s + 2.264s).

No migration needed — `status` is a free-form text column; both
'offline' and 'awaiting_agent' are existing values used elsewhere
(workspace.go uses awaiting_agent on initial external creation).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 06:39:57 -07:00
Hongming Wang
317196463a fix(orphan-sweeper): close TOCTOU race with issueAndInjectToken on restart
Independent code review caught a real bug in the previous commit's
stale-token revoke pass. The platform's restart endpoint
(workspace_restart.go:104) Stops the workspace container synchronously
then dispatches re-provisioning to a goroutine (line 173). For a
workspace that's been idle past the 5-minute grace window — extremely
common: user comes back to a long-idle workspace and clicks Restart —
this opens a race window:

  1. Container stopped → ListWorkspaceContainerIDPrefixes returns no
     entry → workspace becomes a stale-token candidate.
  2. issueAndInjectToken runs in the goroutine: revokes old tokens,
     issues a fresh one, writes it to /configs/.auth_token.
  3. If the sweeper's predicate-only UPDATE
     `WHERE workspace_id = $1 AND revoked_at IS NULL` runs AFTER
     IssueToken commits but is racing the SELECT-then-UPDATE window,
     it revokes the freshly-issued token alongside the old ones.
  4. Container starts with a now-revoked token → 401 forever.

The fix carries the SAME staleness predicate from the SELECT into the
per-workspace UPDATE: a token created within the grace window can't
match `< now() - grace` and is automatically excluded. The operation
is now idempotent against fresh inserts.

Also addresses other findings from the same review:

  - Add `status NOT IN ('removed', 'provisioning')` to the SELECT
    (R2 + first-line C1 defence). 'provisioning' is set synchronously
    in workspace_restart.go before the async re-provision begins, so
    it's a reliable in-flight signal that narrows the candidate set.

  - Stop calling wsauth.RevokeAllForWorkspace from the sweeper —
    that helper revokes EVERY live token unconditionally; the sweeper
    needs "every STALE live token" which is a different (safer)
    operation. Inline the UPDATE so we own the predicate end-to-end.
    Drop the wsauth import (no longer needed in this package).

  - Tighten expectStaleTokenSweepNoOp regex to anchor at start and
    require the status filter, so a future query whose first line
    coincidentally starts with "SELECT DISTINCT t.workspace_id" can't
    silently absorb the helper's expectation (R3).

  - Defensive `if reaper == nil { return }` at top of
    sweepStaleTokensWithoutContainer — even though StartOrphanSweeper
    already short-circuits on nil, a future refactor that wires this
    pass directly without checking would otherwise mass-revoke in
    CP/SaaS mode (F2).

  - Comment in the function explaining why empty likes is intentionally
    NOT a short-circuit (asymmetry with the first two passes is the
    whole point — "no containers running" is the load-bearing case).

  - Add TestSweepOnce_StaleTokenRevokeUsesStalenessPredicate that
    asserts the UPDATE shape (predicate present, grace bound). A
    real-Postgres integration test would prove the race resolution
    end-to-end; this catches the regression where someone simplifies
    the UPDATE back to predicate-only.

  - Add TestSweepStaleTokens_NilReaperEarlyExit pinning the F2 guard.

Existing tests updated to match the new query/UPDATE shape with tight
regexes that pin all the safety guards (status filter, staleness
predicate in both SELECT and UPDATE).

Full Go suite green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 17:28:50 -07:00
Hongming Wang
3332e6878b fix(orphan-sweeper): revoke stale tokens for workspaces with no live container
Heals the user-reported "auth token conflict after volume wipe" failure
mode. When an operator nukes a workspace's /configs volume outside the
platform's restart endpoint (common via `docker compose down -v` or
manual cleanup scripts), the DB still holds live workspace_auth_tokens
for that workspace while the recreated container has an empty
/configs/.auth_token. Subsequent /registry/register calls 401 forever:
requireWorkspaceToken sees live tokens, container has no token to
present, and the workspace is permanently wedged until an operator
manually revokes via SQL.

The platform's restart endpoint already handles this correctly via
wsauth.RevokeAllForWorkspace inside issueAndInjectToken. This change
adds a third orphan-sweeper pass — sweepStaleTokensWithoutContainer —
as the safety net for the equivalent action taken outside the API.

Detection criterion: workspace has at least one live (non-revoked)
token whose most-recent activity (COALESCE(last_used_at, created_at))
is older than staleTokenGrace (5 minutes), AND no live Docker
container's name prefix matches the workspace ID.

Safety filters that bound the revoke radius:

  1. Only runs in single-tenant Docker mode. The orphan sweeper is
     wired only when prov != nil in cmd/server/main.go — CP/SaaS mode
     never gets here, so an empty container list cannot be confused
     with "no Docker at all" (which would otherwise revoke every
     workspace's tokens in production SaaS).

  2. staleTokenGrace = 5min skips tokens issued/used in the last
     5 minutes. Bounds the race with mid-provisioning (token issued
     moments before docker run completes) and brief restart windows
     — a healthy workspace touches last_used_at every 30s heartbeat,
     so 5min is 10× the heartbeat interval.

  3. The query joins workspaces.status != 'removed' so deleted
     workspaces are not revoked here (handled at delete time by the
     explicit RevokeAllForWorkspace call).

  4. make_interval(secs => $2) avoids a time.Duration.String() →
     "5m0s" mismatch with Postgres interval grammar that I caught
     during implementation.

  5. Each revocation logs the workspace ID so operators can correlate
     "workspace just lost auth" with this sweeper, not blame a
     network blip.

Failure mode: revoke fails (transient DB error). Loop bails to avoid
log spam; next 60s cycle retries. Worst case a workspace stays
401-blocked an extra minute.

Tests: 5 new tests covering the headline scenario, the safety gate
(workspace with container is NOT revoked), revoke-failure-bails-loop,
query-error-non-fatal, and Docker-list-failure-skips-cycle. All 11
existing sweepOnce tests updated to register the new third-pass query
expectation via a small `expectStaleTokenSweepNoOp` helper that keeps
their existing assertions readable.

Full Go test suite green: registry, wsauth, handlers, and all other
packages.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 17:20:08 -07:00
Hongming Wang
4915d1d59e fix(orphan-sweeper): reap labeled containers with no DB row (wiped-DB)
The existing sweeper only reaps ws-* containers whose workspace row
has status='removed'. That misses the entire wiped-DB case: an
operator does `docker compose down -v` (kills the postgres volume),
the previous platform's ws-* containers keep running, the new
platform boots into an empty workspaces table — first pass finds
zero candidates and those containers leak forever. Symptom users
hit today: 7 ws-* containers from 11h ago, no rows in DB, no
visibility in Canvas, eating CPU + memory.

Fix shape:

1. Provisioner stamps every ws-* container + volume with
   `molecule.platform.managed=true`. Without a label, the sweeper
   would have to assume any unlabeled ws-* container might belong
   to a sibling platform stack on a shared Docker daemon.

2. Provisioner exposes ListManagedContainerIDPrefixes — a label-filter
   counterpart to the existing name-filter.

3. Sweeper splits sweepOnce into two independent passes:
     - sweepRemovedRows (unchanged behavior; status='removed' only)
     - sweepLabeledOrphansWithoutRows (new; labeled containers whose
       workspace_id has no row in the table at all)
   Each pass has its own short-circuit so an empty result or transient
   error in one doesn't block the other — load-bearing because the
   wiped-DB pass exists precisely for cases where the removed-row
   pass finds nothing.

Safe under multi-platform-on-shared-daemon: only containers carrying
our label get reaped, sibling stacks' containers are invisible to this
pass. (For now the label is a constant string; a future per-instance
UUID layer can refine "ours" further if a real shared-daemon scenario
emerges.)

Migration: existing platforms running pre-PR builds have UNLABELED
ws-* containers. After this lands they continue to NOT be reaped by
the new path (no label = invisible). They'll only be cleaned via
manual intervention or once the operator recreates them — same as
today. No regression.

Tests cover all five branches of the new pass: happy-path reap,
no-reap when row exists, mixed reap-some-keep-some, Docker error
short-circuits cleanly, non-UUID prefixes get filtered before the
SQL query.

Pairs with PR #2122 (script-level fix). Together they close the
orphan-leak path for both `bash scripts/nuke-and-rebuild.sh` users
(handled by the script) AND `docker compose down -v` users (handled
by the runtime).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 14:33:41 -07:00
Hongming Wang
d0f198b24f merge: resolve staging conflicts (a2a_proxy + workspace_crud)
Three files conflicted with staging changes that landed while this PR
sat open. Resolved each by combining both intents (not picking one side):

- a2a_proxy.go: keep the branch's idle-timeout signature
  (workspaceID parameter + comment) AND apply staging's #1483 SSRF
  defense-in-depth check at the top of dispatchA2A. Type-assert
  h.broadcaster (now an EventEmitter interface per staging) back to
  *Broadcaster for applyIdleTimeout's SubscribeSSE call; falls through
  to no-op when the assertion fails (test-mock case).

- a2a_proxy_test.go: keep both new test suites — branch's
  TestApplyIdleTimeout_* (3 cases for the idle-timeout helper) AND
  staging's TestDispatchA2A_RejectsUnsafeURL (#1483 regression). Updated
  the staging test's dispatchA2A call to pass the workspaceID arg
  introduced by the branch's signature change.

- workspace_crud.go: combine both Delete-cleanup intents:
  * Branch's cleanupCtx detachment (WithoutCancel + 30s) so canvas
    hang-up doesn't cancel mid-Docker-call (the container-leak fix)
  * Branch's stopAndRemove helper that skips RemoveVolume when Stop
    fails (orphan sweeper handles)
  * Staging's #1843 stopErrs aggregation so Stop failures bubble up
    as 500 to the client (the EC2 orphan-instance prevention)
  Both concerns satisfied: cleanup runs to completion past canvas
  hangup AND failed Stop calls surface to caller.

Build clean, all platform tests pass.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-04-26 10:43:22 -07:00
Hongming Wang
be1beff4a0 fix(registry): runtime-aware provision-timeout sweep — give hermes 30 min
Pre-fix: workspace-server's provision-timeout sweep was hardcoded
at 10 min for all runtimes. The CP-side bootstrap-watcher (cp#245)
correctly gives hermes 25 min for cold-boot (hermes installs
include apt + uv + Python venv + Node + hermes-agent — 13–25 min on
slow apt mirrors is normal). The two timeout systems disagreed:
the watcher would happily wait 25 min, but the workspace-server's
10-min sweep killed healthy hermes boots mid-install at 10 min and
marked them failed.

Today's example: #2061's E2E run on 2026-04-26 at 08:06:34Z
created a hermes workspace, EC2 cloud-init was visibly making
progress on apt-installs (libcjson1, libmbedcrypto7t64) when the
sweep flipped status to 'failed' at 08:17:00Z (10:26 elapsed). The
test threw "Workspace failed: " (empty error from sql.NullString
serialization) and CI failed on a healthy boot.

Fix: provisioningTimeoutFor(runtime) — same shape as the CP's
bootstrapTimeoutFn:
  - hermes:  30 min (watcher's 25 min + 5 min slack)
  - others:  10 min (unchanged — claude-code/langgraph/etc. boot
                     in <5 min, 10 min is plenty)

PROVISION_TIMEOUT_SECONDS env override still works (applies to all
runtimes — operators who care about the runtime distinction
shouldn't use the override anyway).

Sweep query change: pulls (id, runtime, age_sec) per row instead
of pre-filtering by age in SQL. Per-row Go evaluation picks the
correct timeout. Slightly more rows scanned but bounded by the
status='provisioning' partial index — workspaces in flight, not
historical.

Tests:
  - TestProvisioningTimeout_RuntimeAware — locks in the per-runtime
    mapping
  - TestSweepStuckProvisioning_HermesGets30MinSlack — hermes at
    11 min must NOT be flipped
  - TestSweepStuckProvisioning_HermesPastDeadline — hermes at
    31 min IS flipped, payload includes runtime
  - Existing tests updated for the new query shape

Verified:
  - go build ./... clean
  - go vet ./... clean
  - go test ./... all green

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 01:44:09 -07:00
Hongming Wang
b47a1b87b0 chore: refresh stale orphan-sweeper Stop-failure comment
Convergence-pass review noted the comment at orphan_sweeper.go:171
still describes the pre-cb126014 contract ("Stop returns nil even
when container is gone, but a future change could surface real
errors"). The future is now — Stop does surface real errors today.
Tightened the comment to match the live contract:
isContainerNotFound is treated as success, anything else returns
the wrapped Docker error, sweeper retries on the next cycle.

Pure comment change, no behavior diff.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 23:34:57 -07:00
Hongming Wang
cb12601414 fix(platform): make Provisioner.Stop return real errors so cleanup gates fire
Review caught a critical issue with 12c49183: the headline "skip
RemoveVolume when Stop fails" guarantee was dead code. `Provisioner.Stop`
unconditionally `return nil`'d after logging the underlying
ContainerRemove error, so the new `if err := h.provisioner.Stop(...);
err != nil { skip volume }` guard in workspace_crud.go AND the same
guard in the orphan sweeper could never fire. RemoveVolume always
ran, predictably failing with "volume in use" when Stop hadn't
actually killed the container — which is the exact production bug
the commit claimed to fix.

Now Stop:
  - returns nil on successful remove (no change)
  - returns nil when the container is already gone (uses the existing
    isContainerNotFound helper — that's the cleanup post-condition,
    not a failure)
  - returns the wrapped Docker error otherwise (daemon timeout, ctx
    cancellation, socket EOF — anything that means the container
    might still be alive)

Audited every Provisioner.Stop caller in the tree (team.go,
workspace_restart.go ×4, workspace.go) — all of them already
discard the return value, so the widened error surface is purely
opt-in for the new cleanup paths and breaks no existing behaviour.

Other review-driven fixes in this commit:

- workspace_crud.go: detached `broadcaster.RecordAndBroadcast` from
  the request ctx too. RecordAndBroadcast does INSERT INTO
  structure_events + Redis Publish; if the canvas hangs up, a
  request-ctx-bound INSERT can be cancelled mid-write and the
  WORKSPACE_REMOVED event never lands, leaving other WS clients
  ignorant of the cascade.

- orphan_sweeper.go: added isLikelyWorkspaceID guard before turning
  Docker container prefixes into SQL LIKE patterns. The Docker
  name filter is a SUBSTRING match (not prefix), so non-workspace
  containers like `my-ws-tool` slip through; the in-loop HasPrefix
  in provisioner trims most, but the in-sweeper alphabet check
  (hex + dashes only) is the second line of defence and also
  blocks SQL LIKE wildcards (`_`, `%`) from reaching the query.
  Two new tests pin this — TestSweepOnce_FiltersNonWorkspacePrefixes
  and TestIsLikelyWorkspaceID with 10 alphabet cases.

- provisioner.go: comment added to ListWorkspaceContainerIDPrefixes
  flagging the substring/HasPrefix relationship as load-bearing.

Verified: full Go test suite passes; all 8 sweeper tests pass
(2 new for the LIKE-pattern guard); existing dispatch / delete /
provisioner tests unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 23:32:48 -07:00
Hongming Wang
12c4918318 fix(platform): stop leaking workspace containers on delete
Symptom: deleting workspaces from the canvas marked DB rows
status='removed' but left Docker containers running indefinitely.
After a session of org imports + cancellations, we counted 10
running ws-* containers all backed by 'removed' DB rows, eating
~1100% CPU on the Docker VM.

Two compounding bugs in handlers/workspace_crud.go's delete cascade:

1. The cleanup loop used `c.Request.Context()` for the Docker
   stop/remove calls. When the canvas's `api.del` resolved on the
   platform's 200, gin cancelled the request ctx — and any in-flight
   Docker call cancelled with `context canceled`, leaving the
   container alive. Old logs:
       "Delete descendant <id> volume removal warning:
        ... context canceled"

2. `provisioner.Stop`'s error return was discarded and `RemoveVolume`
   ran unconditionally afterward. When Stop didn't actually kill the
   container (transient daemon error, ctx cancellation as in #1), the
   volume removal would predictably fail with "volume in use" and
   the container kept running with the volume mounted. Old logs:
       "Delete descendant <id> volume removal warning:
        Error response from daemon: remove ... volume is in use"

Fix layered in two parts:

- workspace_crud.go: detach cleanup with `context.WithoutCancel(ctx)`
  + a 30s bounded timeout. Stop's error is now checked and on
  failure we skip RemoveVolume entirely (the orphan sweeper below
  catches what we deferred).

- New registry/orphan_sweeper.go: periodic reconcile pass (every 60s,
  initial run on boot). Lists running ws-* containers via Docker name
  filter, intersects with DB rows where status='removed', stops +
  removes volumes for the leaks. Defence in depth — even a brand-new
  Stop failure mode heals on the next sweep instead of leaking
  forever.

Provisioner gains a tiny ListWorkspaceContainerIDPrefixes helper
that wraps ContainerList with the `name=ws-` filter; the sweeper
takes an OrphanReaper interface (matches the ContainerChecker
pattern in healthsweep.go) so unit tests don't need a real Docker
daemon.

main.go wires the sweeper alongside the existing liveness +
health-sweep + provisioning-timeout monitors, all under
supervised.RunWithRecover so a panic restarts the goroutine.

6 new sweeper tests cover the reconcile path, the
no-running-containers short-circuit, the daemon-error skip, the
Stop-failure-leaves-volume invariant (the same trap that motivated
this fix), the volume-remove-error-is-non-fatal continuation,
and the nil-reaper no-op.

Verified: full Go test suite passes; manually purged the 10 leaked
containers + their orphan volumes from the dev host with `docker
rm -f` + `docker volume rm` (one-off cleanup; the sweeper would
have caught them on the next cycle once deployed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 12:36:22 -07:00
Hongming Wang
ec52d155f4 fix(sweeper): emit WORKSPACE_PROVISION_FAILED so canvas updates UI
The provision-timeout sweeper was emitting a new WORKSPACE_PROVISION_TIMEOUT
event type, but the canvas event handler (canvas-events.ts:234) only
has a case for WORKSPACE_PROVISION_FAILED — the sweep's event fell
through silently. DB was being marked 'failed' but the UI stayed on
'starting' indefinitely until the user hard-refreshed.

Reusing the existing event name keeps the UI reaction uniform across
both fail paths (runtime-crash via bootstrap-watcher and boot-timeout
via sweeper). Operators who need to distinguish can read the `source`
payload field — "bootstrap_watcher" vs "provision_timeout_sweep".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 20:38:41 -07:00
molecule-ai[bot]
fcd3a6eaf0 fix(test): align ssrf_test.go localhost test cases with isSafeURL behaviour (#1192)
* feat(canvas): rewrite MemoryInspectorPanel to match backend API

Issue #909 (chunk 3 of #576).

The existing MemoryInspectorPanel used the wrong API endpoint
(/memory instead of /memories) and wrong field names (key/value/version
instead of id/content/scope/namespace/created_at). It also lacked
LOCAL/TEAM/GLOBAL scope tabs and a namespace filter.

Changes:
- Fix endpoint: GET /workspaces/:id/memories with ?scope= query param
- Fix MemoryEntry type to match actual API: id, content, scope,
  namespace, created_at, similarity_score
- Add LOCAL/TEAM/GLOBAL scope tabs
- Add namespace filter input
- Remove Edit functionality (no update endpoint in backend)
- Delete uses DELETE /workspaces/:id/memories/:id (by id, not key)
- Full rewrite of 27 tests to match new API and UI structure
- Uses ConfirmDialog (not native dialogs) for delete confirmation
- All dark zinc theme (no light colors)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: tighten types + improve provision-timeout message (#1135, #1136)

#1135 — TypeScript: make BudgetData.budget_used and WorkspaceMetrics
fields optional to match actual partial-response shapes from provisioning-
stuck workspaces. Runtime already guarded with ?? 0.

#1136 — provisiontimeout.go: replace misleading "check required env vars"
hint (preflight catches that case upfront) with accurate message about
container starting but failing to call /registry/register.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

* fix(test): align ssrf_test.go localhost test cases with isSafeURL behaviour

isSafeURL blocks 127.0.0.1 via ip.IsLoopback() even in dev environments.
The test cases `wantErr: false` for localhost were incorrect — the
test would fail when go test runs. Fix by changing wantErr to true
for both localhost test cases.

Rationale: loopback blocking at this layer is intentional. Access
control is enforced by WorkspaceAuth + CanCommunicate at the A2A
routing layer, not by the URL validation. Opening this would widen
the SSRF attack surface without adding real dev flexibility.

Closes: ssrf_test.go inconsistency reported 2026-04-21

Co-Authored-By: Claude Sonnet 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Molecule AI Core-UIUX <core-uiux@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-21 02:08:45 +00:00
Hongming Wang
c3f7447e86 fix: harden stuck-provisioning UX — details crash, preflight, sweeper
Workspaces stuck in status='provisioning' previously surfaced in three
bad ways:

1. **Details tab crashed** with `Cannot read properties of undefined
   (reading 'toLocaleString')`. `BudgetSection` + `WorkspaceUsage`
   assumed full response shapes but a provisioning-stuck workspace
   returns partial `{}`. Guard each deep field with `?? 0` and cover
   the partial-response case with regression tests.

2. **Missing required env vars failed silently** 15+ minutes later as
   a cosmetic "Provisioning Timeout" banner. The in-container preflight
   catches them but by then the container has already crashed without
   calling /registry/register, so the workspace sat in 'provisioning'
   forever. Mirror the preflight server-side: parse config.yaml's
   `runtime_config.required_env` before launch, fail fast with a
   WORKSPACE_PROVISION_FAILED event naming the missing vars.

3. **No backend timeout** ever flipped a stuck workspace to 'failed'.
   Add a registry sweeper (10m default, env-overridable) that detects
   workspaces stuck past the window, flips them to 'failed', and emits
   WORKSPACE_PROVISION_TIMEOUT. Race-safe: the UPDATE re-checks the
   status + age predicate so a concurrent register/restart wins.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 14:51:39 -07:00
Hongming Wang
d8026347e5 chore: open-source restructure — rename dirs, remove internal files, scrub secrets
Renames:
- platform/ → workspace-server/ (Go module path stays as "platform" for
  external dep compat — will update after plugin module republish)
- workspace-template/ → workspace/

Removed (moved to separate repos or deleted):
- PLAN.md — internal roadmap (move to private project board)
- HANDOFF.md, AGENTS.md — one-time internal session docs
- .claude/ — gitignored entirely (local agent config)
- infra/cloudflare-worker/ → Molecule-AI/molecule-tenant-proxy
- org-templates/molecule-dev/ → standalone template repo
- .mcp-eval/ → molecule-mcp-server repo
- test-results/ — ephemeral, gitignored

Security scrubbing:
- Cloudflare account/zone/KV IDs → placeholders
- Real EC2 IPs → <EC2_IP> in all docs
- CF token prefix, Neon project ID, Fly app names → redacted
- Langfuse dev credentials → parameterized
- Personal runner username/machine name → generic

Community files:
- CONTRIBUTING.md — build, test, branch conventions
- CODE_OF_CONDUCT.md — Contributor Covenant 2.1

All Dockerfiles, CI workflows, docker-compose, railway.toml, render.yaml,
README, CLAUDE.md updated for new directory names.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 00:24:44 -07:00