Commit Graph

605 Commits

Author SHA1 Message Date
Hongming Wang
26e2e97006 refactor(models): consolidate per-runtime model defaults to SSOT (RFC #2873 iter 1)
Two call sites — workspace_provision.go:537 and org_import.go:54 —
duplicated the same `if runtime == "claude-code"` branch deciding
the default model when the operator/agent didn't supply one. They
were copy-pasted; nothing prevented them from drifting silently.

Extract to `models.DefaultModel(runtime string) string`. Both call
sites now route through the helper. New runtimes need one entry
in DefaultModel + one assertion in TestDefaultModel — pre-fix it
required two source edits + an audit.

Foundation for the future `RuntimeConfig` interface (RFC #2873 +
task #231): once we add `ProvisioningTimeout()`, `CapabilitiesSupported()`
etc., the helper expands to per-runtime structs and `DefaultModel`
becomes one method on the interface.

## Coverage

15 unit tests pinning the exact contract:
  - claude-code → "sonnet"
  - 9 other known runtimes → universal default
  - empty + unknown → universal default (matches pre-refactor fallthrough)
  - case-sensitivity preserved (CLAUDE-CODE → universal default)

Plus invariant test: `DefaultModel` never returns "" — protects
against a future "return early on unknown" regression that would
silently break workspace creation.

## Verification

  - go build ./... clean
  - 15 model unit tests pass
  - existing handler tests untouched (no behavior change at call sites)
  - identical output to pre-refactor for every input

First iteration of the OSS-shape refactor program. Each PR meets all
7 bars (plugin/abstract/modular/SSOT/coverage/cleanup/file-split).

Refs RFC #2873.
2026-05-05 04:12:37 -07:00
Hongming Wang
1e8d7ae17c
Merge pull request #2869 from Molecule-AI/test/rfc2829-tighten-sweeper-assertions
test(delegations): tighten integration-test assertions + integrationDB doc
2026-05-05 10:42:31 +00:00
Hongming Wang
fcdf79774d test(delegations): tighten integration-test assertions + integrationDB doc (#321)
Three small follow-ups from #2866 self-review:

1. TestIntegration_Sweeper_StaleHeartbeatIsMarkedStuck — assert
   strings.Contains(errDet, "no heartbeat for") instead of != "".
   The original "non-empty" check passes for any error_detail value;
   if a future regression swaps the message format, the test wouldn't
   catch it. Pin the production format string explicitly.

2. TestIntegration_Sweeper_DeadlineExceededIsMarkedFailed — drop the
   redundant `last_heartbeat = now()` write. The sweeper checks
   deadline FIRST (the stronger statement) and short-circuits before
   evaluating heartbeat staleness, so the heartbeat field is irrelevant
   for that test path.

3. integrationDB doc comment now warns explicitly that the helper is
   NOT t.Parallel()-safe — it hot-swaps the package-level mdb.DB and
   restores via t.Cleanup. If a future contributor adds t.Parallel()
   to one of these tests they race on the global. Comment makes the
   constraint discoverable instead of a debugging surprise.

All 7 integration tests still pass against real Postgres locally.
2026-05-05 03:39:22 -07:00
Hongming Wang
d6337a1ae9 feat(org-import): make createWorkspaceTree idempotent (Phase 3 of #2857)
OrgHandler.Import was non-idempotent — every call INSERTed a fresh row
for every workspace in the tree, regardless of whether matching
workspaces already existed. Calling /org/import twice with the same
template duplicated the entire tree.

This was the bigger leak source than TeamHandler.Expand (deleted in
PR #2856). tenant-hongming accumulated 72 distinct child workspaces
in 4 days entirely from repeated org-template spawns of the same
template — the (tier × runtime) matrix in the audit data was the
template's static shape, multiplied by spawn count.

Fix: route through a new lookupExistingChild helper before INSERT.
Skip-if-exists semantics by default:
- Match on (parent_id, name) using `IS NOT DISTINCT FROM` so NULL
  parents (root workspaces) are included.
- Ignore status='removed' rows so collapsed teams or deleted
  workspaces don't block re-import.
- Recursion still runs on the existing id so partial-match templates
  (parent exists, some children missing) backfill correctly instead
  of either no-op'ing the whole subtree or duplicating the existing
  children.
- Result entries for skipped nodes carry skipped:true so callers
  (canvas Import preflight modal) can surface "5 of 7 already
  existed, 2 created."

The recursion that walked ws.Children is extracted into
recurseChildrenForImport so both the create-path and the skip-path
share one implementation — no duplicated grid math, no two paths to
keep in sync.

Note: replace_if_exists semantics (re-roll: stop+delete old, create
new) are deferred. Skip-if-exists alone closes the leak; re-roll is
a later UX decision for the canvas Import preflight modal.

Tests:
- 4 sqlmock cases on lookupExistingChild: not-found, found,
  nil-parent (the IS NOT DISTINCT FROM NULL trick), DB-error
  propagates (must fail fast — silent fallback to INSERT is the
  failure mode the helper exists to prevent).
- 1 source-level AST gate (per memory feedback_behavior_based_ast_gates.md):
  pins that h.lookupExistingChild( appears BEFORE INSERT INTO workspaces
  in org_import.go. If a future refactor reintroduces the un-checked
  INSERT, the gate fails. Verified load-bearing by removing the call —
  build fails (helper symbol gone).

go vet ./... clean. go test ./internal/handlers/ -count 1 — all green
(4.2s, no regression on existing OrgImport / Provision / Team tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 03:37:49 -07:00
Hongming Wang
3d2a50e2a2 test(delegations): extend integration suite with sweeper coverage (3 tests)
Real-Postgres tests for the RFC #2829 PR-3 sweeper. Validates:

  - Deadline-exceeded rows are marked failed with the expected
    error_detail
  - Stale-heartbeat in-flight rows are marked stuck (uses
    DELEGATION_STUCK_THRESHOLD_S env override for deterministic
    timing)
  - Healthy rows (fresh heartbeat + future deadline) are not touched
    — no false-positive against well-behaved delegations

These extend the gate added in the previous commit so the workflow
catches sweeper regressions, not just ledger-write ones. All 7
integration tests now pass; CI workflow runs them all.
2026-05-05 03:20:19 -07:00
Hongming Wang
c661ea4cd3
Merge pull request #2861 from Molecule-AI/fix/rfc2829-result-preview-ordering-and-integration-gate
fix(delegations): preserve result_preview + add real-Postgres integration gate
2026-05-05 09:51:30 +00:00
Hongming Wang
4c9f12258d fix(delegations): preserve result_preview through completion + add real-Postgres integration gate
Two-part PR:

## Fix: result_preview was lost on completion

Self-review of #2854 caught a real bug. SetStatus has a same-status
replay no-op; the order of calls in `executeDelegation` completion
+ `UpdateStatus` completed branch clobbered the preview field:

  1. updateDelegationStatus(completed, "")    fires
  2. inner recordLedgerStatus(completed, "", "")
       → SetStatus transitions dispatched → completed with preview=""
  3. outer recordLedgerStatus(completed, "", responseText)
       → SetStatus reads current=completed, status=completed
       → SAME-STATUS NO-OP, never writes responseText → preview lost

Confirmed against real Postgres (see integration test). Strict-sqlmock
unit tests passed because they pin SQL shape, not row state.

Fix: call the WITH-PREVIEW recordLedgerStatus FIRST, then
updateDelegationStatus. The inner call becomes the no-op (correctly
preserves the row written by the outer call).

Same gap fixed in UpdateStatus handler — body.ResponsePreview was
never landing in the ledger because updateDelegationStatus's nested
SetStatus(completed, "", "") fired first.

## Gate: real-Postgres integration tests + CI workflow

The unit-test-only workflow that shipped #2854 was the root cause.
Adding two layers of defense:

1. workspace-server/internal/handlers/delegation_ledger_integration_test.go
   — `//go:build integration` tag, requires INTEGRATION_DB_URL env var.
   4 tests:
     * ResultPreviewPreservedThroughCompletion (regression gate for the
       bug above — fires the production call sequence in fixed order
       and asserts row.result_preview matches)
     * ResultPreviewBuggyOrderIsLost (DIAGNOSTIC: confirms the
       same-status no-op contract works as designed; if SetStatus's
       semantics ever change, this test fires)
     * FailedTransitionCapturesErrorDetail (failure-path symmetry)
     * FullLifecycle_QueuedToDispatchedToCompleted (forward-only +
       happy path)

2. .github/workflows/handlers-postgres-integration.yml
   — required check on staging branch protection. Spins postgres:15
   service container, applies the delegations migration, runs
   `go test -tags=integration` against the live DB. Always-runs +
   per-step gating on path filter (handlers/wsauth/migrations) so
   the required-check name is satisfied on PRs that don't touch
   relevant code.

Local dev workflow (file header documents this):

  docker run --rm -d --name pg -e POSTGRES_PASSWORD=test -p 55432:5432 postgres:15-alpine
  psql ... < workspace-server/migrations/049_delegations.up.sql
  INTEGRATION_DB_URL="postgres://postgres:test@localhost:55432/molecule?sslmode=disable" \
    go test -tags=integration ./internal/handlers/ -run "^TestIntegration_"

## Why this matters

Per memory `feedback_mandatory_local_e2e_before_ship`: backend PRs
MUST verify against real Postgres before claiming done. sqlmock pins
SQL shape; only a real DB can verify row state. The workflow makes
this gate mandatory rather than optional.
2026-05-05 02:47:52 -07:00
Hongming Wang
d890fd9a3f
Merge pull request #2856 from Molecule-AI/chore/remove-team-expand-handler
chore(workspace-server): remove TeamHandler.Expand bulk-create handler
2026-05-05 09:42:51 +00:00
Hongming Wang
ec1f21922c chore(workspace-server): remove TeamHandler.Expand bulk-create handler
Every workspace can have children via the regular CreateWorkspace flow
with parent_id set, so a separate handler that bulk-creates from
config.yaml's sub_workspaces (and was non-idempotent — calling it twice
duplicated the team) earned its way out. "Team" is just the state of
having children; expanding/collapsing is purely a canvas-side visual
action that toggles the `collapsed` column via PATCH.

The non-idempotency directly caused tenant-hongming's vCPU starvation:
72 distinct child workspaces accumulated in 4 days, ~14 leaked EC2s
(50 of 64 vCPU consumed by stale teams), every Canvas tabs E2E retry
flaking on RunInstances VcpuLimitExceeded.

What stays:
- TeamHandler.Collapse — still useful; stops + removes children via
  StopWorkspaceAuto. Reachable from the canvas Collapse Team button.
  (Note: that button currently calls PATCH /workspaces/:id, not the
  Collapse endpoint — that's a separate reachability question for
  later.)
- findTemplateDirByName helper — kept in team.go pending a relocate
  decision; no in-package consumers after Expand.
- The four other paths that create child workspaces continue to work
  unchanged: regular POST /workspaces with parent_id, OrgHandler.Import
  (recursive tree), Bundle import, scripts.

What goes:
- POST /workspaces/:id/expand route (router.go)
- TeamHandler.Expand method (team.go: ~130 lines)
- 4 TestTeamExpand_* sqlmock tests (team_test.go)
- TestTeamExpand_UsesAutoNotDirectDockerPath AST gate
  (workspace_provision_auto_test.go) — pinned a code path that no
  longer exists; the generic TestNoCallSiteCallsDirectProvisionerExceptAuto
  gate still covers the architectural intent for any future caller.

Follow-up PRs:
- canvas/ContextMenu.tsx: drop the "Expand to Team" right-click button
  + handleExpand callback; users create children via the regular
  + New Workspace dialog with the parent picker (already supported)
- OrgHandler.Import idempotency (skip-if-exists OR replace_if_exists)
  — same bug class as the deleted Expand, but on the bulk-tree path
- One-off cleanup script for tenant-hongming's 72 stale workspaces

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 02:39:13 -07:00
Hongming Wang
ca61213578
Merge pull request #2853 from Molecule-AI/refactor/split-workspace-dispatchers-1777970000
refactor(handlers): extract dispatchers from workspace.go (#2800 partial)
2026-05-05 09:30:55 +00:00
Hongming Wang
44bb35a926 feat(delegations): wire ledger Insert+SetStatus from production code paths (RFC #2829 #318)
PR-1 shipped the `delegations` table + `DelegationLedger` helper. PR-3
wired the sweeper. PR-4 wired the dashboard. But no PR ever wired
`ledger.Insert` from a production code path — the table stayed empty,
the sweeper had nothing to sweep, the dashboard had nothing to show.

This PR closes that gap. Behind feature flag `DELEGATION_LEDGER_WRITE=1`
(default off), the legacy activity_logs writes are mirrored to the
durable ledger:

  - insertDelegationRow → ledger.Insert (queued)
  - updateDelegationStatus → ledger.SetStatus on every status transition
  - executeDelegation completion path → ledger.SetStatus(completed,
    result_preview) for the result preview that activity_logs already
    stores in response_body
  - Record handler → ledger.Insert + ledger.SetStatus(dispatched) so
    agent-initiated delegations land in the same table

## Why a flag

The legacy flow has ~30 strict-sqlmock tests pinning exactly which SQL
statements fire per handler. Adding ledger writes always-on would
force adding ExpectExec stanzas to each. Flag-off keeps all 30 green
without churn; flag-on lets operators populate the table in staging
to feed the sweeper + dashboard once the agent-side cutover (RFC #2829
PR-5) has proven the round-trip end-to-end.

Default off → byte-identical to pre-#318 behavior.

## Status vocabulary mapping

activity_logs uses a freer status vocabulary than the ledger's CHECK
constraint allows. updateDelegationStatus is called with values like
"received" that the ledger doesn't accept; the wiring filters via a
switch to only forward known-good values, skipping anything else.

Record's first activity_logs row is `dispatched` but the ledger's
Insert path requires `queued` as initial state. Insert as queued first;
the very next SetStatus(..., dispatched) promotes it on the same row.

## Coverage

8 wiring tests (delegation_ledger_writes_test.go):

  - flag off → no SQL fired (rollout safety contract)
  - flag on → INSERT + UPDATE fire as expected
  - flag rejects loose truthy values (true/yes/0/on/TRUE) — only "1"
    is the on signal, matching PR-2 + PR-5 conventions
  - terminal-state replay swallows ErrInvalidTransition (legacy is
    authoritative; ledger replay error is not a delegation failure)

All 30 existing delegation_test.go tests still pass — flag default off
keeps the strict-sqlmock surface unchanged.

Refs RFC #2829.
2026-05-05 02:26:06 -07:00
Hongming Wang
024ef260db refactor(handlers): extract dispatchers from workspace.go (#2800 partial)
workspace.go was 950 lines after the dispatcher work in PRs #2811 +
#2824 + #2843 + #2846 + #2847 + #2848 + #2850. This extracts the 6
SoT dispatcher helpers into a new workspace_dispatchers.go so the
file is the architectural unit it deserves to be (one place for
"how do we route a workspace lifecycle verb to a backend?").

Moved (no body changes — pure cut + paste with imports):
  - HasProvisioner               (gate accessor)
  - provisionWorkspaceAuto       (async provision)
  - provisionWorkspaceAutoSync   (sync provision, runRestartCycle's path)
  - StopWorkspaceAuto            (stop dispatcher)
  - RestartWorkspaceAuto         (restart wrapper)
  - RestartWorkspaceAutoOpts     (restart with resetClaudeSession)

workspace.go shrinks from 950 → 735 lines and now holds:
  - WorkspaceHandler struct + constructor
  - SetCPProvisioner / SetEnvMutators
  - Create / List / Get / scanWorkspaceRow
  - HTTP handler glue

workspace_dispatchers.go is 255 lines and holds the dispatcher trio +
sync variant + gate accessor + a header docblock summarizing the
history (PRs that added each helper) and the source-level pin tests
that gate against drift.

Source-level pin tests updated:
  - TestNoCallSiteCallsDirectProvisionerExceptAuto: workspace_dispatchers.go
    added to allowlist (the dispatcher IS the place that calls per-backend
    bodies directly).
  - TestNoCallSiteCallsBareStop: same.
  - TestNoBareBothNilCheck / TestOrgImportGate_UsesHasProvisionerNotBareField:
    no change — they were source-pinning specific files, not all callers.

Build clean, vet clean, full test suite passes (1742 / 0 in workspace,
all Go test packages green).

Out of scope (#2800 has more):
  - workspace_provision.go (869 lines) split into Docker + CP halves —
    files would still be 400+ each, marginal value. Defer until a
    third backend lands and the symmetry breaks.
  - Splitting Create / List / Get into per-handler files — they're
    short and tightly coupled to the struct; keep co-located.

Closes #2800 partial. Filing a follow-up issue if/when workspace.go
or workspace_provision.go grows past 800 lines again.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 02:24:49 -07:00
Hongming Wang
c85783fbee docs(workspace): point recovery hint at /external/rotate (not the never-shipped /tokens)
Self-review of #2852: the inline comment on the IssueToken-failed branch
still referenced POST /workspaces/:id/tokens, which never shipped. The
recovery path that did ship in #2852 is POST /workspaces/:id/external/rotate.
Update the hint so the next operator who hits this failure mode finds
the right endpoint.
2026-05-05 01:58:43 -07:00
Hongming Wang
b375252dc8 feat(external): credential rotation + re-show instruction modal (#319)
External workspaces (runtime=external) lose their workspace_auth_token
the moment the create modal closes — the token is unrecoverable from
any later DB read. Operators who lost their copy or want to respond to
a suspected leak had no recovery path short of recreating the workspace
(which also breaks cross-workspace delegation links + memory namespace).

This PR adds two endpoints + a Config-tab section that surfaces them:

  POST /workspaces/:id/external/rotate
    Revokes any prior live tokens, mints a fresh one, returns the same
    ExternalConnectionInfo payload Create returns. Old credentials stop
    working immediately — the previously-paired agent will fail auth on
    its next heartbeat (~20s).

  GET /workspaces/:id/external/connection
    Returns the connect block with auth_token="". For the operator who
    just needs to re-find PLATFORM_URL / WORKSPACE_ID / one of the
    snippets without invalidating the live agent.

Both reject runtime ≠ external with 400 + a hint pointing at /restart
for non-external runtimes (which mints AND injects into the container).

## Why a flag isn't needed

The endpoints are purely additive — Create's behavior is unchanged.
Existing external workspaces don't see anything different until an
operator clicks the new buttons.

## DRY refactor

Extracted BuildExternalConnectionPayload() in external_connection.go
as the single source of truth for the connect payload shape. Create,
Rotate, and GetExternalConnection all call it. Adds a snippet once →
all three endpoints emit it. Trims trailing slash on platform_url so
no double-slash sneaks into registry_endpoint.

## Canvas

ExternalConnectionSection mounts in ConfigTab when runtime=external.
Two buttons:
  - "Show connection info" (cosmetic) — fetches GET /external/connection
  - "Rotate credentials" (destructive) — confirm dialog explains the
    impact, then POST /external/rotate

Both reuse the existing ExternalConnectModal so operators don't learn
a second snippet UX.

## Coverage

10 Go tests:
  - Rotate happy path (revoke + mint order, payload shape, broadcast event)
  - Rotate refuses non-external runtimes (400 with restart hint)
  - Rotate 404 on unknown workspace + 400 on empty id
  - GetExternalConnection happy path (auth_token="", same payload shape)
  - GetExternalConnection refuses non-external + 404 on unknown
  - BuildExternalConnectionPayload — placeholder substitution + trailing
    slash trimming + blank-token contract

6 canvas tests:
  - both action buttons render
  - "Show" calls GET /external/connection and opens modal
  - "Rotate" opens confirm dialog before firing POST
  - Cancel dismisses without rotating
  - Confirm POSTs and opens modal with returned token
  - API failures surface as visible error chips

Migration: existing external workspaces gain new abilities; no data
migration. The DRY refactor preserves byte-identical Create response
shape (8 ConfigTab tests + all existing handler tests still pass).

Closes #319.
2026-05-05 01:55:27 -07:00
Hongming Wang
61b7755c3c feat(handlers): migrate Pause loop to StopWorkspaceAuto — #2799 Phase 3
Last open #2799 site. Pause's per-workspace stop call now routes
through StopWorkspaceAuto, removing the final inline if-cpProv-else
(actually if-h.provisioner) dispatch from workspace_restart.go's
restart/pause/resume code paths.

Pre-2026-05-05 the Pause loop was:

  if h.provisioner != nil {
      h.provisioner.Stop(ctx, ws.id)
  }

Same drift class as #2813 (team-collapse leak) + #2814 (workspace
delete leak) — Docker-only stop silently no-ops on SaaS, leaving
the EC2 running while the workspace row gets marked paused. Orphan
sweeper would catch it eventually but the leak window is real.

Pause-specific bookkeeping (mark paused, clear workspace keys,
broadcast WORKSPACE_PAUSED) stays inline in the handler; only the
"stop the running workload" step delegates. StopWorkspaceAuto's
no-backend → no-op semantics match the pre-fix behavior on
misconfigured deployments (the bookkeeping still runs).

One new source-level pin:
  TestPauseHandler_UsesStopWorkspaceAuto — gates regression to the
  inline dispatch shape.

This closes #2799 Phase 3. After this PR + #2847 (Phase 2 PR-B) land,
workspace_restart.go has no remaining inline if-cpProv-else dispatch
in any user-facing code path. The remaining direct backend calls
inside the file are in stopForRestart and cpStopWithRetry — both
internal helpers that ARE the dispatcher's underlying primitives,
not new bypasses.

Note: scope was originally tagged "Phase 3 needs PauseWorkspaceAuto
verb" in the audit on PR #2843. On closer reading Pause's stop step
is identical to Stop — only the bookkeeping is Pause-specific. Reusing
StopWorkspaceAuto avoids unnecessary surface and keeps the dispatcher
trio (provision/stop/restart) tight.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 00:00:16 -07:00
Hongming Wang
9a772bf946 feat(handlers): provisionWorkspaceAutoSync + Site 4 migration — #2799 Phase 2 PR-B
runRestartCycle's auto-restart cycle (Site 4 from PR #2843's audit)
needs synchronous provision dispatch — the outer pending-flag loop
in RestartByID relies on returning when the new container is up so
the next restart cycle doesn't race the in-flight provision goroutine
on its Stop call.

Phase 1's provisionWorkspaceAuto wraps each per-backend body in
`go func() {...}()` — wrong shape for runRestartCycle's needs. This
PR introduces provisionWorkspaceAutoSync as a behavioral mirror that
runs in the current goroutine instead.

Two helpers, kept identical except for the wrapper:

  provisionWorkspaceAuto:     spawns goroutine, returns immediately
  provisionWorkspaceAutoSync: blocks until per-backend body returns

Same backend-selection (CP first, Docker second) + no-backend
mark-failed fallback. When one grows a new arm (third backend, retry
semantics), the other should too — pinned in the docstring.

Site 4 (runRestartCycle) was the only call site that needs sync today.
Migrating it removes the last bare if-cpProv-else dispatch in the
restart code path's provision half.

Three new tests:
  - TestProvisionWorkspaceAutoSync_RoutesToCPWhenSet
  - TestProvisionWorkspaceAutoSync_NoBackendMarksFailed
  - TestRunRestartCycle_UsesProvisionWorkspaceAutoSync (source-level pin)

Out of scope (last open #2799 site):
  Phase 3 — Site 5 (Pause loop). PAUSE doesn't reprovision; needs a
  new PauseWorkspaceAuto verb. After this PR lands, Pause is the only
  inline if-cpProv-else dispatch left in workspace_restart.go.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 23:44:54 -07:00
Hongming Wang
0a90d7ae1a
Merge pull request #2846 from Molecule-AI/feat/2799-phase2-restart-resume-1777958000
feat(handlers): migrate Restart + Resume handlers to dispatchers — #2799 Phase 2 PR-A
2026-05-05 05:15:29 +00:00
Hongming Wang
5b7f4d260b feat(handlers): migrate Restart + Resume handlers to dispatchers — #2799 Phase 2 PR-A
Sites 1+2 (Restart HTTP handler goroutine) and Site 3 (Resume HTTP
handler goroutine) now route through RestartWorkspaceAutoOpts /
provisionWorkspaceAuto instead of inlining the if-cpProv-else dispatch.

Three changes:

1. **RestartWorkspaceAutoOpts** — new variant of RestartWorkspaceAuto
   that carries the resetClaudeSession Docker-only flag (issue #12).
   The bare RestartWorkspaceAuto still exists as a wrapper that calls
   Opts with false. CP path silently ignores the flag (each EC2 boots
   fresh — no session state to clear). Mirrors the Provision pair
   (provisionWorkspace / provisionWorkspaceOpts).

2. **Restart handler (Site 1+2)** — the inline goroutine
   `if h.provisioner != nil { Stop } else if h.cpProv != nil { ... }`
   collapses to `RestartWorkspaceAutoOpts(...)`. Pre-fix the dispatch
   was Docker-FIRST ordering (a different drift class from the
   silent-drop bugs PRs #2811/#2824 closed); the dispatcher enforces
   CP-FIRST.

3. **Resume handler (Site 3)** — Resume is provision-only (workspace
   is paused, no live container), so it routes through
   provisionWorkspaceAuto, not RestartWorkspaceAuto. Inline
   if-cpProv-else dispatch removed.

Two new source-level pins:
  - TestRestartHandler_UsesRestartWorkspaceAuto
  - TestResumeHandler_UsesProvisionWorkspaceAuto

These prevent regression to the inline dispatch pattern.

Out of scope (tracked under #2799):
  - Site 4 (runRestartCycle) — synchronous coordination model needs
    a different shape than the fire-and-return dispatchers. PR-B.
  - Site 5 (Pause loop) — PAUSE doesn't reprovision, needs a new
    PauseWorkspaceAuto verb. Phase 3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 22:09:12 -07:00
Hongming Wang
7993693cf1 feat(delegations): wire RFC #2829 sweeper + admin routes into platform server
Activates the server-side foundation that PRs #2832, #2836, #2837
shipped without wiring (each PR landed dead code on purpose so the
review surface stayed tight).

## What this PR wires up

  1. router.go — registers the RFC #2829 PR-4 admin endpoints behind
     AdminAuth:
       GET /admin/delegations[?status=...&limit=N]
       GET /admin/delegations/stats

  2. cmd/server/main.go — starts the RFC #2829 PR-3 stuck-task
     sweeper as a supervised goroutine alongside the existing
     scheduler + hibernation-monitor + image-auto-refresh:
       go supervised.RunWithRecover(ctx, "delegation-sweeper",
                                    delegSweeper.Start)

## What this PR does NOT do

  - PR-2's DELEGATION_RESULT_INBOX_PUSH flag stays default off — flip
    happens via env config in a follow-up after staging burn-in.
  - PR-5's DELEGATION_SYNC_VIA_INBOX flag stays default off — same
    reason. The two flags are independent; either can be flipped in
    isolation.
  - Canvas operator panel UI: this PR exposes the JSON contract; the
    canvas panel consumes it in a separate canvas PR.

## Coverage

2 new router gate tests in admin_delegations_route_test.go:

  - List endpoint requires AdminAuth (unauthenticated → 401)
  - Stats endpoint requires AdminAuth (unauthenticated → 401)

Pattern mirrors admin_test_token_route_test.go (the IDOR-fix gate
for PR #112). Catches a future router refactor that silently drops
AdminAuth — operator dashboard data exposes caller_id, callee_id, and
task_preview, none of which should reach unauthenticated callers.

Sweeper boots as a no-op until at least one delegation row exists,
so this PR is safe to land before PR-5's agent-side cutover sees
production traffic.

Refs RFC #2829.
2026-05-04 22:00:59 -07:00
Hongming Wang
789d705866
Merge pull request #2843 from Molecule-AI/fix/restart-dispatcher-rework-1777956000
feat(handlers): RestartWorkspaceAuto dispatcher — #2799 Phase 1 (re-do of #2835)
2026-05-05 04:48:52 +00:00
Hongming Wang
cb820acbd6 fix(test): pre-register sqlmock for panic-recovered Docker test goroutine 2026-05-04 21:44:31 -07:00
Hongming Wang
82e7059e0e
Merge pull request #2842 from Molecule-AI/fix/codex-template-bump-cli-pin
fix(external-templates): unpin codex CLI from stale ^0.57
2026-05-05 04:34:14 +00:00
Hongming Wang
4f67fe59fb feat(handlers): RestartWorkspaceAuto dispatcher — #2799 Phase 1
Closes the third silent-drop-on-SaaS class for the restart verb. Two
of the three dispatchers were already in place (provisionWorkspaceAuto
PR #2811, StopWorkspaceAuto PR #2824); this completes the trio.

PR #2835 was an earlier attempt at this work (delivered by a peer
agent) that I had to send back for four critical bugs — stop-leg
dispatch order inverted, no-backend nil-deref, empty payload (dispatcher
unusable by callers), forcing-function tests red-from-day-1. This
re-do takes the audit + classification from that work but rebuilds
the implementation against the existing dispatcher convention.

Phase 1 scope:

  - RestartWorkspaceAuto in workspace.go — symmetric mirror of
    provisionWorkspaceAuto + StopWorkspaceAuto. CP-first dispatch
    order. cpStopWithRetry on the SaaS leg (Restart's "make it alive
    again" contract justifies the retry that StopWorkspaceAuto's
    delete-time contract does not). Three-arm shape including a
    no-backend mark-failed defense-in-depth.
  - Three new pin tests covering the routing surface:
    TestRestartWorkspaceAuto_RoutesToCPWhenSet,
    TestRestartWorkspaceAuto_RoutesToDockerWhenOnlyDocker,
    TestRestartWorkspaceAuto_NoBackendMarksFailed.

Phase 2/3 (deferred, file as follow-up issue):

  - workspace_restart.go's manual dispatch sites (Restart handler
    goroutine, Resume handler goroutine, runRestartCycle's inline
    Stop, Pause loop). Each site has async-context reasoning beyond
    a fire-and-return dispatcher and needs per-site review.
  - Pause specifically needs a different verb (PauseWorkspaceAuto)
    since Pause doesn't reprovision.

Why no callers migrated in this PR: the existing call sites in
workspace_restart.go all build their `payload` from a synchronous
DB read first; rewiring them needs care to preserve that ordering
plus the resetClaudeSession + template path resolution that lives
in the HTTP handler context. Splitting the dispatcher introduction
from the migration keeps each PR small and reviewable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 21:30:36 -07:00
Hongming Wang
410275e5af fix(external-templates): unpin codex CLI from stale ^0.57
`^0.57` only allows 0.57.x — codex CLI is now at 0.128 with breaking
API changes between (notably `exec --resume <sid>` → `exec resume <sid>`
subcommand). Operators following the snippet today either get a
6-month-old codex with the legacy resume flag, OR install latest manually
and discover the daemon previously couldn't drive it.

codex-channel-molecule 0.1.2 (just published) handles the new subcommand
shape, so operators are best served by always getting the latest codex
that the bridge daemon was last validated against. Bump to `@latest`.

If a future codex CLI breaks the daemon's invocation again, we ship a
new bridge-daemon release rather than asking operators to manage a pin
themselves.

Test: go test ./internal/handlers/ -run TestExternalTemplates -count=1 → green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 21:27:45 -07:00
Hongming Wang
1557743ef9
Merge branch 'staging' into feat/rfc2829-pr2-result-push-and-sync-cutover 2026-05-04 21:25:33 -07:00
Hongming Wang
b0bcd97781
Merge pull request #2839 from Molecule-AI/fix/status-failed-must-set-error-1777954000
fix(bundle): markFailed sets last_sample_error + AST drift gate (resolves #2632 root cause)
2026-05-05 04:12:38 +00:00
Hongming Wang
56149f8a24 fix(bundle): markFailed sets last_sample_error + AST gate
Closes the bug class surfaced by Canvas E2E #2632: a workspace ends up
status='failed' with last_sample_error=NULL, and operators (or the
E2E poll loop) see the useless "Workspace failed: (no last_sample_error)"
with no triage signal.

Two pieces:

1. **bundle/importer.go markFailed** — the UPDATE was setting only
   status, leaving last_sample_error NULL. Same incident class as the
   silent-drop bugs in PRs #2811 + #2824, different code path.
   markProvisionFailed in workspace_provision_shared.go has set the
   message column for a long time; this writer drifted the convention.
   Fix: include last_sample_error in the SET clause + the broadcast.

2. **AST drift gate** (db/workspace_status_failed_message_drift_test.go)
   — Go AST walk that finds every db.DB.{Exec,Query,QueryRow}Context
   call whose argument list binds models.StatusFailed and asserts the
   SQL literal contains last_sample_error. Catches the next caller
   that drifts the same convention. Verified to FAIL against the bug
   shape (reverted importer.go temporarily — gate flagged the exact
   line) and PASS against the fix.

Why an AST gate vs a regex: pre-fix attempt with a regex over UPDATE
statements flagged status='online' / status='hibernating' / status=
'removed' UPDATEs as false positives. Walking the AST and only
flagging calls that pass the StatusFailed constant eliminates that.

Out of scope (filed separately if needed):
- The Canvas E2E that surfaced the missing message (#2632) is now a
  required check on staging via PR #2827. Once this fix lands the
  next staging push should re-run #2632's failing case and produce
  a meaningful last_sample_error.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 21:08:08 -07:00
Hongming Wang
0134353a48
Merge pull request #2838 from Molecule-AI/feat/memories-update-endpoint
feat(memories): PATCH /workspaces/:id/memories/:id endpoint for edits
2026-05-05 04:06:01 +00:00
Hongming Wang
aca7d99152
Merge pull request #2837 from Molecule-AI/feat/rfc2829-pr4-operator-dashboard
feat(delegations): operator dashboard endpoint over the durable ledger (RFC #2829 PR-4)
2026-05-05 04:01:46 +00:00
Hongming Wang
aec0fb35d2 feat(memories): PATCH /workspaces/:id/memories/:id endpoint for edits
Pre-fix the only writes to agent_memories were Commit (POST) and
Delete (DELETE). Editing an entry meant delete + recreate, losing the
original id and created_at, and (the user-visible reason for filing
this) leaving the canvas Memory tab without an Edit button at all.

Adds PATCH that accepts either content, namespace, or both — at
least one required (empty body 400s; silently no-op'ing would let a
buggy client think it succeeded). The full Commit security pipeline
is re-run on content edits:
  - redactSecrets on every scope (#1201 SAFE-T)
  - GLOBAL [MEMORY → [_MEMORY delimiter escape (#807 SAFE-T)
  - GLOBAL audit log row mirroring Commit's #767 forensic pattern
  - re-embed via the configured EmbeddingFunc (skipping would leave
    the row's vector pointing at the OLD content, silently breaking
    semantic search)

Cross-scope edits (LOCAL→GLOBAL) intentionally NOT supported — that's
delete + recreate so the GLOBAL access-control gate (only root
workspaces can write GLOBAL) gets re-evaluated cleanly.

7 new sqlmock tests pin: namespace-only, content-only LOCAL,
content-only GLOBAL with audit + escape, empty-body 400, empty-
content 400, 404 on missing/wrong-workspace memory, no-op 200 with
changed=false (and crucially: no UPDATE fires on no-op).

Build clean, full handlers test suite (./internal/handlers) passes
in 4s.

PR-2 (frontend): Add modal + Edit button in MemoryInspectorPanel.tsx
will land separately.
2026-05-04 21:00:47 -07:00
Hongming Wang
2ed4f4fb41 feat(delegations): operator dashboard endpoint over the durable ledger (RFC #2829 PR-4)
Two read endpoints over the `delegations` table (PR-1 schema):

  GET /admin/delegations[?status=in_flight|stuck|failed|completed&limit=N]
  GET /admin/delegations/stats

## What this gives operators

Without this, post-incident investigation requires direct DB access —
only the on-call SRE can answer "is workspace X delegating to a wedged
callee?". This moves that visibility into the same surface as
/admin/queue, /admin/schedules-health, /admin/memories.

## List endpoint

Status filter via tight allowlist:

  - in_flight (default) → status IN (queued, dispatched, in_progress)
  - stuck               → status='stuck' (rows the PR-3 sweeper marked)
  - failed              → status='failed'
  - completed           → status='completed'

Unknown status → 400 with the allowlist in the error body. Limit
1..1000, default 100.

The status allowlist drives a parameterized IN clause (no string-
concatenation of user-controlled values into SQL).

Result rows expose all the audit-grade fields the dashboard needs:
delegation_id, caller_id, callee_id, task_preview, status,
last_heartbeat, deadline, result_preview, error_detail, retry_count,
created_at, updated_at. Nullable fields use pointer types so JSON
omits them when NULL (no false-zero "" for missing values).

## Stats endpoint

Zero-fills every known status key (queued, dispatched, in_progress,
completed, failed, stuck) so the dashboard summary card doesn't have
to handle "missing key vs zero" branching.

## Out of scope (deferred)

  - "retry this stuck task" mutation: needs the agent-side cutover
    (RFC #2829 PR-5 plan) before re-fire is safe
  - p95 / p99 duration aggregates: separate metric exposure, not a
    row-level read endpoint
  - Canvas UI: this is the JSON contract; the canvas operator panel
    consumes it in a follow-up canvas PR

## Wiring

NOT wired into the router in this PR — ships separately to keep
PR-by-PR review surface tight. Wiring will land in the
`enable-rfc2829-server-side` follow-up PR alongside the sweeper Start
call and the result-push flag flip.

## Coverage

11 unit tests:

List (8):
  - default status=in_flight, IN(queued,dispatched,in_progress)
  - status=stuck → IN(stuck)
  - status=failed → IN(failed)
  - unknown status → 400 with allowlist
  - negative limit → 400
  - over-cap limit → 400
  - custom limit accepted + echoed in response
  - nullable fields populated correctly (pointer-omitempty)

Stats (2):
  - zero-fills missing status keys
  - empty table → all counts zero

Contract pin (1):
  - statusFilters table shape — every documented key + value pair
    pinned. Drift catches accidental edits (forward defense).

Refs RFC #2829.
2026-05-04 20:58:17 -07:00
Hongming Wang
02b325063b feat(delegations): stuck-task sweeper with deadline + heartbeat-staleness rules (RFC #2829 PR-3)
Periodically scans the `delegations` table (PR-1 schema) for in-flight
rows that need terminal action:

  1. Deadline-exceeded → marked `failed` with "deadline exceeded by sweeper"
  2. Heartbeat-stale (no beat for >10× heartbeat interval) → marked `stuck`

## Why both rules

Deadline catches forever-heartbeating wedged agents (the alive-but-not-
advancing class — agent loops on heartbeat call inside its main loop).
Heartbeat-staleness catches OOM-killed and crashed agents that stop cold
without graceful shutdown. Either rule alone misses one of these classes.

## Order matters

Deadline is checked first. A deadline-exceeded AND stale row is marked
`failed` (operator action: investigate + give up), not `stuck` (operator
action: investigate + retry). The semantic difference matters.

## NULL heartbeat is a free pass

A delegation that's just been inserted but hasn't emitted its first
heartbeat yet is NOT stuck-marked — gives the agent its first beat
window. Lets the deadline catch true never-started rows naturally.

## Concurrent-completion safety

Sweep races with UpdateStatus on a delegation that just completed: the
ledger's terminal forward-only protection (PR-1) returns ErrInvalidTransition,
sweeper logs + counts in Errors, the row stays correctly in completed.

## Configuration

  - DELEGATION_SWEEPER_INTERVAL_S — tick cadence (default 5min)
  - DELEGATION_STUCK_THRESHOLD_S  — heartbeat-staleness threshold (default 10min)

Both fall back gracefully on invalid input (typo'd env shouldn't crash
startup). Both read at construction time so a long-running process
picks up overrides via restart.

## Wiring

NOT wired into main.go in this PR — that ships separately so the
sweeper can be enabled/disabled independently of the binary upgrade.
The sweeper is a standalone Sweep(ctx) callable + Start(ctx) ticker
loop, both with panic recovery, both indexed-scan-cheap on the
partial idx_delegations_inflight_heartbeat from PR-1.

## Coverage

13 unit tests against sqlmock-backed *sql.DB:

Sweep semantics (8 tests):
  - empty in-flight set → clean no-op
  - deadline → failed
  - heartbeat-stale → stuck
  - NULL heartbeat is left alone (first-beat free pass)
  - healthy row → no-op
  - both-rule row → marked failed (deadline wins)
  - mixed set → both rules fire on the right rows
  - concurrent-completion race → forward-only protection holds

Env override parsing (5 tests):
  - default on missing env
  - parses positive seconds
  - falls back on garbage
  - falls back on negative
  - constructor picks up overrides; defaults when env unset

Refs RFC #2829.
2026-05-04 20:55:13 -07:00
Hongming Wang
ae79b9e9fe feat(delegations): result-push to caller inbox behind feature flag (RFC #2829 PR-2)
When a delegation completes (or fails), also write an
`activity_type='a2a_receive'` row to the caller's activity_logs so the
caller's inbox poller (workspace/inbox.py — `?type=a2a_receive`) surfaces
the result to the agent.

Why: today the only way the caller agent learns about a delegation result
is by holding open an HTTP `message/send` connection through the platform
proxy. That connection has a hard timeout (~600s) — a 90-iteration
external-runtime task on stream output routinely blows past it, and the
result emitted after the timeout lands in /dev/null. (Hongming's home
hermes hit this on 2026-05-05 — task was actively heartbeating "iteration
14/90" when the proxy timer fired.)

This PR adds the SERVER-SIDE result-push so the result is durably
delivered to the caller's inbox queue. The agent-side cutover (replace
sync httpx delegation with delegate_task_async + wait_for_message poll)
ships in the next PR — once both land, the proxy timeout class is gone.

## Feature flag

`DELEGATION_RESULT_INBOX_PUSH=1` enables the push. Default off — staging
canary first, flip after RFC #2829 PR-3 (agent-side) lands and proves
the round-trip end-to-end. With the flag off, behavior is byte-identical
to before this PR (verified by TestUpdateStatus_FlagOff_NoNewSQL).

## Two write sites

  1. UpdateStatus handler (POST /workspaces/:id/delegations/:id/update)
     — agent-initiated delegations report status here
  2. executeDelegation goroutine — canvas-initiated delegations
     (POST /workspaces/:id/delegate) report status from this background
     coroutine

Both paths call `pushDelegationResultToInbox` which is best-effort: an
INSERT failure logs but does NOT propagate up. The existing
`delegate_result` row in activity_logs (the dashboard view) remains
authoritative; the new `a2a_receive` row is purely additive for the
inbox-poller to surface.

## Coverage

6 new tests in delegation_inbox_push_test.go:

  - flag off → no SQL fired (the rollout-safety contract)
  - flag on, completed → a2a_receive row with status=ok
  - flag on, failed    → a2a_receive row with status=error + error_detail
  - UpdateStatus end-to-end (flag on, completed)
  - UpdateStatus end-to-end (flag on, failed)
  - UpdateStatus end-to-end (flag off, byte-identical to pre-PR behavior)

All 30 existing delegation_test.go tests still pass — flag default off
keeps the strict-sqlmock surface unchanged.

Refs RFC #2829.
2026-05-04 20:50:46 -07:00
Hongming Wang
ed6dfe01e5 feat(delegations): durable per-task ledger + audit-write helper (RFC #2829 PR-1)
Adds the `delegations` table and the DelegationLedger writer that PRs #2-#4
of RFC #2829 build on. Schema-only foundation — no behavior change in this
PR. PR-2 wires the ledger into the existing handlers and ships the result-
push-to-inbox cutover behind a feature flag.

Why a dedicated table when activity_logs already records every delegation
event:

  Today, "what is currently in flight for this workspace" is reconstructed
  by GROUPing activity_logs by delegation_id and ORDER BY created_at DESC.
  PR-3's stuck-task sweeper needs the join

      SELECT delegation_id FROM delegations
      WHERE status = 'in_progress'
        AND last_heartbeat < now() - interval '10 minutes'

  which is impossible to express against the event stream without a window
  over every (delegation_id, latest event) pair — a planner-killing query
  at scale. The dedicated table makes the sweeper an indexed scan.

  Same posture as tenant_resources (PR #2343, memory
  `reference_tenant_resources_audit`): activity_logs remains the audit-
  grade source of truth, delegations is the queryable view for dashboards
  + sweeper joins. Symmetric writes — both tables are written, neither
  blocks orchestration on the other's failure.

Schema highlights:
  - delegation_id PRIMARY KEY (caller-chosen, idempotent retry on
    restart is a no-op via ON CONFLICT DO NOTHING)
  - caller_id / callee_id NOT FK — workspace delete must NOT cascade-
    delete delegation history (audit retention)
  - status CHECK constraint enforces the lifecycle
    (queued|dispatched|in_progress|completed|failed|stuck)
  - last_heartbeat NULL-able; PR-3 sweeper compares to NOW()
  - deadline default now()+6h matches longest-observed legit delegation
    (memory-namespace migrations) — protects against forever-heartbeating
    wedged agents
  - Partial index `idx_delegations_inflight_heartbeat` keeps the sweeper
    hot path tiny (only non-terminal rows)
  - UNIQUE(caller_id, idempotency_key) WHERE NOT NULL — natural
    collision becomes ON CONFLICT no-op without colliding across callers

DelegationLedger.SetStatus enforces forward-only on terminal states
(completed/failed/stuck cannot be revised) as defense-in-depth on the
schema CHECK. Same-status replay is a no-op. Missing-row SetStatus is
a no-op (transient inconsistency the next agent retry will heal).

Heartbeat updates only in-flight rows — terminal-state delegations are
silently skipped.

Coverage:
  - 17 unit tests against sqlmock-backed *sql.DB (Insert happy path,
    missing-required guards, truncation, lifecycle transitions, terminal
    forward-only protection, replay no-op, missing-row no-op, empty-input
    rejection, heartbeat semantics, transition table shape)
  - Migration roundtrip verified on a real Postgres 15 instance:
    up creates the expected schema with all 4 indexes + CHECK, down
    drops everything cleanly.

Refs RFC #2829.
2026-05-04 20:43:06 -07:00
Hongming Wang
46d79a3e3b
Merge pull request #2824 from Molecule-AI/fix/stop-workspace-auto-saas-1777945000
fix(provision): StopWorkspaceAuto mirror — close SaaS EC2-leak class
2026-05-05 03:05:09 +00:00
Hongming Wang
2198f92dcb
Merge pull request #2823 from Molecule-AI/feat/codex-tab-pypi-install
feat(external-templates): codex tab uses plain pip install
2026-05-05 03:03:08 +00:00
Hongming Wang
11c9ed2a46 fix(provision): StopWorkspaceAuto mirror — close SaaS EC2-leak class
Closes #2813 (team-collapse) and #2814 (workspace delete).

Two leaks, one class. Both call sites had the same shape pre-fix:

  if h.provisioner != nil {
      h.provisioner.Stop(ctx, wsID)
  }

On SaaS where h.provisioner (Docker) is nil and h.cpProv is set, that
gate evaluates false and the EC2 keeps running. Workspace gets marked
removed in DB; EC2 lives on until the orphan sweeper catches it.

Same drift class as PR #2811's org-import provision bug — a Docker-
only check on what should be a both-backend operation. Confirmed in
production: PR #2811's verification step deleted a test workspace and
the EC2 stayed running until I terminated it manually.

Fix: WorkspaceHandler.StopWorkspaceAuto(ctx, wsID) — symmetric mirror
of provisionWorkspaceAuto. CP first, Docker second, no-op when neither
is wired (a workspace nobody is running can't be stopped — that's a
no-op, not a failure, distinct from provision's mark-failed contract).

Three call-site changes:
- team.go:208 (Collapse) → h.wh.StopWorkspaceAuto(ctx, childID)
- workspace_crud.go:432 (stopAndRemove) → h.StopWorkspaceAuto(...);
  RemoveVolume stays Docker-only behind an explicit gate since
  CP-managed workspaces have no host-bind volumes
- TeamHandler.provisioner field + NewTeamHandler's *Provisioner param
  removed as dead code (Stop was the only call site)

Volume cleanup separation is intentional: the abstraction is "stop
the running workload," not "tear down all state." Callers that need
volume cleanup keep their `if h.provisioner != nil { RemoveVolume }`
gate AFTER the Stop call.

Tests:
- TestStopWorkspaceAuto_RoutesToCPWhenSet — SaaS path
- TestStopWorkspaceAuto_RoutesToDockerWhenOnlyDocker — self-hosted
- TestStopWorkspaceAuto_NoBackendIsNoOp — pins the contract distinction
  from provisionWorkspaceAuto's mark-failed
- TestNoCallSiteCallsBareStop — source-level pin against
  `.provisioner.Stop(` / `.cpProv.Stop(` outside the dispatcher,
  per-backend bodies, restart helper, and the Docker-daemon-direct
  short-lived-container path. Strips Go comments before substring
  match so archaeology in code comments doesn't trip the gate.
- Verified: pin FAILS against the buggy shape (workspace_crud.go
  reversion); team.go reversion compile-fails because the field is
  gone — even stronger than the test.

Out of scope (tracked under #2799):
- workspace_restart.go's manual if-cpProv-else dispatch with retry
  semantics tuned for the restart hot path. Functionally equivalent
  + wraps cpStopWithRetry, so it's not the bug class this PR closes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 20:00:23 -07:00
Hongming Wang
c0bfd19b9e feat(external-templates): codex tab uses plain pip install for bridge daemon
`codex-channel-molecule` 0.1.0 is now on PyPI, so operators no longer need
the `git+https://...` URL workaround.

Verified: `pip install codex-channel-molecule` from a clean venv installs
the wheel and the `codex-channel-molecule --help` console script runs.

PyPI: https://pypi.org/project/codex-channel-molecule/0.1.0/

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 19:58:56 -07:00
Hongming Wang
e0f9434eaf fix(files-eic): silence ssh known-hosts warning that 500'd Hermes config load
GET /workspaces/:id/files/config.yaml on hongming.moleculesai.app's
Hermes workspace returned 500 with body:

  ssh cat: exit status 1 (Warning: Permanently added '[127.0.0.1]:37951'
   (ED25519) to the list of known hosts.)

Root cause: ssh emits the "Permanently added" notice on every fresh
tunnel connection, even with UserKnownHostsFile=/dev/null (that
prevents persistence, not the warning). It lands on stderr, fooling
readFileViaEIC's classifier:

  if len(out) == 0 && stderr.Len() == 0 {
      return nil, os.ErrNotExist
  }
  return nil, fmt.Errorf("ssh cat: %w (%s)", runErr, ...)

stderr was non-empty (the warning), so we returned the wrapped error
→ 500 from the HTTP layer instead of 404.

Fix: add `-o LogLevel=ERROR` to BOTH writeFileViaEIC and readFileViaEIC
ssh invocations. Silences info+warning while keeping real auth/tunnel
errors visible (those emit at ERROR level).

Test: TestSSHArgs_LogLevelErrorBothSites pins the flag in both blocks.
Mutation-tested: stripping the flag from one site fails the gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 19:58:49 -07:00
Hongming Wang
7e1fdf5847 refactor(provision): use HasProvisioner() at all gate-y both-nil checks
SSOT pass — replace 4 bare `h.provisioner == nil && h.cpProv == nil`
checks with `!h.HasProvisioner()`. When a third backend lands (k8s,
containerd, whatever), HasProvisioner gets one new field; bare both-nil
checks would each need to be hunted and updated.

Sites:
- a2a_proxy_helpers.go:166 — maybeMarkContainerDead skip-no-backend
- workspace_restart.go:118 — Restart endpoint guard
- workspace_restart.go:363 — RestartByID coalescer guard
- workspace_restart.go:660 — Resume endpoint guard

Adds TestNoBareBothNilCheck (source-level) so the antipattern can't
slip back in.

Out of scope but discovered during the audit (filed separately):
- team.go:207 — team-collapse Stop is Docker-only, leaks EC2 on SaaS
- workspace_crud.go:423 — workspace delete cleanup is Docker-only,
  leaks EC2 on SaaS
Both need a StopWorkspaceAuto mirror of provisionWorkspaceAuto. Same
class of bug as today's org-import incident, different verb (stop vs
provision).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 18:51:53 -07:00
Hongming Wang
d084d7e61a fix(provision): consolidate org-import gate + Auto self-marks-failed
Two changes that close the silent-drop bug class:

1. Add WorkspaceHandler.HasProvisioner() and use it as the org-import
   gate. Pre-fix, org_import.go:178 read `h.provisioner != nil` (Docker-
   only) — on SaaS tenants where cpProv is wired but Docker is nil, the
   entire 220-line provisioning prep block was skipped. The Auto call
   PR #2798 added at line 395 was unreachable on SaaS.

   Repro: 2026-05-05 01:14 — hongming prod tenant, 7-workspace org
   import, every workspace sat in 'provisioning' for 10 min until the
   sweeper marked it failed with the misleading "container started but
   never called /registry/register".

2. provisionWorkspaceAuto self-marks-failed on the no-backend path.
   Defense in depth: even if a future caller bypasses HasProvisioner
   gating or ignores the bool return (TeamHandler pre-#2367 did exactly
   this), the workspace ends in a clean failed state with an actionable
   error message instead of lingering until the 10-min sweep.

   Auto becomes the single source of truth for "start a workspace" —
   routing AND the no-backend failure path. Create's redundant
   if-not-Auto-then-mark-failed block collapses (kept only the
   workspace_config UPSERT, which is a Create-specific UI concern for
   rendering runtime/model on the Config tab).

Tests:
- TestProvisionWorkspaceAuto_NoBackendMarksFailed pins the new contract
- TestHasProvisioner_TrueOnCPOnly catches the SaaS-only blind spot
- TestHasProvisioner_TrueOnDockerOnly preserves self-hosted shape
- TestHasProvisioner_FalseWhenNeitherWired pins the gate-out path
- TestOrgImportGate_UsesHasProvisionerNotBareField source-pins the gate
  (verified: FAILS against the buggy `h.provisioner != nil` shape, PASSES
  with `h.workspace.HasProvisioner()`)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 18:47:02 -07:00
Hongming Wang
dfd0bc528c fix(external-templates): codex-channel-molecule via git+ URL (not on PyPI yet)
Mirrors the pattern hermes-channel-molecule uses (line 256). Drops
the broken `pip install codex-channel-molecule` which would 404.
PyPI publish workflow is a separate piece of work — until then,
git+https install is the path operators get.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 18:29:23 -07:00
Hongming Wang
4ea6f437e9 feat(external-templates): codex tab now includes the bridge-daemon inbound path
The codex tab in the External Connect modal had a "outbound-tools-only
first cut" caveat — operators got the MCP wiring for codex calling
platform tools, but there was no documented inbound path. Canvas
messages couldn't wake an idle codex session.

That gap is now filled by codex-channel-molecule
(github.com/Molecule-AI/codex-channel-molecule), shipped today as the
codex counterpart to hermes-channel-molecule. The daemon long-polls
the platform inbox, runs `codex exec --resume <session>` per inbound
message, captures the assistant reply, routes it back via
send_message_to_user / delegate_task, and acks the inbox row.
Per-thread session continuity persisted to disk so daemon restarts
don't lose conversation context.

This commit:
- Updates externalCodexTemplate to include `pip install
  codex-channel-molecule` (step 1) and a foreground `nohup
  codex-channel-molecule` invocation (step 3) using the same env-var
  contract as the MCP server (WORKSPACE_ID + PLATFORM_URL +
  MOLECULE_WORKSPACE_TOKEN).
- Adds a "Canvas messages don't wake codex" common-issues entry to the
  TAB_HELP codex section pointing at the bridge daemon log.
- Updates the doc comment to record the upstream deprecation path:
  when openai/codex#17543 lands, the bridge becomes redundant and the
  wired MCP server delivers push natively.

Verified: TestExternalTemplates_NoMoleculeOrgIDPlaceholder still
passes (no MOLECULE_ORG_ID re-introduction); full handlers suite
green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 18:28:35 -07:00
Hongming Wang
0f389ba325
Merge pull request #2804 from Molecule-AI/fix/external-templates-drop-molecule-org-id
fix(external-templates): drop MOLECULE_ORG_ID from codex/openclaw/hermes snippets
2026-05-05 00:38:45 +00:00
Hongming Wang
472862bc50 fix(external-templates): drop MOLECULE_ORG_ID from operator-facing snippets
Codex / openclaw / hermes-channel snippets each instructed operators
to set `MOLECULE_ORG_ID = "<your org id>"`. The molecule_runtime MCP
subprocess these snippets spawn never reads MOLECULE_ORG_ID — that
env var is consumed only by workspace-server's TenantGuard
middleware, server-side, on the tenant box itself (set by the control
plane via user-data on provision).

External operator → tenant calls pass TenantGuard via the
isSameOriginCanvas path (Origin matches Host), with auth via Bearer
token + X-Workspace-ID. The universal_mcp snippet — which calls into
the same molecule_runtime — has always (correctly) omitted
MOLECULE_ORG_ID; this brings codex / openclaw / hermes-channel into
line.

Symptom that caught it: an external codex CLI session, after pasting
the codex-tab snippet, surfaced "MOLECULE_ORG_ID is still set to
'<your org id>'" as an unresolved blocker — agent reasonably treated
the placeholder as required setup. Operator has no value to fill.

Pinned with a structural test
(TestExternalTemplates_NoMoleculeOrgIDPlaceholder) so the placeholder
can't drift back across all six external-tab templates.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 17:30:07 -07:00
Hongming Wang
b5435b4732 fix(memory v2): warn at boot when cutover env half-configured
MEMORY_V2_CUTOVER=true gates the admin export/import path on the v2
plugin, but the cutoverActive() check in admin_memories.go silently
returns false when the plugin isn't wired:

  func (h *AdminMemoriesHandler) cutoverActive() bool {
      if os.Getenv(envMemoryV2Cutover) != "true" {
          return false
      }
      return h.plugin != nil && h.resolver != nil
  }

Two operator misconfigs hit the silent-fallback path:

  1. MEMORY_V2_CUTOVER=true set, MEMORY_PLUGIN_URL unset
     → wiring.Build returns nil → handler stays on legacy SQL path
     → operator sees no error, assumes cutover is live, but every
        request still writes the legacy table.

  2. MEMORY_V2_CUTOVER=true set, MEMORY_PLUGIN_URL set, but plugin
     unreachable at boot
     → wiring.Build still returns the bundle (intentional — circuit
        breaker handles ongoing unavailability), but every cutover
        write quietly falls back via the breaker.
     → only signal: legacy table keeps growing.

Both are exactly the "structurally invisible until prod" failure
mode; the only real-world detection today is "notice the legacy
table is still being written to," which no operator will check.

Add loud, distinctive WARN log lines at Build() time for both
shapes. Boot logs are operator-visible, so a half-config is
immediately obvious without needing dashboards.

Tests:
  * 4 new (cutover+no-URL → warn, neither set → silent, cutover+probe-
    fail → loud warn, probe-fail-without-cutover → quiet generic)
  * 6 existing (still pass; pin no-warning-on-happy-path)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 17:24:11 -07:00
Hongming Wang
f1b72af97e
Merge pull request #2798 from Molecule-AI/fix/org-import-saas-routing-1777938328
fix(org-import): route through provisionWorkspaceAuto so SaaS gets EC2 — closes #2486
2026-05-04 23:54:37 +00:00
Hongming Wang
19e7acdc22 fix(org-import): route through provisionWorkspaceAuto so SaaS gets EC2
Org-import called h.workspace.provisionWorkspace directly — same silent-
drop bug that bit TeamHandler.Expand on 2026-05-04 (see workspace.go
:121-125 comment + #2486). Symptom on SaaS: every claude-code workspace
sat in "provisioning" until the 600s sweeper marked it failed with
"container started but never called /registry/register" — because no
container ever existed; the goroutine returned silently when the Docker
provisioner field was nil.

User reproduced 2026-05-04 ~22:30Z importing a 7-workspace template on
the hongming prod tenant. Tenant CP logs (queried live via SSM) showed
ZERO "Provisioner: goroutine entered" or "CPProvisioner: goroutine
entered" lines for any of the 7 failed workspace UUIDs in the 60min
window — confirming the goroutine never ran past line 384 of
org_import.go because provisionWorkspace returned early in SaaS mode.

The fix is one line: replace h.workspace.provisionWorkspace with
h.workspace.provisionWorkspaceAuto. Auto is the single source of
truth for backend selection (workspace.go:130) — picks CP-mode when
h.cpProv is wired, Docker-mode when h.provisioner is wired, returns
false when neither.

ALSO adds a generic source-level gate
(TestNoCallSiteCallsDirectProvisionerExceptAuto) so the next future
caller can't repeat the pattern. Walks every non-test .go file in
handlers/ and fails if any direct call to provisionWorkspace( or
provisionWorkspaceCP( appears outside the dispatcher's own definition
file.

The gate currently allows workspace_restart.go which has its own
manual if-h.cpProv-else dispatch (functionally equivalent to Auto,
not the bug class — but is architectural duplication; follow-up
filed for proper de-dup).

Test plan:
- TestOrgImport_UsesAutoNotDirectDockerPath: pin the org_import.go
  call site
- TestNoCallSiteCallsDirectProvisionerExceptAuto: generic gate against
  future drift
- TestTeamExpand_UsesAutoNotDirectDockerPath (existing): symmetric for
  team.go

All 3 + the rest of the handler suite pass.

Closes #2486
Pairs with: PR #2794 (configurable provision concurrency) which made
            it possible to bisect concurrency-vs-routing as the cause

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 16:49:07 -07:00
Hongming Wang
872b781f64
Merge pull request #2792 from Molecule-AI/feat/drop-shared-context
feat: drop shared_context — use memory v2 team namespace
2026-05-04 23:37:49 +00:00
Hongming Wang
3bc7749e84 feat(org-import): make provision concurrency configurable via env
Org-import was hard-capped at 3 concurrent workspace provisions (#1084),
calibrated for Docker-mode workspaces where each provision was a
docker-run. Now that workspaces are EC2 instances, AWS RunInstances
parallelises happily and the artificial cap of 3 makes a 7-workspace
org-import take 3-4× longer than necessary (3 batches × ~70s/provision
≈ 4 min wall time when AWS could absorb all 7 in parallel for ~70s).

This PR makes the cap configurable via MOLECULE_PROVISION_CONCURRENCY:
  unset    → 3 (Docker-mode default, unchanged)
  "0"      → effectively unlimited (SaaS / EC2 backend; AWS rate-limit
             + vCPU quota are the real backpressure)
  N>0      → exactly N
  N<0      → fall back to default 3 + warning log
  garbage  → fall back to default 3 + warning log

The "0 = unlimited" mapping is the user-facing convention requested for
SaaS deployments — operators don't have to pick an arbitrary large
number. Implementation hands off 1<<20 internally so the channel-based
semaphore stays a no-op without infinite-buffer risk.

Test coverage (org_provision_concurrency_test.go, 6 cases / 15 subtests):
- unset → default
- "0" → large unlimited cap
- positive integer exact (1, 5, 10, 50)
- negative → default + warning
- non-numeric → default + warning
- whitespace-trimmed (" 7 " → 7)

Boot-time log line confirms the resolved cap so an operator can verify
their env is being honored without re-deploying.

Does NOT address the separate 600s "never registered" timeout the user
also reported during org-import — that's filed as molecule-core#2793
for proper investigation (parallel-provision contention, network
routing, register-retry budget, or container-start failure are all
candidates and need live SSM capture to bisect).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 16:33:49 -07:00