Commit Graph

630 Commits

Author SHA1 Message Date
Hongming Wang
60afcd43c9 test(handlers): generic Class 1 leak AST gate (#2867 PR-A)
Adds class1_ast_gate_test.go — a per-package AST walk that fails the
build if any handler function INSERTs INTO workspaces inside a range
loop body without one of three escape hatches:

  1. A call to a registered preflight helper (lookupExistingChild today;
     extend preflightCallNames as new helpers are introduced).
  2. An ON CONFLICT clause in the same SQL literal (idempotent UPSERT,
     like registry.go).
  3. An explicit `// class1-gate: idempotent-by-design` comment in the
     function body (deliberately awkward — forces a code-review beat).

Why this is broader than the existing
TestCreateWorkspaceTree_CallsLookupBeforeInsert gate in
org_import_idempotency_test.go: that one is hard-coded to one function
in one file. This one walks every non-test .go file in the handlers
package and applies a structural rule independent of file/function
names. A future handler written from scratch in a new file would not
have been covered before — now it is.

Detection mechanism (per AST):
  - Collect spans (Lbrace..Rbrace) of every RangeStmt body in each
    function. Position-based instead of stack-based — ast.Inspect's
    nil-callback ordering doesn't give per-node pop semantics, so a
    naive push/pop stack silently miscounts. Position spans are
    deterministic.
  - Walk every BasicLit, regex-match `^\s*INSERT INTO workspaces\(`
    (tightened from bytes.Index "INSERT INTO workspaces" so
    workspaces_audit literals don't false-positive — same regex used
    by the existing createWorkspaceTree gate).
  - For each match: record insertLine, hasONCONFLICT, and the
    innermost enclosing RangeStmt line (or 0 if not inside any range).
  - Fail the function if INSERT is inside a range AND no preflight
    AND no ON CONFLICT AND no allowlist annotation.

Self-tests (per `feedback_assert_exact_not_substring.md` —
verify gate fails on the bug shape before merging):
  - TestClass1_GateFiresOnSyntheticBuggySource: synthetic source
    where INSERT is inside `for _, child := range children` body
    must trigger the gate's three guards (enclosingRangeLine!=0,
    hasONCONFLICT=false, no preflight call).
  - TestClass1_GateAllowsONCONFLICT: synthetic INSERT...ON CONFLICT
    must NOT trigger the gate (idempotent UPSERT case).
  - TestClass1_GateAllowsAllowlistAnnotation: function with
    `// class1-gate: idempotent-by-design` must be skipped.
  - TestClass1_NoUnpreflightedInsertInsideRange: production sweep
    over every handler .go file. Currently passes because
    org_import.go preflights, registry.go ON-CONFLICTs, and
    workspace.go's Create has no INSERT inside a range body.

Verification:
  - go test ./internal/handlers/... -run TestClass1_ -count=1
    → 4/4 PASS
  - go test ./internal/handlers/... -count=1 → suite green
    (no pre-existing test broken by the new file)

Refs molecule-core#2867 (PR-A Class 1 generic AST gate)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:01:34 -07:00
Hongming Wang
ff75aeb43e
Merge pull request #2922 from Molecule-AI/fix/memory-plugin-gate-sidecar-on-cutover
fix(memory-plugin): gate sidecar spawn on cutover-active
2026-05-05 19:44:01 +00:00
Hongming Wang
412dec0d87 fix(memory-plugin): gate sidecar spawn on cutover-active
PR #2906 spawned the sidecar unconditionally on every tenant boot. The
plugin's first migration runs \`CREATE EXTENSION vector\` which fails
on tenant Postgres without pgvector preinstalled — every staging
tenant redeploy aborted at the 30s health gate. CP fail-fast kept
running tenants on the prior image (no outage), but the new image
was DOA.

Caught on staging redeploy 2026-05-05 19:23 with
\`pq: extension "vector" is not available\`.

Fix: only spawn the sidecar when the operator has flipped the cutover
flag — \`MEMORY_V2_CUTOVER=true\` OR \`MEMORY_PLUGIN_URL\` is set.

  * Aligns the entrypoint to the same opt-in posture wiring.go already
    uses (it skips building the client when MEMORY_PLUGIN_URL is empty).
  * Until cutover, the sidecar isn't even running — no migration, no
    health gate, no boot-time pgvector dependency.
  * Operators activating cutover already redeploy with the new env
    vars set; that's when the sidecar starts. By definition they've
    verified pgvector is available before flipping.
  * MEMORY_PLUGIN_DISABLE=1 escape hatch preserved; harness fix #2915
    becomes belt-and-suspenders (still respected).

Both Dockerfile and entrypoint-tenant.sh updated. Behavior change for
existing deployments: zero (cutover env vars still unset → sidecar
still inert, but now also not running).

Refs RFC #2728. Hotfix for #2906; supersedes the migration-path
fragility class (the sidecar isn't doing migrations on tenants that
won't use it).
2026-05-05 12:39:03 -07:00
Hongming Wang
83454e5efd feat(workspace-server): structured logging at provisioning boundaries
Adds internal/provlog with a single Event(name, fields) helper that
emits JSON-tagged single-line records to the standard logger. Five
boundary sites instrumented for #2867:

  provision.start         — workspace_dispatchers.go (sync + async)
  provision.skip_existing — org_import.go idempotency hit
  provision.ec2_started   — cp_provisioner.go after RunInstances
  provision.ec2_stopped   — cp_provisioner.go after TerminateInstances ack
  restart.pre_stop        — workspace_restart.go before Stop dispatch

These pair with the existing human-prose log.Printf lines (kept). The
new records are grep+jq friendly so a future log-aggregation pipeline
can reconstruct per-workspace provision timelines without parsing the
operator messages — this is the "and debug loggers so it dont happen
again" half of the leak-prevention work.

Tests:
  - provlog: emits evt-prefixed JSON, nil-tolerant, marshal-error
    fallback preserves event boundary, single-line output pinned.
  - handlers: provlog_emit_test.go pins three call-site contracts:
    provisionWorkspaceAutoSync emits provision.start with sync=true,
    stopForRestart emits restart.pre_stop with backend=cp on SaaS,
    and backend=none when both backends are nil.

Field taxonomy is convenience for ops, not contract — payload can grow
additively without breaking callers. Behavior gate is the event name +
boundary location, per feedback_behavior_based_ast_gates.md.

Refs #2867 (PR-D structured logging at provisioning boundaries)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 12:30:11 -07:00
Hongming Wang
8254bedf30
Merge pull request #2917 from Molecule-AI/chore/delete-team-collapse-2864
chore: delete TeamHandler.Collapse + docs cleanup (closes #2864)
2026-05-05 19:04:30 +00:00
Hongming Wang
eec4ea2e7d chore: delete TeamHandler.Collapse + docs cleanup (closes #2864)
Multi-model retrospective review of #2856 (Phase 1 Expand removal)
flagged that TeamHandler.Collapse is unreachable from the canvas UI:
the "Collapse Team" button calls PATCH /workspaces/:id { collapsed }
(visual flag toggle on canvas_layouts), NOT POST /workspaces/:id/collapse.
The destructive POST route — which stops EC2s, marks children removed,
and deletes layouts — has zero UI callers (verified via grep across
canvas/, scripts/, and the MCP tool registry; only docs referenced it).

Two semantically different operations had been sharing the word
"Collapse":

- Visual collapse (canvas) → PATCH { collapsed: true }. Hides
  children visually. Reversible. UI-only.
- Destructive collapse (POST /collapse) → Stops + marks removed.
  Irreversible. No caller.

Deleting the destructive one + its supporting machinery:

- workspace-server/internal/handlers/team.go (entirely)
- workspace-server/internal/handlers/team_test.go (entirely)
- POST /collapse route + teamh init in router.go
- findTemplateDirByName helper (zero non-test callers after Expand
  was deleted in #2856; package-private so no out-of-package consumers)
- NewTeamHandler constructor (no callers after route removed)

Plus stale doc references (the most dangerous was the MCP wrapper
mapping in mcp-server-setup.md — anyone generating MCP tool wrappers
from that table was wiring a 404):

- docs/agent-runtime/team-expansion.md (deleted entirely — whole
  guide taught the deleted flow)
- docs/api-reference.md (dropped two team.go rows)
- docs/api-protocol/platform-api.md (dropped /expand + /collapse
  rows)
- docs/architecture/molecule-technical-doc.md (dropped /expand +
  /collapse rows)
- docs/guides/mcp-server-setup.md (dropped expand_team +
  collapse_team MCP wrapper mappings)
- docs/glossary.md (dropped "(org template expand_team)"
  parenthetical)
- docs/frontend/canvas.md (dropped broken link to deleted
  team-expansion.md)

Kept: docs/architecture/backends.md mention of "TeamHandler.Expand
(#2367) bypassed routing on Start" — correct historical context for
the AST gate's existence, no live route reference.

Visual-collapse path unaffected:
  canvas/src/components/ContextMenu.tsx:227 → api.patch — unchanged
  canvas/src/components/WorkspaceNode.tsx:128 → api.patch — unchanged

go vet ./... clean. go test ./internal/handlers/ -count 1 — all green
(4.3s, no regression).

Net: -388/+10 = ~378 lines removed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 11:59:43 -07:00
Hongming Wang
6201d12533 fix(memory-plugin): embed migrations into binary via go:embed
PR #2906 shipped the binary at /memory-plugin without the migrations
directory. The plugin's runMigrations() resolved a relative path
\`cmd/memory-plugin-postgres/migrations\` that exists in the build
context but NOT in the runtime image. Every staging tenant boot
failed with:

  memory-plugin-postgres: migrate: read migrations dir
    "cmd/memory-plugin-postgres/migrations": open
    cmd/memory-plugin-postgres/migrations: no such file or directory

  memory-plugin:  /v1/health never returned 200 after 30s
    — aborting boot

Caught on the staging redeploy fleet job after #2906 merged. Tenants
stayed on the old image (CP redeploy correctly fail-fasted) but the
new image was broken.

Fix: \`//go:embed migrations/*.up.sql\` bundles the migrations into
the binary at build time. No filesystem path dependency at runtime.

  * \`embed.FS\` embeds the .up.sql files alongside the binary.
  * runMigrations() reads from migrationsFS by default;
    MEMORY_PLUGIN_MIGRATIONS_DIR override path preserved for operators
    shipping custom migrations.
  * Names sorted alphabetically — pinned by a test so a future
    \`002_*.up.sql\` is guaranteed to run after \`001_*.up.sql\`.

Tests:
  * TestMigrationsEmbedded_ContainsCreateTable — pins that the embed
    pattern matched files AND those files contain CREATE TABLE
    (catches both empty-pattern and wrong-files-embedded).
  * TestRunMigrationsFromEmbed_OrderingIsAlphabetic — pins sorted
    application order.

Verified locally: \`go build\` succeeds, binary 9.3MB,
\`strings\` shows the embedded SQL.

Refs RFC #2728. Hotfix for #2906.
2026-05-05 11:57:37 -07:00
Hongming Wang
fc1c45789e
Merge pull request #2912 from Molecule-AI/feat/saas-default-hardening-2910
feat(saas): close 4th default-tier site + lift org_import asymmetry + tests (#2910)
2026-05-05 18:42:19 +00:00
Hongming Wang
9f551319d2 feat(saas): close 4th default-tier site + lift org_import asymmetry + tests (#2910)
Multi-model retrospective review of #2901 found three Critical gaps:

1. (#2910 PR-B) template_import.go:79 wrote `tier: 3` hardcoded into
   generated config.yaml. On SaaS this defeated the T4 default at the
   create-handler layer — a config-less template import landed at T3
   regardless of POST /workspaces' computed default. The 4th
   default-tier site #2901 missed.

2. (#2910 PR-A) #2901 claimed `go test ... all green` but added zero
   new tests. Existing structural-pin tests caught dispatch-layer
   drift but said nothing about tier-default drift. A future refactor
   that flips DefaultTier() to always return 3 would ship green.

3. (#2910 PR-E) org_import.go fallback returned T2 on self-hosted
   while workspace.go returned T3. Internally consistent ("bulk vs
   interactive defaults") but undocumented same-name-different-value
   drift.

Fix:

- TemplatesHandler.NewTemplatesHandler now takes `wh *WorkspaceHandler`
  (nil-tolerant for read-only callers). Import + ReplaceFiles compute
  tier via h.wh.DefaultTier() and pass it to generateDefaultConfig.
  generateDefaultConfig gets a `tier int` parameter (bounds-checked,
  invalid input falls back to T3).

- org_import.go fallback lifts to h.workspace.DefaultTier() — single
  source of truth shared with Create + Templates so a future
  tier-default change sweeps every entry point at once.

- New saas_default_tier_test.go pinning:
    TestIsSaaS_TrueWhenCPProvWired
    TestIsSaaS_FalseWhenOnlyDocker
    TestDefaultTier_SaaS_IsT4
    TestDefaultTier_SelfHosted_IsT3
    TestGenerateDefaultConfig_RespectsTierParam
    TestGenerateDefaultConfig_SelfHostedTierT3
    TestGenerateDefaultConfig_OutOfRangeFallsBackToT3

- Existing template_import_test.go tests + chat_files_test.go +
  security_regression_test.go updated to thread the new tier param /
  wh constructor arg through their NewTemplatesHandler calls. Their
  pre-#2910 assertion of `tier: 3` is preserved (now passes because
  the test caller passes `3` explicitly), so no regression.

go vet ./... clean. go test ./internal/handlers/ -count 1 — all
green (4.2s).

Deferred to separate follow-ups (per #2910 plan):
- PR-C: MOLECULE_DEPLOYMENT_MODE explicit deployment-mode signal
  (closes the IsSaaS()=cpProv!=nil structural fragility)
- PR-D: Host iptables IMDS block + IMDSv2 hop-limit (paired with
  molecule-controlplane EC2-IAM-scope audit)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 11:38:22 -07:00
Hongming Wang
1052f8bdb0 fix(memory-plugin): bind to 127.0.0.1 by default
Self-review of PR #2906 flagged: defaultListenAddr was ":9100" — binds
on every container interface. Inside today's deployment that's moot
(no host port mapping, platform talks over loopback) but it's not
least-privilege. A future Dockerfile edit that publishes the port,
a misconfigured Fly machine, or a future cross-host plugin topology
would expose an unauth'd memory store.

Loopback is the right baseline. Operators with a multi-host topology
already override via MEMORY_PLUGIN_LISTEN_ADDR — that path is unchanged.

Tests:
  * TestLoadConfig_DefaultListenAddrIsLoopback pins the new default.
  * TestLoadConfig_ListenAddrEnvOverride pins the override path so
    operators relying on it don't break.
  * TestLoadConfig_MissingDatabaseURL covers the existing fail-fast.

No prior unit tests existed for loadConfig — boot_e2e_test.go always
sets MEMORY_PLUGIN_LISTEN_ADDR explicitly, so the default was never
exercised by tests. This PR adds that coverage.

Refs RFC #2728. Hardening follow-up to PR #2906.
2026-05-05 11:35:24 -07:00
Hongming Wang
5334d60de4
Merge pull request #2898 from Molecule-AI/2867-workspaces-insert-allowlist
test(handlers): allowlist INSERT INTO workspaces sites (#2867 class 1)
2026-05-05 18:18:19 +00:00
Hongming Wang
d6c0227e3f
Merge pull request #2906 from Molecule-AI/feat/memory-plugin-sidecar-bundle
feat(memory-v2): bundle memory-plugin-postgres as in-image sidecar
2026-05-05 18:16:57 +00:00
Hongming Wang
27db090d3d
Merge pull request #2907 from Molecule-AI/feat/poll-mode-chat-upload-phase5a
feat(poll-upload): phase 5a — atomic batch insert + acked-index + mime hardening
2026-05-05 11:16:56 -07:00
Hongming Wang
0f25f6de97 test(handlers): allowlist INSERT INTO workspaces sites — close bulk-create regression class (#2867 class 1)
Adds TestINSERTworkspacesAllowlist: walks every non-test .go in this
package, finds funcs containing an `INSERT INTO workspaces (` SQL
literal, and pins the result against an explicit allowlist with the
safety mechanism named per entry.

New entries fail the build until a reviewer adds them — forcing the
question "what makes this INSERT idempotent?" at PR-review time, not
after the next bulk-create leak (the shape that produced 72 stale
child workspaces in tenant-hongming over 4 days).

Pairs with TestCreateWorkspaceTree_CallsLookupBeforeInsert (the
behavior pin for the one bulk path today). Together:
- this test catches "did a new function start inserting?"
- that test catches "did the existing bulk path drop its idempotency check?"

Both fire immediately when drift happens.

Current allowlist (3 entries):
- org_import.go:createWorkspaceTree → lookup-then-insert via
  lookupExistingChild (#2868 phase 3, also pinned by the sibling AST
  gate from #2895)
- registry.go:Register → ON CONFLICT (id) DO UPDATE (idempotent by
  primary key — external workspace upsert)
- workspace.go:Create → single-workspace POST /workspaces, server-
  generated UUID, no iteration

Verified via mutation: dropping a synthetic tempBulkLeakTest with an
unsafe loop+INSERT into the package fails the gate with a clear
diagnostic pointing at the file + function. Restoring the tree
returns the gate to green.

Memory: feedback_assert_exact_not_substring.md (verify tightened test
FAILS on bug shape) — mutation proof done locally.

RFC #2867 class 1. Class 2 (Prometheus gauge for ec2_instance
duplicates) + class 3 (structured logging on workspace create) are
follow-up PRs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 11:15:16 -07:00
Hongming Wang
9991057ad1 feat(poll-upload): phase 5a — atomic batch insert + acked-index + mime hardening
Resolves four of six findings from the retrospective code review of Phases
1–4 (poll-mode chat upload). Bundled because every change is in the
platform's pending_uploads layer or the multi-file handler that reads it.

Findings resolved:

1. Important — Sweep query lacked an index for the acked-retention OR-arm.
   The Phase 1 partial indexes are both `WHERE acked_at IS NULL`, so the
   `(acked_at IS NOT NULL AND acked_at < retention)` half of the WHERE
   clause seq-scanned the table on every cycle. Add a complementary
   partial index on `acked_at WHERE acked_at IS NOT NULL` so both arms
   of the disjunction are index-covered. Disjoint from the existing two
   indexes (no row matches both predicates), so write amplification is
   bounded to ~one index entry per terminal-state row.

2. Important — uploadPollMode partial-failure left orphans. The previous
   per-file Put loop committed rows 1..K-1 and then errored on row K with
   no compensation, so a client retry would double-insert the survivors.
   Refactor the handler into three explicit phases (pre-validate +
   read-into-memory, single atomic PutBatch, per-file activity row) and
   add Storage.PutBatch with all-or-nothing transaction semantics.

3. FYI — pendinguploads.StartSweeperWithInterval was exported only for
   tests. Move it to lower-case startSweeperWithInterval and expose the
   test seam through pendinguploads/export_test.go (Go convention; the
   shim file is stripped from the production binary at build time).

4. Nit — multipart Content-Type was passed verbatim into pending_uploads
   rows and re-served on /content. Add safeMimetype which strips
   parameters, rejects CR/LF/control bytes, and coerces malformed shapes
   to application/octet-stream. The eventual GET /content response can no
   longer be header-split via a crafted Content-Type on the multipart.

Comprehensive tests:

- 10 PutBatch unit tests (sqlmock): happy path, empty input, all four
  pre-validation rejection paths, BeginTx error, per-row error +
  Rollback (no Commit), first-row error, Commit error.
- 4 new PutBatch integration tests (real Postgres): all-rows-commit
  happy path with COUNT(*) verification, atomic-rollback no-leak via
  a NUL-byte filename that lib/pq rejects mid-batch, oversize
  short-circuit no-Tx, idx_pending_uploads_acked existence + partial
  predicate via pg_indexes (planner-shape-independent).
- 3 new chat_files_poll tests: atomic rollback on second-file oversize,
  atomic rollback on PutBatch error, mimetype CRLF/NUL/parameter
  sanitization (8 sub-cases).

The two remaining review findings (inbox_uploads.fetch_and_stage blocks
the poll loop synchronously; two httpx Clients per row) are Python-side
and ship in Phase 5b once this lands on staging.

Test-only export pattern via export_test.go, atomic pre-validation
discipline (validate before Tx), and behavior-based (not name-based)
test assertions follow the standing project conventions.
2026-05-05 11:10:13 -07:00
Hongming Wang
b89a49ec93 feat(memory-v2): bundle memory-plugin-postgres as in-image sidecar
Closes the gap between the merged Memory v2 code (PR #2757 wired the
client into main.go) and operator activation. Without this PR an
operator wanting to flip MEMORY_V2_CUTOVER=true had to provision a
separate memory-plugin service and point MEMORY_PLUGIN_URL at it —
extra ops surface for what the design intends to be a built-in.

What ships:
  * Both Dockerfile + Dockerfile.tenant build the
    cmd/memory-plugin-postgres binary into /memory-plugin.
  * Entrypoints spawn the plugin in the background on :9100 BEFORE
    starting the main server; wait up to 30s for /v1/health to return
    200; abort boot loud if it doesn't (better to crash-loop than to
    silently route cutover traffic against a dead plugin).
  * Default env: MEMORY_PLUGIN_DATABASE_URL=$DATABASE_URL (share the
    existing tenant Postgres — plugin's `memory_namespaces` /
    `memory_records` tables coexist with platform schema, no
    conflicts), MEMORY_PLUGIN_LISTEN_ADDR=:9100.
  * MEMORY_PLUGIN_DISABLE=1 escape hatch for operators running the
    plugin externally on a separate host.
  * Platform image: plugin runs as the `platform` user (not root) via
    su-exec — matches the privilege boundary the main server already
    drops to. Tenant image already starts as `canvas` so the plugin
    inherits non-root automatically.

What stays operator-controlled:
  * MEMORY_V2_CUTOVER is NOT auto-set. Behavior change for existing
    deployments: zero. The wiring at workspace-server/internal/memory/
    wiring/wiring.go skips building the plugin client until the
    operator opts in, so the running sidecar is a no-op for traffic
    until then.
  * MEMORY_PLUGIN_URL is NOT auto-set either, for the same reason —
    setting it implies cutover-active intent. Operators set both on
    staging first, verify a live commit/recall round-trip (closes
    pending task #292), then promote to production.

Operator activation steps after this PR ships:
  1. Verify pgvector extension is available on the target Postgres
     (the plugin's first migration runs CREATE EXTENSION IF NOT
     EXISTS vector). Railway's managed Postgres ships pgvector
     available; some self-hosted operators may need to enable it.
  2. Redeploy the workspace-server with this image.
  3. Set MEMORY_PLUGIN_URL=http://localhost:9100 + MEMORY_V2_CUTOVER=true
     in the environment (staging first).
  4. Watch boot logs for "memory-plugin:  sidecar healthy" and the
     wiring.go cutover messages; do a live commit_memory + recall_memory
     round-trip via the canvas Memory tab to verify.
  5. Promote to production once staging holds for a sweep window.

Refs RFC #2728. Closes the dormant-plugin gap noted in task #294.
2026-05-05 11:10:11 -07:00
Hongming Wang
c79ba05ed5 test(pendinguploads): close cycleDone-vs-metric-record race in sweeper tests
TestStartSweeper_RecordsMetricsOnError flaked on every CI rerun under
race detection: `error counter delta = 0, want 1`. Root cause is a
race between two goroutines, not a bug in the production sweeper.

The fake `fakeSweepStorage.Sweep` signals `cycleDone` from inside its
deferred return — that happens BEFORE Sweep's return value is
received by `sweepOnce`, which is what triggers the metric increment.
On slow CI hosts the test goroutine wins the read after `waitForCycle`
unblocks and BEFORE StartSweeper's goroutine has called
`metrics.PendingUploadsSweepError`, so the asserted delta is 0 even
though the metric WILL be 1 a few ms later.

Adds a polling assert helper, `waitForMetricDelta`, that closes the
race deterministically without timing-based sleeps:

- TestStartSweeper_RecordsMetricsOnError uses waitForMetricDelta to
  wait for the error counter to settle at 1.
- TestStartSweeper_RecordsMetricsOnSuccess uses it on the success
  counters (acked, expired) so the error-stayed-zero assertion
  reads after StartSweeper has fully processed the cycle.
- waitForCycle keeps its current shape but documents the caveat in
  its comment so future tests don't repeat the assumption.

Verified: `go test ./internal/pendinguploads/ -race -count 5` passes
all 9 tests across 5 iterations cleanly.

Per memory feedback_question_test_when_unexpected.md: the
"delta=0, want=1" failure looked like a real production bug at first
glance, but instrumented inspection showed the metric DOES increment,
just AFTER the test's read. The fix is the test's wait shape, not
the sweeper.

Unblocks every PR currently broken by this flake (#2898 hit it on
two consecutive CI runs; staging-merged PRs from earlier today
(#2877/#2881/#2885/#2886) introduced the test).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 10:46:17 -07:00
Hongming Wang
7644e82f2f feat(saas): default new workspaces to T4 on SaaS, T3 self-hosted
User reported every SaaS workspace defaults to T2 (Standard). Three
sites quietly disagreed on the default:

- canvas CreateWorkspaceDialog (line 126): isSaaS ? 4 : 3   ← only correct one
- canvas EmptyState "Create blank":      tier: 2            ← hardcoded
- workspace.go POST /workspaces:         tier = 3           ← not SaaS-aware
- org_import.go createWorkspaceTree:     tier = 2 (fallback)← not SaaS-aware

So a user clicking "+ New Workspace" via the dialog got T4 on SaaS,
but a user clicking "Create blank" on the empty canvas got T2, and an
agent POSTing /workspaces directly got T3. Same tenant, three different
tiers depending on entry point.

Fix:

1. WorkspaceHandler.IsSaaS() and DefaultTier() helpers (workspace_dispatchers.go).
   IsSaaS() := h.cpProv != nil — single source of truth for "are we
   SaaS" across the file. DefaultTier() returns 4 on SaaS, 3 on
   self-hosted. SaaS rationale: each workspace runs on its own sibling
   EC2 so the per-workspace tier boundary is a Docker resource limit
   on the only container present — no neighbour to protect from. T4
   matches the boundary.

2. workspace.go now defaults tier via h.DefaultTier() instead of
   hardcoded T3.

3. org_import.go fallback (when neither ws.tier nor defaults.tier set)
   becomes SaaS-aware: T4 on SaaS, T2 on self-hosted (preserve the
   existing safe-shared-Docker-daemon default for self-hosted org
   imports).

4. canvas EmptyState "Create blank" stops sending tier:2 in the body
   and lets the backend pick — single source of truth in the backend.
   Eliminates the third disagreement.

Test plan:
- go vet ./... clean
- go test ./internal/handlers/ -count 1 — all green (4.3s)
- npx tsc --noEmit on canvas — clean
- Staging E2E (after deploy): create a fresh workspace via canvas
  empty-state on hongming.moleculesai.app, confirm tier=4 on the
  workspace details panel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 10:30:22 -07:00
Hongming Wang
48d1945269 test(org-import): tighten AST gate to discriminate workspaces vs lookalikes (#2872 Imp-1)
The previous TestCreateWorkspaceTree_CallsLookupBeforeInsert used
bytes.Index("INSERT INTO workspaces"), which prefix-matches
INSERT INTO workspaces_audit, INSERT INTO workspace_secrets, and
INSERT INTO workspace_channels. RFC #2872 cited this as a silent
false-pass mode: a future refactor that adds an audit-table INSERT
literal earlier in source than the real workspaces INSERT would
make the gate point at the wrong target.

Replaces the byte-search with a go/ast walk + a regex that requires
`\s*\(` after `workspaces` — distinguishes the real target from
prefix lookalikes.

Adds three discriminating tests:
- TestWorkspacesInsertRE_RejectsLookalikes — pins the regex against
  9 sql shapes (real, raw-string-literal, audit-shadow, workspace_*
  prefixes, canvas_layouts, UPDATE/SELECT, comments).
- TestGate_FailsWhenLookupAfterInsert — synthesizes Go source where
  the lookup is positioned AFTER the workspaces INSERT, asserts the
  helper returns lookupPos > insertPos (which the production gate
  flags via t.Errorf). Proves the gate isn't vestigial.
- TestGate_IgnoresAuditTableShadow — synthesizes source with an
  audit-table INSERT BEFORE the lookup + real INSERT, asserts the
  tightened regex correctly walks past the shadow and finds the
  real INSERT.

Also extracts findLookupAndWorkspacesInsertPos as a helper so the
gate logic can be exercised against synthetic source, not only
against the real org_import.go.

Memory: feedback_assert_exact_not_substring.md (verify tightened
test FAILS on old code) — TestGate_FailsWhenLookupAfterInsert is
the failing-on-bug-shape proof.

Closes the silent-false-pass mode of #2872 Important-1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 08:32:56 -07:00
Hongming Wang
e50799bc29 test(rfc): poll-mode chat upload — phase 4 real-Postgres integration
Phase 4 closes out the rollout — strict-sqlmock unit tests pin which
SQL fires, but they cannot detect bugs that depend on the actual row
state after the SQL runs. Real-Postgres integration tests catch:

  - the Sweep CTE depends on Postgres' make_interval function and
    the table's CHECK constraints; sqlmock would happily accept a
    hand-written SQL literal that Postgres rejects at runtime.
  - the partial idx_pending_uploads_unacked index only catches a
    wrong WHERE predicate at real-query-plan time.
  - subtle predicate drift (e.g. a WHERE clause that filters by
    acked_at IS NOT NULL but uses BETWEEN incorrectly).

Test cases:

  - PutGetAckRoundTrip: the full happy path — Put, Get, MarkFetched,
    Ack, idempotent re-Ack, Get-after-Ack returns ErrNotFound.
  - Sweep_DeletesAckedAfterRetention: row not eligible at retention=1h
    immediately after Ack; deleted at retention=0.
  - Sweep_DeletesExpiredUnacked: backdated expires_at exercises the
    unacked-and-expired branch of the WHERE clause.
  - Sweep_DeletesBothCategoriesInOneCycle: three rows (acked, expired,
    fresh); a single Sweep deletes the first two and leaves the third.
  - PutEnforcesSizeCap: ErrTooLarge above MaxFileBytes.
  - GetIgnoresExpiredAndAcked: Get filters predicate matches expected
    row state in the table.

Run path:
  - locally via the file-header docker incantation.
  - CI runs on every PR/push that touches handlers/** OR migrations/**
    (.github/workflows/handlers-postgres-integration.yml).
2026-05-05 05:04:41 -07:00
Hongming Wang
a327d207da feat(rfc): poll-mode chat upload — phase 3 GC sweep + observability
Phase 3 of the poll-mode chat upload rollout. Stack atop Phase 2.

The platform's pending_uploads table grows once-per-uploaded-file with
no built-in cleanup. Phase 1's hard TTL (expires_at default 24h) makes
expired rows un-fetchable but doesn't actually delete them; Phase 1's
ack stamps acked_at but leaves the row indefinitely. Without a sweep
the table grows unbounded across normal traffic.

This PR adds:

- `Storage.Sweep(ctx, ackRetention)` — a single round-trip CTE that
  deletes acked rows past their retention window plus unacked rows
  past expires_at. Returns `(acked, expired)` deletion counts so
  Phase 3 dashboards can spot the stuck-fetch pattern (high expired,
  low acked) vs healthy churn.
- `pendinguploads.StartSweeper(ctx, storage, ackRetention)` —
  background goroutine that calls Sweep every 5 minutes (default).
  Runs once immediately on startup so a platform restart cleans up
  any rows that became eligible while we were down.
- Prometheus counters `molecule_pending_uploads_swept_total` with
  `outcome={acked,expired,error}` labels. Wired into the existing
  `/metrics` endpoint.
- Wired from cmd/server/main.go via supervised.RunWithRecover —
  one transient panic doesn't take the platform down with it.

Defaults:
  - SweepInterval = 5m (matches the dashboard refresh cadence)
  - DefaultAckRetention = 1h (gives the workspace at-least-once retry
    headroom in case it processed but failed to write the file before
    crashing)

Test coverage: 100% on storage_test.go (extended with sweepSQL pin +
six Sweep test cases including negative-retention clamp + zero-retention
immediate-delete + DB error wrapping) and sweeper_test.go (ticker-driven
+ ctx-cancel + nil-storage + transient-error-doesn't-crash + metric
counter assertions).

Closes the third of four phases tracked on the parent RFC; phase 4 is
the staging E2E test.
2026-05-05 05:00:13 -07:00
Hongming Wang
c778b62202 feat(metrics): add molecule_phantom_busy_resets_total counter (#2865)
Closes #2865 (split-B of the #2669 root-cause stack).

The phantom-busy sweep in workspace-server/internal/scheduler/scheduler.go
already logs each row reset, but no aggregate metric surfaces "how often
is this firing." A regression that causes high reset rates (e.g.
controlplane#481's missing env vars, or future drift in the workspace
runtime's task-lifecycle accounting) only surfaces when users complain.

Fix: counter exposed at /metrics as molecule_phantom_busy_resets_total,
incremented from sweepPhantomBusy after each row whose active_tasks
was reset. Same shape as existing molecule_websocket_connections_active.

Operator-side dashboard: alert when daily phantom-busy reset count
> 0.5% of active workspaces. Today's steady-state is near-zero; any
increase is a regression signal.

Tests:
  - TestTrackPhantomBusyReset_IncrementsCounter
  - TestTrackPhantomBusyReset_RaceFreeUnderConcurrentWrites (50×200
    concurrent writes; tests atomic invariant)
  - TestHandler_ExposesPhantomBusyResetsCounter (asserts HELP + TYPE
    + value lines in Prometheus text format)
  - TestHandler_PhantomBusyResetsZeroByDefault (fresh-process 0
    contract — prevents a future refactor from accidentally dropping
    the metric from /metrics)

Race-detector clean. Vet clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 04:45:24 -07:00
Hongming Wang
243f9bc2b1
Merge pull request #2877 from Molecule-AI/feat/poll-mode-chat-upload-phase1
feat(rfc): poll-mode chat upload — phase 1 platform staging layer
2026-05-05 11:32:10 +00:00
Hongming Wang
43bf94a07c fix(chat-uploads): align poll-mode activity rows with inbox poll filter
The workspace inbox poller filters
`GET /workspaces/:id/activity?type=a2a_receive` — writing rows with
`activity_type=chat_upload_receive` would be silently invisible to it.

Switch the poll-mode upload-staging handler to write
`activity_type=a2a_receive` with `method=chat_upload_receive` as the
discriminator. Same shape as A2A's `tasks/send` vs `message/send` method
split; the workspace-side handler (Phase 2) routes by `method`, not
activity_type.

Pinned with `TestPollUpload_ActivityRowDiscriminator` — sqlmock
WithArgs on positions 2 (activity_type) and 5 (method) so a refactor
that flips activity_type back to a custom value gets a red test
instead of a runtime "poller saw nothing" silent break.
2026-05-05 04:29:07 -07:00
Hongming Wang
86fdaad111 feat(rfc): poll-mode chat upload — phase 1 platform staging layer
External-runtime workspaces (registered via molecule connect, behind
NAT, no public callback URL) currently see HTTP 422 "workspace has no
callback URL" on every chat file upload. The only escape is to wrap the
laptop in ngrok / Cloudflare tunnel + re-register push-mode — a tax
that shouldn't exist for a one-line use case.

This phase introduces the platform-side staging layer that lets
canvas → external workspace uploads ride the same poll loop the inbox
already uses for text messages.

Architecture (mirrors inbox poll, SSOT principle):
  Canvas POST /chat/uploads (multipart)
      ↓ delivery_mode=poll
  Platform: chat_files.uploadPollMode
      ↓ pendinguploads.Storage.Put + LogActivity(chat_upload_receive)
  Workspace's existing inbox poller picks up the activity row (Phase 2)
  Workspace fetches: GET /workspaces/:id/pending-uploads/:fid/content
  Workspace acks:    POST /workspaces/:id/pending-uploads/:fid/ack

Pieces in this PR:
  * Migration 20260505100000 — pending_uploads table; partial indexes
    on unacked + expires_at for the workspace fetch + Phase 3 sweep
    hot paths. No FK to workspaces (audit retention), 24h hard TTL.
  * internal/pendinguploads — Storage interface + Postgres impl. Bytes
    inline (bytea) today; the interface lets a future PR replace with
    S3 (RFC #2789) by swapping one constructor. 100% test coverage on
    the Postgres impl via sqlmock-pinned SQL.
  * handlers.PendingUploadsHandler — GET /content + POST /ack endpoints.
    wsAuth-gated; cross-workspace bleed protection via per-row
    workspace_id check (token leak from A can't read B's pending bytes).
    Handler tests pin happy path + every 4xx/5xx mapping including
    cross-workspace + race-with-sweep.
  * chat_files.go — Upload poll-mode branch behind WithPendingUploads
    builder. Push-mode unchanged (regression-tested). Multipart parse
    + per-file sanitize + storage.Put + activity_logs row per file.
  * SanitizeFilename — Go mirror of workspace/internal_chat_uploads.py
    sanitize_filename. Tests pin parity case-by-case so canvas-emitted
    URIs stay identical regardless of which path handles the upload.
  * Comprehensive logging — every state transition (staged, fetch,
    ack, error) emits a structured log line with workspace_id +
    file_id + size + sanitized name. Phase 3 metrics will hook these.

The pendinguploads.Storage wiring is opt-in (WithPendingUploads on
ChatFilesHandler) so a binary deployed without the migration keeps the
pre-existing 422 behavior — no boot-order coupling between code roll
and schema roll.

Phase 2 (separate PR): workspace inbox extension — inbox_uploads.py
fetches via the GET endpoint, writes to /workspace/.molecule/chat-
uploads/, acks, and rewrites the URI from platform-pending: → workspace:
so the agent's existing send-attachments path needs no changes.
Phase 3: GC sweep + dashboards. Phase 4: poll-mode E2E on staging.

Tests:
  * 100% coverage on pendinguploads (sqlmock-pinned SQL drift gate).
  * Functional 100% on new handler code (uncovered branches are
    documented defensive duplicates: uuid re-parse, multipart Open
    error, Writer.Write fail — none reproducible in unit tests).
  * Push-mode + NULL delivery_mode regression tests pin no behavior
    change for existing workspaces.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 04:22:24 -07:00
Hongming Wang
26e2e97006 refactor(models): consolidate per-runtime model defaults to SSOT (RFC #2873 iter 1)
Two call sites — workspace_provision.go:537 and org_import.go:54 —
duplicated the same `if runtime == "claude-code"` branch deciding
the default model when the operator/agent didn't supply one. They
were copy-pasted; nothing prevented them from drifting silently.

Extract to `models.DefaultModel(runtime string) string`. Both call
sites now route through the helper. New runtimes need one entry
in DefaultModel + one assertion in TestDefaultModel — pre-fix it
required two source edits + an audit.

Foundation for the future `RuntimeConfig` interface (RFC #2873 +
task #231): once we add `ProvisioningTimeout()`, `CapabilitiesSupported()`
etc., the helper expands to per-runtime structs and `DefaultModel`
becomes one method on the interface.

## Coverage

15 unit tests pinning the exact contract:
  - claude-code → "sonnet"
  - 9 other known runtimes → universal default
  - empty + unknown → universal default (matches pre-refactor fallthrough)
  - case-sensitivity preserved (CLAUDE-CODE → universal default)

Plus invariant test: `DefaultModel` never returns "" — protects
against a future "return early on unknown" regression that would
silently break workspace creation.

## Verification

  - go build ./... clean
  - 15 model unit tests pass
  - existing handler tests untouched (no behavior change at call sites)
  - identical output to pre-refactor for every input

First iteration of the OSS-shape refactor program. Each PR meets all
7 bars (plugin/abstract/modular/SSOT/coverage/cleanup/file-split).

Refs RFC #2873.
2026-05-05 04:12:37 -07:00
Hongming Wang
1e8d7ae17c
Merge pull request #2869 from Molecule-AI/test/rfc2829-tighten-sweeper-assertions
test(delegations): tighten integration-test assertions + integrationDB doc
2026-05-05 10:42:31 +00:00
Hongming Wang
fcdf79774d test(delegations): tighten integration-test assertions + integrationDB doc (#321)
Three small follow-ups from #2866 self-review:

1. TestIntegration_Sweeper_StaleHeartbeatIsMarkedStuck — assert
   strings.Contains(errDet, "no heartbeat for") instead of != "".
   The original "non-empty" check passes for any error_detail value;
   if a future regression swaps the message format, the test wouldn't
   catch it. Pin the production format string explicitly.

2. TestIntegration_Sweeper_DeadlineExceededIsMarkedFailed — drop the
   redundant `last_heartbeat = now()` write. The sweeper checks
   deadline FIRST (the stronger statement) and short-circuits before
   evaluating heartbeat staleness, so the heartbeat field is irrelevant
   for that test path.

3. integrationDB doc comment now warns explicitly that the helper is
   NOT t.Parallel()-safe — it hot-swaps the package-level mdb.DB and
   restores via t.Cleanup. If a future contributor adds t.Parallel()
   to one of these tests they race on the global. Comment makes the
   constraint discoverable instead of a debugging surprise.

All 7 integration tests still pass against real Postgres locally.
2026-05-05 03:39:22 -07:00
Hongming Wang
d6337a1ae9 feat(org-import): make createWorkspaceTree idempotent (Phase 3 of #2857)
OrgHandler.Import was non-idempotent — every call INSERTed a fresh row
for every workspace in the tree, regardless of whether matching
workspaces already existed. Calling /org/import twice with the same
template duplicated the entire tree.

This was the bigger leak source than TeamHandler.Expand (deleted in
PR #2856). tenant-hongming accumulated 72 distinct child workspaces
in 4 days entirely from repeated org-template spawns of the same
template — the (tier × runtime) matrix in the audit data was the
template's static shape, multiplied by spawn count.

Fix: route through a new lookupExistingChild helper before INSERT.
Skip-if-exists semantics by default:
- Match on (parent_id, name) using `IS NOT DISTINCT FROM` so NULL
  parents (root workspaces) are included.
- Ignore status='removed' rows so collapsed teams or deleted
  workspaces don't block re-import.
- Recursion still runs on the existing id so partial-match templates
  (parent exists, some children missing) backfill correctly instead
  of either no-op'ing the whole subtree or duplicating the existing
  children.
- Result entries for skipped nodes carry skipped:true so callers
  (canvas Import preflight modal) can surface "5 of 7 already
  existed, 2 created."

The recursion that walked ws.Children is extracted into
recurseChildrenForImport so both the create-path and the skip-path
share one implementation — no duplicated grid math, no two paths to
keep in sync.

Note: replace_if_exists semantics (re-roll: stop+delete old, create
new) are deferred. Skip-if-exists alone closes the leak; re-roll is
a later UX decision for the canvas Import preflight modal.

Tests:
- 4 sqlmock cases on lookupExistingChild: not-found, found,
  nil-parent (the IS NOT DISTINCT FROM NULL trick), DB-error
  propagates (must fail fast — silent fallback to INSERT is the
  failure mode the helper exists to prevent).
- 1 source-level AST gate (per memory feedback_behavior_based_ast_gates.md):
  pins that h.lookupExistingChild( appears BEFORE INSERT INTO workspaces
  in org_import.go. If a future refactor reintroduces the un-checked
  INSERT, the gate fails. Verified load-bearing by removing the call —
  build fails (helper symbol gone).

go vet ./... clean. go test ./internal/handlers/ -count 1 — all green
(4.2s, no regression on existing OrgImport / Provision / Team tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 03:37:49 -07:00
Hongming Wang
3d2a50e2a2 test(delegations): extend integration suite with sweeper coverage (3 tests)
Real-Postgres tests for the RFC #2829 PR-3 sweeper. Validates:

  - Deadline-exceeded rows are marked failed with the expected
    error_detail
  - Stale-heartbeat in-flight rows are marked stuck (uses
    DELEGATION_STUCK_THRESHOLD_S env override for deterministic
    timing)
  - Healthy rows (fresh heartbeat + future deadline) are not touched
    — no false-positive against well-behaved delegations

These extend the gate added in the previous commit so the workflow
catches sweeper regressions, not just ledger-write ones. All 7
integration tests now pass; CI workflow runs them all.
2026-05-05 03:20:19 -07:00
Hongming Wang
c661ea4cd3
Merge pull request #2861 from Molecule-AI/fix/rfc2829-result-preview-ordering-and-integration-gate
fix(delegations): preserve result_preview + add real-Postgres integration gate
2026-05-05 09:51:30 +00:00
Hongming Wang
4c9f12258d fix(delegations): preserve result_preview through completion + add real-Postgres integration gate
Two-part PR:

## Fix: result_preview was lost on completion

Self-review of #2854 caught a real bug. SetStatus has a same-status
replay no-op; the order of calls in `executeDelegation` completion
+ `UpdateStatus` completed branch clobbered the preview field:

  1. updateDelegationStatus(completed, "")    fires
  2. inner recordLedgerStatus(completed, "", "")
       → SetStatus transitions dispatched → completed with preview=""
  3. outer recordLedgerStatus(completed, "", responseText)
       → SetStatus reads current=completed, status=completed
       → SAME-STATUS NO-OP, never writes responseText → preview lost

Confirmed against real Postgres (see integration test). Strict-sqlmock
unit tests passed because they pin SQL shape, not row state.

Fix: call the WITH-PREVIEW recordLedgerStatus FIRST, then
updateDelegationStatus. The inner call becomes the no-op (correctly
preserves the row written by the outer call).

Same gap fixed in UpdateStatus handler — body.ResponsePreview was
never landing in the ledger because updateDelegationStatus's nested
SetStatus(completed, "", "") fired first.

## Gate: real-Postgres integration tests + CI workflow

The unit-test-only workflow that shipped #2854 was the root cause.
Adding two layers of defense:

1. workspace-server/internal/handlers/delegation_ledger_integration_test.go
   — `//go:build integration` tag, requires INTEGRATION_DB_URL env var.
   4 tests:
     * ResultPreviewPreservedThroughCompletion (regression gate for the
       bug above — fires the production call sequence in fixed order
       and asserts row.result_preview matches)
     * ResultPreviewBuggyOrderIsLost (DIAGNOSTIC: confirms the
       same-status no-op contract works as designed; if SetStatus's
       semantics ever change, this test fires)
     * FailedTransitionCapturesErrorDetail (failure-path symmetry)
     * FullLifecycle_QueuedToDispatchedToCompleted (forward-only +
       happy path)

2. .github/workflows/handlers-postgres-integration.yml
   — required check on staging branch protection. Spins postgres:15
   service container, applies the delegations migration, runs
   `go test -tags=integration` against the live DB. Always-runs +
   per-step gating on path filter (handlers/wsauth/migrations) so
   the required-check name is satisfied on PRs that don't touch
   relevant code.

Local dev workflow (file header documents this):

  docker run --rm -d --name pg -e POSTGRES_PASSWORD=test -p 55432:5432 postgres:15-alpine
  psql ... < workspace-server/migrations/049_delegations.up.sql
  INTEGRATION_DB_URL="postgres://postgres:test@localhost:55432/molecule?sslmode=disable" \
    go test -tags=integration ./internal/handlers/ -run "^TestIntegration_"

## Why this matters

Per memory `feedback_mandatory_local_e2e_before_ship`: backend PRs
MUST verify against real Postgres before claiming done. sqlmock pins
SQL shape; only a real DB can verify row state. The workflow makes
this gate mandatory rather than optional.
2026-05-05 02:47:52 -07:00
Hongming Wang
d890fd9a3f
Merge pull request #2856 from Molecule-AI/chore/remove-team-expand-handler
chore(workspace-server): remove TeamHandler.Expand bulk-create handler
2026-05-05 09:42:51 +00:00
Hongming Wang
ec1f21922c chore(workspace-server): remove TeamHandler.Expand bulk-create handler
Every workspace can have children via the regular CreateWorkspace flow
with parent_id set, so a separate handler that bulk-creates from
config.yaml's sub_workspaces (and was non-idempotent — calling it twice
duplicated the team) earned its way out. "Team" is just the state of
having children; expanding/collapsing is purely a canvas-side visual
action that toggles the `collapsed` column via PATCH.

The non-idempotency directly caused tenant-hongming's vCPU starvation:
72 distinct child workspaces accumulated in 4 days, ~14 leaked EC2s
(50 of 64 vCPU consumed by stale teams), every Canvas tabs E2E retry
flaking on RunInstances VcpuLimitExceeded.

What stays:
- TeamHandler.Collapse — still useful; stops + removes children via
  StopWorkspaceAuto. Reachable from the canvas Collapse Team button.
  (Note: that button currently calls PATCH /workspaces/:id, not the
  Collapse endpoint — that's a separate reachability question for
  later.)
- findTemplateDirByName helper — kept in team.go pending a relocate
  decision; no in-package consumers after Expand.
- The four other paths that create child workspaces continue to work
  unchanged: regular POST /workspaces with parent_id, OrgHandler.Import
  (recursive tree), Bundle import, scripts.

What goes:
- POST /workspaces/:id/expand route (router.go)
- TeamHandler.Expand method (team.go: ~130 lines)
- 4 TestTeamExpand_* sqlmock tests (team_test.go)
- TestTeamExpand_UsesAutoNotDirectDockerPath AST gate
  (workspace_provision_auto_test.go) — pinned a code path that no
  longer exists; the generic TestNoCallSiteCallsDirectProvisionerExceptAuto
  gate still covers the architectural intent for any future caller.

Follow-up PRs:
- canvas/ContextMenu.tsx: drop the "Expand to Team" right-click button
  + handleExpand callback; users create children via the regular
  + New Workspace dialog with the parent picker (already supported)
- OrgHandler.Import idempotency (skip-if-exists OR replace_if_exists)
  — same bug class as the deleted Expand, but on the bulk-tree path
- One-off cleanup script for tenant-hongming's 72 stale workspaces

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 02:39:13 -07:00
Hongming Wang
ca61213578
Merge pull request #2853 from Molecule-AI/refactor/split-workspace-dispatchers-1777970000
refactor(handlers): extract dispatchers from workspace.go (#2800 partial)
2026-05-05 09:30:55 +00:00
Hongming Wang
44bb35a926 feat(delegations): wire ledger Insert+SetStatus from production code paths (RFC #2829 #318)
PR-1 shipped the `delegations` table + `DelegationLedger` helper. PR-3
wired the sweeper. PR-4 wired the dashboard. But no PR ever wired
`ledger.Insert` from a production code path — the table stayed empty,
the sweeper had nothing to sweep, the dashboard had nothing to show.

This PR closes that gap. Behind feature flag `DELEGATION_LEDGER_WRITE=1`
(default off), the legacy activity_logs writes are mirrored to the
durable ledger:

  - insertDelegationRow → ledger.Insert (queued)
  - updateDelegationStatus → ledger.SetStatus on every status transition
  - executeDelegation completion path → ledger.SetStatus(completed,
    result_preview) for the result preview that activity_logs already
    stores in response_body
  - Record handler → ledger.Insert + ledger.SetStatus(dispatched) so
    agent-initiated delegations land in the same table

## Why a flag

The legacy flow has ~30 strict-sqlmock tests pinning exactly which SQL
statements fire per handler. Adding ledger writes always-on would
force adding ExpectExec stanzas to each. Flag-off keeps all 30 green
without churn; flag-on lets operators populate the table in staging
to feed the sweeper + dashboard once the agent-side cutover (RFC #2829
PR-5) has proven the round-trip end-to-end.

Default off → byte-identical to pre-#318 behavior.

## Status vocabulary mapping

activity_logs uses a freer status vocabulary than the ledger's CHECK
constraint allows. updateDelegationStatus is called with values like
"received" that the ledger doesn't accept; the wiring filters via a
switch to only forward known-good values, skipping anything else.

Record's first activity_logs row is `dispatched` but the ledger's
Insert path requires `queued` as initial state. Insert as queued first;
the very next SetStatus(..., dispatched) promotes it on the same row.

## Coverage

8 wiring tests (delegation_ledger_writes_test.go):

  - flag off → no SQL fired (rollout safety contract)
  - flag on → INSERT + UPDATE fire as expected
  - flag rejects loose truthy values (true/yes/0/on/TRUE) — only "1"
    is the on signal, matching PR-2 + PR-5 conventions
  - terminal-state replay swallows ErrInvalidTransition (legacy is
    authoritative; ledger replay error is not a delegation failure)

All 30 existing delegation_test.go tests still pass — flag default off
keeps the strict-sqlmock surface unchanged.

Refs RFC #2829.
2026-05-05 02:26:06 -07:00
Hongming Wang
024ef260db refactor(handlers): extract dispatchers from workspace.go (#2800 partial)
workspace.go was 950 lines after the dispatcher work in PRs #2811 +
#2824 + #2843 + #2846 + #2847 + #2848 + #2850. This extracts the 6
SoT dispatcher helpers into a new workspace_dispatchers.go so the
file is the architectural unit it deserves to be (one place for
"how do we route a workspace lifecycle verb to a backend?").

Moved (no body changes — pure cut + paste with imports):
  - HasProvisioner               (gate accessor)
  - provisionWorkspaceAuto       (async provision)
  - provisionWorkspaceAutoSync   (sync provision, runRestartCycle's path)
  - StopWorkspaceAuto            (stop dispatcher)
  - RestartWorkspaceAuto         (restart wrapper)
  - RestartWorkspaceAutoOpts     (restart with resetClaudeSession)

workspace.go shrinks from 950 → 735 lines and now holds:
  - WorkspaceHandler struct + constructor
  - SetCPProvisioner / SetEnvMutators
  - Create / List / Get / scanWorkspaceRow
  - HTTP handler glue

workspace_dispatchers.go is 255 lines and holds the dispatcher trio +
sync variant + gate accessor + a header docblock summarizing the
history (PRs that added each helper) and the source-level pin tests
that gate against drift.

Source-level pin tests updated:
  - TestNoCallSiteCallsDirectProvisionerExceptAuto: workspace_dispatchers.go
    added to allowlist (the dispatcher IS the place that calls per-backend
    bodies directly).
  - TestNoCallSiteCallsBareStop: same.
  - TestNoBareBothNilCheck / TestOrgImportGate_UsesHasProvisionerNotBareField:
    no change — they were source-pinning specific files, not all callers.

Build clean, vet clean, full test suite passes (1742 / 0 in workspace,
all Go test packages green).

Out of scope (#2800 has more):
  - workspace_provision.go (869 lines) split into Docker + CP halves —
    files would still be 400+ each, marginal value. Defer until a
    third backend lands and the symmetry breaks.
  - Splitting Create / List / Get into per-handler files — they're
    short and tightly coupled to the struct; keep co-located.

Closes #2800 partial. Filing a follow-up issue if/when workspace.go
or workspace_provision.go grows past 800 lines again.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 02:24:49 -07:00
Hongming Wang
c85783fbee docs(workspace): point recovery hint at /external/rotate (not the never-shipped /tokens)
Self-review of #2852: the inline comment on the IssueToken-failed branch
still referenced POST /workspaces/:id/tokens, which never shipped. The
recovery path that did ship in #2852 is POST /workspaces/:id/external/rotate.
Update the hint so the next operator who hits this failure mode finds
the right endpoint.
2026-05-05 01:58:43 -07:00
Hongming Wang
b375252dc8 feat(external): credential rotation + re-show instruction modal (#319)
External workspaces (runtime=external) lose their workspace_auth_token
the moment the create modal closes — the token is unrecoverable from
any later DB read. Operators who lost their copy or want to respond to
a suspected leak had no recovery path short of recreating the workspace
(which also breaks cross-workspace delegation links + memory namespace).

This PR adds two endpoints + a Config-tab section that surfaces them:

  POST /workspaces/:id/external/rotate
    Revokes any prior live tokens, mints a fresh one, returns the same
    ExternalConnectionInfo payload Create returns. Old credentials stop
    working immediately — the previously-paired agent will fail auth on
    its next heartbeat (~20s).

  GET /workspaces/:id/external/connection
    Returns the connect block with auth_token="". For the operator who
    just needs to re-find PLATFORM_URL / WORKSPACE_ID / one of the
    snippets without invalidating the live agent.

Both reject runtime ≠ external with 400 + a hint pointing at /restart
for non-external runtimes (which mints AND injects into the container).

## Why a flag isn't needed

The endpoints are purely additive — Create's behavior is unchanged.
Existing external workspaces don't see anything different until an
operator clicks the new buttons.

## DRY refactor

Extracted BuildExternalConnectionPayload() in external_connection.go
as the single source of truth for the connect payload shape. Create,
Rotate, and GetExternalConnection all call it. Adds a snippet once →
all three endpoints emit it. Trims trailing slash on platform_url so
no double-slash sneaks into registry_endpoint.

## Canvas

ExternalConnectionSection mounts in ConfigTab when runtime=external.
Two buttons:
  - "Show connection info" (cosmetic) — fetches GET /external/connection
  - "Rotate credentials" (destructive) — confirm dialog explains the
    impact, then POST /external/rotate

Both reuse the existing ExternalConnectModal so operators don't learn
a second snippet UX.

## Coverage

10 Go tests:
  - Rotate happy path (revoke + mint order, payload shape, broadcast event)
  - Rotate refuses non-external runtimes (400 with restart hint)
  - Rotate 404 on unknown workspace + 400 on empty id
  - GetExternalConnection happy path (auth_token="", same payload shape)
  - GetExternalConnection refuses non-external + 404 on unknown
  - BuildExternalConnectionPayload — placeholder substitution + trailing
    slash trimming + blank-token contract

6 canvas tests:
  - both action buttons render
  - "Show" calls GET /external/connection and opens modal
  - "Rotate" opens confirm dialog before firing POST
  - Cancel dismisses without rotating
  - Confirm POSTs and opens modal with returned token
  - API failures surface as visible error chips

Migration: existing external workspaces gain new abilities; no data
migration. The DRY refactor preserves byte-identical Create response
shape (8 ConfigTab tests + all existing handler tests still pass).

Closes #319.
2026-05-05 01:55:27 -07:00
Hongming Wang
61b7755c3c feat(handlers): migrate Pause loop to StopWorkspaceAuto — #2799 Phase 3
Last open #2799 site. Pause's per-workspace stop call now routes
through StopWorkspaceAuto, removing the final inline if-cpProv-else
(actually if-h.provisioner) dispatch from workspace_restart.go's
restart/pause/resume code paths.

Pre-2026-05-05 the Pause loop was:

  if h.provisioner != nil {
      h.provisioner.Stop(ctx, ws.id)
  }

Same drift class as #2813 (team-collapse leak) + #2814 (workspace
delete leak) — Docker-only stop silently no-ops on SaaS, leaving
the EC2 running while the workspace row gets marked paused. Orphan
sweeper would catch it eventually but the leak window is real.

Pause-specific bookkeeping (mark paused, clear workspace keys,
broadcast WORKSPACE_PAUSED) stays inline in the handler; only the
"stop the running workload" step delegates. StopWorkspaceAuto's
no-backend → no-op semantics match the pre-fix behavior on
misconfigured deployments (the bookkeeping still runs).

One new source-level pin:
  TestPauseHandler_UsesStopWorkspaceAuto — gates regression to the
  inline dispatch shape.

This closes #2799 Phase 3. After this PR + #2847 (Phase 2 PR-B) land,
workspace_restart.go has no remaining inline if-cpProv-else dispatch
in any user-facing code path. The remaining direct backend calls
inside the file are in stopForRestart and cpStopWithRetry — both
internal helpers that ARE the dispatcher's underlying primitives,
not new bypasses.

Note: scope was originally tagged "Phase 3 needs PauseWorkspaceAuto
verb" in the audit on PR #2843. On closer reading Pause's stop step
is identical to Stop — only the bookkeeping is Pause-specific. Reusing
StopWorkspaceAuto avoids unnecessary surface and keeps the dispatcher
trio (provision/stop/restart) tight.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 00:00:16 -07:00
Hongming Wang
9a772bf946 feat(handlers): provisionWorkspaceAutoSync + Site 4 migration — #2799 Phase 2 PR-B
runRestartCycle's auto-restart cycle (Site 4 from PR #2843's audit)
needs synchronous provision dispatch — the outer pending-flag loop
in RestartByID relies on returning when the new container is up so
the next restart cycle doesn't race the in-flight provision goroutine
on its Stop call.

Phase 1's provisionWorkspaceAuto wraps each per-backend body in
`go func() {...}()` — wrong shape for runRestartCycle's needs. This
PR introduces provisionWorkspaceAutoSync as a behavioral mirror that
runs in the current goroutine instead.

Two helpers, kept identical except for the wrapper:

  provisionWorkspaceAuto:     spawns goroutine, returns immediately
  provisionWorkspaceAutoSync: blocks until per-backend body returns

Same backend-selection (CP first, Docker second) + no-backend
mark-failed fallback. When one grows a new arm (third backend, retry
semantics), the other should too — pinned in the docstring.

Site 4 (runRestartCycle) was the only call site that needs sync today.
Migrating it removes the last bare if-cpProv-else dispatch in the
restart code path's provision half.

Three new tests:
  - TestProvisionWorkspaceAutoSync_RoutesToCPWhenSet
  - TestProvisionWorkspaceAutoSync_NoBackendMarksFailed
  - TestRunRestartCycle_UsesProvisionWorkspaceAutoSync (source-level pin)

Out of scope (last open #2799 site):
  Phase 3 — Site 5 (Pause loop). PAUSE doesn't reprovision; needs a
  new PauseWorkspaceAuto verb. After this PR lands, Pause is the only
  inline if-cpProv-else dispatch left in workspace_restart.go.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 23:44:54 -07:00
Hongming Wang
0a90d7ae1a
Merge pull request #2846 from Molecule-AI/feat/2799-phase2-restart-resume-1777958000
feat(handlers): migrate Restart + Resume handlers to dispatchers — #2799 Phase 2 PR-A
2026-05-05 05:15:29 +00:00
Hongming Wang
5b7f4d260b feat(handlers): migrate Restart + Resume handlers to dispatchers — #2799 Phase 2 PR-A
Sites 1+2 (Restart HTTP handler goroutine) and Site 3 (Resume HTTP
handler goroutine) now route through RestartWorkspaceAutoOpts /
provisionWorkspaceAuto instead of inlining the if-cpProv-else dispatch.

Three changes:

1. **RestartWorkspaceAutoOpts** — new variant of RestartWorkspaceAuto
   that carries the resetClaudeSession Docker-only flag (issue #12).
   The bare RestartWorkspaceAuto still exists as a wrapper that calls
   Opts with false. CP path silently ignores the flag (each EC2 boots
   fresh — no session state to clear). Mirrors the Provision pair
   (provisionWorkspace / provisionWorkspaceOpts).

2. **Restart handler (Site 1+2)** — the inline goroutine
   `if h.provisioner != nil { Stop } else if h.cpProv != nil { ... }`
   collapses to `RestartWorkspaceAutoOpts(...)`. Pre-fix the dispatch
   was Docker-FIRST ordering (a different drift class from the
   silent-drop bugs PRs #2811/#2824 closed); the dispatcher enforces
   CP-FIRST.

3. **Resume handler (Site 3)** — Resume is provision-only (workspace
   is paused, no live container), so it routes through
   provisionWorkspaceAuto, not RestartWorkspaceAuto. Inline
   if-cpProv-else dispatch removed.

Two new source-level pins:
  - TestRestartHandler_UsesRestartWorkspaceAuto
  - TestResumeHandler_UsesProvisionWorkspaceAuto

These prevent regression to the inline dispatch pattern.

Out of scope (tracked under #2799):
  - Site 4 (runRestartCycle) — synchronous coordination model needs
    a different shape than the fire-and-return dispatchers. PR-B.
  - Site 5 (Pause loop) — PAUSE doesn't reprovision, needs a new
    PauseWorkspaceAuto verb. Phase 3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 22:09:12 -07:00
Hongming Wang
7993693cf1 feat(delegations): wire RFC #2829 sweeper + admin routes into platform server
Activates the server-side foundation that PRs #2832, #2836, #2837
shipped without wiring (each PR landed dead code on purpose so the
review surface stayed tight).

## What this PR wires up

  1. router.go — registers the RFC #2829 PR-4 admin endpoints behind
     AdminAuth:
       GET /admin/delegations[?status=...&limit=N]
       GET /admin/delegations/stats

  2. cmd/server/main.go — starts the RFC #2829 PR-3 stuck-task
     sweeper as a supervised goroutine alongside the existing
     scheduler + hibernation-monitor + image-auto-refresh:
       go supervised.RunWithRecover(ctx, "delegation-sweeper",
                                    delegSweeper.Start)

## What this PR does NOT do

  - PR-2's DELEGATION_RESULT_INBOX_PUSH flag stays default off — flip
    happens via env config in a follow-up after staging burn-in.
  - PR-5's DELEGATION_SYNC_VIA_INBOX flag stays default off — same
    reason. The two flags are independent; either can be flipped in
    isolation.
  - Canvas operator panel UI: this PR exposes the JSON contract; the
    canvas panel consumes it in a separate canvas PR.

## Coverage

2 new router gate tests in admin_delegations_route_test.go:

  - List endpoint requires AdminAuth (unauthenticated → 401)
  - Stats endpoint requires AdminAuth (unauthenticated → 401)

Pattern mirrors admin_test_token_route_test.go (the IDOR-fix gate
for PR #112). Catches a future router refactor that silently drops
AdminAuth — operator dashboard data exposes caller_id, callee_id, and
task_preview, none of which should reach unauthenticated callers.

Sweeper boots as a no-op until at least one delegation row exists,
so this PR is safe to land before PR-5's agent-side cutover sees
production traffic.

Refs RFC #2829.
2026-05-04 22:00:59 -07:00
Hongming Wang
789d705866
Merge pull request #2843 from Molecule-AI/fix/restart-dispatcher-rework-1777956000
feat(handlers): RestartWorkspaceAuto dispatcher — #2799 Phase 1 (re-do of #2835)
2026-05-05 04:48:52 +00:00
Hongming Wang
cb820acbd6 fix(test): pre-register sqlmock for panic-recovered Docker test goroutine 2026-05-04 21:44:31 -07:00
Hongming Wang
82e7059e0e
Merge pull request #2842 from Molecule-AI/fix/codex-template-bump-cli-pin
fix(external-templates): unpin codex CLI from stale ^0.57
2026-05-05 04:34:14 +00:00
Hongming Wang
4f67fe59fb feat(handlers): RestartWorkspaceAuto dispatcher — #2799 Phase 1
Closes the third silent-drop-on-SaaS class for the restart verb. Two
of the three dispatchers were already in place (provisionWorkspaceAuto
PR #2811, StopWorkspaceAuto PR #2824); this completes the trio.

PR #2835 was an earlier attempt at this work (delivered by a peer
agent) that I had to send back for four critical bugs — stop-leg
dispatch order inverted, no-backend nil-deref, empty payload (dispatcher
unusable by callers), forcing-function tests red-from-day-1. This
re-do takes the audit + classification from that work but rebuilds
the implementation against the existing dispatcher convention.

Phase 1 scope:

  - RestartWorkspaceAuto in workspace.go — symmetric mirror of
    provisionWorkspaceAuto + StopWorkspaceAuto. CP-first dispatch
    order. cpStopWithRetry on the SaaS leg (Restart's "make it alive
    again" contract justifies the retry that StopWorkspaceAuto's
    delete-time contract does not). Three-arm shape including a
    no-backend mark-failed defense-in-depth.
  - Three new pin tests covering the routing surface:
    TestRestartWorkspaceAuto_RoutesToCPWhenSet,
    TestRestartWorkspaceAuto_RoutesToDockerWhenOnlyDocker,
    TestRestartWorkspaceAuto_NoBackendMarksFailed.

Phase 2/3 (deferred, file as follow-up issue):

  - workspace_restart.go's manual dispatch sites (Restart handler
    goroutine, Resume handler goroutine, runRestartCycle's inline
    Stop, Pause loop). Each site has async-context reasoning beyond
    a fire-and-return dispatcher and needs per-site review.
  - Pause specifically needs a different verb (PauseWorkspaceAuto)
    since Pause doesn't reprovision.

Why no callers migrated in this PR: the existing call sites in
workspace_restart.go all build their `payload` from a synchronous
DB read first; rewiring them needs care to preserve that ordering
plus the resetClaudeSession + template path resolution that lives
in the HTTP handler context. Splitting the dispatcher introduction
from the migration keeps each PR small and reviewable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 21:30:36 -07:00
Hongming Wang
410275e5af fix(external-templates): unpin codex CLI from stale ^0.57
`^0.57` only allows 0.57.x — codex CLI is now at 0.128 with breaking
API changes between (notably `exec --resume <sid>` → `exec resume <sid>`
subcommand). Operators following the snippet today either get a
6-month-old codex with the legacy resume flag, OR install latest manually
and discover the daemon previously couldn't drive it.

codex-channel-molecule 0.1.2 (just published) handles the new subcommand
shape, so operators are best served by always getting the latest codex
that the bridge daemon was last validated against. Bump to `@latest`.

If a future codex CLI breaks the daemon's invocation again, we ship a
new bridge-daemon release rather than asking operators to manage a pin
themselves.

Test: go test ./internal/handlers/ -run TestExternalTemplates -count=1 → green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 21:27:45 -07:00
Hongming Wang
1557743ef9
Merge branch 'staging' into feat/rfc2829-pr2-result-push-and-sync-cutover 2026-05-04 21:25:33 -07:00