Commit Graph

519 Commits

Author SHA1 Message Date
Hongming Wang
61d5908817 fix(workspace files API): write claude-code config to /configs, sudo for root-owned base
Root cause of the user-visible 500 ("install: cannot create directory
'/opt/configs': Permission denied") on PUT
/workspaces/<id>/files/config.yaml:

1. Path map fall-through. claude-code wasn't in workspaceFilePathPrefix,
   so resolveWorkspaceFilePath returned the default `/opt/configs/...`.
   That directory doesn't exist on the workspace EC2 — cloud-init in
   provisioner/userdata_containerized.go runs `mkdir -p /configs` only.
   Even if the SSH write had succeeded at /opt/configs, the docker
   container's bind-mount is host:/configs → container:/configs,
   so the file would have been invisible to the runtime.

2. /configs ownership. cloud-init runs as root, so /configs is
   root-owned. The SSH-as-ubuntu install command can't write into it
   without sudo. Hermes wasn't affected because its base path
   (/home/ubuntu/.hermes) is ubuntu-owned.

Two-line fix:

- Add `claude-code: /configs` to the runtime → base-path map and flip
  the default fall-through from `/opt/configs` to `/configs`. Leave the
  pre-existing langgraph/external entries pointing at /opt/configs
  pending a migration audit (no user report on those today, and
  flipping them would silently relocate any files those runtimes
  already wrote).
- Prefix the remote install command with `sudo -n` so the write
  succeeds under the standard EC2 ubuntu/passwordless-sudo posture.
  `-n` (non-interactive) ensures clean failure if that ever changes,
  rather than a hang waiting for a password prompt.

Tests:
- TestResolveWorkspaceFilePath_KnownRuntimes adds claude-code +
  CLAUDE-CODE coverage and updates the empty/unknown default cases
  to expect /configs. The langgraph/external rows stay green
  (unchanged values), confirming the scope of the rename.

Verification:
- go build ./... clean
- go test ./internal/handlers/ green
- The user-reported bug
  (PUT /workspaces/57fb7043-79a0-4a53-ae4a-efb39deb457f/files/config.yaml
   → 500 EACCES on /opt/configs) is the failure mode this fix addresses
  on both axes (path + sudo).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 14:29:08 -07:00
Hongming Wang
707e4d7342 Memory v2 wiring: replace decorative tests with real integration
Self-review of #2755 found two tests that didn't actually exercise the
production code path:

- TestNamespaceCleanupFn_NamespaceFormat asserted
  "workspace:" + "abc-123" == "workspace:abc-123" — a compile-time
  invariant, not runtime behavior. Provided no protection if the closure
  in Bundle.NamespaceCleanupFn ever stopped using that prefix.

- TestNamespaceCleanupFn_FailureLogsButReturns built a *parallel*
  cleanup closure inline with errors.New, then invoked the parallel
  closure. The production closure was never exercised. A regression
  in NamespaceCleanupFn (e.g. forgetting the deferred recover, calling
  the plugin without nil-check) would still pass this test.

Replaced both with real integration:

- TestNamespaceCleanupFn_HitsPluginAtCorrectNamespace spins up
  httptest.Server, points MEMORY_PLUGIN_URL at it, calls Build(),
  invokes the production closure, and asserts the server actually
  saw DELETE /v1/namespaces/workspace:abc-123.

- TestNamespaceCleanupFn_PluginErrorDoesNotPanic exercises the
  failure path for real: server returns 500 on DELETE, closure must
  log and return without propagating. defer-recover is belt-and-
  suspenders since production calls this from a for-loop in
  workspace_crud.go that has no recover.

Couldn't ship with #2755 because the merge queue locks the branch
once enqueued. Following up now that #2755 is merged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 10:38:59 -07:00
Hongming Wang
46731729d4 Memory v2 fixup Critical: wire plugin from main.go (was fully dormant)
Caught during continued review: the entire v2 plugin system shipped
in PRs #2729-#2742 + #2744-#2751 was never actually invoked because
main.go and router.go don't construct the plugin client/resolver or
attach the WithMemoryV2 / WithNamespaceCleanup hooks.

Operators setting MEMORY_PLUGIN_URL=... saw zero behavior change
because nothing read it. Every fixup we shipped (idempotency, verify
mode, expires_at validation, audit JSON, namespace cleanup, O(N)
export, boot E2E) was also dormant for the same reason.

Root cause: when a multi-handler feature lands across many PRs, none
of them are individually responsible for wiring main.go — and the
master-task-tracking issue didn't gate-check that the wiring landed.
Add main.go integration to every multi-handler RFC checklist.

What ships:

  * internal/memory/wiring/wiring.go: new package that constructs the
    plugin client + resolver from MEMORY_PLUGIN_URL once. Returns nil
    when unset (preserves zero-config legacy behavior). Probes
    /v1/health at boot but doesn't fail-closed — the MCP layer's
    circuit breaker handles ongoing unavailability.

  * internal/memory/wiring/wiring_test.go: 6 tests covering the
    nil/non-nil bundle paths + the namespace-cleanup closure
    contract (nil-safe, format-stable, failure-tolerant).

  * cmd/server/main.go: imports memwiring, calls Build(db.DB) once
    after WorkspaceHandler creation, attaches WithNamespaceCleanup,
    threads the bundle through router.Setup.

  * internal/router/router.go: Setup signature gains *memwiring.Bundle
    param. Inside, attaches WithMemoryV2 to AdminMemoriesHandler and
    MCPHandler when the bundle is non-nil.

After this, the v2 plugin is reachable end-to-end:

  Operator sets MEMORY_PLUGIN_URL → main.Build instantiates client +
  resolver → WorkspaceHandler gets cleanup hook → router wires
  AdminMemoriesHandler + MCPHandler with WithMemoryV2 → MCP tool
  calls (commit_memory_v2, search_memory, etc.) actually do
  something → admin export/import respects MEMORY_V2_CUTOVER.

Prerequisite for #292 (staging verification) — without this, the
operator runbook's step 2 (set MEMORY_PLUGIN_URL, observe behavior)
silently no-ops.

Verified: all 9 affected test packages still green
(memory/{client,contract,e2e,namespace,pgplugin,wiring}, handlers,
router, plus the build).
2026-05-04 10:22:30 -07:00
Hongming Wang
9f47ecf86e
Merge branch 'staging' into fix/memory-v2-i3-export-on 2026-05-04 09:44:37 -07:00
Hongming Wang
ebc20794f3 fix(admin-memories): include each member's private namespace in export
ReadableNamespaces(rootID) returns {workspace:rootID, team:rootID,
org:rootID} — the workspace: namespace it surfaces is the root's only.
The I3 batching change resolved namespaces once per root which silently
dropped every child workspace's private memories from admin export
(workspace:childID never reached the plugin search).

Keep the per-root batching win for team:/org:/custom: namespaces;
inject each member's workspace:<id> + owner mapping explicitly so
coverage matches the legacy per-workspace iteration.

Cost stays at 1 SQL + N_roots resolver + 1 plugin search.

Test changes:
- New TestExport_IncludesEveryMembersPrivateNamespace uses a
  per-workspace resolver stub (mirrors real behaviour) and asserts
  every member's workspace:<id> reaches the plugin search AND that
  children's private memories appear in the response with correct
  owner attribution. Verified to FAIL on the pre-fix code.
- TestExport_BatchesPluginCallsByRoot updated to expect 5 namespaces
  (3 workspace + team + org) instead of 3 — it had pinned the buggy
  3-namespace behaviour.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 09:44:06 -07:00
Hongming Wang
6b445aae2d Memory v2 fixup I5: workspace purge cleans up plugin namespace
Self-review #291. When a workspace is hard-purged, its
`workspace:<id>` namespace stays in the plugin storage. Over time
deleted workspaces accumulate as orphan namespaces.

Fix: optional namespaceCleanupFn hook on WorkspaceHandler. The
purge path (workspace_crud.go ~line 520) iterates each purged id
and calls the hook best-effort. main.go wires the hook to
plugin.DeleteNamespace when MEMORY_PLUGIN_URL is set; operators
who haven't enabled the plugin keep the no-op default.

Why a hook (not direct plugin import):
  * Keeps WorkspaceHandler decoupled from the memory contract
    package (easier to test, smaller blast radius if the contract
    bumps)
  * Tests inject a captureCleanupHook stub without standing up a
    real plugin client
  * Production wiring stays a one-liner in main.go

What gets cleaned up:
  * `workspace:<id>` for each purged workspace
  * NOT `team:<root>` / `org:<root>` — those may still be
    referenced by other workspaces under the same root, so dropping
    them on a single workspace's purge would orphan team/org data
    for the survivors. Operator can purge those manually after
    confirming the entire root is gone.

What stays untouched:
  * Soft-removed workspaces (status='removed', no ?purge=true). The
    grace window is by design — the data should still be there if
    the operator unremoves.

Tests:
  * TestWithNamespaceCleanup_DefaultIsNil pins the safe default
  * TestWithNamespaceCleanup_NilStaysNil pins the explicit-nil case
  * TestWithNamespaceCleanup_AttachesFn pins the wiring
  * TestPurge_CallsCleanupHookPerID exercises the per-id loop body
  * TestPurge_NilHookIsSkipped pins the nil guard

A full end-to-end Delete-handler test requires mocking broadcaster
+ provisioner + descendant SQL chain, which is out-of-scope for a
single fixup. Integration coverage for the wired path lives in
PR-11's E2E swap test (#293 follow-up).
2026-05-04 09:20:37 -07:00
Hongming Wang
9a64aeaa2c Memory v2 fixup I3: admin export O(workspaces) → O(N_roots+1)
Self-review #289. The previous exportViaPlugin ran one resolver CTE
walk + one plugin search PER WORKSPACE. For a 1000-workspace tenant
that's 1000× of each, mostly redundant — workspaces sharing a
team/org root see identical readable namespaces.

New strategy:
  1. Single SQL pass returns each workspace + its computed root_id
     via a recursive CTE (loadWorkspacesWithRoots).
  2. Group by root → unique tree count is typically << workspace
     count.
  3. Resolver runs ONCE per root (any member sees the same readable
     list).
  4. Build the union of all root namespaces; single plugin.Search
     call.
  5. Map each memory back to a workspace_name via pickOwnerForNamespace
     (workspace:<id> → matching member; team:* / org:* / custom:* →
     canonical first member of root group).

Net call cost: 1 SQL + N_roots resolver + 1 plugin call (vs
N_workspaces × resolver + N_workspaces × plugin in the old code).

Tests:
  * TestExport_BatchesPluginCallsByRoot pins the new behavior
    explicitly: 3 workspaces under 1 root → exactly 1 plugin search
    (was 3 with the old code).
  * TestPickOwnerForNamespace covers all five attribution cases:
    workspace:<id> match, workspace:<id> no-match-fallback, team:*,
    org:*, custom:* → first-member-of-root-group; plus empty-members
    fallback.
  * All 9 existing TestExport_* / TestImport_* / TestPickOwner /
    TestNamespaceKindFromLegacyScope / TestSkipImport / etc. tests
    remain green (verified with -run "Export").

The legacy DB path (when MEMORY_V2_CUTOVER unset) is unchanged.
2026-05-04 09:17:30 -07:00
Hongming Wang
d297e75fc9
Merge pull request #2746 from Molecule-AI/fix/memory-v2-i1-i4-small
Memory v2 fixup I1+I4: expires_at validation + audit JSON marshal
2026-05-04 16:05:02 +00:00
Hongming Wang
d48693144b Memory v2 fixup I1+I4: expires_at validation + audit JSON marshal
Two small Important findings from self-review, bundled because both
are <20 line changes touching the same file.

I1: expires_at silent drop
  - mcp_tools_memory_v2.go:130 had `if t, err := ...; err == nil { ... }`
    which dropped malformed timestamps without telling the agent.
    Agent passes `expires_at: "tomorrow"`, gets a 200, and the memory
    has no TTL.
  - Now returns a clear error: "invalid expires_at: must be RFC3339"
  - Test renamed: TestCommitMemoryV2_BadExpiresIsIgnored (which
    codified the bug) → TestCommitMemoryV2_BadExpiresReturnsError
    (which pins the fix).

I4: audit log JSON via Sprintf-%q
  - auditOrgWrite was building activity_logs.metadata via fmt.Sprintf
    with %q. Go-quoted strings happen to coincide with JSON-quoted
    for ASCII (and today's values are pure ASCII: UUID + hex digest)
    so the bug was latent.
  - Replaced with json.Marshal of map[string]string. Same wire shape
    today, but won't silently produce invalid JSON if metadata grows
    to include arbitrary content snippets.
  - New test TestAuditOrgWrite_MetadataIsValidJSON uses a custom
    sqlmock.Argument matcher (jsonValidMatcher) that fails the test
    if the metadata column isn't parseable JSON. The test runs
    auditOrgWrite with a content string containing quotes,
    backslashes, and a control byte — values where %q would diverge
    from JSON-quote.

Both pre-existing tests (TestCommitMemoryV2_AuditsOrgWrites etc.)
remain green.
2026-05-04 08:57:58 -07:00
Hongming Wang
1e97fb9a16 Memory v2 fixup C1: backfill idempotency via MemoryWrite.id
Self-review (post-merge) flagged that the backfill claimed to be
idempotent on re-run but actually duplicates every row because the
plugin's INSERT uses gen_random_uuid() and ignores any id passed in.

Fix is contract-level: extend MemoryWrite with an optional `id`
idempotency key. When supplied, the plugin MUST treat the write as
upsert keyed on this id; when omitted, the plugin generates a fresh
UUID (production agent commits keep working unchanged).

Changes:
  * docs/api-protocol/memory-plugin-v1.yaml: add id field with
    description that flags it as idempotency key
  * internal/memory/contract/contract.go: add ID to MemoryWrite struct,
    update memory_write_minimal golden vector
  * internal/memory/pgplugin/store.go: split CommitMemory into two
    paths — upsert when body.ID set (INSERT ... ON CONFLICT (id) DO
    UPDATE), plain INSERT otherwise
  * cmd/memory-backfill/main.go: pass agent_memories.id to MemoryWrite,
    fix the false comment about 409 deduplication

New tests:
  * pgplugin: TestCommitMemory_WithIDUpserts pins the upsert SQL is
    used when id is set; TestCommitMemory_UpsertScanError covers the
    error branch
  * backfill: TestBackfill_PassesSourceUUIDAsIdempotencyKey pins the
    forwarding behavior; TestBackfill_RerunIsIdempotent simulates a
    retry and asserts both runs pass the same uuid (plugin upsert is
    what makes this safe)

Why this matters: operators retrying a failed backfill (which they
will — networks fail, transactions abort) would otherwise create N
duplicates per memory. The duplicates aren't visible until search
results show obvious dupes — debugging that under prod load is bad.

Production agent commits are unaffected: they leave id empty, the
plugin generates a fresh UUID via gen_random_uuid(), zero behavior
change for the hot path.
2026-05-04 08:54:13 -07:00
Hongming Wang
b07575c710
Merge branch 'staging' into feat/memory-v2-pr11-e2e-swap 2026-05-04 08:24:26 -07:00
Hongming Wang
b937415e1e Memory v2 PR-11: E2E test — flat-plugin swap proves contract works
Final implementation PR. Builds on PR-1..10 (all merged or queued).

Proves the central design property of the plugin contract: ANY
plugin satisfying the v1 OpenAPI spec works as a drop-in replacement
for the built-in postgres plugin. If this test fails after a refactor,
the contract has drifted in a way that breaks ecosystem plugins.

What ships:
  * internal/memory/e2e/swap_test.go — five E2E tests against a
    deliberately minimal "flat-memory" stub plugin (~50 LOC, single
    map, zero capabilities)
  * MCPHandler.Dispatch — small exported wrapper around dispatch so
    out-of-package E2E tests can drive tools by name without
    duplicating the whole MCP RPC stack

E2E coverage:
  * TestE2E_FlatPluginRoundTrip: full lifecycle
    - list_writable_namespaces returns 3 entries
    - commit_memory_v2 writes through plugin
    - search_memory finds it back
    - commit_summary writes a summary
    - forget_memory deletes
    - search after forget excludes the deleted memory

  * TestE2E_LegacyShimRoutesThroughFlatPlugin: PR-6 shim wired up
    - Legacy commit_memory(scope=LOCAL) ends up in plugin storage
    - Legacy recall_memory finds it back through plugin search
    - Response shapes preserved (scope:LOCAL stays scope:LOCAL)

  * TestE2E_OrgMemoriesDelimiterWrap: prompt-injection mitigation
    - Org-namespace memory committed
    - Audit INSERT into activity_logs verified
    - Search returns content with [MEMORY id=... scope=ORG ns=...]
      prefix applied

  * TestE2E_StubPluginCapabilitiesAreEmpty: capability negotiation
    - Stub plugin reports zero capabilities
    - Client.SupportsCapability returns false for FTS, embedding
    - Confirms graceful degradation when plugin doesn't support a
      feature

  * TestE2E_PluginUnreachable_AgentSeesClearError: failure surface
    - Plugin URL pointing at bogus port
    - commit_memory_v2 returns informative error
    - No nil-pointer dereference; error message is actionable

The flat plugin is intentionally minimal — it has no namespaces table
distinct from memory records, no FTS, no semantic search, no TTL. The
test proves operators can drop in a 50-line plugin and the agent
behavior is identical (modulo capability-gated features).
2026-05-04 08:20:35 -07:00
Hongming Wang
7b0bd32957 Memory v2 PR-8: cutover — admin export/import via plugin
Builds on merged PR-1..7. Adds the operator-controlled cutover flag
that flips admin export/import from the legacy direct-DB path to the
v2 plugin path.

Activation: MEMORY_V2_CUTOVER=true AND the v2 plugin is wired via
WithMemoryV2. Both must be true to take the new path; either being
false falls through to the existing legacy SQL code unchanged.

What ships:
  * AdminMemoriesHandler gains plugin + resolver fields, wired via
    WithMemoryV2 (production) / withMemoryV2APIs (tests)
  * Export: enumerates workspaces, asks resolver for each one's
    readable namespaces, searches each via plugin, deduplicates by
    memory id, applies SAFE-T1201 redaction on emitted content
    (F1084 parity). Returns the legacy memoryExportEntry shape so
    existing tooling keeps working.
  * Import: scope→namespace translation mirrors PR-6 shim. Uses
    UpsertNamespace + CommitMemory; runs SAFE-T1201 redaction
    BEFORE the plugin sees the content (F1085 parity).
  * Helpers: legacyScopeFromNamespace + namespaceKindFromLegacyScope
    (lifted out so admin_memories doesn't depend on MCP handler
    helpers). skipImport typed error.

Operational rollout (cutover sequencing):
  1. Today: MEMORY_V2_CUTOVER unset → legacy DB path.
  2. After PR-7 backfill applied + smoke verified: operator sets
     MEMORY_V2_CUTOVER=true.
  3. From that point, admin export/import operate on plugin
     storage; legacy agent_memories table is read-only for the
     ~60-day grace window before PR-9 drops it.

Coverage on new paths:
  * cutoverActive: 100%
  * WithMemoryV2 / withMemoryV2APIs: 100%
  * importViaPlugin: 100%
  * exportViaPlugin: 97.2% (one defensive scan-error branch in the
    workspace-list loop)
  * scopeToWritableNamespaceForImport: 76.9% (resolver-error and
    no-matching-kind branches exercised end-to-end via Import)
  * legacyScopeFromNamespace + namespaceKindFromLegacyScope: 100%

Edge cases pinned:
  * Cutover flag matrix (env unset/true/false × wired/unwired)
  * Export deduplicates memories shared across team (one row per id)
  * Export tolerates per-workspace failures (resolver / plugin) and
    keeps going on the rest
  * Export returns 500 only when the top-level workspace query fails
  * Empty readable namespaces → empty export (no panic)
  * Export redacts secrets in plugin path
  * Import: unknown workspace skipped, unknown scope skipped,
    plugin upsert/commit errors counted as errors
  * Import redacts secrets BEFORE plugin sees content
  * Legacy export/import path unchanged when cutover flag unset
2026-05-04 08:15:10 -07:00
Hongming Wang
290e6dfdc3 Memory v2 PR-6: backward-compat shim — legacy tools route to v2
Builds on merged PR-1..5. Adds the bridge that lets legacy
commit_memory / recall_memory tools route through the v2 plugin path
when MEMORY_PLUGIN_URL is wired, otherwise fall through to the
existing DB-backed code unchanged.

What ships:
  * handlers/mcp_tools_memory_legacy_shim.go — translation helpers:
      scopeToWritableNamespace, scopeToReadableNamespaces,
      commitMemoryLegacyShim, recallMemoryLegacyShim,
      namespaceKindToLegacyScope
  * handlers/mcp_tools.go — toolCommitMemory + toolRecallMemory now
    delegate to the shim when memv2 is wired

Translation:
  commit:  LOCAL  → workspace:<self>
           TEAM   → team:<root>     (resolver picks at runtime)
           empty  → defaults to LOCAL (preserves legacy default)
           GLOBAL → still rejected at MCP bridge (C3 preserved)
  recall:  LOCAL  → search restricted to workspace:<self>
           TEAM   → workspace:<self> + team:<root>
           empty  → all readable (matches v2 default behavior)
           GLOBAL → blocked at MCP bridge (C3 preserved)

Response shapes are preserved exactly:
  commit: {"id":"...","scope":"LOCAL"|"TEAM"} — agents see no diff
  recall: [{"id":"...","content":"...","scope":"LOCAL"|...,"created_at":"..."}, ...]
  org-namespace memories get the same [MEMORY id=... scope=ORG ns=...]
  prefix as v2 search; legacy scope label comes back as "GLOBAL"

Operational rollout:
  * Today: MEMORY_PLUGIN_URL unset on most operators → legacy DB path
  * After PR-7 backfill: operators set MEMORY_PLUGIN_URL → all writes
    flow through plugin transparently
  * After PR-8 cutover: dual-write removed, plugin is the only path
  * After PR-9 (~60 days later): legacy tool entries dropped entirely

Coverage: 100% on every helper, 100% on recallMemoryLegacyShim,
94.7% on commitMemoryLegacyShim. The 1 uncovered line is a defensive
guard against a v2-response-parse error that's unreachable when the
v2 tool is operating correctly (it always returns valid JSON).

Edge cases pinned:
  * scope translation for every legacy value + invalid scope
  * resolver error propagation
  * plugin error propagation
  * GLOBAL still blocked
  * default-scope fallback (LOCAL)
  * empty content rejected
  * No-op when v2 unwired (legacy SQL path exercised via sqlmock)
  * org-namespace memory wrap on recall + GLOBAL scope label round-trip
  * No-results returns "No memories found." (legacy message preserved)
2026-05-04 08:01:41 -07:00
Hongming Wang
5bfa4b1d80 Memory v2 PR-5: 6 new MCP tools wired through the plugin
Builds on PR-1, PR-2, PR-3, PR-4 (all merged). Adds the agent-facing
v2 surface for the memory plugin contract.

What ships (all in handlers/mcp_tools_memory_v2.go, no edits to
the legacy commit_memory / recall_memory paths):

  commit_memory_v2   — write to a namespace; default workspace:self
  search_memory      — search across namespaces; default = all readable
  commit_summary     — kind=summary, 30-day default TTL, runtime-overridable
  list_writable_namespaces — discover what you can write to
  list_readable_namespaces — discover what you can read from
  forget_memory      — delete by id, only in namespaces you can write to

Workspace-server is the security perimeter — every layer the plugin
mustn't be trusted with runs here:

  * SAFE-T1201 redactSecrets BEFORE every plugin write
  * Server-side ACL re-validation: CanWrite + IntersectReadable run
    on EVERY request, never trusting client-supplied namespaces (a
    canvas re-parent between list_writable and commit would otherwise
    let a stale namespace slip through)
  * org:* writes audited to activity_logs (SHA256, not plaintext) —
    matches memories.go:201-221 so the schema stays uniform
  * Audit failure does NOT block the write (logged + continue) —
    failing closed would deny org-scope writes whenever activity_logs
    is unhappy
  * org:* memories get the [MEMORY id=... scope=ORG ns=...]: prefix
    on read — preserves the prompt-injection mitigation from
    memories.go:455-461

Coexistence design: legacy commit_memory + recall_memory still wired
to their old code paths in mcp_tools.go. PR-6 will alias them to
delegate to these v2 implementations. PR-9 (60 days post-cutover)
removes the legacy entries.

Wiring:
  * MCPHandler gains an memv2 field (nil-safe; tools return a clear
    error when MEMORY_PLUGIN_URL is unset rather than crashing)
  * WithMemoryV2(plugin, resolver) is the production wiring API
    main.go calls at boot
  * withMemoryV2APIs(plugin, resolver) is the test-injectable variant
    against the memoryPluginAPI / namespaceResolverAPI interfaces

Coverage: 100.0% on every new function in mcp_tools_memory_v2.go.

Edge cases pinned:
  * empty/whitespace content → reject before plugin
  * plugin unconfigured → clear error, no crash
  * ACL violation → clear error
  * resolver error → wrapped error
  * plugin error → wrapped error
  * malformed expires_at → silently ignored (no exception)
  * org write audit failure → logged, write proceeds
  * search namespace intersection drops foreign entries
  * search with all-foreign namespaces → empty result, plugin not called
  * search org memories get delimiter wrap, workspace memories do not
  * forget with explicit + default namespace
  * forget cross-scope rejected
  * pickStr / pickStringSlice handle missing keys, wrong types, mixed slices
  * wrapOrgDelimiter format is exact-match
  * dispatch wires all 6 tools (no "unknown tool" error)
2026-05-04 07:50:26 -07:00
Hongming Wang
f2397bf138
Merge pull request #2733 from Molecule-AI/feat/memory-v2-pr3-postgres-plugin
Memory v2 PR-3: built-in postgres plugin server + schema migrations
2026-05-04 14:37:24 +00:00
Hongming Wang
ff5f4cbf7c Memory v2 PR-3: built-in postgres plugin server + schema migrations
Builds on merged PR-1 (#2729), independent of PR-2/PR-4.

Implements every endpoint of the v1 plugin contract behind an HTTP
server (cmd/memory-plugin-postgres/) backed by postgres. Operators
run this binary next to workspace-server; it's the default
implementation MEMORY_PLUGIN_URL points at.

What ships:
  - cmd/memory-plugin-postgres/main.go: boot, signal-driven shutdown,
    boot-time migrations, configurable LISTEN/DATABASE/MIGRATION_DIR
  - cmd/memory-plugin-postgres/migrations/001_memory_v2.up.sql:
      memory_namespaces (PK on name, kind CHECK, expires_at, metadata)
      memory_records (FK to namespaces with CASCADE, kind+source CHECK,
                      pgvector embedding, FTS tsvector, ivfflat partial
                      index on embedding, partial index on expires_at)
  - internal/memory/pgplugin/store.go: storage layer using lib/pq
  - internal/memory/pgplugin/handlers.go: HTTP layer (no router dep —
    a switch on URL.Path keeps the binary's dep surface tiny)
  - 100% statement coverage on store.go + handlers.go

Schema notes:
  - These tables live next to the plugin binary, NOT in workspace-
    server/migrations/. When operators swap the plugin, these tables
    become orphaned (operator drops manually). Documented in PR-10.
  - Search supports semantic (pgvector cosine) → FTS (>=2 char query)
    → ILIKE (1-char query) → recent-listing (no query), with a TTL
    filter applied uniformly across all paths.
  - DELETE on namespace cascades to memory_records (FK ON DELETE
    CASCADE) — a deleted namespace immediately frees its memories.

Coverage corner cases pinned:
  - Health: ok, degraded (db ping fails), no-ping fn
  - Every CRUD endpoint: happy path, bad name, bad JSON, bad body,
    not-found, store errors, exec/scan/marshal errors
  - Search: FTS, semantic, short-query (ILIKE), no-query (recent),
    kinds filter, store errors, scan errors, mid-iteration row error
  - Routing edge cases: unknown path, empty namespace, unknown sub,
    method-not-allowed, GET on /v1/health (allowed), POST on /v1/health
    (404), GET on /v1/search (404)
  - Helper internals: marshalMetadata (nil/happy/unmarshalable),
    nullTime (nil/non-nil), vectorString (empty/format),
    nullVectorString (empty/non-empty), scanNamespace +
    scanMemory metadata-decode errors

No callers in workspace-server yet; integration starts in PR-5
(MCP handlers wire the plugin client through to MCP tools).
2026-05-04 07:31:56 -07:00
Hongming Wang
01b653d6b0 Memory v2 PR-4: namespace resolver + tests
Stacked on PR-1 (#2729). Computes the readable/writable namespace lists
for a workspace from the live workspaces tree at request time. No
precomputed columns, no migrations — re-parenting on canvas takes
effect immediately on the next memory call.

What ships:
  - workspace-server/internal/memory/namespace/resolver.go
    - walkChain: recursive CTE, walks parent_id chain to root, capped
      at depth 50 to defend against malformed/cyclic data
    - derive: maps a chain to (workspace, team, org) namespace strings
    - ReadableNamespaces / WritableNamespaces: the public API
    - CanWrite + IntersectReadable: server-side ACL helpers MCP
      handlers (PR-5) will call before talking to the plugin
  - resolver_test.go: 100% statement coverage

Design choices worth flagging:
  - Today's tree is depth-1 (root + children). The recursive CTE
    handles arbitrary depth so we don't have to revisit the resolver
    when the tree deepens.
  - GLOBAL→org write restriction (memories.go:167-174) is preserved
    by gating the org namespace's Writable flag on parent_id IS NULL.
  - Removed-status workspaces are NOT filtered from the chain walk —
    matches today's TEAM behavior (memories.go:367-372 filters on
    read, not on tree walk).
  - IntersectReadable with empty `requested` returns ALL readable
    namespaces (default-search-everything semantic from the discovery
    tools spec).

This package has zero callers in this PR; integration starts in PR-5.
2026-05-04 07:25:33 -07:00
Hongming Wang
c1cff3169f Memory v2 PR-2: HTTP plugin client + breaker + capability negotiation
Builds on PR-1 (#2729). Implements every endpoint in the OpenAPI spec
plus two operational concerns the agent never sees:

  1. Capability negotiation. Boot/Refresh probes /v1/health and
     captures the plugin's capability list. MCP handlers (PR-5) ask
     SupportsCapability before exposing capability-gated features —
     e.g., agents can only request semantic search when "embedding"
     is reported.

  2. Circuit breaker. Three consecutive failures open the breaker for
     60 seconds; while open, calls fail fast with ErrBreakerOpen.
     Picked these constants because:
       - 3 failures: long enough to skip transient blips, short enough
         to react before all in-flight handlers stack on the timeout
       - 60s cooldown: long enough to back off a flapping plugin,
         short enough that recovery is felt within a single session
     4xx responses do NOT count toward the breaker (those are client
     bugs, not plugin health issues); 5xx + transport errors do.

What ships:
  - workspace-server/internal/memory/client/client.go
  - client_test.go: 100% statement coverage

Coverage corner cases pinned:
  - env-var success branches in New (parseDurationEnv applied)
  - json.Marshal error (via channel in Propagation)
  - http.NewRequestWithContext error (via unbalanced bracket in BaseURL)
  - 204 NoContent on endpoint that normally has a body
  - 4xx vs 5xx breaker behavior (4xx must NOT trip)
  - breaker cooldown elapsed → reset on next success
  - all 6 public endpoints fail-fast when breaker is open

This package has no callers in this PR; integration starts in PR-5.
2026-05-04 06:57:24 -07:00
Hongming Wang
53d823e719 Memory v2 PR-1: OpenAPI plugin contract + Go bindings
First of 11 PRs implementing the memory-system plugin refactor (RFC #2728).
This PR is pure additive scaffolding — no behavior change, no integration
yet. It defines the wire shape between workspace-server and a memory
plugin so PR-2 (HTTP client) and PR-3 (built-in postgres plugin) can be
built against a single source of truth.

What ships:
  - docs/api-protocol/memory-plugin-v1.yaml: OpenAPI 3.0.3 spec covering
    /v1/health, namespace upsert/patch/delete, memory commit, search,
    forget. Auth-free (private network only); workspace-server is the
    only sanctioned client and the security perimeter.
  - workspace-server/internal/memory/contract: typed Go bindings with
    Validate() methods on every wire object so both client (PR-2) and
    server (PR-3) self-check at the boundary.
  - Round-trip JSON tests for every type (catch asymmetric tag bugs).
  - 5 golden vector files under testdata/ pinning the exact wire shape;
    update via UPDATE_GOLDENS=1.

Coverage: 100% of statements in contract.go.

The validation rules encode design decisions worth flagging in review:
  - SearchRequest with empty Namespaces is REJECTED at plugin level —
    workspace-server is required to intersect the readable set
    server-side; an empty list reaching the plugin is a bug.
  - NamespacePatch with no fields is REJECTED — empty patches are
    pointless round-trips.
  - MemoryWrite with whitespace-only Content is REJECTED — zero-info
    memories pollute search results.

No code yet calls into this package; integration starts in PR-2.
2026-05-04 06:45:52 -07:00
Hongming Wang
be997883c9 Centralize backend selection in provisionWorkspaceAuto
User-reported 2026-05-04: deploying a team org-template ("Design
Director" + 6 sub-agents) on a SaaS tenant produced 7-of-7
WORKSPACE_PROVISION_FAILED with the misleading message
"container started but never called /registry/register". Diagnose
returned "docker client not configured on this workspace-server" and
the workspace rows had no instance_id.

Root cause: TeamHandler.Expand hardcoded h.wh.provisionWorkspace —
the Docker leg of WorkspaceHandler. WorkspaceHandler.Create branched
on h.cpProv to pick CP-managed EC2 (SaaS) vs local Docker
(self-hosted), but Expand never used that branch. On SaaS the docker
goroutine ran but had no socket, so children silently sat in
"provisioning" until the 600s sweeper marked them failed.

Architectural principle (user): templates own
runtime/config/prompts/files/plugins; the platform owns where it
runs. Backend selection belongs in one helper.

Fix:
- Extract WorkspaceHandler.provisionWorkspaceAuto: picks CP when
  cpProv is set, Docker when only provisioner is set, returns false
  when neither (caller marks failed).
- WorkspaceHandler.Create routes through Auto.
- TeamHandler.Expand routes through Auto.

Tests pin three invariants:
- TestProvisionWorkspaceAuto_NoBackendReturnsFalse — Auto signals
  fall-through correctly so the caller can persist + mark-failed.
- TestProvisionWorkspaceAuto_RoutesToCPWhenSet — when cpProv is
  wired, Start lands on CP (the user-visible regression target).
  Discipline-verified: removing the cpProv branch fails this.
- TestTeamExpand_UsesAutoNotDirectDockerPath — source-level guard
  against future refactors reintroducing the hardcoded Docker call.
  Discipline-verified: reverting team.go fails this with a clear
  message naming the bug class.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 03:43:41 -07:00
Hongming Wang
bcea8ac822 Broaden empty-URL 422 to cover NULL delivery_mode (production reality)
Live-probed user's tenant: three of three external-runtime workspaces
register with delivery_mode = NULL, not "poll". The earlier narrow
poll-only check fell through to the misleading 503 for the actually-
observed shape.

Invariant we want: URL empty + not-exactly-"push" → no dispatch path
will ever exist → 422. Only push-mode with empty URL is genuinely
transient (mid-boot, restart in progress) → 503.

Added TestChatUpload_NullModeEmptyURL using the user's actual workspace
ID. Existing TestChatUpload_NoURL switched to explicit "push" mode
(was relying on default — unsafe given the new branching).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 02:42:46 -07:00
Hongming Wang
87ae691e67 Distinguish poll-mode workspace from transient empty-URL on chat upload
External-runtime workspaces that register in poll mode have no callback
URL by design — the platform never dispatches to them, so chat upload
(HTTP-forward by design) can't proceed. Returning 503 + "workspace url
not registered yet" was misleading: the "yet" implied transient state,
but the URL would never arrive.

Caught externally on 2026-05-04: user uploading an image to an external
"mac laptop" runtime workspace saw the 503 and assumed they should
retry. The workspace's poll mode meant retrying would never help.

Fix: include delivery_mode in the workspace lookup. When URL is empty:
- poll mode → 422 + "re-register in push mode with a public URL"
  (Unprocessable Entity — this request can't succeed against this
  workspace's configuration; no retry will help)
- push mode → 503 + "not registered yet" (genuine transient state —
  retry after next heartbeat is correct)

Test: TestChatUpload_PollModeEmptyURL pins the new 422 path; existing
TestChatUpload_NoURL strengthened to assert the "not registered yet"
substring stays on the push branch (it would have silently passed if
the new 422 path had clobbered both branches).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 02:42:46 -07:00
Hongming Wang
d5eb58af56 feat(external-connect): comprehensive setup — fix Claude Code channel snippet + add per-tab Help section
User report: handing the modal's Claude Code channel snippet to an
agent fails immediately with two errors that the snippet doesn't tell
the operator how to resolve:

  plugin:molecule@Molecule-AI/molecule-mcp-claude-channel · plugin not installed
  plugin:molecule@Molecule-AI/molecule-mcp-claude-channel · not on the approved channels allowlist

Root cause: the snippet's `claude --channels plugin:...` line assumes
the plugin is pre-installed AND that the channel is on Anthropic's
default allowlist. Both assumptions are wrong for a custom Molecule
plugin in a public repo.

Two changes:

1. Rewrite externalChannelTemplate (Go) with full setup chain:
   - Bun prereq check (channel plugins are Bun scripts)
   - `/plugin marketplace add Molecule-AI/molecule-mcp-claude-channel`
     + `/plugin install molecule@molecule-mcp-claude-channel` BEFORE the
     launch — otherwise "plugin not installed"
   - `--dangerously-load-development-channels` flag on launch — required
     for non-Anthropic-allowlisted channels, otherwise "not on approved
     channels allowlist"
   - Common-errors block at the bottom mapping each error string to
     which numbered step recovers it
   - Team/Enterprise managed-settings caveat (the dev-channels flag is
     blocked there; admin must use channelsEnabled + allowedChannelPlugins)

   Plugin install info verified by reading `Molecule-AI/molecule-mcp-claude-channel`
   plugin.json (`name: "molecule"`) and the Claude Code channels +
   plugin-discovery docs at code.claude.com/docs/en/{channels,discover-plugins}.

2. Add per-tab HelpBlock to the modal (canvas):
   - Collapsible <details> below each snippet, closed by default so the
     snippet stays the visual focus
   - "Where to install" link (PyPI for runtime, claude.com for Claude
     Code, github.com/openai/codex for Codex, NousResearch/hermes-agent
     for Hermes)
   - "Documentation" link (docs.molecule.ai/docs/guides/*; hostname
     confirmed by existing blog post canonical metadata; paths map
     1:1 to docs/guides/*.md files in this repo)
   - "Common errors" list with concrete recovery steps for each tab
     (e.g. Codex tab calls out the codex≥0.57 requirement and TOML
     duplicate-table parse error; OpenClaw calls out the :18789 port
     conflict check)

   URL discipline: every URL is either (a) verified against a file path
   in this repo's docs/, (b) the canonical repo of an existing snippet
   reference, or (c) a well-known third-party canonical URL. No guessed
   URLs — broken links would defeat the purpose of "more comprehensive
   instructions."

Verification:
- `go build ./...` clean in workspace-server
- `go test ./internal/handlers/...` passes (4.3s)
- Bash syntax check on test_staging_full_saas.sh (no edits there) clean
- TS brace/paren/bracket counts balanced; no full tsc run because the
  worktree's node_modules isn't installed — counterpart Canvas tabs E2E
  on the PR will exercise the full type-check + render path

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 17:46:55 -07:00
Hongming Wang
ff0d4dae77 fix(external-connect): address self-review criticals — config corruption + durability
Self-review of the modal-tab additions caught footguns in the new
hermes/codex/openclaw snippets. Ship the fixes before merge.

Critical 1 — Hermes `cat >> ~/.hermes/config.yaml` corrupts existing
configs. Most existing hermes installs have a top-level gateway:
block; appending creates a duplicate, which YAML rejects. Replaced
the auto-append with explicit instructions: 'under your existing
gateway: block, add a plugin_platforms entry'.

Critical 2 — Codex `cat >> ~/.codex/config.toml` corrupts on
re-run. TOML rejects duplicate [mcp_servers.molecule] tables; a
second run breaks codex parse. Replaced auto-append with commented
config block + explicit 'open ~/.codex/config.toml in your editor
and paste'. Canvas-side token stamping still hits the literal in
the comment so the operator's clipboard has the real token already
substituted.

Required 3 — OpenClaw `onboard --non-interactive` missing
provider/model defaults. Added explicit --provider + --model
placeholders in a commented form so operators see what's needed
without a stub default applying silently.

Required 4 — OpenClaw gateway started with bare '&' dies on
terminal close. Switched to nohup + log file + disown, with a note
that systemd is the right answer for production.

Optional 5 + 6 (env_vars cleanup, tests) deferred — env_vars stripped
to keep the in-tree-vs-external surface narrow; tests for the new
response fields can land separately when external_connection.go is
next touched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 13:12:54 -07:00
Hongming Wang
eba0c5e3f1 feat(canvas): add Hermes/Codex/OpenClaw tabs to ExternalConnectModal + default to Universal MCP
The External Connect modal had tabs for Python SDK / curl / Claude Code
channel / Universal MCP. Operators using hermes / codex / openclaw as
their external runtime had no copy-paste; they pieced together
WORKSPACE_ID + PLATFORM_URL + auth_token into config files by reading
docs.

Adds three runtime-specific snippets stamped server-side:

- **Hermes** — installs molecule-ai-workspace-runtime + the
  hermes-channel-molecule plugin, exports the 4 env vars, and writes
  the gateway.plugin_platforms.molecule block into ~/.hermes/config.yaml.
  Same long-poll-based push semantics the Claude Code channel tab
  delivers (push parity with the in-tree template-hermes adapter).

- **Codex** — wires the molecule_runtime A2A MCP server into
  ~/.codex/config.toml ([mcp_servers.molecule] block with env_vars
  passthrough + literal env values). Outbound tools only — codex's
  MCP client doesn't route arbitrary notifications/* (verified by
  reading codex-rs/codex-mcp/src/connection_manager.rs); push parity
  on external codex would need a separate bridge daemon, tracked
  as future work. Snippet calls this out so operators know to pair
  with Python SDK if they need inbound delivery.

- **OpenClaw** — installs openclaw + onboards, wires the molecule
  MCP server via openclaw mcp set, starts the gateway on loopback.
  Same outbound-tools-only caveat as codex; the in-tree template-
  openclaw adapter implements the full sessions.steer push path,
  but an external setup would need the same bridge daemon to translate
  platform inbox events into sessions.steer calls. Future work.

Default open tab changed from "Claude Code" to "Universal MCP".
Universal MCP is runtime-agnostic and works as a starting point for
any operator regardless of their downstream agent runtime; runtime-
specific tabs are still one click away. Pre-2026-05-03 the modal
defaulted to Claude Code, so operators using non-Claude runtimes
opened to a tab they had to skip past.

Tab order also reorganized:
  Universal MCP → Python SDK → Claude Code → Hermes → Codex → OpenClaw → curl → Fields

Each runtime-specific tab is gated on the platform supplying the
snippet (older platform builds without the field don't show empty
tabs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 13:07:19 -07:00
Hongming Wang
1bff419833 feat(provisioner): digest-pin workspace images via runtime_image_pins (#2272 layer 1)
Layer 1 of the runtime-rollout plan. Decouples publish from promotion by
giving operators a `runtime_image_pins` table the provisioner consults at
container-create time. No row = legacy `:latest` behavior; row present =
provisioner pulls `<base>@sha256:<digest>`. One bad publish no longer
breaks every workspace simultaneously.

Mechanics:

  - Migration 047: `runtime_image_pins` (template_name PK + sha256 digest +
    audit columns) and `workspaces.runtime_image_digest` (nullable, with
    partial index) for "show me workspaces still on the old digest" queries.
  - `resolveRuntimeImage` (handlers/runtime_image_pin.go): looks up the
    pin, returns `<base>@sha256:<digest>` on hit, "" on miss/error so the
    provisioner falls through to the legacy tag map. Availability over
    pinning — any DB error logs and returns "" rather than blocking the
    provision. `WORKSPACE_IMAGE_LOCAL_OVERRIDE=1` short-circuits the
    lookup so devs rebuilding template images locally see their fresh
    build.
  - `WorkspaceConfig.Image` carries the resolved value into the
    provisioner. `selectImage` honors it ahead of the runtime→tag map and
    falls back to DefaultImage on unknown runtime.
  - The existing `imageTagIsMoving` predicate (#215) already returns false
    on `@sha256:` form, so digest pins skip the force-pull path naturally.

Tests:

  - Handler-side (sqlmock): no-pin/db-error/with-pin/empty/unknown/local-
    override paths cover every branch of `resolveRuntimeImage`.
  - Provisioner-side: `selectImage` table covers explicit-image preference,
    runtime-map fallback, unknown-runtime → default, empty-config →
    default. Plus a struct-literal compile-time pin on `Image` so a future
    refactor can't silently drop the field.

Layer 2 (per-ring routing via `workspaces.runtime_image_digest`) and the
admin promote/rollback endpoint ride on top of this and ship separately.
2026-05-03 02:30:00 -07:00
Hongming Wang
be271aef8b fix(orphan-sweeper): exclude runtime='external' from stale-token revoke
The Docker-mode orphan sweeper was incorrectly targeting external runtime
workspaces, revoking their auth tokens ~6 minutes after creation (one
sweep cycle past the 5-min grace).

External workspaces have NO local container by design — their agent runs
off-host. The "no live container" predicate the sweep uses to detect
wiped-volume orphans matches every external workspace unconditionally,
which was killing the only auth credential the off-host agent has.

Reproducer: create runtime=external workspace, paste the auth token into
molecule-mcp / curl, wait 5 minutes. Next request returns
`HTTP 401 — token may be revoked`. Platform log shows
`Orphan sweeper: revoking stale tokens for workspace <id> (no live
container; volume likely wiped)`.

Fix: add `AND w.runtime != 'external'` to the sweep's SELECT. The
existing test regexes (third-pass query expectations + the shared
expectStaleTokenSweepNoOp helper) are tightened to require the new
predicate, so a regression that drops it fails CI immediately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 00:49:37 -07:00
Hongming Wang
384edb4af0
Merge branch 'staging' into perf/cache-platform-inbound-secret 2026-05-03 00:08:43 -07:00
Hongming Wang
b040171fa1 perf(wsauth): in-process cache for platform_inbound_secret reads
Heartbeats fire every 60s per workspace and were the dominant caller
of ReadPlatformInboundSecret — one DB SELECT each, purely to redeliver
the same value. For an N-workspace fleet that's N SELECTs/minute of
pure overhead, growing linearly with the fleet (#189).

This adds a sync.Map cache keyed by workspaceID with a 5-minute TTL:

- **Read-through**: cache miss → DB SELECT → populate → return.
- **Write-through**: every IssuePlatformInboundSecret call refreshes
  the cache with the new value before returning, so the lazy-heal mint
  path (readOrLazyHealInboundSecret) doesn't see a stale read of the
  value it just wrote.
- **TTL eviction**: 5 minutes — generous enough that the heartbeat
  hot path hits cache for ~5 reads in a row before re-validating, short
  enough that an out-of-band rotation (operator running
  `UPDATE workspaces SET platform_inbound_secret=...` directly)
  propagates within minutes without requiring a redeploy.
- **Absence not cached**: ErrNoInboundSecret skips the cache write so
  the lazy-heal recovery contract for the column-NULL case
  (readOrLazyHealInboundSecret in workspace_provision_shared.go) keeps
  working.

Memory footprint is bounded by the active workspace fleet (~200 bytes
per entry); deleted workspaces leave dead entries until process restart,
acceptable given workspace-deletion is operator-rare.

Why in-process instead of Redis: workspace-server runs as a single
Railway service today (per memory project_controlplane_ownership);
adding Redis for this single column read would be over-engineering.
The cache is a self-contained, Redis-free upgrade that keeps the same
semantic surface (read returns the latest secret) while collapsing
the heartbeat read storm. If the deployment ever fans out across
replicas, an operator-side rotation propagates per-replica TTL-bounded
without needing a shared write log.

Tests: 5 new cases covering cache hit within TTL, refresh after TTL
(simulating an operator rotation via SQL), write-through on Issue,
absence-not-cached, and Reset clearing all entries. The setupMock
helper in wsauth and setupTestDB helper in handlers both call
ResetInboundSecretCacheForTesting() at start + cleanup so write-through
state from one test doesn't shadow SELECT expectations in the next.
SetInboundSecretCacheNowForTesting() exposes a deterministic clock
override so the TTL test doesn't sleep.

Task: #189.
2026-05-03 00:04:38 -07:00
Hongming Wang
c4f64a11a8
Merge pull request #2546 from Molecule-AI/fix/provisioner-repull-moving-tags
fix(provisioner): force re-pull of moving image tags on workspace start
2026-05-03 06:59:36 +00:00
Hongming Wang
552602e462 fix(provisioner): force re-pull of moving image tags on workspace start
Previously Start() only pulled when the image was missing locally
(imgErr != nil). Once a tenant's Docker daemon had `:latest` cached,
it stuck on that snapshot forever even after publish-runtime pushed
a newer image with the same tag — the same image-cache class that
sibling task #232 closed on the controlplane redeploy path.

Now Start() additionally re-pulls when the tag is "moving"
(`:latest`, no tag, `:staging`, `:main`, `:dev`, `:edge`, `:nightly`,
`:rolling`). Pinned tags (semver, sha-prefixed, date-stamped, build-id)
and digest-pinned references (`@sha256:...`) skip the pull because
their contents are by definition immutable.

The classifier (imageTagIsMoving) is deliberately conservative on the
"moving" side — only the well-known moving tags trip it. Misclassifying
a pinned tag as moving wastes bandwidth on every provision; misclassifying
moving as pinned silently bricks the fleet on stale snapshots, which
is exactly the bug class this fix closes.

Edge cases handled:
- Registry hostname with port (`localhost:5000/foo`) — the `:5000` is
  not mistaken for a tag.
- Digest pinning (`image@sha256:...`) — never re-pulled even if a
  moving-looking tag is also present.
- Legacy local-build tags (`workspace-template:hermes`) — treated as
  pinned (no registry to move from).

Test coverage: 22 cases across all classifier shapes. No changes to
the pull-failure path (still best-effort, ContainerCreate still
surfaces the actionable "image not found" error if the pull failed
and the cache is also empty).

Task: #215. Companion to #232.
2026-05-02 23:56:32 -07:00
Hongming Wang
dfeefb0acc fix(workspace-server): vendor upstream derive-provider.sh + close 12-prefix drift
The drift gate's monorepoRoot walk-up looked for workspace-configs-templates/
which is gitignored locally and doesn't exist in this repo at all (the
canonical script lives in molecule-ai-workspace-template-hermes). Test
failed on CI from day one with "could not find monorepo root".

Two layered fixes in one PR:

1. Vendor upstream derive-provider.sh as testdata/ + drop monorepoRoot.
   The vendored copy has a header pointing operators at the upstream
   source and a one-line cp command for refresh. Test now reads two
   files (vendored shell + workspace_provision.go) via package-relative
   paths — Go test sets cwd to the package dir, so this is hermetic
   without any walk-up gymnastics.

2. Update the case-statement regex to match upstream's renamed variable
   (${_HERMES_MODEL} since v0.12.0, the resolved value of
   HERMES_INFERENCE_MODEL with a HERMES_DEFAULT_MODEL legacy fallback).
   Regex now accepts either spelling so a future rename fails loudly
   on the parser-sanity check rather than silently returning empty.

Vendoring upstream surfaced real drift the gate was designed to catch:
upstream v0.12.0 added 12 provider prefixes that deriveProviderFromModelSlug
didn't handle (xai/grok, bedrock/aws, tencent/tencent-tokenhub, gmi,
qwen-oauth, lmstudio/lm-studio, minimax-oauth, alibaba-coding-plan,
google-gemini-cli, openai-codex, copilot-acp, copilot). Without these,
Save+Restart on a workspace using one of those prefixes would persist
LLM_PROVIDER="" and the next boot would fall back to derive-provider.sh's
runtime *=auto branch — losing the user's explicit choice on every restart.

Added all 12 case clauses + 16 new table-driven test cases (covering
both canonical and aliased forms). Drift gate now passes; future
upstream additions will fail loudly with a "DRIFT: ..." message
pointing the engineer at the missing case.

Task: #242
2026-05-02 23:51:23 -07:00
Hongming Wang
284012a768 test(workspace-server): AST drift gate for derive-provider.sh ↔ Go port
PR #2535 added a Go port of derive-provider.sh
(deriveProviderFromModelSlug) so workspace-server can persist
LLM_PROVIDER into workspace_secrets at provision time. This created
two sources of truth — if a future PR adds a provider prefix to one
without the other, the platform's persisted LLM_PROVIDER silently
disagrees with what the container's derive-provider.sh produces at
boot, with no test going red.

This adds a hermetic drift gate that:

  1. Parses workspace-configs-templates/hermes/scripts/derive-provider.sh
     with regex (handling both single-line `pat/*) PROVIDER="x" ;;`
     clauses and multi-line conditional clauses) to build a
     map[prefix]provider.
  2. Walks workspace_provision.go's AST with go/ast, finds
     deriveProviderFromModelSlug, and extracts every case-clause
     prefix → return-string-literal pair.
  3. Cross-checks both directions and accepts only the two documented
     divergences (nousresearch/* and openai/* both → "openrouter" at
     provision time because derive-provider.sh's runtime-env checks
     aren't loaded yet) via a hardcoded acceptedDivergences map.
  4. Fails with an actionable message that names both files and
     suggests the exact fix (add the case OR add to divergence list
     with a comment).

Pattern: behavior-based AST gate from PR #2367 / memory feedback —
pin the invariant by what the function maps, not by what it's named.
Stdlib-only (go/ast, go/parser, go/token, regexp); no network, no DB,
no docker — reads two monorepo files in-process.

A second sanity-check test pins anchor prefixes the regex must find,
so a future shell-syntax change can't silently produce an empty map
and trivially pass the main gate.

Closes task #242.
2026-05-02 23:51:23 -07:00
Hongming Wang
586d567a48 fix(workspace-server): log silent yaml.Unmarshal + coexistence test (#256, #257)
Two follow-ups from PR #2543's multi-model code review (audit #253).

1. **Log silent yaml.Unmarshal errors (#256).** When a malformed
   config.yaml made `yaml.Unmarshal(data, &raw)` fail, the affected
   template silently disappeared from /templates with no trace —
   operator could not distinguish "excluded due to parse error" from
   "never existed." That widened a real foot-gun once PR #2543 added
   structured top-level `providers:` (a string-shaped top-level
   `providers:` decoded into `[]providerRegistryEntry` would fail and
   drop the whole entry). Now logs `templates list: skip <id>:
   yaml.Unmarshal: <err>` and continues with the rest.

2. **Coexistence test (#257 part 1).** PR #2543 covered the structured
   registry and slug list in isolation. claude-code-default in
   production ships BOTH: top-level `providers:` (structured registry,
   2 entries) AND `runtime_config.providers:` (slug list, 3 entries).
   New `TestTemplatesList_BothProviderShapesCoexist` mirrors that
   layout, asserts both shapes surface independently with no
   cross-talk (e.g. a slug-only entry like `anthropic-api` does NOT
   synthesize a stub in the structured registry), and pins the JSON
   wire-shape for both fields side-by-side.

3. **`base_url: null` decoding assertion (#257 part 3).** Adds an
   explicit `got[0].BaseURL == ""` check in the existing
   `TestTemplatesList_SurfacesProviderRegistry` test, locking in the
   `string` (not `*string`) type. A future change to `*string` would
   surface as JSON `null` and break canvas's "no base_url = use
   provider defaults" branch — caught loudly by this assertion.

Tests: 11 TestTemplatesList_* now green, including the new
MalformedYAMLLogsAndSkips and BothProviderShapesCoexist.

The remaining piece of #257 — renaming `Providers []string` JSON tag
to `provider_slugs` — requires coordinated canvas updates across 4
files and is intentionally deferred to a separate PR (no canvas
churn while user is mid-test).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 23:01:59 -07:00
Hongming Wang
992a0c6860 fix(workspace-server): surface structured provider registry on /templates (#235)
Closes the contract drift caught by audit #253. Task #235 ("Server:
enrich /templates payload with structured providers") was marked
completed, but `templates.go` only ever emitted the
`runtime_config.providers []string` slug list — the structured
ProviderEntry shape (auth_env, model_prefixes, model_aliases, base_url)
the description promised was never plumbed.

Templates ship the structured registry under a TOP-LEVEL `providers:`
block (claude-code carries 6+ entries today; hermes still uses the
slug list). Both shapes coexist and are independent — surface them as
two separate fields:

  - `providers`           → existing []string slug list (unchanged)
  - `provider_registry`   → new []providerRegistryEntry (structured)

The canvas's ProviderModelSelector comment block already anticipates
this ("Templates that ship explicit vendor metadata (future) should
override the heuristic."). With this field in place, the canvas can
optionally drop its prefix-inference fallback for templates that ship
an explicit registry — separate PR. Today's change is purely additive
on the server side; no canvas change required.

Tests:
- TestTemplatesList_SurfacesProviderRegistry: order preservation +
  field plumbing on a claude-code-shaped fixture (oauth + minimax)
  + JSON wire-shape gate to catch struct-tag renames.
- TestTemplatesList_OmitsProviderRegistryWhenAbsent: omitempty so
  legacy templates (hermes, langgraph) don't emit `null` and break
  Array.isArray on the canvas side.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 22:42:42 -07:00
Hongming Wang
8a86b66159 fix(workspace-server): set universal MODEL env on every templated provision
Bug B fix, server-side complement to molecule-runtime PR #2538.
The runtime PR taught `workspace/config.py` to honour
`MODEL_PROVIDER` over `runtime_config.model` from the template's
verbatim YAML. This PR is the upstream half: workspace-server's
`applyRuntimeModelEnv` now sets `MODEL=<picked>` for **every**
runtime, not just hermes (which got `HERMES_DEFAULT_MODEL` already).

Pre-fix: applyRuntimeModelEnv's per-runtime switch only emitted
HERMES_DEFAULT_MODEL for hermes; every other runtime got nothing,
so the adapter read its template's default model from
/configs/config.yaml. Surfaced 2026-05-02 — picking MiniMax-M2 in
canvas → workspace booted with model=sonnet (claude-code template
default) and demanded CLAUDE_CODE_OAUTH_TOKEN.

Post-fix: MODEL is set unconditionally before the per-runtime switch.
HERMES_DEFAULT_MODEL stays for backwards compat. Adapters opt in by
reading os.environ["MODEL"] in their executor (claude-code adapter
already does this since the same Bug B fix; see
workspace-configs-templates/claude-code-default/adapter.py).

Tests
=====
- `TestApplyRuntimeModelEnv_SetsUniversalMODELForAllRuntimes`:
  table-driven across claude-code/hermes/langgraph/crewai + empty-model
  fallback + MODEL_PROVIDER-secret-fallback path. Adding a new
  runtime = adding a row, not writing a new test.
- All 6 sub-cases pass + existing
  `TestWorkspaceCreate_FirstDeploy_UnknownModel_OnlyMintModelProvider`
  pin still green.

Why now
=======
This was authored alongside the runtime PR but stashed (not committed)
during a session-handoff cleanup. The molecule-runtime side shipped at
SHA 16ac895a and is live on PyPI as molecule-ai-workspace-runtime
0.1.84, but until the workspace-server side ships, the canvas-picked
MODEL env never reaches non-hermes adapters.

Caught by the systematic stash audit triggered by the user's
discovery that ProviderModelSelector had been similarly stashed.

Closes the workspace-server side of #246. Builds on merged #2538.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 22:10:51 -07:00
Hongming Wang
d95877c88d
Merge pull request #2535 from Molecule-AI/fix/hermes-first-deploy-model-provider-persistence
Persist canvas-selected model+provider on first deploy
2026-05-03 02:25:03 +00:00
Hongming Wang
1b75fddb8e
Merge pull request #2536 from Molecule-AI/chore/prune-manifest-to-4-runtimes
chore(manifest): prune to 4 actively-supported runtimes
2026-05-03 02:24:50 +00:00
Hongming Wang
f33e59ba8c chore(manifest): prune to 4 actively-supported runtimes
Deletes the 5 unsupported workspace_templates from manifest.json
(langgraph, crewai, autogen, deepagents, gemini-cli). The runtime
matrix is now claude-code / hermes / openclaw / codex — the four
templates with shipping images, working A2A integration, and active
CI publish-image cascades.

Mirrors the prune in:
  - workspace-server/internal/handlers/runtime_registry.go
    (fallbackRuntimes for dev/test contexts that boot without the
    manifest mounted)
  - workspace-server/internal/handlers/workspace_provision.go
    (sanitizeRuntime: empty/unknown → "claude-code", was "langgraph";
    removes the langgraph/deepagents-specific runtime_config skip
    branch — they're no longer supported, so the block is dead)
  - tests for both: rename TestEnsureDefaultConfig_LangGraph →
    _Hermes, TestEnsureDefaultConfig_EmptyRuntimeDefaultsToLangGraph
    → _ClaudeCode, drop TestEnsureDefaultConfig_DeepAgents,
    update TestSanitizeRuntime_Allowlist + the two
    TestResolveRestartTemplate_* cases that pinned langgraph-default
    as the safe-default name

Why this is safe: production reads manifest.json at boot and uses it
as the authoritative allowlist; the 5 removed runtimes have not
shipped working images for ≥1 release cycle. Any provision request
naming one will now coerce to claude-code (with a log line) instead
of returning a runtime that has no functioning template repo.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 19:21:47 -07:00
Hongming Wang
a1de71dd53 fix(workspace-server): persist canvas-selected model + provider on first deploy
When the canvas POSTs /workspaces with {model: "minimax/MiniMax-M2.7"},
the model slug was never written to workspace_secrets. The workspace
booted hermes once with HERMES_DEFAULT_MODEL set from payload.Model, but
on every subsequent restart applyRuntimeModelEnv's fallback chain found
nothing in envVars["MODEL_PROVIDER"] (because nothing wrote it) and
hermes silently fell through to the template default
(nousresearch/hermes-4-70b) — wrong provider keys → hermes gateway
401'd → /health poll failed → molecule-runtime never registered →
"container started but never called /registry/register".

Worse, LLM_PROVIDER was never written either (the canvas doesn't send
provider), so CP user-data wrote no provider: field to
/configs/config.yaml and derive-provider.sh fell through to PROVIDER=auto
on every custom-prefix slug.

Fix: after the workspace row commits, persist MODEL_PROVIDER (verbatim
slug) and LLM_PROVIDER (derived from slug prefix) to workspace_secrets.
LLM_PROVIDER is gating-only — derive-provider.sh remains the runtime
source of truth and can override at boot. Reuses extracted
setModelSecret / setProviderSecret helpers (refactored out of SetModel /
SetProvider gin handlers) so SQL stays in one place.

Symptom: failed-workspace 95ed3ff2 (2026-05-02).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 19:21:01 -07:00
Hongming Wang
f18ee8598a fix(restart): retry cpProv.Stop with backoff + flag exhaustion as LEAK-SUSPECT
Both restart paths (interactive Restart handler + auto-restart's
stopForRestart) used to log-and-continue on cpProv.Stop failure. After
PR #2500 made CPProvisioner.Stop surface CP non-2xx as an error, those
paths became the actual leak generator: every transient CP/AWS hiccup =
one orphan EC2 alongside the freshly provisioned one. The 13 zombie
workspace EC2s on demo-prep staging traced to this exact path.

Adds cpStopWithRetry helper with bounded exponential backoff (3 attempts,
1s/2s/4s). Different policy from workspace_crud.go's Delete handler:
Delete returns 500 to the client on Stop failure (loud-fail-and-block —
user asked to destroy, silent leak unacceptable), whereas Restart's
contract is "make the workspace alive again" — refusing to reprovision
strands the user with a dead workspace. So this helper retries to absorb
transient failures, then on exhaustion emits a structured `LEAK-SUSPECT`
log line for the (forthcoming) CP-side workspace orphan reconciler to
correlate. Caller proceeds to reprovision regardless.

ctx-cancel exits the retry early without sleeping the backoff (matters
during shutdown drain); the cancel path emits a distinct log line and
deliberately does NOT emit LEAK-SUSPECT — operator-cancel and
retry-exhaustion are different signals and conflating them would noise
up the orphan-reconciler queue with workspaces we never had a chance to
retry.

Tests: 5 behavior tests covering every branch (no-op, first-try success,
eventual success, exhaustion, ctx-cancel) + 1 AST gate that pins the
helper-only invariant (any future inline `h.cpProv.Stop(...)` in
workspace_restart.go fires the gate, mutation-tested).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 23:36:38 -07:00
Hongming Wang
5167e482d0 fix(cp-provisioner): surface CP non-2xx on Stop to plug EC2 leak
http.Client.Do only errors on transport failure — a CP 5xx (AWS
hiccup, missing IAM, transient outage) was silently treated as
success. Workspace row then flipped to status='removed' and the EC2
stayed alive forever with no DB pointer (the "orphan EC2 on a
0-customer account" scenario flagged in workspace_crud.go #1843).
Found while triaging 13 zombie workspace EC2s on demo-prep staging.

Adds a status-code check that returns an error tagged with the
workspace ID + status + bounded body excerpt, so the existing
loud-fail path in workspace_crud.go's Delete handler can populate
stop_failures and surface a 500. Body read is io.LimitReader-capped
at 512 bytes to keep error logs sane during a CP outage.

Tests: 4 new (5xx surfaces, 4xx surfaces, 2xx variants 200/202/204
all succeed, long body is truncated). Test-first verified — the
first three fail on the buggy code and all four pass on the fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 22:59:01 -07:00
Hongming Wang
0064f02c00 test(sweeper): integration coverage for manifest-override + accessor consolidation
Two follow-ups from PR #2494's review:

1. Two new sweep tests exercise the lookup path through
   sweepStuckProvisioning end-to-end:
     - ManifestOverrideSparesRow: claude-code 11min old, manifest=20min
       → no UPDATE, no broadcast (sparing works through the sweeper)
     - ManifestOverrideStillFlipsPastDeadline: claude-code 21min old,
       manifest=20min → flipped + payload.timeout_secs=1200
   Closes the gap that the unit-test on provisioningTimeoutFor alone
   left open: a future refactor could drop the lookup arg from the
   sweeper's call and only the unit test caught it. Verified by
   regression-injecting `lookup→nil` in sweepStuckProvisioning — both
   new tests fail, the old ones still pass.

2. addProvisionTimeoutMs now goes through ProvisionTimeoutSecondsForRuntime
   instead of calling provisionTimeouts.get directly. Single accessor
   path for the same data — the canvas response and the sweeper now
   resolve identically by construction.

No production behavior change; tests + accessor cleanup only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 22:00:36 -07:00
Hongming Wang
18edf88d59 fix(sweeper): honour template-manifest provision_timeout_seconds
Real wiring gap discovered while investigating issue #2486 cluster of
prod claude-code workspaces failed at exactly 10m. The
runtimeProvisionTimeoutsCache (#2054 phase 2) reads
runtime_config.provision_timeout_seconds from each template's
config.yaml so the **canvas** spinner respects per-template timeouts —
but the **sweeper** in registry/provisiontimeout.go hardcoded 10 min
(claude-code) / 30 min (hermes) and never consulted the manifest. So a
template that declared a longer window had a UI that waited correctly
but a sweeper that killed the row at the hardcoded floor anyway.

Resolution order pinned by new TestProvisioningTimeout_ManifestOverride:

  1. PROVISION_TIMEOUT_SECONDS env (ops-debug global override)
  2. Template manifest lookup (per-runtime, beats hermes default too)
  3. Hermes default (30 min — CP bootstrap-watcher 25 min + 5 min slack)
  4. DefaultProvisioningTimeout (10 min)

Wiring:
  - registry: new RuntimeTimeoutLookup function type, threaded through
    StartProvisioningTimeoutSweep + sweepStuckProvisioning + the
    pre-existing provisioningTimeoutFor.
  - handlers: ProvisionTimeoutSecondsForRuntime exposes the cache's
    lookup as a method so main.go can pass it without breaking the
    handlers→registry import direction.
  - cmd/server/main.go: wire wh.ProvisionTimeoutSecondsForRuntime into
    the sweep boot.

Verified:
  - go test -race ./... passes (every workspace-server package).
  - Regression-injected the lookup arm: 3 manifest-override subcases
    fail with the actual-vs-expected gap, confirming the new test is
    load-bearing.
  - The original two timeout tests (env-override, hermes default) keep
    passing — `lookup=nil` argument preserves their semantics.

Operator action enabled: a template wanting a 15-min window can now
just set `runtime_config.provision_timeout_seconds: 900` in its
config.yaml and the sweeper honours it on the next workspace-server
restart.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 21:44:42 -07:00
Hongming Wang
955755ce1e test(provision): tighten Assertion 4 message to name both failure modes
Per review nit on PR #2491: the previous message ("a goroutine reached
cpProv.Start but never broadcast its failure") could mislead an
operator if Assertion 2 and 4 both fire — Assertion 4 also catches
"goroutine exited via an earlier path before reaching Start." Spell
both modes out and cross-reference Assertion 2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 20:14:39 -07:00
Hongming Wang
82cc331517 test(provision): harden panic tests with re-raise guard + assert broadcast count
Post-merge follow-up to PR #2487 review feedback:

1. guardAgainstReraise(fn) helper around every panic-test exercise. The
   original RecoversAndMarksFailed had its own outer recover() to detect
   re-raise; NoOpWhenNoPanic and PersistFailureLogged didn't. If a future
   regression makes logProvisionPanic re-raise, those two would have
   crashed the test process (taking sibling tests down) instead of
   reporting a clean failure. Now all three use the shared guard.

2. Concurrent repro now asserts bcast.count == 7 — the new
   concurrentSafeBroadcaster's count field was added in the race fix
   but not actually consumed. Cross-checks the existing recorder-set
   assertion from a different angle: a goroutine could in principle
   reach cpProv.Start (recorder hits) but then lose its
   WORKSPACE_PROVISION_FAILED broadcast on the failure path. Pinning
   both rules out that silent-drop variant for the canvas-broadcast
   contract specifically.

3. Comment on captureLog noting log.SetOutput is process-global and
   incompatible with t.Parallel() — preempts a future footgun if
   someone parallelizes the panic suite.

Verified: all four tests pass under -race; full handlers + db packages
green under -race.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 20:11:11 -07:00
Hongming Wang
4f64c4366f test(provision): swap to concurrent-safe broadcaster in 7-burst harness
CI Platform (Go) ran with -race and the concurrent test tripped the
detector: captureBroadcaster (sequential-test stub) writes lastData
unguarded; 7 fan-out goroutines call markProvisionFailed → that stub
concurrently. Local non-race run had hidden it.

Introduce concurrentSafeBroadcaster (mutex-counted) for this single
fan-out test. Sequential tests keep using captureBroadcaster — the
fix is local to the test that creates the goroutines.

Verified ./internal/handlers passes with -race.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 20:03:11 -07:00
Hongming Wang
7a19724194 fix(provision): route panic recovery through markProvisionFailed + fix log capture
Three fixes addressing review of the issue #2486 observability PR:

1. CI failure: original inline UPDATE in logProvisionPanic used a hard-coded
   `status='failed'` literal, which trips workspace_status_enum_drift_test
   (the post-PR-#2396 gate that requires every status write to flow through
   models.Status* via parameterized $N). Refactor to call
   h.markProvisionFailed which uses StatusFailed parameterized.

2. Canvas-broadcast gap (review finding): inline UPDATE skipped
   RecordAndBroadcast, so panic recovery marked the row failed in DB but
   the canvas spinner stayed on "provisioning" until the next poll.
   markProvisionFailed fires WORKSPACE_PROVISION_FAILED, so canvas now
   flips to a failure card immediately.

3. Critical test bug (review finding): `defer log.SetOutput(log.Writer())`
   in three test sites evaluated log.Writer() at defer-fire time AFTER the
   SetOutput swap — restoring the buffer to itself, never restoring
   os.Stderr. Subsequent tests in the package were running with the panic
   tests' captured buffer as their writer. Extracted captureLog(t) helper
   that captures `prev` BEFORE the swap and uses t.Cleanup.

Plus: softened the "goroutine never started" comment in the concurrent
repro harness — the harness atomic-counts BEFORE the entry log fires, so
"never started" was misleading; the real failure mode is "entry log
renamed/removed or writer hijacked."

Verified: full handlers suite passes; drift gate passes (Platform Go CI
failure root-caused). Regression-injected the recover body again — both
panic tests still fail as expected, confirming the contract is gated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 19:56:34 -07:00
Hongming Wang
fe92194584 test(provision): concurrent 7-burst repro harness for #2486 silent-drop
Goal: a deterministic, in-process reproduction of the prod incident
where 7 simultaneous claude-code provisions on the hongming tenant
produced ZERO log lines from any of the four documented exit paths.

Approach: stub CPProvisioner that records every Start() call,
sqlmock for the prepare flow, fire 7 goroutines concurrently against
provisionWorkspaceCP, then assert:

  1. Entry log fired exactly 7 times (one per goroutine).
  2. Stub Start() recorded all 7 distinct workspace IDs.
  3. Each goroutine's entry log names its own workspace ID.

Result on staging head as of 2026-05-02: PASSES — meaning the
silent-drop class isn't reproducible against current head with stub
CP. Tenant hongming runs sha 76c604fb (725 commits behind staging),
so the bug is most likely already fixed upstream — hongming needs
a redeploy.

The test stays as a regression gate: any future refactor that
re-introduces silent goroutine swallow in the CP provision path
(rate-limit drop, channel-send-without-receiver, panic without
recover, etc.) trips it.

A safeWriter wraps the captured log buffer because raw
bytes.Buffer.Write isn't safe for concurrent goroutines — without
serialization the 7 entry-log lines interleave at byte boundaries
and the strings.Count assertion gets unreliable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 19:19:05 -07:00