Commit Graph

4384 Commits

Author SHA1 Message Date
Hongming Wang
cb70d3d437 docs: callout Python>=3.11 requirement on Universal MCP install snippet
User-reported friction: pip install molecule-ai-workspace-runtime on a
3.10 interpreter fails with "Could not find a version that satisfies the
requirement (from versions: none)" — pip's requires_python filter
silently drops the only available artifact before attempting install,
so the error doesn't mention Python at all. Operators see
"package missing", file a bug, and chase a phantom CDN/visibility
issue.

Two changes mirror the requirement at the two operator-touch surfaces:

1. workspace-server/internal/handlers/external_connection.go:
   the externalUniversalMcpTemplate snippet (rendered into the
   canvas Connect-External-Agent modal) now leads with a brief
   "Requires Python >= 3.11" block + diagnostic + upgrade paths.

2. docs/workspace-runtime-package.md: same callout at the top of
   the doc, before the Overview, so anyone landing here from search
   gets the answer immediately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:44:25 -07:00
Hongming Wang
a1d202723d
Merge pull request #2929 from Molecule-AI/auto-sync/main-ef67dc51
chore: sync main → staging (auto, ff to ef67dc51)
2026-05-05 13:42:55 -07:00
molecule-ai[bot]
ef67dc513e
Merge pull request #2928 from Molecule-AI/staging
staging → main: auto-promote 2bf6a70
2026-05-05 20:33:52 +00:00
Hongming Wang
23d3f057d3
Merge pull request #2890 from Molecule-AI/refactor/a2a-tools-memory-extract-rfc2873-iter4c
refactor(workspace): extract memory tools from a2a_tools.py (RFC #2873 iter 4c)
2026-05-05 20:31:45 +00:00
Hongming Wang
46a4ef83bb fix(tests): patch a2a_tools_memory.httpx, not a2a_tools.httpx
Iter 4c (#2890) moved tool_commit_memory + tool_recall_memory into
a2a_tools_memory.py, which has its own top-level `import httpx`.
test_mcp_memory.py + the secret-redact memory tests still patched
`a2a_tools.httpx.AsyncClient`, which after the move is the WRONG
module's reference — the real call inside the moved tool resolves to
`a2a_tools_memory.httpx.AsyncClient` and reaches the network. CI
catches this as 7 failures: JSONDecodeError on empty bodies and
"All connection attempts failed" on the recall side.

Update 7 patch sites to `a2a_tools_memory.httpx.AsyncClient`. The
existing tests in `test_a2a_tools_impl.py` were already updated by
the iter-4c PR; only these two files were missed.

Verified: pytest workspace/tests/test_mcp_memory.py +
test_secret_redact.py — 43/43 pass after the fix (both files were
red on the iter-4c branch CI).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:25:06 -07:00
Hongming Wang
a6afc18de5
Merge pull request #2927 from Molecule-AI/fix/org-import-polish-2872
fix(org-import): polish — wrap-safe ErrNoRows, bounded lookup, godoc (#2872)
2026-05-05 20:25:01 +00:00
Hongming Wang
423d58d42c fix(org-import): polish — wrap-safe ErrNoRows, bounded lookup, godoc
Three small hardening passes from #2872's optional/important findings,
batched into one polish PR:

1. errors.Is(err, sql.ErrNoRows) instead of err == sql.ErrNoRows.
   The bare equality breaks if any future caller wraps the error via
   fmt.Errorf("…: %w", err) — the no-rows happy path would fall
   through to the "real DB error" branch and abort the import.
   errors.Is unwraps. New test
   TestLookupExistingChild_WrappedNoRows_TreatedAsNotFound pins the
   fix; verified the test fails on the old `==` shape (build break
   on unused-import + assertion failure once import dropped).

2. Bounded 5s timeout on lookupExistingChild instead of
   context.Background().
   The createWorkspaceTree call site runs in goroutines spawned from
   the /org/import handler, so plumbing the request context here
   would cascade-cancel into provisionWorkspaceAuto and abort
   in-flight EC2 provisioning if the client disconnected mid-import
   — that's the wrong tradeoff. A short bounded timeout protects the
   per-row SELECT against a wedged DB without taking the
   drop-everything-on-disconnect behaviour. The lookup is a single
   ~10ms query; 5s leaves 500x headroom for transient slow paths.

3. Godoc clarifications on the skip-path block.
   - /org/import is ADDITIVE-ONLY, never destructive. Children
     present in the existing tree but absent from the new template
     are preserved (no DELETE on diff).
   - Skip-path does NOT propagate updates to existing nodes — a
     re-import that adds an initial_memory or schedule to an
     existing workspace is silently dropped. Document the limitation
     so future operators know to delete-and-re-import or reach for
     a future /org/sync route.

Verification:
  - go build ./... → clean
  - go test ./internal/handlers/... → all passing (TestLookup* +
    TestCreateWorkspaceTree* + TestClass1* + TestGate*)
  - 4 lookup tests + 1 new wrap-safety test → 5/5 PASS
  - Full handlers suite → green

Refs molecule-core#2872 (Optional findings — wrap-safety + ctx, godoc
clarifications for additive-only + skip-path-update-limitation)

Out of scope (deferred):
  - PR-D partial unique index migration + ON CONFLICT — sequenced
    after Phase 4 cleanup verified clean per #2872 plan
  - PR-E full createWorkspaceTree integration test for partial-match
    — needs heavier sqlmock scaffolding for downstream
    workspaces_audit/canvas_layouts/secrets/channels INSERTs;
    follow-up

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:20:54 -07:00
Hongming Wang
9386f1d399
Merge pull request #2926 from Molecule-AI/fix/agent-comms-display-parity
fix(canvas): AgentCommsPanel display + initial-state parity with my-chat
2026-05-05 20:15:29 +00:00
Hongming Wang
a766e5ce48
Merge pull request #2925 from Molecule-AI/e2e/poll-mode-chat-upload-tests
test(e2e): poll-mode chat upload E2E in standard suite
2026-05-05 20:13:27 +00:00
Hongming Wang
5ad2669f88 fix(canvas): AgentCommsPanel display + initial-state parity with my-chat
User-visible problem: agent-comms panel opens mid-conversation on long
histories (the same chat-opens-in-middle bug PR #2903 fixed for
my-chat) and silently renders empty state when the history fetch fails
(no retry button, no diagnostic).

Three changes mirror the my-chat patterns from ChatTab:

1. Initial-mount instant scroll.
   Adds hasInitialScrollRef + switches the scroll hook from useEffect
   to useLayoutEffect. First arrival of messages → scrollIntoView
   `instant`; subsequent appends → `smooth` as before. useLayoutEffect
   runs before paint so the user never sees the panel jump for one
   frame on every append.

2. Error UI with Retry button.
   Adds `loadError` state. The history-load .catch now sets the
   error message; a new branch in the render renders a red alert
   with the failure text and a Retry button that re-invokes
   `loadInitial`. Same shape as ChatTab MyChatPanel's `loadError`
   handling — both surfaces should fail loud, not silent.

3. Extracted `loadInitial` callback.
   The history-load body becomes a useCallback so the retry button
   has a stable reference to call. Mirrors ChatTab's loadInitial.

Tests (4 new in AgentCommsPanel.render.test.tsx):
- Loading state renders the loading copy.
- Error state with Retry button renders on rejection; clicking
  Retry fires a second api.get.
- Empty state renders when load succeeds with zero rows.
- scrollIntoView is called with behavior=instant on first message
  arrival (pins the chat-opens-in-middle prevention).

Verification:
- pnpm test → 1284/1284 pass (1280 prior + 4 new)
- tsc --noEmit → clean
- 92 → 93 test files, no existing test broken

Closes the parity gap raised in chat. The two surfaces now share:
loading copy / error UI / empty-state placeholder / scroll behaviour /
useLayoutEffect timing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:09:36 -07:00
Hongming Wang
0ca4e431c1 test(e2e): add poll-mode chat upload E2E and wire into e2e-api.yml
Covers the user-visible flow that Phase 1-5b shipped (RFC #2891):
register a poll-mode workspace, POST a multi-file /chat/uploads, verify
the activity feed shows one chat_upload_receive row per file, fetch the
bytes via /pending-uploads/:fid/content, ack each row, and confirm a
post-ack fetch returns 404. Also pins cross-workspace bleed protection
(workspace B's bearer on A's URL → 401, B's URL with A's file_id →
404) and the file_id-UUID-parse 400 path.

23 assertions, all green against a local platform (Postgres+Redis+
platform-server stack matches the e2e-api.yml CI recipe verbatim).

Why a new script instead of extending test_poll_mode_e2e.sh: that
script tests A2A short-circuit + since_id cursor semantics; this one
tests the chat-upload path. They share zero handler code on the
platform side and would dilute each other's failure messages if
combined.

Why not the bearerless-401 strict-mode assertion: the platform's
wsauth fail-opens for bearerless requests when MOLECULE_ENV=development
(see middleware/devmode.go). The CI workflow doesn't set that var, but
some local-dev .env files do — the assertion would flap by environment
without testing the poll-mode upload contract. The middleware's own
unit tests cover strict-mode 401.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:08:55 -07:00
molecule-ai[bot]
184ce7ae4e
Merge pull request #2919 from Molecule-AI/staging
staging → main: auto-promote ae22a55
2026-05-05 13:08:45 -07:00
Hongming Wang
2bf6a7005f
Merge pull request #2923 from Molecule-AI/feat/class1-ast-gate-2867
test(handlers): generic Class 1 leak AST gate (#2867 PR-A)
2026-05-05 20:04:59 +00:00
Hongming Wang
16ead69641
Merge pull request #2913 from Molecule-AI/fix-saas-logout-ui-missing
fix(canvas): wire SaaS Sign-out button — /cp/auth/signout was unreachable from UI
2026-05-05 20:03:53 +00:00
Hongming Wang
60afcd43c9 test(handlers): generic Class 1 leak AST gate (#2867 PR-A)
Adds class1_ast_gate_test.go — a per-package AST walk that fails the
build if any handler function INSERTs INTO workspaces inside a range
loop body without one of three escape hatches:

  1. A call to a registered preflight helper (lookupExistingChild today;
     extend preflightCallNames as new helpers are introduced).
  2. An ON CONFLICT clause in the same SQL literal (idempotent UPSERT,
     like registry.go).
  3. An explicit `// class1-gate: idempotent-by-design` comment in the
     function body (deliberately awkward — forces a code-review beat).

Why this is broader than the existing
TestCreateWorkspaceTree_CallsLookupBeforeInsert gate in
org_import_idempotency_test.go: that one is hard-coded to one function
in one file. This one walks every non-test .go file in the handlers
package and applies a structural rule independent of file/function
names. A future handler written from scratch in a new file would not
have been covered before — now it is.

Detection mechanism (per AST):
  - Collect spans (Lbrace..Rbrace) of every RangeStmt body in each
    function. Position-based instead of stack-based — ast.Inspect's
    nil-callback ordering doesn't give per-node pop semantics, so a
    naive push/pop stack silently miscounts. Position spans are
    deterministic.
  - Walk every BasicLit, regex-match `^\s*INSERT INTO workspaces\(`
    (tightened from bytes.Index "INSERT INTO workspaces" so
    workspaces_audit literals don't false-positive — same regex used
    by the existing createWorkspaceTree gate).
  - For each match: record insertLine, hasONCONFLICT, and the
    innermost enclosing RangeStmt line (or 0 if not inside any range).
  - Fail the function if INSERT is inside a range AND no preflight
    AND no ON CONFLICT AND no allowlist annotation.

Self-tests (per `feedback_assert_exact_not_substring.md` —
verify gate fails on the bug shape before merging):
  - TestClass1_GateFiresOnSyntheticBuggySource: synthetic source
    where INSERT is inside `for _, child := range children` body
    must trigger the gate's three guards (enclosingRangeLine!=0,
    hasONCONFLICT=false, no preflight call).
  - TestClass1_GateAllowsONCONFLICT: synthetic INSERT...ON CONFLICT
    must NOT trigger the gate (idempotent UPSERT case).
  - TestClass1_GateAllowsAllowlistAnnotation: function with
    `// class1-gate: idempotent-by-design` must be skipped.
  - TestClass1_NoUnpreflightedInsertInsideRange: production sweep
    over every handler .go file. Currently passes because
    org_import.go preflights, registry.go ON-CONFLICTs, and
    workspace.go's Create has no INSERT inside a range body.

Verification:
  - go test ./internal/handlers/... -run TestClass1_ -count=1
    → 4/4 PASS
  - go test ./internal/handlers/... -count=1 → suite green
    (no pre-existing test broken by the new file)

Refs molecule-core#2867 (PR-A Class 1 generic AST gate)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:01:34 -07:00
Hongming Wang
ff75aeb43e
Merge pull request #2922 from Molecule-AI/fix/memory-plugin-gate-sidecar-on-cutover
fix(memory-plugin): gate sidecar spawn on cutover-active
2026-05-05 19:44:01 +00:00
Hongming Wang
81cf0cbf98
Merge pull request #2921 from Molecule-AI/fix/batch-fetcher-cancel-on-timeout
fix(inbox-uploads): cancel BatchFetcher futures on wait_all timeout
2026-05-05 12:41:48 -07:00
Hongming Wang
412dec0d87 fix(memory-plugin): gate sidecar spawn on cutover-active
PR #2906 spawned the sidecar unconditionally on every tenant boot. The
plugin's first migration runs \`CREATE EXTENSION vector\` which fails
on tenant Postgres without pgvector preinstalled — every staging
tenant redeploy aborted at the 30s health gate. CP fail-fast kept
running tenants on the prior image (no outage), but the new image
was DOA.

Caught on staging redeploy 2026-05-05 19:23 with
\`pq: extension "vector" is not available\`.

Fix: only spawn the sidecar when the operator has flipped the cutover
flag — \`MEMORY_V2_CUTOVER=true\` OR \`MEMORY_PLUGIN_URL\` is set.

  * Aligns the entrypoint to the same opt-in posture wiring.go already
    uses (it skips building the client when MEMORY_PLUGIN_URL is empty).
  * Until cutover, the sidecar isn't even running — no migration, no
    health gate, no boot-time pgvector dependency.
  * Operators activating cutover already redeploy with the new env
    vars set; that's when the sidecar starts. By definition they've
    verified pgvector is available before flipping.
  * MEMORY_PLUGIN_DISABLE=1 escape hatch preserved; harness fix #2915
    becomes belt-and-suspenders (still respected).

Both Dockerfile and entrypoint-tenant.sh updated. Behavior change for
existing deployments: zero (cutover env vars still unset → sidecar
still inert, but now also not running).

Refs RFC #2728. Hotfix for #2906; supersedes the migration-path
fragility class (the sidecar isn't doing migrations on tenants that
won't use it).
2026-05-05 12:39:03 -07:00
Hongming Wang
9a53529047 ci: retrigger after stuck Canvas tabs E2E (was running 17min vs typical <1min on staging) 2026-05-05 12:38:09 -07:00
Hongming Wang
39931acd9c fix(inbox-uploads): cancel BatchFetcher futures on wait_all timeout
The deadline contract was incomplete: wait_all logged the timeout but
close() then called executor.shutdown(wait=True), which blocked on
the leaked workers — undoing the user-facing timeout. The inbox poll
loop would stall indefinitely on a hung /content fetch instead of
returning to chat-message processing.

Fix: wait_all now flips self._timed_out and cancels queued (not-yet-
started) futures; close() reads that flag and switches to
shutdown(wait=False, cancel_futures=True) on the timeout path.
Currently-running workers can't be interrupted by Python's threading
model, but they're now detached daemons whose blocking httpx call
no longer gates the next poll.

Healthy path (no timeout) keeps the existing drain-and-wait so a
still-queued ack POST isn't dropped mid-write.

Two new tests pin both legs of the contract end-to-end:
- close-after-timeout-doesn't-block: hung worker, wait_all(0.05s)
  fires the timeout, close() returns in <1s instead of waiting ~5s
  for the worker to come back.
- close-without-timeout-still-drains: 2 slow workers, wait_all
  completes cleanly, close() drains both ack POSTs.

Resolves the BatchFetcher timeout-cancellation finding from the
post-merge five-axis review of Phase 5b.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 12:34:41 -07:00
6f19b88fa7
Merge pull request #2920 from Molecule-AI/feat/structured-provision-logging-2867
feat(workspace-server): structured logging at provisioning boundaries (#2867 PR-D)
2026-05-05 19:34:05 +00:00
Hongming Wang
83454e5efd feat(workspace-server): structured logging at provisioning boundaries
Adds internal/provlog with a single Event(name, fields) helper that
emits JSON-tagged single-line records to the standard logger. Five
boundary sites instrumented for #2867:

  provision.start         — workspace_dispatchers.go (sync + async)
  provision.skip_existing — org_import.go idempotency hit
  provision.ec2_started   — cp_provisioner.go after RunInstances
  provision.ec2_stopped   — cp_provisioner.go after TerminateInstances ack
  restart.pre_stop        — workspace_restart.go before Stop dispatch

These pair with the existing human-prose log.Printf lines (kept). The
new records are grep+jq friendly so a future log-aggregation pipeline
can reconstruct per-workspace provision timelines without parsing the
operator messages — this is the "and debug loggers so it dont happen
again" half of the leak-prevention work.

Tests:
  - provlog: emits evt-prefixed JSON, nil-tolerant, marshal-error
    fallback preserves event boundary, single-line output pinned.
  - handlers: provlog_emit_test.go pins three call-site contracts:
    provisionWorkspaceAutoSync emits provision.start with sync=true,
    stopForRestart emits restart.pre_stop with backend=cp on SaaS,
    and backend=none when both backends are nil.

Field taxonomy is convenience for ops, not contract — payload can grow
additively without breaking callers. Behavior gate is the event name +
boundary location, per feedback_behavior_based_ast_gates.md.

Refs #2867 (PR-D structured logging at provisioning boundaries)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 12:30:11 -07:00
Hongming Wang
575f893f4e fix(canvas): consume CP logout_url to break the SSO re-auth loop
Follow-up to molecule-controlplane#485. The first half of #2913 wired
a Sign-out button + signOut() helper that POSTed /cp/auth/signout, but
clicking still left the user signed in: WorkOS's browser cookie
preserved the SSO session, /cp/auth/login auto-re-authed via SSO, and
the user landed back on /orgs.

CP PR #485 returns the AuthKit hosted logout URL in the signout
response. This change has signOut() navigate the browser there
instead of /cp/auth/login. AuthKit clears its cookie + redirects to
return_to (configured server-side from APP_URL) → next /cp/auth/login
hits a fresh AuthKit, no SSO session, login form actually shows.

Defensive parsing: malformed JSON, missing logout_url, or wrong-type
logout_url all fall through to the legacy /cp/auth/login fallback,
which works locally (DisabledProvider, dev) where there's no SSO to
escape.

Forward-compat: when CP doesn't have #485 deployed yet, signOut()
sees logout_url="" or missing → fallback fires. Order of merge
between this and #485 doesn't matter, but the bug isn't actually
fixed end-to-end until both ship.

Tests added (3 new, 15 total auth.test.ts):
- Hosted logout: navigates to logout_url when response includes one.
- DisabledProvider path: falls back to /cp/auth/login when "".
- Defensive: malformed JSON body → fallback (no crash).
- Defensive: non-string logout_url → fallback (no open redirect).

Verified:
- npx vitest run src/lib/__tests__/auth.test.ts — 15/15 pass
- tsc --noEmit clean

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 12:21:49 -07:00
Hongming Wang
4cac4e7710 fix(canvas): wire SaaS Sign-out button — POST /cp/auth/signout was unreachable from the UI
Reported externally on 2026-05-05: "SaaS app logout does not work."

Root cause: the control plane has had POST /cp/auth/signout (clears the
WorkOS session cookie + revokes at the provider) since auth shipped,
but no canvas code ever called it. grep across canvas/ for
`logout|signOut|signout|sign-out` returned zero results — no helper,
no button, no menu entry. Users had no path to log out short of
clearing cookies in DevTools.

This is a UI gap, not a backend bug. Adding the missing pieces:

1. `signOut()` helper in `canvas/src/lib/auth.ts`:
   - POST /cp/auth/signout with credentials:include (cross-origin
     cookie required for tenant subdomain → app subdomain)
   - Best-effort: a 5xx, 401-stale-cookie, or network failure still
     redirects the browser to /cp/auth/login. Leaving the user on an
     authed-looking page after they clicked Sign out is the worst
     possible UX — that's the precise "logout doesn't work" symptom
     the report described.
   - Lands on /cp/auth/login (not the current URL) so the user
     doesn't loop back into the org they just left via AuthGate's
     return_to.

2. `AccountBar` component on /orgs page Shell — renders the signed-in
   email + Sign-out button at the top. Click → signOut() →
   `Signing out…` → bounces to login. Disabled-while-pending so a
   double-click can't fire two requests.

3. Tests in `auth.test.ts` (4 new, total 12 pass):
   - POSTs to the right endpoint with credentials:include
   - Redirects to /cp/auth/login after success
   - Redirects EVEN ON network failure (the critical UX invariant)
   - Redirects on 401 (stale cookie path)

The auth-origin resolution (`getAuthOrigin`) is reused so a tenant
subdomain (acme.moleculesai.app) correctly POSTs to
app.moleculesai.app/cp/auth/signout — same chain that fetchSession
+ redirectToLogin already use.

Test plan:
- [x] `npx vitest run src/lib/__tests__/auth.test.ts` — 12/12 green
- [x] `tsc --noEmit` — clean
- [ ] Manual: navigate to /orgs, click Sign out, observe redirect +
      that the next /orgs visit bounces to login (cookie cleared)
- [ ] CI green

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 12:20:18 -07:00
Hongming Wang
8254bedf30
Merge pull request #2917 from Molecule-AI/chore/delete-team-collapse-2864
chore: delete TeamHandler.Collapse + docs cleanup (closes #2864)
2026-05-05 19:04:30 +00:00
Hongming Wang
ec72f199e6
Merge pull request #2916 from Molecule-AI/fix/memory-plugin-embed-migrations
fix(memory-plugin): embed migrations into binary via go:embed (hotfix #2906)
2026-05-05 19:04:01 +00:00
Hongming Wang
ae22a55675
Merge pull request #2909 from Molecule-AI/feat/poll-mode-chat-upload-phase5b
feat(poll-upload): phase 5b — concurrent BatchFetcher + httpx client reuse
2026-05-05 12:05:58 -07:00
molecule-ai[bot]
08648bf4b1
Merge pull request #2914 from Molecule-AI/staging
staging → main: auto-promote 5334d60
2026-05-05 12:03:30 -07:00
Hongming Wang
eec4ea2e7d chore: delete TeamHandler.Collapse + docs cleanup (closes #2864)
Multi-model retrospective review of #2856 (Phase 1 Expand removal)
flagged that TeamHandler.Collapse is unreachable from the canvas UI:
the "Collapse Team" button calls PATCH /workspaces/:id { collapsed }
(visual flag toggle on canvas_layouts), NOT POST /workspaces/:id/collapse.
The destructive POST route — which stops EC2s, marks children removed,
and deletes layouts — has zero UI callers (verified via grep across
canvas/, scripts/, and the MCP tool registry; only docs referenced it).

Two semantically different operations had been sharing the word
"Collapse":

- Visual collapse (canvas) → PATCH { collapsed: true }. Hides
  children visually. Reversible. UI-only.
- Destructive collapse (POST /collapse) → Stops + marks removed.
  Irreversible. No caller.

Deleting the destructive one + its supporting machinery:

- workspace-server/internal/handlers/team.go (entirely)
- workspace-server/internal/handlers/team_test.go (entirely)
- POST /collapse route + teamh init in router.go
- findTemplateDirByName helper (zero non-test callers after Expand
  was deleted in #2856; package-private so no out-of-package consumers)
- NewTeamHandler constructor (no callers after route removed)

Plus stale doc references (the most dangerous was the MCP wrapper
mapping in mcp-server-setup.md — anyone generating MCP tool wrappers
from that table was wiring a 404):

- docs/agent-runtime/team-expansion.md (deleted entirely — whole
  guide taught the deleted flow)
- docs/api-reference.md (dropped two team.go rows)
- docs/api-protocol/platform-api.md (dropped /expand + /collapse
  rows)
- docs/architecture/molecule-technical-doc.md (dropped /expand +
  /collapse rows)
- docs/guides/mcp-server-setup.md (dropped expand_team +
  collapse_team MCP wrapper mappings)
- docs/glossary.md (dropped "(org template expand_team)"
  parenthetical)
- docs/frontend/canvas.md (dropped broken link to deleted
  team-expansion.md)

Kept: docs/architecture/backends.md mention of "TeamHandler.Expand
(#2367) bypassed routing on Start" — correct historical context for
the AST gate's existence, no live route reference.

Visual-collapse path unaffected:
  canvas/src/components/ContextMenu.tsx:227 → api.patch — unchanged
  canvas/src/components/WorkspaceNode.tsx:128 → api.patch — unchanged

go vet ./... clean. go test ./internal/handlers/ -count 1 — all green
(4.3s, no regression).

Net: -388/+10 = ~378 lines removed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 11:59:43 -07:00
Hongming Wang
6201d12533 fix(memory-plugin): embed migrations into binary via go:embed
PR #2906 shipped the binary at /memory-plugin without the migrations
directory. The plugin's runMigrations() resolved a relative path
\`cmd/memory-plugin-postgres/migrations\` that exists in the build
context but NOT in the runtime image. Every staging tenant boot
failed with:

  memory-plugin-postgres: migrate: read migrations dir
    "cmd/memory-plugin-postgres/migrations": open
    cmd/memory-plugin-postgres/migrations: no such file or directory

  memory-plugin:  /v1/health never returned 200 after 30s
    — aborting boot

Caught on the staging redeploy fleet job after #2906 merged. Tenants
stayed on the old image (CP redeploy correctly fail-fasted) but the
new image was broken.

Fix: \`//go:embed migrations/*.up.sql\` bundles the migrations into
the binary at build time. No filesystem path dependency at runtime.

  * \`embed.FS\` embeds the .up.sql files alongside the binary.
  * runMigrations() reads from migrationsFS by default;
    MEMORY_PLUGIN_MIGRATIONS_DIR override path preserved for operators
    shipping custom migrations.
  * Names sorted alphabetically — pinned by a test so a future
    \`002_*.up.sql\` is guaranteed to run after \`001_*.up.sql\`.

Tests:
  * TestMigrationsEmbedded_ContainsCreateTable — pins that the embed
    pattern matched files AND those files contain CREATE TABLE
    (catches both empty-pattern and wrong-files-embedded).
  * TestRunMigrationsFromEmbed_OrderingIsAlphabetic — pins sorted
    application order.

Verified locally: \`go build\` succeeds, binary 9.3MB,
\`strings\` shows the embedded SQL.

Refs RFC #2728. Hotfix for #2906.
2026-05-05 11:57:37 -07:00
Hongming Wang
81e83c05b7 fix(inbox): drop unused batch_fetcher = None after end-of-batch drain
Lint nit from review bot — _drain_uploads() runs and the function
immediately advances to the cursor save + return, so the local
re-assign is dead code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 11:56:54 -07:00
Hongming Wang
5b5eacbb29 test(inbox): clean up daemon poller thread to prevent test cross-talk
test_start_poller_thread_is_daemon spawned a daemon thread with no stop
mechanism — the leaked thread polled every 10ms with the test's patched
httpx.Client mock STILL ACTIVE for ~50ms after the test scope. Later
tests that re-patched httpx.Client + asserted call counts on
fetch_and_stage / Client construction got their assertions inflated by
the leaked thread's iterations.

Symptoms: test_poll_once_skips_chat_upload_row_from_queue saw
fetch_and_stage called twice instead of once on Python 3.11 CI;
test_batch_fetcher_owns_client_when_not_supplied saw two Client
constructions instead of one in the full local suite. Both surfaced
only after Phase 5b's BatchFetcher refactor changed the timing window
that allowed the leaked thread to fire mid-test.

Fix: extend start_poller_thread with an optional stop_event kwarg
(backward compatible — production callers pass None and rely on the
daemon flag for process-exit cleanup). The test now signals + joins
on stop_event before exiting scope, so the thread is gone before any
later test patches httpx.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 11:47:14 -07:00
Hongming Wang
c8fca1467e
Merge pull request #2915 from Molecule-AI/fix/harness-disable-memory-plugin
fix(harness): disable memory-plugin sidecar in harness tenants
2026-05-05 18:45:43 +00:00
Hongming Wang
7c8b81c6eb fix(harness): disable memory-plugin sidecar in harness tenants
PR #2906 bundled memory-plugin-postgres as a startup-gated sidecar in
both tenant entrypoints. Plugin migrations include
\`CREATE EXTENSION IF NOT EXISTS vector\` which fails on the harness's
plain postgres:15-alpine (no pgvector preinstalled). The 30s health
gate then aborts container boot and Harness Replays fails.

Detected on auto-promote PR #2914 — Harness Replays job:
  Container harness-tenant-alpha-1  Error
  Container harness-tenant-beta-1   Error
  dependency failed to start: container harness-tenant-alpha-1 exited (1)

The harness doesn't exercise memory features, so the simplest fix is
to use the documented escape hatch the sidecar entrypoint already
ships (MEMORY_PLUGIN_DISABLE=1) — applied to both alpha and beta
tenants in compose.yml. Alternative would be switching the harness
postgres images to pgvector/pgvector:pg15, deferred until the
harness wants to verify memory paths.

Refs PR #2906. Unblocks #2914 (auto-promote staging→main).
2026-05-05 11:42:20 -07:00
Hongming Wang
fc1c45789e
Merge pull request #2912 from Molecule-AI/feat/saas-default-hardening-2910
feat(saas): close 4th default-tier site + lift org_import asymmetry + tests (#2910)
2026-05-05 18:42:19 +00:00
Hongming Wang
e3a18ed8e8
Merge pull request #2911 from Molecule-AI/fix/memory-plugin-bind-loopback
fix(memory-plugin): bind to 127.0.0.1 by default
2026-05-05 18:38:35 +00:00
Hongming Wang
9f551319d2 feat(saas): close 4th default-tier site + lift org_import asymmetry + tests (#2910)
Multi-model retrospective review of #2901 found three Critical gaps:

1. (#2910 PR-B) template_import.go:79 wrote `tier: 3` hardcoded into
   generated config.yaml. On SaaS this defeated the T4 default at the
   create-handler layer — a config-less template import landed at T3
   regardless of POST /workspaces' computed default. The 4th
   default-tier site #2901 missed.

2. (#2910 PR-A) #2901 claimed `go test ... all green` but added zero
   new tests. Existing structural-pin tests caught dispatch-layer
   drift but said nothing about tier-default drift. A future refactor
   that flips DefaultTier() to always return 3 would ship green.

3. (#2910 PR-E) org_import.go fallback returned T2 on self-hosted
   while workspace.go returned T3. Internally consistent ("bulk vs
   interactive defaults") but undocumented same-name-different-value
   drift.

Fix:

- TemplatesHandler.NewTemplatesHandler now takes `wh *WorkspaceHandler`
  (nil-tolerant for read-only callers). Import + ReplaceFiles compute
  tier via h.wh.DefaultTier() and pass it to generateDefaultConfig.
  generateDefaultConfig gets a `tier int` parameter (bounds-checked,
  invalid input falls back to T3).

- org_import.go fallback lifts to h.workspace.DefaultTier() — single
  source of truth shared with Create + Templates so a future
  tier-default change sweeps every entry point at once.

- New saas_default_tier_test.go pinning:
    TestIsSaaS_TrueWhenCPProvWired
    TestIsSaaS_FalseWhenOnlyDocker
    TestDefaultTier_SaaS_IsT4
    TestDefaultTier_SelfHosted_IsT3
    TestGenerateDefaultConfig_RespectsTierParam
    TestGenerateDefaultConfig_SelfHostedTierT3
    TestGenerateDefaultConfig_OutOfRangeFallsBackToT3

- Existing template_import_test.go tests + chat_files_test.go +
  security_regression_test.go updated to thread the new tier param /
  wh constructor arg through their NewTemplatesHandler calls. Their
  pre-#2910 assertion of `tier: 3` is preserved (now passes because
  the test caller passes `3` explicitly), so no regression.

go vet ./... clean. go test ./internal/handlers/ -count 1 — all
green (4.2s).

Deferred to separate follow-ups (per #2910 plan):
- PR-C: MOLECULE_DEPLOYMENT_MODE explicit deployment-mode signal
  (closes the IsSaaS()=cpProv!=nil structural fragility)
- PR-D: Host iptables IMDS block + IMDSv2 hop-limit (paired with
  molecule-controlplane EC2-IAM-scope audit)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 11:38:22 -07:00
Hongming Wang
1052f8bdb0 fix(memory-plugin): bind to 127.0.0.1 by default
Self-review of PR #2906 flagged: defaultListenAddr was ":9100" — binds
on every container interface. Inside today's deployment that's moot
(no host port mapping, platform talks over loopback) but it's not
least-privilege. A future Dockerfile edit that publishes the port,
a misconfigured Fly machine, or a future cross-host plugin topology
would expose an unauth'd memory store.

Loopback is the right baseline. Operators with a multi-host topology
already override via MEMORY_PLUGIN_LISTEN_ADDR — that path is unchanged.

Tests:
  * TestLoadConfig_DefaultListenAddrIsLoopback pins the new default.
  * TestLoadConfig_ListenAddrEnvOverride pins the override path so
    operators relying on it don't break.
  * TestLoadConfig_MissingDatabaseURL covers the existing fail-fast.

No prior unit tests existed for loadConfig — boot_e2e_test.go always
sets MEMORY_PLUGIN_LISTEN_ADDR explicitly, so the default was never
exercised by tests. This PR adds that coverage.

Refs RFC #2728. Hardening follow-up to PR #2906.
2026-05-05 11:35:24 -07:00
Hongming Wang
30fb507165 feat(poll-upload): phase 5b — concurrent BatchFetcher + httpx client reuse
Resolves the two remaining findings from the Phase 1-4 retrospective
review (the Python-side counterparts to phase 5a):

1. Important — inbox_uploads.fetch_and_stage blocked the inbox poll
   loop synchronously per row. A user dragging 4 files into chat at
   once would stall the poller for 4× per-fetch latency before the
   chat message reached the agent. Add BatchFetcher: a thread-pool
   wrapper (default 4 workers) that submits fetches concurrently and
   exposes wait_all() as the barrier the inbox loop calls before
   processing the chat-message row that references the uploads.

   The drain barrier is the correctness invariant: rewrite_request_body
   must observe a populated URI cache when it walks the chat-message
   row's parts. _poll_once now drains the BatchFetcher inline before
   the first non-upload row, AND at end-of-batch (case: batch contains
   only upload rows; the corresponding chat message arrives in a later
   poll, but the future-poll-races-current-fetch race is closed).

2. Nit — fetch_and_stage created two httpx.Client instances per row
   (one for GET /content, one for POST /ack). Refactor so a single
   client serves both calls. When called from BatchFetcher, the
   batch-shared client serves every row's GET + ack — so the second
   fetch reuses the TCP+TLS handshake from the first.

Comprehensive tests:

- 13 new inbox_uploads tests:
  - fetch_and_stage with supplied client: zero httpx.Client
    constructions, GET+POST through the same client, caller's client
    not closed (lifecycle owned by caller).
  - fetch_and_stage without supplied client: exactly one
    httpx.Client constructed (was 2 pre-fix), closed on the way out.
  - BatchFetcher: 3 rows × 120ms = parallel completion < 250ms
    (vs. ~360ms serial), URI cache hot when wait_all returns,
    per-row failure isolation, single-client reuse across all
    submits, idempotent close, submit-after-close raises,
    owned-vs-supplied client lifecycle, no-op wait_all on empty
    batch, graceful httpx-missing degradation.

- 3 new inbox tests:
  - poll_once drains uploads before processing the chat-message row
    (in-place mutation of row['request_body'] proves the URI was
    rewritten BEFORE message_from_activity returned).
  - poll_once with only upload rows still drains at end-of-batch.
  - poll_once with no upload rows never constructs a BatchFetcher
    (zero overhead on the no-upload happy path).

133 total inbox + inbox_uploads tests pass; 0 regressions.

Closes the chat-upload poll-mode-perf gap end-to-end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 11:26:55 -07:00
molecule-ai[bot]
77e9a965ac
Merge pull request #2904 from Molecule-AI/staging
staging → main: auto-promote 6470e5f
2026-05-05 18:24:19 +00:00
Hongming Wang
5334d60de4
Merge pull request #2898 from Molecule-AI/2867-workspaces-insert-allowlist
test(handlers): allowlist INSERT INTO workspaces sites (#2867 class 1)
2026-05-05 18:18:19 +00:00
Hongming Wang
d6c0227e3f
Merge pull request #2906 from Molecule-AI/feat/memory-plugin-sidecar-bundle
feat(memory-v2): bundle memory-plugin-postgres as in-image sidecar
2026-05-05 18:16:57 +00:00
Hongming Wang
27db090d3d
Merge pull request #2907 from Molecule-AI/feat/poll-mode-chat-upload-phase5a
feat(poll-upload): phase 5a — atomic batch insert + acked-index + mime hardening
2026-05-05 11:16:56 -07:00
Hongming Wang
0f25f6de97 test(handlers): allowlist INSERT INTO workspaces sites — close bulk-create regression class (#2867 class 1)
Adds TestINSERTworkspacesAllowlist: walks every non-test .go in this
package, finds funcs containing an `INSERT INTO workspaces (` SQL
literal, and pins the result against an explicit allowlist with the
safety mechanism named per entry.

New entries fail the build until a reviewer adds them — forcing the
question "what makes this INSERT idempotent?" at PR-review time, not
after the next bulk-create leak (the shape that produced 72 stale
child workspaces in tenant-hongming over 4 days).

Pairs with TestCreateWorkspaceTree_CallsLookupBeforeInsert (the
behavior pin for the one bulk path today). Together:
- this test catches "did a new function start inserting?"
- that test catches "did the existing bulk path drop its idempotency check?"

Both fire immediately when drift happens.

Current allowlist (3 entries):
- org_import.go:createWorkspaceTree → lookup-then-insert via
  lookupExistingChild (#2868 phase 3, also pinned by the sibling AST
  gate from #2895)
- registry.go:Register → ON CONFLICT (id) DO UPDATE (idempotent by
  primary key — external workspace upsert)
- workspace.go:Create → single-workspace POST /workspaces, server-
  generated UUID, no iteration

Verified via mutation: dropping a synthetic tempBulkLeakTest with an
unsafe loop+INSERT into the package fails the gate with a clear
diagnostic pointing at the file + function. Restoring the tree
returns the gate to green.

Memory: feedback_assert_exact_not_substring.md (verify tightened test
FAILS on bug shape) — mutation proof done locally.

RFC #2867 class 1. Class 2 (Prometheus gauge for ec2_instance
duplicates) + class 3 (structured logging on workspace create) are
follow-up PRs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 11:15:16 -07:00
Hongming Wang
9991057ad1 feat(poll-upload): phase 5a — atomic batch insert + acked-index + mime hardening
Resolves four of six findings from the retrospective code review of Phases
1–4 (poll-mode chat upload). Bundled because every change is in the
platform's pending_uploads layer or the multi-file handler that reads it.

Findings resolved:

1. Important — Sweep query lacked an index for the acked-retention OR-arm.
   The Phase 1 partial indexes are both `WHERE acked_at IS NULL`, so the
   `(acked_at IS NOT NULL AND acked_at < retention)` half of the WHERE
   clause seq-scanned the table on every cycle. Add a complementary
   partial index on `acked_at WHERE acked_at IS NOT NULL` so both arms
   of the disjunction are index-covered. Disjoint from the existing two
   indexes (no row matches both predicates), so write amplification is
   bounded to ~one index entry per terminal-state row.

2. Important — uploadPollMode partial-failure left orphans. The previous
   per-file Put loop committed rows 1..K-1 and then errored on row K with
   no compensation, so a client retry would double-insert the survivors.
   Refactor the handler into three explicit phases (pre-validate +
   read-into-memory, single atomic PutBatch, per-file activity row) and
   add Storage.PutBatch with all-or-nothing transaction semantics.

3. FYI — pendinguploads.StartSweeperWithInterval was exported only for
   tests. Move it to lower-case startSweeperWithInterval and expose the
   test seam through pendinguploads/export_test.go (Go convention; the
   shim file is stripped from the production binary at build time).

4. Nit — multipart Content-Type was passed verbatim into pending_uploads
   rows and re-served on /content. Add safeMimetype which strips
   parameters, rejects CR/LF/control bytes, and coerces malformed shapes
   to application/octet-stream. The eventual GET /content response can no
   longer be header-split via a crafted Content-Type on the multipart.

Comprehensive tests:

- 10 PutBatch unit tests (sqlmock): happy path, empty input, all four
  pre-validation rejection paths, BeginTx error, per-row error +
  Rollback (no Commit), first-row error, Commit error.
- 4 new PutBatch integration tests (real Postgres): all-rows-commit
  happy path with COUNT(*) verification, atomic-rollback no-leak via
  a NUL-byte filename that lib/pq rejects mid-batch, oversize
  short-circuit no-Tx, idx_pending_uploads_acked existence + partial
  predicate via pg_indexes (planner-shape-independent).
- 3 new chat_files_poll tests: atomic rollback on second-file oversize,
  atomic rollback on PutBatch error, mimetype CRLF/NUL/parameter
  sanitization (8 sub-cases).

The two remaining review findings (inbox_uploads.fetch_and_stage blocks
the poll loop synchronously; two httpx Clients per row) are Python-side
and ship in Phase 5b once this lands on staging.

Test-only export pattern via export_test.go, atomic pre-validation
discipline (validate before Tx), and behavior-based (not name-based)
test assertions follow the standing project conventions.
2026-05-05 11:10:13 -07:00
Hongming Wang
b89a49ec93 feat(memory-v2): bundle memory-plugin-postgres as in-image sidecar
Closes the gap between the merged Memory v2 code (PR #2757 wired the
client into main.go) and operator activation. Without this PR an
operator wanting to flip MEMORY_V2_CUTOVER=true had to provision a
separate memory-plugin service and point MEMORY_PLUGIN_URL at it —
extra ops surface for what the design intends to be a built-in.

What ships:
  * Both Dockerfile + Dockerfile.tenant build the
    cmd/memory-plugin-postgres binary into /memory-plugin.
  * Entrypoints spawn the plugin in the background on :9100 BEFORE
    starting the main server; wait up to 30s for /v1/health to return
    200; abort boot loud if it doesn't (better to crash-loop than to
    silently route cutover traffic against a dead plugin).
  * Default env: MEMORY_PLUGIN_DATABASE_URL=$DATABASE_URL (share the
    existing tenant Postgres — plugin's `memory_namespaces` /
    `memory_records` tables coexist with platform schema, no
    conflicts), MEMORY_PLUGIN_LISTEN_ADDR=:9100.
  * MEMORY_PLUGIN_DISABLE=1 escape hatch for operators running the
    plugin externally on a separate host.
  * Platform image: plugin runs as the `platform` user (not root) via
    su-exec — matches the privilege boundary the main server already
    drops to. Tenant image already starts as `canvas` so the plugin
    inherits non-root automatically.

What stays operator-controlled:
  * MEMORY_V2_CUTOVER is NOT auto-set. Behavior change for existing
    deployments: zero. The wiring at workspace-server/internal/memory/
    wiring/wiring.go skips building the plugin client until the
    operator opts in, so the running sidecar is a no-op for traffic
    until then.
  * MEMORY_PLUGIN_URL is NOT auto-set either, for the same reason —
    setting it implies cutover-active intent. Operators set both on
    staging first, verify a live commit/recall round-trip (closes
    pending task #292), then promote to production.

Operator activation steps after this PR ships:
  1. Verify pgvector extension is available on the target Postgres
     (the plugin's first migration runs CREATE EXTENSION IF NOT
     EXISTS vector). Railway's managed Postgres ships pgvector
     available; some self-hosted operators may need to enable it.
  2. Redeploy the workspace-server with this image.
  3. Set MEMORY_PLUGIN_URL=http://localhost:9100 + MEMORY_V2_CUTOVER=true
     in the environment (staging first).
  4. Watch boot logs for "memory-plugin:  sidecar healthy" and the
     wiring.go cutover messages; do a live commit_memory + recall_memory
     round-trip via the canvas Memory tab to verify.
  5. Promote to production once staging holds for a sweep window.

Refs RFC #2728. Closes the dormant-plugin gap noted in task #294.
2026-05-05 11:10:11 -07:00
Hongming Wang
f5613bf099
Merge pull request #2902 from Molecule-AI/fix-pendinguploads-sweeper-test-race
test(pendinguploads): close cycleDone-vs-metric-record race in sweeper tests
2026-05-05 18:02:21 +00:00
Hongming Wang
9bd2a2c45f
Merge pull request #2903 from Molecule-AI/fix/chat-tab-initial-scroll-bottom
fix(canvas/chat): instant-scroll to bottom on first mount
2026-05-05 17:50:42 +00:00
Hongming Wang
a489ee1a7c fix(canvas/chat): instant-scroll to bottom on first mount
Reported: "right now when chat box opens it opens in the middle, but
it should be at the end of conversation."

Root cause: ChatTab.tsx:548 fires `bottomRef.scrollIntoView({ behavior:
"smooth" })` on every messages-update. On initial mount with N
messages already loaded, the smooth-scroll triggers a ~300ms animation
that any concurrent React re-render (agent push landing, theme
toggle, sidepanel resize) interrupts mid-flight, leaving the user
stuck somewhere in the middle of the conversation.

Fix: track first-mount via hasInitialScrollRef. Use behavior:"instant"
for the initial jump (deterministic, no animation interruption), then
smooth for subsequent appends (the new-message-landing visual stays).

Refs flipped on first messages.length > 0 transition, so:
- Initial open of chat tab: instant jump to bottom ✓
- New agent message arrives: smooth scroll into view ✓
- Workspace switch (ChatTab remounts): fresh hasInitialScrollRef, gets
  instant again ✓
- loadOlder prepend: anchor-restore path unchanged, still pins user's
  reading position ✓

Test plan:
- pnpm test --run ChatTab.lazyHistory.test.tsx → 8 pass (existing
  lazy-history tests untouched)
- npx tsc --noEmit clean
- Manual on hongming.moleculesai.app: open a busy chat (mac laptop,
  ~50 messages), confirm view lands at the latest bubble, not mid-
  scroll. Switch to another workspace + back → instant again.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 10:47:32 -07:00
Hongming Wang
c79ba05ed5 test(pendinguploads): close cycleDone-vs-metric-record race in sweeper tests
TestStartSweeper_RecordsMetricsOnError flaked on every CI rerun under
race detection: `error counter delta = 0, want 1`. Root cause is a
race between two goroutines, not a bug in the production sweeper.

The fake `fakeSweepStorage.Sweep` signals `cycleDone` from inside its
deferred return — that happens BEFORE Sweep's return value is
received by `sweepOnce`, which is what triggers the metric increment.
On slow CI hosts the test goroutine wins the read after `waitForCycle`
unblocks and BEFORE StartSweeper's goroutine has called
`metrics.PendingUploadsSweepError`, so the asserted delta is 0 even
though the metric WILL be 1 a few ms later.

Adds a polling assert helper, `waitForMetricDelta`, that closes the
race deterministically without timing-based sleeps:

- TestStartSweeper_RecordsMetricsOnError uses waitForMetricDelta to
  wait for the error counter to settle at 1.
- TestStartSweeper_RecordsMetricsOnSuccess uses it on the success
  counters (acked, expired) so the error-stayed-zero assertion
  reads after StartSweeper has fully processed the cycle.
- waitForCycle keeps its current shape but documents the caveat in
  its comment so future tests don't repeat the assumption.

Verified: `go test ./internal/pendinguploads/ -race -count 5` passes
all 9 tests across 5 iterations cleanly.

Per memory feedback_question_test_when_unexpected.md: the
"delta=0, want=1" failure looked like a real production bug at first
glance, but instrumented inspection showed the metric DOES increment,
just AFTER the test's read. The fix is the test's wait shape, not
the sweeper.

Unblocks every PR currently broken by this flake (#2898 hit it on
two consecutive CI runs; staging-merged PRs from earlier today
(#2877/#2881/#2885/#2886) introduced the test).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 10:46:17 -07:00