Covers the user-visible flow that Phase 1-5b shipped (RFC #2891):
register a poll-mode workspace, POST a multi-file /chat/uploads, verify
the activity feed shows one chat_upload_receive row per file, fetch the
bytes via /pending-uploads/:fid/content, ack each row, and confirm a
post-ack fetch returns 404. Also pins cross-workspace bleed protection
(workspace B's bearer on A's URL → 401, B's URL with A's file_id →
404) and the file_id-UUID-parse 400 path.
23 assertions, all green against a local platform (Postgres+Redis+
platform-server stack matches the e2e-api.yml CI recipe verbatim).
Why a new script instead of extending test_poll_mode_e2e.sh: that
script tests A2A short-circuit + since_id cursor semantics; this one
tests the chat-upload path. They share zero handler code on the
platform side and would dilute each other's failure messages if
combined.
Why not the bearerless-401 strict-mode assertion: the platform's
wsauth fail-opens for bearerless requests when MOLECULE_ENV=development
(see middleware/devmode.go). The CI workflow doesn't set that var, but
some local-dev .env files do — the assertion would flap by environment
without testing the poll-mode upload contract. The middleware's own
unit tests cover strict-mode 401.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #2906 spawned the sidecar unconditionally on every tenant boot. The
plugin's first migration runs \`CREATE EXTENSION vector\` which fails
on tenant Postgres without pgvector preinstalled — every staging
tenant redeploy aborted at the 30s health gate. CP fail-fast kept
running tenants on the prior image (no outage), but the new image
was DOA.
Caught on staging redeploy 2026-05-05 19:23 with
\`pq: extension "vector" is not available\`.
Fix: only spawn the sidecar when the operator has flipped the cutover
flag — \`MEMORY_V2_CUTOVER=true\` OR \`MEMORY_PLUGIN_URL\` is set.
* Aligns the entrypoint to the same opt-in posture wiring.go already
uses (it skips building the client when MEMORY_PLUGIN_URL is empty).
* Until cutover, the sidecar isn't even running — no migration, no
health gate, no boot-time pgvector dependency.
* Operators activating cutover already redeploy with the new env
vars set; that's when the sidecar starts. By definition they've
verified pgvector is available before flipping.
* MEMORY_PLUGIN_DISABLE=1 escape hatch preserved; harness fix#2915
becomes belt-and-suspenders (still respected).
Both Dockerfile and entrypoint-tenant.sh updated. Behavior change for
existing deployments: zero (cutover env vars still unset → sidecar
still inert, but now also not running).
Refs RFC #2728. Hotfix for #2906; supersedes the migration-path
fragility class (the sidecar isn't doing migrations on tenants that
won't use it).
The deadline contract was incomplete: wait_all logged the timeout but
close() then called executor.shutdown(wait=True), which blocked on
the leaked workers — undoing the user-facing timeout. The inbox poll
loop would stall indefinitely on a hung /content fetch instead of
returning to chat-message processing.
Fix: wait_all now flips self._timed_out and cancels queued (not-yet-
started) futures; close() reads that flag and switches to
shutdown(wait=False, cancel_futures=True) on the timeout path.
Currently-running workers can't be interrupted by Python's threading
model, but they're now detached daemons whose blocking httpx call
no longer gates the next poll.
Healthy path (no timeout) keeps the existing drain-and-wait so a
still-queued ack POST isn't dropped mid-write.
Two new tests pin both legs of the contract end-to-end:
- close-after-timeout-doesn't-block: hung worker, wait_all(0.05s)
fires the timeout, close() returns in <1s instead of waiting ~5s
for the worker to come back.
- close-without-timeout-still-drains: 2 slow workers, wait_all
completes cleanly, close() drains both ack POSTs.
Resolves the BatchFetcher timeout-cancellation finding from the
post-merge five-axis review of Phase 5b.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds internal/provlog with a single Event(name, fields) helper that
emits JSON-tagged single-line records to the standard logger. Five
boundary sites instrumented for #2867:
provision.start — workspace_dispatchers.go (sync + async)
provision.skip_existing — org_import.go idempotency hit
provision.ec2_started — cp_provisioner.go after RunInstances
provision.ec2_stopped — cp_provisioner.go after TerminateInstances ack
restart.pre_stop — workspace_restart.go before Stop dispatch
These pair with the existing human-prose log.Printf lines (kept). The
new records are grep+jq friendly so a future log-aggregation pipeline
can reconstruct per-workspace provision timelines without parsing the
operator messages — this is the "and debug loggers so it dont happen
again" half of the leak-prevention work.
Tests:
- provlog: emits evt-prefixed JSON, nil-tolerant, marshal-error
fallback preserves event boundary, single-line output pinned.
- handlers: provlog_emit_test.go pins three call-site contracts:
provisionWorkspaceAutoSync emits provision.start with sync=true,
stopForRestart emits restart.pre_stop with backend=cp on SaaS,
and backend=none when both backends are nil.
Field taxonomy is convenience for ops, not contract — payload can grow
additively without breaking callers. Behavior gate is the event name +
boundary location, per feedback_behavior_based_ast_gates.md.
Refs #2867 (PR-D structured logging at provisioning boundaries)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Multi-model retrospective review of #2856 (Phase 1 Expand removal)
flagged that TeamHandler.Collapse is unreachable from the canvas UI:
the "Collapse Team" button calls PATCH /workspaces/:id { collapsed }
(visual flag toggle on canvas_layouts), NOT POST /workspaces/:id/collapse.
The destructive POST route — which stops EC2s, marks children removed,
and deletes layouts — has zero UI callers (verified via grep across
canvas/, scripts/, and the MCP tool registry; only docs referenced it).
Two semantically different operations had been sharing the word
"Collapse":
- Visual collapse (canvas) → PATCH { collapsed: true }. Hides
children visually. Reversible. UI-only.
- Destructive collapse (POST /collapse) → Stops + marks removed.
Irreversible. No caller.
Deleting the destructive one + its supporting machinery:
- workspace-server/internal/handlers/team.go (entirely)
- workspace-server/internal/handlers/team_test.go (entirely)
- POST /collapse route + teamh init in router.go
- findTemplateDirByName helper (zero non-test callers after Expand
was deleted in #2856; package-private so no out-of-package consumers)
- NewTeamHandler constructor (no callers after route removed)
Plus stale doc references (the most dangerous was the MCP wrapper
mapping in mcp-server-setup.md — anyone generating MCP tool wrappers
from that table was wiring a 404):
- docs/agent-runtime/team-expansion.md (deleted entirely — whole
guide taught the deleted flow)
- docs/api-reference.md (dropped two team.go rows)
- docs/api-protocol/platform-api.md (dropped /expand + /collapse
rows)
- docs/architecture/molecule-technical-doc.md (dropped /expand +
/collapse rows)
- docs/guides/mcp-server-setup.md (dropped expand_team +
collapse_team MCP wrapper mappings)
- docs/glossary.md (dropped "(org template expand_team)"
parenthetical)
- docs/frontend/canvas.md (dropped broken link to deleted
team-expansion.md)
Kept: docs/architecture/backends.md mention of "TeamHandler.Expand
(#2367) bypassed routing on Start" — correct historical context for
the AST gate's existence, no live route reference.
Visual-collapse path unaffected:
canvas/src/components/ContextMenu.tsx:227 → api.patch — unchanged
canvas/src/components/WorkspaceNode.tsx:128 → api.patch — unchanged
go vet ./... clean. go test ./internal/handlers/ -count 1 — all green
(4.3s, no regression).
Net: -388/+10 = ~378 lines removed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #2906 shipped the binary at /memory-plugin without the migrations
directory. The plugin's runMigrations() resolved a relative path
\`cmd/memory-plugin-postgres/migrations\` that exists in the build
context but NOT in the runtime image. Every staging tenant boot
failed with:
memory-plugin-postgres: migrate: read migrations dir
"cmd/memory-plugin-postgres/migrations": open
cmd/memory-plugin-postgres/migrations: no such file or directory
memory-plugin: ❌ /v1/health never returned 200 after 30s
— aborting boot
Caught on the staging redeploy fleet job after #2906 merged. Tenants
stayed on the old image (CP redeploy correctly fail-fasted) but the
new image was broken.
Fix: \`//go:embed migrations/*.up.sql\` bundles the migrations into
the binary at build time. No filesystem path dependency at runtime.
* \`embed.FS\` embeds the .up.sql files alongside the binary.
* runMigrations() reads from migrationsFS by default;
MEMORY_PLUGIN_MIGRATIONS_DIR override path preserved for operators
shipping custom migrations.
* Names sorted alphabetically — pinned by a test so a future
\`002_*.up.sql\` is guaranteed to run after \`001_*.up.sql\`.
Tests:
* TestMigrationsEmbedded_ContainsCreateTable — pins that the embed
pattern matched files AND those files contain CREATE TABLE
(catches both empty-pattern and wrong-files-embedded).
* TestRunMigrationsFromEmbed_OrderingIsAlphabetic — pins sorted
application order.
Verified locally: \`go build\` succeeds, binary 9.3MB,
\`strings\` shows the embedded SQL.
Refs RFC #2728. Hotfix for #2906.
Lint nit from review bot — _drain_uploads() runs and the function
immediately advances to the cursor save + return, so the local
re-assign is dead code.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
test_start_poller_thread_is_daemon spawned a daemon thread with no stop
mechanism — the leaked thread polled every 10ms with the test's patched
httpx.Client mock STILL ACTIVE for ~50ms after the test scope. Later
tests that re-patched httpx.Client + asserted call counts on
fetch_and_stage / Client construction got their assertions inflated by
the leaked thread's iterations.
Symptoms: test_poll_once_skips_chat_upload_row_from_queue saw
fetch_and_stage called twice instead of once on Python 3.11 CI;
test_batch_fetcher_owns_client_when_not_supplied saw two Client
constructions instead of one in the full local suite. Both surfaced
only after Phase 5b's BatchFetcher refactor changed the timing window
that allowed the leaked thread to fire mid-test.
Fix: extend start_poller_thread with an optional stop_event kwarg
(backward compatible — production callers pass None and rely on the
daemon flag for process-exit cleanup). The test now signals + joins
on stop_event before exiting scope, so the thread is gone before any
later test patches httpx.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #2906 bundled memory-plugin-postgres as a startup-gated sidecar in
both tenant entrypoints. Plugin migrations include
\`CREATE EXTENSION IF NOT EXISTS vector\` which fails on the harness's
plain postgres:15-alpine (no pgvector preinstalled). The 30s health
gate then aborts container boot and Harness Replays fails.
Detected on auto-promote PR #2914 — Harness Replays job:
Container harness-tenant-alpha-1 Error
Container harness-tenant-beta-1 Error
dependency failed to start: container harness-tenant-alpha-1 exited (1)
The harness doesn't exercise memory features, so the simplest fix is
to use the documented escape hatch the sidecar entrypoint already
ships (MEMORY_PLUGIN_DISABLE=1) — applied to both alpha and beta
tenants in compose.yml. Alternative would be switching the harness
postgres images to pgvector/pgvector:pg15, deferred until the
harness wants to verify memory paths.
Refs PR #2906. Unblocks #2914 (auto-promote staging→main).
Multi-model retrospective review of #2901 found three Critical gaps:
1. (#2910 PR-B) template_import.go:79 wrote `tier: 3` hardcoded into
generated config.yaml. On SaaS this defeated the T4 default at the
create-handler layer — a config-less template import landed at T3
regardless of POST /workspaces' computed default. The 4th
default-tier site #2901 missed.
2. (#2910 PR-A) #2901 claimed `go test ... all green` but added zero
new tests. Existing structural-pin tests caught dispatch-layer
drift but said nothing about tier-default drift. A future refactor
that flips DefaultTier() to always return 3 would ship green.
3. (#2910 PR-E) org_import.go fallback returned T2 on self-hosted
while workspace.go returned T3. Internally consistent ("bulk vs
interactive defaults") but undocumented same-name-different-value
drift.
Fix:
- TemplatesHandler.NewTemplatesHandler now takes `wh *WorkspaceHandler`
(nil-tolerant for read-only callers). Import + ReplaceFiles compute
tier via h.wh.DefaultTier() and pass it to generateDefaultConfig.
generateDefaultConfig gets a `tier int` parameter (bounds-checked,
invalid input falls back to T3).
- org_import.go fallback lifts to h.workspace.DefaultTier() — single
source of truth shared with Create + Templates so a future
tier-default change sweeps every entry point at once.
- New saas_default_tier_test.go pinning:
TestIsSaaS_TrueWhenCPProvWired
TestIsSaaS_FalseWhenOnlyDocker
TestDefaultTier_SaaS_IsT4
TestDefaultTier_SelfHosted_IsT3
TestGenerateDefaultConfig_RespectsTierParam
TestGenerateDefaultConfig_SelfHostedTierT3
TestGenerateDefaultConfig_OutOfRangeFallsBackToT3
- Existing template_import_test.go tests + chat_files_test.go +
security_regression_test.go updated to thread the new tier param /
wh constructor arg through their NewTemplatesHandler calls. Their
pre-#2910 assertion of `tier: 3` is preserved (now passes because
the test caller passes `3` explicitly), so no regression.
go vet ./... clean. go test ./internal/handlers/ -count 1 — all
green (4.2s).
Deferred to separate follow-ups (per #2910 plan):
- PR-C: MOLECULE_DEPLOYMENT_MODE explicit deployment-mode signal
(closes the IsSaaS()=cpProv!=nil structural fragility)
- PR-D: Host iptables IMDS block + IMDSv2 hop-limit (paired with
molecule-controlplane EC2-IAM-scope audit)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-review of PR #2906 flagged: defaultListenAddr was ":9100" — binds
on every container interface. Inside today's deployment that's moot
(no host port mapping, platform talks over loopback) but it's not
least-privilege. A future Dockerfile edit that publishes the port,
a misconfigured Fly machine, or a future cross-host plugin topology
would expose an unauth'd memory store.
Loopback is the right baseline. Operators with a multi-host topology
already override via MEMORY_PLUGIN_LISTEN_ADDR — that path is unchanged.
Tests:
* TestLoadConfig_DefaultListenAddrIsLoopback pins the new default.
* TestLoadConfig_ListenAddrEnvOverride pins the override path so
operators relying on it don't break.
* TestLoadConfig_MissingDatabaseURL covers the existing fail-fast.
No prior unit tests existed for loadConfig — boot_e2e_test.go always
sets MEMORY_PLUGIN_LISTEN_ADDR explicitly, so the default was never
exercised by tests. This PR adds that coverage.
Refs RFC #2728. Hardening follow-up to PR #2906.
Resolves the two remaining findings from the Phase 1-4 retrospective
review (the Python-side counterparts to phase 5a):
1. Important — inbox_uploads.fetch_and_stage blocked the inbox poll
loop synchronously per row. A user dragging 4 files into chat at
once would stall the poller for 4× per-fetch latency before the
chat message reached the agent. Add BatchFetcher: a thread-pool
wrapper (default 4 workers) that submits fetches concurrently and
exposes wait_all() as the barrier the inbox loop calls before
processing the chat-message row that references the uploads.
The drain barrier is the correctness invariant: rewrite_request_body
must observe a populated URI cache when it walks the chat-message
row's parts. _poll_once now drains the BatchFetcher inline before
the first non-upload row, AND at end-of-batch (case: batch contains
only upload rows; the corresponding chat message arrives in a later
poll, but the future-poll-races-current-fetch race is closed).
2. Nit — fetch_and_stage created two httpx.Client instances per row
(one for GET /content, one for POST /ack). Refactor so a single
client serves both calls. When called from BatchFetcher, the
batch-shared client serves every row's GET + ack — so the second
fetch reuses the TCP+TLS handshake from the first.
Comprehensive tests:
- 13 new inbox_uploads tests:
- fetch_and_stage with supplied client: zero httpx.Client
constructions, GET+POST through the same client, caller's client
not closed (lifecycle owned by caller).
- fetch_and_stage without supplied client: exactly one
httpx.Client constructed (was 2 pre-fix), closed on the way out.
- BatchFetcher: 3 rows × 120ms = parallel completion < 250ms
(vs. ~360ms serial), URI cache hot when wait_all returns,
per-row failure isolation, single-client reuse across all
submits, idempotent close, submit-after-close raises,
owned-vs-supplied client lifecycle, no-op wait_all on empty
batch, graceful httpx-missing degradation.
- 3 new inbox tests:
- poll_once drains uploads before processing the chat-message row
(in-place mutation of row['request_body'] proves the URI was
rewritten BEFORE message_from_activity returned).
- poll_once with only upload rows still drains at end-of-batch.
- poll_once with no upload rows never constructs a BatchFetcher
(zero overhead on the no-upload happy path).
133 total inbox + inbox_uploads tests pass; 0 regressions.
Closes the chat-upload poll-mode-perf gap end-to-end.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds TestINSERTworkspacesAllowlist: walks every non-test .go in this
package, finds funcs containing an `INSERT INTO workspaces (` SQL
literal, and pins the result against an explicit allowlist with the
safety mechanism named per entry.
New entries fail the build until a reviewer adds them — forcing the
question "what makes this INSERT idempotent?" at PR-review time, not
after the next bulk-create leak (the shape that produced 72 stale
child workspaces in tenant-hongming over 4 days).
Pairs with TestCreateWorkspaceTree_CallsLookupBeforeInsert (the
behavior pin for the one bulk path today). Together:
- this test catches "did a new function start inserting?"
- that test catches "did the existing bulk path drop its idempotency check?"
Both fire immediately when drift happens.
Current allowlist (3 entries):
- org_import.go:createWorkspaceTree → lookup-then-insert via
lookupExistingChild (#2868 phase 3, also pinned by the sibling AST
gate from #2895)
- registry.go:Register → ON CONFLICT (id) DO UPDATE (idempotent by
primary key — external workspace upsert)
- workspace.go:Create → single-workspace POST /workspaces, server-
generated UUID, no iteration
Verified via mutation: dropping a synthetic tempBulkLeakTest with an
unsafe loop+INSERT into the package fails the gate with a clear
diagnostic pointing at the file + function. Restoring the tree
returns the gate to green.
Memory: feedback_assert_exact_not_substring.md (verify tightened test
FAILS on bug shape) — mutation proof done locally.
RFC #2867 class 1. Class 2 (Prometheus gauge for ec2_instance
duplicates) + class 3 (structured logging on workspace create) are
follow-up PRs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolves four of six findings from the retrospective code review of Phases
1–4 (poll-mode chat upload). Bundled because every change is in the
platform's pending_uploads layer or the multi-file handler that reads it.
Findings resolved:
1. Important — Sweep query lacked an index for the acked-retention OR-arm.
The Phase 1 partial indexes are both `WHERE acked_at IS NULL`, so the
`(acked_at IS NOT NULL AND acked_at < retention)` half of the WHERE
clause seq-scanned the table on every cycle. Add a complementary
partial index on `acked_at WHERE acked_at IS NOT NULL` so both arms
of the disjunction are index-covered. Disjoint from the existing two
indexes (no row matches both predicates), so write amplification is
bounded to ~one index entry per terminal-state row.
2. Important — uploadPollMode partial-failure left orphans. The previous
per-file Put loop committed rows 1..K-1 and then errored on row K with
no compensation, so a client retry would double-insert the survivors.
Refactor the handler into three explicit phases (pre-validate +
read-into-memory, single atomic PutBatch, per-file activity row) and
add Storage.PutBatch with all-or-nothing transaction semantics.
3. FYI — pendinguploads.StartSweeperWithInterval was exported only for
tests. Move it to lower-case startSweeperWithInterval and expose the
test seam through pendinguploads/export_test.go (Go convention; the
shim file is stripped from the production binary at build time).
4. Nit — multipart Content-Type was passed verbatim into pending_uploads
rows and re-served on /content. Add safeMimetype which strips
parameters, rejects CR/LF/control bytes, and coerces malformed shapes
to application/octet-stream. The eventual GET /content response can no
longer be header-split via a crafted Content-Type on the multipart.
Comprehensive tests:
- 10 PutBatch unit tests (sqlmock): happy path, empty input, all four
pre-validation rejection paths, BeginTx error, per-row error +
Rollback (no Commit), first-row error, Commit error.
- 4 new PutBatch integration tests (real Postgres): all-rows-commit
happy path with COUNT(*) verification, atomic-rollback no-leak via
a NUL-byte filename that lib/pq rejects mid-batch, oversize
short-circuit no-Tx, idx_pending_uploads_acked existence + partial
predicate via pg_indexes (planner-shape-independent).
- 3 new chat_files_poll tests: atomic rollback on second-file oversize,
atomic rollback on PutBatch error, mimetype CRLF/NUL/parameter
sanitization (8 sub-cases).
The two remaining review findings (inbox_uploads.fetch_and_stage blocks
the poll loop synchronously; two httpx Clients per row) are Python-side
and ship in Phase 5b once this lands on staging.
Test-only export pattern via export_test.go, atomic pre-validation
discipline (validate before Tx), and behavior-based (not name-based)
test assertions follow the standing project conventions.
Closes the gap between the merged Memory v2 code (PR #2757 wired the
client into main.go) and operator activation. Without this PR an
operator wanting to flip MEMORY_V2_CUTOVER=true had to provision a
separate memory-plugin service and point MEMORY_PLUGIN_URL at it —
extra ops surface for what the design intends to be a built-in.
What ships:
* Both Dockerfile + Dockerfile.tenant build the
cmd/memory-plugin-postgres binary into /memory-plugin.
* Entrypoints spawn the plugin in the background on :9100 BEFORE
starting the main server; wait up to 30s for /v1/health to return
200; abort boot loud if it doesn't (better to crash-loop than to
silently route cutover traffic against a dead plugin).
* Default env: MEMORY_PLUGIN_DATABASE_URL=$DATABASE_URL (share the
existing tenant Postgres — plugin's `memory_namespaces` /
`memory_records` tables coexist with platform schema, no
conflicts), MEMORY_PLUGIN_LISTEN_ADDR=:9100.
* MEMORY_PLUGIN_DISABLE=1 escape hatch for operators running the
plugin externally on a separate host.
* Platform image: plugin runs as the `platform` user (not root) via
su-exec — matches the privilege boundary the main server already
drops to. Tenant image already starts as `canvas` so the plugin
inherits non-root automatically.
What stays operator-controlled:
* MEMORY_V2_CUTOVER is NOT auto-set. Behavior change for existing
deployments: zero. The wiring at workspace-server/internal/memory/
wiring/wiring.go skips building the plugin client until the
operator opts in, so the running sidecar is a no-op for traffic
until then.
* MEMORY_PLUGIN_URL is NOT auto-set either, for the same reason —
setting it implies cutover-active intent. Operators set both on
staging first, verify a live commit/recall round-trip (closes
pending task #292), then promote to production.
Operator activation steps after this PR ships:
1. Verify pgvector extension is available on the target Postgres
(the plugin's first migration runs CREATE EXTENSION IF NOT
EXISTS vector). Railway's managed Postgres ships pgvector
available; some self-hosted operators may need to enable it.
2. Redeploy the workspace-server with this image.
3. Set MEMORY_PLUGIN_URL=http://localhost:9100 + MEMORY_V2_CUTOVER=true
in the environment (staging first).
4. Watch boot logs for "memory-plugin: ✅ sidecar healthy" and the
wiring.go cutover messages; do a live commit_memory + recall_memory
round-trip via the canvas Memory tab to verify.
5. Promote to production once staging holds for a sweep window.
Refs RFC #2728. Closes the dormant-plugin gap noted in task #294.
Reported: "right now when chat box opens it opens in the middle, but
it should be at the end of conversation."
Root cause: ChatTab.tsx:548 fires `bottomRef.scrollIntoView({ behavior:
"smooth" })` on every messages-update. On initial mount with N
messages already loaded, the smooth-scroll triggers a ~300ms animation
that any concurrent React re-render (agent push landing, theme
toggle, sidepanel resize) interrupts mid-flight, leaving the user
stuck somewhere in the middle of the conversation.
Fix: track first-mount via hasInitialScrollRef. Use behavior:"instant"
for the initial jump (deterministic, no animation interruption), then
smooth for subsequent appends (the new-message-landing visual stays).
Refs flipped on first messages.length > 0 transition, so:
- Initial open of chat tab: instant jump to bottom ✓
- New agent message arrives: smooth scroll into view ✓
- Workspace switch (ChatTab remounts): fresh hasInitialScrollRef, gets
instant again ✓
- loadOlder prepend: anchor-restore path unchanged, still pins user's
reading position ✓
Test plan:
- pnpm test --run ChatTab.lazyHistory.test.tsx → 8 pass (existing
lazy-history tests untouched)
- npx tsc --noEmit clean
- Manual on hongming.moleculesai.app: open a busy chat (mac laptop,
~50 messages), confirm view lands at the latest bubble, not mid-
scroll. Switch to another workspace + back → instant again.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TestStartSweeper_RecordsMetricsOnError flaked on every CI rerun under
race detection: `error counter delta = 0, want 1`. Root cause is a
race between two goroutines, not a bug in the production sweeper.
The fake `fakeSweepStorage.Sweep` signals `cycleDone` from inside its
deferred return — that happens BEFORE Sweep's return value is
received by `sweepOnce`, which is what triggers the metric increment.
On slow CI hosts the test goroutine wins the read after `waitForCycle`
unblocks and BEFORE StartSweeper's goroutine has called
`metrics.PendingUploadsSweepError`, so the asserted delta is 0 even
though the metric WILL be 1 a few ms later.
Adds a polling assert helper, `waitForMetricDelta`, that closes the
race deterministically without timing-based sleeps:
- TestStartSweeper_RecordsMetricsOnError uses waitForMetricDelta to
wait for the error counter to settle at 1.
- TestStartSweeper_RecordsMetricsOnSuccess uses it on the success
counters (acked, expired) so the error-stayed-zero assertion
reads after StartSweeper has fully processed the cycle.
- waitForCycle keeps its current shape but documents the caveat in
its comment so future tests don't repeat the assumption.
Verified: `go test ./internal/pendinguploads/ -race -count 5` passes
all 9 tests across 5 iterations cleanly.
Per memory feedback_question_test_when_unexpected.md: the
"delta=0, want=1" failure looked like a real production bug at first
glance, but instrumented inspection showed the metric DOES increment,
just AFTER the test's read. The fix is the test's wait shape, not
the sweeper.
Unblocks every PR currently broken by this flake (#2898 hit it on
two consecutive CI runs; staging-merged PRs from earlier today
(#2877/#2881/#2885/#2886) introduced the test).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User reported every SaaS workspace defaults to T2 (Standard). Three
sites quietly disagreed on the default:
- canvas CreateWorkspaceDialog (line 126): isSaaS ? 4 : 3 ← only correct one
- canvas EmptyState "Create blank": tier: 2 ← hardcoded
- workspace.go POST /workspaces: tier = 3 ← not SaaS-aware
- org_import.go createWorkspaceTree: tier = 2 (fallback)← not SaaS-aware
So a user clicking "+ New Workspace" via the dialog got T4 on SaaS,
but a user clicking "Create blank" on the empty canvas got T2, and an
agent POSTing /workspaces directly got T3. Same tenant, three different
tiers depending on entry point.
Fix:
1. WorkspaceHandler.IsSaaS() and DefaultTier() helpers (workspace_dispatchers.go).
IsSaaS() := h.cpProv != nil — single source of truth for "are we
SaaS" across the file. DefaultTier() returns 4 on SaaS, 3 on
self-hosted. SaaS rationale: each workspace runs on its own sibling
EC2 so the per-workspace tier boundary is a Docker resource limit
on the only container present — no neighbour to protect from. T4
matches the boundary.
2. workspace.go now defaults tier via h.DefaultTier() instead of
hardcoded T3.
3. org_import.go fallback (when neither ws.tier nor defaults.tier set)
becomes SaaS-aware: T4 on SaaS, T2 on self-hosted (preserve the
existing safe-shared-Docker-daemon default for self-hosted org
imports).
4. canvas EmptyState "Create blank" stops sending tier:2 in the body
and lets the backend pick — single source of truth in the backend.
Eliminates the third disagreement.
Test plan:
- go vet ./... clean
- go test ./internal/handlers/ -count 1 — all green (4.3s)
- npx tsc --noEmit on canvas — clean
- Staging E2E (after deploy): create a fresh workspace via canvas
empty-state on hongming.moleculesai.app, confirm tier=4 on the
workspace details panel.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI caught two test files I missed in the original iter 4b retarget:
test_a2a_multi_workspace.py + test_delegation_sync_via_polling.py
patch a2a_tools.{discover_peer, send_a2a_message, _delegate_sync_via_polling,
httpx.AsyncClient} but those call sites moved to a2a_tools_delegation
in this PR. 17 patch sites retargeted; 30 tests now green.
Refs RFC #2873 iter 4b.
The previous TestCreateWorkspaceTree_CallsLookupBeforeInsert used
bytes.Index("INSERT INTO workspaces"), which prefix-matches
INSERT INTO workspaces_audit, INSERT INTO workspace_secrets, and
INSERT INTO workspace_channels. RFC #2872 cited this as a silent
false-pass mode: a future refactor that adds an audit-table INSERT
literal earlier in source than the real workspaces INSERT would
make the gate point at the wrong target.
Replaces the byte-search with a go/ast walk + a regex that requires
`\s*\(` after `workspaces` — distinguishes the real target from
prefix lookalikes.
Adds three discriminating tests:
- TestWorkspacesInsertRE_RejectsLookalikes — pins the regex against
9 sql shapes (real, raw-string-literal, audit-shadow, workspace_*
prefixes, canvas_layouts, UPDATE/SELECT, comments).
- TestGate_FailsWhenLookupAfterInsert — synthesizes Go source where
the lookup is positioned AFTER the workspaces INSERT, asserts the
helper returns lookupPos > insertPos (which the production gate
flags via t.Errorf). Proves the gate isn't vestigial.
- TestGate_IgnoresAuditTableShadow — synthesizes source with an
audit-table INSERT BEFORE the lookup + real INSERT, asserts the
tightened regex correctly walks past the shadow and finds the
real INSERT.
Also extracts findLookupAndWorkspacesInsertPos as a helper so the
gate logic can be exercised against synthetic source, not only
against the real org_import.go.
Memory: feedback_assert_exact_not_substring.md (verify tightened
test FAILS on old code) — TestGate_FailsWhenLookupAfterInsert is
the failing-on-bug-shape proof.
Closes the silent-false-pass mode of #2872 Important-1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 4 closes out the rollout — strict-sqlmock unit tests pin which
SQL fires, but they cannot detect bugs that depend on the actual row
state after the SQL runs. Real-Postgres integration tests catch:
- the Sweep CTE depends on Postgres' make_interval function and
the table's CHECK constraints; sqlmock would happily accept a
hand-written SQL literal that Postgres rejects at runtime.
- the partial idx_pending_uploads_unacked index only catches a
wrong WHERE predicate at real-query-plan time.
- subtle predicate drift (e.g. a WHERE clause that filters by
acked_at IS NOT NULL but uses BETWEEN incorrectly).
Test cases:
- PutGetAckRoundTrip: the full happy path — Put, Get, MarkFetched,
Ack, idempotent re-Ack, Get-after-Ack returns ErrNotFound.
- Sweep_DeletesAckedAfterRetention: row not eligible at retention=1h
immediately after Ack; deleted at retention=0.
- Sweep_DeletesExpiredUnacked: backdated expires_at exercises the
unacked-and-expired branch of the WHERE clause.
- Sweep_DeletesBothCategoriesInOneCycle: three rows (acked, expired,
fresh); a single Sweep deletes the first two and leaves the third.
- PutEnforcesSizeCap: ErrTooLarge above MaxFileBytes.
- GetIgnoresExpiredAndAcked: Get filters predicate matches expected
row state in the table.
Run path:
- locally via the file-header docker incantation.
- CI runs on every PR/push that touches handlers/** OR migrations/**
(.github/workflows/handlers-postgres-integration.yml).
Second slice of the a2a_tools.py split (stacked on iter 4a). Owns the
three delegation MCP tools + the RFC #2829 PR-5 sync-via-polling
helper they share:
* tool_delegate_task — synchronous delegation
* tool_delegate_task_async — fire-and-forget
* tool_check_task_status — poll the platform's /delegations log
* _delegate_sync_via_polling — durable async + poll for terminal status
* _SYNC_POLL_INTERVAL_S / _SYNC_POLL_BUDGET_S constants
a2a_tools.py shrinks from 915 → 609 LOC (−306). Stacked on iter 4a's
RBAC extraction; uses `from a2a_tools_rbac import auth_headers_for_heartbeat`
as its auth-header source.
The lazy `from a2a_tools import report_activity` inside tool_delegate_task
breaks the circular-import cycle (a2a_tools imports the delegation
re-exports at module-load; delegation handler needs report_activity at
CALL time). A dedicated test pins this contract.
Tests:
* 77 existing test_a2a_tools_impl.py tests pass after retargeting
20 patch sites in TestToolDelegateTask + TestToolDelegateTaskAsync +
TestToolCheckTaskStatus from `a2a_tools.foo` to
`a2a_tools_delegation.foo` (foo ∈ {discover_peer, send_a2a_message,
httpx.AsyncClient}). The patches need to target the new module
because that's where the call sites live now.
* test_a2a_tools_delegation.py adds 8 new tests:
- 6 alias drift gates (`a2a_tools.tool_delegate_task is …`)
- 2 import-contract tests (no top-level circular dep + a2a_tools
surfaces every delegation symbol)
- 1 sync-poll budget invariant
113 tests total (77 impl + 28 rbac + 8 delegation), all green.
Refs RFC #2873.
Iter 4a's new module needs to be in the rewrite list so the wheel
ships its imports prefixed correctly. Caught by 'PR-built wheel +
import smoke'.
Refs RFC #2873 iter 4a.
The iter-3 split created mcp_heartbeat / mcp_inbox_pollers /
mcp_workspace_resolver but the wheel build's drift-gate check at
scripts/build_runtime_package.py:TOP_LEVEL_MODULES wasn't updated.
Without this fix the wheel ships those modules un-rewritten, so
their imports of platform_auth / configs_dir / etc. break at
runtime. Caught by the 'PR-built wheel + import smoke' check.
Refs RFC #2873 iter 3.
Phase 3 of the poll-mode chat upload rollout. Stack atop Phase 2.
The platform's pending_uploads table grows once-per-uploaded-file with
no built-in cleanup. Phase 1's hard TTL (expires_at default 24h) makes
expired rows un-fetchable but doesn't actually delete them; Phase 1's
ack stamps acked_at but leaves the row indefinitely. Without a sweep
the table grows unbounded across normal traffic.
This PR adds:
- `Storage.Sweep(ctx, ackRetention)` — a single round-trip CTE that
deletes acked rows past their retention window plus unacked rows
past expires_at. Returns `(acked, expired)` deletion counts so
Phase 3 dashboards can spot the stuck-fetch pattern (high expired,
low acked) vs healthy churn.
- `pendinguploads.StartSweeper(ctx, storage, ackRetention)` —
background goroutine that calls Sweep every 5 minutes (default).
Runs once immediately on startup so a platform restart cleans up
any rows that became eligible while we were down.
- Prometheus counters `molecule_pending_uploads_swept_total` with
`outcome={acked,expired,error}` labels. Wired into the existing
`/metrics` endpoint.
- Wired from cmd/server/main.go via supervised.RunWithRecover —
one transient panic doesn't take the platform down with it.
Defaults:
- SweepInterval = 5m (matches the dashboard refresh cadence)
- DefaultAckRetention = 1h (gives the workspace at-least-once retry
headroom in case it processed but failed to write the file before
crashing)
Test coverage: 100% on storage_test.go (extended with sweepSQL pin +
six Sweep test cases including negative-retention clamp + zero-retention
immediate-delete + DB error wrapping) and sweeper_test.go (ticker-driven
+ ctx-cancel + nil-storage + transient-error-doesn't-crash + metric
counter assertions).
Closes the third of four phases tracked on the parent RFC; phase 4 is
the staging E2E test.
Closes#2865 (split-B of the #2669 root-cause stack).
The phantom-busy sweep in workspace-server/internal/scheduler/scheduler.go
already logs each row reset, but no aggregate metric surfaces "how often
is this firing." A regression that causes high reset rates (e.g.
controlplane#481's missing env vars, or future drift in the workspace
runtime's task-lifecycle accounting) only surfaces when users complain.
Fix: counter exposed at /metrics as molecule_phantom_busy_resets_total,
incremented from sweepPhantomBusy after each row whose active_tasks
was reset. Same shape as existing molecule_websocket_connections_active.
Operator-side dashboard: alert when daily phantom-busy reset count
> 0.5% of active workspaces. Today's steady-state is near-zero; any
increase is a regression signal.
Tests:
- TestTrackPhantomBusyReset_IncrementsCounter
- TestTrackPhantomBusyReset_RaceFreeUnderConcurrentWrites (50×200
concurrent writes; tests atomic invariant)
- TestHandler_ExposesPhantomBusyResetsCounter (asserts HELP + TYPE
+ value lines in Prometheus text format)
- TestHandler_PhantomBusyResetsZeroByDefault (fresh-process 0
contract — prevents a future refactor from accidentally dropping
the metric from /metrics)
Race-detector clean. Vet clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>