The previous byte-slice form `s[:previewCap]` could split a multi-byte
codepoint at byte 4096, producing invalid UTF-8. Postgres JSONB rejects
the row → ledger insert silently fails → audit gap on dashboards while
activity_logs continues to record the event.
Walk the string by rune index and stop at the last boundary that fits
inside the cap. ASCII-only strings still hit the cap exactly; CJK/emoji
strings stop slightly under, never over.
Mirrors the truncatePreviewRunes fix shipped for agent_message_writer
in #2959. Followup: deduplicate into a shared helper once both have
landed.
Tests: 2 regression tests using utf8.ValidString — one with an all-3-byte
rune string just over the cap, one with a single multi-byte rune sitting
exactly on the boundary. Verified on the previous byte-slice impl: both
new tests would fail (invalid UTF-8 + truncation past cap by 1 byte).
Pre-RFC-#2945, every BroadcastOnly / RecordAndBroadcast call site
passed a bare string literal:
h.broadcaster.BroadcastOnly(workspaceID, "AGENT_MESSAGE", payload)
29 producers (Go, ~30 call sites in handlers/, scheduler/, registry/,
bundle/) and ~30 canvas consumers (TS store + listeners) duplicated
the same string with no shared definition. A producer renaming an
event silently broke every consumer — same drift class that produced
the reno-stars data-loss regression on the persistence side. PR-A
fixed the persistence-side SSOT (AgentMessageWriter); PR-B fixes the
event-name SSOT.
What this PR ships
internal/events/types.go
- EventType typed string + 29 named constants covering the full
taxonomy (chat / lifecycle / agent assignment / delegation /
task / approval / auth).
- Grouped semantically; new constants must be added here AND
mirrored in canvas/src/lib/ws-events.ts (parity gate landing
in PR-B-2 follow-up).
- AllEventTypes slice — authoritative list for the snapshot
test + the cross-language parity gate.
internal/events/types_test.go (3 tests)
- TestAllEventTypes_IsSnapshot: pins the canonical list. Adding
a new constant without updating AllEventTypes (or vice versa)
fails with a one-line diff.
- TestEventType_NoEmptyConstants: catches accidentally-empty
values (typo in types.go: const X EventType = ...).
- TestEventType_AllUppercaseSnakeCase: pins the wire format that
canvas TS switch statements assume (no kebab-case, no mixed
case, no leading/trailing/double underscores).
agent_message_writer.go (single migration)
- Demonstrates the constant-usage shape:
events.EventAgentMessage → "AGENT_MESSAGE"
- Other ~30 call sites stay on bare strings for now (this PR
narrow); the migration happens in PR-B-1 follow-up. Both
shapes (constant + bare string) co-exist on the wire — the
typed version is just the recommended path for new code.
Why ship this in stages
1. PR-B (this): types + tests + first migration → MERGEABLE NOW,
low risk.
2. PR-B-1 (follow-up): migrate the remaining ~30 call sites to
constants. Mechanical, low-risk.
3. PR-B-2 (follow-up): canvas/src/lib/ws-events.ts mirror + cross-
language parity gate. Touches both repos.
Per memory feedback_oss_design_philosophy.md (every refactor toward
OSS plugin shape) — this surface is now plugin-safe: external
implementations can import the events package and get the same
named taxonomy without copying strings.
Verified
- go vet ./internal/events/ clean
- go build ./... clean
- TestAllEventTypes_IsSnapshot + TestEventType_* all pass
- TestAgentMessageWriter_* (the only call site touched) still green
Refs RFC #2945, PR #2949 (PR-A SSOT), PR #2944 (reno-stars).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-review of PR #2949 surfaced two pre-existing defects that the
SSOT consolidation inherited from the original /notify handler. Both
are addressable in a small follow-up; shipping them as a separate PR
keeps the consolidation and the bug-fix individually reviewable.
Critical: byte-slice preview truncation produces invalid UTF-8
-------------------------------------------------------------
Pre-fix:
if len(preview) > 80 {
preview = preview[:80] + "…"
}
`len()` returns BYTES; `preview[:80]` slices on a byte boundary. For
agent-authored chat in CJK / emoji / accented characters, byte 80
lands mid-codepoint → invalid UTF-8 → Postgres JSONB rejects → INSERT
fails → activity_log row never written → message vanishes from chat
history on the next reload. The persistence-failure log fires but
operators have to grep to find it, and the user-visible regression
mode is identical to reno-stars.
Fix: extract `truncatePreviewRunes(s, maxRunes)` that walks the rune
boundary using `for i := range s` (Go's range over string yields rune
start indices). Cap at 80 RUNES not bytes — UI-friendly count, not
storage count.
Important: workspace-lookup error path swallows real DB errors
--------------------------------------------------------------
Pre-fix:
if err := w.db.QueryRowContext(...).Scan(&wsName); err != nil {
return ErrWorkspaceNotFound
}
Conflates `sql.ErrNoRows` (legit not-found → caller 404) with real
DB errors (connection drop, query timeout, pool exhaustion → caller
should 503). During a Postgres outage every notify call surfaced as
"workspace not found" — masking the actual incident in alerting and
making the symptom indistinguishable from "you typed a bad workspace
ID".
Fix: distinguish via `errors.Is(err, sql.ErrNoRows)` and wrap
non-not-found errors with `fmt.Errorf("agent_message: workspace
lookup: %w", err)`. Callers' existing fallback path (return 500 /
return error wrapped) handles the new shape correctly without any
changes — verified by running existing TestNotify_* and
TestMCPHandler_SendMessage_* tests.
Tests added (3 new, 11 total writer tests)
------------------------------------------
- TestTruncatePreviewRunes_RuneBoundary: 8-case table — ASCII, CJK,
exactly-at-max, emoji prefix. Asserts both correct visible output
AND `utf8.ValidString` on every result so the bug shape (invalid
UTF-8) can't recur.
- TestAgentMessageWriter_Send_NonASCIIMessagePersists: end-to-end
with a 200-rune CJK message (exceeds the 80-rune cap, would have
hit the byte-slice bug). Pins the INSERT summary contains valid
UTF-8 with exactly 80-rune body + ellipsis.
- TestAgentMessageWriter_Send_DBErrorOnLookupReturnsWrapped: pins the
DB-outage path returns a wrapped non-ErrWorkspaceNotFound error so
alerting can distinguish 404 from 503. Verified via mock
ExpectQuery returning a transient error.
Verified
--------
- `go vet ./internal/handlers/` clean
- `go build ./...` clean
- All 14 writer + caller tests pass (8 original + 3 new + AST gate +
TestNotify_* + TestMCPHandler_SendMessage_* sibling tests)
Per memory feedback_assert_exact_not_substring.md: every new test
asserts boundary behavior directly (UTF-8 validity, exact rune count,
errors.Is comparison) rather than substring-match in stringified
output.
Refs RFC #2945, PR #2949, PR #2944.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The External Connect modal's Codex and OpenClaw tabs were rendering
this MCP server config:
command = "python3"
args = ["-m", "molecule_runtime.a2a_mcp_server"]
That spawns the bare MCP dispatcher with no presence wiring. The
``molecule-mcp`` console-script wrapper (mcp_cli.main) is what calls
``POST /registry/register`` at startup and runs the 20s heartbeat
thread alongside the MCP stdio loop. Without the wrapper, the canvas
flips the workspace back to ``awaiting_agent`` (OFFLINE) within
60-90s — even while tools work — because nothing is heartbeating.
Operator-side this looks like: the workspace is registered and tools
work fine when invoked, but the canvas shows "offline" / "Restart"
CTA, peer agents see the workspace as awaiting_agent in list_peers
output, and inbound A2A delivery silently fails the readiness check.
A new external-Codex operator (#2957) hit this and spent debugging
time on what should have been a copy-paste install.
Fix: switch both Codex and OpenClaw templates to
``command = "molecule-mcp"`` / ``args = []``, matching the universal
MCP template that already handles this correctly. Inline comment in
each template explains the wrapper-vs-bare-module tradeoff so a
future template author doesn't regress to the shorter form.
Hermes-channel intentionally still spawns the bare module — the
hermes plugin owns the platform plugin path and runs its own
register_platform/heartbeat code in-process; double-heartbeating
would race. Universal/Codex/OpenClaw all need the wrapper.
Regression gate: TestExternalMcpTemplates_UseMoleculeMcpWrapper
asserts the three templates that must use the wrapper actually do,
and explicitly fails on the old ``-m molecule_runtime.a2a_mcp_server``
shape. Verified the test FAILS on pre-fix source by stashing only
external_connection.go and re-running.
Source: molecule-core#2957 issue 1 (item 4 of the report — the
``(codex returned empty output)`` / opaque-canvas-error / stale-
session items live in codex-channel-molecule and are tracked
separately).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the v1 LOCAL/TEAM/GLOBAL tab trio (mapped to the deprecated
shared_context model) with a v2 plugin-driven UI. Without this,
canvas Memory tab was reading the frozen agent_memories table while
all post-cutover agent writes went to the plugin's memory_records —
the tab silently displayed stale data.
## Backend (workspace-server)
New routes under wsAuth, all behind the existing per-tenant token:
GET /workspaces/:id/v2/namespaces → readable + writable lists
GET /workspaces/:id/v2/memories → plugin search proxy
DELETE /workspaces/:id/v2/memories/:mid → plugin forget proxy
memories_v2.go — slim handler:
- Server-side ACL: every search request is intersected with the
resolver's readable-namespaces set (canvas-supplied namespace
that the workspace can't read returns [] not 403, matches v1
existence-non-inferring shape).
- Returns 503 with "set MEMORY_PLUGIN_URL" hint when plugin
isn't wired (canvas surfaces a banner).
- Maps plugin not_found → 404, other plugin errors → 502.
- View shaping: NamespaceView.label rendered server-side
("Workspace (abc-1234)", "Team (t-99)", "Org (acme)", custom)
so canvas doesn't parse namespace names. MemoryView surfaces
pin/expires_at/score/source_workspace_id from Propagation.
memories_v2_test.go — 100% line + 100% function coverage:
- 503 path on every endpoint when unwired
- Namespaces success + readable/writable error paths
- Search: empty intersection, full-path query/kind/limit
propagation, namespace=/no-namespace branches, propagation
map missing/wrong-type, intersect error, plugin error
- Forget: success, plugin not_found→404, other plugin
errors→502, missing memoryId→400
- Helpers: namespaceLabel for all 4 kinds + truncation,
parseLimit edge cases (default/0/negative/over-cap/non-num),
memoryToView field round-trip, indexOfColon, shortID
## Frontend (canvas)
MemoryInspectorPanel rewritten for v2:
- Drop LOCAL/TEAM/GLOBAL trio. Namespace dropdown driven by
GET /v2/namespaces.readable, "All namespaces" default.
- New per-row badges: kind (F/S/C), source (agent/runtime/user),
pin (📌), TTL countdown (⌛12h / "expired"), score% on
semantic search, source-workspace ⇡ws-pee for propagated.
- Drop Edit button — v2 plugin contract has no PATCH; the
model is forget + recommit. Forget stays.
- Plugin-unavailable banner with operator hint when /v2/*
returns 503.
- Bug fix surfaced by test: rollback-on-failed-delete order
of operations (loadEntries() called setError(null) AFTER
we set the failure message, wiping it). Reload first, then
set the error.
MemoryEditorDialog deleted — Add was POST /memories which v2
doesn't support from canvas (writes go via MCP). The legacy
Edit-flow tests go with it.
## Test results
Backend: `go test ./internal/handlers/` — all pass
Backend coverage on memories_v2.go: 100% lines, 100% functions
Canvas: `vitest run` — 91 files, 1273 tests pass (26 new)
Canvas coverage on MemoryInspectorPanel.tsx: 100% lines,
100% functions, 96.7% statements, 84.7% branches
(uncovered branches are defensive `?? fallback` for
contract-impossible kind/source values)
## Migration note
The legacy v1 GET/POST/PATCH/DELETE on /workspaces/:id/memories
remains in place for the back-compat MCP shim (mcp_tools_memory_v2's
legacy routing) and admin export/import. PR-9 (#283) drops
agent_memories along with the v1 endpoints once the cutover
verification window closes.
Pre-RFC-#2945 the broadcast + activity_log INSERT for "agent → user
chat" was duplicated across two handlers — activity.go's Notify (HTTP
/notify) and mcp_tools.go's toolSendMessageToUser (MCP tools/call).
The duplication is exactly what produced the reno-stars production
data-loss regression (PR #2944): the persistence-half fix landed for
one handler and silently lagged for the other for months, dropping
every long-form external-agent message on reload.
PR #2944 added the missing INSERT to mcp_tools.go and a forward-
looking AST gate. This PR removes the duplication at the source.
What changes
------------
NEW: workspace-server/internal/handlers/agent_message_writer.go
- AgentMessageWriter struct + NewAgentMessageWriter ctor.
- Send(ctx, workspaceID, message, attachments) error: workspace
lookup → broadcast WS AGENT_MESSAGE → INSERT activity_logs.
- ErrWorkspaceNotFound for the lookup-miss path so callers can
return 404 / JSON-RPC error cleanly.
- Best-effort persistence: INSERT failure logs only, returns nil so
the broadcast success isn't undone (matches previous behavior in
both call sites — pinned by test).
- Takes events.EventEmitter (interface) so tests can substitute a
capturing fake without nil-panicking inside hub.Broadcast.
UPDATED: activity.go:Notify
- Replaced ~75 lines of inline broadcast+INSERT with a 12-line
call to AgentMessageWriter.Send.
- Attachment shape conversion (NotifyAttachment → AgentMessageAttachment)
is local to the HTTP handler; the writer's API doesn't import the
HTTP-binding-tagged type.
UPDATED: mcp_tools.go:toolSendMessageToUser
- Replaced ~40 lines (the post-#2944 broadcast+INSERT pair) with a
6-line call to the writer.
- Attachments is nil today because the MCP tool args don't expose
attachments yet. When the schema adds it, build the slice and
pass through; the writer half is ready.
Tests
-----
agent_message_writer_test.go (8 tests, comprehensive):
- TestAgentMessageWriter_Send_Success_NoAttachments — happy path,
pins JSON `{"result":"hi"}`.
- TestAgentMessageWriter_Send_Success_WithAttachments — pins file
parts shape (kind=file, file.{uri,name,mimeType,size}). Uses a
jsonMatcher that decodes + asserts via predicate (tolerant of
map key ordering, exact on shape).
- TestAgentMessageWriter_Send_WorkspaceNotFound — pins
ErrWorkspaceNotFound + asserts NO broadcast NO INSERT.
- TestAgentMessageWriter_Send_DBInsertFailureStillReturnsNil — pins
best-effort persistence contract.
- TestAgentMessageWriter_Send_PreviewTruncation — pins ≤80-char
preview + ellipsis (Ryan's onboarding-friction report would have
bloated activity_logs.summary by 2KB without this).
- TestAgentMessageWriter_Send_BroadcastsAgentMessageEvent — pins WS
event name + payload shape via capturingEmitter.
- TestAgentMessageWriter_Send_OmitsAttachmentsKeyWhenEmpty — pins
the "no key when nil" wire contract.
The existing AST gate from #2944
(TestAgentMessageBroadcastsArePersisted) still holds: any future
function emitting AGENT_MESSAGE without an INSERT fails the test.
With the writer in place that's now redundant — both producers go
through it — but the gate is cheap to keep as defense-in-depth.
Verified: go vet clean; all writer + caller tests pass; existing
TestNotify_* + TestMCPHandler_SendMessage_* + the AST gate all green.
Refs RFC #2945, PR #2944.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per user request: audit all similar tools + write comprehensive tests
including E2E for the persistence-of-AGENT_MESSAGE-broadcasts contract.
Audit (all BroadcastOnly call sites in workspace-server/internal/):
| Site | Event | Persisted? | Notes |
|---|---|:---:|---|
| a2a_proxy_helpers.go:275 | A2A_RESPONSE | ✓ | LogActivity above |
| activity.go:486 (Notify) | AGENT_MESSAGE | ✓ | INSERT line 535 |
| activity.go:701 (LogActivity) | ACTIVITY_LOGGED | ✓ | self-emits inside DB write |
| mcp_tools.go:341 (toolSendMessageToUser) | AGENT_MESSAGE | ✓ NEW (this PR) |
| registry.go:575 | TASK_UPDATED | N/A | transient progress, not chat |
| registry.go:596 | WORKSPACE_HEARTBEAT | N/A | infra ping, not chat |
Only one chat-bearing broadcast was missing persistence (the just-
fixed mcp bridge path). No other regressions found.
Tests added (4 new, total 5 send_message_to_user tests):
1. TestAgentMessageBroadcastsArePersisted — AST gate that walks every
non-test .go in the package, finds funcs that BroadcastOnly with
"AGENT_MESSAGE", asserts each ALSO contains an
"INSERT INTO activity_logs". Forward-looking regression block:
any future chat tool that broadcasts without persisting fails the
test with a clear file:func diagnostic. Mutation-tested locally:
removing the INSERT block from toolSendMessageToUser reliably
produces the expected failure.
2. TestMCPHandler_SendMessageToUser_DBErrorLogsAndStill200s — pins
the "best-effort persistence" contract. DB INSERT failures must
NOT abort the tool response (the WS broadcast already succeeded;
retrying would double-render in the live chat). Matches /notify.
3. TestMCPHandler_SendMessageToUser_ResponseBodyShape — pins the
exact `{"result": "<message>"}` JSON shape stored in
response_body. The canvas hydrater (extractResponseText in
historyHydration.ts) reads body.result; any drift here silently
breaks chat history without failing the INSERT. Per memory
feedback_assert_exact_not_substring.md, asserts the literal JSON
shape, not a substring.
4. TestMCPHandler_SendMessageToUser_PersistsToActivityLog (existing,
from previous commit) — pins INSERT shape with regex on
'a2a_receive' + 'notify' literals.
5. TestMCPHandler_SendMessageToUser_Blocked_WhenEnvNotSet (existing)
— env-gate aborts before DB.
Test fixture cleanup: newMCPHandler now uses newTestBroadcaster (real
ws.Hub) instead of events.NewBroadcaster(nil) — the latter nil-panics
inside hub.Broadcast on the AGENT_MESSAGE path. Same broadcaster
shape every other handler test uses.
E2E note: the AST gate is the strongest forward-looking guarantee.
A real-DB integration test would add value for CI but is largely
duplicative of the sqlmock contract tests above (sqlmock pins SQL
shape with much faster feedback). Left as a future enhancement when
the handlers Postgres-integration suite extends MCP coverage.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reported on production tenant reno-stars: an external claude-code agent
(CEO Ryan PC workspace) sent a long-form message via send_message_to_user;
the user saw it live in the chat panel but it vanished after a refresh.
Confirmed via direct production query — the message is NOT in
activity_logs at all (only short test pings around it are persisted).
Root cause: there are TWO server-side handlers for send_message_to_user:
1. HTTP `/workspaces/:id/notify` (activity.go:Notify) — broadcasts WS
AND inserts a row into activity_logs. This is the path the
in-container runtime's tool_send_message_to_user calls.
2. MCP-bridge `tools/call name=send_message_to_user`
(mcp_tools.go:toolSendMessageToUser) — broadcasts WS only,
**never persisted**. This is the path EXTERNAL agents using
molecule-mcp's send_message_to_user tool route through.
The persistence fix landed for path 1 months ago but was never mirrored
on path 2. External agents — exactly the case in reno-stars/CEO Ryan PC
— have been silently losing every long-form notification on reload.
Fix: mirror the activity.go INSERT shape inside toolSendMessageToUser:
INSERT INTO activity_logs
(workspace_id, activity_type, method, summary, response_body, status)
VALUES ($1, 'a2a_receive', 'notify', $2, $3::jsonb, 'ok')
Same wire shape as /notify so the canvas's chat-history hydration
(`type=a2a_receive&source=canvas`) treats both writers identically.
Errors are log-only — broadcast already succeeded, persistence failure
shouldn't block the tool response (matches /notify behavior; downside
is the same data-loss-on-DB-error risk, surfaced via log.Printf).
Tests
-----
- `TestMCPHandler_SendMessageToUser_PersistsToActivityLog` — pins both
the workspace-name lookup AND the INSERT shape. Regex-matches
`'a2a_receive'` + `'notify'` literals so a future refactor that
changes activity_type or method breaks the test loud, not silently
re-introducing the data-loss bug.
- Updated newMCPHandler to use newTestBroadcaster() (real ws.Hub) —
events.NewBroadcaster(nil) crashes inside hub.Broadcast in the
send_message_to_user path. Same shape every other handler test uses.
Verified `go test ./internal/handlers/ -run TestMCPHandler_SendMessage`
green; full vet clean.
Refs reno-stars production incident 2026-05-05.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related fixes to the Connect-External-Agent flow that the user
flagged: the "Need help?" disclosure block in the modal is for the
operator's eyes only — but the agent reading the pasted snippet has
no access to that context. And the docs URL was pointing at a
hostname that doesn't resolve.
User-visible problems:
1. The agent doesn't see the install link, docs link, or the common-
error/check pairs that the human pasted. When the agent fails to
register or hits ConnectionRefused, it can't self-diagnose because
the troubleshooting context lives in a separate UI block.
2. https://docs.molecule.ai → DNS NXDOMAIN. Every "Documentation"
link in the modal was a dead link.
## Fixes
### Move help INTO the snippet (not a separate human-only UI block)
Each of the 7 server-rendered templates in
`workspace-server/internal/handlers/external_connection.go` now
appends a `# Need help?` section with: install link, correct docs
link, and the top common errors as `# • symptom — check` pairs.
Templates updated: curl / channel (Claude Code) / mcp (Universal MCP) /
python / hermes / codex / openclaw. Agents reading the paste now have
the same diagnostic context the human did.
### Drop the duplicated UI block in the canvas modal
`canvas/src/components/ExternalConnectModal.tsx`:
- Removed the `TAB_HELP` per-tab metadata constant (152 lines).
- Removed the `HelpBlock` component (62 lines).
- Removed the `<HelpBlock help={TAB_HELP[tab]} />` render call.
The snippet is now the single source of truth for tab-level help.
### Fix the wrong docs hostname
The actual docs site is `doc.moleculesai.app` (singular `doc`,
`.app` not `.ai`), confirmed by:
- `package.json` description in `Molecule-AI/docs` repo →
"Molecule AI documentation site — doc.moleculesai.app"
- HTTP HEAD on the new URL → 200 for both
`/docs/guides/mcp-server-setup` and
`/docs/guides/external-agent-registration`
- HTTP HEAD on old `docs.molecule.ai` → 000 (NXDOMAIN)
All template docs URLs now point at `doc.moleculesai.app`.
## Verification
- `go build ./...` clean
- `go test ./internal/handlers/... -count=1` green
- `pnpm test` → 1291/1291 pass (unchanged)
- `tsc --noEmit` clean
- 219 LOC removed (canvas duplicate UI), 69 LOC added (snippet help)
- Net `-150 LOC` while gaining the agent-readable help
## Out of scope (deferred, captured in followups)
- One blog post still has `canonical: "https://docs.molecule.ai/blog/..."`
in `src/app/blog/2026-04-20-chrome-devtools-mcp/page.mdx` — separate
blog-content fix.
- Comment in `theme-provider.tsx` references `docs.moleculesai.app`
(with `s`) — comment-only, not a runtime URL.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User-reported friction: pip install molecule-ai-workspace-runtime on a
3.10 interpreter fails with "Could not find a version that satisfies the
requirement (from versions: none)" — pip's requires_python filter
silently drops the only available artifact before attempting install,
so the error doesn't mention Python at all. Operators see
"package missing", file a bug, and chase a phantom CDN/visibility
issue.
Two changes mirror the requirement at the two operator-touch surfaces:
1. workspace-server/internal/handlers/external_connection.go:
the externalUniversalMcpTemplate snippet (rendered into the
canvas Connect-External-Agent modal) now leads with a brief
"Requires Python >= 3.11" block + diagnostic + upgrade paths.
2. docs/workspace-runtime-package.md: same callout at the top of
the doc, before the Overview, so anyone landing here from search
gets the answer immediately.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three small hardening passes from #2872's optional/important findings,
batched into one polish PR:
1. errors.Is(err, sql.ErrNoRows) instead of err == sql.ErrNoRows.
The bare equality breaks if any future caller wraps the error via
fmt.Errorf("…: %w", err) — the no-rows happy path would fall
through to the "real DB error" branch and abort the import.
errors.Is unwraps. New test
TestLookupExistingChild_WrappedNoRows_TreatedAsNotFound pins the
fix; verified the test fails on the old `==` shape (build break
on unused-import + assertion failure once import dropped).
2. Bounded 5s timeout on lookupExistingChild instead of
context.Background().
The createWorkspaceTree call site runs in goroutines spawned from
the /org/import handler, so plumbing the request context here
would cascade-cancel into provisionWorkspaceAuto and abort
in-flight EC2 provisioning if the client disconnected mid-import
— that's the wrong tradeoff. A short bounded timeout protects the
per-row SELECT against a wedged DB without taking the
drop-everything-on-disconnect behaviour. The lookup is a single
~10ms query; 5s leaves 500x headroom for transient slow paths.
3. Godoc clarifications on the skip-path block.
- /org/import is ADDITIVE-ONLY, never destructive. Children
present in the existing tree but absent from the new template
are preserved (no DELETE on diff).
- Skip-path does NOT propagate updates to existing nodes — a
re-import that adds an initial_memory or schedule to an
existing workspace is silently dropped. Document the limitation
so future operators know to delete-and-re-import or reach for
a future /org/sync route.
Verification:
- go build ./... → clean
- go test ./internal/handlers/... → all passing (TestLookup* +
TestCreateWorkspaceTree* + TestClass1* + TestGate*)
- 4 lookup tests + 1 new wrap-safety test → 5/5 PASS
- Full handlers suite → green
Refs molecule-core#2872 (Optional findings — wrap-safety + ctx, godoc
clarifications for additive-only + skip-path-update-limitation)
Out of scope (deferred):
- PR-D partial unique index migration + ON CONFLICT — sequenced
after Phase 4 cleanup verified clean per #2872 plan
- PR-E full createWorkspaceTree integration test for partial-match
— needs heavier sqlmock scaffolding for downstream
workspaces_audit/canvas_layouts/secrets/channels INSERTs;
follow-up
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds class1_ast_gate_test.go — a per-package AST walk that fails the
build if any handler function INSERTs INTO workspaces inside a range
loop body without one of three escape hatches:
1. A call to a registered preflight helper (lookupExistingChild today;
extend preflightCallNames as new helpers are introduced).
2. An ON CONFLICT clause in the same SQL literal (idempotent UPSERT,
like registry.go).
3. An explicit `// class1-gate: idempotent-by-design` comment in the
function body (deliberately awkward — forces a code-review beat).
Why this is broader than the existing
TestCreateWorkspaceTree_CallsLookupBeforeInsert gate in
org_import_idempotency_test.go: that one is hard-coded to one function
in one file. This one walks every non-test .go file in the handlers
package and applies a structural rule independent of file/function
names. A future handler written from scratch in a new file would not
have been covered before — now it is.
Detection mechanism (per AST):
- Collect spans (Lbrace..Rbrace) of every RangeStmt body in each
function. Position-based instead of stack-based — ast.Inspect's
nil-callback ordering doesn't give per-node pop semantics, so a
naive push/pop stack silently miscounts. Position spans are
deterministic.
- Walk every BasicLit, regex-match `^\s*INSERT INTO workspaces\(`
(tightened from bytes.Index "INSERT INTO workspaces" so
workspaces_audit literals don't false-positive — same regex used
by the existing createWorkspaceTree gate).
- For each match: record insertLine, hasONCONFLICT, and the
innermost enclosing RangeStmt line (or 0 if not inside any range).
- Fail the function if INSERT is inside a range AND no preflight
AND no ON CONFLICT AND no allowlist annotation.
Self-tests (per `feedback_assert_exact_not_substring.md` —
verify gate fails on the bug shape before merging):
- TestClass1_GateFiresOnSyntheticBuggySource: synthetic source
where INSERT is inside `for _, child := range children` body
must trigger the gate's three guards (enclosingRangeLine!=0,
hasONCONFLICT=false, no preflight call).
- TestClass1_GateAllowsONCONFLICT: synthetic INSERT...ON CONFLICT
must NOT trigger the gate (idempotent UPSERT case).
- TestClass1_GateAllowsAllowlistAnnotation: function with
`// class1-gate: idempotent-by-design` must be skipped.
- TestClass1_NoUnpreflightedInsertInsideRange: production sweep
over every handler .go file. Currently passes because
org_import.go preflights, registry.go ON-CONFLICTs, and
workspace.go's Create has no INSERT inside a range body.
Verification:
- go test ./internal/handlers/... -run TestClass1_ -count=1
→ 4/4 PASS
- go test ./internal/handlers/... -count=1 → suite green
(no pre-existing test broken by the new file)
Refs molecule-core#2867 (PR-A Class 1 generic AST gate)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #2906 spawned the sidecar unconditionally on every tenant boot. The
plugin's first migration runs \`CREATE EXTENSION vector\` which fails
on tenant Postgres without pgvector preinstalled — every staging
tenant redeploy aborted at the 30s health gate. CP fail-fast kept
running tenants on the prior image (no outage), but the new image
was DOA.
Caught on staging redeploy 2026-05-05 19:23 with
\`pq: extension "vector" is not available\`.
Fix: only spawn the sidecar when the operator has flipped the cutover
flag — \`MEMORY_V2_CUTOVER=true\` OR \`MEMORY_PLUGIN_URL\` is set.
* Aligns the entrypoint to the same opt-in posture wiring.go already
uses (it skips building the client when MEMORY_PLUGIN_URL is empty).
* Until cutover, the sidecar isn't even running — no migration, no
health gate, no boot-time pgvector dependency.
* Operators activating cutover already redeploy with the new env
vars set; that's when the sidecar starts. By definition they've
verified pgvector is available before flipping.
* MEMORY_PLUGIN_DISABLE=1 escape hatch preserved; harness fix#2915
becomes belt-and-suspenders (still respected).
Both Dockerfile and entrypoint-tenant.sh updated. Behavior change for
existing deployments: zero (cutover env vars still unset → sidecar
still inert, but now also not running).
Refs RFC #2728. Hotfix for #2906; supersedes the migration-path
fragility class (the sidecar isn't doing migrations on tenants that
won't use it).
Adds internal/provlog with a single Event(name, fields) helper that
emits JSON-tagged single-line records to the standard logger. Five
boundary sites instrumented for #2867:
provision.start — workspace_dispatchers.go (sync + async)
provision.skip_existing — org_import.go idempotency hit
provision.ec2_started — cp_provisioner.go after RunInstances
provision.ec2_stopped — cp_provisioner.go after TerminateInstances ack
restart.pre_stop — workspace_restart.go before Stop dispatch
These pair with the existing human-prose log.Printf lines (kept). The
new records are grep+jq friendly so a future log-aggregation pipeline
can reconstruct per-workspace provision timelines without parsing the
operator messages — this is the "and debug loggers so it dont happen
again" half of the leak-prevention work.
Tests:
- provlog: emits evt-prefixed JSON, nil-tolerant, marshal-error
fallback preserves event boundary, single-line output pinned.
- handlers: provlog_emit_test.go pins three call-site contracts:
provisionWorkspaceAutoSync emits provision.start with sync=true,
stopForRestart emits restart.pre_stop with backend=cp on SaaS,
and backend=none when both backends are nil.
Field taxonomy is convenience for ops, not contract — payload can grow
additively without breaking callers. Behavior gate is the event name +
boundary location, per feedback_behavior_based_ast_gates.md.
Refs #2867 (PR-D structured logging at provisioning boundaries)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Multi-model retrospective review of #2856 (Phase 1 Expand removal)
flagged that TeamHandler.Collapse is unreachable from the canvas UI:
the "Collapse Team" button calls PATCH /workspaces/:id { collapsed }
(visual flag toggle on canvas_layouts), NOT POST /workspaces/:id/collapse.
The destructive POST route — which stops EC2s, marks children removed,
and deletes layouts — has zero UI callers (verified via grep across
canvas/, scripts/, and the MCP tool registry; only docs referenced it).
Two semantically different operations had been sharing the word
"Collapse":
- Visual collapse (canvas) → PATCH { collapsed: true }. Hides
children visually. Reversible. UI-only.
- Destructive collapse (POST /collapse) → Stops + marks removed.
Irreversible. No caller.
Deleting the destructive one + its supporting machinery:
- workspace-server/internal/handlers/team.go (entirely)
- workspace-server/internal/handlers/team_test.go (entirely)
- POST /collapse route + teamh init in router.go
- findTemplateDirByName helper (zero non-test callers after Expand
was deleted in #2856; package-private so no out-of-package consumers)
- NewTeamHandler constructor (no callers after route removed)
Plus stale doc references (the most dangerous was the MCP wrapper
mapping in mcp-server-setup.md — anyone generating MCP tool wrappers
from that table was wiring a 404):
- docs/agent-runtime/team-expansion.md (deleted entirely — whole
guide taught the deleted flow)
- docs/api-reference.md (dropped two team.go rows)
- docs/api-protocol/platform-api.md (dropped /expand + /collapse
rows)
- docs/architecture/molecule-technical-doc.md (dropped /expand +
/collapse rows)
- docs/guides/mcp-server-setup.md (dropped expand_team +
collapse_team MCP wrapper mappings)
- docs/glossary.md (dropped "(org template expand_team)"
parenthetical)
- docs/frontend/canvas.md (dropped broken link to deleted
team-expansion.md)
Kept: docs/architecture/backends.md mention of "TeamHandler.Expand
(#2367) bypassed routing on Start" — correct historical context for
the AST gate's existence, no live route reference.
Visual-collapse path unaffected:
canvas/src/components/ContextMenu.tsx:227 → api.patch — unchanged
canvas/src/components/WorkspaceNode.tsx:128 → api.patch — unchanged
go vet ./... clean. go test ./internal/handlers/ -count 1 — all green
(4.3s, no regression).
Net: -388/+10 = ~378 lines removed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #2906 shipped the binary at /memory-plugin without the migrations
directory. The plugin's runMigrations() resolved a relative path
\`cmd/memory-plugin-postgres/migrations\` that exists in the build
context but NOT in the runtime image. Every staging tenant boot
failed with:
memory-plugin-postgres: migrate: read migrations dir
"cmd/memory-plugin-postgres/migrations": open
cmd/memory-plugin-postgres/migrations: no such file or directory
memory-plugin: ❌ /v1/health never returned 200 after 30s
— aborting boot
Caught on the staging redeploy fleet job after #2906 merged. Tenants
stayed on the old image (CP redeploy correctly fail-fasted) but the
new image was broken.
Fix: \`//go:embed migrations/*.up.sql\` bundles the migrations into
the binary at build time. No filesystem path dependency at runtime.
* \`embed.FS\` embeds the .up.sql files alongside the binary.
* runMigrations() reads from migrationsFS by default;
MEMORY_PLUGIN_MIGRATIONS_DIR override path preserved for operators
shipping custom migrations.
* Names sorted alphabetically — pinned by a test so a future
\`002_*.up.sql\` is guaranteed to run after \`001_*.up.sql\`.
Tests:
* TestMigrationsEmbedded_ContainsCreateTable — pins that the embed
pattern matched files AND those files contain CREATE TABLE
(catches both empty-pattern and wrong-files-embedded).
* TestRunMigrationsFromEmbed_OrderingIsAlphabetic — pins sorted
application order.
Verified locally: \`go build\` succeeds, binary 9.3MB,
\`strings\` shows the embedded SQL.
Refs RFC #2728. Hotfix for #2906.
Multi-model retrospective review of #2901 found three Critical gaps:
1. (#2910 PR-B) template_import.go:79 wrote `tier: 3` hardcoded into
generated config.yaml. On SaaS this defeated the T4 default at the
create-handler layer — a config-less template import landed at T3
regardless of POST /workspaces' computed default. The 4th
default-tier site #2901 missed.
2. (#2910 PR-A) #2901 claimed `go test ... all green` but added zero
new tests. Existing structural-pin tests caught dispatch-layer
drift but said nothing about tier-default drift. A future refactor
that flips DefaultTier() to always return 3 would ship green.
3. (#2910 PR-E) org_import.go fallback returned T2 on self-hosted
while workspace.go returned T3. Internally consistent ("bulk vs
interactive defaults") but undocumented same-name-different-value
drift.
Fix:
- TemplatesHandler.NewTemplatesHandler now takes `wh *WorkspaceHandler`
(nil-tolerant for read-only callers). Import + ReplaceFiles compute
tier via h.wh.DefaultTier() and pass it to generateDefaultConfig.
generateDefaultConfig gets a `tier int` parameter (bounds-checked,
invalid input falls back to T3).
- org_import.go fallback lifts to h.workspace.DefaultTier() — single
source of truth shared with Create + Templates so a future
tier-default change sweeps every entry point at once.
- New saas_default_tier_test.go pinning:
TestIsSaaS_TrueWhenCPProvWired
TestIsSaaS_FalseWhenOnlyDocker
TestDefaultTier_SaaS_IsT4
TestDefaultTier_SelfHosted_IsT3
TestGenerateDefaultConfig_RespectsTierParam
TestGenerateDefaultConfig_SelfHostedTierT3
TestGenerateDefaultConfig_OutOfRangeFallsBackToT3
- Existing template_import_test.go tests + chat_files_test.go +
security_regression_test.go updated to thread the new tier param /
wh constructor arg through their NewTemplatesHandler calls. Their
pre-#2910 assertion of `tier: 3` is preserved (now passes because
the test caller passes `3` explicitly), so no regression.
go vet ./... clean. go test ./internal/handlers/ -count 1 — all
green (4.2s).
Deferred to separate follow-ups (per #2910 plan):
- PR-C: MOLECULE_DEPLOYMENT_MODE explicit deployment-mode signal
(closes the IsSaaS()=cpProv!=nil structural fragility)
- PR-D: Host iptables IMDS block + IMDSv2 hop-limit (paired with
molecule-controlplane EC2-IAM-scope audit)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-review of PR #2906 flagged: defaultListenAddr was ":9100" — binds
on every container interface. Inside today's deployment that's moot
(no host port mapping, platform talks over loopback) but it's not
least-privilege. A future Dockerfile edit that publishes the port,
a misconfigured Fly machine, or a future cross-host plugin topology
would expose an unauth'd memory store.
Loopback is the right baseline. Operators with a multi-host topology
already override via MEMORY_PLUGIN_LISTEN_ADDR — that path is unchanged.
Tests:
* TestLoadConfig_DefaultListenAddrIsLoopback pins the new default.
* TestLoadConfig_ListenAddrEnvOverride pins the override path so
operators relying on it don't break.
* TestLoadConfig_MissingDatabaseURL covers the existing fail-fast.
No prior unit tests existed for loadConfig — boot_e2e_test.go always
sets MEMORY_PLUGIN_LISTEN_ADDR explicitly, so the default was never
exercised by tests. This PR adds that coverage.
Refs RFC #2728. Hardening follow-up to PR #2906.
Adds TestINSERTworkspacesAllowlist: walks every non-test .go in this
package, finds funcs containing an `INSERT INTO workspaces (` SQL
literal, and pins the result against an explicit allowlist with the
safety mechanism named per entry.
New entries fail the build until a reviewer adds them — forcing the
question "what makes this INSERT idempotent?" at PR-review time, not
after the next bulk-create leak (the shape that produced 72 stale
child workspaces in tenant-hongming over 4 days).
Pairs with TestCreateWorkspaceTree_CallsLookupBeforeInsert (the
behavior pin for the one bulk path today). Together:
- this test catches "did a new function start inserting?"
- that test catches "did the existing bulk path drop its idempotency check?"
Both fire immediately when drift happens.
Current allowlist (3 entries):
- org_import.go:createWorkspaceTree → lookup-then-insert via
lookupExistingChild (#2868 phase 3, also pinned by the sibling AST
gate from #2895)
- registry.go:Register → ON CONFLICT (id) DO UPDATE (idempotent by
primary key — external workspace upsert)
- workspace.go:Create → single-workspace POST /workspaces, server-
generated UUID, no iteration
Verified via mutation: dropping a synthetic tempBulkLeakTest with an
unsafe loop+INSERT into the package fails the gate with a clear
diagnostic pointing at the file + function. Restoring the tree
returns the gate to green.
Memory: feedback_assert_exact_not_substring.md (verify tightened test
FAILS on bug shape) — mutation proof done locally.
RFC #2867 class 1. Class 2 (Prometheus gauge for ec2_instance
duplicates) + class 3 (structured logging on workspace create) are
follow-up PRs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolves four of six findings from the retrospective code review of Phases
1–4 (poll-mode chat upload). Bundled because every change is in the
platform's pending_uploads layer or the multi-file handler that reads it.
Findings resolved:
1. Important — Sweep query lacked an index for the acked-retention OR-arm.
The Phase 1 partial indexes are both `WHERE acked_at IS NULL`, so the
`(acked_at IS NOT NULL AND acked_at < retention)` half of the WHERE
clause seq-scanned the table on every cycle. Add a complementary
partial index on `acked_at WHERE acked_at IS NOT NULL` so both arms
of the disjunction are index-covered. Disjoint from the existing two
indexes (no row matches both predicates), so write amplification is
bounded to ~one index entry per terminal-state row.
2. Important — uploadPollMode partial-failure left orphans. The previous
per-file Put loop committed rows 1..K-1 and then errored on row K with
no compensation, so a client retry would double-insert the survivors.
Refactor the handler into three explicit phases (pre-validate +
read-into-memory, single atomic PutBatch, per-file activity row) and
add Storage.PutBatch with all-or-nothing transaction semantics.
3. FYI — pendinguploads.StartSweeperWithInterval was exported only for
tests. Move it to lower-case startSweeperWithInterval and expose the
test seam through pendinguploads/export_test.go (Go convention; the
shim file is stripped from the production binary at build time).
4. Nit — multipart Content-Type was passed verbatim into pending_uploads
rows and re-served on /content. Add safeMimetype which strips
parameters, rejects CR/LF/control bytes, and coerces malformed shapes
to application/octet-stream. The eventual GET /content response can no
longer be header-split via a crafted Content-Type on the multipart.
Comprehensive tests:
- 10 PutBatch unit tests (sqlmock): happy path, empty input, all four
pre-validation rejection paths, BeginTx error, per-row error +
Rollback (no Commit), first-row error, Commit error.
- 4 new PutBatch integration tests (real Postgres): all-rows-commit
happy path with COUNT(*) verification, atomic-rollback no-leak via
a NUL-byte filename that lib/pq rejects mid-batch, oversize
short-circuit no-Tx, idx_pending_uploads_acked existence + partial
predicate via pg_indexes (planner-shape-independent).
- 3 new chat_files_poll tests: atomic rollback on second-file oversize,
atomic rollback on PutBatch error, mimetype CRLF/NUL/parameter
sanitization (8 sub-cases).
The two remaining review findings (inbox_uploads.fetch_and_stage blocks
the poll loop synchronously; two httpx Clients per row) are Python-side
and ship in Phase 5b once this lands on staging.
Test-only export pattern via export_test.go, atomic pre-validation
discipline (validate before Tx), and behavior-based (not name-based)
test assertions follow the standing project conventions.
Closes the gap between the merged Memory v2 code (PR #2757 wired the
client into main.go) and operator activation. Without this PR an
operator wanting to flip MEMORY_V2_CUTOVER=true had to provision a
separate memory-plugin service and point MEMORY_PLUGIN_URL at it —
extra ops surface for what the design intends to be a built-in.
What ships:
* Both Dockerfile + Dockerfile.tenant build the
cmd/memory-plugin-postgres binary into /memory-plugin.
* Entrypoints spawn the plugin in the background on :9100 BEFORE
starting the main server; wait up to 30s for /v1/health to return
200; abort boot loud if it doesn't (better to crash-loop than to
silently route cutover traffic against a dead plugin).
* Default env: MEMORY_PLUGIN_DATABASE_URL=$DATABASE_URL (share the
existing tenant Postgres — plugin's `memory_namespaces` /
`memory_records` tables coexist with platform schema, no
conflicts), MEMORY_PLUGIN_LISTEN_ADDR=:9100.
* MEMORY_PLUGIN_DISABLE=1 escape hatch for operators running the
plugin externally on a separate host.
* Platform image: plugin runs as the `platform` user (not root) via
su-exec — matches the privilege boundary the main server already
drops to. Tenant image already starts as `canvas` so the plugin
inherits non-root automatically.
What stays operator-controlled:
* MEMORY_V2_CUTOVER is NOT auto-set. Behavior change for existing
deployments: zero. The wiring at workspace-server/internal/memory/
wiring/wiring.go skips building the plugin client until the
operator opts in, so the running sidecar is a no-op for traffic
until then.
* MEMORY_PLUGIN_URL is NOT auto-set either, for the same reason —
setting it implies cutover-active intent. Operators set both on
staging first, verify a live commit/recall round-trip (closes
pending task #292), then promote to production.
Operator activation steps after this PR ships:
1. Verify pgvector extension is available on the target Postgres
(the plugin's first migration runs CREATE EXTENSION IF NOT
EXISTS vector). Railway's managed Postgres ships pgvector
available; some self-hosted operators may need to enable it.
2. Redeploy the workspace-server with this image.
3. Set MEMORY_PLUGIN_URL=http://localhost:9100 + MEMORY_V2_CUTOVER=true
in the environment (staging first).
4. Watch boot logs for "memory-plugin: ✅ sidecar healthy" and the
wiring.go cutover messages; do a live commit_memory + recall_memory
round-trip via the canvas Memory tab to verify.
5. Promote to production once staging holds for a sweep window.
Refs RFC #2728. Closes the dormant-plugin gap noted in task #294.
TestStartSweeper_RecordsMetricsOnError flaked on every CI rerun under
race detection: `error counter delta = 0, want 1`. Root cause is a
race between two goroutines, not a bug in the production sweeper.
The fake `fakeSweepStorage.Sweep` signals `cycleDone` from inside its
deferred return — that happens BEFORE Sweep's return value is
received by `sweepOnce`, which is what triggers the metric increment.
On slow CI hosts the test goroutine wins the read after `waitForCycle`
unblocks and BEFORE StartSweeper's goroutine has called
`metrics.PendingUploadsSweepError`, so the asserted delta is 0 even
though the metric WILL be 1 a few ms later.
Adds a polling assert helper, `waitForMetricDelta`, that closes the
race deterministically without timing-based sleeps:
- TestStartSweeper_RecordsMetricsOnError uses waitForMetricDelta to
wait for the error counter to settle at 1.
- TestStartSweeper_RecordsMetricsOnSuccess uses it on the success
counters (acked, expired) so the error-stayed-zero assertion
reads after StartSweeper has fully processed the cycle.
- waitForCycle keeps its current shape but documents the caveat in
its comment so future tests don't repeat the assumption.
Verified: `go test ./internal/pendinguploads/ -race -count 5` passes
all 9 tests across 5 iterations cleanly.
Per memory feedback_question_test_when_unexpected.md: the
"delta=0, want=1" failure looked like a real production bug at first
glance, but instrumented inspection showed the metric DOES increment,
just AFTER the test's read. The fix is the test's wait shape, not
the sweeper.
Unblocks every PR currently broken by this flake (#2898 hit it on
two consecutive CI runs; staging-merged PRs from earlier today
(#2877/#2881/#2885/#2886) introduced the test).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User reported every SaaS workspace defaults to T2 (Standard). Three
sites quietly disagreed on the default:
- canvas CreateWorkspaceDialog (line 126): isSaaS ? 4 : 3 ← only correct one
- canvas EmptyState "Create blank": tier: 2 ← hardcoded
- workspace.go POST /workspaces: tier = 3 ← not SaaS-aware
- org_import.go createWorkspaceTree: tier = 2 (fallback)← not SaaS-aware
So a user clicking "+ New Workspace" via the dialog got T4 on SaaS,
but a user clicking "Create blank" on the empty canvas got T2, and an
agent POSTing /workspaces directly got T3. Same tenant, three different
tiers depending on entry point.
Fix:
1. WorkspaceHandler.IsSaaS() and DefaultTier() helpers (workspace_dispatchers.go).
IsSaaS() := h.cpProv != nil — single source of truth for "are we
SaaS" across the file. DefaultTier() returns 4 on SaaS, 3 on
self-hosted. SaaS rationale: each workspace runs on its own sibling
EC2 so the per-workspace tier boundary is a Docker resource limit
on the only container present — no neighbour to protect from. T4
matches the boundary.
2. workspace.go now defaults tier via h.DefaultTier() instead of
hardcoded T3.
3. org_import.go fallback (when neither ws.tier nor defaults.tier set)
becomes SaaS-aware: T4 on SaaS, T2 on self-hosted (preserve the
existing safe-shared-Docker-daemon default for self-hosted org
imports).
4. canvas EmptyState "Create blank" stops sending tier:2 in the body
and lets the backend pick — single source of truth in the backend.
Eliminates the third disagreement.
Test plan:
- go vet ./... clean
- go test ./internal/handlers/ -count 1 — all green (4.3s)
- npx tsc --noEmit on canvas — clean
- Staging E2E (after deploy): create a fresh workspace via canvas
empty-state on hongming.moleculesai.app, confirm tier=4 on the
workspace details panel.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous TestCreateWorkspaceTree_CallsLookupBeforeInsert used
bytes.Index("INSERT INTO workspaces"), which prefix-matches
INSERT INTO workspaces_audit, INSERT INTO workspace_secrets, and
INSERT INTO workspace_channels. RFC #2872 cited this as a silent
false-pass mode: a future refactor that adds an audit-table INSERT
literal earlier in source than the real workspaces INSERT would
make the gate point at the wrong target.
Replaces the byte-search with a go/ast walk + a regex that requires
`\s*\(` after `workspaces` — distinguishes the real target from
prefix lookalikes.
Adds three discriminating tests:
- TestWorkspacesInsertRE_RejectsLookalikes — pins the regex against
9 sql shapes (real, raw-string-literal, audit-shadow, workspace_*
prefixes, canvas_layouts, UPDATE/SELECT, comments).
- TestGate_FailsWhenLookupAfterInsert — synthesizes Go source where
the lookup is positioned AFTER the workspaces INSERT, asserts the
helper returns lookupPos > insertPos (which the production gate
flags via t.Errorf). Proves the gate isn't vestigial.
- TestGate_IgnoresAuditTableShadow — synthesizes source with an
audit-table INSERT BEFORE the lookup + real INSERT, asserts the
tightened regex correctly walks past the shadow and finds the
real INSERT.
Also extracts findLookupAndWorkspacesInsertPos as a helper so the
gate logic can be exercised against synthetic source, not only
against the real org_import.go.
Memory: feedback_assert_exact_not_substring.md (verify tightened
test FAILS on old code) — TestGate_FailsWhenLookupAfterInsert is
the failing-on-bug-shape proof.
Closes the silent-false-pass mode of #2872 Important-1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 4 closes out the rollout — strict-sqlmock unit tests pin which
SQL fires, but they cannot detect bugs that depend on the actual row
state after the SQL runs. Real-Postgres integration tests catch:
- the Sweep CTE depends on Postgres' make_interval function and
the table's CHECK constraints; sqlmock would happily accept a
hand-written SQL literal that Postgres rejects at runtime.
- the partial idx_pending_uploads_unacked index only catches a
wrong WHERE predicate at real-query-plan time.
- subtle predicate drift (e.g. a WHERE clause that filters by
acked_at IS NOT NULL but uses BETWEEN incorrectly).
Test cases:
- PutGetAckRoundTrip: the full happy path — Put, Get, MarkFetched,
Ack, idempotent re-Ack, Get-after-Ack returns ErrNotFound.
- Sweep_DeletesAckedAfterRetention: row not eligible at retention=1h
immediately after Ack; deleted at retention=0.
- Sweep_DeletesExpiredUnacked: backdated expires_at exercises the
unacked-and-expired branch of the WHERE clause.
- Sweep_DeletesBothCategoriesInOneCycle: three rows (acked, expired,
fresh); a single Sweep deletes the first two and leaves the third.
- PutEnforcesSizeCap: ErrTooLarge above MaxFileBytes.
- GetIgnoresExpiredAndAcked: Get filters predicate matches expected
row state in the table.
Run path:
- locally via the file-header docker incantation.
- CI runs on every PR/push that touches handlers/** OR migrations/**
(.github/workflows/handlers-postgres-integration.yml).
Phase 3 of the poll-mode chat upload rollout. Stack atop Phase 2.
The platform's pending_uploads table grows once-per-uploaded-file with
no built-in cleanup. Phase 1's hard TTL (expires_at default 24h) makes
expired rows un-fetchable but doesn't actually delete them; Phase 1's
ack stamps acked_at but leaves the row indefinitely. Without a sweep
the table grows unbounded across normal traffic.
This PR adds:
- `Storage.Sweep(ctx, ackRetention)` — a single round-trip CTE that
deletes acked rows past their retention window plus unacked rows
past expires_at. Returns `(acked, expired)` deletion counts so
Phase 3 dashboards can spot the stuck-fetch pattern (high expired,
low acked) vs healthy churn.
- `pendinguploads.StartSweeper(ctx, storage, ackRetention)` —
background goroutine that calls Sweep every 5 minutes (default).
Runs once immediately on startup so a platform restart cleans up
any rows that became eligible while we were down.
- Prometheus counters `molecule_pending_uploads_swept_total` with
`outcome={acked,expired,error}` labels. Wired into the existing
`/metrics` endpoint.
- Wired from cmd/server/main.go via supervised.RunWithRecover —
one transient panic doesn't take the platform down with it.
Defaults:
- SweepInterval = 5m (matches the dashboard refresh cadence)
- DefaultAckRetention = 1h (gives the workspace at-least-once retry
headroom in case it processed but failed to write the file before
crashing)
Test coverage: 100% on storage_test.go (extended with sweepSQL pin +
six Sweep test cases including negative-retention clamp + zero-retention
immediate-delete + DB error wrapping) and sweeper_test.go (ticker-driven
+ ctx-cancel + nil-storage + transient-error-doesn't-crash + metric
counter assertions).
Closes the third of four phases tracked on the parent RFC; phase 4 is
the staging E2E test.
Closes#2865 (split-B of the #2669 root-cause stack).
The phantom-busy sweep in workspace-server/internal/scheduler/scheduler.go
already logs each row reset, but no aggregate metric surfaces "how often
is this firing." A regression that causes high reset rates (e.g.
controlplane#481's missing env vars, or future drift in the workspace
runtime's task-lifecycle accounting) only surfaces when users complain.
Fix: counter exposed at /metrics as molecule_phantom_busy_resets_total,
incremented from sweepPhantomBusy after each row whose active_tasks
was reset. Same shape as existing molecule_websocket_connections_active.
Operator-side dashboard: alert when daily phantom-busy reset count
> 0.5% of active workspaces. Today's steady-state is near-zero; any
increase is a regression signal.
Tests:
- TestTrackPhantomBusyReset_IncrementsCounter
- TestTrackPhantomBusyReset_RaceFreeUnderConcurrentWrites (50×200
concurrent writes; tests atomic invariant)
- TestHandler_ExposesPhantomBusyResetsCounter (asserts HELP + TYPE
+ value lines in Prometheus text format)
- TestHandler_PhantomBusyResetsZeroByDefault (fresh-process 0
contract — prevents a future refactor from accidentally dropping
the metric from /metrics)
Race-detector clean. Vet clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The workspace inbox poller filters
`GET /workspaces/:id/activity?type=a2a_receive` — writing rows with
`activity_type=chat_upload_receive` would be silently invisible to it.
Switch the poll-mode upload-staging handler to write
`activity_type=a2a_receive` with `method=chat_upload_receive` as the
discriminator. Same shape as A2A's `tasks/send` vs `message/send` method
split; the workspace-side handler (Phase 2) routes by `method`, not
activity_type.
Pinned with `TestPollUpload_ActivityRowDiscriminator` — sqlmock
WithArgs on positions 2 (activity_type) and 5 (method) so a refactor
that flips activity_type back to a custom value gets a red test
instead of a runtime "poller saw nothing" silent break.
External-runtime workspaces (registered via molecule connect, behind
NAT, no public callback URL) currently see HTTP 422 "workspace has no
callback URL" on every chat file upload. The only escape is to wrap the
laptop in ngrok / Cloudflare tunnel + re-register push-mode — a tax
that shouldn't exist for a one-line use case.
This phase introduces the platform-side staging layer that lets
canvas → external workspace uploads ride the same poll loop the inbox
already uses for text messages.
Architecture (mirrors inbox poll, SSOT principle):
Canvas POST /chat/uploads (multipart)
↓ delivery_mode=poll
Platform: chat_files.uploadPollMode
↓ pendinguploads.Storage.Put + LogActivity(chat_upload_receive)
Workspace's existing inbox poller picks up the activity row (Phase 2)
Workspace fetches: GET /workspaces/:id/pending-uploads/:fid/content
Workspace acks: POST /workspaces/:id/pending-uploads/:fid/ack
Pieces in this PR:
* Migration 20260505100000 — pending_uploads table; partial indexes
on unacked + expires_at for the workspace fetch + Phase 3 sweep
hot paths. No FK to workspaces (audit retention), 24h hard TTL.
* internal/pendinguploads — Storage interface + Postgres impl. Bytes
inline (bytea) today; the interface lets a future PR replace with
S3 (RFC #2789) by swapping one constructor. 100% test coverage on
the Postgres impl via sqlmock-pinned SQL.
* handlers.PendingUploadsHandler — GET /content + POST /ack endpoints.
wsAuth-gated; cross-workspace bleed protection via per-row
workspace_id check (token leak from A can't read B's pending bytes).
Handler tests pin happy path + every 4xx/5xx mapping including
cross-workspace + race-with-sweep.
* chat_files.go — Upload poll-mode branch behind WithPendingUploads
builder. Push-mode unchanged (regression-tested). Multipart parse
+ per-file sanitize + storage.Put + activity_logs row per file.
* SanitizeFilename — Go mirror of workspace/internal_chat_uploads.py
sanitize_filename. Tests pin parity case-by-case so canvas-emitted
URIs stay identical regardless of which path handles the upload.
* Comprehensive logging — every state transition (staged, fetch,
ack, error) emits a structured log line with workspace_id +
file_id + size + sanitized name. Phase 3 metrics will hook these.
The pendinguploads.Storage wiring is opt-in (WithPendingUploads on
ChatFilesHandler) so a binary deployed without the migration keeps the
pre-existing 422 behavior — no boot-order coupling between code roll
and schema roll.
Phase 2 (separate PR): workspace inbox extension — inbox_uploads.py
fetches via the GET endpoint, writes to /workspace/.molecule/chat-
uploads/, acks, and rewrites the URI from platform-pending: → workspace:
so the agent's existing send-attachments path needs no changes.
Phase 3: GC sweep + dashboards. Phase 4: poll-mode E2E on staging.
Tests:
* 100% coverage on pendinguploads (sqlmock-pinned SQL drift gate).
* Functional 100% on new handler code (uncovered branches are
documented defensive duplicates: uuid re-parse, multipart Open
error, Writer.Write fail — none reproducible in unit tests).
* Push-mode + NULL delivery_mode regression tests pin no behavior
change for existing workspaces.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two call sites — workspace_provision.go:537 and org_import.go:54 —
duplicated the same `if runtime == "claude-code"` branch deciding
the default model when the operator/agent didn't supply one. They
were copy-pasted; nothing prevented them from drifting silently.
Extract to `models.DefaultModel(runtime string) string`. Both call
sites now route through the helper. New runtimes need one entry
in DefaultModel + one assertion in TestDefaultModel — pre-fix it
required two source edits + an audit.
Foundation for the future `RuntimeConfig` interface (RFC #2873 +
task #231): once we add `ProvisioningTimeout()`, `CapabilitiesSupported()`
etc., the helper expands to per-runtime structs and `DefaultModel`
becomes one method on the interface.
## Coverage
15 unit tests pinning the exact contract:
- claude-code → "sonnet"
- 9 other known runtimes → universal default
- empty + unknown → universal default (matches pre-refactor fallthrough)
- case-sensitivity preserved (CLAUDE-CODE → universal default)
Plus invariant test: `DefaultModel` never returns "" — protects
against a future "return early on unknown" regression that would
silently break workspace creation.
## Verification
- go build ./... clean
- 15 model unit tests pass
- existing handler tests untouched (no behavior change at call sites)
- identical output to pre-refactor for every input
First iteration of the OSS-shape refactor program. Each PR meets all
7 bars (plugin/abstract/modular/SSOT/coverage/cleanup/file-split).
Refs RFC #2873.
Three small follow-ups from #2866 self-review:
1. TestIntegration_Sweeper_StaleHeartbeatIsMarkedStuck — assert
strings.Contains(errDet, "no heartbeat for") instead of != "".
The original "non-empty" check passes for any error_detail value;
if a future regression swaps the message format, the test wouldn't
catch it. Pin the production format string explicitly.
2. TestIntegration_Sweeper_DeadlineExceededIsMarkedFailed — drop the
redundant `last_heartbeat = now()` write. The sweeper checks
deadline FIRST (the stronger statement) and short-circuits before
evaluating heartbeat staleness, so the heartbeat field is irrelevant
for that test path.
3. integrationDB doc comment now warns explicitly that the helper is
NOT t.Parallel()-safe — it hot-swaps the package-level mdb.DB and
restores via t.Cleanup. If a future contributor adds t.Parallel()
to one of these tests they race on the global. Comment makes the
constraint discoverable instead of a debugging surprise.
All 7 integration tests still pass against real Postgres locally.
OrgHandler.Import was non-idempotent — every call INSERTed a fresh row
for every workspace in the tree, regardless of whether matching
workspaces already existed. Calling /org/import twice with the same
template duplicated the entire tree.
This was the bigger leak source than TeamHandler.Expand (deleted in
PR #2856). tenant-hongming accumulated 72 distinct child workspaces
in 4 days entirely from repeated org-template spawns of the same
template — the (tier × runtime) matrix in the audit data was the
template's static shape, multiplied by spawn count.
Fix: route through a new lookupExistingChild helper before INSERT.
Skip-if-exists semantics by default:
- Match on (parent_id, name) using `IS NOT DISTINCT FROM` so NULL
parents (root workspaces) are included.
- Ignore status='removed' rows so collapsed teams or deleted
workspaces don't block re-import.
- Recursion still runs on the existing id so partial-match templates
(parent exists, some children missing) backfill correctly instead
of either no-op'ing the whole subtree or duplicating the existing
children.
- Result entries for skipped nodes carry skipped:true so callers
(canvas Import preflight modal) can surface "5 of 7 already
existed, 2 created."
The recursion that walked ws.Children is extracted into
recurseChildrenForImport so both the create-path and the skip-path
share one implementation — no duplicated grid math, no two paths to
keep in sync.
Note: replace_if_exists semantics (re-roll: stop+delete old, create
new) are deferred. Skip-if-exists alone closes the leak; re-roll is
a later UX decision for the canvas Import preflight modal.
Tests:
- 4 sqlmock cases on lookupExistingChild: not-found, found,
nil-parent (the IS NOT DISTINCT FROM NULL trick), DB-error
propagates (must fail fast — silent fallback to INSERT is the
failure mode the helper exists to prevent).
- 1 source-level AST gate (per memory feedback_behavior_based_ast_gates.md):
pins that h.lookupExistingChild( appears BEFORE INSERT INTO workspaces
in org_import.go. If a future refactor reintroduces the un-checked
INSERT, the gate fails. Verified load-bearing by removing the call —
build fails (helper symbol gone).
go vet ./... clean. go test ./internal/handlers/ -count 1 — all green
(4.2s, no regression on existing OrgImport / Provision / Team tests).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Real-Postgres tests for the RFC #2829 PR-3 sweeper. Validates:
- Deadline-exceeded rows are marked failed with the expected
error_detail
- Stale-heartbeat in-flight rows are marked stuck (uses
DELEGATION_STUCK_THRESHOLD_S env override for deterministic
timing)
- Healthy rows (fresh heartbeat + future deadline) are not touched
— no false-positive against well-behaved delegations
These extend the gate added in the previous commit so the workflow
catches sweeper regressions, not just ledger-write ones. All 7
integration tests now pass; CI workflow runs them all.
Two-part PR:
## Fix: result_preview was lost on completion
Self-review of #2854 caught a real bug. SetStatus has a same-status
replay no-op; the order of calls in `executeDelegation` completion
+ `UpdateStatus` completed branch clobbered the preview field:
1. updateDelegationStatus(completed, "") fires
2. inner recordLedgerStatus(completed, "", "")
→ SetStatus transitions dispatched → completed with preview=""
3. outer recordLedgerStatus(completed, "", responseText)
→ SetStatus reads current=completed, status=completed
→ SAME-STATUS NO-OP, never writes responseText → preview lost
Confirmed against real Postgres (see integration test). Strict-sqlmock
unit tests passed because they pin SQL shape, not row state.
Fix: call the WITH-PREVIEW recordLedgerStatus FIRST, then
updateDelegationStatus. The inner call becomes the no-op (correctly
preserves the row written by the outer call).
Same gap fixed in UpdateStatus handler — body.ResponsePreview was
never landing in the ledger because updateDelegationStatus's nested
SetStatus(completed, "", "") fired first.
## Gate: real-Postgres integration tests + CI workflow
The unit-test-only workflow that shipped #2854 was the root cause.
Adding two layers of defense:
1. workspace-server/internal/handlers/delegation_ledger_integration_test.go
— `//go:build integration` tag, requires INTEGRATION_DB_URL env var.
4 tests:
* ResultPreviewPreservedThroughCompletion (regression gate for the
bug above — fires the production call sequence in fixed order
and asserts row.result_preview matches)
* ResultPreviewBuggyOrderIsLost (DIAGNOSTIC: confirms the
same-status no-op contract works as designed; if SetStatus's
semantics ever change, this test fires)
* FailedTransitionCapturesErrorDetail (failure-path symmetry)
* FullLifecycle_QueuedToDispatchedToCompleted (forward-only +
happy path)
2. .github/workflows/handlers-postgres-integration.yml
— required check on staging branch protection. Spins postgres:15
service container, applies the delegations migration, runs
`go test -tags=integration` against the live DB. Always-runs +
per-step gating on path filter (handlers/wsauth/migrations) so
the required-check name is satisfied on PRs that don't touch
relevant code.
Local dev workflow (file header documents this):
docker run --rm -d --name pg -e POSTGRES_PASSWORD=test -p 55432:5432 postgres:15-alpine
psql ... < workspace-server/migrations/049_delegations.up.sql
INTEGRATION_DB_URL="postgres://postgres:test@localhost:55432/molecule?sslmode=disable" \
go test -tags=integration ./internal/handlers/ -run "^TestIntegration_"
## Why this matters
Per memory `feedback_mandatory_local_e2e_before_ship`: backend PRs
MUST verify against real Postgres before claiming done. sqlmock pins
SQL shape; only a real DB can verify row state. The workflow makes
this gate mandatory rather than optional.
Every workspace can have children via the regular CreateWorkspace flow
with parent_id set, so a separate handler that bulk-creates from
config.yaml's sub_workspaces (and was non-idempotent — calling it twice
duplicated the team) earned its way out. "Team" is just the state of
having children; expanding/collapsing is purely a canvas-side visual
action that toggles the `collapsed` column via PATCH.
The non-idempotency directly caused tenant-hongming's vCPU starvation:
72 distinct child workspaces accumulated in 4 days, ~14 leaked EC2s
(50 of 64 vCPU consumed by stale teams), every Canvas tabs E2E retry
flaking on RunInstances VcpuLimitExceeded.
What stays:
- TeamHandler.Collapse — still useful; stops + removes children via
StopWorkspaceAuto. Reachable from the canvas Collapse Team button.
(Note: that button currently calls PATCH /workspaces/:id, not the
Collapse endpoint — that's a separate reachability question for
later.)
- findTemplateDirByName helper — kept in team.go pending a relocate
decision; no in-package consumers after Expand.
- The four other paths that create child workspaces continue to work
unchanged: regular POST /workspaces with parent_id, OrgHandler.Import
(recursive tree), Bundle import, scripts.
What goes:
- POST /workspaces/:id/expand route (router.go)
- TeamHandler.Expand method (team.go: ~130 lines)
- 4 TestTeamExpand_* sqlmock tests (team_test.go)
- TestTeamExpand_UsesAutoNotDirectDockerPath AST gate
(workspace_provision_auto_test.go) — pinned a code path that no
longer exists; the generic TestNoCallSiteCallsDirectProvisionerExceptAuto
gate still covers the architectural intent for any future caller.
Follow-up PRs:
- canvas/ContextMenu.tsx: drop the "Expand to Team" right-click button
+ handleExpand callback; users create children via the regular
+ New Workspace dialog with the parent picker (already supported)
- OrgHandler.Import idempotency (skip-if-exists OR replace_if_exists)
— same bug class as the deleted Expand, but on the bulk-tree path
- One-off cleanup script for tenant-hongming's 72 stale workspaces
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>