molecule-core

Author	SHA1	Message	Date
Hongming Wang	512fdfd59d	fix(canvas): plain drag out of parent un-nests again Un-nest used to require holding Alt (or Cmd to force-detach). That was too conservative — when a user dragged a child clearly outside its parent's bbox, nothing happened on release, because the default branch soft-clamped back and only the Alt branch actually opened the "Extract?" confirm. Matches the exact bug the user just flagged ("I can put agents in other agent, but when I drag it out, it does not move out"). New rules: * Past the 20 % hysteresis → confirm un-nest. Plain drag, no modifier. This is what most users expect (Miro / Figma behave the same way — drag outside the frame and the shape leaves it). * Inside or within 20 % of the edge → soft-clamp back inside. Guards against twitchy releases that momentarily overshoot the edge by a few pixels. * Cmd / Ctrl → force un-nest regardless of overlap. Escape-hatch for when the user dragged within the hysteresis zone but really wants out. * Dropping onto a different parent → nest there (unchanged). Alt is no longer a required modifier for un-nesting. Keeps it as a non-gesture modifier only; no meaning unless we re-bind it later. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 20:48:38 -07:00
Hongming Wang	f2a4b6e0d3	fix: dev-mode bypass for IP rate limiter + 429 retry on GET The 600-req/min/IP bucket is sized for SaaS where each tenant has a distinct client IP. On a local Docker setup every panel shares one IP — hydration (/workspaces + /templates + /org/templates + /approvals/pending) plus polling (A2A overlay + activity tabs + approvals + schedule + channels + audit trail) can burst past the bucket inside a minute, blanking the canvas with 429s. The user reported it after dragging workspaces — dragging itself is release-only (savePosition in onNodeDragStop), but the polling that's always running added onto startup tripped the limit. Two-layer fix: Server: RateLimiter.Middleware short-circuits when isDevModeFailOpen is true (MOLECULE_ENV=development + empty ADMIN_TOKEN), matching the Tier-1b hatch already applied to AdminAuth, WorkspaceAuth, and discovery. SaaS production keeps the bucket. Client: api.ts auto-retries a single 429 on idempotent GET requests, waiting the server-provided Retry-After (capped at 20s). Mutations (POST/PUT/PATCH/DELETE) never auto-retry to avoid double-applying. Users on SaaS hitting a legitimate rate-limit spike get one transparent recovery instead of an immediately-blank Canvas. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 20:44:09 -07:00
Hongming Wang	286dcbfd1e	fix(canvas,org): collapse org-imported parents on first paint Importing a 15-workspace org template dropped every child as a freely-positioned card into its parent's coordinate space. Parents with 5-10 kids had the kids spill below the parent's initial min size, producing the "ugly default" layout the user just flagged — a mess of overlapping cards the moment the import completed. Fix: every workspace in an org-template import that HAS children is inserted with `collapsed = true`. Leaf workspaces stay expanded (nothing to hide). The canvas renders a collapsed parent as a compact header-only card with its "N sub" badge — visually identical to the pre-refactor default the user asked for. Double-click on a collapsed parent now EXPANDS it (flipping `collapsed` locally + persisting via PATCH) so the user can drill in to see the subtree. Only once expanded does a second double-click zoom-to-team, matching the prior behaviour. Leaf-first creation order stays the same; the collapsed flag just means "render compact" not "hide from API". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 20:36:55 -07:00
Hongming Wang	507696d88a	fix(canvas,server): address review findings on `3f11df03` Five review findings from the `3f11df03` six-bug commit: 1. Add TestPeers_DevModeFailOpen_{Allows,ClosedWhenAdminTokenSet, ClosedInProduction} covering all three gating states for the security-sensitive dev-mode hatch the prior commit added to /registry/:id/peers. Previously shipped untested — a future refactor could have silently inverted polarity or removed the gate. New tests pin the contract: * MOLECULE_ENV=development + ADMIN_TOKEN="" → allow bearerless * MOLECULE_ENV=development + ADMIN_TOKEN set → require token * MOLECULE_ENV=production → require token 2. ConfigTab handleSave diffs against the RAW parsed YAML / form config instead of the DEFAULT_CONFIG-merged shape. The previous code would silently PATCH tier=1 to the DB when a user deleted the `tier:` line in raw mode (the default-merge substituted 1). Now: only fields the user actually typed participate in the diff. Type guards (typeof === "number" / "string") prevent coercion surprises on malformed YAML. 3. ConfigTab model-save failure no longer lies "Saved". The /workspaces/:id/model PATCH can reject when the runtime doesn't support the chosen model; previously we caught + console.warn'd + showed green Saved, and the user watched the model revert on next reload with no explanation. Now the save path collects a `modelSaveError` and surfaces it via setError with a partial- success message ("Other fields saved, but model update failed: …") so the user sees why. 4. ChannelsTab now surfaces BOTH channels-fetch and adapters-fetch failures, distinguishing them in the error text ("Failed to load connected channels and platforms — try refreshing"). Previously only an adapters failure was visible; a channels failure left the user with an apparently-empty list and no indication the API was unreachable. 5. ChatTab panels drop the redundant aria-hidden attribute. The `hidden`/`flex` Tailwind class already sets display:none, which removes the node from the accessibility tree on its own; the extra aria-hidden invited WAI-ARIA lint warnings if a focusable descendant ever landed inside an inactive panel. Tests: 923 canvas + full Go handler suite pass. 3 new Go tests. No behaviour change on the five prior fixes — this commit tightens their edges per the independent review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 20:29:44 -07:00
Hongming Wang	3f11df031c	fix: six UX bugs (peers auth, scroll, chat tabs, config persist, + visibility) Six bugs reported from a live session — all shippable in one commit: 1. Peers tab 401 on local Docker. The /registry/:id/peers endpoint demands a workspace-scoped bearer token (validateDiscoveryCaller) which the canvas session doesn't hold. Added the same Tier-1b dev-mode fail-open hatch that AdminAuth and WorkspaceAuth already use — gated by MOLECULE_ENV=development + empty ADMIN_TOKEN, so SaaS production stays strict. Exported IsDevModeFailOpen from the middleware package for the handler layer to reuse. 2. Org Templates list unscrollable. OrgTemplatesSection was rendered in the TemplatePalette footer — a div without overflow — so when it expanded to 15+ entries the list extended past the viewport with no scroll. Moved it to the top of the flex-1 overflow-y-auto container. Tall lists now scroll naturally. 3. Chat tab: "My Chat" and "Agent Comms" rendered stacked instead of switching. HTML `hidden` attribute was being overridden by Tailwind's `flex` class (display: flex beats the attribute), so both tabpanels rendered concurrently. Swapped to a conditional Tailwind `hidden`/`flex` class so the inactive panel is display:none with proper CSS specificity. 4. Hermes Config form never persists. handleSave wrote config.yaml but name / tier / runtime / model all live on the workspace row (or the dedicated /workspaces/:id/model endpoint) — the form edited in-memory, the request returned 200, the next reload wiped everything back. Hermes + external runtimes manage their own config inside the container anyway, so writing config.yaml is a no-op for them; skip it. Always diff and PATCH the DB-backed fields that actually changed. 5. Channels "+ Connect" dropdown empty on first open. ChannelsTab's load() used Promise.all with a silent catch — if EITHER the channels or adapters fetch failed, both setters were skipped with no error visible. Switched to Promise.allSettled so each endpoint settles independently, and the adapters failure now surfaces via the top-level error state. 6. Plugin registry always "No plugins in registry". Same silent catch pattern in SkillsTab.tsx — load errors for /plugins, /plugins/sources, and /workspaces/:id/plugins swallowed without logging. Replaced the empty catches with console.warn so future failures are at least visible in devtools. Tests: 923 passing (unchanged). Go handler tests pass. Server rebuilt and running with the peers-auth + collapsed-persistence fixes (pid 15875). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 20:18:30 -07:00
Hongming Wang	4fd7f1e84c	fix(canvas): tighten rescue + cap toast + cover paths with tests Three follow-up review findings from the `c2b2e13a` review: 1. Rescue heuristic uses pure bbox-non-overlap. The previous `position.x < 0` branch rescued any child whose parent was later dragged past it, even when the layout was clearly recoverable (e.g. relative -40, child still overlaps parent). New rule: rescue iff the child's bbox has zero overlap with the parent's bbox — self-calibrating, scales with user-resized parents, catches screenshot-case and legacy huge-positive data. 2. Toast caps failed-name list at 3 and appends "and N more". Stops a 50-node partial failure from overflowing the toast container. 3. Cycle guard on selection-roots walk in batchNest. Corrupt parentId data can't send the loop infinite now. Cheap defensive guard — one Set per selected node. Tests added (923 total, up from 918): * canvas-topology.test: 4 rescue scenarios — screenshot case (zero-overlap rescue), negative drift kept, huge-positive rescued, user-resized layout kept. * canvas.test: selection-roots filter on a 3-level chain. * workspace_crud test: PATCH {collapsed:true} runs the UPDATE. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 20:08:14 -07:00
Hongming Wang	c2b2e13abe	fix(canvas): address code-review findings on the Canvas refactor Five issues surfaced in the review of `50b53784`. Each was either a real bug waiting to hit users or a silent failure mode. 1. Topology rescue no longer teleports user-resized children. Rescue was comparing against parentMinSize(childCount), so any child the user had placed in space the parent was resized into got snapped to the default grid on reload — undoing the layout. Now rescue fires only on obviously corrupt data: negative relative coords (legacy pre-nesting absolute positions that landed above/left of their assigned parent) or values past an MAX_PLAUSIBLE_OFFSET threshold. Children just-past the initial minimum are left alone. 2. batchNest now filters to selection-roots before planning. Previously selecting both A and A's descendant B and dragging into T yanked B out of A to become a sibling under T. Users reasonably expect the A subtree to move intact. The new pass drops any selected node whose ancestor is also selected — those follow their ancestor via React Flow's parent binding. 3. batchNest surfaces partial failure via showToast. Previously silent: 2 of 5 PATCHes fail, user sees 3 cards re-parented + 2 snapped back with no explanation. Now names the failed cards. 4. confirmNest closes the nest dialog BEFORE dispatching the async store action, so a second drag can't kick off a competing batch while the first is still in flight. 5. collapsed is now persisted. The Go workspace_crud.go Update handler ignored the `collapsed` field, so user-initiated collapse round-tripped to an expanded state on next hydrate. Added the PATCH branch (`UPDATE workspaces SET collapsed = ...`) so the state survives reload. Nits cleaned: * Removed dead dragStartParentRef in useDragHandlers. * Swapped redundant `node.data as WorkspaceNodeData` casts for a named WorkspaceNode type alias. * Canvas.tsx SR-live region now reads n.parentId (matches MiniMap + RF's native field) instead of the mirror n.data.parentId. Tests added (918 total, up from 915): * batchNest happy path — 2-root selection fires 2 combined PATCHes carrying parent_id + x + y, not 2×N sequential round-trips. * batchNest ancestor+descendant selection — subtree stays intact. * batchNest partial failure rollback — only the rejected nodes revert; successful ones stay committed. Backend change is single-line (collapsed PATCH branch); all workspace_crud Go tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 19:58:44 -07:00
Hongming Wang	50b537849a	refactor(canvas): split Canvas.tsx into hooks; parallelize batchNest Two concerns in one commit (separate files, each self-contained): ## Canvas.tsx split (from ~680 to ~250 lines) Canvas.tsx was holding drag gesture state + keyboard shortcuts + viewport wiring + JSX. Each concern now lives in its own unit under canvas/src/components/canvas/: - dragUtils.ts — pure: shouldDetach, clampChildIntoParent, DETACH_FRACTION - DropTargetBadge.tsx — the floating "Drop into: <name>" label + the dashed ghost preview at the target slot - useDragHandlers.ts — encapsulates onNodeDragStart / Drag / Stop, findDropTarget hit-test, pendingNest state, and confirmNest/cancelNest. Routes multi- select drags through batchNest automatically. - useKeyboardShortcuts — Esc, Enter, Shift+Enter, Cmd+]/[, Z — one window listener, one source of truth. - useCanvasViewport — pan-to-node + zoom-to-team CustomEvent listeners and the debounced viewport save. Canvas.tsx becomes a thin composition + JSX file. No behavioural change; the refactor is covered by the existing 915 canvas tests. ## batchNest parallelization (2N round-trips → N, all in flight) Previously nestNode fired two sequential PATCHes (parent_id then x/y) and batchNest looped nestNode sequentially. For a 5-node selection on a typical ~200ms link this was ~2s of serialized RPCs. - nestNode now combines parent_id + x + y into ONE PATCH. The Go handler (workspace_crud.go Update) already reads all three from the same body — no backend change. - batchNest rewritten: compute every re-parent plan against one snapshot, commit a single set(), then fire N PATCHes via Promise.allSettled in parallel. Per-node failures roll back only that node (others stay committed) — same semantics as the single- node path, just concurrent. - The state math in the batch path also correctly shifts descendant zIndex by depthDelta when any re-parented node has a subtree. ## Also - canvas-topology.ts: reverted P3.12's opt-in rescue to the auto- rescue default. When a child's stored relative position would render it outside the parent bbox (the visual regression the user saw after collapse → reload — Hermes child drawn outside Claude Code Agent on first paint), the child is placed in the next default grid slot. The "Arrange Children" context command stays for bigger teams. All 915 canvas tests pass. No backend changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 19:43:18 -07:00
Hongming Wang	c5abed988e	fix(canvas): address review findings on playability pass Five Critical issues caught in code review of `f3423a51`. Each one broke an invariant the original commit claimed to uphold. 1. nestNode: descendants kept their old-depth zIndex after a re-parent. Now walks the dragged subtree and shifts every descendant's zIndex by the same depthDelta so "children above ancestors" survives moves between levels of the hierarchy. 2. bumpZOrder: siblings all share zIndex = depth in fresh topology, so a single +1 bump was identical for every sibling and subsequent bumps drifted zIndex unboundedly. Rewritten to sort siblings by current zIndex and swap the target with its neighbour in the bump direction — Figma-style reorder, stays within the sibling tier. 3. findDropTarget: depth-first tiebreaker lost to bumped siblings. The visually-frontmost card after Cmd+] is a shallow sibling, but the hit test picked the deepest nested card regardless. Swapped order so zIndex wins first, depth second, area third. Also pre-computes the depth map once per call (was O(n²) via repeated .find walks — will matter past ~30 workspaces). 4. arrangeChildren: saved absolute position using `slot + parent.position`, but parent.position is RELATIVE to its own parent when nested. Grandchildren's stored x/y were in the parent's local frame and reload placed them in the wrong spot. Now walks the full ancestor chain via absOf() to get the true canvas-absolute origin before PATCHing. 5. setCollapsed: naive flip of every descendant's hidden flag diverged from the topology rebuild on hydrate. Collapse A, collapse B, then expand A — C should stay hidden because B is still collapsed, but before this fix C was unhidden. Rewritten to recompute every descendant's hidden from the full ancestry chain, matching the topology pass byte-for-byte. New round-trip test asserts the two code paths produce identical node.hidden across a full lifecycle. Also: - Removed dead cascadeMessage constant (never rendered). - Replaced hardcoded 260/120 in zoom-to-team with exported constants. - arrangeChildren PATCH catch now logs instead of silently swallowing. - Added 70→76 tests: setCollapsed 3-chain scenarios, bumpZOrder swap semantics, edge-of-list no-op. All 915 canvas tests green. Backend untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 19:16:48 -07:00
Hongming Wang	f3423a513d	feat(canvas): industry-pattern playability pass (P1+P2+P3) Ships the full prioritized improvement list from the canvas research report — aligns our nesting/resize UX with Miro / FigJam / tldraw / Figma conventions. Organized by priority below. ## P1 — baseline playability * Hysteresis on drag-out detach (Miro): a child only un-nests when >=20% of its bbox is outside the parent on release. Prevents accidental un-nesting from twitchy drags. * Drop-target now uses tree-depth DESC, then zIndex DESC, then area ASC to pick targets when nested parents overlap (xyflow #2827). * Children render above ancestors by inheriting zIndex = parent + 1 in topology and on every nest/unnest (xyflow #4012). * Live drop-target outline (existing) plus a Mural-style "Drop into: <name>" floating badge so colour isn't the only cue. * growParentsToFitChildren now fires only on dimension-type changes inside onNodesChange (NodeResizer commits) and once on drag-stop — avoids tldraw's edge-chase artifact (P3.11 commit-on-release). ## P2 — polish * Whimsical-style ghost preview: dashed outline at the next default grid slot inside the drop-target parent during drag. * Alt-drag escape with soft clamp: dropping slightly outside a parent without Alt/Cmd snaps the child back inside (clampChildIntoParent); Alt releases the clamp to allow un-nest; Cmd/Ctrl force-detaches. * Figma-style keyboard hierarchy nav: Enter descends to first child, Shift+Enter ascends to parent, Cmd+]/[ re-orders siblings via the new bumpZOrder store action. * Multi-select re-parent preserves offsets: confirmNest routes through a new batchNest action when the primary drag is part of a batch selection (Lucidchart pattern). ## P3 — long-tail * Minimap now shows parent cards as filled regions with a blue stroke, so hierarchy reads at a glance without zooming. * Out-of-bounds rescue is opt-in: topology no longer silently re-lays children whose stored position is outside the parent bbox (Figma trust-the-data). The new Arrange Children context menu item runs the rescue on demand via arrangeChildren. * Cmd-drag force-detach regardless of hysteresis. * Collapse workspace: the existing Collapse Team action now toggles a local setCollapsed store action that hides every descendant and shrinks the parent card to header-only (Miro frame outline view). Growth pass skips collapsed parents so they don't push back out. All 910 canvas tests green. Backend untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 19:03:02 -07:00
Hongming Wang	d359390f83	fix(canvas): parent auto-fit sizing + rescue out-of-bounds children Two playability bugs in the new flat-cards layout: 1. On first load or fresh org import a parent had no explicit width or height, so children whose stored position sat inside their (eventual) parent's rectangle rendered visually outside the smaller default parent box. Compute a parent starting size in canvas-topology: • 2-column grid of child-default footprints + header/side padding • Grows per child count (2→1 row, 3-4→2 rows, etc.) and stamp it onto the Node's width/height so the first paint already contains every child. 2. If a child's stored relative position actually falls outside the parent's computed bounds (legacy org-imports at 0,0, pre-refactor absolute coordinates, manually-nudged rows), assign that child a deterministic default grid slot inside the parent instead. Runtime cascade: added growParentsToFitChildren to onNodesChange so when the user drags or resizes a child past the parent's current bounds, the parent grows to contain it (+padding). Miro/FigJam-style frame auto-fit — grow-only, never shrinks under the user's manual resize. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 18:29:04 -07:00
Hongming Wang	cc194f0b7e	refactor(canvas): flat workspace cards with React Flow native parenting Every workspace now renders as a first-class card on the canvas regardless of parent_id. The old "parent card contains mini TeamMember chips" layout is gone — if B is parented to A, B renders as a full card inside A's coordinate space using React Flow's `parentId` binding, so moving A carries B along and children have the same detail + actions as root cards. Details: - canvas-topology.ts: topologically sort parents before children (React Flow ordering requirement), compute each child's RF-native parentId + relative position on load. DB keeps absolute x/y; the abs→rel conversion happens here, reverse translation in Canvas.onNodeDragStop before savePosition PATCHes the DB. - WorkspaceNode.tsx: delete the EmbeddedTeam + TeamMemberChip blocks, simplify the size classes, and add NodeResizer (visible when selected) so users can drag any edge/corner to grow or shrink. Parent cards default to a larger min size so nested children have breathing room. - Canvas.tsx drop targeting rewritten: bounds-based hit test against each node's measured absolute bbox, deepest match wins. Fixes two prior bugs at once — dropping onto Claude Code with a nested same- named Hermes no longer picks the wrong node, and the target can now be a nested workspace when that's where the pointer actually released. - canvas.ts nestNode + removeNode: translate position between old and new parent's absolute origin on nest/unnest so the card doesn't jump, and re-point the RF `parentId` alongside `data.parentId` on reparent. - Tests: hidden-flag assertions replaced with parentId checks; obsolete TeamMemberChip a11y/eject tests deleted (the UI component no longer exists). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 18:18:44 -07:00
Hongming Wang	8a07cf4035	fix(canvas): skip already-nested workspaces as drop targets Dragging one workspace onto another could pick a nested child as the "nearest" drop target instead of the visible parent card the user actually hovered. The effect: dropping a free-floating Hermes Agent onto a Claude Code Agent that already had a Hermes Agent nested inside showed "Move 'Hermes Agent' inside 'Hermes Agent'?" — the confirmation referenced the nested same-named child, not Claude Code. Why: getIntersectingNodes returns every overlapping node, including hidden=true children that render inside their parent's card. The parent and child share bounding boxes, so the child often "won" the nearest-distance check. Filter them out at the source: a node that's already got a parentId (or is hidden) is never a valid top-level drop target. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 17:49:01 -07:00
Hongming Wang	7356cf8d3a	fix(chat): clear sending spinner when any path delivers the reply Two latent bugs kept the "Processing with Claude Code..." timer ticking after the agent had already answered: 1. The A2A_RESPONSE store handler wrote into agentMessages[workspaceId] (no prefix) but ChatTab's "clear sending" effect subscribed to agentMessages["a2a:" + workspaceId]. Keys never matched — the effect was dead code from day one. Removed the dead subscription and moved the setSending(false) into the pendingAgentMsgs effect so any reply delivered via a WS push (Claude Code SDK, Hermes's send_message_to_user) also closes the spinner. 2. Added an activity-log fallback: when the platform emits a successful a2a_receive ACTIVITY_LOGGED for this workspace, clear sending and stop the timer. That covers the "runtime answered but we never saw the store message" case Claude Code exhibited tonight — the HTTP request can stay in flight while the SDK already pushed its reply. Symmetric a2a_receive error path also clears sending and surfaces the error message, so a runtime-side failure no longer hangs the UI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 17:43:30 -07:00
Hongming Wang	1c60869e1e	Merge remote-tracking branch 'origin/staging' into fix/restore-quickstart-plus-hotfixes # Conflicts: # .gitignore	2026-04-23 17:38:08 -07:00
Hongming Wang	18ebb1d7bf	fix(server): remove 60s A2A client timeout + correct file-read cat args Two bugs surfaced while testing Claude Code + OAuth deploys: 1. A2A proxy: a2aClient had a 60s Client.Timeout "safety net" that defeated the per-request context deadlines the code otherwise sets (canvas = 5m, agent-to-agent = 30m). Claude Code's first-token cold start over OAuth takes 30-60s, so every first "hi" into a fresh claude-code workspace returned 503 at exactly the 1m mark. Removed the Client.Timeout — the context deadline now governs as documented in the adjacent comment. 2. Files tab: ReadFile ran `cat <rootPath> <filePath>` as two args to cat. `cat /home agent/turtle_draw.py` tries to read the rootPath directory (errors "Is a directory") and then resolves the filePath relative to the container cwd, which is not guaranteed to equal rootPath. Result: the file-content pane stayed blank even though the file listed fine. Join into a single path before exec. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 17:25:53 -07:00
Hongming Wang	e337efe974	fix(canvas): propagate runtime through WORKSPACE_PROVISIONING event The side-panel runtime pill read "unknown" for newly-deployed workspaces because canvas-events.ts created the node from WORKSPACE_PROVISIONING payload — and the payload only carried name + tier. No refetch filled the gap during provisioning, so the user saw "RUNTIME unknown" on the card even though the DB row had the real runtime set. Includes runtime in every WORKSPACE_PROVISIONING emitter: * handlers/workspace.go — initial create * handlers/workspace_restart.go — explicit restart, auto-restart, and crash-recovery resume loop * handlers/org_import.go — multi-workspace org imports Canvas-side: canvas-events.ts reads payload.runtime when creating the node; the provisioning test asserts the pill value is populated before any refetch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 17:17:49 -07:00
Hongming Wang	dc50a1c775	refactor(canvas): data-drive provider picker from template config.yaml The MissingKeysModal's provider list was hardcoded in deploy-preflight.ts as RUNTIME_PROVIDERS — a per-runtime map that duplicated what each template repo already declares in its config.yaml. That meant adding a new provider required changes in two places, and the UI could drift out of sync with the actual template (e.g. when a template adds a MiniMax or Kimi model, the picker wouldn't know). The single source of truth for "which env vars does this workspace need" is each template's config.yaml: * `runtime_config.models[].required_env` — per-model key list * `runtime_config.required_env` — runtime-level AND list Go /templates already returned `models`. This change: * Adds `required_env` alongside `models` on templateSummary so the canvas receives the full picture. * Rewrites deploy-preflight.ts to derive ProviderChoice[] from a template object via `providersFromTemplate(template)`: - groups `models[]` by unique required_env tuple - falls back to runtime_config.required_env when models is empty - decorates labels with model counts (e.g. "OpenRouter (14 models)") * `checkDeploySecrets(template, workspaceId?)` now takes a template object instead of a runtime string. Any-provider satisfaction still short-circuits preflight to ok=true. * MissingKeysModal receives `providers` directly; no more lookups. * TemplatePalette threads `template.models` + `template.required_env` into the preflight. Side effects: * Claude Code's dual-auth (OAuth token OR Anthropic API key) now surfaces as two picker options — its config.yaml already declared both, the UI just wasn't reading them. * Hermes picker now shows 8 provider options (Nous, OpenRouter, Anthropic, Gemini, DeepSeek, GLM, Kimi, Kilocode) instead of the hand-picked 3, matching its 35-model reality. Removed the legacy RUNTIME_PROVIDERS / RUNTIME_REQUIRED_KEYS / getRequiredKeys / findMissingKeys exports; MissingKeysModal.test.tsx deleted (its coverage is subsumed by the new template-driven deploy-preflight.test.ts). 58 modal-adjacent tests pass; full canvas suite 919 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 17:07:15 -07:00
Hongming Wang	3456bf79a7	Merge pull request #1931 from Molecule-AI/chore/remove-internal-content-from-monorepo chore: remove internal content + add hard CI gate (CEO directive 2026-04-23)	2026-04-23 17:04:29 -07:00
rabbitblood	427b764f58	chore: remove internal content + add hard CI gate (CEO directive 2026-04-23) This monorepo is public. Internal content (positioning, competitive briefs, sales playbooks, PMM/press drip, draft campaigns) belongs in Molecule-AI/internal — never here. ## What this PR removes /research/ (3 competitive briefs) /marketing/ (45 files: assets, audio, community, copy, demos, devrel, drip, pmm, press, sales) /docs/marketing/ (31 draft campaign / blog / brief files) comment-1172.json + comment-1173.json test-pmm-temp.txt tick-reflections-temp.md 83 files removed, 7,141 lines deleted from public history (going forward — historical commits remain visible in this repo's git log). ## Companion: internal repo absorption Molecule-AI/internal PR `chore/migrate-monorepo-internal-content-2026-04-23` absorbs all 79 files into `from-monorepo-2026-04-23/` for curator triage into the existing internal/marketing/ tree. Bulk-dump avoids file-collision on overlapping subdirs (audio, devrel, pmm). ## Three-layer enforcement so this can't recur 1. .gitignore — blocks `git add` of /research, /marketing, /docs/marketing, /comment-.json, -temp.{md,txt}, /test-pmm-, /tick-reflections- 2. .github/workflows/block-internal-paths.yml — CI hard gate. Fails any PR that adds a forbidden path. Cannot be silently bypassed. 3. docs/internal-content-policy.md — canonical decision tree for agents and humans. Linked from the CI failure message. A separate PR on molecule-ai-org-template-molecule-dev updates SHARED_RULES to teach every agent role to write internal content directly to Molecule-AI/internal via gh repo clone + commit + PR (the prevention-at- source layer; this PR is the mechanical backstop). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 16:58:28 -07:00
Hongming Wang	958eec3a7d	Merge pull request #1929 from Molecule-AI/chore/remove-org-templates chore: remove org-templates/molecule-dev — standalone repo is source of truth	2026-04-23 16:46:55 -07:00
Hongming Wang	a8f41a57ea	chore: remove org-templates/molecule-dev — standalone repo is source of truth Reverts the `.gitignore` checkin-exception for molecule-dev that let it creep back on every main↔staging sync. Keeping this dir in core meant: - 800KB of template files shipping with every monorepo clone - Confusion about which copy is canonical (this one vs the standalone Molecule-AI/molecule-ai-org-template-dev repo) - Merge churn — `0506e0c` re-added it against #6e6de39's removal intent just by taking 'theirs' in a conflict resolution All org-templates now live in their own repos, fetched via scripts/clone-manifest.sh when needed locally. molecule-dev has no special status; it's the same shape as every other org template. The .gitignore rule is now a simple `/org-templates/` with no exceptions, matching the rule structure already used for `/plugins/` and `/workspace-configs-templates/`. Future conflict resolutions can't re-add by accident because git won't track anything under that path. User flagged this at session start 2026-04-23 ('org-templates should only exist as standalone template repo'). Fixing for real this time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 16:44:18 -07:00
Hongming Wang	c5bcd7298c	Merge remote-tracking branch 'origin/staging' into fix/restore-quickstart-plus-hotfixes # Conflicts: # workspace-server/internal/handlers/ssrf.go	2026-04-23 16:42:41 -07:00
Hongming Wang	baa7e1531f	feat(canvas): provider-picker MissingKeysModal for multi-provider runtimes Runtimes like Hermes and LangGraph accept any one of several LLM provider keys (OpenRouter OR OpenAI OR Anthropic OR Nous-native). Before this change, the missing-keys modal treated all supported providers as simultaneously required — a fresh user on Hermes was asked for three parallel API keys when any one suffices. Introduces RUNTIME_PROVIDERS in deploy-preflight.ts as the canonical per-runtime provider list (label, envVar, note). checkDeploySecrets now returns all alternatives as missingKeys when nothing is configured, so the modal can offer a picker. MissingKeysModal dispatches between two render paths: * ProviderPickerModal — radio list of supported providers, a single env input for the chosen one. Saving that one key satisfies the preflight. Activated whenever the runtime has ≥2 provider choices. * AllKeysModal — legacy parallel-inputs UX, all keys must be saved before deploy. Kept for single-provider runtimes (claude-code, gemini-cli) and callers that pass unrelated-key lists. Dual-mode preserves the pre-existing contract for every caller while fixing the multi-provider UX. All 930 canvas vitest tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 16:41:09 -07:00
Hongming Wang	03b56fa5af	fix(canvas): collapse Org Templates section by default in palette The TemplatePalette's Org Templates section rendered all cards inline, each ~120 px tall (name + description + "Import org" button). With 4 org templates on disk that's ~500 px of drawer height — the individual workspace templates at the top (AutoGen / LangGraph / Hermes / …) got pushed off-screen, which is the exact complaint from the test session ("templates still 90% org, cant even see normal workspace template"). Collapsed the Org Templates section by default. The header now toggles with an ▶ caret and shows the count ("Org Templates (4)"). Clicking expands to reveal the full card list; clicking again collapses. Persists only within a session — fresh mounts start collapsed so the primary deploy path stays visible. Individual workspace templates are the usual starting point (pick a runtime, deploy one agent), while org templates are a heavier "deploy this whole pre-built team" action. Making the second expandable matches the relative frequency. - `TemplatePalette.tsx::OrgTemplatesSection` — added `expanded` state (default false), wrapped the cards in `{expanded && …}`, turned the header into a toggle button with `aria-expanded` + `aria-controls`. - `__tests__/OrgTemplatesSection.test.tsx` — 3 new rendering tests: collapsed-by-default (cards absent), click expands (cards appear), click again collapses (cards gone). Mocks /org/templates with a 2-entry response so the count assertion is stable. Full canvas vitest: 930/930 pass (up from 927). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 16:24:49 -07:00
Hongming Wang	50ae33e8b3	Merge pull request #1885 from Molecule-AI/fix/ki005-security-clean [P0] fix(security): F1085/KI-005/CWE-78 — clean rebase onto staging	2026-04-23 16:11:03 -07:00
Hongming Wang	b4719ad070	fix(canvas): Legend avoids TemplatePalette + silence WS handshake races ### Two unrelated but small UI fixes surfaced while testing the Canvas 1. Legend hidden under the open TemplatePalette. Legend is `fixed bottom-6 left-4 z-30`. TemplatePalette's drawer (when open) is `fixed top-0 left-0 w-[280px] z-30` — same z-index, same left-edge column. The Legend overlapped the palette's bottom 180 px. Published the palette-open state to the canvas store so the Legend can shift right (to `left-[296px]` — 280 px palette + 16 px gap) while the palette is open, animated via a 200 ms `transition-[left]` to match the palette's slide. Closes cleanly back to `left-4` when the palette is dismissed. Files: - `store/canvas.ts` — added `templatePaletteOpen` + `setTemplatePaletteOpen`. - `TemplatePalette.tsx` — calls `setTemplatePaletteOpen(open)` on every open/close transition via a new useEffect. - `Legend.tsx` — reads the flag and swaps `left-4` <-> `left-[296px]`. 2. "WebSocket is closed before the connection is established" spam. Two components (`ChatTab`, `AgentCommsPanel`) open their own short- lived WebSocket to tail the ACTIVITY_LOGGED stream. Their cleanup path called `ws.close()` unconditionally, which trips a browser console warning when React StrictMode re-runs the effect in dev and the handshake hasn't completed yet. Confirmed via DevTools console on the running canvas. Added a `closeWebSocketGracefully(ws)` helper in `lib/ws-close.ts`: - OPEN / CLOSING → close immediately (normal path). - CONNECTING → defer close to the 'open' listener so the browser sees a full handshake. Also wires an 'error' listener that cancels the queued close if the handshake fails (no double-close). - CLOSED → no-op. Both consumers now call the helper in their useEffect cleanup. Silences the warning without changing observable behaviour. ### Tests `canvas/src/lib/__tests__/ws-close.test.ts` — 5 cases with a fake WebSocket covering each readyState branch plus the error-before-open cancellation path. Full vitest suite: 927/927 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 16:03:01 -07:00
Hongming Wang	255fd3c192	Merge branch 'staging' into fix/ki005-security-clean	2026-04-23 16:01:01 -07:00
Hongming Wang	5eb5e38c59	fix(canvas): re-centre Toolbar on canvas area when SidePanel is open When a workspace is selected the SidePanel (fixed, right-0, z-50) opens from the right edge and covers the right third of the viewport. The Toolbar at the top was positioned `fixed top-3 left-1/2 -translate-x-1/2 z-20` — centred on the full viewport, not the remaining canvas area. Consequence: the right half of the Toolbar (Audit / Search / Help / Settings) was hidden behind the panel as soon as the user clicked any workspace. Fix: publish the live SidePanel width to the canvas store and read it in Toolbar. When a node is selected, shift the Toolbar LEFT by `sidePanelWidth / 2` so its centre lines up with the middle of the remaining canvas area. Animated via a 200 ms `transition-[margin-left]` to match the SidePanel's own slide-in easing. - `store/canvas.ts` — added `sidePanelWidth` + `setSidePanelWidth`. Default 480 (matches SIDEPANEL_DEFAULT_WIDTH). - `SidePanel.tsx` — calls `setSidePanelWidth(width)` on every width change so the store stays in sync with localStorage. - `Toolbar.tsx` — reads `sidePanelWidth`, applies a negative `marginLeft` style when `selectedNodeId` is non-null. - `SidePanel.tabs.test.tsx` — added `setSidePanelWidth: vi.fn()` to the mocked store state so SidePanel's new useEffect has a callable to invoke. 18 previously-passing tests now pass again. No visual regression when no workspace is selected — the toolbar stays in its original centred position. SaaS canvas unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 15:57:12 -07:00
Hongming Wang	6faea202b9	fix(a2a-queue): nil-safe drain + 202-requeue handling (followup to #1893 ) (#1896 ) * fix(a2a-queue): nil-safe error extraction in DrainQueueForWorkspace + handle 202-requeue The drain path called proxyErr.Response["error"].(string) without a comma- ok assertion. When proxyErr.Response had no "error" key (which happens in the 202-Accepted-queued branch I added in the same PR — that response is {"queued": true, "queue_id": ..., "queue_depth": ...}), the type assertion panicked and killed the platform process. The platform was down 25 minutes today before this was diagnosed. Fleet went from 30 real outputs/15min → 0 events. Two fixes here: 1. Treat 202 Accepted from the inner proxyA2ARequest as "re-queued" (target was busy AGAIN). Mark THIS attempt completed; the new queue row will be drained on the next heartbeat tick. Don't propagate as failure. 2. Defensive type-assertion when reading the error string. Falls back to http.StatusText, then a generic "unknown drain dispatch error" so the queue still gets a non-empty error_detail for ops debugging. Now the drain path can never panic on a malformed proxy response. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(a2a-queue): return (202, body, nil) so callers see queued-as-success Cycle 53 found callers logging 45× 'delegation failed: proxy a2a error' even though the queue's drain stats showed 48 completions in the same window. Investigation: my busy-error path returned return http.StatusAccepted, nil, &proxyA2AError{Status: 202, Response: ...} The non-nil proxyA2AError is the failure signal. Even with status=202, callers' `if proxyErr != nil` branch fires and logs the request as failed. The 202 status was meaningless — the response body was nil too, so the caller never even saw the queue_id/depth metadata. Fix: return success-shape so callers do NOT enter the error branch: respBody, _ := json.Marshal(gin.H{"queued": true, "queue_id": qid, ...}) return http.StatusAccepted, respBody, nil Net effect: queue continues to absorb busy-errors (working since #1893), AND callers correctly record the dispatch as queued-success rather than failed. Closes the cycle 53 misclassification that was making the queue look ineffective on activity_logs counts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: molecule-ai[bot] <276602405+molecule-ai[bot]@users.noreply.github.com>	2026-04-23 22:55:43 +00:00
molecule-ai[bot]	254db21f6a	fix(ci): handle both module path formats in coverage-gate path-strip The sed stripping only handled platform/workspace-server/... paths, but go tool cover may emit platform/internal/... paths (without workspace-server/). When the pattern doesn't match, rel retains the full package import path and the allowlist grep -qxF fails to find the short entry (e.g. internal/handlers/tokens.go). Add a second substitution to strip the platform/ prefix as a fallback so both path formats normalize to the same allowlist-relative form.	2026-04-23 22:49:51 +00:00
Hongming Wang	a0ac72f725	test(canvas): update a11y tests for T3 default tier CreateWorkspaceDialog.a11y.test.tsx's two tier-button tests assumed T1 was the default selection. After the previous commit flipped the non-SaaS default to T3, the radio group's default-selected button changed accordingly. Updated: - "tier buttons have role=radio and aria-checked reflects selection" — T3 is now `aria-checked="true"`, T1 is the "unselected" foil we click to verify the flip. - "selected radio has tabIndex=0, others have tabIndex=-1" — T3 is the tabindex=0 member now. The roving-tabIndex and ArrowDown / ArrowRight tests further down the file start by explicitly clicking/focusing T1 or T2, so they're unaffected by the default change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 15:37:23 -07:00
Hongming Wang	2baaa977c7	feat(quickstart): default new agents to T3 (Privileged) Default tier for a newly-created workspace was T1 (Sandboxed) on self-hosted and T4 (Full Access) on SaaS. Real work needs at minimum a read_write workspace mount + Docker daemon access — that's T3 ("Privileged") per the tier ladder in CreateWorkspaceDialog. The user-visible consequence was that clicking "Deploy" on almost any template landed in a sandbox that couldn't actually run the agent's tooling until the user knew to bump the tier manually. ### Changes Platform (Go) — default tier flipped from 1→3 in two places so API callers (Canvas, molecli, org import) all get the same default: - `handlers/workspace.go`: `POST /workspaces` default when `tier` is omitted from the request body. - `handlers/template_import.go`: `generateDefaultConfig` writes `tier: 3` into the auto-generated `config.yaml` for bundle imports that don't declare one. Canvas — `CreateWorkspaceDialog.tsx` self-hosted form default flipped from T1→T3. SaaS stays at T4 (each SaaS workspace runs on its own sibling EC2, so the shared-blast-radius reasoning doesn't apply and we can safely go a tier higher). ### Tests Updated every sqlmock assertion that anchored on the old `tier=1` default: - `handlers_test.go::TestWorkspaceCreate` — default-path INSERT now expects `3`. - `handlers_additional_test.go::TestWorkspaceCreate_WithParentID` — same. - `workspace_test.go::TestWorkspaceCreate_DBInsertError` / `TestWorkspaceCreate_WithSecrets_Persists` — same. - `workspace_test.go::TestWorkspaceCreate_TemplateDefaults*` — same (current handler semantics ignore the template's `tier:` field and fall through to the default; kept tests faithful to the implementation, left a comment flagging the latent inconsistency). - `workspace_budget_test.go::TestWorkspaceBudget_Create_WithLimit` — same. - `template_import_test.go::TestGenerateDefaultConfig` — asserts `tier: 3` now. All `go test -race ./internal/handlers/` pass. Canvas `CreateWorkspaceDialog` tests don't assert the default tier (they only reference `tier` as prop data on stub workspaces) so no test update needed on that side. ### SaaS parity Zero behaviour change on hosted SaaS. The Go-side default only fires when the Canvas (or any caller) omits `tier` from the request body. The SaaS Canvas explicitly passes `tier: 4` from the CreateWorkspaceDialog `isSaaS ? 4 : 3` branch, so the Go default never runs on a SaaS request. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 15:34:22 -07:00
Hongming Wang	30ed7ba0b9	Merge pull request #1898 from Molecule-AI/fix/config-tab-runtime-model-hermes fix(canvas/config): load runtime+model from workspace metadata + hide misleading config.yaml error for hermes	2026-04-23 15:16:53 -07:00
molecule-ai[bot]	70ff4252a8	Merge branch 'staging' into fix/config-tab-runtime-model-hermes	2026-04-23 22:11:06 +00:00
Hongming Wang	19cd5c9f4b	test(router): set ADMIN_TOKEN in TestTestTokenRoute_RequiresAdminAuth_WhenTokensExist The test asserts that AdminAuth rejects an unauthenticated request to the test-token route once any workspace token exists in the DB. It sets MOLECULE_ENV=development to enable the handler's gate. After this branch's AdminAuth Tier-1b hatch (middleware/devmode.go), MOLECULE_ENV=development + empty ADMIN_TOKEN becomes the explicit fail-open signal for local dev — so the request correctly passes AdminAuth and falls through to the handler, which then 500s on an unmocked DB lookup instead of the expected 401. The security property the test is protecting (no bearer → 401 when tokens exist) corresponds to the SaaS configuration where ADMIN_TOKEN is always set. Setting ADMIN_TOKEN in the test suppresses the dev-mode hatch and reaches AdminAuth's Tier-2 bearer check, which correctly aborts 401 with "admin auth required". No production behaviour change — the test is now verifying the path that actually runs in production (MOLECULE_ENV=production + ADMIN_TOKEN set). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 15:03:34 -07:00
Hongming Wang	06273b11ef	fix(canvas/config): load runtime+model from workspace metadata + hide misleading config.yaml error for hermes Canvas Config tab had 3 bugs visible on hermes workspaces (#1894): 1. Runtime dropdown showed "LangGraph (default)" even when the workspace's actual runtime was hermes — because the form only loaded runtime from config.yaml, and hermes doesn't use the platform's config.yaml template. 2. Model field was empty for the same reason. 3. "No config.yaml found" error appeared on hermes workspaces despite everything being fine — hermes manages its own config at ~/.hermes/config.yaml on the workspace host. Worse, clicking Save with the empty form would silently flip `runtime` back from `hermes` to `LangGraph (default)`. ## Fix - loadConfig now always fetches workspace metadata (runtime + model) via GET /workspaces/:id and GET /workspaces/:id/model BEFORE attempting the config.yaml fetch. These act as the source of truth for runtime and model when config.yaml doesn't set them. - RUNTIMES_WITH_OWN_CONFIG set lists runtimes that manage their own config outside the platform template (hermes, external). For these: - Missing config.yaml is NOT an error — no red banner shown. - An informational gray banner tells the user where to edit the runtime's config (e.g. "edit ~/.hermes/config.yaml via Terminal tab or the hermes CLI" for hermes). Closes #1894. Verified 2026-04-23 on user's hongmingwang tenant which runs hermes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 14:58:36 -07:00
Hongming Wang	de99a22ffc	fix(quickstart): hotfixes discovered during live testing session Five additional breakages surfaced while testing the restored stack end-to-end (spin up Hermes template → click node → open side panel → configure secrets → send chat). Each fix is narrowly scoped and has matching unit or e2e tests so they don't regress. ### 1. SSRF defence blocked loopback A2A on self-hosted Docker handlers/ssrf.go was rejecting `http://127.0.0.1:<port>` workspace URLs as loopback, so POST /workspaces/:id/a2a returned 502 on every Canvas chat send in local-dev. The provisioner on self-hosted Docker publishes each container's A2A port on 127.0.0.1:<ephemeral> — that's the only reachable address for the platform-on-host path. Added `devModeAllowsLoopback()` — allows loopback only when MOLECULE_ENV ∈ {development, dev}. SaaS (MOLECULE_ENV=production) continues to block loopback; every other blocked range (metadata 169.254/16, TEST-NET, CGNAT, link-local) stays blocked in dev mode. Tests: 5 new tests in ssrf_test.go covering dev-mode loopback, dev-mode short-alias ("dev"), production still blocks loopback, dev-mode still blocks every other range, and a 9-case table test of the predicate with case/whitespace/typo variants. ### 2. canvas/src/lib/api.ts: 401 → login redirect broke localhost Every 401 called `redirectToLogin()` which navigates to `/cp/auth/login`. That route exists only on SaaS (mounted by the cp_proxy when CP_UPSTREAM_URL is set). On localhost it 404s — users landed on a blank "404 page not found" instead of seeing the actual error they should fix. Gated the redirect on the SaaS-tenant slug check: on <slug>.moleculesai.app, redirect unchanged; on any non-SaaS host (localhost, LAN IP, reserved subdomains like app.moleculesai.app), throw a real error so the calling component can render a retry affordance. Tests: 4 new vitest cases in a dedicated api-401.test.ts (needs jsdom for window.location.hostname) — SaaS redirects, localhost throws, LAN hostname throws, reserved apex throws. ### 3. SecretsSection rendered a hardcoded key list config/secrets-section.tsx shipped a fixed COMMON_KEYS list (Anthropic / OpenAI / Google / SERP / Model Override) regardless of what the workspace's template actually needed. A Hermes workspace declaring MINIMAX_API_KEY in required_env got five irrelevant slots and nothing for the key it actually needed. Made the slot list template-driven via a new `requiredEnv?: string[]` prop passed down from ConfigTab. Added `KNOWN_LABELS` for well-known names and `humanizeKeyName` to turn arbitrary SCREAMING_SNAKE_CASE into a readable label (e.g. MINIMAX_API_KEY → "Minimax API Key"). Acronyms (API, URL, ID, SDK, MCP, LLM, AI) stay uppercase. Legacy fallback preserved when required_env is empty. Tests: 8 new vitest cases covering known-label lookup, humanise fallback, acronym preservation, deduplication, and both fallback paths. ### 4. Confusing placeholder in Required Env Vars field The TagList in ConfigTab labelled "Required Env Vars (from template)" is a DECLARATION field — stores variable names. The placeholder "e.g. CLAUDE_CODE_OAUTH_TOKEN" suggested that, but users naturally typed the value of their API key into the field instead. The actual values go in the Secrets section further down the tab. Relabelled to "Required Env Var Names (from template)", changed the placeholder to "variable NAME (e.g. ANTHROPIC_API_KEY) — not the value", and added a one-line helper below pointing to Secrets. ### 5. Agent chat replies rendered 2-3 times Three delivery paths can fire for a single agent reply — HTTP response to POST /a2a, A2A_RESPONSE WS event, and a send_message_to_user WS push. Paths 2↔3 were already guarded by `sendingFromAPIRef`; path 1 had no guard. Hermes emits both the reply body AND a send_message_to_user with the same text, which manifested as duplicate bubbles with identical timestamps. Added `appendMessageDeduped(prev, msg, windowMs = 3000)` in chat/types.ts — dedupes on (role, content) within a 3s window. Threaded into all three setMessages call sites. The window is short enough that legitimate repeat messages ("hi", "hi") from a real user/agent a few seconds apart still render. Tests: 8 new vitest cases covering empty history, different content, duplicate within window, different roles, window elapsed, stale match, malformed timestamps, and custom window. ### 6. New end-to-end regression test tests/e2e/test_dev_mode.sh — 7 HTTP assertions that run against a live platform with MOLECULE_ENV=development and catch regressions on all the dev-mode escape hatches in a single pass: AdminAuth (empty DB + after-token), WorkspaceAuth (/activity, /delegations), AdminAuth on /approvals/pending, and the populated /org/templates response. Shellcheck-clean. ### Test sweep - `go test -race ./internal/handlers/ ./internal/middleware/ ./internal/provisioner/` — all pass - `npx vitest run` in canvas — 922/922 pass (up from 902) - `shellcheck --severity=warning infra/scripts/setup.sh tests/e2e/test_dev_mode.sh` — clean - `bash tests/e2e/test_dev_mode.sh` — 7/7 pass against a live platform + populated template registry ### SaaS parity Every relaxation remains conditional on MOLECULE_ENV=development. Production tenants run MOLECULE_ENV=production (enforced by the secrets-encryption strict-init path) and always set ADMIN_TOKEN, so none of these code paths fire on hosted SaaS. Behaviour on real tenants is byte-for-byte unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 14:57:18 -07:00
Hongming Wang	47d3ef5b9e	refactor(middleware): extract dev-mode fail-open predicate AdminAuth and WorkspaceAuth both carried the same 5-line `ADMIN_TOKEN == "" && MOLECULE_ENV in {development, dev}` check. If a third middleware ever needs the hatch — or if "dev mode" semantics change (new env name, allowlist, runtime flag) — the previous shape made N places to keep in sync and N places a security reviewer has to audit. This commit factors the predicate into a single `isDevModeFailOpen()` helper in `internal/middleware/devmode.go`. Each call site becomes if isDevModeFailOpen() { c.Next(); return } `devmode.go` carries the full rationale (why the hatch exists, why it's safe for SaaS) so call sites don't need to restate it. ### Also - Moved the dev-mode env-value set to a package-level `devModeEnvValues` map so adding aliases is one line. Matches the existing convention (`handlers/admin_test_token.go`) of treating `MOLECULE_ENV != "production"` as dev — but stays explicit about which values opt IN rather than blanket-accepting everything non-prod. - Added case-insensitive compare + trim on the env value so operators don't have to remember exact casing. - New `devmode_test.go` unit-tests the predicate directly: 6 cases covering happy path, both opt-out signals (ADMIN_TOKEN, production mode), short alias, case-insensitive + whitespace tolerance, and an explicit negative-space sweep of arbitrary non-dev values ("staging", "preview", "test", "devel", "") to lock in that typos don't silently enable the hatch. Existing AdminAuth/WorkspaceAuth integration tests still exercise the helper indirectly via HTTP — they pass unchanged, confirming the behaviour is preserved. ### No behavioural change Before and after this commit, `go test -race ./internal/middleware/` reports identical results. Zero production surface change — this is a pure refactor, but it collapses the dev-mode seam from two inline blocks into one named predicate, which is the shape future contributors (and security reviewers) can follow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 14:55:34 -07:00
Hongming Wang	539e3483e4	fix(provisioner): force linux/amd64 pull + create on Apple Silicon hosts (#1875 ) On an Apple Silicon dev box, every `POST /workspaces` failed immediately with: no matching manifest for linux/arm64/v8 in the manifest list entries: no match for platform in manifest: not found because the GHCR workspace-template-* images ship only a linux/amd64 manifest today. `ImagePull` and `ContainerCreate` asked for the daemon's native arch and missed. The Canvas surfaced this as docker image "ghcr.io/molecule-ai/workspace-template-autogen:latest" not found after pull attempt — verify GHCR visibility for autogen — confusing because the image IS visible, just not for linux/arm64. ### Fix Add an auto-detect helper `defaultImagePlatform()` in `internal/provisioner/provisioner.go` that returns `"linux/amd64"` on Apple Silicon hosts and `""` (no preference) everywhere else, with an env override `MOLECULE_IMAGE_PLATFORM` for operators who want to pin or disable explicitly. The result is passed to both `ImagePull` (`PullOptions.Platform`) and `ContainerCreate` (4th arg `*ocispec.Platform`) so the pulled amd64 manifest matches the create-time platform spec. Docker Desktop transparently runs it under QEMU emulation on M-series Macs — slow (2–5× native) but functional. SaaS production (linux/amd64 EC2, `MOLECULE_ENV=production`) never hits the `runtime.GOARCH == "arm64"` branch, so the current behaviour on real tenants is byte-for-byte unchanged. Opt-in escape hatch for operators who want it off: export MOLECULE_IMAGE_PLATFORM="" # disable auto-force export MOLECULE_IMAGE_PLATFORM=linux/arm64 # pin alternate `ocispec` is `github.com/opencontainers/image-spec/specs-go/v1` — already in go.sum v1.1.1 as a transitive dependency of `github.com/docker/docker`, not a new import. ### Tests `internal/provisioner/platform_test.go` exercises every branch: - `TestDefaultImagePlatform_EnvOverride_ExplicitValue` — env wins - `TestDefaultImagePlatform_EnvOverride_EmptyValue` — empty string disables the auto-force (operator escape hatch) - `TestDefaultImagePlatform_AutoDetect` — linux/amd64 on arm64 Mac, "" on every other host - `TestParseOCIPlatform` — 7 table-driven cases covering well-formed platforms, malformed inputs, and nil handling ### End-to-end verification Before this commit, `POST /workspaces` on my Apple Silicon box: workspace status transitioned: provisioning → failed (~1s) log: image pull for ... failed: no matching manifest for linux/arm64/v8 After this commit, fresh DB + fresh platform: workspace status transitioned: provisioning → online (~25s) log: attempting pull (platform=linux/amd64) pulled ghcr.io/molecule-ai/workspace-template-langgraph:latest docker ps: ws-7aa08951-00d Up 27 seconds The existing provisioner race-tested test suite (`go test -race ./internal/provisioner/`) still passes — the platform pointer defaults to nil on linux/amd64 hosts, so the CI-resolved test expectations don't change. Closes #1875 (arm64 image blocker). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 14:55:34 -07:00
Hongming Wang	96cc4b0c42	fix(quickstart): wire up template/plugin registry via manifest.json The Canvas template palette was empty on a fresh clone because `workspace-configs-templates/`, `org-templates/`, and `plugins/` are gitignored and nothing populated them. The registry already exists — `manifest.json` at repo root lists every curated `workspace-template-`, `org-template-`, and `plugin-` repo, and `scripts/clone-manifest.sh` clones them — but the step was absent from the README and setup.sh, so new users never ran it. ### What this commit does 1. `setup.sh` runs `clone-manifest.sh` automatically* (once). After starting the Docker network but before booting infra, iterate `manifest.json` and clone any workspace_templates / org_templates / plugins that aren't already populated. Idempotent — subsequent runs skip dirs that have content. Requires `jq`; when jq is missing the step prints a clear install hint and skips (doesn't fail). 2. `clone-manifest.sh` is idempotent. Before running `git clone`, check whether the target directory already exists and is non-empty — skip if so. Lets `setup.sh` rerun safely without forcing the operator to delete already-cloned template repos. 3. `ListTemplates` logs the reason it skips a template. The handler previously swallowed `resolveYAMLIncludes` errors with `continue`, so a broken template showed up as an empty palette with no log trail. Now the include-expansion and yaml.Unmarshal failure paths both emit a descriptive `log.Printf` — the exact message that made the stale `org-templates/molecule-dev/` snapshot debuggable: ListTemplates: skipping molecule-dev — !include expansion failed: !include "core-platform.yaml" at line 25: open .../teams/ core-platform.yaml: no such file or directory 4. Remove the in-tree `org-templates/molecule-dev/` snapshot (170 files). Matches the explicit intent of prior commit `bfec9e53` — "remove org-templates/molecule-dev/ — standalone repo is source of truth". A later "full staging snapshot" re-added a partial copy that had `!include` references to 7 role files that never existed in the snapshot (`core-platform.yaml`, `controlplane.yaml`, `app-docs.yaml`, `infra.yaml`, `sdk.yaml`, `release-manager/workspace.yaml`, `integration-tester/workspace.yaml`). `clone-manifest.sh` repopulates it fresh from `Molecule-AI/molecule-ai-org-template-molecule-dev`. .gitignore exception for `molecule-dev/` is dropped accordingly — the whole `/org-templates/` tree is now gitignored, symmetric with `/plugins/` and `/workspace-configs-templates/`. 5. Doc updates* (README, README.zh-CN, CONTRIBUTING) mention `jq` as a prerequisite and describe what setup.sh now does. ### Verification On a fresh-nuked DB with the updated branch: 1. `bash infra/scripts/setup.sh` — cleanly clones 33/33 manifest repos (20 plugins, 8 workspace_templates, 5 org_templates), then boots infra. Second run skips all 33 (idempotent). 2. `go run ./cmd/server` — "Applied 41 migrations", :8080 healthy. 3. `curl http://localhost:8080/org/templates` returns 4 templates (was `[]`): - Free Beats All - MeDo Smoke Test - Molecule AI Worker Team (Gemini) - Reno Stars Agent Team 4. `bash tests/e2e/test_api.sh` — 61/61 pass. 5. `npx vitest run` in canvas — 902/902 pass. 6. `shellcheck infra/scripts/setup.sh` — clean. ### SaaS parity All changes are local-dev surface. `setup.sh`, `clone-manifest.sh`, and the local `org-templates/` directory aren't part of the CP provisioner path — SaaS tenant machines get their templates via Dockerfile layers or CP-side provisioning, not `clone-manifest.sh`. The `ListTemplates` log addition is harmless either way (replaces a silent `continue` with a `log.Printf + continue`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 14:55:34 -07:00
Hongming Wang	dae7f50095	fix(wsauth): extend dev-mode escape hatch to WorkspaceAuth The previous commit on this branch added a dev-mode fail-open branch to AdminAuth so the Canvas dashboard could enumerate workspaces after the first token lands in the DB. Verification via Chrome (clicking a workspace to open its side panel) surfaced the same class of bug on a different middleware — `WorkspaceAuth` — triggering: API GET /workspaces/<id>/activity?type=a2a_receive&source=canvas&limit=50: 401 {"error":"missing workspace auth token"} Root cause is identical to AdminAuth's: in local dev the Canvas (at localhost:3000) calls the platform (at localhost:8080) cross-port, so `isSameOriginCanvas`'s Host==Referer check fails. Without a bearer token, every per-workspace read (/activity, /delegations, /memories, /events/stream, /schedules, etc.) 401s and the side panel is unusable. ### Fix Symmetric extension in `WorkspaceAuth` (workspace-server/internal/middleware/wsauth_middleware.go): after the existing `isSameOriginCanvas` fallback, add a narrow escape hatch that stays fail-open only when BOTH - `ADMIN_TOKEN` is unset (operator has not opted in to the #684 closure), AND - `MOLECULE_ENV` is explicitly a dev mode (`development` / `dev`). SaaS tenants never hit this branch because hosted provisioning sets both `ADMIN_TOKEN` and `MOLECULE_ENV=production`. The comment in the code also links back to AdminAuth's Tier-1b for consistency. ### Tests Three new table-driven tests in wsauth_middleware_test.go mirror the AdminAuth tier-1b suite, exercising the positive path and both negative cases: - `TestWorkspaceAuth_DevModeEscapeHatch_NoBearer_FailsOpen` — the happy path (dev mode, no admin token → 200) - `TestWorkspaceAuth_DevModeEscapeHatch_IgnoredInProduction` — the SaaS-safety guarantee (production + no admin token → 401) - `TestWorkspaceAuth_DevModeEscapeHatch_IgnoredWhenAdminTokenSet` — explicit `ADMIN_TOKEN` wins; dev mode does not silently override the opt-in ### Comprehensive audit of adjacent middlewares Re-scanned every file under workspace-server/internal/middleware/ and every handler that invokes `AbortWithStatusJSON(Unauthorized)` directly, to check for other surfaces where local dev might silently 401. Findings, already OK: - `CanvasOrBearer` — cosmetic routes already accept localhost:3000 via `canvasOriginAllowed` (Origin header check); no change needed. - `tenant_guard.go` — no-op when `MOLECULE_ORG_ID` is unset (self- hosted / dev); no change needed. - `session_auth.go` — verifies against `CP_UPSTREAM_URL`; returns (false, false) in local dev so callers fall through to bearer; no change needed. - `socket.go` `HandleConnect` — Canvas browser clients don't send `X-Workspace-ID` so skip the bearer check; agent clients do and validate as today. No change needed. - Handlers in handlers/{discovery,registry,secrets,plugins_install, a2a_proxy_helpers,schedules}.go — all workspace-scoped routes called by the workspace runtime, not the Canvas browser. Unaffected. - `handlers/admin_test_token.go` — already `MOLECULE_ENV`-aware (the convention this hatch mirrors). ### End-to-end verification 1. Fresh-nuked DB, platform + canvas restarted with `MOLECULE_ENV=development` 2. `POST /workspaces` → token lands in DB (Tier-1 would close here) 3. Probed every Canvas-hit endpoint with no bearer, with Canvas-like `Origin: http://localhost:3000`: 200 /workspaces 200 /workspaces/<id>/activity 200 /workspaces/<id>/delegations 200 /workspaces/<id>/memories 200 /approvals/pending 200 /events 4. Chrome browser test: opened http://localhost:3000, clicked a workspace tile — the side panel rendered with the full 13-tab structure (Chat, Activity, Details, Skills, Terminal, Config, Schedule, Channels, Files, Memory, Traces, Events, Audit) and no `Failed to load chat history` error. "No messages yet" placeholder shows instead of the 401 retry screen. 5. `go test -race ./internal/middleware/` — clean 6. `bash tests/e2e/test_api.sh` — 61/61 pass Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 14:55:34 -07:00
Hongming Wang	a93bd58b59	fix(quickstart): keep Canvas working post first workspace + hide SaaS cookie banner on localhost Follow-up to the previous commit on this branch. Two additional fresh-clone regressions surfaced during end-to-end verification, both affecting local dev only and both landing inside the same SaaS-vs-local-dev seam: ### 1. Canvas 401-loops after first workspace creation `GET /workspaces` is behind `AdminAuth` (router.go:121 — "C1: unauthenticated workspace topology exposure"). The middleware has a Tier-1 fail-open branch that only fires when no workspace tokens exist anywhere in the DB. The moment a user creates their first workspace — via either the Canvas UI, the API, or the e2e-api test suite — a token lands in the DB, Tier-1 closes, and the Canvas (which has no bearer token in local dev: no WorkOS session, no NEXT_PUBLIC_ADMIN_TOKEN baked in at build time) gets 401 on every list call. The UI renders a stuck "API GET /workspaces: 401 admin auth required" placeholder forever. SaaS is unaffected because hosted provisioning always sets both `ADMIN_TOKEN` and `MOLECULE_ENV=production`, and the Canvas there either carries a WorkOS session cookie or `NEXT_PUBLIC_ADMIN_TOKEN` baked into the JS bundle. Fix (`workspace-server/internal/middleware/wsauth_middleware.go`): add a narrow Tier-1b escape hatch that stays fail-open when both `ADMIN_TOKEN` is unset and `MOLECULE_ENV` is explicitly a dev mode ("development" / "dev"). Production never hits it (SaaS sets `MOLECULE_ENV=production`). Mirrors the existing convention in `handlers/admin_test_token.go` which gates the e2e test-token endpoint on `MOLECULE_ENV != "production"`. Three new regression tests in `wsauth_middleware_test.go`: - `TestAdminAuth_DevModeEscapeHatch_FailsOpenWithHasLiveTokens` — the happy path (dev mode, no admin token, tokens exist → 200) - `TestAdminAuth_DevModeEscapeHatch_IgnoredWhenAdminTokenSet` — explicit `ADMIN_TOKEN` wins; dev mode does not silently re-open the gate - `TestAdminAuth_DevModeEscapeHatch_IgnoredInProduction` — the SaaS-safety guarantee (production + no admin token + tokens exist → 401) `.env.example` flipped to set `MOLECULE_ENV=development` by default so new users get the dev-mode hatch automatically via `cp .env.example .env`. SaaS provisioning overrides to `production`, consistent with the existing convention used by the secrets-encryption strict-init path. ### 2. SaaS cookie/privacy banner rendered on localhost `CookieConsent` mounted unconditionally in the root layout, so `npm run dev` on localhost showed a "Cookies & your privacy" banner pointing at `moleculesai.app/legal/privacy`. That banner is a GDPR/ePrivacy compliance UI that only applies to the hosted SaaS offering; self-hosted / local-dev / Vercel-preview hosts must not see it. Fix (`canvas/src/components/CookieConsent.tsx`): gate render on `isSaaSTenant()`. Matches the convention used by `AuthGate` and the workspace tier picker elsewhere in the codebase. Tests (`canvas/src/components/__tests__/CookieConsent.test.tsx`): existing tests now stub `window.location.hostname` to a SaaS subdomain before rendering (required since `isSaaSTenant()` on jsdom's default "localhost" would suppress the banner). Added two new tests for the local-dev hide path: - `does NOT render on local dev (non-SaaS hostname)` - `does NOT render on a LAN hostname (192.168., .local)` ### Verification On a fresh-nuked DB with the updated branch: 1. `bash infra/scripts/setup.sh` — clean 2. `go run ./cmd/server` — "Applied 41 migrations", :8080 healthy, dev-mode hatch armed (`MOLECULE_ENV=development`) 3. `npm run dev` in canvas — :3000 renders, no cookie banner 4. `bash tests/e2e/test_api.sh` — 61 passed, 0 failed (test suite creates tokens; GET /workspaces stays 200 under the hatch) 5. Browser at http://localhost:3000 AFTER the e2e run: - Canvas renders the workspace list (no 401 placeholder) - No cookie banner 6. `npx vitest run` — 902 tests passed (900 prior + 2 new hide tests) 7. `go test -race ./internal/middleware/` — all passing (3 new dev-mode tests + existing Issue-180 / Issue-120 / Issue-684 suite), coverage 81.8% ### SaaS parity audit Same principle as the rest of this branch: local must work without weakening SaaS. - Dev-mode hatch: conditional on `MOLECULE_ENV=development`. Production tenants always run `MOLECULE_ENV=production` (already enforced by the secrets-encryption `InitStrict` path in `internal/crypto/aes.go`). Branch is unreachable there. - Cookie banner: gated on `isSaaSTenant()` which checks `NEXT_PUBLIC_SAAS_HOST_SUFFIX` (default `.moleculesai.app`). SaaS hosts still get the banner; every other host doesn't. No change to SaaS behaviour. #1822 backend-parity tracker untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 14:55:33 -07:00
Hongming Wang	8ef0b653bd	Merge pull request #1888 from Molecule-AI/fix/restart-preserves-user-config fix(restart): preserve user config volume on default restart (#1822 drift-risk-3)	2026-04-23 14:41:30 -07:00
Hongming Wang	09faaec1ab	Merge branch 'staging' into fix/restart-preserves-user-config	2026-04-23 14:39:21 -07:00
Hongming Wang	cfaad6cc1a	Merge pull request #1893 from Molecule-AI/fix/queue-on-conflict-syntax-1870 fix(a2a-queue): use partial-index ON CONFLICT syntax (not constraint name)	2026-04-23 14:33:36 -07:00
Molecule AI CP-BE	84cc745efd	fix(ci): correct coverage-gate path-strip to match allowlist format (#1885 ) sed was stripping only github.com/Molecule-AI/molecule-monorepo/platform/, leaving workspace-server/internal/handlers/workspace_provision.go. The allowlist uses internal/handlers/workspace_provision.go (no workspace-server/). Fix strips the full prefix so grep -qxF exact match succeeds. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-23 21:24:24 +00:00
rabbitblood	751b265dbd	fix(a2a-queue): use partial-index ON CONFLICT syntax (not constraint name) #1892's EnqueueA2A INSERT used `ON CONFLICT ON CONSTRAINT idx_a2a_queue_idempotency DO NOTHING`, but Postgres rejects this: ERROR: constraint "idx_a2a_queue_idempotency" for table "a2a_queue" does not exist Partial unique INDEXES cannot be referenced by name in ON CONFLICT — that form is reserved for true CONSTRAINTs created via CREATE TABLE ... CONSTRAINT or ALTER TABLE ADD CONSTRAINT. Partial indexes need the column-list + WHERE form so the planner can match the index. Effect of the bug: every EnqueueA2A errored, the busy-error fallback returned 503 instead of 202, queue stayed empty. Cycle 50 observed 46 busy errors / 0 queue rows — the deployed Phase 1 had no effect. Fix: switch to ON CONFLICT (workspace_id, idempotency_key) WHERE idempotency_key IS NOT NULL AND status IN ('queued','dispatched') DO NOTHING Verified manually against the live `a2a_queue` table on staging — INSERT returns the new id; cleanup deleted the test row. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 14:22:13 -07:00
Hongming Wang	4e4ee610a7	Merge pull request #1892 from Molecule-AI/feat/a2a-queue-phase1-1870 feat(a2a): queue-on-busy — Phase 1 of priority queue (#1870)	2026-04-23 14:12:45 -07:00
rabbitblood	87a97846cd	feat(a2a): queue-on-busy — Phase 1 of priority queue (#1870 ) ## Problem When a lead delegates to a worker that's mid-synthesis, the proxy returns 503 "workspace agent busy" and the caller records the delegation as failed. On fan-out storms from leads this hits ~70% drop rate — today's observed numbers in the cycle reports. ## Fix — Phase 1 TASK-level queue-on-busy When `handleA2ADispatchError` determines the target is busy, instead of returning 503, enqueue the request as priority=TASK and return 202 Accepted with `{queued: true, queue_id, queue_depth}`. The workspace's next heartbeat (≤30s) drains one item if it reports spare capacity. Files: - migrations/042_a2a_queue.{up,down}.sql — `a2a_queue` table with partial indexes on status='queued' + idempotency_key. Schema supports PriorityCritical/Task/Info from day one so Phase 2/3 ship without migration churn. - internal/handlers/a2a_queue.go — EnqueueA2A / DequeueNext / Mark*-helpers plus WorkspaceHandler.DrainQueueForWorkspace. Uses `SELECT ... FOR UPDATE SKIP LOCKED` so concurrent drains can't double-claim the same row. Max 5 attempts before marking 'failed' so a stuck item doesn't wedge the queue forever. - internal/handlers/a2a_proxy_helpers.go — isUpstreamBusyError branch calls EnqueueA2A and returns 202 on success. Falls through to the legacy 503 on enqueue error (DB hiccup shouldn't silently drop). - internal/handlers/registry.go — RegistryHandler gets a QueueDrainFunc injection hook (SetQueueDrainFunc). When Heartbeat sees active_tasks < max_concurrent_tasks, spawns a goroutine that calls the drain hook. context.WithoutCancel ensures the drain outlives the heartbeat handler's ctx. - internal/router/router.go — wires wh.DrainQueueForWorkspace into rh.SetQueueDrainFunc after both are constructed. ## Not in this PR (Phase 2/3/4 follow-ups) - INFO priority + TTL (Phase 2) - CRITICAL priority + soft preemption between tool calls (Phase 3) - Age-based promotion so TASK doesn't starve (Phase 4) - `GET /workspaces/:id/queue` observability endpoint Schema already supports all of these; only the dispatch + policy code remains. ## Tests - TestExtractIdempotencyKey (5 cases): messageId parsing is robust - TestPriorityConstants: ordering invariant + 50=TASK default alignment with migration DEFAULT Full DB-touching tests (FIFO order, retry bound, idempotency conflict) intentionally deferred to the CI migration-enabled path — sqlmock ceremony would duplicate the existing test infrastructure 3× over and the behaviour is directly expressible in SQL constraints (FOR UPDATE SKIP LOCKED, partial unique index). ## Expected impact once deployed - a2a_receive error with "busy" flavor drops from ~69/10min observed today to ~0 - delegation_failed rate drops from ~50% to <5% - real_output metric rises from ~30/15min back toward the pre- throttle baseline Closes #1870 Phase 1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 14:09:29 -07:00

1 2 3 4 5 ...

2744 Commits