Commit Graph

292 Commits

Author SHA1 Message Date
Hongming Wang
9a223afba1 fix(dotenv,socket): review-driven hardening of .env loader + WS poll
Independent code review surfaced three required fixes and one cheap
optional one. All addressed here.

dotenv parser:
- `export FOO=bar` was parsed as key `"export FOO"` (with embedded
  space) and silently os.Setenv'd, so a developer pasting from a
  direnv `.envrc` would get junk vars. Now strips the prefix.
- Quoted values weren't unwrapped: `FOO="hello world"` produced value
  `"hello world"` with literal quotes. Now strips one matched pair of
  surrounding `"` or `'`. Inside a quoted value `#` is part of the
  value, not a comment marker (matches godotenv convention).
- UTF-8 BOM at file start (Windows editors) would have produced a
  first key like U+FEFF + "FOO". Now stripped via TrimPrefix.

dotenv loader:
- findDotEnv()'s upward walk would happily pick up `~/.env` or a
  sibling-repo `.env` if the binary was run from `~/Documents/other-
  project/`. Real foot-gun on shared dev boxes. Now gated on a
  monorepo sentinel: the candidate directory must contain
  `workspace-server/go.mod`. Falls through to "no .env found" (=
  pre-fix behavior) when the sentinel is absent.

socket fallback poll:
- startFallbackPoll() previously fired only on onclose, so the very
  first connect attempt — when onclose hasn't fired yet because we
  never had a successful onopen — left the canvas with no HTTP poll
  for the duration of the failing handshake (Chrome can hold a
  SYN-SENT WebSocket open ~75s before giving up). Now also called at
  the top of connect(); the timer-already-running guard makes it a
  no-op when one cycle later onclose calls it again.

Test coverage added: export prefix, single+double quoted values, hash
inside quotes preserved, unterminated quote falls back to bare value,
CRLF stripping locked in, BOM stripping, and a sentinel-rejection
regression test that creates a temp .env with no workspace-server
sibling and asserts findDotEnv refuses to load it.

Verified: 985 canvas tests + 30 dotenv subtests + 4 dotenv integration
tests all pass; tsc clean; rebuilt platform from monorepo root with
stripped env still loads .env (49 vars) and /workspaces returns 200.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 21:09:18 -07:00
Hongming Wang
21db85d691 fix(canvas): cascade delete locally so children disappear without WS
Deleting a parent on a wedged WS used to leave the child cards on
the canvas as orphaned roots until the user manually refreshed.

Why: Canvas.tsx and DetailsTab.tsx both called `removeNode(parentId)`
after `DELETE /workspaces/:id?confirm=true` returned 200. `removeNode`
deliberately re-parents children rather than cascading — it relies on
the per-descendant WORKSPACE_REMOVED WS events the platform emits as
part of the cascade to drop each child individually. When the WS is
unhealthy those events never arrive, so the local store keeps the
children alive (now re-parented to root since their actual parent is
gone).

Fix: new `removeSubtree(rootId)` action on the canvas store mirrors
the server-side cascade — drops the root + every descendant + every
incident edge in one atomic set(). Both delete call sites now use it.
The WS events still arrive when WS is healthy and become idempotent
no-ops because the nodes are already gone.

Why a new action instead of changing removeNode: removeNode's
re-parenting behavior is correct for non-cascading flows (drag-out,
manual node detach in the future). Adding a sibling action keeps
both call shapes available rather than forcing every caller to opt
out of cascade.

6 new unit tests cover root cascade, mid-level cascade, leaf
no-op-cascade, selection clearing across the subtree, selection
preservation outside the subtree, and edge cleanup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 20:51:09 -07:00
Hongming Wang
0b4dfbd121 fix(canvas): suppress stale provisioning banners + add WS-down HTTP fallback poll
Two related fixes for the case where the canvas thinks workspaces are
stuck provisioning when they're actually online:

1. ProvisioningTimeout banners now gate on wsStatus === "connected".
   While the WS is in connecting/disconnected state, the local
   "provisioning" status reflects the last event received before the
   drop — workspaces may have transitioned to online minutes ago. The
   8m timeout was firing against frozen state and showing a wall of
   yellow warnings on already-online workspaces.

2. Socket layer now starts a 10s rehydrate poll when the WS goes
   unhealthy (onclose) and stops it on onopen/disconnect. The
   reconnect attempts continue in parallel; whichever recovers first
   wins. rehydrate()'s existing dedup gate prevents the open-time
   rehydrate from racing with a fallback poll. Without this the
   store could stay frozen for minutes while WS exponential backoff
   chewed through retries.

Plus the previously-uncommitted TemplatePalette flushSync change so
the import modal unmounts synchronously before doImport runs (otherwise
React batches the close with the import's setState prefix and the
modal backdrop hides the spawn animation).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 20:22:15 -07:00
Hongming Wang
1d71b4e9e5 fix(canvas): bundle of UX hardening — modals, position stability, error UX, paste
Single-themed bundle of fixes accumulated while polishing the canvas
chat / agent-comms / plugins / position flows. Each piece is small;
the connective tissue is "things observable from the canvas right
panel and the org-deploy flow that surprised real users".

UI / composer
  - Legend: add close X + persisted-localStorage state + reopener
    pill; default open for first-time users.
  - SidePanel: rename "Skills" tab label → "Plugins" (single-line;
    internal panelTab enum value, component name, and store keys
    unchanged).
  - SkillsTab: registry tri-state UI (loading / error / empty) with
    actionable Retry button + 10s explicit fetch timeout. Handle
    AbortSignal.timeout's DOMException by name (TimeoutError /
    AbortError) — Chromium's "signal timed out" message wouldn't
    match the prior naive /timeout/ regex. Reset mountedRef on every
    mount: pre-existing StrictMode dev-mode bug where cleanup-only
    `current = false` was never re-set, permanently wedging every
    `if (mountedRef.current) setX(...)` guard and producing a
    "Loading…" panel that never resolved on hard refresh.
  - ChatTab: paste-image-from-clipboard via onPaste handler; unique
    monotonic-counter filenames so same-second pastes don't collide
    on name+size dedup. mime→ext map avoids `image/svg+xml`-style
    raw extensions on synthesised filenames. Bypasses the
    DataTransfer constructor so Safari < 14.1 / older Edge work.
  - ChatTab: drop stuck error toast when the WS path already
    delivered the agent reply but the HTTP path errored late
    (sendingFromAPIRef gate now covers the .catch() handler).
  - ChatTab: filter heartbeat-style internal self-messages from the
    My Chat tab so historical rows with source_id=NULL don't
    surface as user-typed input.
  - Modal portals: OrgImportPreflightModal + MissingKeysModal
    (ProviderPickerModal + AllKeysModal) now createPortal to
    document.body and clamp max-h to 80vh. Escapes the ancestor
    containing block (TemplatePalette's fixed+filtered sidebar
    re-anchored descendants' position:fixed to itself, hiding
    modals behind workspace cards). MissingKeysModal bumped to
    z-[60] for stack ordering when both modals are open.
  - OrgImportPreflightModal saveOne: ref-based microtask-safe
    in-flight gate replaces the brittle "set startValue inside a
    setState updater and read on the next line" pattern (React 18
    doesn't guarantee functional updaters run synchronously; that
    path strands `saving:true` and never calls createSecret). Same
    useRef pattern guards SkillsTab.loadRegistry against concurrent
    fires and Fast-Refresh-stranded promises; force=true parameter
    on retry click bypasses the gate.

Agent comms
  - AgentCommsPanel: derive UI-facing `flow` field instead of using
    activity_type-derived direction. Self-logged a2a_receive rows
    (source_id == workspace_id, what the agent runtime writes to log
    its own outbound delegation replies) now correctly render as
    OUTBOUND with → arrow + right-justified bubble. Previously they
    rendered "← From Self" with Restart pointing at THIS workspace.
  - AgentCommsPanel: error rows replace the unactionable
    "X failed [A2A_ERROR]" body with banner + underlying-error
    code-block + cause-hint (matched on Claude Code SDK init wedge,
    deadline-exceeded, agent-thrown exception, empty-error) +
    Restart [peer] / Open [peer] action buttons.
  - AgentCommsPanel: render text bodies through ReactMarkdown +
    remark-gfm so multi-part replies (tables, code) render properly.

Multi-part text extractor
  - extractReplyText (live A2A response in ChatTab) and
    extractResponseText (chat history loader in message-parser):
    now COLLECT from every source — top-level parts, parts.root.text,
    and artifacts — joined with "\n". Previous "first source wins"
    silently dropped multi-part replies (Hermes summary+detail,
    Claude Code long-form table). Tests cover joined-from-parts,
    joined-from-artifacts, joined-from-both.

Position stability
  - canvas-topology.buildNodesAndEdges: auto-rescue heuristic now
    accepts currentParentSizes map; uses max(initial min, currently
    grown) for the bbox check. Fixes "child jumps to weird location
    after 30s" — the periodic socket health-check rehydrate
    (silenceSec > 30) was rebuilding nodes from scratch, and the
    rescue's reliance on grid-derived initial size false-flagged
    children the user dragged into the user-grown area.
  - canvas.hydrate: pass live measured dimensions from the existing
    store into buildNodesAndEdges.
  - socket.RehydrateDedup: pure exported helper class that gates
    rehydrate calls. Two states — in-flight (in-flight Promise reused
    by concurrent callers) + post-completion window (1.5s, returns
    Promise.resolve()). Initialised with -Infinity so first call
    always passes the gate. Wired into ReconnectingSocket.rehydrate.

A2A edges
  - New A2AEdge custom React Flow edge component portals its label
    out of the SVG layer via EdgeLabelRenderer so labels (a) render
    above workspace cards instead of being hidden behind them and
    (b) accept clicks. Click selects source + switches panel to
    Activity, but only on a NEW selection (preserves current tab on
    re-click of an already-selected source).
  - buildA2AEdges output tagged type:"a2a"; edgeTypes wired in
    Canvas.tsx.

Tests
  - 14 new vitest cases across 4 files (964 → 978 passing):
    OrgImportPreflightModal saveOne single-fire / double-click,
    any-of rendering; AgentCommsPanel toCommMessage flow derivation
    in all four shapes; canvas-topology rescue respects-grown /
    rescues-genuine-drift / fallback-without-live-size; socket
    RehydrateDedup gate behaviour; message-parser multi-part
    response extraction.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 19:54:43 -07:00
Hongming Wang
ad73a56db1 feat(env-preflight): support any_of OR groups (e.g. API_KEY OR OAUTH_TOKEN)
Extends the org-import env preflight so a template can declare an
alternative: satisfy ANY one member to pass. Motivated by the
Claude-family node case where either ANTHROPIC_API_KEY or
CLAUDE_CODE_OAUTH_TOKEN unlocks the agent — forcing both was wrong.

Server (workspace-server):
  - New EnvRequirement union type with custom YAML + JSON
    (un)marshaling. Accepts scalar (strict) or {any_of: [...]} in
    both on-disk org.yaml and inline POST /org/import bodies.
  - collectOrgEnv now returns []EnvRequirement. Dedups groups by
    sorted-member signature. "Strict wins" pruning drops any-of
    groups that mention a name already declared strictly (same
    tier and cross-tier).
  - Import preflight uses EnvRequirement.IsSatisfied — scalar =
    exact match, group = any member present.
  - Empty any_of: [] rejected at parse time (never-satisfiable).
  - 14 handler tests (6 updated for the union shape, 8 new
    covering any-of satisfaction, dedup, strict-dominates-group,
    cross-tier pruning, invalid-member filtering, YAML round-trip,
    and empty-any-of rejection).

Canvas:
  - EnvRequirement = string | {any_of: string[]} with envReqMembers,
    envReqSatisfied, envReqKey helpers.
  - OrgImportPreflightModal renders strict rows and any-of groups
    via a new AnyOfEnvGroup sub-component: "Configure any one"
    banner, per-member input, ✓-satisfied indicator, and dimmed
    siblings once any member is configured so the user can still
    switch providers.
  - TemplatePalette.OrgTemplate.required_env / recommended_env
    retyped to EnvRequirement[]; passthrough to the modal
    unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 16:16:25 -07:00
Hongming Wang
f995b90a85 test(canvas-events): expect both pan-to-node AND fit-deploying-org on NEW root provision
Commit 5adc8a74 (part of this PR) intentionally made
molecule:fit-deploying-org fire for root-level workspaces too — it
used to only fire for children, which meant a standalone create
didn't center the viewport until the first child arrived ~2s later.

The existing regression test still expected ONLY the
molecule:pan-to-node event for a new root, so it started failing
with "expected length 1, got 2". The product behavior is correct
(centering on the root immediately is better UX); the test was
pinning the old single-dispatch shape.

Fix: assert BOTH events fire, each with the right detail payload,
so a future regression that drops either one (or duplicates) trips
the test. Single-test update, no production code change. 953/953
canvas tests pass locally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 15:55:52 -07:00
Hongming Wang
5adc8a74d5 feat(canvas+org): env preflight, EmptyState parity, shared useTemplateDeploy hook
Builds on #2061. Three internally-cohesive sub-features; easiest to
read in order.

## 1. Org-level env preflight

Server
- `OrgTemplate` + `OrgWorkspace` gain `required_env: string[]` and
  `recommended_env: string[]` YAML fields.
- `GET /org/templates` walks the tree and returns the tree-union
  (deduped, sorted) of both. `collectOrgEnv` dedup prefers required
  when the same key is declared at both tiers.
- `POST /org/import` preflights against `global_secrets` WHERE
  `octet_length(encrypted_value) > 0` (empty-value rows used to be
  counted as "configured" and the per-container preflight still
  failed at start time). 412 Precondition Failed + `missing_env`
  list when required keys are absent. `force=true` bypasses with
  an audit log line. DB lookup failure now returns 500 (was:
  silent fall-through that defeated the guard). Env-var NAMES
  validated against `^[A-Z][A-Z0-9_]{0,127}$` so a malicious
  template can't ship pathological names into the UI or DB.

Canvas
- New `OrgImportPreflightModal`: red "Required" section (blocking)
  and yellow "Recommended" section (non-blocking, import stays
  enabled, shows live missing-count next to the Import button).
- Per-key password input → `PUT /settings/secrets` → strike-through
  on save. Functional `setDrafts` throughout (no stale-closure
  clobbers on rapid successive saves). `useEffect` seed keyed on a
  sorted-join string signature so a parent re-render with a new
  array identity doesn't clobber typed inputs.
- `TemplatePalette.handleImport` branches: zero env declarations →
  straight to import; any declarations → fetch configured global
  secret keys, open the modal.

Tests (Go): `TestCollectOrgEnv_*` (5) cover union-across-levels,
required-wins-over-recommended (including same-struct), dedup,
empty, invalid-name rejection.

## 2. EmptyState parity with TemplatePalette

The "Deploy your first agent" grid used to call `POST /workspaces`
with no preflight while the sidebar palette ran
`checkDeploySecrets` + `MissingKeysModal` first. Same template
deployed two different ways → first-run users saw containers boot
in `failed` state without guidance. Now both surfaces share one
preflight + modal handshake.

EmptyState's previous `interface Template` dropped `runtime`,
`models`, and `required_env` — silently discarding exactly the
fields the preflight needs. `Template` now lives in
`deploy-preflight.ts` and is imported from there by both surfaces.

## 3. useTemplateDeploy hook

With the preflight + modal wiring now duplicated across
EmptyState + TemplatePalette + (going forward) any third surface,
extracted the pattern into `canvas/src/hooks/useTemplateDeploy.tsx`:

  const { deploy, deploying, error, modal } = useTemplateDeploy({
    canvasCoords: ...,   // optional, default random
    onDeployed: (id) => ...,
  });

Closes three drift surfaces that the duplication had created:
- `resolveRuntime` id→runtime fallback table (moved to
  `deploy-preflight.ts`). EmptyState had a narrower fallback that
  would have silently disagreed with the palette on any future id
  needing a non-identity mapping.
- `checkDeploySecrets` call signature. One owner.
- `MissingKeysModal` JSX wiring. One owner.

Narrow try/catch around `checkDeploySecrets` so a preflight network
failure clears `deploying` and surfaces via `setError` instead of
stranding the button forever. `modal: ReactNode` (not a
`renderModal()` function) — the previous memoization bought
nothing since consumers called it inline every render. Named
`MissingKeysInfo` interface for the state shape.

## 4. Viewport auto-fit user-pan gate fix

During org deploy the canvas was meant to pan+zoom to follow each
arriving workspace (`molecule:fit-deploying-org` event → debounced
fitView). In practice the fit stayed stuck on wherever the first
fit landed.

Root cause: React Flow v12 fires `onMoveEnd` with a truthy `event`
at the END of a programmatic `fitView` animation. The original
"respect-user-pan" gate stamped `userPannedAtRef` in `onMoveEnd`,
so our own fit completing looked like a user pan, and every
subsequent auto-fit short-circuited for the rest of the deploy.

Fix: stop trusting `onMoveEnd` for user-intent detection. Register
explicit `wheel` + `pointerdown` listeners on `document` with
capture phase and `target.closest('.react-flow__pane')` filter.
Capture-phase immunity to `stopPropagation`; pane-filter rejects
toolbar / modal / side-panel clicks (the old `window` fallback
caught those). `onMoveEnd` simplified to only drive the debounced
viewport save.

Also: fit event dispatched on root arrivals (not just children),
so the canvas centers on the just-landed root immediately instead
of waiting ~2s for the first child. Animation 600ms → 400ms so
successive per-arrival fits don't pile up visually. End-state fit
stays at 1200ms — intentional asymmetry ("settling" vs
"tracking"), documented in code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 15:15:33 -07:00
Hongming Wang
425df5e5a9 merge(staging): resolve conflicts + fix 7 test regressions on top of #2061
- Merge origin/staging into fix/canvas-multilevel-layout-ux. 18 files
  auto-merged (mostly canvas/tabs/chat and workspace-server handlers
  the earlier DIRTY marker was stale relative to current staging).

- Fix 7 test failures surfaced by the merge:

  1. Canvas.pan-to-node.test.tsx — mockGetIntersectingNodes was
     inferred as vi.fn(() => never[]); mockReturnValueOnce of a node
     object failed type check. Explicit return-type annotation.

  2. Canvas.pan-to-node.test.tsx + Canvas.a11y.test.tsx — Canvas.tsx
     reads deletingIds.size (new multilevel-layout state). Both mock
     stores lacked deletingIds; added new Set<string>() to each.

  3. canvas-batch-partial-failure.test.ts — makeWS() built a wire-
     format WorkspaceData (snake_case, with x/y/uptime_seconds). The
     store's node.data is now WorkspaceNodeData (camelCase, no wire-
     only fields). Rewrote makeWS to produce WorkspaceNodeData and
     updated 5 call-site casts. No assertions changed.

  4. ConfigTab.hermes.test.tsx — two tests pinned pre-#2061 behavior
     that the PR intentionally inverts:

       a. "shows hermes-specific info banner" — RUNTIMES_WITH_OWN_CONFIG
          now contains only {"external"}, so the banner is no longer
          shown for hermes. Inverted assertion: now pins ABSENCE of
          the banner, with a comment noting the inversion.

       b. "config.yaml runtime wins over DB" — priority reversed:
          DB is now authoritative so the tier-on-node badge matches
          the form. Inverted scenario: DB=hermes + yaml=crewai →
          form shows hermes. Switched test's DB runtime off langgraph
          because the dropdown collapses langgraph into an empty-
          valued "default" option that would hide the win signal.

- No production code changed — this commit is staging merge + test
  realignment only. 953/953 canvas tests pass. tsc --noEmit clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 13:50:39 -07:00
Hongming Wang
94d9331c76 feat(canvas+platform): chat attachments, model selection, deploy/delete UX
Session's accumulated UX work across frontend and platform. Reviewable
in four logical sections — diff is large but internally cohesive
(each section fixes a gap the next one depends on).

## Chat attachments — user ↔ agent file round trip

- New POST /workspaces/:id/chat/uploads (multipart, 50 MB total /
  25 MB per file, UUID-prefixed storage under
  /workspace/.molecule/chat-uploads/).
- New GET /workspaces/:id/chat/download with RFC 6266 filename
  escaping and binary-safe io.CopyN streaming.
- Canvas: drag-and-drop onto chat pane, pending-file pills,
  per-message attachment chips with fetch+blob download (anchor
  navigation can't carry auth headers).
- A2A flow carries FileParts end-to-end; hermes template executor
  now consumes attachments via platform helpers.

## Platform attachment helpers (workspace/executor_helpers.py)

Every runtime's executor routes through the same helpers so future
runtimes inherit attachment awareness for free:
- extract_attached_files — resolve workspace:/file:///bare URIs,
  reject traversal, skip non-existent.
- build_user_content_with_files — manifest for non-image files,
  multi-modal list (text + image_url) for images. Respects
  MOLECULE_DISABLE_IMAGE_INLINING for providers whose vision
  adapter hangs on base64 payloads (MiniMax M2.7).
- collect_outbound_files — scans agent reply for /workspace/...
  paths, stages each into chat-uploads/ (download endpoint
  whitelist), emits as FileParts in the A2A response.
- ensure_workspace_writable — called at molecule-runtime startup
  so non-root agents can write /workspace without each template
  having to chmod in its Dockerfile.

Hermes template executor + langgraph (a2a_executor.py) + claude-code
(claude_sdk_executor.py) all adopt the helpers.

## Model selection & related platform fixes

- PUT /workspaces/:id/model — was 404'ing, so canvas "Save"
  silently lost the model choice. Stores into workspace_secrets
  (MODEL_PROVIDER), auto-restarts via RestartByID.
- applyRuntimeModelEnv falls back to envVars["MODEL_PROVIDER"]
  so Restart propagates the stored model to HERMES_DEFAULT_MODEL
  without needing the caller to rehydrate payload.Model.
- ConfigTab Tier dropdown now reads from workspaces row, not the
  (stale) config.yaml — fixes "badge shows T3, form shows T2".

## ChatTab & WebSocket UX fixes

- Send button no longer locks after a dropped TASK_COMPLETE —
  `sending` no longer initializes from data.currentTask.
- A2A POST timeout 15 s → 120 s. LLM turns routinely exceed 15 s;
  the previous default aborted fetches while the server was still
  replying, producing "agent may be unreachable" on success.
- socket.ts: disposed flag + reconnectTimer cancellation + handler
  detachment fix zombie-WebSocket in React StrictMode.
- Hermes Config tab: RUNTIMES_WITH_OWN_CONFIG drops 'hermes' —
  the adaptor's purpose IS the form, banner was contradictory.
- workspace_provision.go auto-recovery: try <runtime>-default AND
  bare <runtime> for template path (hermes lives at the bare name).

## Org deploy/delete animation (theme-ready CSS)

- styles/theme-tokens.css — design tokens (durations, easings,
  colors). Light theme overrides by setting only the deltas.
- styles/org-deploy.css — animation classes + keyframes, every
  value references a token. prefers-reduced-motion respected.
- Canvas projects node.draggable=false onto locked workspaces
  (deploying children AND actively-deleting ids) — RF's
  authoritative drag lock; useDragHandlers retains a belt-and-
  braces check.
- Organ cancel button (red pulse pill on root during deploy)
  cascades via existing DELETE /workspaces/:id?confirm=true.
- Auto fit-view after each arrival, debounced 500 ms so rapid
  sibling arrivals coalesce into one fit (previous per-event
  fit made the viewport lurch continuously).
- Auto-fit respects user-pan — onMoveEnd stamps a user-pan
  timestamp only when event !== null (ignores programmatic
  fitView) so auto-fits don't self-cancel.
- deletingIds store slice + useOrgDeployState merge gives the
  delete flow the same dim + non-draggable treatment as deploy.
- Platform-level classNames.ts shared by canvas-events +
  useCanvasViewport (DRY'd 3 copies of split/filter/join).

## Server payload change

- org_import.go WORKSPACE_PROVISIONING broadcast now includes
  parent_id + parent-RELATIVE x/y (slotX/slotY) so the canvas
  renders the child at the right parent-nested slot without doing
  any absolute-position walk. createWorkspaceTree signature gains
  relX, relY alongside absX, absY; both call sites updated.

## Tests

- workspace/tests/test_executor_helpers.py — 11 new cases
  covering URI resolution (including traversal rejection),
  attached-file extraction (both Part shapes), manifest-only
  vs multi-modal content, large-image skip, outbound staging,
  dedup, and ensure_workspace_writable (chmod 777 + non-root
  tolerance).
- workspace-server chat_files_test.go — upload validation,
  Content-Disposition escaping, filename sanitisation.
- workspace-server secrets_test.go — SetModel upsert, empty
  clears, invalid UUID rejection.
- tests/e2e/test_chat_attachments_e2e.sh — round-trip against
  a live hermes workspace.
- tests/e2e/test_chat_attachments_multiruntime_e2e.sh — static
  plumbing check + round-trip across hermes/langgraph/claude-code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 13:27:51 -07:00
Hongming Wang
2dbd06d52e
Merge pull request #2055 from Molecule-AI/feat/lark-channel-first-class-v2
feat(channels): first-class Lark/Feishu support via schema-driven config
2026-04-24 19:57:57 +00:00
rabbitblood
998cd03265 fix(tabs-a11y): mock config_schema on adapter response
Schema-driven ChannelsTab renders no inputs when config_schema is
absent — the test's bare {type, display_name} mock mismatched the
real API shape and every getByLabelText("Bot Token") failed.

Mock now mirrors GET /channels/adapters with the Telegram schema
(bot_token password + chat_id text) so the a11y assertions run
against the actual rendered form.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 12:04:51 -07:00
molecule-ai[bot]
92a0c0073d
Merge pull request #2058 from Molecule-AI/chore/canvas-node22-upgrade
chore(canvas): upgrade node:20-alpine → node:22-alpine
2026-04-24 19:04:25 +00:00
molecule-ai[bot]
17f29e874a
Merge pull request #2029 from Molecule-AI/fix/canvas-a11y-tabs-v2
fix(canvas/a11y): add type=button to tab toolbar and settings buttons
2026-04-24 19:01:24 +00:00
1e5fc48acb chore(canvas): upgrade node:20-alpine → node:22-alpine
Node.js 20 reaches EOL 2026-09 and actions/checkout@v4 emits
Node.js 20 deprecation warnings on GitHub Actions (Node 24 forced
2026-06-02). Next.js 15.1 is fully compatible with Node 22.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 18:54:30 +00:00
Hongming Wang
04e60e7303
Merge pull request #2052 from Molecule-AI/fix/canvas-provisioning-timeout-runtime-aware
fix(canvas): runtime-aware provisioning-timeout threshold (hermes 12min vs default 2min)
2026-04-24 18:51:46 +00:00
rabbitblood
00265d7028 feat(channels): first-class Lark/Feishu support via schema-driven config
Lark adapter was already implemented in Go (lark.go — outbound Custom Bot
webhook + inbound Event Subscriptions with constant-time token verify),
but the Canvas connect-form hardcoded a Telegram-shaped pair of inputs
(bot_token + chat_id). Selecting "Lark / Feishu" from the dropdown
silently sent the wrong field names — there was no way to enter a
webhook URL.

Fix: move form shape to the server.

- Add `ConfigField` struct + `ConfigSchema()` method to the
  `ChannelAdapter` interface. Each adapter declares its own fields with
  label/type/required/sensitive/placeholder/help.
- Implement per-adapter schemas:
  - Lark: webhook_url (required+sensitive) + verify_token (optional+sensitive)
  - Slack: bot_token/channel_id/webhook_url/username/icon_emoji
  - Discord: webhook_url + optional public_key
  - Telegram: bot_token + chat_id (unchanged UX, keeps Detect Chats)
- Change `ListAdapters()` to return `[]AdapterInfo` with config_schema
  inline. Sorted deterministically by display name so UI ordering is
  stable across Go's random map iteration.
- Update the 3 existing `ListAdapters` test sites to struct access.

Canvas (`ChannelsTab.tsx`):
- Replace the two hardcoded bot_token/chat_id inputs with a single
  schema-driven `SchemaField` component. Renders one input per field in
  the order the adapter returns them.
- Form state becomes `formValues: Record<string,string>` keyed by
  `ConfigField.key`. Values reset on platform-switch so stale
  Telegram credentials can't leak into a new Lark channel.
- "Detect Chats" stays but only renders for platforms in
  `SUPPORTS_DETECT_CHATS` (Telegram only — the only provider with
  getUpdates).
- Only schema-known keys are posted in `config`, scrubbing any stale
  values from previous platform selections.

Regression tests:
- `TestLark_ConfigSchema` locks in the 2-field Lark contract with the
  required/sensitive flags correctly set.
- `TestListAdapters_IncludesLark` confirms registry wiring + schema
  survives round-trip through ListAdapters.

Known pre-existing `TestStripPluginMarkers_AwkScript` failure in
internal/handlers is unrelated to this change (verified via stash+test
on clean staging).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 11:51:15 -07:00
Hongming Wang
0b237ed9dd refactor(canvas): extract runtime profiles to @/lib/runtimeProfiles
Preparation for a "hundreds of runtimes" plugin ecosystem. Keeping the
runtime-specific UX knobs in-line inside ProvisioningTimeout scales badly
— every new runtime would require editing a component, not just adding a
table entry. Other components (create-workspace dialog, workspace card
tooltips, etc.) will want the same runtime metadata.

Changes:

- New file `canvas/src/lib/runtimeProfiles.ts` owns:
  * `RuntimeProfile` type — structural shape, every field optional so
    new runtimes can partially-fill without breaking consumers.
  * `DEFAULT_RUNTIME_PROFILE` — 2-min default floor (docker-fast).
  * `RUNTIME_PROFILES` — named overrides (currently: hermes 12 min).
  * `WorkspaceRuntimeOverrides` — interface for server-provided
    per-workspace overrides, so operators can tune via template
    manifest / workspace metadata without a canvas release.
  * `getRuntimeProfile()` — resolver with
    overrides → profile → default priority.
  * `provisionTimeoutForRuntime()` — convenience wrapper.

- `ProvisioningTimeout.tsx` now delegates to the profile module.
  `DEFAULT_PROVISION_TIMEOUT_MS` re-exported for legacy test importers.

- Tests: 16/16 (up from 9 before the first fix). Adds pinning for:
  * overrides > profile > default priority chain
  * "every entry in RUNTIME_PROFILES resolves to a number" contract
  * backward-compat export

Adding a new slow runtime is now one table entry in
`canvas/src/lib/runtimeProfiles.ts` with a mandatory `WHY` comment.
Moving to server-driven profiles later is a ~10-line change (the
resolver already threads WorkspaceRuntimeOverrides through).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 11:48:39 -07:00
Hongming Wang
9597d262ca fix(canvas): runtime-aware provisioning-timeout threshold
Hermes workspaces cold-boot in 8-13 min (ripgrep + ffmpeg + node22 +
hermes-agent source build + Playwright + Chromium ~300MB). The canvas's
2-min hardcoded "Provisioning Timeout" warning fired at ~2min and told
users their workspace was "stuck" while it was still mid-install. Users
hit Retry, triggering fresh cold boots and cancelling healthy workspaces.

User-facing symptom (reported 2026-04-24 18:35Z): hermes workspace showed
"has been provisioning for 3m 15s — it may have encountered an issue"
with Retry + Cancel buttons, while the EC2 was installing node_modules.

Fix:
- Keep DEFAULT_PROVISION_TIMEOUT_MS = 120_000 (2min) — correct for fast
  docker runtimes (claude-code, langgraph, crewai) where cold boot is
  30-90s.
- Add RUNTIME_TIMEOUT_OVERRIDES_MS = { hermes: 720_000 } (12min).
  Aligns with tests/e2e/test_staging_full_saas.sh's
  PROVISION_TIMEOUT_SECS=900 (15min) so UI warns shortly before the
  backend itself gives up.
- New timeoutForRuntime() resolves the base; per-node lookup in the
  check-timeouts interval so a mixed batch (1 hermes + 2 langgraph) uses
  the right threshold for each.
- timeoutMs prop is now optional. Undefined → per-runtime lookup; a
  number → forces a single threshold for every workspace (tests use this
  for deterministic behavior).

Tests: 4 new cases pinning the runtime-aware resolution, including a
guard that catches future regressions that would weaken hermes's budget.
Existing tests unchanged (they import DEFAULT_PROVISION_TIMEOUT_MS which
still exports 120_000).

13/13 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 11:46:09 -07:00
Molecule AI Core Platform Lead
49fc97e6e4 refactor(canvas): remove unused EmbeddedTeam component from WorkspaceNode
EmbeddedTeam was defined in WorkspaceNode.tsx but had no call site —
TeamMemberChip (which is called directly) covers the same rendering
responsibility. The function was stranded after a prior refactor and
was flagged by github-code-quality on PR #1989 (merged 2026-04-24T14:09Z
without this cleanup because the token died before push).

Removes 25 lines of dead code. MAX_NESTING_DEPTH is kept — it is used
by TeamMemberChip at line 498.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 18:30:36 +00:00
1126d7b66d fix(canvas/a11y): add type=button to tab toolbar and settings buttons
WCAG 4.1.2 / bug #1669 follow-up — fixing remaining buttons missing
type="button" across tab components and settings.

Files changed:
- FilesTab/FilesToolbar.tsx (5 buttons): +New, Upload, Export,
  Clear, ↻ (all had onClick, no type=button)
- config/secrets-section.tsx (7 buttons): Remove, Edit/Update/Cancel
  across 2 SecretRow variants + add-variable form
- config/form-inputs.tsx (2 buttons): tag remove ×, section collapse toggle
- ActivityTab.tsx (1 button): row expand toggle
- TracesTab.tsx (1 button): Refresh
- settings/UnsavedChangesGuard.tsx (2 buttons): Keep editing, Discard
  (Radix AlertDialog asChild wrappers — type=button prevents form submit)

Total: 18 buttons fixed across 6 files. 934/934 tests pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 14:41:35 +00:00
Hongming Wang
6b62391e5d
Merge pull request #1989 from Molecule-AI/fix/canvas-a11y-final
fix(canvas/a11y): type=button campaign + aria fixes (batch 1-3)
2026-04-24 14:05:27 +00:00
Molecule AI Core Platform Lead
4db7f6f024 fix(canvas): define MAX_NESTING_DEPTH constant in WorkspaceNode.tsx
TeamMemberChip used MAX_NESTING_DEPTH to cap recursive sub-agent
rendering at depth 3, but the constant was never declared — causing
a TypeScript build error ('Cannot find name MAX_NESTING_DEPTH') that
blocked Canvas CI on PR #1989.

Add the constant above EmbeddedTeam with a doc comment explaining its
purpose (guards against circular parentId cycles + readability cap).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 12:52:28 +00:00
9f52ee1777 fix(canvas/WorkspaceNode.tsx): add missing useMemo import
CI failure: "Cannot find name 'useMemo'" at line 363.
useMemo was called but not imported — likely dropped during refactor.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 12:40:52 +00:00
6a96641c37 fix(canvas/a11y): add type="button" to remaining canvas component buttons (batch 3)
WCAG 4.1.2 / bug #1669 follow-up — final batch completing the campaign.
Added type="button" to all buttons missing it across 14 canvas components.

Files changed (14, all additions):
- Toolbar.tsx: Stop All, Restart All, A2A toggle, Audit shortcut, Quick help, Search shortcut, Help close (7)
- MemoryInspectorPanel.tsx: scope tabs, refresh, search clear ×2, expand, delete (6)
- TemplatePalette.tsx: org refresh, toggle, Import Agent, org import, deploy template, palette refresh (6)
- ProvisioningTimeout.tsx: Retry, Cancel Request, View Logs, Keep, Remove Workspace (5)
- ConsoleModal.tsx: close, Copy output, Close (3)
- OnboardingWizard.tsx: Skip guide, action, Next (3)
- ConversationTraceModal.tsx: close ×2 (2)
- WorkspaceNode.tsx: Restart banner, Extract from team (2)
- CommunicationOverlay.tsx: toggle, close panel (2)
- Toaster.tsx: dismiss ×2 (2)
- SearchDialog.tsx: search result button (1)
- TermsGate.tsx: accept (1)
- ErrorBoundary.tsx: Reload (1)
- BundleDropZone.tsx: import trigger (1)

Total campaign (batches 1-3): 27 + 42 = 69 buttons fixed across 24 components.
All 477 canvas vitest tests pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 12:40:52 +00:00
32a3b84147 fix(canvas/a11y): add type="button" to MissingKeysModal, ContextMenu, CreateWorkspaceDialog tier radio
WCAG 4.1.2 / bug #1669 follow-up — modal + menu buttons need explicit type="button".

- MissingKeysModal.tsx: Save, Open Settings Panel, Cancel Deploy, Add Keys+Deploy (4)
- ContextMenu.tsx: all menuitem buttons (1 — inner menu items loop)
- CreateWorkspaceDialog.tsx: tier radio buttons in dialog (1)

56 vitest tests pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 12:40:52 +00:00
e14b6d2de4 fix(canvas/a11y): add type="button" to BatchActionBar, EmptyState, SidePanel, CreateWorkspaceDialog
WCAG 4.1.2 / bug #1669 follow-up — buttons without explicit type="button"
default to type="submit", risking accidental form submission.

Added type="button" to all action buttons in:
- BatchActionBar.tsx: Restart All, Pause All, Delete All, Clear Selection (4)
- EmptyState.tsx: template deploy buttons + Create blank (all)
- SidePanel.tsx: close panel, tab switches, Restart Now (3)
- CreateWorkspaceDialog.tsx: open trigger, Cancel, Create (3)

Total this commit: +12 insertions / 2 deletions across 4 files.
Prior commit (c5590c0c): ConfirmDialog + AuditTrailPanel + DeleteCascadeConfirmDialog (+7).
Combined batch: 19 buttons fixed across 7 components.

86 vitest tests pass across all touched test files.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 12:40:52 +00:00
2ff15a38a8 fix(canvas/a11y): add type="button" to ConfirmDialog, AuditTrailPanel, DeleteCascadeConfirmDialog
WCAG 4.1.2 / bug #1669 follow-up — buttons without explicit type="button"
default to type="submit", which triggers accidental form submission when
the button is rendered inside a <form> element.

Added type="button" to all action buttons in:
- ConfirmDialog.tsx: Cancel + confirm buttons (lines 123, 130)
- DeleteCascadeConfirmDialog.tsx: Cancel + Delete All buttons (lines 145, 151)
- AuditTrailPanel.tsx: filter buttons, refresh, load-more (lines 140, 154, 194)

All 51 component tests pass (5 ConfirmDialog, 46 AuditTrailPanel+DeleteCascadeConfirmDialog).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 12:40:52 +00:00
e355f447bb fix(canvas/a11y): add aria-hidden to 6 decorative SVGs + aria-label to OrgTokensTab input
WCAG 1.3.1 — inputs without visible text labels need aria-label.
WCAG 4.1.2 — decorative SVGs inside interactive elements need
aria-hidden so screen readers ignore icon content.

Changes:
- ErrorBoundary: warning triangle SVG — aria-hidden=true
- Toolbar: 4 decorative SVGs — aria-hidden=true
  (Stop All square, Restart Pending arrow, Search magnifier, Help circle)
- SettingsButton: gear icon SVG — aria-hidden=true (parent has aria-label)
- RevealToggle: EyeIcon + EyeOffIcon SVGs — aria-hidden=true
- OrgTokensTab: name input — aria-label="Organization API key label"

Bonus fix: removed duplicate title/aria-label props on Restart All button.

Note: ConsoleModal and DeleteCascadeConfirmDialog do not exist in current
staging (aae0c81) — tab trapping fix inapplicable to this codebase.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 12:40:52 +00:00
59feb65252 fix(canvas/a11y): add type=button to 24 buttons across DetailsTab, ConfigTab, FilesTab, MemoryTab
WCAG 4.1.2 / bug #1669 follow-up — DetailsTab, ConfigTab, FilesTab, and
MemoryTab had buttons without explicit type="button", causing accidental
form submission in any surrounding <form> context.

Changes:
- DetailsTab (9 buttons): Save, Cancel (edit), Restart/Retry, Edit,
  View console output, peer select, Confirm Delete, Cancel (delete), Delete Workspace
- ConfigTab AgentCardSection (3): Save, Cancel, Edit Agent Card
- ConfigTab footer (3): Save & Restart, Save, Reload
- ConfigTab textareas (2): aria-label added to Agent Card JSON editor and Raw YAML editor
- FilesTab (4): Delete All, Cancel, Delete, Cancel
- MemoryTab (11): Expand/Collapse, Open, Expand (collapsed state), Advanced,
  Refresh, Add, Save, Cancel (add form), expand entry, Delete entry, Show

Total: 32 interactive elements corrected across 4 tab components.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 12:39:43 +00:00
Hongming Wang
46fbffb95b fix(canvas/e2e): raise staging-setup deadline 15 min → 20 min
Matches tests/e2e/test_staging_full_saas.sh's 20-min budget (#1930).
Canvas E2E was still stuck at 900s (15 min) which regularly flakes on
tenant cold boots in 12-15 min range — especially on staging where
workspace-server image pulls + AMI bootstrapping add 3-5 min vs prod.

Concrete blocker: 2026-04-24 staging→main sync (#1981) kept failing on
"tenant provision: timed out after 900s" in canvas/e2e/staging-setup.ts
despite the actual sync E2E going green. Canvas-side timeout was
strictly tighter than the sync-side timeout.

Also raises WORKSPACE_ONLINE_TIMEOUT_MS to 20 min to cover the case
where the workspace EC2 is provisioned but hermes cold-install (apt +
uv + hermes-agent clone + gateway boot) takes longer than the original
10-min budget — matches the 20-min workspace deadline in SaaS E2E.

No behavior change when things are fast. Just covers the tail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 01:26:13 -07:00
molecule-ai[bot]
82d15f4d33
Merge pull request #1859 from Molecule-AI/content-marketer/phase34-launch-post-v2
docs(marketing): Phase 34 launch post v2 — governance-first + tool trace
2026-04-24 07:05:54 +00:00
Hongming Wang
0ef5dad1b1
Merge pull request #1993 from Molecule-AI/fix/auth-redirect-loop-regression-tests
test(auth): add regression tests for redirect loop guards
2026-04-24 06:57:12 +00:00
Hongming Wang
8c80175cd8 fix(canvas): subtree-aware layout + org-import reliability + UX polish
Five tightly-related fixes surfaced while stress-testing org-template
imports (Legal Team, Molecule Company, etc.) on a running control plane:

1) Org import was silently failing — INSERT wrote `collapsed` into the
   `workspaces` table but that column lives on `canvas_layouts`
   (005_canvas_layouts.sql). Every import returned 207 with 0 rows
   created, which `api.post` treated as success → green "Imported"
   toast + empty canvas. Moved the write to canvas_layouts; updated
   the workspace_crud PATCH path to UPSERT there too; refreshed the
   test mock. Added a client-side assertion that throws on
   2xx-with-`error`-body so future partial-failures surface a red
   toast rather than lying about success.

2) Multi-level nested layout was collision-prone: children that were
   themselves parents (CTO → Dev Lead → 6 engineers) got the same
   leaf-sized grid slot as leaf siblings and clipped into each other.
   Added post-order `sizeOfSubtree` + sibling-size-aware
   `childSlotInGrid` on both the Go server and the TS client (kept in
   sync). `buildNodesAndEdges` now uses subtree sizes for both parent
   dimensions and the rescue heuristic. `setCollapsed` on expand now
   reads each child's actual rendered width/height instead of the
   leaf-count formula — a regression test covers the CTO/Dev Lead
   scenario.

3) Provisioning-timeout banner was unusable during large imports: a
   30-workspace tree triggered 27 simultaneous "stuck" warnings 2
   minutes in (server paces + provision concurrency = 3 guarantee tail
   items legitimately wait longer). Scaled threshold with concurrent
   count (base + 45s per queue slot beyond concurrency) and added a
   Dismiss (×) button per banner.

4) Auto pan-and-zoom on org ready: after the last workspace flips out
   of `provisioning`, canvas now fitView's with a 1.2s animation,
   0.25 padding, `maxZoom: 0.8` and `minZoom: 0.25`. Without the zoom
   caps fitView was hitting the component's maxZoom=2 on small trees
   and zooming in instead of out.

5) Toolbar was visually busy: `+ N sub` count wrapped onto a second
   row on narrow viewports; status dot and workspace total were in
   separate border-delimited cells. Merged into one segment with
   `whitespace-nowrap`; A2A / Audit / Search / Help collapsed to
   icon-only 28px buttons with tooltip + aria-label (Figma/Linear
   pattern). Stop All / Restart Pending keep text — they're urgent.

Also:
- `api.{get,post,...}` accept an optional `{ timeoutMs }` so callers
  that hit intentionally-slow endpoints (org import paces 2s between
  siblings) don't trip the 15s default and report false aborts.
- `WorkspaceNode` clamps role text to 2 lines so verbose descriptions
  don't unboundedly grow card height and break the grid.
- `PARENT_HEADER_PADDING` bumped 44→130 to clear name + runtime +
  2-line role + the currentTask banner that appears during the
  initial-prompt phase.

Tests: 930 canvas tests + full Go handler suite pass. Added
regressions for (i) 207 partial-success surfacing as throw, and
(ii) setCollapsed sizing with nested-parent children.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 23:48:29 -07:00
e9be12210f test(auth): add regression tests for redirect loop guards
AuthGate now skips session fetch for /cp/auth/* paths, and
redirectToLogin guards against re-setting window.location when
already on an auth path. Both guards had no test coverage —
a future refactor could silently reintroduce the redirect loop.

Added:
- AuthGate.test.tsx: 2 cases covering /cp/auth/login and
  /cp/auth/signup path skipping (no fetchSession call, no
  redirectToLogin call, children rendered)
- auth.test.ts: 2 cases covering redirectToLogin early return
  for /cp/auth/login and /cp/auth/signup paths

Fixes: Molecule-AI/molecule-core#1541

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 06:30:35 +00:00
Hongming Wang
d53583f9c6 Merge remote-tracking branch 'origin/staging' into fix/restore-quickstart-plus-hotfixes 2026-04-23 21:04:55 -07:00
Hongming Wang
2d6ff11c4e fix(canvas): re-sort parents-before-children after nest mutation
React Flow requires parent nodes to appear before their children in
the nodes array. When they don't, it logs "Parent node {id} not
found. Please make sure that parent nodes are in front of their
child nodes in the nodes array" and — more importantly — renders
the child at canvas-absolute coords instead of parent-relative,
flashing it far outside the parent.

topology's buildNodesAndEdges already enforced this at hydrate, but
nestNode + batchNest weren't re-sorting after mutating parentId.
A freshly-nested child often ended up after-first-drag at the
wrong screen position because its new parent sat later in the
array than itself.

Extract sortParentsBeforeChildren() into canvas-topology as a
reusable DFS visit; call it at the tail of both nestNode's set()
and batchNest's commit set(). 923 tests still green — no behaviour
change beyond eliminating the warning and the position flash.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 21:00:40 -07:00
Hongming Wang
2a8977c946 fix(canvas): cancel-nest also shrinks the parent back
Canceling the nest/extract dialog restored the child's position but
left the parent card at its auto-grown size. growParentsToFitChildren
fires on drag-stop to fit a then-outside child; when the drag is
subsequently cancelled, the parent keeps that grown width/height
forever because the grow pass is grow-only.

Strip width/height from the ex-parent alongside the child position
restore in cancelNest — React Flow re-measures from CSS, parent
collapses back to its natural size. Same trick nestNode already
uses for the un-nest path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 20:56:08 -07:00
Hongming Wang
09053dfdeb fix(canvas): cancel-nest restores position; un-nest shrinks parent
Two follow-up polish items for drag-and-nest:

1. Cancelling the "Extract from team?" dialog now snaps the
   dragged card back to where the drag started. Before, a user
   who dragged a child out, saw the confirm dialog, then clicked
   Cancel ended up with the card stranded outside the parent at
   its drop-point position — which also got persisted via
   savePosition on drag-stop. Now onNodeDragStart captures the
   pre-drag position + parent, and cancelNest restores both the
   RF node position and fires savePosition with the absolute
   pre-drag coords so reload matches.

2. Un-nesting now clears the ex-parent's explicit width/height
   in the nodes array. growParentsToFitChildren is grow-only so
   it could never shrink the parent back down after a child
   left; the card stayed at its auto-grown size with empty
   space. Stripping width/height lets React Flow re-measure from
   the card's own min-width / min-height CSS, so the parent
   visually shrinks to fit whatever children remain.

923 canvas tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 20:52:28 -07:00
Hongming Wang
512fdfd59d fix(canvas): plain drag out of parent un-nests again
Un-nest used to require holding Alt (or Cmd to force-detach). That
was too conservative — when a user dragged a child clearly outside
its parent's bbox, nothing happened on release, because the default
branch soft-clamped back and only the Alt branch actually opened
the "Extract?" confirm. Matches the exact bug the user just flagged
("I can put agents in other agent, but when I drag it out, it does
not move out").

New rules:
 * Past the 20 % hysteresis → confirm un-nest. Plain drag, no
   modifier. This is what most users expect (Miro / Figma behave
   the same way — drag outside the frame and the shape leaves it).
 * Inside or within 20 % of the edge → soft-clamp back inside.
   Guards against twitchy releases that momentarily overshoot the
   edge by a few pixels.
 * Cmd / Ctrl → force un-nest regardless of overlap. Escape-hatch
   for when the user dragged within the hysteresis zone but really
   wants out.
 * Dropping onto a different parent → nest there (unchanged).

Alt is no longer a required modifier for un-nesting. Keeps it as
a non-gesture modifier only; no meaning unless we re-bind it later.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 20:48:38 -07:00
Hongming Wang
f2a4b6e0d3 fix: dev-mode bypass for IP rate limiter + 429 retry on GET
The 600-req/min/IP bucket is sized for SaaS where each tenant has
a distinct client IP. On a local Docker setup every panel shares
one IP — hydration (/workspaces + /templates + /org/templates +
/approvals/pending) plus polling (A2A overlay + activity tabs +
approvals + schedule + channels + audit trail) can burst past the
bucket inside a minute, blanking the canvas with 429s. The user
reported it after dragging workspaces — dragging itself is
release-only (savePosition in onNodeDragStop), but the polling
that's always running added onto startup tripped the limit.

Two-layer fix:

Server: RateLimiter.Middleware short-circuits when isDevModeFailOpen
is true (MOLECULE_ENV=development + empty ADMIN_TOKEN), matching
the Tier-1b hatch already applied to AdminAuth, WorkspaceAuth, and
discovery. SaaS production keeps the bucket.

Client: api.ts auto-retries a single 429 on idempotent GET requests,
waiting the server-provided Retry-After (capped at 20s). Mutations
(POST/PUT/PATCH/DELETE) never auto-retry to avoid double-applying.
Users on SaaS hitting a legitimate rate-limit spike get one
transparent recovery instead of an immediately-blank Canvas.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 20:44:09 -07:00
Hongming Wang
286dcbfd1e fix(canvas,org): collapse org-imported parents on first paint
Importing a 15-workspace org template dropped every child as a
freely-positioned card into its parent's coordinate space. Parents
with 5-10 kids had the kids spill below the parent's initial min
size, producing the "ugly default" layout the user just flagged —
a mess of overlapping cards the moment the import completed.

Fix: every workspace in an org-template import that HAS children
is inserted with `collapsed = true`. Leaf workspaces stay
expanded (nothing to hide). The canvas renders a collapsed
parent as a compact header-only card with its "N sub" badge —
visually identical to the pre-refactor default the user asked for.

Double-click on a collapsed parent now EXPANDS it (flipping
`collapsed` locally + persisting via PATCH) so the user can drill
in to see the subtree. Only once expanded does a second
double-click zoom-to-team, matching the prior behaviour.

Leaf-first creation order stays the same; the collapsed flag
just means "render compact" not "hide from API".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 20:36:55 -07:00
Hongming Wang
507696d88a fix(canvas,server): address review findings on 3f11df03
Five review findings from the 3f11df03 six-bug commit:

1. Add TestPeers_DevModeFailOpen_{Allows,ClosedWhenAdminTokenSet,
   ClosedInProduction} covering all three gating states for the
   security-sensitive dev-mode hatch the prior commit added to
   /registry/:id/peers. Previously shipped untested — a future
   refactor could have silently inverted polarity or removed the
   gate. New tests pin the contract:
     * MOLECULE_ENV=development + ADMIN_TOKEN="" → allow bearerless
     * MOLECULE_ENV=development + ADMIN_TOKEN set → require token
     * MOLECULE_ENV=production                    → require token

2. ConfigTab handleSave diffs against the RAW parsed YAML / form
   config instead of the DEFAULT_CONFIG-merged shape. The previous
   code would silently PATCH tier=1 to the DB when a user deleted
   the `tier:` line in raw mode (the default-merge substituted 1).
   Now: only fields the user actually typed participate in the
   diff. Type guards (typeof === "number" / "string") prevent
   coercion surprises on malformed YAML.

3. ConfigTab model-save failure no longer lies "Saved". The
   /workspaces/:id/model PATCH can reject when the runtime doesn't
   support the chosen model; previously we caught + console.warn'd
   + showed green Saved, and the user watched the model revert on
   next reload with no explanation. Now the save path collects a
   `modelSaveError` and surfaces it via setError with a partial-
   success message ("Other fields saved, but model update failed:
   …") so the user sees why.

4. ChannelsTab now surfaces BOTH channels-fetch and adapters-fetch
   failures, distinguishing them in the error text ("Failed to
   load connected channels and platforms — try refreshing").
   Previously only an adapters failure was visible; a channels
   failure left the user with an apparently-empty list and no
   indication the API was unreachable.

5. ChatTab panels drop the redundant aria-hidden attribute. The
   `hidden`/`flex` Tailwind class already sets display:none, which
   removes the node from the accessibility tree on its own; the
   extra aria-hidden invited WAI-ARIA lint warnings if a focusable
   descendant ever landed inside an inactive panel.

Tests: 923 canvas + full Go handler suite pass. 3 new Go tests.
No behaviour change on the five prior fixes — this commit tightens
their edges per the independent review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 20:29:44 -07:00
Hongming Wang
3f11df031c fix: six UX bugs (peers auth, scroll, chat tabs, config persist, + visibility)
Six bugs reported from a live session — all shippable in one commit:

1. Peers tab 401 on local Docker. The /registry/:id/peers endpoint
   demands a workspace-scoped bearer token (validateDiscoveryCaller)
   which the canvas session doesn't hold. Added the same Tier-1b
   dev-mode fail-open hatch that AdminAuth and WorkspaceAuth already
   use — gated by MOLECULE_ENV=development + empty ADMIN_TOKEN, so
   SaaS production stays strict. Exported IsDevModeFailOpen from the
   middleware package for the handler layer to reuse.

2. Org Templates list unscrollable. OrgTemplatesSection was rendered
   in the TemplatePalette footer — a div without overflow — so when
   it expanded to 15+ entries the list extended past the viewport
   with no scroll. Moved it to the top of the flex-1 overflow-y-auto
   container. Tall lists now scroll naturally.

3. Chat tab: "My Chat" and "Agent Comms" rendered stacked instead
   of switching. HTML `hidden` attribute was being overridden by
   Tailwind's `flex` class (display: flex beats the attribute),
   so both tabpanels rendered concurrently. Swapped to a conditional
   Tailwind `hidden`/`flex` class so the inactive panel is
   display:none with proper CSS specificity.

4. Hermes Config form never persists. handleSave wrote config.yaml
   but name / tier / runtime / model all live on the workspace row
   (or the dedicated /workspaces/:id/model endpoint) — the form
   edited in-memory, the request returned 200, the next reload
   wiped everything back. Hermes + external runtimes manage their
   own config inside the container anyway, so writing config.yaml
   is a no-op for them; skip it. Always diff and PATCH the DB-backed
   fields that actually changed.

5. Channels "+ Connect" dropdown empty on first open. ChannelsTab's
   load() used Promise.all with a silent catch — if EITHER the
   channels or adapters fetch failed, both setters were skipped
   with no error visible. Switched to Promise.allSettled so each
   endpoint settles independently, and the adapters failure now
   surfaces via the top-level error state.

6. Plugin registry always "No plugins in registry". Same silent
   catch pattern in SkillsTab.tsx — load errors for /plugins,
   /plugins/sources, and /workspaces/:id/plugins swallowed without
   logging. Replaced the empty catches with console.warn so future
   failures are at least visible in devtools.

Tests: 923 passing (unchanged). Go handler tests pass. Server
rebuilt and running with the peers-auth + collapsed-persistence
fixes (pid 15875).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 20:18:30 -07:00
Molecule AI App & Docs Lead
3715c06e0b fix(canvas): remove stale firstInputRef useEffect from AllKeysModal
AllKeysModal already handles focus via autoFocus={index === 0} on the
first input and a separate title-focus effect. The orphaned useEffect
referencing firstInputRef (declared only in ProviderPickerModal) caused
a TypeScript build error: "Cannot find name 'firstInputRef'".

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 03:11:36 +00:00
746cb22855 fix(canvas/tests): normalize useCanvasStore mock pattern in test files
Standardize the mock for useCanvasStore to always expose getState()
(used by production ContextMenu to filter parent nodes). Applies the
same Object.assign-wrapping pattern introduced in #1744 to:
- ClaudeSettings.test.tsx
- tabs.a11y.test.tsx
- ContextMenu.keyboard.test.tsx (mockStore shape alignment)
2026-04-24 03:10:18 +00:00
680f1f50f2 fix(canvas/a11y): restore aria-hidden on backdrop div after cherry-pick conflict
Cherry-pick from #1744 left the backdrop div without aria-hidden="true"
(the outer dialog div got it instead). Re-apply aria-hidden="true" to
the backdrop div so screen readers skip the clickable overlay layer.

Also revert test assertion from bg-black → bg-black/70 to match the
exact class applied to the backdrop div.
2026-04-24 03:10:18 +00:00
Hongming Wang
4fd7f1e84c fix(canvas): tighten rescue + cap toast + cover paths with tests
Three follow-up review findings from the c2b2e13a review:

1. Rescue heuristic uses pure bbox-non-overlap. The previous
   `position.x < 0` branch rescued any child whose parent was
   later dragged past it, even when the layout was clearly
   recoverable (e.g. relative -40, child still overlaps parent).
   New rule: rescue iff the child's bbox has zero overlap with
   the parent's bbox — self-calibrating, scales with user-resized
   parents, catches screenshot-case and legacy huge-positive data.

2. Toast caps failed-name list at 3 and appends "and N more".
   Stops a 50-node partial failure from overflowing the toast
   container.

3. Cycle guard on selection-roots walk in batchNest. Corrupt
   parentId data can't send the loop infinite now. Cheap
   defensive guard — one Set per selected node.

Tests added (923 total, up from 918):
 * canvas-topology.test: 4 rescue scenarios — screenshot case
   (zero-overlap rescue), negative drift kept, huge-positive
   rescued, user-resized layout kept.
 * canvas.test: selection-roots filter on a 3-level chain.
 * workspace_crud test: PATCH {collapsed:true} runs the UPDATE.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 20:08:14 -07:00
Hongming Wang
c2b2e13abe fix(canvas): address code-review findings on the Canvas refactor
Five issues surfaced in the review of 50b53784. Each was either a real
bug waiting to hit users or a silent failure mode.

1. Topology rescue no longer teleports user-resized children.
   Rescue was comparing against parentMinSize(childCount), so any
   child the user had placed in space the parent was resized into
   got snapped to the default grid on reload — undoing the layout.
   Now rescue fires only on obviously corrupt data: negative
   relative coords (legacy pre-nesting absolute positions that
   landed above/left of their assigned parent) or values past an
   MAX_PLAUSIBLE_OFFSET threshold. Children just-past the initial
   minimum are left alone.

2. batchNest now filters to selection-roots before planning.
   Previously selecting both A and A's descendant B and dragging
   into T yanked B out of A to become a sibling under T. Users
   reasonably expect the A subtree to move intact. The new pass
   drops any selected node whose ancestor is also selected —
   those follow their ancestor via React Flow's parent binding.

3. batchNest surfaces partial failure via showToast. Previously
   silent: 2 of 5 PATCHes fail, user sees 3 cards re-parented + 2
   snapped back with no explanation. Now names the failed cards.

4. confirmNest closes the nest dialog BEFORE dispatching the async
   store action, so a second drag can't kick off a competing batch
   while the first is still in flight.

5. collapsed is now persisted. The Go workspace_crud.go Update
   handler ignored the `collapsed` field, so user-initiated
   collapse round-tripped to an expanded state on next hydrate.
   Added the PATCH branch (`UPDATE workspaces SET collapsed = ...`)
   so the state survives reload.

Nits cleaned:
 * Removed dead dragStartParentRef in useDragHandlers.
 * Swapped redundant `node.data as WorkspaceNodeData` casts for a
   named WorkspaceNode type alias.
 * Canvas.tsx SR-live region now reads n.parentId (matches
   MiniMap + RF's native field) instead of the mirror n.data.parentId.

Tests added (918 total, up from 915):
 * batchNest happy path — 2-root selection fires 2 combined PATCHes
   carrying parent_id + x + y, not 2×N sequential round-trips.
 * batchNest ancestor+descendant selection — subtree stays intact.
 * batchNest partial failure rollback — only the rejected nodes
   revert; successful ones stay committed.

Backend change is single-line (collapsed PATCH branch); all
workspace_crud Go tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 19:58:44 -07:00
dc9001835e fix(ConfigTab.hermes.test): remove unused fireEvent import 2026-04-24 02:55:51 +00:00
Hongming Wang
50b537849a refactor(canvas): split Canvas.tsx into hooks; parallelize batchNest
Two concerns in one commit (separate files, each self-contained):

## Canvas.tsx split (from ~680 to ~250 lines)

Canvas.tsx was holding drag gesture state + keyboard shortcuts +
viewport wiring + JSX. Each concern now lives in its own unit under
canvas/src/components/canvas/:

- dragUtils.ts          — pure: shouldDetach, clampChildIntoParent,
                          DETACH_FRACTION
- DropTargetBadge.tsx   — the floating "Drop into: <name>" label + the
                          dashed ghost preview at the target slot
- useDragHandlers.ts    — encapsulates onNodeDragStart / Drag / Stop,
                          findDropTarget hit-test, pendingNest state,
                          and confirmNest/cancelNest. Routes multi-
                          select drags through batchNest automatically.
- useKeyboardShortcuts  — Esc, Enter, Shift+Enter, Cmd+]/[, Z — one
                          window listener, one source of truth.
- useCanvasViewport     — pan-to-node + zoom-to-team CustomEvent
                          listeners and the debounced viewport save.

Canvas.tsx becomes a thin composition + JSX file. No behavioural
change; the refactor is covered by the existing 915 canvas tests.

## batchNest parallelization (2N round-trips → N, all in flight)

Previously nestNode fired two sequential PATCHes (parent_id then x/y)
and batchNest looped nestNode sequentially. For a 5-node selection on
a typical ~200ms link this was ~2s of serialized RPCs.

- nestNode now combines parent_id + x + y into ONE PATCH. The Go
  handler (workspace_crud.go Update) already reads all three from the
  same body — no backend change.
- batchNest rewritten: compute every re-parent plan against one
  snapshot, commit a single set(), then fire N PATCHes via
  Promise.allSettled in parallel. Per-node failures roll back only
  that node (others stay committed) — same semantics as the single-
  node path, just concurrent.
- The state math in the batch path also correctly shifts descendant
  zIndex by depthDelta when any re-parented node has a subtree.

## Also

- canvas-topology.ts: reverted P3.12's opt-in rescue to the auto-
  rescue default. When a child's stored relative position would render
  it outside the parent bbox (the visual regression the user saw after
  collapse → reload — Hermes child drawn outside Claude Code Agent on
  first paint), the child is placed in the next default grid slot.
  The "Arrange Children" context command stays for bigger teams.

All 915 canvas tests pass. No backend changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 19:43:18 -07:00