Phase 1 of moving runtime UX knobs server-side. Builds the canvas
foundation: a workspace can carry its own provision_timeout_ms
(sourced server-side from a template manifest in a follow-up PR),
and ProvisioningTimeout's resolver respects it per-node.
Today the resolver had Props-level timeoutMs that applied to ALL
nodes — fine for tests but wrong for production where one batch
could mix runtimes (hermes 12-min cold boot alongside docker 2-min).
The runtime profile fallback already handles per-runtime defaults;
this PR adds the per-WORKSPACE override layer above that.
Resolution priority (most specific wins):
1. node.provisionTimeoutMs — server-declared per-workspace
override (this PR's new field)
2. timeoutMs prop — single-threshold test override
3. runtime profile in @/lib/runtimeProfiles
4. DEFAULT_RUNTIME_PROFILE
Changes:
- WorkspaceData (socket): add optional provision_timeout_ms
- WorkspaceNodeData: add optional provisionTimeoutMs
- canvas-topology hydrate: thread the field through to node.data
- ProvisioningTimeout: extend the serialized-string node iteration
to carry provisionTimeoutMs (4-field positional split); pass as
the second arg to provisionTimeoutForRuntime
- 3 new tests in ProvisioningTimeout.test.tsx covering hydrate
threading, null fall-through, and resolver priority
Phase 2 (separate PR, blocked on workspace-server template-config
loader): workspace-server reads provision_timeout_seconds from
template config.yaml at provision time, includes
provision_timeout_ms in the workspace API/socket response. Phase 3
(template-repo PR): template-hermes config.yaml declares
provision_timeout_seconds: 720; canvas RUNTIME_PROFILES.hermes
becomes redundant and can be removed.
19/19 tests pass (3 new + 16 existing).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User reported the canvas threw a generic "API GET /workspaces: 500
{auth check failed}" error when local Postgres + Redis were both
down. Two problems:
1. The error code (500) and message ("auth check failed") said
nothing useful. The actual condition was "platform can't reach
its datastore to validate your token" — a Service Unavailable
class, not Internal Server Error.
2. The canvas had no way to distinguish infra-down from a real
auth bug, so it rendered the raw API string in the same
generic-error overlay it uses for everything.
Fix in two layers:
Server (wsauth_middleware.go):
- New abortAuthLookupError helper centralises all three sites
that previously returned `500 {"error":"auth check failed"}`
when HasAnyLiveTokenGlobal or orgtoken.Validate hit a DB error.
- Now returns 503 + structured body
`{"error": "...", "code": "platform_unavailable"}`. 503 is
the correct semantic ("retry shortly, infra is unavailable")
and the code field is the contract the canvas reads.
- Body deliberately excludes the underlying DB error string —
production hostnames / connection-string fragments must not
leak into a user-visible error toast.
Canvas (api.ts):
- New PlatformUnavailableError class. api.ts inspects 503
responses for the platform_unavailable code and throws the
typed error instead of the generic "API GET /…: 503 …"
message. Generic 503s (upstream-busy, etc.) keep the legacy
path so existing busy-retry UX isn't disrupted.
Canvas (page.tsx):
- New PlatformDownDiagnostic component renders when the
initial hydration catches PlatformUnavailableError.
Surfaces the actual condition with operator-actionable
copy ("brew services start postgresql@14 / redis") +
pointer to the platform log + a Reload button.
Tests:
- Go: TestAdminAuth_DatastoreError_Returns503PlatformUnavailable
pins the response shape (status, code field, no DB-error leak)
- Canvas: 5 tests for PlatformUnavailableError classification —
typed throw on 503+code match, generic-Error fallback for
503-without-code (upstream busy), 500 stays generic, non-JSON
body falls back to generic.
1015 canvas tests + full Go middleware suite pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The actual cause-fix for the staging-tabs E2E saga (#2073/#2074/#2075).
Old behaviour: ANY 401 from any fetch on a SaaS tenant subdomain
called redirectToLogin → window.location.href = AuthKit. This is
wrong. Plenty of 401s don't mean "session is dead":
- workspace-scoped endpoints (/workspaces/:id/peers, /plugins)
require a workspace-scoped token, not the tenant admin bearer
- resource-permission mismatches (user has tenant access but not
this specific workspace)
- misconfigured proxies returning 401 spuriously
A single transient one of those yanked authenticated users back to
AuthKit. Same bug yanked the staging-tabs E2E off the tenant origin
mid-test for 6+ hours tonight, leading to the cascade of test-side
mocks (#2073/#2074/#2075) that worked around the symptom without
fixing the cause.
This PR fixes it at the source. The new logic:
- 401 on /cp/auth/* path → that IS the canonical session-dead
signal → redirect (unchanged)
- 401 on any other path with slug present → probe /cp/auth/me:
probe 401 → session genuinely dead → redirect
probe 200 → session fine, endpoint refused this token →
throw a real Error, caller renders error state
probe network err → assume session-fine (conservative) →
throw real Error
- slug empty (localhost / LAN / reserved subdomain) → throw
without redirect (unchanged)
The probe adds one extra fetch on a 401, only when slug is set
and the path isn't already auth-scoped. That's rare and
worthwhile — a transient probe round-trip is cheap; an unwanted
auth redirect is a UX disaster.
Tests:
- api-401.test.ts rewritten with the full matrix:
* /cp/auth/me 401 → redirect (no probe, that IS the signal)
* non-auth 401 + probe 401 → redirect
* non-auth 401 + probe 200 → throw, no redirect ← the fix
* non-auth 401 + probe network err → throw, no redirect
* empty slug paths (localhost/LAN/reserved) → throw, no probe
- 43 tests in canvas/src/lib/__tests__/api*.test.ts all pass
- tsc clean
The staging-tabs E2E spec's universal-401 route handler stays as
defense-in-depth (silences resource-load console noise + guards
against panels without try/catch), but the comment now describes
its role honestly: api.ts is the primary fix, the route is the
safety net.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After #2074, the staging-tabs spec stopped failing on the auth-redirect
locator timeout (good — the broadened 401-mock works) but started
failing on a different aggregate check:
Error: unexpected console errors:
Failed to load resource: the server responded with a status of 404
Failed to load resource: the server responded with a status of 404
Failed to load resource: the server responded with a status of 404
Browser console messages for resource-load failures omit the URL,
so the message is uninformative on its own — we can't filter
selectively (e.g. "is this a missing-CSS noise or a real broken
endpoint?"). The previous filter list (sentry/vercel/WebSocket/
favicon/molecule-icon) catches specific known-noisy strings but
this generic "Failed to load resource" doesn't contain any of them.
Two changes:
1. Add page.on('requestfailed') + page.on('response>=400') logging
to capture the URL of any failed request. Logs to test stdout
(visible in the workflow log) — leaves a breadcrumb so a real
bug isn't completely hidden when we filter the generic message.
2. Add "Failed to load resource" to the filter list. With (1) in
place we still see the URLs for diagnosis; the generic console
message is just noise.
Real JS exceptions (panel crash, undefined access, etc.) come with
a file path and stack trace and aren't matched by either filter,
so the gate still catches actual bugs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#2073 caught workspace-scoped 401s but missed non-workspace paths.
SkillsTab.tsx alone fetches /plugins and /plugins/sources, both
outside the /workspaces/<id>/* tree. Either of those 401s with the
tenant admin bearer in SaaS mode → canvas/src/lib/api.ts:62-74
redirects to AuthKit → page navigates away mid-test → next locator
times out.
Same failure signature observed at 16:03Z post-#2073 merge:
e2e/staging-tabs.spec.ts:45:7 › tab: skills
TimeoutError: locator.scrollIntoViewIfNeeded: Timeout 5000ms
- navigated to "https://scenic-pumpkin-83.authkit.app/?..."
Broaden the route to "**" with `request.resourceType() !== "fetch"`
short-circuit (preserves HTML/JS/CSS pass-through) and a
/cp/auth/me skip (the dedicated mock above wins). Same 401 →
empty-body conversion logic; just a wider net.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bundle review of pieces 1/2/3 surfaced two critical issues plus a
handful of required + optional fixes. All addressed.
Critical:
1. Migration 043 was missing 'paused' and 'hibernated' from the
workspace_status enum. Both are real production statuses written
by workspace_restart.go (lines 283 and 406), introduced by
migration 029_workspace_hibernation. The original `USING
status::workspace_status` cast would have errored mid-transaction
on any production DB containing those values. Added both. Also
added `SET LOCAL lock_timeout = '5s'` so the migration aborts
instead of stalling the workspace fleet behind a slow SELECT.
2. The chat activity-feed window kept only 8 lines, and a single
multi-tool turn (Read 5 files + Grep + Bash + Edit + delegate)
easily flushed older context before the user could read it.
Extracted appendActivityLine to chat/activityLog.ts with a
20-line window AND consecutive-duplicate collapse (same tool
on the same target twice in a row is noise, not new progress).
5 unit tests pin the behavior.
Required:
3. The SDK wedge flag was sticky-only — a single transient
Control-request-timeout from a flaky network blip locked the
workspace into degraded for the whole process lifetime, even
when the next query() would have succeeded. Added
_clear_sdk_wedge_on_success(), called from _run_query's success
path. The next heartbeat after a working query reports
runtime_state empty and the platform recovers the workspace to
online without a manual restart. New regression test.
4. _report_tool_use now sets target_id = WORKSPACE_ID for self-
actions, matching the convention other self-logged activity
rows use. DB consumers joining on target_id see a well-defined
value instead of NULL.
Optional taken:
5. Tightened _WEDGE_ERROR_PATTERNS from "control request timeout"
to "control request timeout: initialize" — suffix-anchored so a
future SDK error on an in-flight tool-call control message
doesn't get misclassified as the unrecoverable post-init wedge.
6. Dropped the redundant "context canceled" substring fallback in
isUpstreamBusyError. errors.Is(err, context.Canceled) is the
typed check; the substring would also match healthy client-side
aborts, which we don't want classified as upstream-busy.
Verified: 1010 canvas tests + 64 Python tests + full Go suite pass;
migration applies cleanly on dev DB with all 8 enum values; reverse
migration restores TEXT.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two halves of the same UX win — the user wants to see what Claude is
doing while a chat reply is in flight instead of staring at "0s" for
minutes.
Workspace side (claude_sdk_executor.py):
- The executor's _run_query message loop already iterated the SDK
stream for AssistantMessage.TextBlock content. Now also detects
ToolUseBlock / ServerToolUseBlock entries (by class name, since
the conftest stub doesn't define them) and fires-and-forgets a
POST /workspaces/:id/activity row of type agent_log per tool use.
- _summarize_tool_use maps the common tools (Read, Write, Edit,
Bash, Glob, Grep, WebFetch, WebSearch, Task, TodoWrite) to a
one-line summary with the file path / pattern / command, falling
back to "🛠 <tool>(…)" for anything else. Truncated at 200 chars.
- Posts directly to /workspaces/:id/activity rather than going
through a2a_tools.report_activity, which would also push a
/registry/heartbeat current_task and double-log as a TASK_UPDATED
line in the same chat feed.
- All failures swallowed silently — telemetry must not break
the conversation.
Canvas side (ChatTab.tsx):
- The existing ACTIVITY_LOGGED handler streams a2a_send /
a2a_receive / task_update events into a sliding-window
activityLog state. Two issues fixed:
1. No `msg.workspace_id === workspaceId` filter — a sibling
workspace's a2a_send was leaking into the wrong chat
panel as "→ Delegating to X...". Added an early return.
2. No agent_log render branch. Added one that renders the
summary verbatim (the workspace already prefixed its
own emoji icon, so no double-icon).
- Existing 8-line sliding window keeps the UI scoped; older
progress lines naturally roll off as new ones arrive.
Result: when DD is delegating to Visual Designer + reading
config files + running Bash to lint, the spinner area shows:
📄 Read /configs/system-prompt.md
⚡ Bash: pnpm test
→ Delegating to Visual Designer...
← Visual Designer responded (47s)
instead of bare "0s · Processing with Claude Code..." for minutes.
63 Python tests + 58 canvas chat tests pass; tsc clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The staging-tabs E2E has been failing for 6+ hours on the same
locator timeout — diagnosed earlier today as the canvas's
lib/api.ts:62-74 redirect-on-401 path firing mid-test:
e2e/staging-tabs.spec.ts:45:7 › tab: skills
TimeoutError: locator.scrollIntoViewIfNeeded: Timeout 5000ms
- navigated to "https://scenic-pumpkin-83.authkit.app/?..."
Several side-panel tabs (Peers, Skills, Channels, Memory, Audit,
and anything workspace-scoped) hit endpoints under
`/workspaces/<id>/*` that require a workspace-scoped token, NOT
the tenant admin bearer the test uses. The endpoints respond 401
in SaaS mode. canvas/src/lib/api.ts:62-74 reacts to ANY 401 by
setting `window.location.href` to AuthKit — yanking the page off
the tenant origin mid-test.
The test comment at line 18 already acknowledged the 401 class
("Peers tab: 401 without workspace-scoped token") but assumed
those would surface as "errored content" rather than a hard
navigation. The redirect logic in api.ts was added later and
breaks the assumption.
Fix: add a Playwright route handler that catches any 401 from
`/workspaces/<id>/*` paths and replaces with `200 + empty body`.
Body shape is best-effort by URL — list endpoints (paths not
ending in a UUID-shaped segment) get `[]`, single-resource
endpoints get `{}`. Both are valid JSON and well-written panels
render an empty state for either rather than crashing.
The two route patterns (`/workspaces/...` and `/cp/auth/me`)
don't overlap — the existing `/cp/auth/me` mock continues to
gate AuthGate's session check independently.
Verification:
- Type-check passes (tsc clean for the spec; pre-existing errors
in unrelated test files unchanged)
- Can't run staging E2E locally without CP admin token; CI will
exercise the real path against the freshly-provisioned tenant
- E2E Staging SaaS (full lifecycle) is currently green at 08:07Z,
confirming the underlying staging infra works — the failures
have been narrowly in this Playwright-tabs spec
Targets staging per molecule-core convention.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three required fixes from the bundle review of 391e1872:
1. workspace/a2a_client.py: substring `type_name in msg` could miss
the diagnostic prefix when an exception's message embedded a
different class name mid-string (e.g. `OSError("see ConnectionError
below")` → printed as plain msg, type lost). Switched to a
prefix-anchored check (`msg.startswith(f"{type_name}:")` etc.) so
the type label is always added when not already at the start of
the message.
2. workspace/a2a_tools.py: `activity_logs.error_detail` is unbounded
TEXT on the platform (handlers/activity.go does not validate
length). A buggy or hostile peer could stream arbitrarily large
error messages into the caller's activity log. Cap at 4096 chars
at the producer — comfortably above any real exception traceback,
well below an obvious-DoS threshold.
3. New regression test for JSON-RPC `code=0` — pins the
`code is not None` semantics so the code is preserved in the
detail rather than collapsing into the no-code path. Code=0 is
not valid per the spec, but a malformed peer can still emit it
and we want it visible for diagnosis.
Plus one optional taken: extracted the A2A-error → hint mapping into
canvas/src/components/tabs/chat/a2aErrorHint.ts. The two prior copies
(AgentCommsPanel.inferCauseHint + ActivityTab.inferA2AErrorHint) had
already drifted — Activity tab gained `not found`/`offline` cases the
chat panel never picked up, AgentCommsPanel handled empty-input
explicitly while Activity didn't. The shared module is the merged
superset, with 10 unit tests pinning each named pattern + the
"most specific first" ordering (Claude SDK wedge wins over generic
timeout).
Skipped (per analysis):
- Unicode-naive 120-char slice — Python str[:N] slices on code
points, not bytes. Safe.
- Nested [A2A_ERROR] confusion — non-issue per reviewer; outer
prefix winning still produces a structured render.
- MessagePreview + JsonBlock dual render on errors — intentional
drilldown; raw JSON is below the fold for operators who need it.
- console.warn dedup — refetches don't happen per-event so spam
risk is low.
- str(data)[:200] materialization — A2A response bodies aren't
typically MB-sized.
Verified: 1005 canvas tests pass (10 new hint tests); 10 Python
send_a2a_message tests pass (1 new for code=0); tsc clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Symptom: Activity tab and Agent Comms surfaced bare "[A2A_ERROR] "
(prefix + nothing) for failed delegations. Operator had no signal
to act on — no exception type, no target, no hint about what went
wrong, no next step. Fix is in three layers.
1. workspace/a2a_client.py — every error path now produces an
actionable detail string:
- except branch: some httpx exceptions (RemoteProtocolError,
ConnectionReset variants) stringify to "". Pre-fix the catch
was `f"{_A2A_ERROR_PREFIX}{e}"` → bare prefix. Now falls back
to `<TypeName> (no message — likely connection reset or silent
timeout)` and always appends `[target=<url>]` for traceability
in chained delegations.
- JSON-RPC error branch: previously dropped error.code on the
floor and printed "unknown" when message was missing. Now
surfaces both, including the well-defined "JSON-RPC error
with no message (code=N)" path.
- "neither result nor error" branch: pre-fix returned
str(payload) which the canvas rendered as a successful
response block. Now tagged as A2A_ERROR with a payload
snippet so downstream UI routes through the error path.
2. workspace/a2a_tools.py — tool_delegate_task now passes
error_detail (the stripped error message) through to the
activity-log POST. The platform's activity_logs.error_detail
column is the canvas's red error chip source; populating it
makes the failure visible in the row header without the user
having to expand into raw response_body JSON. The summary line
also gets a 120-char prefix of the cause so the collapsed row
reads "React Engineer failed: ConnectionResetError: ... [target=...]"
instead of "React Engineer failed".
3. canvas/src/components/tabs/ActivityTab.tsx — MessagePreview
now detects [A2A_ERROR]-prefixed bodies and renders a
structured error block (red chip, stripped detail, cause hint)
instead of the previous gray text-block that showed the literal
"[A2A_ERROR]" string. inferA2AErrorHint mirrors the patterns
from AgentCommsPanel.inferCauseHint so the same symptom reads
the same way in both surfaces (Claude SDK init wedge → restart
workspace; timeout → busy/stuck; connection-reset → transient
blip then check logs).
Tests: 9 send_a2a_message tests pass (including a new regression
test for the empty-stringifying-exception case that the user
reported); 995 canvas tests pass; tsc clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reported symptom: canvas edges show "1 call · just now" between two
agents, but the Agent Comms tab for the source workspace renders
"No agent-to-agent communications yet" — even though
GET /workspaces/<id>/activity?source=agent&limit=50 returns a2a_send
+ a2a_receive rows.
Confirmed via curl that the API does return the rows the panel
should map. The panel's load handler was the suspect, but it had:
.catch(() => setLoading(false))
which swallowed every failure path — network errors, JSON parse,
ANY throw inside the .then body — without leaving a single trace in
the console. The panel just sat on its empty state and gave the user
zero signal to act on. (And by extension, gave us nothing to debug
remotely either.)
Two changes:
1. Wrap the per-row `toCommMessage` call in a try/catch so one
malformed activity row (unexpected request_body shape, etc.)
doesn't throw out of the for-loop and skip the
setMessages(msgs) line. Previously the panel would silently
drop the entire batch when ANY row failed to parse.
2. Replace the bare `.catch(() => setLoading(false))` with a
logging variant. Now a future "panel stuck empty" report comes
with `AgentCommsPanel: load activity failed <err>` or
`AgentCommsPanel: failed to map activity row {...}` in the
console — diagnosable instead of opaque.
Behavior on the happy path is unchanged (5 existing tests still
pass; tsc clean). This is purely defensive: it makes the failure
path visible so the next stuck-empty report can be root-caused
instead of guessed at.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bundle-level review caught an implicit coupling in useCanvasViewport
between two distinct fit effects:
- settle fit: 1200ms one-shot when provisioning transitions to zero
(deploy just finished — settle on the whole org once)
- tracking fit: 500ms debounced per molecule:fit-deploying-org event
(track the org's bounds as children land during the deploy)
Both effects shared a single autoFitTimerRef, so each one's
clearTimeout call could silently cancel the other's pending fit.
Today's behavior happened to land in the right order out of luck —
the tracking handler fires per-arrival during the deploy, then the
settle effect arms after the last child completes. But nothing in
the code enforces that ordering; a future refactor that, say,
fires the settle effect from the same event sequence as the
tracking timer (mid-deploy status flicker) would silently drop the
settle fit because the tracking timer's clearTimeout ran last.
Splitting into settleFitTimerRef + trackingFitTimerRef makes the
two effects fully independent. Cleanup clears both. Tests still pass
(995/995); the refactor is mechanical.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three review-driven fixes plus regression coverage for the bugs
landed in 176b703d / deedb5ef:
1. clearTimeout the prior reload handle before scheduling a new one in
both installFromSource and handleUninstall. Two installs within the
PLUGIN_RELOAD_DELAY_MS window (15s) used to queue two
loadInstalled() calls; the unmount cleanup only cleared the latest
handle, and the second reconciliation could overwrite a still-
correct optimistic state with a stale snapshot mid-restart.
2. Drop `setInstalledLoaded(true)` from the optimistic block. That
flag's contract is "the initial GET has succeeded at least once" —
it gates the auto-expand-registry effect. A user installing a
custom-source plugin BEFORE the initial fetch returned would flip
the gate prematurely, the auto-expand would never fire, and a
followup loadInstalled racing with the optimistic write could
overwrite our entry with [] mid-restart.
3. Don't force `supported_on_runtime: true` on the optimistic record.
The "inert on this runtime" badge in the row renders on the value
`=== false`. Forcing true would hide the badge for 15s if the user
installed a plugin that doesn't actually support the workspace's
runtime; the real value lands at refetch. Leaving the field
undefined keeps the badge neutral until reconciliation arrives.
Plus a behavioral test (SkillsTab.install.test.tsx) that asserts:
- the install POST URL contains the workspaceId (not "undefined")
- the row's "Install" button is replaced by the green "Installed"
tag synchronously after POST resolves, without advancing any
timer — locks in the optimistic-update contract so a future
refactor can't silently regress it.
995 canvas tests pass (2 new); tsc clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After clicking Install, the button reverted from "Installing..." → "Install"
the moment the POST returned, then sat there for ~15s before the green
"Installed" tag appeared. The 15s gap is PLUGIN_RELOAD_DELAY_MS — we
delay the GET /workspaces/:id/plugins refetch to wait for the workspace
to restart (the listing handler returns [] while the container is
restarting because findRunningContainer comes up empty).
Uninstall already does optimistic local-state mutation (line 244 prior
to this commit) so the green tag → install button transition is
instant. Install was the inconsistent half — push the registry entry
into `installed` immediately after POST returns 200 and let the
delayed refetch reconcile.
The optimistic record uses the registry entry's metadata (name,
version, description, tags, runtimes, skills) and sets
supported_on_runtime=true. If reconciliation later disagrees (server
filter, install actually failed at the runtime layer), the refetch
overwrites the local record. Worst case is a brief 15s window where
we show "Installed" for a plugin that won't load — same window the
user previously experienced as "stuck on Install button" — but flipped
to the correct expected state.
Custom-source installs (github://, etc.) don't have a registry entry
to use, so they keep the old behavior of waiting for the refetch. Most
users install from the registry list in the UI.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SkillsTab read \`data.id\` from its props and used the value to build
two API URLs:
POST /workspaces/\${data.id}/plugins
DELETE /workspaces/\${data.id}/plugins/\${pluginName}
But \`data\` is the React Flow node.data blob (WorkspaceNodeData) —
the workspace id lives on \`node.id\`, NOT on \`node.data\`. WorkspaceNodeData
extends \`Record<string, unknown>\`, which makes \`data.id\` type-check
silently as \`unknown\` instead of erroring. So every install/uninstall
hit \`/workspaces/undefined/plugins\`, the server's not-found path
returned 503 "workspace container not running" (misleading — the real
issue was the bogus URL), and the user got a confusing toast.
Every other tab in SidePanel takes \`workspaceId={selectedNodeId}\` as
an explicit prop. SkillsTab was the lone outlier, presumably because
"data has all the fields I need" is the obvious-looking shortcut that
TypeScript can't catch through the index-signature interface.
Fix: make \`workspaceId\` an explicit prop on SkillsTab, drop the
\`data.id\` reads, thread the prop from SidePanel like the other tabs.
Test fixture updated to pass it.
Verified: 993 canvas tests pass; tsc clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Independent code review surfaced two required documentation fixes and
one growth-correctness gap. All addressed here.
Auto-fit gate (useCanvasViewport):
The previous "subtree-grew-by-count" check missed the delete-then-add
case: subtree of 6 → delete one → 5 → a different child arrives → 6
again. A length-only comparison reads no growth and the fit is
skipped, leaving the new node off-screen. Switched to an id-set
membership snapshot so any brand-new id forces the fit even when the
count is unchanged.
The gate logic is now extracted as a pure exported function
`shouldFitGrowing(currentIds, prevIds, userPannedAt, lastAutoFitAt)`
so the regression-prone decision can be unit-tested in isolation
without standing up React Flow + DOM event refs. 8 cases cover:
first-fit, empty-prior, brand-new id, status-update with user pan,
no-pan-ever, pan-before-last-fit, delete-then-add same length, and
shrink-only with user pan.
Parser parity (dotenv.go + next.config.ts):
Existing-env semantics were undocumented in both parsers. Both now
explicitly note that an explicitly-set empty string (`KEY=` from the
parent shell) counts as "set" — the file value does NOT backfill —
matching the Go (os.LookupEnv) and Node (`process.env[k] !==
undefined`) primitives.
`export ` prefix uses a literal space; `export\tFOO=bar` is
intentionally rejected. Added the same comment in both parsers
to lock in this parity invariant since the commit message claims
"if one parser changes, the other has to."
Skipped (per analysis):
- Drag-pan respect for left-click drag-pan during deploy. The
growth-check safety net means any pan gets overridden on the
next arrival anyway, which is the desired behavior for the
"watch the org deploy" use case. After deploy completes, no
more fit-deploying-org events fire so drag-pan works freely.
- Map cleanup for lastFitSubtreeIdsRef. Per-tab session, UUID
keys, tiny entries — not worth the cleanup hook.
993 canvas tests pass (8 new); Go dotenv tests pass; tsc clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Symptom: org import zoomed to fit the parent + first child, then froze
at that framing while the remaining children kept materialising
off-screen. The user had to manually pan/zoom to see the new arrivals.
Two stacked bugs in useCanvasViewport's deploy-time auto-fit:
1. The user-pan-respect gate stamps userPannedAtRef on EVERY
pointerdown that lands inside .react-flow__pane. That fires for
ordinary clicks (deselect, click-near-a-card, modal-close-bubble
from the import dialog) — not just for actual pan gestures. One
accidental pre-import click was enough to lock out every fit for
the rest of the deploy. Wheel is the canonical unambiguous
pan/zoom signal; drop pointerdown.
2. Even with a real pan during deploy, when more children land the
org's bounds grow and the user has lost context — the new
arrivals are off-screen and the deploy is the primary thing they
want to watch right now. The guard had no growth awareness, so
one pan cancelled all follow-up fits unconditionally. Now we
track the subtree size at the last fit (per root), and if the
current subtree is larger we force the fit through regardless of
the user-pan timestamp. When the subtree size hasn't changed
(status updates on already-positioned nodes), the user-pan
respect still applies — so post-deploy exploration isn't
yanked back.
The Map keyed by root id supports back-to-back imports of different
orgs without one's growth count blocking the other's first fit.
985 canvas tests pass; tsc clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Symptom: spawn animation missing on org import. Workspaces appeared in
their final positions all at once instead of materialising one-by-one.
Root cause: the WS pill said "Reconnecting" forever because the canvas
was trying to connect to ws://localhost:3000/ws — its own port, where
Next.js dev doesn't serve a WebSocket — instead of the platform's
ws://localhost:8080/ws.
Why: deriveWsBaseUrl() falls back to window.location when
NEXT_PUBLIC_WS_URL is unset. Next.js auto-loads .env from the project
root only — and the canonical NEXT_PUBLIC_WS_URL /
NEXT_PUBLIC_PLATFORM_URL live in the monorepo root .env, alongside the
Go platform's MOLECULE_ENV / DATABASE_URL. Without an extra
canvas/.env.local copy (which would still be a per-developer manual
step), the canvas dev server starts blind to those vars.
Fix: next.config.ts now walks upward from __dirname looking for the
monorepo root (same workspace-server/go.mod sentinel the platform's
dotenv loader uses) and merges the root .env into process.env BEFORE
Next.js compiles. Existing env wins over file values, so docker
runs / CI / explicit exports still dominate.
The parser is a TypeScript mirror of workspace-server/cmd/server/
dotenv.go's parseDotEnvLine — same rules (export prefix, quotes,
inline comments, BOM) so a single .env line behaves identically across
both processes. If one parser changes, the other has to.
Production unaffected: `output: "standalone"` bakes resolved env into
the build, the workspace-server sentinel isn't shipped in deploy
artifacts, and the existing-env-wins rule means container env
dominates anywhere this file is consulted at runtime.
Verified: canvas dev startup log now shows
"[next.config] loaded 49 vars from /Users/.../molecule-core/.env";
served bundle has the correct ws://localhost:8080/ws URL; WS pill
flips to "Connected" after a hard refresh and per-workspace spawn
animations fire on the next org import as expected.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Independent code review surfaced three required fixes and one cheap
optional one. All addressed here.
dotenv parser:
- `export FOO=bar` was parsed as key `"export FOO"` (with embedded
space) and silently os.Setenv'd, so a developer pasting from a
direnv `.envrc` would get junk vars. Now strips the prefix.
- Quoted values weren't unwrapped: `FOO="hello world"` produced value
`"hello world"` with literal quotes. Now strips one matched pair of
surrounding `"` or `'`. Inside a quoted value `#` is part of the
value, not a comment marker (matches godotenv convention).
- UTF-8 BOM at file start (Windows editors) would have produced a
first key like U+FEFF + "FOO". Now stripped via TrimPrefix.
dotenv loader:
- findDotEnv()'s upward walk would happily pick up `~/.env` or a
sibling-repo `.env` if the binary was run from `~/Documents/other-
project/`. Real foot-gun on shared dev boxes. Now gated on a
monorepo sentinel: the candidate directory must contain
`workspace-server/go.mod`. Falls through to "no .env found" (=
pre-fix behavior) when the sentinel is absent.
socket fallback poll:
- startFallbackPoll() previously fired only on onclose, so the very
first connect attempt — when onclose hasn't fired yet because we
never had a successful onopen — left the canvas with no HTTP poll
for the duration of the failing handshake (Chrome can hold a
SYN-SENT WebSocket open ~75s before giving up). Now also called at
the top of connect(); the timer-already-running guard makes it a
no-op when one cycle later onclose calls it again.
Test coverage added: export prefix, single+double quoted values, hash
inside quotes preserved, unterminated quote falls back to bare value,
CRLF stripping locked in, BOM stripping, and a sentinel-rejection
regression test that creates a temp .env with no workspace-server
sibling and asserts findDotEnv refuses to load it.
Verified: 985 canvas tests + 30 dotenv subtests + 4 dotenv integration
tests all pass; tsc clean; rebuilt platform from monorepo root with
stripped env still loads .env (49 vars) and /workspaces returns 200.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Deleting a parent on a wedged WS used to leave the child cards on
the canvas as orphaned roots until the user manually refreshed.
Why: Canvas.tsx and DetailsTab.tsx both called `removeNode(parentId)`
after `DELETE /workspaces/:id?confirm=true` returned 200. `removeNode`
deliberately re-parents children rather than cascading — it relies on
the per-descendant WORKSPACE_REMOVED WS events the platform emits as
part of the cascade to drop each child individually. When the WS is
unhealthy those events never arrive, so the local store keeps the
children alive (now re-parented to root since their actual parent is
gone).
Fix: new `removeSubtree(rootId)` action on the canvas store mirrors
the server-side cascade — drops the root + every descendant + every
incident edge in one atomic set(). Both delete call sites now use it.
The WS events still arrive when WS is healthy and become idempotent
no-ops because the nodes are already gone.
Why a new action instead of changing removeNode: removeNode's
re-parenting behavior is correct for non-cascading flows (drag-out,
manual node detach in the future). Adding a sibling action keeps
both call shapes available rather than forcing every caller to opt
out of cascade.
6 new unit tests cover root cascade, mid-level cascade, leaf
no-op-cascade, selection clearing across the subtree, selection
preservation outside the subtree, and edge cleanup.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Seventh E2E bug, surfaced after the AuthGate mock from the previous
commit finally let the harness reach the tab-iteration loop:
Error: tab-skills button missing — TABS list may have drifted
Locator: locator('#tab-skills')
The TABS bar in SidePanel is `overflow-x-auto` (intentional — there
are 13 tabs and they don't all fit on smaller viewports; the
right-edge fade gradient signals the overflow). Tabs after position
~3 are clipped, and Playwright's `toBeVisible()` returns false for
clipped elements (it checks getBoundingClientRect against viewport).
Fix: `scrollIntoViewIfNeeded()` before the visibility assertion,
mirroring what SidePanel's own keyboard handler does on arrow-key
navigation. The tab is then in view and `toBeVisible()` passes.
This was the test's 7th and (probably) final harness bug. The
chain mapping all the way from "staging E2E timed out at 1200s"
this morning:
1. instance_status field name (#2066)
2. staging.moleculesai.app DNS zone (#2066)
3. X-Molecule-Org-Id TenantGuard header (#2066)
4. Hydration selector waited pre-click (#2066)
5. networkidle never settles (this PR's parent commits)
6. AuthGate /cp/auth/me redirect
7. Tab buttons clipped by overflow-x-auto
If THIS run still fails, the failure surfaces in actual product
behavior (a tab's panel content), not test mechanics.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related fixes for the case where the canvas thinks workspaces are
stuck provisioning when they're actually online:
1. ProvisioningTimeout banners now gate on wsStatus === "connected".
While the WS is in connecting/disconnected state, the local
"provisioning" status reflects the last event received before the
drop — workspaces may have transitioned to online minutes ago. The
8m timeout was firing against frozen state and showing a wall of
yellow warnings on already-online workspaces.
2. Socket layer now starts a 10s rehydrate poll when the WS goes
unhealthy (onclose) and stops it on onopen/disconnect. The
reconnect attempts continue in parallel; whichever recovers first
wins. rehydrate()'s existing dedup gate prevents the open-time
rehydrate from racing with a fallback poll. Without this the
store could stay frozen for minutes while WS exponential backoff
chewed through retries.
Plus the previously-uncommitted TemplatePalette flushSync change so
the import modal unmounts synchronously before doImport runs (otherwise
React batches the close with the import's setState prefix and the
modal backdrop hides the spawn animation).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sixth E2E bug, surfaced after the page.goto-domcontentloaded fix
finally let the navigation complete. The harness now reaches the
canvas-root selector wait but still times out because the canvas
never renders:
TimeoutError: page.waitForSelector: Timeout 45000ms exceeded.
waiting for [aria-label="Molecule AI workspace canvas"]
Root cause: canvas/src/components/AuthGate.tsx wraps the page,
fetches /cp/auth/me on mount, and redirects to the login page when
the response is 401. The bearer header we set via
context.setExtraHTTPHeaders works for platform API calls but does
NOT satisfy /cp/auth/me — that endpoint is cookie-based (WorkOS
session). So:
1. AuthGate mounts
2. Calls fetchSession() → /cp/auth/me → 401 (no session cookie)
3. AuthGate transitions to anonymous → redirectToLogin()
4. Browser navigates away from tenant URL
5. The React Flow canvas root with the aria-label never mounts
6. waitForSelector times out at 45s
Fix: context.route() intercepts /cp/auth/me and returns a fake
Session JSON so AuthGate resolves to "authenticated" and renders
its children. The session contents are cosmetic — Session.org_id
and Session.user_id appear in a few canvas surfaces but never fail
on dummy values.
This is the cleanest fix path. Alternatives considered + rejected:
- Add a ?e2e=1 backdoor to AuthGate: production code shouldn't
have a "skip auth" flag, even gated.
- Real WorkOS login flow in Playwright: too much overhead per run.
- Skip the canvas UI test, test only API: defeats the point of
the staging E2E (which is to catch UI regressions before
promotion).
After this lands the harness should reach the workspace-node click
step and exercise tabs — only then can a real product bug (rather
than a test-harness bug) surface. The 6-bug chain mapped to:
1. instance_status field name (#2066)
2. staging.moleculesai.app DNS zone (#2066)
3. X-Molecule-Org-Id TenantGuard header (#2066)
4. Hydration selector waited pre-click (#2066)
5. networkidle never settles (this commit's parent)
6. AuthGate /cp/auth/me redirect (this commit)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single-themed bundle of fixes accumulated while polishing the canvas
chat / agent-comms / plugins / position flows. Each piece is small;
the connective tissue is "things observable from the canvas right
panel and the org-deploy flow that surprised real users".
UI / composer
- Legend: add close X + persisted-localStorage state + reopener
pill; default open for first-time users.
- SidePanel: rename "Skills" tab label → "Plugins" (single-line;
internal panelTab enum value, component name, and store keys
unchanged).
- SkillsTab: registry tri-state UI (loading / error / empty) with
actionable Retry button + 10s explicit fetch timeout. Handle
AbortSignal.timeout's DOMException by name (TimeoutError /
AbortError) — Chromium's "signal timed out" message wouldn't
match the prior naive /timeout/ regex. Reset mountedRef on every
mount: pre-existing StrictMode dev-mode bug where cleanup-only
`current = false` was never re-set, permanently wedging every
`if (mountedRef.current) setX(...)` guard and producing a
"Loading…" panel that never resolved on hard refresh.
- ChatTab: paste-image-from-clipboard via onPaste handler; unique
monotonic-counter filenames so same-second pastes don't collide
on name+size dedup. mime→ext map avoids `image/svg+xml`-style
raw extensions on synthesised filenames. Bypasses the
DataTransfer constructor so Safari < 14.1 / older Edge work.
- ChatTab: drop stuck error toast when the WS path already
delivered the agent reply but the HTTP path errored late
(sendingFromAPIRef gate now covers the .catch() handler).
- ChatTab: filter heartbeat-style internal self-messages from the
My Chat tab so historical rows with source_id=NULL don't
surface as user-typed input.
- Modal portals: OrgImportPreflightModal + MissingKeysModal
(ProviderPickerModal + AllKeysModal) now createPortal to
document.body and clamp max-h to 80vh. Escapes the ancestor
containing block (TemplatePalette's fixed+filtered sidebar
re-anchored descendants' position:fixed to itself, hiding
modals behind workspace cards). MissingKeysModal bumped to
z-[60] for stack ordering when both modals are open.
- OrgImportPreflightModal saveOne: ref-based microtask-safe
in-flight gate replaces the brittle "set startValue inside a
setState updater and read on the next line" pattern (React 18
doesn't guarantee functional updaters run synchronously; that
path strands `saving:true` and never calls createSecret). Same
useRef pattern guards SkillsTab.loadRegistry against concurrent
fires and Fast-Refresh-stranded promises; force=true parameter
on retry click bypasses the gate.
Agent comms
- AgentCommsPanel: derive UI-facing `flow` field instead of using
activity_type-derived direction. Self-logged a2a_receive rows
(source_id == workspace_id, what the agent runtime writes to log
its own outbound delegation replies) now correctly render as
OUTBOUND with → arrow + right-justified bubble. Previously they
rendered "← From Self" with Restart pointing at THIS workspace.
- AgentCommsPanel: error rows replace the unactionable
"X failed [A2A_ERROR]" body with banner + underlying-error
code-block + cause-hint (matched on Claude Code SDK init wedge,
deadline-exceeded, agent-thrown exception, empty-error) +
Restart [peer] / Open [peer] action buttons.
- AgentCommsPanel: render text bodies through ReactMarkdown +
remark-gfm so multi-part replies (tables, code) render properly.
Multi-part text extractor
- extractReplyText (live A2A response in ChatTab) and
extractResponseText (chat history loader in message-parser):
now COLLECT from every source — top-level parts, parts.root.text,
and artifacts — joined with "\n". Previous "first source wins"
silently dropped multi-part replies (Hermes summary+detail,
Claude Code long-form table). Tests cover joined-from-parts,
joined-from-artifacts, joined-from-both.
Position stability
- canvas-topology.buildNodesAndEdges: auto-rescue heuristic now
accepts currentParentSizes map; uses max(initial min, currently
grown) for the bbox check. Fixes "child jumps to weird location
after 30s" — the periodic socket health-check rehydrate
(silenceSec > 30) was rebuilding nodes from scratch, and the
rescue's reliance on grid-derived initial size false-flagged
children the user dragged into the user-grown area.
- canvas.hydrate: pass live measured dimensions from the existing
store into buildNodesAndEdges.
- socket.RehydrateDedup: pure exported helper class that gates
rehydrate calls. Two states — in-flight (in-flight Promise reused
by concurrent callers) + post-completion window (1.5s, returns
Promise.resolve()). Initialised with -Infinity so first call
always passes the gate. Wired into ReconnectingSocket.rehydrate.
A2A edges
- New A2AEdge custom React Flow edge component portals its label
out of the SVG layer via EdgeLabelRenderer so labels (a) render
above workspace cards instead of being hidden behind them and
(b) accept clicks. Click selects source + switches panel to
Activity, but only on a NEW selection (preserves current tab on
re-click of an already-selected source).
- buildA2AEdges output tagged type:"a2a"; edgeTypes wired in
Canvas.tsx.
Tests
- 14 new vitest cases across 4 files (964 → 978 passing):
OrgImportPreflightModal saveOne single-fire / double-click,
any-of rendering; AgentCommsPanel toCommMessage flow derivation
in all four shapes; canvas-topology rescue respects-grown /
rescues-genuine-drift / fallback-without-live-size; socket
RehydrateDedup gate behaviour; message-parser multi-part
response extraction.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fifth E2E bug surfaced by the previous run. After the four setup-
phase fixes (instance_status, DNS zone, X-Molecule-Org-Id, hydration
selector) plus CP#259 ending the pq cache class, the harness finally
reached the actual page navigation step — and timed out there:
TimeoutError: page.goto: Timeout 45000ms exceeded.
navigating to "https://...staging.moleculesai.app/", waiting until "networkidle"
`waitUntil: "networkidle"` waits for 500ms of network silence. The
canvas keeps a WebSocket connection open + polls /events and
/workspaces every few seconds for status updates, so the network
is never idle — page.goto sits on it until the default 45s timeout
and throws.
Fix: switch to `waitUntil: "domcontentloaded"`. Returns as soon as
the HTML is parsed. React hydration plus the existing
`waitForSelector` line below is what actually gates ready-for-
interaction; the goto's job is just to land on the page.
This is a generally-applicable lesson — networkidle is broken for
any SPA with a heartbeat. Notably, our existing canvas unit tests
that mock @xyflow/react and don't open WebSockets DON'T hit this,
which is why this only surfaces against staging.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fourth E2E bug in the staging→main chain. The previous three (#2066
setup-phase fixes) let the harness reach the actual Playwright spec.
This one is in staging-tabs.spec.ts itself.
The spec at L78 waits 45s for one of:
[role="tablist"], [data-testid="hydration-error"]
Both targets are wrong:
1. [role="tablist"] only appears AFTER the workspace node is
clicked (which happens 25 lines later at L100). Waiting for
it BEFORE the click can never resolve, so the wait always
times out at 45s regardless of whether the canvas actually
loaded.
2. [data-testid="hydration-error"] doesn't exist anywhere in
the canvas. The error banner at app/page.tsx:62 only had
role="alert" — which collides with toast notifications and
other alert-type elements, so a more-specific selector was
never wired.
Two-part fix:
- Test waits on `[aria-label="Molecule AI workspace canvas"]`
instead — that's the React Flow wrapper (Canvas.tsx:150),
always present once hydrated regardless of workspace count
or selection state. Hydration-error banner remains the
secondary OR target for the failure path.
- app/page.tsx hydration-error banner gets the missing
`data-testid="hydration-error"` attribute. role="alert"
stays for accessibility; the testid is for programmatic
detection without conflict.
After this lands, the staging-tabs spec should advance past the
initial wait, click the workspace node, and exercise each tab.
If a tab fails, we get a proper test failure rather than a 45s
timeout that obscures everything.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Third E2E bug in the staging→main chain, found while debugging the
\`Workspace create 404\` failure that surfaced after the previous two
E2E fixes (instance_status, staging.moleculesai.app DNS).
Root cause: workspace-server's \`middleware/TenantGuard\` middleware
returns 404 (not 401/403, intentionally — see comment in
\`tenant_guard.go\`: "must not be inferable by probing other orgs'
machines") when a request to the tenant origin lacks one of:
- X-Molecule-Org-Id header matching MOLECULE_ORG_ID env on the tenant
- Fly-Replay-Src state from the CP router (production browser path)
- Same-origin Canvas (Referer == Host)
The E2E was a direct GitHub-Actions curl with neither — every non-
allowlisted route 404'd with the platform's ratelimit headers but
none of the security headers, which made it look like a missing
route in the platform.
The org UUID is already on the admin-orgs row alongside instance_status,
so capture it during the readiness poll and add it to the tenantAuth
header bag. Both /workspaces (POST) and /workspaces/:id (GET) now
carry it.
Allowlist still contains /health, /metrics, /registry/register,
/registry/heartbeat — so the TLS readiness step (which hits /health)
keeps working without the header.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Second related E2E bug, surfaced after #2066's instance_status fix
let the harness reach the TLS readiness step:
Error: tenant TLS: timed out after 180s
The CP provisioner writes staging tenant DNS as
<slug>.staging.moleculesai.app (with the staging. subdomain
prefix — visible in the EC2 provisioner DNS log line). The harness
was building https://<slug>.moleculesai.app (prod-zone shape),
so DNS literally didn't resolve, fetch threw NXDOMAIN inside the
silent catch, and waitFor saw null on every 5s poll until 180s
elapsed.
Fix: parameterize as STAGING_TENANT_DOMAIN env var, default
staging.moleculesai.app. Doc-comment example updated to match.
Override hatch is there only for ops running this harness against
a non-default zone.
Verified manually: a freshly-provisioned tenant
(e2e-canvas-20260425-sav9fe) was unreachable at the prod-shaped
URL (NXDOMAIN) but reached CF at the staging-shaped URL.
teardown.ts only hits CP, not the tenant URL — no fix needed there.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Staging Canvas Playwright E2E has been timing out at 1200s on every
recent run. Found via /code-review-and-quality on the staging→main
promotion chain.
The CP /cp/admin/orgs response shape is (handlers/admin.go:118):
type adminOrgSummary struct {
...
InstanceStatus string `json:"instance_status,omitempty"`
...
}
There is NO top-level `status` field. The waitFor predicate compared
`row.status === "running"` against undefined on every poll — the
predicate could never resolve truthy. The harness invariably wedged
on the 20-min timeout regardless of whether the tenant was actually
provisioned.
This bug has been double-edged:
- It MASKED the #242 pq-cache-collision class for hours: the
tenants WERE provisioning fine, but the test couldn't tell.
- It survived #255, #257 (real CP fixes) — the test still timed
out, making us suspect more CP bugs that didn't exist.
Fix: poll `row.instance_status` instead. One-line change. Identical
fix for the failed-state branch one line below.
No new tests for the harness itself; the fix's correctness is
verified by the next E2E run on the affected branch passing
end-to-end. If it doesn't pass after this, there's a separate
bug we can hunt cleanly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends the org-import env preflight so a template can declare an
alternative: satisfy ANY one member to pass. Motivated by the
Claude-family node case where either ANTHROPIC_API_KEY or
CLAUDE_CODE_OAUTH_TOKEN unlocks the agent — forcing both was wrong.
Server (workspace-server):
- New EnvRequirement union type with custom YAML + JSON
(un)marshaling. Accepts scalar (strict) or {any_of: [...]} in
both on-disk org.yaml and inline POST /org/import bodies.
- collectOrgEnv now returns []EnvRequirement. Dedups groups by
sorted-member signature. "Strict wins" pruning drops any-of
groups that mention a name already declared strictly (same
tier and cross-tier).
- Import preflight uses EnvRequirement.IsSatisfied — scalar =
exact match, group = any member present.
- Empty any_of: [] rejected at parse time (never-satisfiable).
- 14 handler tests (6 updated for the union shape, 8 new
covering any-of satisfaction, dedup, strict-dominates-group,
cross-tier pruning, invalid-member filtering, YAML round-trip,
and empty-any-of rejection).
Canvas:
- EnvRequirement = string | {any_of: string[]} with envReqMembers,
envReqSatisfied, envReqKey helpers.
- OrgImportPreflightModal renders strict rows and any-of groups
via a new AnyOfEnvGroup sub-component: "Configure any one"
banner, per-member input, ✓-satisfied indicator, and dimmed
siblings once any member is configured so the user can still
switch providers.
- TemplatePalette.OrgTemplate.required_env / recommended_env
retyped to EnvRequirement[]; passthrough to the modal
unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Commit 5adc8a74 (part of this PR) intentionally made
molecule:fit-deploying-org fire for root-level workspaces too — it
used to only fire for children, which meant a standalone create
didn't center the viewport until the first child arrived ~2s later.
The existing regression test still expected ONLY the
molecule:pan-to-node event for a new root, so it started failing
with "expected length 1, got 2". The product behavior is correct
(centering on the root immediately is better UX); the test was
pinning the old single-dispatch shape.
Fix: assert BOTH events fire, each with the right detail payload,
so a future regression that drops either one (or duplicates) trips
the test. Single-test update, no production code change. 953/953
canvas tests pass locally.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Problem
Two issues the external-workspace path was silently dropping:
1. `knownRuntimes` was a hardcoded Go map that drifted from
manifest.json — e.g. `gemini-cli` was in manifest but missing
from the Go allowlist, so any workspace provisioning with
runtime=gemini-cli got silently coerced to langgraph.
2. No end-to-end "bring your own compute" story. The canvas UI
had no way to pick runtime=external; the partial backend code
required the operator to already have a URL ready (chicken-and-
egg with the agent that doesn't exist yet), and no workspace_auth
_token was minted so the external agent couldn't authenticate its
register call.
## Change
### Runtime registry driven by manifest.json
- New `runtime_registry.go` reads `manifest.json` at service init.
Each `workspace_templates[].name` becomes a runtime identifier
(with the `-default` suffix stripped so `claude-code-default`
and `claude-code` resolve to the same runtime).
- `external` is always injected (no template repo exists for it).
- Falls back to a static map on manifest load failure so tests /
dev containers keep working.
- 5 new tests including a real-manifest sanity check.
### First-class external workspace flow
When `POST /workspaces` is called with `runtime: "external"` AND
no URL supplied:
1. Workspace row inserted with `status='awaiting_agent'`
(distinct from `provisioning` so canvas doesn't trip its
provisioning-timeout UX).
2. A workspace_auth_token is minted via `wsauth.IssueToken`.
3. Response body includes a `connection` object with:
- `workspace_id`, `platform_url`, `auth_token`
- `registry_endpoint`, `heartbeat_endpoint`
- `curl_register_template` — zero-dep one-shot register snippet
- `python_snippet` — full SDK setup w/ heartbeat loop,
paired with molecule-sdk-python PR #13's A2AServer
4. The platform URL is resolved from `EXTERNAL_PLATFORM_URL` env
(ops-configurable per tenant) or falls back to request headers.
The legacy `payload.External` + `payload.URL` path is preserved —
org-import and other callers that already have a URL still work.
### Canvas UI
- New "External agent (bring your own compute)" checkbox in
CreateWorkspaceDialog.
- When checked, template/model/hermes-provider fields are hidden
and the POST body includes `runtime: "external"`.
- New `ExternalConnectModal` component: shown once after create,
renders Python / curl / raw-fields tabs with copy-to-clipboard
buttons. Stays mounted as a sibling of the create dialog so the
token survives the create dialog unmount.
- `auth_token` is interpolated into the snippet client-side so the
copied block is truly ready to run — operator only has to fill
in their agent's public URL.
## Tests
- Go: 5 new runtime_registry tests (happy path, -default strip,
external always injected, missing file, malformed JSON, real
manifest sanity). All existing handler tests still pass.
- TypeScript: no type errors on my files; pre-existing
canvas-batch-partial-failure type drift is on main already and
tracked on the #2061 branch.
## Follow-ups (filed separately)
- Cut molecule-sdk-python v0.y to PyPI so the snippet can use
`pip install molecule-ai-sdk` instead of `git+main`.
- Add a `runtime: string` field per template in manifest.json so
one template can declare its runtime explicitly (instead of
deriving it from name conventions). Unblocks N-templates-per-
runtime (e.g. hermes-minimax, hermes-anthropic both runtime=hermes).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Builds on #2061. Three internally-cohesive sub-features; easiest to
read in order.
## 1. Org-level env preflight
Server
- `OrgTemplate` + `OrgWorkspace` gain `required_env: string[]` and
`recommended_env: string[]` YAML fields.
- `GET /org/templates` walks the tree and returns the tree-union
(deduped, sorted) of both. `collectOrgEnv` dedup prefers required
when the same key is declared at both tiers.
- `POST /org/import` preflights against `global_secrets` WHERE
`octet_length(encrypted_value) > 0` (empty-value rows used to be
counted as "configured" and the per-container preflight still
failed at start time). 412 Precondition Failed + `missing_env`
list when required keys are absent. `force=true` bypasses with
an audit log line. DB lookup failure now returns 500 (was:
silent fall-through that defeated the guard). Env-var NAMES
validated against `^[A-Z][A-Z0-9_]{0,127}$` so a malicious
template can't ship pathological names into the UI or DB.
Canvas
- New `OrgImportPreflightModal`: red "Required" section (blocking)
and yellow "Recommended" section (non-blocking, import stays
enabled, shows live missing-count next to the Import button).
- Per-key password input → `PUT /settings/secrets` → strike-through
on save. Functional `setDrafts` throughout (no stale-closure
clobbers on rapid successive saves). `useEffect` seed keyed on a
sorted-join string signature so a parent re-render with a new
array identity doesn't clobber typed inputs.
- `TemplatePalette.handleImport` branches: zero env declarations →
straight to import; any declarations → fetch configured global
secret keys, open the modal.
Tests (Go): `TestCollectOrgEnv_*` (5) cover union-across-levels,
required-wins-over-recommended (including same-struct), dedup,
empty, invalid-name rejection.
## 2. EmptyState parity with TemplatePalette
The "Deploy your first agent" grid used to call `POST /workspaces`
with no preflight while the sidebar palette ran
`checkDeploySecrets` + `MissingKeysModal` first. Same template
deployed two different ways → first-run users saw containers boot
in `failed` state without guidance. Now both surfaces share one
preflight + modal handshake.
EmptyState's previous `interface Template` dropped `runtime`,
`models`, and `required_env` — silently discarding exactly the
fields the preflight needs. `Template` now lives in
`deploy-preflight.ts` and is imported from there by both surfaces.
## 3. useTemplateDeploy hook
With the preflight + modal wiring now duplicated across
EmptyState + TemplatePalette + (going forward) any third surface,
extracted the pattern into `canvas/src/hooks/useTemplateDeploy.tsx`:
const { deploy, deploying, error, modal } = useTemplateDeploy({
canvasCoords: ..., // optional, default random
onDeployed: (id) => ...,
});
Closes three drift surfaces that the duplication had created:
- `resolveRuntime` id→runtime fallback table (moved to
`deploy-preflight.ts`). EmptyState had a narrower fallback that
would have silently disagreed with the palette on any future id
needing a non-identity mapping.
- `checkDeploySecrets` call signature. One owner.
- `MissingKeysModal` JSX wiring. One owner.
Narrow try/catch around `checkDeploySecrets` so a preflight network
failure clears `deploying` and surfaces via `setError` instead of
stranding the button forever. `modal: ReactNode` (not a
`renderModal()` function) — the previous memoization bought
nothing since consumers called it inline every render. Named
`MissingKeysInfo` interface for the state shape.
## 4. Viewport auto-fit user-pan gate fix
During org deploy the canvas was meant to pan+zoom to follow each
arriving workspace (`molecule:fit-deploying-org` event → debounced
fitView). In practice the fit stayed stuck on wherever the first
fit landed.
Root cause: React Flow v12 fires `onMoveEnd` with a truthy `event`
at the END of a programmatic `fitView` animation. The original
"respect-user-pan" gate stamped `userPannedAtRef` in `onMoveEnd`,
so our own fit completing looked like a user pan, and every
subsequent auto-fit short-circuited for the rest of the deploy.
Fix: stop trusting `onMoveEnd` for user-intent detection. Register
explicit `wheel` + `pointerdown` listeners on `document` with
capture phase and `target.closest('.react-flow__pane')` filter.
Capture-phase immunity to `stopPropagation`; pane-filter rejects
toolbar / modal / side-panel clicks (the old `window` fallback
caught those). `onMoveEnd` simplified to only drive the debounced
viewport save.
Also: fit event dispatched on root arrivals (not just children),
so the canvas centers on the just-landed root immediately instead
of waiting ~2s for the first child. Animation 600ms → 400ms so
successive per-arrival fits don't pile up visually. End-state fit
stays at 1200ms — intentional asymmetry ("settling" vs
"tracking"), documented in code.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Merge origin/staging into fix/canvas-multilevel-layout-ux. 18 files
auto-merged (mostly canvas/tabs/chat and workspace-server handlers
the earlier DIRTY marker was stale relative to current staging).
- Fix 7 test failures surfaced by the merge:
1. Canvas.pan-to-node.test.tsx — mockGetIntersectingNodes was
inferred as vi.fn(() => never[]); mockReturnValueOnce of a node
object failed type check. Explicit return-type annotation.
2. Canvas.pan-to-node.test.tsx + Canvas.a11y.test.tsx — Canvas.tsx
reads deletingIds.size (new multilevel-layout state). Both mock
stores lacked deletingIds; added new Set<string>() to each.
3. canvas-batch-partial-failure.test.ts — makeWS() built a wire-
format WorkspaceData (snake_case, with x/y/uptime_seconds). The
store's node.data is now WorkspaceNodeData (camelCase, no wire-
only fields). Rewrote makeWS to produce WorkspaceNodeData and
updated 5 call-site casts. No assertions changed.
4. ConfigTab.hermes.test.tsx — two tests pinned pre-#2061 behavior
that the PR intentionally inverts:
a. "shows hermes-specific info banner" — RUNTIMES_WITH_OWN_CONFIG
now contains only {"external"}, so the banner is no longer
shown for hermes. Inverted assertion: now pins ABSENCE of
the banner, with a comment noting the inversion.
b. "config.yaml runtime wins over DB" — priority reversed:
DB is now authoritative so the tier-on-node badge matches
the form. Inverted scenario: DB=hermes + yaml=crewai →
form shows hermes. Switched test's DB runtime off langgraph
because the dropdown collapses langgraph into an empty-
valued "default" option that would hide the win signal.
- No production code changed — this commit is staging merge + test
realignment only. 953/953 canvas tests pass. tsc --noEmit clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Session's accumulated UX work across frontend and platform. Reviewable
in four logical sections — diff is large but internally cohesive
(each section fixes a gap the next one depends on).
## Chat attachments — user ↔ agent file round trip
- New POST /workspaces/:id/chat/uploads (multipart, 50 MB total /
25 MB per file, UUID-prefixed storage under
/workspace/.molecule/chat-uploads/).
- New GET /workspaces/:id/chat/download with RFC 6266 filename
escaping and binary-safe io.CopyN streaming.
- Canvas: drag-and-drop onto chat pane, pending-file pills,
per-message attachment chips with fetch+blob download (anchor
navigation can't carry auth headers).
- A2A flow carries FileParts end-to-end; hermes template executor
now consumes attachments via platform helpers.
## Platform attachment helpers (workspace/executor_helpers.py)
Every runtime's executor routes through the same helpers so future
runtimes inherit attachment awareness for free:
- extract_attached_files — resolve workspace:/file:///bare URIs,
reject traversal, skip non-existent.
- build_user_content_with_files — manifest for non-image files,
multi-modal list (text + image_url) for images. Respects
MOLECULE_DISABLE_IMAGE_INLINING for providers whose vision
adapter hangs on base64 payloads (MiniMax M2.7).
- collect_outbound_files — scans agent reply for /workspace/...
paths, stages each into chat-uploads/ (download endpoint
whitelist), emits as FileParts in the A2A response.
- ensure_workspace_writable — called at molecule-runtime startup
so non-root agents can write /workspace without each template
having to chmod in its Dockerfile.
Hermes template executor + langgraph (a2a_executor.py) + claude-code
(claude_sdk_executor.py) all adopt the helpers.
## Model selection & related platform fixes
- PUT /workspaces/:id/model — was 404'ing, so canvas "Save"
silently lost the model choice. Stores into workspace_secrets
(MODEL_PROVIDER), auto-restarts via RestartByID.
- applyRuntimeModelEnv falls back to envVars["MODEL_PROVIDER"]
so Restart propagates the stored model to HERMES_DEFAULT_MODEL
without needing the caller to rehydrate payload.Model.
- ConfigTab Tier dropdown now reads from workspaces row, not the
(stale) config.yaml — fixes "badge shows T3, form shows T2".
## ChatTab & WebSocket UX fixes
- Send button no longer locks after a dropped TASK_COMPLETE —
`sending` no longer initializes from data.currentTask.
- A2A POST timeout 15 s → 120 s. LLM turns routinely exceed 15 s;
the previous default aborted fetches while the server was still
replying, producing "agent may be unreachable" on success.
- socket.ts: disposed flag + reconnectTimer cancellation + handler
detachment fix zombie-WebSocket in React StrictMode.
- Hermes Config tab: RUNTIMES_WITH_OWN_CONFIG drops 'hermes' —
the adaptor's purpose IS the form, banner was contradictory.
- workspace_provision.go auto-recovery: try <runtime>-default AND
bare <runtime> for template path (hermes lives at the bare name).
## Org deploy/delete animation (theme-ready CSS)
- styles/theme-tokens.css — design tokens (durations, easings,
colors). Light theme overrides by setting only the deltas.
- styles/org-deploy.css — animation classes + keyframes, every
value references a token. prefers-reduced-motion respected.
- Canvas projects node.draggable=false onto locked workspaces
(deploying children AND actively-deleting ids) — RF's
authoritative drag lock; useDragHandlers retains a belt-and-
braces check.
- Organ cancel button (red pulse pill on root during deploy)
cascades via existing DELETE /workspaces/:id?confirm=true.
- Auto fit-view after each arrival, debounced 500 ms so rapid
sibling arrivals coalesce into one fit (previous per-event
fit made the viewport lurch continuously).
- Auto-fit respects user-pan — onMoveEnd stamps a user-pan
timestamp only when event !== null (ignores programmatic
fitView) so auto-fits don't self-cancel.
- deletingIds store slice + useOrgDeployState merge gives the
delete flow the same dim + non-draggable treatment as deploy.
- Platform-level classNames.ts shared by canvas-events +
useCanvasViewport (DRY'd 3 copies of split/filter/join).
## Server payload change
- org_import.go WORKSPACE_PROVISIONING broadcast now includes
parent_id + parent-RELATIVE x/y (slotX/slotY) so the canvas
renders the child at the right parent-nested slot without doing
any absolute-position walk. createWorkspaceTree signature gains
relX, relY alongside absX, absY; both call sites updated.
## Tests
- workspace/tests/test_executor_helpers.py — 11 new cases
covering URI resolution (including traversal rejection),
attached-file extraction (both Part shapes), manifest-only
vs multi-modal content, large-image skip, outbound staging,
dedup, and ensure_workspace_writable (chmod 777 + non-root
tolerance).
- workspace-server chat_files_test.go — upload validation,
Content-Disposition escaping, filename sanitisation.
- workspace-server secrets_test.go — SetModel upsert, empty
clears, invalid UUID rejection.
- tests/e2e/test_chat_attachments_e2e.sh — round-trip against
a live hermes workspace.
- tests/e2e/test_chat_attachments_multiruntime_e2e.sh — static
plumbing check + round-trip across hermes/langgraph/claude-code.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Marketing-lead agent's rename pass updated the "renders all three plans"
test (lines 56-57) but missed lines 77, 94, 114, 132, 143, 158 which still
referenced the pre-rename "Upgrade to Starter" / "Upgrade to Pro" button
names. Canvas (Next.js) build failed with getByRole timeout because the
component now says "Upgrade to Team" / "Upgrade to Growth".
Internal PlanId tuple ("free" | "starter" | "pro") and startCheckout(planId)
call are unchanged — only the user-facing button labels shifted, so
assertions like startCheckout("pro", "acme") still match the server-side API.
Verified locally: 9/9 PricingTable tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Schema-driven ChannelsTab renders no inputs when config_schema is
absent — the test's bare {type, display_name} mock mismatched the
real API shape and every getByLabelText("Bot Token") failed.
Mock now mirrors GET /channels/adapters with the Telegram schema
(bot_token password + chat_id text) so the a11y assertions run
against the actual rendered form.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lark adapter was already implemented in Go (lark.go — outbound Custom Bot
webhook + inbound Event Subscriptions with constant-time token verify),
but the Canvas connect-form hardcoded a Telegram-shaped pair of inputs
(bot_token + chat_id). Selecting "Lark / Feishu" from the dropdown
silently sent the wrong field names — there was no way to enter a
webhook URL.
Fix: move form shape to the server.
- Add `ConfigField` struct + `ConfigSchema()` method to the
`ChannelAdapter` interface. Each adapter declares its own fields with
label/type/required/sensitive/placeholder/help.
- Implement per-adapter schemas:
- Lark: webhook_url (required+sensitive) + verify_token (optional+sensitive)
- Slack: bot_token/channel_id/webhook_url/username/icon_emoji
- Discord: webhook_url + optional public_key
- Telegram: bot_token + chat_id (unchanged UX, keeps Detect Chats)
- Change `ListAdapters()` to return `[]AdapterInfo` with config_schema
inline. Sorted deterministically by display name so UI ordering is
stable across Go's random map iteration.
- Update the 3 existing `ListAdapters` test sites to struct access.
Canvas (`ChannelsTab.tsx`):
- Replace the two hardcoded bot_token/chat_id inputs with a single
schema-driven `SchemaField` component. Renders one input per field in
the order the adapter returns them.
- Form state becomes `formValues: Record<string,string>` keyed by
`ConfigField.key`. Values reset on platform-switch so stale
Telegram credentials can't leak into a new Lark channel.
- "Detect Chats" stays but only renders for platforms in
`SUPPORTS_DETECT_CHATS` (Telegram only — the only provider with
getUpdates).
- Only schema-known keys are posted in `config`, scrubbing any stale
values from previous platform selections.
Regression tests:
- `TestLark_ConfigSchema` locks in the 2-field Lark contract with the
required/sensitive flags correctly set.
- `TestListAdapters_IncludesLark` confirms registry wiring + schema
survives round-trip through ListAdapters.
Known pre-existing `TestStripPluginMarkers_AwkScript` failure in
internal/handlers is unrelated to this change (verified via stash+test
on clean staging).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Preparation for a "hundreds of runtimes" plugin ecosystem. Keeping the
runtime-specific UX knobs in-line inside ProvisioningTimeout scales badly
— every new runtime would require editing a component, not just adding a
table entry. Other components (create-workspace dialog, workspace card
tooltips, etc.) will want the same runtime metadata.
Changes:
- New file `canvas/src/lib/runtimeProfiles.ts` owns:
* `RuntimeProfile` type — structural shape, every field optional so
new runtimes can partially-fill without breaking consumers.
* `DEFAULT_RUNTIME_PROFILE` — 2-min default floor (docker-fast).
* `RUNTIME_PROFILES` — named overrides (currently: hermes 12 min).
* `WorkspaceRuntimeOverrides` — interface for server-provided
per-workspace overrides, so operators can tune via template
manifest / workspace metadata without a canvas release.
* `getRuntimeProfile()` — resolver with
overrides → profile → default priority.
* `provisionTimeoutForRuntime()` — convenience wrapper.
- `ProvisioningTimeout.tsx` now delegates to the profile module.
`DEFAULT_PROVISION_TIMEOUT_MS` re-exported for legacy test importers.
- Tests: 16/16 (up from 9 before the first fix). Adds pinning for:
* overrides > profile > default priority chain
* "every entry in RUNTIME_PROFILES resolves to a number" contract
* backward-compat export
Adding a new slow runtime is now one table entry in
`canvas/src/lib/runtimeProfiles.ts` with a mandatory `WHY` comment.
Moving to server-driven profiles later is a ~10-line change (the
resolver already threads WorkspaceRuntimeOverrides through).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hermes workspaces cold-boot in 8-13 min (ripgrep + ffmpeg + node22 +
hermes-agent source build + Playwright + Chromium ~300MB). The canvas's
2-min hardcoded "Provisioning Timeout" warning fired at ~2min and told
users their workspace was "stuck" while it was still mid-install. Users
hit Retry, triggering fresh cold boots and cancelling healthy workspaces.
User-facing symptom (reported 2026-04-24 18:35Z): hermes workspace showed
"has been provisioning for 3m 15s — it may have encountered an issue"
with Retry + Cancel buttons, while the EC2 was installing node_modules.
Fix:
- Keep DEFAULT_PROVISION_TIMEOUT_MS = 120_000 (2min) — correct for fast
docker runtimes (claude-code, langgraph, crewai) where cold boot is
30-90s.
- Add RUNTIME_TIMEOUT_OVERRIDES_MS = { hermes: 720_000 } (12min).
Aligns with tests/e2e/test_staging_full_saas.sh's
PROVISION_TIMEOUT_SECS=900 (15min) so UI warns shortly before the
backend itself gives up.
- New timeoutForRuntime() resolves the base; per-node lookup in the
check-timeouts interval so a mixed batch (1 hermes + 2 langgraph) uses
the right threshold for each.
- timeoutMs prop is now optional. Undefined → per-runtime lookup; a
number → forces a single threshold for every workspace (tests use this
for deterministic behavior).
Tests: 4 new cases pinning the runtime-aware resolution, including a
guard that catches future regressions that would weaken hermes's budget.
Existing tests unchanged (they import DEFAULT_PROVISION_TIMEOUT_MS which
still exports 120_000).
13/13 pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EmbeddedTeam was defined in WorkspaceNode.tsx but had no call site —
TeamMemberChip (which is called directly) covers the same rendering
responsibility. The function was stranded after a prior refactor and
was flagged by github-code-quality on PR #1989 (merged 2026-04-24T14:09Z
without this cleanup because the token died before push).
Removes 25 lines of dead code. MAX_NESTING_DEPTH is kept — it is used
by TeamMemberChip at line 498.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TeamMemberChip used MAX_NESTING_DEPTH to cap recursive sub-agent
rendering at depth 3, but the constant was never declared — causing
a TypeScript build error ('Cannot find name MAX_NESTING_DEPTH') that
blocked Canvas CI on PR #1989.
Add the constant above EmbeddedTeam with a doc comment explaining its
purpose (guards against circular parentId cycles + readability cap).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CI failure: "Cannot find name 'useMemo'" at line 363.
useMemo was called but not imported — likely dropped during refactor.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Matches tests/e2e/test_staging_full_saas.sh's 20-min budget (#1930).
Canvas E2E was still stuck at 900s (15 min) which regularly flakes on
tenant cold boots in 12-15 min range — especially on staging where
workspace-server image pulls + AMI bootstrapping add 3-5 min vs prod.
Concrete blocker: 2026-04-24 staging→main sync (#1981) kept failing on
"tenant provision: timed out after 900s" in canvas/e2e/staging-setup.ts
despite the actual sync E2E going green. Canvas-side timeout was
strictly tighter than the sync-side timeout.
Also raises WORKSPACE_ONLINE_TIMEOUT_MS to 20 min to cover the case
where the workspace EC2 is provisioned but hermes cold-install (apt +
uv + hermes-agent clone + gateway boot) takes longer than the original
10-min budget — matches the 20-min workspace deadline in SaaS E2E.
No behavior change when things are fast. Just covers the tail.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five tightly-related fixes surfaced while stress-testing org-template
imports (Legal Team, Molecule Company, etc.) on a running control plane:
1) Org import was silently failing — INSERT wrote `collapsed` into the
`workspaces` table but that column lives on `canvas_layouts`
(005_canvas_layouts.sql). Every import returned 207 with 0 rows
created, which `api.post` treated as success → green "Imported"
toast + empty canvas. Moved the write to canvas_layouts; updated
the workspace_crud PATCH path to UPSERT there too; refreshed the
test mock. Added a client-side assertion that throws on
2xx-with-`error`-body so future partial-failures surface a red
toast rather than lying about success.
2) Multi-level nested layout was collision-prone: children that were
themselves parents (CTO → Dev Lead → 6 engineers) got the same
leaf-sized grid slot as leaf siblings and clipped into each other.
Added post-order `sizeOfSubtree` + sibling-size-aware
`childSlotInGrid` on both the Go server and the TS client (kept in
sync). `buildNodesAndEdges` now uses subtree sizes for both parent
dimensions and the rescue heuristic. `setCollapsed` on expand now
reads each child's actual rendered width/height instead of the
leaf-count formula — a regression test covers the CTO/Dev Lead
scenario.
3) Provisioning-timeout banner was unusable during large imports: a
30-workspace tree triggered 27 simultaneous "stuck" warnings 2
minutes in (server paces + provision concurrency = 3 guarantee tail
items legitimately wait longer). Scaled threshold with concurrent
count (base + 45s per queue slot beyond concurrency) and added a
Dismiss (×) button per banner.
4) Auto pan-and-zoom on org ready: after the last workspace flips out
of `provisioning`, canvas now fitView's with a 1.2s animation,
0.25 padding, `maxZoom: 0.8` and `minZoom: 0.25`. Without the zoom
caps fitView was hitting the component's maxZoom=2 on small trees
and zooming in instead of out.
5) Toolbar was visually busy: `+ N sub` count wrapped onto a second
row on narrow viewports; status dot and workspace total were in
separate border-delimited cells. Merged into one segment with
`whitespace-nowrap`; A2A / Audit / Search / Help collapsed to
icon-only 28px buttons with tooltip + aria-label (Figma/Linear
pattern). Stop All / Restart Pending keep text — they're urgent.
Also:
- `api.{get,post,...}` accept an optional `{ timeoutMs }` so callers
that hit intentionally-slow endpoints (org import paces 2s between
siblings) don't trip the 15s default and report false aborts.
- `WorkspaceNode` clamps role text to 2 lines so verbose descriptions
don't unboundedly grow card height and break the grid.
- `PARENT_HEADER_PADDING` bumped 44→130 to clear name + runtime +
2-line role + the currentTask banner that appears during the
initial-prompt phase.
Tests: 930 canvas tests + full Go handler suite pass. Added
regressions for (i) 207 partial-success surfacing as throw, and
(ii) setCollapsed sizing with nested-parent children.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AuthGate now skips session fetch for /cp/auth/* paths, and
redirectToLogin guards against re-setting window.location when
already on an auth path. Both guards had no test coverage —
a future refactor could silently reintroduce the redirect loop.
Added:
- AuthGate.test.tsx: 2 cases covering /cp/auth/login and
/cp/auth/signup path skipping (no fetchSession call, no
redirectToLogin call, children rendered)
- auth.test.ts: 2 cases covering redirectToLogin early return
for /cp/auth/login and /cp/auth/signup paths
Fixes: Molecule-AI/molecule-core#1541
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
React Flow requires parent nodes to appear before their children in
the nodes array. When they don't, it logs "Parent node {id} not
found. Please make sure that parent nodes are in front of their
child nodes in the nodes array" and — more importantly — renders
the child at canvas-absolute coords instead of parent-relative,
flashing it far outside the parent.
topology's buildNodesAndEdges already enforced this at hydrate, but
nestNode + batchNest weren't re-sorting after mutating parentId.
A freshly-nested child often ended up after-first-drag at the
wrong screen position because its new parent sat later in the
array than itself.
Extract sortParentsBeforeChildren() into canvas-topology as a
reusable DFS visit; call it at the tail of both nestNode's set()
and batchNest's commit set(). 923 tests still green — no behaviour
change beyond eliminating the warning and the position flash.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Canceling the nest/extract dialog restored the child's position but
left the parent card at its auto-grown size. growParentsToFitChildren
fires on drag-stop to fit a then-outside child; when the drag is
subsequently cancelled, the parent keeps that grown width/height
forever because the grow pass is grow-only.
Strip width/height from the ex-parent alongside the child position
restore in cancelNest — React Flow re-measures from CSS, parent
collapses back to its natural size. Same trick nestNode already
uses for the un-nest path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two follow-up polish items for drag-and-nest:
1. Cancelling the "Extract from team?" dialog now snaps the
dragged card back to where the drag started. Before, a user
who dragged a child out, saw the confirm dialog, then clicked
Cancel ended up with the card stranded outside the parent at
its drop-point position — which also got persisted via
savePosition on drag-stop. Now onNodeDragStart captures the
pre-drag position + parent, and cancelNest restores both the
RF node position and fires savePosition with the absolute
pre-drag coords so reload matches.
2. Un-nesting now clears the ex-parent's explicit width/height
in the nodes array. growParentsToFitChildren is grow-only so
it could never shrink the parent back down after a child
left; the card stayed at its auto-grown size with empty
space. Stripping width/height lets React Flow re-measure from
the card's own min-width / min-height CSS, so the parent
visually shrinks to fit whatever children remain.
923 canvas tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Un-nest used to require holding Alt (or Cmd to force-detach). That
was too conservative — when a user dragged a child clearly outside
its parent's bbox, nothing happened on release, because the default
branch soft-clamped back and only the Alt branch actually opened
the "Extract?" confirm. Matches the exact bug the user just flagged
("I can put agents in other agent, but when I drag it out, it does
not move out").
New rules:
* Past the 20 % hysteresis → confirm un-nest. Plain drag, no
modifier. This is what most users expect (Miro / Figma behave
the same way — drag outside the frame and the shape leaves it).
* Inside or within 20 % of the edge → soft-clamp back inside.
Guards against twitchy releases that momentarily overshoot the
edge by a few pixels.
* Cmd / Ctrl → force un-nest regardless of overlap. Escape-hatch
for when the user dragged within the hysteresis zone but really
wants out.
* Dropping onto a different parent → nest there (unchanged).
Alt is no longer a required modifier for un-nesting. Keeps it as
a non-gesture modifier only; no meaning unless we re-bind it later.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 600-req/min/IP bucket is sized for SaaS where each tenant has
a distinct client IP. On a local Docker setup every panel shares
one IP — hydration (/workspaces + /templates + /org/templates +
/approvals/pending) plus polling (A2A overlay + activity tabs +
approvals + schedule + channels + audit trail) can burst past the
bucket inside a minute, blanking the canvas with 429s. The user
reported it after dragging workspaces — dragging itself is
release-only (savePosition in onNodeDragStop), but the polling
that's always running added onto startup tripped the limit.
Two-layer fix:
Server: RateLimiter.Middleware short-circuits when isDevModeFailOpen
is true (MOLECULE_ENV=development + empty ADMIN_TOKEN), matching
the Tier-1b hatch already applied to AdminAuth, WorkspaceAuth, and
discovery. SaaS production keeps the bucket.
Client: api.ts auto-retries a single 429 on idempotent GET requests,
waiting the server-provided Retry-After (capped at 20s). Mutations
(POST/PUT/PATCH/DELETE) never auto-retry to avoid double-applying.
Users on SaaS hitting a legitimate rate-limit spike get one
transparent recovery instead of an immediately-blank Canvas.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Importing a 15-workspace org template dropped every child as a
freely-positioned card into its parent's coordinate space. Parents
with 5-10 kids had the kids spill below the parent's initial min
size, producing the "ugly default" layout the user just flagged —
a mess of overlapping cards the moment the import completed.
Fix: every workspace in an org-template import that HAS children
is inserted with `collapsed = true`. Leaf workspaces stay
expanded (nothing to hide). The canvas renders a collapsed
parent as a compact header-only card with its "N sub" badge —
visually identical to the pre-refactor default the user asked for.
Double-click on a collapsed parent now EXPANDS it (flipping
`collapsed` locally + persisting via PATCH) so the user can drill
in to see the subtree. Only once expanded does a second
double-click zoom-to-team, matching the prior behaviour.
Leaf-first creation order stays the same; the collapsed flag
just means "render compact" not "hide from API".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five review findings from the 3f11df03 six-bug commit:
1. Add TestPeers_DevModeFailOpen_{Allows,ClosedWhenAdminTokenSet,
ClosedInProduction} covering all three gating states for the
security-sensitive dev-mode hatch the prior commit added to
/registry/:id/peers. Previously shipped untested — a future
refactor could have silently inverted polarity or removed the
gate. New tests pin the contract:
* MOLECULE_ENV=development + ADMIN_TOKEN="" → allow bearerless
* MOLECULE_ENV=development + ADMIN_TOKEN set → require token
* MOLECULE_ENV=production → require token
2. ConfigTab handleSave diffs against the RAW parsed YAML / form
config instead of the DEFAULT_CONFIG-merged shape. The previous
code would silently PATCH tier=1 to the DB when a user deleted
the `tier:` line in raw mode (the default-merge substituted 1).
Now: only fields the user actually typed participate in the
diff. Type guards (typeof === "number" / "string") prevent
coercion surprises on malformed YAML.
3. ConfigTab model-save failure no longer lies "Saved". The
/workspaces/:id/model PATCH can reject when the runtime doesn't
support the chosen model; previously we caught + console.warn'd
+ showed green Saved, and the user watched the model revert on
next reload with no explanation. Now the save path collects a
`modelSaveError` and surfaces it via setError with a partial-
success message ("Other fields saved, but model update failed:
…") so the user sees why.
4. ChannelsTab now surfaces BOTH channels-fetch and adapters-fetch
failures, distinguishing them in the error text ("Failed to
load connected channels and platforms — try refreshing").
Previously only an adapters failure was visible; a channels
failure left the user with an apparently-empty list and no
indication the API was unreachable.
5. ChatTab panels drop the redundant aria-hidden attribute. The
`hidden`/`flex` Tailwind class already sets display:none, which
removes the node from the accessibility tree on its own; the
extra aria-hidden invited WAI-ARIA lint warnings if a focusable
descendant ever landed inside an inactive panel.
Tests: 923 canvas + full Go handler suite pass. 3 new Go tests.
No behaviour change on the five prior fixes — this commit tightens
their edges per the independent review.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Six bugs reported from a live session — all shippable in one commit:
1. Peers tab 401 on local Docker. The /registry/:id/peers endpoint
demands a workspace-scoped bearer token (validateDiscoveryCaller)
which the canvas session doesn't hold. Added the same Tier-1b
dev-mode fail-open hatch that AdminAuth and WorkspaceAuth already
use — gated by MOLECULE_ENV=development + empty ADMIN_TOKEN, so
SaaS production stays strict. Exported IsDevModeFailOpen from the
middleware package for the handler layer to reuse.
2. Org Templates list unscrollable. OrgTemplatesSection was rendered
in the TemplatePalette footer — a div without overflow — so when
it expanded to 15+ entries the list extended past the viewport
with no scroll. Moved it to the top of the flex-1 overflow-y-auto
container. Tall lists now scroll naturally.
3. Chat tab: "My Chat" and "Agent Comms" rendered stacked instead
of switching. HTML `hidden` attribute was being overridden by
Tailwind's `flex` class (display: flex beats the attribute),
so both tabpanels rendered concurrently. Swapped to a conditional
Tailwind `hidden`/`flex` class so the inactive panel is
display:none with proper CSS specificity.
4. Hermes Config form never persists. handleSave wrote config.yaml
but name / tier / runtime / model all live on the workspace row
(or the dedicated /workspaces/:id/model endpoint) — the form
edited in-memory, the request returned 200, the next reload
wiped everything back. Hermes + external runtimes manage their
own config inside the container anyway, so writing config.yaml
is a no-op for them; skip it. Always diff and PATCH the DB-backed
fields that actually changed.
5. Channels "+ Connect" dropdown empty on first open. ChannelsTab's
load() used Promise.all with a silent catch — if EITHER the
channels or adapters fetch failed, both setters were skipped
with no error visible. Switched to Promise.allSettled so each
endpoint settles independently, and the adapters failure now
surfaces via the top-level error state.
6. Plugin registry always "No plugins in registry". Same silent
catch pattern in SkillsTab.tsx — load errors for /plugins,
/plugins/sources, and /workspaces/:id/plugins swallowed without
logging. Replaced the empty catches with console.warn so future
failures are at least visible in devtools.
Tests: 923 passing (unchanged). Go handler tests pass. Server
rebuilt and running with the peers-auth + collapsed-persistence
fixes (pid 15875).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AllKeysModal already handles focus via autoFocus={index === 0} on the
first input and a separate title-focus effect. The orphaned useEffect
referencing firstInputRef (declared only in ProviderPickerModal) caused
a TypeScript build error: "Cannot find name 'firstInputRef'".
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Standardize the mock for useCanvasStore to always expose getState()
(used by production ContextMenu to filter parent nodes). Applies the
same Object.assign-wrapping pattern introduced in #1744 to:
- ClaudeSettings.test.tsx
- tabs.a11y.test.tsx
- ContextMenu.keyboard.test.tsx (mockStore shape alignment)
Cherry-pick from #1744 left the backdrop div without aria-hidden="true"
(the outer dialog div got it instead). Re-apply aria-hidden="true" to
the backdrop div so screen readers skip the clickable overlay layer.
Also revert test assertion from bg-black → bg-black/70 to match the
exact class applied to the backdrop div.
Three follow-up review findings from the c2b2e13a review:
1. Rescue heuristic uses pure bbox-non-overlap. The previous
`position.x < 0` branch rescued any child whose parent was
later dragged past it, even when the layout was clearly
recoverable (e.g. relative -40, child still overlaps parent).
New rule: rescue iff the child's bbox has zero overlap with
the parent's bbox — self-calibrating, scales with user-resized
parents, catches screenshot-case and legacy huge-positive data.
2. Toast caps failed-name list at 3 and appends "and N more".
Stops a 50-node partial failure from overflowing the toast
container.
3. Cycle guard on selection-roots walk in batchNest. Corrupt
parentId data can't send the loop infinite now. Cheap
defensive guard — one Set per selected node.
Tests added (923 total, up from 918):
* canvas-topology.test: 4 rescue scenarios — screenshot case
(zero-overlap rescue), negative drift kept, huge-positive
rescued, user-resized layout kept.
* canvas.test: selection-roots filter on a 3-level chain.
* workspace_crud test: PATCH {collapsed:true} runs the UPDATE.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five issues surfaced in the review of 50b53784. Each was either a real
bug waiting to hit users or a silent failure mode.
1. Topology rescue no longer teleports user-resized children.
Rescue was comparing against parentMinSize(childCount), so any
child the user had placed in space the parent was resized into
got snapped to the default grid on reload — undoing the layout.
Now rescue fires only on obviously corrupt data: negative
relative coords (legacy pre-nesting absolute positions that
landed above/left of their assigned parent) or values past an
MAX_PLAUSIBLE_OFFSET threshold. Children just-past the initial
minimum are left alone.
2. batchNest now filters to selection-roots before planning.
Previously selecting both A and A's descendant B and dragging
into T yanked B out of A to become a sibling under T. Users
reasonably expect the A subtree to move intact. The new pass
drops any selected node whose ancestor is also selected —
those follow their ancestor via React Flow's parent binding.
3. batchNest surfaces partial failure via showToast. Previously
silent: 2 of 5 PATCHes fail, user sees 3 cards re-parented + 2
snapped back with no explanation. Now names the failed cards.
4. confirmNest closes the nest dialog BEFORE dispatching the async
store action, so a second drag can't kick off a competing batch
while the first is still in flight.
5. collapsed is now persisted. The Go workspace_crud.go Update
handler ignored the `collapsed` field, so user-initiated
collapse round-tripped to an expanded state on next hydrate.
Added the PATCH branch (`UPDATE workspaces SET collapsed = ...`)
so the state survives reload.
Nits cleaned:
* Removed dead dragStartParentRef in useDragHandlers.
* Swapped redundant `node.data as WorkspaceNodeData` casts for a
named WorkspaceNode type alias.
* Canvas.tsx SR-live region now reads n.parentId (matches
MiniMap + RF's native field) instead of the mirror n.data.parentId.
Tests added (918 total, up from 915):
* batchNest happy path — 2-root selection fires 2 combined PATCHes
carrying parent_id + x + y, not 2×N sequential round-trips.
* batchNest ancestor+descendant selection — subtree stays intact.
* batchNest partial failure rollback — only the rejected nodes
revert; successful ones stay committed.
Backend change is single-line (collapsed PATCH branch); all
workspace_crud Go tests still pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two concerns in one commit (separate files, each self-contained):
## Canvas.tsx split (from ~680 to ~250 lines)
Canvas.tsx was holding drag gesture state + keyboard shortcuts +
viewport wiring + JSX. Each concern now lives in its own unit under
canvas/src/components/canvas/:
- dragUtils.ts — pure: shouldDetach, clampChildIntoParent,
DETACH_FRACTION
- DropTargetBadge.tsx — the floating "Drop into: <name>" label + the
dashed ghost preview at the target slot
- useDragHandlers.ts — encapsulates onNodeDragStart / Drag / Stop,
findDropTarget hit-test, pendingNest state,
and confirmNest/cancelNest. Routes multi-
select drags through batchNest automatically.
- useKeyboardShortcuts — Esc, Enter, Shift+Enter, Cmd+]/[, Z — one
window listener, one source of truth.
- useCanvasViewport — pan-to-node + zoom-to-team CustomEvent
listeners and the debounced viewport save.
Canvas.tsx becomes a thin composition + JSX file. No behavioural
change; the refactor is covered by the existing 915 canvas tests.
## batchNest parallelization (2N round-trips → N, all in flight)
Previously nestNode fired two sequential PATCHes (parent_id then x/y)
and batchNest looped nestNode sequentially. For a 5-node selection on
a typical ~200ms link this was ~2s of serialized RPCs.
- nestNode now combines parent_id + x + y into ONE PATCH. The Go
handler (workspace_crud.go Update) already reads all three from the
same body — no backend change.
- batchNest rewritten: compute every re-parent plan against one
snapshot, commit a single set(), then fire N PATCHes via
Promise.allSettled in parallel. Per-node failures roll back only
that node (others stay committed) — same semantics as the single-
node path, just concurrent.
- The state math in the batch path also correctly shifts descendant
zIndex by depthDelta when any re-parented node has a subtree.
## Also
- canvas-topology.ts: reverted P3.12's opt-in rescue to the auto-
rescue default. When a child's stored relative position would render
it outside the parent bbox (the visual regression the user saw after
collapse → reload — Hermes child drawn outside Claude Code Agent on
first paint), the child is placed in the next default grid slot.
The "Arrange Children" context command stays for bigger teams.
All 915 canvas tests pass. No backend changes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five Critical issues caught in code review of f3423a51. Each one broke
an invariant the original commit claimed to uphold.
1. nestNode: descendants kept their old-depth zIndex after a re-parent.
Now walks the dragged subtree and shifts every descendant's zIndex
by the same depthDelta so "children above ancestors" survives moves
between levels of the hierarchy.
2. bumpZOrder: siblings all share zIndex = depth in fresh topology, so
a single +1 bump was identical for every sibling and subsequent
bumps drifted zIndex unboundedly. Rewritten to sort siblings by
current zIndex and swap the target with its neighbour in the bump
direction — Figma-style reorder, stays within the sibling tier.
3. findDropTarget: depth-first tiebreaker lost to bumped siblings. The
visually-frontmost card after Cmd+] is a shallow sibling, but the
hit test picked the deepest nested card regardless. Swapped order
so zIndex wins first, depth second, area third. Also pre-computes
the depth map once per call (was O(n²) via repeated .find walks —
will matter past ~30 workspaces).
4. arrangeChildren: saved absolute position using `slot + parent.position`,
but parent.position is RELATIVE to its own parent when nested.
Grandchildren's stored x/y were in the parent's local frame and
reload placed them in the wrong spot. Now walks the full ancestor
chain via absOf() to get the true canvas-absolute origin before
PATCHing.
5. setCollapsed: naive flip of every descendant's hidden flag diverged
from the topology rebuild on hydrate. Collapse A, collapse B, then
expand A — C should stay hidden because B is still collapsed, but
before this fix C was unhidden. Rewritten to recompute every
descendant's hidden from the full ancestry chain, matching the
topology pass byte-for-byte. New round-trip test asserts the two
code paths produce identical node.hidden across a full lifecycle.
Also:
- Removed dead cascadeMessage constant (never rendered).
- Replaced hardcoded 260/120 in zoom-to-team with exported constants.
- arrangeChildren PATCH catch now logs instead of silently swallowing.
- Added 70→76 tests: setCollapsed 3-chain scenarios, bumpZOrder swap
semantics, edge-of-list no-op.
All 915 canvas tests green. Backend untouched.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ships the full prioritized improvement list from the canvas research
report — aligns our nesting/resize UX with Miro / FigJam / tldraw / Figma
conventions. Organized by priority below.
## P1 — baseline playability
* Hysteresis on drag-out detach (Miro): a child only un-nests when >=20%
of its bbox is outside the parent on release. Prevents accidental
un-nesting from twitchy drags.
* Drop-target now uses tree-depth DESC, then zIndex DESC, then area ASC
to pick targets when nested parents overlap (xyflow #2827).
* Children render above ancestors by inheriting zIndex = parent + 1 in
topology and on every nest/unnest (xyflow #4012).
* Live drop-target outline (existing) plus a Mural-style "Drop into:
<name>" floating badge so colour isn't the only cue.
* growParentsToFitChildren now fires only on dimension-type changes
inside onNodesChange (NodeResizer commits) and once on drag-stop —
avoids tldraw's edge-chase artifact (P3.11 commit-on-release).
## P2 — polish
* Whimsical-style ghost preview: dashed outline at the next default
grid slot inside the drop-target parent during drag.
* Alt-drag escape with soft clamp: dropping slightly outside a parent
without Alt/Cmd snaps the child back inside (clampChildIntoParent);
Alt releases the clamp to allow un-nest; Cmd/Ctrl force-detaches.
* Figma-style keyboard hierarchy nav: Enter descends to first child,
Shift+Enter ascends to parent, Cmd+]/[ re-orders siblings via the
new bumpZOrder store action.
* Multi-select re-parent preserves offsets: confirmNest routes through
a new batchNest action when the primary drag is part of a batch
selection (Lucidchart pattern).
## P3 — long-tail
* Minimap now shows parent cards as filled regions with a blue stroke,
so hierarchy reads at a glance without zooming.
* Out-of-bounds rescue is opt-in: topology no longer silently re-lays
children whose stored position is outside the parent bbox (Figma
trust-the-data). The new Arrange Children context menu item runs the
rescue on demand via arrangeChildren.
* Cmd-drag force-detach regardless of hysteresis.
* Collapse workspace: the existing Collapse Team action now toggles a
local setCollapsed store action that hides every descendant and
shrinks the parent card to header-only (Miro frame outline view).
Growth pass skips collapsed parents so they don't push back out.
All 910 canvas tests green. Backend untouched.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two playability bugs in the new flat-cards layout:
1. On first load or fresh org import a parent had no explicit width or
height, so children whose stored position sat inside their (eventual)
parent's rectangle rendered visually outside the smaller default
parent box. Compute a parent starting size in canvas-topology:
• 2-column grid of child-default footprints + header/side padding
• Grows per child count (2→1 row, 3-4→2 rows, etc.)
and stamp it onto the Node's width/height so the first paint already
contains every child.
2. If a child's stored relative position actually falls outside the
parent's computed bounds (legacy org-imports at 0,0, pre-refactor
absolute coordinates, manually-nudged rows), assign that child a
deterministic default grid slot inside the parent instead.
Runtime cascade: added growParentsToFitChildren to onNodesChange so when
the user drags or resizes a child past the parent's current bounds, the
parent grows to contain it (+padding). Miro/FigJam-style frame auto-fit
— grow-only, never shrinks under the user's manual resize.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every workspace now renders as a first-class card on the canvas
regardless of parent_id. The old "parent card contains mini TeamMember
chips" layout is gone — if B is parented to A, B renders as a full card
inside A's coordinate space using React Flow's `parentId` binding, so
moving A carries B along and children have the same detail + actions as
root cards.
Details:
- canvas-topology.ts: topologically sort parents before children
(React Flow ordering requirement), compute each child's RF-native
parentId + relative position on load. DB keeps absolute x/y; the
abs→rel conversion happens here, reverse translation in
Canvas.onNodeDragStop before savePosition PATCHes the DB.
- WorkspaceNode.tsx: delete the EmbeddedTeam + TeamMemberChip blocks,
simplify the size classes, and add NodeResizer (visible when selected)
so users can drag any edge/corner to grow or shrink. Parent cards
default to a larger min size so nested children have breathing room.
- Canvas.tsx drop targeting rewritten: bounds-based hit test against
each node's measured absolute bbox, deepest match wins. Fixes two
prior bugs at once — dropping onto Claude Code with a nested same-
named Hermes no longer picks the wrong node, and the target can now
be a nested workspace when that's where the pointer actually released.
- canvas.ts nestNode + removeNode: translate position between old and
new parent's absolute origin on nest/unnest so the card doesn't jump,
and re-point the RF `parentId` alongside `data.parentId` on reparent.
- Tests: hidden-flag assertions replaced with parentId checks; obsolete
TeamMemberChip a11y/eject tests deleted (the UI component no longer
exists).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Dragging one workspace onto another could pick a nested child as the
"nearest" drop target instead of the visible parent card the user
actually hovered. The effect: dropping a free-floating Hermes Agent
onto a Claude Code Agent that already had a Hermes Agent nested inside
showed "Move 'Hermes Agent' inside 'Hermes Agent'?" — the confirmation
referenced the nested same-named child, not Claude Code.
Why: getIntersectingNodes returns every overlapping node, including
hidden=true children that render inside their parent's card. The
parent and child share bounding boxes, so the child often "won" the
nearest-distance check. Filter them out at the source: a node that's
already got a parentId (or is hidden) is never a valid top-level drop
target.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three complementary regression tests for the chain of P0s fixed today.
Each targets a specific bug class that reached production, and will
fire loud if any of them regress.
## 1. E2E A2A assertion enhancements (tests/e2e/test_staging_full_saas.sh)
The existing A2A check looked for "error|exception" in the response text,
which was too broad and missed the actual error patterns we hit. Now
matches each known error class individually with a diagnostic fail
message pointing at the exact bug:
- "[hermes-agent error 401]" → hermes #12 (API_SERVER_KEY)
- "hermes-agent unreachable" → gateway process died
- "model_not_found" → hermes #13 (model prefix)
- "Encrypted content is not supported" → hermes #14 (api_mode)
- "Unknown provider" → bridge PROVIDER misconfig
Also asserts the response contains the PONG token the prompt asked for —
catches silent-truncation/echo regressions.
## 2. Hermes install.sh bridge shell harness (tools/test-hermes-bridge.sh)
4 scenarios × 16 assertions, all offline (no docker, no network):
- openai-bridge-happy: OPENAI_API_KEY + openai/gpt-4o →
provider=custom, model="gpt-4o" (prefix stripped),
api_mode=chat_completions
- operator-custom-wins: explicit HERMES_CUSTOM_* → bridge skipped
- openrouter-not-touched: OPENROUTER_API_KEY → provider=openrouter,
slug kept
- non-prefixed-model: bare "gpt-4o" → prefix-strip is a no-op
Runs in <1s, can be wired into template-hermes CI. Pins the exact
config.yaml shape — any drift in derive-provider.sh or the bridge
if-block breaks a test.
## 3. Canvas ConfigTab hermes tests (ConfigTab.hermes.test.tsx)
5 vitest cases covering the #1894 bugs:
- Runtime loads from workspace metadata when config.yaml missing
- "No config.yaml found" red error hidden for hermes
- Hermes info banner shown instead
- Langgraph workspace still sees the red error (regression-guard the
other way)
- config.yaml runtime wins over workspace metadata when present
## Running
bash tools/test-hermes-bridge.sh # 16 assertions
cd canvas && npx vitest run src/components/tabs/__tests__/ConfigTab.hermes.test.tsx # 5 cases
# E2E enhancements ride on the existing staging E2E workflow
## Not yet covered (tracked in #1900)
CP admin delete-tenant EC2 cascade, cp-provisioner instance_id
lookup (#1738), purge audit SQL mismatch (#241), and pq prepared-
statement cache collision (#242). These are in-controlplane-repo
concerns — separate PR with CP-side sqlmock + integration tests.
Closes items in #1900.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two latent bugs kept the "Processing with Claude Code..." timer ticking
after the agent had already answered:
1. The A2A_RESPONSE store handler wrote into agentMessages[workspaceId]
(no prefix) but ChatTab's "clear sending" effect subscribed to
agentMessages["a2a:" + workspaceId]. Keys never matched — the effect
was dead code from day one. Removed the dead subscription and moved
the setSending(false) into the pendingAgentMsgs effect so any reply
delivered via a WS push (Claude Code SDK, Hermes's
send_message_to_user) also closes the spinner.
2. Added an activity-log fallback: when the platform emits a successful
a2a_receive ACTIVITY_LOGGED for this workspace, clear sending and
stop the timer. That covers the "runtime answered but we never saw
the store message" case Claude Code exhibited tonight — the HTTP
request can stay in flight while the SDK already pushed its reply.
Symmetric a2a_receive error path also clears sending and surfaces the
error message, so a runtime-side failure no longer hangs the UI.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The side-panel runtime pill read "unknown" for newly-deployed workspaces
because canvas-events.ts created the node from WORKSPACE_PROVISIONING
payload — and the payload only carried name + tier. No refetch filled
the gap during provisioning, so the user saw "RUNTIME unknown" on the
card even though the DB row had the real runtime set.
Includes runtime in every WORKSPACE_PROVISIONING emitter:
* handlers/workspace.go — initial create
* handlers/workspace_restart.go — explicit restart, auto-restart, and
crash-recovery resume loop
* handlers/org_import.go — multi-workspace org imports
Canvas-side: canvas-events.ts reads payload.runtime when creating the
node; the provisioning test asserts the pill value is populated before
any refetch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The MissingKeysModal's provider list was hardcoded in deploy-preflight.ts
as RUNTIME_PROVIDERS — a per-runtime map that duplicated what each
template repo already declares in its config.yaml. That meant adding a
new provider required changes in two places, and the UI could drift out
of sync with the actual template (e.g. when a template adds a MiniMax or
Kimi model, the picker wouldn't know).
The single source of truth for "which env vars does this workspace need"
is each template's config.yaml:
* `runtime_config.models[].required_env` — per-model key list
* `runtime_config.required_env` — runtime-level AND list
Go /templates already returned `models`. This change:
* Adds `required_env` alongside `models` on templateSummary so the
canvas receives the full picture.
* Rewrites deploy-preflight.ts to derive ProviderChoice[] from a
template object via `providersFromTemplate(template)`:
- groups `models[]` by unique required_env tuple
- falls back to runtime_config.required_env when models is empty
- decorates labels with model counts (e.g. "OpenRouter (14 models)")
* `checkDeploySecrets(template, workspaceId?)` now takes a template
object instead of a runtime string. Any-provider satisfaction still
short-circuits preflight to ok=true.
* MissingKeysModal receives `providers` directly; no more lookups.
* TemplatePalette threads `template.models` + `template.required_env`
into the preflight.
Side effects:
* Claude Code's dual-auth (OAuth token OR Anthropic API key) now
surfaces as two picker options — its config.yaml already declared
both, the UI just wasn't reading them.
* Hermes picker now shows 8 provider options (Nous, OpenRouter,
Anthropic, Gemini, DeepSeek, GLM, Kimi, Kilocode) instead of the
hand-picked 3, matching its 35-model reality.
Removed the legacy RUNTIME_PROVIDERS / RUNTIME_REQUIRED_KEYS /
getRequiredKeys / findMissingKeys exports; MissingKeysModal.test.tsx
deleted (its coverage is subsumed by the new template-driven
deploy-preflight.test.ts). 58 modal-adjacent tests pass; full canvas
suite 919 pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Runtimes like Hermes and LangGraph accept any one of several LLM
provider keys (OpenRouter OR OpenAI OR Anthropic OR Nous-native).
Before this change, the missing-keys modal treated all supported
providers as simultaneously required — a fresh user on Hermes was
asked for three parallel API keys when any one suffices.
Introduces RUNTIME_PROVIDERS in deploy-preflight.ts as the canonical
per-runtime provider list (label, envVar, note). checkDeploySecrets
now returns all alternatives as missingKeys when nothing is
configured, so the modal can offer a picker.
MissingKeysModal dispatches between two render paths:
* ProviderPickerModal — radio list of supported providers, a single
env input for the chosen one. Saving that one key satisfies the
preflight. Activated whenever the runtime has ≥2 provider choices.
* AllKeysModal — legacy parallel-inputs UX, all keys must be saved
before deploy. Kept for single-provider runtimes (claude-code,
gemini-cli) and callers that pass unrelated-key lists.
Dual-mode preserves the pre-existing contract for every caller while
fixing the multi-provider UX. All 930 canvas vitest tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The TemplatePalette's Org Templates section rendered all cards
inline, each ~120 px tall (name + description + "Import org" button).
With 4 org templates on disk that's ~500 px of drawer height — the
individual workspace templates at the top (AutoGen / LangGraph /
Hermes / …) got pushed off-screen, which is the exact complaint from
the test session ("templates still 90% org, cant even see normal
workspace template").
Collapsed the Org Templates section by default. The header now
toggles with an ▶ caret and shows the count ("Org Templates (4)").
Clicking expands to reveal the full card list; clicking again
collapses. Persists only within a session — fresh mounts start
collapsed so the primary deploy path stays visible.
Individual workspace templates are the usual starting point (pick a
runtime, deploy one agent), while org templates are a heavier
"deploy this whole pre-built team" action. Making the second
expandable matches the relative frequency.
- `TemplatePalette.tsx::OrgTemplatesSection` — added `expanded`
state (default false), wrapped the cards in `{expanded && …}`,
turned the header into a toggle button with `aria-expanded` +
`aria-controls`.
- `__tests__/OrgTemplatesSection.test.tsx` — 3 new rendering tests:
collapsed-by-default (cards absent), click expands (cards appear),
click again collapses (cards gone). Mocks /org/templates with a
2-entry response so the count assertion is stable.
Full canvas vitest: 930/930 pass (up from 927).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
### Two unrelated but small UI fixes surfaced while testing the Canvas
**1. Legend hidden under the open TemplatePalette.**
Legend is `fixed bottom-6 left-4 z-30`. TemplatePalette's drawer (when
open) is `fixed top-0 left-0 w-[280px] z-30` — same z-index, same
left-edge column. The Legend overlapped the palette's bottom 180 px.
Published the palette-open state to the canvas store so the Legend
can shift right (to `left-[296px]` — 280 px palette + 16 px gap) while
the palette is open, animated via a 200 ms `transition-[left]` to
match the palette's slide. Closes cleanly back to `left-4` when the
palette is dismissed.
Files:
- `store/canvas.ts` — added `templatePaletteOpen` + `setTemplatePaletteOpen`.
- `TemplatePalette.tsx` — calls `setTemplatePaletteOpen(open)` on
every open/close transition via a new useEffect.
- `Legend.tsx` — reads the flag and swaps `left-4` <-> `left-[296px]`.
**2. "WebSocket is closed before the connection is established" spam.**
Two components (`ChatTab`, `AgentCommsPanel`) open their own short-
lived WebSocket to tail the ACTIVITY_LOGGED stream. Their cleanup
path called `ws.close()` unconditionally, which trips a browser
console warning when React StrictMode re-runs the effect in dev and
the handshake hasn't completed yet. Confirmed via DevTools console
on the running canvas.
Added a `closeWebSocketGracefully(ws)` helper in `lib/ws-close.ts`:
- OPEN / CLOSING → close immediately (normal path).
- CONNECTING → defer close to the 'open' listener so the
browser sees a full handshake. Also wires an
'error' listener that cancels the queued close
if the handshake fails (no double-close).
- CLOSED → no-op.
Both consumers now call the helper in their useEffect cleanup.
Silences the warning without changing observable behaviour.
### Tests
`canvas/src/lib/__tests__/ws-close.test.ts` — 5 cases with a fake
WebSocket covering each readyState branch plus the error-before-open
cancellation path. Full vitest suite: 927/927 pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a workspace is selected the SidePanel (fixed, right-0, z-50)
opens from the right edge and covers the right third of the
viewport. The Toolbar at the top was positioned
`fixed top-3 left-1/2 -translate-x-1/2 z-20` — centred on the full
viewport, not the remaining canvas area. Consequence: the right half
of the Toolbar (Audit / Search / Help / Settings) was hidden behind
the panel as soon as the user clicked any workspace.
Fix: publish the live SidePanel width to the canvas store and read
it in Toolbar. When a node is selected, shift the Toolbar LEFT by
`sidePanelWidth / 2` so its centre lines up with the middle of the
remaining canvas area. Animated via a 200 ms `transition-[margin-left]`
to match the SidePanel's own slide-in easing.
- `store/canvas.ts` — added `sidePanelWidth` + `setSidePanelWidth`.
Default 480 (matches SIDEPANEL_DEFAULT_WIDTH).
- `SidePanel.tsx` — calls `setSidePanelWidth(width)` on every width
change so the store stays in sync with localStorage.
- `Toolbar.tsx` — reads `sidePanelWidth`, applies a negative
`marginLeft` style when `selectedNodeId` is non-null.
- `SidePanel.tabs.test.tsx` — added `setSidePanelWidth: vi.fn()` to
the mocked store state so SidePanel's new useEffect has a callable
to invoke. 18 previously-passing tests now pass again.
No visual regression when no workspace is selected — the toolbar
stays in its original centred position. SaaS canvas unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CreateWorkspaceDialog.a11y.test.tsx's two tier-button tests assumed
T1 was the default selection. After the previous commit flipped the
non-SaaS default to T3, the radio group's default-selected button
changed accordingly.
Updated:
- "tier buttons have role=radio and aria-checked reflects selection"
— T3 is now `aria-checked="true"`, T1 is the "unselected" foil we
click to verify the flip.
- "selected radio has tabIndex=0, others have tabIndex=-1" — T3 is
the tabindex=0 member now.
The roving-tabIndex and ArrowDown / ArrowRight tests further down the
file start by explicitly clicking/focusing T1 or T2, so they're
unaffected by the default change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Default tier for a newly-created workspace was T1 (Sandboxed) on
self-hosted and T4 (Full Access) on SaaS. Real work needs at minimum
a read_write workspace mount + Docker daemon access — that's T3
("Privileged") per the tier ladder in CreateWorkspaceDialog. The
user-visible consequence was that clicking "Deploy" on almost any
template landed in a sandbox that couldn't actually run the agent's
tooling until the user knew to bump the tier manually.
### Changes
**Platform (Go)** — default tier flipped from 1→3 in two places so
API callers (Canvas, molecli, org import) all get the same default:
- `handlers/workspace.go`: `POST /workspaces` default when `tier` is
omitted from the request body.
- `handlers/template_import.go`: `generateDefaultConfig` writes
`tier: 3` into the auto-generated `config.yaml` for bundle imports
that don't declare one.
**Canvas** — `CreateWorkspaceDialog.tsx` self-hosted form default
flipped from T1→T3. SaaS stays at T4 (each SaaS workspace runs on
its own sibling EC2, so the shared-blast-radius reasoning doesn't
apply and we can safely go a tier higher).
### Tests
Updated every sqlmock assertion that anchored on the old `tier=1`
default:
- `handlers_test.go::TestWorkspaceCreate` — default-path INSERT now
expects `3`.
- `handlers_additional_test.go::TestWorkspaceCreate_WithParentID` —
same.
- `workspace_test.go::TestWorkspaceCreate_DBInsertError` /
`TestWorkspaceCreate_WithSecrets_Persists` — same.
- `workspace_test.go::TestWorkspaceCreate_TemplateDefaults*` — same
(current handler semantics ignore the template's `tier:` field and
fall through to the default; kept tests faithful to the
implementation, left a comment flagging the latent inconsistency).
- `workspace_budget_test.go::TestWorkspaceBudget_Create_WithLimit` —
same.
- `template_import_test.go::TestGenerateDefaultConfig` — asserts
`tier: 3` now.
All `go test -race ./internal/handlers/` pass.
Canvas `CreateWorkspaceDialog` tests don't assert the default tier
(they only reference `tier` as prop data on stub workspaces) so no
test update needed on that side.
### SaaS parity
Zero behaviour change on hosted SaaS. The Go-side default only fires
when the Canvas (or any caller) omits `tier` from the request body.
The SaaS Canvas explicitly passes `tier: 4` from the
CreateWorkspaceDialog `isSaaS ? 4 : 3` branch, so the Go default
never runs on a SaaS request.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Canvas Config tab had 3 bugs visible on hermes workspaces (#1894):
1. Runtime dropdown showed "LangGraph (default)" even when the workspace's
actual runtime was hermes — because the form only loaded runtime from
config.yaml, and hermes doesn't use the platform's config.yaml template.
2. Model field was empty for the same reason.
3. "No config.yaml found" error appeared on hermes workspaces despite
everything being fine — hermes manages its own config at
~/.hermes/config.yaml on the workspace host.
Worse, clicking Save with the empty form would silently flip `runtime`
back from `hermes` to `LangGraph (default)`.
## Fix
- loadConfig now always fetches workspace metadata (runtime + model)
via GET /workspaces/:id and GET /workspaces/:id/model BEFORE attempting
the config.yaml fetch. These act as the source of truth for runtime
and model when config.yaml doesn't set them.
- RUNTIMES_WITH_OWN_CONFIG set lists runtimes that manage their own
config outside the platform template (hermes, external). For these:
- Missing config.yaml is NOT an error — no red banner shown.
- An informational gray banner tells the user where to edit the
runtime's config (e.g. "edit ~/.hermes/config.yaml via Terminal tab
or the hermes CLI" for hermes).
Closes#1894.
Verified 2026-04-23 on user's hongmingwang tenant which runs hermes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five additional breakages surfaced while testing the restored stack
end-to-end (spin up Hermes template → click node → open side panel →
configure secrets → send chat). Each fix is narrowly scoped and has
matching unit or e2e tests so they don't regress.
### 1. SSRF defence blocked loopback A2A on self-hosted Docker
handlers/ssrf.go was rejecting `http://127.0.0.1:<port>` workspace
URLs as loopback, so POST /workspaces/:id/a2a returned 502 on every
Canvas chat send in local-dev. The provisioner on self-hosted Docker
publishes each container's A2A port on 127.0.0.1:<ephemeral> — that's
the only reachable address for the platform-on-host path.
Added `devModeAllowsLoopback()` — allows loopback only when
MOLECULE_ENV ∈ {development, dev}. SaaS (MOLECULE_ENV=production)
continues to block loopback; every other blocked range (metadata
169.254/16, TEST-NET, CGNAT, link-local) stays blocked in dev mode.
Tests: 5 new tests in ssrf_test.go covering dev-mode loopback,
dev-mode short-alias ("dev"), production still blocks loopback,
dev-mode still blocks every other range, and a 9-case table test of
the predicate with case/whitespace/typo variants.
### 2. canvas/src/lib/api.ts: 401 → login redirect broke localhost
Every 401 called `redirectToLogin()` which navigates to
`/cp/auth/login`. That route exists only on SaaS (mounted by the
cp_proxy when CP_UPSTREAM_URL is set). On localhost it 404s — users
landed on a blank "404 page not found" instead of seeing the actual
error they should fix.
Gated the redirect on the SaaS-tenant slug check: on
<slug>.moleculesai.app, redirect unchanged; on any non-SaaS host
(localhost, LAN IP, reserved subdomains like app.moleculesai.app),
throw a real error so the calling component can render a retry
affordance.
Tests: 4 new vitest cases in a dedicated api-401.test.ts (needs
jsdom for window.location.hostname) — SaaS redirects, localhost
throws, LAN hostname throws, reserved apex throws.
### 3. SecretsSection rendered a hardcoded key list
config/secrets-section.tsx shipped a fixed COMMON_KEYS list
(Anthropic / OpenAI / Google / SERP / Model Override) regardless of
what the workspace's template actually needed. A Hermes workspace
declaring MINIMAX_API_KEY in required_env got five irrelevant slots
and nothing for the key it actually needed.
Made the slot list template-driven via a new `requiredEnv?: string[]`
prop passed down from ConfigTab. Added `KNOWN_LABELS` for well-known
names and `humanizeKeyName` to turn arbitrary SCREAMING_SNAKE_CASE
into a readable label (e.g. MINIMAX_API_KEY → "Minimax API Key").
Acronyms (API, URL, ID, SDK, MCP, LLM, AI) stay uppercase. Legacy
fallback preserved when required_env is empty.
Tests: 8 new vitest cases covering known-label lookup, humanise
fallback, acronym preservation, deduplication, and both fallback
paths.
### 4. Confusing placeholder in Required Env Vars field
The TagList in ConfigTab labelled "Required Env Vars (from template)"
is a DECLARATION field — stores variable names. The placeholder
"e.g. CLAUDE_CODE_OAUTH_TOKEN" suggested that, but users naturally
typed the value of their API key into the field instead. The actual
values go in the Secrets section further down the tab.
Relabelled to "Required Env Var Names (from template)", changed the
placeholder to "variable NAME (e.g. ANTHROPIC_API_KEY) — not the
value", and added a one-line helper below pointing to Secrets.
### 5. Agent chat replies rendered 2-3 times
Three delivery paths can fire for a single agent reply — HTTP
response to POST /a2a, A2A_RESPONSE WS event, and a
send_message_to_user WS push. Paths 2↔3 were already guarded by
`sendingFromAPIRef`; path 1 had no guard. Hermes emits both the
reply body AND a send_message_to_user with the same text, which
manifested as duplicate bubbles with identical timestamps.
Added `appendMessageDeduped(prev, msg, windowMs = 3000)` in
chat/types.ts — dedupes on (role, content) within a 3s window.
Threaded into all three setMessages call sites. The window is short
enough that legitimate repeat messages ("hi", "hi") from a real
user/agent a few seconds apart still render.
Tests: 8 new vitest cases covering empty history, different content,
duplicate within window, different roles, window elapsed, stale
match, malformed timestamps, and custom window.
### 6. New end-to-end regression test
tests/e2e/test_dev_mode.sh — 7 HTTP assertions that run against a
live platform with MOLECULE_ENV=development and catch regressions
on all the dev-mode escape hatches in a single pass: AdminAuth
(empty DB + after-token), WorkspaceAuth (/activity, /delegations),
AdminAuth on /approvals/pending, and the populated
/org/templates response. Shellcheck-clean.
### Test sweep
- `go test -race ./internal/handlers/ ./internal/middleware/
./internal/provisioner/` — all pass
- `npx vitest run` in canvas — 922/922 pass (up from 902)
- `shellcheck --severity=warning infra/scripts/setup.sh
tests/e2e/test_dev_mode.sh` — clean
- `bash tests/e2e/test_dev_mode.sh` — 7/7 pass against a live
platform + populated template registry
### SaaS parity
Every relaxation remains conditional on MOLECULE_ENV=development.
Production tenants run MOLECULE_ENV=production (enforced by the
secrets-encryption strict-init path) and always set ADMIN_TOKEN, so
none of these code paths fire on hosted SaaS. Behaviour on real
tenants is byte-for-byte unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to the previous commit on this branch. Two additional fresh-clone
regressions surfaced during end-to-end verification, both affecting local
dev only and both landing inside the same SaaS-vs-local-dev seam:
### 1. Canvas 401-loops after first workspace creation
`GET /workspaces` is behind `AdminAuth` (router.go:121 — "C1: unauthenticated
workspace topology exposure"). The middleware has a Tier-1 fail-open branch
that only fires when *no* workspace tokens exist anywhere in the DB. The
moment a user creates their first workspace — via either the Canvas UI, the
API, or the e2e-api test suite — a token lands in the DB, Tier-1 closes, and
the Canvas (which has no bearer token in local dev: no WorkOS session, no
NEXT_PUBLIC_ADMIN_TOKEN baked in at build time) gets 401 on every list
call. The UI renders a stuck "API GET /workspaces: 401 admin auth required"
placeholder forever.
SaaS is unaffected because hosted provisioning always sets both
`ADMIN_TOKEN` and `MOLECULE_ENV=production`, and the Canvas there either
carries a WorkOS session cookie or `NEXT_PUBLIC_ADMIN_TOKEN` baked into
the JS bundle.
**Fix** (`workspace-server/internal/middleware/wsauth_middleware.go`): add
a narrow Tier-1b escape hatch that stays fail-open when *both*
`ADMIN_TOKEN` is unset *and* `MOLECULE_ENV` is explicitly a dev mode
("development" / "dev"). Production never hits it (SaaS sets
`MOLECULE_ENV=production`). Mirrors the existing convention in
`handlers/admin_test_token.go` which gates the e2e test-token endpoint on
`MOLECULE_ENV != "production"`.
Three new regression tests in `wsauth_middleware_test.go`:
- `TestAdminAuth_DevModeEscapeHatch_FailsOpenWithHasLiveTokens` — the
happy path (dev mode, no admin token, tokens exist → 200)
- `TestAdminAuth_DevModeEscapeHatch_IgnoredWhenAdminTokenSet` — explicit
`ADMIN_TOKEN` wins; dev mode does not silently re-open the gate
- `TestAdminAuth_DevModeEscapeHatch_IgnoredInProduction` — the
SaaS-safety guarantee (production + no admin token + tokens exist → 401)
`.env.example` flipped to set `MOLECULE_ENV=development` by default so
new users get the dev-mode hatch automatically via `cp .env.example .env`.
SaaS provisioning overrides to `production`, consistent with the existing
convention used by the secrets-encryption strict-init path.
### 2. SaaS cookie/privacy banner rendered on localhost
`CookieConsent` mounted unconditionally in the root layout, so
`npm run dev` on localhost showed a "Cookies & your privacy" banner
pointing at `moleculesai.app/legal/privacy`. That banner is a
GDPR/ePrivacy compliance UI that only applies to the hosted SaaS
offering; self-hosted / local-dev / Vercel-preview hosts must not
see it.
**Fix** (`canvas/src/components/CookieConsent.tsx`): gate render on
`isSaaSTenant()`. Matches the convention used by `AuthGate` and the
workspace tier picker elsewhere in the codebase.
Tests (`canvas/src/components/__tests__/CookieConsent.test.tsx`):
existing tests now stub `window.location.hostname` to a SaaS
subdomain before rendering (required since `isSaaSTenant()` on jsdom's
default "localhost" would suppress the banner). Added two new tests
for the local-dev hide path:
- `does NOT render on local dev (non-SaaS hostname)`
- `does NOT render on a LAN hostname (192.168.*, *.local)`
### Verification
On a fresh-nuked DB with the updated branch:
1. `bash infra/scripts/setup.sh` — clean
2. `go run ./cmd/server` — "Applied 41 migrations", :8080 healthy,
dev-mode hatch armed (`MOLECULE_ENV=development`)
3. `npm run dev` in canvas — :3000 renders, no cookie banner
4. `bash tests/e2e/test_api.sh` — **61 passed, 0 failed**
(test suite creates tokens; GET /workspaces stays 200 under the hatch)
5. Browser at http://localhost:3000 AFTER the e2e run:
- Canvas renders the workspace list (no 401 placeholder)
- No cookie banner
6. `npx vitest run` — **902 tests passed** (900 prior + 2 new hide tests)
7. `go test -race ./internal/middleware/` — all passing (3 new
dev-mode tests + existing Issue-180 / Issue-120 / Issue-684 suite),
coverage 81.8%
### SaaS parity audit
Same principle as the rest of this branch: local must work without
weakening SaaS.
- Dev-mode hatch: conditional on `MOLECULE_ENV=development`.
Production tenants always run `MOLECULE_ENV=production` (already
enforced by the secrets-encryption `InitStrict` path in
`internal/crypto/aes.go`). Branch is unreachable there.
- Cookie banner: gated on `isSaaSTenant()` which checks
`NEXT_PUBLIC_SAAS_HOST_SUFFIX` (default `.moleculesai.app`). SaaS
hosts still get the banner; every other host doesn't.
No change to SaaS behaviour. #1822 backend-parity tracker untouched.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reproducing the README's quickstart on a clean clone surfaced seven
independent bugs between `git clone` and seeing the Canvas in a browser.
Each fix is minimal and local-dev-only — the SaaS/EC2 provisioner path
(issue #1822) is untouched.
Bugs fixed:
1. `infra/scripts/setup.sh` applied migrations via raw psql, bypassing
the platform's `schema_migrations` tracker. The platform then re-ran
every migration on first boot and crashed on non-idempotent ALTER
TABLE statements (e.g. `036_org_api_tokens_org_id.up.sql`). Dropped
the migration block — `workspace-server/internal/db/postgres.go:53`
already tracks and skips applied files.
2. `.env.example` shipped `DATABASE_URL=postgres://USER:PASS@postgres:...`
with literal `USER:PASS` placeholders and the Docker-internal hostname
`postgres`. A `cp .env.example .env` followed by `go run ./cmd/server`
on the host failed with `dial tcp: lookup postgres: no such host`.
Replaced with working `dev:dev@localhost:5432` defaults that match
`docker-compose.infra.yml`.
3. `docker-compose.infra.yml` and `docker-compose.yml` set
`CLICKHOUSE_URL: clickhouse://...:9000/...`. Langfuse v2 rejects
anything other than `http://` or `https://`, so the container
crash-looped and returned HTTP 500. Switched to
`http://...:8123` (HTTP interface) and added `CLICKHOUSE_MIGRATION_URL`
for the migration-time native-protocol connection. Also removed
`LANGFUSE_AUTO_CLICKHOUSE_MIGRATION_DISABLED` so migrations actually
run.
4. `canvas/package.json` dev script crashed with `EADDRINUSE :::8080`
when `.env` was sourced before `npm run dev` — Next.js reads `PORT`
from env and the platform owns 8080. Pinned `dev` to
`-p 3000` so sourced env can't hijack it. `start` left as-is because
production `node server.js` (Dockerfile CMD) must respect `PORT`
from the orchestrator.
5. README/CONTRIBUTING told users to clone `Molecule-AI/molecule-monorepo`
— that repo 404s; the actual name is `molecule-core`. The Railway
and Render deploy buttons had the same broken URL. Replaced in both
English and Chinese READMEs and in CONTRIBUTING. Internal identifiers
(Go module path, Docker network `molecule-monorepo-net`, Python helper
`molecule-monorepo-status`) deliberately left alone — renaming those
is an invasive refactor orthogonal to this fix.
6. README quickstart was missing `cp .env.example .env`. Users who went
straight from `git clone` to `./infra/scripts/setup.sh` got a script
that warned about an unset `ADMIN_TOKEN` (harmless) but then couldn't
run the platform without figuring out the env setup on their own.
Added the step in both READMEs and CONTRIBUTING. Deliberately NOT
generating `ADMIN_TOKEN`/`SECRETS_ENCRYPTION_KEY` here — the e2e-api
suite (`tests/e2e/test_api.sh`) assumes AdminAuth fallback mode
(no server-side `ADMIN_TOKEN`), which is how CI runs it.
7. CI shellcheck only covered `tests/e2e/*.sh` — `infra/scripts/setup.sh`
is in the critical path of every new-user onboarding but was never
linted. Extended the `shellcheck` job and the `changes` filter to
cover `infra/scripts/`. `scripts/` deliberately excluded until its
pre-existing SC3040/SC3043 warnings are cleaned up separately.
Verification (fresh nuke-and-rebuild following the updated README):
- `docker compose -f docker-compose.infra.yml down -v` + `rm .env`
- `cp .env.example .env` → defaults work as-is
- `bash infra/scripts/setup.sh` — clean, no migration errors, all 6
infra containers healthy
- `cd workspace-server && go run ./cmd/server` — "Applied 41 migrations
(0 already applied)", platform on :8080/health 200
- `cd canvas && npm install && npm run dev` — Canvas on :3000/ 200
even with `.env` sourced (PORT=8080 in env)
- `bash tests/e2e/test_api.sh` — **61 passed, 0 failed**
- `cd canvas && npx vitest run` — **900 tests passed**
- `cd canvas && npm run build` — production build clean
- `shellcheck --severity=warning infra/scripts/*.sh` — clean
- Langfuse `/api/public/health` 200 (was 500)
Scope notes:
- SaaS/EC2 parity (issue #1822): all files touched here are local-dev
surface. Canvas container uses `node server.js` with `ENV PORT=3000`
in `canvas/Dockerfile` — the `-p 3000` pin in `package.json` dev
script only affects `npm run dev`, not the production CMD.
- Test coverage (issue #1821): project policy is tiered coverage floors,
not a blanket 100% target. Files touched here are shell scripts,
YAML, Markdown, and one package.json script — not classes covered
by the coverage matrix.
- No overlap with open PRs — searched `setup.sh`, `quickstart`,
`langfuse`, `clickhouse`, `migration`, `README`; nothing conflicts.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: molecule-ai[bot] <276602405+molecule-ai[bot]@users.noreply.github.com>
The pathname.startsWith() loop-break added to redirectToLogin needs
pathname on the mock Location object; tests were supplying only href.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tenant subdomains (hongmingwang.moleculesai.app) proxy to EC2 platform
which has no /cp/auth/* routes. Auth UI lives on app.moleculesai.app.
Added getAuthOrigin() that detects SaaS tenant hosts and redirects to
the app subdomain for login/signup. Non-SaaS hosts (localhost, dev)
fall back to PLATFORM_URL as before.
[Molecule-Platform-Evolvement-Manager]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When session credentials expire mid-use, ALL API calls return 401.
Previously this threw a generic error that crashed the UI with no
recovery path. Now the API client intercepts 401 and redirects to
login once (via redirectToLogin which already guards against loops).
Combined with the AuthGate /cp/auth/* path guard, this gives the
correct behavior: credentials lost → redirect to login → user logs
in → return_to sends them back.
[Molecule-Platform-Evolvement-Manager]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
AuthGate redirected anonymous users to /cp/auth/login?return_to=<url>,
but the login page itself triggered AuthGate, which redirected again
with double-encoded return_to. Each redirect added another encoding
layer until the URL exceeded 431 (Request Header Fields Too Large).
Two guards:
1. redirectToLogin() returns early if already on /cp/auth/* path
2. AuthGate skips redirect check entirely for /cp/auth/* paths
[Molecule-Platform-Evolvement-Manager]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PR #1781 introduced useCanvasStore.getState() call in ContextMenu.tsx
(line 169) but the existing Vitest mock for useCanvasStore in the keyboard
test file lacked a getState method, causing:
TypeError: useCanvasStore.getState is not a function
Fix: attach getState: () => mockStore to the mock using Object.assign
so the static method is available alongside the selector fn.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Brings the staging branch up to date with main's feature-fix stream so
every staging-targeted PR stops tripping on pre-existing rot. Before
this merge, staging had 30+ compile + test failures from fix PRs that
landed on main but never reached staging — primarily #1755's panic-
cascade + schema-drift alignments.
After this merge the handlers package goes from 30+ fails → 2 pre-
existing nil-docker test panics (TestCopyFilesToContainer_CWE22_
RejectsTraversal + TestDeleteViaEphemeral_F1085_RejectsTraversal),
both authored on staging and broken before this promotion. Tracked
separately; not a merge regression.
## Conflicts resolved
1. docs/marketing/campaigns/discord-adapter-announcement/announcement.md
— deleted on main (9d0d213: "move sensitive strategy + research to
internal repo"), modified on staging. Deletion wins: marketing
content moved out of the public monorepo per that commit's intent.
The content lives in the internal repo.
2. workspace-server/internal/handlers/container_files.go — staging's
rmTarget version kept. Main's version had `Cmd: []string{"rm",
"-rf", "/configs/" + filePath}` which concatenates raw filePath
AFTER the prefix-check on rmTarget, defeating the path-traversal
guard (a "../etc/passwd" input passes validation but the rm cmd
then traverses). Staging's `Cmd: []string{"rm", "-rf", rmTarget}`
uses the validated path. Keeping staging's more-secure variant.
## Includes build unblockers from #1769 / #1782
- terminal.go: malformed handleLocalConnect repaired
- terminal_test.go: missing braces in TestHandleConnect_RoutesToLocal
- workspace_crud.go: unused imports + duplicate strField block
- container_files_test.go: duplicate contains() removed (uses the one
in workspace_provision_test.go, same package)
## Verification
- go build ./... ✅ clean
- go vet ./... ✅ clean
- go test -race ./... — 18/20 packages green; 2 test panics in
internal/handlers are pre-existing on staging (documented above)
ContextMenu.tsx reads parent-workspace children via
useCanvasStore.getState().nodes.filter(...) — a direct .getState()
call, not the selector-calling form. The existing vi.mock exposed
only the selector form, so rendering crashed with
"TypeError: useCanvasStore.getState is not a function".
Restructure the vi.mock factory to return Object.assign(fn, {
getState: () => mockStore }) so both call shapes resolve. Factory body
builds the function locally because vi.mock hoists above outer-scope
variable declarations and can't reference `mockStore` via closure.
Verified: all 15 tests in the file pass after the change.
Unblocks the Canvas (Next.js) CI check on PR #1743 (staging→main sync).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three small, non-overlapping fixes extracted from closed PR #1664:
1. canvas/src/components/ContextMenu.tsx — Replace the useMemo-over-nodes
pattern with a hashed-boolean selector (s.nodes.some(...)) so Zustand's
useSyncExternalStore snapshot comparison is stable. Resolves React
error #185 (infinite render loop). Moves the child-node list derivation
into the delete handler via getState() so the render path no longer
allocates a fresh array.
2. workspace-server/internal/handlers/a2a_proxy.go — Allow the
Docker-bridge hostname path (ws-<id>:8000) to skip the SSRF guard in
local-docker mode. Gated on !saasMode() so SaaS deployments keep the
full private-IP blocklist (a remote workspace registration can't claim
a ws-* hostname and reach a sensitive VPC IP).
3. workspace-server/Dockerfile — Add entrypoint.sh that discovers the
docker.sock GID at boot and adds the platform user to that group, then
exec's su-exec to drop privileges. Lets the platform container reach
the host docker socket without running as root.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause of the hermes 401 "Invalid API key" on SaaS workspaces:
1. CreateWorkspaceDialog never sent `model` in the /workspaces POST
2. Tenant/CP plumbed through a valid (provider, API key) but empty MODEL
3. Workspace install.sh ran with HERMES_DEFAULT_MODEL unset
4. derive-provider.sh saw no slug → PROVIDER="auto"
5. Hermes fell back to its compiled-in default (Anthropic via
OpenAI-compat adapter)
6. User's MINIMAX_API_KEY was present but irrelevant — hermes tried
Anthropic with it → 401
Fix:
- Extend HERMES_PROVIDERS with `defaultModel` + `models` (suggestion
list). Each provider ships with a known-good default so the trap
is physically impossible to hit with the new form.
- Add a required Model input to the Hermes panel, auto-populated
from the provider's defaultModel when the provider changes (only
if the user hasn't typed their own slug yet).
- Datalist surfaces additional model suggestions per provider so
users can pick a different size (e.g. M2.7-highspeed) without
typing the whole slug.
- handleCreate validates hermesModel is non-empty, sends as `model`
in the POST body alongside the secrets block.
- useEffect guard avoids clobbering a user-typed custom slug when
they toggle providers back and forth.
Existing 19 a11y tests still pass (non-SaaS path unchanged, four-tier
picker still renders, arrow-key nav still wraps).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Following feedback that T4 — not T3 — is the full-access tier:
- Non-SaaS picker now shows all four tiers: T1 Sandboxed, T2 Standard,
T3 Privileged, T4 Full Access. Four-column grid.
- SaaS picker stays single-option but now locks to T4 (was T3). Every
SaaS workspace gets a dedicated EC2 VM, which is unambiguously the
"full host" case — T3 (privileged container) was a category mismatch.
- Default tier on SaaS is 4 (was 3). CP provisioner already supports
tier 4 (t3.large / 80 GB). TIER_CONFIG already has T4's amber color.
Tests updated for the four-tier picker: wrap tests now go T4 ↔ T1, and
the selection/tabIndex tests cover the fourth button.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Staging added hasChildren/children fields to workspace store shape.
Test assertion updated to use objectContaining to avoid false negatives.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
On SaaS every workspace gets its own EC2 VM — the Docker-sandbox
distinction between T1 (sandboxed), T2 (standard Docker), and T3
(full host access) doesn't apply. A SaaS workspace is always a
dedicated VM, which is "full access" by construction. Showing T1/T2
in that UI is a category error: users pick a sandbox level that has
no effect on the actual EC2 machine they get.
Changes:
- tenant.ts: export isSaaSTenant() — returns true when canvas is
served at <slug>.moleculesai.app (SSR-safe: false on server)
- CreateWorkspaceDialog: when isSaaSTenant(), render only the T3
option, default tier=3, grid collapses to a single column. Label
gets a " — dedicated VM" hint so the user knows what they're
getting. On self-hosted the full T1/T2/T3 picker is unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Add-Key form used to open with a required Service dropdown
(GitHub / Anthropic / OpenRouter / Other) that gated everything
else. The dropdown did no persistent work — the secret store only
cares about (key_name, value); the Service label was never saved
anywhere. It also suffered registry drift: today we support ~22
hermes-dispatched providers (MiniMax, Gemini, DeepSeek, Kimi, Qwen,
NVIDIA, etc.); only 3 had entries. Everyone else landed in "Other"
with no downside beyond the mandatory click.
Replaces it with:
1. Key-name <datalist> autocomplete sourced from new
KEY_NAME_SUGGESTIONS in lib/services.ts — 26 entries covering
common infra keys + every hermes-supported provider.
2. inferGroup(keyName) derives classification at render time,
matching what the store already does in getGrouped(). No
behaviour change for list grouping.
3. Provider docs link renders inline only when inferGroup
recognises the name. For 'custom' keys we stay quiet — no
false-structure prompt.
4. Test-connection button still available when the inferred group
supports it AND the value is format-valid. Same providers as
before.
SERVICES registry preserved for LIST rendering + test routing.
Result: two fields instead of three. One fewer decision. Provider-
agnostic by design — new providers work the moment someone types
their canonical env var name; no UI code change per provider.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1526 shipped the /templates registry + canvas dynamic Runtime /
Model / Required-Env fields on 2026-04-22 — but merged into the
staging branch, not main. The staging→main promotion PR #1496 has
been open unmerged for a while with 1172 commits divergence, so
prod (which builds from main) still carries the old hardcoded
dropdown.
Symptom seen on hongmingwang.moleculesai.app today:
- New Hermes Agent workspace (template declares runtime: hermes) loads
Config tab → Runtime dropdown shows "LangGraph (default)" because
there's no <option value="hermes"> in the hardcoded list; it falls
back to empty-value silently.
- Model field is a plain TextInput with static placeholder
"e.g. anthropic:claude-sonnet-4-6" — should be a combobox populated
from the selected runtime's models[].
- Required Env Vars is a TagList with static placeholder
"e.g. CLAUDE_CODE_OAUTH_TOKEN" — should auto-populate from the
selected model's required_env.
- Net effect: "Save & Deploy" sends empty model + empty env to the
provisioner → workspace instant-fails.
This PR cherry-picks the exact three files from PR #1526 (#359dc61
on staging) forward to main, without pulling the other 1171
commits:
- canvas/src/components/tabs/ConfigTab.tsx
- RuntimeOption interface + FALLBACK_RUNTIME_OPTIONS (hermes,
gemini-cli included)
- useEffect fetches /templates and populates runtimeOptions
dynamically
- dropdown renders from runtimeOptions (no hardcoded list)
- Model becomes a combobox with datalist of available models
per selected runtime
- Required Env Vars auto-populates from the selected model's
required_env on model change
- workspace-server/internal/handlers/templates.go
- /templates endpoint returns [{id, name, runtime, models}] with
per-template models registry (id, name, required_env)
- workspace-server/internal/handlers/templates_test.go
- Tests for runtime+models parsing and legacy top-level model
fallback
The canvas Runtime dropdown now resolves "hermes" correctly;
Model dropdown shows the models[] from the hermes template; Env
auto-populates with HERMES_API_KEY (or whichever model selected).
Verified locally:
- workspace-server builds clean
- Template handler tests pass: TestTemplatesList_RuntimeAndModelsRegistry,
TestTemplatesList_LegacyTopLevelModel, TestTemplatesList_NonexistentDir
Follow-up: the staging→main promotion gap (#1496) is the
underlying process issue. Either merge that PR or adopt a policy
of landing fixes directly on main (as several PRs have today).
Files here were chosen minimally to avoid pulling unrelated staging
changes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(canary-release): flag as aspirational; link to current state
The canary-release.md doc describes the pipeline as if the fleet is
running — referring to AWS account 004947743811 and a configured
MoleculeStagingProvisioner role. Reality as of 2026-04-22: no canary
tenants are provisioned, the 3 GH Actions secrets are empty, and
canary-verify.yml has failed 7/7 times in a row.
Added a top-of-doc ⚠️ state note that:
1. Clarifies this is intended design, not deployed reality.
2. Notes the AWS account ID is historical / unverified.
3. Explains that merges currently rely on manual promote-latest.
4. Cross-links to molecule-controlplane/docs/canary-tenants.md for
the Phase 1 work that's shipped, the Phase 2 stand-up plan, and
the "should we even do this now?" decision framework.
5. Asks whoever lands Phase 2 to reconcile the two docs.
No behaviour change — doc-only.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(build): add missing fmt import in a2a_proxy.go, fix canvas Dockerfile GID
- a2a_proxy.go: missing "fmt" import caused build failure (8 undefined
references at lines 743-775). Likely dropped during a recent merge.
- canvas/Dockerfile: GID 1000 already in use in node base image.
Changed to dynamic group/user creation with fallback.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Hongming Wang <hongmingwang.rabbit@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Hongming Wang <hongmingwangrabbit@gmail.com>
publish-canvas-image has been failing on every main push since 2026-04-21
at `addgroup -g 1000 canvas` because node:20-alpine already ships a `node`
user/group at uid/gid 1000. Same collision workspace-server/Dockerfile.tenant
already fixes with `deluser --remove-home node` before `addgroup`.
Copying that pattern here so the workflow goes green again and canvas images
publish to ghcr. No runtime behaviour change — canvas still runs as non-root
uid 1000.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(canvas+templates): fetch runtime dropdown from /templates registry
Canvas hardcoded 6 runtime options, drifting from manifest.json which
already registers hermes + gemini-cli as first-class workspace templates.
A Hermes workspace had runtime=hermes in its DB row but Config showed
"LangGraph (default)" — the HTML select fell back to its first option
because "hermes" wasn't listed, and saving would clobber the runtime
back to empty.
Now:
- GET /templates returns the runtime field from each cloned template's
config.yaml (previously dropped on the floor)
- ConfigTab fetches /templates on mount, dedupes non-empty runtimes, and
renders them as <option>s. Falls back to the static list if the fetch
fails (offline, older backend), so the control never renders empty.
Adding a template to manifest.json now flows through automatically — no
canvas PR required.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(canvas+templates): model + required-env suggestions from template
Extends the dropdown fix so Model and Required Env also flow from
the template registry instead of being free-form fields the user
has to remember.
Template config.yaml now declares:
runtime_config:
model: <default>
models:
- id: nous-hermes-3-70b
name: Nous Hermes 3 70B (Nous Portal)
required_env: [HERMES_API_KEY]
- id: nousresearch/hermes-3-llama-3.1-70b
name: Hermes 3 70B (via OpenRouter)
required_env: [OPENROUTER_API_KEY]
Platform: GET /templates now returns runtime + model + models[] per
template (was previously dropping runtime + ignoring runtime_config).
Canvas:
- Runtime dropdown built from /templates (was hardcoded 6 options)
- Model input becomes a datalist combobox; free-form input still
allowed since model names rotate faster than templates
- Required Env Vars default to the selected model's required_env,
labelled "(suggested)" so the user knows it's template-driven
- Everything falls back to a static list when /templates is
unreachable, so offline editing still works
Follow-up: add models[] to the other 7 template repos (claude-code,
crewai, autogen, deepagents, openclaw, gemini-cli, langgraph). This
PR updates the platform + canvas; the Hermes template config update
goes in a separate PR against its own repo.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(canvas): commit required_env on model change; add backend tests
Review turned up that the \"Required Env Vars (suggested)\" display
was cosmetic-only — users picking a different model saw the new
env suggestion in the TagList, but the values never made it into
state, so Save serialized an empty (or stale) required_env and the
workspace ran with the wrong auth check.
Canvas fixes:
- Model input onChange now commits the matched modelSpec's required_env
to state — but only when the prior required_env was empty or matched
the previous modelSpec's list (i.e. user hadn't manually edited).
User-typed envs always win.
- Dropped the display-only fallback in TagList values; shows only what's
actually in state.
- New \"Template suggests X, Apply\" hint button covers the edge case
where state and template differ (existing workspace whose required_env
lags the template's current recommendation).
- datalist option key now includes index so template authors shipping
duplicate model ids don't trigger a silent React key collision.
- Small arraysEqual helper.
Backend tests:
- TestTemplatesList_RuntimeAndModelsRegistry — asserts /templates
response carries runtime + models[] with per-model required_env.
- TestTemplatesList_LegacyTopLevelModel — asserts older templates with
top-level model: still surface correctly, with empty Models[].
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Hongming Wang <hongmingwang.rabbit@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ContextMenu's children selector ran .filter() inside the Zustand
hook, returning a brand-new array reference on every render.
useSyncExternalStore under the hood compares snapshots with
Object.is — a new array always differs, so React kept scheduling
re-renders, hit the 50-update depth cap, and crashed with minified
error #185.
Observed as "Application error: a client-side exception" on every
SaaS tenant once a session cookie resolved. Caught in dev mode
where the build emits the clear warning:
The result of getSnapshot should be cached to avoid an infinite loop
at ContextMenu (src/components/ContextMenu.tsx:26:34)
Fix: select the stable nodes array once, derive children via
useMemo outside the store subscription. Same output, no new
reference per render.
Manually verified: dev bundle served through a cloudflared tunnel
to a live tenant, ContextMenu component mounts cleanly, remaining
console errors are all unrelated (localhost API 401s from the dev
server pointing at its own origin).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor: split 4 oversized handler files into focused sub-files
- org.go (1099 lines) → org.go + org_import.go + org_helpers.go
- mcp.go (1001 lines) → mcp.go + mcp_tools.go
- workspace.go (934 lines) → workspace.go + workspace_crud.go
- a2a_proxy.go (825 lines) → a2a_proxy.go + a2a_proxy_helpers.go
No functional changes — same package, same exports, same tests.
All files stay under 635 lines.
Note: isSafeURL and isPrivateOrMetadataIP are duplicated between
mcp_tools.go and a2a_proxy_helpers.go — this is a pre-existing issue
from the original mcp.go and a2a_proxy.go, not introduced by this split.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(runtime+scheduler): increment/decrement active_tasks counter (refs #1386)
* docs(tutorials): add Self-Hosted AI Agents guide — Docker, Fly Machines, bare metal
* docs: add Remote Agents feature + Phase 30 blog links to docs index
* docs(marketing): update Phase 30 brief — Action 5 complete, docs/index.md update noted
* docs(api-ref): add workspace file copy API reference (#1281)
Documents TemplatesHandler.copyFilesToContainer (container_files.go):
- Endpoint overview: PUT /workspaces/:id/files/*path
- Parameter descriptions for all four function parameters
- CWE-22 path traversal protection (PRs #1267/1270/1271)
- Defense-in-depth: validateRelPath at handler + archive boundary
- Full error code table (400/404/500)
- curl example with success and path-traversal rejection cases
Also covers: writeViaEphemeral routing, findContainer fallback,
allowed roots allow-list, and related links to platform-api.md.
Co-authored-by: Molecule AI Technical Writer <technical-writer@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(security): CWE-78/CWE-22 — block shell injection in deleteViaEphemeral (#1310)
## Summary
Issue #1273: deleteViaEphemeral interpolated filePath directly into
rm command, enabling both shell injection (CWE-78) and path traversal
(CWE-22) attacks.
## Changes
1. Added validateRelPath(filePath) guard before constructing the rm command.
validateRelPath blocks absolute paths and ".." traversal sequences.
2. Changed Cmd from "/configs/"+filePath (string interpolation) to
[]string{"rm", "-rf", "/configs", filePath} (exec form). This
eliminates shell injection entirely — filePath is a plain argument,
never interpreted as shell code.
## Security properties
- validateRelPath: blocks "../" and absolute paths before they reach Docker
- Exec form: filePath cannot inject shell metacharacters even if validation
is somehow bypassed
- "/configs" as separate arg: rm has exactly two arguments, no room for
injected args
Closes#1273.
Co-authored-by: Molecule AI Infra-Runtime-BE <infra-runtime-be@agents.moleculesai.app>
* fix(security): backport SSRF defence (CWE-918) to main — isSafeURL in a2a_proxy.go (#1292) (#1302)
* fix(security): backport SSRF defence (CWE-918) to main — isSafeURL in mcp.go and a2a_proxy.go
Issue #1042: 3 CodeQL SSRF findings across mcp.go and a2a_proxy.go.
staging already ships the fix (PRs #1147, #1154 → merged); main did not include it.
- mcp.go: add isSafeURL() + isPrivateOrMetadataIP() helpers; validate
agentURL before outbound calls in mcpCallTool (line ~529) and
toolDelegateTaskAsync (line ~607)
- a2a_proxy.go: add identical isSafeURL() + isPrivateOrMetadataIP()
helpers; call isSafeURL() before dispatchA2A in resolveAgentURL()
(blocks finding #1 at line 462)
- mcp_test.go: 19 new tests covering all blocked URL patterns:
file://, ftp://, 127.0.0.1, ::1, 169.254.169.254, 10.x.x.x,
172.16.x.x, 192.168.x.x, empty hostname, invalid URL,
isPrivateOrMetadataIP across all private/CGNAT/metadata ranges
1. URL scheme enforcement — http/https only
2. IP literal blocking — loopback, link-local, RFC-1918, CGNAT, doc/test ranges
3. DNS hostname resolution — blocks internal hostnames resolving to private IPs
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(ci-blocker): remove duplicate isSafeURL/isPrivateOrMetadataIP from mcp.go
Issue #1292: PR #1274 duplicated isSafeURL + isPrivateOrMetadataIP in
mcp.go — both functions already exist on main at lines 829 and 876.
Kept the mcp.go definitions (the originals) and removed the 70-line
duplicate appended at end of file. a2a_proxy.go functions are
unchanged — they serve the same purpose via a separate code path.
* fix: remove orphaned commit-text lines from a2a_proxy.go
Three lines from the PR/commit title were accidentally baked into the
file during the rebase from #1274 to #1302, causing a Go syntax error
(a bare string literal at statement level followed by dangling braces).
Deletion restores:
}
return agentURL, nil
}
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Molecule AI Infra-Runtime-BE <infra-runtime-be@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Molecule AI Core-BE <core-be@agents.moleculesai.app>
Co-authored-by: Molecule AI SDK Lead <sdk-lead@agents.moleculesai.app>
* fix(canvas/test): patch test regressions from PR #1243 + proximity hitbox fix (#1313)
* fix(ci): revert cancel-in-progress to true — ubuntu-runner dispatch stalled
With cancel-in-progress: false, pending CI runs accumulate in the
ci-staging concurrency group. New pushes create queued runs, but
GitHub dispatches multiple runs for the same SHA instead of replacing
the pending one. All runs get stuck/cancelled before completing.
Reverting to cancel-in-progress: true restores CI operation — runs
that are superseded are cancelled, freeing the concurrency slot for
the new run to proceed.
Runner availability (ubuntu-latest dispatch stall) is a separate
infra issue tracked independently.
* fix(security): validate tar header names in copyFilesToContainer — CWE-22 path traversal (#1043)
Tar header names were built from raw map keys without validation. A malicious
server-side caller could embed "../" in a file name to escape the destPath
volume mount (/configs) and write files outside the intended directory.
Fix: validate each name with filepath.Clean + IsAbs + HasPrefix("..") checks
before using it in the tar header, then join with destPath for the archive
header. Also guard parent-directory creation against traversal.
Closes#1043.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(canvas/test): patch regressed tests from PR #1243 orgs-page flakiness fix
Two regressions introduced by PR #1243 (fix issue #1207):
1. **ContextMenu.keyboard.test.tsx** — `setPendingDelete` now receives
`{id, name, hasChildren}` (cascade-delete UX, PR #1252), but the test
expected only `{id, name}`. Added `hasChildren: false` to the assertion.
2. **orgs-page.test.tsx** — 10 tests awaited `vi.advanceTimersByTimeAsync(50)`
without `act()`. With fake timers, `setState` (synchronous) is flushed by
`advanceTimersByTimeAsync`, but the React state update it triggers is a
microtask — so the test saw stale render. Wrapping in `act(async () =>
{ await vi.advanceTimersByTimeAsync(50); })` ensures microtasks drain
before assertions run.
All 813 vitest tests pass.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(canvas): add 100px proximity threshold to drag-to-nest detection
Fixes#1052 — previously, getIntersectingNodes() returned any node whose
bounding box overlapped the dragged node, regardless of actual pixel
distance. On a sparse canvas this triggered the "Nest Workspace" dialog
even when the dragged node was nowhere near any target.
The fix adds an on-node-drag proximity filter: only nodes within 100px
(center-to-center) of the dragged node are eligible as nest targets.
Distance is computed as squared Euclidean to avoid the sqrt overhead in
the hot drag path.
Added two tests to Canvas.pan-to-node.test.tsx covering the mock wiring
and confirming the regression is addressed in Canvas.tsx.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: molecule-ai[bot] <276602405+molecule-ai[bot]@users.noreply.github.com>
Co-authored-by: Molecule AI Core-FE <core-fe@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(canvas): add ?? 0 guard for optional budget_used in progressPct (#1324) (#1327)
* fix(ci): revert cancel-in-progress to true — ubuntu-runner dispatch stalled
With cancel-in-progress: false, pending CI runs accumulate in the
ci-staging concurrency group. New pushes create queued runs, but
GitHub dispatches multiple runs for the same SHA instead of replacing
the pending one. All runs get stuck/cancelled before completing.
Reverting to cancel-in-progress: true restores CI operation — runs
that are superseded are cancelled, freeing the concurrency slot for
the new run to proceed.
Runner availability (ubuntu-latest dispatch stall) is a separate
infra issue tracked independently.
* fix(security): validate tar header names in copyFilesToContainer — CWE-22 path traversal (#1043)
Tar header names were built from raw map keys without validation. A malicious
server-side caller could embed "../" in a file name to escape the destPath
volume mount (/configs) and write files outside the intended directory.
Fix: validate each name with filepath.Clean + IsAbs + HasPrefix("..") checks
before using it in the tar header, then join with destPath for the archive
header. Also guard parent-directory creation against traversal.
Closes#1043.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(canvas/test): patch regressed tests from PR #1243 orgs-page flakiness fix
Two regressions introduced by PR #1243 (fix issue #1207):
1. **ContextMenu.keyboard.test.tsx** — `setPendingDelete` now receives
`{id, name, hasChildren}` (cascade-delete UX, PR #1252), but the test
expected only `{id, name}`. Added `hasChildren: false` to the assertion.
2. **orgs-page.test.tsx** — 10 tests awaited `vi.advanceTimersByTimeAsync(50)`
without `act()`. With fake timers, `setState` (synchronous) is flushed by
`advanceTimersByTimeAsync`, but the React state update it triggers is a
microtask — so the test saw stale render. Wrapping in `act(async () =>
{ await vi.advanceTimersByTimeAsync(50); })` ensures microtasks drain
before assertions run.
All 813 vitest tests pass.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(canvas): add 100px proximity threshold to drag-to-nest detection
Fixes#1052 — previously, getIntersectingNodes() returned any node whose
bounding box overlapped the dragged node, regardless of actual pixel
distance. On a sparse canvas this triggered the "Nest Workspace" dialog
even when the dragged node was nowhere near any target.
The fix adds an on-node-drag proximity filter: only nodes within 100px
(center-to-center) of the dragged node are eligible as nest targets.
Distance is computed as squared Euclidean to avoid the sqrt overhead in
the hot drag path.
Added two tests to Canvas.pan-to-node.test.tsx covering the mock wiring
and confirming the regression is addressed in Canvas.tsx.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(canvas): add ?? 0 guard for optional budget_used in progressPct
Fixes#1324 — TypeScript strict mode flags budget.budget_used as
possibly undefined in the progressPct ternary, even though the
outer condition checks budget_limit > 0.
Fix: use nullish coalescing (budget_used ?? 0) so progress shows 0%
when the backend returns a partial shape (provisioning-stuck
workspaces). Also adds a test covering the undefined-budget_used
case with the progress bar aria-valuenow and fill width both at 0%.
Closes#1324.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: molecule-ai[bot] <276602405+molecule-ai[bot]@users.noreply.github.com>
Co-authored-by: Molecule AI Core-FE <core-fe@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(canvas): add ?? 0 guard for optional budget_used in progressPct (issue #1324) (#1329)
* fix(ci): revert cancel-in-progress to true — ubuntu-runner dispatch stalled
With cancel-in-progress: false, pending CI runs accumulate in the
ci-staging concurrency group. New pushes create queued runs, but
GitHub dispatches multiple runs for the same SHA instead of replacing
the pending one. All runs get stuck/cancelled before completing.
Reverting to cancel-in-progress: true restores CI operation — runs
that are superseded are cancelled, freeing the concurrency slot for
the new run to proceed.
Runner availability (ubuntu-latest dispatch stall) is a separate
infra issue tracked independently.
* fix(security): validate tar header names in copyFilesToContainer — CWE-22 path traversal (#1043)
Tar header names were built from raw map keys without validation. A malicious
server-side caller could embed "../" in a file name to escape the destPath
volume mount (/configs) and write files outside the intended directory.
Fix: validate each name with filepath.Clean + IsAbs + HasPrefix("..") checks
before using it in the tar header, then join with destPath for the archive
header. Also guard parent-directory creation against traversal.
Closes#1043.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(canvas/test): patch regressed tests from PR #1243 orgs-page flakiness fix
Two regressions introduced by PR #1243 (fix issue #1207):
1. **ContextMenu.keyboard.test.tsx** — `setPendingDelete` now receives
`{id, name, hasChildren}` (cascade-delete UX, PR #1252), but the test
expected only `{id, name}`. Added `hasChildren: false` to the assertion.
2. **orgs-page.test.tsx** — 10 tests awaited `vi.advanceTimersByTimeAsync(50)`
without `act()`. With fake timers, `setState` (synchronous) is flushed by
`advanceTimersByTimeAsync`, but the React state update it triggers is a
microtask — so the test saw stale render. Wrapping in `act(async () =>
{ await vi.advanceTimersByTimeAsync(50); })` ensures microtasks drain
before assertions run.
All 813 vitest tests pass.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(canvas): add 100px proximity threshold to drag-to-nest detection
Fixes#1052 — previously, getIntersectingNodes() returned any node whose
bounding box overlapped the dragged node, regardless of actual pixel
distance. On a sparse canvas this triggered the "Nest Workspace" dialog
even when the dragged node was nowhere near any target.
The fix adds an on-node-drag proximity filter: only nodes within 100px
(center-to-center) of the dragged node are eligible as nest targets.
Distance is computed as squared Euclidean to avoid the sqrt overhead in
the hot drag path.
Added two tests to Canvas.pan-to-node.test.tsx covering the mock wiring
and confirming the regression is addressed in Canvas.tsx.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(canvas): add ?? 0 guard for optional budget_used in progressPct
Fixes#1324 — TypeScript strict mode flags budget.budget_used as
possibly undefined in the progressPct ternary, even though the
outer condition checks budget_limit > 0.
Fix: use nullish coalescing (budget_used ?? 0) so progress shows 0%
when the backend returns a partial shape (provisioning-stuck
workspaces). Also adds a test covering the undefined-budget_used
case with the progress bar aria-valuenow and fill width both at 0%.
Closes#1324.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: molecule-ai[bot] <276602405+molecule-ai[bot]@users.noreply.github.com>
Co-authored-by: Molecule AI Core-FE <core-fe@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(platform): unblock SaaS workspace registration end-to-end
Every workspace in the cross-EC2 SaaS provisioning shape was failing
registration, heartbeat, or A2A routing. Four distinct blockers sat
between "EC2 is up" and "agent responds"; three are platform-side and
fixed here (the fourth is in the CP user-data, separate PR).
1. SSRF validator blocked RFC-1918 (registry.go + mcp.go)
validateAgentURL and isPrivateOrMetadataIP rejected 172.16.0.0/12,
which contains the AWS default VPC range (172.31.x.x) that every
sibling workspace EC2 registers from. Registration returned 400 and
the 10-min provision sweep flipped status to failed. RFC-1918 +
IPv6 ULA are now gated behind saasMode(); link-local (169.254/16),
loopback, IPv6 metadata (fe80::/10, ::1), and TEST-NET stay blocked
unconditionally in both modes.
saasMode() resolution order:
1. MOLECULE_DEPLOY_MODE=saas|self-hosted (explicit operator flag)
2. MOLECULE_ORG_ID presence (legacy implicit signal, kept for
back-compat so existing deployments don't need a config change)
isPrivateOrMetadataIP now actually checks IPv6 — previously it
returned false on any non-IPv4 input, which would let a registered
[::1] or [fe80::...] URL bypass the SSRF check entirely.
2. Orphan auth-token minting (workspace_provision.go)
issueAndInjectToken mints a token and stuffs it into
cfg.ConfigFiles[".auth_token"]. The Docker provisioner writes that
file into the /configs volume — the CP provisioner ignores it
(only cfg.EnvVars crosses the wire). Result: live token in DB, no
plaintext on disk, RegistryHandler.requireWorkspaceToken 401s every
/registry/register attempt because the workspace is no longer in
the "no live token → bootstrap-allowed" state. Now no-ops in SaaS
mode; the register handler already mints on first successful
register and returns the plaintext in the response body for the
runtime to persist locally.
Also removes the redundant wsauth.IssueToken call at the bottom of
provisionWorkspaceCP, which created the same orphan-token pattern
a second time.
3. Compaction artefacts (bundle/importer.go, handlers/org_tokens.go,
scheduler.go, workspace_provision.go)
Four pre-existing compile errors on main from an earlier session's
code truncation: missing tuple destructuring on ExecContext /
redactSecrets / orgTokenActor, missing close-brace in
Scheduler.fireSchedule's panic recovery. All one-line mechanical
fixes; without them the binary would not build.
Tests
-----
ssrf_test.go adds:
* TestSaasMode — covers the env resolution ladder (explicit flag
wins over legacy signal, case-insensitive, whitespace tolerant)
* TestIsPrivateOrMetadataIP_SaaSMode — asserts RFC-1918 + IPv6 ULA
flip to allowed, metadata/loopback/TEST-NET still blocked
* TestIsPrivateOrMetadataIP_IPv6 — regression guard for the old
"returns false for all IPv6" behaviour
Follow-up issue for CP-sourced workspace_id attestation will be filed
separately — closes the residual intra-VPC SSRF + token-race windows
the SaaS-mode relaxation introduces.
Verified end-to-end today on workspace 6565a2e0 (hermes runtime, OpenAI
provider) — agent returned "PONG" in 1.4s after register → heartbeat →
A2A proxy → runtime.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(runtime+scheduler): increment/decrement active_tasks + max_concurrent (#1408)
Runtime (shared_runtime.py):
- set_current_task now increments active_tasks on task start, decrements
on completion (was binary 0/1)
- Counter never goes below 0 (max(0, n-1))
- Pushes heartbeat immediately on BOTH increment and decrement (#1372)
Scheduler (scheduler.go):
- Reads max_concurrent_tasks from DB (default 1, backward compatible)
- Skips cron only when active_tasks >= max_concurrent_tasks (was > 0)
- Leaders can be configured with max_concurrent_tasks > 1 to accept
A2A delegations while a cron runs
Platform:
- Added max_concurrent_tasks column to workspaces (migration 037)
- Workspace model + list/get queries include the new field
- API exposes max_concurrent_tasks in workspace JSON
Config.yaml support (future): runtime_config.max_concurrent_tasks
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix(review): address 3 critical issues from code review
1. BLOCKER: executor_helpers.py now uses increment/decrement too
(was still binary 0/1, stomping the counter for CLI + SDK executors)
2. BUG: asymmetric getattr defaults fixed — both paths use default 0
(was 0 on increment, 1 on decrement)
3. UX: current_task preserved when active_tasks > 0 on decrement
(was clearing task description even when other tasks still running)
4. Scheduler polling loop re-reads max_concurrent_tasks on each poll
(was using stale value from initial query)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Hongming Wang <hongmingwangrabbit@gmail.com>
Co-authored-by: molecule-ai[bot] <276602405+molecule-ai[bot]@users.noreply.github.com>
Co-authored-by: Molecule AI Technical Writer <technical-writer@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Molecule AI Infra-Runtime-BE <infra-runtime-be@agents.moleculesai.app>
Co-authored-by: Molecule AI Core-BE <core-be@agents.moleculesai.app>
Co-authored-by: Molecule AI SDK Lead <sdk-lead@agents.moleculesai.app>
Co-authored-by: Molecule AI Core-FE <core-fe@agents.moleculesai.app>
Co-authored-by: Hongming Wang <hongmingwang.rabbit@users.noreply.github.com>
* docs: workspace files API reference, skill catalog, and links
* docs: fix secrets endpoint path across docs
The workspace secrets endpoint is `/workspaces/:id/secrets`, not
`/secrets/values`. This was wrong in quickstart.md (Path 2: Remote Agent)
and workspace-runtime.md (registration flow example and comparison table).
The external-agent-registration guide already had the correct path.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* docs: fix broken blog cross-link in skills-vs-bundled-tools post
Link path had an extra `/docs/` segment: `/docs/blog/...` instead of
`/blog/...`. Nextra resolves blog posts directly under `/blog/<slug>`,
not under `/docs/blog/`.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* docs: add skill-catalog.md guide
Linked from the skills-vs-bundled-tools blog post as a reference
for TTS/image-generation/web-search skills. The blog promises
"install directly via the CLI" with a skill catalog — this page
fills that promise by documenting available skill types, install
commands, version management, custom skill authoring, and removal.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* docs(marketing): update Phase 30 brief — Action 5 complete, docs/index.md update noted
* docs(api-ref): add workspace file copy API reference
Documents TemplatesHandler.copyFilesToContainer (container_files.go):
- Endpoint overview: PUT /workspaces/:id/files/*path
- Parameter descriptions for all four function parameters
- CWE-22 path traversal protection (PRs #1267/1270/1271)
- Defense-in-depth: validateRelPath at handler + archive boundary
- Full error code table (400/404/500)
- curl example with success and path-traversal rejection cases
Also covers: writeViaEphemeral routing, findContainer fallback,
allowed roots allow-list, and related links to platform-api.md.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Molecule AI Technical Writer <technical-writer@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: molecule-ai[bot] <276602405+molecule-ai[bot]@users.noreply.github.com>
* fix(handlers): add saasMode() gating to isPrivateOrMetadataIP in a2a_proxy_helpers.go
Issue #1421 / #1401: PR #1363 (handler split) moved isPrivateOrMetadataIP
into a2a_proxy_helpers.go but kept the OLD pre-SaaS version — it
unconditionally blocks RFC-1918 addresses, regressing the fix in
commits 1125a02 / cf10733.
The A2A proxy path now has the same SaaS-gated logic as registry.go:
- Cloud metadata (169.254/16, fe80::/10, ::1) always blocked in both modes
- RFC-1918 (10/8, 172.16/12, 192.168/16) + IPv6 ULA (fc00::/7) blocked in
self-hosted, allowed in SaaS cross-EC2 mode
- IPv6 addresses now properly checked (previous version returned false for all)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* docs(marketing): Discord adapter Day 2 Reddit + HN community copy
* fix(tests): supply *events.Broadcaster pointer to captureBroadcaster
Cannot use *captureBroadcaster as *events.Broadcaster when the struct
embeds events.Broadcaster as a value — must initialize as a named field.
Fixes go vet error in workspace_provision_test.go:
cannot use broadcaster (*captureBroadcaster) as *events.Broadcaster value
* Merge pull request #1429 from fix/canvas-tooltip-clear-timer
Without this, a 400ms setTimeout from onFocus/onMouseEnter that fires
after onBlur will re-show a tooltip the user just dismissed. The
setShow(false) in onBlur closes the tooltip immediately but leaves the
timer pending — Tab-blur followed by timer-fire would re-show it.
Fix: add clearTimeout(timerRef.current) at the top of onBlur, mirroring
the pattern already used in onMouseLeave and onFocus.
Refs: PR #1367 (a11y keyboard support — this was a pre-existing gap)
Co-authored-by: Molecule AI App-FE <app-fe@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(canvas/test): add missing children:[] to setPendingDelete expectation (#1426)
PR #1252 (cascade-delete UX) updated setPendingDelete to pass a
children array for cascade-warning rendering. The keyboard-a11y test
assertion was not updated to match.
Test: clicking 'Delete' hoists state to the store and closes the menu
Co-authored-by: Molecule AI Core-QA <core-qa@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(canvas/test): add children:[] to setPendingDelete + \' entity fix (closes#1380) (#1427)
* ci: retry — trigger fresh runner allocation
* fix(canvas/test): add children:[] to setPendingDelete assertion
setPendingDelete now includes children:[] (PR #1383 extended the
pendingDelete type). The keyboard accessibility test at line 225 used
exact object matching which omitted the new field, causing a failure
after staging merged #1383.
Issue: #1380
* fix(canvas): replace ' HTML entity with straight apostrophe
JSX does not entity-decode ' — it renders the literal text
"'" instead of "'". Found at line 157 (payment confirmed) and
line 321 (empty org list). Replaced with a straight apostrophe,
which JSX handles correctly.
Ref: issue #1375
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: DevOps Engineer <devops@molecule.ai>
Co-authored-by: Molecule AI Core-UIUX <core-uiux@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* Merge pull request #1430 from fix/1421-saas-ssrf-helpers
Issue #1421 / #1401: PR #1363 (handler split) moved isPrivateOrMetadataIP
into a2a_proxy_helpers.go but kept the OLD pre-SaaS version — it
unconditionally blocks RFC-1918 addresses, regressing the fix in
commits 1125a02 / cf10733.
The A2A proxy path now has the same SaaS-gated logic as registry.go:
- Cloud metadata (169.254/16, fe80::/10, ::1) always blocked in both modes
- RFC-1918 (10/8, 172.16/12, 192.168/16) + IPv6 ULA (fc00::/7) blocked in
self-hosted, allowed in SaaS cross-EC2 mode
- IPv6 addresses now properly checked (previous version returned false for all)
Co-authored-by: Molecule AI Core-BE <core-be@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(P0): CWE-22 path traversal in copyFilesToContainer + ContextMenu test
Issue #1434 — CWE-22 Path Traversal Regression:
PR #1280 (dc218212) correctly used cleaned path in tar header.
PR #1363 (e9615af) regressed to using uncleaned `name`.
Fix: use `clean` in filepath.Join AND add defence-in-depth escape check.
Issue #1422 — ContextMenu Test Regression:
PR #1340 expanded pendingDelete store type to include `children:[]`.
Test assertion missing the field — add `children:[]` to match.
Note: ssrf.go created (shared isSafeURL/isPrivateOrMetadataIP) to
prepare for the handler-split refactor fix — current branch has no
build error, but the shared file will prevent regression when PR #1363
is merged. isSafeURL/isPrivateOrMetadataIP retained in both files
for now to avoid breaking callers while the split is finalized.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: resolve 3 go vet failures + add idempotency_key to delegate_task_async
- workspace_provision_test.go: add missing mock := setupTestDB(t) to
TestSeedInitialMemories_Truncation — mock was referenced but never
declared, causing "undefined: mock" vet error
- orgtoken/tokens_test.go: discard unused orgID return value with _ in
Validate call — "declared and not used" vet error
- a2a_tools.py: delegate_task_async now sends idempotency_key (SHA-256
of workspace_id + task) to POST /workspaces/:id/delegate, fixing
duplicate task execution when an agent restarts mid-delegation (#1456)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: airenostars <airenostars@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: molecule-ai[bot] <276602405+molecule-ai[bot]@users.noreply.github.com>
Co-authored-by: Hongming Wang <hongmingwangrabbit@gmail.com>
Co-authored-by: Molecule AI Technical Writer <technical-writer@agents.moleculesai.app>
Co-authored-by: Molecule AI Infra-Runtime-BE <infra-runtime-be@agents.moleculesai.app>
Co-authored-by: Molecule AI Core-BE <core-be@agents.moleculesai.app>
Co-authored-by: Molecule AI SDK Lead <sdk-lead@agents.moleculesai.app>
Co-authored-by: Molecule AI Core-FE <core-fe@agents.moleculesai.app>
Co-authored-by: Hongming Wang <hongmingwang.rabbit@users.noreply.github.com>
Co-authored-by: Molecule AI Community Manager <community-manager@agents.moleculesai.app>
Co-authored-by: Molecule AI App-FE <app-fe@agents.moleculesai.app>
Co-authored-by: Molecule AI Core-QA <core-qa@agents.moleculesai.app>
Co-authored-by: DevOps Engineer <devops@molecule.ai>
Co-authored-by: Molecule AI Core-UIUX <core-uiux@agents.moleculesai.app>
Co-authored-by: Molecule AI Dev Lead <dev-lead@agents.moleculesai.app>
Issue #1434 — CWE-22 Path Traversal Regression:
PR #1280 (dc218212) correctly used cleaned path in tar header.
PR #1363 (e9615af) regressed to using uncleaned `name`.
Fix: use `clean` in filepath.Join AND add defence-in-depth escape check.
Issue #1422 — ContextMenu Test Regression:
PR #1340 expanded pendingDelete store type to include `children:[]`.
Test assertion missing the field — add `children:[]` to match.
Note: ssrf.go created (shared isSafeURL/isPrivateOrMetadataIP) to
prepare for the handler-split refactor fix — current branch has no
build error, but the shared file will prevent regression when PR #1363
is merged. isSafeURL/isPrivateOrMetadataIP retained in both files
for now to avoid breaking callers while the split is finalized.
Co-authored-by: Molecule AI Core-BE <core-be@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Verified against live staging: the admin endpoint returns 400 'confirm
field must equal the URL slug' when the body key is 'confirm_token'.
Every workflow's safety-net teardown step + the main harness + the
Playwright teardown all had the wrong key. Fixed all six call sites.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reduces required secret surface from 2 (session cookie + admin token)
to 1 (admin token). Pairs with molecule-controlplane#202 which adds:
- POST /cp/admin/orgs — server-to-server org creation
- GET /cp/admin/orgs/:slug/admin-token — per-tenant bearer fetch
With those endpoints live, CI doesn't need to scrape a browser WorkOS
session cookie. CP admin bearer (Railway CP_ADMIN_API_TOKEN) drives
provision + tenant-token retrieval + teardown through a single
credential.
Changes
-------
test_staging_full_saas.sh: admin bearer for provision/teardown,
fetched per-tenant token drives all tenant API calls. Added
E2E_INTENTIONAL_FAILURE=1 toggle that poisons the tenant token
after provisioning so the teardown path gets exercised when the
happy-path isn't.
canvas/e2e/staging-setup.ts: same pivot; exports STAGING_TENANT_TOKEN
instead of STAGING_SESSION_COOKIE.
canvas/e2e/staging-tabs.spec.ts: context.setExtraHTTPHeaders with
Authorization: Bearer on every page request, no cookie handling.
All three workflows (e2e-staging-saas, canary-staging,
e2e-staging-canvas): drop MOLECULE_STAGING_SESSION_COOKIE env +
verification step. One secret to set.
NEW e2e-staging-sanity.yml: weekly Mon 06:00 UTC. Runs the harness
with E2E_INTENTIONAL_FAILURE=1 and inverts the pass condition —
rc=1 is green, rc=0 (unexpected success) or rc=4 (leak) open a
priority-high issue labelled e2e-safety-net. This is the
answer to 'how do we know the teardown path still works when
nothing else has failed recently.'
STAGING_SAAS_E2E.md refreshed: single-secret setup, sanity workflow
documented, canvas workflow added to the coverage matrix.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three additions on top of 187a9bf:
1. Canary (.github/workflows/canary-staging.yml)
30-min cron that runs the full-SaaS harness in E2E_MODE=canary: one
hermes workspace + one A2A PONG + teardown. ~8-min wall clock vs
~20-min for the full run.
Alerting is self-contained: opens a single 'Canary failing' issue on
first failure, comments on subsequent failures (no issue spam),
auto-closes the issue on the next green run. Labels: canary-staging,
bug. Safety-net teardown step sweeps e2e-YYYYMMDD-canary-* orgs
tagged today so a runner cancel can't leak EC2.
2. Canvas Playwright (canvas/e2e/staging-*.ts + playwright.staging.config.ts
+ .github/workflows/e2e-staging-canvas.yml)
staging-setup.ts provisions a fresh org + hermes workspace (same
lifecycle as the bash harness, just in TypeScript). staging-tabs.spec.ts
clicks through all 13 workspace-panel tabs (chat, activity, details,
skills, terminal, config, schedule, channels, files, memory, traces,
events, audit) and asserts each renders without crashing and without
'Failed to load' error toasts. Known SaaS gaps (Files empty, Terminal
disconnects, Peers 401) are documented in #1369 and whitelisted so
they don't fail the test — the gate is 'no hard crash', not 'no
issues'.
staging-teardown.ts deletes the org via DELETE /cp/admin/tenants/:slug.
playwright.staging.config.ts separates staging from local tests so
pnpm test in dev doesn't try to provision against staging. Retries=2
and timeouts are longer; workers=1 because the setup provisions one
shared workspace. Workflow uploads HTML report + screenshots on
failure for 14 days.
3. Delegation mechanics (tests/e2e/test_staging_full_saas.sh section 10)
Parent → child proxy test: POST /workspaces/CHILD/a2a with
X-Source-Workspace-Id=PARENT and verify the child responds + child
activity log captures PARENT as source. Intentionally LLM-free: the
mechanics regression is what matters; prompt-driven delegation
correctness belongs in canvas-driven tests.
Also reorders teardown step to 11/11 since delegation is 10/11.
Mode gating:
E2E_MODE=canary -> skips child workspace, HMA memory, peers,
activity, delegation (steps 6, 9, 10 no-op). Full-lifecycle still
runs every piece. Validated both paths via 'bash -n' syntax check
after each edit.
Secrets requirement unchanged (same two secrets as 187a9bf):
MOLECULE_STAGING_SESSION_COOKIE, MOLECULE_STAGING_ADMIN_TOKEN.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(canvas/a11y): mark StatusDot as aria-hidden — decorative element
StatusDot is purely decorative; the status is already conveyed via
aria-label on parent elements (WorkspaceNode, SidePanel header, etc.).
Marking it aria-hidden="true" prevents screen readers from announcing
the empty div as "img" with no alt text.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(canvas): guard budget_used optional field with ?? 0 in progress calc
TypeScript error in CI: 'budget.budget_used' is possibly 'undefined'
when used in the progress percentage calculation. The field is
optional per BudgetData interface, so ?? 0 is the correct guard.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(canvas/a11y): Tooltip keyboard focus support + ARIA role
- Add role="tooltip" + unique id so assistive tech can find tooltip content
- Add aria-describedby on trigger so screen readers announce tooltip text
- Add onFocus/onBlur handlers so keyboard users (Tab navigation) can see
tooltips that mouse users see on hover
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(canvas/test): restore advanceTimersByTime pattern in orgs-page error test
waitFor() + fake timers (vi.useFakeTimers in beforeEach) cause race
conditions: the 5s polling timeout fires before React state updates flush.
Restores the established pattern used by all other tests in this file:
advanceTimersByTimeAsync(50) + runAllTimersAsync().
Also removes the now-unused waitFor import.
Ref: PRs #1360, #1345
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Molecule AI Core-UIUX <core-uiux@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Issue #1138: Add Playwright E2E for context-menu → delete confirm flow.
The unit test (ContextMenu.keyboard.test.tsx) only exercises the store
setter — it can't catch the portal/race bug from PR #1133 where the
portal-rendered ConfirmDialog was closed by the menu's outside-click
handler before onConfirm fired.
This E2E test covers:
- Right-click workspace node → context menu opens
- Click Delete → ConfirmDialog appears (not swallowed)
- Click Confirm → dialog closes, node disappears, DELETE /workspaces/:id fires
- Click Cancel → dialog closes, node remains
Requires: platform on :8080, canvas on :3000.
Closes#1138.
Co-authored-by: Molecule AI Core-UIUX <core-uiux@agents.moleculesai.app>
Issue #1268: orgs-page error state test — replace vi.advanceTimersByTimeAsync(50)
with waitFor polling. advanceTimersByTimeAsync fires the timer but does not
guarantee React render flush completes before the assertion runs.
Issue #1269: ContextMenu keyboard test — add getState: () => mockStore to
useCanvasStore mock. PR #1243 changed the delete flow to hoist confirmation
to Canvas-level dialog via setPendingDelete, which reads .nodes via
useCanvasStore.getState() — the mock was missing getState.
Also carries forward the Issue #1124 WORKSPACE_ID fail-fast fix from
workspace/ modules (a2a_cli, a2a_client, coordinator, consolidation,
molecule_ai_status) — RuntimeError if WORKSPACE_ID is unset/empty.
Co-authored-by: Molecule AI Core Platform Lead <core-platform-lead@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(ci): revert cancel-in-progress to true — ubuntu-runner dispatch stalled
With cancel-in-progress: false, pending CI runs accumulate in the
ci-staging concurrency group. New pushes create queued runs, but
GitHub dispatches multiple runs for the same SHA instead of replacing
the pending one. All runs get stuck/cancelled before completing.
Reverting to cancel-in-progress: true restores CI operation — runs
that are superseded are cancelled, freeing the concurrency slot for
the new run to proceed.
Runner availability (ubuntu-latest dispatch stall) is a separate
infra issue tracked independently.
* fix(security): validate tar header names in copyFilesToContainer — CWE-22 path traversal (#1043)
Tar header names were built from raw map keys without validation. A malicious
server-side caller could embed "../" in a file name to escape the destPath
volume mount (/configs) and write files outside the intended directory.
Fix: validate each name with filepath.Clean + IsAbs + HasPrefix("..") checks
before using it in the tar header, then join with destPath for the archive
header. Also guard parent-directory creation against traversal.
Closes#1043.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(canvas/test): patch regressed tests from PR #1243 orgs-page flakiness fix
Two regressions introduced by PR #1243 (fix issue #1207):
1. **ContextMenu.keyboard.test.tsx** — `setPendingDelete` now receives
`{id, name, hasChildren}` (cascade-delete UX, PR #1252), but the test
expected only `{id, name}`. Added `hasChildren: false` to the assertion.
2. **orgs-page.test.tsx** — 10 tests awaited `vi.advanceTimersByTimeAsync(50)`
without `act()`. With fake timers, `setState` (synchronous) is flushed by
`advanceTimersByTimeAsync`, but the React state update it triggers is a
microtask — so the test saw stale render. Wrapping in `act(async () =>
{ await vi.advanceTimersByTimeAsync(50); })` ensures microtasks drain
before assertions run.
All 813 vitest tests pass.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(canvas): add 100px proximity threshold to drag-to-nest detection
Fixes#1052 — previously, getIntersectingNodes() returned any node whose
bounding box overlapped the dragged node, regardless of actual pixel
distance. On a sparse canvas this triggered the "Nest Workspace" dialog
even when the dragged node was nowhere near any target.
The fix adds an on-node-drag proximity filter: only nodes within 100px
(center-to-center) of the dragged node are eligible as nest targets.
Distance is computed as squared Euclidean to avoid the sqrt overhead in
the hot drag path.
Added two tests to Canvas.pan-to-node.test.tsx covering the mock wiring
and confirming the regression is addressed in Canvas.tsx.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(canvas): add ?? 0 guard for optional budget_used in progressPct
Fixes#1324 — TypeScript strict mode flags budget.budget_used as
possibly undefined in the progressPct ternary, even though the
outer condition checks budget_limit > 0.
Fix: use nullish coalescing (budget_used ?? 0) so progress shows 0%
when the backend returns a partial shape (provisioning-stuck
workspaces). Also adds a test covering the undefined-budget_used
case with the progress bar aria-valuenow and fill width both at 0%.
Closes#1324.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: molecule-ai[bot] <276602405+molecule-ai[bot]@users.noreply.github.com>
Co-authored-by: Molecule AI Core-FE <core-fe@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(ci): revert cancel-in-progress to true — ubuntu-runner dispatch stalled
With cancel-in-progress: false, pending CI runs accumulate in the
ci-staging concurrency group. New pushes create queued runs, but
GitHub dispatches multiple runs for the same SHA instead of replacing
the pending one. All runs get stuck/cancelled before completing.
Reverting to cancel-in-progress: true restores CI operation — runs
that are superseded are cancelled, freeing the concurrency slot for
the new run to proceed.
Runner availability (ubuntu-latest dispatch stall) is a separate
infra issue tracked independently.
* fix(security): validate tar header names in copyFilesToContainer — CWE-22 path traversal (#1043)
Tar header names were built from raw map keys without validation. A malicious
server-side caller could embed "../" in a file name to escape the destPath
volume mount (/configs) and write files outside the intended directory.
Fix: validate each name with filepath.Clean + IsAbs + HasPrefix("..") checks
before using it in the tar header, then join with destPath for the archive
header. Also guard parent-directory creation against traversal.
Closes#1043.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(canvas/test): patch regressed tests from PR #1243 orgs-page flakiness fix
Two regressions introduced by PR #1243 (fix issue #1207):
1. **ContextMenu.keyboard.test.tsx** — `setPendingDelete` now receives
`{id, name, hasChildren}` (cascade-delete UX, PR #1252), but the test
expected only `{id, name}`. Added `hasChildren: false` to the assertion.
2. **orgs-page.test.tsx** — 10 tests awaited `vi.advanceTimersByTimeAsync(50)`
without `act()`. With fake timers, `setState` (synchronous) is flushed by
`advanceTimersByTimeAsync`, but the React state update it triggers is a
microtask — so the test saw stale render. Wrapping in `act(async () =>
{ await vi.advanceTimersByTimeAsync(50); })` ensures microtasks drain
before assertions run.
All 813 vitest tests pass.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(canvas): add 100px proximity threshold to drag-to-nest detection
Fixes#1052 — previously, getIntersectingNodes() returned any node whose
bounding box overlapped the dragged node, regardless of actual pixel
distance. On a sparse canvas this triggered the "Nest Workspace" dialog
even when the dragged node was nowhere near any target.
The fix adds an on-node-drag proximity filter: only nodes within 100px
(center-to-center) of the dragged node are eligible as nest targets.
Distance is computed as squared Euclidean to avoid the sqrt overhead in
the hot drag path.
Added two tests to Canvas.pan-to-node.test.tsx covering the mock wiring
and confirming the regression is addressed in Canvas.tsx.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: molecule-ai[bot] <276602405+molecule-ai[bot]@users.noreply.github.com>
Co-authored-by: Molecule AI Core-FE <core-fe@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(canvas/test): restore test regressions from PR #1243
PR #1243 introduced two regressions in the canvas vitest suite:
1. ContextMenu.keyboard.test.tsx: the setPendingDelete call now
passes `{hasChildren, id, name}` (not just `{id, name}`). Updated
the keyboard-a11y test assertion to match the new store shape.
2. orgs-page.test.tsx: mockFetch.mockResolvedValueOnce() returned a
plain object that didn't match the two-argument (url, options)
call signature used by the component's fetch wrapper. Switched to
mockImplementationOnce returning a rejected Promise — matching
real fetch's rejection contract — and added runAllTimersAsync after
advanceTimersByTimeAsync(50) to flush React state updates.
54 test files · 813 tests · all passing
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(canvas): replace bounding-box intersection with distance threshold for nest detection
ReactFlow's getIntersectingNodes uses bounding-box overlap detection, which
fires the drag-over state whenever any part of two nodes' position rectangles
overlap — even when the dragged node is far from the target. This made the
"Nest Workspace" dialog appear from large distances.
Fix: scan all nodes on each drag tick and set dragOverNodeId to the closest
node within NEST_PROXIMITY_THRESHOLD (150 px, center-to-center). This matches
the intuitive behavior: nest only when the node is actually dropped near another.
Constants:
- NEST_PROXIMITY_THRESHOLD = 150px (~60% of a collapsed node's width)
- DEFAULT_NODE_WIDTH = 245px (mid-range of min/max node widths)
- DEFAULT_NODE_HEIGHT = 110px
Also removed the unused getIntersectingNodes import (was causing duplicate
identifier error when both onNodeDrag and the zoom handler called useReactFlow
in the same component scope).
Closes#1052.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(canvas): cascade-delete UX — show child count and require checkbox before Delete All
Issue #1137: with ?confirm=true always sent, a single confirmation silently
cascades — a team lead with 20 children gets nuked on one click.
Changes:
- store/canvas.ts: pendingDelete type now includes children: {id, name}[]
- ContextMenu.tsx: passes child list to setPendingDelete on Delete click
- DeleteCascadeConfirmDialog.tsx: new component — shows child names, a
cascade warning, and requires the operator to tick a checkbox before
Delete All activates. Disabled by default; only enables after checkbox.
- Canvas.tsx: conditionally renders DeleteCascadeConfirmDialog for
hasChildren workspaces, or plain ConfirmDialog for leaf workspaces.
confirmDelete requires cascadeConfirmChecked=true when hasChildren.
- ContextMenu.keyboard.test.tsx: updated setPendingDelete assertion to
include children:[] (no children in the test fixture).
813 tests pass.
Closes#1137.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Molecule AI Core-UIUX <core-uiux@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
- Store: pendingDelete now carries `hasChildren: boolean` (computed from
nodes.some(parentId === nodeId))
- ContextMenu: passes hasChildren into setPendingDelete
- Canvas: dialog title changes to "Delete Workspace and Children" with
⚠️ message when hasChildren; confirms with "Delete All"
Refs: #1137
Co-authored-by: Molecule AI Fullstack (floater) <fullstack-floater@agents.moleculesai.app>
- DetailsTab: use `(data.lastErrorRate ?? 0)` instead of bare multiply to
prevent NaN% when the field is absent on pre-provisioning workspaces.
- WorkspaceUsage: make formatPeriod accept optional start/end strings;
return "—" for undefined so the usage period shows blank rather than
"Invalid Date" for provisioning/partial workspaces.
Refs: #1139
Co-authored-by: Molecule AI Fullstack (floater) <fullstack-floater@agents.moleculesai.app>
Root cause: tests used try/finally { vi.useRealTimers() / vi.useFakeTimers() }
back-and-forth. When any test's finally-block called vi.useFakeTimers(),
subsequent tests inherited fake timer state causing 50ms real setTimeouts
to not fire and mockFetch to accumulate calls across test boundaries.
Fix: consolidate timer management to beforeEach/afterEach hooks.
- beforeEach: vi.useFakeTimers() — all tests start from known fake state
- afterEach: cleanup() + vi.useRealTimers() — restore real timers for next test
- Individual tests: use vi.advanceTimersByTimeAsync(50) instead of real setTimeout
Also removed duplicate afterEach(cleanup()) and unused waitFor import.
Closes#1207.
Co-authored-by: Molecule AI Core-FE <core-fe@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(auth): F1094 — requireCallerOwnsOrg reads org_id not created_by (#1200)
Root cause: requireCallerOwnsOrg (org_plugin_allowlist.go:116) was
reading org_api_tokens.created_by to determine caller's org workspace
ID. But created_by is a provenance label ("session", "admin-token",
"org-token:<prefix>") — never a UUID. The equality check
callerOrg != targetOrgID always failed → every org-token caller
got 403 on /orgs/:id/plugins/allowlist routes.
Fix:
- Migration 036: adds org_id UUID column (nullable) to org_api_tokens
with index. Existing pre-migration tokens get org_id=NULL → deny
by default (safer than cross-org access).
- orgtoken.Issue: takes new orgID param; stores in org_id column.
- orgtoken.OrgIDByTokenID: new helper reads org_id for a token ID.
Returns ("", nil) for NULL/unanchored tokens.
- requireCallerOwnsOrg: now calls OrgIDByTokenID instead of reading
created_by. Pre-migration tokens with org_id=NULL get callerOrg=""
→ denied (safer).
- orgTokenActor (org_tokens.go): returns (createdBy, orgID) pair.
Token minted via another org token gets its org_id set at mint time.
Session/ADMIN_TOKEN callers get orgID="".
- orgtoken.Token struct: adds OrgID field for list display.
- orgtoken.List: selects org_id alongside other columns.
- Updated existing tests for new Issue signature.
- Added 10 regression tests covering: happy path, unanchored denial,
cross-org denial, session bypass, DB error denial.
🤖 Generated with [Claude Code](https://claude.ai/claude-code)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(security): replace err.Error() leaks with prod-safe messages (#1206)
- workspace_provision.go: provisionWorkspace, provisionWorkspaceCP —
replaced 7 err.Error() calls with "provisioning failed" in both
Broadcast payloads and last_sample_error DB column. Full error
preserved in server-side log.Printf.
- plugins_install_pipeline.go: resolveAndStage — replaced 5 err.Error()
calls with generic messages:
"invalid plugin source"
"plugin source not supported"
"invalid plugin name"
"staged plugin exceeds size limit"
"plugin manifest integrity check failed"
Risk mitigated: DB errors (pq: connection refused, pq: deadlock),
OS errors, and internal paths no longer leak in HTTP JSON responses
or WebSocket broadcasts.
Added regression tests (workspace_provision_test.go):
- TestProvisionWorkspace_NoInternalErrorsInBroadcast
- TestProvisionWorkspaceCP_NoInternalErrorsInBroadcast
- TestResolveAndStage_NoInternalErrorsInHTTPErr
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(F1089): log panic-recovery UPDATE errors in scheduler
The panic defer blocks in tick() and fireSchedule() now capture
and log errors from the db.DB.ExecContext call that advances next_run_at
after a panic. Previously, a DB failure during panic recovery was
silent — the log line for the panic itself appeared but any subsequent
UPDATE failure was invisible, risking unnoticed scheduler drift.
context.Background() was already used (F1089 comment in place); this
commit adds the missing error capture + log.Printf on exec failure.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(issue-1207): eliminate orgs-page test flakiness
Three root causes addressed:
1. Duplicate afterEach blocks (lines 97-103) — two identical
afterEach(() => { cleanup(); }) blocks collapsed to one.
2. Fake-timer isolation gap — if a polling test failed before its
finally-block ran, vi.useFakeTimers() persisted globally. The next
non-polling test's setTimeout(50) then hung indefinitely (fake
timers don't advance without vi.advanceTimersByTime), causing
waitFor/async timeouts. Fixed by calling vi.useRealTimers()
unconditionally in beforeEach (guaranteed clean slate) and
afterEach (even when a test fails before its own finally).
3. mockFetch.callHistory now cleared via mockReset() in beforeEach,
preventing "expected 2 calls but got N" failures from carry-over
between polling tests.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Molecule AI Dev Lead <dev-lead@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Use explicit navigator.clipboard check instead of optional chaining so
the no-op case is handled explicitly. When clipboard API is unavailable
(non-HTTPS context) show a toast: "Copy requires HTTPS — please select
and copy manually". Production is always HTTPS so this only affects
local dev with http:// canvas.
Closes#1199.
Co-authored-by: Molecule AI Core-FE <core-fe@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat(canvas): rewrite MemoryInspectorPanel to match backend API
Issue #909 (chunk 3 of #576).
The existing MemoryInspectorPanel used the wrong API endpoint
(/memory instead of /memories) and wrong field names (key/value/version
instead of id/content/scope/namespace/created_at). It also lacked
LOCAL/TEAM/GLOBAL scope tabs and a namespace filter.
Changes:
- Fix endpoint: GET /workspaces/:id/memories with ?scope= query param
- Fix MemoryEntry type to match actual API: id, content, scope,
namespace, created_at, similarity_score
- Add LOCAL/TEAM/GLOBAL scope tabs
- Add namespace filter input
- Remove Edit functionality (no update endpoint in backend)
- Delete uses DELETE /workspaces/:id/memories/:id (by id, not key)
- Full rewrite of 27 tests to match new API and UI structure
- Uses ConfirmDialog (not native dialogs) for delete confirmation
- All dark zinc theme (no light colors)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: tighten types + improve provision-timeout message (#1135, #1136)
#1135 — TypeScript: make BudgetData.budget_used and WorkspaceMetrics
fields optional to match actual partial-response shapes from provisioning-
stuck workspaces. Runtime already guarded with ?? 0.
#1136 — provisiontimeout.go: replace misleading "check required env vars"
hint (preflight catches that case upfront) with accurate message about
container starting but failing to call /registry/register.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
* fix(test): align ssrf_test.go localhost test cases with isSafeURL behaviour
isSafeURL blocks 127.0.0.1 via ip.IsLoopback() even in dev environments.
The test cases `wantErr: false` for localhost were incorrect — the
test would fail when go test runs. Fix by changing wantErr to true
for both localhost test cases.
Rationale: loopback blocking at this layer is intentional. Access
control is enforced by WorkspaceAuth + CanCommunicate at the A2A
routing layer, not by the URL validation. Opening this would widen
the SSRF attack surface without adding real dev flexibility.
Closes: ssrf_test.go inconsistency reported 2026-04-21
Co-Authored-By: Claude Sonnet 4.7 <noreply@anthropic.com>
---------
Co-authored-by: Molecule AI Core-UIUX <core-uiux@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Post-review cleanup for the #1178 / #1189 bootstrap-watcher flow:
- ConsoleModal status-code matching uses \b regex anchors instead of
raw substrings. Before, any error message containing "501" inside
a longer digit run ("15012") would false-match into the self-hosted
branch. Unlikely in practice but cheap to tighten.
- Peers empty-state copy now explains WHY the list is empty on
offline / failed / provisioning workspaces instead of rendering the
same "No reachable peers" text used for healthy workspaces with
zero siblings. Online workspaces unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The peers endpoint requires a workspace-scoped bearer token (see
validateDiscoveryCaller in handlers/discovery.go — designed for
agent-to-agent calls). The canvas session doesn't hold that token, so
every Details-tab open for a provisioning / failed / offline workspace
fired a 401 that cluttered devtools and lit up the error banner even
though the real UX here is "no peers — the workspace hasn't booted."
Gate the fetch on status ∈ {online, degraded} and render an empty
Peers list for everything else.
Follow-up: give the canvas a way to see peers for any workspace (admin
session should be enough). Tracked separately — this fix just quiets
the noise on the common case.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Part 3 of 3 for the "fail fast + comprehensive logs" UX. Platform PR
#1168 and controlplane #181 ship the server-side; this PR surfaces the
data in the canvas.
Two changes:
1. DetailsTab renders `last_sample_error` in a dedicated Error section
when the workspace is failed (or degraded with an error). Before,
the only trace of why a workspace failed was a generic banner —
users had to click "View Logs", which opened the terminal tab (the
post-boot log, empty on a runtime crash). Now the actual Python
traceback is inline. A "View console output" button in the same
section opens the full serial console in a modal.
2. New ConsoleModal component. Fetches GET /workspaces/:id/console
(platform → CP → ec2:GetConsoleOutput). Portal-rendered above the
canvas with Copy / Close / Esc handlers. Renders a friendly message
on 501 (self-hosted deploys without CP) and 404 (instance
terminated).
3. ProvisioningTimeout's "View Logs" button now opens the console
modal instead of the (usually empty) terminal tab — when a
workspace is stuck in provisioning, the cloud-init + user-data
trace is what the user actually needs.
Tests cover the closed-state no-fetch, happy-path fetch, 501/404
messaging, and Close/Escape wiring.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes: #177 (CRITICAL — Dockerfile runs as root)
Dockerfiles changed:
- workspace-server/Dockerfile (platform-only): addgroup/adduser + USER platform
- workspace-server/Dockerfile.tenant (combined Go+Canvas): addgroup/adduser + USER canvas
+ chown canvas:canvas on canvas dir so non-root node process can read it
- canvas/Dockerfile (canvas standalone): addgroup/adduser + USER canvas
- workspace-server/entrypoint-tenant.sh: update header comment (no longer starts
as root; both processes now start non-root)
The entrypoint no longer needs a root→non-root handoff since both the Go
platform and Canvas node run as non-root by default. The 'canvas' user owns
/app and /platform, so volume mounts owned by the host's canvas user work
without needing a root init step.
Co-authored-by: Molecule AI CP-BE <cp-be@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
billing.ts (startCheckout, openBillingPortal): replace raw res.text()
in thrown Error with a safe status-only message. The response body from
/cp/billing/* routes can contain Stripe API error detail (invalid key,
card decline message, raw Stripe envelope) that should not reach clients.
orgs/page.tsx (createOrg): same fix — raw body → safe message.
Full body is logged server-side for debugging.
Closes: #91 (CWE-209 — Stripe key echoed in error)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Root causes:
1. TermsGate (rendered inside OrgsPage Shell) fetches /cp/auth/terms-status
before OrgsPage fetches /cp/orgs, consuming the first mockResponseOnce
slot — leaving /cp/orgs with no mock and throwing TypeError.
Fix: mock TermsGate as a pass-through component in vi.mock.
2. Non-polling tests used mockFetchSession.mockResolvedValueOnce() which
exhausted after one call; React 18 concurrent re-renders call
fetchSession() multiple times, causing subsequent calls to return
undefined. Fix: use mockResolvedValue() (persistent) for fetchSession.
3. vi.clearAllMocks() in beforeEach kept mockResolvedValueOnce from
previous tests from leaking BUT the vi.fn() mock implementation was
already reset by mockFetchSession.mockReset() in beforeEach. Tests
were passing stale persistent mocks from previous tests. Fix:
mockFetchSession.mockReset() in beforeEach + mockResolvedValue in
each test.
4. Polling tests used vi.useFakeTimers() without shouldAdvanceTime,
which prevented React's useEffect from calling fetch() (0 calls).
Fix: use vi.useFakeTimers({ shouldAdvanceTime: true }) + await
vi.advanceTimersByTimeAsync() to advance time during await.
5. Unmount test unmounted before effects fired (with shouldAdvanceTime).
Fix: flush microtasks with await vi.advanceTimersByTimeAsync(0)
before unmount so the effect runs and schedules the poll timer.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three nits identified during post-merge review of #1119, #1133:
1. ContextMenu.tsx imported `removeNode` from the canvas store but
stopped using it when the delete-confirm flow moved to Canvas in
#1133. Also removed the now-unused mock entry in the keyboard
test so the test inventory matches the real call list.
2. Preflight's YAML parse failure was a silent pass — defensible since
the in-container preflight owns the schema, but invisible to ops if
a template ships malformed YAML. Log at WARN so the signal surfaces
without blocking the provision.
3. formatMissingEnvError rendered its slice via %q, producing
`["A" "B"]` which is Go-literal-looking and ugly in a user-facing
error. Join with ", " instead. Test updated to assert the new
format.
No behavioural changes beyond the log line; fixes are review nits, not
bug fixes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Clicking "Delete" in the workspace context menu did nothing for stuck
workspaces. The confirm dialog was rendered via portal as a child of
ContextMenu. ContextMenu's outside-click handler checks whether the
click target is inside its ref — but the portal puts the dialog in
document.body, outside the ref. So clicking the dialog's Confirm
counted as "outside", closed the menu, unmounted the dialog mid-click,
and the onConfirm handler never ran.
Hoist the pending-delete state to the canvas store and render the
confirm dialog at the Canvas level (same pattern as the existing
pendingNest dialog). The dialog now outlives ContextMenu, so the
outside-click close is harmless. Close the context menu on the Delete
click itself rather than waiting for the dialog to resolve.
Add a regression test covering the new flow and add the standard
?confirm=true query param so the backend's child-cascade guard is
consulted correctly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Workspaces stuck in status='provisioning' previously surfaced in three
bad ways:
1. **Details tab crashed** with `Cannot read properties of undefined
(reading 'toLocaleString')`. `BudgetSection` + `WorkspaceUsage`
assumed full response shapes but a provisioning-stuck workspace
returns partial `{}`. Guard each deep field with `?? 0` and cover
the partial-response case with regression tests.
2. **Missing required env vars failed silently** 15+ minutes later as
a cosmetic "Provisioning Timeout" banner. The in-container preflight
catches them but by then the container has already crashed without
calling /registry/register, so the workspace sat in 'provisioning'
forever. Mirror the preflight server-side: parse config.yaml's
`runtime_config.required_env` before launch, fail fast with a
WORKSPACE_PROVISION_FAILED event naming the missing vars.
3. **No backend timeout** ever flipped a stuck workspace to 'failed'.
Add a registry sweeper (10m default, env-overridable) that detects
workspaces stuck past the window, flips them to 'failed', and emits
WORKSPACE_PROVISION_TIMEOUT. Race-safe: the UPDATE re-checks the
status + age predicate so a concurrent register/restart wins.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds user-facing API keys with full-org admin scope. Replaces the
single ADMIN_TOKEN env var with named, revocable, audited tokens
that users can mint/rotate from the canvas UI without ops
intervention.
Designed for the beta growth phase — one token tier (full admin).
Future work will split into scoped roles (admin / workspace-write
/ read-only) and per-workspace bindings. See docs/architecture/
org-api-keys.md for the design + follow-up roadmap.
## Surface
POST /org/tokens mint (plaintext returned once)
GET /org/tokens list live keys (prefix-only)
DELETE /org/tokens/:id revoke (idempotent)
All AdminAuth-gated. Bootstrap path: mint the first token via
ADMIN_TOKEN or canvas session; tokens can mint more tokens after.
## Validation as a new AdminAuth tier (2a)
AdminAuth evaluation order:
Tier 0 lazy-bootstrap fail-open (only when no live tokens AND
no ADMIN_TOKEN env)
Tier 1 verified WorkOS session via /cp/auth/tenant-member
Tier 2a org_api_tokens SELECT — NEW
Tier 2b ADMIN_TOKEN env (bootstrap / CLI break-glass)
Tier 3 any live workspace token (deprecated, only when ADMIN_TOKEN
unset)
Tier 2a runs ONE indexed lookup (partial index on
token_hash WHERE revoked_at IS NULL) + an async last_used_at
bump. No measurable latency cost on the hot path.
## UI
New "Org API Keys" tab in the settings panel. Label field for
human-readable naming. Plaintext shown once + clipboard copy.
Revoke with confirm dialog. Mirrors the existing workspace-
TokensTab flow so users who've used one get the other for free.
## Security properties
- Plaintext never stored. sha256 hash + 8-char display prefix.
- Revocation is immediate: partial index on revoked_at IS NULL
means the next request validates or fails in microseconds.
- created_by audit field captures provenance: "org-token:<short>"
when a token mints another, "session" for browser-UI mints,
"admin-token" for the ADMIN_TOKEN bootstrap path.
- Validate() collapses all failure shapes into ErrInvalidToken
so response-shape can't distinguish "never existed" from
"revoked".
## Tests
- internal/orgtoken: 9 unit tests (hash storage, empty field
null-ing, validation happy path, empty plaintext, unknown hash,
revoked filtering, list ordering, revoke idempotency, has-any-
live short-circuit).
- AdminAuth tier-2a integration covered by existing middleware
tests unchanged (fail-open + bearer paths).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Canvas's browser bundle issues fetches to both CP endpoints
(/cp/auth/me, /cp/orgs, ...) AND tenant-platform endpoints
(/canvas/viewport, /approvals/pending, /org/templates). They
share ONE build-time base URL. Baking api.moleculesai.app
broke tenant calls with 404; baking the tenant subdomain broke
auth. Tried both today and saw exactly one failure mode per
attempt.
Real fix: same-origin fetches + tenant-side split. Adds:
internal/router/cp_proxy.go # /cp/* → CP_UPSTREAM_URL
mounted before NoRoute(canvasProxy). Now a tenant serves:
/cp/* → reverse-proxy to api.moleculesai.app
/canvas/viewport,
/approvals/pending,
/workspaces/:id/*,
/ws, /registry, → tenant platform (existing handlers)
/metrics
everything else → canvas UI (existing reverse-proxy)
Canvas middleware reverts to `connect-src 'self' wss:` for the
same-origin path (keeping explicit PLATFORM_URL whitelist as a
self-hosted escape hatch when the build-arg is non-empty).
CI build-arg flips to NEXT_PUBLIC_PLATFORM_URL="" so the bundle
issues relative fetches.
Security of cp_proxy:
- Cookie + Authorization PRESERVED across the hop (opposite of
canvas proxy) — they carry the WorkOS session, which is the
whole point.
- Host rewritten to upstream so CORS + cookie-domain on the CP
side see their own hostname.
- Upstream URL validated at construction: must parse, must be
http(s), must have a host — misconfig fails closed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tenant page loads were blocked by:
Refused to connect to 'https://api.moleculesai.app/cp/auth/me'
because it violates the document's Content Security Policy.
CSP had `connect-src 'self' wss:` — fine for same-origin + any wss,
but browser refuses cross-origin HTTPS fetches that aren't listed.
PLATFORM_URL (baked from NEXT_PUBLIC_PLATFORM_URL, which is the CP
origin on SaaS tenants) needs to be explicit.
Fix: middleware reads NEXT_PUBLIC_PLATFORM_URL at build/runtime
and adds both the https and wss siblings to connect-src. Self-
hosted deploys that override the build-arg automatically get a
matching CSP — no hardcoded hostname.
Test added: buildCsp includes NEXT_PUBLIC_PLATFORM_URL origin in
connect-src when set. Also loosens the dev `ws:` assertion since
dev uses `connect-src *` which subsumes ws (pre-existing behavior,
test was stale).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tenant page loads were failing with repeated CSP violations:
Executing inline script violates ... script-src 'self'
'nonce-M2M4YTVh...' 'strict-dynamic'. ...
because Next.js's bootstrap inline scripts were emitted without a
nonce attribute. The middleware was generating per-request nonces
correctly and sending them via `x-nonce` — but the layout was
fully static, so Next.js cached the HTML once and served that cached
bundle (no nonces baked in) for every request.
Fix: call `await headers()` in the root layout. That opts the tree
into dynamic rendering AND signals Next.js to propagate the
x-nonce value to its own generated <script> tags.
The `nonce` return value is intentionally unused — the framework
handles its bootstrap scripts automatically once the read happens.
Future code that adds third-party <Script> components (analytics,
etc.) should pass the returned nonce explicitly.
Verified against live tenant: before this change every /_next/
chunk script tag in the HTML had no nonce attribute; expected after
deploy is `<script nonce="..." src="/_next/...">` on each.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three changes that keep getting lost on nuke+rebuild:
1. middleware.ts: read CSP_DEV_MODE env to relax CSP in local Docker
2. api.ts: send NEXT_PUBLIC_ADMIN_TOKEN header (AdminAuth on /workspaces)
3. Dockerfile: accept NEXT_PUBLIC_ADMIN_TOKEN as build arg
All three are required for the canvas to work in local Docker where
canvas (port 3000) fetches from platform (port 8080) cross-origin.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
#1080 added /waitlist to canvas, but canvas isn't served at
app.moleculesai.app — it backs the tenant subdomains (acme.moleculesai.app
etc.). The real /waitlist lives in the separate molecule-app repo,
which is what the CP auth callback redirects to.
molecule-app#12 has the real page + contact form wiring to
/cp/waitlist/request. This canvas copy was never reachable and would
only diverge.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the user-facing half of the beta-gate: a page at /waitlist that
the CP auth callback redirects users to when their email isn't on
the allowlist. Collects email + optional name + use-case and POSTs
to /cp/waitlist/request (backend landed in controlplane #150).
## Behavior
- No auto-pre-fill of email from URL query (CP's #145 dropped the
?email= param for the privacy reason; this test guards against a
future regression on the client side).
- Client-side validates email shape for instant feedback; backend
re-validates.
- Three UI states after submit:
success → "your request is in" banner, form hidden
dedup → softer "already on file" banner when backend returns
dedup=true (same 200, no 409 to avoid enumeration)
error → inline banner with backend message or network fallback
## Tests
9 tests in __tests__/waitlist-page.test.tsx covering:
- default render + a11y (role=button, role=status, role=alert)
- URL-pre-fill privacy regression guard
- HTML5 + JS validation (empty, malformed)
- successful POST with trimmed body
- dedup branch
- non-2xx with + without error field
- network rejection
Follow-up to the beta-gate rollout on controlplane #145 / #150.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mechanical lint fix. github-code-quality[bot] flagged unused
import on line 18 — fireEvent is imported but never referenced in
the test file. Removing it clears the code quality gate without
changing any test behaviour.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The test docstring promised polling coverage but I'd only wired the
describe-block header, not the actual tests. Closing that gap — vitest
fake timers drive three cases:
- `provisioning` org → 2nd fetch fires after 5.1s advance
- all `running` → no 2nd fetch even after 10s advance
- `awaiting_payment` org, unmount before timer fires → no post-unmount
fetch (cleanup correctly clears the pollTimer)
The unmount case is the meaningful one: without it a fast nav-away
leaves the 5s interval chasing the CP forever. page.tsx L97-99 does
clear the timer; the test pins the contract.
Local baseline on origin/staging tip ede6597 + this branch:
canvas vitest: 50 files / 781 tests, all green (+3 vs prior commit)
canvas build: clean
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two independent test additions that harden the surface freshly landed on
staging via PRs #982 (canvas fetch timeout), #992 (/orgs landing), #994
(post-checkout redirect to /orgs).
canvas/src/lib/__tests__/api.test.ts (+74 lines, 7 new tests)
- GET/POST/PATCH/PUT/DELETE each pass an AbortSignal to fetch
- TimeoutError (DOMException name=TimeoutError) propagates to the caller
- Each request installs its own signal — no shared module-level controller
that would allow one slow request to cancel an unrelated fast one
This is the hardening nit I flagged in my APPROVE-w/-nit review of
fix/canvas-api-fetch-timeout. Landing as a follow-up now that #982 is in
staging.
canvas/src/app/__tests__/orgs-page.test.tsx (+251 lines, new file, 10 tests)
- Auth guard: signed-out → redirectToLogin and no /cp/orgs fetch
- Error state: failed /cp/orgs → Error message + Retry button
- Empty list: CreateOrgForm renders
- CTA by status:
running → "Open" link targets {slug}.moleculesai.app
awaiting_payment → "Complete payment" → /pricing?org=<slug>
failed → "Contact support" mailto
- Post-checkout: ?checkout=success renders CheckoutBanner AND
history.replaceState scrubs the query param
- Fetch contract: /cp/orgs called with credentials:include + AbortSignal
Local baseline on origin/staging tip ede6597:
canvas vitest: 50 files / 778 tests, all green
canvas build: clean, /orgs route present (2.83 kB / 105 kB first-load)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wraps /orgs in a TermsGate that polls /cp/auth/terms-status on mount
and overlays a blocking modal when the current terms version hasn't
been accepted yet. "I agree" POSTs /cp/auth/accept-terms and dismisses
the modal; the backend records IP + UA as GDPR Art. 7 proof-of-consent.
Also adds a short data residency notice under the page header:
workspaces run in AWS us-east-2 (Ohio, US). An EU region selector is
a future lift once the infra is provisioned there.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the UI surface for the credit system to /orgs:
- CreditsPill next to each org row. Tone shifts from zinc → amber at
10% of plan to red at zero.
- LowCreditsBanner appears under the pill for running orgs when the
balance crosses thresholds: overage_used > 0 → "overage active",
balance <= 0 → "out of credits, upgrade", trial tail → "trial almost
out".
- Pure helpers extracted to lib/credits.ts so formatCredits, pillTone,
and bannerKind are unit-tested without jsdom.
Backend List query now returns credits_balance / plan_monthly_credits
/ overage_used_credits / overage_cap_credits so no second round-trip
is needed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two small polish items that together close the signup-to-running-tenant
flow for real users:
1. Stripe success_url now points at /orgs?checkout=success instead of
the current page (was pricing). The old behavior left people staring
at plan cards with no indication payment went through — the new
behavior drops them right onto their org list where they can watch
the status flip.
2. /orgs shows a green "Payment confirmed, workspace spinning up"
banner when it sees ?checkout=success, then clears the query
param via replaceState so a reload doesn't show it again.
3. /orgs now polls every 5s while any org is awaiting_payment or
provisioning. Users see the Stripe webhook's effect live — no
manual refresh needed — and once every org settles the polling
stops so idle tabs don't hammer /cp/orgs.
Paired with PR #992 (the /orgs page itself) this makes the end-to-end
flow on BILLING_REQUIRED=true deployments feel right:
/pricing → Stripe → /orgs?checkout=success → banner → live poll →
"Open" button when org.status transitions to running.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CP's Callback handler redirects every new WorkOS session to
APP_URL/orgs, but canvas had no such route — new users hit the canvas
Home component, which tries to call /workspaces on a tenant that
doesn't exist yet, and saw a confusing error. This PR plugs that gap
with a dedicated landing page that:
- Bounces anonymous visitors back to /cp/auth/login
- Zero-org users see a slug-picker (POST /cp/orgs, refresh)
- For each existing org, shows status + CTA:
* awaiting_payment → amber "Complete payment" → /pricing?org=…
* running → emerald "Open" → https://<slug>.moleculesai.app
* failed → "Contact support" → mailto
* provisioning → read-only "provisioning…"
- Surfaces errors inline with a Retry button
Deliberately server-light: one GET /cp/orgs, no WebSocket, no canvas
store hydration. Goal is to move the user from signup to either
Stripe Checkout or their tenant URL with one click each.
Closes the last UX gap between the BILLING_REQUIRED gate landing on
the CP and real users being able to complete a signup today.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-launch audit flagged api.ts as missing a timeout on every fetch.
A slow or hung CP response would leave the UI spinning indefinitely
with no way for the user to abort — effectively a client-side DoS.
15s is long enough for real CP queries (slowest observed is Stripe
portal redirect at ~3s) and short enough that a stalled backend
surfaces as a clear error with a retry affordance.
Uses AbortSignal.timeout (widely supported since 2023) so the
abort propagates through React Query / SWR consumers cleanly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
amber-400 on zinc-900 is 5.4:1 (AA pass). amber-300 is 6.9:1 (AA+AAA pass)
and matches the rest of the amber usage in WorkspaceNode (currentTask,
error detail, badge chip).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sets up shadcn/ui CLI so new components can be added with
`npx shadcn add <component>`. Uses new-york style, zinc base color,
no CSS variables (matches existing Tailwind-only approach).
Adds clsx + tailwind-merge for the cn() utility.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
next dev --turbopack for significantly faster dev server startup
and hot module replacement. Build script unchanged (Turbopack for
next build is still experimental).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Shift+click to toggle node selection (multi-select mode)
- BatchActionBar floating at bottom when >1 node selected
- Batch Restart All, Pause All, Delete All with ConfirmDialog
- Selected nodes get blue ring highlight
- Escape clears selection
- Pane click clears selection
- Dark theme, accessible (ARIA labels, focus rings)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds role="button", tabIndex, aria-label="Select <name>", and keyboard
handlers (Enter/Space) to TeamMemberChip. Fixes 5 failing a11y tests
from issue #831. Updates eject button test to match existing label format.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Conflicts arose because PR #892 base commits (MemoryInspectorPanel creation,
A2A overlay) had already landed on main via a different merge path, and
last-tick merges (#876, #888) had modified Toolbar, SidePanel, and test
fixtures.
Resolution strategy:
- Toolbar.tsx, SidePanel.tsx, Canvas.a11y.test.tsx, Canvas.pan-to-node.test.tsx,
MemoryInspectorPanel.test.tsx: take main (strictly newer, already contains
the branch's A2A overlay content plus subsequent a11y/UX fixes)
- MemoryInspectorPanel.tsx: take main (543 lines with semantic search) + apply
sanitizeId() helper from #904 + update bodyId prefix to mem-body-
- DetailsTab.tsx: take main (has #875 Field/useId + #878 deleteButtonRef/focus)
+ apply alertdialog structure from #905 while preserving focus management
Mechanical conflict resolution by triage-agent; no logic changes beyond the
four a11y fixes already in the branch (#902-#905).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PR #878 landed before this branch and added useRef + deleteButtonRef focus-
management to DetailsTab.tsx. This commit combines that import with the
useId/cloneElement import added here, and preserves the Field component
htmlFor/id wiring from this PR unchanged.
Mechanical conflict resolution by triage-agent; no logic changes.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
role="alert" is for passive announcements. A delete confirmation with
Confirm/Cancel action buttons requires a user response, which is the
semantics of role="alertdialog" (interactive dialog requiring response).
- Replace role="alert" with role="alertdialog" + aria-modal="true"
- Add aria-labelledby="delete-confirm-title" for an accessible name
- Add <h3 id="delete-confirm-title"> as the labelling element
("Confirm deletion") so AT announces the dialog purpose on focus
Closes#905
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Memory keys can contain characters like [ ] / : . # and spaces that make
invalid HTML id values (breaks CSS selectors and ARIA id-ref lookups).
- Add sanitizeId() helper: replaces non-alphanumeric chars with hyphens,
collapses consecutive hyphens, strips leading/trailing hyphens
- Compute bodyId = "mem-body-{sanitizeId(entry.key)}" in MemoryEntryRow
- Set id={bodyId} on the expanded body container
- Set aria-controls={bodyId} on the toggle button so AT can navigate
directly between the button and its controlled panel
Closes#904
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add aria-pressed={filter === f.id} to every filter pill button so AT
announces which filter is currently active
- Add aria-pressed={autoRefresh} to the auto-refresh toggle so AT
announces the live/paused state when the button is activated
Closes#903
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add role="alert" to the global error banner and the inline add-form
error message so screen readers announce errors immediately on render
- Add aria-label to all three add-form inputs (key / value / TTL) so
every form control has an accessible name (was flagged as unlabelled)
- Add aria-expanded={expanded === entry.key} to each entry toggle button
so AT announces collapsed/expanded state on activation
Closes#902
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The error banner div introduced in the MemoryInspectorPanel (PR #892)
was missing role="alert", regressing the a11y standard established in
PR #877 / issue #830. Screen readers now announce the error immediately
on render.
Closes#901
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Per UIUX Cycle 5 spec, Dialog.Content should carry an explicit
aria-label="Conversation trace" in addition to the aria-labelledby
automatically wired by Radix Dialog via Dialog.Title. This provides
a fallback accessible name directly on the dialog container element.
All 732 tests pass, build clean.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Corrects the source-input aria-label wording to match the UIUX Cycle 4
spec exactly. Previous commit used "Install plugin from source URL";
spec says "Install from source URL" (matches the visible "Install from
source" section heading). Updates the corresponding test assertions.
No functional change. All 736 tests pass.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- WorkspaceNode.eject.test.tsx: add draggable/selectable/deletable to
NodeProps render call (TS2739); add `as WorkspaceNodeData` cast on
makeNodeData return to silence Partial<> spread widening (TS2322)
The cherry-picked fix/canvas-test-fixture-budgetlimit commit (9e0aa61)
also lands here — it resolves latent test-fixture drift in 7 test files
that the incremental tsc cache had masked on main but that became visible
once the new WorkspaceNode.eject.test.tsx file invalidated the cache.
tsc --noEmit: 0 errors | npm test: 726 passed | npm run build: clean
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The budget PR (#541) added budgetLimit: number | null as a required field
on WorkspaceNodeData and budget_limit: number | null on WorkspaceData.
Seven test fixture factories were not updated, causing tsc --noEmit to
produce 34 TS2322/TS2345 errors (runtime tests still passed because
Vitest transpiles via esbuild which strips types).
Fixes:
- canvas-events.test.ts: makeNode factory +budgetLimit: null
- canvas-events-pan.test.ts: makeNode factory +budgetLimit: null
- canvas-capabilities.test.ts: makeNodeData factory +budgetLimit: null
- canvas-topology.test.ts: makeWS factory +budget_limit: null
- canvas.test.ts: makeWS factory +budget_limit: null; two inline
summarizeWorkspaceCapabilities args +budgetLimit: null; context-menu
fixture +budgetLimit: null
- ProvisioningTimeout.test.tsx: makeWS factory +budget_limit: null
Also fixes 3 TS2348 errors in AuthGate.test.tsx: newer Vitest type defs
resolve ReturnType<typeof vi.fn> to Mock<Procedure|Constructable> which
TypeScript no longer considers directly callable in a vi.mock factory.
Fix: intersect the mock variables with a plain function type so both the
call expression and the mock API (mockReturnValue etc.) type-check.
tsc --noEmit: 0 errors. npm test: 722/722.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- EjectIcon now accepts React.SVGProps<SVGSVGElement> so aria-hidden can be passed
- Eject button: aria-label and title both use `Extract ${data.name} from team`
(previously title was static 'Extract from team'; aria-label was absent)
- <EjectIcon aria-hidden="true"> prevents assistive tech from double-announcing
the icon content inside the already-labelled button
- Added WorkspaceNode.eject.test.tsx (4 tests) covering aria-label, title,
label==title invariant, and aria-hidden on the SVG
Closes#854
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two WCAG violations in the Danger Zone delete flow:
1. WCAG 4.1.3 (Status Messages): the confirmation UI that appears when
the user clicks "Delete Workspace" had no ARIA live region, so screen
readers never announced the confirmation prompt. Adding role="alert"
to the confirmation container makes it an implicit assertive live
region that is announced immediately.
2. WCAG 2.4.3 (Focus Order): pressing Cancel left focus wherever the
browser placed it (often body). Keyboard users had to re-navigate to
find the Delete Workspace button. The Cancel handler now calls
deleteButtonRef.current?.focus() to return focus to the trigger
button, matching the expected modal/disclosure focus-management pattern.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
WCAG 1.3.1 / 4.1.3: the onboarding card had no landmark role and no
live region, so screen readers had no way to know the card exists or
that the step changed.
- Add role="complementary" aria-label="Onboarding guide" to the card
container so it appears as a named landmark in assistive technology.
- Add a role="status" aria-live="polite" aria-atomic="true" sr-only div
that holds the current step label. When the step state changes React
updates the div content, which the live region broadcasts to the AT
without pulling focus away from the user's current position.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
NVDA and other screen readers ignore the title attribute on interactive
elements and non-interactive divs. Add aria-label alongside title on:
- Stop All button (dynamic label reflects active task count)
- Restart All button (dynamic label reflects pending workspace count)
- StatusPill component (online/offline/failed/provisioning counts)
- WsStatusPill component (connected/connecting/disconnected variants)
Inner dot and text spans get aria-hidden="true" so the screen reader
reads the single aria-label rather than individual child nodes.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
WCAG 1.3.1 / 4.1.3: the error div that appears after a failed workspace
deploy or blank-workspace create had no ARIA live region, so screen
readers never announced it. Adding role="alert" makes the message an
implicit aria-live="assertive" region so assistive technology surfaces
the error immediately without requiring the user to navigate to it.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wire WCAG 1.3.1 label associations: 6 bare <label>+control pairs in
ConfigTab (Description, Tier, Runtime, Effort, Task Budget, Backend) now
use stable useId() IDs with matching htmlFor/id. Field helper in
DetailsTab updated to generate its own fieldId via useId() and inject it
into the child element via cloneElement, so every Name/Role/Tier field in
edit mode is correctly associated without requiring call-site changes.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add role="separator" + aria-valuenow/min/max/orientation + tabIndex={0}
to make the resize handle focusable and discoverable by screen readers
(WAI-ARIA slider pattern). Add onKeyDown handler: ArrowLeft/Right moves
by 16px, Home/End snaps to min/max. Persist width to localStorage on
keyboard resize, matching the existing mouse behaviour.
Focus ring uses focus-visible:ring-2 to avoid showing on mouse click.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously loadMessagesFromDB swallowed all errors and returned [] — a
network failure was indistinguishable from an empty history, so the user
had no way to know loading failed. Now the function returns
{ messages, error } and the MyChatPanel renders a role="alert" banner
with the error message and a Retry button when messages are empty and
a load error occurred.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace title attribute (not read by screen readers for truncated text)
with aria-label, add role="status" so live regions announce the error,
and raise text color from text-amber-300/60 (~2.1:1) to text-amber-400
(~10.6:1) to meet WCAG AA contrast (4.5:1 minimum).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add bodyId derived from entry.key, attach aria-controls={bodyId} to the
toggle button, and add id={bodyId} role="region" aria-label to the
collapsible body div. Screen readers can now announce the expand/collapse
relationship between the button and the region it controls (WCAG 4.1.2).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Badge was always text-zinc-500; apply blue-500 (>=0.8), zinc-400 (0.5–0.8),
zinc-600 (<0.5) per spec. Add 3 vitest tests for each color tier (725 total).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Resolves 4 merge conflicts: Toolbar.tsx (2), Canvas.a11y.test.tsx (1),
Canvas.pan-to-node.test.tsx (1). All conflicts were additive — PR adds
selectedNodeId/setPanelTab selectors and the Audit toolbar button; main
didn't have them. Took PR additions throughout.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- AuditTrailPanel SidePanel tab showing the workspace audit ledger from
GET /workspaces/:id/audit with cursor-based pagination (?cursor=, ?limit=50)
- Color-coded event-type badges: delegation=blue-500, decision=violet-500,
gate=yellow-500, hitl=orange-500
- chain_valid=false renders red tamper warning indicator
- Event-type filter bar (All / Delegation / Decision / Gate / HITL) resets
pagination and reloads with ?event_type= param
- Relative timestamps refreshed every 30 s without re-fetching
- Empty state with icon and descriptive copy
- Toolbar Audit button (ledger icon) switches panel to audit tab for
selected workspace, or shows toast if no workspace is selected
- 29 new unit tests across formatAuditRelativeTime, AuditEntryRow, and
AuditTrailPanel component integration suites
- Update SidePanel.tabs.test.tsx for 13-tab count and audit as last tab
- Add setPanelTab to Canvas test store mocks (Toolbar now reads it)
Closes#753
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- New A2ATopologyOverlay component polls /activity fan-out every 60s and
writes directed edges to a2aEdges store slice (separate from topology edges)
- buildA2AEdges aggregates delegate rows per source→target pair; violet-500
animated edge when last call <5 min ago, blue-500 static otherwise
- Toolbar toggle persists to localStorage (molecule:show-a2a-edges)
- Canvas.tsx merges a2aEdges into allEdges via useMemo; pointerEvents:none
on all edge elements keeps nodes draggable
- 24 new unit tests across pure function, helper, and component suites
- Fix Canvas.a11y and Canvas.pan-to-node store mocks (missing A2A fields)
Closes#744
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- New A2ATopologyOverlay component polls /activity fan-out every 60s and
writes directed edges to a2aEdges store slice (separate from topology edges)
- buildA2AEdges aggregates delegate rows per source→target pair; violet-500
animated edge when last call <5 min ago, blue-500 static otherwise
- Toolbar toggle persists to localStorage (molecule:show-a2a-edges)
- Canvas.tsx merges a2aEdges into allEdges via useMemo; pointerEvents:none
on all edge elements keeps nodes draggable
- 24 new unit tests across pure function, helper, and component suites
- Fix Canvas.a11y and Canvas.pan-to-node store mocks (missing A2A fields)
Closes#744
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a fifth option to the effort <select> in the Claude Settings section:
<option value="max">max — absolute ceiling</option>
The dropdown now offers: low / medium / high / xhigh / max.
effort is typed as string? so no interface update required.
Test updated: source-assertion count "four" → "five", new toYaml
serialization test for effort: max.
641/641 tests pass. Build clean.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
DetailsTab renders WorkspaceUsage alongside BudgetSection. The test suite
sets api.get to return [] (a valid empty peers list) but WorkspaceUsage
calls api.get for metrics and crashes on undefined input_tokens when the
mock returns an array instead of a WorkspaceMetrics object.
Add a stub vi.mock following the same pattern already used for BudgetSection.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three add/add + content conflicts, all mechanical:
- WorkspaceUsage.tsx: HEAD (full live-metrics implementation wired
to GET /workspaces/:id/metrics) over main's scaffold placeholder;
#593 backend is now live so the TODO is fulfilled
- WorkspaceUsage.test.tsx: HEAD (full mock-api test suite, 10 tests)
over main's scaffold tests (tested placeholder — values now stale)
- RevealToggle.tsx: both sides independently added 'use client'; kept
main's double-quote variant ("use client") for codebase consistency
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The vi.mock("../../../store/canvas") call was nested inside an it()
block. Vitest hoists all vi.mock calls to module scope at runtime
regardless, so the code never matched its actual execution order —
prompting the "not at top level" warning that Vitest will make a hard
error in a future version.
Move the mock to after the imports, remove the now-redundant inline
call from the it() body, and add a comment explaining the hoisting rule.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>