molecule-core

Author	SHA1	Message	Date
Hongming Wang	06c85bd185	Merge pull request #2045 from Molecule-AI/feat/flat-rate-pricing-1833 feat(canvas): flat-rate pricing — rename Starter→Team, Pro→Growth (Issue #1833)	2026-04-25 05:54:06 +00:00
Hongming Wang	e0f338e8ae	fix(canvas): plug timer leak + optimistic-install semantics in SkillsTab Three review-driven fixes plus regression coverage for the bugs landed in `176b703d` / `deedb5ef`: 1. clearTimeout the prior reload handle before scheduling a new one in both installFromSource and handleUninstall. Two installs within the PLUGIN_RELOAD_DELAY_MS window (15s) used to queue two loadInstalled() calls; the unmount cleanup only cleared the latest handle, and the second reconciliation could overwrite a still- correct optimistic state with a stale snapshot mid-restart. 2. Drop `setInstalledLoaded(true)` from the optimistic block. That flag's contract is "the initial GET has succeeded at least once" — it gates the auto-expand-registry effect. A user installing a custom-source plugin BEFORE the initial fetch returned would flip the gate prematurely, the auto-expand would never fire, and a followup loadInstalled racing with the optimistic write could overwrite our entry with [] mid-restart. 3. Don't force `supported_on_runtime: true` on the optimistic record. The "inert on this runtime" badge in the row renders on the value `=== false`. Forcing true would hide the badge for 15s if the user installed a plugin that doesn't actually support the workspace's runtime; the real value lands at refetch. Leaving the field undefined keeps the badge neutral until reconciliation arrives. Plus a behavioral test (SkillsTab.install.test.tsx) that asserts: - the install POST URL contains the workspaceId (not "undefined") - the row's "Install" button is replaced by the green "Installed" tag synchronously after POST resolves, without advancing any timer — locks in the optimistic-update contract so a future refactor can't silently regress it. 995 canvas tests pass (2 new); tsc clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 22:47:46 -07:00
Hongming Wang	deedb5eff6	fix(canvas): optimistic plugin install so the UI flips to "Installed" instantly After clicking Install, the button reverted from "Installing..." → "Install" the moment the POST returned, then sat there for ~15s before the green "Installed" tag appeared. The 15s gap is PLUGIN_RELOAD_DELAY_MS — we delay the GET /workspaces/:id/plugins refetch to wait for the workspace to restart (the listing handler returns [] while the container is restarting because findRunningContainer comes up empty). Uninstall already does optimistic local-state mutation (line 244 prior to this commit) so the green tag → install button transition is instant. Install was the inconsistent half — push the registry entry into `installed` immediately after POST returns 200 and let the delayed refetch reconcile. The optimistic record uses the registry entry's metadata (name, version, description, tags, runtimes, skills) and sets supported_on_runtime=true. If reconciliation later disagrees (server filter, install actually failed at the runtime layer), the refetch overwrites the local record. Worst case is a brief 15s window where we show "Installed" for a plugin that won't load — same window the user previously experienced as "stuck on Install button" — but flipped to the correct expected state. Custom-source installs (github://, etc.) don't have a registry entry to use, so they keep the old behavior of waiting for the refetch. Most users install from the registry list in the UI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 22:41:51 -07:00
Hongming Wang	9a785e9c32	ci(canary): inject E2E_OPENAI_API_KEY so A2A turn doesn't 500 The canary workflow has been failing for ~30 consecutive runs (issue #1500, opened 2026-04-21) on the same line: [hermes-agent error 500] No LLM provider configured. Run `hermes model` to select a provider, or run `hermes setup` for first-time configuration. Root cause: the canary's env block was missing E2E_OPENAI_API_KEY. Without it, tests/e2e/test_staging_full_saas.sh provisions the workspace with empty secrets; template-hermes start.sh seeds ~/.hermes/.env with no provider keys; derive-provider.sh resolves the model slug `openai/gpt-4o` to PROVIDER=openrouter (hermes has no native openai provider in its registry); A2A request at step 8/11 fails with the "No LLM provider configured" error from hermes-agent. The full-lifecycle workflow (e2e-staging-saas.yml line 84) carries the same secret correctly. Mirror its pattern + add a fail-fast preflight so future regressions surface in <5s instead of after 8 min of provision-then-die. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 22:37:13 -07:00
Hongming Wang	176b703dbc	fix(canvas): plugin install POSTed to /workspaces/undefined/plugins SkillsTab read \`data.id\` from its props and used the value to build two API URLs: POST /workspaces/\${data.id}/plugins DELETE /workspaces/\${data.id}/plugins/\${pluginName} But \`data\` is the React Flow node.data blob (WorkspaceNodeData) — the workspace id lives on \`node.id\`, NOT on \`node.data\`. WorkspaceNodeData extends \`Record<string, unknown>\`, which makes \`data.id\` type-check silently as \`unknown\` instead of erroring. So every install/uninstall hit \`/workspaces/undefined/plugins\`, the server's not-found path returned 503 "workspace container not running" (misleading — the real issue was the bogus URL), and the user got a confusing toast. Every other tab in SidePanel takes \`workspaceId={selectedNodeId}\` as an explicit prop. SkillsTab was the lone outlier, presumably because "data has all the fields I need" is the obvious-looking shortcut that TypeScript can't catch through the index-signature interface. Fix: make \`workspaceId\` an explicit prop on SkillsTab, drop the \`data.id\` reads, thread the prop from SidePanel like the other tabs. Test fixture updated to pass it. Verified: 993 canvas tests pass; tsc clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 22:36:35 -07:00
Hongming Wang	ee429cfee7	fix(canvas,dotenv): review-driven hardening of fit gate + parser parity Independent code review surfaced two required documentation fixes and one growth-correctness gap. All addressed here. Auto-fit gate (useCanvasViewport): The previous "subtree-grew-by-count" check missed the delete-then-add case: subtree of 6 → delete one → 5 → a different child arrives → 6 again. A length-only comparison reads no growth and the fit is skipped, leaving the new node off-screen. Switched to an id-set membership snapshot so any brand-new id forces the fit even when the count is unchanged. The gate logic is now extracted as a pure exported function `shouldFitGrowing(currentIds, prevIds, userPannedAt, lastAutoFitAt)` so the regression-prone decision can be unit-tested in isolation without standing up React Flow + DOM event refs. 8 cases cover: first-fit, empty-prior, brand-new id, status-update with user pan, no-pan-ever, pan-before-last-fit, delete-then-add same length, and shrink-only with user pan. Parser parity (dotenv.go + next.config.ts): Existing-env semantics were undocumented in both parsers. Both now explicitly note that an explicitly-set empty string (`KEY=` from the parent shell) counts as "set" — the file value does NOT backfill — matching the Go (os.LookupEnv) and Node (`process.env[k] !== undefined`) primitives. `export ` prefix uses a literal space; `export\tFOO=bar` is intentionally rejected. Added the same comment in both parsers to lock in this parity invariant since the commit message claims "if one parser changes, the other has to." Skipped (per analysis): - Drag-pan respect for left-click drag-pan during deploy. The growth-check safety net means any pan gets overridden on the next arrival anyway, which is the desired behavior for the "watch the org deploy" use case. After deploy completes, no more fit-deploying-org events fire so drag-pan works freely. - Map cleanup for lastFitSubtreeIdsRef. Per-tab session, UUID keys, tiny entries — not worth the cleanup hook. 993 canvas tests pass (8 new); Go dotenv tests pass; tsc clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 22:23:51 -07:00
Hongming Wang	e900a773ac	fix(canvas): keep tracking org bounds during deploy after first fit Symptom: org import zoomed to fit the parent + first child, then froze at that framing while the remaining children kept materialising off-screen. The user had to manually pan/zoom to see the new arrivals. Two stacked bugs in useCanvasViewport's deploy-time auto-fit: 1. The user-pan-respect gate stamps userPannedAtRef on EVERY pointerdown that lands inside .react-flow__pane. That fires for ordinary clicks (deselect, click-near-a-card, modal-close-bubble from the import dialog) — not just for actual pan gestures. One accidental pre-import click was enough to lock out every fit for the rest of the deploy. Wheel is the canonical unambiguous pan/zoom signal; drop pointerdown. 2. Even with a real pan during deploy, when more children land the org's bounds grow and the user has lost context — the new arrivals are off-screen and the deploy is the primary thing they want to watch right now. The guard had no growth awareness, so one pan cancelled all follow-up fits unconditionally. Now we track the subtree size at the last fit (per root), and if the current subtree is larger we force the fit through regardless of the user-pan timestamp. When the subtree size hasn't changed (status updates on already-positioned nodes), the user-pan respect still applies — so post-deploy exploration isn't yanked back. The Map keyed by root id supports back-to-back imports of different orgs without one's growth count blocking the other's first fit. 985 canvas tests pass; tsc clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 21:37:54 -07:00
Hongming Wang	ec7ecd5461	fix(canvas): load monorepo .env in next.config so WS connects in dev Symptom: spawn animation missing on org import. Workspaces appeared in their final positions all at once instead of materialising one-by-one. Root cause: the WS pill said "Reconnecting" forever because the canvas was trying to connect to ws://localhost:3000/ws — its own port, where Next.js dev doesn't serve a WebSocket — instead of the platform's ws://localhost:8080/ws. Why: deriveWsBaseUrl() falls back to window.location when NEXT_PUBLIC_WS_URL is unset. Next.js auto-loads .env from the project root only — and the canonical NEXT_PUBLIC_WS_URL / NEXT_PUBLIC_PLATFORM_URL live in the monorepo root .env, alongside the Go platform's MOLECULE_ENV / DATABASE_URL. Without an extra canvas/.env.local copy (which would still be a per-developer manual step), the canvas dev server starts blind to those vars. Fix: next.config.ts now walks upward from __dirname looking for the monorepo root (same workspace-server/go.mod sentinel the platform's dotenv loader uses) and merges the root .env into process.env BEFORE Next.js compiles. Existing env wins over file values, so docker runs / CI / explicit exports still dominate. The parser is a TypeScript mirror of workspace-server/cmd/server/ dotenv.go's parseDotEnvLine — same rules (export prefix, quotes, inline comments, BOM) so a single .env line behaves identically across both processes. If one parser changes, the other has to. Production unaffected: `output: "standalone"` bakes resolved env into the build, the workspace-server sentinel isn't shipped in deploy artifacts, and the existing-env-wins rule means container env dominates anywhere this file is consulted at runtime. Verified: canvas dev startup log now shows "[next.config] loaded 49 vars from /Users/.../molecule-core/.env"; served bundle has the correct ws://localhost:8080/ws URL; WS pill flips to "Connected" after a hard refresh and per-workspace spawn animations fire on the next org import as expected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 21:29:05 -07:00
Hongming Wang	4014513b94	fix(dotenv): empty value with inline comment was returning the comment The repo's own .env contains lines like CONFIGS_DIR= # Path to workspace-configs-templates/... where the value is empty + an inline comment. The pre-fix parser: 1. v = " # Path to ..." 2. TrimLeft → "# Path to ..." 3. Inline-comment loop looked for " #" or "\t#" — neither matches because the leading whitespace is gone. 4. Returned the comment text as the value. Result: os.Setenv("CONFIGS_DIR", "# Path to ...") clobbered the auto- discovery fallback. The TemplatesHandler then opened the comment as a directory, ReadDir errored silently, and GET /templates returned []. Canvas's Templates panel showed "No templates found in workspace-configs-templates/" even though 8 valid templates existed on disk. Fix: strip leading whitespace from the value FIRST, then run a position-aware comment scan that treats `#` as a comment marker iff it's at the start of the (trimmed) value or preceded by whitespace. A bare `#` mid-value (e.g. `KEY=token#fragment`) still survives. Quoted-value handling moved above the comment scan so `KEY="value # not"` keeps the `#` as part of the value — pulled the quote-detection into the same TrimLeft-then-check shape as the bare path. The unterminated-quote case still falls through to bare-value handling. Three regression tests added covering the exact .env line that broke (`CONFIGS_DIR= # ...`), spaces-only with comment, and tab- only with comment. Verified end-to-end: GET /templates now returns all 8 templates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 21:17:21 -07:00
Hongming Wang	9a223afba1	fix(dotenv,socket): review-driven hardening of .env loader + WS poll Independent code review surfaced three required fixes and one cheap optional one. All addressed here. dotenv parser: - `export FOO=bar` was parsed as key `"export FOO"` (with embedded space) and silently os.Setenv'd, so a developer pasting from a direnv `.envrc` would get junk vars. Now strips the prefix. - Quoted values weren't unwrapped: `FOO="hello world"` produced value `"hello world"` with literal quotes. Now strips one matched pair of surrounding `"` or `'`. Inside a quoted value `#` is part of the value, not a comment marker (matches godotenv convention). - UTF-8 BOM at file start (Windows editors) would have produced a first key like U+FEFF + "FOO". Now stripped via TrimPrefix. dotenv loader: - findDotEnv()'s upward walk would happily pick up `~/.env` or a sibling-repo `.env` if the binary was run from `~/Documents/other- project/`. Real foot-gun on shared dev boxes. Now gated on a monorepo sentinel: the candidate directory must contain `workspace-server/go.mod`. Falls through to "no .env found" (= pre-fix behavior) when the sentinel is absent. socket fallback poll: - startFallbackPoll() previously fired only on onclose, so the very first connect attempt — when onclose hasn't fired yet because we never had a successful onopen — left the canvas with no HTTP poll for the duration of the failing handshake (Chrome can hold a SYN-SENT WebSocket open ~75s before giving up). Now also called at the top of connect(); the timer-already-running guard makes it a no-op when one cycle later onclose calls it again. Test coverage added: export prefix, single+double quoted values, hash inside quotes preserved, unterminated quote falls back to bare value, CRLF stripping locked in, BOM stripping, and a sentinel-rejection regression test that creates a temp .env with no workspace-server sibling and asserts findDotEnv refuses to load it. Verified: 985 canvas tests + 30 dotenv subtests + 4 dotenv integration tests all pass; tsc clean; rebuilt platform from monorepo root with stripped env still loads .env (49 vars) and /workspaces returns 200. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 21:09:18 -07:00
Hongming Wang	21db85d691	fix(canvas): cascade delete locally so children disappear without WS Deleting a parent on a wedged WS used to leave the child cards on the canvas as orphaned roots until the user manually refreshed. Why: Canvas.tsx and DetailsTab.tsx both called `removeNode(parentId)` after `DELETE /workspaces/:id?confirm=true` returned 200. `removeNode` deliberately re-parents children rather than cascading — it relies on the per-descendant WORKSPACE_REMOVED WS events the platform emits as part of the cascade to drop each child individually. When the WS is unhealthy those events never arrive, so the local store keeps the children alive (now re-parented to root since their actual parent is gone). Fix: new `removeSubtree(rootId)` action on the canvas store mirrors the server-side cascade — drops the root + every descendant + every incident edge in one atomic set(). Both delete call sites now use it. The WS events still arrive when WS is healthy and become idempotent no-ops because the nodes are already gone. Why a new action instead of changing removeNode: removeNode's re-parenting behavior is correct for non-cascading flows (drag-out, manual node detach in the future). Adding a sibling action keeps both call shapes available rather than forcing every caller to opt out of cascade. 6 new unit tests cover root cascade, mid-level cascade, leaf no-op-cascade, selection clearing across the subtree, selection preservation outside the subtree, and edge cleanup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 20:51:09 -07:00
Hongming Wang	e58ecf2974	fix(e2e): scrollIntoView before toBeVisible — clipped tabs were "missing" Seventh E2E bug, surfaced after the AuthGate mock from the previous commit finally let the harness reach the tab-iteration loop: Error: tab-skills button missing — TABS list may have drifted Locator: locator('#tab-skills') The TABS bar in SidePanel is `overflow-x-auto` (intentional — there are 13 tabs and they don't all fit on smaller viewports; the right-edge fade gradient signals the overflow). Tabs after position ~3 are clipped, and Playwright's `toBeVisible()` returns false for clipped elements (it checks getBoundingClientRect against viewport). Fix: `scrollIntoViewIfNeeded()` before the visibility assertion, mirroring what SidePanel's own keyboard handler does on arrow-key navigation. The tab is then in view and `toBeVisible()` passes. This was the test's 7th and (probably) final harness bug. The chain mapping all the way from "staging E2E timed out at 1200s" this morning: 1. instance_status field name (#2066) 2. staging.moleculesai.app DNS zone (#2066) 3. X-Molecule-Org-Id TenantGuard header (#2066) 4. Hydration selector waited pre-click (#2066) 5. networkidle never settles (this PR's parent commits) 6. AuthGate /cp/auth/me redirect 7. Tab buttons clipped by overflow-x-auto If THIS run still fails, the failure surfaces in actual product behavior (a tab's panel content), not test mechanics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 20:37:36 -07:00
Hongming Wang	f8c900909e	fix(platform): auto-load .env from CWD on startup Local dev runs (`/tmp/molecule-server` after `go build`) used to 401 on /workspaces the moment the DB had any workspace token in it: the binary inherited a bare shell env with no MOLECULE_ENV, so AdminAuth's dev fail-open branch (gated on MOLECULE_ENV=development) didn't fire. The repo's .env already has MOLECULE_ENV=development plus DATABASE_URL, REDIS_URL, ADMIN_TOKEN=, etc. Until now you had to `set -a && source .env` in the launching shell — a paper cut, but worse, it's a paper cut in EVERY automated dev workflow (IDE run configs, integration test harnesses, the smoke-test loop in this branch's manual testing). Fix: cmd/server now walks upward from CWD looking for a .env (capped at 6 levels) and merges KEY=VALUE pairs into os.Environ before any other code reads env. Already-set vars win over file values, so docker run -e / CI exports / `KEY=val ./binary` still dominate — only unset keys get filled in. Why no godotenv dep: the format we use is plain KEY=VALUE with `#` comments, no interpolation, no quoting (verified against the live .env: 49 kv lines, zero references to ${...} or `export`). A 30-line parser is auditable and avoids supply-chain surface. Why it's safe in production: Dockerfile doesn't COPY .env into the image and .env is gitignored, so prod containers have no .env on disk to load — the function's findDotEnv() loop finds nothing and returns silently. If an operator deliberately drops one in, the existing-env-wins rule means container-injected env still dominates. Verified by booting `env -i HOME=$HOME PATH=$PATH /tmp/molecule-server` from the repo root with a stripped env: log shows ".env: /Users/.../molecule-core/.env — loaded 49, 0 already set" and /workspaces returns 200 instead of 401. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 20:33:28 -07:00
Hongming Wang	0b4dfbd121	fix(canvas): suppress stale provisioning banners + add WS-down HTTP fallback poll Two related fixes for the case where the canvas thinks workspaces are stuck provisioning when they're actually online: 1. ProvisioningTimeout banners now gate on wsStatus === "connected". While the WS is in connecting/disconnected state, the local "provisioning" status reflects the last event received before the drop — workspaces may have transitioned to online minutes ago. The 8m timeout was firing against frozen state and showing a wall of yellow warnings on already-online workspaces. 2. Socket layer now starts a 10s rehydrate poll when the WS goes unhealthy (onclose) and stops it on onopen/disconnect. The reconnect attempts continue in parallel; whichever recovers first wins. rehydrate()'s existing dedup gate prevents the open-time rehydrate from racing with a fallback poll. Without this the store could stay frozen for minutes while WS exponential backoff chewed through retries. Plus the previously-uncommitted TemplatePalette flushSync change so the import modal unmounts synchronously before doImport runs (otherwise React batches the close with the import's setState prefix and the modal backdrop hides the spawn animation). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 20:22:15 -07:00
Hongming Wang	6c70b413e0	fix(e2e): mock /cp/auth/me — AuthGate redirect was preventing canvas render Sixth E2E bug, surfaced after the page.goto-domcontentloaded fix finally let the navigation complete. The harness now reaches the canvas-root selector wait but still times out because the canvas never renders: TimeoutError: page.waitForSelector: Timeout 45000ms exceeded. waiting for [aria-label="Molecule AI workspace canvas"] Root cause: canvas/src/components/AuthGate.tsx wraps the page, fetches /cp/auth/me on mount, and redirects to the login page when the response is 401. The bearer header we set via context.setExtraHTTPHeaders works for platform API calls but does NOT satisfy /cp/auth/me — that endpoint is cookie-based (WorkOS session). So: 1. AuthGate mounts 2. Calls fetchSession() → /cp/auth/me → 401 (no session cookie) 3. AuthGate transitions to anonymous → redirectToLogin() 4. Browser navigates away from tenant URL 5. The React Flow canvas root with the aria-label never mounts 6. waitForSelector times out at 45s Fix: context.route() intercepts /cp/auth/me and returns a fake Session JSON so AuthGate resolves to "authenticated" and renders its children. The session contents are cosmetic — Session.org_id and Session.user_id appear in a few canvas surfaces but never fail on dummy values. This is the cleanest fix path. Alternatives considered + rejected: - Add a ?e2e=1 backdoor to AuthGate: production code shouldn't have a "skip auth" flag, even gated. - Real WorkOS login flow in Playwright: too much overhead per run. - Skip the canvas UI test, test only API: defeats the point of the staging E2E (which is to catch UI regressions before promotion). After this lands the harness should reach the workspace-node click step and exercise tabs — only then can a real product bug (rather than a test-harness bug) surface. The 6-bug chain mapped to: 1. instance_status field name (#2066) 2. staging.moleculesai.app DNS zone (#2066) 3. X-Molecule-Org-Id TenantGuard header (#2066) 4. Hydration selector waited pre-click (#2066) 5. networkidle never settles (this commit's parent) 6. AuthGate /cp/auth/me redirect (this commit) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 19:59:04 -07:00
Hongming Wang	1d71b4e9e5	fix(canvas): bundle of UX hardening — modals, position stability, error UX, paste Single-themed bundle of fixes accumulated while polishing the canvas chat / agent-comms / plugins / position flows. Each piece is small; the connective tissue is "things observable from the canvas right panel and the org-deploy flow that surprised real users". UI / composer - Legend: add close X + persisted-localStorage state + reopener pill; default open for first-time users. - SidePanel: rename "Skills" tab label → "Plugins" (single-line; internal panelTab enum value, component name, and store keys unchanged). - SkillsTab: registry tri-state UI (loading / error / empty) with actionable Retry button + 10s explicit fetch timeout. Handle AbortSignal.timeout's DOMException by name (TimeoutError / AbortError) — Chromium's "signal timed out" message wouldn't match the prior naive /timeout/ regex. Reset mountedRef on every mount: pre-existing StrictMode dev-mode bug where cleanup-only `current = false` was never re-set, permanently wedging every `if (mountedRef.current) setX(...)` guard and producing a "Loading…" panel that never resolved on hard refresh. - ChatTab: paste-image-from-clipboard via onPaste handler; unique monotonic-counter filenames so same-second pastes don't collide on name+size dedup. mime→ext map avoids `image/svg+xml`-style raw extensions on synthesised filenames. Bypasses the DataTransfer constructor so Safari < 14.1 / older Edge work. - ChatTab: drop stuck error toast when the WS path already delivered the agent reply but the HTTP path errored late (sendingFromAPIRef gate now covers the .catch() handler). - ChatTab: filter heartbeat-style internal self-messages from the My Chat tab so historical rows with source_id=NULL don't surface as user-typed input. - Modal portals: OrgImportPreflightModal + MissingKeysModal (ProviderPickerModal + AllKeysModal) now createPortal to document.body and clamp max-h to 80vh. Escapes the ancestor containing block (TemplatePalette's fixed+filtered sidebar re-anchored descendants' position:fixed to itself, hiding modals behind workspace cards). MissingKeysModal bumped to z-[60] for stack ordering when both modals are open. - OrgImportPreflightModal saveOne: ref-based microtask-safe in-flight gate replaces the brittle "set startValue inside a setState updater and read on the next line" pattern (React 18 doesn't guarantee functional updaters run synchronously; that path strands `saving:true` and never calls createSecret). Same useRef pattern guards SkillsTab.loadRegistry against concurrent fires and Fast-Refresh-stranded promises; force=true parameter on retry click bypasses the gate. Agent comms - AgentCommsPanel: derive UI-facing `flow` field instead of using activity_type-derived direction. Self-logged a2a_receive rows (source_id == workspace_id, what the agent runtime writes to log its own outbound delegation replies) now correctly render as OUTBOUND with → arrow + right-justified bubble. Previously they rendered "← From Self" with Restart pointing at THIS workspace. - AgentCommsPanel: error rows replace the unactionable "X failed [A2A_ERROR]" body with banner + underlying-error code-block + cause-hint (matched on Claude Code SDK init wedge, deadline-exceeded, agent-thrown exception, empty-error) + Restart [peer] / Open [peer] action buttons. - AgentCommsPanel: render text bodies through ReactMarkdown + remark-gfm so multi-part replies (tables, code) render properly. Multi-part text extractor - extractReplyText (live A2A response in ChatTab) and extractResponseText (chat history loader in message-parser): now COLLECT from every source — top-level parts, parts.root.text, and artifacts — joined with "\n". Previous "first source wins" silently dropped multi-part replies (Hermes summary+detail, Claude Code long-form table). Tests cover joined-from-parts, joined-from-artifacts, joined-from-both. Position stability - canvas-topology.buildNodesAndEdges: auto-rescue heuristic now accepts currentParentSizes map; uses max(initial min, currently grown) for the bbox check. Fixes "child jumps to weird location after 30s" — the periodic socket health-check rehydrate (silenceSec > 30) was rebuilding nodes from scratch, and the rescue's reliance on grid-derived initial size false-flagged children the user dragged into the user-grown area. - canvas.hydrate: pass live measured dimensions from the existing store into buildNodesAndEdges. - socket.RehydrateDedup: pure exported helper class that gates rehydrate calls. Two states — in-flight (in-flight Promise reused by concurrent callers) + post-completion window (1.5s, returns Promise.resolve()). Initialised with -Infinity so first call always passes the gate. Wired into ReconnectingSocket.rehydrate. A2A edges - New A2AEdge custom React Flow edge component portals its label out of the SVG layer via EdgeLabelRenderer so labels (a) render above workspace cards instead of being hidden behind them and (b) accept clicks. Click selects source + switches panel to Activity, but only on a NEW selection (preserves current tab on re-click of an already-selected source). - buildA2AEdges output tagged type:"a2a"; edgeTypes wired in Canvas.tsx. Tests - 14 new vitest cases across 4 files (964 → 978 passing): OrgImportPreflightModal saveOne single-fire / double-click, any-of rendering; AgentCommsPanel toCommMessage flow derivation in all four shapes; canvas-topology rescue respects-grown / rescues-genuine-drift / fallback-without-live-size; socket RehydrateDedup gate behaviour; message-parser multi-part response extraction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 19:54:43 -07:00
Hongming Wang	65b531acf6	fix(workspace): tag self-originated A2A POSTs with X-Workspace-ID Workspace runtime fired four classes of A2A request to the platform without the X-Workspace-ID header that identifies the source workspace: heartbeat self-messages, initial_prompt, idle-loop fires, and peer-to-peer A2A from runtime tools. The platform's a2a_receive logger keys source_id off that header — without it, every such row was written with source_id=NULL, which the canvas's My Chat tab filters as ?source=canvas (i.e. "user typed this") and rendered the internal triggers as if the human user had sent them. The "Delegation results are ready..." heartbeat trigger was visible to end users in the chat history; delegate_task A2A calls between agents were misclassified the same way. Centralise the header construction in a new platform_auth helper self_source_headers(workspace_id) that returns auth_headers() PLUS {X-Workspace-ID: <id>}. Apply it to: - heartbeat.py self-message (refactored from inline header dict) - main.py initial_prompt POST - main.py idle_prompt POST - a2a_client.py send_a2a_message (peer A2A from runtime) - builtin_tools/a2a_tools.py delegate_task (was missing ALL headers) Tests: - test_heartbeat.py asserts the X-Workspace-ID header is set on the self-message POST. - test_a2a_tools_module.py asserts the same on delegate_task POSTs; FakeClient.post mocks updated to accept the headers kwarg. Production effect lands the moment workspace containers are rebuilt with this code; existing rows in activity_logs keep their NULL source_id (legacy data). The canvas-side filter (#follow-up) covers the historical-rows case until backfill. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 19:54:43 -07:00
Hongming Wang	c2504d9361	fix(e2e): page.goto waitUntil networkidle never settles — switch to domcontentloaded Fifth E2E bug surfaced by the previous run. After the four setup- phase fixes (instance_status, DNS zone, X-Molecule-Org-Id, hydration selector) plus CP#259 ending the pq cache class, the harness finally reached the actual page navigation step — and timed out there: TimeoutError: page.goto: Timeout 45000ms exceeded. navigating to "https://...staging.moleculesai.app/", waiting until "networkidle" `waitUntil: "networkidle"` waits for 500ms of network silence. The canvas keeps a WebSocket connection open + polls /events and /workspaces every few seconds for status updates, so the network is never idle — page.goto sits on it until the default 45s timeout and throws. Fix: switch to `waitUntil: "domcontentloaded"`. Returns as soon as the HTML is parsed. React hydration plus the existing `waitForSelector` line below is what actually gates ready-for- interaction; the goto's job is just to land on the page. This is a generally-applicable lesson — networkidle is broken for any SPA with a heartbeat. Notably, our existing canvas unit tests that mock @xyflow/react and don't open WebSockets DON'T hit this, which is why this only surfaces against staging. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 19:43:46 -07:00
Hongming Wang	59b5449a4e	chore: re-trigger CI — staging CP now has CP#259 SetMaxIdleConns(0) fix	2026-04-24 19:07:32 -07:00
Hongming Wang	01c417828d	chore: re-trigger CI — staging CP has SetMaxIdleConns(0) fix from CP#259	2026-04-24 19:06:18 -07:00
Hongming Wang	4e3bb3795a	fix(e2e): canvas-hydration wait used a selector that never appears pre-click Fourth E2E bug in the staging→main chain. The previous three (#2066 setup-phase fixes) let the harness reach the actual Playwright spec. This one is in staging-tabs.spec.ts itself. The spec at L78 waits 45s for one of: [role="tablist"], [data-testid="hydration-error"] Both targets are wrong: 1. [role="tablist"] only appears AFTER the workspace node is clicked (which happens 25 lines later at L100). Waiting for it BEFORE the click can never resolve, so the wait always times out at 45s regardless of whether the canvas actually loaded. 2. [data-testid="hydration-error"] doesn't exist anywhere in the canvas. The error banner at app/page.tsx:62 only had role="alert" — which collides with toast notifications and other alert-type elements, so a more-specific selector was never wired. Two-part fix: - Test waits on `[aria-label="Molecule AI workspace canvas"]` instead — that's the React Flow wrapper (Canvas.tsx:150), always present once hydrated regardless of workspace count or selection state. Hydration-error banner remains the secondary OR target for the failure path. - app/page.tsx hydration-error banner gets the missing `data-testid="hydration-error"` attribute. role="alert" stays for accessibility; the testid is for programmatic detection without conflict. After this lands, the staging-tabs spec should advance past the initial wait, click the workspace node, and exercise each tab. If a tab fails, we get a proper test failure rather than a 45s timeout that obscures everything. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 18:38:28 -07:00
Hongming Wang	4fdeabdbe0	fix(e2e): send X-Molecule-Org-Id header — TenantGuard 404s without it Third E2E bug in the staging→main chain, found while debugging the \`Workspace create 404\` failure that surfaced after the previous two E2E fixes (instance_status, staging.moleculesai.app DNS). Root cause: workspace-server's \`middleware/TenantGuard\` middleware returns 404 (not 401/403, intentionally — see comment in \`tenant_guard.go\`: "must not be inferable by probing other orgs' machines") when a request to the tenant origin lacks one of: - X-Molecule-Org-Id header matching MOLECULE_ORG_ID env on the tenant - Fly-Replay-Src state from the CP router (production browser path) - Same-origin Canvas (Referer == Host) The E2E was a direct GitHub-Actions curl with neither — every non- allowlisted route 404'd with the platform's ratelimit headers but none of the security headers, which made it look like a missing route in the platform. The org UUID is already on the admin-orgs row alongside instance_status, so capture it during the readiness poll and add it to the tenantAuth header bag. Both /workspaces (POST) and /workspaces/:id (GET) now carry it. Allowlist still contains /health, /metrics, /registry/register, /registry/heartbeat — so the TLS readiness step (which hits /health) keeps working without the header. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 18:13:13 -07:00
Hongming Wang	edcac16b81	fix(e2e): use staging.moleculesai.app for tenant DNS — wrong zone hung TLS poll Second related E2E bug, surfaced after #2066's instance_status fix let the harness reach the TLS readiness step: Error: tenant TLS: timed out after 180s The CP provisioner writes staging tenant DNS as <slug>.staging.moleculesai.app (with the staging. subdomain prefix — visible in the EC2 provisioner DNS log line). The harness was building https://<slug>.moleculesai.app (prod-zone shape), so DNS literally didn't resolve, fetch threw NXDOMAIN inside the silent catch, and waitFor saw null on every 5s poll until 180s elapsed. Fix: parameterize as STAGING_TENANT_DOMAIN env var, default staging.moleculesai.app. Doc-comment example updated to match. Override hatch is there only for ops running this harness against a non-default zone. Verified manually: a freshly-provisioned tenant (e2e-canvas-20260425-sav9fe) was unreachable at the prod-shaped URL (NXDOMAIN) but reached CF at the staging-shaped URL. teardown.ts only hits CP, not the tenant URL — no fix needed there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 17:45:48 -07:00
Hongming Wang	754f361c03	fix(e2e): poll instance_status not status — waitFor never matched, masked real bugs Staging Canvas Playwright E2E has been timing out at 1200s on every recent run. Found via /code-review-and-quality on the staging→main promotion chain. The CP /cp/admin/orgs response shape is (handlers/admin.go:118): type adminOrgSummary struct { ... InstanceStatus string `json:"instance_status,omitempty"` ... } There is NO top-level `status` field. The waitFor predicate compared `row.status === "running"` against undefined on every poll — the predicate could never resolve truthy. The harness invariably wedged on the 20-min timeout regardless of whether the tenant was actually provisioned. This bug has been double-edged: - It MASKED the #242 pq-cache-collision class for hours: the tenants WERE provisioning fine, but the test couldn't tell. - It survived #255, #257 (real CP fixes) — the test still timed out, making us suspect more CP bugs that didn't exist. Fix: poll `row.instance_status` instead. One-line change. Identical fix for the failed-state branch one line below. No new tests for the harness itself; the fix's correctness is verified by the next E2E run on the affected branch passing end-to-end. If it doesn't pass after this, there's a separate bug we can hunt cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 17:32:12 -07:00
Hongming Wang	560172968f	chore: re-trigger CI — staging CP has CP#257 orgs UPDATE fix now	2026-04-24 16:45:16 -07:00
Hongming Wang	a7eb071e35	feat(org-templates): add ux-ab-lab + manifest entry + schema smoke test Introduces the UX A/B Lab org template — a 7-agent cell for rapid landing-page variant generation. The template is also the first consumer of the new any_of env schema (ANTHROPIC_API_KEY OR CLAUDE_CODE_OAUTH_TOKEN), so it doubles as an end-to-end fixture for that feature. Canvas tree (all claude-code / sonnet): Design Director ├── UX Researcher ├── Visual Designer ├── React Engineer ├── Deploy Engineer ├── A11y + SEO Auditor ← WCAG AA + canonical/noindex gate └── Perf Auditor ← Core Web Vitals gate Template files live in their own standalone repo (Molecule-AI/molecule-ai-org-template-ux-ab-lab, to be published); this change adds the manifest.json entry so fresh clones + CI populate the template via scripts/clone-manifest.sh. Tests: - TestOrgTemplate_ClaudeAnyOfAuthPreflight — parses the exact required_env / recommended_env shape the template ships with via inline YAML (not on-disk, since org-templates/ is gitignored in this monorepo) and verifies either member alternative satisfies the preflight. SEO safety built into the auditor's system prompt: - One canonical variant; all others canonicalise to it. - noindex, follow on non-canonical variants. - Sitemap contains only the canonical URL. - No robots.txt disallow (blocked pages can't emit canonical). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 16:22:14 -07:00
Hongming Wang	ad73a56db1	feat(env-preflight): support any_of OR groups (e.g. API_KEY OR OAUTH_TOKEN) Extends the org-import env preflight so a template can declare an alternative: satisfy ANY one member to pass. Motivated by the Claude-family node case where either ANTHROPIC_API_KEY or CLAUDE_CODE_OAUTH_TOKEN unlocks the agent — forcing both was wrong. Server (workspace-server): - New EnvRequirement union type with custom YAML + JSON (un)marshaling. Accepts scalar (strict) or {any_of: [...]} in both on-disk org.yaml and inline POST /org/import bodies. - collectOrgEnv now returns []EnvRequirement. Dedups groups by sorted-member signature. "Strict wins" pruning drops any-of groups that mention a name already declared strictly (same tier and cross-tier). - Import preflight uses EnvRequirement.IsSatisfied — scalar = exact match, group = any member present. - Empty any_of: [] rejected at parse time (never-satisfiable). - 14 handler tests (6 updated for the union shape, 8 new covering any-of satisfaction, dedup, strict-dominates-group, cross-tier pruning, invalid-member filtering, YAML round-trip, and empty-any-of rejection). Canvas: - EnvRequirement = string \| {any_of: string[]} with envReqMembers, envReqSatisfied, envReqKey helpers. - OrgImportPreflightModal renders strict rows and any-of groups via a new AnyOfEnvGroup sub-component: "Configure any one" banner, per-member input, ✓-satisfied indicator, and dimmed siblings once any member is configured so the user can still switch providers. - TemplatePalette.OrgTemplate.required_env / recommended_env retyped to EnvRequirement[]; passthrough to the modal unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 16:16:25 -07:00
Hongming Wang	f995b90a85	test(canvas-events): expect both pan-to-node AND fit-deploying-org on NEW root provision Commit `5adc8a74` (part of this PR) intentionally made molecule:fit-deploying-org fire for root-level workspaces too — it used to only fire for children, which meant a standalone create didn't center the viewport until the first child arrived ~2s later. The existing regression test still expected ONLY the molecule:pan-to-node event for a new root, so it started failing with "expected length 1, got 2". The product behavior is correct (centering on the root immediately is better UX); the test was pinning the old single-dispatch shape. Fix: assert BOTH events fire, each with the right detail payload, so a future regression that drops either one (or duplicates) trips the test. Single-test update, no production code change. 953/953 canvas tests pass locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 15:55:52 -07:00
Hongming Wang	1e8b5e0167	feat(external-runtime): first-class BYO-compute workspaces + manifest-driven registry ## Problem Two issues the external-workspace path was silently dropping: 1. `knownRuntimes` was a hardcoded Go map that drifted from manifest.json — e.g. `gemini-cli` was in manifest but missing from the Go allowlist, so any workspace provisioning with runtime=gemini-cli got silently coerced to langgraph. 2. No end-to-end "bring your own compute" story. The canvas UI had no way to pick runtime=external; the partial backend code required the operator to already have a URL ready (chicken-and- egg with the agent that doesn't exist yet), and no workspace_auth _token was minted so the external agent couldn't authenticate its register call. ## Change ### Runtime registry driven by manifest.json - New `runtime_registry.go` reads `manifest.json` at service init. Each `workspace_templates[].name` becomes a runtime identifier (with the `-default` suffix stripped so `claude-code-default` and `claude-code` resolve to the same runtime). - `external` is always injected (no template repo exists for it). - Falls back to a static map on manifest load failure so tests / dev containers keep working. - 5 new tests including a real-manifest sanity check. ### First-class external workspace flow When `POST /workspaces` is called with `runtime: "external"` AND no URL supplied: 1. Workspace row inserted with `status='awaiting_agent'` (distinct from `provisioning` so canvas doesn't trip its provisioning-timeout UX). 2. A workspace_auth_token is minted via `wsauth.IssueToken`. 3. Response body includes a `connection` object with: - `workspace_id`, `platform_url`, `auth_token` - `registry_endpoint`, `heartbeat_endpoint` - `curl_register_template` — zero-dep one-shot register snippet - `python_snippet` — full SDK setup w/ heartbeat loop, paired with molecule-sdk-python PR #13's A2AServer 4. The platform URL is resolved from `EXTERNAL_PLATFORM_URL` env (ops-configurable per tenant) or falls back to request headers. The legacy `payload.External` + `payload.URL` path is preserved — org-import and other callers that already have a URL still work. ### Canvas UI - New "External agent (bring your own compute)" checkbox in CreateWorkspaceDialog. - When checked, template/model/hermes-provider fields are hidden and the POST body includes `runtime: "external"`. - New `ExternalConnectModal` component: shown once after create, renders Python / curl / raw-fields tabs with copy-to-clipboard buttons. Stays mounted as a sibling of the create dialog so the token survives the create dialog unmount. - `auth_token` is interpolated into the snippet client-side so the copied block is truly ready to run — operator only has to fill in their agent's public URL. ## Tests - Go: 5 new runtime_registry tests (happy path, -default strip, external always injected, missing file, malformed JSON, real manifest sanity). All existing handler tests still pass. - TypeScript: no type errors on my files; pre-existing canvas-batch-partial-failure type drift is on main already and tracked on the #2061 branch. ## Follow-ups (filed separately) - Cut molecule-sdk-python v0.y to PyPI so the snippet can use `pip install molecule-ai-sdk` instead of `git+main`. - Add a `runtime: string` field per template in manifest.json so one template can declare its runtime explicitly (instead of deriving it from name conventions). Unblocks N-templates-per- runtime (e.g. hermes-minimax, hermes-anthropic both runtime=hermes). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 15:34:10 -07:00
Hongming Wang	5adc8a74d5	feat(canvas+org): env preflight, EmptyState parity, shared useTemplateDeploy hook Builds on #2061. Three internally-cohesive sub-features; easiest to read in order. ## 1. Org-level env preflight Server - `OrgTemplate` + `OrgWorkspace` gain `required_env: string[]` and `recommended_env: string[]` YAML fields. - `GET /org/templates` walks the tree and returns the tree-union (deduped, sorted) of both. `collectOrgEnv` dedup prefers required when the same key is declared at both tiers. - `POST /org/import` preflights against `global_secrets` WHERE `octet_length(encrypted_value) > 0` (empty-value rows used to be counted as "configured" and the per-container preflight still failed at start time). 412 Precondition Failed + `missing_env` list when required keys are absent. `force=true` bypasses with an audit log line. DB lookup failure now returns 500 (was: silent fall-through that defeated the guard). Env-var NAMES validated against `^[A-Z][A-Z0-9_]{0,127}$` so a malicious template can't ship pathological names into the UI or DB. Canvas - New `OrgImportPreflightModal`: red "Required" section (blocking) and yellow "Recommended" section (non-blocking, import stays enabled, shows live missing-count next to the Import button). - Per-key password input → `PUT /settings/secrets` → strike-through on save. Functional `setDrafts` throughout (no stale-closure clobbers on rapid successive saves). `useEffect` seed keyed on a sorted-join string signature so a parent re-render with a new array identity doesn't clobber typed inputs. - `TemplatePalette.handleImport` branches: zero env declarations → straight to import; any declarations → fetch configured global secret keys, open the modal. Tests (Go): `TestCollectOrgEnv_*` (5) cover union-across-levels, required-wins-over-recommended (including same-struct), dedup, empty, invalid-name rejection. ## 2. EmptyState parity with TemplatePalette The "Deploy your first agent" grid used to call `POST /workspaces` with no preflight while the sidebar palette ran `checkDeploySecrets` + `MissingKeysModal` first. Same template deployed two different ways → first-run users saw containers boot in `failed` state without guidance. Now both surfaces share one preflight + modal handshake. EmptyState's previous `interface Template` dropped `runtime`, `models`, and `required_env` — silently discarding exactly the fields the preflight needs. `Template` now lives in `deploy-preflight.ts` and is imported from there by both surfaces. ## 3. useTemplateDeploy hook With the preflight + modal wiring now duplicated across EmptyState + TemplatePalette + (going forward) any third surface, extracted the pattern into `canvas/src/hooks/useTemplateDeploy.tsx`: const { deploy, deploying, error, modal } = useTemplateDeploy({ canvasCoords: ..., // optional, default random onDeployed: (id) => ..., }); Closes three drift surfaces that the duplication had created: - `resolveRuntime` id→runtime fallback table (moved to `deploy-preflight.ts`). EmptyState had a narrower fallback that would have silently disagreed with the palette on any future id needing a non-identity mapping. - `checkDeploySecrets` call signature. One owner. - `MissingKeysModal` JSX wiring. One owner. Narrow try/catch around `checkDeploySecrets` so a preflight network failure clears `deploying` and surfaces via `setError` instead of stranding the button forever. `modal: ReactNode` (not a `renderModal()` function) — the previous memoization bought nothing since consumers called it inline every render. Named `MissingKeysInfo` interface for the state shape. ## 4. Viewport auto-fit user-pan gate fix During org deploy the canvas was meant to pan+zoom to follow each arriving workspace (`molecule:fit-deploying-org` event → debounced fitView). In practice the fit stayed stuck on wherever the first fit landed. Root cause: React Flow v12 fires `onMoveEnd` with a truthy `event` at the END of a programmatic `fitView` animation. The original "respect-user-pan" gate stamped `userPannedAtRef` in `onMoveEnd`, so our own fit completing looked like a user pan, and every subsequent auto-fit short-circuited for the rest of the deploy. Fix: stop trusting `onMoveEnd` for user-intent detection. Register explicit `wheel` + `pointerdown` listeners on `document` with capture phase and `target.closest('.react-flow__pane')` filter. Capture-phase immunity to `stopPropagation`; pane-filter rejects toolbar / modal / side-panel clicks (the old `window` fallback caught those). `onMoveEnd` simplified to only drive the debounced viewport save. Also: fit event dispatched on root arrivals (not just children), so the canvas centers on the just-landed root immediately instead of waiting ~2s for the first child. Animation 600ms → 400ms so successive per-arrival fits don't pile up visually. End-state fit stays at 1200ms — intentional asymmetry ("settling" vs "tracking"), documented in code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 15:15:33 -07:00
Hongming Wang	184f8256cd	ci(redeploy): fire post-main tenant fleet redeploy via CP admin endpoint Closes the "main merged but prod tenants still on old image" gap. ## Trigger chain main merge └─> publish-workspace-server-image (builds + pushes :latest + :<sha>) └─> redeploy-tenants-on-main (this workflow) └─> POST https://api.moleculesai.app/cp/admin/tenants/redeploy-fleet └─> Canary hongmingwang + 60s soak, then batches of 3 with SSM Run Command redeploying each tenant EC2 ## Features - Auto-fires on every successful publish-workspace-server-image run. - Manual dispatch with optional target_tag (for rollback to an older SHA), canary_slug override, batch_size, dry_run. - 30s delay before calling CP so GHCR edge cache serves the new :latest consistently to every tenant's docker pull. - Skips when publish job failed (workflow_run fires on any completion). - Job summary renders per-tenant results as a markdown table so ops can see which tenant, if any, broke the chain. - Exits non-zero on HTTP != 200 or ok=false so a broken rollout marks the commit status red. ## Secrets + vars required - secret CP_ADMIN_API_TOKEN — Railway prod molecule-platform / CP_ADMIN_API_TOKEN Mirrored into this repo's secrets. - var CP_URL (optional) — defaults to https://api.moleculesai.app ## Paired with - Molecule-AI/molecule-controlplane branch feat/tenant-auto-redeploy which adds the /cp/admin/tenants/redeploy-fleet endpoint + the SSM orchestration. This workflow is a no-op until that lands on prod CP. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 14:34:28 -07:00
Hongming Wang	a34121d451	fix(a2a_executor): remove shadowing local `Part` import that broke streaming Python scoping rule: any name assigned anywhere in a function body is local for the entire body. The outbound-files block at ~L442 had `from a2a.types import ... Part ...`, which made `Part` a local name throughout the execute() function. The astream_events loop at L358 — which runs BEFORE that import — then raised: UnboundLocalError: cannot access local variable 'Part' where it is not associated with a value Every streaming A2A reply died with "Agent error: cannot access local variable 'Part' where it is not associated with a value" instead of the actual agent text. 5 tests caught it: - test_streaming_plain_string_content - test_streaming_anthropic_content_blocks - test_non_stream_events_ignored - test_core_execute_on_chat_model_end_captures_last_ai_message - test_core_execute_pii_redaction_when_pii_found Fix: drop `Part` from the function-scope import (it is already imported at module level on line 42) and leave a comment pinning the rationale so a future refactor doesn't re-introduce the shadow. All 43 test_a2a_executor tests pass locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 14:21:04 -07:00
Hongming Wang	817b8b0307	fix(scripts): make MAX_DELETE_PCT actually honor env override The script's own help text documents \`MAX_DELETE_PCT=62 ./sweep-cf-orphans.sh\` as the way to relax the safety gate, but the in-script assignment on line 35 was unconditional and overwrote any env value — so the override never worked. During today's staging tenant-provision recovery (CP #255 context), hit the 57%-delete threshold and needed the documented override to clear 64 orphan records. The one-char change to \`\${MAX_DELETE_PCT:-50}\` honors the env while keeping the 50% default when no caller overrides. Ran with MAX_DELETE_PCT=62 after the fix — deleted 64 records, CF zone 111→47. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 14:14:55 -07:00
Hongming Wang	425df5e5a9	merge(staging): resolve conflicts + fix 7 test regressions on top of #2061 - Merge origin/staging into fix/canvas-multilevel-layout-ux. 18 files auto-merged (mostly canvas/tabs/chat and workspace-server handlers the earlier DIRTY marker was stale relative to current staging). - Fix 7 test failures surfaced by the merge: 1. Canvas.pan-to-node.test.tsx — mockGetIntersectingNodes was inferred as vi.fn(() => never[]); mockReturnValueOnce of a node object failed type check. Explicit return-type annotation. 2. Canvas.pan-to-node.test.tsx + Canvas.a11y.test.tsx — Canvas.tsx reads deletingIds.size (new multilevel-layout state). Both mock stores lacked deletingIds; added new Set<string>() to each. 3. canvas-batch-partial-failure.test.ts — makeWS() built a wire- format WorkspaceData (snake_case, with x/y/uptime_seconds). The store's node.data is now WorkspaceNodeData (camelCase, no wire- only fields). Rewrote makeWS to produce WorkspaceNodeData and updated 5 call-site casts. No assertions changed. 4. ConfigTab.hermes.test.tsx — two tests pinned pre-#2061 behavior that the PR intentionally inverts: a. "shows hermes-specific info banner" — RUNTIMES_WITH_OWN_CONFIG now contains only {"external"}, so the banner is no longer shown for hermes. Inverted assertion: now pins ABSENCE of the banner, with a comment noting the inversion. b. "config.yaml runtime wins over DB" — priority reversed: DB is now authoritative so the tier-on-node badge matches the form. Inverted scenario: DB=hermes + yaml=crewai → form shows hermes. Switched test's DB runtime off langgraph because the dropdown collapses langgraph into an empty- valued "default" option that would hide the win signal. - No production code changed — this commit is staging merge + test realignment only. 953/953 canvas tests pass. tsc --noEmit clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 13:50:39 -07:00
Hongming Wang	94d9331c76	feat(canvas+platform): chat attachments, model selection, deploy/delete UX Session's accumulated UX work across frontend and platform. Reviewable in four logical sections — diff is large but internally cohesive (each section fixes a gap the next one depends on). ## Chat attachments — user ↔ agent file round trip - New POST /workspaces/:id/chat/uploads (multipart, 50 MB total / 25 MB per file, UUID-prefixed storage under /workspace/.molecule/chat-uploads/). - New GET /workspaces/:id/chat/download with RFC 6266 filename escaping and binary-safe io.CopyN streaming. - Canvas: drag-and-drop onto chat pane, pending-file pills, per-message attachment chips with fetch+blob download (anchor navigation can't carry auth headers). - A2A flow carries FileParts end-to-end; hermes template executor now consumes attachments via platform helpers. ## Platform attachment helpers (workspace/executor_helpers.py) Every runtime's executor routes through the same helpers so future runtimes inherit attachment awareness for free: - extract_attached_files — resolve workspace:/file:///bare URIs, reject traversal, skip non-existent. - build_user_content_with_files — manifest for non-image files, multi-modal list (text + image_url) for images. Respects MOLECULE_DISABLE_IMAGE_INLINING for providers whose vision adapter hangs on base64 payloads (MiniMax M2.7). - collect_outbound_files — scans agent reply for /workspace/... paths, stages each into chat-uploads/ (download endpoint whitelist), emits as FileParts in the A2A response. - ensure_workspace_writable — called at molecule-runtime startup so non-root agents can write /workspace without each template having to chmod in its Dockerfile. Hermes template executor + langgraph (a2a_executor.py) + claude-code (claude_sdk_executor.py) all adopt the helpers. ## Model selection & related platform fixes - PUT /workspaces/:id/model — was 404'ing, so canvas "Save" silently lost the model choice. Stores into workspace_secrets (MODEL_PROVIDER), auto-restarts via RestartByID. - applyRuntimeModelEnv falls back to envVars["MODEL_PROVIDER"] so Restart propagates the stored model to HERMES_DEFAULT_MODEL without needing the caller to rehydrate payload.Model. - ConfigTab Tier dropdown now reads from workspaces row, not the (stale) config.yaml — fixes "badge shows T3, form shows T2". ## ChatTab & WebSocket UX fixes - Send button no longer locks after a dropped TASK_COMPLETE — `sending` no longer initializes from data.currentTask. - A2A POST timeout 15 s → 120 s. LLM turns routinely exceed 15 s; the previous default aborted fetches while the server was still replying, producing "agent may be unreachable" on success. - socket.ts: disposed flag + reconnectTimer cancellation + handler detachment fix zombie-WebSocket in React StrictMode. - Hermes Config tab: RUNTIMES_WITH_OWN_CONFIG drops 'hermes' — the adaptor's purpose IS the form, banner was contradictory. - workspace_provision.go auto-recovery: try <runtime>-default AND bare <runtime> for template path (hermes lives at the bare name). ## Org deploy/delete animation (theme-ready CSS) - styles/theme-tokens.css — design tokens (durations, easings, colors). Light theme overrides by setting only the deltas. - styles/org-deploy.css — animation classes + keyframes, every value references a token. prefers-reduced-motion respected. - Canvas projects node.draggable=false onto locked workspaces (deploying children AND actively-deleting ids) — RF's authoritative drag lock; useDragHandlers retains a belt-and- braces check. - Organ cancel button (red pulse pill on root during deploy) cascades via existing DELETE /workspaces/:id?confirm=true. - Auto fit-view after each arrival, debounced 500 ms so rapid sibling arrivals coalesce into one fit (previous per-event fit made the viewport lurch continuously). - Auto-fit respects user-pan — onMoveEnd stamps a user-pan timestamp only when event !== null (ignores programmatic fitView) so auto-fits don't self-cancel. - deletingIds store slice + useOrgDeployState merge gives the delete flow the same dim + non-draggable treatment as deploy. - Platform-level classNames.ts shared by canvas-events + useCanvasViewport (DRY'd 3 copies of split/filter/join). ## Server payload change - org_import.go WORKSPACE_PROVISIONING broadcast now includes parent_id + parent-RELATIVE x/y (slotX/slotY) so the canvas renders the child at the right parent-nested slot without doing any absolute-position walk. createWorkspaceTree signature gains relX, relY alongside absX, absY; both call sites updated. ## Tests - workspace/tests/test_executor_helpers.py — 11 new cases covering URI resolution (including traversal rejection), attached-file extraction (both Part shapes), manifest-only vs multi-modal content, large-image skip, outbound staging, dedup, and ensure_workspace_writable (chmod 777 + non-root tolerance). - workspace-server chat_files_test.go — upload validation, Content-Disposition escaping, filename sanitisation. - workspace-server secrets_test.go — SetModel upsert, empty clears, invalid UUID rejection. - tests/e2e/test_chat_attachments_e2e.sh — round-trip against a live hermes workspace. - tests/e2e/test_chat_attachments_multiruntime_e2e.sh — static plumbing check + round-trip across hermes/langgraph/claude-code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 13:27:51 -07:00
Hongming Wang	62217250ed	test(pricing): finish Starter→Team, Pro→Growth rename in 6 stale assertions Marketing-lead agent's rename pass updated the "renders all three plans" test (lines 56-57) but missed lines 77, 94, 114, 132, 143, 158 which still referenced the pre-rename "Upgrade to Starter" / "Upgrade to Pro" button names. Canvas (Next.js) build failed with getByRole timeout because the component now says "Upgrade to Team" / "Upgrade to Growth". Internal PlanId tuple ("free" \| "starter" \| "pro") and startCheckout(planId) call are unchanged — only the user-facing button labels shifted, so assertions like startCheckout("pro", "acme") still match the server-side API. Verified locally: 9/9 PricingTable tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 13:01:40 -07:00
Hongming Wang	2dbd06d52e	Merge pull request #2055 from Molecule-AI/feat/lark-channel-first-class-v2 feat(channels): first-class Lark/Feishu support via schema-driven config	2026-04-24 19:57:57 +00:00
rabbitblood	998cd03265	fix(tabs-a11y): mock config_schema on adapter response Schema-driven ChannelsTab renders no inputs when config_schema is absent — the test's bare {type, display_name} mock mismatched the real API shape and every getByLabelText("Bot Token") failed. Mock now mirrors GET /channels/adapters with the Telegram schema (bot_token password + chat_id text) so the a11y assertions run against the actual rendered form. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 12:04:51 -07:00
molecule-ai[bot]	92a0c0073d	Merge pull request #2058 from Molecule-AI/chore/canvas-node22-upgrade chore(canvas): upgrade node:20-alpine → node:22-alpine	2026-04-24 19:04:25 +00:00
molecule-ai[bot]	17f29e874a	Merge pull request #2029 from Molecule-AI/fix/canvas-a11y-tabs-v2 fix(canvas/a11y): add type=button to tab toolbar and settings buttons	2026-04-24 19:01:24 +00:00
molecule-ai[bot]	02406ea823	Merge pull request #2024 from Molecule-AI/fix/gh-identity-plugin-role-env-v2 feat(#1957): wire gh-identity plugin into workspace-server	2026-04-24 19:01:22 +00:00
Hongming Wang	fc2e6150d3	Merge pull request #2056 from Molecule-AI/fix/compliance-default-owasp-agentic fix(compliance): flip default mode to owasp_agentic (detect-only)	2026-04-24 18:56:00 +00:00
molecule-ai[bot]	58745145cb	Merge pull request #2038 from Molecule-AI/hotfix/audit34-to-main hotfix: Audit #34 fixes to main	2026-04-24 18:55:39 +00:00
Molecule AI Core-DevOps	1e5fc48acb	chore(canvas): upgrade node:20-alpine → node:22-alpine Node.js 20 reaches EOL 2026-09 and actions/checkout@v4 emits Node.js 20 deprecation warnings on GitHub Actions (Node 24 forced 2026-06-02). Next.js 15.1 is fully compatible with Node 22. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 18:54:30 +00:00
Hongming Wang	9af058b82d	fix(compliance): flip default mode to owasp_agentic (detect-only) Prior state: compliance.mode default was "" (fully off) and no template in the repo set it explicitly — so prompt-injection detection, PII redaction, and agency-limit checks were silently disabled on every live workspace, despite the machinery being present in workspace/builtin_tools/compliance.py. This was surfaced during a 2026-04-24 review of the A2A inbound path: a2a_executor.py gates three security checks on _compliance_cfg.mode == "owasp_agentic" and default config never matches, so every A2A message skipped all three. Fix: default is now owasp_agentic + prompt_injection=detect. Detect mode logs injection attempts as audit events without blocking — no UX cost, just visibility. Operators who want stricter enforcement set `prompt_injection: block` per workspace. Operators who genuinely want compliance fully off can set `mode: ""` (not recommended; documented). Changes: - ComplianceConfig.mode default: "" → "owasp_agentic" - Yaml parser fallback default: "" → "owasp_agentic" (must match dataclass) - Docstring updated with rationale + opt-out snippet Tests: 66/66 test_compliance.py + test_a2a_executor.py pass. 19/19 test_config.py pass. The one test asserting compliance_mode == "" is for the "config load failed" fallback path (different from the default config path) — correctly unchanged. Security posture improvement: prompt-injection detection is now always on for every workspace created after this ships, with zero behavior change for legitimate inputs. Block mode remains an opt-in when an operator wants to actively reject injection attempts rather than just log them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 11:52:09 -07:00
Hongming Wang	04e60e7303	Merge pull request #2052 from Molecule-AI/fix/canvas-provisioning-timeout-runtime-aware fix(canvas): runtime-aware provisioning-timeout threshold (hermes 12min vs default 2min)	2026-04-24 18:51:46 +00:00
rabbitblood	00265d7028	feat(channels): first-class Lark/Feishu support via schema-driven config Lark adapter was already implemented in Go (lark.go — outbound Custom Bot webhook + inbound Event Subscriptions with constant-time token verify), but the Canvas connect-form hardcoded a Telegram-shaped pair of inputs (bot_token + chat_id). Selecting "Lark / Feishu" from the dropdown silently sent the wrong field names — there was no way to enter a webhook URL. Fix: move form shape to the server. - Add `ConfigField` struct + `ConfigSchema()` method to the `ChannelAdapter` interface. Each adapter declares its own fields with label/type/required/sensitive/placeholder/help. - Implement per-adapter schemas: - Lark: webhook_url (required+sensitive) + verify_token (optional+sensitive) - Slack: bot_token/channel_id/webhook_url/username/icon_emoji - Discord: webhook_url + optional public_key - Telegram: bot_token + chat_id (unchanged UX, keeps Detect Chats) - Change `ListAdapters()` to return `[]AdapterInfo` with config_schema inline. Sorted deterministically by display name so UI ordering is stable across Go's random map iteration. - Update the 3 existing `ListAdapters` test sites to struct access. Canvas (`ChannelsTab.tsx`): - Replace the two hardcoded bot_token/chat_id inputs with a single schema-driven `SchemaField` component. Renders one input per field in the order the adapter returns them. - Form state becomes `formValues: Record<string,string>` keyed by `ConfigField.key`. Values reset on platform-switch so stale Telegram credentials can't leak into a new Lark channel. - "Detect Chats" stays but only renders for platforms in `SUPPORTS_DETECT_CHATS` (Telegram only — the only provider with getUpdates). - Only schema-known keys are posted in `config`, scrubbing any stale values from previous platform selections. Regression tests: - `TestLark_ConfigSchema` locks in the 2-field Lark contract with the required/sensitive flags correctly set. - `TestListAdapters_IncludesLark` confirms registry wiring + schema survives round-trip through ListAdapters. Known pre-existing `TestStripPluginMarkers_AwkScript` failure in internal/handlers is unrelated to this change (verified via stash+test on clean staging). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 11:51:15 -07:00
Hongming Wang	0b237ed9dd	refactor(canvas): extract runtime profiles to @/lib/runtimeProfiles Preparation for a "hundreds of runtimes" plugin ecosystem. Keeping the runtime-specific UX knobs in-line inside ProvisioningTimeout scales badly — every new runtime would require editing a component, not just adding a table entry. Other components (create-workspace dialog, workspace card tooltips, etc.) will want the same runtime metadata. Changes: - New file `canvas/src/lib/runtimeProfiles.ts` owns: * `RuntimeProfile` type — structural shape, every field optional so new runtimes can partially-fill without breaking consumers. * `DEFAULT_RUNTIME_PROFILE` — 2-min default floor (docker-fast). * `RUNTIME_PROFILES` — named overrides (currently: hermes 12 min). * `WorkspaceRuntimeOverrides` — interface for server-provided per-workspace overrides, so operators can tune via template manifest / workspace metadata without a canvas release. * `getRuntimeProfile()` — resolver with overrides → profile → default priority. * `provisionTimeoutForRuntime()` — convenience wrapper. - `ProvisioningTimeout.tsx` now delegates to the profile module. `DEFAULT_PROVISION_TIMEOUT_MS` re-exported for legacy test importers. - Tests: 16/16 (up from 9 before the first fix). Adds pinning for: * overrides > profile > default priority chain * "every entry in RUNTIME_PROFILES resolves to a number" contract * backward-compat export Adding a new slow runtime is now one table entry in `canvas/src/lib/runtimeProfiles.ts` with a mandatory `WHY` comment. Moving to server-driven profiles later is a ~10-line change (the resolver already threads WorkspaceRuntimeOverrides through). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 11:48:39 -07:00
molecule-ai[bot]	1a27370e7b	Merge pull request #2051 from Molecule-AI/fix/canvas-embeddedteam-removal-and-canvasorbearer-return refactor(canvas): remove unused EmbeddedTeam component from WorkspaceNode	2026-04-24 18:47:16 +00:00
Hongming Wang	9597d262ca	fix(canvas): runtime-aware provisioning-timeout threshold Hermes workspaces cold-boot in 8-13 min (ripgrep + ffmpeg + node22 + hermes-agent source build + Playwright + Chromium ~300MB). The canvas's 2-min hardcoded "Provisioning Timeout" warning fired at ~2min and told users their workspace was "stuck" while it was still mid-install. Users hit Retry, triggering fresh cold boots and cancelling healthy workspaces. User-facing symptom (reported 2026-04-24 18:35Z): hermes workspace showed "has been provisioning for 3m 15s — it may have encountered an issue" with Retry + Cancel buttons, while the EC2 was installing node_modules. Fix: - Keep DEFAULT_PROVISION_TIMEOUT_MS = 120_000 (2min) — correct for fast docker runtimes (claude-code, langgraph, crewai) where cold boot is 30-90s. - Add RUNTIME_TIMEOUT_OVERRIDES_MS = { hermes: 720_000 } (12min). Aligns with tests/e2e/test_staging_full_saas.sh's PROVISION_TIMEOUT_SECS=900 (15min) so UI warns shortly before the backend itself gives up. - New timeoutForRuntime() resolves the base; per-node lookup in the check-timeouts interval so a mixed batch (1 hermes + 2 langgraph) uses the right threshold for each. - timeoutMs prop is now optional. Undefined → per-runtime lookup; a number → forces a single threshold for every workspace (tests use this for deterministic behavior). Tests: 4 new cases pinning the runtime-aware resolution, including a guard that catches future regressions that would weaken hermes's budget. Existing tests unchanged (they import DEFAULT_PROVISION_TIMEOUT_MS which still exports 120_000). 13/13 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 11:46:09 -07:00

... 13 14 15 16 17 ...

3663 Commits