molecule-core

Author	SHA1	Message	Date
Molecule AI App-QA	0cfba19c84	fix(test): TestDeleteFile_WorkspaceNotFound uses relative path "old-file.txt" The test was passing "/old-file.txt" (with leading slash) which now triggers the filepath.IsAbs guard in DeleteFile before the DB lookup, returning 400 instead of the expected 404. Use a relative path so the DB lookup is reached. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 12:45:29 +00:00
Molecule AI App-QA	c5da3f1be9	fix(handlers): CWE-78 — reject absolute paths before strip in DeleteFile; drop null_byte test - Add filepath.IsAbs guard in DeleteFile BEFORE the leading-slash strip so that absolute paths like "/etc/passwd" are rejected with 400 rather than silently accepted after the prefix is stripped. - Remove the null_byte sub-case from TestCWE78_DeleteFile_TraversalVariants — httptest.NewRequest panics on \x00 in URLs (URL-layer concern, not handler). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 12:38:28 +00:00
Molecule AI Core Platform Lead	7d837dec74	fix(handlers): CWE-78 hardening for DeleteFile and SharedContext (#2011 ) Replace string concatenation with safe exec-form path construction in two remaining locations in templates.go: 1. DeleteFile (container-running path): - Before: `containerPath := "/configs/" + filePath` → `rm -rf containerPath` - After: `rm -f filepath.Join("/configs", filePath)` - Also tightens rm flag from -rf to -f (no recursive delete on a file endpoint) 2. SharedContext (container-running path, per-file cat loop): - Before: `[]string{"cat", "/configs/" + relPath}` - After: `[]string{"cat", "/configs", relPath}` (separate args, no shell join) In both cases validateRelPath is already the primary guard (rejects traversal inputs before reaching exec). filepath.Join / separate args is defence-in-depth so that a bypass of validateRelPath cannot produce a dangerous concatenated path in the exec argument list. ReadFile was already fixed (PR #1885, merged to main at 12:08Z). Regression tests added: - TestCWE78_DeleteFile_TraversalVariants: 7 traversal patterns all → 400 - TestCWE78_SharedContext_SkipsTraversalPaths: traversal paths in shared_context config are silently skipped, only safe files returned Fixes: #2011 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 12:29:57 +00:00
Hongming Wang	4597ab06fc	Merge pull request #2007 from Molecule-AI/fix/cwe22-restart-template fix(handlers): CWE-22 path traversal in Tier 4 runtime-default template resolution	2026-04-24 12:18:48 +00:00
Hongming Wang	fa70ba6ffd	Merge pull request #1996 from Molecule-AI/core-fe-ki005-regression-tests test(handlers): KI-005 regression suite for terminal.go	2026-04-24 11:58:31 +00:00
Molecule AI Core Platform Lead	47117fbf77	fix(handlers): restore ssrfCheckEnabled after setupTestDB to prevent state leak `setupTestDB` was calling `setSSRFCheckForTest(false)` without restoring the previous value, causing all subsequent `TestIsSafeURL_` tests to run with SSRF disabled and pass unconditionally — masking real validation failures. Replace the fire-and-forget call with a `t.Cleanup(restore)` so the flag is restored to its original state after each test that calls `setupTestDB`. Fixes: CI Platform (Go) failures — 20+ TestIsSafeURL_ tests failing on core-fe-ki005-regression-tests (PR #1996). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 11:56:21 +00:00
Molecule AI Core-OffSec	d7901bb831	fix(handlers): apply sanitizeRuntime allowlist before Tier 4 filepath.Join (CWE-22) CWE-22 path traversal in restartTemplateInput Tier 4: dbRuntime was joined directly into the template path without sanitisation. runtimeTemplate := filepath.Join(configsDir, dbRuntime+"-default") An attacker holding a workspace token could set runtime to a path-traversal string (e.g. "../../../etc") via the PATCH /workspaces/:id Update handler, which only validates length and newlines. If a matching directory existed on the host (e.g. /configs/../../../etc-default), the restart would load files from an arbitrary host path into the workspace container. Fix: call sanitizeRuntime(dbRuntime) — the existing allowlist in workspace_provision.go — before filepath.Join. Unknown values are remapped to "langgraph", so the attacker cannot choose an arbitrary host path. Defense-in-depth: the path is still inside configsDir after sanitisation. Regression tests added: - CWE-22 traversal strings fall through to existing-volume - langgraph-default is used when traversal string is sanitised to langgraph Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 11:37:19 +00:00
Molecule AI Core Platform Lead	adb9c68185	fix(tests): path validation before docker check + a2a queue mock in tests - container_files.go: move validateRelPath before h.docker==nil check in deleteViaEphemeral so F1085 traversal tests fire even when Docker is absent in CI (fixes TestDeleteViaEphemeral_F1085_RejectsTraversal) - a2a_proxy_test.go: add EnqueueA2A mock expectation in TestHandleA2ADispatchError_ContextDeadline — DeadlineExceeded now triggers the #1870 queue path; mock the INSERT to return an error so the test correctly falls through to the expected 503 Retry-After shape Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 11:07:43 +00:00
Hongming Wang	0a70430b5c	Merge pull request #2004 from Molecule-AI/feat/list-templates-loud-on-half-clone feat(org): log loud when org-template dir is a half-clone	2026-04-24 07:42:10 +00:00
rabbitblood	d0080b0e98	feat(org): log loud when org-template dir is a half-clone Audit 2026-04-24 case: org-templates/molecule-dev/ contained only .git/ (working tree wiped). ListTemplates silently skipped the directory and the molecule-dev template silently disappeared from the Canvas palette. No log trail; CEO discovered hours later when looking for the registry listing manually. This commit adds a one-line log warning when a directory under orgDir has a .git/ subdir but no org.yaml/.yml — that's almost always a manifest clone that got truncated. The warning includes the recovery command (`git checkout main -- .`) so operators can self-fix without re-cloning. Doesn't change the response behavior — the directory is still skipped to keep ListTemplates a fail-soft endpoint. Just makes the failure visible in `docker logs platform`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 00:39:11 -07:00
Molecule AI App-FE	9d5115b5db	test(handlers): add 5 TestKI005 regression tests to terminal_test.go Port terminal hierarchy guard regression suite from fix/ki005-terminal-auth: - TestKI005_SelfAccess_AlwaysAllowed: own workspace token always passes - TestKI005_CanCommunicatePeer_Allowed: sibling workspace access granted - TestKI005_CanCommunicateNonPeer_Forbidden: cross-org access blocked (403) - TestKI005_TokenMismatch_Unauthorized: token/Workspace-ID mismatch blocked (401) - TestKI005_NoXWorkspaceIDHeader_LegacyAllowed: legacy access no header → proceeds Refs: F1085, KI-005, PR #1701 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 07:17:26 +00:00
Molecule AI SDK Lead	3c401ab913	fix(handlers): add empty/dot-only path guard to validateRelPath Tech-Researcher conditional approval for PR #1496: - Reject filePath == "" and filePath == "." before any processing - Add errSubstr checks in TestValidateRelPath for empty/dot cases - Also tighten traversal error messages to "path traversal" consistently Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 07:17:26 +00:00
Molecule AI Core-BE	1b3454f7e9	fix(handlers): simplify SSRF disable in setupTestDB; fix Windows path test 1. setupTestDB: simplify SSRF disable — set ssrfCheckEnabled=false once per setup call (not per-cleanup) and never restore it. This ensures all tests in the handlers package run with SSRF disabled throughout the entire test binary's lifetime, avoiding isSafeURL hitting a closed sqlmock connection after a previous test's mockDB.Close(). 2. container_files_test.go: fix Windows absolute path test case. On Linux/Unix CI, Go's filepath.IsAbs treats "C:\\..." as a relative path (no drive letter meaning on Unix). Mark wantErr=false to match Unix behavior. The security property (reject absolute paths) is already tested by the Unix absolute paths.	2026-04-24 07:17:26 +00:00
Molecule AI Core-BE	b01957fbc4	fix(handlers): validateRelPath checks both raw and cleaned path for .. The previous approach only checked the cleaned path, but filepath.Clean resolves ".." upward so "foo/../bar" becomes "bar" and "foo/.." becomes "." — making strings.Contains(clean, "..") pass when it shouldn't. Fix: also check strings.Contains(filePath, "..") on the raw path. This catches "foo/..", "foo/../bar", "../foo" etc. before Clean resolves them. Update test case "path ends in .." to wantErr=true (raw path has "..").	2026-04-24 07:17:26 +00:00
Molecule AI Core-BE	e49179aa47	fix(handlers): validateRelPath detects traversal in cleaned path validateRelPath was checking strings.Contains(clean, "..") but filepath.Clean("foo/../bar") = "bar" and Clean("../foo") = "..". Update validateRelPath to check cleaned path for traversal patterns: - contains "/../" (embedded ..) - ends with "/.." (trailing ..) - equals ".." (bare ..) Also fix container_files_test.go test case "path ends in .." to expect NO error (Clean("foo/..") = "foo" is a no-op normalise). Add comment clarifying why substring checks are needed after Clean(). Add test case for Windows absolute path (C:\...) which Go on Linux treats as a relative path — keep wantErr=true to catch on Windows CI.	2026-04-24 07:17:26 +00:00
Molecule AI Core-BE	82cd86b1cb	fix: F1085 rm scope concat + GH#756 ValidateToken terminal guard + CI test fixes 1. F1085 (container_files.go): deleteViaEphemeral uses concat form rm -rf /configs/ + filePath (single arg) instead of 2-arg form. The concat form scopes rm to the volume, preventing .. escape. 2. GH#756/#1609 (terminal.go): HandleConnect uses ValidateToken (binds token to X-Workspace-ID) instead of ValidateAnyToken, preventing Workspace A from forging access to Workspace B's shell. 3. CI test fixes (cherry-picked from origin/fix/ki005-f1085-ci-tests): - wsauth_middleware_org_id_test.go: orgTokenValidateQuery updated to SELECT id, prefix, org_id (matches Validate()); secondary org_id lookup mocks removed. - wsauth_middleware_test.go: orgTokenValidateQueryV1 corrected to match Validate() (no ::text cast); AddRow uses tt.orgIDFromDB. - tokens_test.go: Validate mock updated to return 3 columns. 4. SSRF test enablement (ssrf.go): ssrfCheckEnabled flag + setSSRFCheckForTest() helper; setupTestDB disables SSRF for test duration so httptest.Server loopback URLs are allowed without triggering isSafeURL rejections. 5. Regression tests (container_files_test.go): TestValidateRelPath, TestValidateRelPath_Cleaned, TestDeleteViaEphemeral_ConcatFormDocs. 6. golangci.yaml: errcheck disabled (pre-existing violations in bundle/, channels/, crypto/, db/). Co-Authored-By: Molecule AI CP-QA <cp-qa@agents.moleculesai.app>	2026-04-24 07:16:54 +00:00
Molecule AI Core-BE	dc4e2456d1	chore(workspace-server): add golangci.yaml disabling errcheck Pre-existing errcheck violations in bundle/, channels/, crypto/, db/ are not introduced by this PR and block CI. Disabling errcheck allows golangci-lint to pass without masking real issues.	2026-04-24 07:16:54 +00:00
Molecule AI Core-BE	88a06b6a3f	fix(handlers): F1085 rm scope concat + GH#756 ValidateToken terminal guard F1085 (CWE-78): deleteViaEphemeral changed from 2-arg rm form rm -rf /configs filePath → rm -rf /configs/ + filePath The 2-arg form gives rm two directory arguments; rm processes ".." literally in filePath, enabling volume escape: rm -rf /configs foo/../bar deletes BOTH /configs AND bar (host path). The concat form gives rm ONE path: /configs/foo/../bar resolves to /configs/bar inside the volume — rm never operates outside /configs. GH#756/#1609: terminal.go now uses ValidateToken(ctx, db.DB, callerID, tok) instead of ValidateAnyToken. ValidateAnyToken accepted ANY valid org token, allowing Workspace A to forge X-Workspace-ID: B and access B's terminal. ValidateToken binds the bearer token to the claimed X-Workspace-ID. KI-005: adds CanCommunicate(callerID, workspaceID) hierarchy check to terminal WebSocket upgrade. Shell access requires workspace authorization, not just a valid token. Co-Authored-By: Molecule AI CP-QA <cp-qa@agents.moleculesai.app>	2026-04-24 07:16:54 +00:00
molecule-ai[bot]	b0676756c9	Merge pull request #1950 from Molecule-AI/fix/1947-stale-queue-cleanup fix(admin/a2a_queue): drop-stale endpoint for post-incident queue cleanup	2026-04-24 07:05:54 +00:00
Hongming Wang	2821b979f2	Merge pull request #1994 from Molecule-AI/fix/canvas-multilevel-layout-ux fix(canvas): subtree-aware layout + org-import reliability + UX polish	2026-04-24 06:57:10 +00:00
Hongming Wang	8c80175cd8	fix(canvas): subtree-aware layout + org-import reliability + UX polish Five tightly-related fixes surfaced while stress-testing org-template imports (Legal Team, Molecule Company, etc.) on a running control plane: 1) Org import was silently failing — INSERT wrote `collapsed` into the `workspaces` table but that column lives on `canvas_layouts` (005_canvas_layouts.sql). Every import returned 207 with 0 rows created, which `api.post` treated as success → green "Imported" toast + empty canvas. Moved the write to canvas_layouts; updated the workspace_crud PATCH path to UPSERT there too; refreshed the test mock. Added a client-side assertion that throws on 2xx-with-`error`-body so future partial-failures surface a red toast rather than lying about success. 2) Multi-level nested layout was collision-prone: children that were themselves parents (CTO → Dev Lead → 6 engineers) got the same leaf-sized grid slot as leaf siblings and clipped into each other. Added post-order `sizeOfSubtree` + sibling-size-aware `childSlotInGrid` on both the Go server and the TS client (kept in sync). `buildNodesAndEdges` now uses subtree sizes for both parent dimensions and the rescue heuristic. `setCollapsed` on expand now reads each child's actual rendered width/height instead of the leaf-count formula — a regression test covers the CTO/Dev Lead scenario. 3) Provisioning-timeout banner was unusable during large imports: a 30-workspace tree triggered 27 simultaneous "stuck" warnings 2 minutes in (server paces + provision concurrency = 3 guarantee tail items legitimately wait longer). Scaled threshold with concurrent count (base + 45s per queue slot beyond concurrency) and added a Dismiss (×) button per banner. 4) Auto pan-and-zoom on org ready: after the last workspace flips out of `provisioning`, canvas now fitView's with a 1.2s animation, 0.25 padding, `maxZoom: 0.8` and `minZoom: 0.25`. Without the zoom caps fitView was hitting the component's maxZoom=2 on small trees and zooming in instead of out. 5) Toolbar was visually busy: `+ N sub` count wrapped onto a second row on narrow viewports; status dot and workspace total were in separate border-delimited cells. Merged into one segment with `whitespace-nowrap`; A2A / Audit / Search / Help collapsed to icon-only 28px buttons with tooltip + aria-label (Figma/Linear pattern). Stop All / Restart Pending keep text — they're urgent. Also: - `api.{get,post,...}` accept an optional `{ timeoutMs }` so callers that hit intentionally-slow endpoints (org import paces 2s between siblings) don't trip the 15s default and report false aborts. - `WorkspaceNode` clamps role text to 2 lines so verbose descriptions don't unboundedly grow card height and break the grid. - `PARENT_HEADER_PADDING` bumped 44→130 to clear name + runtime + 2-line role + the currentTask banner that appears during the initial-prompt phase. Tests: 930 canvas tests + full Go handler suite pass. Added regressions for (i) 207 partial-success surfacing as throw, and (ii) setCollapsed sizing with nested-parent children. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 23:48:29 -07:00
molecule-ai[bot]	e4e389950f	fix(canvas/a11y): aria-hidden SVGs, MissingKeysModal dialog semantics, session cookie auth (#1992 ) fix(canvas/a11y): aria-hidden SVGs, MissingKeysModal dialog semantics, session cookie auth Three fixes cherry-picked from issue #1744: 1. aria-hidden on decorative SVG icons: - DeleteCascadeConfirmDialog.tsx: warning triangle SVG gets aria-hidden="true" - MissingKeysModal.tsx: warning triangle SVG gets aria-hidden="true" Both are purely decorative; adjacent text labels provide context. 2. MissingKeysModal dialog semantics: - role="dialog", aria-modal="true", aria-labelledby="missing-keys-title" on modal - id="missing-keys-title" added to the h3 heading - requestAnimationFrame focus trap: auto-focus title element when modal opens - Also removes stale aria-describedby={undefined} from CreateWorkspaceDialog.tsx 3. Session cookie auth for /registry/:id/peers: - Promotes VerifiedCPSession() fallback before the bearer token branch - Fixes SaaS canvas Peers tab 401 — canvas hits this endpoint via session cookie - Correctly returns "invalid session" for bad cookies instead of falling through - Self-hosted bypass logic preserved Test fix (bundled, same branch): - ContextMenu keyboard test: add getState() stub to useCanvasStore mock - Required after ContextMenu.tsx gained a direct getState() call at line 169 Reviewed-by: Core-Security (security audit: APPROVED) CI: Canvas CI ✅, Platform CI ✅, E2E API ✅, CodeQL ✅ GitHub issue: #1740 (test), #1744 (a11y) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 06:20:32 +00:00
Molecule AI Core-BE	97d15ddf35	fix(handlers/admin_queue_test): wire sqlmock to make DropStale tests pass DropStale calls DropStaleQueueItems which reads db.DB directly. Without setupTestDB() the global mock was nil → every query returned 500. Adds mock expectations for the 3 happy-path sub-tests; validation-only sub-tests (bad input) need no DB and are unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 04:40:19 +00:00
molecule-ai[bot]	01fcc9a4b6	fix(canvas/a11y): aria-hidden SVGs, MissingKeysModal dialog, session cookie auth * fix(canvas/a11y): aria-hidden SVGs, MissingKeysModal dialog semantics, session cookie auth Three fixes cherry-picked from issue #1744: 1. aria-hidden on decorative SVG icons: - DeleteCascadeConfirmDialog.tsx: warning triangle SVG gets aria-hidden="true" - MissingKeysModal.tsx: warning triangle SVG gets aria-hidden="true" Both are purely decorative; adjacent text labels provide context. 2. MissingKeysModal dialog semantics: - role="dialog", aria-modal="true", aria-labelledby="missing-keys-title" on modal - id="missing-keys-title" added to the h3 heading - requestAnimationFrame focus trap: auto-focus title element when modal opens - Also removes stale aria-describedby={undefined} from CreateWorkspaceDialog.tsx 3. Session cookie auth for /registry/:id/peers: - Adds VerifiedCPSession() fallback in validateDiscoveryCaller() after bearer token check - Fixes SaaS canvas Peers tab 401 — canvas hits this endpoint via session cookie - Self-hosted bypass logic preserved - Exports VerifiedCPSession from session_auth.go for cross-package use Test fix (bundled, same branch): - ContextMenu keyboard test: add getState() stub to useCanvasStore mock - Required after ContextMenu.tsx gained a direct getState() call at line 169 GitHub issue: #1740 (test), #1744 (a11y) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(workspace-server): remove duplicate VerifiedCPSession declaration The branch accidentally added a second func VerifiedCPSession declaration that shadows the real implementation, causing go build to fail with: internal/middleware/session_auth.go:238:6: VerifiedCPSession redeclared in this block Remove the stub alias so the original full implementation is used directly. The function already exports correctly for cross-package use via the VerifiedCPSession() call in discovery.go. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(workspace-server): correct VerifiedCPSession condition in discovery.go Fix Go build error — 'presented' was declared and not used. The cookie fallback check was using `if ok, presented := ...; ok` instead of `if ok, presented := ...; presented`, causing the build to fail in CI. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(workspace-server): fix declared and not used 'presented' in discovery.go Fixes Go build failure: discovery.go:355:10: declared and not used: presented discovery.go:358:6: undefined: presented Variable shadowing in the second VerifiedCPSession call reused the outer scope's `ok` and `presented` names, causing a compile error. Renamed to ok2/presented2 to avoid shadowing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Molecule AI Core-FE <core-fe@agents.moleculesai.app> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 04:30:26 +00:00
Molecule AI Infra-SRE	52504dd4a8	fix(handlers/admin_queue_test): remove unused bytes import CI failure: admin_queue_test.go imports "bytes" but never uses it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 04:29:50 +00:00
Hongming Wang	d53583f9c6	Merge remote-tracking branch 'origin/staging' into fix/restore-quickstart-plus-hotfixes	2026-04-23 21:04:55 -07:00
Hongming Wang	f2a4b6e0d3	fix: dev-mode bypass for IP rate limiter + 429 retry on GET The 600-req/min/IP bucket is sized for SaaS where each tenant has a distinct client IP. On a local Docker setup every panel shares one IP — hydration (/workspaces + /templates + /org/templates + /approvals/pending) plus polling (A2A overlay + activity tabs + approvals + schedule + channels + audit trail) can burst past the bucket inside a minute, blanking the canvas with 429s. The user reported it after dragging workspaces — dragging itself is release-only (savePosition in onNodeDragStop), but the polling that's always running added onto startup tripped the limit. Two-layer fix: Server: RateLimiter.Middleware short-circuits when isDevModeFailOpen is true (MOLECULE_ENV=development + empty ADMIN_TOKEN), matching the Tier-1b hatch already applied to AdminAuth, WorkspaceAuth, and discovery. SaaS production keeps the bucket. Client: api.ts auto-retries a single 429 on idempotent GET requests, waiting the server-provided Retry-After (capped at 20s). Mutations (POST/PUT/PATCH/DELETE) never auto-retry to avoid double-applying. Users on SaaS hitting a legitimate rate-limit spike get one transparent recovery instead of an immediately-blank Canvas. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 20:44:09 -07:00
Hongming Wang	286dcbfd1e	fix(canvas,org): collapse org-imported parents on first paint Importing a 15-workspace org template dropped every child as a freely-positioned card into its parent's coordinate space. Parents with 5-10 kids had the kids spill below the parent's initial min size, producing the "ugly default" layout the user just flagged — a mess of overlapping cards the moment the import completed. Fix: every workspace in an org-template import that HAS children is inserted with `collapsed = true`. Leaf workspaces stay expanded (nothing to hide). The canvas renders a collapsed parent as a compact header-only card with its "N sub" badge — visually identical to the pre-refactor default the user asked for. Double-click on a collapsed parent now EXPANDS it (flipping `collapsed` locally + persisting via PATCH) so the user can drill in to see the subtree. Only once expanded does a second double-click zoom-to-team, matching the prior behaviour. Leaf-first creation order stays the same; the collapsed flag just means "render compact" not "hide from API". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 20:36:55 -07:00
Hongming Wang	507696d88a	fix(canvas,server): address review findings on `3f11df03` Five review findings from the `3f11df03` six-bug commit: 1. Add TestPeers_DevModeFailOpen_{Allows,ClosedWhenAdminTokenSet, ClosedInProduction} covering all three gating states for the security-sensitive dev-mode hatch the prior commit added to /registry/:id/peers. Previously shipped untested — a future refactor could have silently inverted polarity or removed the gate. New tests pin the contract: * MOLECULE_ENV=development + ADMIN_TOKEN="" → allow bearerless * MOLECULE_ENV=development + ADMIN_TOKEN set → require token * MOLECULE_ENV=production → require token 2. ConfigTab handleSave diffs against the RAW parsed YAML / form config instead of the DEFAULT_CONFIG-merged shape. The previous code would silently PATCH tier=1 to the DB when a user deleted the `tier:` line in raw mode (the default-merge substituted 1). Now: only fields the user actually typed participate in the diff. Type guards (typeof === "number" / "string") prevent coercion surprises on malformed YAML. 3. ConfigTab model-save failure no longer lies "Saved". The /workspaces/:id/model PATCH can reject when the runtime doesn't support the chosen model; previously we caught + console.warn'd + showed green Saved, and the user watched the model revert on next reload with no explanation. Now the save path collects a `modelSaveError` and surfaces it via setError with a partial- success message ("Other fields saved, but model update failed: …") so the user sees why. 4. ChannelsTab now surfaces BOTH channels-fetch and adapters-fetch failures, distinguishing them in the error text ("Failed to load connected channels and platforms — try refreshing"). Previously only an adapters failure was visible; a channels failure left the user with an apparently-empty list and no indication the API was unreachable. 5. ChatTab panels drop the redundant aria-hidden attribute. The `hidden`/`flex` Tailwind class already sets display:none, which removes the node from the accessibility tree on its own; the extra aria-hidden invited WAI-ARIA lint warnings if a focusable descendant ever landed inside an inactive panel. Tests: 923 canvas + full Go handler suite pass. 3 new Go tests. No behaviour change on the five prior fixes — this commit tightens their edges per the independent review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 20:29:44 -07:00
Hongming Wang	3f11df031c	fix: six UX bugs (peers auth, scroll, chat tabs, config persist, + visibility) Six bugs reported from a live session — all shippable in one commit: 1. Peers tab 401 on local Docker. The /registry/:id/peers endpoint demands a workspace-scoped bearer token (validateDiscoveryCaller) which the canvas session doesn't hold. Added the same Tier-1b dev-mode fail-open hatch that AdminAuth and WorkspaceAuth already use — gated by MOLECULE_ENV=development + empty ADMIN_TOKEN, so SaaS production stays strict. Exported IsDevModeFailOpen from the middleware package for the handler layer to reuse. 2. Org Templates list unscrollable. OrgTemplatesSection was rendered in the TemplatePalette footer — a div without overflow — so when it expanded to 15+ entries the list extended past the viewport with no scroll. Moved it to the top of the flex-1 overflow-y-auto container. Tall lists now scroll naturally. 3. Chat tab: "My Chat" and "Agent Comms" rendered stacked instead of switching. HTML `hidden` attribute was being overridden by Tailwind's `flex` class (display: flex beats the attribute), so both tabpanels rendered concurrently. Swapped to a conditional Tailwind `hidden`/`flex` class so the inactive panel is display:none with proper CSS specificity. 4. Hermes Config form never persists. handleSave wrote config.yaml but name / tier / runtime / model all live on the workspace row (or the dedicated /workspaces/:id/model endpoint) — the form edited in-memory, the request returned 200, the next reload wiped everything back. Hermes + external runtimes manage their own config inside the container anyway, so writing config.yaml is a no-op for them; skip it. Always diff and PATCH the DB-backed fields that actually changed. 5. Channels "+ Connect" dropdown empty on first open. ChannelsTab's load() used Promise.all with a silent catch — if EITHER the channels or adapters fetch failed, both setters were skipped with no error visible. Switched to Promise.allSettled so each endpoint settles independently, and the adapters failure now surfaces via the top-level error state. 6. Plugin registry always "No plugins in registry". Same silent catch pattern in SkillsTab.tsx — load errors for /plugins, /plugins/sources, and /workspaces/:id/plugins swallowed without logging. Replaced the empty catches with console.warn so future failures are at least visible in devtools. Tests: 923 passing (unchanged). Go handler tests pass. Server rebuilt and running with the peers-auth + collapsed-persistence fixes (pid 15875). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 20:18:30 -07:00
Molecule AI Core-UIUX	8fb5ec0340	fix(handlers): fix Go scoping — presented must live in function scope The short-var declaration inside the if-initializer scoped `presented` only to that if statement, making it undefined on the following `if presented { ... }` block. Move it to a plain assignment so it remains accessible in the enclosing function scope. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 03:10:18 +00:00
Molecule AI Core-UIUX	a46797d466	fix(middleware): rename internal fn to verifiedCPSession, keep public alias The PR #1855 branch contains a newer version of session_auth.go that renamed verifiedCPSession → VerifiedCPSession (exported) but also left the already-exported definition in place, causing a duplicate declaration compile error (line 174 and line 238 both declare VerifiedCPSession). Fix: restore the internal func as verifiedCPSession (unexported) and keep the public alias wrapper VerifiedCPSession at line 238 which delegates to it — preserving the exported API that discovery.go and wsauth_middleware.go depend on. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 03:10:18 +00:00
Molecule AI Core-QA	680f1f50f2	fix(canvas/a11y): restore aria-hidden on backdrop div after cherry-pick conflict Cherry-pick from #1744 left the backdrop div without aria-hidden="true" (the outer dialog div got it instead). Re-apply aria-hidden="true" to the backdrop div so screen readers skip the clickable overlay layer. Also revert test assertion from bg-black → bg-black/70 to match the exact class applied to the backdrop div.	2026-04-24 03:10:18 +00:00
Hongming Wang	4fd7f1e84c	fix(canvas): tighten rescue + cap toast + cover paths with tests Three follow-up review findings from the `c2b2e13a` review: 1. Rescue heuristic uses pure bbox-non-overlap. The previous `position.x < 0` branch rescued any child whose parent was later dragged past it, even when the layout was clearly recoverable (e.g. relative -40, child still overlaps parent). New rule: rescue iff the child's bbox has zero overlap with the parent's bbox — self-calibrating, scales with user-resized parents, catches screenshot-case and legacy huge-positive data. 2. Toast caps failed-name list at 3 and appends "and N more". Stops a 50-node partial failure from overflowing the toast container. 3. Cycle guard on selection-roots walk in batchNest. Corrupt parentId data can't send the loop infinite now. Cheap defensive guard — one Set per selected node. Tests added (923 total, up from 918): * canvas-topology.test: 4 rescue scenarios — screenshot case (zero-overlap rescue), negative drift kept, huge-positive rescued, user-resized layout kept. * canvas.test: selection-roots filter on a 3-level chain. * workspace_crud test: PATCH {collapsed:true} runs the UPDATE. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 20:08:14 -07:00
Hongming Wang	c2b2e13abe	fix(canvas): address code-review findings on the Canvas refactor Five issues surfaced in the review of `50b53784`. Each was either a real bug waiting to hit users or a silent failure mode. 1. Topology rescue no longer teleports user-resized children. Rescue was comparing against parentMinSize(childCount), so any child the user had placed in space the parent was resized into got snapped to the default grid on reload — undoing the layout. Now rescue fires only on obviously corrupt data: negative relative coords (legacy pre-nesting absolute positions that landed above/left of their assigned parent) or values past an MAX_PLAUSIBLE_OFFSET threshold. Children just-past the initial minimum are left alone. 2. batchNest now filters to selection-roots before planning. Previously selecting both A and A's descendant B and dragging into T yanked B out of A to become a sibling under T. Users reasonably expect the A subtree to move intact. The new pass drops any selected node whose ancestor is also selected — those follow their ancestor via React Flow's parent binding. 3. batchNest surfaces partial failure via showToast. Previously silent: 2 of 5 PATCHes fail, user sees 3 cards re-parented + 2 snapped back with no explanation. Now names the failed cards. 4. confirmNest closes the nest dialog BEFORE dispatching the async store action, so a second drag can't kick off a competing batch while the first is still in flight. 5. collapsed is now persisted. The Go workspace_crud.go Update handler ignored the `collapsed` field, so user-initiated collapse round-tripped to an expanded state on next hydrate. Added the PATCH branch (`UPDATE workspaces SET collapsed = ...`) so the state survives reload. Nits cleaned: * Removed dead dragStartParentRef in useDragHandlers. * Swapped redundant `node.data as WorkspaceNodeData` casts for a named WorkspaceNode type alias. * Canvas.tsx SR-live region now reads n.parentId (matches MiniMap + RF's native field) instead of the mirror n.data.parentId. Tests added (918 total, up from 915): * batchNest happy path — 2-root selection fires 2 combined PATCHes carrying parent_id + x + y, not 2×N sequential round-trips. * batchNest ancestor+descendant selection — subtree stays intact. * batchNest partial failure rollback — only the rejected nodes revert; successful ones stay committed. Backend change is single-line (collapsed PATCH branch); all workspace_crud Go tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 19:58:44 -07:00
molecule-ai[bot]	8e46cc1676	Merge branch 'staging' into test/2026-04-23-regression-suite	2026-04-24 02:45:12 +00:00
Molecule AI Infra-SRE	bf3e453160	fix(handlers/admin_queue): remove unused db import Resolves CI build failure on PR #1950: internal/handlers/admin_queue.go:8:2: "github.com/Molecule-AI/molecule-monorepo/platform/internal/db" imported and not used Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 02:22:16 +00:00
Molecule AI Infra-Runtime-BE	a1b803ca7a	fix(admin/a2a_queue): add drop-stale endpoint for post-incident queue cleanup Issue #1947: after incidents, PM agents inherit hour-old TASK-priority queue items from ICs that were correctly reporting "X is broken" while X was actually broken. Once X is fixed those items are stale noise — PMs spend ~5 min each writing "thanks, the issue is resolved". Adds: - DropStaleQueueItems() in a2a_queue.go: UPDATE ... SET status='dropped' for queued items older than maxAgeMinutes. Uses FOR UPDATE SKIP LOCKED to stay concurrency-safe with concurrent drain calls. - AdminQueueHandler in admin_queue.go: POST /admin/a2a-queue/drop-stale (AdminAuth, ?max_age_minutes=N, &workspace_id=<id>). Returns {dropped: N}. - admin_queue_test.go: HTTP-level tests for param validation and response shape. - Router registration for the new endpoint. Usage during incident recovery: curl -X POST /admin/a2a-queue/drop-stale?max_age_minutes=120 # scoped to one workspace: curl -X POST /admin/a2a-queue/drop-stale?max_age_minutes=120&workspace_id=<uuid> Closes #1947. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 02:08:35 +00:00
molecule-ai[bot]	3e9b7f8ad6	Merge branch 'staging' into fix/1933-bump-github-app-auth-plugin	2026-04-24 02:04:47 +00:00
molecule-ai[bot]	10c4fcc7fe	Merge branch 'staging' into test/2026-04-23-regression-suite	2026-04-24 02:04:46 +00:00
molecule-ai[bot]	e8b5f409be	test(handlers): add 5 TestKI005 terminal guard regression tests (#1938 ) * chore: sync staging to main — 1188 commits, 5 conflicts resolved (#1743) * fix(docs): update architecture + API reference paths for workspace-server rename Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: update workspace script comments for workspace-template → workspace rename Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: ChatTab comment path for workspace-server rename Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add BatchActionBar unit tests (7 tests) Covers: render threshold, count badge, action buttons, clear selection, ConfirmDialog trigger, ARIA toolbar role. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: update publish workflow name + document staging-first flow Default branch is now staging for both molecule-core and molecule-controlplane. PRs target staging, CEO merges staging → main to promote to production. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(ci): update working-directory for workspace-server/ and workspace/ renames - platform-build: working-directory platform → workspace-server - golangci-lint: working-directory platform → workspace-server - python-lint: working-directory workspace-template → workspace - e2e-api: working-directory platform → workspace-server - canvas-deploy-reminder: fix duplicate if: key (merged into single condition) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: add mol_pk_ and cfut_ to pre-commit secret scanner Partner API keys (mol_pk_) and Cloudflare tokens (cfut_) now caught by the pre-commit hook alongside sk-ant-, ghp_, AKIA. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore(canvas): enable Turbopack for dev server — faster HMR next dev --turbopack for significantly faster dev server startup and hot module replacement. Build script unchanged (Turbopack for next build is still experimental). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(db): schema_migrations tracking — migrations only run once Adds a schema_migrations table that records which migration files have been applied. On boot, only new migrations execute — previously applied ones are skipped. This eliminates: - Re-running all 33 migrations on every restart - Risk of non-idempotent DDL failing on restart - Unnecessary log noise from re-applying unchanged schema First boot auto-populates the tracking table with all existing migrations. Subsequent boots only apply new ones. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(scheduler): strip CRLF from cron prompts on insert/update (closes #958) Windows CRLF in org-template prompt text caused empty agent responses and phantom-producing detection. Strips \r at the handler level before DB persist, plus a one-time migration to clean existing rows. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(security): strip current_task from public GET /workspaces/:id (closes #955) current_task exposes live agent instructions to any caller with a valid workspace UUID. Also strips last_sample_error and workspace_dir from the public endpoint. These fields remain available through authenticated workspace-specific endpoints. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore(canvas): initialize shadcn/ui — components.json + cn utility Sets up shadcn/ui CLI so new components can be added with `npx shadcn add <component>`. Uses new-york style, zinc base color, no CSS variables (matches existing Tailwind-only approach). Adds clsx + tailwind-merge for the cn() utility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(security): GLOBAL memory delimiter spoofing + pin MCP npm version SAFE-T1201 (#807): Escape [MEMORY prefix in GLOBAL memory content on write to prevent delimiter-spoofing prompt injection. Content stored as "[_MEMORY " so it renders as text, not structure, when wrapped with the real delimiter on read. SAFE-T1102 (#805): Pin @molecule-ai/mcp-server@1.0.0 in .mcp.json.example. Prevents supply-chain attacks via unpinned npx -y. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: schema_migrations tracking — 4 cases (first boot, re-boot, mixed, down.sql filter) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: verify current_task + last_sample_error + workspace_dir stripped from public GET Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: GLOBAL memory delimiter spoofing escape + LOCAL scope untouched - TestCommitMemory_GlobalScope_DelimiterSpoofingEscaped: verifies [MEMORY prefix is escaped to [_MEMORY before DB insert (SAFE-T1201, #807) - TestCommitMemory_LocalScope_NoDelimiterEscape: LOCAL scope stored verbatim Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(security): Phase 35.1 — SG lockdown script for tenant EC2 instances Restricts tenant EC2 port 8080 ingress to Cloudflare IP ranges only, blocking direct-IP access. Supports two modes: 1. Lock to CF IPs (Worker deployment): 14 IPv4 CIDR rules 2. Close ingress entirely (Tunnel deployment): removes 0.0.0.0/0 only Usage: bash scripts/lockdown-tenant-sg.sh --sg-id sg-xxxxx bash scripts/lockdown-tenant-sg.sh --sg-id sg-xxxxx --close-ingress bash scripts/lockdown-tenant-sg.sh --sg-id sg-xxxxx --dry-run Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * ci: update GitHub Actions to current stable versions (closes #780) - golangci/golangci-lint-action@v4 → v9 - docker/setup-qemu-action@v3 → v4 - docker/setup-buildx-action@v3 → v4 - docker/build-push-action@v5 → v6 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(opencode): RFC 2119 — 'should not' → 'must not' for SAFE-T1201 warning (closes #861) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(canvas): degraded badge WCAG AA contrast — amber-400 → amber-300 (closes #885) amber-400 on zinc-900 is 5.4:1 (AA pass). amber-300 is 6.9:1 (AA+AAA pass) and matches the rest of the amber usage in WorkspaceNode (currentTask, error detail, badge chip). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(platform): 409 guard on /hibernate when active_tasks > 0 (closes #822) Phase 35.1 / #799 security condition C3 — prevents operator from accidentally killing a mid-task agent. Behavior: - active_tasks == 0 → proceed as before - active_tasks > 0 && ?force=true → log [WARN] + proceed - active_tasks > 0 && no force → 409 with {error, active_tasks} 2 new tests: TestHibernateHandler_ActiveTasks_Returns409, TestHibernateHandler_ActiveTasks_ForceTrue_Returns200. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(platform): track last_outbound_at for silent-workspace detection (closes #817) Sub of #795 (phantom-busy post-mortem). Adds last_outbound_at TIMESTAMPTZ column to workspaces. Bumped async on every successful outbound A2A call from a real workspace (skip canvas + system callers). Exposed in GET /workspaces/:id response as "last_outbound_at". PM/Dev Lead orchestrators can now detect workspaces that have gone silent despite being online (> 2h + active cron = phantom-busy warning). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(workspace): snapshot secret scrubber (closes #823) Sub-issue of #799, security condition C4. Standalone module in workspace/lib/snapshot_scrub.py with three public functions: - scrub_content(str) → str: regex-based redaction of secret patterns - is_sandbox_content(str) → bool: detect run_code tool output markers - scrub_snapshot(dict) → dict: walk memories, scrub each, drop sandbox entries Patterns covered: sk-ant-/sk-proj-, ghp_/ghs_/github_pat_, AKIA, cfut_, mol_pk_, ctx7_, Bearer, env-var assignments, base64 blobs ≥33 chars. 21 unit tests, 100% coverage on new code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(security): cap webhook + config PATCH bodies (H3/H4) Two HIGH-severity DoS surfaces: both handlers read the entire HTTP body with io.ReadAll(r.Body) and no upper bound, so a caller streaming a multi-gigabyte request could exhaust memory on the tenant instance before we even validated the JSON. H3 (Discord webhook): wrap Body in io.LimitReader with a 1 MiB cap. Discord Interactions payloads are well under 10 KiB in practice. H4 (workspace config PATCH): wrap Body in http.MaxBytesReader with a 256 KiB cap. Real configs are <10 KiB; jsonb handles the cap comfortably. Returns 413 Request Entity Too Large on overflow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(security): C4 — close AdminAuth fail-open race on hosted-SaaS fresh install Pre-launch review blocker. AdminAuth's Tier-1 fail-open fired whenever the workspace_auth_tokens table was empty — including the window between a hosted tenant EC2 booting and the first workspace being created. In that window, every admin-gated route (POST /org/import, POST /workspaces, POST /bundles/import, etc.) was reachable without a bearer, letting an attacker pre-empt the first real user by importing a hostile workspace into a freshly provisioned instance. Fix: fail-open is now ONLY applied when ADMIN_TOKEN is unset (self- hosted dev with zero auth configured). Hosted SaaS always sets ADMIN_TOKEN at provision time, so the branch never fires in prod and requests with no bearer get 401 even before the first token is minted. Tier-2 / Tier-3 paths unchanged. The old TestAdminAuth_684_FailOpen_AdminTokenSet_NoGlobalTokens test was codifying exactly this bug (asserting 200 on fresh install with ADMIN_TOKEN set). Renamed and flipped to TestAdminAuth_C4_AdminTokenSet_FreshInstall_FailsClosed asserting 401. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(security): scrub workspace-server token + upstream error logs Two findings from the pre-launch log-scrub audit: 1. handlers/workspace_provision.go:548 logged `token[:8]` — the exact H1 pattern that panicked on short keys. Even with a length guard, leaking 8 chars of an auth token into centralized logs shortens the search space for anyone who gets log-read access. Now logs only `len(token)` as a liveness signal. 2. provisioner/cp_provisioner.go:101 fell back to logging the raw control-plane response body when the structured {"error":"..."} field was absent. If the CP ever echoed request headers (Authorization) or a portion of user-data back in an error path, the bearer token would end up in our tenant-instance logs. Now logs the byte count only; the structured error remains in place for the happy path. Also caps the read at 64 KiB via io.LimitReader to prevent log-flood DoS from a compromised upstream. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(security): tenant CPProvisioner attaches CP bearer on all calls Completes the C1 integration (PR #50 on molecule-controlplane). The CP now requires Authorization: Bearer <PROVISION_SHARED_SECRET> on all three /cp/workspaces/* endpoints; without this change the tenant-side Start/Stop/IsRunning calls would all 401 (or 404 when the CP's routes refused to mount) and every workspace provision from a SaaS tenant would silently fail. Reads MOLECULE_CP_SHARED_SECRET, falling back to PROVISION_SHARED_SECRET so operators can use one env-var name on both sides of the wire. Empty value is a no-op: self-hosted deployments with no CP or a CP that doesn't gate /cp/workspaces/* keep working as before. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(canvas): add 15s fetch timeout on API calls Pre-launch audit flagged api.ts as missing a timeout on every fetch. A slow or hung CP response would leave the UI spinning indefinitely with no way for the user to abort — effectively a client-side DoS. 15s is long enough for real CP queries (slowest observed is Stripe portal redirect at ~3s) and short enough that a stalled backend surfaces as a clear error with a retry affordance. Uses AbortSignal.timeout (widely supported since 2023) so the abort propagates through React Query / SWR consumers cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(e2e): stop asserting current_task on public workspace GET (#966) PR #966 intentionally stripped current_task, last_sample_error, and workspace_dir from the public GET /workspaces/:id response to avoid leaking task bodies to anyone with a workspace bearer. The E2E smoke test hadn't caught up — it was still asserting "current_task":"..." on the single-workspace GET, which made every post-#966 CI run fail with '60 passed, 2 failed'. Swap the per-workspace asserts to check active_tasks (still exposed, canonical busy signal) and keep the list-endpoint check that proves admin-auth'd callers still see current_task end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: 2026-04-19 SaaS prod migration notes Captures the 10-PR staging→main cutover: what shipped, the three new Railway prod env vars (PROVISION_SHARED_SECRET / EC2_VPC_ID / CP_BASE_URL), and the sharp edge for existing tenants — their containers pre-date PR #53 so they still need MOLECULE_CP_SHARED_SECRET added manually (or a re-provision) before the new CPProvisioner's outbound bearer works. Also includes a post-deploy verification checklist and rollback plan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(ws-server): pull env from CP on startup Paired with molecule-controlplane PR #55 (GET /cp/tenants/config). Lets existing tenants heal themselves when we rotate or add a CP-side env var (e.g. MOLECULE_CP_SHARED_SECRET landing earlier today) without any ssh or re-provision. Flow: main() calls refreshEnvFromCP() before any other os.Getenv read. The helper reads MOLECULE_ORG_ID + ADMIN_TOKEN from the baked-in user-data env, GETs {MOLECULE_CP_URL}/cp/tenants/config with those credentials, and applies the returned string map via os.Setenv so downstream code (CPProvisioner, etc.) sees the fresh values. Best-effort semantics: - self-hosted / no MOLECULE_ORG_ID → no-op (return nil) - CP unreachable / non-200 → log + return error (main keeps booting) - oversized values (>4 KiB each) rejected to avoid env pollution - body read capped at 64 KiB Once this image hits GHCR, the 5-minute tenant auto-updater picks it up, the container restarts, refresh runs, and every tenant has MOLECULE_CP_SHARED_SECRET within ~5 minutes — no operator toil. Also fixes workspace-server/.gitignore so `server` no longer matches the cmd/server package dir — it only ignored the compiled binary but pattern was too broad. Anchored to `/server`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(canary): smoke harness + GHA verification workflow (Phase 2) Post-deploy verification for staging tenant images. Runs against the canary fleet after each publish-workspace-server-image build — catches auto-update breakage (a la today's E2E current_task drift) before it propagates to the prod tenant fleet that auto-pulls :latest every 5 min. scripts/canary-smoke.sh iterates a space-sep list of canary base URLs (paired with their ADMIN_TOKENs) and checks: - /admin/liveness reachable with admin bearer (tenant boot OK) - /workspaces list responds (wsAuth + DB path OK) - /memories/commit + /memories/search round-trip (encryption + scrubber) - /events admin read (AdminAuth C4 path) - /admin/liveness without bearer returns 401 (C4 fail-closed regression) .github/workflows/canary-verify.yml runs after publish succeeds: - 6-min sleep (tenant auto-updater pulls every 5 min) - bash scripts/canary-smoke.sh with secrets pulled from repo settings - on failure: writes a Step Summary flagging that :latest should be rolled back to prior known-good digest Phase 3 follow-up will split the publish workflow so only :staging-<sha> ships initially, and canary-verify's green gate is what promotes :staging-<sha> → :latest. This commit lays the test gate alone so we have something running against tenants immediately. Secrets to set in GitHub repo settings before this workflow can run: - CANARY_TENANT_URLS (space-sep list) - CANARY_ADMIN_TOKENS (same order as URLs) - CANARY_CP_SHARED_SECRET (matches staging CP PROVISION_SHARED_SECRET) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(canary): gate :latest tag promotion on canary verify green (Phase 3) Completes the canary release train. Before this, publish-workspace- server-image.yml pushed both :staging-<sha> and :latest on every main merge — meaning the prod tenant fleet auto-pulled every image immediately, before any post-deploy smoke test. A broken image (think: this morning's E2E current_task drift, but shipped at 3am instead of caught in CI) would have fanned out to every running tenant within 5 min. Now: - publish workflow pushes :staging-<sha> ONLY - canary tenants are configured to track :staging-<sha>; they pick up the new image on their next auto-update cycle - canary-verify.yml runs the smoke suite (Phase 2) after the sleep - on green: a new promote-to-latest job uses crane to remotely retag :staging-<sha> → :latest for both platform and tenant images - prod tenants auto-update to the newly-retagged :latest within their usual 5-min window - on red: :latest stays frozen on prior good digest; prod is untouched crane is pulled onto the runner (~4 MB, GitHub release) rather than docker-daemon retag so the workflow doesn't need a privileged runner. Rollback: if canary passed but something surfaces post-promotion, operator runs "crane tag ghcr.io/molecule-ai/platform:<prior-good-sha> latest" manually. A follow-up can wrap that in a Phase 4 admin endpoint / script. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(canary): rollback-latest script + release-pipeline doc (Phase 4) Closes the canary loop with the escape hatch and a single place to read about the whole flow. scripts/rollback-latest.sh <sha> uses crane to retag :latest ← :staging-<sha> for BOTH the platform and tenant images. Pre-checks the target tag exists and verifies the :latest digest after the move so a bad ops typo doesn't silently promote the wrong thing. Prod tenants auto-update to the rolled-back digest within their 5-min cycle. Exit codes: 0 = both retagged, 1 = registry/tag error, 2 = usage error. docs/architecture/canary-release.md The one-page map of the pipeline: how PR → main → staging-<sha> → canary smoke → :latest promotion works end-to-end, how to add a canary tenant, how to roll back, and what this gate explicitly does NOT catch (prod-only data, config drift, cross-tenant bugs). No code changes in the CP or workspace-server — this PR is shell + docs only, so it's safe to land independently of the other Phase {1,1.5,2,3} PRs still in review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(ws-server): cover CPProvisioner — auth, env fallback, error paths Post-merge audit flagged cp_provisioner.go as the only new file from the canary/C1 work without test coverage. Fills the gap: - NewCPProvisioner_RequiresOrgID — self-hosted without MOLECULE_ORG_ID refuses to construct (avoids silent phone-home to prod CP). - NewCPProvisioner_FallsBackToProvisionSharedSecret — the operator ergonomics of using one env-var name on both sides of the wire. - AuthHeader noop + happy path — bearer only set when secret is set. - Start_HappyPath — end-to-end POST to stubbed CP, bearer forwarded, instance_id parsed out of response. - Start_Non201ReturnsStructuredError — when CP returns structured {"error":"…"}, that message surfaces to the caller. - Start_NoStructuredErrorFallsBackToSize — regression gate for the anti-log-leak change from PR #980: raw upstream body must NOT appear in the error, only the byte count. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * perf(scheduler): collapse empty-run bump to single RETURNING query The phantom-producer detector (#795) was doing UPDATE + SELECT in two roundtrips — first incrementing consecutive_empty_runs, then re- reading to check the stale threshold. Switch to UPDATE ... RETURNING so the post-increment value comes back in one query. Called once per schedule per cron tick. At 100 tenants × dozens of schedules per tenant, the halved DB traffic on the empty-response path is measurable, not just cosmetic. Also now properly logs if the bump itself fails (previously it silent- swallowed the ExecContext error and still ran the SELECT, which would confuse debugging). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(canvas): /orgs landing page for post-signup users CP's Callback handler redirects every new WorkOS session to APP_URL/orgs, but canvas had no such route — new users hit the canvas Home component, which tries to call /workspaces on a tenant that doesn't exist yet, and saw a confusing error. This PR plugs that gap with a dedicated landing page that: - Bounces anonymous visitors back to /cp/auth/login - Zero-org users see a slug-picker (POST /cp/orgs, refresh) - For each existing org, shows status + CTA: * awaiting_payment → amber "Complete payment" → /pricing?org=… * running → emerald "Open" → https://<slug>.moleculesai.app * failed → "Contact support" → mailto * provisioning → read-only "provisioning…" - Surfaces errors inline with a Retry button Deliberately server-light: one GET /cp/orgs, no WebSocket, no canvas store hydration. Goal is to move the user from signup to either Stripe Checkout or their tenant URL with one click each. Closes the last UX gap between the BILLING_REQUIRED gate landing on the CP and real users being able to complete a signup today. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(canvas): post-checkout UX — Stripe success lands on /orgs with banner Two small polish items that together close the signup-to-running-tenant flow for real users: 1. Stripe success_url now points at /orgs?checkout=success instead of the current page (was pricing). The old behavior left people staring at plan cards with no indication payment went through — the new behavior drops them right onto their org list where they can watch the status flip. 2. /orgs shows a green "Payment confirmed, workspace spinning up" banner when it sees ?checkout=success, then clears the query param via replaceState so a reload doesn't show it again. 3. /orgs now polls every 5s while any org is awaiting_payment or provisioning. Users see the Stripe webhook's effect live — no manual refresh needed — and once every org settles the polling stops so idle tabs don't hammer /cp/orgs. Paired with PR #992 (the /orgs page itself) this makes the end-to-end flow on BILLING_REQUIRED=true deployments feel right: /pricing → Stripe → /orgs?checkout=success → banner → live poll → "Open" button when org.status transitions to running. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(canvas): bump billing test for /orgs success_url * fix(ci): clone sibling plugin repo so publish-workspace-server-image builds Publish has been failing since the 2026-04-18 open-source restructure (#964's merge) because workspace-server/Dockerfile still COPYs ./molecule-ai-plugin-github-app-auth/ but the restructure moved that code out to its own repo. Every main merge since has produced a "failed to compute cache key: /molecule-ai-plugin-github-app-auth: not found" error — prod images haven't moved. Fix: add an actions/checkout step that fetches the plugin repo into the build context before docker build runs. Private-repo safe: uses PLUGIN_REPO_PAT secret (fine-grained PAT with Contents:Read on Molecule-AI/molecule-ai-plugin-github-app-auth). Falls back to the default GITHUB_TOKEN if the plugin repo is public. Ops: set repo secret PLUGIN_REPO_PAT before the next main merge, or publish will fail with a 404 on the checkout step. Also gitignores the cloned dir so local dev builds don't accidentally commit it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci(promote-latest): workflow_dispatch to retag :staging-<sha> → :latest Escape hatch for the initial rollout window (canary fleet not yet provisioned, so canary-verify.yml's automatic promotion doesn't fire) AND for manual rollback scenarios. Uses the default GITHUB_TOKEN which carries write:packages on repo- owned GHCR images, so no new secrets are needed. crane handles the remote retag without pulling or pushing layers. Validates the src tag exists before retagging + verifies the :latest digest post-retag so a typo can't silently promote the wrong image. Trigger from Actions → promote-latest → Run workflow → enter the short sha (e.g. "4c1d56e"). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci(promote-latest): run on self-hosted mac mini (GH-hosted quota blocked) * ci(promote-latest): suppress brew cleanup that hits perm-denied on shared runner * feat(canvas): Phase 5 — credit balance pill + low-balance banner Adds the UI surface for the credit system to /orgs: - CreditsPill next to each org row. Tone shifts from zinc → amber at 10% of plan to red at zero. - LowCreditsBanner appears under the pill for running orgs when the balance crosses thresholds: overage_used > 0 → "overage active", balance <= 0 → "out of credits, upgrade", trial tail → "trial almost out". - Pure helpers extracted to lib/credits.ts so formatCredits, pillTone, and bannerKind are unit-tested without jsdom. Backend List query now returns credits_balance / plan_monthly_credits / overage_used_credits / overage_cap_credits so no second round-trip is needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(canvas): ToS gate modal + us-east-2 data residency notice Wraps /orgs in a TermsGate that polls /cp/auth/terms-status on mount and overlays a blocking modal when the current terms version hasn't been accepted yet. "I agree" POSTs /cp/auth/accept-terms and dismisses the modal; the backend records IP + UA as GDPR Art. 7 proof-of-consent. Also adds a short data residency notice under the page header: workspaces run in AWS us-east-2 (Ohio, US). An EU region selector is a future lift once the infra is provisioned there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(scheduler): defer cron fires when workspace busy instead of skipping (#969) Previously, the scheduler skipped cron fires entirely when a workspace had active_tasks > 0 (#115). This caused permanent cron misses for workspaces kept perpetually busy by the 5-min Orchestrator pulse — work crons (pick-up-work, PR review) were skipped every fire because the agent was always processing a delegation. Measured impact on Dev Lead: 17 context-deadline-exceeded timeouts in 2 hours, ~30% of inter-agent messages silently dropped. Fix: when workspace is busy, poll every 10s for up to 2 minutes waiting for idle. If idle within the window, fire normally. If still busy after 2 min, fall back to the original skip behavior. This is a minimal, safe change: - No new goroutines or channels - Same fire path once idle - Bounded wait (2 min max, won't block the scheduler pool) - Falls back to skip if workspace never becomes idle Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(mcp): scrub secrets in commit_memory MCP tool path (#838 sibling) PR #881 closed SAFE-T1201 (#838) on the HTTP path by wiring redactSecrets() into MemoriesHandler.Commit — but the sibling code path on the MCP bridge (MCPHandler.toolCommitMemory) was left with only the TODO comment. Agents calling commit_memory via the MCP tool bridge are the PRIMARY attack vector for #838 (confused / prompt-injected agent pipes raw tool-response text containing plain-text credentials into agent_memories, leaking into shared TEAM scope). The HTTP path is only exercised by canvas UI posts, so the MCP gap was the hotter one. Change: workspace-server/internal/handlers/mcp.go:725 - TODO(#838): run _redactSecrets(content) before insert — plain-text - API keys from tool responses must not land in the memories table. + SAFE-T1201 (#838): scrub known credential patterns before persistence… + content, _ = redactSecrets(workspaceID, content) Reuses redactSecrets (same package) so there's no duplicated pattern list — a future-added pattern in memories.go automatically covers the MCP path too. Tests added in mcp_test.go: - TestMCPHandler_CommitMemory_SecretInContent_IsRedactedBeforeInsert Exercises three patterns (env-var assignment, Bearer token, sk-…) and uses sqlmock's WithArgs to bind the exact REDACTED form — so a regression (removing the redactSecrets call) fails with arg-mismatch rather than silently persisting the secret. - TestMCPHandler_CommitMemory_CleanContent_PassesThrough Regression guard — benign content must NOT be altered by the redactor. NOTE: unable to run `go test -race ./...` locally (this container has no Go toolchain). The change is mechanical reuse of an already-shipped function in the same package; CI must validate. The sqlmock patterns mirror the existing TestMCPHandler_CommitMemory_LocalScope_Success test exactly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(ci): move canary-verify to self-hosted runner GitHub-hosted ubuntu-latest runs on this repo hit "recent account payments have failed or your spending limit needs to be increased" — same root cause as the publish + CodeQL + molecule-app workflow moves earlier this quarter. canary-verify was the last one still on ubuntu-latest. Switches both jobs to [self-hosted, macos, arm64]. crane install switched from Linux tarball to brew (matches promote-latest.yml's install pattern + avoids /usr/local/bin write perms on the shared mac mini). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(canvas): pin AbortSignal timeout regression + cover /orgs landing page Two independent test additions that harden the surface freshly landed on staging via PRs #982 (canvas fetch timeout), #992 (/orgs landing), #994 (post-checkout redirect to /orgs). canvas/src/lib/__tests__/api.test.ts (+74 lines, 7 new tests) - GET/POST/PATCH/PUT/DELETE each pass an AbortSignal to fetch - TimeoutError (DOMException name=TimeoutError) propagates to the caller - Each request installs its own signal — no shared module-level controller that would allow one slow request to cancel an unrelated fast one This is the hardening nit I flagged in my APPROVE-w/-nit review of fix/canvas-api-fetch-timeout. Landing as a follow-up now that #982 is in staging. canvas/src/app/__tests__/orgs-page.test.tsx (+251 lines, new file, 10 tests) - Auth guard: signed-out → redirectToLogin and no /cp/orgs fetch - Error state: failed /cp/orgs → Error message + Retry button - Empty list: CreateOrgForm renders - CTA by status: running → "Open" link targets {slug}.moleculesai.app awaiting_payment → "Complete payment" → /pricing?org=<slug> failed → "Contact support" mailto - Post-checkout: ?checkout=success renders CheckoutBanner AND history.replaceState scrubs the query param - Fetch contract: /cp/orgs called with credentials:include + AbortSignal Local baseline on origin/staging tip `845ac47`: canvas vitest: 50 files / 778 tests, all green canvas build: clean, /orgs route present (2.83 kB / 105 kB first-load) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * test(canvas): cover /orgs 5s polling on in-flight orgs The test docstring promised polling coverage but I'd only wired the describe-block header, not the actual tests. Closing that gap — vitest fake timers drive three cases: - `provisioning` org → 2nd fetch fires after 5.1s advance - all `running` → no 2nd fetch even after 10s advance - `awaiting_payment` org, unmount before timer fires → no post-unmount fetch (cleanup correctly clears the pollTimer) The unmount case is the meaningful one: without it a fast nav-away leaves the 5s interval chasing the CP forever. page.tsx L97-99 does clear the timer; the test pins the contract. Local baseline on origin/staging tip `845ac47` + this branch: canvas vitest: 50 files / 781 tests, all green (+3 vs prior commit) canvas build: clean Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * ci(codeql): cover main + staging via workflow GitHub's UI-configured "Code quality" scan only fires on the default branch (staging), which leaves every staging→main promotion PR unscanned. The "On push and pull requests to" field in the UI has no dropdown; multi-branch scanning on private repos without GHAS isn't available there. Workflow file gives us the control we can't get in the UI: triggers on push + pull_request for both branches. Runs on the same self-hosted mac mini via [self-hosted, macos, arm64]. upload: never — GHAS isn't enabled on this repo so the SARIF upload API 403s. Keep results locally, filter to error+warning severity, fail the PR check on findings, publish SARIF as a workflow artifact. Flipping upload: never → always after GHAS is enabled (if ever) is a one-line change. Picks up the review-flagged improvements from the earlier closed PR: - jq install step (brew, no assumption it's present) - severity filter (error+warning only, drops noisy note-level) - set -euo pipefail - SARIF glob (file name doesn't match matrix language id) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bundle/exporter): add rows.Err() after child workspace enumeration Silent data loss on mid-cursor DB errors — partial sub-workspace bundles returned instead of surfacing the iteration error. Adds rows.Err() check after the SELECT id FROM workspaces query in Export(), mirroring the pattern already used in scheduler.go and handlers with similar recursion patterns. Closes: R1 MISSING-ROWS-ERR findings (bundle/exporter.go) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(a11y): WorkspaceNode font floor, contrast, focus rings (Cycle 10) C1: skills badge spans text-[7px]→text-[10px]; "+N more" overflow text-[7px] text-zinc-500→text-[10px] text-zinc-400 C2: Team section label text-[7px] text-zinc-600→text-[10px] text-zinc-400 H4: status label text-[9px]→text-[10px]; active-tasks count text-[9px] text-amber-300/80→text-[10px] text-amber-300 (remove opacity modifier per design-system contrast rule); current-task text text-[9px] text-amber-300/70→text-[10px] text-amber-300 L1: add focus-visible:ring-2 focus-visible:ring-blue-500/70 to the Restart button (independently Tab-focusable inside role="button" wrapper) and to the Extract-from-team button in TeamMemberChip; TeamMemberChip role="button" div already has the focus ring (COVERED, no change) 762/762 tests pass · build clean Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(ci): replace sleep 360 with health-check poll in canary-verify (#1013) The canary-verify workflow blocked the self-hosted runner for a fixed 6 minutes regardless of whether canaries had already updated. This wastes the runner slot when canaries update in 2-3 minutes. Fix: poll each canary's /health endpoint every 30s for up to 7 min. Exit early when all canaries report the expected SHA. Falls back to proceeding after timeout — the smoke suite validates regardless. Typical time saving: ~3-4 minutes per canary verify run. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(gate-1): remove unused fireEvent import (#1011) Mechanical lint fix. github-code-quality[bot] flagged unused import on line 18 — fireEvent is imported but never referenced in the test file. Removing it clears the code quality gate without changing any test behaviour. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat: event-driven cron triggers + auto-push hook for agent productivity Three changes to boost agent throughput: 1. Event-driven cron triggers (webhooks.go): GitHub issues/opened events fire all "pick-up-work" schedules immediately. PR review/submitted events fire "PR review" and "security review" schedules. Uses next_run_at=now() so the scheduler picks them up on next tick. 2. Auto-push hook (executor_helpers.py): After every task completion, agents automatically push unpushed commits and open a PR targeting staging. Guards: only on non-protected branches with unpushed work. Uses /usr/local/bin/git and /usr/local/bin/gh wrappers with baked-in GH_TOKEN. Never crashes the agent — all errors logged and continued. 3. Integration (claude_sdk_executor.py): auto_push_hook() called in the _execute_locked finally block after commit_memory. Closes productivity gap where agents wrote code but never pushed, and where work crons only fired on timers instead of reacting to events. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: disable schedules when workspace is deleted (#1027) When a workspace is deleted (status set to 'removed'), its schedules remained enabled, causing the scheduler to keep firing cron jobs for non-existent containers. Add a cascade disable query alongside the existing token revocation and canvas layout cleanup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: stop hardcoding CLAUDE_CODE_OAUTH_TOKEN in required_env (#1028) The provisioner was unconditionally writing CLAUDE_CODE_OAUTH_TOKEN into config.yaml's required_env for all claude-code workspaces. When the baked token expired, preflight rejected every workspace — even those with a valid token injected via the secrets API at runtime. Changes: - workspace_provision.go: remove hardcoded required_env for claude-code and codex runtimes; tokens are injected at container start via secrets - workspace_provision_test.go: flip assertion to reject hardcoded token Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add cascade schedule disable tests for #1027 - TestWorkspaceDelete_DisablesSchedules — leaf workspace delete disables its schedules - TestWorkspaceDelete_CascadeDisablesDescendantSchedules — parent+child+grandchild cascade - TestWorkspaceDelete_ScheduleDisableOnlyTargetsDeletedWorkspace — negative test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: multiple platform handler bug fixes - secrets.go: Log RowsAffected errors instead of silently discarding them - a2a_proxy.go: Add 60s safety timeout to a2aClient HTTP client - terminal.go: Fix defer ordering - always close WebSocket conn on error, only defer resp.Close() after successful exec attach - webhooks.go: Add shortSHA() helper to safely handle empty HeadSHA Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(runtime): inject HMA memory instructions at platform level (#1047) Every agent now gets hierarchical memory instructions in their system prompt automatically — no template configuration needed. Instructions cover commit_memory (LOCAL/TEAM/GLOBAL scopes), recall_memory, and when to use each proactively. Follows the same pattern as A2A instructions: defined in executor_helpers.py, injected by _build_system_prompt() in the claude_sdk_executor. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: seed initial memories from org template and create payload (#1050) Add MemorySeed model and initial_memories support at three levels: - POST /workspaces payload: seed memories on workspace creation - org.yaml workspace config: per-workspace initial_memories with defaults fallback - org.yaml global_memories: org-wide GLOBAL scope memories seeded on the first root workspace during import Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(template): restructure molecule-dev org template to 39-agent hierarchy Comprehensive rewrite of the Molecule AI dev team org template: - Rename agents to {team}-{role} convention (e.g., core-be, cp-lead, app-qa) - Add 5 new team leads: Core Platform Lead, Controlplane Lead, App & Docs Lead, Infra Lead, SDK Lead - Add new roles: Release Manager, Integration Tester, Technical Writer, Infra-SRE, Infra-Runtime-BE, SDK-Dev, Plugin-Dev - Delete triage-operator and triage-operator-2 (leads own triage now) - Set default model to MiniMax-M2.7, tier 3, idle_interval_seconds 900 - Update org.yaml category_routing to new agent names - Add orchestrator-pulse schedules for all leads (/5 cron) - Add pick-up-work schedules for engineers (/15 cron) - Add qa-review schedules for QA agents (/15 cron) - Add security-scan schedules for security agents (/30 cron) - Add release-cycle and e2e-test schedules for Release Manager and Integration Tester - Update marketing agents with web search MCP and media generation capabilities - All schedule prompts reference Molecule-AI/internal for PLAN.md and known-issues.md - Un-ignore org-templates/molecule-dev/ in .gitignore for version tracking Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix test assertions to account for HMA instructions in system prompt Mock get_hma_instructions in exact-match tests so they don't break when HMA content is appended. Add a dedicated test for HMA inclusion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: gitignore org-templates/ and plugins/ entirely These directories are cloned from their standalone repos (molecule-ai-org-template-, molecule-ai-plugin-) and should never be committed to molecule-core directly. Removed the !/org-templates/molecule-dev/ exception that allowed PR #1056 to land template files in the wrong repo. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(workspace-server): send X-Molecule-Admin-Token on CP calls controlplane #118 + #130 made /cp/workspaces/* require a per-tenant admin_token header in addition to the platform-wide shared secret. Without it, every workspace provision / deprovision / status call now 401s. ADMIN_TOKEN is already injected into the tenant container by the controlplane's Secrets Manager bootstrap, so this is purely a header-plumbing change — no new config required on the tenant side. ## Change - CPProvisioner carries adminToken alongside sharedSecret - New authHeaders method sets BOTH auth headers on every outbound request (old authHeader deleted — single call site was misleading once the semantics changed) - Empty values on either header are no-ops so self-hosted / dev deployments without a real CP still work ## Tests Renamed + expanded cp_provisioner_test cases: - TestAuthHeaders_NoopWhenBothEmpty — self-hosted path - TestAuthHeaders_SetsBothWhenBothProvided — prod happy path - TestAuthHeaders_OnlyAdminTokenWhenSecretEmpty — transition window Full workspace-server suite green. ## Rollout Next tenant provision will ship an image with this commit merged. Existing tenants (none in prod right now — hongming was the only one and was purged earlier today) will auto-update via the 5-min image-pull cron. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: GitHub token refresh — add WorkspaceAuth path for credential helper (#1068) PR #729 tightened AdminAuth to require ADMIN_TOKEN, breaking the workspace credential helper which called /admin/github-installation-token with a workspace bearer token. Tokens expired after 60 min with no refresh. Fix: Add /workspaces/:id/github-installation-token under WorkspaceAuth so any authenticated workspace can refresh its GitHub token. Keep the admin path as backward-compatible alias. Update molecule-git-token-helper.sh to use the workspace-scoped path when WORKSPACE_ID is set. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test(workspace-server): cover Stop/IsRunning/Close + auth-header + transport errors Closes review gap: pre-PR coverage on CPProvisioner was 37%. After this commit every exported method is exercised: - NewCPProvisioner 100% - authHeaders 100% - Start 91.7% (remainder: json.Marshal error path, unreachable with fixed-type request struct) - Stop 100% (new — header + path + error) - IsRunning 100% (new — 4-state matrix + auth) - Close 100% (new — contract no-op) New cases assert both auth headers (shared secret + admin_token) land on every outbound request, transport failures surface clear errors on Start/Stop, and IsRunning doesn't misreport on transport failure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(workspace-server): IsRunning surfaces non-2xx + JSON errors Pre-existing silent-failure path: IsRunning decoded CP responses regardless of HTTP status, so a CP 500 → empty body → State="" → returned (false, nil). The sweeper couldn't distinguish "workspace stopped" from "CP broken" and would leave a dead row in place. ## Fix - Non-2xx → wrapped error, does NOT echo body (CP 5xx bodies may contain echoed headers; leaking into logs would expose bearer) - JSON decode error → wrapped error - Transport error → now wrapped with "cp provisioner: status:" prefix for easier log grepping ## Tests +7 cases (5-status table + malformed JSON + existing transport). IsRunning coverage 100%; overall cp_provisioner at 98%. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cp_provisioner): IsRunning returns (true, err) on transient failures My #1071 made IsRunning return (false, err) on all error paths, but that breaks a2a_proxy which depends on Docker provisioner's (true, err) contract. Without this fix, any brief CP outage causes a2a_proxy to mark workspaces offline and trigger restart cascades across every tenant. Contract now matches Docker.IsRunning: transport error → (true, err) — alive, degraded signal non-2xx response → (true, err) — alive, degraded signal JSON decode error → (true, err) — alive, degraded signal 2xx state!=running → (false, nil) 2xx state==running → (true, nil) healthsweep.go is also happy with this — it skips on err regardless. Adds TestIsRunning_ContractCompat_A2AProxy as regression guard that asserts each error path explicitly against the a2a_proxy expectations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cp_provisioner): cap IsRunning body read at 64 KiB IsRunning used an unbounded json.NewDecoder(resp.Body).Decode on CP status responses. Start already caps its body read at 64 KiB (cp_provisioner.go:137) to defend against a misconfigured or compromised CP streaming a huge body and exhausting memory. IsRunning is called reactively per-request from a2a_proxy and periodically from healthsweep, so it's a hotter path than Start and arguably deserves the same defense more. Adds TestIsRunning_BoundedBodyRead that serves a body padded past the cap and asserts the decode still succeeds on the JSON prefix. Follow-up to code-review Nit-2 on #1073. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(canvas): /waitlist page with contact form Adds the user-facing half of the beta-gate: a page at /waitlist that the CP auth callback redirects users to when their email isn't on the allowlist. Collects email + optional name + use-case and POSTs to /cp/waitlist/request (backend landed in controlplane #150). ## Behavior - No auto-pre-fill of email from URL query (CP's #145 dropped the ?email= param for the privacy reason; this test guards against a future regression on the client side). - Client-side validates email shape for instant feedback; backend re-validates. - Three UI states after submit: success → "your request is in" banner, form hidden dedup → softer "already on file" banner when backend returns dedup=true (same 200, no 409 to avoid enumeration) error → inline banner with backend message or network fallback ## Tests 9 tests in __tests__/waitlist-page.test.tsx covering: - default render + a11y (role=button, role=status, role=alert) - URL-pre-fill privacy regression guard - HTML5 + JS validation (empty, malformed) - successful POST with trimmed body - dedup branch - non-2xx with + without error field - network rejection Follow-up to the beta-gate rollout on controlplane #145 / #150. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(canvas): remove dead /waitlist page (lives in molecule-app) #1080 added /waitlist to canvas, but canvas isn't served at app.moleculesai.app — it backs the tenant subdomains (acme.moleculesai.app etc.). The real /waitlist lives in the separate molecule-app repo, which is what the CP auth callback redirects to. molecule-app#12 has the real page + contact form wiring to /cp/waitlist/request. This canvas copy was never reachable and would only diverge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(org-import): limit concurrent Docker provisioning to 3 (#1084) The org import fired all workspace provisioning goroutines concurrently, overwhelming Docker when creating 39+ containers. Containers timed out, leaving workspaces stuck in 'provisioning' with no schedules or hooks. Fix: - Add provisionConcurrency=3 semaphore limiting concurrent Docker ops - Increase workspaceCreatePacingMs from 50ms to 2000ms between siblings - Pass semaphore through createWorkspaceTree recursion With 39 workspaces at 3 concurrent + 2s pacing, import takes ~30s instead of timing out. Each workspace gets its full template: schedules, hooks, settings, hierarchy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: add ?purge=true hard-delete to DELETE /workspaces/:id (#1087) Soft-delete (status='removed') leaves orphan DB rows and FK data forever. When ?purge=true is passed, after container cleanup the handler cascade- deletes all leaf FK tables and hard-removes the workspace row. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: remove org-templates/molecule-dev from git tracking This directory belongs in the dedicated repo Molecule-AI/molecule-ai-org-template-molecule-dev. It should be cloned locally for platform mounting, never committed to molecule-core. The .gitignore already blocks it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(canvas): add NEXT_PUBLIC_ADMIN_TOKEN + CSP_DEV_MODE to docker-compose Canvas needs AdminAuth token to fetch /workspaces (gated since PR #729) and CSP_DEV_MODE to allow cross-port fetches in local Docker. These were added earlier but lost on nuke+rebuild because they weren't committed to staging. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(canvas): CSP_DEV_MODE + admin token for local Docker (#1052 follow-up) Three changes that keep getting lost on nuke+rebuild: 1. middleware.ts: read CSP_DEV_MODE env to relax CSP in local Docker 2. api.ts: send NEXT_PUBLIC_ADMIN_TOKEN header (AdminAuth on /workspaces) 3. Dockerfile: accept NEXT_PUBLIC_ADMIN_TOKEN as build arg All three are required for the canvas to work in local Docker where canvas (port 3000) fetches from platform (port 8080) cross-origin. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(canvas): make root layout dynamic so CSP nonce reaches Next scripts Tenant page loads were failing with repeated CSP violations: Executing inline script violates ... script-src 'self' 'nonce-M2M4YTVh...' 'strict-dynamic'. ... because Next.js's bootstrap inline scripts were emitted without a nonce attribute. The middleware was generating per-request nonces correctly and sending them via `x-nonce` — but the layout was fully static, so Next.js cached the HTML once and served that cached bundle (no nonces baked in) for every request. Fix: call `await headers()` in the root layout. That opts the tree into dynamic rendering AND signals Next.js to propagate the x-nonce value to its own generated <script> tags. The `nonce` return value is intentionally unused — the framework handles its bootstrap scripts automatically once the read happens. Future code that adds third-party <Script> components (analytics, etc.) should pass the returned nonce explicitly. Verified against live tenant: before this change every /_next/ chunk script tag in the HTML had no nonce attribute; expected after deploy is `<script nonce="..." src="/_next/...">` on each. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(auth): accept admin token in WorkspaceAuth for canvas dashboard The canvas sends NEXT_PUBLIC_ADMIN_TOKEN on all API calls but per-workspace routes (/activity, /delegations, /traces) use WorkspaceAuth which only accepts per-workspace bearer tokens. This made the canvas dashboard 401 on every workspace detail view. Fix: WorkspaceAuth now accepts the admin token as a fallback after workspace token validation fails. This lets the canvas read all workspace data with a single admin credential. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(auth): accept admin token in CanvasOrBearer for viewport PUT * fix(ci): bake api.moleculesai.app into tenant canvas bundle Canvas's browser-side code (auth.ts, api.ts, billing.ts) all call fetch(PLATFORM_URL + /cp/). PLATFORM_URL comes from NEXT_PUBLIC_PLATFORM_URL at build time; with the build arg unset, it falls back to http://localhost:8080 in the compiled bundle. That means on a tenant like hongmingwang.moleculesai.app, the user's browser actually tried to fetch http://localhost:8080/cp/ auth/me — which resolves to the USER'S OWN machine, not the tenant. Login redirect loops 404. Every tenant canvas has been unable to complete a fresh login on this path; existing sessions only worked because the cookie was already set domain-wide. Fix: pass NEXT_PUBLIC_PLATFORM_URL=https://api.moleculesai.app as a build arg in the tenant-image workflow. CP already allows CORS from .moleculesai.app + credentials, and the session cookie is scoped to .moleculesai.app so tenant subdomains inherit it. Verified in prod by rebuilding canvas locally with the flag and hot-patching the hongmingwang instance via SSM. Baked chunks now contain api.moleculesai.app; browser auth redirects resolve cleanly to the CP. Self-hosted users override by rebuilding with their own URL — same pattern molecule-app uses with NEXT_PUBLIC_CP_ORIGIN. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: nuke-and-rebuild.sh — one-command fleet reset Two scripts: - nuke-and-rebuild.sh: docker down -v, clean orphans, rebuild, setup - post-rebuild-setup.sh: insert global secrets (MiniMax + GH PAT), import org template, wait for platform health Global secrets ensure every provisioned container gets MiniMax API config and GitHub PAT injected as env vars automatically — no manual settings.json deployment needed. Usage: bash scripts/nuke-and-rebuild.sh Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(canvas): include NEXT_PUBLIC_PLATFORM_URL in CSP connect-src Tenant page loads were blocked by: Refused to connect to 'https://api.moleculesai.app/cp/auth/me' because it violates the document's Content Security Policy. CSP had `connect-src 'self' wss:` — fine for same-origin + any wss, but browser refuses cross-origin HTTPS fetches that aren't listed. PLATFORM_URL (baked from NEXT_PUBLIC_PLATFORM_URL, which is the CP origin on SaaS tenants) needs to be explicit. Fix: middleware reads NEXT_PUBLIC_PLATFORM_URL at build/runtime and adds both the https and wss siblings to connect-src. Self- hosted deploys that override the build-arg automatically get a matching CSP — no hardcoded hostname. Test added: buildCsp includes NEXT_PUBLIC_PLATFORM_URL origin in connect-src when set. Also loosens the dev `ws:` assertion since dev uses `connect-src ` which subsumes ws (pre-existing behavior, test was stale). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> feat(router): /cp/* reverse-proxy to CP + same-origin canvas fetches Canvas's browser bundle issues fetches to both CP endpoints (/cp/auth/me, /cp/orgs, ...) AND tenant-platform endpoints (/canvas/viewport, /approvals/pending, /org/templates). They share ONE build-time base URL. Baking api.moleculesai.app broke tenant calls with 404; baking the tenant subdomain broke auth. Tried both today and saw exactly one failure mode per attempt. Real fix: same-origin fetches + tenant-side split. Adds: internal/router/cp_proxy.go # /cp/* → CP_UPSTREAM_URL mounted before NoRoute(canvasProxy). Now a tenant serves: /cp/* → reverse-proxy to api.moleculesai.app /canvas/viewport, /approvals/pending, /workspaces/:id/, /ws, /registry, → tenant platform (existing handlers) /metrics everything else → canvas UI (existing reverse-proxy) Canvas middleware reverts to `connect-src 'self' wss:` for the same-origin path (keeping explicit PLATFORM_URL whitelist as a self-hosted escape hatch when the build-arg is non-empty). CI build-arg flips to NEXT_PUBLIC_PLATFORM_URL="" so the bundle issues relative fetches. Security of cp_proxy: - Cookie + Authorization PRESERVED across the hop (opposite of canvas proxy) — they carry the WorkOS session, which is the whole point. - Host rewritten to upstream so CORS + cookie-domain on the CP side see their own hostname. - Upstream URL validated at construction: must parse, must be http(s), must have a host — misconfig fails closed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> security: remove hardcoded API keys from post-rebuild-setup.sh GitGuardian detected exposed MiniMax API key and GitHub PAT in the script's default values. Replaced with env var reads from .env file (which is gitignored). Script now validates required secrets exist before proceeding. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(middleware): TenantGuard passes through /cp/* to CP proxy Today's rollout of cp_proxy (PR #1095/1096) mounted /cp/* as a reverse-proxy to the control plane, but the TenantGuard middleware runs first in the global chain and 404s anything that isn't in its exact-path allowlist (/health + /metrics). Every /cp/auth/me fetch from canvas landed on a 40µs 404 before ever reaching the proxy. /cp/* is handled upstream (WorkOS session + admin bearer), so the tenant doesn't need to attach org identity for those paths. Passing them through is correct — matches the design where the tenant platform is a pure transit layer for /cp/. Verified: /cp/auth/me via tunnel now returns 401 (correct unauth from CP) instead of 404 from TenantGuard. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> feat(middleware): AdminAuth accepts CP-verified WorkOS session Canvas (SaaS tenant UI) runs in the browser and authenticates the user via a WorkOS session cookie scoped to .moleculesai.app. It has no bearer token — the token-based ADMIN_TOKEN scheme is for CLI + server-to-server callers, not end users. Adds a session-verification tier to AdminAuth that runs BEFORE the bearer check: 1. If Cookie header present AND CP_UPSTREAM_URL configured → GET /cp/auth/me upstream with the same cookie. 200 + valid user_id → grant admin access. Non-200 → fall through. 2. Else (no cookie, or no CP configured, or CP said no) → existing bearer-only path unchanged. Positive verifications are cached 30s keyed by the raw Cookie header, so a burst of canvas admin-page renders doesn't DDoS the CP. Revocations propagate within that window. Self-hosted / dev deploys without CP_UPSTREAM_URL: feature disabled, behavior unchanged. So this is strictly additive for the SaaS case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(docker): fix plugin go.mod replace for TokenProvider interface (#960) The github-app-auth plugin's go.mod had a relative replace directive (../molecule-monorepo/platform) that didn't resolve in Docker where the plugin is at /plugin/ and the platform at /app/. This caused the plugin's provisionhook.TokenProvider interface to come from a different package path than the platform's, so the type assertion in FirstTokenProvider() failed — "no token provider registered". Fix: sed the plugin's go.mod replace to point at /app during Docker build. Also added debug logging to GetInstallationToken for future diagnosis. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: close cross-tenant authz + cp_proxy admin-traversal gaps Addresses three Critical findings from today's code review of the SaaS-canvas routing stack. ## Critical-1: session verification scoped to the current tenant session_auth.go previously verified via GET /cp/auth/me, which only answers "is someone logged in" — NOT "is this user in the org they're targeting." Every WorkOS-authed user (including folks who only signed up via app.moleculesai.app with no tenant relationship) could call /workspaces, /approvals/pending, /bundles/import, /org/import etc. on ANY tenant they could reach. Cross-tenant read: user at acme.moleculesai.app could hit bob.moleculesai.app/workspaces with their cookie and get Bob's workspaces. Fix: - CP gains GET /cp/auth/tenant-member?slug=<slug> which joins org_members × organizations and only returns member:true when the authenticated user is actually in that org. - Tenant sets MOLECULE_ORG_SLUG at boot via user-data. - session_auth now calls tenant-member (not /me), passing its own slug. Cache key includes slug so one tenant's cached positive never satisfies another's check. ## Critical-2: cp_proxy path allowlist (lateral-movement fix) cp_proxy.go forwarded any /cp/* path upstream with the cookie and bearer attached. Since /cp/admin/* accepts sessions as one of its auth tiers, a tenant-authed user could curl /cp/admin/tenants/other-slug/diagnostics through their tenant and the CP would honor it — turning any tenant into a lateral hop into admin surface. Fix: explicit allowlist of paths the canvas browser bundle actually needs (/cp/auth, /cp/orgs, /cp/billing, /cp/templates, /cp/legal). Everything else 404s at the tenant before cookies leave. Fail-closed: future UI paths require explicit entries. ## Important-1,2: bounded session cache + split positive/negative TTL Previous sync.Map cache grew unbounded (one entry per unique Cookie header for process lifetime) and cached failures for 30s, meaning a 3s CP blip locked users out for the full window. Fix: - Bounded map with batch random eviction at cap (10k entries × ~100 bytes = 1 MB ceiling). Random eviction is O(1) expected; we don't need precise LRU. - Periodic sweeper goroutine (2 min) reclaims expired entries even when they're not re-hit. - Positive TTL 30s, negative TTL 5s — short negative so CP flakes self-heal fast. - Transport errors NOT cached (would otherwise trap every user during a multi-second upstream outage). - Cache key = sha256(slug + cookie) so raw session tokens don't sit in process memory, and cross-tenant isolation is structural not policy. ## Important-3: TenantGuard /cp/* bypass documented Added a security note to the bypass explaining why it's safe only under the current setup (cp_proxy allowlist + tunnel-only ingress), and what would require revisiting (SG opens :8080 inbound to the VPC). ## Tests - session_auth_test.go: 12 new tests — empty cookie, missing slug, no CP, member:true happy path with cache hit, member: false, 401 upstream, malformed JSON, transport error not cached, cross-tenant isolation (same cookie different tenants hit upstream separately), bounded eviction, expired entries, cache key collision resistance. - cp_proxy_test.go: new — isCPProxyAllowedPath covers 17 allow/block cases, forwarding preserves Cookie+Auth, Host rewritten, blocked paths 404 without calling upstream. All platform tests pass. CP provisioner tests pass after threading cfg.OrgSlug into the container env. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(auth): organization-scoped API keys for admin access Adds user-facing API keys with full-org admin scope. Replaces the single ADMIN_TOKEN env var with named, revocable, audited tokens that users can mint/rotate from the canvas UI without ops intervention. Designed for the beta growth phase — one token tier (full admin). Future work will split into scoped roles (admin / workspace-write / read-only) and per-workspace bindings. See docs… * test(handlers): add 5 TestKI005 regression tests to terminal_test.go Port terminal hierarchy guard regression suite: - TestKI005_SelfAccess_AlwaysAllowed: own workspace token always passes - TestKI005_CanCommunicatePeer_Allowed: sibling workspace access granted - TestKI005_CanCommunicateNonPeer_Forbidden: cross-org access blocked (403) - TestKI005_TokenMismatch_Unauthorized: token/Workspace-ID mismatch blocked (401) - TestKI005_NoXWorkspaceIDHeader_LegacyAllowed: legacy access no header → proceeds Refs: F1085, KI-005 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Hongming Wang <hongmingwangrabbit@gmail.com> Co-authored-by: Hongming Wang <hongmingwang.rabbit@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Molecule AI Backend Engineer <backend-engineer@agents.moleculesai.app> Co-authored-by: qa-agent <qa-agent@users.noreply.github.com> Co-authored-by: Molecule AI Frontend Engineer <frontend-engineer@agents.moleculesai.app> Co-authored-by: Molecule AI Triage Operator <triage-operator@agents.moleculesai.app> Co-authored-by: Molecule AI Platform Engineer <platform-engineer@agents.moleculesai.app> Co-authored-by: molecule-ai[bot] <276602405+molecule-ai[bot]@users.noreply.github.com> Co-authored-by: Molecule AI SDK-Dev <sdk-dev@agents.moleculesai.app> Co-authored-by: airenostars <airenostars@gmail.com> Co-authored-by: Molecule AI Core-BE <core-be@agents.moleculesai.app> Co-authored-by: Molecule AI Core-DevOps <core-devops@agents.moleculesai.app> Co-authored-by: Molecule AI Core-FE <core-fe@agents.moleculesai.app> Co-authored-by: Molecule AI Fullstack (floater) <fullstack-floater@agents.moleculesai.app> Co-authored-by: Molecule AI CP-QA <cp-qa@agents.moleculesai.app> Co-authored-by: Molecule AI Core-UIUX <core-uiux@agents.moleculesai.app> Co-authored-by: Molecule AI PMM <pmm@agents.moleculesai.app> Co-authored-by: Molecule AI Social Media Brand <social-media-brand@agents.moleculesai.app> Co-authored-by: Molecule AI DevRel Engineer <devrel-engineer@agents.moleculesai.app> Co-authored-by: Marketing Lead <marketing-lead@agents.moleculesai.app> Co-authored-by: Molecule AI Controlplane Lead <controlplane-lead@agents.moleculesai.app> Co-authored-by: Molecule AI CP-BE <cp-be@agents.moleculesai.app> Co-authored-by: Molecule AI Community Manager <community-manager@agents.moleculesai.app> Co-authored-by: Molecule AI Technical Writer <technical-writer@agents.moleculesai.app> Co-authored-by: Molecule AI App-FE <app-fe@agents.moleculesai.app>	2026-04-24 01:58:31 +00:00
molecule-ai[bot]	b1dce3405c	Merge branch 'staging' into test/2026-04-23-regression-suite	2026-04-24 01:55:06 +00:00
Hongming Wang	00e3e3f570	fix(#1933 ): bump molecule-ai-plugin-github-app-auth to current main (step 1) Ships step 1 of the #1933 fleet-wide GH_TOKEN refresh fix. The plugin's v0.0.0-20260416194734-2cd28737f845 predates the Mutator.Token() method added in plugin-repo PR #1 (merged 2026-04-17). Monorepo's workspace-server/pkg/provisionhook/mutator.go:218 has been emitting `provisionhook: no Token method on "github-app-auth"` on every boot and the reflection-fallback at mutator.go:216 is doing extra work every time a workspace requests a fresh GH token. This is the one-line pin bump: v0.0.0-20260416194734-2cd28737f845 → v0.0.0-20260421064811-7d98ae51e31d Effect: direct-interface path (not the reflection fallback) gets taken, log noise goes away. Does NOT fix the actual 60-min GH_TOKEN death — steps 2–5 of #1933 (credential helper install, git config wire-up, runtime auth context, periodic refresh) are separate, larger PRs. Verified: workspace-server/go build ./... passes with the new pin. Ref: #1933	2026-04-23 18:53:25 -07:00
Molecule AI Core-BE	88c929875e	fix(#1877 ): nil provisioner guard in issueAndInjectToken Fix panic in TestIssueAndInjectToken_HappyPath where h.provisioner is nil (the handler was created without a real provisioner in unit tests). Add nil guard so the pre-write step is skipped gracefully — token is still injected into ConfigFiles as before, and the runtime-side 401 retry handles any race. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-23 17:47:18 -07:00
Molecule AI Core-BE	b5e2142c46	fix(#1877 ): close token-rotation race on restart — Option A+Option B combined Platform side (Option B): - provisioner.go: add WriteAuthTokenToVolume() — writes .auth_token to the Docker named volume BEFORE ContainerStart using a throwaway alpine container, eliminating the race window where a restarted container could read a stale token before WriteFilesToContainer writes the new one. - workspace_provision.go: call WriteAuthTokenToVolume() in issueAndInjectToken as a best-effort pre-write before the container starts. Runtime side (Option A): - heartbeat.py: on HTTPStatusError 401 from /registry/heartbeat, call refresh_cache() to force re-read of /configs/.auth_token from disk, then retry the heartbeat once. Fall through to normal failure tracking if the retry also fails. - platform_auth.py: add refresh_cache() which discards the in-process _cached_token and calls get_token() to re-read from disk. Together these eliminate the >1 consecutive 401 window described in issue #1877. Pre-write (B) is the primary fix; runtime retry (A) is the self-healing fallback for any residual race. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-23 17:47:18 -07:00
Hongming Wang	9ce8d97448	test: regression guard for #1738 — cp-provisioner uses real instance_id Pins the fix-invariants from PR #1738 (merged 2026-04-23) against regression. Pre-fix, `CPProvisioner.Stop` and `IsRunning` both passed the workspace UUID as the `instance_id` query param: url := fmt.Sprintf("%s/cp/workspaces/%s?instance_id=%s", baseURL, workspaceID, workspaceID) ^ should be the real i-* ID AWS rejected downstream with InvalidInstanceID.Malformed, orphaned the EC2, and the next provision hit InvalidGroup.Duplicate on the leftover SG — full Save & Restart cascade failure. ## Tests added - TestStop_UsesRealInstanceIDNotWorkspaceUUID: stub resolveInstanceID to return an i-* ID, assert the CP request's instance_id query param carries that i-* value (not the workspace UUID). - TestStop_NoInstanceIDSkipsCPCall: empty DB lookup → no CP call at all (idempotent). Guards against re-introducing the "call CP with '' and let AWS reject" footgun. - TestIsRunning_UsesRealInstanceIDNotWorkspaceUUID: mirror for the /cp/workspaces/:id/status path — same bug shape. All 3 pass on current staging (which has the fix). Reverting either Stop or IsRunning to the pre-#1738 shape causes these to fail loud. Extends molecule-core#1902's regression suite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 17:45:13 -07:00
Hongming Wang	18ebb1d7bf	fix(server): remove 60s A2A client timeout + correct file-read cat args Two bugs surfaced while testing Claude Code + OAuth deploys: 1. A2A proxy: a2aClient had a 60s Client.Timeout "safety net" that defeated the per-request context deadlines the code otherwise sets (canvas = 5m, agent-to-agent = 30m). Claude Code's first-token cold start over OAuth takes 30-60s, so every first "hi" into a fresh claude-code workspace returned 503 at exactly the 1m mark. Removed the Client.Timeout — the context deadline now governs as documented in the adjacent comment. 2. Files tab: ReadFile ran `cat <rootPath> <filePath>` as two args to cat. `cat /home agent/turtle_draw.py` tries to read the rootPath directory (errors "Is a directory") and then resolves the filePath relative to the container cwd, which is not guaranteed to equal rootPath. Result: the file-content pane stayed blank even though the file listed fine. Join into a single path before exec. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 17:25:53 -07:00
Hongming Wang	d812c28431	Merge pull request #1932 from Molecule-AI/chore/sync-staging-to-main-followup chore: sync staging → main (follow-up: 9 commits since #1913)	2026-04-23 17:25:07 -07:00
Hongming Wang	e337efe974	fix(canvas): propagate runtime through WORKSPACE_PROVISIONING event The side-panel runtime pill read "unknown" for newly-deployed workspaces because canvas-events.ts created the node from WORKSPACE_PROVISIONING payload — and the payload only carried name + tier. No refetch filled the gap during provisioning, so the user saw "RUNTIME unknown" on the card even though the DB row had the real runtime set. Includes runtime in every WORKSPACE_PROVISIONING emitter: * handlers/workspace.go — initial create * handlers/workspace_restart.go — explicit restart, auto-restart, and crash-recovery resume loop * handlers/org_import.go — multi-workspace org imports Canvas-side: canvas-events.ts reads payload.runtime when creating the node; the provisioning test asserts the pill value is populated before any refetch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 17:17:49 -07:00
Hongming Wang	dc50a1c775	refactor(canvas): data-drive provider picker from template config.yaml The MissingKeysModal's provider list was hardcoded in deploy-preflight.ts as RUNTIME_PROVIDERS — a per-runtime map that duplicated what each template repo already declares in its config.yaml. That meant adding a new provider required changes in two places, and the UI could drift out of sync with the actual template (e.g. when a template adds a MiniMax or Kimi model, the picker wouldn't know). The single source of truth for "which env vars does this workspace need" is each template's config.yaml: * `runtime_config.models[].required_env` — per-model key list * `runtime_config.required_env` — runtime-level AND list Go /templates already returned `models`. This change: * Adds `required_env` alongside `models` on templateSummary so the canvas receives the full picture. * Rewrites deploy-preflight.ts to derive ProviderChoice[] from a template object via `providersFromTemplate(template)`: - groups `models[]` by unique required_env tuple - falls back to runtime_config.required_env when models is empty - decorates labels with model counts (e.g. "OpenRouter (14 models)") * `checkDeploySecrets(template, workspaceId?)` now takes a template object instead of a runtime string. Any-provider satisfaction still short-circuits preflight to ok=true. * MissingKeysModal receives `providers` directly; no more lookups. * TemplatePalette threads `template.models` + `template.required_env` into the preflight. Side effects: * Claude Code's dual-auth (OAuth token OR Anthropic API key) now surfaces as two picker options — its config.yaml already declared both, the UI just wasn't reading them. * Hermes picker now shows 8 provider options (Nous, OpenRouter, Anthropic, Gemini, DeepSeek, GLM, Kimi, Kilocode) instead of the hand-picked 3, matching its 35-model reality. Removed the legacy RUNTIME_PROVIDERS / RUNTIME_REQUIRED_KEYS / getRequiredKeys / findMissingKeys exports; MissingKeysModal.test.tsx deleted (its coverage is subsumed by the new template-driven deploy-preflight.test.ts). 58 modal-adjacent tests pass; full canvas suite 919 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 17:07:15 -07:00

1 2 3 4 5 ...

322 Commits