Commit Graph

2798 Commits

Author SHA1 Message Date
molecule-ai[bot]
e4e389950f
fix(canvas/a11y): aria-hidden SVGs, MissingKeysModal dialog semantics, session cookie auth (#1992)
fix(canvas/a11y): aria-hidden SVGs, MissingKeysModal dialog semantics, session cookie auth

Three fixes cherry-picked from issue #1744:

1. aria-hidden on decorative SVG icons:
   - DeleteCascadeConfirmDialog.tsx: warning triangle SVG gets aria-hidden="true"
   - MissingKeysModal.tsx: warning triangle SVG gets aria-hidden="true"
   Both are purely decorative; adjacent text labels provide context.

2. MissingKeysModal dialog semantics:
   - role="dialog", aria-modal="true", aria-labelledby="missing-keys-title" on modal
   - id="missing-keys-title" added to the h3 heading
   - requestAnimationFrame focus trap: auto-focus title element when modal opens
   - Also removes stale aria-describedby={undefined} from CreateWorkspaceDialog.tsx

3. Session cookie auth for /registry/:id/peers:
   - Promotes VerifiedCPSession() fallback before the bearer token branch
   - Fixes SaaS canvas Peers tab 401 — canvas hits this endpoint via session cookie
   - Correctly returns "invalid session" for bad cookies instead of falling through
   - Self-hosted bypass logic preserved

Test fix (bundled, same branch):
   - ContextMenu keyboard test: add getState() stub to useCanvasStore mock
   - Required after ContextMenu.tsx gained a direct getState() call at line 169

Reviewed-by: Core-Security (security audit: APPROVED)
CI: Canvas CI , Platform CI , E2E API , CodeQL 

GitHub issue: #1740 (test), #1744 (a11y)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 06:20:32 +00:00
Hongming Wang
a2f471feed
Merge pull request #1987 from Molecule-AI/fix/e2e-pin-hermes-custom-provider
fix(e2e): pin HERMES_* env so openai/* routes deterministically
2026-04-23 22:44:25 -07:00
Hongming Wang
884fff1145 fix(e2e): pin HERMES_* env vars so openai/* routes deterministically
Root cause of the sustained E2E step-8 A2A 401 failures (3+/3 runs
2026-04-24 03h–04h): the A2A returns 200 with a JSON-RPC result whose
text is OpenRouter's error format —
  {'message': 'Missing Authentication header', 'code': 401}
(integer code, not OpenAI's string 'invalid_api_key'). template-hermes's
derive-provider.sh was picking PROVIDER=openrouter for openai/* models
despite template-hermes#19 (the fix that flips openai/* → custom when
OPENAI_API_KEY is set) having been merged 01:30Z.

Verified via probe workspaces on the staging canary tenant:
  probe 1 (just OPENAI_API_KEY): → OpenRouter's 401 shape
  probe 2 (+ HERMES_INFERENCE_PROVIDER=custom + HERMES_CUSTOM_*):
           → OpenAI's 401 shape ('code': 'invalid_api_key')

So derive-provider.sh's updates apparently aren't reaching every
staging tenant on re-provision — possibly because tenant EC2s cache
/opt/adapter from an earlier boot, or the CP's user-data snapshot
bundles a pre-fix template-hermes. That's a separate follow-up (needs
forced re-clone of /opt/adapter on every workspace boot).

This PR is the test-side workaround. Pinning the HERMES_* bridge env
vars bypasses derive-provider.sh entirely, so the test works regardless
of which template-hermes commit any given tenant happens to have on
disk.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 22:41:22 -07:00
Hongming Wang
faba17a84c
Merge pull request #1917 from Molecule-AI/fix/blog-ai-agents-org-scoped-keys-missing-endpoint
fix(blog): remove fake /org/tokens/:id/logs endpoint reference (molecule-core#1914)
2026-04-23 22:12:10 -07:00
1da9759d0d Merge remote-tracking branch 'origin/staging' into fix/blog-ai-agents-org-scoped-keys-missing-endpoint 2026-04-24 05:09:39 +00:00
Hongming Wang
f4b301b4da
Merge pull request #1982 from Molecule-AI/feat/merge-queue-trigger
ci: add merge_group trigger to ci + codeql
2026-04-23 21:51:50 -07:00
rabbitblood
0cc8733f09 Merge remote-tracking branch 'origin/staging' into feat/merge-queue-trigger 2026-04-23 21:48:59 -07:00
molecule-ai[bot]
35bcad9204
feat(workspace): migrate a2a-sdk from 0.3.x to 1.0.0 (KI-009) (#1974)
* feat(workspace): migrate a2a-sdk from 0.3.x to 1.0.0 (KI-009)

Migrates all workspace code from a2a-sdk v0.3.x to v1.0.0, following the
official migration guide from a2aproject/a2a-python.

Breaking changes applied:
- A2AStarletteApplication → Starlette route factory
  (create_agent_card_routes + create_jsonrpc_routes)
- AgentCard.url removed; url+protocol now in supported_protocols[].url
- AgentCapabilities fields renamed to snake_case
  (pushNotifications→push_notifications,
   stateTransitionHistory→state_transition_history)
- AgentCard.defaultInputModes/outputModes → default_input_modes/output_modes
- TaskState.canceled → TaskState.TASK_STATE_CANCELED
- a2a.utils → a2a.helpers
- Part(root=TextPart(text=t)) → Part(text=t) (TextPart removed)

Files changed:
- requirements.txt: pinned >=1.0.0,<2.0
- main.py: Starlette route factory + AgentCard restructure
- a2a_executor.py: Part() + TaskState + helpers import
- hermes_executor.py: TaskState + helpers import
- google-adk/adapter.py: TaskState + helpers import
- cli_executor.py: helpers import
- claude_sdk_executor.py: helpers import
- tests/conftest.py: a2a.helpers mock stub
- tests/test_a2a_executor.py: TaskState enum key
- adapters/google-adk/test_adapter.py: Part + helpers stub

Refs: KI-009
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(test): update _TaskState mock to a2a-sdk v1 enum name (TASK_STATE_CANCELED)

---------

Co-authored-by: Molecule AI Tech Researcher <tech-researcher@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: molecule-ai[bot] <276602405+molecule-ai[bot]@users.noreply.github.com>
2026-04-24 04:43:17 +00:00
rabbitblood
01de3ef6d2 Merge remote-tracking branch 'origin/staging' into feat/merge-queue-trigger 2026-04-23 21:34:16 -07:00
molecule-ai[bot]
01fcc9a4b6
fix(canvas/a11y): aria-hidden SVGs, MissingKeysModal dialog, session cookie auth
* fix(canvas/a11y): aria-hidden SVGs, MissingKeysModal dialog semantics, session cookie auth

Three fixes cherry-picked from issue #1744:

1. aria-hidden on decorative SVG icons:
   - DeleteCascadeConfirmDialog.tsx: warning triangle SVG gets aria-hidden="true"
   - MissingKeysModal.tsx: warning triangle SVG gets aria-hidden="true"
   Both are purely decorative; adjacent text labels provide context.

2. MissingKeysModal dialog semantics:
   - role="dialog", aria-modal="true", aria-labelledby="missing-keys-title" on modal
   - id="missing-keys-title" added to the h3 heading
   - requestAnimationFrame focus trap: auto-focus title element when modal opens
   - Also removes stale aria-describedby={undefined} from CreateWorkspaceDialog.tsx

3. Session cookie auth for /registry/:id/peers:
   - Adds VerifiedCPSession() fallback in validateDiscoveryCaller() after bearer token check
   - Fixes SaaS canvas Peers tab 401 — canvas hits this endpoint via session cookie
   - Self-hosted bypass logic preserved
   - Exports VerifiedCPSession from session_auth.go for cross-package use

Test fix (bundled, same branch):
   - ContextMenu keyboard test: add getState() stub to useCanvasStore mock
   - Required after ContextMenu.tsx gained a direct getState() call at line 169

GitHub issue: #1740 (test), #1744 (a11y)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(workspace-server): remove duplicate VerifiedCPSession declaration

The branch accidentally added a second func VerifiedCPSession declaration
that shadows the real implementation, causing go build to fail with:
  internal/middleware/session_auth.go:238:6: VerifiedCPSession redeclared in this block

Remove the stub alias so the original full implementation is used directly.
The function already exports correctly for cross-package use via the
VerifiedCPSession() call in discovery.go.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(workspace-server): correct VerifiedCPSession condition in discovery.go

Fix Go build error — 'presented' was declared and not used.
The cookie fallback check was using `if ok, presented := ...; ok` instead
of `if ok, presented := ...; presented`, causing the build to fail in CI.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(workspace-server): fix declared and not used 'presented' in discovery.go

Fixes Go build failure:
  discovery.go:355:10: declared and not used: presented
  discovery.go:358:6: undefined: presented

Variable shadowing in the second VerifiedCPSession call reused the outer
scope's `ok` and `presented` names, causing a compile error. Renamed to
ok2/presented2 to avoid shadowing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Molecule AI Core-FE <core-fe@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 04:30:26 +00:00
rabbitblood
5f3508fef0 ci: add merge_group trigger to ci + codeql
Pre-work for enabling GitHub merge queue on the staging branch (#TBD
follow-up issue). Without these triggers, the queue's pre-merge CI run
on the speculative `gh-readonly-queue/...` ref would never fire, every
queued PR would show false-green for the required checks, and queue
would merge things that don't actually pass on the rebased commit.

Adding the trigger now is **a no-op** — the `merge_group` event only
fires once the queue is enabled on a branch, which is a separate UI/API
toggle. So this PR is safe to land in isolation; merge-queue enablement
is the next step and reversible at the branch-protection level.

Why these two workflows:
- `ci.yml` provides 5 of the 8 required staging checks (Detect changes,
  Platform Go, Canvas Next.js, Python Lint & Test, Shellcheck E2E)
- `codeql.yml` provides the other 3 (Analyze go / js-ts / python)

Other workflows (e2e-staging-*, canary-*, publish-*) are not required
status checks and don't need the trigger to keep the queue working.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 21:24:53 -07:00
Hongming Wang
0576e341b9
ops(#1976): add smart-sweep script for orphan Cloudflare DNS records (#1978)
Replaces the "panic-button at >65 records" manual sweep that nukes
every pattern-match unconditionally (would delete live workspaces
along with orphans).

This version:
- Queries CP prod + staging /admin/orgs for live tenant slugs
- Queries AWS EC2 describe-instances for live workspace Name tags
- Only deletes CF records whose slug/ws-id has no live counterpart
- Dry-run by default (--execute to actually delete)
- Safety gate refuses to delete >50% of records (configurable via
  MAX_DELETE_PCT env var) — catches the "API returned zero orgs, every
  tenant looks orphan" failure mode before it nukes production
- Per-category accounting: orphan-ws / orphan-e2e-tenant / etc.

Usage:
  CF_API_TOKEN=... CF_ZONE_ID=... \
    CP_PROD_ADMIN_TOKEN=... CP_STAGING_ADMIN_TOKEN=... \
    bash scripts/ops/sweep-cf-orphans.sh           # dry-run
  bash scripts/ops/sweep-cf-orphans.sh --execute   # actually delete

Ref: #1976 (root-cause: tenant.Delete + workspace.Delete don't clean
their CF records — until that's fixed, this script is the maintenance
path)

Co-authored-by: Hongming Wang <hongmingwang.rabbit@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: molecule-ai[bot] <276602405+molecule-ai[bot]@users.noreply.github.com>
2026-04-24 04:19:49 +00:00
Hongming Wang
6745a61ebf
Merge pull request #1970 from Molecule-AI/fix/restore-quickstart-plus-hotfixes
fix(canvas): playability pass + UX polish (post #1897)
2026-04-23 21:08:52 -07:00
Hongming Wang
d53583f9c6 Merge remote-tracking branch 'origin/staging' into fix/restore-quickstart-plus-hotfixes 2026-04-23 21:04:55 -07:00
Hongming Wang
2d6ff11c4e fix(canvas): re-sort parents-before-children after nest mutation
React Flow requires parent nodes to appear before their children in
the nodes array. When they don't, it logs "Parent node {id} not
found. Please make sure that parent nodes are in front of their
child nodes in the nodes array" and — more importantly — renders
the child at canvas-absolute coords instead of parent-relative,
flashing it far outside the parent.

topology's buildNodesAndEdges already enforced this at hydrate, but
nestNode + batchNest weren't re-sorting after mutating parentId.
A freshly-nested child often ended up after-first-drag at the
wrong screen position because its new parent sat later in the
array than itself.

Extract sortParentsBeforeChildren() into canvas-topology as a
reusable DFS visit; call it at the tail of both nestNode's set()
and batchNest's commit set(). 923 tests still green — no behaviour
change beyond eliminating the warning and the position flash.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 21:00:40 -07:00
Hongming Wang
2a8977c946 fix(canvas): cancel-nest also shrinks the parent back
Canceling the nest/extract dialog restored the child's position but
left the parent card at its auto-grown size. growParentsToFitChildren
fires on drag-stop to fit a then-outside child; when the drag is
subsequently cancelled, the parent keeps that grown width/height
forever because the grow pass is grow-only.

Strip width/height from the ex-parent alongside the child position
restore in cancelNest — React Flow re-measures from CSS, parent
collapses back to its natural size. Same trick nestNode already
uses for the un-nest path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 20:56:08 -07:00
Hongming Wang
09053dfdeb fix(canvas): cancel-nest restores position; un-nest shrinks parent
Two follow-up polish items for drag-and-nest:

1. Cancelling the "Extract from team?" dialog now snaps the
   dragged card back to where the drag started. Before, a user
   who dragged a child out, saw the confirm dialog, then clicked
   Cancel ended up with the card stranded outside the parent at
   its drop-point position — which also got persisted via
   savePosition on drag-stop. Now onNodeDragStart captures the
   pre-drag position + parent, and cancelNest restores both the
   RF node position and fires savePosition with the absolute
   pre-drag coords so reload matches.

2. Un-nesting now clears the ex-parent's explicit width/height
   in the nodes array. growParentsToFitChildren is grow-only so
   it could never shrink the parent back down after a child
   left; the card stayed at its auto-grown size with empty
   space. Stripping width/height lets React Flow re-measure from
   the card's own min-width / min-height CSS, so the parent
   visually shrinks to fit whatever children remain.

923 canvas tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 20:52:28 -07:00
Hongming Wang
512fdfd59d fix(canvas): plain drag out of parent un-nests again
Un-nest used to require holding Alt (or Cmd to force-detach). That
was too conservative — when a user dragged a child clearly outside
its parent's bbox, nothing happened on release, because the default
branch soft-clamped back and only the Alt branch actually opened
the "Extract?" confirm. Matches the exact bug the user just flagged
("I can put agents in other agent, but when I drag it out, it does
not move out").

New rules:
 * Past the 20 % hysteresis → confirm un-nest. Plain drag, no
   modifier. This is what most users expect (Miro / Figma behave
   the same way — drag outside the frame and the shape leaves it).
 * Inside or within 20 % of the edge → soft-clamp back inside.
   Guards against twitchy releases that momentarily overshoot the
   edge by a few pixels.
 * Cmd / Ctrl → force un-nest regardless of overlap. Escape-hatch
   for when the user dragged within the hysteresis zone but really
   wants out.
 * Dropping onto a different parent → nest there (unchanged).

Alt is no longer a required modifier for un-nesting. Keeps it as
a non-gesture modifier only; no meaning unless we re-bind it later.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 20:48:38 -07:00
Hongming Wang
f2a4b6e0d3 fix: dev-mode bypass for IP rate limiter + 429 retry on GET
The 600-req/min/IP bucket is sized for SaaS where each tenant has
a distinct client IP. On a local Docker setup every panel shares
one IP — hydration (/workspaces + /templates + /org/templates +
/approvals/pending) plus polling (A2A overlay + activity tabs +
approvals + schedule + channels + audit trail) can burst past the
bucket inside a minute, blanking the canvas with 429s. The user
reported it after dragging workspaces — dragging itself is
release-only (savePosition in onNodeDragStop), but the polling
that's always running added onto startup tripped the limit.

Two-layer fix:

Server: RateLimiter.Middleware short-circuits when isDevModeFailOpen
is true (MOLECULE_ENV=development + empty ADMIN_TOKEN), matching
the Tier-1b hatch already applied to AdminAuth, WorkspaceAuth, and
discovery. SaaS production keeps the bucket.

Client: api.ts auto-retries a single 429 on idempotent GET requests,
waiting the server-provided Retry-After (capped at 20s). Mutations
(POST/PUT/PATCH/DELETE) never auto-retry to avoid double-applying.
Users on SaaS hitting a legitimate rate-limit spike get one
transparent recovery instead of an immediately-blank Canvas.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 20:44:09 -07:00
Hongming Wang
286dcbfd1e fix(canvas,org): collapse org-imported parents on first paint
Importing a 15-workspace org template dropped every child as a
freely-positioned card into its parent's coordinate space. Parents
with 5-10 kids had the kids spill below the parent's initial min
size, producing the "ugly default" layout the user just flagged —
a mess of overlapping cards the moment the import completed.

Fix: every workspace in an org-template import that HAS children
is inserted with `collapsed = true`. Leaf workspaces stay
expanded (nothing to hide). The canvas renders a collapsed
parent as a compact header-only card with its "N sub" badge —
visually identical to the pre-refactor default the user asked for.

Double-click on a collapsed parent now EXPANDS it (flipping
`collapsed` locally + persisting via PATCH) so the user can drill
in to see the subtree. Only once expanded does a second
double-click zoom-to-team, matching the prior behaviour.

Leaf-first creation order stays the same; the collapsed flag
just means "render compact" not "hide from API".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 20:36:55 -07:00
Hongming Wang
507696d88a fix(canvas,server): address review findings on 3f11df03
Five review findings from the 3f11df03 six-bug commit:

1. Add TestPeers_DevModeFailOpen_{Allows,ClosedWhenAdminTokenSet,
   ClosedInProduction} covering all three gating states for the
   security-sensitive dev-mode hatch the prior commit added to
   /registry/:id/peers. Previously shipped untested — a future
   refactor could have silently inverted polarity or removed the
   gate. New tests pin the contract:
     * MOLECULE_ENV=development + ADMIN_TOKEN="" → allow bearerless
     * MOLECULE_ENV=development + ADMIN_TOKEN set → require token
     * MOLECULE_ENV=production                    → require token

2. ConfigTab handleSave diffs against the RAW parsed YAML / form
   config instead of the DEFAULT_CONFIG-merged shape. The previous
   code would silently PATCH tier=1 to the DB when a user deleted
   the `tier:` line in raw mode (the default-merge substituted 1).
   Now: only fields the user actually typed participate in the
   diff. Type guards (typeof === "number" / "string") prevent
   coercion surprises on malformed YAML.

3. ConfigTab model-save failure no longer lies "Saved". The
   /workspaces/:id/model PATCH can reject when the runtime doesn't
   support the chosen model; previously we caught + console.warn'd
   + showed green Saved, and the user watched the model revert on
   next reload with no explanation. Now the save path collects a
   `modelSaveError` and surfaces it via setError with a partial-
   success message ("Other fields saved, but model update failed:
   …") so the user sees why.

4. ChannelsTab now surfaces BOTH channels-fetch and adapters-fetch
   failures, distinguishing them in the error text ("Failed to
   load connected channels and platforms — try refreshing").
   Previously only an adapters failure was visible; a channels
   failure left the user with an apparently-empty list and no
   indication the API was unreachable.

5. ChatTab panels drop the redundant aria-hidden attribute. The
   `hidden`/`flex` Tailwind class already sets display:none, which
   removes the node from the accessibility tree on its own; the
   extra aria-hidden invited WAI-ARIA lint warnings if a focusable
   descendant ever landed inside an inactive panel.

Tests: 923 canvas + full Go handler suite pass. 3 new Go tests.
No behaviour change on the five prior fixes — this commit tightens
their edges per the independent review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 20:29:44 -07:00
Hongming Wang
3f11df031c fix: six UX bugs (peers auth, scroll, chat tabs, config persist, + visibility)
Six bugs reported from a live session — all shippable in one commit:

1. Peers tab 401 on local Docker. The /registry/:id/peers endpoint
   demands a workspace-scoped bearer token (validateDiscoveryCaller)
   which the canvas session doesn't hold. Added the same Tier-1b
   dev-mode fail-open hatch that AdminAuth and WorkspaceAuth already
   use — gated by MOLECULE_ENV=development + empty ADMIN_TOKEN, so
   SaaS production stays strict. Exported IsDevModeFailOpen from the
   middleware package for the handler layer to reuse.

2. Org Templates list unscrollable. OrgTemplatesSection was rendered
   in the TemplatePalette footer — a div without overflow — so when
   it expanded to 15+ entries the list extended past the viewport
   with no scroll. Moved it to the top of the flex-1 overflow-y-auto
   container. Tall lists now scroll naturally.

3. Chat tab: "My Chat" and "Agent Comms" rendered stacked instead
   of switching. HTML `hidden` attribute was being overridden by
   Tailwind's `flex` class (display: flex beats the attribute),
   so both tabpanels rendered concurrently. Swapped to a conditional
   Tailwind `hidden`/`flex` class so the inactive panel is
   display:none with proper CSS specificity.

4. Hermes Config form never persists. handleSave wrote config.yaml
   but name / tier / runtime / model all live on the workspace row
   (or the dedicated /workspaces/:id/model endpoint) — the form
   edited in-memory, the request returned 200, the next reload
   wiped everything back. Hermes + external runtimes manage their
   own config inside the container anyway, so writing config.yaml
   is a no-op for them; skip it. Always diff and PATCH the DB-backed
   fields that actually changed.

5. Channels "+ Connect" dropdown empty on first open. ChannelsTab's
   load() used Promise.all with a silent catch — if EITHER the
   channels or adapters fetch failed, both setters were skipped
   with no error visible. Switched to Promise.allSettled so each
   endpoint settles independently, and the adapters failure now
   surfaces via the top-level error state.

6. Plugin registry always "No plugins in registry". Same silent
   catch pattern in SkillsTab.tsx — load errors for /plugins,
   /plugins/sources, and /workspaces/:id/plugins swallowed without
   logging. Replaced the empty catches with console.warn so future
   failures are at least visible in devtools.

Tests: 923 passing (unchanged). Go handler tests pass. Server
rebuilt and running with the peers-auth + collapsed-persistence
fixes (pid 15875).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 20:18:30 -07:00
Hongming Wang
06a249bbb1
Merge pull request #1961 from Molecule-AI/feat/canvas-activitytab-missingkeys-tests
fix(canvas/a11y+tests): aria-hidden backdrop, verifiedCPSession guard, useCanvasStore mock normalization
2026-04-23 20:15:42 -07:00
Molecule AI App & Docs Lead
3715c06e0b fix(canvas): remove stale firstInputRef useEffect from AllKeysModal
AllKeysModal already handles focus via autoFocus={index === 0} on the
first input and a separate title-focus effect. The orphaned useEffect
referencing firstInputRef (declared only in ProviderPickerModal) caused
a TypeScript build error: "Cannot find name 'firstInputRef'".

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 03:11:36 +00:00
8fb5ec0340 fix(handlers): fix Go scoping — presented must live in function scope
The short-var declaration inside the if-initializer scoped `presented`
only to that if statement, making it undefined on the following
`if presented { ... }` block. Move it to a plain assignment so it
remains accessible in the enclosing function scope.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 03:10:18 +00:00
a46797d466 fix(middleware): rename internal fn to verifiedCPSession, keep public alias
The PR #1855 branch contains a newer version of session_auth.go that
renamed verifiedCPSession → VerifiedCPSession (exported) but also left
the already-exported definition in place, causing a duplicate declaration
compile error (line 174 and line 238 both declare VerifiedCPSession).

Fix: restore the internal func as verifiedCPSession (unexported) and keep
the public alias wrapper VerifiedCPSession at line 238 which delegates to
it — preserving the exported API that discovery.go and wsauth_middleware.go
depend on.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 03:10:18 +00:00
746cb22855 fix(canvas/tests): normalize useCanvasStore mock pattern in test files
Standardize the mock for useCanvasStore to always expose getState()
(used by production ContextMenu to filter parent nodes). Applies the
same Object.assign-wrapping pattern introduced in #1744 to:
- ClaudeSettings.test.tsx
- tabs.a11y.test.tsx
- ContextMenu.keyboard.test.tsx (mockStore shape alignment)
2026-04-24 03:10:18 +00:00
680f1f50f2 fix(canvas/a11y): restore aria-hidden on backdrop div after cherry-pick conflict
Cherry-pick from #1744 left the backdrop div without aria-hidden="true"
(the outer dialog div got it instead). Re-apply aria-hidden="true" to
the backdrop div so screen readers skip the clickable overlay layer.

Also revert test assertion from bg-black → bg-black/70 to match the
exact class applied to the backdrop div.
2026-04-24 03:10:18 +00:00
Hongming Wang
4fd7f1e84c fix(canvas): tighten rescue + cap toast + cover paths with tests
Three follow-up review findings from the c2b2e13a review:

1. Rescue heuristic uses pure bbox-non-overlap. The previous
   `position.x < 0` branch rescued any child whose parent was
   later dragged past it, even when the layout was clearly
   recoverable (e.g. relative -40, child still overlaps parent).
   New rule: rescue iff the child's bbox has zero overlap with
   the parent's bbox — self-calibrating, scales with user-resized
   parents, catches screenshot-case and legacy huge-positive data.

2. Toast caps failed-name list at 3 and appends "and N more".
   Stops a 50-node partial failure from overflowing the toast
   container.

3. Cycle guard on selection-roots walk in batchNest. Corrupt
   parentId data can't send the loop infinite now. Cheap
   defensive guard — one Set per selected node.

Tests added (923 total, up from 918):
 * canvas-topology.test: 4 rescue scenarios — screenshot case
   (zero-overlap rescue), negative drift kept, huge-positive
   rescued, user-resized layout kept.
 * canvas.test: selection-roots filter on a 3-level chain.
 * workspace_crud test: PATCH {collapsed:true} runs the UPDATE.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 20:08:14 -07:00
Hongming Wang
c2b2e13abe fix(canvas): address code-review findings on the Canvas refactor
Five issues surfaced in the review of 50b53784. Each was either a real
bug waiting to hit users or a silent failure mode.

1. Topology rescue no longer teleports user-resized children.
   Rescue was comparing against parentMinSize(childCount), so any
   child the user had placed in space the parent was resized into
   got snapped to the default grid on reload — undoing the layout.
   Now rescue fires only on obviously corrupt data: negative
   relative coords (legacy pre-nesting absolute positions that
   landed above/left of their assigned parent) or values past an
   MAX_PLAUSIBLE_OFFSET threshold. Children just-past the initial
   minimum are left alone.

2. batchNest now filters to selection-roots before planning.
   Previously selecting both A and A's descendant B and dragging
   into T yanked B out of A to become a sibling under T. Users
   reasonably expect the A subtree to move intact. The new pass
   drops any selected node whose ancestor is also selected —
   those follow their ancestor via React Flow's parent binding.

3. batchNest surfaces partial failure via showToast. Previously
   silent: 2 of 5 PATCHes fail, user sees 3 cards re-parented + 2
   snapped back with no explanation. Now names the failed cards.

4. confirmNest closes the nest dialog BEFORE dispatching the async
   store action, so a second drag can't kick off a competing batch
   while the first is still in flight.

5. collapsed is now persisted. The Go workspace_crud.go Update
   handler ignored the `collapsed` field, so user-initiated
   collapse round-tripped to an expanded state on next hydrate.
   Added the PATCH branch (`UPDATE workspaces SET collapsed = ...`)
   so the state survives reload.

Nits cleaned:
 * Removed dead dragStartParentRef in useDragHandlers.
 * Swapped redundant `node.data as WorkspaceNodeData` casts for a
   named WorkspaceNode type alias.
 * Canvas.tsx SR-live region now reads n.parentId (matches
   MiniMap + RF's native field) instead of the mirror n.data.parentId.

Tests added (918 total, up from 915):
 * batchNest happy path — 2-root selection fires 2 combined PATCHes
   carrying parent_id + x + y, not 2×N sequential round-trips.
 * batchNest ancestor+descendant selection — subtree stays intact.
 * batchNest partial failure rollback — only the rejected nodes
   revert; successful ones stay committed.

Backend change is single-line (collapsed PATCH branch); all
workspace_crud Go tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 19:58:44 -07:00
Hongming Wang
b752c3c2c3
Merge pull request #1902 from Molecule-AI/test/2026-04-23-regression-suite
test: regression guards for 2026-04-23 hermes + CP bug wave
2026-04-23 19:58:09 -07:00
dc9001835e fix(ConfigTab.hermes.test): remove unused fireEvent import 2026-04-24 02:55:51 +00:00
molecule-ai[bot]
f509e5d11d
Merge pull request #1951 from Molecule-AI/sync/staging-to-main-2026-04-24
chore: sync staging → main (2026-04-24)
2026-04-24 02:48:08 +00:00
molecule-ai[bot]
b43e21aa39
Merge branch 'staging' into sync/staging-to-main-2026-04-24 2026-04-24 02:45:14 +00:00
molecule-ai[bot]
8e46cc1676
Merge branch 'staging' into test/2026-04-23-regression-suite 2026-04-24 02:45:12 +00:00
Hongming Wang
50b537849a refactor(canvas): split Canvas.tsx into hooks; parallelize batchNest
Two concerns in one commit (separate files, each self-contained):

## Canvas.tsx split (from ~680 to ~250 lines)

Canvas.tsx was holding drag gesture state + keyboard shortcuts +
viewport wiring + JSX. Each concern now lives in its own unit under
canvas/src/components/canvas/:

- dragUtils.ts          — pure: shouldDetach, clampChildIntoParent,
                          DETACH_FRACTION
- DropTargetBadge.tsx   — the floating "Drop into: <name>" label + the
                          dashed ghost preview at the target slot
- useDragHandlers.ts    — encapsulates onNodeDragStart / Drag / Stop,
                          findDropTarget hit-test, pendingNest state,
                          and confirmNest/cancelNest. Routes multi-
                          select drags through batchNest automatically.
- useKeyboardShortcuts  — Esc, Enter, Shift+Enter, Cmd+]/[, Z — one
                          window listener, one source of truth.
- useCanvasViewport     — pan-to-node + zoom-to-team CustomEvent
                          listeners and the debounced viewport save.

Canvas.tsx becomes a thin composition + JSX file. No behavioural
change; the refactor is covered by the existing 915 canvas tests.

## batchNest parallelization (2N round-trips → N, all in flight)

Previously nestNode fired two sequential PATCHes (parent_id then x/y)
and batchNest looped nestNode sequentially. For a 5-node selection on
a typical ~200ms link this was ~2s of serialized RPCs.

- nestNode now combines parent_id + x + y into ONE PATCH. The Go
  handler (workspace_crud.go Update) already reads all three from the
  same body — no backend change.
- batchNest rewritten: compute every re-parent plan against one
  snapshot, commit a single set(), then fire N PATCHes via
  Promise.allSettled in parallel. Per-node failures roll back only
  that node (others stay committed) — same semantics as the single-
  node path, just concurrent.
- The state math in the batch path also correctly shifts descendant
  zIndex by depthDelta when any re-parented node has a subtree.

## Also

- canvas-topology.ts: reverted P3.12's opt-in rescue to the auto-
  rescue default. When a child's stored relative position would render
  it outside the parent bbox (the visual regression the user saw after
  collapse → reload — Hermes child drawn outside Claude Code Agent on
  first paint), the child is placed in the next default grid slot.
  The "Arrange Children" context command stays for bigger teams.

All 915 canvas tests pass. No backend changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 19:43:18 -07:00
Hongming Wang
6b69dcdcaa
Merge pull request #1891 from Molecule-AI/fix/e2e-api-staging-trigger
feat(ci): run E2E API smoke test on staging branch
2026-04-23 19:42:07 -07:00
molecule-ai[bot]
c2fcb011f4
Merge branch 'staging' into fix/e2e-api-staging-trigger 2026-04-24 02:40:01 +00:00
Hongming Wang
c5abed988e fix(canvas): address review findings on playability pass
Five Critical issues caught in code review of f3423a51. Each one broke
an invariant the original commit claimed to uphold.

1. nestNode: descendants kept their old-depth zIndex after a re-parent.
   Now walks the dragged subtree and shifts every descendant's zIndex
   by the same depthDelta so "children above ancestors" survives moves
   between levels of the hierarchy.

2. bumpZOrder: siblings all share zIndex = depth in fresh topology, so
   a single +1 bump was identical for every sibling and subsequent
   bumps drifted zIndex unboundedly. Rewritten to sort siblings by
   current zIndex and swap the target with its neighbour in the bump
   direction — Figma-style reorder, stays within the sibling tier.

3. findDropTarget: depth-first tiebreaker lost to bumped siblings. The
   visually-frontmost card after Cmd+] is a shallow sibling, but the
   hit test picked the deepest nested card regardless. Swapped order
   so zIndex wins first, depth second, area third. Also pre-computes
   the depth map once per call (was O(n²) via repeated .find walks —
   will matter past ~30 workspaces).

4. arrangeChildren: saved absolute position using `slot + parent.position`,
   but parent.position is RELATIVE to its own parent when nested.
   Grandchildren's stored x/y were in the parent's local frame and
   reload placed them in the wrong spot. Now walks the full ancestor
   chain via absOf() to get the true canvas-absolute origin before
   PATCHing.

5. setCollapsed: naive flip of every descendant's hidden flag diverged
   from the topology rebuild on hydrate. Collapse A, collapse B, then
   expand A — C should stay hidden because B is still collapsed, but
   before this fix C was unhidden. Rewritten to recompute every
   descendant's hidden from the full ancestry chain, matching the
   topology pass byte-for-byte. New round-trip test asserts the two
   code paths produce identical node.hidden across a full lifecycle.

Also:
- Removed dead cascadeMessage constant (never rendered).
- Replaced hardcoded 260/120 in zoom-to-team with exported constants.
- arrangeChildren PATCH catch now logs instead of silently swallowing.
- Added 70→76 tests: setCollapsed 3-chain scenarios, bumpZOrder swap
  semantics, edge-of-list no-op.

All 915 canvas tests green. Backend untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 19:16:48 -07:00
Hongming Wang
3b9b3da237
Merge pull request #1939 from Molecule-AI/fix/1933-bump-github-app-auth-plugin
fix(#1933-step1): bump github-app-auth plugin pin to pick up Token() method
2026-04-23 19:08:12 -07:00
cb7e52779a Merge pull request #1938 from Molecule-AI/test/ki005-terminal-guard-regression-tests 2026-04-24 02:07:28 +00:00
molecule-ai[bot]
3e9b7f8ad6
Merge branch 'staging' into fix/1933-bump-github-app-auth-plugin 2026-04-24 02:04:47 +00:00
molecule-ai[bot]
10c4fcc7fe
Merge branch 'staging' into test/2026-04-23-regression-suite 2026-04-24 02:04:46 +00:00
Hongming Wang
f3423a513d feat(canvas): industry-pattern playability pass (P1+P2+P3)
Ships the full prioritized improvement list from the canvas research
report — aligns our nesting/resize UX with Miro / FigJam / tldraw / Figma
conventions. Organized by priority below.

## P1 — baseline playability

* Hysteresis on drag-out detach (Miro): a child only un-nests when >=20%
  of its bbox is outside the parent on release. Prevents accidental
  un-nesting from twitchy drags.
* Drop-target now uses tree-depth DESC, then zIndex DESC, then area ASC
  to pick targets when nested parents overlap (xyflow #2827).
* Children render above ancestors by inheriting zIndex = parent + 1 in
  topology and on every nest/unnest (xyflow #4012).
* Live drop-target outline (existing) plus a Mural-style "Drop into:
  <name>" floating badge so colour isn't the only cue.
* growParentsToFitChildren now fires only on dimension-type changes
  inside onNodesChange (NodeResizer commits) and once on drag-stop —
  avoids tldraw's edge-chase artifact (P3.11 commit-on-release).

## P2 — polish

* Whimsical-style ghost preview: dashed outline at the next default
  grid slot inside the drop-target parent during drag.
* Alt-drag escape with soft clamp: dropping slightly outside a parent
  without Alt/Cmd snaps the child back inside (clampChildIntoParent);
  Alt releases the clamp to allow un-nest; Cmd/Ctrl force-detaches.
* Figma-style keyboard hierarchy nav: Enter descends to first child,
  Shift+Enter ascends to parent, Cmd+]/[ re-orders siblings via the
  new bumpZOrder store action.
* Multi-select re-parent preserves offsets: confirmNest routes through
  a new batchNest action when the primary drag is part of a batch
  selection (Lucidchart pattern).

## P3 — long-tail

* Minimap now shows parent cards as filled regions with a blue stroke,
  so hierarchy reads at a glance without zooming.
* Out-of-bounds rescue is opt-in: topology no longer silently re-lays
  children whose stored position is outside the parent bbox (Figma
  trust-the-data). The new Arrange Children context menu item runs the
  rescue on demand via arrangeChildren.
* Cmd-drag force-detach regardless of hysteresis.
* Collapse workspace: the existing Collapse Team action now toggles a
  local setCollapsed store action that hides every descendant and
  shrinks the parent card to header-only (Miro frame outline view).
  Growth pass skips collapsed parents so they don't push back out.

All 910 canvas tests green. Backend untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 19:03:02 -07:00
molecule-ai[bot]
e8b5f409be
test(handlers): add 5 TestKI005 terminal guard regression tests (#1938)
* chore: sync staging to main — 1188 commits, 5 conflicts resolved (#1743)

* fix(docs): update architecture + API reference paths for workspace-server rename

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: update workspace script comments for workspace-template → workspace rename

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: ChatTab comment path for workspace-server rename

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add BatchActionBar unit tests (7 tests)

Covers: render threshold, count badge, action buttons, clear selection,
ConfirmDialog trigger, ARIA toolbar role.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: update publish workflow name + document staging-first flow

Default branch is now staging for both molecule-core and
molecule-controlplane. PRs target staging, CEO merges staging → main
to promote to production.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(ci): update working-directory for workspace-server/ and workspace/ renames

- platform-build: working-directory platform → workspace-server
- golangci-lint: working-directory platform → workspace-server
- python-lint: working-directory workspace-template → workspace
- e2e-api: working-directory platform → workspace-server
- canvas-deploy-reminder: fix duplicate if: key (merged into single condition)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: add mol_pk_ and cfut_ to pre-commit secret scanner

Partner API keys (mol_pk_*) and Cloudflare tokens (cfut_*) now
caught by the pre-commit hook alongside sk-ant-, ghp_, AKIA.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore(canvas): enable Turbopack for dev server — faster HMR

next dev --turbopack for significantly faster dev server startup
and hot module replacement. Build script unchanged (Turbopack for
next build is still experimental).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(db): schema_migrations tracking — migrations only run once

Adds a schema_migrations table that records which migration files
have been applied. On boot, only new migrations execute — previously
applied ones are skipped. This eliminates:

- Re-running all 33 migrations on every restart
- Risk of non-idempotent DDL failing on restart
- Unnecessary log noise from re-applying unchanged schema

First boot auto-populates the tracking table with all existing
migrations. Subsequent boots only apply new ones.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(scheduler): strip CRLF from cron prompts on insert/update (closes #958)

Windows CRLF in org-template prompt text caused empty agent responses
and phantom-producing detection. Strips \r at the handler level before
DB persist, plus a one-time migration to clean existing rows.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(security): strip current_task from public GET /workspaces/:id (closes #955)

current_task exposes live agent instructions to any caller with a
valid workspace UUID. Also strips last_sample_error and workspace_dir
from the public endpoint. These fields remain available through
authenticated workspace-specific endpoints.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore(canvas): initialize shadcn/ui — components.json + cn utility

Sets up shadcn/ui CLI so new components can be added with
`npx shadcn add <component>`. Uses new-york style, zinc base color,
no CSS variables (matches existing Tailwind-only approach).

Adds clsx + tailwind-merge for the cn() utility.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(security): GLOBAL memory delimiter spoofing + pin MCP npm version

SAFE-T1201 (#807): Escape [MEMORY prefix in GLOBAL memory content on
write to prevent delimiter-spoofing prompt injection. Content stored
as "[_MEMORY " so it renders as text, not structure, when wrapped with
the real delimiter on read.

SAFE-T1102 (#805): Pin @molecule-ai/mcp-server@1.0.0 in .mcp.json.example.
Prevents supply-chain attacks via unpinned npx -y.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: schema_migrations tracking — 4 cases (first boot, re-boot, mixed, down.sql filter)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: verify current_task + last_sample_error + workspace_dir stripped from public GET

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: GLOBAL memory delimiter spoofing escape + LOCAL scope untouched

- TestCommitMemory_GlobalScope_DelimiterSpoofingEscaped: verifies [MEMORY prefix
  is escaped to [_MEMORY before DB insert (SAFE-T1201, #807)
- TestCommitMemory_LocalScope_NoDelimiterEscape: LOCAL scope stored verbatim

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(security): Phase 35.1 — SG lockdown script for tenant EC2 instances

Restricts tenant EC2 port 8080 ingress to Cloudflare IP ranges only,
blocking direct-IP access. Supports two modes:

1. Lock to CF IPs (Worker deployment): 14 IPv4 CIDR rules
2. Close ingress entirely (Tunnel deployment): removes 0.0.0.0/0 only

Usage:
  bash scripts/lockdown-tenant-sg.sh --sg-id sg-xxxxx
  bash scripts/lockdown-tenant-sg.sh --sg-id sg-xxxxx --close-ingress
  bash scripts/lockdown-tenant-sg.sh --sg-id sg-xxxxx --dry-run

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* ci: update GitHub Actions to current stable versions (closes #780)

- golangci/golangci-lint-action@v4 → v9
- docker/setup-qemu-action@v3 → v4
- docker/setup-buildx-action@v3 → v4
- docker/build-push-action@v5 → v6

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs(opencode): RFC 2119 — 'should not' → 'must not' for SAFE-T1201 warning (closes #861)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(canvas): degraded badge WCAG AA contrast — amber-400 → amber-300 (closes #885)

amber-400 on zinc-900 is 5.4:1 (AA pass). amber-300 is 6.9:1 (AA+AAA pass)
and matches the rest of the amber usage in WorkspaceNode (currentTask,
error detail, badge chip).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(platform): 409 guard on /hibernate when active_tasks > 0 (closes #822)

Phase 35.1 / #799 security condition C3 — prevents operator from
accidentally killing a mid-task agent.

Behavior:
- active_tasks == 0 → proceed as before
- active_tasks > 0 && ?force=true → log [WARN] + proceed
- active_tasks > 0 && no force → 409 with {error, active_tasks}

2 new tests: TestHibernateHandler_ActiveTasks_Returns409,
TestHibernateHandler_ActiveTasks_ForceTrue_Returns200.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(platform): track last_outbound_at for silent-workspace detection (closes #817)

Sub of #795 (phantom-busy post-mortem). Adds last_outbound_at TIMESTAMPTZ
column to workspaces. Bumped async on every successful outbound A2A call
from a real workspace (skip canvas + system callers). Exposed in
GET /workspaces/:id response as "last_outbound_at".

PM/Dev Lead orchestrators can now detect workspaces that have gone silent
despite being online (> 2h + active cron = phantom-busy warning).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(workspace): snapshot secret scrubber (closes #823)

Sub-issue of #799, security condition C4. Standalone module in
workspace/lib/snapshot_scrub.py with three public functions:

- scrub_content(str) → str: regex-based redaction of secret patterns
- is_sandbox_content(str) → bool: detect run_code tool output markers
- scrub_snapshot(dict) → dict: walk memories, scrub each, drop sandbox entries

Patterns covered: sk-ant-/sk-proj-, ghp_/ghs_/github_pat_, AKIA,
cfut_, mol_pk_, ctx7_, Bearer, env-var assignments, base64 blobs ≥33 chars.

21 unit tests, 100% coverage on new code.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(security): cap webhook + config PATCH bodies (H3/H4)

Two HIGH-severity DoS surfaces: both handlers read the entire HTTP
body with io.ReadAll(r.Body) and no upper bound, so a caller streaming
a multi-gigabyte request could exhaust memory on the tenant instance
before we even validated the JSON.

H3 (Discord webhook): wrap Body in io.LimitReader with a 1 MiB cap.
Discord Interactions payloads are well under 10 KiB in practice.

H4 (workspace config PATCH): wrap Body in http.MaxBytesReader with a
256 KiB cap. Real configs are <10 KiB; jsonb handles the cap
comfortably. Returns 413 Request Entity Too Large on overflow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(security): C4 — close AdminAuth fail-open race on hosted-SaaS fresh install

Pre-launch review blocker. AdminAuth's Tier-1 fail-open fired whenever
the workspace_auth_tokens table was empty — including the window between
a hosted tenant EC2 booting and the first workspace being created. In
that window, every admin-gated route (POST /org/import, POST /workspaces,
POST /bundles/import, etc.) was reachable without a bearer, letting an
attacker pre-empt the first real user by importing a hostile workspace
into a freshly provisioned instance.

Fix: fail-open is now ONLY applied when ADMIN_TOKEN is unset (self-
hosted dev with zero auth configured). Hosted SaaS always sets
ADMIN_TOKEN at provision time, so the branch never fires in prod and
requests with no bearer get 401 even before the first token is minted.

Tier-2 / Tier-3 paths unchanged.

The old TestAdminAuth_684_FailOpen_AdminTokenSet_NoGlobalTokens test
was codifying exactly this bug (asserting 200 on fresh install with
ADMIN_TOKEN set). Renamed and flipped to
TestAdminAuth_C4_AdminTokenSet_FreshInstall_FailsClosed asserting 401.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(security): scrub workspace-server token + upstream error logs

Two findings from the pre-launch log-scrub audit:

1. handlers/workspace_provision.go:548 logged `token[:8]` — the exact
   H1 pattern that panicked on short keys. Even with a length guard,
   leaking 8 chars of an auth token into centralized logs shortens the
   search space for anyone who gets log-read access. Now logs only
   `len(token)` as a liveness signal.

2. provisioner/cp_provisioner.go:101 fell back to logging the raw
   control-plane response body when the structured {"error":"..."}
   field was absent. If the CP ever echoed request headers (Authorization)
   or a portion of user-data back in an error path, the bearer token
   would end up in our tenant-instance logs. Now logs the byte count
   only; the structured error remains in place for the happy path.
   Also caps the read at 64 KiB via io.LimitReader to prevent
   log-flood DoS from a compromised upstream.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(security): tenant CPProvisioner attaches CP bearer on all calls

Completes the C1 integration (PR #50 on molecule-controlplane). The CP
now requires Authorization: Bearer <PROVISION_SHARED_SECRET> on all
three /cp/workspaces/* endpoints; without this change the tenant-side
Start/Stop/IsRunning calls would all 401 (or 404 when the CP's routes
refused to mount) and every workspace provision from a SaaS tenant
would silently fail.

Reads MOLECULE_CP_SHARED_SECRET, falling back to PROVISION_SHARED_SECRET
so operators can use one env-var name on both sides of the wire. Empty
value is a no-op: self-hosted deployments with no CP or a CP that
doesn't gate /cp/workspaces/* keep working as before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(canvas): add 15s fetch timeout on API calls

Pre-launch audit flagged api.ts as missing a timeout on every fetch.
A slow or hung CP response would leave the UI spinning indefinitely
with no way for the user to abort — effectively a client-side DoS.

15s is long enough for real CP queries (slowest observed is Stripe
portal redirect at ~3s) and short enough that a stalled backend
surfaces as a clear error with a retry affordance.

Uses AbortSignal.timeout (widely supported since 2023) so the
abort propagates through React Query / SWR consumers cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(e2e): stop asserting current_task on public workspace GET (#966)

PR #966 intentionally stripped current_task, last_sample_error, and
workspace_dir from the public GET /workspaces/:id response to avoid
leaking task bodies to anyone with a workspace bearer. The E2E smoke
test hadn't caught up — it was still asserting "current_task":"..."
on the single-workspace GET, which made every post-#966 CI run fail
with '60 passed, 2 failed'.

Swap the per-workspace asserts to check active_tasks (still exposed,
canonical busy signal) and keep the list-endpoint check that proves
admin-auth'd callers still see current_task end-to-end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: 2026-04-19 SaaS prod migration notes

Captures the 10-PR staging→main cutover: what shipped, the three new
Railway prod env vars (PROVISION_SHARED_SECRET / EC2_VPC_ID /
CP_BASE_URL), and the sharp edge for existing tenants — their
containers pre-date PR #53 so they still need MOLECULE_CP_SHARED_SECRET
added manually (or a re-provision) before the new CPProvisioner's
outbound bearer works.

Also includes a post-deploy verification checklist and rollback plan.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ws-server): pull env from CP on startup

Paired with molecule-controlplane PR #55 (GET /cp/tenants/config). Lets
existing tenants heal themselves when we rotate or add a CP-side env
var (e.g. MOLECULE_CP_SHARED_SECRET landing earlier today) without any
ssh or re-provision.

Flow: main() calls refreshEnvFromCP() before any other os.Getenv read.
The helper reads MOLECULE_ORG_ID + ADMIN_TOKEN from the baked-in
user-data env, GETs {MOLECULE_CP_URL}/cp/tenants/config with those
credentials, and applies the returned string map via os.Setenv so
downstream code (CPProvisioner, etc.) sees the fresh values.

Best-effort semantics:
- self-hosted / no MOLECULE_ORG_ID → no-op (return nil)
- CP unreachable / non-200 → log + return error (main keeps booting)
- oversized values (>4 KiB each) rejected to avoid env pollution
- body read capped at 64 KiB

Once this image hits GHCR, the 5-minute tenant auto-updater picks it
up, the container restarts, refresh runs, and every tenant has
MOLECULE_CP_SHARED_SECRET within ~5 minutes — no operator toil.

Also fixes workspace-server/.gitignore so `server` no longer matches
the cmd/server package dir — it only ignored the compiled binary but
pattern was too broad. Anchored to `/server`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(canary): smoke harness + GHA verification workflow (Phase 2)

Post-deploy verification for staging tenant images. Runs against the
canary fleet after each publish-workspace-server-image build — catches
auto-update breakage (a la today's E2E current_task drift) before it
propagates to the prod tenant fleet that auto-pulls :latest every 5 min.

scripts/canary-smoke.sh iterates a space-sep list of canary base URLs
(paired with their ADMIN_TOKENs) and checks:
- /admin/liveness reachable with admin bearer (tenant boot OK)
- /workspaces list responds (wsAuth + DB path OK)
- /memories/commit + /memories/search round-trip (encryption + scrubber)
- /events admin read (AdminAuth C4 path)
- /admin/liveness without bearer returns 401 (C4 fail-closed regression)

.github/workflows/canary-verify.yml runs after publish succeeds:
- 6-min sleep (tenant auto-updater pulls every 5 min)
- bash scripts/canary-smoke.sh with secrets pulled from repo settings
- on failure: writes a Step Summary flagging that :latest should be
  rolled back to prior known-good digest

Phase 3 follow-up will split the publish workflow so only
:staging-<sha> ships initially, and canary-verify's green gate is
what promotes :staging-<sha> → :latest. This commit lays the test
gate alone so we have something running against tenants immediately.

Secrets to set in GitHub repo settings before this workflow can run:
- CANARY_TENANT_URLS (space-sep list)
- CANARY_ADMIN_TOKENS (same order as URLs)
- CANARY_CP_SHARED_SECRET (matches staging CP PROVISION_SHARED_SECRET)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(canary): gate :latest tag promotion on canary verify green (Phase 3)

Completes the canary release train. Before this, publish-workspace-
server-image.yml pushed both :staging-<sha> and :latest on every
main merge — meaning the prod tenant fleet auto-pulled every image
immediately, before any post-deploy smoke test. A broken image
(think: this morning's E2E current_task drift, but shipped at 3am
instead of caught in CI) would have fanned out to every running
tenant within 5 min.

Now:
- publish workflow pushes :staging-<sha> ONLY
- canary tenants are configured to track :staging-<sha>; they pick
  up the new image on their next auto-update cycle
- canary-verify.yml runs the smoke suite (Phase 2) after the sleep
- on green: a new promote-to-latest job uses crane to remotely
  retag :staging-<sha> → :latest for both platform and tenant images
- prod tenants auto-update to the newly-retagged :latest within
  their usual 5-min window
- on red: :latest stays frozen on prior good digest; prod is untouched

crane is pulled onto the runner (~4 MB, GitHub release) rather than
docker-daemon retag so the workflow doesn't need a privileged runner.

Rollback: if canary passed but something surfaces post-promotion,
operator runs "crane tag ghcr.io/molecule-ai/platform:<prior-good-sha>
latest" manually. A follow-up can wrap that in a Phase 4 admin
endpoint / script.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(canary): rollback-latest script + release-pipeline doc (Phase 4)

Closes the canary loop with the escape hatch and a single place to
read about the whole flow.

scripts/rollback-latest.sh <sha>
  uses crane to retag :latest ← :staging-<sha> for BOTH the platform
  and tenant images. Pre-checks the target tag exists and verifies
  the :latest digest after the move so a bad ops typo doesn't
  silently promote the wrong thing. Prod tenants auto-update to the
  rolled-back digest within their 5-min cycle. Exit codes: 0 = both
  retagged, 1 = registry/tag error, 2 = usage error.

docs/architecture/canary-release.md
  The one-page map of the pipeline: how PR → main → staging-<sha> →
  canary smoke → :latest promotion works end-to-end, how to add a
  canary tenant, how to roll back, and what this gate explicitly does
  NOT catch (prod-only data, config drift, cross-tenant bugs).

No code changes in the CP or workspace-server — this PR is shell
+ docs only, so it's safe to land independently of the other Phase
{1,1.5,2,3} PRs still in review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(ws-server): cover CPProvisioner — auth, env fallback, error paths

Post-merge audit flagged cp_provisioner.go as the only new file from
the canary/C1 work without test coverage. Fills the gap:

- NewCPProvisioner_RequiresOrgID — self-hosted without MOLECULE_ORG_ID
  refuses to construct (avoids silent phone-home to prod CP).
- NewCPProvisioner_FallsBackToProvisionSharedSecret — the operator
  ergonomics of using one env-var name on both sides of the wire.
- AuthHeader noop + happy path — bearer only set when secret is set.
- Start_HappyPath — end-to-end POST to stubbed CP, bearer forwarded,
  instance_id parsed out of response.
- Start_Non201ReturnsStructuredError — when CP returns structured
  {"error":"…"}, that message surfaces to the caller.
- Start_NoStructuredErrorFallsBackToSize — regression gate for the
  anti-log-leak change from PR #980: raw upstream body must NOT
  appear in the error, only the byte count.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf(scheduler): collapse empty-run bump to single RETURNING query

The phantom-producer detector (#795) was doing UPDATE + SELECT in two
roundtrips — first incrementing consecutive_empty_runs, then re-
reading to check the stale threshold. Switch to UPDATE ... RETURNING
so the post-increment value comes back in one query.

Called once per schedule per cron tick. At 100 tenants × dozens of
schedules per tenant, the halved DB traffic on the empty-response
path is measurable, not just cosmetic.

Also now properly logs if the bump itself fails (previously it silent-
swallowed the ExecContext error and still ran the SELECT, which would
confuse debugging).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(canvas): /orgs landing page for post-signup users

CP's Callback handler redirects every new WorkOS session to
APP_URL/orgs, but canvas had no such route — new users hit the canvas
Home component, which tries to call /workspaces on a tenant that
doesn't exist yet, and saw a confusing error. This PR plugs that gap
with a dedicated landing page that:

- Bounces anonymous visitors back to /cp/auth/login
- Zero-org users see a slug-picker (POST /cp/orgs, refresh)
- For each existing org, shows status + CTA:
  * awaiting_payment → amber "Complete payment" → /pricing?org=…
  * running          → emerald "Open" → https://<slug>.moleculesai.app
  * failed           → "Contact support" → mailto
  * provisioning     → read-only "provisioning…"
- Surfaces errors inline with a Retry button

Deliberately server-light: one GET /cp/orgs, no WebSocket, no canvas
store hydration. Goal is to move the user from signup to either
Stripe Checkout or their tenant URL with one click each.

Closes the last UX gap between the BILLING_REQUIRED gate landing on
the CP and real users being able to complete a signup today.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(canvas): post-checkout UX — Stripe success lands on /orgs with banner

Two small polish items that together close the signup-to-running-tenant
flow for real users:

1. Stripe success_url now points at /orgs?checkout=success instead of
   the current page (was pricing). The old behavior left people staring
   at plan cards with no indication payment went through — the new
   behavior drops them right onto their org list where they can watch
   the status flip.

2. /orgs shows a green "Payment confirmed, workspace spinning up"
   banner when it sees ?checkout=success, then clears the query
   param via replaceState so a reload doesn't show it again.

3. /orgs now polls every 5s while any org is awaiting_payment or
   provisioning. Users see the Stripe webhook's effect live — no
   manual refresh needed — and once every org settles the polling
   stops so idle tabs don't hammer /cp/orgs.

Paired with PR #992 (the /orgs page itself) this makes the end-to-end
flow on BILLING_REQUIRED=true deployments feel right:
  /pricing → Stripe → /orgs?checkout=success → banner → live poll →
  "Open" button when org.status transitions to running.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(canvas): bump billing test for /orgs success_url

* fix(ci): clone sibling plugin repo so publish-workspace-server-image builds

Publish has been failing since the 2026-04-18 open-source restructure
(#964's merge) because workspace-server/Dockerfile still COPYs
./molecule-ai-plugin-github-app-auth/ but the restructure moved that
code out to its own repo. Every main merge since has produced a
"failed to compute cache key: /molecule-ai-plugin-github-app-auth:
not found" error — prod images haven't moved.

Fix: add an actions/checkout step that fetches the plugin repo into
the build context before docker build runs.

Private-repo safe: uses PLUGIN_REPO_PAT secret (fine-grained PAT with
Contents:Read on Molecule-AI/molecule-ai-plugin-github-app-auth).
Falls back to the default GITHUB_TOKEN if the plugin repo is public.

Ops: set repo secret PLUGIN_REPO_PAT before the next main merge, or
publish will fail with a 404 on the checkout step.

Also gitignores the cloned dir so local dev builds don't accidentally
commit it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci(promote-latest): workflow_dispatch to retag :staging-<sha> → :latest

Escape hatch for the initial rollout window (canary fleet not yet
provisioned, so canary-verify.yml's automatic promotion doesn't fire)
AND for manual rollback scenarios.

Uses the default GITHUB_TOKEN which carries write:packages on repo-
owned GHCR images, so no new secrets are needed. crane handles the
remote retag without pulling or pushing layers.

Validates the src tag exists before retagging + verifies the :latest
digest post-retag so a typo can't silently promote the wrong image.

Trigger from Actions → promote-latest → Run workflow → enter the
short sha (e.g. "4c1d56e").

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci(promote-latest): run on self-hosted mac mini (GH-hosted quota blocked)

* ci(promote-latest): suppress brew cleanup that hits perm-denied on shared runner

* feat(canvas): Phase 5 — credit balance pill + low-balance banner

Adds the UI surface for the credit system to /orgs:
- CreditsPill next to each org row. Tone shifts from zinc → amber at
  10% of plan to red at zero.
- LowCreditsBanner appears under the pill for running orgs when the
  balance crosses thresholds: overage_used > 0 → "overage active",
  balance <= 0 → "out of credits, upgrade", trial tail → "trial almost
  out".
- Pure helpers extracted to lib/credits.ts so formatCredits, pillTone,
  and bannerKind are unit-tested without jsdom.

Backend List query now returns credits_balance / plan_monthly_credits
/ overage_used_credits / overage_cap_credits so no second round-trip
is needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(canvas): ToS gate modal + us-east-2 data residency notice

Wraps /orgs in a TermsGate that polls /cp/auth/terms-status on mount
and overlays a blocking modal when the current terms version hasn't
been accepted yet. "I agree" POSTs /cp/auth/accept-terms and dismisses
the modal; the backend records IP + UA as GDPR Art. 7 proof-of-consent.

Also adds a short data residency notice under the page header:
workspaces run in AWS us-east-2 (Ohio, US). An EU region selector is
a future lift once the infra is provisioned there.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(scheduler): defer cron fires when workspace busy instead of skipping (#969)

Previously, the scheduler skipped cron fires entirely when a workspace
had active_tasks > 0 (#115). This caused permanent cron misses for
workspaces kept perpetually busy by the 5-min Orchestrator pulse — work
crons (pick-up-work, PR review) were skipped every fire because the
agent was always processing a delegation.

Measured impact on Dev Lead: 17 context-deadline-exceeded timeouts in
2 hours, ~30% of inter-agent messages silently dropped.

Fix: when workspace is busy, poll every 10s for up to 2 minutes waiting
for idle. If idle within the window, fire normally. If still busy after
2 min, fall back to the original skip behavior.

This is a minimal, safe change:
- No new goroutines or channels
- Same fire path once idle
- Bounded wait (2 min max, won't block the scheduler pool)
- Falls back to skip if workspace never becomes idle

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(mcp): scrub secrets in commit_memory MCP tool path (#838 sibling)

PR #881 closed SAFE-T1201 (#838) on the HTTP path by wiring redactSecrets()
into MemoriesHandler.Commit — but the sibling code path on the MCP bridge
(MCPHandler.toolCommitMemory) was left with only the TODO comment. Agents
calling commit_memory via the MCP tool bridge are the PRIMARY attack vector
for #838 (confused / prompt-injected agent pipes raw tool-response text
containing plain-text credentials into agent_memories, leaking into shared
TEAM scope). The HTTP path is only exercised by canvas UI posts, so the MCP
gap was the hotter one.

Change:

  workspace-server/internal/handlers/mcp.go:725
    - TODO(#838): run _redactSecrets(content) before insert — plain-text
    - API keys from tool responses must not land in the memories table.
    + SAFE-T1201 (#838): scrub known credential patterns before persistence…
    + content, _ = redactSecrets(workspaceID, content)

Reuses redactSecrets (same package) so there's no duplicated pattern list —
a future-added pattern in memories.go automatically covers the MCP path too.

Tests added in mcp_test.go:

  - TestMCPHandler_CommitMemory_SecretInContent_IsRedactedBeforeInsert
      Exercises three patterns (env-var assignment, Bearer token, sk-…)
      and uses sqlmock's WithArgs to bind the exact REDACTED form — so a
      regression (removing the redactSecrets call) fails with arg-mismatch
      rather than silently persisting the secret.

  - TestMCPHandler_CommitMemory_CleanContent_PassesThrough
      Regression guard — benign content must NOT be altered by the redactor.

NOTE: unable to run `go test -race ./...` locally (this container has no Go
toolchain). The change is mechanical reuse of an already-shipped function in
the same package; CI must validate. The sqlmock patterns mirror the existing
TestMCPHandler_CommitMemory_LocalScope_Success test exactly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(ci): move canary-verify to self-hosted runner

GitHub-hosted ubuntu-latest runs on this repo hit "recent account
payments have failed or your spending limit needs to be increased"
— same root cause as the publish + CodeQL + molecule-app workflow
moves earlier this quarter. canary-verify was the last one still on
ubuntu-latest.

Switches both jobs to [self-hosted, macos, arm64]. crane install
switched from Linux tarball to brew (matches promote-latest.yml's
install pattern + avoids /usr/local/bin write perms on the shared
mac mini).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(canvas): pin AbortSignal timeout regression + cover /orgs landing page

Two independent test additions that harden the surface freshly landed on
staging via PRs #982 (canvas fetch timeout), #992 (/orgs landing), #994
(post-checkout redirect to /orgs).

canvas/src/lib/__tests__/api.test.ts (+74 lines, 7 new tests)
  - GET/POST/PATCH/PUT/DELETE each pass an AbortSignal to fetch
  - TimeoutError (DOMException name=TimeoutError) propagates to the caller
  - Each request installs its own signal — no shared module-level controller
    that would allow one slow request to cancel an unrelated fast one
  This is the hardening nit I flagged in my APPROVE-w/-nit review of
  fix/canvas-api-fetch-timeout. Landing as a follow-up now that #982 is in
  staging.

canvas/src/app/__tests__/orgs-page.test.tsx (+251 lines, new file, 10 tests)
  - Auth guard: signed-out → redirectToLogin and no /cp/orgs fetch
  - Error state: failed /cp/orgs → Error message + Retry button
  - Empty list: CreateOrgForm renders
  - CTA by status:
      running          → "Open" link targets {slug}.moleculesai.app
      awaiting_payment → "Complete payment" → /pricing?org=<slug>
      failed           → "Contact support" mailto
  - Post-checkout: ?checkout=success renders CheckoutBanner AND
    history.replaceState scrubs the query param
  - Fetch contract: /cp/orgs called with credentials:include + AbortSignal

Local baseline on origin/staging tip 845ac47:
  canvas vitest: 50 files / 778 tests, all green
  canvas build:  clean, /orgs route present (2.83 kB / 105 kB first-load)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test(canvas): cover /orgs 5s polling on in-flight orgs

The test docstring promised polling coverage but I'd only wired the
describe-block header, not the actual tests. Closing that gap — vitest
fake timers drive three cases:

- `provisioning` org → 2nd fetch fires after 5.1s advance
- all `running` → no 2nd fetch even after 10s advance
- `awaiting_payment` org, unmount before timer fires → no post-unmount
  fetch (cleanup correctly clears the pollTimer)

The unmount case is the meaningful one: without it a fast nav-away
leaves the 5s interval chasing the CP forever. page.tsx L97-99 does
clear the timer; the test pins the contract.

Local baseline on origin/staging tip 845ac47 + this branch:
  canvas vitest: 50 files / 781 tests, all green (+3 vs prior commit)
  canvas build:  clean

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* ci(codeql): cover main + staging via workflow

GitHub's UI-configured "Code quality" scan only fires on the default
branch (staging), which leaves every staging→main promotion PR
unscanned. The "On push and pull requests to" field in the UI has no
dropdown; multi-branch scanning on private repos without GHAS isn't
available there.

Workflow file gives us the control we can't get in the UI: triggers
on push + pull_request for both branches. Runs on the same
self-hosted mac mini via [self-hosted, macos, arm64].

upload: never — GHAS isn't enabled on this repo so the SARIF upload
API 403s. Keep results locally, filter to error+warning severity,
fail the PR check on findings, publish SARIF as a workflow artifact.
Flipping upload: never → always after GHAS is enabled (if ever) is
a one-line change.

Picks up the review-flagged improvements from the earlier closed PR:
  - jq install step (brew, no assumption it's present)
  - severity filter (error+warning only, drops noisy note-level)
  - set -euo pipefail
  - SARIF glob (file name doesn't match matrix language id)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(bundle/exporter): add rows.Err() after child workspace enumeration

Silent data loss on mid-cursor DB errors — partial sub-workspace
bundles returned instead of surfacing the iteration error. Adds
rows.Err() check after the SELECT id FROM workspaces query in
Export(), mirroring the pattern already used in scheduler.go
and handlers with similar recursion patterns.

Closes: R1 MISSING-ROWS-ERR findings (bundle/exporter.go)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(a11y): WorkspaceNode font floor, contrast, focus rings (Cycle 10)

C1: skills badge spans text-[7px]→text-[10px]; "+N more" overflow
    text-[7px] text-zinc-500→text-[10px] text-zinc-400
C2: Team section label text-[7px] text-zinc-600→text-[10px] text-zinc-400
H4: status label text-[9px]→text-[10px]; active-tasks count
    text-[9px] text-amber-300/80→text-[10px] text-amber-300 (remove opacity
    modifier per design-system contrast rule); current-task text
    text-[9px] text-amber-300/70→text-[10px] text-amber-300
L1: add focus-visible:ring-2 focus-visible:ring-blue-500/70 to the Restart
    button (independently Tab-focusable inside role="button" wrapper) and to
    the Extract-from-team button in TeamMemberChip; TeamMemberChip
    role="button" div already has the focus ring (COVERED, no change)

762/762 tests pass · build clean

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(ci): replace sleep 360 with health-check poll in canary-verify (#1013)

The canary-verify workflow blocked the self-hosted runner for a fixed
6 minutes regardless of whether canaries had already updated. This
wastes the runner slot when canaries update in 2-3 minutes.

Fix: poll each canary's /health endpoint every 30s for up to 7 min.
Exit early when all canaries report the expected SHA. Falls back to
proceeding after timeout — the smoke suite validates regardless.

Typical time saving: ~3-4 minutes per canary verify run.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(gate-1): remove unused fireEvent import (#1011)

Mechanical lint fix. github-code-quality[bot] flagged unused
import on line 18 — fireEvent is imported but never referenced in
the test file. Removing it clears the code quality gate without
changing any test behaviour.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat: event-driven cron triggers + auto-push hook for agent productivity

Three changes to boost agent throughput:

1. Event-driven cron triggers (webhooks.go): GitHub issues/opened events
   fire all "pick-up-work" schedules immediately. PR review/submitted
   events fire "PR review" and "security review" schedules. Uses
   next_run_at=now() so the scheduler picks them up on next tick.

2. Auto-push hook (executor_helpers.py): After every task completion,
   agents automatically push unpushed commits and open a PR targeting
   staging. Guards: only on non-protected branches with unpushed work.
   Uses /usr/local/bin/git and /usr/local/bin/gh wrappers with baked-in
   GH_TOKEN. Never crashes the agent — all errors logged and continued.

3. Integration (claude_sdk_executor.py): auto_push_hook() called in the
   _execute_locked finally block after commit_memory.

Closes productivity gap where agents wrote code but never pushed,
and where work crons only fired on timers instead of reacting to events.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: disable schedules when workspace is deleted (#1027)

When a workspace is deleted (status set to 'removed'), its schedules
remained enabled, causing the scheduler to keep firing cron jobs for
non-existent containers. Add a cascade disable query alongside the
existing token revocation and canvas layout cleanup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: stop hardcoding CLAUDE_CODE_OAUTH_TOKEN in required_env (#1028)

The provisioner was unconditionally writing CLAUDE_CODE_OAUTH_TOKEN into
config.yaml's required_env for all claude-code workspaces.  When the
baked token expired, preflight rejected every workspace — even those
with a valid token injected via the secrets API at runtime.

Changes:
- workspace_provision.go: remove hardcoded required_env for claude-code
  and codex runtimes; tokens are injected at container start via secrets
- workspace_provision_test.go: flip assertion to reject hardcoded token

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add cascade schedule disable tests for #1027

- TestWorkspaceDelete_DisablesSchedules — leaf workspace delete disables its schedules
- TestWorkspaceDelete_CascadeDisablesDescendantSchedules — parent+child+grandchild cascade
- TestWorkspaceDelete_ScheduleDisableOnlyTargetsDeletedWorkspace — negative test

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: multiple platform handler bug fixes

- secrets.go: Log RowsAffected errors instead of silently discarding them
- a2a_proxy.go: Add 60s safety timeout to a2aClient HTTP client
- terminal.go: Fix defer ordering - always close WebSocket conn on error,
  only defer resp.Close() after successful exec attach
- webhooks.go: Add shortSHA() helper to safely handle empty HeadSHA

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(runtime): inject HMA memory instructions at platform level (#1047)

Every agent now gets hierarchical memory instructions in their system
prompt automatically — no template configuration needed. Instructions
cover commit_memory (LOCAL/TEAM/GLOBAL scopes), recall_memory, and
when to use each proactively.

Follows the same pattern as A2A instructions: defined in
executor_helpers.py, injected by _build_system_prompt() in the
claude_sdk_executor.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: seed initial memories from org template and create payload (#1050)

Add MemorySeed model and initial_memories support at three levels:
- POST /workspaces payload: seed memories on workspace creation
- org.yaml workspace config: per-workspace initial_memories with
  defaults fallback
- org.yaml global_memories: org-wide GLOBAL scope memories seeded
  on the first root workspace during import

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(template): restructure molecule-dev org template to 39-agent hierarchy

Comprehensive rewrite of the Molecule AI dev team org template:

- Rename agents to {team}-{role} convention (e.g., core-be, cp-lead, app-qa)
- Add 5 new team leads: Core Platform Lead, Controlplane Lead, App & Docs Lead, Infra Lead, SDK Lead
- Add new roles: Release Manager, Integration Tester, Technical Writer, Infra-SRE, Infra-Runtime-BE, SDK-Dev, Plugin-Dev
- Delete triage-operator and triage-operator-2 (leads own triage now)
- Set default model to MiniMax-M2.7, tier 3, idle_interval_seconds 900
- Update org.yaml category_routing to new agent names
- Add orchestrator-pulse schedules for all leads (*/5 cron)
- Add pick-up-work schedules for engineers (*/15 cron)
- Add qa-review schedules for QA agents (*/15 cron)
- Add security-scan schedules for security agents (*/30 cron)
- Add release-cycle and e2e-test schedules for Release Manager and Integration Tester
- Update marketing agents with web search MCP and media generation capabilities
- All schedule prompts reference Molecule-AI/internal for PLAN.md and known-issues.md
- Un-ignore org-templates/molecule-dev/ in .gitignore for version tracking

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix test assertions to account for HMA instructions in system prompt

Mock get_hma_instructions in exact-match tests so they don't break
when HMA content is appended. Add a dedicated test for HMA inclusion.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: gitignore org-templates/ and plugins/ entirely

These directories are cloned from their standalone repos
(molecule-ai-org-template-*, molecule-ai-plugin-*) and should
never be committed to molecule-core directly.

Removed the !/org-templates/molecule-dev/ exception that allowed
PR #1056 to land template files in the wrong repo.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(workspace-server): send X-Molecule-Admin-Token on CP calls

controlplane #118 + #130 made /cp/workspaces/* require a per-tenant
admin_token header in addition to the platform-wide shared secret.
Without it, every workspace provision / deprovision / status call
now 401s.

ADMIN_TOKEN is already injected into the tenant container by the
controlplane's Secrets Manager bootstrap, so this is purely a
header-plumbing change — no new config required on the tenant side.

## Change

- CPProvisioner carries adminToken alongside sharedSecret
- New authHeaders method sets BOTH auth headers on every outbound
  request (old authHeader deleted — single call site was misleading
  once the semantics changed)
- Empty values on either header are no-ops so self-hosted / dev
  deployments without a real CP still work

## Tests

Renamed + expanded cp_provisioner_test cases:
- TestAuthHeaders_NoopWhenBothEmpty — self-hosted path
- TestAuthHeaders_SetsBothWhenBothProvided — prod happy path
- TestAuthHeaders_OnlyAdminTokenWhenSecretEmpty — transition window

Full workspace-server suite green.

## Rollout

Next tenant provision will ship an image with this commit merged.
Existing tenants (none in prod right now — hongming was the only
one and was purged earlier today) will auto-update via the 5-min
image-pull cron.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: GitHub token refresh — add WorkspaceAuth path for credential helper (#1068)

PR #729 tightened AdminAuth to require ADMIN_TOKEN, breaking the
workspace credential helper which called /admin/github-installation-token
with a workspace bearer token. Tokens expired after 60 min with no refresh.

Fix: Add /workspaces/:id/github-installation-token under WorkspaceAuth
so any authenticated workspace can refresh its GitHub token. Keep the
admin path as backward-compatible alias.

Update molecule-git-token-helper.sh to use the workspace-scoped path
when WORKSPACE_ID is set.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test(workspace-server): cover Stop/IsRunning/Close + auth-header + transport errors

Closes review gap: pre-PR coverage on CPProvisioner was 37%.
After this commit every exported method is exercised:

  - NewCPProvisioner            100%
  - authHeaders                  100%
  - Start                         91.7% (remainder: json.Marshal error
                                   path, unreachable with fixed-type
                                   request struct)
  - Stop                         100% (new — header + path + error)
  - IsRunning                    100% (new — 4-state matrix + auth)
  - Close                        100% (new — contract no-op)

New cases assert both auth headers (shared secret + admin_token) land
on every outbound request, transport failures surface clear errors
on Start/Stop, and IsRunning doesn't misreport on transport failure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(workspace-server): IsRunning surfaces non-2xx + JSON errors

Pre-existing silent-failure path: IsRunning decoded CP responses
regardless of HTTP status, so a CP 500 → empty body → State="" →
returned (false, nil). The sweeper couldn't distinguish "workspace
stopped" from "CP broken" and would leave a dead row in place.

## Fix

  - Non-2xx → wrapped error, does NOT echo body (CP 5xx bodies may
    contain echoed headers; leaking into logs would expose bearer)
  - JSON decode error → wrapped error
  - Transport error → now wrapped with "cp provisioner: status:"
    prefix for easier log grepping

## Tests

+7 cases (5-status table + malformed JSON + existing transport).
IsRunning coverage 100%; overall cp_provisioner at 98%.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cp_provisioner): IsRunning returns (true, err) on transient failures

My #1071 made IsRunning return (false, err) on all error paths, but that
breaks a2a_proxy which depends on Docker provisioner's (true, err) contract.
Without this fix, any brief CP outage causes a2a_proxy to mark workspaces
offline and trigger restart cascades across every tenant.

Contract now matches Docker.IsRunning:
  transport error    → (true, err)  — alive, degraded signal
  non-2xx response   → (true, err)  — alive, degraded signal
  JSON decode error  → (true, err)  — alive, degraded signal
  2xx state!=running → (false, nil)
  2xx state==running → (true, nil)

healthsweep.go is also happy with this — it skips on err regardless.

Adds TestIsRunning_ContractCompat_A2AProxy as regression guard that
asserts each error path explicitly against the a2a_proxy expectations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cp_provisioner): cap IsRunning body read at 64 KiB

IsRunning used an unbounded json.NewDecoder(resp.Body).Decode on
CP status responses. Start already caps its body read at 64 KiB
(cp_provisioner.go:137) to defend against a misconfigured or
compromised CP streaming a huge body and exhausting memory.

IsRunning is called reactively per-request from a2a_proxy and
periodically from healthsweep, so it's a hotter path than Start
and arguably deserves the same defense more.

Adds TestIsRunning_BoundedBodyRead that serves a body padded past
the cap and asserts the decode still succeeds on the JSON prefix.

Follow-up to code-review Nit-2 on #1073.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(canvas): /waitlist page with contact form

Adds the user-facing half of the beta-gate: a page at /waitlist that
the CP auth callback redirects users to when their email isn't on
the allowlist. Collects email + optional name + use-case and POSTs
to /cp/waitlist/request (backend landed in controlplane #150).

## Behavior

- No auto-pre-fill of email from URL query (CP's #145 dropped the
  ?email= param for the privacy reason; this test guards against a
  future regression on the client side).
- Client-side validates email shape for instant feedback; backend
  re-validates.
- Three UI states after submit:
    success → "your request is in" banner, form hidden
    dedup   → softer "already on file" banner when backend returns
              dedup=true (same 200, no 409 to avoid enumeration)
    error   → inline banner with backend message or network fallback

## Tests

9 tests in __tests__/waitlist-page.test.tsx covering:
- default render + a11y (role=button, role=status, role=alert)
- URL-pre-fill privacy regression guard
- HTML5 + JS validation (empty, malformed)
- successful POST with trimmed body
- dedup branch
- non-2xx with + without error field
- network rejection

Follow-up to the beta-gate rollout on controlplane #145 / #150.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(canvas): remove dead /waitlist page (lives in molecule-app)

#1080 added /waitlist to canvas, but canvas isn't served at
app.moleculesai.app — it backs the tenant subdomains (acme.moleculesai.app
etc.). The real /waitlist lives in the separate molecule-app repo,
which is what the CP auth callback redirects to.

molecule-app#12 has the real page + contact form wiring to
/cp/waitlist/request. This canvas copy was never reachable and would
only diverge.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(org-import): limit concurrent Docker provisioning to 3 (#1084)

The org import fired all workspace provisioning goroutines concurrently,
overwhelming Docker when creating 39+ containers. Containers timed out,
leaving workspaces stuck in 'provisioning' with no schedules or hooks.

Fix:
- Add provisionConcurrency=3 semaphore limiting concurrent Docker ops
- Increase workspaceCreatePacingMs from 50ms to 2000ms between siblings
- Pass semaphore through createWorkspaceTree recursion

With 39 workspaces at 3 concurrent + 2s pacing, import takes ~30s instead
of timing out. Each workspace gets its full template: schedules, hooks,
settings, hierarchy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: add ?purge=true hard-delete to DELETE /workspaces/:id (#1087)

Soft-delete (status='removed') leaves orphan DB rows and FK data forever.
When ?purge=true is passed, after container cleanup the handler cascade-
deletes all leaf FK tables and hard-removes the workspace row.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: remove org-templates/molecule-dev from git tracking

This directory belongs in the dedicated repo
Molecule-AI/molecule-ai-org-template-molecule-dev.
It should be cloned locally for platform mounting, never
committed to molecule-core. The .gitignore already blocks it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(canvas): add NEXT_PUBLIC_ADMIN_TOKEN + CSP_DEV_MODE to docker-compose

Canvas needs AdminAuth token to fetch /workspaces (gated since PR #729)
and CSP_DEV_MODE to allow cross-port fetches in local Docker.

These were added earlier but lost on nuke+rebuild because they weren't
committed to staging.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(canvas): CSP_DEV_MODE + admin token for local Docker (#1052 follow-up)

Three changes that keep getting lost on nuke+rebuild:
1. middleware.ts: read CSP_DEV_MODE env to relax CSP in local Docker
2. api.ts: send NEXT_PUBLIC_ADMIN_TOKEN header (AdminAuth on /workspaces)
3. Dockerfile: accept NEXT_PUBLIC_ADMIN_TOKEN as build arg

All three are required for the canvas to work in local Docker where
canvas (port 3000) fetches from platform (port 8080) cross-origin.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(canvas): make root layout dynamic so CSP nonce reaches Next scripts

Tenant page loads were failing with repeated CSP violations:

  Executing inline script violates ... script-src 'self'
  'nonce-M2M4YTVh...' 'strict-dynamic'. ...

because Next.js's bootstrap inline scripts were emitted without a
nonce attribute. The middleware was generating per-request nonces
correctly and sending them via `x-nonce` — but the layout was
fully static, so Next.js cached the HTML once and served that cached
bundle (no nonces baked in) for every request.

Fix: call `await headers()` in the root layout. That opts the tree
into dynamic rendering AND signals Next.js to propagate the
x-nonce value to its own generated <script> tags.

The `nonce` return value is intentionally unused — the framework
handles its bootstrap scripts automatically once the read happens.
Future code that adds third-party <Script> components (analytics,
etc.) should pass the returned nonce explicitly.

Verified against live tenant: before this change every /_next/
chunk script tag in the HTML had no nonce attribute; expected after
deploy is `<script nonce="..." src="/_next/...">` on each.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(auth): accept admin token in WorkspaceAuth for canvas dashboard

The canvas sends NEXT_PUBLIC_ADMIN_TOKEN on all API calls but per-workspace
routes (/activity, /delegations, /traces) use WorkspaceAuth which only
accepts per-workspace bearer tokens. This made the canvas dashboard 401
on every workspace detail view.

Fix: WorkspaceAuth now accepts the admin token as a fallback after
workspace token validation fails. This lets the canvas read all workspace
data with a single admin credential.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(auth): accept admin token in CanvasOrBearer for viewport PUT

* fix(ci): bake api.moleculesai.app into tenant canvas bundle

Canvas's browser-side code (auth.ts, api.ts, billing.ts) all call
fetch(PLATFORM_URL + /cp/*). PLATFORM_URL comes from
NEXT_PUBLIC_PLATFORM_URL at build time; with the build arg unset,
it falls back to http://localhost:8080 in the compiled bundle.

That means on a tenant like hongmingwang.moleculesai.app, the
user's browser actually tried to fetch http://localhost:8080/cp/
auth/me — which resolves to the USER'S OWN machine, not the tenant.
Login redirect loops 404. Every tenant canvas has been unable to
complete a fresh login on this path; existing sessions only worked
because the cookie was already set domain-wide.

Fix: pass NEXT_PUBLIC_PLATFORM_URL=https://api.moleculesai.app
as a build arg in the tenant-image workflow. CP already allows
CORS from *.moleculesai.app + credentials, and the session cookie
is scoped to .moleculesai.app so tenant subdomains inherit it.

Verified in prod by rebuilding canvas locally with the flag and
hot-patching the hongmingwang instance via SSM. Baked chunks now
contain api.moleculesai.app; browser auth redirects resolve
cleanly to the CP.

Self-hosted users override by rebuilding with their own URL —
same pattern molecule-app uses with NEXT_PUBLIC_CP_ORIGIN.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: nuke-and-rebuild.sh — one-command fleet reset

Two scripts:
- nuke-and-rebuild.sh: docker down -v, clean orphans, rebuild, setup
- post-rebuild-setup.sh: insert global secrets (MiniMax + GH PAT),
  import org template, wait for platform health

Global secrets ensure every provisioned container gets MiniMax API
config and GitHub PAT injected as env vars automatically — no manual
settings.json deployment needed.

Usage: bash scripts/nuke-and-rebuild.sh

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(canvas): include NEXT_PUBLIC_PLATFORM_URL in CSP connect-src

Tenant page loads were blocked by:

  Refused to connect to 'https://api.moleculesai.app/cp/auth/me'
  because it violates the document's Content Security Policy.

CSP had `connect-src 'self' wss:` — fine for same-origin + any wss,
but browser refuses cross-origin HTTPS fetches that aren't listed.
PLATFORM_URL (baked from NEXT_PUBLIC_PLATFORM_URL, which is the CP
origin on SaaS tenants) needs to be explicit.

Fix: middleware reads NEXT_PUBLIC_PLATFORM_URL at build/runtime
and adds both the https and wss siblings to connect-src. Self-
hosted deploys that override the build-arg automatically get a
matching CSP — no hardcoded hostname.

Test added: buildCsp includes NEXT_PUBLIC_PLATFORM_URL origin in
connect-src when set. Also loosens the dev `ws:` assertion since
dev uses `connect-src *` which subsumes ws (pre-existing behavior,
test was stale).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(router): /cp/* reverse-proxy to CP + same-origin canvas fetches

Canvas's browser bundle issues fetches to both CP endpoints
(/cp/auth/me, /cp/orgs, ...) AND tenant-platform endpoints
(/canvas/viewport, /approvals/pending, /org/templates). They
share ONE build-time base URL. Baking api.moleculesai.app
broke tenant calls with 404; baking the tenant subdomain broke
auth. Tried both today and saw exactly one failure mode per
attempt.

Real fix: same-origin fetches + tenant-side split. Adds:

  internal/router/cp_proxy.go      # /cp/* → CP_UPSTREAM_URL

mounted before NoRoute(canvasProxy). Now a tenant serves:

  /cp/*              → reverse-proxy to api.moleculesai.app
  /canvas/viewport,
  /approvals/pending,
  /workspaces/:id/*,
  /ws, /registry,    → tenant platform (existing handlers)
  /metrics
  everything else    → canvas UI (existing reverse-proxy)

Canvas middleware reverts to `connect-src 'self' wss:` for the
same-origin path (keeping explicit PLATFORM_URL whitelist as a
self-hosted escape hatch when the build-arg is non-empty).

CI build-arg flips to NEXT_PUBLIC_PLATFORM_URL="" so the bundle
issues relative fetches.

Security of cp_proxy:
  - Cookie + Authorization PRESERVED across the hop (opposite of
    canvas proxy) — they carry the WorkOS session, which is the
    whole point.
  - Host rewritten to upstream so CORS + cookie-domain on the CP
    side see their own hostname.
  - Upstream URL validated at construction: must parse, must be
    http(s), must have a host — misconfig fails closed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* security: remove hardcoded API keys from post-rebuild-setup.sh

GitGuardian detected exposed MiniMax API key and GitHub PAT in the
script's default values. Replaced with env var reads from .env file
(which is gitignored). Script now validates required secrets exist
before proceeding.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(middleware): TenantGuard passes through /cp/* to CP proxy

Today's rollout of cp_proxy (PR #1095/1096) mounted /cp/* as a
reverse-proxy to the control plane, but the TenantGuard middleware
runs first in the global chain and 404s anything that isn't in its
exact-path allowlist (/health + /metrics). Every /cp/auth/me fetch
from canvas landed on a 40µs 404 before ever reaching the proxy.

/cp/* is handled upstream (WorkOS session + admin bearer), so the
tenant doesn't need to attach org identity for those paths. Passing
them through is correct — matches the design where the tenant
platform is a pure transit layer for /cp/*.

Verified: /cp/auth/me via tunnel now returns 401 (correct unauth
from CP) instead of 404 from TenantGuard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(middleware): AdminAuth accepts CP-verified WorkOS session

Canvas (SaaS tenant UI) runs in the browser and authenticates the
user via a WorkOS session cookie scoped to .moleculesai.app. It
has no bearer token — the token-based ADMIN_TOKEN scheme is for
CLI + server-to-server callers, not end users.

Adds a session-verification tier to AdminAuth that runs BEFORE the
bearer check:

 1. If Cookie header present AND CP_UPSTREAM_URL configured →
    GET /cp/auth/me upstream with the same cookie. 200 + valid
    user_id → grant admin access. Non-200 → fall through.
 2. Else (no cookie, or no CP configured, or CP said no) →
    existing bearer-only path unchanged.

Positive verifications are cached 30s keyed by the raw Cookie
header, so a burst of canvas admin-page renders doesn't DDoS
the CP. Revocations propagate within that window.

Self-hosted / dev deploys without CP_UPSTREAM_URL: feature
disabled, behavior unchanged. So this is strictly additive for
the SaaS case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(docker): fix plugin go.mod replace for TokenProvider interface (#960)

The github-app-auth plugin's go.mod had a relative replace directive
(../molecule-monorepo/platform) that didn't resolve in Docker where
the plugin is at /plugin/ and the platform at /app/. This caused the
plugin's provisionhook.TokenProvider interface to come from a different
package path than the platform's, so the type assertion in
FirstTokenProvider() failed — "no token provider registered".

Fix: sed the plugin's go.mod replace to point at /app during Docker build.
Also added debug logging to GetInstallationToken for future diagnosis.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: close cross-tenant authz + cp_proxy admin-traversal gaps

Addresses three Critical findings from today's code review of the
SaaS-canvas routing stack.

## Critical-1: session verification scoped to the current tenant

session_auth.go previously verified via GET /cp/auth/me, which
only answers "is someone logged in" — NOT "is this user in the
org they're targeting." Every WorkOS-authed user (including folks
who only signed up via app.moleculesai.app with no tenant
relationship) could call /workspaces, /approvals/pending,
/bundles/import, /org/import etc. on ANY tenant they could reach.
Cross-tenant read: user at acme.moleculesai.app could hit
bob.moleculesai.app/workspaces with their cookie and get Bob's
workspaces.

Fix:
  - CP gains GET /cp/auth/tenant-member?slug=<slug> which joins
    org_members × organizations and only returns member:true when
    the authenticated user is actually in that org.
  - Tenant sets MOLECULE_ORG_SLUG at boot via user-data.
  - session_auth now calls tenant-member (not /me), passing its
    own slug. Cache key includes slug so one tenant's cached
    positive never satisfies another's check.

## Critical-2: cp_proxy path allowlist (lateral-movement fix)

cp_proxy.go forwarded any /cp/* path upstream with the cookie
and bearer attached. Since /cp/admin/* accepts sessions as one
of its auth tiers, a tenant-authed user could curl
/cp/admin/tenants/other-slug/diagnostics through their tenant
and the CP would honor it — turning any tenant into a lateral
hop into admin surface.

Fix: explicit allowlist of paths the canvas browser bundle
actually needs (/cp/auth, /cp/orgs, /cp/billing, /cp/templates,
/cp/legal). Everything else 404s at the tenant before cookies
leave. Fail-closed: future UI paths require explicit entries.

## Important-1,2: bounded session cache + split positive/negative TTL

Previous sync.Map cache grew unbounded (one entry per unique
Cookie header for process lifetime) and cached failures for 30s,
meaning a 3s CP blip locked users out for the full window.

Fix:
  - Bounded map with batch random eviction at cap (10k entries ×
    ~100 bytes = 1 MB ceiling). Random eviction is O(1)
    expected; we don't need precise LRU.
  - Periodic sweeper goroutine (2 min) reclaims expired entries
    even when they're not re-hit.
  - Positive TTL 30s, negative TTL 5s — short negative so CP
    flakes self-heal fast.
  - Transport errors NOT cached (would otherwise trap every
    user during a multi-second upstream outage).
  - Cache key = sha256(slug + cookie) so raw session tokens
    don't sit in process memory, and cross-tenant isolation is
    structural not policy.

## Important-3: TenantGuard /cp/* bypass documented

Added a security note to the bypass explaining why it's safe
only under the current setup (cp_proxy allowlist + tunnel-only
ingress), and what would require revisiting (SG opens :8080
inbound to the VPC).

## Tests

  - session_auth_test.go: 12 new tests — empty cookie, missing
    slug, no CP, member:true happy path with cache hit, member:
    false, 401 upstream, malformed JSON, transport error not
    cached, cross-tenant isolation (same cookie different
    tenants hit upstream separately), bounded eviction, expired
    entries, cache key collision resistance.
  - cp_proxy_test.go: new — isCPProxyAllowedPath covers 17
    allow/block cases, forwarding preserves Cookie+Auth, Host
    rewritten, blocked paths 404 without calling upstream.

All platform tests pass. CP provisioner tests pass after
threading cfg.OrgSlug into the container env.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(auth): organization-scoped API keys for admin access

Adds user-facing API keys with full-org admin scope. Replaces the
single ADMIN_TOKEN env var with named, revocable, audited tokens
that users can mint/rotate from the canvas UI without ops
intervention.

Designed for the beta growth phase — one token tier (full admin).
Future work will split into scoped roles (admin / workspace-write
/ read-only) and per-workspace bindings. See docs…

* test(handlers): add 5 TestKI005 regression tests to terminal_test.go

Port terminal hierarchy guard regression suite:
- TestKI005_SelfAccess_AlwaysAllowed: own workspace token always passes
- TestKI005_CanCommunicatePeer_Allowed: sibling workspace access granted
- TestKI005_CanCommunicateNonPeer_Forbidden: cross-org access blocked (403)
- TestKI005_TokenMismatch_Unauthorized: token/Workspace-ID mismatch blocked (401)
- TestKI005_NoXWorkspaceIDHeader_LegacyAllowed: legacy access no header → proceeds

Refs: F1085, KI-005

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Hongming Wang <hongmingwangrabbit@gmail.com>
Co-authored-by: Hongming Wang <hongmingwang.rabbit@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Molecule AI Backend Engineer <backend-engineer@agents.moleculesai.app>
Co-authored-by: qa-agent <qa-agent@users.noreply.github.com>
Co-authored-by: Molecule AI Frontend Engineer <frontend-engineer@agents.moleculesai.app>
Co-authored-by: Molecule AI Triage Operator <triage-operator@agents.moleculesai.app>
Co-authored-by: Molecule AI Platform Engineer <platform-engineer@agents.moleculesai.app>
Co-authored-by: molecule-ai[bot] <276602405+molecule-ai[bot]@users.noreply.github.com>
Co-authored-by: Molecule AI SDK-Dev <sdk-dev@agents.moleculesai.app>
Co-authored-by: airenostars <airenostars@gmail.com>
Co-authored-by: Molecule AI Core-BE <core-be@agents.moleculesai.app>
Co-authored-by: Molecule AI Core-DevOps <core-devops@agents.moleculesai.app>
Co-authored-by: Molecule AI Core-FE <core-fe@agents.moleculesai.app>
Co-authored-by: Molecule AI Fullstack (floater) <fullstack-floater@agents.moleculesai.app>
Co-authored-by: Molecule AI CP-QA <cp-qa@agents.moleculesai.app>
Co-authored-by: Molecule AI Core-UIUX <core-uiux@agents.moleculesai.app>
Co-authored-by: Molecule AI PMM <pmm@agents.moleculesai.app>
Co-authored-by: Molecule AI Social Media Brand <social-media-brand@agents.moleculesai.app>
Co-authored-by: Molecule AI DevRel Engineer <devrel-engineer@agents.moleculesai.app>
Co-authored-by: Marketing Lead <marketing-lead@agents.moleculesai.app>
Co-authored-by: Molecule AI Controlplane Lead <controlplane-lead@agents.moleculesai.app>
Co-authored-by: Molecule AI CP-BE <cp-be@agents.moleculesai.app>
Co-authored-by: Molecule AI Community Manager <community-manager@agents.moleculesai.app>
Co-authored-by: Molecule AI Technical Writer <technical-writer@agents.moleculesai.app>
Co-authored-by: Molecule AI App-FE <app-fe@agents.moleculesai.app>
2026-04-24 01:58:31 +00:00
7807bf8dc4 Merge remote-tracking branch 'refs/remotes/origin/staging' into sync/staging-to-main-2026-04-24 2026-04-24 01:56:21 +00:00
molecule-ai[bot]
b1dce3405c
Merge branch 'staging' into test/2026-04-23-regression-suite 2026-04-24 01:55:06 +00:00
Hongming Wang
00e3e3f570 fix(#1933): bump molecule-ai-plugin-github-app-auth to current main (step 1)
Ships step 1 of the #1933 fleet-wide GH_TOKEN refresh fix.

The plugin's v0.0.0-20260416194734-2cd28737f845 predates the Mutator.Token()
method added in plugin-repo PR #1 (merged 2026-04-17). Monorepo's
workspace-server/pkg/provisionhook/mutator.go:218 has been emitting
`provisionhook: no Token method on "github-app-auth"` on every boot and
the reflection-fallback at mutator.go:216 is doing extra work every
time a workspace requests a fresh GH token.

This is the one-line pin bump:
  v0.0.0-20260416194734-2cd28737f845 → v0.0.0-20260421064811-7d98ae51e31d

Effect: direct-interface path (not the reflection fallback) gets taken,
log noise goes away. Does NOT fix the actual 60-min GH_TOKEN death —
steps 2–5 of #1933 (credential helper install, git config wire-up,
runtime auth context, periodic refresh) are separate, larger PRs.

Verified: workspace-server/go build ./... passes with the new pin.

Ref: #1933
2026-04-23 18:53:25 -07:00
Hongming Wang
98887599d3
Merge pull request #1904 from Molecule-AI/plugin/mcp-server-adaptor
feat(plugin): implement MCPServerAdaptor (issue #847)
2026-04-23 18:44:28 -07:00
61c5f8ad9a feat(plugin): implement MCPServerAdaptor (issue #847)
Rule-of-three threshold met: 4 plugin proposals (molecule-firecrawl
#512, molecule-github-mcp #520, molecule-browser-use #553, mcp-connector
#573) all independently shipped the same mcpServers-adapter pattern.

Adds MCPServerAdaptor to builtins.py — plugins wrapping an MCP server
now declare `from plugins_registry.builtins import MCPServerAdaptor as
Adaptor` in their per-runtime adapter file. The adaptor:

- Merges mcpServers from settings-fragment.json into
  <configs>/.claude/settings.json (deep-merge so multiple plugins'
  servers coexist).
- Optionally ships skills/rules/setup.sh via AgentskillsAdaptor
  delegation.
- On uninstall: removes skills/rules but intentionally leaves
  mcpServers entries in settings.json (users may share configs with
  other tools or have manually curated entries).

Also fixes _deep_merge_hooks: non-hook top-level keys that are dicts
(e.g. mcpServers) are now deep-merged with existing values instead of
being skipped via setdefault.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 01:42:13 +00:00