Matches tests/e2e/test_staging_full_saas.sh's 20-min budget (#1930).
Canvas E2E was still stuck at 900s (15 min) which regularly flakes on
tenant cold boots in 12-15 min range — especially on staging where
workspace-server image pulls + AMI bootstrapping add 3-5 min vs prod.
Concrete blocker: 2026-04-24 staging→main sync (#1981) kept failing on
"tenant provision: timed out after 900s" in canvas/e2e/staging-setup.ts
despite the actual sync E2E going green. Canvas-side timeout was
strictly tighter than the sync-side timeout.
Also raises WORKSPACE_ONLINE_TIMEOUT_MS to 20 min to cover the case
where the workspace EC2 is provisioned but hermes cold-install (apt +
uv + hermes-agent clone + gateway boot) takes longer than the original
10-min budget — matches the 20-min workspace deadline in SaaS E2E.
No behavior change when things are fast. Just covers the tail.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
71 files across docs/marketing/ and marketing/ are blocked by the
Block-internal-flavored-paths CI gate (CEO directive 2026-04-23).
These paths must live in Molecule-AI/internal, not the public monorepo.
Unblocks PR #1981 (staging→main sync).
Public-facing blog/devrel content should be re-added via correct paths:
docs/blog/<slug>.md, docs/devrel/<slug>.md, docs/tutorials/<slug>.md
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Audit 2026-04-24 case: org-templates/molecule-dev/ contained only .git/
(working tree wiped). ListTemplates silently skipped the directory and
the molecule-dev template silently disappeared from the Canvas palette.
No log trail; CEO discovered hours later when looking for the registry
listing manually.
This commit adds a one-line log warning when a directory under orgDir
has a .git/ subdir but no org.yaml/.yml — that's almost always a manifest
clone that got truncated. The warning includes the recovery command
(`git checkout main -- .`) so operators can self-fix without re-cloning.
Doesn't change the response behavior — the directory is still skipped
to keep ListTemplates a fail-soft endpoint. Just makes the failure
visible in `docker logs platform`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixes#1957. All agents share one PAT, so `gh issue create --assignee @me`
resolves to the CEO. Today's "6 issues @me for 7 cycles" defect signal
turned out to be CEO-load misclassified as team-stagnation.
Translation rules:
- `--assignee @me` → `--label team:<role-slug>`
- `--reviewer @me` → dropped (review-bot scans labels, not requests)
- `--assignee user` (real user) → unchanged
role-slug derived from GIT_AUTHOR_NAME ("Molecule AI Core-BE" → "core-be").
The wrapper already handled the title-prefix + body-footer transforms;
these are just two more cases in the existing arg-walk loop.
Backward compat: any agent prompt that doesn't use @me passes through
unchanged. Agents don't need prompt updates — the wrapper is transparent.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-merge guard against the deadlock pattern that hit twice today:
adding a workflow's check to required_status_checks while the workflow
itself doesn't have a `merge_group:` trigger → merge queue stalls
forever in AWAITING_CHECKS because the required check can't fire on
gh-readonly-queue/* refs.
Each time today this happened it cost 30-60min of debug + a hot-fix PR
+ temporary removal of the required check. This workflow runs on every
PR touching .github/workflows/ and on push to staging/main, listing
required checks for staging and verifying each one's owning workflow
declares merge_group.
Self-listens on merge_group so the linter passes its own queue runs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause of the 2026-04-24 all-day E2E failure chain: Railway staging
CP had TENANT_IMAGE pinned to :staging-a14cf86 — a static SHA that had
silently drifted 10+ days stale. Every new tenant (including every E2E
run's fresh tenant) was spawned with that stale image, which predated
applyRuntimeModelEnv. Without applyRuntimeModelEnv, HERMES_DEFAULT_MODEL
never reached the workspace EC2 user-data, so install.sh fell back to
nousresearch/hermes-4-70b → openrouter → 401 "Missing Authentication
header" in every A2A reply.
Four correct fixes shipped today all got shadowed by this single stale
pin:
• template-hermes#19 (provider priority for openai/*)
• template-hermes#20 (decouple prefix-strip from bridge guard)
• molecule-controlplane#247 (force fresh /opt/adapter clone)
• molecule-core#1987 (E2E pins HERMES_CUSTOM_* as workaround)
Fix: publish each main build under both :staging-<sha> AND :staging-latest.
Change Railway staging CP's TENANT_IMAGE env to :staging-latest (done via
`railway variables --set` as part of this incident). Future main builds
then auto-propagate to new tenant provisions without any human in the
loop.
Safety: :staging-latest is the "most recent main build" — NOT a
canary-verified promotion. That distinction is preserved:
• Prod tenants still pull :latest (canary-verified, retagged by
canary-verify.yml only after the canary fleet green-lights a digest)
• Staging tenants now pull :staging-latest (every main build, pre-canary)
So staging becomes the canary: if a :staging-latest build regresses,
the staging canary fleet catches it before it can be promoted to :latest
for prod. This is what the canary design intended; the missing
:staging-latest tag was the hole.
Zero impact on image size / build time: Docker tags point at the same
digest, no duplicate push.
Follow-up: filed an issue tracking the need for CP's TENANT_IMAGE to
NEVER be pinned to a SHA in any environment — it must always float on a
named tag (:staging-latest for staging, :latest for prod).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Re-do of the fix that was originally bundled into PR #1995 but never
landed — the second commit on that branch got rejected by GH006
(branch locked by merge queue) after the first commit was already
queued. Only the file-removal commit made it to staging.
Without this trigger, adding "Block forbidden paths" to
required_status_checks deadlocks the queue: every PR sits in
AWAITING_CHECKS forever waiting on a check that can't fire on
gh-readonly-queue/* refs.
Sequence to land safely:
1. (already done) Removed "Block forbidden paths" from required_status_checks
2. (this PR) Add merge_group trigger
3. (after merge) Re-add "Block forbidden paths" to required_status_checks
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1889 ("docs(blog): A2A Protocol deep-dive") landed two files under
the forbidden marketing/devrel/ path:
- marketing/devrel/phase34-platform-instructions-social-copy.md
- marketing/devrel/phase34-tool-trace-social-copy.md
The Block-forbidden-paths workflow correctly flagged both at PR-time
(run 24875689649 — failure at 06:28:20Z) but it was NOT in the required
status checks list on staging, so the PR merged anyway at 06:32:47Z.
The push-event run on staging then failed visibly (run 24875838257),
which is what surfaced this.
Two-part fix:
1. (this PR) Remove the leaked files. Authors can re-file the same
content in Molecule-AI/internal under marketing/ if it's still needed.
2. (already done outside this PR) "Block forbidden paths" added to
required_status_checks on staging branch protection so the next leak
attempt gets blocked at PR-merge time, not after the fact.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five tightly-related fixes surfaced while stress-testing org-template
imports (Legal Team, Molecule Company, etc.) on a running control plane:
1) Org import was silently failing — INSERT wrote `collapsed` into the
`workspaces` table but that column lives on `canvas_layouts`
(005_canvas_layouts.sql). Every import returned 207 with 0 rows
created, which `api.post` treated as success → green "Imported"
toast + empty canvas. Moved the write to canvas_layouts; updated
the workspace_crud PATCH path to UPSERT there too; refreshed the
test mock. Added a client-side assertion that throws on
2xx-with-`error`-body so future partial-failures surface a red
toast rather than lying about success.
2) Multi-level nested layout was collision-prone: children that were
themselves parents (CTO → Dev Lead → 6 engineers) got the same
leaf-sized grid slot as leaf siblings and clipped into each other.
Added post-order `sizeOfSubtree` + sibling-size-aware
`childSlotInGrid` on both the Go server and the TS client (kept in
sync). `buildNodesAndEdges` now uses subtree sizes for both parent
dimensions and the rescue heuristic. `setCollapsed` on expand now
reads each child's actual rendered width/height instead of the
leaf-count formula — a regression test covers the CTO/Dev Lead
scenario.
3) Provisioning-timeout banner was unusable during large imports: a
30-workspace tree triggered 27 simultaneous "stuck" warnings 2
minutes in (server paces + provision concurrency = 3 guarantee tail
items legitimately wait longer). Scaled threshold with concurrent
count (base + 45s per queue slot beyond concurrency) and added a
Dismiss (×) button per banner.
4) Auto pan-and-zoom on org ready: after the last workspace flips out
of `provisioning`, canvas now fitView's with a 1.2s animation,
0.25 padding, `maxZoom: 0.8` and `minZoom: 0.25`. Without the zoom
caps fitView was hitting the component's maxZoom=2 on small trees
and zooming in instead of out.
5) Toolbar was visually busy: `+ N sub` count wrapped onto a second
row on narrow viewports; status dot and workspace total were in
separate border-delimited cells. Merged into one segment with
`whitespace-nowrap`; A2A / Audit / Search / Help collapsed to
icon-only 28px buttons with tooltip + aria-label (Figma/Linear
pattern). Stop All / Restart Pending keep text — they're urgent.
Also:
- `api.{get,post,...}` accept an optional `{ timeoutMs }` so callers
that hit intentionally-slow endpoints (org import paces 2s between
siblings) don't trip the 15s default and report false aborts.
- `WorkspaceNode` clamps role text to 2 lines so verbose descriptions
don't unboundedly grow card height and break the grid.
- `PARENT_HEADER_PADDING` bumped 44→130 to clear name + runtime +
2-line role + the currentTask banner that appears during the
initial-prompt phase.
Tests: 930 canvas tests + full Go handler suite pass. Added
regressions for (i) 207 partial-success surfacing as throw, and
(ii) setCollapsed sizing with nested-parent children.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AuthGate now skips session fetch for /cp/auth/* paths, and
redirectToLogin guards against re-setting window.location when
already on an auth path. Both guards had no test coverage —
a future refactor could silently reintroduce the redirect loop.
Added:
- AuthGate.test.tsx: 2 cases covering /cp/auth/login and
/cp/auth/signup path skipping (no fetchSession call, no
redirectToLogin call, children rendered)
- auth.test.ts: 2 cases covering redirectToLogin early return
for /cp/auth/login and /cp/auth/signup paths
Fixes: Molecule-AI/molecule-core#1541
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix(canvas/a11y): aria-hidden SVGs, MissingKeysModal dialog semantics, session cookie auth
Three fixes cherry-picked from issue #1744:
1. aria-hidden on decorative SVG icons:
- DeleteCascadeConfirmDialog.tsx: warning triangle SVG gets aria-hidden="true"
- MissingKeysModal.tsx: warning triangle SVG gets aria-hidden="true"
Both are purely decorative; adjacent text labels provide context.
2. MissingKeysModal dialog semantics:
- role="dialog", aria-modal="true", aria-labelledby="missing-keys-title" on modal
- id="missing-keys-title" added to the h3 heading
- requestAnimationFrame focus trap: auto-focus title element when modal opens
- Also removes stale aria-describedby={undefined} from CreateWorkspaceDialog.tsx
3. Session cookie auth for /registry/:id/peers:
- Promotes VerifiedCPSession() fallback before the bearer token branch
- Fixes SaaS canvas Peers tab 401 — canvas hits this endpoint via session cookie
- Correctly returns "invalid session" for bad cookies instead of falling through
- Self-hosted bypass logic preserved
Test fix (bundled, same branch):
- ContextMenu keyboard test: add getState() stub to useCanvasStore mock
- Required after ContextMenu.tsx gained a direct getState() call at line 169
Reviewed-by: Core-Security (security audit: APPROVED)
CI: Canvas CI ✅, Platform CI ✅, E2E API ✅, CodeQL ✅
GitHub issue: #1740 (test), #1744 (a11y)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Root cause of the sustained E2E step-8 A2A 401 failures (3+/3 runs
2026-04-24 03h–04h): the A2A returns 200 with a JSON-RPC result whose
text is OpenRouter's error format —
{'message': 'Missing Authentication header', 'code': 401}
(integer code, not OpenAI's string 'invalid_api_key'). template-hermes's
derive-provider.sh was picking PROVIDER=openrouter for openai/* models
despite template-hermes#19 (the fix that flips openai/* → custom when
OPENAI_API_KEY is set) having been merged 01:30Z.
Verified via probe workspaces on the staging canary tenant:
probe 1 (just OPENAI_API_KEY): → OpenRouter's 401 shape
probe 2 (+ HERMES_INFERENCE_PROVIDER=custom + HERMES_CUSTOM_*):
→ OpenAI's 401 shape ('code': 'invalid_api_key')
So derive-provider.sh's updates apparently aren't reaching every
staging tenant on re-provision — possibly because tenant EC2s cache
/opt/adapter from an earlier boot, or the CP's user-data snapshot
bundles a pre-fix template-hermes. That's a separate follow-up (needs
forced re-clone of /opt/adapter on every workspace boot).
This PR is the test-side workaround. Pinning the HERMES_* bridge env
vars bypasses derive-provider.sh entirely, so the test works regardless
of which template-hermes commit any given tenant happens to have on
disk.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DropStale calls DropStaleQueueItems which reads db.DB directly. Without
setupTestDB() the global mock was nil → every query returned 500.
Adds mock expectations for the 3 happy-path sub-tests; validation-only
sub-tests (bad input) need no DB and are unchanged.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(canvas/a11y): aria-hidden SVGs, MissingKeysModal dialog semantics, session cookie auth
Three fixes cherry-picked from issue #1744:
1. aria-hidden on decorative SVG icons:
- DeleteCascadeConfirmDialog.tsx: warning triangle SVG gets aria-hidden="true"
- MissingKeysModal.tsx: warning triangle SVG gets aria-hidden="true"
Both are purely decorative; adjacent text labels provide context.
2. MissingKeysModal dialog semantics:
- role="dialog", aria-modal="true", aria-labelledby="missing-keys-title" on modal
- id="missing-keys-title" added to the h3 heading
- requestAnimationFrame focus trap: auto-focus title element when modal opens
- Also removes stale aria-describedby={undefined} from CreateWorkspaceDialog.tsx
3. Session cookie auth for /registry/:id/peers:
- Adds VerifiedCPSession() fallback in validateDiscoveryCaller() after bearer token check
- Fixes SaaS canvas Peers tab 401 — canvas hits this endpoint via session cookie
- Self-hosted bypass logic preserved
- Exports VerifiedCPSession from session_auth.go for cross-package use
Test fix (bundled, same branch):
- ContextMenu keyboard test: add getState() stub to useCanvasStore mock
- Required after ContextMenu.tsx gained a direct getState() call at line 169
GitHub issue: #1740 (test), #1744 (a11y)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(workspace-server): remove duplicate VerifiedCPSession declaration
The branch accidentally added a second func VerifiedCPSession declaration
that shadows the real implementation, causing go build to fail with:
internal/middleware/session_auth.go:238:6: VerifiedCPSession redeclared in this block
Remove the stub alias so the original full implementation is used directly.
The function already exports correctly for cross-package use via the
VerifiedCPSession() call in discovery.go.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(workspace-server): correct VerifiedCPSession condition in discovery.go
Fix Go build error — 'presented' was declared and not used.
The cookie fallback check was using `if ok, presented := ...; ok` instead
of `if ok, presented := ...; presented`, causing the build to fail in CI.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(workspace-server): fix declared and not used 'presented' in discovery.go
Fixes Go build failure:
discovery.go:355:10: declared and not used: presented
discovery.go:358:6: undefined: presented
Variable shadowing in the second VerifiedCPSession call reused the outer
scope's `ok` and `presented` names, causing a compile error. Renamed to
ok2/presented2 to avoid shadowing.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Molecule AI Core-FE <core-fe@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Pre-work for enabling GitHub merge queue on the staging branch (#TBD
follow-up issue). Without these triggers, the queue's pre-merge CI run
on the speculative `gh-readonly-queue/...` ref would never fire, every
queued PR would show false-green for the required checks, and queue
would merge things that don't actually pass on the rebased commit.
Adding the trigger now is **a no-op** — the `merge_group` event only
fires once the queue is enabled on a branch, which is a separate UI/API
toggle. So this PR is safe to land in isolation; merge-queue enablement
is the next step and reversible at the branch-protection level.
Why these two workflows:
- `ci.yml` provides 5 of the 8 required staging checks (Detect changes,
Platform Go, Canvas Next.js, Python Lint & Test, Shellcheck E2E)
- `codeql.yml` provides the other 3 (Analyze go / js-ts / python)
Other workflows (e2e-staging-*, canary-*, publish-*) are not required
status checks and don't need the trigger to keep the queue working.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the "panic-button at >65 records" manual sweep that nukes
every pattern-match unconditionally (would delete live workspaces
along with orphans).
This version:
- Queries CP prod + staging /admin/orgs for live tenant slugs
- Queries AWS EC2 describe-instances for live workspace Name tags
- Only deletes CF records whose slug/ws-id has no live counterpart
- Dry-run by default (--execute to actually delete)
- Safety gate refuses to delete >50% of records (configurable via
MAX_DELETE_PCT env var) — catches the "API returned zero orgs, every
tenant looks orphan" failure mode before it nukes production
- Per-category accounting: orphan-ws / orphan-e2e-tenant / etc.
Usage:
CF_API_TOKEN=... CF_ZONE_ID=... \
CP_PROD_ADMIN_TOKEN=... CP_STAGING_ADMIN_TOKEN=... \
bash scripts/ops/sweep-cf-orphans.sh # dry-run
bash scripts/ops/sweep-cf-orphans.sh --execute # actually delete
Ref: #1976 (root-cause: tenant.Delete + workspace.Delete don't clean
their CF records — until that's fixed, this script is the maintenance
path)
Co-authored-by: Hongming Wang <hongmingwang.rabbit@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: molecule-ai[bot] <276602405+molecule-ai[bot]@users.noreply.github.com>
React Flow requires parent nodes to appear before their children in
the nodes array. When they don't, it logs "Parent node {id} not
found. Please make sure that parent nodes are in front of their
child nodes in the nodes array" and — more importantly — renders
the child at canvas-absolute coords instead of parent-relative,
flashing it far outside the parent.
topology's buildNodesAndEdges already enforced this at hydrate, but
nestNode + batchNest weren't re-sorting after mutating parentId.
A freshly-nested child often ended up after-first-drag at the
wrong screen position because its new parent sat later in the
array than itself.
Extract sortParentsBeforeChildren() into canvas-topology as a
reusable DFS visit; call it at the tail of both nestNode's set()
and batchNest's commit set(). 923 tests still green — no behaviour
change beyond eliminating the warning and the position flash.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Canceling the nest/extract dialog restored the child's position but
left the parent card at its auto-grown size. growParentsToFitChildren
fires on drag-stop to fit a then-outside child; when the drag is
subsequently cancelled, the parent keeps that grown width/height
forever because the grow pass is grow-only.
Strip width/height from the ex-parent alongside the child position
restore in cancelNest — React Flow re-measures from CSS, parent
collapses back to its natural size. Same trick nestNode already
uses for the un-nest path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two follow-up polish items for drag-and-nest:
1. Cancelling the "Extract from team?" dialog now snaps the
dragged card back to where the drag started. Before, a user
who dragged a child out, saw the confirm dialog, then clicked
Cancel ended up with the card stranded outside the parent at
its drop-point position — which also got persisted via
savePosition on drag-stop. Now onNodeDragStart captures the
pre-drag position + parent, and cancelNest restores both the
RF node position and fires savePosition with the absolute
pre-drag coords so reload matches.
2. Un-nesting now clears the ex-parent's explicit width/height
in the nodes array. growParentsToFitChildren is grow-only so
it could never shrink the parent back down after a child
left; the card stayed at its auto-grown size with empty
space. Stripping width/height lets React Flow re-measure from
the card's own min-width / min-height CSS, so the parent
visually shrinks to fit whatever children remain.
923 canvas tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Un-nest used to require holding Alt (or Cmd to force-detach). That
was too conservative — when a user dragged a child clearly outside
its parent's bbox, nothing happened on release, because the default
branch soft-clamped back and only the Alt branch actually opened
the "Extract?" confirm. Matches the exact bug the user just flagged
("I can put agents in other agent, but when I drag it out, it does
not move out").
New rules:
* Past the 20 % hysteresis → confirm un-nest. Plain drag, no
modifier. This is what most users expect (Miro / Figma behave
the same way — drag outside the frame and the shape leaves it).
* Inside or within 20 % of the edge → soft-clamp back inside.
Guards against twitchy releases that momentarily overshoot the
edge by a few pixels.
* Cmd / Ctrl → force un-nest regardless of overlap. Escape-hatch
for when the user dragged within the hysteresis zone but really
wants out.
* Dropping onto a different parent → nest there (unchanged).
Alt is no longer a required modifier for un-nesting. Keeps it as
a non-gesture modifier only; no meaning unless we re-bind it later.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>