The harness needs E2E_OPENAI_API_KEY set for Hermes workspaces to
boot — without it the runtime crashes with "No provider API key
found" and workspaces never hit online. Preflight step fails fast
with a clear error if the repo secret is missing, so CI doesn't
burn 10 minutes on a foregone conclusion.
Repo secret to add: Settings → Secrets → Actions →
MOLECULE_STAGING_OPENAI_KEY.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Section 10's delegation call is a raw curl (not tenant_call, because
it carries an additional X-Source-Workspace-Id). It was missing
X-Molecule-Org-Id, which TenantGuard requires — so the tenant 404'd
every delegation probe despite section 8's A2A call (via tenant_call)
working correctly.
Repro: staging run 2026-04-21T17:40Z had section 8 green (PONG)
and section 10 red (rc=22) on the same workspace. Only difference
was the missing header.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
workspace/config.py:258 reads MODEL_PROVIDER as the full model string
(format 'provider:model', e.g. 'anthropic:claude-opus-4-7'). My prior
'openai' alone got parsed as the model name → 404 model_not_found.
Use 'openai:gpt-4o' and also set OPENAI_BASE_URL to api.openai.com
(default was openrouter.ai which takes different key format).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hermes's provider resolver checks ANTHROPIC_API_KEY first (resolution
order puts anthropic before openai). Without MODEL_PROVIDER=openai
explicitly set, Hermes defaults to claude-sonnet-4-6 against the
OpenAI endpoint and 404s with model_not_found.
Staging E2E run 2026-04-21T17:24Z hit this after every earlier fix
landed (workspace online, A2A ready) — last remaining blocker for
the happy path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Workspace runtimes (hermes, langgraph, etc.) crash at boot with
'No provider API key found' when no ANTHROPIC_API_KEY / OPENAI_API_KEY /
etc. is set. Harness previously sent no secrets → workspace sat in
provisioning for 10 min → harness timed out.
Console log from staging run 2026-04-21T17:08Z showed the exact crash:
ValueError: No Hermes provider API key found. Set any one of:
ANTHROPIC_API_KEY, HERMES_API_KEY, NOUS_API_KEY, OPENROUTER_API_KEY,
OPENAI_API_KEY, ...
Read E2E_OPENAI_API_KEY from env and inject into both parent and
child workspace POST bodies via the secrets field (persists as
workspace_secret, materialises into container env). Empty key
falls through — dev can still run smoke tests, workspace just
won't reach online.
For CI, a new repo secret MOLECULE_STAGING_OPENAI_KEY needs to be
added and passed as E2E_OPENAI_API_KEY in the workflow env.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously matched every e2e-YYYYMMDD-* slug, which stomped parallel
CI runs AND manual dev probes against staging. Incident 2026-04-21
15:02Z: this workflow's safety net deleted an unrelated manual tenant
1s after it hit 'running', timing out the dev run at 15min.
Scope to f'e2e-{today}-{GITHUB_RUN_ID}-' so each run only cleans its
own leftovers. Empty run_id (local invocation) keeps the old broader
behaviour so dev safety-nets still sweep.
Also fix: the previous filter used o.get('status') which doesn't exist
on the admin API response. Now reads instance_status (the real field).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TenantGuard middleware on the tenant platform returns 404 (not 403,
by design — avoid leaking tenant existence to org scanners) when
requests lack X-Molecule-Org-Id matching MOLECULE_ORG_ID. Harness
hit this on POST /workspaces (section 5) despite having a valid
Authorization bearer.
- Capture org_id from admin-create response
- Send X-Molecule-Org-Id on every tenant_call
Confirmed via manual repro 2026-04-21T14:56Z: curl with Bearer but
no org-id header → 404; with both headers → expected route reached.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous hardcode `$SLUG.moleculesai.app` only matched prod. Staging
tenants live at `$SLUG.staging.moleculesai.app`, so the harness hit
DNS for a nonexistent host and timed out at section 4 even after
provisioning succeeded.
Derive from CP URL: api.X → X, staging-api.X → staging.X. Override
via MOLECULE_TENANT_DOMAIN for self-hosted setups.
Confirmed gap on manual run 2026-04-21T14:40Z: section 2 passed in
2min but section 4 timed out at 3min on the wrong hostname.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
/cp/admin/orgs exposes `instance_status` (COALESCE'd from
org_instances.status), NOT a top-level `status` field. The harness
polled the wrong field and always read empty → timed out at 15min
on a tenant that had actually provisioned successfully (confirmed
2026-04-21T14:22Z: EC2 launched, canary ok, but harness never saw
status=running).
No code change to the admin API — the field has never been named
`status`. The harness just had a typo that happened to type-check
(the Go struct hasn't changed, only the sh/py polling was wrong).
Now the harness correctly reads `instance_status` and the main
provision poll loop terminates on the expected transition.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verified against live staging: the admin endpoint returns 400 'confirm
field must equal the URL slug' when the body key is 'confirm_token'.
Every workflow's safety-net teardown step + the main harness + the
Playwright teardown all had the wrong key. Fixed all six call sites.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reduces required secret surface from 2 (session cookie + admin token)
to 1 (admin token). Pairs with molecule-controlplane#202 which adds:
- POST /cp/admin/orgs — server-to-server org creation
- GET /cp/admin/orgs/:slug/admin-token — per-tenant bearer fetch
With those endpoints live, CI doesn't need to scrape a browser WorkOS
session cookie. CP admin bearer (Railway CP_ADMIN_API_TOKEN) drives
provision + tenant-token retrieval + teardown through a single
credential.
Changes
-------
test_staging_full_saas.sh: admin bearer for provision/teardown,
fetched per-tenant token drives all tenant API calls. Added
E2E_INTENTIONAL_FAILURE=1 toggle that poisons the tenant token
after provisioning so the teardown path gets exercised when the
happy-path isn't.
canvas/e2e/staging-setup.ts: same pivot; exports STAGING_TENANT_TOKEN
instead of STAGING_SESSION_COOKIE.
canvas/e2e/staging-tabs.spec.ts: context.setExtraHTTPHeaders with
Authorization: Bearer on every page request, no cookie handling.
All three workflows (e2e-staging-saas, canary-staging,
e2e-staging-canvas): drop MOLECULE_STAGING_SESSION_COOKIE env +
verification step. One secret to set.
NEW e2e-staging-sanity.yml: weekly Mon 06:00 UTC. Runs the harness
with E2E_INTENTIONAL_FAILURE=1 and inverts the pass condition —
rc=1 is green, rc=0 (unexpected success) or rc=4 (leak) open a
priority-high issue labelled e2e-safety-net. This is the
answer to 'how do we know the teardown path still works when
nothing else has failed recently.'
STAGING_SAAS_E2E.md refreshed: single-secret setup, sanity workflow
documented, canvas workflow added to the coverage matrix.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three additions on top of 187a9bf:
1. Canary (.github/workflows/canary-staging.yml)
30-min cron that runs the full-SaaS harness in E2E_MODE=canary: one
hermes workspace + one A2A PONG + teardown. ~8-min wall clock vs
~20-min for the full run.
Alerting is self-contained: opens a single 'Canary failing' issue on
first failure, comments on subsequent failures (no issue spam),
auto-closes the issue on the next green run. Labels: canary-staging,
bug. Safety-net teardown step sweeps e2e-YYYYMMDD-canary-* orgs
tagged today so a runner cancel can't leak EC2.
2. Canvas Playwright (canvas/e2e/staging-*.ts + playwright.staging.config.ts
+ .github/workflows/e2e-staging-canvas.yml)
staging-setup.ts provisions a fresh org + hermes workspace (same
lifecycle as the bash harness, just in TypeScript). staging-tabs.spec.ts
clicks through all 13 workspace-panel tabs (chat, activity, details,
skills, terminal, config, schedule, channels, files, memory, traces,
events, audit) and asserts each renders without crashing and without
'Failed to load' error toasts. Known SaaS gaps (Files empty, Terminal
disconnects, Peers 401) are documented in #1369 and whitelisted so
they don't fail the test — the gate is 'no hard crash', not 'no
issues'.
staging-teardown.ts deletes the org via DELETE /cp/admin/tenants/:slug.
playwright.staging.config.ts separates staging from local tests so
pnpm test in dev doesn't try to provision against staging. Retries=2
and timeouts are longer; workers=1 because the setup provisions one
shared workspace. Workflow uploads HTML report + screenshots on
failure for 14 days.
3. Delegation mechanics (tests/e2e/test_staging_full_saas.sh section 10)
Parent → child proxy test: POST /workspaces/CHILD/a2a with
X-Source-Workspace-Id=PARENT and verify the child responds + child
activity log captures PARENT as source. Intentionally LLM-free: the
mechanics regression is what matters; prompt-driven delegation
correctness belongs in canvas-driven tests.
Also reorders teardown step to 11/11 since delegation is 10/11.
Mode gating:
E2E_MODE=canary -> skips child workspace, HMA memory, peers,
activity, delegation (steps 6, 9, 10 no-op). Full-lifecycle still
runs every piece. Validated both paths via 'bash -n' syntax check
after each edit.
Secrets requirement unchanged (same two secrets as 187a9bf):
MOLECULE_STAGING_SESSION_COOKIE, MOLECULE_STAGING_ADMIN_TOKEN.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Dedicated CI/CD lane that exercises the whole SaaS cross-EC2 shape end to
end, against live staging:
1. Accept terms / create org (POST /cp/orgs) — catches ToS gate, slug
validation, billing/quota, member insert regressions.
2. Wait for tenant EC2 + cloudflared tunnel + TLS propagation (up to
15 min cold).
3. Provision a parent + child workspace via the tenant URL.
4. Wait both online (exercises the SaaS register + token bootstrap
flow fixed in #1364).
5. A2A round-trip on parent — validates the full LLM loop (MCP tools,
provider auth, JSON-RPC response shape, proxy SSRF gate).
6. HMA memory write + read — validates awareness namespace + scope
routing.
7. Peers + activity smoke — route-registration regression guard.
8. Teardown via DELETE /cp/admin/tenants/:slug + leak assertion — a
leaked org at teardown fails CI with exit 4.
Why a dedicated workflow (not folded into ci.yml):
- ~20 min wall clock per run (EC2 boot is the long pole). Too slow
for every PR push.
- Needs its own concurrency group (staging has an org-create quota
and two overlapping runs would race on slug prefix).
- Distinct secret surface (session cookie + admin bearer) — keep it
off PR jobs that don't need them.
Triggers: push to main (provisioning-critical paths only), PRs on the
same paths, manual workflow_dispatch (with runtime + keep_org inputs),
and 07:00 UTC nightly cron for drift detection.
Belt-and-braces teardown: the script installs an EXIT trap, and the
workflow has an always()-step that greps e2e-YYYYMMDD-* orgs created
today and force-deletes them via the idempotent admin endpoint. Covers
the case where GH cancels the runner before the trap fires.
Docs: tests/e2e/STAGING_SAAS_E2E.md — what's covered, how to provision
the two required secrets, local-dev notes, cost (~$0.007/run), known
gaps (canvas UI + delegation + claude-code).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every workspace in the cross-EC2 SaaS provisioning shape was failing
registration, heartbeat, or A2A routing. Four distinct blockers sat
between "EC2 is up" and "agent responds"; three are platform-side and
fixed here (the fourth is in the CP user-data, separate PR).
1. SSRF validator blocked RFC-1918 (registry.go + mcp.go)
validateAgentURL and isPrivateOrMetadataIP rejected 172.16.0.0/12,
which contains the AWS default VPC range (172.31.x.x) that every
sibling workspace EC2 registers from. Registration returned 400 and
the 10-min provision sweep flipped status to failed. RFC-1918 +
IPv6 ULA are now gated behind saasMode(); link-local (169.254/16),
loopback, IPv6 metadata (fe80::/10, ::1), and TEST-NET stay blocked
unconditionally in both modes.
saasMode() resolution order:
1. MOLECULE_DEPLOY_MODE=saas|self-hosted (explicit operator flag)
2. MOLECULE_ORG_ID presence (legacy implicit signal, kept for
back-compat so existing deployments don't need a config change)
isPrivateOrMetadataIP now actually checks IPv6 — previously it
returned false on any non-IPv4 input, which would let a registered
[::1] or [fe80::...] URL bypass the SSRF check entirely.
2. Orphan auth-token minting (workspace_provision.go)
issueAndInjectToken mints a token and stuffs it into
cfg.ConfigFiles[".auth_token"]. The Docker provisioner writes that
file into the /configs volume — the CP provisioner ignores it
(only cfg.EnvVars crosses the wire). Result: live token in DB, no
plaintext on disk, RegistryHandler.requireWorkspaceToken 401s every
/registry/register attempt because the workspace is no longer in
the "no live token → bootstrap-allowed" state. Now no-ops in SaaS
mode; the register handler already mints on first successful
register and returns the plaintext in the response body for the
runtime to persist locally.
Also removes the redundant wsauth.IssueToken call at the bottom of
provisionWorkspaceCP, which created the same orphan-token pattern
a second time.
3. Compaction artefacts (bundle/importer.go, handlers/org_tokens.go,
scheduler.go, workspace_provision.go)
Four pre-existing compile errors on main from an earlier session's
code truncation: missing tuple destructuring on ExecContext /
redactSecrets / orgTokenActor, missing close-brace in
Scheduler.fireSchedule's panic recovery. All one-line mechanical
fixes; without them the binary would not build.
Tests
-----
ssrf_test.go adds:
* TestSaasMode — covers the env resolution ladder (explicit flag
wins over legacy signal, case-insensitive, whitespace tolerant)
* TestIsPrivateOrMetadataIP_SaaSMode — asserts RFC-1918 + IPv6 ULA
flip to allowed, metadata/loopback/TEST-NET still blocked
* TestIsPrivateOrMetadataIP_IPv6 — regression guard for the old
"returns false for all IPv6" behaviour
Follow-up issue for CP-sourced workspace_id attestation will be filed
separately — closes the residual intra-VPC SSRF + token-race windows
the SaaS-mode relaxation introduces.
Verified end-to-end today on workspace 6565a2e0 (hermes runtime, OpenAI
provider) — agent returned "PONG" in 1.4s after register → heartbeat →
A2A proxy → runtime.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(ci): revert cancel-in-progress to true — ubuntu-runner dispatch stalled
With cancel-in-progress: false, pending CI runs accumulate in the
ci-staging concurrency group. New pushes create queued runs, but
GitHub dispatches multiple runs for the same SHA instead of replacing
the pending one. All runs get stuck/cancelled before completing.
Reverting to cancel-in-progress: true restores CI operation — runs
that are superseded are cancelled, freeing the concurrency slot for
the new run to proceed.
Runner availability (ubuntu-latest dispatch stall) is a separate
infra issue tracked independently.
* fix(security): validate tar header names in copyFilesToContainer — CWE-22 path traversal (#1043)
Tar header names were built from raw map keys without validation. A malicious
server-side caller could embed "../" in a file name to escape the destPath
volume mount (/configs) and write files outside the intended directory.
Fix: validate each name with filepath.Clean + IsAbs + HasPrefix("..") checks
before using it in the tar header, then join with destPath for the archive
header. Also guard parent-directory creation against traversal.
Closes#1043.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(canvas/test): patch regressed tests from PR #1243 orgs-page flakiness fix
Two regressions introduced by PR #1243 (fix issue #1207):
1. **ContextMenu.keyboard.test.tsx** — `setPendingDelete` now receives
`{id, name, hasChildren}` (cascade-delete UX, PR #1252), but the test
expected only `{id, name}`. Added `hasChildren: false` to the assertion.
2. **orgs-page.test.tsx** — 10 tests awaited `vi.advanceTimersByTimeAsync(50)`
without `act()`. With fake timers, `setState` (synchronous) is flushed by
`advanceTimersByTimeAsync`, but the React state update it triggers is a
microtask — so the test saw stale render. Wrapping in `act(async () =>
{ await vi.advanceTimersByTimeAsync(50); })` ensures microtasks drain
before assertions run.
All 813 vitest tests pass.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(canvas): add 100px proximity threshold to drag-to-nest detection
Fixes#1052 — previously, getIntersectingNodes() returned any node whose
bounding box overlapped the dragged node, regardless of actual pixel
distance. On a sparse canvas this triggered the "Nest Workspace" dialog
even when the dragged node was nowhere near any target.
The fix adds an on-node-drag proximity filter: only nodes within 100px
(center-to-center) of the dragged node are eligible as nest targets.
Distance is computed as squared Euclidean to avoid the sqrt overhead in
the hot drag path.
Added two tests to Canvas.pan-to-node.test.tsx covering the mock wiring
and confirming the regression is addressed in Canvas.tsx.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(canvas): add ?? 0 guard for optional budget_used in progressPct
Fixes#1324 — TypeScript strict mode flags budget.budget_used as
possibly undefined in the progressPct ternary, even though the
outer condition checks budget_limit > 0.
Fix: use nullish coalescing (budget_used ?? 0) so progress shows 0%
when the backend returns a partial shape (provisioning-stuck
workspaces). Also adds a test covering the undefined-budget_used
case with the progress bar aria-valuenow and fill width both at 0%.
Closes#1324.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: molecule-ai[bot] <276602405+molecule-ai[bot]@users.noreply.github.com>
Co-authored-by: Molecule AI Core-FE <core-fe@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(ci): revert cancel-in-progress to true — ubuntu-runner dispatch stalled
With cancel-in-progress: false, pending CI runs accumulate in the
ci-staging concurrency group. New pushes create queued runs, but
GitHub dispatches multiple runs for the same SHA instead of replacing
the pending one. All runs get stuck/cancelled before completing.
Reverting to cancel-in-progress: true restores CI operation — runs
that are superseded are cancelled, freeing the concurrency slot for
the new run to proceed.
Runner availability (ubuntu-latest dispatch stall) is a separate
infra issue tracked independently.
* fix(security): validate tar header names in copyFilesToContainer — CWE-22 path traversal (#1043)
Tar header names were built from raw map keys without validation. A malicious
server-side caller could embed "../" in a file name to escape the destPath
volume mount (/configs) and write files outside the intended directory.
Fix: validate each name with filepath.Clean + IsAbs + HasPrefix("..") checks
before using it in the tar header, then join with destPath for the archive
header. Also guard parent-directory creation against traversal.
Closes#1043.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(canvas/test): patch regressed tests from PR #1243 orgs-page flakiness fix
Two regressions introduced by PR #1243 (fix issue #1207):
1. **ContextMenu.keyboard.test.tsx** — `setPendingDelete` now receives
`{id, name, hasChildren}` (cascade-delete UX, PR #1252), but the test
expected only `{id, name}`. Added `hasChildren: false` to the assertion.
2. **orgs-page.test.tsx** — 10 tests awaited `vi.advanceTimersByTimeAsync(50)`
without `act()`. With fake timers, `setState` (synchronous) is flushed by
`advanceTimersByTimeAsync`, but the React state update it triggers is a
microtask — so the test saw stale render. Wrapping in `act(async () =>
{ await vi.advanceTimersByTimeAsync(50); })` ensures microtasks drain
before assertions run.
All 813 vitest tests pass.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(canvas): add 100px proximity threshold to drag-to-nest detection
Fixes#1052 — previously, getIntersectingNodes() returned any node whose
bounding box overlapped the dragged node, regardless of actual pixel
distance. On a sparse canvas this triggered the "Nest Workspace" dialog
even when the dragged node was nowhere near any target.
The fix adds an on-node-drag proximity filter: only nodes within 100px
(center-to-center) of the dragged node are eligible as nest targets.
Distance is computed as squared Euclidean to avoid the sqrt overhead in
the hot drag path.
Added two tests to Canvas.pan-to-node.test.tsx covering the mock wiring
and confirming the regression is addressed in Canvas.tsx.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(canvas): add ?? 0 guard for optional budget_used in progressPct
Fixes#1324 — TypeScript strict mode flags budget.budget_used as
possibly undefined in the progressPct ternary, even though the
outer condition checks budget_limit > 0.
Fix: use nullish coalescing (budget_used ?? 0) so progress shows 0%
when the backend returns a partial shape (provisioning-stuck
workspaces). Also adds a test covering the undefined-budget_used
case with the progress bar aria-valuenow and fill width both at 0%.
Closes#1324.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: molecule-ai[bot] <276602405+molecule-ai[bot]@users.noreply.github.com>
Co-authored-by: Molecule AI Core-FE <core-fe@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(ci): revert cancel-in-progress to true — ubuntu-runner dispatch stalled
With cancel-in-progress: false, pending CI runs accumulate in the
ci-staging concurrency group. New pushes create queued runs, but
GitHub dispatches multiple runs for the same SHA instead of replacing
the pending one. All runs get stuck/cancelled before completing.
Reverting to cancel-in-progress: true restores CI operation — runs
that are superseded are cancelled, freeing the concurrency slot for
the new run to proceed.
Runner availability (ubuntu-latest dispatch stall) is a separate
infra issue tracked independently.
* fix(security): validate tar header names in copyFilesToContainer — CWE-22 path traversal (#1043)
Tar header names were built from raw map keys without validation. A malicious
server-side caller could embed "../" in a file name to escape the destPath
volume mount (/configs) and write files outside the intended directory.
Fix: validate each name with filepath.Clean + IsAbs + HasPrefix("..") checks
before using it in the tar header, then join with destPath for the archive
header. Also guard parent-directory creation against traversal.
Closes#1043.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(canvas/test): patch regressed tests from PR #1243 orgs-page flakiness fix
Two regressions introduced by PR #1243 (fix issue #1207):
1. **ContextMenu.keyboard.test.tsx** — `setPendingDelete` now receives
`{id, name, hasChildren}` (cascade-delete UX, PR #1252), but the test
expected only `{id, name}`. Added `hasChildren: false` to the assertion.
2. **orgs-page.test.tsx** — 10 tests awaited `vi.advanceTimersByTimeAsync(50)`
without `act()`. With fake timers, `setState` (synchronous) is flushed by
`advanceTimersByTimeAsync`, but the React state update it triggers is a
microtask — so the test saw stale render. Wrapping in `act(async () =>
{ await vi.advanceTimersByTimeAsync(50); })` ensures microtasks drain
before assertions run.
All 813 vitest tests pass.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(canvas): add 100px proximity threshold to drag-to-nest detection
Fixes#1052 — previously, getIntersectingNodes() returned any node whose
bounding box overlapped the dragged node, regardless of actual pixel
distance. On a sparse canvas this triggered the "Nest Workspace" dialog
even when the dragged node was nowhere near any target.
The fix adds an on-node-drag proximity filter: only nodes within 100px
(center-to-center) of the dragged node are eligible as nest targets.
Distance is computed as squared Euclidean to avoid the sqrt overhead in
the hot drag path.
Added two tests to Canvas.pan-to-node.test.tsx covering the mock wiring
and confirming the regression is addressed in Canvas.tsx.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: molecule-ai[bot] <276602405+molecule-ai[bot]@users.noreply.github.com>
Co-authored-by: Molecule AI Core-FE <core-fe@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(security): backport SSRF defence (CWE-918) to main — isSafeURL in mcp.go and a2a_proxy.go
Issue #1042: 3 CodeQL SSRF findings across mcp.go and a2a_proxy.go.
staging already ships the fix (PRs #1147, #1154 → merged); main did not include it.
- mcp.go: add isSafeURL() + isPrivateOrMetadataIP() helpers; validate
agentURL before outbound calls in mcpCallTool (line ~529) and
toolDelegateTaskAsync (line ~607)
- a2a_proxy.go: add identical isSafeURL() + isPrivateOrMetadataIP()
helpers; call isSafeURL() before dispatchA2A in resolveAgentURL()
(blocks finding #1 at line 462)
- mcp_test.go: 19 new tests covering all blocked URL patterns:
file://, ftp://, 127.0.0.1, ::1, 169.254.169.254, 10.x.x.x,
172.16.x.x, 192.168.x.x, empty hostname, invalid URL,
isPrivateOrMetadataIP across all private/CGNAT/metadata ranges
1. URL scheme enforcement — http/https only
2. IP literal blocking — loopback, link-local, RFC-1918, CGNAT, doc/test ranges
3. DNS hostname resolution — blocks internal hostnames resolving to private IPs
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(ci-blocker): remove duplicate isSafeURL/isPrivateOrMetadataIP from mcp.go
Issue #1292: PR #1274 duplicated isSafeURL + isPrivateOrMetadataIP in
mcp.go — both functions already exist on main at lines 829 and 876.
Kept the mcp.go definitions (the originals) and removed the 70-line
duplicate appended at end of file. a2a_proxy.go functions are
unchanged — they serve the same purpose via a separate code path.
* fix: remove orphaned commit-text lines from a2a_proxy.go
Three lines from the PR/commit title were accidentally baked into the
file during the rebase from #1274 to #1302, causing a Go syntax error
(a bare string literal at statement level followed by dangling braces).
Deletion restores:
}
return agentURL, nil
}
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Molecule AI Infra-Runtime-BE <infra-runtime-be@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Molecule AI Core-BE <core-be@agents.moleculesai.app>
Co-authored-by: Molecule AI SDK Lead <sdk-lead@agents.moleculesai.app>
## Summary
Issue #1273: deleteViaEphemeral interpolated filePath directly into
rm command, enabling both shell injection (CWE-78) and path traversal
(CWE-22) attacks.
## Changes
1. Added validateRelPath(filePath) guard before constructing the rm command.
validateRelPath blocks absolute paths and ".." traversal sequences.
2. Changed Cmd from "/configs/"+filePath (string interpolation) to
[]string{"rm", "-rf", "/configs", filePath} (exec form). This
eliminates shell injection entirely — filePath is a plain argument,
never interpreted as shell code.
## Security properties
- validateRelPath: blocks "../" and absolute paths before they reach Docker
- Exec form: filePath cannot inject shell metacharacters even if validation
is somehow bypassed
- "/configs" as separate arg: rm has exactly two arguments, no room for
injected args
Closes#1273.
Co-authored-by: Molecule AI Infra-Runtime-BE <infra-runtime-be@agents.moleculesai.app>
Documents TemplatesHandler.copyFilesToContainer (container_files.go):
- Endpoint overview: PUT /workspaces/:id/files/*path
- Parameter descriptions for all four function parameters
- CWE-22 path traversal protection (PRs #1267/1270/1271)
- Defense-in-depth: validateRelPath at handler + archive boundary
- Full error code table (400/404/500)
- curl example with success and path-traversal rejection cases
Also covers: writeViaEphemeral routing, findContainer fallback,
allowed roots allow-list, and related links to platform-api.md.
Co-authored-by: Molecule AI Technical Writer <technical-writer@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
The provision-timeout sweeper was emitting a new WORKSPACE_PROVISION_TIMEOUT
event type, but the canvas event handler (canvas-events.ts:234) only
has a case for WORKSPACE_PROVISION_FAILED — the sweep's event fell
through silently. DB was being marked 'failed' but the UI stayed on
'starting' indefinitely until the user hard-refreshed.
Reusing the existing event name keeps the UI reaction uniform across
both fail paths (runtime-crash via bootstrap-watcher and boot-timeout
via sweeper). Operators who need to distinguish can read the `source`
payload field — "bootstrap_watcher" vs "provision_timeout_sweep".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
workspace_restart.go:127-133 accepted body.Template (attacker-controlled)
via raw filepath.Join(h.configsDir, template), allowing path traversal
(e.g. "../../../etc") to escape configsDir.
Fix: replace raw filepath.Join with resolveInsideRoot, same pattern as
workspace.go:102 (already fixed) and workspace.go:249 (already fixed).
Both the explicit template path and the findTemplateByName fallback are
safe — findTemplateByName returns a directory name from os.ReadDir which
is inherently bounded and cannot contain "/".
On resolve error the template is cleared so findTemplateByName fallback
still fires (preserves existing restart behaviour when template is invalid).
Closes: #1043
Co-authored-by: Molecule AI Core-BE <core-be@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
orgtoken.Validate now returns org_id (the org workspace UUID stored on
org_api_tokens rows, populated by #1212). Both call sites in
wsauth_middleware.go — WorkspaceAuth and AdminAuth — call
c.Set("org_id", orgID) after successful org-token validation.
This unbreaks orgCallerID(c) for org-token callers. Previously the
middleware populated org_token_id and org_token_prefix but never org_id,
so any handler reading c.Get("org_id") (e.g. requireCallerOwnsOrg) got
"" even for valid org tokens.
The change is additive: orgID may be empty for pre-migration tokens
minted before #1212. requireCallerOwnsOrg already handles empty org_id
by denying by default.
Co-authored-by: Molecule AI CP-BE <cp-be@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Root cause: commits e6d48e6 and e085621 stored ci.yml with JSON-escaped
content (literal \n sequences, leading double-quote) instead of proper
YAML with actual newlines. All CI runs failed with "workflow file issue"
before any job could start.
Fix: restore from pre-corruption base (2517164), apply intended changes:
- concurrency.cancel-in-progress: true → false (queue rather than cancel)
- changes job: runs-on ubuntu-latest (frees mac mini for real work)
PR #1242 intent preserved, corruption from API commit removed.
orgtoken.Validate now returns org_id (the org workspace UUID stored on
org_api_tokens rows, populated by #1212). Both call sites in
wsauth_middleware.go — WorkspaceAuth and AdminAuth — call
c.Set("org_id", orgID) after successful org-token validation.
This unbreaks orgCallerID(c) for org-token callers. Previously the
middleware populated org_token_id and org_token_prefix but never org_id,
so any handler reading c.Get("org_id") (e.g. requireCallerOwnsOrg) got
"" even for valid org tokens.
The change is additive: orgID may be empty for pre-migration tokens
minted before #1212. requireCallerOwnsOrg already handles empty org_id
by denying by default.
Co-authored-by: Molecule AI CP-BE <cp-be@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
- Store: pendingDelete now carries `hasChildren: boolean` (computed from
nodes.some(parentId === nodeId))
- ContextMenu: passes hasChildren into setPendingDelete
- Canvas: dialog title changes to "Delete Workspace and Children" with
⚠️ message when hasChildren; confirms with "Delete All"
Refs: #1137
Co-authored-by: Molecule AI Fullstack (floater) <fullstack-floater@agents.moleculesai.app>
- DetailsTab: use `(data.lastErrorRate ?? 0)` instead of bare multiply to
prevent NaN% when the field is absent on pre-provisioning workspaces.
- WorkspaceUsage: make formatPeriod accept optional start/end strings;
return "—" for undefined so the usage period shows blank rather than
"Invalid Date" for provisioning/partial workspaces.
Refs: #1139
Co-authored-by: Molecule AI Fullstack (floater) <fullstack-floater@agents.moleculesai.app>
PR #1229 sed command had no capture groups but used $1 in the
replacement, committing the literal string "defer func() { _ = \$1 }()"
instead of "defer func() { _ = resp.Body.Close() }()". Go does not
compile — $1 is not a valid identifier.
Fixed with: sed -i 's/defer func() { _ = \$1 }()/defer func() { _ = resp.Body.Close() }()/g'
Affected (all on origin/staging):
workspace-server/cmd/server/cp_config.go
workspace-server/internal/handlers/a2a_proxy.go
workspace-server/internal/handlers/github_token.go
workspace-server/internal/handlers/traces.go
workspace-server/internal/handlers/transcript.go
workspace-server/internal/middleware/session_auth.go
workspace-server/internal/provisioner/cp_provisioner.go (3 occurrences)
Closes: #1245
Co-authored-by: Molecule AI Core-BE <core-be@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Line 9 of ci.yml accidentally contained a bare string with the commit
SHA instead of the intended concurrency: block, causing all CI runs
to fail with a YAML parse error.
Also restores the changes from the PR #1242 intent (workflow-level
concurrency with cancel-in-progress: false).
Fixes: CI failure on staging after PR #1242 merge.
CPProvisioner env mutator error branch was left with unresolved conflict
markers after a prior rebase. Resolved to the HEAD-side generic message
"plugin env mutator chain failed" which is consistent with the same
message used in the Provisioner path (line 107/111).
No functional change.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
cancel-in-progress: false queues new runs so the single mac mini
runner doesn't fight itself when pushes stack during rebases or
cross-PR contention. Existing e2e-api.yml already has this pattern.
Fixes: 19 queued runs on single self-hosted runner (02:55 UTC snapshot)
Co-authored-by: Molecule AI Fullstack (floater) <fullstack-floater@agents.moleculesai.app>
Root cause: tests used try/finally { vi.useRealTimers() / vi.useFakeTimers() }
back-and-forth. When any test's finally-block called vi.useFakeTimers(),
subsequent tests inherited fake timer state causing 50ms real setTimeouts
to not fire and mockFetch to accumulate calls across test boundaries.
Fix: consolidate timer management to beforeEach/afterEach hooks.
- beforeEach: vi.useFakeTimers() — all tests start from known fake state
- afterEach: cleanup() + vi.useRealTimers() — restore real timers for next test
- Individual tests: use vi.advanceTimersByTimeAsync(50) instead of real setTimeout
Also removed duplicate afterEach(cleanup()) and unused waitFor import.
Closes#1207.
Co-authored-by: Molecule AI Core-FE <core-fe@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
The post-fire UPDATE after s.proxy.ProxyA2ARequest() was using fireCtx,
which derives from the outer ctx passed into fireSchedule(). If that ctx
is cancelled — HTTP timeout, graceful shutdown, or any upstream deadline —
ExecContext returns context.Canceled and the UPDATE is silently skipped,
leaving next_run_at stale and causing the schedule to re-fire on the
next tick.
Fix: create a dedicated updateCtx from context.Background() with a 5s
deadline, independent of the outer ctx hierarchy. Also improved the
error log to include schedule name for easier debugging.
Complements PR #1241 (fix/f1089-scheduler-ctx-fix-main) which fixes
the goroutine-panic path in tick() — this fix covers the wider case of
normal-return + ctx-cancelled after the proxy call.
F1089 | Severity: HIGH+security
Co-authored-by: Molecule AI Infra Lead <infra-lead@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat(canvas): rewrite MemoryInspectorPanel to match backend API
Issue #909 (chunk 3 of #576).
The existing MemoryInspectorPanel used the wrong API endpoint
(/memory instead of /memories) and wrong field names (key/value/version
instead of id/content/scope/namespace/created_at). It also lacked
LOCAL/TEAM/GLOBAL scope tabs and a namespace filter.
Changes:
- Fix endpoint: GET /workspaces/:id/memories with ?scope= query param
- Fix MemoryEntry type to match actual API: id, content, scope,
namespace, created_at, similarity_score
- Add LOCAL/TEAM/GLOBAL scope tabs
- Add namespace filter input
- Remove Edit functionality (no update endpoint in backend)
- Delete uses DELETE /workspaces/:id/memories/:id (by id, not key)
- Full rewrite of 27 tests to match new API and UI structure
- Uses ConfirmDialog (not native dialogs) for delete confirmation
- All dark zinc theme (no light colors)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: tighten types + improve provision-timeout message (#1135, #1136)
#1135 — TypeScript: make BudgetData.budget_used and WorkspaceMetrics
fields optional to match actual partial-response shapes from provisioning-
stuck workspaces. Runtime already guarded with ?? 0.
#1136 — provisiontimeout.go: replace misleading "check required env vars"
hint (preflight catches that case upfront) with accurate message about
container starting but failing to call /registry/register.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
* fix(test): align ssrf_test.go localhost test cases with isSafeURL behaviour
isSafeURL blocks 127.0.0.1 via ip.IsLoopback() even in dev environments.
The test cases `wantErr: false` for localhost were incorrect — the
test would fail when go test runs. Fix by changing wantErr to true
for both localhost test cases.
Rationale: loopback blocking at this layer is intentional. Access
control is enforced by WorkspaceAuth + CanCommunicate at the A2A
routing layer, not by the URL validation. Opening this would widen
the SSRF attack surface without adding real dev flexibility.
Closes: ssrf_test.go inconsistency reported 2026-04-21
Co-Authored-By: Claude Sonnet 4.7 <noreply@anthropic.com>
---------
Co-authored-by: Molecule AI Core-UIUX <core-uiux@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>