Commit Graph

115 Commits

Author SHA1 Message Date
Hongming Wang
1125a029b8 fix(platform): unblock SaaS workspace registration end-to-end
Every workspace in the cross-EC2 SaaS provisioning shape was failing
registration, heartbeat, or A2A routing. Four distinct blockers sat
between "EC2 is up" and "agent responds"; three are platform-side and
fixed here (the fourth is in the CP user-data, separate PR).

1. SSRF validator blocked RFC-1918 (registry.go + mcp.go)
   validateAgentURL and isPrivateOrMetadataIP rejected 172.16.0.0/12,
   which contains the AWS default VPC range (172.31.x.x) that every
   sibling workspace EC2 registers from. Registration returned 400 and
   the 10-min provision sweep flipped status to failed. RFC-1918 +
   IPv6 ULA are now gated behind saasMode(); link-local (169.254/16),
   loopback, IPv6 metadata (fe80::/10, ::1), and TEST-NET stay blocked
   unconditionally in both modes.

   saasMode() resolution order:
     1. MOLECULE_DEPLOY_MODE=saas|self-hosted (explicit operator flag)
     2. MOLECULE_ORG_ID presence (legacy implicit signal, kept for
        back-compat so existing deployments don't need a config change)

   isPrivateOrMetadataIP now actually checks IPv6 — previously it
   returned false on any non-IPv4 input, which would let a registered
   [::1] or [fe80::...] URL bypass the SSRF check entirely.

2. Orphan auth-token minting (workspace_provision.go)
   issueAndInjectToken mints a token and stuffs it into
   cfg.ConfigFiles[".auth_token"]. The Docker provisioner writes that
   file into the /configs volume — the CP provisioner ignores it
   (only cfg.EnvVars crosses the wire). Result: live token in DB, no
   plaintext on disk, RegistryHandler.requireWorkspaceToken 401s every
   /registry/register attempt because the workspace is no longer in
   the "no live token → bootstrap-allowed" state. Now no-ops in SaaS
   mode; the register handler already mints on first successful
   register and returns the plaintext in the response body for the
   runtime to persist locally.

   Also removes the redundant wsauth.IssueToken call at the bottom of
   provisionWorkspaceCP, which created the same orphan-token pattern
   a second time.

3. Compaction artefacts (bundle/importer.go, handlers/org_tokens.go,
   scheduler.go, workspace_provision.go)
   Four pre-existing compile errors on main from an earlier session's
   code truncation: missing tuple destructuring on ExecContext /
   redactSecrets / orgTokenActor, missing close-brace in
   Scheduler.fireSchedule's panic recovery. All one-line mechanical
   fixes; without them the binary would not build.

Tests
-----
ssrf_test.go adds:
  * TestSaasMode — covers the env resolution ladder (explicit flag
    wins over legacy signal, case-insensitive, whitespace tolerant)
  * TestIsPrivateOrMetadataIP_SaaSMode — asserts RFC-1918 + IPv6 ULA
    flip to allowed, metadata/loopback/TEST-NET still blocked
  * TestIsPrivateOrMetadataIP_IPv6 — regression guard for the old
    "returns false for all IPv6" behaviour

Follow-up issue for CP-sourced workspace_id attestation will be filed
separately — closes the residual intra-VPC SSRF + token-race windows
the SaaS-mode relaxation introduces.

Verified end-to-end today on workspace 6565a2e0 (hermes runtime, OpenAI
provider) — agent returned "PONG" in 1.4s after register → heartbeat →
A2A proxy → runtime.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 03:06:46 -07:00
molecule-ai[bot]
45715aa8a5 fix(canvas/test): patch test regressions from PR #1243 + proximity hitbox fix (#1313)
* fix(ci): revert cancel-in-progress to true — ubuntu-runner dispatch stalled

With cancel-in-progress: false, pending CI runs accumulate in the
ci-staging concurrency group. New pushes create queued runs, but
GitHub dispatches multiple runs for the same SHA instead of replacing
the pending one. All runs get stuck/cancelled before completing.

Reverting to cancel-in-progress: true restores CI operation — runs
that are superseded are cancelled, freeing the concurrency slot for
the new run to proceed.

Runner availability (ubuntu-latest dispatch stall) is a separate
infra issue tracked independently.

* fix(security): validate tar header names in copyFilesToContainer — CWE-22 path traversal (#1043)

Tar header names were built from raw map keys without validation. A malicious
server-side caller could embed "../" in a file name to escape the destPath
volume mount (/configs) and write files outside the intended directory.

Fix: validate each name with filepath.Clean + IsAbs + HasPrefix("..") checks
before using it in the tar header, then join with destPath for the archive
header. Also guard parent-directory creation against traversal.

Closes #1043.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(canvas/test): patch regressed tests from PR #1243 orgs-page flakiness fix

Two regressions introduced by PR #1243 (fix issue #1207):

1. **ContextMenu.keyboard.test.tsx** — `setPendingDelete` now receives
   `{id, name, hasChildren}` (cascade-delete UX, PR #1252), but the test
   expected only `{id, name}`. Added `hasChildren: false` to the assertion.

2. **orgs-page.test.tsx** — 10 tests awaited `vi.advanceTimersByTimeAsync(50)`
   without `act()`. With fake timers, `setState` (synchronous) is flushed by
   `advanceTimersByTimeAsync`, but the React state update it triggers is a
   microtask — so the test saw stale render. Wrapping in `act(async () =>
   { await vi.advanceTimersByTimeAsync(50); })` ensures microtasks drain
   before assertions run.

All 813 vitest tests pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(canvas): add 100px proximity threshold to drag-to-nest detection

Fixes #1052 — previously, getIntersectingNodes() returned any node whose
bounding box overlapped the dragged node, regardless of actual pixel
distance. On a sparse canvas this triggered the "Nest Workspace" dialog
even when the dragged node was nowhere near any target.

The fix adds an on-node-drag proximity filter: only nodes within 100px
(center-to-center) of the dragged node are eligible as nest targets.
Distance is computed as squared Euclidean to avoid the sqrt overhead in
the hot drag path.

Added two tests to Canvas.pan-to-node.test.tsx covering the mock wiring
and confirming the regression is addressed in Canvas.tsx.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: molecule-ai[bot] <276602405+molecule-ai[bot]@users.noreply.github.com>
Co-authored-by: Molecule AI Core-FE <core-fe@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-21 07:06:57 +00:00
molecule-ai[bot]
8b24ac2174 fix(security): backport SSRF defence (CWE-918) to main — isSafeURL in a2a_proxy.go (#1292) (#1302)
* fix(security): backport SSRF defence (CWE-918) to main — isSafeURL in mcp.go and a2a_proxy.go

Issue #1042: 3 CodeQL SSRF findings across mcp.go and a2a_proxy.go.
staging already ships the fix (PRs #1147, #1154 → merged); main did not include it.

- mcp.go: add isSafeURL() + isPrivateOrMetadataIP() helpers; validate
  agentURL before outbound calls in mcpCallTool (line ~529) and
  toolDelegateTaskAsync (line ~607)
- a2a_proxy.go: add identical isSafeURL() + isPrivateOrMetadataIP()
  helpers; call isSafeURL() before dispatchA2A in resolveAgentURL()
  (blocks finding #1 at line 462)
- mcp_test.go: 19 new tests covering all blocked URL patterns:
  file://, ftp://, 127.0.0.1, ::1, 169.254.169.254, 10.x.x.x,
  172.16.x.x, 192.168.x.x, empty hostname, invalid URL,
  isPrivateOrMetadataIP across all private/CGNAT/metadata ranges

1. URL scheme enforcement — http/https only
2. IP literal blocking — loopback, link-local, RFC-1918, CGNAT, doc/test ranges
3. DNS hostname resolution — blocks internal hostnames resolving to private IPs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(ci-blocker): remove duplicate isSafeURL/isPrivateOrMetadataIP from mcp.go

Issue #1292: PR #1274 duplicated isSafeURL + isPrivateOrMetadataIP in
mcp.go — both functions already exist on main at lines 829 and 876.
Kept the mcp.go definitions (the originals) and removed the 70-line
duplicate appended at end of file. a2a_proxy.go functions are
unchanged — they serve the same purpose via a separate code path.

* fix: remove orphaned commit-text lines from a2a_proxy.go

Three lines from the PR/commit title were accidentally baked into the
file during the rebase from #1274 to #1302, causing a Go syntax error
(a bare string literal at statement level followed by dangling braces).

Deletion restores:
  }
  return agentURL, nil
}

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Molecule AI Infra-Runtime-BE <infra-runtime-be@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Molecule AI Core-BE <core-be@agents.moleculesai.app>
Co-authored-by: Molecule AI SDK Lead <sdk-lead@agents.moleculesai.app>
2026-04-21 07:06:42 +00:00
molecule-ai[bot]
49ab614f2f fix(security): CWE-78/CWE-22 — block shell injection in deleteViaEphemeral (#1310)
## Summary
Issue #1273: deleteViaEphemeral interpolated filePath directly into
rm command, enabling both shell injection (CWE-78) and path traversal
(CWE-22) attacks.

## Changes
1. Added validateRelPath(filePath) guard before constructing the rm command.
   validateRelPath blocks absolute paths and ".." traversal sequences.
2. Changed Cmd from "/configs/"+filePath (string interpolation) to
   []string{"rm", "-rf", "/configs", filePath} (exec form). This
   eliminates shell injection entirely — filePath is a plain argument,
   never interpreted as shell code.

## Security properties
- validateRelPath: blocks "../" and absolute paths before they reach Docker
- Exec form: filePath cannot inject shell metacharacters even if validation
  is somehow bypassed
- "/configs" as separate arg: rm has exactly two arguments, no room for
  injected args

Closes #1273.

Co-authored-by: Molecule AI Infra-Runtime-BE <infra-runtime-be@agents.moleculesai.app>
2026-04-21 07:06:31 +00:00
Hongming Wang
1f35128ebb Merge pull request #1262 from Molecule-AI/fix/sweeper-emit-provision-failed
fix(sweeper): emit WORKSPACE_PROVISION_FAILED so canvas updates UI
2026-04-20 20:39:20 -07:00
Hongming Wang
ec52d155f4 fix(sweeper): emit WORKSPACE_PROVISION_FAILED so canvas updates UI
The provision-timeout sweeper was emitting a new WORKSPACE_PROVISION_TIMEOUT
event type, but the canvas event handler (canvas-events.ts:234) only
has a case for WORKSPACE_PROVISION_FAILED — the sweep's event fell
through silently. DB was being marked 'failed' but the UI stayed on
'starting' indefinitely until the user hard-refreshed.

Reusing the existing event name keeps the UI reaction uniform across
both fail paths (runtime-crash via bootstrap-watcher and boot-timeout
via sweeper). Operators who need to distinguish can read the `source`
payload field — "bootstrap_watcher" vs "provision_timeout_sweep".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 20:38:41 -07:00
molecule-ai[bot]
0bd2bf2b7f fix(security): CWE path-injection — resolveInsideRoot for Restart + ReadFile template paths (PR #1261)
workspace_restart.go:127-133 accepted body.Template (attacker-controlled)
via raw filepath.Join(h.configsDir, template), allowing path traversal
(e.g. "../../../etc") to escape configsDir.

Fix: replace raw filepath.Join with resolveInsideRoot, same pattern as
workspace.go:102 (already fixed) and workspace.go:249 (already fixed).
Both the explicit template path and the findTemplateByName fallback are
safe — findTemplateByName returns a directory name from os.ReadDir which
is inherently bounded and cannot contain "/".

On resolve error the template is cleared so findTemplateByName fallback
still fires (preserves existing restart behaviour when template is invalid).

Closes: #1043

Co-authored-by: Molecule AI Core-BE <core-be@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-21 03:38:39 +00:00
molecule-ai[bot]
bc9ce59b79 fix(F1097): set org_id in Gin context for org-token callers (#1218) (#1253)
orgtoken.Validate now returns org_id (the org workspace UUID stored on
org_api_tokens rows, populated by #1212). Both call sites in
wsauth_middleware.go — WorkspaceAuth and AdminAuth — call
c.Set("org_id", orgID) after successful org-token validation.

This unbreaks orgCallerID(c) for org-token callers. Previously the
middleware populated org_token_id and org_token_prefix but never org_id,
so any handler reading c.Get("org_id") (e.g. requireCallerOwnsOrg) got
"" even for valid org tokens.

The change is additive: orgID may be empty for pre-migration tokens
minted before #1212. requireCallerOwnsOrg already handles empty org_id
by denying by default.

Co-authored-by: Molecule AI CP-BE <cp-be@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-21 03:26:47 +00:00
molecule-ai[bot]
732f65e8e1 fix(go): replace $1 literal with resp.Body.Close() in 7 files (#1247)
PR #1229 sed command had no capture groups but used $1 in the
replacement, committing the literal string "defer func() { _ = \$1 }()"
instead of "defer func() { _ = resp.Body.Close() }()". Go does not
compile — $1 is not a valid identifier.

Fixed with: sed -i 's/defer func() { _ = \$1 }()/defer func() { _ = resp.Body.Close() }()/g'

Affected (all on origin/staging):
  workspace-server/cmd/server/cp_config.go
  workspace-server/internal/handlers/a2a_proxy.go
  workspace-server/internal/handlers/github_token.go
  workspace-server/internal/handlers/traces.go
  workspace-server/internal/handlers/transcript.go
  workspace-server/internal/middleware/session_auth.go
  workspace-server/internal/provisioner/cp_provisioner.go (3 occurrences)

Closes: #1245

Co-authored-by: Molecule AI Core-BE <core-be@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-21 03:18:21 +00:00
4555304850 fix(merge): resolve conflict markers in workspace_provision.go line 585
CPProvisioner env mutator error branch was left with unresolved conflict
markers after a prior rebase. Resolved to the HEAD-side generic message
"plugin env mutator chain failed" which is consistent with the same
message used in the Provisioner path (line 107/111).

No functional change.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-21 03:12:52 +00:00
molecule-ai[bot]
9be99059dd fix(scheduler): use context.Background() for post-fire UPDATE (F1089) (#1244)
The post-fire UPDATE after s.proxy.ProxyA2ARequest() was using fireCtx,
which derives from the outer ctx passed into fireSchedule(). If that ctx
is cancelled — HTTP timeout, graceful shutdown, or any upstream deadline —
ExecContext returns context.Canceled and the UPDATE is silently skipped,
leaving next_run_at stale and causing the schedule to re-fire on the
next tick.

Fix: create a dedicated updateCtx from context.Background() with a 5s
deadline, independent of the outer ctx hierarchy. Also improved the
error log to include schedule name for easier debugging.

Complements PR #1241 (fix/f1089-scheduler-ctx-fix-main) which fixes
the goroutine-panic path in tick() — this fix covers the wider case of
normal-return + ctx-cancelled after the proxy call.

F1089 | Severity: HIGH+security

Co-authored-by: Molecule AI Infra Lead <infra-lead@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-21 03:07:26 +00:00
Hongming Wang
8059fee128 fix(tenant-guard): allowlist /registry/register + /registry/heartbeat (#1236)
* fix(security): call redactSecrets before seeding workspace memories (F1085)

seedInitialMemories() in workspace_provision.go was inserting template/config
memories directly into agent_memories without scrubbing credential patterns.
A workspace provisioned from a template containing API keys, tokens, or other
secrets would store them in plain text — the same class of issue as #838.

Fix: call redactSecrets(workspaceID, content) on the truncated memory content
before the INSERT. The truncation (maxMemoryContentLength = 100 KiB, CWE-400)
is preserved — redaction runs after truncation so the size limit still applies.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test(workspace_provision): add seedInitialMemories coverage for #1208

Cover the truncate-at-100k boundary (PR #1167, CWE-400) and the
redactSecrets call (F1085 / #1132), both identified as untested in #1208.

- TestSeedInitialMemories_TruncatesOversizedContent: boundary at exactly
  100k, 1 byte over, far over, and well under. Verifies INSERT receives
  exactly maxMemoryContentLength bytes.
- TestSeedInitialMemories_RedactsSecrets: verifies redactSecrets runs
  before INSERT, regression test for F1085.
- TestSeedInitialMemories_InvalidScopeSkipped: invalid scope is silently
  skipped, no INSERT called.
- TestSeedInitialMemories_EmptyMemoriesNil: nil slice is handled without
  DB calls.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs(marketing): Discord adapter launch visual assets (#1209)

Squash-merge: Discord adapter launch visual assets (3 PNGs) + social copy. Acceptance: assets on staging.

* fix(ci): golangci-lint errcheck failures on staging

Suppress errcheck warnings for calls where the return value is safely
ignored:
  - resp.Body.Close() (artifacts/client.go): deferred cleanup — failure
    to close a response body is non-critical; the defer itself is what
    matters for connection reuse.
  - rows.Close() (bundle/exporter.go): deferred cleanup in a loop where
    rows.Err() already handles query errors.
  - filepath.Walk (bundle/exporter.go): top-level walk call; errors in
    sub-directory traversal are handled by the inner callback (which
    returns nil for err != nil).
  - broadcaster.RecordAndBroadcast (bundle/importer.go): fire-and-forget
    event broadcast; errors are logged internally by the broadcaster.
  - db.DB.ExecContext (bundle/importer.go): best-effort runtime column
    update; non-critical auxiliary data that the provisioner re-extracts
    if needed.

Fixes: #1143

* test(artifacts): suppress w.Write return values to satisfy errcheck

All httptest.ResponseWriter.Write calls in client_test.go now discard
the byte count and error return with _, _ = prefix. The Write method
is safe to discard in test handlers — httptest.ResponseWriter.Write
never returns an error for in-memory buffers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(CI): move changes job off self-hosted runner + add workflow concurrency

Cherry-pick from staging PR #1194 for main. Two changes to relieve
macOS arm64 runner saturation:

1. `changes` job: runs on ubuntu-latest instead of
   [self-hosted, macos, arm64]. This job does a plain `git diff`
   with zero macOS dependencies — moving it off the runner frees
   a slot immediately on every workflow trigger.

2. Add workflow-level concurrency:
   concurrency: group: ci-${{ github.ref }}; cancel-in-progress: true

   Prevents multiple stale in-flight CI runs from queuing on the
   same ref when new commits arrive.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(security): call redactSecrets before seeding workspace memories (F1085) (#1203)

seedInitialMemories() in workspace_provision.go was inserting template/config
memories directly into agent_memories without scrubbing credential patterns.
A workspace provisioned from a template containing API keys, tokens, or other
secrets would store them in plain text — the same class of issue as #838.

Fix: call redactSecrets(workspaceID, content) on the truncated memory content
before the INSERT. The truncation (maxMemoryContentLength = 100 KiB, CWE-400)
is preserved — redaction runs after truncation so the size limit still applies.

Co-authored-by: Molecule AI Core-BE <core-be@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* tick: 2026-04-21 ~03:40Z — CI stalled 59+ min, GH_TOKEN 4th rotation, PR reviews done

* fix(tenant-guard): allowlist /registry/register + /registry/heartbeat

Final layer of today's stuck-provisioning saga. With the private-IP
platform_url fix and the intra-VPC :8080 SG rule in place, workspace
EC2s finally reached the tenant on the right port — only to have every
POST bounced with a synthetic 404 by TenantGuard.

TenantGuard is the SaaS hook that rejects cross-tenant routing. It
demands X-Molecule-Org-Id on every request, but CP's workspace user-
data doesn't export MOLECULE_ORG_ID (only WORKSPACE_ID, PLATFORM_URL,
RUNTIME, PORT), so the runtime can't attach the header. Net effect:
every workspace's first heartbeat to /registry/heartbeat was a silent
404, and the workspace sat in 'provisioning' until the platform
sweeper timed it out.

Allowlist the two workspace-boot paths:
  - /registry/register  — one-shot at runtime startup
  - /registry/heartbeat — every 30s

Both are still gated by wsauth.HasAnyLiveToken (workspaces with a
token on file must present it; legacy tokenless workspaces are
grandfathered). And the tenant SG already scopes :8080 to the VPC
CIDR, so only intra-VPC callers can reach these paths in the first
place. The allowlist bypasses cross-org routing, not auth.

Follow-up: passing MOLECULE_ORG_ID into the workspace env would let
the runtime attach the header and drop this allowlist entry. Tracked
separately; not urgent since the multi-layer auth above is already
adequate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Molecule AI Core-BE <core-be@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Molecule AI Infra-SRE <infra-sre@agents.moleculesai.app>
Co-authored-by: molecule-ai[bot] <276602405+molecule-ai[bot]@users.noreply.github.com>
Co-authored-by: Molecule AI Core-DevOps <core-devops@agents.moleculesai.app>
Co-authored-by: Molecule AI Core-UIUX <core-uiux@agents.moleculesai.app>
Co-authored-by: Hongming Wang <hongmingwang.rabbit@users.noreply.github.com>
2026-04-21 02:47:27 +00:00
molecule-ai[bot]
2575960805 fix(errcheck): suppress unchecked resp.Body.Close() across workspace-server (#1229)
Issue #1196: golangci-lint errcheck flags bare resp.Body.Close()
calls because Body.Close() can return a non-nil error (e.g. when the
server sent fewer bytes than Content-Length). All occurrences fixed:

  defer resp.Body.Close()  →  defer func() { _ = resp.Body.Close() }()
  resp.Body.Close()        →  _ = resp.Body.Close()

12 files affected across all Go packages — channels, handlers,
middleware, provisioner, artifacts, and cmd. The body is already fully
consumed at each call site, so the error is always safe to discard.

🤖 Generated with [Claude Code](https://claude.ai)

Co-authored-by: Molecule AI Core-BE <core-be@agents.moleculesai.app>
2026-04-21 02:45:34 +00:00
molecule-ai[bot]
5b5a634b5b fix(middleware): set org_id in context after orgtoken.Validate (F1097) (#1232)
PR #1210 added org_api_tokens.org_id but c.Set("org_id", ...) was never
called — so orgCallerID() always returns "" and all token callers are
denied org-scoped access even within their own org.

Fix: after orgtoken.Validate succeeds in AdminAuth, look up the token's
org_id column and set it in the gin context. Pre-fix tokens (org_id=NULL)
get no org_id in context, which is correct — requireCallerOwnsOrg already
denies access for nil org_id.

Test: TestAdminAuth_OrgToken_SetsOrgID covers both post-fix tokens
(org_id set) and pre-fix tokens (org_id=NULL, not set).

Co-authored-by: Molecule AI Infra-SRE <infra-sre@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-21 02:45:27 +00:00
molecule-ai[bot]
24daa05190 fix(F1089): log panic-recovery UPDATE errors in scheduler (#1233)
* fix(auth): F1094 — requireCallerOwnsOrg reads org_id not created_by (#1200)

Root cause: requireCallerOwnsOrg (org_plugin_allowlist.go:116) was
reading org_api_tokens.created_by to determine caller's org workspace
ID. But created_by is a provenance label ("session", "admin-token",
"org-token:<prefix>") — never a UUID. The equality check
callerOrg != targetOrgID always failed → every org-token caller
got 403 on /orgs/:id/plugins/allowlist routes.

Fix:
- Migration 036: adds org_id UUID column (nullable) to org_api_tokens
  with index. Existing pre-migration tokens get org_id=NULL → deny
  by default (safer than cross-org access).
- orgtoken.Issue: takes new orgID param; stores in org_id column.
- orgtoken.OrgIDByTokenID: new helper reads org_id for a token ID.
  Returns ("", nil) for NULL/unanchored tokens.
- requireCallerOwnsOrg: now calls OrgIDByTokenID instead of reading
  created_by. Pre-migration tokens with org_id=NULL get callerOrg=""
  → denied (safer).
- orgTokenActor (org_tokens.go): returns (createdBy, orgID) pair.
  Token minted via another org token gets its org_id set at mint time.
  Session/ADMIN_TOKEN callers get orgID="".
- orgtoken.Token struct: adds OrgID field for list display.
- orgtoken.List: selects org_id alongside other columns.
- Updated existing tests for new Issue signature.
- Added 10 regression tests covering: happy path, unanchored denial,
  cross-org denial, session bypass, DB error denial.

🤖 Generated with [Claude Code](https://claude.ai/claude-code)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(security): replace err.Error() leaks with prod-safe messages (#1206)

- workspace_provision.go: provisionWorkspace, provisionWorkspaceCP —
  replaced 7 err.Error() calls with "provisioning failed" in both
  Broadcast payloads and last_sample_error DB column. Full error
  preserved in server-side log.Printf.

- plugins_install_pipeline.go: resolveAndStage — replaced 5 err.Error()
  calls with generic messages:
    "invalid plugin source"
    "plugin source not supported"
    "invalid plugin name"
    "staged plugin exceeds size limit"
    "plugin manifest integrity check failed"

Risk mitigated: DB errors (pq: connection refused, pq: deadlock),
OS errors, and internal paths no longer leak in HTTP JSON responses
or WebSocket broadcasts.

Added regression tests (workspace_provision_test.go):
  - TestProvisionWorkspace_NoInternalErrorsInBroadcast
  - TestProvisionWorkspaceCP_NoInternalErrorsInBroadcast
  - TestResolveAndStage_NoInternalErrorsInHTTPErr

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(F1089): log panic-recovery UPDATE errors in scheduler

The panic defer blocks in tick() and fireSchedule() now capture
and log errors from the db.DB.ExecContext call that advances next_run_at
after a panic. Previously, a DB failure during panic recovery was
silent — the log line for the panic itself appeared but any subsequent
UPDATE failure was invisible, risking unnoticed scheduler drift.

context.Background() was already used (F1089 comment in place); this
commit adds the missing error capture + log.Printf on exec failure.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Molecule AI Dev Lead <dev-lead@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-21 02:45:25 +00:00
molecule-ai[bot]
5bdacc611e fix(security): sanitize error details in BootstrapFailed, provision, and plugin install (#1219)
Multiple security findings addressed:

F1095 (BootstrapFailed): Replace err.Error() in ShouldBindJSON failure
response with generic "invalid request body" — raw gin binding errors
can expose validation detail, field names, and type mismatch info.

F1096 (BootstrapFailed): Handle RowsAffected() error instead of ignoring
it — the DB call can fail in ways the current code silently ignores.

#1206 (provision/plugin install): Replace raw err.Error() in API responses,
broadcasts, and last_sample_error DB fields across workspace_provision.go
(7 occurrences) and plugins_install_pipeline.go (6 occurrences). Replaced
with context-appropriate generic messages that don't leak internal DB
file paths, decrypt error details, or resolver internals to callers.

#1208 (test-gap): Add 3 new seedInitialMemories truncate tests:
- Exactly-at-limit (100k bytes → unchanged, boundary case)
- Empty content (skipped, no DB call)
- Oversized with embedded secrets (truncation fires before any other content inspection)

Co-authored-by: Molecule AI Fullstack (floater) <fullstack-floater@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-21 02:11:38 +00:00
molecule-ai[bot]
f1accaf918 fix(auth): F1094 — requireCallerOwnsOrg reads org_id not created_by (#1200) (#1220)
Root cause: requireCallerOwnsOrg (org_plugin_allowlist.go:116) was
reading org_api_tokens.created_by to determine caller's org workspace
ID. But created_by is a provenance label ("session", "admin-token",
"org-token:<prefix>") — never a UUID. The equality check
callerOrg != targetOrgID always failed → every org-token caller
got 403 on /orgs/:id/plugins/allowlist routes.

Fix:
- Migration 036: adds org_id UUID column (nullable) to org_api_tokens
  with partial index for fast lookups. Existing pre-migration tokens
  get org_id=NULL → deny by default (safer than cross-org access).
- orgtoken.Issue: takes new orgID param; stores in org_id column.
- orgtoken.OrgIDByTokenID: new helper reads org_id for a token ID.
  Returns ("", nil) for NULL/unanchored tokens.
- requireCallerOwnsOrg: now calls OrgIDByTokenID instead of reading
  created_by. Pre-migration tokens with org_id=NULL get callerOrg=""
  → denied (safer).
- orgTokenActor (org_tokens.go): returns (createdBy, orgID) pair.
  Token minted via another org token gets its org_id set at mint time.
  Session/ADMIN_TOKEN callers get orgID="".
- orgtoken.Token struct: adds OrgID field for list display.
- orgtoken.List: selects org_id alongside other columns.
- Updated existing tests for new Issue signature.
- Added regression tests: happy path, unanchored denial, DB error denial.

Co-authored-by: Molecule AI Infra-Runtime-BE <infra-runtime-be@agents.moleculesai.app>
Co-authored-by: Molecule AI Dev Lead <dev-lead@agents.moleculesai.app>
2026-04-21 02:11:27 +00:00
molecule-ai[bot]
fcd3a6eaf0 fix(test): align ssrf_test.go localhost test cases with isSafeURL behaviour (#1192)
* feat(canvas): rewrite MemoryInspectorPanel to match backend API

Issue #909 (chunk 3 of #576).

The existing MemoryInspectorPanel used the wrong API endpoint
(/memory instead of /memories) and wrong field names (key/value/version
instead of id/content/scope/namespace/created_at). It also lacked
LOCAL/TEAM/GLOBAL scope tabs and a namespace filter.

Changes:
- Fix endpoint: GET /workspaces/:id/memories with ?scope= query param
- Fix MemoryEntry type to match actual API: id, content, scope,
  namespace, created_at, similarity_score
- Add LOCAL/TEAM/GLOBAL scope tabs
- Add namespace filter input
- Remove Edit functionality (no update endpoint in backend)
- Delete uses DELETE /workspaces/:id/memories/:id (by id, not key)
- Full rewrite of 27 tests to match new API and UI structure
- Uses ConfirmDialog (not native dialogs) for delete confirmation
- All dark zinc theme (no light colors)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: tighten types + improve provision-timeout message (#1135, #1136)

#1135 — TypeScript: make BudgetData.budget_used and WorkspaceMetrics
fields optional to match actual partial-response shapes from provisioning-
stuck workspaces. Runtime already guarded with ?? 0.

#1136 — provisiontimeout.go: replace misleading "check required env vars"
hint (preflight catches that case upfront) with accurate message about
container starting but failing to call /registry/register.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

* fix(test): align ssrf_test.go localhost test cases with isSafeURL behaviour

isSafeURL blocks 127.0.0.1 via ip.IsLoopback() even in dev environments.
The test cases `wantErr: false` for localhost were incorrect — the
test would fail when go test runs. Fix by changing wantErr to true
for both localhost test cases.

Rationale: loopback blocking at this layer is intentional. Access
control is enforced by WorkspaceAuth + CanCommunicate at the A2A
routing layer, not by the URL validation. Opening this would widen
the SSRF attack surface without adding real dev flexibility.

Closes: ssrf_test.go inconsistency reported 2026-04-21

Co-Authored-By: Claude Sonnet 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Molecule AI Core-UIUX <core-uiux@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-21 02:08:45 +00:00
molecule-ai[bot]
09b5a444d3 fix(scheduler): use context.Background() in panic-recovery defer UPDATE (F1089) (#1211)
F1089: PR #1032's panic-recovery defers used the outer `ctx` passed into
fireSchedule/tick. If that ctx was cancelled during the panic window
(HTTP timeout, graceful shutdown), ExecContext returned early and the
next_run_at UPDATE was silently skipped — leaving the schedule stuck.

Fix: both panic defers now call ExecContext(context.Background()) so the
recovery UPDATE is independent of the outer ctx's lifecycle.

Refs: #1201 (F1089, security audit 2026-04-21)

Co-authored-by: Molecule AI CP-BE <cp-be@agents.moleculesai.app>
2026-04-21 02:08:00 +00:00
Molecule AI Fullstack (floater)
11f66b1837 fix(org-api-tokens): add org_id column, close requireCallerOwnsOrg regression
Fixes F1094 / #1200 / #1204 — org-token callers always getting 403 on
org-scoped routes because requireCallerOwnsOrg queried created_by
(provenance label string) instead of a proper org anchor UUID.

Changes:
- Migration 036 adds nullable org_id UUID column to org_api_tokens,
  references workspaces(id). Pre-fix tokens remain usable for
  non-org-scoped routes.
- requireCallerOwnsOrg now queries org_api_tokens.org_id directly.
  Tokens with org_id = NULL (pre-fix) are denied org-scoped access —
  correct security posture for Phase 32 multi-org isolation.
- orgtoken.Issue accepts and stores org_id via NULLIF($5,'')::uuid.
- OrgTokenHandler.Create passes org_id (from session context or
  request body) to Issue. Canvas UI should pass org_id in request
  body so new tokens carry their org anchor.
- admin_memories.go: remove dead-code duplicate redactSecrets call
  (shadowing declaration, lines 125+135 → single call at line 125).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-21 01:34:05 +00:00
molecule-ai[bot]
a5a495c804 Merge pull request #1032 from Molecule-AI/fix/scheduler-advance-next-run-1029
fix(scheduler): advance next_run_at on panic to prevent stuck schedules (#1029)
2026-04-21 00:59:32 +00:00
molecule-ai[bot]
7f2d71e392 test merge attempt
Co-authored-by: Molecule AI CP-BE <cp-be@agents.moleculesai.app>
2026-04-21 00:57:43 +00:00
molecule-ai[bot]
35ccda1091 fix(security): replace err.Error() with generic messages in handler responses (#1193)
Replace all c.JSON(http.StatusBadRequest, gin.H{"error": err.Error()})
calls across 22 handler files with context-appropriate generic messages
to prevent internal error strings (DB details, validation messages,
file paths) leaking into API responses.

Pattern established:
- ShouldBindJSON failures → "invalid request body" (or "invalid delegation request")
- Validation failures → "invalid workspace ID", "invalid path", etc.
- Server-side errors still logged, only generic message returned to client

References: Security finding from Audit #125 (Stripe key leak via err.Error())

Co-authored-by: Molecule AI Fullstack (floater) <fullstack-floater@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-21 00:56:03 +00:00
rabbitblood
1c58bae7c5 test: trigger CI with file change 2026-04-21 00:48:52 +00:00
rabbitblood
74f36e6cec fix(test): align scheduler tests with #969 deferral loop and #795 empty-run tracking
- TestRecordSkipped_AdvancesNextRunAt: call recordSkipped directly instead
  of going through fireSchedule, which now has a 2-min deferral loop (#969)
  that makes sqlmock-based end-to-end testing impractical.
- TestFireSchedule_NormalSuccess_AdvancesNextRunAt: add missing expectation
  for the consecutive_empty_runs reset query (#795) that fires on non-empty
  successful responses.
- TestFireSchedule_ComputeNextRunError: same consecutive_empty_runs fix.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 00:48:52 +00:00
rabbitblood
ad0b870182 test: verify next_run_at advances on panic recovery (#1029)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 00:48:52 +00:00
rabbitblood
8ea04d62bb test: add cascade schedule disable tests for #1027
Add production fix and three new test cases verifying that workspace
deletion cascade-disables all workspace_schedules for the deleted
workspace and its descendants, preventing zombie schedule firings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 00:47:55 +00:00
rabbitblood
c0bc0df439 fix(scheduler): advance next_run_at on panic recovery to prevent stuck schedules (#1029)
When fireSchedule panics before reaching the next_run_at UPDATE,
the deferred recover catches the panic but never advances next_run_at,
leaving it stuck in the past forever. The schedule then fires every
tick (30s) in an infinite retry loop.

Add next_run_at advancement to both panic recovery defers (the
per-goroutine one in tick() and the inner one in fireSchedule()) so
the schedule always moves forward regardless of how the fire exits.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 00:47:55 +00:00
molecule-ai[bot]
9842564b90 fix(security): truncate oversized memory content to prevent storage DoS (CWE-400) (#1167)
CP-QA approved. seedInitialMemories() now truncates mem.Content at 100,000 bytes before INSERT. Oversized content is logged with byte count before/after so operators can detect truncation. Fixes #1066 (CWE-400). NOTE: no unit tests in this commit — follow-up issue recommended.
2026-04-21 00:36:29 +00:00
molecule-ai[bot]
0b1fb56046 fix(scheduler): advance next_run_at on panic to prevent infinite DoS loop (#1029) (#1166)
CP-QA approved. Panic recovery in fireSchedule now advances next_run_at via ComputeNextRun + ExecContext, preventing a panicking cron from indefinitely starving all other schedules. 3 new tests: TestPanicRecovery_AdvancesNextRunAt, TestFireSchedule_NormalSuccess, TestRecordSkipped_AdvancesNextRunAt. Fixes #1029.
2026-04-21 00:34:13 +00:00
molecule-ai[bot]
4b1851a038 fix(security): redactSecrets on admin memories export/import (#1131, #1132) (#1153)
Security fixes for the memory backup/restore endpoints merged in PR #1051.

## F1084 / #1131: Memory export exposes all workspaces

GET /admin/memories/export now applies redactSecrets() to each content
field before including it in the JSON response. Pre-SAFE-T1201 memories
(stored before redactSecrets was mandatory on writes) no longer leak
credential patterns in the admin export.

## F1085 / #1132: Memory import does not call redactSecrets

POST /admin/memories/import now calls redactSecrets() on content before
BOTH the deduplication check and the INSERT. This ensures:

- Imported memories with embedded credentials cannot land unredacted in
  agent_memories (SAFE-T1201 / #838 parity with the commit_memory path).
- Dedup is performed against the redacted value so two backups with
  the same original secret both get [REDACTED:*] as their content and
  are correctly treated as duplicates.

## New tests

admin_memories_test.go: 6 tests covering redactSecrets parity on
both Export and Import endpoints.

Closes #1131.
Closes #1132.

Co-authored-by: Molecule AI Core-DevOps <core-devops@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Molecule AI Infra-Runtime-BE <infra-runtime-be@agents.moleculesai.app>
2026-04-21 00:32:00 +00:00
Hongming Wang
c1593dd328 Merge remote-tracking branch 'origin/staging' into feat/bootstrap-failed-and-console-proxy
# Conflicts:
#	workspace-server/internal/handlers/admin_memories_test.go
2026-04-20 17:31:16 -07:00
Hongming Wang
4641151b09 Merge remote-tracking branch 'origin/staging' into feat/bootstrap-failed-and-console-proxy
# Conflicts:
#	workspace-server/internal/router/router.go
2026-04-20 17:25:24 -07:00
70d47e2730 fix(security): SSRF URL validation (#1130) + redactSecrets on memory admin endpoints (#1131, #1132)
URLs returned from DB and Redis cache (db.GetCachedURL, workspaces.url column)
are now validated via validateAgentURL() before any HTTP request is made:

- mcpResolveURL (mcp.go): added validateAgentURL() calls on all three return
  paths (internal cache, Redis cache, DB fallback).
- resolveAgentURL (a2a_proxy.go): added validateAgentURL() call before
  returning agentURL to the A2A dispatcher.

validateAgentURL() was extended (registry.go) to resolve DNS hostnames and
check each returned IP against the blocklist (private ranges, loopback,
cloud-metadata 169.254.0.0/16). "localhost" is allowed by name for local dev.

GET /admin/memories/export now applies redactSecrets() to each content field
before including it in the JSON response. Pre-SAFE-T1201 memories (stored
before redactSecrets was mandatory on writes) no longer leak credentials.

POST /admin/memories/import now calls redactSecrets() on content before both
the deduplication check and the INSERT. Imported memories with embedded
credentials cannot bypass SAFE-T1201 (#838).

- admin_memories.go: GET /admin/memories/export + POST /admin/memories/import
  handler (from PR #1051, with security fixes applied).
- admin_memories_test.go: 6 tests covering redactSecrets parity on both endpoints.

- registry_test.go: added DNS-lookup test cases for validateAgentURL (F1083).
  "localhost" allowed by name (preserves existing test); nxdomain blocked.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-21 00:24:02 +00:00
c0a1113a6e fix(mcp): correct duplicate-line syntax and rebase redactSecrets to 2-arg
- Remove duplicate-line ExecContext call that caused syntax error at mcp.go:784
- Update redactSecrets signature from 1-arg to 2-arg (workspaceID, content)
  to match the canonical form established in PR #1017
- Update toolCommitMemory call site to use 2-arg form
- Add reserved workspaceID param note in docstring for future audit logging

Fixes PR #1036 compile-blocking issues (Platform Go job).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-21 00:23:40 +00:00
molecule-ai[bot]
b1433ee8e6 Merge pull request #1171 from Molecule-AI/staging
chore: fast-forward staging with main review-cleanup commits
2026-04-21 00:16:58 +00:00
molecule-ai[bot]
beb54ed61d fix: golangci-lint errors in bundle pkg + admin_memories test coverage (#1169)
CP-QA approved. golangci-lint fixes in bundle/exporter.go + bundle/importer.go, redactSecrets in admin_memories.go, plus 489-line admin_memories_test.go.
2026-04-21 00:12:30 +00:00
Hongming Wang
731a9aef6e feat(platform): bootstrap-failed + console endpoints for CP watcher
Workspaces stuck in provisioning used to sit in "starting" for 10min
until the sweeper flipped them. The real signal — a runtime crash at
EC2 boot — lands on the serial console within seconds but nothing
listened. These endpoints close the loop.

1. POST /admin/workspaces/:id/bootstrap-failed
   The control plane's bootstrap watcher posts here when it spots
   "RUNTIME CRASHED" in ec2:GetConsoleOutput. Handler:
   - UPDATEs workspaces SET status='failed' only when status was
     'provisioning' (idempotent — a raced online/failed stays put)
   - Stores the error + log_tail in last_sample_error so the canvas
     can render the real stack trace, not a generic "timeout" string
   - Broadcasts WORKSPACE_PROVISION_FAILED with source='bootstrap_watcher'

2. GET /workspaces/:id/console
   Proxies to CP's new /cp/admin/workspaces/:id/console endpoint so
   the tenant platform can surface EC2 serial console output without
   holding AWS credentials. CPProvisioner.GetConsoleOutput is the
   client; returns 501 in non-CP deployments (docker-compose dev).

Both gated by AdminAuth — CP holds the tenant ADMIN_TOKEN that the
middleware accepts on its tier 2b branch.

Tests cover: happy-path fail, already-transitioned no-op, empty id,
log_tail truncation, and the 501 fallback when no CP is wired.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 17:11:34 -07:00
molecule-ai[bot]
45f5b47487 fix(security): add USER directive before ENTRYPOINT in all tenant images (#1155)
Closes: #177 (CRITICAL — Dockerfile runs as root)

Dockerfiles changed:
- workspace-server/Dockerfile (platform-only): addgroup/adduser + USER platform
- workspace-server/Dockerfile.tenant (combined Go+Canvas): addgroup/adduser + USER canvas
  + chown canvas:canvas on canvas dir so non-root node process can read it
- canvas/Dockerfile (canvas standalone): addgroup/adduser + USER canvas
- workspace-server/entrypoint-tenant.sh: update header comment (no longer starts
  as root; both processes now start non-root)

The entrypoint no longer needs a root→non-root handoff since both the Go
platform and Canvas node run as non-root by default. The 'canvas' user owns
/app and /platform, so volume mounts owned by the host's canvas user work
without needing a root init step.

Co-authored-by: Molecule AI CP-BE <cp-be@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-20 23:51:33 +00:00
bf60cfd99d Merge branch 'fix/stripe-key-redaction' into staging 2026-04-20 23:46:57 +00:00
2ca403311f Merge branch 'fix/ssrf-url-validation' into staging 2026-04-20 23:46:49 +00:00
84ff572588 fix(security): close IDOR gaps on /admin/test-token and /orgs/:id/allowlist
Fixes audit #125 findings for CWE-639:

1. admin_test_token.go — CRITICAL IDOR (finding #112)
   When ADMIN_TOKEN is set in production, require it explicitly on
   GET /admin/workspaces/:id/test-token. The original gap: AdminAuth
   accepted any valid org-scoped token, letting an Org A token holder
   mint workspace bearer tokens for ANY workspace UUID they could enumerate.
   Now requires ADMIN_TOKEN when it's configured; MOLECULE_ENV!=production
   path still requires a valid bearer (any org token works for local dev).

2. org_plugin_allowlist.go — HIGH IDOR (finding #112)
   GET and PUT /orgs/:id/plugins/allowlist: add requireOrgOwnership()
   check after org existence verification. Org-token holders can only
   read/write their own org's allowlist. Session and ADMIN_TOKEN callers
   bypass the check (they have platform-wide access via the session
   cookie path, not org tokens).

Closes: #112 (CWE-639 IDOR — tenant config access)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-20 23:29:27 +00:00
molecule-ai[bot]
517c2f869c Merge pull request #1053 from Molecule-AI/fix/memory-backup-restore-1051
feat(platform): memory backup/restore for nuke-safe development (#1051)
2026-04-20 23:18:30 +00:00
beba599250 fix(security): SSRF defence — validate URLs before outbound A2A calls
Adds isSafeURL() + isPrivateOrMetadataIP() in mcp.go and wires the
check into:
- MCP delegate_task (sync path) — line 530
- MCP delegate_task_async (fire-and-forget) — line 602
- a2a_proxy resolveAgentURL() — line 391

Blocklist covers: RFC-1918 private (10/8, 172.16/12, 192.168/16),
cloud metadata link-local (169.254/16), carrier-grade NAT (100.64/10),
documentation ranges (192.0.2/24, 198.51.100/24, 203.0.113/24),
loopback, unspecified, and link-local multicast.

For hostnames, DNS is resolved and every returned IP is validated —
blocks internal hostnames that resolve to private ranges.

Closes: #1130 (F1083 — SSRF in A2A proxy and MCP bridge)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-20 23:09:11 +00:00
Hongming Wang
fc3ae5a63a chore: code-review cleanup on today's shipped PRs
Three nits identified during post-merge review of #1119, #1133:

1. ContextMenu.tsx imported `removeNode` from the canvas store but
   stopped using it when the delete-confirm flow moved to Canvas in
   #1133. Also removed the now-unused mock entry in the keyboard
   test so the test inventory matches the real call list.

2. Preflight's YAML parse failure was a silent pass — defensible since
   the in-container preflight owns the schema, but invisible to ops if
   a template ships malformed YAML. Log at WARN so the signal surfaces
   without blocking the provision.

3. formatMissingEnvError rendered its slice via %q, producing
   `["A" "B"]` which is Go-literal-looking and ugly in a user-facing
   error. Join with ", " instead. Test updated to assert the new
   format.

No behavioural changes beyond the log line; fixes are review nits, not
bug fixes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 16:04:57 -07:00
Hongming Wang
c3f7447e86 fix: harden stuck-provisioning UX — details crash, preflight, sweeper
Workspaces stuck in status='provisioning' previously surfaced in three
bad ways:

1. **Details tab crashed** with `Cannot read properties of undefined
   (reading 'toLocaleString')`. `BudgetSection` + `WorkspaceUsage`
   assumed full response shapes but a provisioning-stuck workspace
   returns partial `{}`. Guard each deep field with `?? 0` and cover
   the partial-response case with regression tests.

2. **Missing required env vars failed silently** 15+ minutes later as
   a cosmetic "Provisioning Timeout" banner. The in-container preflight
   catches them but by then the container has already crashed without
   calling /registry/register, so the workspace sat in 'provisioning'
   forever. Mirror the preflight server-side: parse config.yaml's
   `runtime_config.required_env` before launch, fail fast with a
   WORKSPACE_PROVISION_FAILED event naming the missing vars.

3. **No backend timeout** ever flipped a stuck workspace to 'failed'.
   Add a registry sweeper (10m default, env-overridable) that detects
   workspaces stuck past the window, flips them to 'failed', and emits
   WORKSPACE_PROVISION_TIMEOUT. Race-safe: the UPDATE re-checks the
   status + age predicate so a concurrent register/restart wins.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 14:51:39 -07:00
Hongming Wang
ad28e10bf4 fix(org-tokens): rate-limit mint, bound list, correct audit provenance
Addresses the Critical + Important findings from today's code
review of the org API keys feature (PRs #1105-1108).

## Critical-1: rate-limit mint endpoint

Previously POST /org/tokens had no mint-rate limit. A compromised
WorkOS session or leaked bearer could mint thousands of tokens in
seconds, forcing a painful manual cleanup of each one.

Fix: dedicated per-IP token bucket, 10 mints/hour/IP. Legitimate
bursts fit under the ceiling; abuse bounces. List + Delete stay
on the global limiter — they can't be used to generate new
secret material.

## Important-1: HTTP handler integration tests

internal/orgtoken had 9 unit tests; the HTTP layer (org_tokens.go)
had none. Adds org_tokens_test.go covering:
  - List happy path + DB error → 500
  - Create actor="admin-token" (bootstrap), actor="org-token:<prefix>"
    (chained mint), actor="session" (canvas browser path)
  - Create name>100 chars → 400
  - Create with empty body mints with no name
  - Revoke happy path 200, missing id 404, empty id 400
  - Plaintext returned in response body and prefix matches first 8 chars
  - Warning text present

A regression that breaks the tier-ordering, drops the createdBy
field, or accepts oversized names now fails at CI not prod.

## Important-2: bound List output

List() had no LIMIT — a mint-storm bug or abuse could make the
admin UI slow to render and allocate proportionally. Adds
LIMIT 500 at the SQL layer. 10x realistic ceiling, guardrail
against pathological cases.

## Important-3: audit provenance uses plaintext prefix, not UUID

orgTokenActor() was logging "org-token:<first-8-of-uuid>" which
couldn't be cross-referenced with the UI (which shows first-8
of the plaintext). Users could not correlate "who minted this"
audit entries with the revoke button they're looking at.

Fix: Validate() now returns (id, prefix, error). Middleware
stashes both on the gin context. Handler reads prefix for the
actor string. Audit rows now match UI prefixes exactly.

## Nit: named constants for audit labels

actorOrgTokenPrefix / actorSession / actorAdminToken replace
the hardcoded strings scattered across the handler. Greppable
across log pipelines + audit queries; one place to change if
the format evolves.

## Tests

  - internal/orgtoken: 9 existing + 0 new, all still green (updated
    signatures for Validate returning prefix).
  - internal/handlers/org_tokens_test.go: new — 9 HTTP-layer tests
    above. Full gin.Context + sqlmock harness.
  - Full `go test ./...` green except one pre-existing
    TestGitHubToken_NoTokenProvider flake unrelated to this change
    (expects 404, gets 500 — tracked separately).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 14:22:38 -07:00
Hongming Wang
3d7244ab94 feat(auth): org tokens reach /workspaces/:id/* subroutes + docs
Extends WorkspaceAuth to accept org API tokens as a valid
credential for any workspace sub-route in the org. Previously a
user minting an org token could hit admin-surface endpoints
(/workspaces, /org/import, etc.) but couldn't reach per-workspace
routes like /workspaces/:id/channels — those were gated by
WorkspaceAuth which only knew about workspace-scoped tokens.

Scope matches the explicit product spec: one org API key can
manipulate every workspace in the org. AI agents given a key can
read/write channels, tokens, schedules, secrets, tasks across all
workspaces.

## WorkspaceAuth tier order

  1. ADMIN_TOKEN exact match (break-glass / bootstrap)
  2. Org API token (Validate against org_api_tokens)           NEW
  3. Workspace-scoped token (ValidateToken with :id binding)
  4. Same-origin canvas referer

Org token tier sits above the per-workspace check so a presenter
of an org key doesn't hit the narrower ValidateToken failure path
first. Checked with isSameOriginCanvas path unchanged.

## End-to-end verified

Minted test token via ADMIN_TOKEN, then with that org token:
  - GET /workspaces             → 200 (list all)
  - GET /workspaces/<id>        → 200 (detail, admin-only route)
  - GET /workspaces/<id>/channels → 200 (workspace sub-route)
  - GET /workspaces/<id>/tokens   → 200 (workspace tokens list)
  - GET /workspaces/<bad-uuid>    → 404 workspace not found
                                    (routing still scoped correctly)

## Documentation

  - docs/architecture/org-api-keys.md — design, data model, threat
    model, security properties
  - docs/architecture/org-api-keys-followups.md — 10 tracked
    follow-ups prioritized (role scoping P1, per-workspace binding
    P1, expiry P2, usage metrics P2, WorkOS user_id capture P2,
    rotation webhooks P3, mint-rate limit P3, audit log P2, CLI
    P3, migrate ADMIN_TOKEN to the same table P4)
  - docs/guides/org-api-keys.md — end-user guide (mint via UI,
    use in curl/Python/TS/AI agents, session-vs-key comparison)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 14:11:45 -07:00
Hongming Wang
91187342b4 feat(auth): organization-scoped API keys for admin access
Adds user-facing API keys with full-org admin scope. Replaces the
single ADMIN_TOKEN env var with named, revocable, audited tokens
that users can mint/rotate from the canvas UI without ops
intervention.

Designed for the beta growth phase — one token tier (full admin).
Future work will split into scoped roles (admin / workspace-write
/ read-only) and per-workspace bindings. See docs/architecture/
org-api-keys.md for the design + follow-up roadmap.

## Surface

  POST   /org/tokens        mint (plaintext returned once)
  GET    /org/tokens        list live keys (prefix-only)
  DELETE /org/tokens/:id    revoke (idempotent)

All AdminAuth-gated. Bootstrap path: mint the first token via
ADMIN_TOKEN or canvas session; tokens can mint more tokens after.

## Validation as a new AdminAuth tier (2a)

AdminAuth evaluation order:
  Tier 0  lazy-bootstrap fail-open (only when no live tokens AND
          no ADMIN_TOKEN env)
  Tier 1  verified WorkOS session via /cp/auth/tenant-member
  Tier 2a org_api_tokens SELECT — NEW
  Tier 2b ADMIN_TOKEN env (bootstrap / CLI break-glass)
  Tier 3  any live workspace token (deprecated, only when ADMIN_TOKEN
          unset)

Tier 2a runs ONE indexed lookup (partial index on
token_hash WHERE revoked_at IS NULL) + an async last_used_at
bump. No measurable latency cost on the hot path.

## UI

New "Org API Keys" tab in the settings panel. Label field for
human-readable naming. Plaintext shown once + clipboard copy.
Revoke with confirm dialog. Mirrors the existing workspace-
TokensTab flow so users who've used one get the other for free.

## Security properties

  - Plaintext never stored. sha256 hash + 8-char display prefix.
  - Revocation is immediate: partial index on revoked_at IS NULL
    means the next request validates or fails in microseconds.
  - created_by audit field captures provenance: "org-token:<short>"
    when a token mints another, "session" for browser-UI mints,
    "admin-token" for the ADMIN_TOKEN bootstrap path.
  - Validate() collapses all failure shapes into ErrInvalidToken
    so response-shape can't distinguish "never existed" from
    "revoked".

## Tests

  - internal/orgtoken: 9 unit tests (hash storage, empty field
    null-ing, validation happy path, empty plaintext, unknown hash,
    revoked filtering, list ordering, revoke idempotency, has-any-
    live short-circuit).
  - AdminAuth tier-2a integration covered by existing middleware
    tests unchanged (fail-open + bearer paths).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 14:01:41 -07:00
Hongming Wang
e790153916 Merge pull request #1102 from Molecule-AI/fix/review-critical-authz-tenant-isolation
fix: close cross-tenant authz + cp_proxy admin-traversal gaps
2026-04-20 13:46:03 -07:00