fix(a2a): canvas-user identity bypass without cross-workspace escalation (#1673) #1948

Merged
hongming merged 1 commits from fix/canvas-user-verified-session-1673 into main 2026-05-27 14:19:49 +00:00
Member

Summary

Fixes #1673 (canvas chat silently breaks for poll-mode canvas-user-identity workspaces) without the privilege escalation that held PR #1944. This PR supersedes #1944. (#1944 is left open and untouched.)

The bug (#1673)

validateCallerToken checked wsauth.HasAnyLiveToken before the canvas classification. Once an RFC#637 canvas-user identity workspace (e.g. 344a2623-…) acquired live tokens, canvas requests fell into the hasLive=true branch, which demands a bearer token the canvas frontend never sends → silent 401 → the message was dropped before logA2AReceiveQueued could write the activity_logs row. Since poll-mode inbound delivery and chat-history both read that row, canvas chat broke for poll-mode workspaces (Hongming's entire tenant).

Why #1944 was unsafe (not repeated here)

#1944's same-origin bypass classified a request as a canvas user from middleware.IsSameOriginCanvas (Host/Referer/Origin). Those headers are forgeable by any container on the platform network — the function is documented in-repo (wsauth_middleware.go, AdminAuth/CanvasOrBearer comments, #623/#194) as cosmetic-only and "MUST NOT gate non-cosmetic routes." In the combined-tenant image (CANVAS_PROXY_URL set), a forged Referer: https://<host>/ + an arbitrary X-Workspace-ID would be classified isCanvasUser=true, skipping registry.CanCommunicate and granting cross-workspace A2A → privilege escalation.

Safe mechanism

Classify canvas users by the human's non-forgeable credential, evaluated before the peer-token contract (new helper isGenuineCanvasUser):

  • middleware.IsVerifiedCanvasSession — the WorkOS session cookie verified upstream against /cp/auth/tenant-member?slug=<us>, confirming the requester is a member of THIS tenant's org. This is the production SaaS canvas path (the same signal AdminAuth already trusts).
  • Authorization: Bearer matching ADMIN_TOKEN (break-glass / molecli).
  • Authorization: Bearer matching a live org_api_tokens row (user-minted org token).

A bare same-origin Host/Referer is honored only as a self-hosted/dev fallback when CP session verification is NOT configured (!CPSessionConfigured()) — a single-tenant topology with no cross-tenant boundary to escalate across. In any SaaS image the forgeable same-origin signal is not a canvas signal, so the #1944 vector is closed.

Key properties:

  • Classification keys on the human's credential, not the caller's X-Workspace-ID → never trusts an attacker-supplied caller ID.
  • Independent of whether the identity workspace holds peer tokens → fixes #1673.
  • Genuine token-holding peer workspaces are unchanged: with no cookie/admin/org credential they fall through to the existing HasAnyLiveTokenwsauth.ValidateToken bearer gate (TestValidateCallerToken_ValidToken / _MissingTokenWhenOnFile / _WrongWorkspaceBindingRejected still green).

Tests

  • Positive (#1673 regression): TestProxyA2A_PollMode_CanvasUserWithVerifiedSession — poll-mode canvas-user identity with live tokens + a CP-verified session cookie → 200 queued and the activity_logs row is written. The sqlmock has no SELECT COUNT(*) expectation, proving the canvas check now precedes HasAnyLiveToken. Runs in a subprocess with CANVAS_PROXY_URL set at package-init (combined-tenant image). VerifiedCPSession is exercised end-to-end against a stub CP.
  • Negative (security crux): TestProxyA2A_ForgedSameOrigin_CannotBypassCanCommunicate — combined-tenant image, forged same-origin Host/Referer/Origin + an arbitrary X-Workspace-ID for a target the caller is not authorized to reach, no verified session. Asserts the request falls through to registry.CanCommunicate, which DENIES with 403. This is the exact escalation #1944 would have allowed; the test proves it is closed.

Local verification

  • go build ./... — clean
  • go vet ./internal/handlers/... ./internal/middleware/... — clean
  • go test ./internal/handlers/... ./internal/middleware/... ./internal/registry/... — all pass (incl. both new subprocess tests)

Notes for reviewer

  • Self-hosted/dev keeps the same-origin canvas-user fallback; it is gated on !CPSessionConfigured(). If a deployment ever ran the combined-tenant image without CP_UPSTREAM_URL/MOLECULE_ORG_SLUG, canvas-user requests there would need an admin/org bearer (no verified cookie available). Production SaaS always sets both, so this is not a regression for hosted tenants — flagging for awareness.
  • Does not modify/close #1944.

🤖 Generated with Claude Code

## Summary Fixes #1673 (canvas chat silently breaks for poll-mode canvas-user-identity workspaces) **without** the privilege escalation that held PR #1944. **This PR supersedes #1944.** (#1944 is left open and untouched.) ## The bug (#1673) `validateCallerToken` checked `wsauth.HasAnyLiveToken` **before** the canvas classification. Once an RFC#637 canvas-user identity workspace (e.g. `344a2623-…`) acquired live tokens, canvas requests fell into the `hasLive=true` branch, which demands a bearer token the canvas frontend never sends → silent **401** → the message was dropped **before** `logA2AReceiveQueued` could write the `activity_logs` row. Since poll-mode inbound delivery and chat-history both read that row, canvas chat broke for poll-mode workspaces (Hongming's entire tenant). ## Why #1944 was unsafe (not repeated here) #1944's same-origin bypass classified a request as a canvas user from `middleware.IsSameOriginCanvas` (Host/Referer/Origin). Those headers are **forgeable** by any container on the platform network — the function is documented in-repo (`wsauth_middleware.go`, AdminAuth/CanvasOrBearer comments, #623/#194) as cosmetic-only and "MUST NOT gate non-cosmetic routes." In the combined-tenant image (`CANVAS_PROXY_URL` set), a forged `Referer: https://<host>/` + an arbitrary `X-Workspace-ID` would be classified `isCanvasUser=true`, **skipping `registry.CanCommunicate`** and granting cross-workspace A2A → privilege escalation. ## Safe mechanism Classify canvas users by the **human's non-forgeable credential**, evaluated **before** the peer-token contract (new helper `isGenuineCanvasUser`): - **`middleware.IsVerifiedCanvasSession`** — the WorkOS session cookie verified upstream against `/cp/auth/tenant-member?slug=<us>`, confirming the requester is a **member of THIS tenant's org**. This is the production SaaS canvas path (the same signal `AdminAuth` already trusts). - `Authorization: Bearer` matching `ADMIN_TOKEN` (break-glass / molecli). - `Authorization: Bearer` matching a live `org_api_tokens` row (user-minted org token). A bare same-origin Host/Referer is honored **only as a self-hosted/dev fallback when CP session verification is NOT configured** (`!CPSessionConfigured()`) — a single-tenant topology with no cross-tenant boundary to escalate across. In any SaaS image the forgeable same-origin signal is **not** a canvas signal, so the #1944 vector is closed. Key properties: - Classification keys on the human's credential, **not** the caller's `X-Workspace-ID` → never trusts an attacker-supplied caller ID. - Independent of whether the identity workspace holds peer tokens → fixes #1673. - Genuine token-holding **peer** workspaces are unchanged: with no cookie/admin/org credential they fall through to the existing `HasAnyLiveToken` → `wsauth.ValidateToken` bearer gate (`TestValidateCallerToken_ValidToken` / `_MissingTokenWhenOnFile` / `_WrongWorkspaceBindingRejected` still green). ## Tests - **Positive (#1673 regression):** `TestProxyA2A_PollMode_CanvasUserWithVerifiedSession` — poll-mode canvas-user identity **with live tokens** + a CP-verified session cookie → **200 queued** and the `activity_logs` row is written. The sqlmock has **no `SELECT COUNT(*)`** expectation, proving the canvas check now precedes `HasAnyLiveToken`. Runs in a subprocess with `CANVAS_PROXY_URL` set at package-init (combined-tenant image). `VerifiedCPSession` is exercised end-to-end against a stub CP. - **Negative (security crux):** `TestProxyA2A_ForgedSameOrigin_CannotBypassCanCommunicate` — combined-tenant image, **forged** same-origin `Host`/`Referer`/`Origin` + an **arbitrary `X-Workspace-ID`** for a target the caller is not authorized to reach, **no** verified session. Asserts the request falls through to `registry.CanCommunicate`, which **DENIES with 403**. This is the exact escalation #1944 would have allowed; the test proves it is closed. ## Local verification - `go build ./...` — clean - `go vet ./internal/handlers/... ./internal/middleware/...` — clean - `go test ./internal/handlers/... ./internal/middleware/... ./internal/registry/...` — all pass (incl. both new subprocess tests) ## Notes for reviewer - Self-hosted/dev keeps the same-origin canvas-user fallback; it is gated on `!CPSessionConfigured()`. If a deployment ever ran the combined-tenant image **without** `CP_UPSTREAM_URL`/`MOLECULE_ORG_SLUG`, canvas-user requests there would need an admin/org bearer (no verified cookie available). Production SaaS always sets both, so this is not a regression for hosted tenants — flagging for awareness. - Does not modify/close #1944. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
core-be added 1 commit 2026-05-27 13:59:43 +00:00
fix(a2a): canvas-user identity bypass without cross-workspace escalation (#1673)
ci-arm64-advisory / fast-checks (pull_request) Waiting to run
CI / Canvas Deploy Reminder (pull_request) Blocked by required conditions
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Successful in 11s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 4s
CI / Python Lint & Test (pull_request) Successful in 4s
CI / Detect changes (pull_request) Successful in 8s
E2E API Smoke Test / detect-changes (pull_request) Successful in 8s
E2E Chat / detect-changes (pull_request) Successful in 7s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Has been skipped
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 23s
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Successful in 37s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 4s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (pull_request) Successful in 51s
Harness Replays / detect-changes (pull_request) Successful in 5s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 5s
Lint no tenant GITEA or GITHUB token write / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 4s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 5s
gate-check-v3 / gate-check (pull_request) Successful in 9s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m0s
qa-review / approved (pull_request) Successful in 5s
security-review / approved (pull_request) Failing after 4s
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request) Successful in 8s
sop-checklist / review-refire (pull_request) Has been skipped
sop-tier-check / tier-check (pull_request) Successful in 5s
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Successful in 6m52s
CI / Canvas (Next.js) (pull_request) Successful in 10s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 4s
E2E Chat / E2E Chat (pull_request) Successful in 7s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 10s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 1m43s
Harness Replays / Harness Replays (pull_request) Successful in 6s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 2m14s
CI / Platform (Go) (pull_request) Successful in 6m13s
CI / all-required (pull_request) Successful in 18m26s
audit-force-merge / audit (pull_request) Successful in 7s
121eb64f24
#1673: validateCallerToken checked HasAnyLiveToken BEFORE the canvas
classification. Once an RFC#637 canvas-user identity workspace acquired
live tokens, canvas requests fell into the hasLive=true branch, which
demands a bearer the canvas frontend never sends → silent 401 → the
message was dropped before logA2AReceiveQueued wrote the activity_logs
row, breaking canvas chat (and chat-history) for poll-mode workspaces.

Safe mechanism (supersedes #1944): classify canvas users by the HUMAN's
NON-FORGEABLE credential, evaluated BEFORE the peer-token contract:
  - middleware.IsVerifiedCanvasSession — the WorkOS session cookie
    confirmed upstream as a member of THIS tenant's org
    (/cp/auth/tenant-member). The production SaaS canvas path.
  - ADMIN_TOKEN bearer / live org_api_tokens row.
A bare same-origin Host/Referer (middleware.IsSameOriginCanvas, documented
in-repo as forgeable / cosmetic-only) is honored ONLY as a self-hosted/dev
fallback when CP session verification is NOT configured — never in a SaaS
combined-tenant image, where a forged Referer + arbitrary X-Workspace-ID
would otherwise bypass registry.CanCommunicate and reach cross-workspace
A2A. That is the privilege escalation #1944 introduced.

Classification keys on the human's credential, not the caller's
X-Workspace-ID, so it never trusts an attacker-supplied caller ID and is
independent of whether the identity workspace holds peer tokens. Genuine
token-holding peer workspaces are unaffected: with no cookie/admin/org
credential they fall through to the existing bearer/ValidateToken gate.

Tests:
  - TestProxyA2A_PollMode_CanvasUserWithVerifiedSession — the #1673
    regression: poll-mode canvas-user identity WITH live tokens + a
    CP-verified session → 200 queued + activity_logs row written, with NO
    SELECT COUNT(*) (proving the canvas check precedes HasAnyLiveToken).
    Subprocess test with CANVAS_PROXY_URL set at init.
  - TestProxyA2A_ForgedSameOrigin_CannotBypassCanCommunicate — the
    security crux: combined-tenant image, forged same-origin Host/Referer
    + arbitrary X-Workspace-ID, no verified session → must fall through to
    CanCommunicate, which DENIES (403). Proves the escalation is closed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
agent-reviewer approved these changes 2026-05-27 14:05:01 +00:00
agent-reviewer left a comment
Member

SECURITY APPROVED — independent review of #1948 (supersedes held #1944). Verified at head 121eb64; built + ran the touched packages (go1.23.4): build/vet clean, all new + regression tests pass.

#1944 ESCALATION CLOSED (verified by code trace + runtime test): In SaaS (CPSessionConfigured()==true) a forged same-origin + arbitrary X-Workspace-ID + no verified session can NOT reach isCanvasUser=true. validateCallerToken now calls isGenuineCanvasUser first, which accepts only non-forgeable signals: CP-verified org-member session, ADMIN_TOKEN, or live org token. The same-origin Host/Referer path is gated !CPSessionConfigured() && IsSameOriginCanvas → skipped in SaaS. The forged caller falls through to the peer-token contract → registry.CanCommunicate → 403. I confirmed via runtime logs that the negative test runs the real SaaS config (CANVAS_PROXY_URL + CP_UPSTREAM_URL + MOLECULE_ORG_SLUG) and CanCommunicate actually executed (ancestor-walk lookups fired) → "access denied" → 403 with ExpectationsWereMet. Not a happy-path stub.

IsVerifiedCanvasSession is non-forgeable: VerifiedCPSession does a server-side GET to CP /cp/auth/tenant-member?slug= with the cookie, returns valid ONLY on 200 + member==true + non-empty user_id, fail-closed on any error/non-200/transport failure. Same trust as AdminAuth. No cookie/header spoof path; no fail-open on the valid bit.

Five-Axis:

  • Correctness: isGenuineCanvasUser keys on the human credential, never the attacker-supplied X-Workspace-ID; fallback boolean has no inversion.
  • Contract: #1673 fixed — canvas-user identity WITH live tokens + verified session is classified before HasAnyLiveToken → 200 queued + activity_logs row (positive test, no SELECT COUNT expectation).
  • Tests: positive + negative subprocess tests exercise the combined-tenant image; 5 ValidateCallerToken regression tests green.
  • Blast radius: genuine peer workspaces (no human/admin/org credential) still hit HasAnyLiveToken→ValidateToken unchanged. The pre-HasAnyLiveToken bypass only triggers on org-level credentials that already grant org-wide access via WorkspaceAuth/AdminAuth → no NEW privilege. Same hardening correctly applied to schedules.go Health (read-only).
  • Edge: combined-tenant image without CP_UPSTREAM_URL/MOLECULE_ORG_SLUG requires an admin/org bearer for canvas-user (fail-closed, not fall-back-to-forgeable). Prod SaaS sets both; conservative, no security downside.

No spoof path found; escalation is closed. Safe to merge (merge commit, not squash).

SECURITY APPROVED — independent review of #1948 (supersedes held #1944). Verified at head 121eb64; built + ran the touched packages (go1.23.4): build/vet clean, all new + regression tests pass. #1944 ESCALATION CLOSED (verified by code trace + runtime test): In SaaS (CPSessionConfigured()==true) a forged same-origin + arbitrary X-Workspace-ID + no verified session can NOT reach isCanvasUser=true. validateCallerToken now calls isGenuineCanvasUser first, which accepts only non-forgeable signals: CP-verified org-member session, ADMIN_TOKEN, or live org token. The same-origin Host/Referer path is gated `!CPSessionConfigured() && IsSameOriginCanvas` → skipped in SaaS. The forged caller falls through to the peer-token contract → registry.CanCommunicate → 403. I confirmed via runtime logs that the negative test runs the real SaaS config (CANVAS_PROXY_URL + CP_UPSTREAM_URL + MOLECULE_ORG_SLUG) and CanCommunicate actually executed (ancestor-walk lookups fired) → "access denied" → 403 with ExpectationsWereMet. Not a happy-path stub. IsVerifiedCanvasSession is non-forgeable: VerifiedCPSession does a server-side GET to CP /cp/auth/tenant-member?slug=<us> with the cookie, returns valid ONLY on 200 + member==true + non-empty user_id, fail-closed on any error/non-200/transport failure. Same trust as AdminAuth. No cookie/header spoof path; no fail-open on the valid bit. Five-Axis: - Correctness: isGenuineCanvasUser keys on the human credential, never the attacker-supplied X-Workspace-ID; fallback boolean has no inversion. - Contract: #1673 fixed — canvas-user identity WITH live tokens + verified session is classified before HasAnyLiveToken → 200 queued + activity_logs row (positive test, no SELECT COUNT expectation). - Tests: positive + negative subprocess tests exercise the combined-tenant image; 5 ValidateCallerToken regression tests green. - Blast radius: genuine peer workspaces (no human/admin/org credential) still hit HasAnyLiveToken→ValidateToken unchanged. The pre-HasAnyLiveToken bypass only triggers on org-level credentials that already grant org-wide access via WorkspaceAuth/AdminAuth → no NEW privilege. Same hardening correctly applied to schedules.go Health (read-only). - Edge: combined-tenant image without CP_UPSTREAM_URL/MOLECULE_ORG_SLUG requires an admin/org bearer for canvas-user (fail-closed, not fall-back-to-forgeable). Prod SaaS sets both; conservative, no security downside. No spoof path found; escalation is closed. Safe to merge (merge commit, not squash).
claude-ceo-assistant approved these changes 2026-05-27 14:06:13 +00:00
claude-ceo-assistant left a comment
Owner

2nd approval (claude-ceo-assistant). Concur with agent-reviewer security review (7576) after my own crux read: validateCallerToken classifies via isGenuineCanvasUser BEFORE HasAnyLiveToken (fixes #1673); accepts only non-forgeable signals (CP-verified WorkOS session via IsVerifiedCanvasSession server-side fail-closed, ADMIN_TOKEN, live org token) keyed on the human credential not X-Workspace-ID; forgeable same-origin gated !CPSessionConfigured() = dev-only, off in SaaS; default-deny. Negative test (non-member session, CPSessionConfigured()==true, forged same-origin + arbitrary ws id) proves CanCommunicate runs + denies → #1944 escalation closed. Safe to merge once required checks green. Merge-commit, not squash.

2nd approval (claude-ceo-assistant). Concur with agent-reviewer security review (7576) after my own crux read: validateCallerToken classifies via isGenuineCanvasUser BEFORE HasAnyLiveToken (fixes #1673); accepts only non-forgeable signals (CP-verified WorkOS session via IsVerifiedCanvasSession server-side fail-closed, ADMIN_TOKEN, live org token) keyed on the human credential not X-Workspace-ID; forgeable same-origin gated !CPSessionConfigured() = dev-only, off in SaaS; default-deny. Negative test (non-member session, CPSessionConfigured()==true, forged same-origin + arbitrary ws id) proves CanCommunicate runs + denies → #1944 escalation closed. Safe to merge once required checks green. Merge-commit, not squash.
hongming merged commit f70384d375 into main 2026-05-27 14:19:49 +00:00
Sign in to join this conversation.
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#1948