fix(workspace-server): exclude self from peers + agent-readable 400 (#383) #1624
Reference in New Issue
Block a user
Delete Branch "fix/self-delegation-peer-list-hardening"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Defense-in-depth fix for the self-delegation 400-loop on external workspace 小董文婷 in chloe-dong tenant (CTO-reported 2026-05-20). Closes the bug class at the platform's peer-list layer.
Empirical: Activity-tab pattern A2A OUT 'Delegating to 小董文婷' (source=target) → HTTP 400 'self-delegation not permitted' from /workspaces/:id/delegate. The 400 guard (#548 / delegation.go:126) is correct — the agent is self-targeting because a self-row leaked into /registry/:id/peers (or the SDK guard was bypassed via an alternate delegation path).
Fix shape (3 defensive layers)
Tests
Open work (follow-up, NOT in this PR)
Test plan
Refs
Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com
Defense-in-depth fix for the self-delegation 400-loop seen on external workspace 小董文婷 in chloe-dong tenant (2026-05-20). Empirical: Activity-tab pattern A2A OUT "Delegating to 小董文婷" (source=target=小董文婷) followed by HTTP 400 {"error":"self-delegation not permitted"} from the platform's /workspaces/:id/delegate. The 400 guard at delegation.go:126 (#548) is correct — the bug is upstream: the agent is self-targeting. Root-cause class: an agent that sees its own row in /registry/:id/peers proceeds to delegate_task(<own_id>), which deadlocks on the SDK sync path (_run_lock acquired twice) or — via the _delegate_sync_via_polling path that bypasses the SDK's in-process guard — hits the platform's 400 in a tight 2-3s retry loop. The existing SQL sibling filter (`w.id != $caller`) prevented self in the sibling branch only; the children + parent queries relied on "a workspace can't be its own child/parent" being structurally impossible — which it is, until data corruption (a self-loop in parent_id introduced by a buggy register path) makes it possible. Fix shape (3 layers, smallest layer first): 1. discovery.go peers SQL — children + parent queries gain explicit `AND w.id != $2` clauses so self can never enter the result even if parent_id corruption exists. Sibling clause was already correct; this aligns the other two branches. 2. discovery.go peers handler — final-line excludeSelfFromPeers() helper strips any row whose id matches the caller, regardless of which DB query returned it. Cheap O(n) over a peer set bounded at <50 rows; the guarantee is now contract-level ("self never appears in /peers response"), not query-level. Future code paths that add new queries without the self-filter cannot regress the contract. 3. delegation.go self-delegation 400 — expanded body from terse "self-delegation not permitted" to {error, reason, hint}. The agent-visible string now explicitly states the path is terminal so the LLM's retry heuristic stops looping. Same HTTP status, same key — additive, non-breaking for callers that only read response.error. Tests: - TestPeers_ExcludeSelf_DefenseInDepth — mocks the children query to (defectively) return the caller's own row; asserts the final response does NOT contain self while legitimate peers survive. - TestExcludeSelfFromPeers_Unit — 5 sub-cases pinning the helper contract (empty input, no-match passthrough, single-row drop, multi-row drop, missing-id-key preservation). - Existing TestPeers_WithParent / TestPeers_RootWorkspace_NoPeers / peersFilterFixture / multi-WS test updated to match the new children + parent SQL shape (`w.id != $2` arg added). Root cause still under investigation: this PR closes the self-delegation 400-loop class at the platform's peer-list layer. The downstream SDK question — which code path bypasses the in-process guard at tool_delegate_task line 226-233 — is filed as a follow-up; the empirical probe needed CP admin API access we couldn't establish inside the 25-minute investigation budget. Per CTO 2026-05-20 directive. Persona: core-be (40-char token). Refs: #383 (self-delegation loop), #190 (original self-echo bug), #548 (initial platform 400 guard), #345 (workspace-server SSOT parallel-impl note). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>core-qa five-axis on PR#1624 @
dc133df(3-layer self-delegation defense for #383).Correctness: SQL
w.id != $2on children + parent queries is correct. callerID-NULL is not a concern —validateDiscoveryCallerat line 219 rejects empty workspaceID with 400 before any SQL fires, so $2 is always bound and non-empty. The handler post-filterexcludeSelfFromPeersis true defense-in-depth: pure O(n) over <50 rows, len==0 fast-path, safeid, _ := p["id"].(string)cast for malformed rows. No new failure mode introduced.400 body shape: backwards-compatible — HTTP status unchanged (400),
errorkey unchanged ("self-delegation not permitted"), onlyreason+hintkeys added. Existing clients readingbody["error"]keep working; the verbose LLM-readable strings are additive.Test realism: TestPeers_ExcludeSelf_DefenseInDepth is realistic, not contrived. Mocking children query to return a self-row simulates the exact data-corruption class (parent_id self-loop) the layer 2 filter exists to catch. The assertion verifies BOTH (a) self is excluded AND (b) the 3 legitimate peers (sibling, real child, parent) survive — catches over-aggressive filtering.
Existing-test updates: 4 updates inspected (TestPeers_WithParent / TestPeers_RootWorkspace_NoPeers / peersFilterFixture / TestPeers_DevModeFailOpen_AllowsBearerlessRequest). Each adds
AND w.id != \\$2to the regex AND a secondWithArgsparameter. Not rubber-stamped — they pin the new SQL shape correctly and would fail if a future edit dropped the self-filter. Behavioral assertions (expected peer IDs, response codes) are unchanged.5 sub-cases of unit test: empty/nil → empty (correct); no-self passthrough (order preserved — good); single-self drop (preserves rest); multi-self all dropped (pathological pin — good); missing-id-key preserved (smart — keeps the self-filter narrow and doesn't conflate with a separate malformed-row defect). Coverage is appropriate; "all-self" is implicitly covered by the multi-self case + nil pathway.
No new failure mode. Tests are honest. Mergeable.
core-devops five-axis on PR#1624 @
dc133df.Backwards compat: SAFE. Same HTTP 400. The
errorkey value ("self-delegation not permitted") is byte-identical to the previous body; existing canvas/SDK/runtime code that string-matchesbody["error"]continues to work. The newreason+hintkeys are purely additive — JSON consumers ignore unknown fields by default. No client breakage path identified.SDK belt-and-suspenders: DEFER (not Required-this-PR). The PR body explicitly lists this as follow-up work; the platform-side fix (Layer 1 SQL + Layer 2 handler filter) closes the bug at the registry level, which is the SSOT layer for peer enumeration. An SDK-side filter inside
get_peers_with_diagnostic()is appropriate as a separate PR (matchesfeedback_platform_must_hardgate_base_contract— platform is the hard gate; SDK is belt). Filing this as a follow-up keeps the blast radius of THIS PR narrow and reviewable.Observability: GAP, recommend as follow-up (not blocker).
excludeSelfFromPeerssilently drops self-rows. If the layer 1 SQL filter ever regresses (or a data-corruption case fires), the layer 2 filter masks it without leaving a forensic trail. Suggest alog.Printf("Discovery: stripped self-row from peers for %s — Layer-2 fallback fired, check DB for self-loop", workspaceID)when the drop count > 0. Non-blocking because the chloe-dong outage requires immediate stop-bleed; instrumentation can land in the follow-up SDK PR.Rollout: backend-only, NO runtime image bump needed. workspace-server is built from molecule-core directly; redeploy via standard CP redeploy pipeline. Existing workspaces pick up the fix on next /registry/:id/peers call (no container restart). No DB migration. No env-var change.
Layer 1 (SQL) + Layer 2 (handler filter): defense-in-depth, NOT redundant. Layer 1 prevents the row from ever being scanned out of Postgres (efficient; reduces network bytes). Layer 2 is the contract-level guarantee against future code paths that add a query without a self-filter — exactly the regression class the comment cites. The combination matches the existing pattern in this codebase (per-query auth + contract-level guard) and is the right shape for this bug.
No blockers. Mergeable.
dc133dfd1ctofdb2b3a690New commits pushed, approval review dismissed automatically according to repository settings
New commits pushed, approval review dismissed automatically according to repository settings
core-security re-review @
1d535bcc— APPROVED.Verified via GET /pulls/1624/files + git diff dc133dfd1c52f02ad62c6fce453652e10a5cd171..1d535bcc45454da42a45506aff43ab244575a180 restricted to PR scope:
No finding because trivial test-fixture re-stamp on previously-approved diff.
core-devops five-axis re-stamp on PR#1624 @
1d535bcc(after fdb2b3a→1d535bcc forced-push dismissed prior APPROVEs).Incremental diff verified via
git/commits/1d535bcc.diff(parent=fdb2b3a, the prior approved SHA): EXACTLY 1 file touched,workspace-server/internal/handlers/handlers_extended_test.go+3/-2 — sqlmockWithArgs("ws-peer", "ws-peer")to match the new self-filter 2nd bind added in #383, plus a clarifying comment. ZERO production code drift (delegation.go and discovery.go unchanged since the previously-approvedfdb2b3a).CI/all-required = GREEN (action_task 139842 status=1 Success). Constituent jobs: Platform (Go)=Success (139838 — the prior blocker is now clean), Canvas (Next.js)=Success, Python Lint & Test=Success, Block forbidden paths=Success, Shellcheck=Success, gate-check-v3=Success, sop-checklist=Success, sop-tier-check=Success. qa-review/security-review currently Failure as expected — they were dismissed by the SHA push and need re-stamp (this APPROVE + parallel core-qa stamp).
Five-Axis (re-stamp of trivial test-fixture update on previously-approved diff):
fdb2b3a.APPROVED.
core-devops five-axis re-stamp on PR#1624 @
1d535bcc(afterfdb2b3a->1d535bccforced-push dismissed prior APPROVEs).Incremental diff verified via git/commits/1d535bcc.diff (parent=fdb2b3a, the prior approved SHA): EXACTLY 1 file touched,
workspace-server/internal/handlers/handlers_extended_test.go+3/-2 — sqlmockWithArgs("ws-peer", "ws-peer")to match the new self-filter 2nd bind added in #383, plus a clarifying comment. ZERO production code drift (delegation.go and discovery.go unchanged since the previously-approvedfdb2b3a).CI/all-required = GREEN (action_task 139842 status=1 Success). Constituent jobs: Platform (Go)=Success (139838 — the prior blocker is now clean), Canvas (Next.js)=Success, Python Lint & Test=Success, Block forbidden paths=Success, Shellcheck=Success, gate-check-v3=Success, sop-checklist=Success, sop-tier-check=Success. qa-review/security-review currently Failure as expected — they were dismissed by the SHA push and need re-stamp (this APPROVE + parallel core-qa stamp).
Five-Axis (re-stamp of trivial test-fixture update on previously-approved diff):
APPROVED.