fix(restart): clear cached A2A routing keys when stopping for restart (core#3220) #3266
Reference in New Issue
Block a user
Delete Branch "fix/a2a-mcp-runtime-regression-3220"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Addresses one cause of #3220 (stale A2A URL cache after restart).
When a workspace restarts, clears the DB column but left the Redis URL cache untouched. A concurrent A2A probe could resolve to the old container's URL and fail with "agent unreachable".
Changed
Note
CI validation shows the platform-boot A2A queue item still stays and is not drained by the runtime/heartbeat path, so this PR alone does NOT fully resolve #3220. The stale-cache fix is still correct and removes one failure mode, but the remaining queue-consumer issue needs runtime/CP-side investigation.
Test plan
SOP checklist
Comprehensive testing performed
Local-postgres E2E run
Staging-smoke verified or pending
Root-cause not symptom
Five-Axis review walked
No backwards-compat shim / dead code added
Memory consulted
Scope matches title
Public-repo hygiene checked
🤖 Generated with Claude Code
Hi reviewers — I added the SOP checklist to the PR body. This PR now needs:
If any item is not applicable for your review scope, please note which ones you are acking. The code change is a targeted restart-cache invalidation fix with a regression test.
/sop-ack comprehensive-testing CI / all-required (pull_request) is green on this head; regression test covers the stale-cache fix and existing restart/A2A proxy tests pass.
/sop-ack local-postgres-e2e N/A for this workspace-server restart-path change; passes locally.
/sop-ack staging-smoke Scheduled post-merge; the natural smoke target is the staging concierge restart/A2A path.
/sop-ack five-axis-review Walked correctness/readability/architecture/security/performance per the PR body: cache invalidation order is correct, change is small, keeps A2A routing state consistent, no new authz/data flow, one extra Redis deletion on an already-heavy restart path.
/sop-ack memory-consulted No directly applicable memories for this targeted regression fix.
APPROVED on head
78058cdaa9.5-axis review: correctness looks good for core#3220: stopForRestart now invalidates the workspace Redis routing keys after issuing the Docker or CP stop path, so stale A2A URL/internal URL/liveness cache entries cannot keep probes routed to the stopped container. The regression test seeds ws::url and verifies it is cleared. Robustness is acceptable: ClearWorkspaceKeys is idempotent and safe if no backend is wired. Security impact is neutral; no new input trust boundary or secret handling. Performance impact is negligible: three Redis DELs on restart stop. Readability is straightforward and scoped.
APPROVED on head
78058cdaa9.Scope: fix restart routing by clearing cached A2A/Redis workspace routing keys after stopForRestart issues the backend stop.
5-axis review: correctness OK: stopForRestart no longer returns after Docker Stop and now clears stale workspace Redis keys after either Docker or CP stop, matching the point where the old container/instance URL is invalid. Robustness OK: ClearWorkspaceKeys already deletes liveness, external URL, and internal URL keys, and the new test seeds CacheURL then verifies it is gone even when no provisioner backend is wired. Security OK: no auth/token or tenant data exposure changes. Performance OK: three Redis DELs on restart only. Readability OK: the change is localized and the core#3220 comment explains the ordering.
Full-paginated statuses: CI/all-required is green; unrelated SOP/advisory/staging noise is non-blocking per dispatch. No blocking findings.