fix(handlers): RFC#524 Layer 1 — convert bare-go sites to goAsync/globalGoAsync #1559
Reference in New Issue
Block a user
Delete Branch "fix/rfc524-layer1-bare-go-conversion"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Forward-ports the canonical
db.DBrace fix69d9b4e3(already on main via the0e13a801staging-promote) to the remaining ~25 sibling bare-gosites that69d9b4e3left untouched. Implements RFC internal#524 Layer 1 deliverable 2 (convert remaining bare-gosites to the goAsync primitive).Why
69d9b4e3fixed the WorkspaceHandler path (maybeMarkContainerDeadetc.), but the rest of the handlers package still spawns fire-and-forget goroutines that readdb.DB:SecretsHandler.Set->restartFunc(7 sites)TemplatesHandler.WriteFile/DeleteFile/Import/ReplaceFiles->h.wh.RestartByID(9 sites)a2a_proxy.go->extractAndUpsertTokenUsage(INSERT INTO llm_token_usage)DelegationHandler.ExecuteDelegation,MCPHandler.delegate_task_async,RegistryHandler.Heartbeatdrain,ChannelHandlerwebhook,OrgHandlerprovision fan-out,AdminPluginDriftHandlerapply,PluginsHandlerinstall/uninstall (both Docker + EIC paths)Any of these can race the
db.DBswap in a later test'st.Cleanupexactly like themaybeMarkContainerDeadpath did pre-69d9b4e3. The RFC predicts this; this PR closes it.Direct consequence of task #240 (no staging->main auto-promotion) — the canonical fix made it to main on its own, but Layer-1 deliverable 2 (convert the remaining sites) was never landed because nobody re-opened the RFC to do so.
Approach
workspace.go: add package-levelglobalAsyncsync.WaitGroup+globalGoAsync(fn)helper, mirroring the existingh.goAsync/asyncWGshape from69d9b4e3. For sibling handlers (SecretsHandler,PluginsHandler, etc.) that don't carry a*WorkspaceHandler, and for free-function callers (extractAndUpsertTokenUsage).handlers_test.go:drainTestAsyncnow drainsglobalAsyncalongside the per-handlerasyncWGs.go-> tracked goroutine conversions across 12 files. Per-handlergoAsyncwhere the handler has a*WorkspaceHandlerref (templates, delegation, org_import);globalGoAsynceverywhere else.goAsync-exemptannotations (Layer 2.2 contract) on connection/lifecycle-scoped goroutines: WS read/write pumps, PTY bridges, EIC pool janitor — these don't readdb.DBand intentionally outlive request scope, so wrapping them inglobalGoAsyncwould deadlock test cleanup.rfc524_layer1_async_drain_test.go(93 LOC) asserts bothasyncWGandglobalAsyncare drained bydrainTestAsync. Fails fast if either side regresses.Verification
-count=10matches the RFC Layer 5 nightly target. Zero races, zero failures across 10 iterations.Diff size
Out of scope (deferred to later RFC layers)
go vet/golangci-lintcustom rule to ban un-annotated bare-godb.DB()accessor migration (the swappable-global lockdown)goAsynclifecycle logs + panic recoveryci.yml:193continue-on-error: falseflip + bare-golint gateTest plan
go test -race -count=10 ./internal/handlers/green locallygo test -race -shuffle=on ./internal/handlers/green locallyrfc524_layer1_async_drain_test.go) added — asserts both drain sidesplatform-buildjob green on this PR (gated by reviewers)Refs
69d9b4e3— canonical fix (drain detached async goroutines before test db.DB swap)0e13a801— staging->main promote that carried69d9b4e3to maincontinue-on-errormasks that hid this class🤖 Generated with Claude Code
core-devops 5-axis review (head
32121207)1. Correctness — no finding. 27 bare-go → globalGoAsync conversions; each annotated with the db.DB read justification at the call site (extractAndUpsertTokenUsage writes llm_token_usage, HandleInbound traverses workspaces, executeDelegation writes A2A status rows, etc.). Captures of workspaceID/respBody into local vars before the closure are correct (avoids capture-by-reference races with mutation in subsequent iterations / async assignment).
2. Architecture — no finding. globalAsync is the right shape: package-level sync.WaitGroup, drained from drainTestAsync (already the test hook). Sibling handlers without a *WorkspaceHandler receiver now have a real drain target — closes the exact gap that motivated commit
69d9b4e3(maybeMarkContainerDead). 6 goAsync-exempt annotations (eic_tunnel_pool janitor, applyIdleTimeout, etc.) each carry an inline reason — exemptions are deliberate, not drive-by-skipped.3. CI / build effects — Required:
CI / all-required (pull_request)is currently FAILING on this head due toCI / Shellcheck (E2E scripts)going red on main. The root cause is the docker-host guardrail bug mc#1561 fixes (its body explicitly cites the same shellcheck failure). This PR is independently sound but cannot merge until mc#1561 lands on main and this branch is rebased. Not a blocker on review APPROVAL, but a blocker on actual merge.4. Race-test coverage — no finding. New
rfc524_layer1_async_drain_test.go(3 tests, 93 LoC) regression-tests the drain contract for sibling handlers; -race -count=10 PASS at 4m15s (per PR body). Pinpoints the exact race class.5. Backward compat — no finding. globalGoAsync is a new helper; existing call sites are either converted in this PR or left alone (and explicitly annotated when left alone). No public API change.
Net: solid forward-port of RFC#524 Layer 1. APPROVE on the change itself; merge blocked on mc#1561 landing + rebase.
core-security 5-axis review (head
32121207)1. Security (no secret-handling change) — no finding. No env vars, no token paths, no auth changes. The conversions touch goroutine scheduling, not data flow.
2. Privilege escalation in async paths — no finding. globalGoAsync drains into a single package-level WaitGroup; it does not grant the closure any privilege the bare-go form didn't already have. All exemptions (idle-timer, EIC pool janitor) are confined to non-DB lifecycle work.
3. Correctness (data-race / fd-race) — no finding. The whole point of this PR is to fix the db.DB race surface in tests; production behavior is unchanged (same goroutines run, just tracked by a WaitGroup). The race-test (
-race -count=10PASS 4m15s) bounds the change.4. Migration script safety — no finding because this PR contains no migration.
5. CI / merge gate — Required (CI-side, not author-side):
CI / all-requiredis failing at this head due to shellcheck E2E breakage on main (mc#1561 fixes). Approve cannot move merge until that upstream is fixed.Net: security surface unchanged. APPROVE on the change; merge needs mc#1561 upstream first.
321212075bto6597e2408fRe-approving on the rebased head after mc#1561 landed (docker-host guardrail follow-up).
Rebase was server-initiated
POST /pulls/1559/update?style=rebase— no semantic content change to this PR's diff, just fast-forward of new main onto the existing branch tip. Original five-axis review at commit_id=321212075b78d1c2178d2d4b29314ffc6fcc180b stands; this is a stale-dismiss ratification of the same content on the rebased head.LGTM (rebase ratification).
Re-approving on the rebased head after mc#1561 landed (docker-host guardrail follow-up).
Rebase was server-initiated
POST /pulls/1559/update?style=rebase— no semantic content change to this PR's diff, just fast-forward of new main onto the existing branch tip. Original five-axis security review at commit_id=321212075b78d1c2178d2d4b29314ffc6fcc180b stands; this is a stale-dismiss ratification of the same content on the rebased head.LGTM (rebase ratification).