fix(handlers): RFC#524 Layer 1 — convert bare-go sites to goAsync/globalGoAsync #1559

Merged
hongming merged 1 commits from fix/rfc524-layer1-bare-go-conversion into main 2026-05-19 03:55:06 +00:00
Member

Summary

Forward-ports the canonical db.DB race fix 69d9b4e3 (already on main via the 0e13a801 staging-promote) to the remaining ~25 sibling bare-go sites that 69d9b4e3 left untouched. Implements RFC internal#524 Layer 1 deliverable 2 (convert remaining bare-go sites to the goAsync primitive).

Why

69d9b4e3 fixed the WorkspaceHandler path (maybeMarkContainerDead etc.), but the rest of the handlers package still spawns fire-and-forget goroutines that read db.DB:

  • SecretsHandler.Set -> restartFunc (7 sites)
  • TemplatesHandler.WriteFile/DeleteFile/Import/ReplaceFiles -> h.wh.RestartByID (9 sites)
  • a2a_proxy.go -> extractAndUpsertTokenUsage (INSERT INTO llm_token_usage)
  • DelegationHandler.ExecuteDelegation, MCPHandler.delegate_task_async, RegistryHandler.Heartbeat drain, ChannelHandler webhook, OrgHandler provision fan-out, AdminPluginDriftHandler apply, PluginsHandler install/uninstall (both Docker + EIC paths)

Any of these can race the db.DB swap in a later test's t.Cleanup exactly like the maybeMarkContainerDead path did pre-69d9b4e3. The RFC predicts this; this PR closes it.

Direct consequence of task #240 (no staging->main auto-promotion) — the canonical fix made it to main on its own, but Layer-1 deliverable 2 (convert the remaining sites) was never landed because nobody re-opened the RFC to do so.

Approach

  • workspace.go: add package-level globalAsync sync.WaitGroup + globalGoAsync(fn) helper, mirroring the existing h.goAsync/asyncWG shape from 69d9b4e3. For sibling handlers (SecretsHandler, PluginsHandler, etc.) that don't carry a *WorkspaceHandler, and for free-function callers (extractAndUpsertTokenUsage).
  • handlers_test.go: drainTestAsync now drains globalAsync alongside the per-handler asyncWGs.
  • 27 bare-go -> tracked goroutine conversions across 12 files. Per-handler goAsync where the handler has a *WorkspaceHandler ref (templates, delegation, org_import); globalGoAsync everywhere else.
  • 6 goAsync-exempt annotations (Layer 2.2 contract) on connection/lifecycle-scoped goroutines: WS read/write pumps, PTY bridges, EIC pool janitor — these don't read db.DB and intentionally outlive request scope, so wrapping them in globalGoAsync would deadlock test cleanup.
  • New regression test rfc524_layer1_async_drain_test.go (93 LOC) asserts both asyncWG and globalAsync are drained by drainTestAsync. Fails fast if either side regresses.

Verification

go vet ./internal/handlers/                          : clean
go test -race -count=1   -timeout 10m ./internal/handlers/ : ok 28.6s
go test -race -count=10  -timeout 15m ./internal/handlers/ : ok 4m15s
go test -race -shuffle=on -count=1 -timeout 5m  ./internal/handlers/ : ok 26.6s

-count=10 matches the RFC Layer 5 nightly target. Zero races, zero failures across 10 iterations.

Diff size

  • 17 modified files: 197+ / 41- (production handler conversions + comments)
  • 1 new test file: 93 LOC
  • Total: ~290 LOC, under the 600 LOC budget in the task brief

Out of scope (deferred to later RFC layers)

  • Layer 2.2 go vet/golangci-lint custom rule to ban un-annotated bare-go
  • Layer 2.4 db.DB() accessor migration (the swappable-global lockdown)
  • Layer 3 per-test sqlmock injection via constructor
  • Layer 4 structured goAsync lifecycle logs + panic recovery
  • Layer 5 ci.yml:193 continue-on-error: false flip + bare-go lint gate

Test plan

  • go test -race -count=10 ./internal/handlers/ green locally
  • go test -race -shuffle=on ./internal/handlers/ green locally
  • new regression test (rfc524_layer1_async_drain_test.go) added — asserts both drain sides
  • CI platform-build job green on this PR (gated by reviewers)
  • Non-author APPROVE before merge (NOT auto-merge — explicit task boundary)

Refs

  • RFC internal#524 — 5-layer A2A handler race fix roadmap
  • molecule-core commit 69d9b4e3 — canonical fix (drain detached async goroutines before test db.DB swap)
  • molecule-core commit 0e13a801 — staging->main promote that carried 69d9b4e3 to main
  • molecule-core#664, #774continue-on-error masks that hid this class
  • task #240 — no staging->main auto-promotion (why deliverable 2 stranded)

🤖 Generated with Claude Code

## Summary Forward-ports the canonical `db.DB` race fix `69d9b4e3` (already on main via the 0e13a801 staging-promote) to the remaining ~25 sibling bare-`go` sites that `69d9b4e3` left untouched. Implements **RFC internal#524 Layer 1 deliverable 2** (convert remaining bare-`go` sites to the goAsync primitive). ## Why `69d9b4e3` fixed the *WorkspaceHandler* path (`maybeMarkContainerDead` etc.), but the rest of the handlers package still spawns fire-and-forget goroutines that read `db.DB`: - `SecretsHandler.Set` -> `restartFunc` (7 sites) - `TemplatesHandler.WriteFile`/`DeleteFile`/`Import`/`ReplaceFiles` -> `h.wh.RestartByID` (9 sites) - `a2a_proxy.go` -> `extractAndUpsertTokenUsage` (INSERT INTO llm_token_usage) - `DelegationHandler.ExecuteDelegation`, `MCPHandler.delegate_task_async`, `RegistryHandler.Heartbeat` drain, `ChannelHandler` webhook, `OrgHandler` provision fan-out, `AdminPluginDriftHandler` apply, `PluginsHandler` install/uninstall (both Docker + EIC paths) Any of these can race the `db.DB` swap in a later test's `t.Cleanup` exactly like the `maybeMarkContainerDead` path did pre-69d9b4e3. The RFC predicts this; this PR closes it. Direct consequence of [task #240](https://git.moleculesai.app/molecule-ai/molecule-core/issues/240) (no staging->main auto-promotion) — the canonical fix made it to main on its own, but Layer-1 deliverable 2 (convert the remaining sites) was never landed because nobody re-opened the RFC to do so. ## Approach - **`workspace.go`**: add package-level `globalAsync` `sync.WaitGroup` + `globalGoAsync(fn)` helper, mirroring the existing `h.goAsync`/`asyncWG` shape from `69d9b4e3`. For sibling handlers (`SecretsHandler`, `PluginsHandler`, etc.) that don't carry a `*WorkspaceHandler`, and for free-function callers (`extractAndUpsertTokenUsage`). - **`handlers_test.go`**: `drainTestAsync` now drains `globalAsync` alongside the per-handler `asyncWG`s. - **27 bare-`go` -> tracked goroutine conversions** across 12 files. Per-handler `goAsync` where the handler has a `*WorkspaceHandler` ref (templates, delegation, org_import); `globalGoAsync` everywhere else. - **6 `goAsync-exempt` annotations** (Layer 2.2 contract) on connection/lifecycle-scoped goroutines: WS read/write pumps, PTY bridges, EIC pool janitor — these don't read `db.DB` and intentionally outlive request scope, so wrapping them in `globalGoAsync` would deadlock test cleanup. - **New regression test** `rfc524_layer1_async_drain_test.go` (93 LOC) asserts both `asyncWG` and `globalAsync` are drained by `drainTestAsync`. Fails fast if either side regresses. ## Verification ``` go vet ./internal/handlers/ : clean go test -race -count=1 -timeout 10m ./internal/handlers/ : ok 28.6s go test -race -count=10 -timeout 15m ./internal/handlers/ : ok 4m15s go test -race -shuffle=on -count=1 -timeout 5m ./internal/handlers/ : ok 26.6s ``` `-count=10` matches the RFC Layer 5 nightly target. Zero races, zero failures across 10 iterations. ## Diff size - 17 modified files: 197+ / 41- (production handler conversions + comments) - 1 new test file: 93 LOC - **Total: ~290 LOC, under the 600 LOC budget in the task brief** ## Out of scope (deferred to later RFC layers) - Layer 2.2 `go vet`/`golangci-lint` custom rule to ban un-annotated bare-`go` - Layer 2.4 `db.DB()` accessor migration (the swappable-global lockdown) - Layer 3 per-test sqlmock injection via constructor - Layer 4 structured `goAsync` lifecycle logs + panic recovery - Layer 5 `ci.yml:193` `continue-on-error: false` flip + bare-`go` lint gate ## Test plan - [x] `go test -race -count=10 ./internal/handlers/` green locally - [x] `go test -race -shuffle=on ./internal/handlers/` green locally - [x] new regression test (`rfc524_layer1_async_drain_test.go`) added — asserts both drain sides - [ ] CI `platform-build` job green on this PR (gated by reviewers) - [ ] Non-author APPROVE before merge (NOT auto-merge — explicit task boundary) ## Refs - **RFC internal#524** — 5-layer A2A handler race fix roadmap - **molecule-core commit 69d9b4e3** — canonical fix (drain detached async goroutines before test db.DB swap) - **molecule-core commit 0e13a801** — staging->main promote that carried 69d9b4e3 to main - **molecule-core#664, #774** — `continue-on-error` masks that hid this class - **task #240** — no staging->main auto-promotion (why deliverable 2 stranded) 🤖 Generated with [Claude Code](https://claude.com/claude-code)
core-be added 1 commit 2026-05-19 02:37:11 +00:00
fix(handlers): forward-port RFC#524 Layer 1 — convert bare-go sites to goAsync/globalGoAsync
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Waiting to run
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 7s
CI / Detect changes (pull_request) Successful in 8s
CI / Shellcheck (E2E scripts) (pull_request) Failing after 19s
CI / all-required (pull_request) Failing after 14s
E2E API Smoke Test / detect-changes (pull_request) Successful in 17s
E2E Chat / detect-changes (pull_request) Successful in 17s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Has been skipped
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 14s
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Successful in 20s
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Has been skipped
Handlers Postgres Integration / detect-changes (pull_request) Successful in 3s
Harness Replays / detect-changes (pull_request) Successful in 2s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 3s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (pull_request) Failing after 1m9s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 12s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 8s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 21s
qa-review / approved (pull_request) Failing after 4s
gate-check-v3 / gate-check (pull_request) Successful in 5s
security-review / approved (pull_request) Failing after 3s
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request) Successful in 6s
sop-tier-check / tier-check (pull_request) Successful in 3s
CI / Platform (Go) (pull_request) Successful in 5m7s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Failing after 1m12s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 5s
CI / Canvas (Next.js) (pull_request) Successful in 6m42s
Harness Replays / Harness Replays (pull_request) Successful in 8s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 4s
E2E Staging External Runtime / E2E Staging External Runtime (pull_request) Successful in 5m24s
CI / Python Lint & Test (pull_request) Successful in 6m59s
CI / Canvas Deploy Reminder (pull_request) Has been cancelled
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 1m8s
E2E Chat / E2E Chat (pull_request) Failing after 5m57s
321212075b
RFC internal#524 Layer 1 deliverable 2: extend the canonical db.DB
race-fix primitive (69d9b4e3, already on main via the 0e13a801
staging-promote) to the ~25 sibling bare-`go` sites that 69d9b4e3 left
untouched. Without this, a SecretsHandler.Set's detached restartFunc, or
a2a_proxy's extractAndUpsertTokenUsage, or a delegation goroutine still
races a later test's setupTestDB t.Cleanup db.DB swap — exactly the
data-race class that 69d9b4e3 fixed for the WorkspaceHandler path.

What changed
============

- workspace.go: add package-level `globalAsync` sync.WaitGroup +
  `globalGoAsync(fn)` helper + `waitGlobalAsyncForTest()` drain. Same
  shape as h.goAsync but reachable from sibling handlers that don't
  carry a *WorkspaceHandler.
- handlers_test.go: drainTestAsync now drains globalAsync alongside the
  per-handler asyncWGs.
- Converted bare-`go` → tracked goroutine at 27 call sites:
    secrets.go (7)            — restartFunc fan-out + restartAllAffected
    templates.go (6)          — h.wh.RestartByID after file/template ops
    template_import.go (3)    — h.wh.RestartByID after Import/ReplaceFiles
    plugins_install.go (2)    — restartFunc after uninstall (both paths)
    plugins_install_pipeline.go (2) — restartFunc after install
    admin_plugin_drift.go (1) — restartFunc on drift apply
    registry.go (1)           — drainQueue on heartbeat capacity
    a2a_proxy.go (1)          — extractAndUpsertTokenUsage (db.DB INSERT)
    delegation.go (1)         — executeDelegation (DB-touching pipeline)
    mcp_tools.go (1)          — async MCP delegate (db.DB read+write)
    channels.go (1)           — async HandleInbound webhook delivery
    org_import.go (1)         — provisionWorkspaceAuto fan-out
- Annotated 6 connection/lifecycle-scoped goroutines with
  `goAsync-exempt` (RFC Layer 2.2 contract):
    a2a_proxy.go applyIdleTimeout — SSE idle-timer, no db.DB access
    socket.go (2)              — WebSocket Read/WritePump, conn-lifetime
    terminal.go (3)             — PTY <-> WS bridges, conn-lifetime
    eic_tunnel_pool.go (group)  — pool janitor + cleanup closures
- rfc524_layer1_async_drain_test.go: new regression test asserting
  drainTestAsync waits for BOTH per-handler asyncWG AND the package-level
  globalAsync — fails fast if either drain side is dropped.

Verification
============

- `go vet ./internal/handlers/`           : clean
- `go test -race -count=1  ./internal/handlers/`  : ok 28.6s
- `go test -race -count=10 ./internal/handlers/`  : ok 4m15s (RFC Layer 5
                                                    nightly target)
- `go test -race -shuffle=on -count=1 ...`         : ok 26.6s

The 4 `TestExecuteDelegation_*` tests were already un-Skipped on main
(via the staging→main backsync); Layer 1.3 of the RFC is therefore
already satisfied. Verified passing under -race in this run.

Layer 1 of RFC internal#524 is now complete on main. Layers 2-5 stay
as separate PRs per the RFC sequencing.

Refs
====
- RFC internal#524 (5-layer roadmap)
- molecule-core commit 69d9b4e3 (canonical fix on staging, promoted to main via 0e13a801)
- molecule-core#664, #774 (continue-on-error masks)
- task #240 (no staging→main auto-promotion — why the gap existed)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
core-devops approved these changes 2026-05-19 03:33:45 +00:00
Dismissed
core-devops left a comment
Member

core-devops 5-axis review (head 32121207)

1. Correctness — no finding. 27 bare-go → globalGoAsync conversions; each annotated with the db.DB read justification at the call site (extractAndUpsertTokenUsage writes llm_token_usage, HandleInbound traverses workspaces, executeDelegation writes A2A status rows, etc.). Captures of workspaceID/respBody into local vars before the closure are correct (avoids capture-by-reference races with mutation in subsequent iterations / async assignment).

2. Architecture — no finding. globalAsync is the right shape: package-level sync.WaitGroup, drained from drainTestAsync (already the test hook). Sibling handlers without a *WorkspaceHandler receiver now have a real drain target — closes the exact gap that motivated commit 69d9b4e3 (maybeMarkContainerDead). 6 goAsync-exempt annotations (eic_tunnel_pool janitor, applyIdleTimeout, etc.) each carry an inline reason — exemptions are deliberate, not drive-by-skipped.

3. CI / build effects — Required: CI / all-required (pull_request) is currently FAILING on this head due to CI / Shellcheck (E2E scripts) going red on main. The root cause is the docker-host guardrail bug mc#1561 fixes (its body explicitly cites the same shellcheck failure). This PR is independently sound but cannot merge until mc#1561 lands on main and this branch is rebased. Not a blocker on review APPROVAL, but a blocker on actual merge.

4. Race-test coverage — no finding. New rfc524_layer1_async_drain_test.go (3 tests, 93 LoC) regression-tests the drain contract for sibling handlers; -race -count=10 PASS at 4m15s (per PR body). Pinpoints the exact race class.

5. Backward compat — no finding. globalGoAsync is a new helper; existing call sites are either converted in this PR or left alone (and explicitly annotated when left alone). No public API change.

Net: solid forward-port of RFC#524 Layer 1. APPROVE on the change itself; merge blocked on mc#1561 landing + rebase.

core-devops 5-axis review (head 32121207) **1. Correctness** — no finding. 27 bare-go → globalGoAsync conversions; each annotated with the db.DB read justification at the call site (extractAndUpsertTokenUsage writes llm_token_usage, HandleInbound traverses workspaces, executeDelegation writes A2A status rows, etc.). Captures of workspaceID/respBody into local vars before the closure are correct (avoids capture-by-reference races with mutation in subsequent iterations / async assignment). **2. Architecture** — no finding. globalAsync is the right shape: package-level sync.WaitGroup, drained from drainTestAsync (already the test hook). Sibling handlers without a *WorkspaceHandler receiver now have a real drain target — closes the exact gap that motivated commit 69d9b4e3 (maybeMarkContainerDead). 6 goAsync-exempt annotations (eic_tunnel_pool janitor, applyIdleTimeout, etc.) each carry an inline reason — exemptions are deliberate, not drive-by-skipped. **3. CI / build effects** — Required: `CI / all-required (pull_request)` is currently FAILING on this head due to `CI / Shellcheck (E2E scripts)` going red on main. The root cause is the docker-host guardrail bug mc#1561 fixes (its body explicitly cites the same shellcheck failure). This PR is independently sound but cannot merge until mc#1561 lands on main and this branch is rebased. Not a blocker on review APPROVAL, but a blocker on actual merge. **4. Race-test coverage** — no finding. New `rfc524_layer1_async_drain_test.go` (3 tests, 93 LoC) regression-tests the drain contract for sibling handlers; -race -count=10 PASS at 4m15s (per PR body). Pinpoints the exact race class. **5. Backward compat** — no finding. globalGoAsync is a new helper; existing call sites are either converted in this PR or left alone (and explicitly annotated when left alone). No public API change. Net: solid forward-port of RFC#524 Layer 1. APPROVE on the change itself; merge blocked on mc#1561 landing + rebase.
core-security approved these changes 2026-05-19 03:34:00 +00:00
Dismissed
core-security left a comment
Member

core-security 5-axis review (head 32121207)

1. Security (no secret-handling change) — no finding. No env vars, no token paths, no auth changes. The conversions touch goroutine scheduling, not data flow.

2. Privilege escalation in async paths — no finding. globalGoAsync drains into a single package-level WaitGroup; it does not grant the closure any privilege the bare-go form didn't already have. All exemptions (idle-timer, EIC pool janitor) are confined to non-DB lifecycle work.

3. Correctness (data-race / fd-race) — no finding. The whole point of this PR is to fix the db.DB race surface in tests; production behavior is unchanged (same goroutines run, just tracked by a WaitGroup). The race-test (-race -count=10 PASS 4m15s) bounds the change.

4. Migration script safety — no finding because this PR contains no migration.

5. CI / merge gate — Required (CI-side, not author-side): CI / all-required is failing at this head due to shellcheck E2E breakage on main (mc#1561 fixes). Approve cannot move merge until that upstream is fixed.

Net: security surface unchanged. APPROVE on the change; merge needs mc#1561 upstream first.

core-security 5-axis review (head 32121207) **1. Security (no secret-handling change)** — no finding. No env vars, no token paths, no auth changes. The conversions touch goroutine scheduling, not data flow. **2. Privilege escalation in async paths** — no finding. globalGoAsync drains into a single package-level WaitGroup; it does not grant the closure any privilege the bare-go form didn't already have. All exemptions (idle-timer, EIC pool janitor) are confined to non-DB lifecycle work. **3. Correctness (data-race / fd-race)** — no finding. The whole point of this PR is to fix the db.DB race surface in tests; production behavior is unchanged (same goroutines run, just tracked by a WaitGroup). The race-test (`-race -count=10` PASS 4m15s) bounds the change. **4. Migration script safety** — no finding because this PR contains no migration. **5. CI / merge gate** — Required (CI-side, not author-side): `CI / all-required` is failing at this head due to shellcheck E2E breakage on main (mc#1561 fixes). Approve cannot move merge until that upstream is fixed. Net: security surface unchanged. APPROVE on the change; merge needs mc#1561 upstream first.
hongming force-pushed fix/rfc524-layer1-bare-go-conversion from 321212075b to 6597e2408f 2026-05-19 03:40:51 +00:00 Compare
core-devops approved these changes 2026-05-19 03:42:19 +00:00
core-devops left a comment
Member

Re-approving on the rebased head after mc#1561 landed (docker-host guardrail follow-up).

Rebase was server-initiated POST /pulls/1559/update?style=rebase — no semantic content change to this PR's diff, just fast-forward of new main onto the existing branch tip. Original five-axis review at commit_id=321212075b78d1c2178d2d4b29314ffc6fcc180b stands; this is a stale-dismiss ratification of the same content on the rebased head.

LGTM (rebase ratification).

Re-approving on the rebased head after mc#1561 landed (docker-host guardrail follow-up). Rebase was server-initiated `POST /pulls/1559/update?style=rebase` — no semantic content change to this PR's diff, just fast-forward of new main onto the existing branch tip. Original five-axis review at commit_id=321212075b78d1c2178d2d4b29314ffc6fcc180b stands; this is a stale-dismiss ratification of the same content on the rebased head. LGTM (rebase ratification).
core-security approved these changes 2026-05-19 03:42:21 +00:00
core-security left a comment
Member

Re-approving on the rebased head after mc#1561 landed (docker-host guardrail follow-up).

Rebase was server-initiated POST /pulls/1559/update?style=rebase — no semantic content change to this PR's diff, just fast-forward of new main onto the existing branch tip. Original five-axis security review at commit_id=321212075b78d1c2178d2d4b29314ffc6fcc180b stands; this is a stale-dismiss ratification of the same content on the rebased head.

LGTM (rebase ratification).

Re-approving on the rebased head after mc#1561 landed (docker-host guardrail follow-up). Rebase was server-initiated `POST /pulls/1559/update?style=rebase` — no semantic content change to this PR's diff, just fast-forward of new main onto the existing branch tip. Original five-axis security review at commit_id=321212075b78d1c2178d2d4b29314ffc6fcc180b stands; this is a stale-dismiss ratification of the same content on the rebased head. LGTM (rebase ratification).
hongming merged commit ec4c8d81ae into main 2026-05-19 03:55:06 +00:00
Sign in to join this conversation.
4 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#1559