fix(channels): check rows.Err after iteration in List and Webhook #2034

Closed
core-be wants to merge 19 commits from fix/channels-rows-err-check into staging
Member
No description provided.
core-be added 21 commits 2026-06-01 03:36:01 +00:00
The deprovision path marks `workspaces.status='removed'` BEFORE calling
the controlplane DELETE. If that CP call fails (transient 5xx, network
hiccup, AWS provider error), the DB row stays at 'removed' with
`instance_id` populated and there's no retry — the EC2 lives forever.
9 prod orphans accumulated over 3 days under this bug.

Adds a SaaS-mode counterpart to the existing Docker `orphan_sweeper`:
- 60s tick (matches the Docker sweeper cadence)
- LIMIT 100 per cycle so a sustained CP outage drains over multiple
  cycles without blowing the request timeout
- Re-issues `cpProv.Stop` for any workspace at status='removed' with a
  non-NULL `instance_id`. Stop is idempotent (AWS terminate on
  already-terminated is a no-op; CP's Deprovision tolerates already-
  deleted DNS) so retries are safe.
- On Stop success, NULLs `instance_id` so the next cycle skips the row.
- On Stop failure, leaves `instance_id` populated for next cycle.

The existing Docker sweeper is gated on `prov != nil`; the new sweeper
is gated on `cpProv != nil`. SaaS tenants get exactly one of the two,
self-hosted tenants get the Docker one — no overlap.

Why this shape over option A (CP-first ordering) or B (durable outbox):
the existing inline path already returns a loud 500 to the user when
CP fails — the only missing piece is automatic retry, which a 60s
sweeper provides without protocol changes, new tables, or new workers.
~30 LOC of production code vs. ~400 for an outbox. RFC discussion in
#2989 comment chain.

Tests:
- 9 unit tests covering happy path, Stop failure, UPDATE failure,
  multiple orphans (one-fails-others-still-process), DB query error,
  nil-DB defense, nil-reaper short-circuit, and the boot-immediate-then-
  tick cadence contract.
- Mutation-tested: status='running' substitution and removed-UPDATE-
  block both fail at least one test.

Out of scope:
- Backfilling the 9 named orphans — they'll heal automatically on the
  first sweep cycle after this lands; no manual cleanup needed.
- Long-term durable-outbox architecture — separate RFC.
- PR #3029: approve with nit (CP orphan sweeper + registry prefix).
- PR #3033: approve with blocker — README undercounts runtimes by 1
  because #3029 adds codex but #3033 claims 8 runtimes and omits it.
- All tests pass locally.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replaces http.DefaultClient in MCP bridge (mcp_tools.go) and CP config
fetch (cp_config.go) with dedicated clients that have transport-level
timeouts. Eliminates a fragility class where global mutable client +
missing context timeout could hang forever on dead workspaces or slow CP.

- mcpHTTPClient: 5s dial, 30s response header, 5s TLS handshake
- cp_config fetch: 10s client timeout matching context deadline

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
updateDelegationStatus was using context.Background() internally,
discarding any caller deadline. This meant status updates could hang
indefinitely if the DB was slow, and cancellation signals from HTTP
request contexts were ignored.

Change signature to accept context.Context and update all call sites.
No behavioral change for happy path; fixes context-timeout hygiene.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two production code paths used context-less sql.DB methods:

1. buildProvisionerConfig (workspace_provision.go:263) called db.DB.QueryRow
   instead of QueryRowContext(ctx, ...) — losing request cancellation.

2. queryPeerMaps (discovery.go:307) called db.DB.Query instead of
   QueryContext(ctx, ...) and did not accept a context parameter. Updated
   signature and all 4 call sites in Peers handler to pass ctx.

No behavioral change for happy path; fixes context-timeout hygiene so
slow queries can be cancelled when the HTTP client disconnects.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
workspace.go BeginTx had explicit Rollback() on every error path but
no deferred safety net. A panic (or future refactor adding an early
return) would leak the tx hold. Add defer tx.Rollback() immediately
after BeginTx — harmless no-op after Commit, critical safety net on
unexpected exits.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two QueryContext calls in restart_context used explicit rows.Close()
after the loop. If a panic or early return occurred during iteration,
the underlying connection would leak back to the pool with an active
result set. Switch to defer rows.Close() inside each if-block for
 RAII-style cleanup.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The async goroutine that processes inbound webhooks had no recover().
A panic inside HandleInbound (e.g., from the A2A proxy or adapter layer)
would crash the entire platform process. Add defer recover() with
workspace-scoped logging so the goroutine exits cleanly and the process
stays alive.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Four fire-and-forget goroutines in a2a_proxy_helpers.go had no recover():
logA2AFailure, logA2ASuccess, logA2AReceiveQueued, and last_outbound_at
update. A panic inside LogActivity or db.DB.ExecContext would crash the
platform process. Add defer recover() with workspace-scoped logging to
each so the goroutine exits cleanly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add panic recovery to goroutines that are NOT wrapped by supervised.RunWithRecover:
- session_auth.go: init() cache sweeper (process-level background task)
- ratelimit.go + mcp_ratelimit.go: bucket cleanup tickers
- bundle/importer.go: fire-and-forget provision start during import
- plugins_install.go: delayed restart after plugin uninstall
- terminal.go: WebSocket bridge goroutines (stdout, PTY, stdin)
- a2a_proxy.go: SSE idle watcher

A panic in any of these would previously crash the process or terminate
a subsystem silently. Now they log and swallow the panic, preserving
process stability.
Prevents bounded timer accumulation during the 30s workspace-online wait.
Each time.After created a timer that couldn't be GC'd until it fired;
with ~60 iterations this was minor but unnecessary.
Prevents timer accumulation in long-running loops:
- restart_context.go: poll loop uses NewTicker instead of time.After
- supervised.go: backoff sleep uses NewTimer with proper Stop
- telegram.go: poll loop uses NewTimer for both retryAfter and poll interval

Each time.After created a timer that couldn't be GC'd until it fired.
In long-running goroutines this caused bounded but unnecessary memory
growth.
Add error checking + logging to previously ignored db.ExecContext and
broadcaster.RecordAndBroadcast calls in bundle importer. Improves
observability when provision failures or URL updates fail silently.
Apply unchecked-error fixes from errcheck audit (#1062):
- channels/manager.go: log db.ExecContext and broadcaster errors in HandleInbound/SendOutbound; return json.Unmarshal errors in loadChannel
- channels/telegram.go: log bot.Send errors for callback ack and edit message
- handlers/approvals.go: log broadcaster, db.QueryRowContext, and db.ExecContext errors

Improves observability when secondary operations fail silently.
The `|| true` suffix on `go vet` and `golangci-lint` commands caused
the CI job to pass even when lint errors were present, masking real
issues and defeating the purpose of static analysis.

Remove the suppression so lint failures correctly fail the build.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The List handler previously ignored rows.Err(), which meant a broken
connection or cursor error mid-stream would return a partial 200 OK
instead of a 500. Add the check so interrupted result sets surface as
errors rather than silent data loss.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
List and ListByWorkspace ignored rows.Err(), so a broken DB cursor
mid-stream would return a partial 200 OK instead of 500. Add the check
to both handlers so interrupted result sets surface as errors.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The Create handler called EncryptSensitiveFields twice in a row,
causing the config to be double-encrypted. The second call was a
regression from a bad merge or copy-paste. Remove the duplicate.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
fix(channels): check rows.Err after iteration in List and Webhook
E2E API Smoke Test / E2E API Smoke Test (pull_request) Blocked by required conditions
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Blocked by required conditions
Harness Replays / Harness Replays (pull_request) Blocked by required conditions
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Blocked by required conditions
branch-protection drift check / Branch protection drift (pull_request) Successful in 7s
cascade-list-drift-gate / check (pull_request) Successful in 6s
Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 5s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 21s
Check migration collisions / Migration version collision check (pull_request) Successful in 9s
CI / Detect changes (pull_request) Successful in 8s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 4s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 6s
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 7s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 7s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 7s
E2E API Smoke Test / detect-changes (pull_request) Successful in 8s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 5s
Harness Replays / detect-changes (pull_request) Successful in 5s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 10s
qa-review / approved (pull_request) Successful in 6s
Secret scan / Scan diff for credential-shaped strings (pull_request) Failing after 31s
sop-checklist / na-declarations (pull_request) N/A: (none)
security-review / approved (pull_request) Successful in 6s
sop-checklist / review-refire (pull_request) Has been skipped
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 33s
Runtime Pin Compatibility / PyPI-latest install + import smoke (pull_request) Successful in 1m33s
gate-check-v3 / gate-check (pull_request) Successful in 9s
sop-checklist / all-items-acked (pull_request) Successful in 7s
sop-tier-check / tier-check (pull_request) Successful in 6s
CI / Platform (Go) (pull_request) Failing after 1m11s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 15s
CI / Canvas (Next.js) (pull_request) Failing after 1m55s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 2s
CI / Python Lint & Test (pull_request) Failing after 7m9s
1c8343a09d
List and Webhook handlers ignored rows.Err(), so a broken DB cursor
mid-stream would return a partial 200 OK instead of 500. Add the check
to both handlers so interrupted result sets surface as errors.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
core-be changed target branch from main to staging 2026-06-01 03:39:55 +00:00
core-be force-pushed fix/channels-rows-err-check from 1c8343a09d to e7ecf6203a 2026-06-01 04:23:22 +00:00 Compare
core-be added 1 commit 2026-06-01 12:41:05 +00:00
ci: remove unused canvasUserMessage type to fix lint on staging
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 4s
CI / Detect changes (pull_request) Successful in 13s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 16s
E2E API Smoke Test / detect-changes (pull_request) Successful in 9s
E2E Chat / detect-changes (pull_request) Successful in 14s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 4s
CI / Platform (Go) (pull_request) Successful in 5m14s
CI / Canvas (Next.js) (pull_request) Successful in 7m5s
CI / Python Lint & Test (pull_request) Successful in 8m31s
Harness Replays / detect-changes (pull_request) Successful in 4s
CI / all-required (pull_request) Successful in 7m41s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 12s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m12s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 8s
gate-check-v3 / gate-check (pull_request_target) Successful in 5s
qa-review / approved (pull_request_target) Successful in 8s
security-review / approved (pull_request_target) Successful in 5s
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4 — body-unfilled: comprehensive-testing, local-postgres-e2
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request_target) Successful in 6s
sop-tier-check / tier-check (pull_request_target) Successful in 5s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m25s
Harness Replays / Harness Replays (pull_request) Successful in 30s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 8s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 2m53s
E2E Chat / E2E Chat (pull_request) Failing after 7m37s
audit-force-merge / audit (pull_request_target) Has been skipped
528a062354
internal/handlers/a2a_proxy_helpers.go:412 had an unused struct that
causes golangci-lint `unused` failure on every PR targeting staging.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Member

Closing — based on the dead staging fork; superseded/junk relative to main. this edits .github/workflows/ci.yml, a dead GitHub-Actions workflow that does not exist on main (CI is now .gitea/workflows/), and commits a stray review-pr3029-pr3033.md review artifact. The bundled Go error-check sweep is already present on main independently. Per the staging-deprecation cleanup; target main for future PRs.

Closing — based on the dead `staging` fork; superseded/junk relative to `main`. this edits `.github/workflows/ci.yml`, a dead GitHub-Actions workflow that does not exist on `main` (CI is now `.gitea/workflows/`), and commits a stray `review-pr3029-pr3033.md` review artifact. The bundled Go error-check sweep is already present on `main` independently. Per the staging-deprecation cleanup; target `main` for future PRs.
devops-engineer closed this pull request 2026-06-01 18:28:36 +00:00
Some required checks failed
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 4s
CI / Detect changes (pull_request) Successful in 13s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 16s
E2E API Smoke Test / detect-changes (pull_request) Successful in 9s
E2E Chat / detect-changes (pull_request) Successful in 14s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 4s
CI / Platform (Go) (pull_request) Successful in 5m14s
CI / Canvas (Next.js) (pull_request) Successful in 7m5s
CI / Python Lint & Test (pull_request) Successful in 8m31s
Harness Replays / detect-changes (pull_request) Successful in 4s
CI / all-required (pull_request) Successful in 7m41s
Required
Details
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 12s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m12s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 8s
gate-check-v3 / gate-check (pull_request_target) Successful in 5s
qa-review / approved (pull_request_target) Successful in 8s
security-review / approved (pull_request_target) Successful in 5s
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4 — body-unfilled: comprehensive-testing, local-postgres-e2
Required
Details
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request_target) Successful in 6s
sop-tier-check / tier-check (pull_request_target) Successful in 5s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m25s
Harness Replays / Harness Replays (pull_request) Successful in 30s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 8s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 2m53s
E2E Chat / E2E Chat (pull_request) Failing after 7m37s
audit-force-merge / audit (pull_request_target) Has been skipped

Pull request closed

Sign in to join this conversation.
No Reviewers
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2034