fix(ci): remove || true suppression from lint steps #2035

Closed
core-be wants to merge 15 commits from fix/ci-lint-suppression-1062 into staging
Member
No description provided.
core-be added 17 commits 2026-06-01 03:36:05 +00:00
The deprovision path marks `workspaces.status='removed'` BEFORE calling
the controlplane DELETE. If that CP call fails (transient 5xx, network
hiccup, AWS provider error), the DB row stays at 'removed' with
`instance_id` populated and there's no retry — the EC2 lives forever.
9 prod orphans accumulated over 3 days under this bug.

Adds a SaaS-mode counterpart to the existing Docker `orphan_sweeper`:
- 60s tick (matches the Docker sweeper cadence)
- LIMIT 100 per cycle so a sustained CP outage drains over multiple
  cycles without blowing the request timeout
- Re-issues `cpProv.Stop` for any workspace at status='removed' with a
  non-NULL `instance_id`. Stop is idempotent (AWS terminate on
  already-terminated is a no-op; CP's Deprovision tolerates already-
  deleted DNS) so retries are safe.
- On Stop success, NULLs `instance_id` so the next cycle skips the row.
- On Stop failure, leaves `instance_id` populated for next cycle.

The existing Docker sweeper is gated on `prov != nil`; the new sweeper
is gated on `cpProv != nil`. SaaS tenants get exactly one of the two,
self-hosted tenants get the Docker one — no overlap.

Why this shape over option A (CP-first ordering) or B (durable outbox):
the existing inline path already returns a loud 500 to the user when
CP fails — the only missing piece is automatic retry, which a 60s
sweeper provides without protocol changes, new tables, or new workers.
~30 LOC of production code vs. ~400 for an outbox. RFC discussion in
#2989 comment chain.

Tests:
- 9 unit tests covering happy path, Stop failure, UPDATE failure,
  multiple orphans (one-fails-others-still-process), DB query error,
  nil-DB defense, nil-reaper short-circuit, and the boot-immediate-then-
  tick cadence contract.
- Mutation-tested: status='running' substitution and removed-UPDATE-
  block both fail at least one test.

Out of scope:
- Backfilling the 9 named orphans — they'll heal automatically on the
  first sweep cycle after this lands; no manual cleanup needed.
- Long-term durable-outbox architecture — separate RFC.
- PR #3029: approve with nit (CP orphan sweeper + registry prefix).
- PR #3033: approve with blocker — README undercounts runtimes by 1
  because #3029 adds codex but #3033 claims 8 runtimes and omits it.
- All tests pass locally.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replaces http.DefaultClient in MCP bridge (mcp_tools.go) and CP config
fetch (cp_config.go) with dedicated clients that have transport-level
timeouts. Eliminates a fragility class where global mutable client +
missing context timeout could hang forever on dead workspaces or slow CP.

- mcpHTTPClient: 5s dial, 30s response header, 5s TLS handshake
- cp_config fetch: 10s client timeout matching context deadline

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
updateDelegationStatus was using context.Background() internally,
discarding any caller deadline. This meant status updates could hang
indefinitely if the DB was slow, and cancellation signals from HTTP
request contexts were ignored.

Change signature to accept context.Context and update all call sites.
No behavioral change for happy path; fixes context-timeout hygiene.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two production code paths used context-less sql.DB methods:

1. buildProvisionerConfig (workspace_provision.go:263) called db.DB.QueryRow
   instead of QueryRowContext(ctx, ...) — losing request cancellation.

2. queryPeerMaps (discovery.go:307) called db.DB.Query instead of
   QueryContext(ctx, ...) and did not accept a context parameter. Updated
   signature and all 4 call sites in Peers handler to pass ctx.

No behavioral change for happy path; fixes context-timeout hygiene so
slow queries can be cancelled when the HTTP client disconnects.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
workspace.go BeginTx had explicit Rollback() on every error path but
no deferred safety net. A panic (or future refactor adding an early
return) would leak the tx hold. Add defer tx.Rollback() immediately
after BeginTx — harmless no-op after Commit, critical safety net on
unexpected exits.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two QueryContext calls in restart_context used explicit rows.Close()
after the loop. If a panic or early return occurred during iteration,
the underlying connection would leak back to the pool with an active
result set. Switch to defer rows.Close() inside each if-block for
 RAII-style cleanup.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The async goroutine that processes inbound webhooks had no recover().
A panic inside HandleInbound (e.g., from the A2A proxy or adapter layer)
would crash the entire platform process. Add defer recover() with
workspace-scoped logging so the goroutine exits cleanly and the process
stays alive.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Four fire-and-forget goroutines in a2a_proxy_helpers.go had no recover():
logA2AFailure, logA2ASuccess, logA2AReceiveQueued, and last_outbound_at
update. A panic inside LogActivity or db.DB.ExecContext would crash the
platform process. Add defer recover() with workspace-scoped logging to
each so the goroutine exits cleanly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add panic recovery to goroutines that are NOT wrapped by supervised.RunWithRecover:
- session_auth.go: init() cache sweeper (process-level background task)
- ratelimit.go + mcp_ratelimit.go: bucket cleanup tickers
- bundle/importer.go: fire-and-forget provision start during import
- plugins_install.go: delayed restart after plugin uninstall
- terminal.go: WebSocket bridge goroutines (stdout, PTY, stdin)
- a2a_proxy.go: SSE idle watcher

A panic in any of these would previously crash the process or terminate
a subsystem silently. Now they log and swallow the panic, preserving
process stability.
Prevents bounded timer accumulation during the 30s workspace-online wait.
Each time.After created a timer that couldn't be GC'd until it fired;
with ~60 iterations this was minor but unnecessary.
Prevents timer accumulation in long-running loops:
- restart_context.go: poll loop uses NewTicker instead of time.After
- supervised.go: backoff sleep uses NewTimer with proper Stop
- telegram.go: poll loop uses NewTimer for both retryAfter and poll interval

Each time.After created a timer that couldn't be GC'd until it fired.
In long-running goroutines this caused bounded but unnecessary memory
growth.
Add error checking + logging to previously ignored db.ExecContext and
broadcaster.RecordAndBroadcast calls in bundle importer. Improves
observability when provision failures or URL updates fail silently.
Apply unchecked-error fixes from errcheck audit (#1062):
- channels/manager.go: log db.ExecContext and broadcaster errors in HandleInbound/SendOutbound; return json.Unmarshal errors in loadChannel
- channels/telegram.go: log bot.Send errors for callback ack and edit message
- handlers/approvals.go: log broadcaster, db.QueryRowContext, and db.ExecContext errors

Improves observability when secondary operations fail silently.
fix(ci): remove || true suppression from lint steps (#1062)
E2E API Smoke Test / E2E API Smoke Test (pull_request) Blocked by required conditions
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Blocked by required conditions
Harness Replays / Harness Replays (pull_request) Blocked by required conditions
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Blocked by required conditions
branch-protection drift check / Branch protection drift (pull_request) Successful in 4s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 15s
cascade-list-drift-gate / check (pull_request) Successful in 3s
Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 4s
Check migration collisions / Migration version collision check (pull_request) Successful in 7s
CI / Detect changes (pull_request) Successful in 6s
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 3s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 3s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 3s
E2E API Smoke Test / detect-changes (pull_request) Successful in 7s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 5s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 5s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 9s
Harness Replays / detect-changes (pull_request) Successful in 10s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 14s
qa-review / approved (pull_request) Successful in 6s
security-review / approved (pull_request) Successful in 4s
Secret scan / Scan diff for credential-shaped strings (pull_request) Failing after 36s
sop-checklist / review-refire (pull_request) Has been skipped
sop-checklist / na-declarations (pull_request) N/A: (none)
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 35s
Runtime Pin Compatibility / PyPI-latest install + import smoke (pull_request) Successful in 1m38s
gate-check-v3 / gate-check (pull_request) Successful in 6s
sop-checklist / all-items-acked (pull_request) Successful in 6s
sop-tier-check / tier-check (pull_request) Successful in 5s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 24s
CI / Platform (Go) (pull_request) Failing after 1m29s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 5s
CI / Canvas (Next.js) (pull_request) Failing after 2m25s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CI / Python Lint & Test (pull_request) Failing after 7m5s
bd428717db
The `|| true` suffix on `go vet` and `golangci-lint` commands caused
the CI job to pass even when lint errors were present, masking real
issues and defeating the purpose of static analysis.

Remove the suppression so lint failures correctly fail the build.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
core-be changed target branch from main to staging 2026-06-01 03:39:56 +00:00
core-be force-pushed fix/ci-lint-suppression-1062 from bd428717db to e78659faba 2026-06-01 04:22:24 +00:00 Compare
core-be added 1 commit 2026-06-01 12:41:00 +00:00
ci: remove unused canvasUserMessage type to fix lint on staging
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s
CI / Detect changes (pull_request) Successful in 8s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 33s
E2E API Smoke Test / detect-changes (pull_request) Successful in 19s
E2E Chat / detect-changes (pull_request) Successful in 8s
CI / Platform (Go) (pull_request) Successful in 5m21s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 11s
CI / Canvas (Next.js) (pull_request) Successful in 6m50s
CI / Python Lint & Test (pull_request) Successful in 7m9s
CI / all-required (pull_request) Successful in 6m10s
Harness Replays / detect-changes (pull_request) Successful in 7s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 11s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 5s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m9s
gate-check-v3 / gate-check (pull_request_target) Successful in 4s
qa-review / approved (pull_request_target) Successful in 6s
security-review / approved (pull_request_target) Successful in 9s
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4 — body-unfilled: comprehensive-testing, local-postgres-e2
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request_target) Successful in 6s
sop-tier-check / tier-check (pull_request_target) Successful in 7s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m28s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
Harness Replays / Harness Replays (pull_request) Successful in 17s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 13s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 2m22s
E2E Chat / E2E Chat (pull_request) Failing after 8m20s
audit-force-merge / audit (pull_request_target) Has been skipped
34cecc2b64
internal/handlers/a2a_proxy_helpers.go:412 had an unused struct that
causes golangci-lint `unused` failure on every PR targeting staging.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Member

Closing — based on the dead staging fork; superseded/junk relative to main. this edits .github/workflows/ci.yml, a dead GitHub-Actions workflow that does not exist on main (CI is now .gitea/workflows/), and commits a stray review-pr3029-pr3033.md review artifact. The bundled Go error-check sweep is already present on main independently. Per the staging-deprecation cleanup; target main for future PRs.

Closing — based on the dead `staging` fork; superseded/junk relative to `main`. this edits `.github/workflows/ci.yml`, a dead GitHub-Actions workflow that does not exist on `main` (CI is now `.gitea/workflows/`), and commits a stray `review-pr3029-pr3033.md` review artifact. The bundled Go error-check sweep is already present on `main` independently. Per the staging-deprecation cleanup; target `main` for future PRs.
devops-engineer closed this pull request 2026-06-01 18:28:39 +00:00
Some required checks failed
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s
CI / Detect changes (pull_request) Successful in 8s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 33s
E2E API Smoke Test / detect-changes (pull_request) Successful in 19s
E2E Chat / detect-changes (pull_request) Successful in 8s
CI / Platform (Go) (pull_request) Successful in 5m21s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 11s
CI / Canvas (Next.js) (pull_request) Successful in 6m50s
CI / Python Lint & Test (pull_request) Successful in 7m9s
CI / all-required (pull_request) Successful in 6m10s
Required
Details
Harness Replays / detect-changes (pull_request) Successful in 7s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 11s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 5s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m9s
gate-check-v3 / gate-check (pull_request_target) Successful in 4s
qa-review / approved (pull_request_target) Successful in 6s
security-review / approved (pull_request_target) Successful in 9s
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4 — body-unfilled: comprehensive-testing, local-postgres-e2
Required
Details
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request_target) Successful in 6s
sop-tier-check / tier-check (pull_request_target) Successful in 7s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m28s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
Harness Replays / Harness Replays (pull_request) Successful in 17s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 13s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 2m22s
E2E Chat / E2E Chat (pull_request) Failing after 8m20s
audit-force-merge / audit (pull_request_target) Has been skipped

Pull request closed

Sign in to join this conversation.
No Reviewers
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2035