molecule-core

Author	SHA1	Message	Date
Hongming Wang	c661ea4cd3	Merge pull request #2861 from Molecule-AI/fix/rfc2829-result-preview-ordering-and-integration-gate fix(delegations): preserve result_preview + add real-Postgres integration gate	2026-05-05 09:51:30 +00:00
Hongming Wang	4c9f12258d	fix(delegations): preserve result_preview through completion + add real-Postgres integration gate Two-part PR: ## Fix: result_preview was lost on completion Self-review of #2854 caught a real bug. SetStatus has a same-status replay no-op; the order of calls in `executeDelegation` completion + `UpdateStatus` completed branch clobbered the preview field: 1. updateDelegationStatus(completed, "") fires 2. inner recordLedgerStatus(completed, "", "") → SetStatus transitions dispatched → completed with preview="" 3. outer recordLedgerStatus(completed, "", responseText) → SetStatus reads current=completed, status=completed → SAME-STATUS NO-OP, never writes responseText → preview lost Confirmed against real Postgres (see integration test). Strict-sqlmock unit tests passed because they pin SQL shape, not row state. Fix: call the WITH-PREVIEW recordLedgerStatus FIRST, then updateDelegationStatus. The inner call becomes the no-op (correctly preserves the row written by the outer call). Same gap fixed in UpdateStatus handler — body.ResponsePreview was never landing in the ledger because updateDelegationStatus's nested SetStatus(completed, "", "") fired first. ## Gate: real-Postgres integration tests + CI workflow The unit-test-only workflow that shipped #2854 was the root cause. Adding two layers of defense: 1. workspace-server/internal/handlers/delegation_ledger_integration_test.go — `//go:build integration` tag, requires INTEGRATION_DB_URL env var. 4 tests: * ResultPreviewPreservedThroughCompletion (regression gate for the bug above — fires the production call sequence in fixed order and asserts row.result_preview matches) * ResultPreviewBuggyOrderIsLost (DIAGNOSTIC: confirms the same-status no-op contract works as designed; if SetStatus's semantics ever change, this test fires) * FailedTransitionCapturesErrorDetail (failure-path symmetry) * FullLifecycle_QueuedToDispatchedToCompleted (forward-only + happy path) 2. .github/workflows/handlers-postgres-integration.yml — required check on staging branch protection. Spins postgres:15 service container, applies the delegations migration, runs `go test -tags=integration` against the live DB. Always-runs + per-step gating on path filter (handlers/wsauth/migrations) so the required-check name is satisfied on PRs that don't touch relevant code. Local dev workflow (file header documents this): docker run --rm -d --name pg -e POSTGRES_PASSWORD=test -p 55432:5432 postgres:15-alpine psql ... < workspace-server/migrations/049_delegations.up.sql INTEGRATION_DB_URL="postgres://postgres:test@localhost:55432/molecule?sslmode=disable" \ go test -tags=integration ./internal/handlers/ -run "^TestIntegration_" ## Why this matters Per memory `feedback_mandatory_local_e2e_before_ship`: backend PRs MUST verify against real Postgres before claiming done. sqlmock pins SQL shape; only a real DB can verify row state. The workflow makes this gate mandatory rather than optional.	2026-05-05 02:47:52 -07:00
Hongming Wang	da46bdeded	Merge pull request #2826 from Molecule-AI/feat/canvas-chat-lazy-load-history feat(canvas/chat): lazy-load history — 10 newest on mount, 20 per scroll-up batch	2026-05-05 09:44:29 +00:00
Hongming Wang	d890fd9a3f	Merge pull request #2856 from Molecule-AI/chore/remove-team-expand-handler chore(workspace-server): remove TeamHandler.Expand bulk-create handler	2026-05-05 09:42:51 +00:00
Hongming Wang	ec1f21922c	chore(workspace-server): remove TeamHandler.Expand bulk-create handler Every workspace can have children via the regular CreateWorkspace flow with parent_id set, so a separate handler that bulk-creates from config.yaml's sub_workspaces (and was non-idempotent — calling it twice duplicated the team) earned its way out. "Team" is just the state of having children; expanding/collapsing is purely a canvas-side visual action that toggles the `collapsed` column via PATCH. The non-idempotency directly caused tenant-hongming's vCPU starvation: 72 distinct child workspaces accumulated in 4 days, ~14 leaked EC2s (50 of 64 vCPU consumed by stale teams), every Canvas tabs E2E retry flaking on RunInstances VcpuLimitExceeded. What stays: - TeamHandler.Collapse — still useful; stops + removes children via StopWorkspaceAuto. Reachable from the canvas Collapse Team button. (Note: that button currently calls PATCH /workspaces/:id, not the Collapse endpoint — that's a separate reachability question for later.) - findTemplateDirByName helper — kept in team.go pending a relocate decision; no in-package consumers after Expand. - The four other paths that create child workspaces continue to work unchanged: regular POST /workspaces with parent_id, OrgHandler.Import (recursive tree), Bundle import, scripts. What goes: - POST /workspaces/:id/expand route (router.go) - TeamHandler.Expand method (team.go: ~130 lines) - 4 TestTeamExpand_* sqlmock tests (team_test.go) - TestTeamExpand_UsesAutoNotDirectDockerPath AST gate (workspace_provision_auto_test.go) — pinned a code path that no longer exists; the generic TestNoCallSiteCallsDirectProvisionerExceptAuto gate still covers the architectural intent for any future caller. Follow-up PRs: - canvas/ContextMenu.tsx: drop the "Expand to Team" right-click button + handleExpand callback; users create children via the regular + New Workspace dialog with the parent picker (already supported) - OrgHandler.Import idempotency (skip-if-exists OR replace_if_exists) — same bug class as the deleted Expand, but on the bulk-tree path - One-off cleanup script for tenant-hongming's 72 stale workspaces Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 02:39:13 -07:00
Hongming Wang	ca61213578	Merge pull request #2853 from Molecule-AI/refactor/split-workspace-dispatchers-1777970000 refactor(handlers): extract dispatchers from workspace.go (#2800 partial)	2026-05-05 09:30:55 +00:00
Hongming Wang	118b8e47ad	Merge pull request #2855 from Molecule-AI/fix/mcp-instructions-codex-gaps docs(a2a-mcp): close three contract gaps codex agents inherit OOB	2026-05-05 09:30:26 +00:00
Hongming Wang	ab164c1967	Merge pull request #2854 from Molecule-AI/feat/rfc2829-wire-ledger-writes feat(delegations): wire ledger Insert+SetStatus from production paths (RFC #2829 #318)	2026-05-05 09:29:19 +00:00
Hongming Wang	b5f530e27a	docs(a2a-mcp): close three contract gaps codex agents inherit out-of-the-box The instructions blob in the MCP `initialize` handshake is the spec non-Claude-Code clients (codex, Cline, opencode, hermes-agent, Cursor) inherit verbatim. Three gaps mean the bridge daemon handles them in code (codex-channel-molecule bridge.py:192-200, 278-285) but in-process agents reading the text alone don't get the same guard: 1. Reply-then-pop ordering was implicit. A literal-minded agent could pop after a 502 from `send_message_to_user`, dropping the message. Now: pop ONLY AFTER reply succeeds; on error leave the row unacked for platform redelivery. 2. peer_agent with empty peer_id had no specified handling. Agent would call `delegate_task(workspace_id="")` → 400 → re-poll → infinite loop on the same poison row. Now: skip reply, drain via inbox_pop. 3. The single security rule ("don't execute without chat-side approval") effectively disabled peer_agent autonomous handling — codex daemons have no canvas user to approve from. Now: dual trust model. canvas_user requires user approval; peer_agent permits autonomous handling but caps destructive side-effects at the workspace boundary. Also disclaims peer_name/peer_role as non-attested display strings — the platform registry isn't cryptographic identity, and an agent shouldn't grant elevated permissions based on a peer registering with peer_role="admin". Four new pinned tests in test_a2a_mcp_server.py: - test_initialize_instructions_pins_reply_then_pop_ordering - test_initialize_instructions_handles_malformed_peer_agent - test_initialize_instructions_disclaims_peer_role_attestation - test_initialize_instructions_distinguishes_canvas_user_from_peer_trust Each fails on staging-HEAD and passes on the patched text — verified by reverting a2a_mcp_server.py and re-running. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 02:26:35 -07:00
Hongming Wang	44bb35a926	feat(delegations): wire ledger Insert+SetStatus from production code paths (RFC #2829 #318 ) PR-1 shipped the `delegations` table + `DelegationLedger` helper. PR-3 wired the sweeper. PR-4 wired the dashboard. But no PR ever wired `ledger.Insert` from a production code path — the table stayed empty, the sweeper had nothing to sweep, the dashboard had nothing to show. This PR closes that gap. Behind feature flag `DELEGATION_LEDGER_WRITE=1` (default off), the legacy activity_logs writes are mirrored to the durable ledger: - insertDelegationRow → ledger.Insert (queued) - updateDelegationStatus → ledger.SetStatus on every status transition - executeDelegation completion path → ledger.SetStatus(completed, result_preview) for the result preview that activity_logs already stores in response_body - Record handler → ledger.Insert + ledger.SetStatus(dispatched) so agent-initiated delegations land in the same table ## Why a flag The legacy flow has ~30 strict-sqlmock tests pinning exactly which SQL statements fire per handler. Adding ledger writes always-on would force adding ExpectExec stanzas to each. Flag-off keeps all 30 green without churn; flag-on lets operators populate the table in staging to feed the sweeper + dashboard once the agent-side cutover (RFC #2829 PR-5) has proven the round-trip end-to-end. Default off → byte-identical to pre-#318 behavior. ## Status vocabulary mapping activity_logs uses a freer status vocabulary than the ledger's CHECK constraint allows. updateDelegationStatus is called with values like "received" that the ledger doesn't accept; the wiring filters via a switch to only forward known-good values, skipping anything else. Record's first activity_logs row is `dispatched` but the ledger's Insert path requires `queued` as initial state. Insert as queued first; the very next SetStatus(..., dispatched) promotes it on the same row. ## Coverage 8 wiring tests (delegation_ledger_writes_test.go): - flag off → no SQL fired (rollout safety contract) - flag on → INSERT + UPDATE fire as expected - flag rejects loose truthy values (true/yes/0/on/TRUE) — only "1" is the on signal, matching PR-2 + PR-5 conventions - terminal-state replay swallows ErrInvalidTransition (legacy is authoritative; ledger replay error is not a delegation failure) All 30 existing delegation_test.go tests still pass — flag default off keeps the strict-sqlmock surface unchanged. Refs RFC #2829.	2026-05-05 02:26:06 -07:00
Hongming Wang	024ef260db	refactor(handlers): extract dispatchers from workspace.go (#2800 partial) workspace.go was 950 lines after the dispatcher work in PRs #2811 + #2824 + #2843 + #2846 + #2847 + #2848 + #2850. This extracts the 6 SoT dispatcher helpers into a new workspace_dispatchers.go so the file is the architectural unit it deserves to be (one place for "how do we route a workspace lifecycle verb to a backend?"). Moved (no body changes — pure cut + paste with imports): - HasProvisioner (gate accessor) - provisionWorkspaceAuto (async provision) - provisionWorkspaceAutoSync (sync provision, runRestartCycle's path) - StopWorkspaceAuto (stop dispatcher) - RestartWorkspaceAuto (restart wrapper) - RestartWorkspaceAutoOpts (restart with resetClaudeSession) workspace.go shrinks from 950 → 735 lines and now holds: - WorkspaceHandler struct + constructor - SetCPProvisioner / SetEnvMutators - Create / List / Get / scanWorkspaceRow - HTTP handler glue workspace_dispatchers.go is 255 lines and holds the dispatcher trio + sync variant + gate accessor + a header docblock summarizing the history (PRs that added each helper) and the source-level pin tests that gate against drift. Source-level pin tests updated: - TestNoCallSiteCallsDirectProvisionerExceptAuto: workspace_dispatchers.go added to allowlist (the dispatcher IS the place that calls per-backend bodies directly). - TestNoCallSiteCallsBareStop: same. - TestNoBareBothNilCheck / TestOrgImportGate_UsesHasProvisionerNotBareField: no change — they were source-pinning specific files, not all callers. Build clean, vet clean, full test suite passes (1742 / 0 in workspace, all Go test packages green). Out of scope (#2800 has more): - workspace_provision.go (869 lines) split into Docker + CP halves — files would still be 400+ each, marginal value. Defer until a third backend lands and the symmetry breaks. - Splitting Create / List / Get into per-handler files — they're short and tightly coupled to the struct; keep co-located. Closes #2800 partial. Filing a follow-up issue if/when workspace.go or workspace_provision.go grows past 800 lines again. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 02:24:49 -07:00
Hongming Wang	d175d0c4c1	Merge branch 'staging' into feat/canvas-chat-lazy-load-history	2026-05-05 02:22:38 -07:00
Hongming Wang	d21ac991c1	Merge pull request #2852 from Molecule-AI/feat/external-rotate-credentials feat(external): credential rotation + re-show instruction modal (#319)	2026-05-05 09:01:44 +00:00
Hongming Wang	c85783fbee	docs(workspace): point recovery hint at /external/rotate (not the never-shipped /tokens) Self-review of #2852: the inline comment on the IssueToken-failed branch still referenced POST /workspaces/:id/tokens, which never shipped. The recovery path that did ship in #2852 is POST /workspaces/:id/external/rotate. Update the hint so the next operator who hits this failure mode finds the right endpoint.	2026-05-05 01:58:43 -07:00
Hongming Wang	b375252dc8	feat(external): credential rotation + re-show instruction modal (#319 ) External workspaces (runtime=external) lose their workspace_auth_token the moment the create modal closes — the token is unrecoverable from any later DB read. Operators who lost their copy or want to respond to a suspected leak had no recovery path short of recreating the workspace (which also breaks cross-workspace delegation links + memory namespace). This PR adds two endpoints + a Config-tab section that surfaces them: POST /workspaces/:id/external/rotate Revokes any prior live tokens, mints a fresh one, returns the same ExternalConnectionInfo payload Create returns. Old credentials stop working immediately — the previously-paired agent will fail auth on its next heartbeat (~20s). GET /workspaces/:id/external/connection Returns the connect block with auth_token="". For the operator who just needs to re-find PLATFORM_URL / WORKSPACE_ID / one of the snippets without invalidating the live agent. Both reject runtime ≠ external with 400 + a hint pointing at /restart for non-external runtimes (which mints AND injects into the container). ## Why a flag isn't needed The endpoints are purely additive — Create's behavior is unchanged. Existing external workspaces don't see anything different until an operator clicks the new buttons. ## DRY refactor Extracted BuildExternalConnectionPayload() in external_connection.go as the single source of truth for the connect payload shape. Create, Rotate, and GetExternalConnection all call it. Adds a snippet once → all three endpoints emit it. Trims trailing slash on platform_url so no double-slash sneaks into registry_endpoint. ## Canvas ExternalConnectionSection mounts in ConfigTab when runtime=external. Two buttons: - "Show connection info" (cosmetic) — fetches GET /external/connection - "Rotate credentials" (destructive) — confirm dialog explains the impact, then POST /external/rotate Both reuse the existing ExternalConnectModal so operators don't learn a second snippet UX. ## Coverage 10 Go tests: - Rotate happy path (revoke + mint order, payload shape, broadcast event) - Rotate refuses non-external runtimes (400 with restart hint) - Rotate 404 on unknown workspace + 400 on empty id - GetExternalConnection happy path (auth_token="", same payload shape) - GetExternalConnection refuses non-external + 404 on unknown - BuildExternalConnectionPayload — placeholder substitution + trailing slash trimming + blank-token contract 6 canvas tests: - both action buttons render - "Show" calls GET /external/connection and opens modal - "Rotate" opens confirm dialog before firing POST - Cancel dismisses without rotating - Confirm POSTs and opens modal with returned token - API failures surface as visible error chips Migration: existing external workspaces gain new abilities; no data migration. The DRY refactor preserves byte-identical Create response shape (8 ConfigTab tests + all existing handler tests still pass). Closes #319.	2026-05-05 01:55:27 -07:00
Hongming Wang	3d226a2c68	Merge pull request #2851 from Molecule-AI/feat/peer-metadata-cache-evict-2482-1777967000 perf(a2a): bound + LRU-evict _peer_metadata cache (#2482)	2026-05-05 08:41:32 +00:00
Hongming Wang	da6d319c48	perf(a2a): bound + LRU-evict _peer_metadata cache (#2482 ) Pre-fix _peer_metadata was an unbounded dict — a workspace receiving from N distinct peers across its lifetime accumulated entries indefinitely (~100 bytes × N). Not crash-class at typical scale (10K peers ≈ 1 MB) but unbounded. The TTL-at-read pattern bounded staleness but did nothing for memory. Fix: hand-rolled LRU on top of OrderedDict. No new dependency. - _PEER_METADATA_MAXSIZE = 1024 (issue's recommended bound) - _peer_metadata_get(canon) — read + LRU touch (move to MRU) - _peer_metadata_set(canon, value) — write + evict-if-over-maxsize - All production reads/writes route through the helpers - _peer_metadata_lock guards the OrderedDict ops so concurrent background-enrichment workers (#2484) don't race the LRU invariant Why hand-rolled vs cachetools: - No new dep. workspace/ has 0 cache libraries today; adding one for ~30 lines is negative leverage. - The TTL is enforced at the call site (existing pattern); only the size cap + LRU is new. cachetools.TTLCache fuses the two, which would force a refactor of every caller's TTL check. - The size + lock are simple enough that a future swap-in of cachetools is mechanical if needs evolve. Why maxsize matters more than ttl (issue's framing): A runaway poller that touches new peer_ids every push would still grow within a single TTL window — TTL eviction only fires at read time. The size cap fires immediately on insert, regardless of read pattern. Three new tests: - test_peer_metadata_set_evicts_lru_when_at_maxsize - test_peer_metadata_get_promotes_to_lru_head - test_peer_metadata_set_replaces_existing_entry_in_place 1742 passed / 0 failed locally (78 new + 1664 existing). Closes #2482. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 01:39:07 -07:00
Hongming Wang	76e9656a7b	Merge pull request #2850 from Molecule-AI/feat/enrich-off-poller-2484-1777965000 perf(a2a): move enrichment GET off the inbox poller thread (#2484)	2026-05-05 08:30:20 +00:00
Hongming Wang	35017c5452	perf(a2a): move enrichment GET off the inbox poller thread (#2484 ) The inbox poller's notification callback called the synchronous enrich_peer_metadata on every push, blocking the poller for up to 2s × N uncached peers per poll batch. Push delivery latency was gated on registry RTT — exactly what PR #2471's negative-cache patch was trying to avoid amplifying. Fix: cache-first nonblocking path with a tiny background worker pool. enrich_peer_metadata_nonblocking(peer_id): - Cache hit (fresh, within TTL): return cached record immediately - Cache miss / stale: return None, schedule background fetch via ThreadPoolExecutor The first push from a new peer arrives metadata-light (bare peer_id); the next push within the 5-min TTL hits the warm cache and gets full name/role. Acceptable trade-off because the channel-envelope enrichment is a UX nicety, not a correctness invariant — and the cold-cache window per peer is bounded to one push. Defenses: - In-flight gate (_enrich_in_flight) — N concurrent pushes for the same uncached peer schedule exactly ONE worker, not N. Without this, a chatty peer's first burst of pushes would amplify into parallel registry GETs — the exact DoS-on-self pattern the negative cache was meant to rate-limit. - Lazy executor init — most test fixtures + short-lived CLI invocations never need it; only the long-running molecule-mcp path actually fires background work. - Daemon-style threads via thread_name_prefix; executor never blocks process exit. Tests: - test_enrich_peer_metadata_nonblocking_cache_hit_returns_immediately - test_enrich_peer_metadata_nonblocking_cache_miss_schedules_fetch - test_enrich_peer_metadata_nonblocking_coalesces_duplicate_pushes - test_enrich_peer_metadata_nonblocking_invalid_peer_id_returns_none Plus updates to the existing test_envelope_enrichment_* suite that asserted synchronous behavior — they now drain the in-flight set via _wait_for_enrichment_inflight_for_testing before checking cache state. Existing synchronous enrich_peer_metadata is unchanged — Phase B (#2790) schema↔dispatcher drift gate + the negative-cache contract from PR #2471 still apply. The nonblocking variant is purely additive. 1739 passed, 0 failed locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 01:24:42 -07:00
Hongming Wang	d10c1a1a36	Merge pull request #2848 from Molecule-AI/feat/2799-phase3-pause-1777961500 feat(handlers): migrate Pause loop to StopWorkspaceAuto — #2799 Phase 3 (closes #2799)	2026-05-05 07:03:31 +00:00
Hongming Wang	61b7755c3c	feat(handlers): migrate Pause loop to StopWorkspaceAuto — #2799 Phase 3 Last open #2799 site. Pause's per-workspace stop call now routes through StopWorkspaceAuto, removing the final inline if-cpProv-else (actually if-h.provisioner) dispatch from workspace_restart.go's restart/pause/resume code paths. Pre-2026-05-05 the Pause loop was: if h.provisioner != nil { h.provisioner.Stop(ctx, ws.id) } Same drift class as #2813 (team-collapse leak) + #2814 (workspace delete leak) — Docker-only stop silently no-ops on SaaS, leaving the EC2 running while the workspace row gets marked paused. Orphan sweeper would catch it eventually but the leak window is real. Pause-specific bookkeeping (mark paused, clear workspace keys, broadcast WORKSPACE_PAUSED) stays inline in the handler; only the "stop the running workload" step delegates. StopWorkspaceAuto's no-backend → no-op semantics match the pre-fix behavior on misconfigured deployments (the bookkeeping still runs). One new source-level pin: TestPauseHandler_UsesStopWorkspaceAuto — gates regression to the inline dispatch shape. This closes #2799 Phase 3. After this PR + #2847 (Phase 2 PR-B) land, workspace_restart.go has no remaining inline if-cpProv-else dispatch in any user-facing code path. The remaining direct backend calls inside the file are in stopForRestart and cpStopWithRetry — both internal helpers that ARE the dispatcher's underlying primitives, not new bypasses. Note: scope was originally tagged "Phase 3 needs PauseWorkspaceAuto verb" in the audit on PR #2843. On closer reading Pause's stop step is identical to Stop — only the bookkeeping is Pause-specific. Reusing StopWorkspaceAuto avoids unnecessary surface and keeps the dispatcher trio (provision/stop/restart) tight. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 00:00:16 -07:00
Hongming Wang	21a7e7b0e7	Merge pull request #2847 from Molecule-AI/feat/2799-phase2b-runrestart-cycle-1777960000 feat(handlers): provisionWorkspaceAutoSync + Site 4 migration — #2799 Phase 2 PR-B	2026-05-05 06:53:18 +00:00
Hongming Wang	9a772bf946	feat(handlers): provisionWorkspaceAutoSync + Site 4 migration — #2799 Phase 2 PR-B runRestartCycle's auto-restart cycle (Site 4 from PR #2843's audit) needs synchronous provision dispatch — the outer pending-flag loop in RestartByID relies on returning when the new container is up so the next restart cycle doesn't race the in-flight provision goroutine on its Stop call. Phase 1's provisionWorkspaceAuto wraps each per-backend body in `go func() {...}()` — wrong shape for runRestartCycle's needs. This PR introduces provisionWorkspaceAutoSync as a behavioral mirror that runs in the current goroutine instead. Two helpers, kept identical except for the wrapper: provisionWorkspaceAuto: spawns goroutine, returns immediately provisionWorkspaceAutoSync: blocks until per-backend body returns Same backend-selection (CP first, Docker second) + no-backend mark-failed fallback. When one grows a new arm (third backend, retry semantics), the other should too — pinned in the docstring. Site 4 (runRestartCycle) was the only call site that needs sync today. Migrating it removes the last bare if-cpProv-else dispatch in the restart code path's provision half. Three new tests: - TestProvisionWorkspaceAutoSync_RoutesToCPWhenSet - TestProvisionWorkspaceAutoSync_NoBackendMarksFailed - TestRunRestartCycle_UsesProvisionWorkspaceAutoSync (source-level pin) Out of scope (last open #2799 site): Phase 3 — Site 5 (Pause loop). PAUSE doesn't reprovision; needs a new PauseWorkspaceAuto verb. After this PR lands, Pause is the only inline if-cpProv-else dispatch left in workspace_restart.go. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 23:44:54 -07:00
Hongming Wang	0a90d7ae1a	Merge pull request #2846 from Molecule-AI/feat/2799-phase2-restart-resume-1777958000 feat(handlers): migrate Restart + Resume handlers to dispatchers — #2799 Phase 2 PR-A	2026-05-05 05:15:29 +00:00
Hongming Wang	5b7f4d260b	feat(handlers): migrate Restart + Resume handlers to dispatchers — #2799 Phase 2 PR-A Sites 1+2 (Restart HTTP handler goroutine) and Site 3 (Resume HTTP handler goroutine) now route through RestartWorkspaceAutoOpts / provisionWorkspaceAuto instead of inlining the if-cpProv-else dispatch. Three changes: 1. RestartWorkspaceAutoOpts — new variant of RestartWorkspaceAuto that carries the resetClaudeSession Docker-only flag (issue #12). The bare RestartWorkspaceAuto still exists as a wrapper that calls Opts with false. CP path silently ignores the flag (each EC2 boots fresh — no session state to clear). Mirrors the Provision pair (provisionWorkspace / provisionWorkspaceOpts). 2. Restart handler (Site 1+2) — the inline goroutine `if h.provisioner != nil { Stop } else if h.cpProv != nil { ... }` collapses to `RestartWorkspaceAutoOpts(...)`. Pre-fix the dispatch was Docker-FIRST ordering (a different drift class from the silent-drop bugs PRs #2811/#2824 closed); the dispatcher enforces CP-FIRST. 3. Resume handler (Site 3) — Resume is provision-only (workspace is paused, no live container), so it routes through provisionWorkspaceAuto, not RestartWorkspaceAuto. Inline if-cpProv-else dispatch removed. Two new source-level pins: - TestRestartHandler_UsesRestartWorkspaceAuto - TestResumeHandler_UsesProvisionWorkspaceAuto These prevent regression to the inline dispatch pattern. Out of scope (tracked under #2799): - Site 4 (runRestartCycle) — synchronous coordination model needs a different shape than the fire-and-return dispatchers. PR-B. - Site 5 (Pause loop) — PAUSE doesn't reprovision, needs a new PauseWorkspaceAuto verb. Phase 3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 22:09:12 -07:00
Hongming Wang	f0fd7b4d9e	Merge pull request #2845 from Molecule-AI/feat/rfc2829-enable-server-side feat(delegations): wire RFC #2829 sweeper + admin routes into platform server	2026-05-05 05:04:11 +00:00
Hongming Wang	7993693cf1	feat(delegations): wire RFC #2829 sweeper + admin routes into platform server Activates the server-side foundation that PRs #2832, #2836, #2837 shipped without wiring (each PR landed dead code on purpose so the review surface stayed tight). ## What this PR wires up 1. router.go — registers the RFC #2829 PR-4 admin endpoints behind AdminAuth: GET /admin/delegations[?status=...&limit=N] GET /admin/delegations/stats 2. cmd/server/main.go — starts the RFC #2829 PR-3 stuck-task sweeper as a supervised goroutine alongside the existing scheduler + hibernation-monitor + image-auto-refresh: go supervised.RunWithRecover(ctx, "delegation-sweeper", delegSweeper.Start) ## What this PR does NOT do - PR-2's DELEGATION_RESULT_INBOX_PUSH flag stays default off — flip happens via env config in a follow-up after staging burn-in. - PR-5's DELEGATION_SYNC_VIA_INBOX flag stays default off — same reason. The two flags are independent; either can be flipped in isolation. - Canvas operator panel UI: this PR exposes the JSON contract; the canvas panel consumes it in a separate canvas PR. ## Coverage 2 new router gate tests in admin_delegations_route_test.go: - List endpoint requires AdminAuth (unauthenticated → 401) - Stats endpoint requires AdminAuth (unauthenticated → 401) Pattern mirrors admin_test_token_route_test.go (the IDOR-fix gate for PR #112). Catches a future router refactor that silently drops AdminAuth — operator dashboard data exposes caller_id, callee_id, and task_preview, none of which should reach unauthenticated callers. Sweeper boots as a no-op until at least one delegation row exists, so this PR is safe to land before PR-5's agent-side cutover sees production traffic. Refs RFC #2829.	2026-05-04 22:00:59 -07:00
Hongming Wang	789d705866	Merge pull request #2843 from Molecule-AI/fix/restart-dispatcher-rework-1777956000 feat(handlers): RestartWorkspaceAuto dispatcher — #2799 Phase 1 (re-do of #2835)	2026-05-05 04:48:52 +00:00
Hongming Wang	cb820acbd6	fix(test): pre-register sqlmock for panic-recovered Docker test goroutine	2026-05-04 21:44:31 -07:00
Hongming Wang	52915268b2	Merge pull request #2844 from Molecule-AI/feat/rfc2829-pr5-agent-side-cutover feat(delegations): agent-side cutover — sync delegate uses async+poll path (RFC #2829 PR-5)	2026-05-05 04:35:55 +00:00
Hongming Wang	82e7059e0e	Merge pull request #2842 from Molecule-AI/fix/codex-template-bump-cli-pin fix(external-templates): unpin codex CLI from stale ^0.57	2026-05-05 04:34:14 +00:00
Hongming Wang	5950d4cd81	feat(delegations): agent-side cutover — sync delegate uses async+poll path (RFC #2829 PR-5) Behind feature flag DELEGATION_SYNC_VIA_INBOX (default off). When set, tool_delegate_task no longer holds an HTTP message/send connection through the platform proxy waiting for the callee's reply. Instead: 1. POST /workspaces/<src>/delegate (returns 202 + delegation_id) — platform's executeDelegation goroutine handles A2A dispatch in the background. No client-side timeout dependency on the platform holding a connection open. 2. Poll GET /workspaces/<src>/delegations every 3s for a row with matching delegation_id reaching terminal status (completed/failed). 3. Return the response_preview text on completed; surface the wrapped _A2A_ERROR_PREFIX error on failed (so caller error detection stays unchanged). This closes the bug class that broke Hongming's home hermes on 2026-05-05 ("message/send queued but result not available after 600s timeout" while the callee was actively heartbeating "iteration 14/90"). ## Compatibility Default-off feature flag — flag-off path is byte-identical to the legacy send_a2a_message behavior, pinned by TestFlagOffLegacyPath::test_flag_off_uses_send_a2a_message_not_polling. Idempotency-key derivation matches tool_delegate_task_async (SHA-256 of source:target:task) so a restart-mid-delegation gets the same key and the platform returns the existing delegation_id. ## Recovery on timeout If the polling budget (DELEGATION_TIMEOUT, default 300s) elapses without a terminal status, the error message includes the delegation_id + a "call check_task_status('<id>') to retrieve later" hint. The platform's durable row is still live — work is NOT lost, just the synchronous wait is over. Caller can poll for the result later via the existing check_task_status tool. ## Stack with PR-2 PR-2 added the SERVER-SIDE result-push to the caller's a2a_receive inbox row. PR-5 (this PR) adds the AGENT-SIDE cutover. Together they remove the proxy-blocked sync path entirely. PR-2 default-off keeps existing behavior; PR-5 default-off keeps existing behavior. Operators flip both for full effect after staging burn-in. ## Coverage 9 unit tests: - flag off → byte-identical to legacy (send_a2a_message called, _delegate_sync_via_polling NOT called) - dispatch HTTP exception → wrapped error - dispatch non-2xx → wrapped error mentioning HTTP code - dispatch missing delegation_id → wrapped error - completed first poll → response_preview returned - failed status → wrapped error with error_detail - transient poll error → keeps polling, eventually succeeds - deadline exceeded → wrapped timeout error mentions delegation_id + check_task_status hint for recovery - filters by delegation_id (other delegations' rows ignored) All passing locally. CI will run the same suite on a clean env. Refs RFC #2829.	2026-05-04 21:31:11 -07:00
Hongming Wang	1e12ed7e9f	Merge pull request #2833 from Molecule-AI/feat/rfc2829-pr2-result-push-and-sync-cutover feat(delegations): result-push to caller inbox behind feature flag (RFC #2829 PR-2)	2026-05-05 04:30:44 +00:00
Hongming Wang	4f67fe59fb	feat(handlers): RestartWorkspaceAuto dispatcher — #2799 Phase 1 Closes the third silent-drop-on-SaaS class for the restart verb. Two of the three dispatchers were already in place (provisionWorkspaceAuto PR #2811, StopWorkspaceAuto PR #2824); this completes the trio. PR #2835 was an earlier attempt at this work (delivered by a peer agent) that I had to send back for four critical bugs — stop-leg dispatch order inverted, no-backend nil-deref, empty payload (dispatcher unusable by callers), forcing-function tests red-from-day-1. This re-do takes the audit + classification from that work but rebuilds the implementation against the existing dispatcher convention. Phase 1 scope: - RestartWorkspaceAuto in workspace.go — symmetric mirror of provisionWorkspaceAuto + StopWorkspaceAuto. CP-first dispatch order. cpStopWithRetry on the SaaS leg (Restart's "make it alive again" contract justifies the retry that StopWorkspaceAuto's delete-time contract does not). Three-arm shape including a no-backend mark-failed defense-in-depth. - Three new pin tests covering the routing surface: TestRestartWorkspaceAuto_RoutesToCPWhenSet, TestRestartWorkspaceAuto_RoutesToDockerWhenOnlyDocker, TestRestartWorkspaceAuto_NoBackendMarksFailed. Phase 2/3 (deferred, file as follow-up issue): - workspace_restart.go's manual dispatch sites (Restart handler goroutine, Resume handler goroutine, runRestartCycle's inline Stop, Pause loop). Each site has async-context reasoning beyond a fire-and-return dispatcher and needs per-site review. - Pause specifically needs a different verb (PauseWorkspaceAuto) since Pause doesn't reprovision. Why no callers migrated in this PR: the existing call sites in workspace_restart.go all build their `payload` from a synchronous DB read first; rewiring them needs care to preserve that ordering plus the resetClaudeSession + template path resolution that lives in the HTTP handler context. Splitting the dispatcher introduction from the migration keeps each PR small and reviewable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 21:30:36 -07:00
Hongming Wang	410275e5af	fix(external-templates): unpin codex CLI from stale ^0.57 `^0.57` only allows 0.57.x — codex CLI is now at 0.128 with breaking API changes between (notably `exec --resume <sid>` → `exec resume <sid>` subcommand). Operators following the snippet today either get a 6-month-old codex with the legacy resume flag, OR install latest manually and discover the daemon previously couldn't drive it. codex-channel-molecule 0.1.2 (just published) handles the new subcommand shape, so operators are best served by always getting the latest codex that the bridge daemon was last validated against. Bump to `@latest`. If a future codex CLI breaks the daemon's invocation again, we ship a new bridge-daemon release rather than asking operators to manage a pin themselves. Test: go test ./internal/handlers/ -run TestExternalTemplates -count=1 → green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 21:27:45 -07:00
Hongming Wang	1557743ef9	Merge branch 'staging' into feat/rfc2829-pr2-result-push-and-sync-cutover	2026-05-04 21:25:33 -07:00
Hongming Wang	e727b31246	Merge pull request #2841 from Molecule-AI/fix/drift-check-pr-soft-skip-with-warning fix(branch-protection-drift): hard-fail on schedule only — unblock PRs missing the secret	2026-05-05 04:22:52 +00:00
Hongming Wang	ae05f91bd8	Merge pull request #2840 from Molecule-AI/feat/canvas-memory-add-edit-modal feat(canvas/memories): Add + Edit modal for MemoryInspectorPanel	2026-05-05 04:20:51 +00:00
Hongming Wang	c89f17a2aa	fix(branch-protection-drift): hard-fail on schedule only, soft-skip + warn on PR #2834 added a hard-fail when GH_TOKEN_FOR_ADMIN_API is missing on schedule + pull_request + workflow_dispatch. The PR-trigger hard-fail is now blocking every PR in the repo because the secret hasn't been provisioned yet — including the staging→main auto-promote PR (#2831), which has no path to set repo secrets itself. Per feedback_schedule_vs_dispatch_secrets_hardening.md the original concern is automated/silent triggers losing the gate without a human to notice. That concern applies to schedule specifically: - schedule: cron, no human, silent soft-skip = invisible regression → KEEP HARD-FAIL. - pull_request: a human is reviewing the PR diff and will see workflow warnings inline. A PR cannot retroactively drift live state — drift happens between PRs (UI clicks, manual gh api PATCH), which the schedule canary catches. The PR-time gate would only catch typos in apply.sh, which the *_payload unit tests catch more directly. → SOFT-SKIP with a prominent warning. - workflow_dispatch: operator override, may not have configured the secret yet. → SOFT-SKIP with warning. The skip is explicit (SKIP_DRIFT_CHECK=1 surfaced to env, then a step `if:` guard) so it's auditable in the workflow run UI, not silently swallowed. Unblocks #2831 (auto-promote staging→main) + every PR currently behind this check.	2026-05-04 21:20:30 -07:00
Hongming Wang	cbe48c2225	feat(canvas/memories): Add + Edit modal for MemoryInspectorPanel The Memory tab was read-only — users could see and Delete entries but the only path to write was leaving canvas. Adds a + Add button (toolbar, next to Refresh) and an Edit button (per-entry, next to Delete) that share one MemoryEditorDialog. Add: POST /workspaces/:id/memories with {content, scope, namespace} Edit: PATCH /workspaces/:id/memories/:id (sibling endpoint #2838) with only fields that changed; no-op edits short-circuit client-side so we don't waste a redactSecrets + re-embed pass Edit mode locks scope (cross-scope moves go through delete + recreate to keep the GLOBAL audit-log + redact pipeline single-purpose). Tests: 6 cases on the dialog covering POST shape, PATCH-only-diff, no-op short-circuit, empty-content guard, save-error keeps modal open, and namespace+content combined PATCH. Existing 27 MemoryInspectorPanel tests still pass with the new prop wiring. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 21:16:35 -07:00
Hongming Wang	b0bcd97781	Merge pull request #2839 from Molecule-AI/fix/status-failed-must-set-error-1777954000 fix(bundle): markFailed sets last_sample_error + AST drift gate (resolves #2632 root cause)	2026-05-05 04:12:38 +00:00
Hongming Wang	56149f8a24	fix(bundle): markFailed sets last_sample_error + AST gate Closes the bug class surfaced by Canvas E2E #2632: a workspace ends up status='failed' with last_sample_error=NULL, and operators (or the E2E poll loop) see the useless "Workspace failed: (no last_sample_error)" with no triage signal. Two pieces: 1. bundle/importer.go markFailed — the UPDATE was setting only status, leaving last_sample_error NULL. Same incident class as the silent-drop bugs in PRs #2811 + #2824, different code path. markProvisionFailed in workspace_provision_shared.go has set the message column for a long time; this writer drifted the convention. Fix: include last_sample_error in the SET clause + the broadcast. 2. AST drift gate (db/workspace_status_failed_message_drift_test.go) — Go AST walk that finds every db.DB.{Exec,Query,QueryRow}Context call whose argument list binds models.StatusFailed and asserts the SQL literal contains last_sample_error. Catches the next caller that drifts the same convention. Verified to FAIL against the bug shape (reverted importer.go temporarily — gate flagged the exact line) and PASS against the fix. Why an AST gate vs a regex: pre-fix attempt with a regex over UPDATE statements flagged status='online' / status='hibernating' / status= 'removed' UPDATEs as false positives. Walking the AST and only flagging calls that pass the StatusFailed constant eliminates that. Out of scope (filed separately if needed): - The Canvas E2E that surfaced the missing message (#2632) is now a required check on staging via PR #2827. Once this fix lands the next staging push should re-run #2632's failing case and produce a meaningful last_sample_error. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 21:08:08 -07:00
Hongming Wang	0134353a48	Merge pull request #2838 from Molecule-AI/feat/memories-update-endpoint feat(memories): PATCH /workspaces/:id/memories/:id endpoint for edits	2026-05-05 04:06:01 +00:00
Hongming Wang	aca7d99152	Merge pull request #2837 from Molecule-AI/feat/rfc2829-pr4-operator-dashboard feat(delegations): operator dashboard endpoint over the durable ledger (RFC #2829 PR-4)	2026-05-05 04:01:46 +00:00
Hongming Wang	aec0fb35d2	feat(memories): PATCH /workspaces/:id/memories/:id endpoint for edits Pre-fix the only writes to agent_memories were Commit (POST) and Delete (DELETE). Editing an entry meant delete + recreate, losing the original id and created_at, and (the user-visible reason for filing this) leaving the canvas Memory tab without an Edit button at all. Adds PATCH that accepts either content, namespace, or both — at least one required (empty body 400s; silently no-op'ing would let a buggy client think it succeeded). The full Commit security pipeline is re-run on content edits: - redactSecrets on every scope (#1201 SAFE-T) - GLOBAL [MEMORY → [_MEMORY delimiter escape (#807 SAFE-T) - GLOBAL audit log row mirroring Commit's #767 forensic pattern - re-embed via the configured EmbeddingFunc (skipping would leave the row's vector pointing at the OLD content, silently breaking semantic search) Cross-scope edits (LOCAL→GLOBAL) intentionally NOT supported — that's delete + recreate so the GLOBAL access-control gate (only root workspaces can write GLOBAL) gets re-evaluated cleanly. 7 new sqlmock tests pin: namespace-only, content-only LOCAL, content-only GLOBAL with audit + escape, empty-body 400, empty- content 400, 404 on missing/wrong-workspace memory, no-op 200 with changed=false (and crucially: no UPDATE fires on no-op). Build clean, full handlers test suite (./internal/handlers) passes in 4s. PR-2 (frontend): Add modal + Edit button in MemoryInspectorPanel.tsx will land separately.	2026-05-04 21:00:47 -07:00
Hongming Wang	b5c0b4d371	Merge pull request #2836 from Molecule-AI/feat/rfc2829-pr3-stuck-task-sweeper feat(delegations): stuck-task sweeper with deadline + heartbeat-staleness rules (RFC #2829 PR-3)	2026-05-05 03:59:16 +00:00
Hongming Wang	2ed4f4fb41	feat(delegations): operator dashboard endpoint over the durable ledger (RFC #2829 PR-4) Two read endpoints over the `delegations` table (PR-1 schema): GET /admin/delegations[?status=in_flight\|stuck\|failed\|completed&limit=N] GET /admin/delegations/stats ## What this gives operators Without this, post-incident investigation requires direct DB access — only the on-call SRE can answer "is workspace X delegating to a wedged callee?". This moves that visibility into the same surface as /admin/queue, /admin/schedules-health, /admin/memories. ## List endpoint Status filter via tight allowlist: - in_flight (default) → status IN (queued, dispatched, in_progress) - stuck → status='stuck' (rows the PR-3 sweeper marked) - failed → status='failed' - completed → status='completed' Unknown status → 400 with the allowlist in the error body. Limit 1..1000, default 100. The status allowlist drives a parameterized IN clause (no string- concatenation of user-controlled values into SQL). Result rows expose all the audit-grade fields the dashboard needs: delegation_id, caller_id, callee_id, task_preview, status, last_heartbeat, deadline, result_preview, error_detail, retry_count, created_at, updated_at. Nullable fields use pointer types so JSON omits them when NULL (no false-zero "" for missing values). ## Stats endpoint Zero-fills every known status key (queued, dispatched, in_progress, completed, failed, stuck) so the dashboard summary card doesn't have to handle "missing key vs zero" branching. ## Out of scope (deferred) - "retry this stuck task" mutation: needs the agent-side cutover (RFC #2829 PR-5 plan) before re-fire is safe - p95 / p99 duration aggregates: separate metric exposure, not a row-level read endpoint - Canvas UI: this is the JSON contract; the canvas operator panel consumes it in a follow-up canvas PR ## Wiring NOT wired into the router in this PR — ships separately to keep PR-by-PR review surface tight. Wiring will land in the `enable-rfc2829-server-side` follow-up PR alongside the sweeper Start call and the result-push flag flip. ## Coverage 11 unit tests: List (8): - default status=in_flight, IN(queued,dispatched,in_progress) - status=stuck → IN(stuck) - status=failed → IN(failed) - unknown status → 400 with allowlist - negative limit → 400 - over-cap limit → 400 - custom limit accepted + echoed in response - nullable fields populated correctly (pointer-omitempty) Stats (2): - zero-fills missing status keys - empty table → all counts zero Contract pin (1): - statusFilters table shape — every documented key + value pair pinned. Drift catches accidental edits (forward defense). Refs RFC #2829.	2026-05-04 20:58:17 -07:00
Hongming Wang	02b325063b	feat(delegations): stuck-task sweeper with deadline + heartbeat-staleness rules (RFC #2829 PR-3) Periodically scans the `delegations` table (PR-1 schema) for in-flight rows that need terminal action: 1. Deadline-exceeded → marked `failed` with "deadline exceeded by sweeper" 2. Heartbeat-stale (no beat for >10× heartbeat interval) → marked `stuck` ## Why both rules Deadline catches forever-heartbeating wedged agents (the alive-but-not- advancing class — agent loops on heartbeat call inside its main loop). Heartbeat-staleness catches OOM-killed and crashed agents that stop cold without graceful shutdown. Either rule alone misses one of these classes. ## Order matters Deadline is checked first. A deadline-exceeded AND stale row is marked `failed` (operator action: investigate + give up), not `stuck` (operator action: investigate + retry). The semantic difference matters. ## NULL heartbeat is a free pass A delegation that's just been inserted but hasn't emitted its first heartbeat yet is NOT stuck-marked — gives the agent its first beat window. Lets the deadline catch true never-started rows naturally. ## Concurrent-completion safety Sweep races with UpdateStatus on a delegation that just completed: the ledger's terminal forward-only protection (PR-1) returns ErrInvalidTransition, sweeper logs + counts in Errors, the row stays correctly in completed. ## Configuration - DELEGATION_SWEEPER_INTERVAL_S — tick cadence (default 5min) - DELEGATION_STUCK_THRESHOLD_S — heartbeat-staleness threshold (default 10min) Both fall back gracefully on invalid input (typo'd env shouldn't crash startup). Both read at construction time so a long-running process picks up overrides via restart. ## Wiring NOT wired into main.go in this PR — that ships separately so the sweeper can be enabled/disabled independently of the binary upgrade. The sweeper is a standalone Sweep(ctx) callable + Start(ctx) ticker loop, both with panic recovery, both indexed-scan-cheap on the partial idx_delegations_inflight_heartbeat from PR-1. ## Coverage 13 unit tests against sqlmock-backed *sql.DB: Sweep semantics (8 tests): - empty in-flight set → clean no-op - deadline → failed - heartbeat-stale → stuck - NULL heartbeat is left alone (first-beat free pass) - healthy row → no-op - both-rule row → marked failed (deadline wins) - mixed set → both rules fire on the right rows - concurrent-completion race → forward-only protection holds Env override parsing (5 tests): - default on missing env - parses positive seconds - falls back on garbage - falls back on negative - constructor picks up overrides; defaults when env unset Refs RFC #2829.	2026-05-04 20:55:13 -07:00
Hongming Wang	43caac911a	Merge pull request #2834 from Molecule-AI/fix/branch-protection-apply-respects-live-state fix(branch-protection): apply.sh respects live state + full-payload drift	2026-05-05 03:54:50 +00:00
Hongming Wang	2e505e7748	fix(branch-protection): apply.sh respects live state + full-payload drift Multi-model review of #2827 caught: the script as-shipped would have silently weakened branch protection on EVERY non-checks dimension the moment anyone ran it. Live staging had enforce_admins=true, dismiss_stale_reviews=false, strict=true, allow_fork_syncing=false, bypass_pull_request_allowances={ HongmingWang-Rabbit + molecule-ai app } Script wrote the opposite for all five. Per memory feedback_dismiss_stale_reviews_blocks_promote.md, the dismiss_stale_reviews flip alone is the load-bearing one — would silently re-block every auto-promote PR (cost user 2.5h once). This PR: 1. apply.sh: per-branch payloads (build_staging_payload / build_main_payload) that codify the deliberate per-branch policy already on the repo, with the script's net contribution being ONLY the new check names (Canvas tabs E2E + E2E API Smoke on staging, Canvas tabs E2E on main). 2. apply.sh: R3 preflight that hits /commits/{sha}/check-runs and asserts every desired check name has at least one historical run on the branch tip. Catches typos like "Canvas Tabs E2E" vs "Canvas tabs E2E" — pre-fix a typo would silently block every PR forever waiting for a context that never emits. Skip via --skip-preflight for genuinely-new workflows whose first run hasn't fired. 3. drift_check.sh: compares the FULL normalised payload (admin, review, lock, conversation, fork-syncing, deletion, force-push) not just the checks list. Pre-fix the drift gate would have missed a UI click that flipped enforce_admins or dismiss_stale_reviews. Drops app_id from the comparison since GH auto-resolves -1 to a specific app id post-write. 4. branch-protection-drift.yml: per memory feedback_schedule_vs_dispatch_secrets_hardening.md — schedule + pull_request triggers HARD-FAIL when GH_TOKEN_FOR_ADMIN_API is missing (silent skip masks the gate disappearing). workflow_dispatch keeps soft-skip for one-off operator runs. Verified by running drift_check against live state: pre-fix would have shown 5 destructive drifts on staging + 5 on main. Post-fix shows ONLY the 2 intended additions on staging + 1 on main, which go away after `apply.sh` runs.	2026-05-04 20:52:11 -07:00

1 2 3 4 5 ...

4282 Commits