molecule-core

Author	SHA1	Message	Date
Hongming Wang	747c12e582	test(a2a): protocol-shape replay corpus gate (#2345 follow-up) Backward-compat replay gate for the A2A JSON-RPC protocol surface. Every PR that touches normalizeA2APayload OR bumps the a-2-a-sdk version pin runs every shape in testdata/a2a_corpus/ through the current code and asserts: valid/ — every shape MUST parse without error and produce a canonical v0.3 payload (params.message.parts list). invalid/ — every shape MUST be rejected with the documented status code and error substring. What this prevents The 2026-04-29 v0.2 → v0.3 silent-drop bug (PR #2349) shipped because the SDK bump PR didn't replay v0.2-shaped inputs against the new code; the shape-mismatch surfaced only in production when the receiver's Pydantic validator silently rejected inbound messages. This gate would have caught it pre-merge. Hand-verified: reverting the v0.2 string→parts shim in normalizeA2APayload fails 3 of the v0.2 corpus entries with the exact rejection class the production bug exhibited. Corpus contents (11 entries) valid/ (10): v0_2_string_content — basic v0.2 (the broken case) v0_2_string_content_no_message_id — v0.2 + auto-fill messageId v0_2_list_content — v0.2 with content as Part list v0_3_parts_text_only — canonical v0.3 v0_3_parts_multi_text — multi-Part list v0_3_parts_with_file — multimodal (text + file) v0_3_parts_with_context — contextId for multi-turn v0_3_streaming_method — message/stream variant v0_3_unicode_text — emoji + multi-script v0_3_long_text — 10KB text Part no_jsonrpc_envelope — bare params/method without outer envelope (legacy senders) invalid/ (3): no_content_or_parts — message has neither field content_is_integer — wrong type for v0.2 content content_is_bool — wrong type, separate from int so the failure msg identifies which type-class regressed Plus 4 inline malformed-JSON cases (truncated, not-JSON, empty, whitespace) that can't be expressed as JSON corpus entries. Coverage tests The gate has 4 test functions: 1. TestA2ACorpus_ValidShapesParse — replay valid/ corpus, assert no error + canonical v0.3 output (parts list non-empty, messageId non-empty, content field deleted). 2. TestA2ACorpus_InvalidShapesRejected — replay invalid/ corpus, assert rejection matches recorded status + error substring. 3. TestA2ACorpus_MalformedJSONRejected — inline cases for non-parseable bodies. 4. TestA2ACorpus_HasMinimumCoverage — at least one v0.2 + one v0.3 entry exists (loses neither side of the bridge). 5. TestA2ACorpus_EveryEntryHasMetadata — _comment/_added/_source on every entry per the README policy; _expect_error and _expect_status on invalid entries. Documentation testdata/a2a_corpus/README.md describes the corpus contract: - When to add entries (new SDK shape, new production-observed shape). - When NOT to add (test scaffolding, hypothetical futures). - Removal policy (breaking change, deprecation window required). Verification - All 24 corpus subtests pass on current main. - Hand-test: revert the v0.2 compat shim → 3 v0.2 entries fail the gate with the exact rejection class the production bug exhibited. Confirmed. - Whole-module go test ./... green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 01:26:02 -07:00
Hongming Wang	344e3e8914	Merge pull request #2362 from Molecule-AI/auto/a2a-upstream-5xx-mark-dead fix(a2a): detect dead EC2 agents on upstream 5xx + reactive auto-restart for SaaS	2026-04-30 08:14:14 +00:00
Hongming Wang	a27cf8f39f	fix(restart): extract stopForRestart helper + add 524 to dead-agent list Addresses code-review C1 (test goroutine race) and I2 (CF 524) on PR #2362. C1: TestRunRestartCycle_SaaSPath_DispatchesViaCPProv invoked runRestartCycle end-to-end, which spawns `go h.sendRestartContext(...)`. That goroutine outlived the test, then read db.DB while the next test's setupTestDB wrote to it — DATA RACE under -race, cascading 30+ failures across the handlers suite. Refactored: extracted `stopForRestart(ctx, id)` from runRestartCycle as a pure dispatcher, and rewrote the SaaS-path test to call it directly (no async goroutine spawned). Added a no-provisioner no-op guard test. I2: Cloudflare 524 ("origin timed out") now triggers maybeMarkContainerDead alongside 502/503/504. Same upstream signal — origin agent unresponsive. Verified `go test -race -count=1 ./internal/handlers/...` green locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 00:58:22 -07:00
Hongming Wang	28b4e38002	fix(restart): branch provisionWorkspace dispatch on cpProv (PR #2362 amendment) Independent review of #2362 caught a Critical gap: the previous commit fixed the Stop dispatch in runRestartCycle but left the provisionWorkspace dispatch unconditionally Docker-only. So on SaaS the auto-restart cycle would Stop the EC2 successfully (good), then NPE inside provisionWorkspace's `h.provisioner.VolumeHasFile` call. coalesceRestart's recover()-without- re-raise (a deliberate platform-stability safeguard) silently swallowed the panic, leaving the workspace permanently stuck in status='provisioning' because the UPDATE on workspace_restart.go:450 had already run. Net pre-amendment effect on SaaS: dead agent → structured 503 (good) → workspace flipped to 'offline' (good) → cpProv.Stop succeeded (good) → provisionWorkspace NPE swallowed (bad) → workspace permanently 'provisioning' until manual canvas restart. The headline claim of #2362 ("SaaS auto-restart now works") was false on the path it shipped. Fix: dispatch the reprovision call the same way every other call site in the package does (workspace.go:431-433, workspace_restart.go:197+596) — branch on `h.cpProv != nil` and call provisionWorkspaceCP for SaaS, provisionWorkspace for Docker. Tests: - New TestRunRestartCycle_SaaSPath_DispatchesViaCPProv asserts cpProv.Stop is called when the SaaS path runs (would have caught the NPE if provisionWorkspace had been called instead). - fakeCPProv updated: methods record calls and return nil/empty by default rather than panicking. The previous "panic on unexpected call" pattern was unsafe — the panic fires on the async restart goroutine spawned by maybeMarkContainerDead AFTER the test assertions ran, so the test passed by accident even though the production path was broken (which is exactly how the Critical bug landed). - Existing tests still pass (full handlers + provisioner suites green). Branch-count audit refresh: runRestartCycle dispatch decisions: 1. h.provisioner != nil → provisioner.Stop + provisionWorkspace ✓ (existing tests) 2. h.cpProv != nil → cpProv.Stop + provisionWorkspaceCP ✓ (NEW test) 3. both nil → coalesceRestart never called (RestartByID gate) ✓ Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 00:35:51 -07:00
Hongming Wang	9f35788aee	fix(a2a): detect dead EC2 agents on upstream 5xx + reactive auto-restart for SaaS Class-of-bugs fix surfaced by hongmingwang.moleculesai.app's canvas chat to a dead workspace returning a generic Cloudflare 502 page on 2026-04-30. Three independent gaps in the reactive-health path that together leak dead-agent failures to canvas with no auto-recovery. ## Bug 1 — maybeMarkContainerDead is a no-op for SaaS tenants `maybeMarkContainerDead` only consulted `h.provisioner` (local Docker provisioner). SaaS tenants set `h.cpProv` (CP-backed EC2 provisioner) and leave `h.provisioner` nil — so the function early-returned false on every call and dead EC2 agents never triggered the offline-flip / broadcast / restart cascade. Fix: extend `CPProvisionerAPI` interface with `IsRunning(ctx, id) (bool, error)` (already implemented on `*CPProvisioner`; just needs to surface on the interface). `maybeMarkContainerDead` now branches: local-Docker path uses `h.provisioner.IsRunning`; SaaS path uses `h.cpProv.IsRunning` which calls the CP's `/cp/workspaces/:id/status` endpoint to read the EC2 state. ## Bug 2 — RestartByID short-circuits on `h.provisioner == nil` Same shape as Bug 1: the auto-restart cascade triggered by `maybeMarkContainerDead` calls `RestartByID` which short-circuited when the local Docker provisioner was missing. So even if Bug 1 were fixed, the workspace-offline state would never recover. Fix: change the gate to `h.provisioner == nil && h.cpProv == nil` and update `runRestartCycle` to branch on which provisioner is wired for the Stop call. (The HTTP `Restart` handler already does this branching correctly — we're just bringing the auto-restart path to parity.) ## Bug 3 — upstream 502/503/504 propagated as-is, masked by Cloudflare When the agent's tunnel returns 5xx (the "tunnel up but no origin" shape — agent process dead but cloudflared connection still healthy), `dispatchA2A` returns successfully at the HTTP layer with a 5xx body. `handleA2ADispatchError`'s reactive-health path doesn't run because that path is only triggered on transport-level errors. The pre-fix code propagated the 502 status to canvas; Cloudflare in front of the platform then masked the 502 with its own opaque "error code: 502" page, hiding any structured response and any Retry-After hint. Fix: in `proxyA2ARequest`, when the upstream returns 502/503/504, run `maybeMarkContainerDead` BEFORE propagating. If IsRunning confirms the agent is dead → return a structured 503 with restarting=true + Retry-After (CF doesn't mask 503s the same way). If running, propagate the original status (don't recycle a healthy agent on a transient hiccup — it might have legitimately returned 502). ## Drive-by — a2aClient transport timeouts a2aClient was `&http.Client{}` with no Transport timeouts. When a workspace's EC2 black-holes TCP connects (instance terminated mid-flight, SG flipped, NACL bug), the OS default is 75s on Linux / 21s on macOS — long enough for Cloudflare's ~100s edge timeout to fire first and surface a generic 502. Added DialContext (10s connect), TLSHandshake (10s), and ResponseHeaderTimeout (60s). Client.Timeout DELIBERATELY unset — that would pre-empt slow-cold-start flows (Claude Code OAuth first-token, multi-minute agent synthesis). Long-tail body streaming is still governed by per-request context deadline. ## Tests - `TestMaybeMarkContainerDead_CPOnly_NotRunning` — IsRunning(false) → marks workspace offline, returns true. - `TestMaybeMarkContainerDead_CPOnly_Running` — IsRunning(true) → no offline-flip, returns false (don't recycle a healthy agent). - `TestProxyA2A_Upstream502_TriggersContainerDeadCheck` — agent server returns 502 + cpProv reports dead → caller gets 503 with restarting= true and Retry-After: 15. - `TestProxyA2A_Upstream502_AliveAgent_PropagatesAsIs` — same upstream 502 but cpProv reports running → propagates 502 (existing behavior; safety check that prevents over-eager recycling). - Existing `TestMaybeMarkContainerDead_NilProvisioner` / `TestMaybeMarkContainerDead_ExternalRuntime` still pass. - Full handlers + provisioner test suites pass. ## Impact Pre-fix: dead EC2 agent on a SaaS tenant → CF-masked 502 to canvas, no auto-recovery, manual restart from canvas required. Post-fix: dead EC2 agent on a SaaS tenant → structured 503 with restarting=true + Retry-After to canvas, workspace flipped to offline, auto-restart cycle triggered. Canvas can show a user-actionable "agent is restarting, please wait" message instead of a generic 502. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 00:28:22 -07:00
Hongming Wang	92a29bb37c	Merge pull request #2360 from Molecule-AI/auto/2358-followup-permissions-concurrency fix(ci): close gaps in auto-promote dispatch tail (#2358 follow-up)	2026-04-30 07:07:21 +00:00
Hongming Wang	26d5c5ba1f	fix(ci): close gaps in auto-promote dispatch tail (#2358 follow-up) Independent review of #2358 surfaced three gaps that the original self-review missed. All three would manifest only on the FIRST real staging→main promotion through the new tail step, so they'd silently re-introduce the deploy-chain bug #2357 was supposed to fix. 1. Missing `actions: write` permission. `gh workflow run` POSTs to `/repos/.../actions/workflows/.../dispatches`, which requires the actions:write scope on GITHUB_TOKEN. The job had only contents:write + pull-requests:write, so the dispatch call would 403 on every run and the publish chain would still not fire. Adding the scope. 2. No workflow-level concurrency block. When CI + E2E Staging Canvas + E2E API Smoke + CodeQL all complete within seconds of each other on a green staging push (the typical case), four separate workflow_run events fire and four parallel auto-promote runs all reach the dispatch tail. They poll the same PR, all observe the same mergedAt, and all call `gh workflow run` — producing 2-4× redundant publish builds racing for the same `:staging-latest` retag and 2-4× canary-verify chains. Added `concurrency.group: auto-promote-staging, cancel-in-progress: false`. cancel-in-progress=false because killing a polling tail that's about to dispatch would re-introduce the original bug. 3. PR closed-without-merge ties up a runner for 30 min. If the merge queue rejects the PR (gates flip red post-approval), or an operator closes it manually, mergedAt stays null forever and the loop polls 60 × 30s burning a runner slot. Now also reads `state` in the same `gh pr view` call and breaks early when STATE=CLOSED. Verification on this PR is structural (workflow won't fire on a staging→main promotion until this lands AND a subsequent staging push triggers auto-promote). The actions:write fix in particular is unverifiable until the next real run — the prior #2358 fix has the same property, so we're stacking two unverifiable workflow edits. That's intentional rather than risky: stage 1 (#2358) was load-bearing for the deploy-chain restoration; stage 2 (this PR) hardens it before it actually matters. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 00:03:31 -07:00
Hongming Wang	d850ec7c8c	Merge pull request #2358 from Molecule-AI/auto/issue-2357-promote-dispatch-chain fix(ci): dispatch publish chain after auto-promote merge (#2357)	2026-04-30 06:36:02 +00:00
Hongming Wang	9a7f61661b	fix(ci): dispatch publish chain after auto-promote merge (#2357 ) The auto-promote staging → main flow uses `gh pr merge --auto` with GITHUB_TOKEN, which means GitHub suppresses downstream `push` events on the resulting main commit. This is documented behavior — events created by GITHUB_TOKEN do not trigger new workflow runs, with workflow_dispatch and repository_dispatch as the only exceptions. Effect: when the merge queue lands the auto-promote PR, the main push DOES NOT fire publish-workspace-server-image. canary-verify + the :staging-<sha> → :latest retag never run, so redeploy-tenants-on-main also never fires. Tenants stay on stale code until someone manually dispatches the chain (which is what just happened for issue #2339). Fix here: after enqueuing auto-merge, poll for the PR to land, then explicitly `gh workflow run publish-workspace-server-image.yml --ref main`. workflow_dispatch is the documented exception, so the dispatch event itself DOES create a new run. canary-verify and redeploy-tenants-on-main chain via workflow_run as before. Long-term (tracked in #2357): switch the auto-merge call above to a GitHub App token (actions/create-github-app-token) so the merge event itself can trigger the downstream chain naturally; the polling tail becomes deletable. Why a 30-min poll cap: merge queue typically lands a green promote PR within 5-10 min. 30 min covers a slow CI run without hanging the workflow indefinitely. If the merge times out, the step warns and exits 0 — operator can manually dispatch as a fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 23:31:13 -07:00
Hongming Wang	c1f993ca36	Merge pull request #2355 from Molecule-AI/auto/issue-2339-pr4-e2e-poll-roundtrip test(e2e): poll-mode + since_id cursor round-trip (#2339 PR 4)	2026-04-30 06:25:29 +00:00
Hongming Wang	08252b3cd7	fix(e2e): use real UUIDs for poll-mode test workspace ids CI run on PR #2355 surfaced `pq: invalid input syntax for type uuid: ws-poll-e2e-1777529293-3363` — workspaces.id is UUID-typed and the hand-rolled "ws-<tag>" shape fails the cast. Phase 1 returned generic 'registration failed' which cascaded into Phase 3 'lookup failed' (resolveAgentURL on a non-existent row) and Phase 4 'missing workspace auth token' (no token extracted because Phase 1 didn't run the bootstrap path). Generate v4 UUIDs via uuidgen (with a python3 fallback), one each for the poll workspace, the caller workspace, and the Phase 2 invalid-mode probe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 23:10:36 -07:00
Hongming Wang	a495b86a06	test(e2e): poll-mode + since_id cursor round-trip (#2339 PR 4) End-to-end coverage for the canvas-chat unblocker. Exercises every moving part of the #2339 stack against a real platform instance: Phase 1 — register a workspace as delivery_mode=poll WITHOUT a URL; verify the response carries delivery_mode=poll. Phase 2 — invalid delivery_mode rejected with 400 (typo defense). Phase 3 — POST A2A to the poll-mode workspace; verify proxyA2ARequest short-circuits and returns 200 {status:queued, delivery_mode:poll, method:message/send} without ever resolving an agent URL. Phase 4 — verify the queued message appears in /activity?type=a2a_receive with the right method + payload (the polling agent reads from here). Phase 5 — since_id cursor returns ASC-ordered rows STRICTLY AFTER the cursor; the cursor row itself must NOT be replayed. Sends two follow-up messages and asserts ordering: rows[0] is the older new event, rows[-1] is the newer. Phase 6 — unknown / pruned cursor returns 410 Gone with an explanation. Phase 7 — cross-workspace cursor isolation: a UUID belonging to one workspace cannot be used to peek at another workspace's feed (returns 410, same as pruned, no info leak). Idempotent: per-run unique workspace ids (date+pid). Trap-based cleanup deletes the test rows on exit; no e2e_cleanup_all_workspaces call (see feedback_never_run_cluster_cleanup_tests_on_live_platform.md). Wired into .github/workflows/e2e-api.yml so it runs on every PR that touches workspace-server/, tests/e2e/, or the workflow file itself — same gate as the existing test_a2a_e2e + test_notify_attachments suites. Stacked on #2354 (PR 3: since_id cursor). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 23:07:10 -07:00
Hongming Wang	b5bde0399a	Merge pull request #2354 from Molecule-AI/auto/issue-2339-pr3-activity-cursor feat(activity): since_id cursor on GET /activity (#2339 PR 3)	2026-04-30 05:55:05 +00:00
Hongming Wang	a81b0e1e3d	feat(activity): since_id cursor on GET /activity (#2339 PR 3) Telegram getUpdates / Slack RTM shape: poll-mode workspaces pass the id of the last activity_logs row they consumed, server returns rows strictly after in chronological (ASC) order. Existing callers that don't pass since_id keep DESC + most-recent-N — backwards-compatible. Cursor lookup is scoped by workspace_id so a caller cannot enumerate or peek at another workspace's events by passing a UUID belonging to a different workspace. Cross-workspace and pruned cursors both return 410 Gone — no information leak (caller cannot distinguish "row never existed" from "row exists but you can't see it"). since_id + since_secs both apply (AND). When since_id is set the order flips to ASC because polling consumers need recorded-order; the recent- feed shape (no since_id) keeps DESC. Tests: - TestActivityHandler_SinceID_ReturnsNewerASC — cursor lookup → main query with cursorTime + ASC ordering. - TestActivityHandler_SinceID_CursorNotFound_410 — pruned/unknown cursor. - TestActivityHandler_SinceID_CrossWorkspaceCursor_410 — UUID belongs to another workspace, scoped lookup hides it (same 410 path, no leak). - TestActivityHandler_SinceID_CombinedWithSinceSecs — placeholder index arithmetic with both filters. Stacked on #2353 (PR 2: poll-mode short-circuit). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 22:51:52 -07:00
Hongming Wang	706a388806	Merge pull request #2353 from Molecule-AI/auto/issue-2339-pr2-poll-shortcircuit-v2 feat(a2a): poll-mode short-circuit in ProxyA2A (#2339 PR 2)	2026-04-30 05:29:03 +00:00
Hongming Wang	91a1d5377d	feat(a2a): poll-mode short-circuit in ProxyA2A (#2339 PR 2) Skip SSRF/dispatch and queue to activity_logs for delivery_mode=poll workspaces. The polling agent (e.g. molecule-mcp-claude-channel on an operator's laptop) consumes via GET /activity?since_id= in PR 3 — no public URL needed. Order: budget -> normalize -> lookupDeliveryMode short-circuit -> resolveAgentURL. Normalizing before the short-circuit keeps the JSON-RPC method name on the activity_logs row so the polling agent can dispatch correctly. Fail-closed-to-push: any DB error reading delivery_mode defaults to push (loud + recoverable) rather than poll (silent drop). Tests: - TestProxyA2A_PollMode_ShortCircuits_NoSSRF_NoDispatch — core invariant: no resolveAgentURL, no Do(), records to activity_logs, returns 200 {status:"queued",delivery_mode:"poll",method:"message/send"}. - TestProxyA2A_PushMode_NoShortCircuit — push path unaffected; the agent server actually receives the request. - TestProxyA2A_PollMode_FailsClosedToPush — DB error on mode lookup must NOT silently queue; falls through to the push path. Stacked on #2348 (PR 1: schema + register flow). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 22:22:28 -07:00
Hongming Wang	3da2392f95	Merge pull request #2348 from Molecule-AI/auto/issue-2339-pr1-delivery-mode feat(workspaces): delivery_mode column + poll-mode register flow (#2339 PR 1)	2026-04-30 05:18:03 +00:00
Hongming Wang	ec6e47cbe3	Merge pull request #2351 from Molecule-AI/auto/issue-2344-architecture-lint test(arch): codify 4 module boundaries as architecture tests (#2344)	2026-04-30 05:16:09 +00:00
Hongming Wang	68f18424f5	test(arch): codify 4 module boundaries as architecture tests (#2344 ) Hard gate #4: codified module boundaries as Go tests, so a new contributor (or AI agent) can't silently land an import that crosses a layer. Boundaries enforced (one architecture_test.go per package): - wsauth has no internal/* deps — auth leaf, must be unit-testable in isolation - models has no internal/* deps — pure-types leaf, reverse dep would create cycles since most packages depend on models - db has no internal/* deps — DB layer below business logic, must be testable with sqlmock without spinning up handlers/provisioner - provisioner does not import handlers or router — unidirectional layering: handlers wires provisioner into HTTP routes; the reverse is a cycle Each test parses .go files in its package via go/parser (no x/tools dep needed) and asserts forbidden import paths don't appear. Failure messages name the rule, the offending file, and explain WHY the boundary exists so the diff reviewer learns the rule. Note: the original issue's first two proposed boundaries (provisioner-no-DB, handlers-no-docker) don't match the codebase today — provisioner already imports db (PR #2276 runtime-image lookup) and handlers hold *docker.Client directly (terminal, plugins, bundle, templates). I picked the four boundaries that actually hold; the first two are aspirational and would need a refactor before they could be codified. Hand-tested by injecting a deliberate wsauth -> orgtoken violation: the gate fires red with the rule message before merge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 22:12:58 -07:00
Hongming Wang	664bbd8899	Merge pull request #2350 from Molecule-AI/auto/issue-2342-continuous-synth-e2e ci: continuous synthetic E2E against staging (#2342)	2026-04-30 05:08:35 +00:00
Hongming Wang	0b83faa33c	Merge pull request #2349 from Molecule-AI/auto/issue-2345-a2a-v02-compat-clean fix(a2a): v0.2 → v0.3 compat shim at proxy edge (#2345)	2026-04-30 05:05:04 +00:00
Hongming Wang	db5d11ffca	ci: continuous synthetic E2E against staging (#2342 ) Hard gate Tier 2 item 2 of 4. Cron-driven full-lifecycle E2E that catches regressions visible only at runtime — schema drift, deployment-pipeline gaps, vendor outages, env-var rotations, DNS / CF / Railway side-effects. Empirical motivation from today: - #2345 (A2A v0.2 silent drop) — passed unit tests, broke at JSON-RPC parse layer between sender + receiver. Visible only when a sender exercises the full path. Now-fixed by PR #2349, but a continuous E2E would have surfaced it within 20 min of the regression. - RFC #2312 chat upload — landed staging-branch but never reached staging tenants because publish-workspace-server-image was main- only. Caught by manual dogfooding hours after deploy. Same pattern. Both classes are invisible to PR-time CI. The continuous gate fires every 20 min against a real staging tenant and surfaces regressions within minutes. Cadence: cron `0,20,40 * * * *` (3x/hour). Offsets the existing sweep-cf-orphans (:15) and sweep-cf-tunnels (:45) so the three ops don't burst CF/AWS APIs at the same minute. Concurrency group prevents overlapping runs if one hangs. Cost: ~$0.50-1/day GHA + pennies of staging tenant lifecycle. Reuses existing tests/e2e/test_staging_full_saas.sh — no new harness to maintain. Bounded at 10 min wall-clock (vs 15 min default) so stuck runs fail fast rather than holding up the next firing. Defaults to E2E_RUNTIME=langgraph (fastest cold start; the regression classes this gate catches don't need hermes-specific paths). Operators can dispatch with runtime=hermes when they want SDK-native coverage. Schedule-vs-dispatch hardening: hard-fail on missing CP_STAGING_ADMIN_API_TOKEN for cron firing (silent-skip would mask real outages); soft-skip for operator dispatch. Refs: - #2342 hard-gates Tier 2 item 2 - #2345 (A2A v0.2 fix that this gate would have caught earlier) - #2335 / #2337 (deployment-pipeline gaps that this gate also catches)	2026-04-29 22:04:57 -07:00
Hongming Wang	140fc5fb10	fix(a2a): v0.2 → v0.3 compat shim at proxy edge (#2345 ) Closes #2345. ## Symptom Design Director silently dropped A2A briefs whose sender used the v0.2 message format (`params.message.content` string) instead of v0.3 (`params.message.parts` part-list). The downstream a2a-sdk's v0.3 Pydantic validator rejected with "params.message.parts — Field required" but the rejection only landed in tenant-side logs; the sender saw HTTP 200/202 and assumed delivery. UX Researcher therefore never received the kickoff. Multi-agent pipeline silently idle. ## Fix Convert at the proxy edge in normalizeA2APayload. Two cases handled, one explicitly rejected: v0.2 string content → wrap as [{kind: text, text: <content>}] (the canonical v0.2 case from the dogfooding report) v0.2 list content → preserve list as parts (some older clients put a list under `content`; treat as "client meant parts, used wrong field name") v0.3 parts present → no-op (hot path for normal traffic) Neither present → return HTTP 400 with structured JSON-RPC error pointing at the missing field Why at the proxy edge: every workspace gets the compat for free without each one bumping a2a-sdk separately. The SDK's own compat adapter is strict about `parts` and rejects v0.2 senders. Why reject loud on missing-both: pre-fix the SDK's Pydantic rejection was post-handler-dispatch and invisible to the original sender. Now misshapen payloads return a structured 400 to the actual caller — kills the entire silent-drop class for this payload-shape category. ## Tests 7 new cases on normalizeA2APayload (#2345) + 1 fixture update on the existing _MissingMethodReturnsEmpty test: TestNormalizeA2APayload_ConvertsV02StringContentToParts TestNormalizeA2APayload_ConvertsV02ListContentToParts TestNormalizeA2APayload_PreservesV03Parts (hot path) TestNormalizeA2APayload_RejectsMessageWithNeitherContentNorParts TestNormalizeA2APayload_RejectsContentWithUnsupportedType TestNormalizeA2APayload_NoMessageNoCheck (e.g. tasks/list bypasses) All 11 normalizeA2APayload tests pass + full handler suite (no regressions). ## Refs Hard-gates discussion: this is exactly the class of failure (silent-drop on schema mismatch) that #2342 (continuous synthetic E2E) would catch automatically. Tier 2 RFC item from #2345 (caller gets structured JSON-RPC error on parse failure) is delivered above via the loud-reject path.	2026-04-29 22:01:41 -07:00
Hongming Wang	d5b00d6ac1	feat(workspaces): delivery_mode column + poll-mode register flow (#2339 PR 1) Adds workspaces.delivery_mode (push, default \| poll) and lets the register handler accept poll-mode workspaces with no URL. This is the foundation for the unified poll/push delivery design in #2339 — Telegram-getUpdates shape for external runtimes that have no public URL. What this PR does: - Migration 045: NOT NULL TEXT column, default 'push', CHECK constraint on the two valid values. - models.Workspace + RegisterPayload + CreateWorkspacePayload gain a DeliveryMode field. RegisterPayload.URL drops the `binding:"required"` tag — the handler now enforces it conditionally on the resolved mode. - Register handler: validates explicit delivery_mode if set; resolves effective mode (payload value, else stored row value, else push) AFTER the C18 token check; validates URL only when effective mode is push; persists delivery_mode in the upsert; returns it in the response; skips URL caching when payload.URL is empty. - CreateWorkspace handler: persists delivery_mode (defaults to push) in the same INSERT, validates it before any side effects. What this PR does NOT do (intentional, follow-up PRs): - PR 2: short-circuit ProxyA2A for poll-mode workspaces (skip SSRF + dispatch, log a2a_receive activity, return 200). - PR 3: since_id cursor on GET /activity for lossless polling. - Plugin v0.2 in molecule-mcp-claude-channel: cursor persistence + a register helper that creates poll-mode workspaces. Backwards compatibility: every existing workspace stays push-mode (schema default) with identical behavior. New tests: TestRegister_PollMode_AcceptsEmptyURL, TestRegister_PushMode_RejectsEmptyURL, TestRegister_InvalidDeliveryMode, TestRegister_PollMode_PreservesExistingValue. All existing register + create tests updated to expect the new delivery_mode column in the INSERT args. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 21:47:14 -07:00
Hongming Wang	21ed74c76a	Merge pull request #2340 from Molecule-AI/auto/issue-2332-capabilities-preamble feat(prompt): Platform Capabilities preamble at top of system prompt (#2332 item 1)	2026-04-30 04:33:31 +00:00
Hongming Wang	4299475746	feat(prompt): Platform Capabilities preamble at top of system prompt Closes #2332 item 1 (workspace awareness — agents don't surface platform-native tools up front). The dogfooding session surfaced that agents weren't using A2A delegation, persistent memory, or send_message_to_user. The tools were registered AND documented in the system prompt — but only in sections #8 (Inter-Agent Communication) and #9 (Hierarchical Memory), which agents read AFTER they've already started reasoning about a plan from earlier sections. This adds a tight inventory at section #1.5 (immediately after Platform Instructions, before role-specific prompt files) — every tool name + its short description in a bulleted block. Detailed when_to_use docs in sections #8/#9 stay; this preamble is the elevator pitch ("you have these"), the later sections are the manual ("here's when and how"). Generated from `platform_tools.registry` ToolSpecs — every tool's `name` + `short` flow through automatically, no manual sync. A new `get_capabilities_preamble(mcp: bool)` helper in executor_helpers mirrors the existing get_a2a_instructions / get_hma_instructions pattern. CLI-runtime agents (mcp=False) get an empty preamble — they see _A2A_INSTRUCTIONS_CLI's hand-written subcommand vocabulary further down, and the registry's MCP tool names would conflict. Tests: - test_capabilities_preamble_appears_in_mcp_prompt: header present - test_capabilities_preamble_lists_every_registry_tool: every a2a + memory tool from registry shows up (drift catches at test time — adding a new tool to registry surfaces here automatically) - test_capabilities_preamble_precedes_prompt_files: ordering invariant (toolkit before role docs) - test_capabilities_preamble_skipped_for_cli_runtime: empty when mcp=False All 40 prompt + platform_tools tests pass.	2026-04-29 21:31:13 -07:00
Hongming Wang	856ff89973	Merge pull request #2338 from Molecule-AI/auto/redeploy-main-concurrency-parity ci: add concurrency block to redeploy-tenants-on-main for parity	2026-04-30 04:16:53 +00:00
Hongming Wang	360361a0ce	ci: add concurrency block to redeploy-tenants-on-main for parity Parity with #2337's redeploy-tenants-on-staging.yml. Both prod and staging redeploys now have explicit serialization: group: redeploy-tenants-on-main (per-workflow, global) group: redeploy-tenants-on-staging (per-workflow, global) cancel-in-progress: false on both — aborting a half-rolled-out fleet would leave tenants stuck on whatever image they happened to be on when cancelled. Better to finish the in-flight rollout before starting the next one. Pre-fix this workflow relied on GitHub's implicit workflow_run queueing, which is "probably fine" but not defensible — explicit > implicit for load-bearing pipeline behavior. Picked up as a #2337 review nit (architecture finding 1: concurrency asymmetry between the two redeploy workflows). No behavior change in the common case. The change matters only when two main pushes land within seconds AND the first redeploy is still mid-rollout — currently rare; will become more common once #2335 (staging-trigger publish) feeds main more frequently via auto-promote.	2026-04-29 21:14:41 -07:00
Hongming Wang	b8246d54a7	Merge pull request #2337 from Molecule-AI/auto/cicd-followups-concurrency-staging-redeploy ci: serialize publish + auto-redeploy staging tenants (#2336 follow-ups)	2026-04-30 04:14:00 +00:00
Hongming Wang	b7291e006b	ci: serialize publish + auto-redeploy staging tenants Two follow-ups from #2335 review (tracked in #2336): 1. Add `concurrency:` block to publish-workspace-server-image.yml so two rapid staging pushes don't race the same :staging-latest retag. Group is per-branch (`${{ github.ref }}`) so staging and main can build in parallel — they produce different :staging-<sha> tags and last-write-wins on :staging-latest is acceptable across branches. `cancel-in-progress: false` keeps in-flight builds — partially-pushed images would break canary-fleet pin consistency. 2. Add redeploy-tenants-on-staging.yml. After #2335, every staging push produces a fresh :staging-latest, but existing tenants only pick it up on next reprovision. This workflow mirrors redeploy-tenants-on- main but for staging: - workflow_run-gated to branches: [staging] - target_tag default 'staging-latest' (vs 'latest' for prod) - CP_URL default https://staging-api.moleculesai.app - CP_STAGING_ADMIN_API_TOKEN repo secret (operator must set) - canary_slug empty by default — staging is itself the canary; no sub-canary needed inside it. Soak still applies if operator specifies a tenant for blast-radius control. Schedule-vs-dispatch hardening matches sweep-cf-orphans/sweep-cf- tunnels: hard-fail on auto-trigger when secret missing so misconfig doesn't silently leave staging tenants on stale code; soft-skip on operator dispatch. Operator action required after merge: Add CP_STAGING_ADMIN_API_TOKEN repo secret. Pull value from staging- CP's CP_ADMIN_API_TOKEN env in Railway controlplane / staging environment. Until set, the auto-trigger will fail the workflow run (visible as red CI), surfacing the misconfiguration. Workflow runs only on staging publish-workspace-server-image success, so no extra load while it sits unconfigured. Verification: - YAML lint clean on both workflows. - Reviewed redeploy-tenants-on-main as template; differences are scoped to staging-specific values (URL, tag, secret name) + harden-on-missing- secret pattern. Refs #2335, #2336.	2026-04-29 21:11:45 -07:00
Hongming Wang	f21cb54ae4	Merge pull request #2335 from Molecule-AI/auto/publish-images-on-staging ci: build workspace-server image on staging push (root-cause fix for stale staging tenants)	2026-04-30 04:03:18 +00:00
Hongming Wang	2e1cef324b	ci: trigger publish-workspace-server-image on staging push too Root cause: this workflow only triggered on `branches: [main]`, but staging-CP pins TENANT_IMAGE=:staging-latest (verified via Railway). :staging-latest was only retagged on main push, so: staging-branch code → never built → never reaches staging tenants staging-CP serves → "yesterday's main" indefinitely When staging→main was wedged (path-filter parity bug, canvas teardown race — both fixed earlier today), :staging-latest stopped updating entirely. RFC #2312 (chat upload HTTP-forward) landed on staging but freshly-provisioned staging tenants kept failing chat upload because they pulled pre-RFC-#2312 image. Verified by tearing down a fresh tenant and observing the legacy "workspace container not running" error from the docker-exec code path that RFC #2312 deleted. Pre-2026-04-24 there was a related-but-different incident: TENANT_IMAGE was a static :staging-<sha> pin that drifted 10 days behind. This new incident is "the dynamic pin still drifts when its update workflow doesn't fire." Fix: add `staging` to the branches trigger. Tag policy is unchanged (:staging-<sha> + :staging-latest on every push). canary-verify.yml still runs on main push (workflow_run-gated to `branches: [main]`), preserving the canary-verified :latest promotion for prod tenants. Steady state after this: - staging push → :staging-latest = staging-branch code → staging-CP - main push → :staging-<sha> for canary, :staging-latest retag (post-promote main code), and after canary green → :latest for prod tenants What this does NOT change: - canary-verify.yml flow (still main-only) - redeploy-tenants-on-main.yml (still rolls prod fleet on main push) - publish-canvas-image.yml (self-hosted standalone canvas; orthogonal) - The :latest tag (canary-verified main, unchanged) What this does fix: - RFC #2312-class fixes that land on staging now actually reach staging tenants without waiting for staging→main promote. - The dogfooding observation "staging tenants seem to be running yesterday's code" disappears as a class. Drive-by: also fixed the typo in the path-filter list (was `publish-platform-image.yml`, the actual file is `publish-workspace-server-image.yml`).	2026-04-29 21:00:56 -07:00
Hongming Wang	86d9cb8b55	Merge pull request #2334 from Molecule-AI/auto/chat-files-comment-update docs(chat_files): update header — Download is HTTP-forward, not docker-cp	2026-04-30 03:32:06 +00:00
Hongming Wang	82f73b1fa3	docs(chat_files): update header — Download is HTTP-forward, not docker-cp The header comment claimed: "file upload (HTTP-forward) + download (Docker-exec)" and: "Download still uses the v1 docker-cp path; migrating it lives in the next PR in this stack" Both wrong now. RFC #2312 PR-D landed the Download HTTP-forward path: chat_files.go:336 builds an http.NewRequestWithContext to ${wsURL}/internal/file/read?path=<abs>, with the response streamed back to the caller. The workspace-side Starlette handler is at workspace/internal_file_read.py, mounted at workspace/main.py:440. Update the header to reflect actual code: both upload + download are HTTP-forward, share the same per-workspace platform_inbound_secret auth, and work uniformly on local Docker and SaaS EC2. Pure docs change — no behavior, no build/test impact.	2026-04-29 20:28:58 -07:00
Hongming Wang	75885d6017	Merge pull request #2333 from Molecule-AI/auto/a2a-queue-status-tier1 feat(a2a): per-queue-id status endpoint + per-message TTL (RFC #2331 Tier 1)	2026-04-30 03:24:16 +00:00
Hongming Wang	b6d223cd0a	feat(a2a): per-queue-id status endpoint + per-message TTL (RFC #2331 Tier 1) Closes the observability gap surfaced in #2329 item 5: callers received queue_id in the 202 enqueue response but had no public lookup. The only existing observability path was check_task_status (delegation-flavored A2A only — joins via request_body->>'delegation_id'). Cross-workspace peer-direct A2A had no observability after enqueue. This PR ships RFC #2331's Tier 1: minimum viable observability + caller- specified TTL. No schema migration — expires_at column already exists (migration 042); only DequeueNext was honoring it, with no caller path to populate it. Two changes: 1. extractExpiresInSeconds(body) — new helper mirroring extractIdempotencyKey/extractDelegationIDFromBody. Pulls params.expires_in_seconds from the JSON-RPC body. Zero (the unset default) preserves today's infinite-TTL semantics. EnqueueA2A grew an expiresAt time.Time parameter; the proxy callsite computes time.Time from the extracted seconds and threads it through to the INSERT. 2. GET /workspaces/:id/a2a/queue/:queue_id — new public handler. Auth: caller's workspace token must match queue.caller_id OR queue.workspace_id, OR be an org-level token. 404 (not 403) on auth failure to avoid leaking queue_id existence. Response includes status/attempts/last_error/timestamps/expires_at; embeds response_body via LEFT JOIN against activity_logs when status= completed for delegation-flavored items. What this does NOT change: - Drain semantics (heartbeat-driven dispatch). - Native-session bypass (claude-agent-sdk, hermes still skip queue). - Schema (column already exists). - MCP tools (delegate_task_async / check_task_status keep their contract; this is a parallel queue-id surface). Tests: - 7 cases on extractExpiresInSeconds covering absent/positive/ zero/negative/invalid-JSON/wrong-type/empty-params. - go vet + go build clean. - Full handlers test suite passes (no regressions from the EnqueueA2A signature change — only one production caller). Tier 2 (cross-workspace stitch + webhook callback) and Tier 3 (controllerized lifecycle) deferred per RFC #2331.	2026-04-29 20:21:17 -07:00
Hongming Wang	d5d8de946f	Merge pull request #2330 from Molecule-AI/auto/dev-start-go-detection fix(dev-start): detect missing Go, fall back to docker-compose platform	2026-04-30 03:06:53 +00:00
Hongming Wang	88da3d523b	fix(dev-start): detect missing Go and fall back to docker-compose platform Issue: scripts/dev-start.sh assumed `go` was on PATH; on a fresh dev box without Go installed, line 111 (`go run ./cmd/server`) failed with `go: not found` and the script bailed before printing the readiness banner. The script's own prerequisite list (line 13-21) said "Go 1.25+" but there was no signpost between "open the doc" and "command not found." Fix: detect `go` via `command -v`. If present, keep the existing `go run` path (fast iteration, attaches to local log). If not, fall back to `docker compose up -d --build platform` which uses the published platform container — slower first run but the script still works without forcing the dev to install Go just to read logs. Either path leaves /health on :8080 so the rest of the script's wait loop is unchanged. If both paths fail, the error message names the install URL (https://go.dev/dl/) and the fallback diagnostic (`/tmp/molecule-platform.log`) so the dev has a single, actionable next step. Verified: `sh -n` syntax check passes. Closes #2329 item 2.	2026-04-29 20:04:37 -07:00
Hongming Wang	16796431b9	Merge pull request #2328 from Molecule-AI/auto/sweep-cf-tunnel-orphans feat(ops): sweep-cf-tunnels janitor — orphan tunnels accumulate forever	2026-04-30 02:45:16 +00:00
Hongming Wang	3a6d2f179d	feat(ops): add sweep-cf-tunnels janitor — orphan Cloudflare Tunnels accumulate CP's tenant-delete cascade removes the DNS record (with sweep-cf-orphans as a backstop) but does NOT delete the underlying Cloudflare Tunnel. Each E2E provision creates one Tunnel named `tenant-<slug>`; without cleanup these accumulate indefinitely on the account, consuming the tunnel quota and cluttering the dashboard. Observed 2026-04-30: dozens of `tenant-e2e-canvas-*` tunnels in Down state with zero replicas, weeks past their tenant's deletion. Same class of bug as the DNS-records leak that drove sweep-cf-orphans (controlplane#239). Parallel-shape to sweep-cf-orphans: - Same dry-run-by-default + --execute pattern - Same MAX_DELETE_PCT safety gate (default 90% — higher than DNS sweep's 50% because tenant-shaped tunnels are orphans by design) - Same schedule/dispatch hardening (hard-fail on missing secrets when scheduled, soft-skip when dispatched) - Cron offset to :45 to avoid CF API bursts colliding with the DNS sweep at :15 Decision rules (in order): 1. Name doesn't match `tenant-<slug>` → keep (unknown — never sweep tunnels that might belong to platform infra). 2. Tunnel has active connections (status=healthy or non-empty connections array) → keep (defense-in-depth: don't kill a live tunnel even if CP forgot the org). 3. Slug ∈ {prod_slugs ∪ staging_slugs} → keep. 4. Otherwise → delete (orphan). Verified by: - shell syntax check (bash -n) - YAML lint - Decide-logic offline smoke (7 cases, all pass) - End-to-end dry-run smoke with stubbed CP + CF APIs Required secrets (added to existing org-secrets): CF_API_TOKEN must include account:cloudflare_tunnel:edit scope (separate from zone:dns:edit used by sweep-cf-orphans — same token if scope is broad, or a new token if narrowly scoped). CF_ACCOUNT_ID account that owns the tunnels (visible in dash.cloudflare.com URL path). CP_PROD_ADMIN_TOKEN reused from sweep-cf-orphans. CP_STAGING_ADMIN_TOKEN reused from sweep-cf-orphans. Note: CP-side root cause (tenant-delete should cascade to tunnel delete) is in molecule-controlplane and worth fixing separately. This janitor is the operational backstop in the meantime — same pattern applied to DNS records when the same root cause was unaddressed.	2026-04-29 19:42:47 -07:00
Hongming Wang	a603dc449f	Merge pull request #2327 from Molecule-AI/auto/fix-canvas-e2e-teardown-race fix(e2e-canvas): kill teardown race that poisons concurrent runs (unblocks staging→main #2264)	2026-04-30 02:26:44 +00:00
Hongming Wang	15b98c4916	fix(e2e-canvas): kill teardown race that poisons concurrent runs Setup wrote .playwright-staging-state.json at the END (step 7), only after org create + provision-wait + TLS + workspace create + workspace- online all succeeded. If setup crashed at steps 1-6, the org existed in CP but the state file did not, so Playwright's globalTeardown bailed out ("nothing to tear down") and the workflow safety-net pattern-swept every e2e-canvas-<today>-* org to compensate. That sweep deleted concurrent runs' live tenants — including their CF DNS records — causing victims' next fetch to die with `getaddrinfo ENOTFOUND`. Race observed 2026-04-30 on PR #2264 staging→main: three real-test runs killed each other mid-test, blocking 68 commits of staging→main promotion. Fix: write the state file as setup's first action, right after slug generation, before any CP call. Now: - Crash before slug gen → no state file, no orphan to clean - Crash during steps 1-6 → state file has slug; teardown deletes it (DELETE 404s if org never created) - Setup completes → state file has full state; teardown deletes the slug The workflow safety-net no longer pattern-sweeps; it reads the state file and deletes only the recorded slug. Concurrent canvas-E2E runs no longer poison each other. Verified by: - tsc --noEmit on staging-setup.ts + staging-teardown.ts - YAML lint on e2e-staging-canvas.yml - Code review: state file write moved to line 113 (post-makeSlug, pre-CP) with the original line-249 write retained as a "promote to full state" overwrite at the end	2026-04-29 19:23:56 -07:00
Hongming Wang	e3588d4934	Merge pull request #2326 from Molecule-AI/auto/issue-2169-railway-pin-audit-cron ci: daily Railway pin-audit cron + issue-on-failure (#2169)	2026-04-30 00:46:12 +00:00
Hongming Wang	0b1d4f294b	Merge pull request #2304 from Molecule-AI/docs/molecule-channel-plugin-pointer docs: surface molecule-mcp-claude-channel plugin in external-workspace flow + CONTRIBUTING	2026-04-30 00:45:51 +00:00
Hongming Wang	c8205b009a	ci: daily Railway pin-audit cron + issue-on-failure (#2169 ) Acceptance criterion 3 of #2001 ("CI check that fails if TENANT_IMAGE contains a SHA-shaped suffix") was deferred from PR #2168 because querying Railway from a GitHub Actions runner needs RAILWAY_TOKEN plumbed as a repo secret. The detection script + regression test in #2168 cover detection; this is the automation-cadence layer. Daily 13:00 UTC schedule (06:00 PT) + workflow_dispatch. Daily is the right cadence for variables-tier config — Railway env var changes are deliberate operator actions, low-frequency. Hourly would risk Railway API rate-limit surprises. Issue-on-failure pattern mirrors e2e-staging-sanity.yml — drift opens a `railway-drift` priority-high issue (or comments on the open one), and a subsequent clean run auto-closes it with a "drift resolved" comment. No human-in-the-loop needed for the close. Schedule-vs-dispatch secret hardening per feedback_schedule_vs_dispatch_secrets_hardening: - Schedule trigger HARD-FAILS on missing RAILWAY_AUDIT_TOKEN (silent-success was the failure mode that bit us before) - workflow_dispatch SOFT-SKIPS so an operator can dry-run the workflow shape during initial token provisioning Operator action required before this gate is live: - Provision a Railway API token, read-only `variables` scope on the molecule-platform project (id 7ccc8c68-61f4-42ab-9be5-586eeee11768) - Store as repo secret RAILWAY_AUDIT_TOKEN - Rotate per the standard 90-day schedule Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 17:43:01 -07:00
Hongming Wang	d3b2e9e61d	Merge pull request #2325 from Molecule-AI/auto/fix-path-filter-check-name-parity-2326 ci: fix path-filter check-name parity in e2e-api + e2e-staging-canvas (unblocks staging→main #2264)	2026-04-30 00:32:08 +00:00
Hongming Wang	c79cf1cfa9	ci: collapse two-jobs-sharing-name path-filter pattern in e2e-api/e2e-staging-canvas Branch protection treats matching-name check runs as a SET — any SKIPPED member fails the required-check eval, even with SUCCESS siblings. The two-jobs-sharing-name pattern (no-op + real-job) emits one SKIPPED + one SUCCESS check run per workflow run; with multiple runs at the same SHA (detect-changes triggers + auto-promote re-runs) the SET fills with SKIPPED entries that block branch protection. Verified live on PR #2264 (staging→main auto-promote): mergeStateStatus stayed BLOCKED for 18+ hours despite APPROVED + MERGEABLE + all gates green at the workflow level. `gh pr merge` returned "base branch policy prohibits the merge"; `enqueuePullRequest` returned "No merge queue found for branch 'main'". The check-runs API showed `E2E API Smoke Test` and `Canvas tabs E2E` each had 2 SKIPPED + 2 SUCCESS at head SHA `66142c1e`. Fix: collapse no-op + real-job into ONE job with no job-level `if:`, gating real work via per-step `if: needs.detect-changes.outputs.X == 'true'`. The job always runs and emits exactly one SUCCESS check run under the required-check name regardless of paths-filter outcome — branch-protection-clean. Same pattern as ci.yml's earlier conversion of Canvas/Platform/Python/ Shellcheck (PR #2322). Closes the parity-fix that should have been applied to all four path-filtered required checks at once.	2026-04-29 17:29:44 -07:00
Hongming Wang	66142c1eab	Merge pull request #2319 from Molecule-AI/auto/issue-2312-pr-f-saas-secret-delivery feat(saas): deliver platform_inbound_secret via /registry/register (RFC #2312, PR-F)	2026-04-29 23:53:03 +00:00
Hongming Wang	5d34abd5b5	Merge remote-tracking branch 'origin/staging' into auto/issue-2312-pr-f-saas-secret-delivery # Conflicts: # scripts/build_runtime_package.py	2026-04-29 16:46:23 -07:00
Hongming Wang	5806feadcc	Merge pull request #2314 from Molecule-AI/auto/issue-2312-pr-b-workspace-ingest feat(workspace): /internal/chat/uploads/ingest endpoint (RFC #2312, PR-B)	2026-04-29 23:40:19 +00:00

1 2 3 4 5 ...

3461 Commits