molecule-core

Author	SHA1	Message	Date
Hongming Wang	3d8a0a58fa	ci(auto-sync): App-token dispatch + ubuntu-latest + workflow_dispatch auto-sync-main-to-staging.yml hasn't fired since 2026-04-29 despite multiple staging→main promotes since. The promote PR #2442 (Phase 2) has been wedged on `mergeStateStatus: BEHIND` for hours because staging is missing the merge commit from PR #2437. Three compounding bugs, all fixed here: 1. GitHub no-recursion suppresses the `on: push` trigger. When the merge queue lands a staging→main promote, the resulting push to main is "by GITHUB_TOKEN", and per https://docs.github.com/en/actions/using-workflows/triggering-a-workflow#triggering-a-workflow-from-a-workflow that push event does NOT fire any downstream workflows. Verified empirically against SHA `76c604fb` (PR #2437): exactly ONE workflow fired on that push — `publish-workspace-server-image`, dispatched explicitly by auto-promote-staging.yml's polling tail with an App token (the documented #2357 workaround). Every other `on: push` workflow on main, including auto-sync, was silently suppressed. Same fix extended here: auto-promote-staging.yml's polling tail now ALSO dispatches `auto-sync-main-to-staging.yml --ref main` via the App token after the merge lands. App-initiated dispatch propagates `workflow_run` cascades, which is what the publish tail relies on too. Failure path: emits `::error::` with the recovery command — operator runs it once and the next promote self-heals. auto-sync.yml gains `workflow_dispatch:` so it can be invoked from the dispatch above + manually if a future promote also misses (defense in depth). 2. `runs-on: [self-hosted, macos, arm64]` was wrong for this repo. Comment claimed "matches the rest of this repo's workflows" — false: this is the ONLY workflow in molecule-core/.github/workflows/ with a non-ubuntu runs-on. Copy-paste artefact from molecule-controlplane (which IS private and has a Mac runner). molecule-core has no Mac runner registered, so even when the trigger DID fire (the 3 historic manual-UI merges), the job would have sat unassigned if the runner were offline. Switched to `ubuntu-latest` to match every other workflow in this repo. 3. The `on: push` trigger remains as a defense-in-depth path for the rare case of a manual UI merge by a real user (which uses their PAT and DOES fire downstream workflows — confirmed via the 2026-04-29 `d35a2420` run with `triggering_actor=HongmingWang-Rabbit` that fired 16 workflows including auto-sync). Belt-and-suspenders. Long-term: switching auto-promote's `gh pr merge --auto` call to use the App token (instead of GITHUB_TOKEN) would let `on: push` triggers fire naturally and obviate the need for the explicit dispatches in the polling tail. Tracked in #2357 — out of scope here. Operator recovery for the current Phase 2 wedge: after this lands on staging, dispatch auto-sync once via `gh workflow run auto-sync-main-to-staging.yml --ref main` to backfill the missed sync from `76c604fb`. PR #2442 will go from BEHIND → CLEAN and auto-merge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 22:28:35 -07:00
Hongming Wang	c275716005	harness(phase-2): multi-tenant compose + cross-tenant isolation replays Brings the local harness from "single tenant covering the request path" to "two tenants covering both the request path AND the per-tenant isolation boundary" — the same shape production runs (one EC2 + one Postgres + one MOLECULE_ORG_ID per tenant). Why this matters: the four prior replays exercise the SaaS request path against one tenant. They cannot prove that TenantGuard rejects a misrouted request (production CF tunnel + AWS LB are the failure surface), nor that two tenants doing legitimate work in parallel keep their `activity_logs` / `workspaces` / connection-pool state partitioned. Both are real bug classes — TenantGuard allowlist drift shipped #2398, lib/pq prepared-statement cache collision is documented as an org-wide hazard. What changed: 1. compose.yml — split into two tenants. tenant-alpha + postgres-alpha + tenant-beta + postgres-beta + the shared cp-stub, redis, cf-proxy. Each tenant gets a distinct ADMIN_TOKEN + MOLECULE_ORG_ID and its own Postgres database. cf-proxy depends on both tenants becoming healthy. 2. cf-proxy/nginx.conf — Host-header → tenant routing. `map $host $tenant_upstream` resolves the right backend per request. Required `resolver 127.0.0.11 valid=30s ipv6=off;` because nginx needs an explicit DNS resolver to use a variable in `proxy_pass` (literal hostnames resolve once at startup; variables resolve per request — without the resolver nginx fails closed with 502). `server_name` lists both tenants + the legacy alias so unknown Host headers don't silently route to a default and mask routing bugs. 3. _curl.sh — per-tenant + cross-tenant-negative helpers. `curl_alpha_admin` / `curl_beta_admin` set the right Host + Authorization + X-Molecule-Org-Id triple. `curl_alpha_creds_at_beta` / `curl_beta_creds_at_alpha` exist precisely to make WRONG requests (replays use them to assert TenantGuard rejects). `psql_exec_alpha` / `psql_exec_beta` shell out per-tenant Postgres exec. Legacy aliases (`curl_admin`, `psql_exec`) keep the four pre-Phase-2 replays working without edits. 4. seed.sh — registers parent+child workspaces in BOTH tenants. Captures server-generated IDs via `jq -r '.id'` (POST /workspaces ignores body.id, so the older client-side mint silently desynced from the workspaces table and broke FK-dependent replays). Stashes `ALPHA_PARENT_ID` / `ALPHA_CHILD_ID` / `BETA_PARENT_ID` / `BETA_CHILD_ID` to .seed.env, plus legacy `ALPHA_ID` / `BETA_ID` aliases for backwards compat with chat-history / channel-envelope. 5. New replays. tenant-isolation.sh (13 assertions) — TenantGuard 404s any request whose X-Molecule-Org-Id doesn't match the container's MOLECULE_ORG_ID. Asserts the 404 body has zero tenant/org/forbidden/denied keywords (existence of a tenant must not be probable from the outside). Covers cross-tenant routing misconfigure + allowlist drift + missing-org-header. per-tenant-independence.sh (12 assertions) — both tenants seed activity_logs in parallel with distinct row counts (3 vs 5) and confirm each tenant's history endpoint returns exactly its own counts. Then a concurrent INSERT race (10 rows per tenant in parallel via `&` + wait) catches shared-pool corruption + prepared-statement cache poisoning + redis cross-keyspace bleed. 6. Bug fix: down.sh + dump-logs SECRETS_ENCRYPTION_KEY validation. `docker compose down -v` validates the entire compose file even though it doesn't read the env. up.sh generates a per-run key into its own shell — down.sh runs in a fresh shell that wouldn't see it, so without a placeholder `compose down` exited non-zero before removing volumes. Workspaces silently leaked into the next ./up.sh + seed.sh boot. Caught when tenant-isolation.sh F1/F2 saw 3× duplicate alpha-parent rows accumulated across three prior runs. Same fix applied to the workflow's dump-logs step. 7. requirements.txt — pin molecule-ai-workspace-runtime>=0.1.78. channel-envelope-trust-boundary.sh imports from `molecule_runtime.` (the wheel-rewritten path) so it catches the failure mode where the wheel build silently strips a fix that unit tests on local source still pass. CI was failing this replay because the wheel wasn't installed — caught in the staging push run from #2492. 8. .github/workflows/harness-replays.yml — Phase 2 plumbing. Removed /etc/hosts step (Host-header path eliminated the need; scripts already source _curl.sh). * Updated dump-logs to reference the new service names (tenant-alpha + tenant-beta + postgres-alpha + postgres-beta). * Added SECRETS_ENCRYPTION_KEY placeholder env on the dump step. Verified: ./run-all-replays.sh from a clean state — 6/6 passed (buildinfo-stale-image, channel-envelope-trust-boundary, chat-history, peer-discovery-404, per-tenant-independence, tenant-isolation). Roadmap section updated: Phase 2 marked shipped. Phase 3 promoted to "replace cp-stub with real molecule-controlplane Docker build + env coherence lint." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 21:36:40 -07:00
Hongming Wang	e58e446444	docs(ci): correct test-ops-scripts.yml header — discover does NOT recurse The previous header said `unittest discover from the scripts/ root walks recursively`, contradicting the workflow body which runs two passes precisely because discover does NOT recurse without __init__.py. Fixed self-review feedback on PR #2440. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:52:58 -07:00
Hongming Wang	f2545fcb57	Merge pull request #2440 from Molecule-AI/chore/wheel-rewriter-tests-and-noqa-cleanup chore: rewriter unit tests + drop misleading noqa on import inbox	2026-05-01 03:48:33 +00:00
Hongming Wang	6e92fe0a08	chore: rewriter unit tests + drop misleading noqa on `import inbox` Two small follow-ups to the PR #2433 → #2436 → #2439 incident chain. 1) `import inbox # noqa: F401` in workspace/a2a_mcp_server.py was misleading — `inbox` IS used (at the bridge wiring inside main()). F401 means "imported but unused", which would mask a real future F401 if the usage is removed. Drop the noqa, keep the explanatory block comment about the rewriter's `import X` → `import mr.X as X` expansion (and the `import X as Y` → `import mr.X as X as Y` trap the comment exists to prevent re-introducing). 2) scripts/test_build_runtime_package.py — 17 unit tests covering `rewrite_imports()` and `build_import_rewriter()` in scripts/build_runtime_package.py. Until now the function had zero coverage despite the entire wheel build depending on it. Tests pin: bare-import aliasing, dotted-import preservation, indented imports, from-imports (simple + dotted + multi-symbol + block), the `import X as Y` rejection added in PR #2436 (with comment- stripping + indented + comma-not-alias edge cases), allowlist anchoring (`a2a` ≠ `a2a_tools`), and end-to-end reproduction of the PR #2433 failing pattern + the #2436 fix pattern. 3) Wire scripts/test_.py into CI by adding a second discover pass to test-ops-scripts.yml. Top-level scripts/ tests live alongside their target file (parallels the scripts/ops/ test layout); the existing scripts/ops/ pass keeps running because scripts/ops/ has no __init__.py so a single discover from scripts/ root doesn't recurse. Two passes is simpler than retrofitting namespace packages. Path filter widened from `scripts/ops/` to `scripts/*` so PRs touching the build script trigger the new tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:45:32 -07:00
Hongming Wang	3c16c27415	ci(wheel-smoke): always-run with per-step if-gates for required-check eligibility The `PR-built wheel + import smoke` gate caught the broken wheel from PR #2433 (`import inbox as _inbox_module` collision) but couldn't block the merge because it isn't a required check on staging. Promoting it to required is the right move per the runtime publish pipeline gates note (2026-04-27 RuntimeCapabilities ImportError outage), but the existing `paths: [workspace/**, scripts/...]` filter blocks PRs that don't touch those paths from ever generating the check run — branch protection would deadlock waiting on a check that never fires. Refactor (same shape as e2e-api.yml's e2e-api job): - Drop top-level `paths:` filter — workflow runs on every push/PR/ merge_group event. - Add `detect-changes` job using dorny/paths-filter to compute the `wheel=true\|false` output. - Collapse to ONE always-running `local-build-install` job named `PR-built wheel + import smoke`. Per-step `if:` gates on the detect output. PRs untouched by wheel-relevant paths emit a no-op SUCCESS step ("paths filter excluded this commit") so the check passes without rebuilding the wheel. - merge_group + workflow_dispatch unconditionally `wheel=true` so the queue always validates the to-be-merged state, regardless of which PR composed it. Why one-job-with-step-gates instead of two-jobs-sharing-name: SKIPPED check runs block branch protection even when SUCCESS siblings exist (verified PR #2264 incident, 2026-04-29). Single always-run job emits exactly one SUCCESS check run regardless of paths filter. Follow-up: open a separate PR adding `PR-built wheel + import smoke` to the staging branch protection's required_status_checks.contexts once this lands. Doing both in one PR risks the protection update firing before the workflow refactor merges, deadlocking unrelated PRs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:40:05 -07:00
Hongming Wang	c68ec23d3c	Merge pull request #2410 from Molecule-AI/auto/harness-replays-ci-gate ci: gate PRs on tests/harness/run-all-replays.sh	2026-04-30 20:35:30 +00:00
Hongming Wang	0f0df576f5	Merge pull request #2392 from Molecule-AI/auto/e2e-staging-external-runtime test(e2e): live staging regression for external-runtime awaiting_agent transitions	2026-04-30 20:32:23 +00:00
Hongming Wang	c8b17ea1ad	fix(harness): install httpx for replay Python evals peer-discovery-404 imports workspace/a2a_client.py which depends on httpx; the runner's stock Python doesn't have it, so the replay's PARSE assertion (b) fails with ModuleNotFoundError on every run. The WIRE assertion (a) — pure curl — passes, so the failure was masking just enough to make the replay LOOK partially-broken when the tenant side is fine. Adding tests/harness/requirements.txt with only httpx instead of sourcing workspace/requirements.txt: that file pulls a2a-sdk, langchain-core, opentelemetry, sqlalchemy, temporalio, etc. — ~30s of install for one replay's PARSE step. The harness's deps surface should grow when a new replay introduces a new import, not by default. Workflow gains one step (`pip install -r tests/harness/requirements.txt`) between the /etc/hosts setup and run-all-replays. No other changes.	2026-04-30 13:32:00 -07:00
Hongming Wang	24cb2a286f	ci(harness-replays): KEEP_UP=1 so dump-logs step has containers to read First run on PR #2410 failed with 'container harness-tenant-1 is unhealthy' but the dump-compose-logs step printed empty tenant logs because run-all-replays.sh's trap-on-EXIT had already torn down the harness. Setting KEEP_UP=1 leaves containers in place; the always-run Force teardown step at the end owns cleanup explicitly. Now we'll actually see why the tenant didn't become healthy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 13:15:46 -07:00
Hongming Wang	3105e87cf7	ci: gate PRs on tests/harness/run-all-replays.sh Closes the gap between "the harness exists" and "the harness blocks bugs." Phase 2 of the harness roadmap (per tests/harness/README.md): make harness-based E2E a required CI check on every PR touching the tenant binary or the harness itself. Trigger: push + pull_request to staging+main, paths-filtered to workspace-server/, canvas/, tests/harness/**, and this workflow. merge_group support included so this becomes branch-protectable. Single-job-with-conditional-steps pattern (matches e2e-api.yml). One check run regardless of paths-filter outcome; satisfies branch protection cleanly per the PR #2264 SKIPPED-in-set finding. Why this exists: 2026-04-30 we shipped a TenantGuard allowlist gap (/buildinfo added to router.go in #2398, never added to the allowlist) that the existing buildinfo-stale-image.sh replay would have caught. The harness was wired correctly; nobody ran it. Replays as a discipline beat replays as a memory item. The CI pipeline: detect-changes (paths filter) └ harness-replays (always) ├ no-op pass when paths-filter says no relevant change └ otherwise: checkout + sibling plugin checkout + /etc/hosts entry + run-all-replays.sh + compose-logs-on-failure + force-teardown Compose logs from tenant/cp-stub/cf-proxy/postgres are dumped on failure so a CI red is debuggable without re-reproducing locally. The trap in run-all-replays.sh handles teardown; the always-run down.sh step is a belt-and-suspenders against trap-bypass kills. Follow-ups (not in this PR): - Add this check to staging branch protection once it's been green for a few PRs (the new-workflow-instability hedge that other gates followed). - Eventually wire the buildx GHA cache to speed up tenant image builds — currently every PR rebuilds the full Dockerfile.tenant (Go + Next.js + template clones) from scratch. Acceptable for now; optimize when the timeout-minutes:30 ceiling becomes painful. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 13:04:53 -07:00
Hongming Wang	ef206b5be6	refactor(ci): extract wheel smoke into shared script publish-runtime.yml had a broad smoke (AgentCard call-shape, well-known mount alignment, new_text_message) inline as a heredoc. runtime-prbuild- compat.yml had a narrow inline smoke (just `from main import main_sync`). Result: a PR could introduce SDK shape regressions that pass at PR time and only fail at publish time, post-merge. Extract the broad smoke into scripts/wheel_smoke.py and invoke it from both workflows. PR-time gate now matches publish-time gate — same script, same assertions. Eliminates the drift hazard of two heredocs that have to be kept in lockstep manually. Verified locally: * Built wheel from workspace/ source, installed in venv, ran smoke → pass * Simulated AgentCard kwarg-rename regression → smoke catches it as `ValueError: Protocol message AgentCard has no "supported_interfaces" field` (the exact failure mode of #2179 / supported_protocols incident) Path filter for runtime-prbuild-compat extended to include scripts/wheel_smoke.py so smoke-only edits get PR-validated. publish- runtime path filter intentionally NOT extended — smoke-only edits should not auto-trigger a PyPI version bump. Subset of #131 (the broader "invoke main() against stub config" goal remains pending — main() needs a config dir + stub platform server). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 11:52:07 -07:00
Hongming Wang	9b909c4459	fix(ci): gate 50%-floor on TOTAL_VERIFIED >= 4 Self-review of #2403 caught a regression: with a 1-tenant fleet (the exact case the original #2402 fix targeted), the new floor would re-introduce the flake. Trace: TOTAL=1, UNREACHABLE=1, $((1/2))=0 if 1 -gt 0 → TRUE → exit 1 The 50%-rule only meaningfully distinguishes "real outage" from "teardown race" when the fleet is large enough that "half down" is statistically meaningful. With 1-3 tenants, canary-verify is the actual gate (it runs against the canary first and aborts the rollout if the canary fails to come up). Gate the floor on TOTAL_VERIFIED >= 4. Truth table: TOTAL UNREACHABLE RESULT 1 1 soft-warn (original e2e flake case) 4 2 soft-warn (exactly half) 4 3 hard-fail (75% — real outage) 10 6 hard-fail (60% — real outage) Mirrored across staging.yml + main.yml. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 11:40:31 -07:00
Hongming Wang	ec39fecda2	fix(ci): hard-fail when >50% of fleet unreachable post-redeploy Belt-and-suspenders sanity floor on top of the unreachable-soft-warn introduced earlier in this PR. Addresses the residual gap noted in review: if a new image crashes on startup, every tenant ends up unreachable, and the soft-warn alone would let that ship as a green deploy. Canary-verify catches it on the canary tenant first, but this guard is a fallback for canary-skip dispatches and same-batch races. Threshold is 50% of healthz_ok-snapshotted tenants — comfortably above the typical e2e-* teardown rate (5-10/hour, ~1 ephemeral tenant per batch) but below any plausible real-outage scenario. Mirrored across staging.yml + main.yml for shape parity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 11:35:56 -07:00
Hongming Wang	d45241cae7	fix(ci): distinguish unreachable from stale in /buildinfo verify step The /buildinfo verify step (PR #2398) was treating "no /buildinfo response" the same as "tenant returned wrong SHA" — both bumped MISMATCH_COUNT and hard-failed the workflow. First post-merge run on staging caught a real edge case: ephemeral E2E tenants (slug e2e-20260430-...) get torn down by the E2E teardown trap between CP's healthz_ok snapshot and the verify step running, so the verify step would dial into DNS that no longer resolves and hard-fail on a benign condition. The bug class we actually care about is STALE (tenant up + serving old code, the #2395 root). UNREACHABLE post-redeploy is almost always a benign teardown race; real "tenant up but unreachable" is caught by CP's own healthz monitor + the alert pipeline, so double-counting it here was making this workflow flaky on every staging push that overlapped E2E. Wire: - Split MISMATCH_COUNT into STALE_COUNT + UNREACHABLE_COUNT. - STALE → hard-fail the workflow (the bug class we're guarding). - UNREACHABLE → :⚠️:, don't fail. Reachable-mismatch still hard-fails. - Job summary surfaces both lists separately so on-call can tell at a glance which class fired. Mirror in redeploy-tenants-on-main.yml for shape parity (prod has fewer ephemeral tenants but identical asymmetry would be a gratuitous fork). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 11:25:46 -07:00
Hongming Wang	998e13c4bd	feat(deploy): verify each tenant /buildinfo matches published SHA after redeploy Closes the gap that let issue #2395 ship: redeploy-fleet workflows reported ssm_status=Success based on SSM RPC return code alone, while EC2 tenants silently kept serving the previous :latest digest because docker compose up without an explicit pull is a no-op when the local tag already exists. Wire: - new buildinfo package exposes GitSHA, set at link time via -ldflags from the GIT_SHA build-arg (default "dev" so test runs without ldflags fail closed against an unset deploy) - router exposes GET /buildinfo returning {git_sha} — public, no auth, cheap enough to curl from CI for every tenant - both Dockerfiles thread GIT_SHA into the Go build - publish-workspace-server-image.yml passes GIT_SHA=github.sha for both images - redeploy-tenants-on-main.yml + redeploy-tenants-on-staging.yml curl each tenant's /buildinfo after the redeploy SSM RPC and fail the workflow on digest mismatch; staging treats both :latest and :staging-latest as moving tags; verification is skipped only when an operator pinned a specific tag via workflow_dispatch Tests: - TestGitSHA_DefaultDevSentinel pins the dev default - TestBuildInfoEndpoint_ReturnsGitSHA pins the wire shape that the workflow's jq lookup depends on Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:08 -07:00
Hongming Wang	8efb2dae8d	fix(ci): handle empty E2E lookup in auto-promote-on-e2e gate When gh run list returns [] (no E2E run on the main SHA — the common case for canvas-only / cmd-only / sweep-only changes whose paths don't trigger E2E), jq's `.[0]` is null and the interpolation `"\(null)/\(null // "none")"` produces "null/none". The case statement has no `null/none)` branch, so it falls into `*)` → exit 1 → auto-promote-on-e2e fails → `:latest` doesn't get retagged to the new SHA → tenants on `redeploy-tenants-on-main` end up pulling the OLD `:latest` digest. Surfaced 2026-04-30 17:00Z as the first observable consequence of PR #2389 (App-token dispatch fix). Every prior auto-promote-on-e2e run was triggered by E2E completion (the "Upstream is E2E itself" short-circuit at line 151 fired before reaching the gate). #2389 made publish-image's completion event correctly fire workflow_run listeners — auto-promote-on-e2e is one of those listeners — and hit the latent jq bug on the first publish-upstream run. Fix: change `.[0]` to `(.[0] // {})` in the jq filter so the empty- array case becomes `none/none` (the documented "E2E paths-filtered out for this SHA — proceed" branch) instead of the unhandled `null/none`. Also default `.status` for the same defensive reason. Verified the three input shapes locally: [] → "none/none" ✓ [{status:completed,conclusion:success}] → "completed/success" ✓ [{status:in_progress,conclusion:null}] → "in_progress/none" ✓ Outer `\|\| echo "none/none"` fallback retained as defense-in-depth for non-zero gh exits (network / auth failures). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:07:52 -07:00
Hongming Wang	79496dcffe	test(e2e): live staging regression for external-runtime awaiting_agent transitions Pins the four workspaces.status=awaiting_agent transitions on a real staging tenant, end-to-end. Catches the class of silent enum failures that migration 046 fix-forwarded — specifically: 1. workspace.go:333 — POST /workspaces with runtime=external + no URL parks the row in 'awaiting_agent'. Pre-046 the UPDATE silently failed and the row stuck on 'provisioning'. 2. registry.go:resolveDeliveryMode — registering an external workspace defaults delivery_mode='poll' (PR #2382). The harness asserts the poll default after register. 3. registry/healthsweep.go:sweepStaleRemoteWorkspaces — after REMOTE_LIVENESS_STALE_AFTER (90s default) with no heartbeat, the workspace transitions back to 'awaiting_agent'. Pre-046 the sweep UPDATE silently failed and the workspace stuck on 'online' forever. 4. Re-register from awaiting_agent → 'online' confirms the state is operator-recoverable, which is the whole reason for using awaiting_agent (vs. 'offline') as the external-runtime stale state. The harness mirrors test_staging_full_saas.sh: tenant create → DNS/TLS wait → tenant token retrieve → exercise → idempotent teardown via EXIT/INT/TERM trap. Exit codes match the documented contract {0,1,2,3,4}; raw bash exit codes are normalized so the safety-net sweeper doesn't open false-positive incident issues. The companion workflow gates on the source files that touch this lifecycle: workspace.go, registry.go, workspace_restart.go, healthsweep.go, liveness.go, every migration, the static drift gate, and the script + workflow themselves. Daily 07:30 UTC cron catches infra drift on quiet days. cancel-in-progress=false because aborting a half-rolled tenant leaves orphan resources for the safety-net to clean. Verification: - bash -n: ok - shellcheck: only the documented A && B \|\| C pattern, identical to test_staging_full_saas.sh. - YAML parser: ok. - Workflow path filter matches every site that writes to the workspace_status enum (cross-checked against the drift gate's UPDATE workspaces / INSERT INTO workspaces enumeration). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 09:36:18 -07:00
Hongming Wang	e418d32582	ci(auto-promote): dispatch publish via molecule-ai App token to unblock workflow_run chain Root cause (verified 2026-04-30): GITHUB_TOKEN-initiated workflow_dispatch creates the dispatched run, but the resulting run's completion event does NOT fire downstream `workflow_run` triggers. This is the documented "no recursion" rule: https://docs.github.com/en/actions/using-workflows/triggering-a-workflow#triggering-a-workflow-from-a-workflow Evidence (publish-workspace-server-image runs on main): run_id \| head_sha \| triggering_actor \| canary \| redeploy ------------+-----------+-----------------------+--------+---------- 25151545007 \| `6ef562ee` \| HongmingWang-Rabbit \| YES \| YES 25171773918 \| `21313dc` \| github-actions[bot] \| NO \| NO 25173801008 \| `59dec57` \| github-actions[bot] \| NO \| NO The 06:52Z run that "worked" was an operator-fired dispatch from the terminal — actor was the operator's PAT. The two runs that "dropped" were dispatched by auto-promote-staging.yml's `gh workflow run` step authenticated via `secrets.GITHUB_TOKEN`, so the actor became `github-actions[bot]` and the workflow_run cascade was suppressed. Same workflow file, same dispatch call, same successful publish run — only the auth token differed. Fix: mint a molecule-ai GitHub App installation token before the dispatch step and use it as `GH_TOKEN`. App-initiated dispatches DO propagate the workflow_run cascade (the App user is a real identity, not the GITHUB_TOKEN bot pseudonym). The molecule-ai App (app_id=3398844, installation 124443072) is already installed on the org with `actions:write` — no new App needed. Only secrets are missing. ## Required setup before merge The following repo secrets must be added at https://github.com/Molecule-AI/molecule-core/settings/secrets/actions or auto-promote will hard-fail at the new "Mint App token" step: - `MOLECULE_AI_APP_ID` = `3398844` - `MOLECULE_AI_APP_PRIVATE_KEY` = contents of a .pem file generated at https://github.com/organizations/Molecule-AI/settings/installations/124443072 (Click "Generate a private key" if one doesn't exist yet.) ## Long-term cleanup The polling tail step still exists because the auto-merge call itself uses GITHUB_TOKEN, so the FF push to main doesn't fire publish-workspace-server-image's `push` trigger naturally. Switching the auto-merge call to use the SAME App token would eliminate the polling tail entirely. Tracked in #2357. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 08:55:49 -07:00
Hongming Wang	a9391c5900	fix(ci): drop --depth=1 from migration collision check fetch The check has been blocking the staging→main auto-promote PR (#2361) since 2026-04-30T07:17Z with: fatal: origin/main...<head>: no merge base Root cause: the workflow does `git fetch origin <base> --depth=1` which overwrites checkout@v4's full-history clone with a shallow tip — destroying the ancestry the subsequent `git diff origin/main...HEAD` (three-dot, merge-base form) needs. This deadlocks every staging→main promote PR until manually fixed. The auto-promote runs were succeeding at the gate-check phase but the subsequent PR-merge step waited 30 min for the failing check and timed out, skipping the publish + redeploy dispatch tail. Fleet recovery for any production-only fix went through staging fine but never reached main. Fix: drop --depth=1 so the explicit fetch preserves full history. The leading comment is updated to call out this trap so a future maintainer doesn't re-add the flag thinking it's a perf win. No test added: this is a workflow-config one-liner that the existing PR check itself exercises end-to-end (the real signal is PR #2361 going green after this lands). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 05:28:03 -07:00
Hongming Wang	e06ebaefdf	Merge pull request #2346 from Molecule-AI/auto/issue-2341-migration-collision ci: hard gate against migration version collisions (#2341)	2026-04-30 08:50:19 +00:00
Hongming Wang	26d5c5ba1f	fix(ci): close gaps in auto-promote dispatch tail (#2358 follow-up) Independent review of #2358 surfaced three gaps that the original self-review missed. All three would manifest only on the FIRST real staging→main promotion through the new tail step, so they'd silently re-introduce the deploy-chain bug #2357 was supposed to fix. 1. Missing `actions: write` permission. `gh workflow run` POSTs to `/repos/.../actions/workflows/.../dispatches`, which requires the actions:write scope on GITHUB_TOKEN. The job had only contents:write + pull-requests:write, so the dispatch call would 403 on every run and the publish chain would still not fire. Adding the scope. 2. No workflow-level concurrency block. When CI + E2E Staging Canvas + E2E API Smoke + CodeQL all complete within seconds of each other on a green staging push (the typical case), four separate workflow_run events fire and four parallel auto-promote runs all reach the dispatch tail. They poll the same PR, all observe the same mergedAt, and all call `gh workflow run` — producing 2-4× redundant publish builds racing for the same `:staging-latest` retag and 2-4× canary-verify chains. Added `concurrency.group: auto-promote-staging, cancel-in-progress: false`. cancel-in-progress=false because killing a polling tail that's about to dispatch would re-introduce the original bug. 3. PR closed-without-merge ties up a runner for 30 min. If the merge queue rejects the PR (gates flip red post-approval), or an operator closes it manually, mergedAt stays null forever and the loop polls 60 × 30s burning a runner slot. Now also reads `state` in the same `gh pr view` call and breaks early when STATE=CLOSED. Verification on this PR is structural (workflow won't fire on a staging→main promotion until this lands AND a subsequent staging push triggers auto-promote). The actions:write fix in particular is unverifiable until the next real run — the prior #2358 fix has the same property, so we're stacking two unverifiable workflow edits. That's intentional rather than risky: stage 1 (#2358) was load-bearing for the deploy-chain restoration; stage 2 (this PR) hardens it before it actually matters. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 00:03:31 -07:00
Hongming Wang	d850ec7c8c	Merge pull request #2358 from Molecule-AI/auto/issue-2357-promote-dispatch-chain fix(ci): dispatch publish chain after auto-promote merge (#2357)	2026-04-30 06:36:02 +00:00
Hongming Wang	9a7f61661b	fix(ci): dispatch publish chain after auto-promote merge (#2357 ) The auto-promote staging → main flow uses `gh pr merge --auto` with GITHUB_TOKEN, which means GitHub suppresses downstream `push` events on the resulting main commit. This is documented behavior — events created by GITHUB_TOKEN do not trigger new workflow runs, with workflow_dispatch and repository_dispatch as the only exceptions. Effect: when the merge queue lands the auto-promote PR, the main push DOES NOT fire publish-workspace-server-image. canary-verify + the :staging-<sha> → :latest retag never run, so redeploy-tenants-on-main also never fires. Tenants stay on stale code until someone manually dispatches the chain (which is what just happened for issue #2339). Fix here: after enqueuing auto-merge, poll for the PR to land, then explicitly `gh workflow run publish-workspace-server-image.yml --ref main`. workflow_dispatch is the documented exception, so the dispatch event itself DOES create a new run. canary-verify and redeploy-tenants-on-main chain via workflow_run as before. Long-term (tracked in #2357): switch the auto-merge call above to a GitHub App token (actions/create-github-app-token) so the merge event itself can trigger the downstream chain naturally; the polling tail becomes deletable. Why a 30-min poll cap: merge queue typically lands a green promote PR within 5-10 min. 30 min covers a slow CI run without hanging the workflow indefinitely. If the merge times out, the step warns and exits 0 — operator can manually dispatch as a fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 23:31:13 -07:00
Hongming Wang	a495b86a06	test(e2e): poll-mode + since_id cursor round-trip (#2339 PR 4) End-to-end coverage for the canvas-chat unblocker. Exercises every moving part of the #2339 stack against a real platform instance: Phase 1 — register a workspace as delivery_mode=poll WITHOUT a URL; verify the response carries delivery_mode=poll. Phase 2 — invalid delivery_mode rejected with 400 (typo defense). Phase 3 — POST A2A to the poll-mode workspace; verify proxyA2ARequest short-circuits and returns 200 {status:queued, delivery_mode:poll, method:message/send} without ever resolving an agent URL. Phase 4 — verify the queued message appears in /activity?type=a2a_receive with the right method + payload (the polling agent reads from here). Phase 5 — since_id cursor returns ASC-ordered rows STRICTLY AFTER the cursor; the cursor row itself must NOT be replayed. Sends two follow-up messages and asserts ordering: rows[0] is the older new event, rows[-1] is the newer. Phase 6 — unknown / pruned cursor returns 410 Gone with an explanation. Phase 7 — cross-workspace cursor isolation: a UUID belonging to one workspace cannot be used to peek at another workspace's feed (returns 410, same as pruned, no info leak). Idempotent: per-run unique workspace ids (date+pid). Trap-based cleanup deletes the test rows on exit; no e2e_cleanup_all_workspaces call (see feedback_never_run_cluster_cleanup_tests_on_live_platform.md). Wired into .github/workflows/e2e-api.yml so it runs on every PR that touches workspace-server/, tests/e2e/, or the workflow file itself — same gate as the existing test_a2a_e2e + test_notify_attachments suites. Stacked on #2354 (PR 3: since_id cursor). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 23:07:10 -07:00
Hongming Wang	db5d11ffca	ci: continuous synthetic E2E against staging (#2342 ) Hard gate Tier 2 item 2 of 4. Cron-driven full-lifecycle E2E that catches regressions visible only at runtime — schema drift, deployment-pipeline gaps, vendor outages, env-var rotations, DNS / CF / Railway side-effects. Empirical motivation from today: - #2345 (A2A v0.2 silent drop) — passed unit tests, broke at JSON-RPC parse layer between sender + receiver. Visible only when a sender exercises the full path. Now-fixed by PR #2349, but a continuous E2E would have surfaced it within 20 min of the regression. - RFC #2312 chat upload — landed staging-branch but never reached staging tenants because publish-workspace-server-image was main- only. Caught by manual dogfooding hours after deploy. Same pattern. Both classes are invisible to PR-time CI. The continuous gate fires every 20 min against a real staging tenant and surfaces regressions within minutes. Cadence: cron `0,20,40 * * * *` (3x/hour). Offsets the existing sweep-cf-orphans (:15) and sweep-cf-tunnels (:45) so the three ops don't burst CF/AWS APIs at the same minute. Concurrency group prevents overlapping runs if one hangs. Cost: ~$0.50-1/day GHA + pennies of staging tenant lifecycle. Reuses existing tests/e2e/test_staging_full_saas.sh — no new harness to maintain. Bounded at 10 min wall-clock (vs 15 min default) so stuck runs fail fast rather than holding up the next firing. Defaults to E2E_RUNTIME=langgraph (fastest cold start; the regression classes this gate catches don't need hermes-specific paths). Operators can dispatch with runtime=hermes when they want SDK-native coverage. Schedule-vs-dispatch hardening: hard-fail on missing CP_STAGING_ADMIN_API_TOKEN for cron firing (silent-skip would mask real outages); soft-skip for operator dispatch. Refs: - #2342 hard-gates Tier 2 item 2 - #2345 (A2A v0.2 fix that this gate would have caught earlier) - #2335 / #2337 (deployment-pipeline gaps that this gate also catches)	2026-04-29 22:04:57 -07:00
Hongming Wang	ea8ff626a9	ci: hard gate against migration version collisions (#2341 ) Two PRs targeting staging can each add a migration with the same numeric prefix (e.g. 044_.up.sql). Each passes CI independently. They collide at merge time. Worst case: second migration silently doesn't apply and prod schema drifts from what the code expects. Caught manually 2026-04-30 during PR #2276 rebase: 044_runtime_image_pins collided with 044_platform_inbound_secret from RFC #2312. This workflow makes that detection automatic at PR-open time. How it works: scripts/ops/check_migration_collisions.py runs on every PR that touches workspace-server/migrations/*. For each new/modified migration filename, extracts the numeric prefix and checks: 1. Does the base branch already have a DIFFERENT migration file with the same prefix? (PR branched off an old base, base advanced and another PR landed the same number — needs rebase.) 2. Is another OPEN PR (not this one) also adding a migration with the same prefix? (Race-window collision — both pass CI separately, would collide at merge time.) Either case → exit 1 with a clear ::error:: message naming the conflicting PR(s) so the author knows what to renumber. Implementation notes: - Uses git ls-tree (not working-tree walk) so it works against any base ref without checkout. - Uses gh pr diff --name-only per open PR, bounded by `gh pr list --limit 100`. ~30s worst case for a busy repo, <5s normally. - --diff-filter=AM picks up Added or Modified — renaming a migration in place is also flagged (intentional; renaming migrations isn't safe). - Same filename in both PR and base = no collision (PR is editing in-place, fine). Tests: scripts/ops/test_check_migration_collisions.py — 9 cases on the regex classifier (the load-bearing piece). End-to-end git/gh path is exercised by running the workflow against real PRs. Hard-gates Tier 1 item 1 (#2341). Cheapest, cleanest gate. Catches one specific class of merge-time foot-gun automatically. Refs hard-gates discussion 2026-04-30. Tier 1 of 4 (others tracked in #2342, #2343, #2344).	2026-04-29 21:42:42 -07:00
Hongming Wang	856ff89973	Merge pull request #2338 from Molecule-AI/auto/redeploy-main-concurrency-parity ci: add concurrency block to redeploy-tenants-on-main for parity	2026-04-30 04:16:53 +00:00
Hongming Wang	360361a0ce	ci: add concurrency block to redeploy-tenants-on-main for parity Parity with #2337's redeploy-tenants-on-staging.yml. Both prod and staging redeploys now have explicit serialization: group: redeploy-tenants-on-main (per-workflow, global) group: redeploy-tenants-on-staging (per-workflow, global) cancel-in-progress: false on both — aborting a half-rolled-out fleet would leave tenants stuck on whatever image they happened to be on when cancelled. Better to finish the in-flight rollout before starting the next one. Pre-fix this workflow relied on GitHub's implicit workflow_run queueing, which is "probably fine" but not defensible — explicit > implicit for load-bearing pipeline behavior. Picked up as a #2337 review nit (architecture finding 1: concurrency asymmetry between the two redeploy workflows). No behavior change in the common case. The change matters only when two main pushes land within seconds AND the first redeploy is still mid-rollout — currently rare; will become more common once #2335 (staging-trigger publish) feeds main more frequently via auto-promote.	2026-04-29 21:14:41 -07:00
Hongming Wang	b7291e006b	ci: serialize publish + auto-redeploy staging tenants Two follow-ups from #2335 review (tracked in #2336): 1. Add `concurrency:` block to publish-workspace-server-image.yml so two rapid staging pushes don't race the same :staging-latest retag. Group is per-branch (`${{ github.ref }}`) so staging and main can build in parallel — they produce different :staging-<sha> tags and last-write-wins on :staging-latest is acceptable across branches. `cancel-in-progress: false` keeps in-flight builds — partially-pushed images would break canary-fleet pin consistency. 2. Add redeploy-tenants-on-staging.yml. After #2335, every staging push produces a fresh :staging-latest, but existing tenants only pick it up on next reprovision. This workflow mirrors redeploy-tenants-on- main but for staging: - workflow_run-gated to branches: [staging] - target_tag default 'staging-latest' (vs 'latest' for prod) - CP_URL default https://staging-api.moleculesai.app - CP_STAGING_ADMIN_API_TOKEN repo secret (operator must set) - canary_slug empty by default — staging is itself the canary; no sub-canary needed inside it. Soak still applies if operator specifies a tenant for blast-radius control. Schedule-vs-dispatch hardening matches sweep-cf-orphans/sweep-cf- tunnels: hard-fail on auto-trigger when secret missing so misconfig doesn't silently leave staging tenants on stale code; soft-skip on operator dispatch. Operator action required after merge: Add CP_STAGING_ADMIN_API_TOKEN repo secret. Pull value from staging- CP's CP_ADMIN_API_TOKEN env in Railway controlplane / staging environment. Until set, the auto-trigger will fail the workflow run (visible as red CI), surfacing the misconfiguration. Workflow runs only on staging publish-workspace-server-image success, so no extra load while it sits unconfigured. Verification: - YAML lint clean on both workflows. - Reviewed redeploy-tenants-on-main as template; differences are scoped to staging-specific values (URL, tag, secret name) + harden-on-missing- secret pattern. Refs #2335, #2336.	2026-04-29 21:11:45 -07:00
Hongming Wang	2e1cef324b	ci: trigger publish-workspace-server-image on staging push too Root cause: this workflow only triggered on `branches: [main]`, but staging-CP pins TENANT_IMAGE=:staging-latest (verified via Railway). :staging-latest was only retagged on main push, so: staging-branch code → never built → never reaches staging tenants staging-CP serves → "yesterday's main" indefinitely When staging→main was wedged (path-filter parity bug, canvas teardown race — both fixed earlier today), :staging-latest stopped updating entirely. RFC #2312 (chat upload HTTP-forward) landed on staging but freshly-provisioned staging tenants kept failing chat upload because they pulled pre-RFC-#2312 image. Verified by tearing down a fresh tenant and observing the legacy "workspace container not running" error from the docker-exec code path that RFC #2312 deleted. Pre-2026-04-24 there was a related-but-different incident: TENANT_IMAGE was a static :staging-<sha> pin that drifted 10 days behind. This new incident is "the dynamic pin still drifts when its update workflow doesn't fire." Fix: add `staging` to the branches trigger. Tag policy is unchanged (:staging-<sha> + :staging-latest on every push). canary-verify.yml still runs on main push (workflow_run-gated to `branches: [main]`), preserving the canary-verified :latest promotion for prod tenants. Steady state after this: - staging push → :staging-latest = staging-branch code → staging-CP - main push → :staging-<sha> for canary, :staging-latest retag (post-promote main code), and after canary green → :latest for prod tenants What this does NOT change: - canary-verify.yml flow (still main-only) - redeploy-tenants-on-main.yml (still rolls prod fleet on main push) - publish-canvas-image.yml (self-hosted standalone canvas; orthogonal) - The :latest tag (canary-verified main, unchanged) What this does fix: - RFC #2312-class fixes that land on staging now actually reach staging tenants without waiting for staging→main promote. - The dogfooding observation "staging tenants seem to be running yesterday's code" disappears as a class. Drive-by: also fixed the typo in the path-filter list (was `publish-platform-image.yml`, the actual file is `publish-workspace-server-image.yml`).	2026-04-29 21:00:56 -07:00
Hongming Wang	3a6d2f179d	feat(ops): add sweep-cf-tunnels janitor — orphan Cloudflare Tunnels accumulate CP's tenant-delete cascade removes the DNS record (with sweep-cf-orphans as a backstop) but does NOT delete the underlying Cloudflare Tunnel. Each E2E provision creates one Tunnel named `tenant-<slug>`; without cleanup these accumulate indefinitely on the account, consuming the tunnel quota and cluttering the dashboard. Observed 2026-04-30: dozens of `tenant-e2e-canvas-*` tunnels in Down state with zero replicas, weeks past their tenant's deletion. Same class of bug as the DNS-records leak that drove sweep-cf-orphans (controlplane#239). Parallel-shape to sweep-cf-orphans: - Same dry-run-by-default + --execute pattern - Same MAX_DELETE_PCT safety gate (default 90% — higher than DNS sweep's 50% because tenant-shaped tunnels are orphans by design) - Same schedule/dispatch hardening (hard-fail on missing secrets when scheduled, soft-skip when dispatched) - Cron offset to :45 to avoid CF API bursts colliding with the DNS sweep at :15 Decision rules (in order): 1. Name doesn't match `tenant-<slug>` → keep (unknown — never sweep tunnels that might belong to platform infra). 2. Tunnel has active connections (status=healthy or non-empty connections array) → keep (defense-in-depth: don't kill a live tunnel even if CP forgot the org). 3. Slug ∈ {prod_slugs ∪ staging_slugs} → keep. 4. Otherwise → delete (orphan). Verified by: - shell syntax check (bash -n) - YAML lint - Decide-logic offline smoke (7 cases, all pass) - End-to-end dry-run smoke with stubbed CP + CF APIs Required secrets (added to existing org-secrets): CF_API_TOKEN must include account:cloudflare_tunnel:edit scope (separate from zone:dns:edit used by sweep-cf-orphans — same token if scope is broad, or a new token if narrowly scoped). CF_ACCOUNT_ID account that owns the tunnels (visible in dash.cloudflare.com URL path). CP_PROD_ADMIN_TOKEN reused from sweep-cf-orphans. CP_STAGING_ADMIN_TOKEN reused from sweep-cf-orphans. Note: CP-side root cause (tenant-delete should cascade to tunnel delete) is in molecule-controlplane and worth fixing separately. This janitor is the operational backstop in the meantime — same pattern applied to DNS records when the same root cause was unaddressed.	2026-04-29 19:42:47 -07:00
Hongming Wang	15b98c4916	fix(e2e-canvas): kill teardown race that poisons concurrent runs Setup wrote .playwright-staging-state.json at the END (step 7), only after org create + provision-wait + TLS + workspace create + workspace- online all succeeded. If setup crashed at steps 1-6, the org existed in CP but the state file did not, so Playwright's globalTeardown bailed out ("nothing to tear down") and the workflow safety-net pattern-swept every e2e-canvas-<today>-* org to compensate. That sweep deleted concurrent runs' live tenants — including their CF DNS records — causing victims' next fetch to die with `getaddrinfo ENOTFOUND`. Race observed 2026-04-30 on PR #2264 staging→main: three real-test runs killed each other mid-test, blocking 68 commits of staging→main promotion. Fix: write the state file as setup's first action, right after slug generation, before any CP call. Now: - Crash before slug gen → no state file, no orphan to clean - Crash during steps 1-6 → state file has slug; teardown deletes it (DELETE 404s if org never created) - Setup completes → state file has full state; teardown deletes the slug The workflow safety-net no longer pattern-sweeps; it reads the state file and deletes only the recorded slug. Concurrent canvas-E2E runs no longer poison each other. Verified by: - tsc --noEmit on staging-setup.ts + staging-teardown.ts - YAML lint on e2e-staging-canvas.yml - Code review: state file write moved to line 113 (post-makeSlug, pre-CP) with the original line-249 write retained as a "promote to full state" overwrite at the end	2026-04-29 19:23:56 -07:00
Hongming Wang	c8205b009a	ci: daily Railway pin-audit cron + issue-on-failure (#2169 ) Acceptance criterion 3 of #2001 ("CI check that fails if TENANT_IMAGE contains a SHA-shaped suffix") was deferred from PR #2168 because querying Railway from a GitHub Actions runner needs RAILWAY_TOKEN plumbed as a repo secret. The detection script + regression test in #2168 cover detection; this is the automation-cadence layer. Daily 13:00 UTC schedule (06:00 PT) + workflow_dispatch. Daily is the right cadence for variables-tier config — Railway env var changes are deliberate operator actions, low-frequency. Hourly would risk Railway API rate-limit surprises. Issue-on-failure pattern mirrors e2e-staging-sanity.yml — drift opens a `railway-drift` priority-high issue (or comments on the open one), and a subsequent clean run auto-closes it with a "drift resolved" comment. No human-in-the-loop needed for the close. Schedule-vs-dispatch secret hardening per feedback_schedule_vs_dispatch_secrets_hardening: - Schedule trigger HARD-FAILS on missing RAILWAY_AUDIT_TOKEN (silent-success was the failure mode that bit us before) - workflow_dispatch SOFT-SKIPS so an operator can dry-run the workflow shape during initial token provisioning Operator action required before this gate is live: - Provision a Railway API token, read-only `variables` scope on the molecule-platform project (id 7ccc8c68-61f4-42ab-9be5-586eeee11768) - Store as repo secret RAILWAY_AUDIT_TOKEN - Rotate per the standard 90-day schedule Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 17:43:01 -07:00
Hongming Wang	c79cf1cfa9	ci: collapse two-jobs-sharing-name path-filter pattern in e2e-api/e2e-staging-canvas Branch protection treats matching-name check runs as a SET — any SKIPPED member fails the required-check eval, even with SUCCESS siblings. The two-jobs-sharing-name pattern (no-op + real-job) emits one SKIPPED + one SUCCESS check run per workflow run; with multiple runs at the same SHA (detect-changes triggers + auto-promote re-runs) the SET fills with SKIPPED entries that block branch protection. Verified live on PR #2264 (staging→main auto-promote): mergeStateStatus stayed BLOCKED for 18+ hours despite APPROVED + MERGEABLE + all gates green at the workflow level. `gh pr merge` returned "base branch policy prohibits the merge"; `enqueuePullRequest` returned "No merge queue found for branch 'main'". The check-runs API showed `E2E API Smoke Test` and `Canvas tabs E2E` each had 2 SKIPPED + 2 SUCCESS at head SHA `66142c1e`. Fix: collapse no-op + real-job into ONE job with no job-level `if:`, gating real work via per-step `if: needs.detect-changes.outputs.X == 'true'`. The job always runs and emits exactly one SUCCESS check run under the required-check name regardless of paths-filter outcome — branch-protection-clean. Same pattern as ci.yml's earlier conversion of Canvas/Platform/Python/ Shellcheck (PR #2322). Closes the parity-fix that should have been applied to all four path-filtered required checks at once.	2026-04-29 17:29:44 -07:00
Hongming Wang	f7b9feb34f	ci: ancestry-check on auto-promote :latest (#2244 ) Two rapid main pushes whose E2Es complete out-of-order can promote :latest backwards: SHA-A merges, SHA-B merges, SHA-B's E2E completes first → :latest = staging-B → SHA-A's E2E completes → :latest = staging-A. Now :latest is older than main's tip and stays wrong until the next main push lands. The orphan-reconciler "next run corrects it" pattern doesn't apply because there's no auto-corrective re-promote. Detection: read the current :latest's `org.opencontainers.image.revision` label (set by publish-workspace-server-image.yml at build time) and ask the GitHub compare API how the candidate SHA relates to current. Branch on `.status`: ahead → retag (target newer) identical → retag is a no-op behind → HARD FAIL (this is the race we're catching) diverged → HARD FAIL (force-push or unusual history) error → fail; manual dispatch can override Hard-fail rather than soft-skip per the approved design — silent-bypass is the class we're moving away from per feedback_schedule_vs_dispatch_secrets_hardening. Workflow goes red, oncall sees it, operator decides whether to retry, force-promote, or investigate. Manual dispatch skips the check (operator override), matching the gate-step's existing semantics. Backward-compat: when current :latest carries no revision label (legacy image), skip-with-warning. All :latest images on main are post-label as of 2026-04-29, so this branch becomes dead within 90 days — TODO note in the step explains the cleanup. No tests — the race is hypothetical at our scale (<1 occurrence/year expected for a fleet of ≤20 paying tenants), and the only way to exercise the new branches is to construct production-shape image state. The dry-fall path lands behind the existing E2E gate-check, so a regression in this step would surface as a failed promote (visible), not a silent advance (invisible). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:18:42 -07:00
Hongming Wang	142b8e9d5b	ci: collapse all 4 path-filtered required checks to single-job-with-conditional-steps Supersedes #2321 + #2322. Applies the same shape uniformly across every required check that uses a path filter: Canvas (Next.js), Platform (Go), Python Lint & Test, Shellcheck (E2E scripts). The bug + fix in one paragraph: GitHub registers a check run for every job whose `name:` matches the required-check context, regardless of whether the job actually executed. A job-level `if:` that evaluates false produces a SKIPPED check run. Branch protection's "required check" rule looks at the SET of check runs with the matching context name on the latest commit and treats any conclusion other than SUCCESS as not-passed — including SKIPPED. Adding a sibling no-op job under the same `name:` (PR #2321 / #2322 attempt) doesn't help: branch protection still sees the SKIPPED sibling and stays BLOCKED. The shape that works: ONE job per required check name, no job-level `if:`, all real work gated per-step. The job always runs and reports SUCCESS regardless of which paths changed. This patch: * Canvas (Next.js): drops the `canvas-build-noop` shadow added in #2321 (which didn't actually clear merge state — verified live on PR #2314). Refactors `canvas-build` to always run; gates checkout/ setup-node/install/build/test on `if: needs.changes.outputs.canvas == 'true'`. Coverage upload step also gated. * Platform (Go): drops job-level `if:`. Gates checkout/setup-go/ download/build/vet/lint/test/coverage-report/threshold-check on per-step `if:`. * Python Lint & Test: drops job-level `if:`. Gates checkout/setup- python/install/pytest on per-step `if:`. * Shellcheck (E2E scripts): drops job-level `if:`. Gates checkout/ shellcheck-run on per-step `if:`. Each refactored job adds a leading no-op echo step with `working-directory: .` override so the always-running spin-up doesn't fail when the path- filter-true working-directory (workspace, workspace-server, canvas) doesn't exist after no-op checkout. Why all four in one PR: the bug shape is identical across all four, and a future PR that only touches workspace-server (passing platform filter, missing canvas/python/scripts) would hit the same BLOCKED state on whichever filter it missed. PR-A and PR-2321 merged because their diffs happened to trigger every filter; PR-B (#2314) only missed canvas. Fixing one at a time means re-living this debugging cycle three more times. Cost: ~10s of always-on CI runtime per PR per job (the ubuntu-latest spin-up + the no-op echo). 40s aggregate, negligible vs. the manual- merge cost when BLOCKED catches us. Memory `feedback_branch_protection_check_name_parity` already updated (2026-04-29) to mark the original two-jobs-sharing-name pattern as DO NOT FOLLOW and document the working shape this PR uses. Refs PR #2321 (the misguided fix-attempt that this supersedes).	2026-04-29 16:09:22 -07:00
Hongming Wang	e22a56d351	ci: collapse Canvas (Next.js) to single job with conditional steps Supersedes PR #2321's two-jobs-sharing-a-name approach, which didn't actually clear branch-protection's required-check evaluation. Live test on PR #2314: GraphQL `isRequired` confirmed BOTH check runs under "Canvas (Next.js)" name (one SUCCESS via no-op, one SKIPPED via real job) registered, and the SKIPPED one kept mergeStateStatus = BLOCKED despite the SUCCESS sibling. Branch protection's "set of matching contexts" semantic is stricter than the durable feedback memory documented — at least one passing isn't enough; SKIPPED counts as not-passed regardless. Real fix: ONE job that always runs (no job-level `if:`), with all real work gated on the path filter via per-step `if:`. Produces exactly one "Canvas (Next.js)" check run per commit, always SUCCEEDS, regardless of which paths changed. Costs ~10s of always-on CI runtime per PR — negligible vs. the manual-merge cost when the BLOCKED state catches us. This same anti-pattern probably affects Platform (Go) (`platform` filter), Python Lint & Test (`python` filter), and Shellcheck (E2E scripts) (`scripts` filter) — all required, all path-gated. PR-A and PR-2321 merged because they happened to trigger every filter; PR-B only missed canvas. File a follow-up issue to apply the same single-job-conditional-steps pattern across those required jobs to remove the latent merge-blocker. Updates feedback memory: branch_protection_check_name_parity is wrong about "two jobs sharing name + at-least-one-success works." Need to correct the note.	2026-04-29 16:01:38 -07:00
Hongming Wang	fcb2049f3f	ci: add no-op shadow for Canvas (Next.js) required check PRs that don't touch canvas/ paths skip the Canvas (Next.js) job via its `if: needs.changes.outputs.canvas == 'true'` guard. GitHub reports SKIPPED for that conclusion. Branch protection on staging requires Canvas (Next.js) — and treats SKIPPED as not-passed, blocking merge on every workspace-server-only or migration-only PR. This is the design pattern documented in feedback memory "branch_protection_check_name_parity": split into a real job + a no-op shadow that share the same `name:`. Exactly one runs per PR; both report the same check context, and at least one always reports SUCCESS, satisfying the required check. The no-op job runs in a few seconds (single `echo` step) and produces the right check context for any PR that has changes outside canvas/. Concrete blocker that prompted this: PR #2314 (RFC #2312 PR-B) sat APPROVED + CI-green + UP-TO-DATE for half an hour with mergeStateStatus BLOCKED, traced via the GraphQL `isRequired` field to a single SKIPPED Canvas (Next.js) check. PRs #2319 (PR-F) and the rest of the RFC #2312 stack would have hit the same wall.	2026-04-29 15:44:07 -07:00
Hongming Wang	d8210514c1	ci(canvas): wire vitest --coverage into CI for baseline observability (#1815 ) Step 2 of #1815. Step 1 (instrumentation in canvas/vitest.config.ts) already shipped — the inline comment there explicitly defers wiring into CI to a follow-up because turning on a 70% threshold blind would either fail CI immediately or paper over a real gap with an ad-hoc exclude list. This PR ships the observability half: - Replaces `npx vitest run` with `npx vitest run --coverage` in the canvas-build job. Coverage gets reported on every PR; no threshold gate yet (vitest.config.ts intentionally doesn't set thresholds). - Adds an artifact upload step for canvas/coverage/ (HTML + json-summary) so reviewers can browse the coverage report from any PR. 7-day retention; if-no-files-found=warn so a step skip doesn't fail. Step 3 (thresholds + hard gate) is the natural follow-up — track in a new sub-issue once we've seen ~5-10 PRs of baseline data and know where current coverage sits. The issue body proposed lines:70 / functions:70 / branches:65 / statements:70; that may need adjustment once the baseline lands. Closes the Step-2 portion of #1815. Step 3 stays open or gets a fresh issue depending on your preference. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 08:51:34 -07:00
Hongming Wang	07a17c2e59	Merge remote-tracking branch 'origin/staging' into docs/auto-promote-staging-prereq-comment # Conflicts: # .github/workflows/auto-promote-staging.yml	2026-04-28 20:46:42 -07:00
Hongming Wang	e373fa1a96	docs(ci): document auto-promote-staging GITHUB_TOKEN PR-create prereq Add a comment block at the top of auto-promote-staging.yml naming the load-bearing one-time repo setting that the workflow depends on: Settings → Actions → General → Workflow permissions → ✅ Allow GitHub Actions to create and approve pull requests Without this toggle, every workflow_run fails with "GitHub Actions is not permitted to create or approve pull requests (createPullRequest)". Observed 2026-04-29 01:43 UTC blocking the `fcd87b9` promotion (PRs #2248 + #2249); manually bridged via PR #2252. The setting is invisible to anyone reading the workflow file, but the workflow cannot do its job without it. Documenting here so the next time it gets toggled off (org admin change, repo migration, audit cleanup) the failure mode points at the cause rather than another round of "why is auto-promote broken." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 18:49:07 -07:00
Hongming Wang	fcd87b9526	Merge pull request #2249 from Molecule-AI/fix/publish-runtime-cascade-hard-fail-on-push fix(ci): hard-fail publish-runtime cascade on push when token missing	2026-04-29 01:33:10 +00:00
Hongming Wang	f1c6673e03	fix(ci): hard-fail publish-runtime cascade on push when token missing Mirror the sweep-cf-orphans hardening (#2248) on publish-runtime's TEMPLATE_DISPATCH_TOKEN gate. The previous behaviour was to print :⚠️:skipping cascade — templates will pick up the new version on their own next rebuild and exit 0. That message is wrong: the 8 workspace-template repos only rebuild on this repository_dispatch fanout. Without the dispatch they stay pinned to whatever runtime version they last saw, and the gap is invisible until someone notices a template several versions behind weeks later. Behaviour after this PR: - push (auto-trigger on workspace/runtime/** changes) → exit 1 - workflow_dispatch (manual operator) → exit 0 with a warning (operator already accepted state; let them rerun after restoring the secret) The token-missing path now also names the consequence concretely ("templates will NOT pick up the new version until this token is restored") so future operators see the actionable line, not the misleading "they'll catch up on their own" message. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 18:28:01 -07:00
hongmingwang-moleculeai	667751919d	Merge pull request #2248 from Molecule-AI/fix/sweep-cf-orphans-hard-fail-on-schedule fix(ci): hard-fail sweep-cf-orphans on schedule when secrets missing	2026-04-29 01:16:22 +00:00
Hongming Wang	9f39f3ef6c	fix(ci): hard-fail sweep-cf-orphans on schedule when secrets missing Replace the soft-skip-with-warning behaviour for scheduled runs of the hourly Cloudflare orphan sweeper with an explicit failure when the six required secrets aren't set. Manual workflow_dispatch keeps the soft-skip path so an operator can short-circuit a deliberate rerun without redoing the secrets dance — they accepted the state when they clicked the button. Why: from some-date to 2026-04-28, all six secrets were unset on the repo. Every hourly tick printed a yellow :⚠️: and exited 0, which GitHub registers as "completed/success" — the sweeper was indistinguishable from a healthy janitor with nothing to do. Cloudflare orphans accumulated unobserved to 152/200 (~76% of the zone quota), and only surfaced via a manual audit. The mechanism to catch this kind of regression is to make the workflow loud: red runs prompt investigation, green runs are presumed healthy. Schedule/workflow_run/push paths now print three ::error:: lines naming the missing secrets, the fix, and a one-line reference to this incident, then exit 1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 18:13:22 -07:00
Hongming Wang	5753021194	Merge pull request #2247 from Molecule-AI/fix/auto-promote-staging-pr-based fix(ci): auto-promote-staging opens a PR + uses merge queue, not direct push	2026-04-29 00:57:33 +00:00
Hongming Wang	e45a5c98b0	fix(ci): auto-promote-staging opens a PR + uses merge queue, not direct push Mirrors the fix #2234 applied to auto-sync-main-to-staging.yml in the reverse direction. Both workflows now use the same merge-queue path that humans use; no special-case bypass. Why Every tick of auto-promote-staging.yml since main's branch protection went stricter has been failing with: remote: error: GH006: Protected branch update failed for refs/heads/main. remote: - Required status checks "Analyze (go)", "Analyze (javascript-typescript)", "Analyze (python)", "Canvas (Next.js)", "Detect changes", "E2E API Smoke Test", "Platform (Go)", "Python Lint & Test", and "Shellcheck (E2E scripts)" were not set by the expected GitHub apps. remote: - Changes must be made through a pull request. The previous version did `git merge --ff-only origin/staging && git push origin main` directly. That works against a permissive branch — it doesn't work against a ruleset that requires checks satisfied by the expected GitHub apps. Only PR merges through the queue produce check runs from the right apps. Result was that today's 12+ merges to staging never propagated to main; the auto-promote ran every tick and failed every tick, while operators had to keep opening manual `staging → main` bridges. Fix - Replace the direct git push step with a step that opens (or reuses) a PR base=main head=staging and enables auto-merge. The merge queue lands it once gates are green on the merge_group ref. - The PR's head IS the staging branch (no per-SHA promote branch needed) — the whole purpose is "advance main to staging's tip". - Add `pull-requests: write` permission so the workflow can call gh pr create + gh pr merge --auto. - Drop the `git merge-base --is-ancestor` divergence check — the merge queue itself enforces branch protection now, and rejects the PR if main has diverged from staging history. Loop safety preserved: when this PR's merge lands on main, it triggers auto-sync-main-to-staging.yml which opens a sync PR back to staging. That sync PR's eventual merge is by GITHUB_TOKEN (the merge queue) which doesn't trigger downstream workflow_run events — so auto-promote-staging.yml does NOT re-fire from its own merge landing. Refs: #2234 (the parallel fix for auto-sync-main-to-staging.yml), task #142, multiple failing runs visible in https://github.com/Molecule-AI/molecule-core/actions/workflows/auto-promote-staging.yml	2026-04-28 17:54:15 -07:00
Hongming Wang	c68ea3a284	Merge pull request #2246 from Molecule-AI/chore/all-deps-batch-2026-04-28-pt2 chore(deps): batch dep bumps — 6 safe upgrades (4 actions majors + 2 npm dev deps)	2026-04-29 00:48:15 +00:00
Hongming Wang	fc59f939ac	chore(deps): batch dep bumps — 6 safe upgrades (4 actions majors + 2 npm dev deps) Consolidates the remaining safe-to-merge dependabot PRs from the 2026-04-28 wave into one consumable PR. Replaces three earlier single-bump PRs (#2245, #2230, #2231) which were closed in favor of this single batch — same pattern as #2235. GitHub Actions majors (SHA-pinned per org convention): github/codeql-action v3 → v4.35.2 (#2228) actions/setup-node v4 → v6.4.0 (#2218) actions/upload-artifact v4 → v7.0.1 (#2216) actions/setup-python v5 → v6.2.0 (#2214) npm dev deps (canvas/, lockfile regenerated in node:22-bookworm container so @emnapi/* and other Linux-only optional deps are properly resolved — Mac-native `npm install` strips them, which caused the earlier #2235 batch to drop these two): @types/node ^22 → ^25.6 (#2231) jsdom ^25 → ^29.1 (#2230) Why each is safe setup-node v4 → v6 / setup-python v5 → v6: Every consumer call pins node-version / python-version explicitly. v5 / v6 changed defaults but pinned consumers are unaffected. Confirmed via grep across .github/workflows/ — all setup-node call sites pin '20' or '22', all setup-python call sites pin '3.11'. codeql-action v3 → v4.35.2: Used as init/autobuild/analyze sub-actions in codeql.yml. v4 bundles a newer CodeQL CLI; ubuntu-latest auto-updates so functional behavior is unchanged. The deprecated CODEQL_ACTION_CLEANUP_TRAP_CACHES env var (per v4.35.2 release notes) is undocumented and we don't set it. upload-artifact v4 → v7.0.1: v6 introduced Node.js 24 runtime requiring Actions Runner >= 2.327.1. All upload-artifact users (codeql.yml, e2e-staging-canvas.yml) run on `ubuntu-latest` (GitHub- hosted), which auto-updates the runner agent. Self-hosted runners are NOT used for these jobs. @types/node 22 → 25 / jsdom 25 → 29: Both are dev-only — @types/node is type definitions, jsdom backs vitest's DOM environment. Tests pass: 79 files / 1154 tests in node:22-bookworm container. Verified locally (Linux container so the lockfile reflects what CI's `npm ci` will install): - cd canvas && npm install --include=optional → 169 packages - npm test → 1154/1154 pass - npm ci → clean install succeeds - npm run build → Next.js prerendering succeeds Closes when this lands (the 3 individual auto-merge PRs from earlier were closed): #2228 #2218 #2216 #2214 #2231 #2230 NOT included (CI failing on dependabot's own run — major framework bumps that need code-side migration tasks, not safe auto-bumps): #2233 next 15 → 16 #2232 tailwindcss 3 → 4 #2226 typescript 5 → 6	2026-04-28 17:44:55 -07:00

1 2 3 4 5

207 Commits