molecule-core

Author	SHA1	Message	Date
Hongming Wang	e06ebaefdf	Merge pull request #2346 from Molecule-AI/auto/issue-2341-migration-collision ci: hard gate against migration version collisions (#2341)	2026-04-30 08:50:19 +00:00
Hongming Wang	26d5c5ba1f	fix(ci): close gaps in auto-promote dispatch tail (#2358 follow-up) Independent review of #2358 surfaced three gaps that the original self-review missed. All three would manifest only on the FIRST real staging→main promotion through the new tail step, so they'd silently re-introduce the deploy-chain bug #2357 was supposed to fix. 1. Missing `actions: write` permission. `gh workflow run` POSTs to `/repos/.../actions/workflows/.../dispatches`, which requires the actions:write scope on GITHUB_TOKEN. The job had only contents:write + pull-requests:write, so the dispatch call would 403 on every run and the publish chain would still not fire. Adding the scope. 2. No workflow-level concurrency block. When CI + E2E Staging Canvas + E2E API Smoke + CodeQL all complete within seconds of each other on a green staging push (the typical case), four separate workflow_run events fire and four parallel auto-promote runs all reach the dispatch tail. They poll the same PR, all observe the same mergedAt, and all call `gh workflow run` — producing 2-4× redundant publish builds racing for the same `:staging-latest` retag and 2-4× canary-verify chains. Added `concurrency.group: auto-promote-staging, cancel-in-progress: false`. cancel-in-progress=false because killing a polling tail that's about to dispatch would re-introduce the original bug. 3. PR closed-without-merge ties up a runner for 30 min. If the merge queue rejects the PR (gates flip red post-approval), or an operator closes it manually, mergedAt stays null forever and the loop polls 60 × 30s burning a runner slot. Now also reads `state` in the same `gh pr view` call and breaks early when STATE=CLOSED. Verification on this PR is structural (workflow won't fire on a staging→main promotion until this lands AND a subsequent staging push triggers auto-promote). The actions:write fix in particular is unverifiable until the next real run — the prior #2358 fix has the same property, so we're stacking two unverifiable workflow edits. That's intentional rather than risky: stage 1 (#2358) was load-bearing for the deploy-chain restoration; stage 2 (this PR) hardens it before it actually matters. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 00:03:31 -07:00
Hongming Wang	d850ec7c8c	Merge pull request #2358 from Molecule-AI/auto/issue-2357-promote-dispatch-chain fix(ci): dispatch publish chain after auto-promote merge (#2357)	2026-04-30 06:36:02 +00:00
Hongming Wang	9a7f61661b	fix(ci): dispatch publish chain after auto-promote merge (#2357 ) The auto-promote staging → main flow uses `gh pr merge --auto` with GITHUB_TOKEN, which means GitHub suppresses downstream `push` events on the resulting main commit. This is documented behavior — events created by GITHUB_TOKEN do not trigger new workflow runs, with workflow_dispatch and repository_dispatch as the only exceptions. Effect: when the merge queue lands the auto-promote PR, the main push DOES NOT fire publish-workspace-server-image. canary-verify + the :staging-<sha> → :latest retag never run, so redeploy-tenants-on-main also never fires. Tenants stay on stale code until someone manually dispatches the chain (which is what just happened for issue #2339). Fix here: after enqueuing auto-merge, poll for the PR to land, then explicitly `gh workflow run publish-workspace-server-image.yml --ref main`. workflow_dispatch is the documented exception, so the dispatch event itself DOES create a new run. canary-verify and redeploy-tenants-on-main chain via workflow_run as before. Long-term (tracked in #2357): switch the auto-merge call above to a GitHub App token (actions/create-github-app-token) so the merge event itself can trigger the downstream chain naturally; the polling tail becomes deletable. Why a 30-min poll cap: merge queue typically lands a green promote PR within 5-10 min. 30 min covers a slow CI run without hanging the workflow indefinitely. If the merge times out, the step warns and exits 0 — operator can manually dispatch as a fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 23:31:13 -07:00
Hongming Wang	a495b86a06	test(e2e): poll-mode + since_id cursor round-trip (#2339 PR 4) End-to-end coverage for the canvas-chat unblocker. Exercises every moving part of the #2339 stack against a real platform instance: Phase 1 — register a workspace as delivery_mode=poll WITHOUT a URL; verify the response carries delivery_mode=poll. Phase 2 — invalid delivery_mode rejected with 400 (typo defense). Phase 3 — POST A2A to the poll-mode workspace; verify proxyA2ARequest short-circuits and returns 200 {status:queued, delivery_mode:poll, method:message/send} without ever resolving an agent URL. Phase 4 — verify the queued message appears in /activity?type=a2a_receive with the right method + payload (the polling agent reads from here). Phase 5 — since_id cursor returns ASC-ordered rows STRICTLY AFTER the cursor; the cursor row itself must NOT be replayed. Sends two follow-up messages and asserts ordering: rows[0] is the older new event, rows[-1] is the newer. Phase 6 — unknown / pruned cursor returns 410 Gone with an explanation. Phase 7 — cross-workspace cursor isolation: a UUID belonging to one workspace cannot be used to peek at another workspace's feed (returns 410, same as pruned, no info leak). Idempotent: per-run unique workspace ids (date+pid). Trap-based cleanup deletes the test rows on exit; no e2e_cleanup_all_workspaces call (see feedback_never_run_cluster_cleanup_tests_on_live_platform.md). Wired into .github/workflows/e2e-api.yml so it runs on every PR that touches workspace-server/, tests/e2e/, or the workflow file itself — same gate as the existing test_a2a_e2e + test_notify_attachments suites. Stacked on #2354 (PR 3: since_id cursor). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 23:07:10 -07:00
Hongming Wang	db5d11ffca	ci: continuous synthetic E2E against staging (#2342 ) Hard gate Tier 2 item 2 of 4. Cron-driven full-lifecycle E2E that catches regressions visible only at runtime — schema drift, deployment-pipeline gaps, vendor outages, env-var rotations, DNS / CF / Railway side-effects. Empirical motivation from today: - #2345 (A2A v0.2 silent drop) — passed unit tests, broke at JSON-RPC parse layer between sender + receiver. Visible only when a sender exercises the full path. Now-fixed by PR #2349, but a continuous E2E would have surfaced it within 20 min of the regression. - RFC #2312 chat upload — landed staging-branch but never reached staging tenants because publish-workspace-server-image was main- only. Caught by manual dogfooding hours after deploy. Same pattern. Both classes are invisible to PR-time CI. The continuous gate fires every 20 min against a real staging tenant and surfaces regressions within minutes. Cadence: cron `0,20,40 * * * *` (3x/hour). Offsets the existing sweep-cf-orphans (:15) and sweep-cf-tunnels (:45) so the three ops don't burst CF/AWS APIs at the same minute. Concurrency group prevents overlapping runs if one hangs. Cost: ~$0.50-1/day GHA + pennies of staging tenant lifecycle. Reuses existing tests/e2e/test_staging_full_saas.sh — no new harness to maintain. Bounded at 10 min wall-clock (vs 15 min default) so stuck runs fail fast rather than holding up the next firing. Defaults to E2E_RUNTIME=langgraph (fastest cold start; the regression classes this gate catches don't need hermes-specific paths). Operators can dispatch with runtime=hermes when they want SDK-native coverage. Schedule-vs-dispatch hardening: hard-fail on missing CP_STAGING_ADMIN_API_TOKEN for cron firing (silent-skip would mask real outages); soft-skip for operator dispatch. Refs: - #2342 hard-gates Tier 2 item 2 - #2345 (A2A v0.2 fix that this gate would have caught earlier) - #2335 / #2337 (deployment-pipeline gaps that this gate also catches)	2026-04-29 22:04:57 -07:00
Hongming Wang	ea8ff626a9	ci: hard gate against migration version collisions (#2341 ) Two PRs targeting staging can each add a migration with the same numeric prefix (e.g. 044_.up.sql). Each passes CI independently. They collide at merge time. Worst case: second migration silently doesn't apply and prod schema drifts from what the code expects. Caught manually 2026-04-30 during PR #2276 rebase: 044_runtime_image_pins collided with 044_platform_inbound_secret from RFC #2312. This workflow makes that detection automatic at PR-open time. How it works: scripts/ops/check_migration_collisions.py runs on every PR that touches workspace-server/migrations/*. For each new/modified migration filename, extracts the numeric prefix and checks: 1. Does the base branch already have a DIFFERENT migration file with the same prefix? (PR branched off an old base, base advanced and another PR landed the same number — needs rebase.) 2. Is another OPEN PR (not this one) also adding a migration with the same prefix? (Race-window collision — both pass CI separately, would collide at merge time.) Either case → exit 1 with a clear ::error:: message naming the conflicting PR(s) so the author knows what to renumber. Implementation notes: - Uses git ls-tree (not working-tree walk) so it works against any base ref without checkout. - Uses gh pr diff --name-only per open PR, bounded by `gh pr list --limit 100`. ~30s worst case for a busy repo, <5s normally. - --diff-filter=AM picks up Added or Modified — renaming a migration in place is also flagged (intentional; renaming migrations isn't safe). - Same filename in both PR and base = no collision (PR is editing in-place, fine). Tests: scripts/ops/test_check_migration_collisions.py — 9 cases on the regex classifier (the load-bearing piece). End-to-end git/gh path is exercised by running the workflow against real PRs. Hard-gates Tier 1 item 1 (#2341). Cheapest, cleanest gate. Catches one specific class of merge-time foot-gun automatically. Refs hard-gates discussion 2026-04-30. Tier 1 of 4 (others tracked in #2342, #2343, #2344).	2026-04-29 21:42:42 -07:00
Hongming Wang	856ff89973	Merge pull request #2338 from Molecule-AI/auto/redeploy-main-concurrency-parity ci: add concurrency block to redeploy-tenants-on-main for parity	2026-04-30 04:16:53 +00:00
Hongming Wang	360361a0ce	ci: add concurrency block to redeploy-tenants-on-main for parity Parity with #2337's redeploy-tenants-on-staging.yml. Both prod and staging redeploys now have explicit serialization: group: redeploy-tenants-on-main (per-workflow, global) group: redeploy-tenants-on-staging (per-workflow, global) cancel-in-progress: false on both — aborting a half-rolled-out fleet would leave tenants stuck on whatever image they happened to be on when cancelled. Better to finish the in-flight rollout before starting the next one. Pre-fix this workflow relied on GitHub's implicit workflow_run queueing, which is "probably fine" but not defensible — explicit > implicit for load-bearing pipeline behavior. Picked up as a #2337 review nit (architecture finding 1: concurrency asymmetry between the two redeploy workflows). No behavior change in the common case. The change matters only when two main pushes land within seconds AND the first redeploy is still mid-rollout — currently rare; will become more common once #2335 (staging-trigger publish) feeds main more frequently via auto-promote.	2026-04-29 21:14:41 -07:00
Hongming Wang	b7291e006b	ci: serialize publish + auto-redeploy staging tenants Two follow-ups from #2335 review (tracked in #2336): 1. Add `concurrency:` block to publish-workspace-server-image.yml so two rapid staging pushes don't race the same :staging-latest retag. Group is per-branch (`${{ github.ref }}`) so staging and main can build in parallel — they produce different :staging-<sha> tags and last-write-wins on :staging-latest is acceptable across branches. `cancel-in-progress: false` keeps in-flight builds — partially-pushed images would break canary-fleet pin consistency. 2. Add redeploy-tenants-on-staging.yml. After #2335, every staging push produces a fresh :staging-latest, but existing tenants only pick it up on next reprovision. This workflow mirrors redeploy-tenants-on- main but for staging: - workflow_run-gated to branches: [staging] - target_tag default 'staging-latest' (vs 'latest' for prod) - CP_URL default https://staging-api.moleculesai.app - CP_STAGING_ADMIN_API_TOKEN repo secret (operator must set) - canary_slug empty by default — staging is itself the canary; no sub-canary needed inside it. Soak still applies if operator specifies a tenant for blast-radius control. Schedule-vs-dispatch hardening matches sweep-cf-orphans/sweep-cf- tunnels: hard-fail on auto-trigger when secret missing so misconfig doesn't silently leave staging tenants on stale code; soft-skip on operator dispatch. Operator action required after merge: Add CP_STAGING_ADMIN_API_TOKEN repo secret. Pull value from staging- CP's CP_ADMIN_API_TOKEN env in Railway controlplane / staging environment. Until set, the auto-trigger will fail the workflow run (visible as red CI), surfacing the misconfiguration. Workflow runs only on staging publish-workspace-server-image success, so no extra load while it sits unconfigured. Verification: - YAML lint clean on both workflows. - Reviewed redeploy-tenants-on-main as template; differences are scoped to staging-specific values (URL, tag, secret name) + harden-on-missing- secret pattern. Refs #2335, #2336.	2026-04-29 21:11:45 -07:00
Hongming Wang	2e1cef324b	ci: trigger publish-workspace-server-image on staging push too Root cause: this workflow only triggered on `branches: [main]`, but staging-CP pins TENANT_IMAGE=:staging-latest (verified via Railway). :staging-latest was only retagged on main push, so: staging-branch code → never built → never reaches staging tenants staging-CP serves → "yesterday's main" indefinitely When staging→main was wedged (path-filter parity bug, canvas teardown race — both fixed earlier today), :staging-latest stopped updating entirely. RFC #2312 (chat upload HTTP-forward) landed on staging but freshly-provisioned staging tenants kept failing chat upload because they pulled pre-RFC-#2312 image. Verified by tearing down a fresh tenant and observing the legacy "workspace container not running" error from the docker-exec code path that RFC #2312 deleted. Pre-2026-04-24 there was a related-but-different incident: TENANT_IMAGE was a static :staging-<sha> pin that drifted 10 days behind. This new incident is "the dynamic pin still drifts when its update workflow doesn't fire." Fix: add `staging` to the branches trigger. Tag policy is unchanged (:staging-<sha> + :staging-latest on every push). canary-verify.yml still runs on main push (workflow_run-gated to `branches: [main]`), preserving the canary-verified :latest promotion for prod tenants. Steady state after this: - staging push → :staging-latest = staging-branch code → staging-CP - main push → :staging-<sha> for canary, :staging-latest retag (post-promote main code), and after canary green → :latest for prod tenants What this does NOT change: - canary-verify.yml flow (still main-only) - redeploy-tenants-on-main.yml (still rolls prod fleet on main push) - publish-canvas-image.yml (self-hosted standalone canvas; orthogonal) - The :latest tag (canary-verified main, unchanged) What this does fix: - RFC #2312-class fixes that land on staging now actually reach staging tenants without waiting for staging→main promote. - The dogfooding observation "staging tenants seem to be running yesterday's code" disappears as a class. Drive-by: also fixed the typo in the path-filter list (was `publish-platform-image.yml`, the actual file is `publish-workspace-server-image.yml`).	2026-04-29 21:00:56 -07:00
Hongming Wang	3a6d2f179d	feat(ops): add sweep-cf-tunnels janitor — orphan Cloudflare Tunnels accumulate CP's tenant-delete cascade removes the DNS record (with sweep-cf-orphans as a backstop) but does NOT delete the underlying Cloudflare Tunnel. Each E2E provision creates one Tunnel named `tenant-<slug>`; without cleanup these accumulate indefinitely on the account, consuming the tunnel quota and cluttering the dashboard. Observed 2026-04-30: dozens of `tenant-e2e-canvas-*` tunnels in Down state with zero replicas, weeks past their tenant's deletion. Same class of bug as the DNS-records leak that drove sweep-cf-orphans (controlplane#239). Parallel-shape to sweep-cf-orphans: - Same dry-run-by-default + --execute pattern - Same MAX_DELETE_PCT safety gate (default 90% — higher than DNS sweep's 50% because tenant-shaped tunnels are orphans by design) - Same schedule/dispatch hardening (hard-fail on missing secrets when scheduled, soft-skip when dispatched) - Cron offset to :45 to avoid CF API bursts colliding with the DNS sweep at :15 Decision rules (in order): 1. Name doesn't match `tenant-<slug>` → keep (unknown — never sweep tunnels that might belong to platform infra). 2. Tunnel has active connections (status=healthy or non-empty connections array) → keep (defense-in-depth: don't kill a live tunnel even if CP forgot the org). 3. Slug ∈ {prod_slugs ∪ staging_slugs} → keep. 4. Otherwise → delete (orphan). Verified by: - shell syntax check (bash -n) - YAML lint - Decide-logic offline smoke (7 cases, all pass) - End-to-end dry-run smoke with stubbed CP + CF APIs Required secrets (added to existing org-secrets): CF_API_TOKEN must include account:cloudflare_tunnel:edit scope (separate from zone:dns:edit used by sweep-cf-orphans — same token if scope is broad, or a new token if narrowly scoped). CF_ACCOUNT_ID account that owns the tunnels (visible in dash.cloudflare.com URL path). CP_PROD_ADMIN_TOKEN reused from sweep-cf-orphans. CP_STAGING_ADMIN_TOKEN reused from sweep-cf-orphans. Note: CP-side root cause (tenant-delete should cascade to tunnel delete) is in molecule-controlplane and worth fixing separately. This janitor is the operational backstop in the meantime — same pattern applied to DNS records when the same root cause was unaddressed.	2026-04-29 19:42:47 -07:00
Hongming Wang	15b98c4916	fix(e2e-canvas): kill teardown race that poisons concurrent runs Setup wrote .playwright-staging-state.json at the END (step 7), only after org create + provision-wait + TLS + workspace create + workspace- online all succeeded. If setup crashed at steps 1-6, the org existed in CP but the state file did not, so Playwright's globalTeardown bailed out ("nothing to tear down") and the workflow safety-net pattern-swept every e2e-canvas-<today>-* org to compensate. That sweep deleted concurrent runs' live tenants — including their CF DNS records — causing victims' next fetch to die with `getaddrinfo ENOTFOUND`. Race observed 2026-04-30 on PR #2264 staging→main: three real-test runs killed each other mid-test, blocking 68 commits of staging→main promotion. Fix: write the state file as setup's first action, right after slug generation, before any CP call. Now: - Crash before slug gen → no state file, no orphan to clean - Crash during steps 1-6 → state file has slug; teardown deletes it (DELETE 404s if org never created) - Setup completes → state file has full state; teardown deletes the slug The workflow safety-net no longer pattern-sweeps; it reads the state file and deletes only the recorded slug. Concurrent canvas-E2E runs no longer poison each other. Verified by: - tsc --noEmit on staging-setup.ts + staging-teardown.ts - YAML lint on e2e-staging-canvas.yml - Code review: state file write moved to line 113 (post-makeSlug, pre-CP) with the original line-249 write retained as a "promote to full state" overwrite at the end	2026-04-29 19:23:56 -07:00
Hongming Wang	c8205b009a	ci: daily Railway pin-audit cron + issue-on-failure (#2169 ) Acceptance criterion 3 of #2001 ("CI check that fails if TENANT_IMAGE contains a SHA-shaped suffix") was deferred from PR #2168 because querying Railway from a GitHub Actions runner needs RAILWAY_TOKEN plumbed as a repo secret. The detection script + regression test in #2168 cover detection; this is the automation-cadence layer. Daily 13:00 UTC schedule (06:00 PT) + workflow_dispatch. Daily is the right cadence for variables-tier config — Railway env var changes are deliberate operator actions, low-frequency. Hourly would risk Railway API rate-limit surprises. Issue-on-failure pattern mirrors e2e-staging-sanity.yml — drift opens a `railway-drift` priority-high issue (or comments on the open one), and a subsequent clean run auto-closes it with a "drift resolved" comment. No human-in-the-loop needed for the close. Schedule-vs-dispatch secret hardening per feedback_schedule_vs_dispatch_secrets_hardening: - Schedule trigger HARD-FAILS on missing RAILWAY_AUDIT_TOKEN (silent-success was the failure mode that bit us before) - workflow_dispatch SOFT-SKIPS so an operator can dry-run the workflow shape during initial token provisioning Operator action required before this gate is live: - Provision a Railway API token, read-only `variables` scope on the molecule-platform project (id 7ccc8c68-61f4-42ab-9be5-586eeee11768) - Store as repo secret RAILWAY_AUDIT_TOKEN - Rotate per the standard 90-day schedule Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 17:43:01 -07:00
Hongming Wang	c79cf1cfa9	ci: collapse two-jobs-sharing-name path-filter pattern in e2e-api/e2e-staging-canvas Branch protection treats matching-name check runs as a SET — any SKIPPED member fails the required-check eval, even with SUCCESS siblings. The two-jobs-sharing-name pattern (no-op + real-job) emits one SKIPPED + one SUCCESS check run per workflow run; with multiple runs at the same SHA (detect-changes triggers + auto-promote re-runs) the SET fills with SKIPPED entries that block branch protection. Verified live on PR #2264 (staging→main auto-promote): mergeStateStatus stayed BLOCKED for 18+ hours despite APPROVED + MERGEABLE + all gates green at the workflow level. `gh pr merge` returned "base branch policy prohibits the merge"; `enqueuePullRequest` returned "No merge queue found for branch 'main'". The check-runs API showed `E2E API Smoke Test` and `Canvas tabs E2E` each had 2 SKIPPED + 2 SUCCESS at head SHA `66142c1e`. Fix: collapse no-op + real-job into ONE job with no job-level `if:`, gating real work via per-step `if: needs.detect-changes.outputs.X == 'true'`. The job always runs and emits exactly one SUCCESS check run under the required-check name regardless of paths-filter outcome — branch-protection-clean. Same pattern as ci.yml's earlier conversion of Canvas/Platform/Python/ Shellcheck (PR #2322). Closes the parity-fix that should have been applied to all four path-filtered required checks at once.	2026-04-29 17:29:44 -07:00
Hongming Wang	f7b9feb34f	ci: ancestry-check on auto-promote :latest (#2244 ) Two rapid main pushes whose E2Es complete out-of-order can promote :latest backwards: SHA-A merges, SHA-B merges, SHA-B's E2E completes first → :latest = staging-B → SHA-A's E2E completes → :latest = staging-A. Now :latest is older than main's tip and stays wrong until the next main push lands. The orphan-reconciler "next run corrects it" pattern doesn't apply because there's no auto-corrective re-promote. Detection: read the current :latest's `org.opencontainers.image.revision` label (set by publish-workspace-server-image.yml at build time) and ask the GitHub compare API how the candidate SHA relates to current. Branch on `.status`: ahead → retag (target newer) identical → retag is a no-op behind → HARD FAIL (this is the race we're catching) diverged → HARD FAIL (force-push or unusual history) error → fail; manual dispatch can override Hard-fail rather than soft-skip per the approved design — silent-bypass is the class we're moving away from per feedback_schedule_vs_dispatch_secrets_hardening. Workflow goes red, oncall sees it, operator decides whether to retry, force-promote, or investigate. Manual dispatch skips the check (operator override), matching the gate-step's existing semantics. Backward-compat: when current :latest carries no revision label (legacy image), skip-with-warning. All :latest images on main are post-label as of 2026-04-29, so this branch becomes dead within 90 days — TODO note in the step explains the cleanup. No tests — the race is hypothetical at our scale (<1 occurrence/year expected for a fleet of ≤20 paying tenants), and the only way to exercise the new branches is to construct production-shape image state. The dry-fall path lands behind the existing E2E gate-check, so a regression in this step would surface as a failed promote (visible), not a silent advance (invisible). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:18:42 -07:00
Hongming Wang	142b8e9d5b	ci: collapse all 4 path-filtered required checks to single-job-with-conditional-steps Supersedes #2321 + #2322. Applies the same shape uniformly across every required check that uses a path filter: Canvas (Next.js), Platform (Go), Python Lint & Test, Shellcheck (E2E scripts). The bug + fix in one paragraph: GitHub registers a check run for every job whose `name:` matches the required-check context, regardless of whether the job actually executed. A job-level `if:` that evaluates false produces a SKIPPED check run. Branch protection's "required check" rule looks at the SET of check runs with the matching context name on the latest commit and treats any conclusion other than SUCCESS as not-passed — including SKIPPED. Adding a sibling no-op job under the same `name:` (PR #2321 / #2322 attempt) doesn't help: branch protection still sees the SKIPPED sibling and stays BLOCKED. The shape that works: ONE job per required check name, no job-level `if:`, all real work gated per-step. The job always runs and reports SUCCESS regardless of which paths changed. This patch: * Canvas (Next.js): drops the `canvas-build-noop` shadow added in #2321 (which didn't actually clear merge state — verified live on PR #2314). Refactors `canvas-build` to always run; gates checkout/ setup-node/install/build/test on `if: needs.changes.outputs.canvas == 'true'`. Coverage upload step also gated. * Platform (Go): drops job-level `if:`. Gates checkout/setup-go/ download/build/vet/lint/test/coverage-report/threshold-check on per-step `if:`. * Python Lint & Test: drops job-level `if:`. Gates checkout/setup- python/install/pytest on per-step `if:`. * Shellcheck (E2E scripts): drops job-level `if:`. Gates checkout/ shellcheck-run on per-step `if:`. Each refactored job adds a leading no-op echo step with `working-directory: .` override so the always-running spin-up doesn't fail when the path- filter-true working-directory (workspace, workspace-server, canvas) doesn't exist after no-op checkout. Why all four in one PR: the bug shape is identical across all four, and a future PR that only touches workspace-server (passing platform filter, missing canvas/python/scripts) would hit the same BLOCKED state on whichever filter it missed. PR-A and PR-2321 merged because their diffs happened to trigger every filter; PR-B (#2314) only missed canvas. Fixing one at a time means re-living this debugging cycle three more times. Cost: ~10s of always-on CI runtime per PR per job (the ubuntu-latest spin-up + the no-op echo). 40s aggregate, negligible vs. the manual- merge cost when BLOCKED catches us. Memory `feedback_branch_protection_check_name_parity` already updated (2026-04-29) to mark the original two-jobs-sharing-name pattern as DO NOT FOLLOW and document the working shape this PR uses. Refs PR #2321 (the misguided fix-attempt that this supersedes).	2026-04-29 16:09:22 -07:00
Hongming Wang	e22a56d351	ci: collapse Canvas (Next.js) to single job with conditional steps Supersedes PR #2321's two-jobs-sharing-a-name approach, which didn't actually clear branch-protection's required-check evaluation. Live test on PR #2314: GraphQL `isRequired` confirmed BOTH check runs under "Canvas (Next.js)" name (one SUCCESS via no-op, one SKIPPED via real job) registered, and the SKIPPED one kept mergeStateStatus = BLOCKED despite the SUCCESS sibling. Branch protection's "set of matching contexts" semantic is stricter than the durable feedback memory documented — at least one passing isn't enough; SKIPPED counts as not-passed regardless. Real fix: ONE job that always runs (no job-level `if:`), with all real work gated on the path filter via per-step `if:`. Produces exactly one "Canvas (Next.js)" check run per commit, always SUCCEEDS, regardless of which paths changed. Costs ~10s of always-on CI runtime per PR — negligible vs. the manual-merge cost when the BLOCKED state catches us. This same anti-pattern probably affects Platform (Go) (`platform` filter), Python Lint & Test (`python` filter), and Shellcheck (E2E scripts) (`scripts` filter) — all required, all path-gated. PR-A and PR-2321 merged because they happened to trigger every filter; PR-B only missed canvas. File a follow-up issue to apply the same single-job-conditional-steps pattern across those required jobs to remove the latent merge-blocker. Updates feedback memory: branch_protection_check_name_parity is wrong about "two jobs sharing name + at-least-one-success works." Need to correct the note.	2026-04-29 16:01:38 -07:00
Hongming Wang	fcb2049f3f	ci: add no-op shadow for Canvas (Next.js) required check PRs that don't touch canvas/ paths skip the Canvas (Next.js) job via its `if: needs.changes.outputs.canvas == 'true'` guard. GitHub reports SKIPPED for that conclusion. Branch protection on staging requires Canvas (Next.js) — and treats SKIPPED as not-passed, blocking merge on every workspace-server-only or migration-only PR. This is the design pattern documented in feedback memory "branch_protection_check_name_parity": split into a real job + a no-op shadow that share the same `name:`. Exactly one runs per PR; both report the same check context, and at least one always reports SUCCESS, satisfying the required check. The no-op job runs in a few seconds (single `echo` step) and produces the right check context for any PR that has changes outside canvas/. Concrete blocker that prompted this: PR #2314 (RFC #2312 PR-B) sat APPROVED + CI-green + UP-TO-DATE for half an hour with mergeStateStatus BLOCKED, traced via the GraphQL `isRequired` field to a single SKIPPED Canvas (Next.js) check. PRs #2319 (PR-F) and the rest of the RFC #2312 stack would have hit the same wall.	2026-04-29 15:44:07 -07:00
Hongming Wang	d8210514c1	ci(canvas): wire vitest --coverage into CI for baseline observability (#1815 ) Step 2 of #1815. Step 1 (instrumentation in canvas/vitest.config.ts) already shipped — the inline comment there explicitly defers wiring into CI to a follow-up because turning on a 70% threshold blind would either fail CI immediately or paper over a real gap with an ad-hoc exclude list. This PR ships the observability half: - Replaces `npx vitest run` with `npx vitest run --coverage` in the canvas-build job. Coverage gets reported on every PR; no threshold gate yet (vitest.config.ts intentionally doesn't set thresholds). - Adds an artifact upload step for canvas/coverage/ (HTML + json-summary) so reviewers can browse the coverage report from any PR. 7-day retention; if-no-files-found=warn so a step skip doesn't fail. Step 3 (thresholds + hard gate) is the natural follow-up — track in a new sub-issue once we've seen ~5-10 PRs of baseline data and know where current coverage sits. The issue body proposed lines:70 / functions:70 / branches:65 / statements:70; that may need adjustment once the baseline lands. Closes the Step-2 portion of #1815. Step 3 stays open or gets a fresh issue depending on your preference. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 08:51:34 -07:00
Hongming Wang	07a17c2e59	Merge remote-tracking branch 'origin/staging' into docs/auto-promote-staging-prereq-comment # Conflicts: # .github/workflows/auto-promote-staging.yml	2026-04-28 20:46:42 -07:00
Hongming Wang	e373fa1a96	docs(ci): document auto-promote-staging GITHUB_TOKEN PR-create prereq Add a comment block at the top of auto-promote-staging.yml naming the load-bearing one-time repo setting that the workflow depends on: Settings → Actions → General → Workflow permissions → ✅ Allow GitHub Actions to create and approve pull requests Without this toggle, every workflow_run fails with "GitHub Actions is not permitted to create or approve pull requests (createPullRequest)". Observed 2026-04-29 01:43 UTC blocking the `fcd87b9` promotion (PRs #2248 + #2249); manually bridged via PR #2252. The setting is invisible to anyone reading the workflow file, but the workflow cannot do its job without it. Documenting here so the next time it gets toggled off (org admin change, repo migration, audit cleanup) the failure mode points at the cause rather than another round of "why is auto-promote broken." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 18:49:07 -07:00
Hongming Wang	fcd87b9526	Merge pull request #2249 from Molecule-AI/fix/publish-runtime-cascade-hard-fail-on-push fix(ci): hard-fail publish-runtime cascade on push when token missing	2026-04-29 01:33:10 +00:00
Hongming Wang	f1c6673e03	fix(ci): hard-fail publish-runtime cascade on push when token missing Mirror the sweep-cf-orphans hardening (#2248) on publish-runtime's TEMPLATE_DISPATCH_TOKEN gate. The previous behaviour was to print :⚠️:skipping cascade — templates will pick up the new version on their own next rebuild and exit 0. That message is wrong: the 8 workspace-template repos only rebuild on this repository_dispatch fanout. Without the dispatch they stay pinned to whatever runtime version they last saw, and the gap is invisible until someone notices a template several versions behind weeks later. Behaviour after this PR: - push (auto-trigger on workspace/runtime/** changes) → exit 1 - workflow_dispatch (manual operator) → exit 0 with a warning (operator already accepted state; let them rerun after restoring the secret) The token-missing path now also names the consequence concretely ("templates will NOT pick up the new version until this token is restored") so future operators see the actionable line, not the misleading "they'll catch up on their own" message. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 18:28:01 -07:00
hongmingwang-moleculeai	667751919d	Merge pull request #2248 from Molecule-AI/fix/sweep-cf-orphans-hard-fail-on-schedule fix(ci): hard-fail sweep-cf-orphans on schedule when secrets missing	2026-04-29 01:16:22 +00:00
Hongming Wang	9f39f3ef6c	fix(ci): hard-fail sweep-cf-orphans on schedule when secrets missing Replace the soft-skip-with-warning behaviour for scheduled runs of the hourly Cloudflare orphan sweeper with an explicit failure when the six required secrets aren't set. Manual workflow_dispatch keeps the soft-skip path so an operator can short-circuit a deliberate rerun without redoing the secrets dance — they accepted the state when they clicked the button. Why: from some-date to 2026-04-28, all six secrets were unset on the repo. Every hourly tick printed a yellow :⚠️: and exited 0, which GitHub registers as "completed/success" — the sweeper was indistinguishable from a healthy janitor with nothing to do. Cloudflare orphans accumulated unobserved to 152/200 (~76% of the zone quota), and only surfaced via a manual audit. The mechanism to catch this kind of regression is to make the workflow loud: red runs prompt investigation, green runs are presumed healthy. Schedule/workflow_run/push paths now print three ::error:: lines naming the missing secrets, the fix, and a one-line reference to this incident, then exit 1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 18:13:22 -07:00
Hongming Wang	5753021194	Merge pull request #2247 from Molecule-AI/fix/auto-promote-staging-pr-based fix(ci): auto-promote-staging opens a PR + uses merge queue, not direct push	2026-04-29 00:57:33 +00:00
Hongming Wang	e45a5c98b0	fix(ci): auto-promote-staging opens a PR + uses merge queue, not direct push Mirrors the fix #2234 applied to auto-sync-main-to-staging.yml in the reverse direction. Both workflows now use the same merge-queue path that humans use; no special-case bypass. Why Every tick of auto-promote-staging.yml since main's branch protection went stricter has been failing with: remote: error: GH006: Protected branch update failed for refs/heads/main. remote: - Required status checks "Analyze (go)", "Analyze (javascript-typescript)", "Analyze (python)", "Canvas (Next.js)", "Detect changes", "E2E API Smoke Test", "Platform (Go)", "Python Lint & Test", and "Shellcheck (E2E scripts)" were not set by the expected GitHub apps. remote: - Changes must be made through a pull request. The previous version did `git merge --ff-only origin/staging && git push origin main` directly. That works against a permissive branch — it doesn't work against a ruleset that requires checks satisfied by the expected GitHub apps. Only PR merges through the queue produce check runs from the right apps. Result was that today's 12+ merges to staging never propagated to main; the auto-promote ran every tick and failed every tick, while operators had to keep opening manual `staging → main` bridges. Fix - Replace the direct git push step with a step that opens (or reuses) a PR base=main head=staging and enables auto-merge. The merge queue lands it once gates are green on the merge_group ref. - The PR's head IS the staging branch (no per-SHA promote branch needed) — the whole purpose is "advance main to staging's tip". - Add `pull-requests: write` permission so the workflow can call gh pr create + gh pr merge --auto. - Drop the `git merge-base --is-ancestor` divergence check — the merge queue itself enforces branch protection now, and rejects the PR if main has diverged from staging history. Loop safety preserved: when this PR's merge lands on main, it triggers auto-sync-main-to-staging.yml which opens a sync PR back to staging. That sync PR's eventual merge is by GITHUB_TOKEN (the merge queue) which doesn't trigger downstream workflow_run events — so auto-promote-staging.yml does NOT re-fire from its own merge landing. Refs: #2234 (the parallel fix for auto-sync-main-to-staging.yml), task #142, multiple failing runs visible in https://github.com/Molecule-AI/molecule-core/actions/workflows/auto-promote-staging.yml	2026-04-28 17:54:15 -07:00
Hongming Wang	c68ea3a284	Merge pull request #2246 from Molecule-AI/chore/all-deps-batch-2026-04-28-pt2 chore(deps): batch dep bumps — 6 safe upgrades (4 actions majors + 2 npm dev deps)	2026-04-29 00:48:15 +00:00
Hongming Wang	fc59f939ac	chore(deps): batch dep bumps — 6 safe upgrades (4 actions majors + 2 npm dev deps) Consolidates the remaining safe-to-merge dependabot PRs from the 2026-04-28 wave into one consumable PR. Replaces three earlier single-bump PRs (#2245, #2230, #2231) which were closed in favor of this single batch — same pattern as #2235. GitHub Actions majors (SHA-pinned per org convention): github/codeql-action v3 → v4.35.2 (#2228) actions/setup-node v4 → v6.4.0 (#2218) actions/upload-artifact v4 → v7.0.1 (#2216) actions/setup-python v5 → v6.2.0 (#2214) npm dev deps (canvas/, lockfile regenerated in node:22-bookworm container so @emnapi/* and other Linux-only optional deps are properly resolved — Mac-native `npm install` strips them, which caused the earlier #2235 batch to drop these two): @types/node ^22 → ^25.6 (#2231) jsdom ^25 → ^29.1 (#2230) Why each is safe setup-node v4 → v6 / setup-python v5 → v6: Every consumer call pins node-version / python-version explicitly. v5 / v6 changed defaults but pinned consumers are unaffected. Confirmed via grep across .github/workflows/ — all setup-node call sites pin '20' or '22', all setup-python call sites pin '3.11'. codeql-action v3 → v4.35.2: Used as init/autobuild/analyze sub-actions in codeql.yml. v4 bundles a newer CodeQL CLI; ubuntu-latest auto-updates so functional behavior is unchanged. The deprecated CODEQL_ACTION_CLEANUP_TRAP_CACHES env var (per v4.35.2 release notes) is undocumented and we don't set it. upload-artifact v4 → v7.0.1: v6 introduced Node.js 24 runtime requiring Actions Runner >= 2.327.1. All upload-artifact users (codeql.yml, e2e-staging-canvas.yml) run on `ubuntu-latest` (GitHub- hosted), which auto-updates the runner agent. Self-hosted runners are NOT used for these jobs. @types/node 22 → 25 / jsdom 25 → 29: Both are dev-only — @types/node is type definitions, jsdom backs vitest's DOM environment. Tests pass: 79 files / 1154 tests in node:22-bookworm container. Verified locally (Linux container so the lockfile reflects what CI's `npm ci` will install): - cd canvas && npm install --include=optional → 169 packages - npm test → 1154/1154 pass - npm ci → clean install succeeds - npm run build → Next.js prerendering succeeds Closes when this lands (the 3 individual auto-merge PRs from earlier were closed): #2228 #2218 #2216 #2214 #2231 #2230 NOT included (CI failing on dependabot's own run — major framework bumps that need code-side migration tasks, not safe auto-bumps): #2233 next 15 → 16 #2232 tailwindcss 3 → 4 #2226 typescript 5 → 6	2026-04-28 17:44:55 -07:00
Hongming Wang	a1bc771f87	Merge pull request #2243 from Molecule-AI/fix/branch-protection-required-check-naming fix(ci): no-op jobs emit same check-run name as their real counterparts	2026-04-29 00:43:31 +00:00
Hongming Wang	4f0dfbbf0b	Merge pull request #2242 from Molecule-AI/fix/dispatch-input-shell-injection-hardening fix(security): harden dispatch inputs against shell injection	2026-04-29 00:31:08 +00:00
github-actions[bot]	7b2d9e9bce	fix(ci): no-op job emits same check-run name as the real one Branch protection on `main` requires "E2E API Smoke Test" as a status check. With Design B's no-op + e2e-api job split, when paths-filter excludes a commit: - e2e-api job (name="E2E API Smoke Test"): SKIPPED - no-op job (name="no-op"): SUCCESS Branch protection counts the skipped check-run as not-satisfied → auto-promote-staging's `git push origin main` rejected with GH006. Observed 2026-04-28 00:22 UTC: every gate green at the workflow level, all_green=true in auto-promote-staging's gate-check, but the FF push itself rejected with: Required status checks "..., E2E API Smoke Test, ..." were not set by the expected GitHub apps. Fix: give the no-op job the same `name:` as the real one. Now both register as check-runs named "E2E API Smoke Test" — exactly one runs per workflow execution (mutex `if`), the other registers as skipped with the same name. Branch protection sees at least one success, requirement satisfied. Same fix applied to e2e-staging-canvas.yml's no-op (name → "Canvas tabs E2E") for symmetry, even though "Canvas tabs E2E" isn't currently in main's required check list — kept consistent so the next time a required-checks reshuffle pulls it in, it doesn't recreate this bug. Note: Design B's intent was always "emit a result auto-promote can read" — that intent was satisfied at the workflow-conclusion level (success), but missed the per-check-run-name level. This PR closes that second-order gap.	2026-04-28 17:25:31 -07:00
github-actions[bot]	b2a0703f1c	fix(ci): per-SHA concurrency on staging gate workflows e2e-staging-canvas had a single global concurrency group: concurrency: group: e2e-staging-canvas cancel-in-progress: false That meant the entire repo shared one running + one pending slot. When a staging push queued behind an in-flight run and a third entrant (a PR run, a follow-on push) entered the group, the staging push got cancelled. auto-promote-staging then saw `completed/cancelled` for a required gate and refused to advance main. Observed 2026-04-28 23:51-23:53: staging tip 3f99fede's e2e-staging- canvas push run was cancelled within 2:20 of starting because a PR run on a follow-on branch entered the group. Auto-promote-staging fired 8+ times after that, all skipped because canvas was still in the cancelled state. The chain stayed stuck until the cancelled run was manually re-dispatched. e2e-api had a softer version of the same bug — `group: e2e-api-${{ github.ref }}`. Per-ref isolates push events from PR events, so this specific scenario didn't hit it, but back-to-back pushes to staging at SHA-A and SHA-B share refs/heads/staging and would still cancel SHA-A's queued run when SHA-B enters. Both workflows now use per-SHA grouping. The single-global-group's original intent was to throttle parallel E2E provisions, but each E2E run already isolates its state via fresh-org-per-run, and parallel infrastructure cost at our scale (~$0.001/min × 10min × 2) is rounding error compared to a stuck pipeline. Per-SHA still dedupes accidental double-triggers for the SAME SHA. It does not cancel obsolete-PR-version runs on force-push — that wasted CI is acceptable given the alternative is losing staging-tip data that auto-promote-staging depends on. Other gate workflows: ci.yml uses `cancel-in-progress: true` which is correct for unit tests (intentional cancellation on supersede). codeql.yml is per-ref like e2e-api was; same fix probably applies if the same deadlock pattern is observed there, but no incident yet so deferring.	2026-04-28 17:18:15 -07:00
github-actions[bot]	475a51adec	fix(ci): defer promote when E2E is racing with publish (review fix) Self-review caught a real correctness bug: scenario where publish- workspace-server-image completes BEFORE E2E Staging SaaS for a runtime- touching SHA. Publish typically takes ~5-10min; E2E ~10-15min, so this ordering is the common case for runtime-path PRs. Previous gate logic: - completed/success: proceed - completed/failure: abort - everything else (including in_progress): proceed ← BUG If publish-trigger fires while E2E is still running, the gate returned "in_progress/none" and fell through the catch-all "proceed" branch. Result: :latest retagged on the publish signal alone. Then E2E ends red — but :latest was already wrongly advanced; the E2E-completion trigger's job-level if=conclusion==success filter just skips, never rolls back. Fix: explicit case for in_progress\|queued\|requested\|waiting\|pending that DEFERS — sets gate.proceed=false, writes a "deferred" summary, exits 0 (workflow run shows success, retag steps skipped). The E2E completion trigger then fires later and either promotes (green) or aborts (red), giving us correct ordering regardless of who finishes first. Subsequent steps now guarded by `if: steps.gate.outputs.proceed == 'true'` instead of relying on `exit 1` for skip semantics. Also added an explicit catch-all `*)` branch that aborts on unknown states (forward-compat: GitHub adds a new status, we surface it instead of silently promoting through it).	2026-04-28 16:59:58 -07:00
github-actions[bot]	f4f45f8561	fix(ci): auto-promote :latest also on publish-image, not just E2E Previously this workflow only triggered on E2E Staging SaaS completion, which is itself paths-filtered to runtime handlers (workspace-server/internal/handlers/{registry,workspace_provision, a2a_proxy}.go, middleware/, provisioner/). publish-workspace-server -image fires on a STRICTLY BROADER path set (workspace-server/, canvas/, manifest.json) — so canvas-only or cmd-only or sweep-only PRs rebuilt the platform image without ever advancing :latest. Result observed 2026-04-28: zero runs of this workflow since merge despite eight main pushes. :latest sat ~7 hours / 9 PRs behind main. Fix: add publish-workspace-server-image as a second trigger. Add an explicit gate inside the job that aborts when E2E Staging SaaS for the same SHA ended red. When E2E didn't fire (paths-filtered), proceed — auto-promote-staging's pre-merge gates (CI + E2E Canvas + E2E API + CodeQL on staging) already validated this SHA before main moved. Concurrency group serializes promotes per-SHA so the publish+E2E both- fired race lands cleanly. Idempotent crane tag makes it safe regardless.	2026-04-28 16:53:30 -07:00
hongmingwang-moleculeai	a45a026099	Merge pull request #2235 from Molecule-AI/chore/deps-batch-2026-04-28 chore(deps): batch dep bumps — 11 safe upgrades from 2026-04-28 wave	2026-04-28 23:40:51 +00:00
Hongming Wang	0cdbc2c4f6	chore(deps): batch dep bumps — 11 safe upgrades from 2026-04-28 dependabot wave Consolidates 11 of the 17 open Dependabot PRs (#2215, #2217, #2219-#2225, #2227, #2229) into one PR. Every entry is a patch / minor / floor bump where the impact surface is small and CI carries the proof. Same pattern as the 2026-04-15 batch. Go (workspace-server/go.mod + go.sum, regenerated via `go mod tidy`): - golang.org/x/crypto 0.49.0 → 0.50.0 (#2225) - github.com/golang-jwt/jwt/v5 5.2.2 → 5.3.1 (#2222) - github.com/gin-contrib/cors 1.7.2 → 1.7.7 (#2220) - github.com/docker/go-connections 0.6.0 → 0.7.0 (#2223) - github.com/redis/go-redis/v9 9.7.3 → 9.19.0 (#2217) Python floor bumps (workspace/requirements.txt; current pip-resolved versions don't change unless they happen to be below the new floor): - httpx >=0.27 → >=0.28.1 (#2221) - uvicorn >=0.30 → >=0.46 (#2229) - temporalio >=1.7 → >=1.26 (#2227) - websockets >=12 → >=16 (#2224) - opentelemetry-sdk >=1.24 → >=1.41.1 (#2219) GitHub Actions (SHA-pinned per existing convention): - dorny/paths-filter@d1c1ffe (v3) → @fbd0ab8 (v4.0.1) (#2215) REMOVED from this batch (lockfile platform mismatch): - #2231 @types/node ^22 → ^25.6 (npm install on macOS strips Linux-only @emnapi/* entries from package-lock.json that CI's `npm ci` then refuses; needs a Linux-side install to land cleanly) - #2230 jsdom ^25 → ^29.1 (same) NOT included in this batch (deferred to per-PR human review): - #2228 github/codeql-action v3 → v4 (CodeQL CLI alignment risk) - #2218 actions/setup-node v4 → v6 (default Node version drift) - #2216 actions/upload-artifact v4 → v7 (3 major versions) - #2214 actions/setup-python v5 → v6 (action major) NOT merged (CI failing on dependabot's own PR): - #2233 next 15 → 16 - #2232 tailwindcss 3 → 4 - #2226 typescript 5 → 6 Verified: - workspace-server: `go mod tidy && go build ./... && go test ./...` — green - workspace requirements.txt: floor bumps only	2026-04-28 16:25:46 -07:00
Hongming Wang	cf258b3355	fix(ci): auto-sync opens a PR + uses merge queue, not direct push The molecule-core/staging branch is protected by ruleset 15500102 (name: staging-merge-queue) which blocks ALL direct pushes — no bypass even for org admins or the GitHub Actions integration. The prior version of this workflow attempted `git push origin staging` and was rejected with GH013: ! [remote rejected] staging -> staging (push declined due to repository rule violations) - Changes must be made through a pull request. - Changes must be made through the merge queue This was a real architectural mismatch: auto-sync was bypassing the same gates everyone else goes through to land on staging, which is exactly what the ruleset is designed to prevent. The fix matches the org convention: the workflow now opens a PR (base=staging, head=auto-sync/main-<sha>) and enables auto-merge. The merge queue picks it up, runs required gates against the merged result, and lands it. Same path human PRs take through staging — no special-snowflake bypass. Trade-off acknowledged - Slight PR churn: every main push that needs sync opens a tracked PR. With concurrency: cancel-in-progress: false (existing) and the merge queue's serial processing, this is bounded — PRs land in order, no thundering herd. - The previous direct-push approach worked on molecule-controlplane (which has no merge_queue ruleset on staging). That version of the workflow was correct for that repo's protection model. Per-repo divergence is acceptable; the invariant ("staging ⊇ main") is what matters, not how it's enforced. Loop safety preserved GITHUB_TOKEN-authored merges (including the merge queue's land of this PR) do NOT trigger downstream workflow runs. So the merge to staging from this PR doesn't fire auto-promote-staging — same as the direct-push version. Idempotency The branch name is derived from main's short sha (`auto-sync/main-<sha>`) so workflow restarts on the same main push reuse the existing branch + PR rather than opening duplicates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 15:59:26 -07:00
Hongming Wang	1867111d95	Merge pull request #2213 from Molecule-AI/chore/pin-actions-to-shas chore(security): pin Actions to SHAs + enable Dependabot auto-bumps	2026-04-28 22:49:25 +00:00
Hongming Wang	c77a88c247	chore(security): pin Actions to SHAs + enable Dependabot auto-bumps Supply-chain hardening for the CI pipeline. 23 workflow files modified, 59 mutable-tag refs replaced with commit SHAs. The risk Every `uses:` reference in .github/workflows/*.yml was pinned to a mutable tag (e.g., `actions/checkout@v4`). A maintainer of an action — or a compromised maintainer account — can repoint that tag to malicious code, and our pipelines silently pull it on the next run. The tj-actions/changed-files compromise of March 2025 is the canonical example: maintainer credential leak, attacker repointed several `@v<N>` tags to a payload that exfiltrated repository secrets. Repos that pinned to SHAs were unaffected. The fix Replace each `@v<N>` with `@<commit-sha> # v<N>`. The trailing comment preserves human readability ("ah, this is v4"); the SHA makes the reference immutable. Actions covered (10 distinct): actions/{checkout,setup-go,setup-python,setup-node,upload-artifact,github-script} docker/{login-action,setup-buildx-action,build-push-action} github/codeql-action/{init,autobuild,analyze} dorny/paths-filter imjasonh/setup-crane pnpm/action-setup (already pinned in molecule-app, listed here for completeness) Excluded: Molecule-AI/molecule-ci/.github/workflows/disable-auto-merge-on-push.yml@main — internal org reusable workflow; we control its repo, threat model is different from third-party actions. Conventional to pin to @main rather than SHA for internal reusables. The maintenance cost SHA pinning means upstream fixes require manual SHA bumps. Without automation, pinned SHAs go stale. So this PR also enables Dependabot across four ecosystems: - github-actions (workflows) - gomod (workspace-server) - npm (canvas) - pip (workspace runtime requirements) Weekly cadence — the supply-chain attack window is "minutes between repoint and pull"; weekly auto-bumps don't help with zero-days regardless. The point is to pull in non-zero-day fixes without operator effort. Aligns with user-stated principle: "long-term, robust, fully- automated, eliminate human error." Companion PR: Molecule-AI/molecule-controlplane#308 (same pattern, smaller surface). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 15:37:06 -07:00
Hongming Wang	6638d6e1d7	feat(ci): SECRET_PATTERNS drift lint across known consumers Adds a lint that diffs the canonical SECRET_PATTERNS array in .github/workflows/secret-scan.yml against every known public consumer mirror, failing on any divergence. Why: every side that scans for credentials carries its own copy of the pattern list. They drift — most recently the workspace-runtime pre-commit hook lagged the canonical by one pattern (sk-cp- / MiniMax F1088 vector), so a developer's local pre-commit would let a sk-cp- token through while the org-wide CI scan would refuse it. Useless friction; automated detection closes the gap. Implementation: .github/scripts/lint_secret_pattern_drift.py — pure stdlib, fetches each consumer's RAW file via urllib, extracts the SECRET_PATTERNS=( ... ) array via anchored regex (the closing `)` is anchored to the start of a line because pattern comments like `# GitHub PAT (classic)` contain their own paren mid-line), diffs against canonical, fails on missing or extra patterns. Fetch failures are warnings, not errors — a consumer whose branch was renamed shouldn't fail the lint until someone updates the URL list. .github/workflows/secret-pattern-drift.yml — daily 05:00 UTC cron + on-push gate (when canonical, the workflow, or the script changes) + workflow_dispatch. Read-only token, 5-minute timeout. Initial consumer set: workspace-runtime's bundled pre-commit hook (the one that drifted on sk-cp-). molecule-controlplane's inlined copy is private so this workflow can't read it; that's tracked separately and the controlplane's own self-monitor is the gap. Verified locally: lint detects drift correctly when the runtime hook is missing sk-cp-, returns clean when aligned. Refs: task #139.	2026-04-28 15:29:09 -07:00
Hongming Wang	97d5883e76	fix(ci): auto-sync concurrency + cleanup follow-ups Three small fixes from the self-review of #2209: 1. Required: concurrency group. Two pushes to main in quick succession (manual UI merge then auto-promote-staging's ff-push, or any back-to-back main pushes) would race two auto-sync runs against the same staging branch — second `git push origin staging` fails non-fast-forward, surfacing as a red CI alert for what should be a no-op. Add `concurrency: { group: auto-sync-main-to-staging, cancel-in-progress: false }` so the second run waits for the first and sees its result. 2. Hygiene: `git merge --abort` on conflict. The conflict-error path exits 1 with the work tree in a half-merged state. Doesn't affect future runs (each gets a fresh checkout) but is an unpleasant artifact for anyone who shells into the runner. Abort first, then exit. 3. Doc accuracy: "Loop safety" comment. The original said the chain terminates because "main is either a no-op or advances further." That's true but understates the actual safety: GitHub Actions explicitly does NOT trigger downstream workflow runs from `GITHUB_TOKEN`-authored pushes. So the loop is impossible by construction, not just by happy coincidence of ref state. Updated the comment to reflect the actual mechanism. Plus a step-name nit: "Fast-forward staging → main" reads as if main is the target. Renamed to "Fast-forward staging to main" for consistency with the workflow's name (main → staging). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 14:59:23 -07:00
Hongming Wang	c59715e143	feat(ci): auto-sync main → staging to keep staging-as-superset invariant Background `auto-promote-staging.yml` advances main via `git merge --ff-only` + `git push origin main` — clean fast-forward, no merge commit. But manual `staging → main` merges via the GitHub UI / API create a merge commit on main that staging doesn't have. The next `staging → main` PR then evaluates as "BEHIND" because staging is missing that merge commit, requiring a manual `gh pr update-branch` round-trip. This pattern bit twice on 2026-04-28 (PRs #2202 and #2205, both manual bridges to land pipeline fixes themselves). Each needed update-branch + re-CI before they could merge. Annoying and avoidable. What this workflow does Triggered on every push to main (regardless of source: auto-promote, UI merge, API merge, direct push): 1. Check whether main is already in staging's ancestry. If yes, no-op — auto-promote-staging keeps them aligned via ff push, and the no-op case is the steady state. 2. If not (manual merge commit on main, or direct main hotfix): try `git merge --ff-only origin/main` first. Works when staging hasn't diverged with its own commits. 3. If ff fails (staging has its own in-flight feature work): `git merge --no-ff origin/main -m "chore: sync main → staging"`. Absorbs main's tip while keeping staging's own history. 4. Push staging. Loop safety Pushing the synced staging triggers auto-promote-staging.yml, which checks gates on staging's new tip and, if green, ff-pushes staging to main. Since staging now ⊇ main, the resulting push to main is either a no-op (no ref change → no push event fires → auto-sync doesn't re-trigger) or advances main further. In the latter case auto-sync fires once more, sees main already in staging's ancestry, no-ops. Bounded. Conflict handling If the merge step hits conflicts (staging and main diverged with incompatible changes), the workflow fails with a clear summary pointing to manual resolution. This shouldn't happen in practice — staging is the integration branch; conflicts indicate a direct main hotfix touching the same code as in-flight staging work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 14:43:43 -07:00
hongmingwang-moleculeai	11a38a0ad4	Merge pull request #2207 from Molecule-AI/fix/secret-scan-printf-and-wordsplit fix(ci): printf format-string sink + filename word-split in secret-scan	2026-04-28 21:11:32 +00:00
Hongming Wang	2c8792d3e0	fix(ci): printf format-string sink + filename word-split in secret-scan Two latent bash bugs in the canonical secret-scan workflow caught during the post-merge review of molecule-controlplane #301 (a private consumer that inlined this workflow's logic and got both fixes there). Same bugs apply here; fixing in canonical means every public consumer (gh-identity, github-app-auth, the 8 workspace template repos) inherits the fix on their next workflow_call. Bug 1: `printf "$OFFENDING"` is a format-string sink. OFFENDING is built from filenames: `${f} (matched: ${pattern})\n`. When passed to printf as the first argument, `%` characters in a filename are interpreted as conversion specifiers — corrupting the error message or printing `%(missing)` artifacts. No filename in the current tree triggers it, but a future test fixture, build artifact, or contributor-supplied path could. Fix: `printf '%b' "$OFFENDING"` interprets the literal `\n` we appended without treating OFFENDING as a format string. Bug 2: `for f in $CHANGED` word-splits on whitespace. Filenames containing spaces would split into multiple tokens. The self-exclude check (`[ "$f" = "$SELF" ] && continue`) and the diff lookup would both operate on partial-path tokens. No filename in the current tree has whitespace, but the failure would be silent if one ever did. Fix: `while IFS= read -r f; do ... done <<< "$CHANGED"` reads whole lines as filenames. Added `[ -z "$f" ] && continue` to match the original `for` loop's implicit empty-input skip. Both fixes are mechanically straightforward (~16 lines net diff, mostly comments documenting the why). No behavior change for filenames in the current tree; strictly better for the edge cases. The same fixes already shipped in molecule-controlplane via #301 which inlined a copy of this workflow. The runtime's bundled pre-commit hook (molecule-ai-workspace-runtime: molecule_runtime/scripts/pre-commit-checks.sh) likely has the same bugs — flagged as a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 14:02:50 -07:00
Hongming Wang	9d4ab7b1a2	feat(ci): auto-promote-on-e2e — retag :latest on green E2E Staging SaaS Closes the final gap in the SaaS pipeline. After auto-promote-staging fast-forwards main, publish-workspace-server-image builds new `:staging-<sha>` images, but `:latest` (what prod tenants pull) only moves on either a manual `promote-latest.yml` dispatch or a canary- verify retag (gated on Phase 2 fleet that doesn't exist). This workflow closes that gap by retagging `platform:staging-<sha>` + `platform-tenant:staging-<sha>` → `:latest` whenever E2E Staging SaaS passes for a `main` push. Uses crane (no Docker daemon needed). Verifies both images exist before retagging either, so a half-published state is impossible. Why trigger only on `main` (not staging): - `:latest` is what prod tenants pull. Only SHAs that have reached `main` (via auto-promote-staging) should advance `:latest`. - Triggering on staging would let a staging-only revert advance `:latest` to a SHA that never reaches `main`, breaking the invariant "production runs what's on `main`". Why a separate workflow rather than folding into e2e-staging-saas.yml: - Test concerns and release concerns separate. - Disabling promote during an incident is one workflow toggle, not an edit to the long E2E file. - When Phase 2 canary work eventually lands, the canary path can replace this trigger without touching the E2E workflow. Doc-aligned: per molecule-controlplane/docs/canary-tenants.md, "green staging E2E → :latest" is the recommended approach for the current scale (≤20 paying tenants); canary fleet is deferred until blast radius grows. Pipeline after this lands is fully self-healing: staging push → 4 gates green → auto-promote fast-forwards main → publish-workspace-server-image → E2E Staging SaaS → THIS WORKFLOW retags :latest → tenant fleet auto-pulls in 5 min (or redeploy-tenants-on-main fans out faster) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 13:58:41 -07:00
Hongming Wang	17018745d0	fix(ci): auto-promote gate-check uses workflow file paths, not names Observed 2026-04-28: auto-promote ran for staging head `96955f7b` with all gates actually green (verified via /commits/<sha>/check-runs API) yet `check-all-gates-green` reported `CodeQL → missing/none` and aborted. Same SHA was promotable; auto-promote couldn't see it. Cause: `gh run list --workflow="CodeQL"` matched two workflows in this repo: - codeql.yml (explicit, scans both staging and main) - codeql (GitHub UI-configured Code-quality default setup, internal, scans default branch only) gh CLI rejects ambiguous `--workflow=<name>` lookups and returns no result → the gate fell through to `missing/none` and ALL_GREEN was set false. Every staging push since both names existed has been silently dead-locked. Fix: switch GATES from display-name strings to workflow file paths. File paths are the unique identifier for a workflow file in .github/workflows/; display names are decoration and can collide. The same `gh run list --workflow=<file.yml>` query that fails on "CodeQL" succeeds on "codeql.yml" because the file path resolves unambiguously. No behavior change for the other three gates (CI, E2E Canvas, E2E API Smoke) since their names didn't collide — they keep working, they just identify by ci.yml / e2e-staging-canvas.yml / e2e-api.yml now. The log line shape changes from `CI → completed/success` to `ci.yml → completed/success` which is fine for ops grep. When adding/removing a gate going forward: file paths only. Keep branch-protection required-checks (check-run display names) in sync as a separate manual step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 13:15:13 -07:00
Hongming Wang	31d25b5a74	fix(ci): e2e gates always emit a result so auto-promote can read it The auto-promote-staging.yml gate-check (line 99) treats "workflow didn't run" as failure. Path-filtered triggers on E2E API Smoke Test and E2E Staging Canvas meant a platform-only or test-only push to staging — say, the prior PR #2201 which only touched tests/e2e/test_staging_full_saas.sh — never triggered the canvas workflow, and auto-promote saw `missing/none`, marked all_green=false, and aborted. Same class for any push that doesn't touch the gate's watched paths. Dead-lock by design, never noticed because the gate was new. Fix per Design B (always-run + fast-skip): - Drop `paths:` from the push/pull_request triggers on both gate workflows. The workflow now always fires on every staging+main push/PR. - Add a `detect-changes` job using `dorny/paths-filter@v3` that decides whether to do real work, scoped to the same paths the trigger filter used to watch. - Real work job (e2e-api / playwright) gates on `needs: detect-changes; if: needs.detect-changes.outputs.X == 'true'`. - Add a sibling `no-op` job that runs when the filter output is false, emitting `::notice::… no-op pass`. The workflow run's conclusion is `success` either way — auto-promote sees green and proceeds. manual `workflow_dispatch` and the weekly canvas `schedule` short- circuit detect-changes to always-run — those triggers exist precisely to exercise the suite and shouldn't be silently no-op'd. Why this approach over making auto-promote-staging smarter: The alternative (Design A, considered + rejected) was to teach auto-promote-staging to read each gate's `paths:` filter and treat "no run because filter excluded the commit" as conditional pass. That couples auto-promote to other workflows' YAML schema and breaks silently if a gate is renamed or its filter changes. Design B keeps the auto-promote contract simple ("each gate emits success") and makes each gate self-describing — adding a new gate doesn't require touching auto-promote. Cost: ~10-30s of runner overhead per gate per push for the no-op when paths don't match. Negligible vs the alternative of dead-locked auto-promote chains. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 12:43:26 -07:00
Hongming Wang	e7eeeb4f59	Merge pull request #2199 from Molecule-AI/fix/pin-compat-narrow-pypi-job-trigger ci(pin-compat): split into two workflows so each gets a narrow paths filter	2026-04-28 18:20:48 +00:00
Hongming Wang	a089712cef	feat(cascade): verify wheel content sha256 against just-built dist Closes #132. Extends the cascade propagation probe (added in #2197 and clarified in #2198) with a content-integrity check. The previous probe verified pip can RESOLVE the version we just published (catches surface 1+2 propagation lag — metadata + simple index). It did NOT verify pip can DOWNLOAD bytes that match what we uploaded — leaving a window where a Fastly stale-content scenario (rare but PyPI has had it: e.g. 2026-04-01 incident where a CDN node served a previous version's wheel under the new version's URL for ~90s after upload) would pass the probe and ship corrupt builds to all 8 receiver templates. Two-stage check, both must pass before the cascade fans out: (a) `pip install --no-cache-dir PACKAGE==VERSION` succeeds — version is resolvable. (Existing, unchanged.) (b) `pip download` of the same wheel + `sha256sum` matches the hash captured pre-upload from `dist/*.whl`. (New.) Captured BEFORE upload via a new `wheel_hash` step that exposes `steps.wheel_hash.outputs.wheel_sha256`, bubbled up as `needs.publish.outputs.wheel_sha256`, and consumed by the cascade probe via the EXPECTED_SHA256 env var. `pip download` is the right primitive: it writes the actual .whl file (vs `pip install` which unpacks and discards), so we can sha256sum it directly. Combined with --no-cache-dir + a wiped /tmp/probe-dl per poll, every poll re-fetches from the live Fastly edge — no local-cache mask. Per-poll cost: ~3-5s pip install + ~3s pip download + 4s sleep. 30-poll budget = ~5-6 min wall on a slow runner (vs the previous ~4-5 min for resolve-only). Well within the cascade's tolerance for a known-rare CDN issue, and the overwhelming-common case (Fastly serves matching bytes immediately) exits on the first poll. Verified locally: pip download of the current PyPI-latest (molecule-ai-workspace-runtime 0.1.29) produced sha256=7e782b2d50812257…, exactly matching PyPI's own metadata endpoint. The mismatch path is exercised inline (different builds of the same version produce different hashes by definition — the build_runtime_package.py output is timestamp-deterministic only within a single CI invocation). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 10:53:50 -07:00
Hongming Wang	a8f59f5fc2	ci(pin-compat): split into two workflows so each gets a narrow paths filter Closes #134. The post-merge review of #2196 flagged that the combined workflow's `paths:` filter (the union of both jobs' needs: `workspace/**` + `scripts/build_runtime_package.py` + the workflow itself) caused the `pypi-latest-install` job to fire on every doc-only / adapter-only / unrelated workspace/ edit. The PyPI artifact that job tests against can't change based on our workspace/ source — only on actual PyPI publishes — so those runs add noise without information. Splits the previously-merged combined workflow: runtime-pin-compat.yml (kept): - PyPI-latest install + import smoke (was: pypi-latest-install) - Narrow `paths:` filter — only fires when workspace/requirements.txt or this workflow file changes - Cron-driven daily for upstream-yank detection (unchanged) runtime-prbuild-compat.yml (new): - PR-built wheel + import smoke (was: local-build-install) - Broad `paths:` filter — fires on any workspace/ source change, scripts/build_runtime_package.py, or this workflow file - No cron (workspace/ doesn't change between firings) Behavior identical to before for content; only the trigger surface is narrower per-job. Each workflow's name is its own status check, so branch protection (which currently lists neither as required) can gate them independently in future. The prior comment in the combined file explicitly acknowledged the asymmetry and proposed this split as a follow-up; this is that follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 10:50:09 -07:00
Hongming Wang	e6ce54006d	ci(publish-runtime): use pip-resolve probe to bound cascade fan-out The cascade's PyPI-propagation gate polled `/pypi/<pkg>/<ver>/json`, which is one of THREE surfaces pip touches when resolving an install: 1. /pypi/<pkg>/<ver>/json — metadata endpoint (the old check) 2. /simple/<pkg>/ — pip's primary download index 3. files.pythonhosted.org — CDN-fronted wheel binary Each has its own cache. Any one of them can lag behind the others, and the previous gate would let the cascade fire while (2) or (3) still served the previous version. Downstream `pip install` in the template repos then resolved to the OLD wheel, the docker layer cache locked that stale resolution in, and subsequent rebuilds kept shipping the old runtime — the "five times in one night" cache trap referenced in the prior comment. Replace the metadata-only poll with an actual `pip install --no-cache-dir --force-reinstall --no-deps PACKAGE==VERSION` from a fresh venv. If pip can resolve and install the exact version we just published, every receiver template will too — pip itself is the ground truth for what the receivers will see, no proxy guessing about which surface is lagging. - Venv created once outside the loop; only `pip install` runs in the poll body. - --no-cache-dir + --force-reinstall ensures every poll hits the live PyPI surfaces (no local-cache mask). - --no-deps keeps each poll fast — we only care about resolving THIS package, not its dep tree. - Loop budget: 30 attempts × 4s ≈ 2 min (vs prior 30 × 2s = 60s). Generous vs typical PyPI propagation, surfaces real upstream issues past the budget. Verified locally: - Probing a non-existent version (0.1.999999) → pip exits 1, loop retries. - Probing the current PyPI-latest → pip exits 0, `pip show` returns the version, loop succeeds. Closes #130. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 18:16:33 -07:00
Hongming Wang	7484e6fbec	Merge pull request #2196 from Molecule-AI/fix/runtime-pin-compat-test-pr-artifact ci(runtime-pin-compat): test the PR-built wheel, not PyPI-latest	2026-04-28 00:42:02 +00:00
Hongming Wang	7065579967	ci(runtime-pin-compat): test the PR-built wheel, not the PyPI-latest one Closes #128's chicken-and-egg. The original gate installed the CURRENTLY-PUBLISHED molecule-ai-workspace-runtime from PyPI, then overlaid workspace/requirements.txt, then smoke-imported. That catches problems with the already-shipped artifact (the daily-cron upstream-yank case), but it cannot catch problems introduced by the PR itself: the imports it exercises are from the OLD wheel, not the PR's source. A PR that adds `from a2a.utils.foo import bar` (where `bar` is added in a2a-sdk 1.5 and the runtime currently pins 1.3) slips through: 1. Pip resolves the existing PyPI wheel + a2a-sdk 1.3. 2. Smoke imports the OLD main.py — no reference to `bar` → green. 3. Merge → publish-runtime.yml ships a wheel WITH the new import. 4. Tenant images redeploy → all crash on first boot with ImportError: cannot import name 'bar' from 'a2a.utils.foo'. Splits the workflow into two jobs: - pypi-latest-install (renamed from default-install): unchanged behavior. Runs on the daily cron and on requirements.txt / workflow edits. Catches upstream PyPI yanks + the already-shipped artifact going stale. - local-build-install (new): runs scripts/build_runtime_package.py on the PR's workspace/, builds the wheel with python -m build (mirroring publish-runtime.yml byte-for-byte), installs that wheel, then runs the same smoke import. Tests the artifact that WOULD be published if this PR merges. Path filter widened to workspace/** so any runtime-source change triggers the local-build job. The pypi-latest job's filter is the same union; its internal logic is unchanged so the daily-cron and upstream-detection use cases continue to work. Verified locally: built the wheel from current workspace/ source via the same script + python -m build invocation, installed into a fresh venv, imported from molecule_runtime.main import main_sync successfully. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 17:39:00 -07:00
Hongming Wang	1b0fab674b	ci(publish-runtime): smoke well-known mount alignment + message helper The existing wheel-smoke catches AgentCard kwarg-shape regressions (state_transition_history, supported_protocols) but doesn't catch the SDK-contract drift class that #2193 just fixed in production: the a2a-sdk 1.x rename of /.well-known/agent.json → /.well-known/agent-card.json, plus AGENT_CARD_WELL_KNOWN_PATH moving to a2a.utils.constants. main.py's readiness probe hardcoded the old literal and 404'd every attempt, silently dropping every workspace's initial_prompt for ~weeks before a user reported it. Two additions to the smoke block: 1. Mount alignment: build an AgentCard, call create_agent_card_routes(), and assert AGENT_CARD_WELL_KNOWN_PATH is among the mounted paths. Catches a future SDK release that decouples the constant value from the route factory's mount path. The source-tree test (workspace/tests/test_agent_card_well_known_path.py) catches the main.py side; this catches the SDK side BEFORE PyPI upload. 2. Message helper smoke: import a2a.helpers.new_text_message and instantiate one. The v0→v1 cheat sheet (memory: reference_a2a_sdk_v0_to_v1_migration.md) flagged this as a real migration find — main.py and a2a_executor.py call it in hot paths, so an import break errors every reply before the message even leaves the workspace. Verified by running the equivalent Python inside ghcr.io/molecule-ai/workspace-template-langgraph:latest: ✓ well-known mount alignment OK (/.well-known/agent-card.json) ✓ message helper import + call OK Closes the structural-fix half of the #2193 finding from the code- review-and-quality pass: "the wheel publish smoke didn't catch this. This is the 7th a2a-sdk migration find of this kind. Task #131 is the right root-cause fix." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 17:34:12 -07:00
hongmingwang-moleculeai	dccec657d6	Merge branch 'staging' into ci/cicd-review-quick-wins	2026-04-27 13:27:16 -07:00
Hongming Wang	5920fc856d	Merge pull request #2182 from Molecule-AI/ci/agentcard-smoke-followup-2179 fix(workspace): rename supported_protocols → supported_interfaces (CRITICAL — every boot crashes)	2026-04-27 14:58:28 +00:00
Hongming Wang	851fd21fb1	fix(workspace): rename supported_protocols → supported_interfaces (a2a-sdk 1.0) CRITICAL: every workspace boot since the a2a-sdk 1.0 migration (#1974) has been crashing at AgentCard construction with: ValueError: Protocol message AgentCard has no "supported_protocols" field The protobuf field is `supported_interfaces` (plural, interfaces — see a2a-sdk types/a2a_pb2.pyi:189). The 0.3→1.0 migration left the kwarg as `supported_protocols`, which doesn't exist in the 1.0 schema, so the constructor raises before any subsequent line of main runs. Why this hid for so long: - publish-runtime.yml's smoke step only IMPORTED molecule_runtime.main; importing the module is fine, only CONSTRUCTING the AgentCard fails - The user-visible symptom is "Workspace failed: " with empty last_sample_error, indistinguishable from generic boot timeouts - The state_transition_history=True bug (fixed in #2179) was a sibling of this — same migration, same class, just caught first Fix is symmetric with #2179: 1. workspace/main.py: rename the kwarg + comment explaining why 2. .github/workflows/publish-runtime.yml: extend the smoke block to instantiate AgentCard with the exact production call shape, so the next field-rename of this class fails at publish time instead of breaking every workspace startup Verification: - Constructed AgentCard against fresh a2a-sdk 1.0.2 in a clean venv with the corrected kwarg → succeeds - Constructed it with the original `supported_protocols` kwarg → fails immediately with the exact error production sees - Smoke test pinned to mirror main.py's exact call shape; main.py + smoke must stay in lockstep going forward Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 07:54:23 -07:00
Hongming Wang	1a703f5687	fix(publish-runtime): wait for PyPI propagation + expand path filter Two structural fixes for the cascade race conditions that bit us five times today: 1. PyPI propagation wait (cascade job): poll PyPI for the just-published version with a 60s budget BEFORE firing repository_dispatch. PyPI accepts the upload but takes a few seconds to make it available via the package index. Cascade was firing too fast — downstream template builds ran `pip install` against a stale index, resolved to the previous version, and docker layer cache locked that in for subsequent rebuilds. Pairs with the build-arg cache invalidation in molecule-ci PR (separate change). Wait without invalidation = next build still pip-resolves correctly. Invalidation without wait = first cascade build may still race PyPI propagation. Together: no race, no stale cache. 2. Path filter expansion: scripts/build_runtime_package.py is the build script and changes to it (e.g. import-rewrite fixes, manifest emit, lib/ subpackage move) directly affect what ships in the wheel. Was missing from the path filter, so PRs touching only scripts/ (like #2174's lib/ fix) didn't auto-publish — the operator had to remember a manual dispatch. Add it to the closed list of files that trigger auto-publish. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 07:42:37 -07:00
Hongming Wang	82b366fce5	ci: add pr-guards caller that disables auto-merge on push Thin caller for molecule-ci's reusable disable-auto-merge-on-push workflow. Forces operator re-engagement when a commit is pushed to an open PR with auto-merge already enabled. Pairs with the org-wide "Automatically delete head branches" repo setting (also enabled today). Defense in depth: 1. Repo setting blocks pushes to a merged-and-deleted branch (post-merge orphan case — what bit #2174 today: my second commit landed on an already-merged-and-deleted branch). 2. This workflow catches in-queue races (push lands while the merge queue is processing) by disabling auto-merge so the operator must explicitly re-engage. Together they cover the full lifecycle of "auto-merge enabled → new commits arrive" without relying on operator discipline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 06:39:31 -07:00
Hongming Wang	3df5867b56	fix: restore main_sync entry point in workspace/main.py The wheel's pyproject.toml has declared `molecule-runtime = "molecule_runtime.main:main_sync"` since the publish pipeline was created on 2026-04-26, but the function itself was never present in workspace/main.py — it lived in the pre-monorepo molecule-ai-workspace-runtime repo and was lost during the consolidation that made workspace/ the source of truth. The 0.1.15 wheel still had main_sync from a leftover snapshot, so the regression went unnoticed until 0.1.16 (the first wheel built from the new source-of-truth) shipped. Symptom: every workspace container restart loops with ImportError: cannot import name 'main_sync' from 'molecule_runtime.main' — the molecule-runtime CLI script's first line tries to import the missing symbol. Workspaces stay in `provisioning` until the 10-min sweep marks them failed. Caught by .github/workflows/runtime-pin-compat.yml, which already imports the symbol by name as its smoke test. (That check kept failing red on every recent merge_group run; this PR fixes the underlying symbol-not-found instead of the smoke step.) Also strengthens publish-runtime.yml's wheel smoke from `import molecule_runtime.main` (loads the module — passes even when entry-point target is missing) to `from molecule_runtime.main import main_sync` (the actual contract the CLI script needs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 03:35:49 -07:00
Hongming Wang	c68dc1877f	fix(release): drift-gate TOP_LEVEL_MODULES + smoke-import main in publish Two compounding bugs surfaced when 0.1.16 hit production today: 1. scripts/build_runtime_package.py had a hand-curated TOP_LEVEL_MODULES set listing every workspace/.py that should get its bare imports rewritten to `molecule_runtime.X`. The set silently went stale: - Missing: transcript_auth (added since #87 phase 1c), runtime_wedge, watcher → unrewritten imports shipped, every workspace startup died with ModuleNotFoundError. - Stale: claude_sdk_executor, cli_executor (both removed in #87), hermes_executor (never existed) → harmless but misleading. 2. publish-runtime.yml's wheel-smoke step asserted on stable invariants (BaseAdapter, AdapterConfig, a2a_client error sentinel) but never imported main. So even though main.py held the broken bare `from transcript_auth import ...`, the smoke check passed. Fixes: - Build script now derives the on-disk module set from workspace/.py and asserts it matches TOP_LEVEL_MODULES exactly. Drift in either direction fails the build with a specific diff message instead of shipping a broken wheel. Closed-list typo guard preserved (we still edit the set explicitly when a module is added/removed) — the gate just makes drift impossible to ignore. - TOP_LEVEL_MODULES updated to current reality: drop the 3 stale, add the 3 missing. - publish-runtime.yml wheel-smoke now `import molecule_runtime.main` before the invariant asserts. main is the entry point and transitively imports every module — any bare-import bug surfaces as ModuleNotFoundError before PyPI accepts the upload. Tested locally: `python3 scripts/build_runtime_package.py --version 0.1.99 --out /tmp/build-test` succeeds, and /tmp/build-test/molecule_runtime/main.py contains the rewritten `from molecule_runtime.transcript_auth import ...`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 03:19:17 -07:00
Hongming Wang	0a455b7d71	feat(publish-runtime): auto-publish to PyPI on staging pushes that touch workspace/ Adds a third trigger so any merge to staging that changes workspace/ auto-publishes a new molecule-ai-workspace-runtime patch release. Closes the human-in-loop gap that caused tonight's RuntimeCapabilities ImportError outage. Tonight: #117 added RuntimeCapabilities to molecule_runtime.adapters.base. The merge landed at 02:37 UTC. Templates rebuilt their images at 07:37 UTC (4 hours later) and started importing the new symbol. PyPI was still serving 0.1.15 (pre-#117) because nobody remembered to push a runtime-vX.Y.Z tag or workflow_dispatch the publish. Result: every template image shipped tonight runs `from molecule_runtime.adapters.base import RuntimeCapabilities` against an installed runtime that doesn't export it -> ImportError -> workspace never registers -> stuck in provisioning until 10-min sweep. Mechanism: - New trigger: push to staging filtered to paths: ['workspace/']. Path filter applies only to branch pushes; the existing tag trigger still fires unconditionally. - Version derivation for the auto case: query PyPI's JSON API for current latest, bump the patch component. PyPI is the source of truth so concurrent runs don't double-publish (HTTP 400 on collision). - concurrency: group serializes parallel staging merges so they don't race on the bump computation. cancel-in-progress: false because each workspace/** change deserves its own release. - publish job now exposes its derived version as a job-level output so the cascade reads it cleanly. Fixes a latent bug: cascade tried to read steps.version.outputs.version, which is from a different job's scope and silently resolved to empty -- then re-derived from GITHUB_REF_NAME, which would have been "staging" under the new trigger and produced an invalid version. Tag-driven and manual-dispatch paths are unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 02:11:45 -07:00
Hongming Wang	c1e9aa7461	Merge pull request #2153 from Molecule-AI/fix/block-internal-paths-shallow-clone-bug fix(ci): block-internal-paths handle merge_group + shallow-clone BASE	2026-04-27 06:58:32 +00:00
Hongming Wang	7ac7a010fa	fix(ci): block-internal-paths handle merge_group + shallow-clone BASE [Molecule-Platform-Evolvement-Manager] ## What was broken Same bug class as the secret-scan.yml fix in #2120 — block-internal-paths hit `fatal: bad object <sha>` exit 128 on the staging push at 2026-04-27 06:50:33Z. Two cases: 1. `merge_group` events: BASE/HEAD came from `github.event.before` / `.after` which are push-event-only properties. On merge_group both came back empty, the script fell through to "scan entire tree" mode which is correct but inefficient. Worse, when this workflow is required for the merge queue (line 21-22), an empty-BASE entire-tree scan would run on every queue check. 2. `push` events with shallow clones: `fetch-depth: 2` doesn't always cover BASE across true merge commits. When BASE is in the payload but absent from the local object DB, `git diff` errors out with `fatal: bad object <sha>` and the job exits 128. This is what broke today's staging push. ## Fix Same shape as the secret-scan.yml fix (#2120): - Add a dedicated `git fetch` step for `merge_group.base_sha`. - Move event-specific SHAs into a step `env:` block; script uses a `case` over `${{ github.event_name }}` covering pull_request / merge_group / push (rather than `if pull_request / else push` which left merge_group on the empty-BASE branch). - On-demand fetch + `git cat-file -e` guard for push BASE so a SHA that's payload-present-but-DB-absent triggers the fetch, and a fetch failure falls through cleanly to "scan entire tree" instead of exiting 128. ## Test plan - [x] YAML structure preserved (no schema changes) - [x] Bash logic mirrors the secret-scan recovery path tested in #2120 - [ ] CI green on this PR's pull_request scan + push to staging post-merge 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 23:54:00 -07:00
Hongming Wang	a4b3ebf951	test(e2e): claude-code + hermes priority-runtimes happy path Self-contained happy-path E2E for the two runtimes the project commits to first-class support for (task #116, completes the loop on the "both must work end-to-end with tests" requirement). What it proves per runtime: 1. POST /workspaces succeeds with the runtime + secrets 2. Workspace reaches status=online within its cold-boot window (claude-code: 240s, hermes: 900s on cold apt + uv + sidecar) 3. POST /a2a (message/send "Reply with PONG") returns a non-error, non-empty reply 4. activity_logs row written with method=message/send and ok\|error status (a2a_proxy.LogActivity contract) Skip semantics: each phase independently checks for its required env key (CLAUDE_CODE_OAUTH_TOKEN / E2E_OPENAI_API_KEY) and skips cleanly if absent. The script always exit-0s if every phase either passed or skipped — so wiring it into a no-keys CI job validates the script itself stays clean without false-failing. Idempotent: pre-sweeps any prior "Priority E2E (claude-code)" / "Priority E2E (hermes)" workspaces so a run interrupted by SIGPIPE / kill -9 (which bypasses the EXIT trap) doesn't poison the next run. Same defensive pattern as test_notify_attachments_e2e.sh. CI wiring: - e2e-api.yml — runs on every PR with no LLM keys, both phases skip, catches script-level regressions (set -u bugs, syntax issues, etc.) - canary-staging.yml + e2e-staging-saas.yml already have the keys via secrets.MOLECULE_STAGING_OPENAI_KEY and exercise wire-real behavior — could be wired to opt-in if you want claude-code coverage there too. Local runs (from this branch, no keys): === Results: 0 passed, 0 failed, 2 skipped === Validates the capability primitives shipped in PRs #2137-2144: once template PRs #12 (claude-code) + #25 (hermes) merge with their declared provides_native_session=True + idle_timeout_override=900, a manual run with both keys validates the full native+pluggable chain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 23:48:54 -07:00
rabbitblood	b81d8e9fc5	chore(secret-scan): add sk-cp- MiniMax pattern (F1088 retroactive fix)	2026-04-26 21:43:22 -07:00
Hongming Wang	62cfc21033	test(comms): comprehensive E2E coverage for agent → user attachments User asked to "keep optimizing and comprehensive e2e testings to prove all works as expected" for the communication path. Adds three layers of coverage for PR #2130 (agent → user file attachments via send_message_to_user) since that path has the most user-visible blast radius: 1. Shell E2E (tests/e2e/test_notify_attachments_e2e.sh) — pure platform test, no workspace container needed. 14 assertions covering: notify text-only round-trip, notify-with-attachments persists parts[].kind=file in the shape extractFilesFromTask reads, per-element validation rejects empty uri/name (regression for the missing gin `dive` bug), and a real /chat/uploads → /notify URI round-trip when a container is up. 2. Canvas AGENT_MESSAGE handler tests (canvas-events.test.ts +5) — pin the WebSocket-side filtering that drops malformed attachments, allows attachments-only bubbles, ignores non-array payloads, and no-ops on pure-empty events. 3. Persisted response_body shape test (message-parser.test.ts +1) — pins the {result, parts} contract the chat history loader hydrates on reload, so refreshing after an agent attachment restores both caption and download chips. Also wires the new shell E2E into e2e-api.yml so the contract regresses in CI rather than only in manual runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 20:41:56 -07:00
Hongming Wang	3a36d732e4	fix(ci): sweep prior UTC day in e2e safety nets (midnight-rollover) [Molecule-Platform-Evolvement-Manager] ## What was breaking All three staging e2e workflows' "Teardown safety net" steps filtered candidate slugs by `f'e2e-...-{today}-...'` where `today` was computed at safety-net-step time via `datetime.date.today()`. When a run crossed midnight UTC (start before 00:00, end after), `today` became the NEXT day, but the slug it created carried the PRIOR day's date. The filter never matched its own slug → leak. ## Today's incident E2E Staging Canvas run [24970092066]( https://github.com/Molecule-AI/molecule-core/actions/runs/24970092066): - started 2026-04-26 23:45:59Z - created slug `e2e-canvas-20260426-1u8nz3` at 23:59Z - ended 2026-04-27 00:12:47Z (failure) - safety-net step ran with `today=20260427` - filter `e2e-canvas-20260427-` did not match `...20260426-1u8nz3` - tenant + child workspace EC2 both stayed up Confirmed via CP staging logs: no DELETE for `1u8nz3` ever issued. The Playwright globalTeardown didn't fire (test crashed mid-run); the workflow safety-net was the last line and it missed. ## Fix All three workflows now sweep BOTH today AND yesterday's UTC dates, so a run that crosses midnight still matches its own slug: ```python today = datetime.date.today() yesterday = today - datetime.timedelta(days=1) dates = (today.strftime('%Y%m%d'), yesterday.strftime('%Y%m%d')) prefixes = tuple(f'e2e-canvas-{d}-' for d in dates) # (canvas variant) ``` Per-run-id scoping (saas + canary) is preserved — the prior-day prefix still includes the run_id, so cross-midnight runs only sweep their own slugs, not other in-flight runs from yesterday. ## Why two-day window vs. arbitrary lookback A run can't legitimately last more than 24h on GitHub-hosted runners (workflow `timeout-minutes` caps; canary=25, e2e-saas=45, canvas=30). Two-day window is enough to cover any cross-midnight run without widening the cross-run-cleanup blast radius further. The `sweep-stale-e2e-orgs.yml` cron (with its 120-min age threshold) remains the catch-all for anything older that drifts through. ## Test plan - [x] Manual logic simulation: post-midnight slug matches yesterday's prefix; same-day still matches; 2-days-ago does NOT match; production tenant never matches - [x] All three workflow YAMLs syntactically valid - [ ] Next cross-midnight run cleans up its own slug 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 19:23:36 -07:00
rabbitblood	6e0a8e8e1c	docs(ci): fix secret-scan reusable workflow self-doc — repo is molecule-core, ref is @staging	2026-04-26 15:44:31 -07:00
Hongming Wang	05ee0843fc	Merge pull request #2125 from Molecule-AI/fix/canary-teardown-slug-pattern fix(ci): canary teardown safety-net slug pattern (was reversed)	2026-04-26 22:04:46 +00:00
Hongming Wang	7425351321	fix(ci): canary teardown safety-net slug pattern (was reversed) [Molecule-Platform-Evolvement-Manager] ## What was broken `canary-staging.yml`'s teardown safety-net step filtered candidate slugs with `f'e2e-{today}-canary-'`. But `test_staging_full_saas.sh` emits canary slugs as `e2e-canary-${date}-${RUN_ID_SUFFIX}` — date SECOND, mode FIRST. Full-mode slugs are the other way around (`e2e-${date}-${RUN_ID_SUFFIX}`), and the canary workflow seems to have been copy-pasted from there without re-checking the slug generator. Net effect: the safety-net step ran on every cancelled / failed canary, hit the CP, got the org list, filtered to zero matches, and exited cleanly. Every cancelled canary EC2 leaked until the once-an-hour `sweep-stale-e2e-orgs.yml` cron eventually caught it (120-min default age threshold means ≥1h leak in the worst case). ## Today's incident Canary run 24966995140 cancelled at 21:03Z. EC2 `tenant-e2e-canary-20260426-canary-24966` still running 1h25m later, manually terminated by the CEO. Three earlier cancellations today (16:04Z, 19:26Z, 20:02Z) hit the same gap — visible as the hourly canary failure pattern in #2090. ## Fix - Filter prefix corrected to `e2e-canary-${today}-` (mode FIRST, date SECOND) to match the actual slug emitter. - Added per-run scoping (`-canary-${GITHUB_RUN_ID}-` suffix) when GITHUB_RUN_ID is set, mirroring the e2e-staging-saas.yml safety net's per-run scoping that was added after the 2026-04-21 cross-run cleanup incident — guards against a queued canary's safety-net step deleting an in-flight different canary's slug while the queue's `cancel-in-progress: false` lets two reach the teardown step concurrently. - Added a comment block tracing the bug + the prior incident so the next maintainer doesn't re-introduce the same mistake. ## Test plan - [x] Manual trace: today's slug `e2e-canary-20260426-canary-24966...` now matches `e2e-canary-20260426-canary-24966` prefix - [x] YAML parses - [ ] Next canary cancellation cleans up automatically ## Companion PR The PRIMARY symptom (TLS-timeout failures, not the leaked EC2) traces to a separate bug in `molecule-controlplane`: tunnel/DNS creation errors are logged-and-continued rather than failing provision. PR coming separately. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 14:44:27 -07:00
rabbitblood	5478beef90	fix(canary): bump job timeout to 25m so bash fail + diagnostic can fire (#2090 ) PR #2107 bumped the bash-side TLS-readiness deadline in tests/e2e/test_staging_full_saas.sh from 600s to 900s (15 min) AND added a diagnostic burst on the fail path so the next failure would identify the broken layer (DNS / TLS / HTTP). What I missed: the canary workflow's own timeout-minutes was also 15. So GitHub Actions killed the job at the 15:00 wall-clock mark BEFORE the bash `fail` + diagnostic could fire — every cancellation silent, no failure comment on #2090, no diagnostic data attached. Visible in the 21:03 UTC canary run: cancelled at 14:03 step time (15:18 wall) without ever reaching the diagnostic block. Bump to 25 min — gives ~10 min headroom over the 15-min bash deadline for setup (org create + tenant provision + admin token fetch) plus the diagnostic dump plus teardown. Still tighter than the sibling staging E2E jobs (20/40/45 min) so a genuine wedge surfaces here first. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 14:36:02 -07:00
Hongming Wang	0ce537750c	fix(ci): handle merge_group + shallow-clone BASE in secret-scan [Molecule-Platform-Evolvement-Manager] ## What was breaking Two distinct failure modes in `.github/workflows/secret-scan.yml`, both visible after PR #2115 / #2117 hit the merge queue: 1. `merge_group` events: the script reads `github.event.before / after` to determine BASE/HEAD. Those properties only exist on `push` events. On `merge_group` events both came back empty, the script fell through to "no BASE → scan entire tree" mode, and false-positived on `canvas/src/lib/validation/__tests__/secret-formats.test.ts` which contains a `ghp_xxxx…` literal as a masking-function fixture. (Run 24966890424 — exit 1, "matched: ghp_[A-Za-z0-9]{36,}".) 2. `push` events with shallow clone: `fetch-depth: 2` doesn't always cover BASE across true merge commits. When BASE is in the payload but absent from the local object DB, `git diff` errors out with `fatal: bad object <sha>` and the job exits 128. (Run 24966796278 — push at 20:53Z merging #2115.) ## Fixes - Add a dedicated fetch step for `merge_group.base_sha` (mirrors the existing pull_request base fetch) so the diff base is in the object DB before `git diff` runs. - Move event-specific SHAs into a step `env:` block so the script uses a clean `case` over `${{ github.event_name }}` instead of a single `if pull_request / else push` that left merge_group on the empty branch. - Add an on-demand fetch for the push-event BASE when it isn't in the shallow clone, plus a `git cat-file -e` guard before the diff so we fall through cleanly to the "scan entire tree" path if the fetch fails (correct, just slower) instead of exiting 128. ## Defense-in-depth `secret-formats.test.ts` had two literal continuous-string fixtures (`'ghp_xxxx…'`, `'github_pat_xxxx…'`). The ghp_ one matched the secret-scan regex. Switched both to the `'prefix_' + 'x'.repeat(N)` pattern already used elsewhere in the same file — runtime value is the same, but the literal source text no longer matches the regex even if the BASE detection ever falls back to tree-scan mode again. ## Test plan - [x] No remaining regex matches in the secret-formats.test.ts source - [x] YAML structure preserved - [ ] CI passes on this PR's pull_request scan (was already passing) - [ ] CI passes on this PR's merge_group scan (the new path) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 14:08:19 -07:00
Hongming Wang	a25ed57613	Merge pull request #2115 from Molecule-AI/chore/codeowners-personal-review-routing chore: add CODEOWNERS to auto-route agent PRs to your personal review account	2026-04-26 20:45:30 +00:00
Hongming Wang	dac55f3b42	chore: add CODEOWNERS to auto-route agent PRs to personal review account After landing the 1-required-review gate on staging in cycle 24, every agent-authored PR sits with `REVIEW_REQUIRED` until someone notices. CODEOWNERS solves the routing half: every changed path matches ``, so GitHub auto-requests review from @hongmingwang-moleculeai (the personal account, separate from the HongmingWang-Rabbit agent identity). PRs land in the personal account's notification queue automatically. The ` @hongmingwang-moleculeai` line is informational (route the request) rather than enforced — branch protection's require_code_owner_reviews flag is off, so any approving review still satisfies the 1-review gate. Flip that on later if you want CODEOWNERS approval to be the required review type. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 13:40:13 -07:00
Hongming Wang	263012249c	Merge pull request #2109 from Molecule-AI/feat/org-wide-secret-scan-workflow feat(ci): add secret-scan workflow + reusable entry point for org-wide enrollment	2026-04-26 20:37:16 +00:00
Hongming Wang	f3a204347c	fix(publish-runtime): use PyPI Trusted Publisher (OIDC) instead of PYPI_TOKEN (#2113 ) Drops the static PYPI_TOKEN secret in favor of OIDC trusted publishing. PyPI now mints a short-lived upload credential after verifying the workflow's OIDC claim against the trusted-publisher config registered for molecule-ai-workspace-runtime (Molecule-AI/molecule-core, publish-runtime.yml, environment pypi-publish). Why: - A leaked PYPI_TOKEN would let any holder publish arbitrary versions of molecule-ai-workspace-runtime to PyPI from anywhere — bypassing the monorepo's review and CI gates entirely. The 8 template repos pull this package; a malicious publish poisons all of them. - Trusted Publisher (OIDC) makes that exfil path moot: no long-lived credential exists to leak. Only this exact workflow, on this repo, in the pypi-publish environment, can upload. After this lands and the first OIDC publish succeeds, the PYPI_TOKEN repo secret should be deleted (it becomes dead weight + a leak surface with no purpose). Belt-and-suspenders companion to PR #56 in molecule-ai-workspace-runtime (sibling repo lockdown). Without OIDC, the sibling lockdown alone doesn't prevent local `python -m build && twine upload` from a laptop with a personal PyPI maintainer credential. Co-authored-by: Hongming Wang <hongmingwangalt@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 13:14:47 -07:00
Hongming Wang	199630908d	fix(publish-runtime): smoke test asserts stable invariants, not feature flags (#2112 ) The original smoke step had `assert a2a_client._A2A_QUEUED_PREFIX` which is a feature-flag-style check — it fires false-positive every time staging is mid-release of that specific feature. Caught when the dry-run publish (run 24965411618) failed because _A2A_QUEUED_PREFIX hadn't landed on staging yet (it lives in PR #2061's series, separate from the PR #2103 chain that shipped this workflow). Replaced with checks for stable invariants of the package contract: - a2a_client._A2A_ERROR_PREFIX exists (always has, since the [A2A_ERROR] sentinel is the foundational error-tagging primitive) - adapters.get_adapter is callable - BaseAdapter has the .name() static method (interface anchor) - AdapterConfig has __init__ (dataclass present) These four cover the cases the smoke test actually needs to catch: import-path rewrites broken by build_runtime_package.py, missing modules, dataclass shape regressions. They don't fire when a specific feature is mid-merge. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Hongming Wang <hongmingwangalt@gmail.com>	2026-04-26 13:14:15 -07:00
rabbitblood	8edbd12980	feat(ci): add secret-scan workflow + reusable entry point for org-wide enrollment Defense-in-depth for the #2090-class incident (2026-04-24): GitHub's hosted Copilot Coding Agent leaked a ghs_* installation token into tenant-proxy/package.json via npm init slurping the URL from a token-embedded origin remote. We can't fix upstream's clone hygiene, so we gate at the PR layer. Single workflow, dual purpose: 1. PR / push / merge_group gate on this repo (molecule-monorepo). Refuses any change whose diff additions contain a credential-shaped string. Same shape as Block forbidden paths — error message tells the agent how to recover without echoing the secret value. 2. Reusable workflow entry point (workflow_call) for the rest of the org. Other Molecule-AI repos enroll with a 3-line workflow: jobs: secret-scan: uses: Molecule-AI/molecule-monorepo/.github/workflows/secret-scan.yml@main This makes molecule-monorepo the single source of truth for the regex set; consumer repos pick up new patterns without per-repo PRs. Pattern set covers GitHub family (ghp_, ghs_, gho_, ghu_, ghr_, github_pat_), Anthropic / OpenAI / Slack / AWS. Mirror of the runtime's bundled pre-commit hook (molecule-ai-workspace-runtime: molecule_runtime/scripts/pre-commit-checks.sh) — keep aligned when either side adds a pattern. Self-exclude on .github/workflows/secret-scan.yml so the file's own regex literals don't block its merge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 12:05:18 -07:00
Hongming Wang	c01f057e6b	ci: shift e2e-staging-saas to staging + threshold canary auto-issue at 3 reds Two CICD-review quick wins consolidated into one PR: # 1. e2e-staging-saas now fires on staging, not just main The full-lifecycle SaaS E2E was main-only, so it caught regressions AFTER they shipped to staging (and into the auto-promote PR). Adding `staging` to the push + pull_request branch list catches them BEFORE the staging→main promotion opens, making canary's green into auto-promote-staging meaningfully more trustworthy. paths-filter is unchanged, so the blast radius stays the same — only provisioning-critical changes trigger the ~25-35 min run. # 2. Canary auto-issue thresholded at 3 consecutive failures The 30-min canary was opening "🔴 Canary failing" issues on every single failure and de-duping via title match. Transient flakes (CF DNS hiccup, AWS API blip) generated noise. Now: on first failure, look up the prior `THRESHOLD-1` runs of this same workflow. Only file an issue when ALL of those also failed (i.e. this is the 3rd consecutive red, ~90 min of sustained failure). If an issue is already open we still comment per-failure so the streak is visible. Threshold rationale: canary fires every 30 min, so 3 reds = ~90 min of sustained failure — past any single-run flake but well inside the deploy window so a real outage still surfaces fast. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 12:02:52 -07:00
Hongming Wang	0de67cd379	feat(platform/admin): /admin/workspace-images/refresh + Docker SDK + GHCR auth The production-side end of the runtime CD chain. Operators (or the post- publish CI workflow) hit this after a runtime release to pull the latest workspace-template-* images from GHCR and recreate any running ws-* containers so they adopt the new image. Without this, freshly-published runtime sat in the registry but containers kept the old image until naturally cycled. Implementation notes: - Uses Docker SDK ImagePull rather than shelling out to docker CLI — the alpine platform container has no docker CLI installed. - ghcrAuthHeader() reads GHCR_USER + GHCR_TOKEN env, builds the base64- encoded JSON payload Docker engine expects in PullOptions.RegistryAuth. Both empty → public/cached images only; both set → private GHCR pulls. - Container matching uses ContainerInspect (NOT ContainerList) because ContainerList returns the resolved digest in .Image, not the human tag. Inspect surfaces .Config.Image which is what we need. - Provisioner.DefaultImagePlatform() exported so admin handler picks the same Apple-Silicon-needs-amd64 platform as the provisioner — single source of truth for the multi-arch override. Local-dev companion: scripts/refresh-workspace-images.sh runs on the host and inherits the host's docker keychain auth — alternate path for when GHCR_USER/TOKEN aren't set in the platform env. 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-04-26 10:17:21 -07:00
Hongming Wang	f1792e1f7a	fix(ci): stop sweep-cf-orphans noise — drop merge_group + soft-skip when secrets unset The sweep-cf-orphans workflow shipped in #2088 was noisier than intended in two ways. This PR fixes both — was filed under the Optional finding I left on the original review and now matters because the noise is observably hitting the merge queue. 1) `merge_group: types: [checks_requested]` was firing the entire sweep job on every PR through the merge queue. The original intent ("future required-check support without a workflow edit") never materialized, and meanwhile every recent merge-queue eval (#2091, #2092, #2093, #2094, #2095, #2097) generated a red `Sweep CF orphans (merge_group)` run. Drop the trigger. Comment in the workflow explains the re-add path if/when the workflow IS wired as a required check (re-add the trigger AND gate the actual sweep step with `if: github.event_name != 'merge_group'` so merge-queue evals are no-op success). 2) The `Verify required secrets present` step exits 2 when the 6 secrets aren't configured yet (the PR body's post-merge step, still pending). That turns the hourly schedule into an hourly red CI run for as long as the secrets stay unset. Convert to a soft skip: emit a `:⚠️:` listing the missing secrets and set a `skip=true` step output, then gate the sweep step with `if: steps.verify.outputs.skip != 'true'`. Workflow reports green and ops still sees the warning when they review recent runs. Net effect: - merge-queue evals stop generating spurious red runs - the schedule reports green-with-warning until secrets land - once secrets land, behavior is identical to today's (real sweep runs, hard-fails if a secret is later removed) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 08:05:53 -07:00
Hongming Wang	355355a80a	test(workspace): centralize pytest-cov config + 92% floor (closes #1817 ) The Python workspace already runs pytest-cov in CI but with no threshold and inline-flagged config. CI run 24956647701 (2026-04-26 staging) reports 97% coverage on the package — well above the issue's 75% target. The actionable gap is locking in a floor so a regression can't sneak past, and centralizing config so local `pytest` matches CI. Changes: - workspace/pytest.ini — coverage flags moved into addopts (-q, --cov=., --cov-report=term-missing, --cov-fail-under=92). 92% = current 97% measurement minus the 5pp safety margin the issue's Step 3 prescribes. - workspace/.coveragerc (new) — [run] omit list and [report] skip_covered. coverage.py doesn't read pytest.ini sections, so the omit config has to live here. - .github/workflows/ci.yml — removed the inline --cov flags from the Python Lint & Test step; now reads from pytest.ini. Workflow stays the same single-command shape, just simpler. Result: any PR that drops coverage below 92% fails CI loudly. Floor ratchets up by replacing 92 with current measurement on a future test-writing pass — same shape as Go coverage gates landed elsewhere. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 06:21:22 -07:00
rabbitblood	0ae6b201b4	refactor(ci): apply simplify findings on PR #2088 - Drop redundant 'aws --version' step. Script's own 'aws ec2 describe-instances' fails just as loud with a more actionable error; the pre-check added ~1s with no signal value. - timeout-minutes 10 → 3. Realistic worst case is ~2min (4 curls + 1 aws + N×CF-DELETE each individually capped at 10s by the script's curl -m flag). 3 surfaces hangs within one cron tick instead of burning the full interval. - Document the schedule-vs-dispatch dry-run asymmetry inline so the next reader doesn't need to trace input defaults. - Add merge_group: types: [checks_requested] for queue parity with runtime-pin-compat.yml — cheap insurance if this ever becomes a required check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 04:18:24 -07:00
rabbitblood	3c18b76aa7	ops(cf): hourly sweep workflow for orphan Cloudflare DNS records (#239 ) Closes Molecule-AI/molecule-controlplane#239. CF zone hit the 200-record quota 2026-04-23+ — every E2E and canary left a record on moleculesai.app, and no scheduled job pruned them. Provisions started failing with code 81045 ('Record quota exceeded'). The sweep-cf-orphans.sh script (PR #1978, with decision-function unit tests added in #2079) already exists but no workflow fires it. Adding it here as a parallel janitor to sweep-stale-e2e-orgs.yml: - hourly schedule at :15 (offset from the e2e-orgs sweep at :00 so the two converge cleanly without racing the same CP admin endpoint) - workflow_dispatch with dry_run input default true (ad-hoc verify without committing to deletes) - workflow_dispatch with max_delete_pct input for major cleanups (the script's own MAX_DELETE_PCT defaults to 50% as a safety gate) - concurrency group prevents schedule + manual-dispatch from racing the same zone Why a separate workflow vs sweep-stale-e2e-orgs.yml: - That workflow drives DELETE /cp/admin/tenants/:slug, assumes CP has the org row. Doesn't catch records left when CP itself never knew about the tenant (canary scratch, manual ops experiments) or when the CP-side cascade's CF-delete branch failed. - sweep-cf-orphans.sh enumerates the CF zone directly + matches against live CP slugs + AWS EC2 names. Catches what the CP-driven sweep can't. Required secrets (will need to be set on the repo): CF_API_TOKEN, CF_ZONE_ID, CP_PROD_ADMIN_TOKEN, CP_STAGING_ADMIN_TOKEN, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY. Pre-flight verify-secrets step fails loud if any are missing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 04:16:43 -07:00
Hongming Wang	1e7f8ebb1b	Merge pull request #2079 from Molecule-AI/feat/test-sweep-cf-decide-2027 test(ops): unit tests for sweep-cf-orphans decide() (#2027)	2026-04-26 09:21:45 +00:00
rabbitblood	5ce7af2d2c	fix(ci): set WORKSPACE_ID for the runtime-pin smoke import platform_auth.py validates WORKSPACE_ID at module load — EC2 user-data sets it from cloud-init, but the CI smoke-test was missing it and failed with 'WORKSPACE_ID is empty'. Set a placeholder UUID so the import gate exercises only the dep-resolution path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 01:59:56 -07:00
rabbitblood	b817251c85	refactor(ci): apply simplify findings on #2083 Review of the runtime-pin-compat workflow: - Add merge_group trigger so when this becomes a required check the queue green-checks it (mirrors ci.yml convention). - Cache pip on workspace/requirements.txt — actions/setup-python@v5 with cache: pip + cache-dependency-path. Saves ~30s per fire. - Document the load-bearing install order: runtime FIRST so pip honors the runtime's declared a2a-sdk constraint (the surface that broke 2026-04-24); workspace/requirements.txt SECOND so a2a-sdk is upgraded to the runtime image's pinned version. Import smoke validates the upgraded combination. Skipped: branch-protection wiring (separate ops decision, not in scope here); ci.yml integration (the standalone schedule trigger is the load-bearing reason to keep this workflow separate). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 01:32:56 -07:00
rabbitblood	9b42a5e311	test(ci): runtime + a2a-sdk pin compatibility gate (controlplane#253) Closes Molecule-AI/molecule-controlplane#253. Prevents recurrence of the 5-hour staging outage from 2026-04-24: molecule-ai-workspace-runtime 0.1.13 declared `a2a-sdk<1.0` in its metadata but actually imported `a2a.server.routes` (1.0+ only). pip resolved successfully; every tenant workspace crashed at import. The canary tenant ultimately caught it but only after 5 hours of degraded staging. PR #249 fixed the version pin manually; nothing automated catches the same class of bug for the next release. This workflow: - Installs molecule-ai-workspace-runtime fresh from PyPI in a Python 3.11 venv (mirrors EC2 user-data install pattern) - Layers in workspace/requirements.txt (the runtime image's actual dep set, including the a2a-sdk[http-server]>=1.0,<2.0 pin) - Runs `from molecule_runtime.main import main_sync` — same import the runtime entrypoint does - Fails CI if pip resolution silently produced a combo that the runtime can't actually import Triggers: - PR + push to main/staging touching workspace/requirements.txt or this workflow (catches local pin changes) - Daily 13:00 UTC schedule (catches upstream PyPI publishes that break the pin combo without any change in our repo) - workflow_dispatch (manual) Concurrency cancels in-progress runs on the same ref. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 01:30:36 -07:00
Hongming Wang	b5f9cbbc55	ci(retarget): handle 422 'duplicate PR' by closing redundant main-PR (closes #1884 ) When a bot opens a PR against main and there's already another PR on the same head branch targeting staging, GitHub's PATCH /pulls returns 422 with: "A pull request already exists for base branch 'staging' and head branch '<branch>'" Pre-fix: the retarget Action exited 1 with no further action. The target-main PR sat there as a duplicate, the workflow run showed red, and someone had to manually close the duplicate. Today's case (#1881 duplicate of #1820) had to be closed manually. Fix: catch that specific 422 message and close the main-PR as redundant instead of failing. Any OTHER 422 (or other error) still fails loud — the grep matches the specific duplicate-base text, not a blanket "any 422 means duplicate". Behaviour matrix: PATCH succeeds → retargeted, explainer comment posted PATCH 422 "already exists for staging" → close main-PR with explainer (NEW) PATCH any other failure → workflow fails (preserves loud-fail for real bugs) Tests: GitHub Actions don't have an inline unit-test framework here. The workflow YAML parses (validated locally) and the bash logic is straightforward. Real verification will be the next duplicate-PR scenario in production. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 00:53:55 -07:00
rabbitblood	6494e9192b	refactor(ops): apply simplify findings on #2027 PR Code-quality + efficiency review of PR #2079: - Hoist all_slugs = prod_slugs \| staging_slugs out of decide() into the caller (was rebuilt on every record — 1k records × ~50-slug union per call). decide() signature now (r, all_slugs, ec2_names). - Compile regexes at module scope (_WS_RE, _E2E_RE, _TENANT_RE) + hoist platform-core literal set (_PLATFORM_CORE_NAMES). Same change mirrored in the bash heredoc. - Drop decorative # Rule N: comments (numbering was out of order, 3 before 2 — actively confusing). - Move the "edits must mirror" reminder OUTSIDE the CANONICAL DECIDE block in the .sh file, eliminating the .replace() comment-skip hack in TestParityWithBashScript. - Drop per-line .strip() in _slice_canonical (would mask a real indentation bug; both blocks already at column 0). - subTest() in TestPlatformCore loops so a single failure no longer short-circuits the rest of the items. - merge_group + concurrency on test-ops-scripts.yml (parity with ci.yml gate behaviour). - Fix don't apostrophe in inline comment that closed the python heredoc's single-quote and broke bash -n. All 25 tests still pass. bash -n clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 00:28:15 -07:00
rabbitblood	ba78a5c00d	test(ops): unit tests for sweep-cf-orphans decide() (#2027 ) Closes #2027. The CF orphan sweep deletes DNS records — a misclassification could nuke a live workspace's tunnel. The decision function had MAX_DELETE_PCT percentage gating but no automated test of category → action mapping. Approach: extract the decide() function to scripts/ops/sweep_cf_decide.py as a verbatim copy bracketed by `# CANONICAL DECIDE BEGIN/END` markers. The shell script keeps its inline heredoc (so the operational path is untouched) but bracketed by the same markers. A parity test (TestParityWithBashScript) reads both files and asserts the bracketed blocks match line-for-line — drift fails CI loudly. Coverage (25 tests, 1 file, stdlib unittest only): - Rule 1 platform-core: apex, _vercel, _domainkey, www/api/app/doc/send/status/staging-api - Rule 3 ws-: live (matches EC2 prefix) on prod + staging; orphan on prod + staging - Rule 4 e2e-: live + orphan on staging; orphan on prod - Rule 2 generic tenant: live prod + staging; unknown subdomain kept-for-safety - Rule 5 fallthrough: external domain + unrelated apex - Rule priority: api.moleculesai.app stays platform-core (not tenant); _vercel stays verification - Safety gate: under/at/over default 50% threshold; zero-total no-divide; custom threshold - Empty live-sets: documents that decide() alone classifies as orphan, gate is the defense CI: new .github/workflows/test-ops-scripts.yml runs `python -m unittest discover` against scripts/ops/ on every PR/push that touches the directory. Lightweight — no requirements file, stdlib only. Local: `cd scripts/ops && python -m unittest test_sweep_cf_decide -v` → 25 tests, all OK. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 00:22:30 -07:00
Hongming Wang	194121c674	Merge pull request #2063 from Molecule-AI/feat/redeploy-tenants-on-main-merge ci(redeploy): auto-redeploy tenant EC2s after every main merge	2026-04-26 07:00:59 +00:00
Hongming Wang	fc54601999	Merge pull request #2067 from Molecule-AI/fix/canary-openai-key-staging ci(canary): inject E2E_OPENAI_API_KEY so A2A turn doesn't 500	2026-04-25 06:12:30 +00:00
Hongming Wang	fe075ee1ba	ci: hourly sweep of stale e2e-* orgs on staging Adds a janitor workflow that runs every hour and deletes any e2e-prefixed staging org older than MAX_AGE_MINUTES (default 120). Catches orgs left behind when per-test-run teardown didn't fire: CI cancellation, runner crash, transient AWS error mid-cascade, bash trap missed (signal 9), etc. Why it exists despite per-run teardown: - Per-run teardown is best-effort by definition. Any process death after the test starts but before the trap fires leaves debris. - GH Actions cancellation kills the runner with no grace period — the workflow's `if: always()` step usually catches this but can still fail on transient CP 5xx at the wrong moment. - The CP cascade itself has best-effort branches today (cascadeTerminateWorkspaces logs+continues on individual EC2 termination failures; DNS deletion same shape). Those need cleanup-correctness work in the CP, but a safety net belongs in CI either way — defense in depth. Behaviour: - Cron every hour. Manual workflow_dispatch with overrideable max_age_minutes + dry_run inputs for one-off cleanups. - Concurrency group prevents two sweeps fighting. - SAFETY_CAP=50 — refuses to delete more than 50 orgs in a single tick. If the CP admin endpoint goes weird and returns no created_at (or returns no orgs at all), every e2e-* would look stale; the cap catches the runaway-nuke case. - DELETE is idempotent CP-side via org_purges.last_step, so a half-deleted org from a prior sweep gets picked up cleanly on the next tick. - Per-org delete failures don't fail the workflow. Next hourly tick retries. The workflow only fails loud at the safety-cap gate. Tonight's specific motivation: ~10 canvas-tabs E2E retries in 2 hours with various failure modes; each provisioned a fresh tenant + EC2 + DNS + DB row. Some fraction leaked. Without this loop, ops has to periodically run the manual sweep-cf-orphans.sh script. With it, staging self-heals. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 23:07:57 -07:00
Hongming Wang	9a785e9c32	ci(canary): inject E2E_OPENAI_API_KEY so A2A turn doesn't 500 The canary workflow has been failing for ~30 consecutive runs (issue #1500, opened 2026-04-21) on the same line: [hermes-agent error 500] No LLM provider configured. Run `hermes model` to select a provider, or run `hermes setup` for first-time configuration. Root cause: the canary's env block was missing E2E_OPENAI_API_KEY. Without it, tests/e2e/test_staging_full_saas.sh provisions the workspace with empty secrets; template-hermes start.sh seeds ~/.hermes/.env with no provider keys; derive-provider.sh resolves the model slug `openai/gpt-4o` to PROVIDER=openrouter (hermes has no native openai provider in its registry); A2A request at step 8/11 fails with the "No LLM provider configured" error from hermes-agent. The full-lifecycle workflow (e2e-staging-saas.yml line 84) carries the same secret correctly. Mirror its pattern + add a fail-fast preflight so future regressions surface in <5s instead of after 8 min of provision-then-die. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 22:37:13 -07:00
Hongming Wang	184f8256cd	ci(redeploy): fire post-main tenant fleet redeploy via CP admin endpoint Closes the "main merged but prod tenants still on old image" gap. ## Trigger chain main merge └─> publish-workspace-server-image (builds + pushes :latest + :<sha>) └─> redeploy-tenants-on-main (this workflow) └─> POST https://api.moleculesai.app/cp/admin/tenants/redeploy-fleet └─> Canary hongmingwang + 60s soak, then batches of 3 with SSM Run Command redeploying each tenant EC2 ## Features - Auto-fires on every successful publish-workspace-server-image run. - Manual dispatch with optional target_tag (for rollback to an older SHA), canary_slug override, batch_size, dry_run. - 30s delay before calling CP so GHCR edge cache serves the new :latest consistently to every tenant's docker pull. - Skips when publish job failed (workflow_run fires on any completion). - Job summary renders per-tenant results as a markdown table so ops can see which tenant, if any, broke the chain. - Exits non-zero on HTTP != 200 or ok=false so a broken rollout marks the commit status red. ## Secrets + vars required - secret CP_ADMIN_API_TOKEN — Railway prod molecule-platform / CP_ADMIN_API_TOKEN Mirrored into this repo's secrets. - var CP_URL (optional) — defaults to https://api.moleculesai.app ## Paired with - Molecule-AI/molecule-controlplane branch feat/tenant-auto-redeploy which adds the /cp/admin/tenants/redeploy-fleet endpoint + the SSM orchestration. This workflow is a no-op until that lands on prod CP. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 14:34:28 -07:00
Molecule AI CP-BE	ca7fa3b65e	fix(e2e): increase hermes workspace wait from 20 to 30 min Root cause of PR #1981 E2E failures (step 7 timeout): - hermes-agent install from NousResearch (Node 22 tarball + Python deps from source) + gateway health wait takes 15-25 min on staging	2026-04-24 17:11:37 +00:00

1 2 3 4 5

237 Commits