molecule-core

Author	SHA1	Message	Date
Hongming Wang	07a17c2e59	Merge remote-tracking branch 'origin/staging' into docs/auto-promote-staging-prereq-comment # Conflicts: # .github/workflows/auto-promote-staging.yml	2026-04-28 20:46:42 -07:00
Hongming Wang	e373fa1a96	docs(ci): document auto-promote-staging GITHUB_TOKEN PR-create prereq Add a comment block at the top of auto-promote-staging.yml naming the load-bearing one-time repo setting that the workflow depends on: Settings → Actions → General → Workflow permissions → ✅ Allow GitHub Actions to create and approve pull requests Without this toggle, every workflow_run fails with "GitHub Actions is not permitted to create or approve pull requests (createPullRequest)". Observed 2026-04-29 01:43 UTC blocking the `fcd87b9` promotion (PRs #2248 + #2249); manually bridged via PR #2252. The setting is invisible to anyone reading the workflow file, but the workflow cannot do its job without it. Documenting here so the next time it gets toggled off (org admin change, repo migration, audit cleanup) the failure mode points at the cause rather than another round of "why is auto-promote broken." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 18:49:07 -07:00
Hongming Wang	fcd87b9526	Merge pull request #2249 from Molecule-AI/fix/publish-runtime-cascade-hard-fail-on-push fix(ci): hard-fail publish-runtime cascade on push when token missing	2026-04-29 01:33:10 +00:00
Hongming Wang	f1c6673e03	fix(ci): hard-fail publish-runtime cascade on push when token missing Mirror the sweep-cf-orphans hardening (#2248) on publish-runtime's TEMPLATE_DISPATCH_TOKEN gate. The previous behaviour was to print :⚠️:skipping cascade — templates will pick up the new version on their own next rebuild and exit 0. That message is wrong: the 8 workspace-template repos only rebuild on this repository_dispatch fanout. Without the dispatch they stay pinned to whatever runtime version they last saw, and the gap is invisible until someone notices a template several versions behind weeks later. Behaviour after this PR: - push (auto-trigger on workspace/runtime/** changes) → exit 1 - workflow_dispatch (manual operator) → exit 0 with a warning (operator already accepted state; let them rerun after restoring the secret) The token-missing path now also names the consequence concretely ("templates will NOT pick up the new version until this token is restored") so future operators see the actionable line, not the misleading "they'll catch up on their own" message. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 18:28:01 -07:00
hongmingwang-moleculeai	667751919d	Merge pull request #2248 from Molecule-AI/fix/sweep-cf-orphans-hard-fail-on-schedule fix(ci): hard-fail sweep-cf-orphans on schedule when secrets missing	2026-04-29 01:16:22 +00:00
Hongming Wang	9f39f3ef6c	fix(ci): hard-fail sweep-cf-orphans on schedule when secrets missing Replace the soft-skip-with-warning behaviour for scheduled runs of the hourly Cloudflare orphan sweeper with an explicit failure when the six required secrets aren't set. Manual workflow_dispatch keeps the soft-skip path so an operator can short-circuit a deliberate rerun without redoing the secrets dance — they accepted the state when they clicked the button. Why: from some-date to 2026-04-28, all six secrets were unset on the repo. Every hourly tick printed a yellow :⚠️: and exited 0, which GitHub registers as "completed/success" — the sweeper was indistinguishable from a healthy janitor with nothing to do. Cloudflare orphans accumulated unobserved to 152/200 (~76% of the zone quota), and only surfaced via a manual audit. The mechanism to catch this kind of regression is to make the workflow loud: red runs prompt investigation, green runs are presumed healthy. Schedule/workflow_run/push paths now print three ::error:: lines naming the missing secrets, the fix, and a one-line reference to this incident, then exit 1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 18:13:22 -07:00
Hongming Wang	5753021194	Merge pull request #2247 from Molecule-AI/fix/auto-promote-staging-pr-based fix(ci): auto-promote-staging opens a PR + uses merge queue, not direct push	2026-04-29 00:57:33 +00:00
Hongming Wang	e45a5c98b0	fix(ci): auto-promote-staging opens a PR + uses merge queue, not direct push Mirrors the fix #2234 applied to auto-sync-main-to-staging.yml in the reverse direction. Both workflows now use the same merge-queue path that humans use; no special-case bypass. Why Every tick of auto-promote-staging.yml since main's branch protection went stricter has been failing with: remote: error: GH006: Protected branch update failed for refs/heads/main. remote: - Required status checks "Analyze (go)", "Analyze (javascript-typescript)", "Analyze (python)", "Canvas (Next.js)", "Detect changes", "E2E API Smoke Test", "Platform (Go)", "Python Lint & Test", and "Shellcheck (E2E scripts)" were not set by the expected GitHub apps. remote: - Changes must be made through a pull request. The previous version did `git merge --ff-only origin/staging && git push origin main` directly. That works against a permissive branch — it doesn't work against a ruleset that requires checks satisfied by the expected GitHub apps. Only PR merges through the queue produce check runs from the right apps. Result was that today's 12+ merges to staging never propagated to main; the auto-promote ran every tick and failed every tick, while operators had to keep opening manual `staging → main` bridges. Fix - Replace the direct git push step with a step that opens (or reuses) a PR base=main head=staging and enables auto-merge. The merge queue lands it once gates are green on the merge_group ref. - The PR's head IS the staging branch (no per-SHA promote branch needed) — the whole purpose is "advance main to staging's tip". - Add `pull-requests: write` permission so the workflow can call gh pr create + gh pr merge --auto. - Drop the `git merge-base --is-ancestor` divergence check — the merge queue itself enforces branch protection now, and rejects the PR if main has diverged from staging history. Loop safety preserved: when this PR's merge lands on main, it triggers auto-sync-main-to-staging.yml which opens a sync PR back to staging. That sync PR's eventual merge is by GITHUB_TOKEN (the merge queue) which doesn't trigger downstream workflow_run events — so auto-promote-staging.yml does NOT re-fire from its own merge landing. Refs: #2234 (the parallel fix for auto-sync-main-to-staging.yml), task #142, multiple failing runs visible in https://github.com/Molecule-AI/molecule-core/actions/workflows/auto-promote-staging.yml	2026-04-28 17:54:15 -07:00
Hongming Wang	c68ea3a284	Merge pull request #2246 from Molecule-AI/chore/all-deps-batch-2026-04-28-pt2 chore(deps): batch dep bumps — 6 safe upgrades (4 actions majors + 2 npm dev deps)	2026-04-29 00:48:15 +00:00
Hongming Wang	fc59f939ac	chore(deps): batch dep bumps — 6 safe upgrades (4 actions majors + 2 npm dev deps) Consolidates the remaining safe-to-merge dependabot PRs from the 2026-04-28 wave into one consumable PR. Replaces three earlier single-bump PRs (#2245, #2230, #2231) which were closed in favor of this single batch — same pattern as #2235. GitHub Actions majors (SHA-pinned per org convention): github/codeql-action v3 → v4.35.2 (#2228) actions/setup-node v4 → v6.4.0 (#2218) actions/upload-artifact v4 → v7.0.1 (#2216) actions/setup-python v5 → v6.2.0 (#2214) npm dev deps (canvas/, lockfile regenerated in node:22-bookworm container so @emnapi/* and other Linux-only optional deps are properly resolved — Mac-native `npm install` strips them, which caused the earlier #2235 batch to drop these two): @types/node ^22 → ^25.6 (#2231) jsdom ^25 → ^29.1 (#2230) Why each is safe setup-node v4 → v6 / setup-python v5 → v6: Every consumer call pins node-version / python-version explicitly. v5 / v6 changed defaults but pinned consumers are unaffected. Confirmed via grep across .github/workflows/ — all setup-node call sites pin '20' or '22', all setup-python call sites pin '3.11'. codeql-action v3 → v4.35.2: Used as init/autobuild/analyze sub-actions in codeql.yml. v4 bundles a newer CodeQL CLI; ubuntu-latest auto-updates so functional behavior is unchanged. The deprecated CODEQL_ACTION_CLEANUP_TRAP_CACHES env var (per v4.35.2 release notes) is undocumented and we don't set it. upload-artifact v4 → v7.0.1: v6 introduced Node.js 24 runtime requiring Actions Runner >= 2.327.1. All upload-artifact users (codeql.yml, e2e-staging-canvas.yml) run on `ubuntu-latest` (GitHub- hosted), which auto-updates the runner agent. Self-hosted runners are NOT used for these jobs. @types/node 22 → 25 / jsdom 25 → 29: Both are dev-only — @types/node is type definitions, jsdom backs vitest's DOM environment. Tests pass: 79 files / 1154 tests in node:22-bookworm container. Verified locally (Linux container so the lockfile reflects what CI's `npm ci` will install): - cd canvas && npm install --include=optional → 169 packages - npm test → 1154/1154 pass - npm ci → clean install succeeds - npm run build → Next.js prerendering succeeds Closes when this lands (the 3 individual auto-merge PRs from earlier were closed): #2228 #2218 #2216 #2214 #2231 #2230 NOT included (CI failing on dependabot's own run — major framework bumps that need code-side migration tasks, not safe auto-bumps): #2233 next 15 → 16 #2232 tailwindcss 3 → 4 #2226 typescript 5 → 6	2026-04-28 17:44:55 -07:00
Hongming Wang	a1bc771f87	Merge pull request #2243 from Molecule-AI/fix/branch-protection-required-check-naming fix(ci): no-op jobs emit same check-run name as their real counterparts	2026-04-29 00:43:31 +00:00
Hongming Wang	4f0dfbbf0b	Merge pull request #2242 from Molecule-AI/fix/dispatch-input-shell-injection-hardening fix(security): harden dispatch inputs against shell injection	2026-04-29 00:31:08 +00:00
github-actions[bot]	7b2d9e9bce	fix(ci): no-op job emits same check-run name as the real one Branch protection on `main` requires "E2E API Smoke Test" as a status check. With Design B's no-op + e2e-api job split, when paths-filter excludes a commit: - e2e-api job (name="E2E API Smoke Test"): SKIPPED - no-op job (name="no-op"): SUCCESS Branch protection counts the skipped check-run as not-satisfied → auto-promote-staging's `git push origin main` rejected with GH006. Observed 2026-04-28 00:22 UTC: every gate green at the workflow level, all_green=true in auto-promote-staging's gate-check, but the FF push itself rejected with: Required status checks "..., E2E API Smoke Test, ..." were not set by the expected GitHub apps. Fix: give the no-op job the same `name:` as the real one. Now both register as check-runs named "E2E API Smoke Test" — exactly one runs per workflow execution (mutex `if`), the other registers as skipped with the same name. Branch protection sees at least one success, requirement satisfied. Same fix applied to e2e-staging-canvas.yml's no-op (name → "Canvas tabs E2E") for symmetry, even though "Canvas tabs E2E" isn't currently in main's required check list — kept consistent so the next time a required-checks reshuffle pulls it in, it doesn't recreate this bug. Note: Design B's intent was always "emit a result auto-promote can read" — that intent was satisfied at the workflow-conclusion level (success), but missed the per-check-run-name level. This PR closes that second-order gap.	2026-04-28 17:25:31 -07:00
github-actions[bot]	b2a0703f1c	fix(ci): per-SHA concurrency on staging gate workflows e2e-staging-canvas had a single global concurrency group: concurrency: group: e2e-staging-canvas cancel-in-progress: false That meant the entire repo shared one running + one pending slot. When a staging push queued behind an in-flight run and a third entrant (a PR run, a follow-on push) entered the group, the staging push got cancelled. auto-promote-staging then saw `completed/cancelled` for a required gate and refused to advance main. Observed 2026-04-28 23:51-23:53: staging tip 3f99fede's e2e-staging- canvas push run was cancelled within 2:20 of starting because a PR run on a follow-on branch entered the group. Auto-promote-staging fired 8+ times after that, all skipped because canvas was still in the cancelled state. The chain stayed stuck until the cancelled run was manually re-dispatched. e2e-api had a softer version of the same bug — `group: e2e-api-${{ github.ref }}`. Per-ref isolates push events from PR events, so this specific scenario didn't hit it, but back-to-back pushes to staging at SHA-A and SHA-B share refs/heads/staging and would still cancel SHA-A's queued run when SHA-B enters. Both workflows now use per-SHA grouping. The single-global-group's original intent was to throttle parallel E2E provisions, but each E2E run already isolates its state via fresh-org-per-run, and parallel infrastructure cost at our scale (~$0.001/min × 10min × 2) is rounding error compared to a stuck pipeline. Per-SHA still dedupes accidental double-triggers for the SAME SHA. It does not cancel obsolete-PR-version runs on force-push — that wasted CI is acceptable given the alternative is losing staging-tip data that auto-promote-staging depends on. Other gate workflows: ci.yml uses `cancel-in-progress: true` which is correct for unit tests (intentional cancellation on supersede). codeql.yml is per-ref like e2e-api was; same fix probably applies if the same deadlock pattern is observed there, but no incident yet so deferring.	2026-04-28 17:18:15 -07:00
github-actions[bot]	475a51adec	fix(ci): defer promote when E2E is racing with publish (review fix) Self-review caught a real correctness bug: scenario where publish- workspace-server-image completes BEFORE E2E Staging SaaS for a runtime- touching SHA. Publish typically takes ~5-10min; E2E ~10-15min, so this ordering is the common case for runtime-path PRs. Previous gate logic: - completed/success: proceed - completed/failure: abort - everything else (including in_progress): proceed ← BUG If publish-trigger fires while E2E is still running, the gate returned "in_progress/none" and fell through the catch-all "proceed" branch. Result: :latest retagged on the publish signal alone. Then E2E ends red — but :latest was already wrongly advanced; the E2E-completion trigger's job-level if=conclusion==success filter just skips, never rolls back. Fix: explicit case for in_progress\|queued\|requested\|waiting\|pending that DEFERS — sets gate.proceed=false, writes a "deferred" summary, exits 0 (workflow run shows success, retag steps skipped). The E2E completion trigger then fires later and either promotes (green) or aborts (red), giving us correct ordering regardless of who finishes first. Subsequent steps now guarded by `if: steps.gate.outputs.proceed == 'true'` instead of relying on `exit 1` for skip semantics. Also added an explicit catch-all `*)` branch that aborts on unknown states (forward-compat: GitHub adds a new status, we surface it instead of silently promoting through it).	2026-04-28 16:59:58 -07:00
github-actions[bot]	f4f45f8561	fix(ci): auto-promote :latest also on publish-image, not just E2E Previously this workflow only triggered on E2E Staging SaaS completion, which is itself paths-filtered to runtime handlers (workspace-server/internal/handlers/{registry,workspace_provision, a2a_proxy}.go, middleware/, provisioner/). publish-workspace-server -image fires on a STRICTLY BROADER path set (workspace-server/, canvas/, manifest.json) — so canvas-only or cmd-only or sweep-only PRs rebuilt the platform image without ever advancing :latest. Result observed 2026-04-28: zero runs of this workflow since merge despite eight main pushes. :latest sat ~7 hours / 9 PRs behind main. Fix: add publish-workspace-server-image as a second trigger. Add an explicit gate inside the job that aborts when E2E Staging SaaS for the same SHA ended red. When E2E didn't fire (paths-filtered), proceed — auto-promote-staging's pre-merge gates (CI + E2E Canvas + E2E API + CodeQL on staging) already validated this SHA before main moved. Concurrency group serializes promotes per-SHA so the publish+E2E both- fired race lands cleanly. Idempotent crane tag makes it safe regardless.	2026-04-28 16:53:30 -07:00
hongmingwang-moleculeai	a45a026099	Merge pull request #2235 from Molecule-AI/chore/deps-batch-2026-04-28 chore(deps): batch dep bumps — 11 safe upgrades from 2026-04-28 wave	2026-04-28 23:40:51 +00:00
Hongming Wang	0cdbc2c4f6	chore(deps): batch dep bumps — 11 safe upgrades from 2026-04-28 dependabot wave Consolidates 11 of the 17 open Dependabot PRs (#2215, #2217, #2219-#2225, #2227, #2229) into one PR. Every entry is a patch / minor / floor bump where the impact surface is small and CI carries the proof. Same pattern as the 2026-04-15 batch. Go (workspace-server/go.mod + go.sum, regenerated via `go mod tidy`): - golang.org/x/crypto 0.49.0 → 0.50.0 (#2225) - github.com/golang-jwt/jwt/v5 5.2.2 → 5.3.1 (#2222) - github.com/gin-contrib/cors 1.7.2 → 1.7.7 (#2220) - github.com/docker/go-connections 0.6.0 → 0.7.0 (#2223) - github.com/redis/go-redis/v9 9.7.3 → 9.19.0 (#2217) Python floor bumps (workspace/requirements.txt; current pip-resolved versions don't change unless they happen to be below the new floor): - httpx >=0.27 → >=0.28.1 (#2221) - uvicorn >=0.30 → >=0.46 (#2229) - temporalio >=1.7 → >=1.26 (#2227) - websockets >=12 → >=16 (#2224) - opentelemetry-sdk >=1.24 → >=1.41.1 (#2219) GitHub Actions (SHA-pinned per existing convention): - dorny/paths-filter@d1c1ffe (v3) → @fbd0ab8 (v4.0.1) (#2215) REMOVED from this batch (lockfile platform mismatch): - #2231 @types/node ^22 → ^25.6 (npm install on macOS strips Linux-only @emnapi/* entries from package-lock.json that CI's `npm ci` then refuses; needs a Linux-side install to land cleanly) - #2230 jsdom ^25 → ^29.1 (same) NOT included in this batch (deferred to per-PR human review): - #2228 github/codeql-action v3 → v4 (CodeQL CLI alignment risk) - #2218 actions/setup-node v4 → v6 (default Node version drift) - #2216 actions/upload-artifact v4 → v7 (3 major versions) - #2214 actions/setup-python v5 → v6 (action major) NOT merged (CI failing on dependabot's own PR): - #2233 next 15 → 16 - #2232 tailwindcss 3 → 4 - #2226 typescript 5 → 6 Verified: - workspace-server: `go mod tidy && go build ./... && go test ./...` — green - workspace requirements.txt: floor bumps only	2026-04-28 16:25:46 -07:00
Hongming Wang	cf258b3355	fix(ci): auto-sync opens a PR + uses merge queue, not direct push The molecule-core/staging branch is protected by ruleset 15500102 (name: staging-merge-queue) which blocks ALL direct pushes — no bypass even for org admins or the GitHub Actions integration. The prior version of this workflow attempted `git push origin staging` and was rejected with GH013: ! [remote rejected] staging -> staging (push declined due to repository rule violations) - Changes must be made through a pull request. - Changes must be made through the merge queue This was a real architectural mismatch: auto-sync was bypassing the same gates everyone else goes through to land on staging, which is exactly what the ruleset is designed to prevent. The fix matches the org convention: the workflow now opens a PR (base=staging, head=auto-sync/main-<sha>) and enables auto-merge. The merge queue picks it up, runs required gates against the merged result, and lands it. Same path human PRs take through staging — no special-snowflake bypass. Trade-off acknowledged - Slight PR churn: every main push that needs sync opens a tracked PR. With concurrency: cancel-in-progress: false (existing) and the merge queue's serial processing, this is bounded — PRs land in order, no thundering herd. - The previous direct-push approach worked on molecule-controlplane (which has no merge_queue ruleset on staging). That version of the workflow was correct for that repo's protection model. Per-repo divergence is acceptable; the invariant ("staging ⊇ main") is what matters, not how it's enforced. Loop safety preserved GITHUB_TOKEN-authored merges (including the merge queue's land of this PR) do NOT trigger downstream workflow runs. So the merge to staging from this PR doesn't fire auto-promote-staging — same as the direct-push version. Idempotency The branch name is derived from main's short sha (`auto-sync/main-<sha>`) so workflow restarts on the same main push reuse the existing branch + PR rather than opening duplicates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 15:59:26 -07:00
Hongming Wang	1867111d95	Merge pull request #2213 from Molecule-AI/chore/pin-actions-to-shas chore(security): pin Actions to SHAs + enable Dependabot auto-bumps	2026-04-28 22:49:25 +00:00
Hongming Wang	c77a88c247	chore(security): pin Actions to SHAs + enable Dependabot auto-bumps Supply-chain hardening for the CI pipeline. 23 workflow files modified, 59 mutable-tag refs replaced with commit SHAs. The risk Every `uses:` reference in .github/workflows/*.yml was pinned to a mutable tag (e.g., `actions/checkout@v4`). A maintainer of an action — or a compromised maintainer account — can repoint that tag to malicious code, and our pipelines silently pull it on the next run. The tj-actions/changed-files compromise of March 2025 is the canonical example: maintainer credential leak, attacker repointed several `@v<N>` tags to a payload that exfiltrated repository secrets. Repos that pinned to SHAs were unaffected. The fix Replace each `@v<N>` with `@<commit-sha> # v<N>`. The trailing comment preserves human readability ("ah, this is v4"); the SHA makes the reference immutable. Actions covered (10 distinct): actions/{checkout,setup-go,setup-python,setup-node,upload-artifact,github-script} docker/{login-action,setup-buildx-action,build-push-action} github/codeql-action/{init,autobuild,analyze} dorny/paths-filter imjasonh/setup-crane pnpm/action-setup (already pinned in molecule-app, listed here for completeness) Excluded: Molecule-AI/molecule-ci/.github/workflows/disable-auto-merge-on-push.yml@main — internal org reusable workflow; we control its repo, threat model is different from third-party actions. Conventional to pin to @main rather than SHA for internal reusables. The maintenance cost SHA pinning means upstream fixes require manual SHA bumps. Without automation, pinned SHAs go stale. So this PR also enables Dependabot across four ecosystems: - github-actions (workflows) - gomod (workspace-server) - npm (canvas) - pip (workspace runtime requirements) Weekly cadence — the supply-chain attack window is "minutes between repoint and pull"; weekly auto-bumps don't help with zero-days regardless. The point is to pull in non-zero-day fixes without operator effort. Aligns with user-stated principle: "long-term, robust, fully- automated, eliminate human error." Companion PR: Molecule-AI/molecule-controlplane#308 (same pattern, smaller surface). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 15:37:06 -07:00
Hongming Wang	6638d6e1d7	feat(ci): SECRET_PATTERNS drift lint across known consumers Adds a lint that diffs the canonical SECRET_PATTERNS array in .github/workflows/secret-scan.yml against every known public consumer mirror, failing on any divergence. Why: every side that scans for credentials carries its own copy of the pattern list. They drift — most recently the workspace-runtime pre-commit hook lagged the canonical by one pattern (sk-cp- / MiniMax F1088 vector), so a developer's local pre-commit would let a sk-cp- token through while the org-wide CI scan would refuse it. Useless friction; automated detection closes the gap. Implementation: .github/scripts/lint_secret_pattern_drift.py — pure stdlib, fetches each consumer's RAW file via urllib, extracts the SECRET_PATTERNS=( ... ) array via anchored regex (the closing `)` is anchored to the start of a line because pattern comments like `# GitHub PAT (classic)` contain their own paren mid-line), diffs against canonical, fails on missing or extra patterns. Fetch failures are warnings, not errors — a consumer whose branch was renamed shouldn't fail the lint until someone updates the URL list. .github/workflows/secret-pattern-drift.yml — daily 05:00 UTC cron + on-push gate (when canonical, the workflow, or the script changes) + workflow_dispatch. Read-only token, 5-minute timeout. Initial consumer set: workspace-runtime's bundled pre-commit hook (the one that drifted on sk-cp-). molecule-controlplane's inlined copy is private so this workflow can't read it; that's tracked separately and the controlplane's own self-monitor is the gap. Verified locally: lint detects drift correctly when the runtime hook is missing sk-cp-, returns clean when aligned. Refs: task #139.	2026-04-28 15:29:09 -07:00
Hongming Wang	97d5883e76	fix(ci): auto-sync concurrency + cleanup follow-ups Three small fixes from the self-review of #2209: 1. Required: concurrency group. Two pushes to main in quick succession (manual UI merge then auto-promote-staging's ff-push, or any back-to-back main pushes) would race two auto-sync runs against the same staging branch — second `git push origin staging` fails non-fast-forward, surfacing as a red CI alert for what should be a no-op. Add `concurrency: { group: auto-sync-main-to-staging, cancel-in-progress: false }` so the second run waits for the first and sees its result. 2. Hygiene: `git merge --abort` on conflict. The conflict-error path exits 1 with the work tree in a half-merged state. Doesn't affect future runs (each gets a fresh checkout) but is an unpleasant artifact for anyone who shells into the runner. Abort first, then exit. 3. Doc accuracy: "Loop safety" comment. The original said the chain terminates because "main is either a no-op or advances further." That's true but understates the actual safety: GitHub Actions explicitly does NOT trigger downstream workflow runs from `GITHUB_TOKEN`-authored pushes. So the loop is impossible by construction, not just by happy coincidence of ref state. Updated the comment to reflect the actual mechanism. Plus a step-name nit: "Fast-forward staging → main" reads as if main is the target. Renamed to "Fast-forward staging to main" for consistency with the workflow's name (main → staging). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 14:59:23 -07:00
Hongming Wang	c59715e143	feat(ci): auto-sync main → staging to keep staging-as-superset invariant Background `auto-promote-staging.yml` advances main via `git merge --ff-only` + `git push origin main` — clean fast-forward, no merge commit. But manual `staging → main` merges via the GitHub UI / API create a merge commit on main that staging doesn't have. The next `staging → main` PR then evaluates as "BEHIND" because staging is missing that merge commit, requiring a manual `gh pr update-branch` round-trip. This pattern bit twice on 2026-04-28 (PRs #2202 and #2205, both manual bridges to land pipeline fixes themselves). Each needed update-branch + re-CI before they could merge. Annoying and avoidable. What this workflow does Triggered on every push to main (regardless of source: auto-promote, UI merge, API merge, direct push): 1. Check whether main is already in staging's ancestry. If yes, no-op — auto-promote-staging keeps them aligned via ff push, and the no-op case is the steady state. 2. If not (manual merge commit on main, or direct main hotfix): try `git merge --ff-only origin/main` first. Works when staging hasn't diverged with its own commits. 3. If ff fails (staging has its own in-flight feature work): `git merge --no-ff origin/main -m "chore: sync main → staging"`. Absorbs main's tip while keeping staging's own history. 4. Push staging. Loop safety Pushing the synced staging triggers auto-promote-staging.yml, which checks gates on staging's new tip and, if green, ff-pushes staging to main. Since staging now ⊇ main, the resulting push to main is either a no-op (no ref change → no push event fires → auto-sync doesn't re-trigger) or advances main further. In the latter case auto-sync fires once more, sees main already in staging's ancestry, no-ops. Bounded. Conflict handling If the merge step hits conflicts (staging and main diverged with incompatible changes), the workflow fails with a clear summary pointing to manual resolution. This shouldn't happen in practice — staging is the integration branch; conflicts indicate a direct main hotfix touching the same code as in-flight staging work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 14:43:43 -07:00
hongmingwang-moleculeai	11a38a0ad4	Merge pull request #2207 from Molecule-AI/fix/secret-scan-printf-and-wordsplit fix(ci): printf format-string sink + filename word-split in secret-scan	2026-04-28 21:11:32 +00:00
Hongming Wang	2c8792d3e0	fix(ci): printf format-string sink + filename word-split in secret-scan Two latent bash bugs in the canonical secret-scan workflow caught during the post-merge review of molecule-controlplane #301 (a private consumer that inlined this workflow's logic and got both fixes there). Same bugs apply here; fixing in canonical means every public consumer (gh-identity, github-app-auth, the 8 workspace template repos) inherits the fix on their next workflow_call. Bug 1: `printf "$OFFENDING"` is a format-string sink. OFFENDING is built from filenames: `${f} (matched: ${pattern})\n`. When passed to printf as the first argument, `%` characters in a filename are interpreted as conversion specifiers — corrupting the error message or printing `%(missing)` artifacts. No filename in the current tree triggers it, but a future test fixture, build artifact, or contributor-supplied path could. Fix: `printf '%b' "$OFFENDING"` interprets the literal `\n` we appended without treating OFFENDING as a format string. Bug 2: `for f in $CHANGED` word-splits on whitespace. Filenames containing spaces would split into multiple tokens. The self-exclude check (`[ "$f" = "$SELF" ] && continue`) and the diff lookup would both operate on partial-path tokens. No filename in the current tree has whitespace, but the failure would be silent if one ever did. Fix: `while IFS= read -r f; do ... done <<< "$CHANGED"` reads whole lines as filenames. Added `[ -z "$f" ] && continue` to match the original `for` loop's implicit empty-input skip. Both fixes are mechanically straightforward (~16 lines net diff, mostly comments documenting the why). No behavior change for filenames in the current tree; strictly better for the edge cases. The same fixes already shipped in molecule-controlplane via #301 which inlined a copy of this workflow. The runtime's bundled pre-commit hook (molecule-ai-workspace-runtime: molecule_runtime/scripts/pre-commit-checks.sh) likely has the same bugs — flagged as a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 14:02:50 -07:00
Hongming Wang	9d4ab7b1a2	feat(ci): auto-promote-on-e2e — retag :latest on green E2E Staging SaaS Closes the final gap in the SaaS pipeline. After auto-promote-staging fast-forwards main, publish-workspace-server-image builds new `:staging-<sha>` images, but `:latest` (what prod tenants pull) only moves on either a manual `promote-latest.yml` dispatch or a canary- verify retag (gated on Phase 2 fleet that doesn't exist). This workflow closes that gap by retagging `platform:staging-<sha>` + `platform-tenant:staging-<sha>` → `:latest` whenever E2E Staging SaaS passes for a `main` push. Uses crane (no Docker daemon needed). Verifies both images exist before retagging either, so a half-published state is impossible. Why trigger only on `main` (not staging): - `:latest` is what prod tenants pull. Only SHAs that have reached `main` (via auto-promote-staging) should advance `:latest`. - Triggering on staging would let a staging-only revert advance `:latest` to a SHA that never reaches `main`, breaking the invariant "production runs what's on `main`". Why a separate workflow rather than folding into e2e-staging-saas.yml: - Test concerns and release concerns separate. - Disabling promote during an incident is one workflow toggle, not an edit to the long E2E file. - When Phase 2 canary work eventually lands, the canary path can replace this trigger without touching the E2E workflow. Doc-aligned: per molecule-controlplane/docs/canary-tenants.md, "green staging E2E → :latest" is the recommended approach for the current scale (≤20 paying tenants); canary fleet is deferred until blast radius grows. Pipeline after this lands is fully self-healing: staging push → 4 gates green → auto-promote fast-forwards main → publish-workspace-server-image → E2E Staging SaaS → THIS WORKFLOW retags :latest → tenant fleet auto-pulls in 5 min (or redeploy-tenants-on-main fans out faster) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 13:58:41 -07:00
Hongming Wang	17018745d0	fix(ci): auto-promote gate-check uses workflow file paths, not names Observed 2026-04-28: auto-promote ran for staging head `96955f7b` with all gates actually green (verified via /commits/<sha>/check-runs API) yet `check-all-gates-green` reported `CodeQL → missing/none` and aborted. Same SHA was promotable; auto-promote couldn't see it. Cause: `gh run list --workflow="CodeQL"` matched two workflows in this repo: - codeql.yml (explicit, scans both staging and main) - codeql (GitHub UI-configured Code-quality default setup, internal, scans default branch only) gh CLI rejects ambiguous `--workflow=<name>` lookups and returns no result → the gate fell through to `missing/none` and ALL_GREEN was set false. Every staging push since both names existed has been silently dead-locked. Fix: switch GATES from display-name strings to workflow file paths. File paths are the unique identifier for a workflow file in .github/workflows/; display names are decoration and can collide. The same `gh run list --workflow=<file.yml>` query that fails on "CodeQL" succeeds on "codeql.yml" because the file path resolves unambiguously. No behavior change for the other three gates (CI, E2E Canvas, E2E API Smoke) since their names didn't collide — they keep working, they just identify by ci.yml / e2e-staging-canvas.yml / e2e-api.yml now. The log line shape changes from `CI → completed/success` to `ci.yml → completed/success` which is fine for ops grep. When adding/removing a gate going forward: file paths only. Keep branch-protection required-checks (check-run display names) in sync as a separate manual step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 13:15:13 -07:00
Hongming Wang	31d25b5a74	fix(ci): e2e gates always emit a result so auto-promote can read it The auto-promote-staging.yml gate-check (line 99) treats "workflow didn't run" as failure. Path-filtered triggers on E2E API Smoke Test and E2E Staging Canvas meant a platform-only or test-only push to staging — say, the prior PR #2201 which only touched tests/e2e/test_staging_full_saas.sh — never triggered the canvas workflow, and auto-promote saw `missing/none`, marked all_green=false, and aborted. Same class for any push that doesn't touch the gate's watched paths. Dead-lock by design, never noticed because the gate was new. Fix per Design B (always-run + fast-skip): - Drop `paths:` from the push/pull_request triggers on both gate workflows. The workflow now always fires on every staging+main push/PR. - Add a `detect-changes` job using `dorny/paths-filter@v3` that decides whether to do real work, scoped to the same paths the trigger filter used to watch. - Real work job (e2e-api / playwright) gates on `needs: detect-changes; if: needs.detect-changes.outputs.X == 'true'`. - Add a sibling `no-op` job that runs when the filter output is false, emitting `::notice::… no-op pass`. The workflow run's conclusion is `success` either way — auto-promote sees green and proceeds. manual `workflow_dispatch` and the weekly canvas `schedule` short- circuit detect-changes to always-run — those triggers exist precisely to exercise the suite and shouldn't be silently no-op'd. Why this approach over making auto-promote-staging smarter: The alternative (Design A, considered + rejected) was to teach auto-promote-staging to read each gate's `paths:` filter and treat "no run because filter excluded the commit" as conditional pass. That couples auto-promote to other workflows' YAML schema and breaks silently if a gate is renamed or its filter changes. Design B keeps the auto-promote contract simple ("each gate emits success") and makes each gate self-describing — adding a new gate doesn't require touching auto-promote. Cost: ~10-30s of runner overhead per gate per push for the no-op when paths don't match. Negligible vs the alternative of dead-locked auto-promote chains. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 12:43:26 -07:00
Hongming Wang	e7eeeb4f59	Merge pull request #2199 from Molecule-AI/fix/pin-compat-narrow-pypi-job-trigger ci(pin-compat): split into two workflows so each gets a narrow paths filter	2026-04-28 18:20:48 +00:00
Hongming Wang	a089712cef	feat(cascade): verify wheel content sha256 against just-built dist Closes #132. Extends the cascade propagation probe (added in #2197 and clarified in #2198) with a content-integrity check. The previous probe verified pip can RESOLVE the version we just published (catches surface 1+2 propagation lag — metadata + simple index). It did NOT verify pip can DOWNLOAD bytes that match what we uploaded — leaving a window where a Fastly stale-content scenario (rare but PyPI has had it: e.g. 2026-04-01 incident where a CDN node served a previous version's wheel under the new version's URL for ~90s after upload) would pass the probe and ship corrupt builds to all 8 receiver templates. Two-stage check, both must pass before the cascade fans out: (a) `pip install --no-cache-dir PACKAGE==VERSION` succeeds — version is resolvable. (Existing, unchanged.) (b) `pip download` of the same wheel + `sha256sum` matches the hash captured pre-upload from `dist/*.whl`. (New.) Captured BEFORE upload via a new `wheel_hash` step that exposes `steps.wheel_hash.outputs.wheel_sha256`, bubbled up as `needs.publish.outputs.wheel_sha256`, and consumed by the cascade probe via the EXPECTED_SHA256 env var. `pip download` is the right primitive: it writes the actual .whl file (vs `pip install` which unpacks and discards), so we can sha256sum it directly. Combined with --no-cache-dir + a wiped /tmp/probe-dl per poll, every poll re-fetches from the live Fastly edge — no local-cache mask. Per-poll cost: ~3-5s pip install + ~3s pip download + 4s sleep. 30-poll budget = ~5-6 min wall on a slow runner (vs the previous ~4-5 min for resolve-only). Well within the cascade's tolerance for a known-rare CDN issue, and the overwhelming-common case (Fastly serves matching bytes immediately) exits on the first poll. Verified locally: pip download of the current PyPI-latest (molecule-ai-workspace-runtime 0.1.29) produced sha256=7e782b2d50812257…, exactly matching PyPI's own metadata endpoint. The mismatch path is exercised inline (different builds of the same version produce different hashes by definition — the build_runtime_package.py output is timestamp-deterministic only within a single CI invocation). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 10:53:50 -07:00
Hongming Wang	a8f59f5fc2	ci(pin-compat): split into two workflows so each gets a narrow paths filter Closes #134. The post-merge review of #2196 flagged that the combined workflow's `paths:` filter (the union of both jobs' needs: `workspace/**` + `scripts/build_runtime_package.py` + the workflow itself) caused the `pypi-latest-install` job to fire on every doc-only / adapter-only / unrelated workspace/ edit. The PyPI artifact that job tests against can't change based on our workspace/ source — only on actual PyPI publishes — so those runs add noise without information. Splits the previously-merged combined workflow: runtime-pin-compat.yml (kept): - PyPI-latest install + import smoke (was: pypi-latest-install) - Narrow `paths:` filter — only fires when workspace/requirements.txt or this workflow file changes - Cron-driven daily for upstream-yank detection (unchanged) runtime-prbuild-compat.yml (new): - PR-built wheel + import smoke (was: local-build-install) - Broad `paths:` filter — fires on any workspace/ source change, scripts/build_runtime_package.py, or this workflow file - No cron (workspace/ doesn't change between firings) Behavior identical to before for content; only the trigger surface is narrower per-job. Each workflow's name is its own status check, so branch protection (which currently lists neither as required) can gate them independently in future. The prior comment in the combined file explicitly acknowledged the asymmetry and proposed this split as a follow-up; this is that follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 10:50:09 -07:00
Hongming Wang	e6ce54006d	ci(publish-runtime): use pip-resolve probe to bound cascade fan-out The cascade's PyPI-propagation gate polled `/pypi/<pkg>/<ver>/json`, which is one of THREE surfaces pip touches when resolving an install: 1. /pypi/<pkg>/<ver>/json — metadata endpoint (the old check) 2. /simple/<pkg>/ — pip's primary download index 3. files.pythonhosted.org — CDN-fronted wheel binary Each has its own cache. Any one of them can lag behind the others, and the previous gate would let the cascade fire while (2) or (3) still served the previous version. Downstream `pip install` in the template repos then resolved to the OLD wheel, the docker layer cache locked that stale resolution in, and subsequent rebuilds kept shipping the old runtime — the "five times in one night" cache trap referenced in the prior comment. Replace the metadata-only poll with an actual `pip install --no-cache-dir --force-reinstall --no-deps PACKAGE==VERSION` from a fresh venv. If pip can resolve and install the exact version we just published, every receiver template will too — pip itself is the ground truth for what the receivers will see, no proxy guessing about which surface is lagging. - Venv created once outside the loop; only `pip install` runs in the poll body. - --no-cache-dir + --force-reinstall ensures every poll hits the live PyPI surfaces (no local-cache mask). - --no-deps keeps each poll fast — we only care about resolving THIS package, not its dep tree. - Loop budget: 30 attempts × 4s ≈ 2 min (vs prior 30 × 2s = 60s). Generous vs typical PyPI propagation, surfaces real upstream issues past the budget. Verified locally: - Probing a non-existent version (0.1.999999) → pip exits 1, loop retries. - Probing the current PyPI-latest → pip exits 0, `pip show` returns the version, loop succeeds. Closes #130. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 18:16:33 -07:00
Hongming Wang	7484e6fbec	Merge pull request #2196 from Molecule-AI/fix/runtime-pin-compat-test-pr-artifact ci(runtime-pin-compat): test the PR-built wheel, not PyPI-latest	2026-04-28 00:42:02 +00:00
Hongming Wang	7065579967	ci(runtime-pin-compat): test the PR-built wheel, not the PyPI-latest one Closes #128's chicken-and-egg. The original gate installed the CURRENTLY-PUBLISHED molecule-ai-workspace-runtime from PyPI, then overlaid workspace/requirements.txt, then smoke-imported. That catches problems with the already-shipped artifact (the daily-cron upstream-yank case), but it cannot catch problems introduced by the PR itself: the imports it exercises are from the OLD wheel, not the PR's source. A PR that adds `from a2a.utils.foo import bar` (where `bar` is added in a2a-sdk 1.5 and the runtime currently pins 1.3) slips through: 1. Pip resolves the existing PyPI wheel + a2a-sdk 1.3. 2. Smoke imports the OLD main.py — no reference to `bar` → green. 3. Merge → publish-runtime.yml ships a wheel WITH the new import. 4. Tenant images redeploy → all crash on first boot with ImportError: cannot import name 'bar' from 'a2a.utils.foo'. Splits the workflow into two jobs: - pypi-latest-install (renamed from default-install): unchanged behavior. Runs on the daily cron and on requirements.txt / workflow edits. Catches upstream PyPI yanks + the already-shipped artifact going stale. - local-build-install (new): runs scripts/build_runtime_package.py on the PR's workspace/, builds the wheel with python -m build (mirroring publish-runtime.yml byte-for-byte), installs that wheel, then runs the same smoke import. Tests the artifact that WOULD be published if this PR merges. Path filter widened to workspace/** so any runtime-source change triggers the local-build job. The pypi-latest job's filter is the same union; its internal logic is unchanged so the daily-cron and upstream-detection use cases continue to work. Verified locally: built the wheel from current workspace/ source via the same script + python -m build invocation, installed into a fresh venv, imported from molecule_runtime.main import main_sync successfully. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 17:39:00 -07:00
Hongming Wang	1b0fab674b	ci(publish-runtime): smoke well-known mount alignment + message helper The existing wheel-smoke catches AgentCard kwarg-shape regressions (state_transition_history, supported_protocols) but doesn't catch the SDK-contract drift class that #2193 just fixed in production: the a2a-sdk 1.x rename of /.well-known/agent.json → /.well-known/agent-card.json, plus AGENT_CARD_WELL_KNOWN_PATH moving to a2a.utils.constants. main.py's readiness probe hardcoded the old literal and 404'd every attempt, silently dropping every workspace's initial_prompt for ~weeks before a user reported it. Two additions to the smoke block: 1. Mount alignment: build an AgentCard, call create_agent_card_routes(), and assert AGENT_CARD_WELL_KNOWN_PATH is among the mounted paths. Catches a future SDK release that decouples the constant value from the route factory's mount path. The source-tree test (workspace/tests/test_agent_card_well_known_path.py) catches the main.py side; this catches the SDK side BEFORE PyPI upload. 2. Message helper smoke: import a2a.helpers.new_text_message and instantiate one. The v0→v1 cheat sheet (memory: reference_a2a_sdk_v0_to_v1_migration.md) flagged this as a real migration find — main.py and a2a_executor.py call it in hot paths, so an import break errors every reply before the message even leaves the workspace. Verified by running the equivalent Python inside ghcr.io/molecule-ai/workspace-template-langgraph:latest: ✓ well-known mount alignment OK (/.well-known/agent-card.json) ✓ message helper import + call OK Closes the structural-fix half of the #2193 finding from the code- review-and-quality pass: "the wheel publish smoke didn't catch this. This is the 7th a2a-sdk migration find of this kind. Task #131 is the right root-cause fix." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 17:34:12 -07:00
hongmingwang-moleculeai	dccec657d6	Merge branch 'staging' into ci/cicd-review-quick-wins	2026-04-27 13:27:16 -07:00
Hongming Wang	5920fc856d	Merge pull request #2182 from Molecule-AI/ci/agentcard-smoke-followup-2179 fix(workspace): rename supported_protocols → supported_interfaces (CRITICAL — every boot crashes)	2026-04-27 14:58:28 +00:00
Hongming Wang	851fd21fb1	fix(workspace): rename supported_protocols → supported_interfaces (a2a-sdk 1.0) CRITICAL: every workspace boot since the a2a-sdk 1.0 migration (#1974) has been crashing at AgentCard construction with: ValueError: Protocol message AgentCard has no "supported_protocols" field The protobuf field is `supported_interfaces` (plural, interfaces — see a2a-sdk types/a2a_pb2.pyi:189). The 0.3→1.0 migration left the kwarg as `supported_protocols`, which doesn't exist in the 1.0 schema, so the constructor raises before any subsequent line of main runs. Why this hid for so long: - publish-runtime.yml's smoke step only IMPORTED molecule_runtime.main; importing the module is fine, only CONSTRUCTING the AgentCard fails - The user-visible symptom is "Workspace failed: " with empty last_sample_error, indistinguishable from generic boot timeouts - The state_transition_history=True bug (fixed in #2179) was a sibling of this — same migration, same class, just caught first Fix is symmetric with #2179: 1. workspace/main.py: rename the kwarg + comment explaining why 2. .github/workflows/publish-runtime.yml: extend the smoke block to instantiate AgentCard with the exact production call shape, so the next field-rename of this class fails at publish time instead of breaking every workspace startup Verification: - Constructed AgentCard against fresh a2a-sdk 1.0.2 in a clean venv with the corrected kwarg → succeeds - Constructed it with the original `supported_protocols` kwarg → fails immediately with the exact error production sees - Smoke test pinned to mirror main.py's exact call shape; main.py + smoke must stay in lockstep going forward Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 07:54:23 -07:00
Hongming Wang	1a703f5687	fix(publish-runtime): wait for PyPI propagation + expand path filter Two structural fixes for the cascade race conditions that bit us five times today: 1. PyPI propagation wait (cascade job): poll PyPI for the just-published version with a 60s budget BEFORE firing repository_dispatch. PyPI accepts the upload but takes a few seconds to make it available via the package index. Cascade was firing too fast — downstream template builds ran `pip install` against a stale index, resolved to the previous version, and docker layer cache locked that in for subsequent rebuilds. Pairs with the build-arg cache invalidation in molecule-ci PR (separate change). Wait without invalidation = next build still pip-resolves correctly. Invalidation without wait = first cascade build may still race PyPI propagation. Together: no race, no stale cache. 2. Path filter expansion: scripts/build_runtime_package.py is the build script and changes to it (e.g. import-rewrite fixes, manifest emit, lib/ subpackage move) directly affect what ships in the wheel. Was missing from the path filter, so PRs touching only scripts/ (like #2174's lib/ fix) didn't auto-publish — the operator had to remember a manual dispatch. Add it to the closed list of files that trigger auto-publish. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 07:42:37 -07:00
Hongming Wang	82b366fce5	ci: add pr-guards caller that disables auto-merge on push Thin caller for molecule-ci's reusable disable-auto-merge-on-push workflow. Forces operator re-engagement when a commit is pushed to an open PR with auto-merge already enabled. Pairs with the org-wide "Automatically delete head branches" repo setting (also enabled today). Defense in depth: 1. Repo setting blocks pushes to a merged-and-deleted branch (post-merge orphan case — what bit #2174 today: my second commit landed on an already-merged-and-deleted branch). 2. This workflow catches in-queue races (push lands while the merge queue is processing) by disabling auto-merge so the operator must explicitly re-engage. Together they cover the full lifecycle of "auto-merge enabled → new commits arrive" without relying on operator discipline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 06:39:31 -07:00
Hongming Wang	3df5867b56	fix: restore main_sync entry point in workspace/main.py The wheel's pyproject.toml has declared `molecule-runtime = "molecule_runtime.main:main_sync"` since the publish pipeline was created on 2026-04-26, but the function itself was never present in workspace/main.py — it lived in the pre-monorepo molecule-ai-workspace-runtime repo and was lost during the consolidation that made workspace/ the source of truth. The 0.1.15 wheel still had main_sync from a leftover snapshot, so the regression went unnoticed until 0.1.16 (the first wheel built from the new source-of-truth) shipped. Symptom: every workspace container restart loops with ImportError: cannot import name 'main_sync' from 'molecule_runtime.main' — the molecule-runtime CLI script's first line tries to import the missing symbol. Workspaces stay in `provisioning` until the 10-min sweep marks them failed. Caught by .github/workflows/runtime-pin-compat.yml, which already imports the symbol by name as its smoke test. (That check kept failing red on every recent merge_group run; this PR fixes the underlying symbol-not-found instead of the smoke step.) Also strengthens publish-runtime.yml's wheel smoke from `import molecule_runtime.main` (loads the module — passes even when entry-point target is missing) to `from molecule_runtime.main import main_sync` (the actual contract the CLI script needs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 03:35:49 -07:00
Hongming Wang	c68dc1877f	fix(release): drift-gate TOP_LEVEL_MODULES + smoke-import main in publish Two compounding bugs surfaced when 0.1.16 hit production today: 1. scripts/build_runtime_package.py had a hand-curated TOP_LEVEL_MODULES set listing every workspace/.py that should get its bare imports rewritten to `molecule_runtime.X`. The set silently went stale: - Missing: transcript_auth (added since #87 phase 1c), runtime_wedge, watcher → unrewritten imports shipped, every workspace startup died with ModuleNotFoundError. - Stale: claude_sdk_executor, cli_executor (both removed in #87), hermes_executor (never existed) → harmless but misleading. 2. publish-runtime.yml's wheel-smoke step asserted on stable invariants (BaseAdapter, AdapterConfig, a2a_client error sentinel) but never imported main. So even though main.py held the broken bare `from transcript_auth import ...`, the smoke check passed. Fixes: - Build script now derives the on-disk module set from workspace/.py and asserts it matches TOP_LEVEL_MODULES exactly. Drift in either direction fails the build with a specific diff message instead of shipping a broken wheel. Closed-list typo guard preserved (we still edit the set explicitly when a module is added/removed) — the gate just makes drift impossible to ignore. - TOP_LEVEL_MODULES updated to current reality: drop the 3 stale, add the 3 missing. - publish-runtime.yml wheel-smoke now `import molecule_runtime.main` before the invariant asserts. main is the entry point and transitively imports every module — any bare-import bug surfaces as ModuleNotFoundError before PyPI accepts the upload. Tested locally: `python3 scripts/build_runtime_package.py --version 0.1.99 --out /tmp/build-test` succeeds, and /tmp/build-test/molecule_runtime/main.py contains the rewritten `from molecule_runtime.transcript_auth import ...`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 03:19:17 -07:00
Hongming Wang	0a455b7d71	feat(publish-runtime): auto-publish to PyPI on staging pushes that touch workspace/ Adds a third trigger so any merge to staging that changes workspace/ auto-publishes a new molecule-ai-workspace-runtime patch release. Closes the human-in-loop gap that caused tonight's RuntimeCapabilities ImportError outage. Tonight: #117 added RuntimeCapabilities to molecule_runtime.adapters.base. The merge landed at 02:37 UTC. Templates rebuilt their images at 07:37 UTC (4 hours later) and started importing the new symbol. PyPI was still serving 0.1.15 (pre-#117) because nobody remembered to push a runtime-vX.Y.Z tag or workflow_dispatch the publish. Result: every template image shipped tonight runs `from molecule_runtime.adapters.base import RuntimeCapabilities` against an installed runtime that doesn't export it -> ImportError -> workspace never registers -> stuck in provisioning until 10-min sweep. Mechanism: - New trigger: push to staging filtered to paths: ['workspace/']. Path filter applies only to branch pushes; the existing tag trigger still fires unconditionally. - Version derivation for the auto case: query PyPI's JSON API for current latest, bump the patch component. PyPI is the source of truth so concurrent runs don't double-publish (HTTP 400 on collision). - concurrency: group serializes parallel staging merges so they don't race on the bump computation. cancel-in-progress: false because each workspace/** change deserves its own release. - publish job now exposes its derived version as a job-level output so the cascade reads it cleanly. Fixes a latent bug: cascade tried to read steps.version.outputs.version, which is from a different job's scope and silently resolved to empty -- then re-derived from GITHUB_REF_NAME, which would have been "staging" under the new trigger and produced an invalid version. Tag-driven and manual-dispatch paths are unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 02:11:45 -07:00
Hongming Wang	c1e9aa7461	Merge pull request #2153 from Molecule-AI/fix/block-internal-paths-shallow-clone-bug fix(ci): block-internal-paths handle merge_group + shallow-clone BASE	2026-04-27 06:58:32 +00:00
Hongming Wang	7ac7a010fa	fix(ci): block-internal-paths handle merge_group + shallow-clone BASE [Molecule-Platform-Evolvement-Manager] ## What was broken Same bug class as the secret-scan.yml fix in #2120 — block-internal-paths hit `fatal: bad object <sha>` exit 128 on the staging push at 2026-04-27 06:50:33Z. Two cases: 1. `merge_group` events: BASE/HEAD came from `github.event.before` / `.after` which are push-event-only properties. On merge_group both came back empty, the script fell through to "scan entire tree" mode which is correct but inefficient. Worse, when this workflow is required for the merge queue (line 21-22), an empty-BASE entire-tree scan would run on every queue check. 2. `push` events with shallow clones: `fetch-depth: 2` doesn't always cover BASE across true merge commits. When BASE is in the payload but absent from the local object DB, `git diff` errors out with `fatal: bad object <sha>` and the job exits 128. This is what broke today's staging push. ## Fix Same shape as the secret-scan.yml fix (#2120): - Add a dedicated `git fetch` step for `merge_group.base_sha`. - Move event-specific SHAs into a step `env:` block; script uses a `case` over `${{ github.event_name }}` covering pull_request / merge_group / push (rather than `if pull_request / else push` which left merge_group on the empty-BASE branch). - On-demand fetch + `git cat-file -e` guard for push BASE so a SHA that's payload-present-but-DB-absent triggers the fetch, and a fetch failure falls through cleanly to "scan entire tree" instead of exiting 128. ## Test plan - [x] YAML structure preserved (no schema changes) - [x] Bash logic mirrors the secret-scan recovery path tested in #2120 - [ ] CI green on this PR's pull_request scan + push to staging post-merge 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 23:54:00 -07:00
Hongming Wang	a4b3ebf951	test(e2e): claude-code + hermes priority-runtimes happy path Self-contained happy-path E2E for the two runtimes the project commits to first-class support for (task #116, completes the loop on the "both must work end-to-end with tests" requirement). What it proves per runtime: 1. POST /workspaces succeeds with the runtime + secrets 2. Workspace reaches status=online within its cold-boot window (claude-code: 240s, hermes: 900s on cold apt + uv + sidecar) 3. POST /a2a (message/send "Reply with PONG") returns a non-error, non-empty reply 4. activity_logs row written with method=message/send and ok\|error status (a2a_proxy.LogActivity contract) Skip semantics: each phase independently checks for its required env key (CLAUDE_CODE_OAUTH_TOKEN / E2E_OPENAI_API_KEY) and skips cleanly if absent. The script always exit-0s if every phase either passed or skipped — so wiring it into a no-keys CI job validates the script itself stays clean without false-failing. Idempotent: pre-sweeps any prior "Priority E2E (claude-code)" / "Priority E2E (hermes)" workspaces so a run interrupted by SIGPIPE / kill -9 (which bypasses the EXIT trap) doesn't poison the next run. Same defensive pattern as test_notify_attachments_e2e.sh. CI wiring: - e2e-api.yml — runs on every PR with no LLM keys, both phases skip, catches script-level regressions (set -u bugs, syntax issues, etc.) - canary-staging.yml + e2e-staging-saas.yml already have the keys via secrets.MOLECULE_STAGING_OPENAI_KEY and exercise wire-real behavior — could be wired to opt-in if you want claude-code coverage there too. Local runs (from this branch, no keys): === Results: 0 passed, 0 failed, 2 skipped === Validates the capability primitives shipped in PRs #2137-2144: once template PRs #12 (claude-code) + #25 (hermes) merge with their declared provides_native_session=True + idle_timeout_override=900, a manual run with both keys validates the full native+pluggable chain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 23:48:54 -07:00
rabbitblood	b81d8e9fc5	chore(secret-scan): add sk-cp- MiniMax pattern (F1088 retroactive fix)	2026-04-26 21:43:22 -07:00
Hongming Wang	62cfc21033	test(comms): comprehensive E2E coverage for agent → user attachments User asked to "keep optimizing and comprehensive e2e testings to prove all works as expected" for the communication path. Adds three layers of coverage for PR #2130 (agent → user file attachments via send_message_to_user) since that path has the most user-visible blast radius: 1. Shell E2E (tests/e2e/test_notify_attachments_e2e.sh) — pure platform test, no workspace container needed. 14 assertions covering: notify text-only round-trip, notify-with-attachments persists parts[].kind=file in the shape extractFilesFromTask reads, per-element validation rejects empty uri/name (regression for the missing gin `dive` bug), and a real /chat/uploads → /notify URI round-trip when a container is up. 2. Canvas AGENT_MESSAGE handler tests (canvas-events.test.ts +5) — pin the WebSocket-side filtering that drops malformed attachments, allows attachments-only bubbles, ignores non-array payloads, and no-ops on pure-empty events. 3. Persisted response_body shape test (message-parser.test.ts +1) — pins the {result, parts} contract the chat history loader hydrates on reload, so refreshing after an agent attachment restores both caption and download chips. Also wires the new shell E2E into e2e-api.yml so the contract regresses in CI rather than only in manual runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 20:41:56 -07:00
Hongming Wang	3a36d732e4	fix(ci): sweep prior UTC day in e2e safety nets (midnight-rollover) [Molecule-Platform-Evolvement-Manager] ## What was breaking All three staging e2e workflows' "Teardown safety net" steps filtered candidate slugs by `f'e2e-...-{today}-...'` where `today` was computed at safety-net-step time via `datetime.date.today()`. When a run crossed midnight UTC (start before 00:00, end after), `today` became the NEXT day, but the slug it created carried the PRIOR day's date. The filter never matched its own slug → leak. ## Today's incident E2E Staging Canvas run [24970092066]( https://github.com/Molecule-AI/molecule-core/actions/runs/24970092066): - started 2026-04-26 23:45:59Z - created slug `e2e-canvas-20260426-1u8nz3` at 23:59Z - ended 2026-04-27 00:12:47Z (failure) - safety-net step ran with `today=20260427` - filter `e2e-canvas-20260427-` did not match `...20260426-1u8nz3` - tenant + child workspace EC2 both stayed up Confirmed via CP staging logs: no DELETE for `1u8nz3` ever issued. The Playwright globalTeardown didn't fire (test crashed mid-run); the workflow safety-net was the last line and it missed. ## Fix All three workflows now sweep BOTH today AND yesterday's UTC dates, so a run that crosses midnight still matches its own slug: ```python today = datetime.date.today() yesterday = today - datetime.timedelta(days=1) dates = (today.strftime('%Y%m%d'), yesterday.strftime('%Y%m%d')) prefixes = tuple(f'e2e-canvas-{d}-' for d in dates) # (canvas variant) ``` Per-run-id scoping (saas + canary) is preserved — the prior-day prefix still includes the run_id, so cross-midnight runs only sweep their own slugs, not other in-flight runs from yesterday. ## Why two-day window vs. arbitrary lookback A run can't legitimately last more than 24h on GitHub-hosted runners (workflow `timeout-minutes` caps; canary=25, e2e-saas=45, canvas=30). Two-day window is enough to cover any cross-midnight run without widening the cross-run-cleanup blast radius further. The `sweep-stale-e2e-orgs.yml` cron (with its 120-min age threshold) remains the catch-all for anything older that drifts through. ## Test plan - [x] Manual logic simulation: post-midnight slug matches yesterday's prefix; same-day still matches; 2-days-ago does NOT match; production tenant never matches - [x] All three workflow YAMLs syntactically valid - [ ] Next cross-midnight run cleans up its own slug 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 19:23:36 -07:00

1 2 3 4

165 Commits