Add a comment block at the top of auto-promote-staging.yml naming the
load-bearing one-time repo setting that the workflow depends on:
Settings → Actions → General → Workflow permissions
→ ✅ Allow GitHub Actions to create and approve pull requests
Without this toggle, every workflow_run fails with
"GitHub Actions is not permitted to create or approve pull requests
(createPullRequest)". Observed 2026-04-29 01:43 UTC blocking the
fcd87b9 promotion (PRs #2248 + #2249); manually bridged via PR #2252.
The setting is invisible to anyone reading the workflow file, but the
workflow cannot do its job without it. Documenting here so the next
time it gets toggled off (org admin change, repo migration, audit
cleanup) the failure mode points at the cause rather than another
round of "why is auto-promote broken."
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirror the sweep-cf-orphans hardening (#2248) on publish-runtime's
TEMPLATE_DISPATCH_TOKEN gate. The previous behaviour was to print
:⚠️:skipping cascade — templates will pick up the new version
on their own next rebuild and exit 0. That message is wrong: the 8
workspace-template repos only rebuild on this repository_dispatch
fanout. Without the dispatch they stay pinned to whatever runtime
version they last saw, and the gap is invisible until someone
notices a template several versions behind weeks later.
Behaviour after this PR:
- push (auto-trigger on workspace/runtime/** changes) → exit 1
- workflow_dispatch (manual operator) → exit 0
with a warning (operator already accepted state; let them rerun
after restoring the secret)
The token-missing path now also names the consequence concretely
("templates will NOT pick up the new version until this token is
restored") so future operators see the actionable line, not the
misleading "they'll catch up on their own" message.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the soft-skip-with-warning behaviour for scheduled runs of the
hourly Cloudflare orphan sweeper with an explicit failure when the six
required secrets aren't set. Manual workflow_dispatch keeps the
soft-skip path so an operator can short-circuit a deliberate rerun
without redoing the secrets dance — they accepted the state when they
clicked the button.
Why: from some-date to 2026-04-28, all six secrets were unset on the
repo. Every hourly tick printed a yellow :⚠️: and exited 0,
which GitHub registers as "completed/success" — the sweeper was
indistinguishable from a healthy janitor with nothing to do. Cloudflare
orphans accumulated unobserved to 152/200 (~76% of the zone quota),
and only surfaced via a manual audit. The mechanism to catch this kind
of regression is to make the workflow loud: red runs prompt
investigation, green runs are presumed healthy.
Schedule/workflow_run/push paths now print three ::error:: lines
naming the missing secrets, the fix, and a one-line reference to this
incident, then exit 1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the fix#2234 applied to auto-sync-main-to-staging.yml in the
reverse direction. Both workflows now use the same merge-queue path
that humans use; no special-case bypass.
Why
Every tick of auto-promote-staging.yml since main's branch protection
went stricter has been failing with:
remote: error: GH006: Protected branch update failed for refs/heads/main.
remote: - Required status checks "Analyze (go)", "Analyze (javascript-typescript)",
"Analyze (python)", "Canvas (Next.js)", "Detect changes",
"E2E API Smoke Test", "Platform (Go)", "Python Lint & Test",
and "Shellcheck (E2E scripts)" were not set by the expected
GitHub apps.
remote: - Changes must be made through a pull request.
The previous version did `git merge --ff-only origin/staging &&
git push origin main` directly. That works against a permissive
branch — it doesn't work against a ruleset that requires checks
satisfied by the expected GitHub apps. Only PR merges through the
queue produce check runs from the right apps.
Result was that today's 12+ merges to staging never propagated to
main; the auto-promote ran every tick and failed every tick, while
operators had to keep opening manual `staging → main` bridges.
Fix
- Replace the direct git push step with a step that opens (or reuses)
a PR base=main head=staging and enables auto-merge. The merge queue
lands it once gates are green on the merge_group ref.
- The PR's head IS the staging branch (no per-SHA promote branch
needed) — the whole purpose is "advance main to staging's tip".
- Add `pull-requests: write` permission so the workflow can call
gh pr create + gh pr merge --auto.
- Drop the `git merge-base --is-ancestor` divergence check — the
merge queue itself enforces branch protection now, and rejects
the PR if main has diverged from staging history.
Loop safety preserved: when this PR's merge lands on main, it
triggers auto-sync-main-to-staging.yml which opens a sync PR back
to staging. That sync PR's eventual merge is by GITHUB_TOKEN (the
merge queue) which doesn't trigger downstream workflow_run events
— so auto-promote-staging.yml does NOT re-fire from its own merge
landing.
Refs: #2234 (the parallel fix for auto-sync-main-to-staging.yml),
task #142, multiple failing runs visible in
https://github.com/Molecule-AI/molecule-core/actions/workflows/auto-promote-staging.yml
Consolidates the remaining safe-to-merge dependabot PRs from the
2026-04-28 wave into one consumable PR. Replaces three earlier
single-bump PRs (#2245, #2230, #2231) which were closed in favor of
this single batch — same pattern as #2235.
GitHub Actions majors (SHA-pinned per org convention):
github/codeql-action v3 → v4.35.2 (#2228)
actions/setup-node v4 → v6.4.0 (#2218)
actions/upload-artifact v4 → v7.0.1 (#2216)
actions/setup-python v5 → v6.2.0 (#2214)
npm dev deps (canvas/, lockfile regenerated in node:22-bookworm
container so @emnapi/* and other Linux-only optional deps are
properly resolved — Mac-native `npm install` strips them, which
caused the earlier #2235 batch to drop these two):
@types/node ^22 → ^25.6 (#2231)
jsdom ^25 → ^29.1 (#2230)
Why each is safe
setup-node v4 → v6 / setup-python v5 → v6:
Every consumer call pins node-version / python-version
explicitly. v5 / v6 changed defaults but pinned consumers
are unaffected. Confirmed via grep across .github/workflows/
— all setup-node call sites pin '20' or '22', all
setup-python call sites pin '3.11'.
codeql-action v3 → v4.35.2:
Used as init/autobuild/analyze sub-actions in codeql.yml.
v4 bundles a newer CodeQL CLI; ubuntu-latest auto-updates
so functional behavior is unchanged. The deprecated
CODEQL_ACTION_CLEANUP_TRAP_CACHES env var (per v4.35.2
release notes) is undocumented and we don't set it.
upload-artifact v4 → v7.0.1:
v6 introduced Node.js 24 runtime requiring Actions Runner
>= 2.327.1. All upload-artifact users (codeql.yml,
e2e-staging-canvas.yml) run on `ubuntu-latest` (GitHub-
hosted), which auto-updates the runner agent. Self-hosted
runners are NOT used for these jobs.
@types/node 22 → 25 / jsdom 25 → 29:
Both are dev-only — @types/node is type definitions,
jsdom backs vitest's DOM environment. Tests pass:
79 files / 1154 tests in node:22-bookworm container.
Verified locally (Linux container so the lockfile reflects what
CI's `npm ci` will install):
- cd canvas && npm install --include=optional → 169 packages
- npm test → 1154/1154 pass
- npm ci → clean install succeeds
- npm run build → Next.js prerendering succeeds
Closes when this lands (the 3 individual auto-merge PRs from earlier
were closed):
#2228#2218#2216#2214#2231#2230
NOT included (CI failing on dependabot's own run — major framework
bumps that need code-side migration tasks, not safe auto-bumps):
#2233 next 15 → 16
#2232 tailwindcss 3 → 4
#2226 typescript 5 → 6
Branch protection on `main` requires "E2E API Smoke Test" as a status
check. With Design B's no-op + e2e-api job split, when paths-filter
excludes a commit:
- e2e-api job (name="E2E API Smoke Test"): SKIPPED
- no-op job (name="no-op"): SUCCESS
Branch protection counts the skipped check-run as not-satisfied →
auto-promote-staging's `git push origin main` rejected with GH006.
Observed 2026-04-28 00:22 UTC: every gate green at the workflow level,
all_green=true in auto-promote-staging's gate-check, but the FF push
itself rejected with:
Required status checks "..., E2E API Smoke Test, ..." were not set
by the expected GitHub apps.
Fix: give the no-op job the same `name:` as the real one. Now both
register as check-runs named "E2E API Smoke Test" — exactly one runs
per workflow execution (mutex `if`), the other registers as skipped
with the same name. Branch protection sees at least one success,
requirement satisfied.
Same fix applied to e2e-staging-canvas.yml's no-op (name → "Canvas
tabs E2E") for symmetry, even though "Canvas tabs E2E" isn't currently
in main's required check list — kept consistent so the next time a
required-checks reshuffle pulls it in, it doesn't recreate this bug.
Note: Design B's intent was always "emit a result auto-promote can
read" — that intent was satisfied at the workflow-conclusion level
(success), but missed the per-check-run-name level. This PR closes
that second-order gap.
e2e-staging-canvas had a single global concurrency group:
concurrency:
group: e2e-staging-canvas
cancel-in-progress: false
That meant the entire repo shared one running + one pending slot. When a
staging push queued behind an in-flight run and a third entrant (a PR
run, a follow-on push) entered the group, the staging push got
cancelled. auto-promote-staging then saw `completed/cancelled` for a
required gate and refused to advance main.
Observed 2026-04-28 23:51-23:53: staging tip 3f99fede's e2e-staging-
canvas push run was cancelled within 2:20 of starting because a PR run
on a follow-on branch entered the group. Auto-promote-staging fired 8+
times after that, all skipped because canvas was still in the cancelled
state. The chain stayed stuck until the cancelled run was manually
re-dispatched.
e2e-api had a softer version of the same bug — `group: e2e-api-${{
github.ref }}`. Per-ref isolates push events from PR events, so this
specific scenario didn't hit it, but back-to-back pushes to staging at
SHA-A and SHA-B share refs/heads/staging and would still cancel SHA-A's
queued run when SHA-B enters.
Both workflows now use per-SHA grouping. The single-global-group's
original intent was to throttle parallel E2E provisions, but each E2E
run already isolates its state via fresh-org-per-run, and parallel
infrastructure cost at our scale (~$0.001/min × 10min × 2) is rounding
error compared to a stuck pipeline.
Per-SHA still dedupes accidental double-triggers for the SAME SHA.
It does not cancel obsolete-PR-version runs on force-push — that wasted
CI is acceptable given the alternative is losing staging-tip data that
auto-promote-staging depends on.
Other gate workflows: ci.yml uses `cancel-in-progress: true` which is
correct for unit tests (intentional cancellation on supersede). codeql.yml
is per-ref like e2e-api was; same fix probably applies if the same
deadlock pattern is observed there, but no incident yet so deferring.
Self-review caught a real correctness bug: scenario where publish-
workspace-server-image completes BEFORE E2E Staging SaaS for a runtime-
touching SHA. Publish typically takes ~5-10min; E2E ~10-15min, so this
ordering is the common case for runtime-path PRs.
Previous gate logic:
- completed/success: proceed
- completed/failure: abort
- everything else (including in_progress): proceed ← BUG
If publish-trigger fires while E2E is still running, the gate returned
"in_progress/none" and fell through the catch-all "proceed" branch.
Result: :latest retagged on the publish signal alone. Then E2E ends
red — but :latest was already wrongly advanced; the E2E-completion
trigger's job-level if=conclusion==success filter just skips, never
rolls back.
Fix: explicit case for in_progress|queued|requested|waiting|pending
that DEFERS — sets gate.proceed=false, writes a "deferred" summary,
exits 0 (workflow run shows success, retag steps skipped). The E2E
completion trigger then fires later and either promotes (green) or
aborts (red), giving us correct ordering regardless of who finishes
first.
Subsequent steps now guarded by `if: steps.gate.outputs.proceed ==
'true'` instead of relying on `exit 1` for skip semantics.
Also added an explicit catch-all `*)` branch that aborts on unknown
states (forward-compat: GitHub adds a new status, we surface it
instead of silently promoting through it).
Previously this workflow only triggered on E2E Staging SaaS completion,
which is itself paths-filtered to runtime handlers
(workspace-server/internal/handlers/{registry,workspace_provision,
a2a_proxy}.go, middleware/**, provisioner/**). publish-workspace-server
-image fires on a STRICTLY BROADER path set (workspace-server/**,
canvas/**, manifest.json) — so canvas-only or cmd-only or sweep-only
PRs rebuilt the platform image without ever advancing :latest.
Result observed 2026-04-28: zero runs of this workflow since merge
despite eight main pushes. :latest sat ~7 hours / 9 PRs behind main.
Fix: add publish-workspace-server-image as a second trigger. Add an
explicit gate inside the job that aborts when E2E Staging SaaS for the
same SHA ended red. When E2E didn't fire (paths-filtered), proceed —
auto-promote-staging's pre-merge gates (CI + E2E Canvas + E2E API +
CodeQL on staging) already validated this SHA before main moved.
Concurrency group serializes promotes per-SHA so the publish+E2E both-
fired race lands cleanly. Idempotent crane tag makes it safe regardless.
Consolidates 11 of the 17 open Dependabot PRs (#2215, #2217, #2219-#2225,
#2227, #2229) into one PR. Every entry is a patch / minor / floor bump
where the impact surface is small and CI carries the proof.
Same pattern as the 2026-04-15 batch.
Go (workspace-server/go.mod + go.sum, regenerated via `go mod tidy`):
- golang.org/x/crypto 0.49.0 → 0.50.0 (#2225)
- github.com/golang-jwt/jwt/v5 5.2.2 → 5.3.1 (#2222)
- github.com/gin-contrib/cors 1.7.2 → 1.7.7 (#2220)
- github.com/docker/go-connections 0.6.0 → 0.7.0 (#2223)
- github.com/redis/go-redis/v9 9.7.3 → 9.19.0 (#2217)
Python floor bumps (workspace/requirements.txt; current pip-resolved
versions don't change unless they happen to be below the new floor):
- httpx >=0.27 → >=0.28.1 (#2221)
- uvicorn >=0.30 → >=0.46 (#2229)
- temporalio >=1.7 → >=1.26 (#2227)
- websockets >=12 → >=16 (#2224)
- opentelemetry-sdk >=1.24 → >=1.41.1 (#2219)
GitHub Actions (SHA-pinned per existing convention):
- dorny/paths-filter@d1c1ffe (v3) → @fbd0ab8 (v4.0.1) (#2215)
REMOVED from this batch (lockfile platform mismatch):
- #2231 @types/node ^22 → ^25.6 (npm install on macOS strips
Linux-only @emnapi/* entries from package-lock.json that CI's
`npm ci` then refuses; needs a Linux-side install to land cleanly)
- #2230 jsdom ^25 → ^29.1 (same)
NOT included in this batch (deferred to per-PR human review):
- #2228 github/codeql-action v3 → v4 (CodeQL CLI alignment risk)
- #2218 actions/setup-node v4 → v6 (default Node version drift)
- #2216 actions/upload-artifact v4 → v7 (3 major versions)
- #2214 actions/setup-python v5 → v6 (action major)
NOT merged (CI failing on dependabot's own PR):
- #2233 next 15 → 16
- #2232 tailwindcss 3 → 4
- #2226 typescript 5 → 6
Verified:
- workspace-server: `go mod tidy && go build ./... && go test ./...` — green
- workspace requirements.txt: floor bumps only
The molecule-core/staging branch is protected by ruleset 15500102
(name: staging-merge-queue) which blocks ALL direct pushes — no
bypass even for org admins or the GitHub Actions integration. The
prior version of this workflow attempted `git push origin staging`
and was rejected with GH013:
! [remote rejected] staging -> staging
(push declined due to repository rule violations)
- Changes must be made through a pull request.
- Changes must be made through the merge queue
This was a real architectural mismatch: auto-sync was bypassing
the same gates everyone else goes through to land on staging,
which is exactly what the ruleset is designed to prevent.
The fix matches the org convention: the workflow now opens a PR
(base=staging, head=auto-sync/main-<sha>) and enables auto-merge.
The merge queue picks it up, runs required gates against the
merged result, and lands it. Same path human PRs take through
staging — no special-snowflake bypass.
Trade-off acknowledged
- Slight PR churn: every main push that needs sync opens a tracked
PR. With concurrency: cancel-in-progress: false (existing) and
the merge queue's serial processing, this is bounded — PRs land
in order, no thundering herd.
- The previous direct-push approach worked on
molecule-controlplane (which has no merge_queue ruleset on
staging). That version of the workflow was correct for that
repo's protection model. Per-repo divergence is acceptable; the
invariant ("staging ⊇ main") is what matters, not how it's
enforced.
Loop safety preserved
GITHUB_TOKEN-authored merges (including the merge queue's land
of this PR) do NOT trigger downstream workflow runs. So the merge
to staging from this PR doesn't fire auto-promote-staging — same
as the direct-push version.
Idempotency
The branch name is derived from main's short sha
(`auto-sync/main-<sha>`) so workflow restarts on the same main
push reuse the existing branch + PR rather than opening duplicates.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Supply-chain hardening for the CI pipeline. 23 workflow files
modified, 59 mutable-tag refs replaced with commit SHAs.
The risk
Every `uses:` reference in .github/workflows/*.yml was pinned to a
mutable tag (e.g., `actions/checkout@v4`). A maintainer of an
action — or a compromised maintainer account — can repoint that
tag to malicious code, and our pipelines silently pull it on the
next run. The tj-actions/changed-files compromise of March 2025 is
the canonical example: maintainer credential leak, attacker
repointed several `@v<N>` tags to a payload that exfiltrated
repository secrets. Repos that pinned to SHAs were unaffected.
The fix
Replace each `@v<N>` with `@<commit-sha> # v<N>`. The trailing
comment preserves human readability ("ah, this is v4"); the SHA
makes the reference immutable.
Actions covered (10 distinct):
actions/{checkout,setup-go,setup-python,setup-node,upload-artifact,github-script}
docker/{login-action,setup-buildx-action,build-push-action}
github/codeql-action/{init,autobuild,analyze}
dorny/paths-filter
imjasonh/setup-crane
pnpm/action-setup (already pinned in molecule-app, listed here for completeness)
Excluded:
Molecule-AI/molecule-ci/.github/workflows/disable-auto-merge-on-push.yml@main
— internal org reusable workflow; we control its repo, threat model
is different from third-party actions. Conventional to pin to @main
rather than SHA for internal reusables.
The maintenance cost
SHA pinning means upstream fixes require manual SHA bumps. Without
automation, pinned SHAs go stale. So this PR also enables Dependabot
across four ecosystems:
- github-actions (workflows)
- gomod (workspace-server)
- npm (canvas)
- pip (workspace runtime requirements)
Weekly cadence — the supply-chain attack window is "minutes between
repoint and pull"; weekly auto-bumps don't help with zero-days
regardless. The point is to pull in non-zero-day fixes without
operator effort.
Aligns with user-stated principle: "long-term, robust, fully-
automated, eliminate human error."
Companion PR: Molecule-AI/molecule-controlplane#308 (same pattern,
smaller surface).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a lint that diffs the canonical SECRET_PATTERNS array in
.github/workflows/secret-scan.yml against every known public
consumer mirror, failing on any divergence.
Why: every side that scans for credentials carries its own copy of
the pattern list. They drift — most recently the workspace-runtime
pre-commit hook lagged the canonical by one pattern (sk-cp- /
MiniMax F1088 vector), so a developer's local pre-commit would let
a sk-cp- token through while the org-wide CI scan would refuse it.
Useless friction; automated detection closes the gap.
Implementation:
.github/scripts/lint_secret_pattern_drift.py — pure stdlib, fetches
each consumer's RAW file via urllib, extracts the
SECRET_PATTERNS=( ... ) array via anchored regex (the closing
`)` is anchored to the start of a line because pattern comments
like `# GitHub PAT (classic)` contain their own paren mid-line),
diffs against canonical, fails on missing or extra patterns.
Fetch failures are warnings, not errors — a consumer whose
branch was renamed shouldn't fail the lint until someone updates
the URL list.
.github/workflows/secret-pattern-drift.yml — daily 05:00 UTC cron
+ on-push gate (when canonical, the workflow, or the script
changes) + workflow_dispatch. Read-only token, 5-minute timeout.
Initial consumer set: workspace-runtime's bundled pre-commit hook
(the one that drifted on sk-cp-). molecule-controlplane's inlined
copy is private so this workflow can't read it; that's tracked
separately and the controlplane's own self-monitor is the gap.
Verified locally: lint detects drift correctly when the runtime
hook is missing sk-cp-, returns clean when aligned.
Refs: task #139.
Three small fixes from the self-review of #2209:
1. **Required: concurrency group.** Two pushes to main in quick
succession (manual UI merge then auto-promote-staging's ff-push,
or any back-to-back main pushes) would race two auto-sync runs
against the same staging branch — second `git push origin staging`
fails non-fast-forward, surfacing as a red CI alert for what should
be a no-op. Add `concurrency: { group: auto-sync-main-to-staging,
cancel-in-progress: false }` so the second run waits for the first
and sees its result.
2. **Hygiene: `git merge --abort` on conflict.** The conflict-error
path exits 1 with the work tree in a half-merged state. Doesn't
affect future runs (each gets a fresh checkout) but is an
unpleasant artifact for anyone who shells into the runner. Abort
first, then exit.
3. **Doc accuracy: "Loop safety" comment.** The original said the
chain terminates because "main is either a no-op or advances
further." That's true but understates the actual safety: GitHub
Actions explicitly does NOT trigger downstream workflow runs from
`GITHUB_TOKEN`-authored pushes. So the loop is impossible by
construction, not just by happy coincidence of ref state. Updated
the comment to reflect the actual mechanism.
Plus a step-name nit: "Fast-forward staging → main" reads as if main
is the target. Renamed to "Fast-forward staging to main" for
consistency with the workflow's name (main → staging).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Background
`auto-promote-staging.yml` advances main via `git merge --ff-only`
+ `git push origin main` — clean fast-forward, no merge commit. But
manual `staging → main` merges via the GitHub UI / API create a merge
commit on main that staging doesn't have. The next `staging → main`
PR then evaluates as "BEHIND" because staging is missing that merge
commit, requiring a manual `gh pr update-branch` round-trip.
This pattern bit twice on 2026-04-28 (PRs #2202 and #2205, both
manual bridges to land pipeline fixes themselves). Each needed
update-branch + re-CI before they could merge. Annoying and
avoidable.
What this workflow does
Triggered on every push to main (regardless of source: auto-promote,
UI merge, API merge, direct push):
1. Check whether main is already in staging's ancestry. If yes,
no-op — auto-promote-staging keeps them aligned via ff push,
and the no-op case is the steady state.
2. If not (manual merge commit on main, or direct main hotfix):
try `git merge --ff-only origin/main` first. Works when staging
hasn't diverged with its own commits.
3. If ff fails (staging has its own in-flight feature work):
`git merge --no-ff origin/main -m "chore: sync main → staging"`.
Absorbs main's tip while keeping staging's own history.
4. Push staging.
Loop safety
Pushing the synced staging triggers auto-promote-staging.yml, which
checks gates on staging's new tip and, if green, ff-pushes staging
to main. Since staging now ⊇ main, the resulting push to main is
either a no-op (no ref change → no push event fires → auto-sync
doesn't re-trigger) or advances main further. In the latter case
auto-sync fires once more, sees main already in staging's ancestry,
no-ops. Bounded.
Conflict handling
If the merge step hits conflicts (staging and main diverged with
incompatible changes), the workflow fails with a clear summary
pointing to manual resolution. This shouldn't happen in practice —
staging is the integration branch; conflicts indicate a direct main
hotfix touching the same code as in-flight staging work.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two latent bash bugs in the canonical secret-scan workflow caught
during the post-merge review of molecule-controlplane #301 (a
private consumer that inlined this workflow's logic and got both
fixes there). Same bugs apply here; fixing in canonical means every
public consumer (gh-identity, github-app-auth, the 8 workspace
template repos) inherits the fix on their next workflow_call.
Bug 1: `printf "$OFFENDING"` is a format-string sink.
OFFENDING is built from filenames: `${f} (matched: ${pattern})\n`.
When passed to printf as the first argument, `%` characters in a
filename are interpreted as conversion specifiers — corrupting the
error message or printing `%(missing)` artifacts. No filename in
the current tree triggers it, but a future test fixture, build
artifact, or contributor-supplied path could.
Fix: `printf '%b' "$OFFENDING"` interprets the literal `\n` we
appended without treating OFFENDING as a format string.
Bug 2: `for f in $CHANGED` word-splits on whitespace.
Filenames containing spaces would split into multiple tokens. The
self-exclude check (`[ "$f" = "$SELF" ] && continue`) and the diff
lookup would both operate on partial-path tokens. No filename in
the current tree has whitespace, but the failure would be silent
if one ever did.
Fix: `while IFS= read -r f; do ... done <<< "$CHANGED"` reads
whole lines as filenames. Added `[ -z "$f" ] && continue` to
match the original `for` loop's implicit empty-input skip.
Both fixes are mechanically straightforward (~16 lines net diff,
mostly comments documenting the why). No behavior change for
filenames in the current tree; strictly better for the edge cases.
The same fixes already shipped in molecule-controlplane via #301
which inlined a copy of this workflow. The runtime's bundled
pre-commit hook (molecule-ai-workspace-runtime:
molecule_runtime/scripts/pre-commit-checks.sh) likely has the same
bugs — flagged as a follow-up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the final gap in the SaaS pipeline. After auto-promote-staging
fast-forwards main, publish-workspace-server-image builds new
`:staging-<sha>` images, but `:latest` (what prod tenants pull) only
moves on either a manual `promote-latest.yml` dispatch or a canary-
verify retag (gated on Phase 2 fleet that doesn't exist).
This workflow closes that gap by retagging
`platform:staging-<sha>` + `platform-tenant:staging-<sha>` → `:latest`
whenever E2E Staging SaaS passes for a `main` push. Uses crane
(no Docker daemon needed). Verifies both images exist before retagging
either, so a half-published state is impossible.
Why trigger only on `main` (not staging):
- `:latest` is what prod tenants pull. Only SHAs that have reached
`main` (via auto-promote-staging) should advance `:latest`.
- Triggering on staging would let a staging-only revert advance
`:latest` to a SHA that never reaches `main`, breaking the
invariant "production runs what's on `main`".
Why a separate workflow rather than folding into e2e-staging-saas.yml:
- Test concerns and release concerns separate.
- Disabling promote during an incident is one workflow toggle, not
an edit to the long E2E file.
- When Phase 2 canary work eventually lands, the canary path can
replace this trigger without touching the E2E workflow.
Doc-aligned: per molecule-controlplane/docs/canary-tenants.md,
"green staging E2E → :latest" is the recommended approach for the
current scale (≤20 paying tenants); canary fleet is deferred until
blast radius grows.
Pipeline after this lands is fully self-healing:
staging push → 4 gates green → auto-promote fast-forwards main
→ publish-workspace-server-image → E2E Staging SaaS
→ THIS WORKFLOW retags :latest → tenant fleet auto-pulls in 5 min
(or redeploy-tenants-on-main fans out faster)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Observed 2026-04-28: auto-promote ran for staging head 96955f7b with
all gates actually green (verified via /commits/<sha>/check-runs API)
yet `check-all-gates-green` reported `CodeQL → missing/none` and
aborted. Same SHA was promotable; auto-promote couldn't see it.
Cause: `gh run list --workflow="CodeQL"` matched two workflows in
this repo:
- codeql.yml (explicit, scans both staging and main)
- codeql (GitHub UI-configured Code-quality default setup,
internal, scans default branch only)
gh CLI rejects ambiguous `--workflow=<name>` lookups and returns no
result → the gate fell through to `missing/none` and ALL_GREEN was
set false. Every staging push since both names existed has been
silently dead-locked.
Fix: switch GATES from display-name strings to workflow file paths.
File paths are the unique identifier for a workflow file in
.github/workflows/; display names are decoration and can collide.
The same `gh run list --workflow=<file.yml>` query that fails on
"CodeQL" succeeds on "codeql.yml" because the file path resolves
unambiguously.
No behavior change for the other three gates (CI, E2E Canvas, E2E
API Smoke) since their names didn't collide — they keep working,
they just identify by ci.yml / e2e-staging-canvas.yml / e2e-api.yml
now. The log line shape changes from `CI → completed/success` to
`ci.yml → completed/success` which is fine for ops grep.
When adding/removing a gate going forward: file paths only. Keep
branch-protection required-checks (check-run display names) in
sync as a separate manual step.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The auto-promote-staging.yml gate-check (line 99) treats "workflow
didn't run" as failure. Path-filtered triggers on E2E API Smoke Test
and E2E Staging Canvas meant a platform-only or test-only push to
staging — say, the prior PR #2201 which only touched
tests/e2e/test_staging_full_saas.sh — never triggered the canvas
workflow, and auto-promote saw `missing/none`, marked all_green=false,
and aborted. Same class for any push that doesn't touch the gate's
watched paths. Dead-lock by design, never noticed because the gate
was new.
Fix per Design B (always-run + fast-skip):
- Drop `paths:` from the push/pull_request triggers on both gate
workflows. The workflow now always fires on every staging+main
push/PR.
- Add a `detect-changes` job using `dorny/paths-filter@v3` that
decides whether to do real work, scoped to the same paths the
trigger filter used to watch.
- Real work job (e2e-api / playwright) gates on
`needs: detect-changes; if: needs.detect-changes.outputs.X == 'true'`.
- Add a sibling `no-op` job that runs when the filter output is
false, emitting `::notice::… no-op pass`. The workflow run's
conclusion is `success` either way — auto-promote sees green and
proceeds.
manual `workflow_dispatch` and the weekly canvas `schedule` short-
circuit detect-changes to always-run — those triggers exist precisely
to exercise the suite and shouldn't be silently no-op'd.
Why this approach over making auto-promote-staging smarter:
The alternative (Design A, considered + rejected) was to teach
auto-promote-staging to read each gate's `paths:` filter and treat
"no run because filter excluded the commit" as conditional pass.
That couples auto-promote to other workflows' YAML schema and breaks
silently if a gate is renamed or its filter changes. Design B keeps
the auto-promote contract simple ("each gate emits success") and
makes each gate self-describing — adding a new gate doesn't require
touching auto-promote.
Cost: ~10-30s of runner overhead per gate per push for the no-op when
paths don't match. Negligible vs the alternative of dead-locked
auto-promote chains.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes#132. Extends the cascade propagation probe (added in #2197
and clarified in #2198) with a content-integrity check.
The previous probe verified pip can RESOLVE the version we just
published (catches surface 1+2 propagation lag — metadata + simple
index). It did NOT verify pip can DOWNLOAD bytes that match what we
uploaded — leaving a window where a Fastly stale-content scenario
(rare but PyPI has had it: e.g. 2026-04-01 incident where a CDN node
served a previous version's wheel under the new version's URL for
~90s after upload) would pass the probe and ship corrupt builds to
all 8 receiver templates.
Two-stage check, both must pass before the cascade fans out:
(a) `pip install --no-cache-dir PACKAGE==VERSION` succeeds —
version is resolvable. (Existing, unchanged.)
(b) `pip download` of the same wheel + `sha256sum` matches the
hash captured pre-upload from `dist/*.whl`. (New.)
Captured BEFORE upload via a new `wheel_hash` step that exposes
`steps.wheel_hash.outputs.wheel_sha256`, bubbled up as
`needs.publish.outputs.wheel_sha256`, and consumed by the cascade
probe via the EXPECTED_SHA256 env var.
`pip download` is the right primitive: it writes the actual .whl
file (vs `pip install` which unpacks and discards), so we can
sha256sum it directly. Combined with --no-cache-dir + a wiped
/tmp/probe-dl per poll, every poll re-fetches from the live Fastly
edge — no local-cache mask.
Per-poll cost: ~3-5s pip install + ~3s pip download + 4s sleep.
30-poll budget = ~5-6 min wall on a slow runner (vs the previous
~4-5 min for resolve-only). Well within the cascade's tolerance for
a known-rare CDN issue, and the overwhelming-common case (Fastly
serves matching bytes immediately) exits on the first poll.
Verified locally: pip download of the current PyPI-latest
(molecule-ai-workspace-runtime 0.1.29) produced
sha256=7e782b2d50812257…, exactly matching PyPI's own metadata
endpoint. The mismatch path is exercised inline (different builds
of the same version produce different hashes by definition — the
build_runtime_package.py output is timestamp-deterministic only
within a single CI invocation).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes#134. The post-merge review of #2196 flagged that the combined
workflow's `paths:` filter (the union of both jobs' needs:
`workspace/**` + `scripts/build_runtime_package.py` + the workflow
itself) caused the `pypi-latest-install` job to fire on every
doc-only / adapter-only / unrelated workspace/ edit. The PyPI artifact
that job tests against can't change based on our workspace/ source —
only on actual PyPI publishes — so those runs add noise without
information.
Splits the previously-merged combined workflow:
runtime-pin-compat.yml (kept):
- PyPI-latest install + import smoke (was: pypi-latest-install)
- Narrow `paths:` filter — only fires when workspace/requirements.txt
or this workflow file changes
- Cron-driven daily for upstream-yank detection (unchanged)
runtime-prbuild-compat.yml (new):
- PR-built wheel + import smoke (was: local-build-install)
- Broad `paths:` filter — fires on any workspace/ source change,
scripts/build_runtime_package.py, or this workflow file
- No cron (workspace/ doesn't change between firings)
Behavior identical to before for content; only the trigger surface is
narrower per-job. Each workflow's name is its own status check, so
branch protection (which currently lists neither as required) can
gate them independently in future.
The prior comment in the combined file explicitly acknowledged the
asymmetry and proposed this split as a follow-up; this is that
follow-up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cascade's PyPI-propagation gate polled `/pypi/<pkg>/<ver>/json`,
which is one of THREE surfaces pip touches when resolving an install:
1. /pypi/<pkg>/<ver>/json — metadata endpoint (the old check)
2. /simple/<pkg>/ — pip's primary download index
3. files.pythonhosted.org — CDN-fronted wheel binary
Each has its own cache. Any one of them can lag behind the others,
and the previous gate would let the cascade fire while (2) or (3)
still served the previous version. Downstream `pip install` in the
template repos then resolved to the OLD wheel, the docker layer
cache locked that stale resolution in, and subsequent rebuilds kept
shipping the old runtime — the "five times in one night" cache trap
referenced in the prior comment.
Replace the metadata-only poll with an actual `pip install
--no-cache-dir --force-reinstall --no-deps PACKAGE==VERSION` from
a fresh venv. If pip can resolve and install the exact version we
just published, every receiver template will too — pip itself is
the ground truth for what the receivers will see, no proxy guessing
about which surface is lagging.
- Venv created once outside the loop; only `pip install` runs in
the poll body.
- --no-cache-dir + --force-reinstall ensures every poll hits the
live PyPI surfaces (no local-cache mask).
- --no-deps keeps each poll fast — we only care about resolving
THIS package, not its dep tree.
- Loop budget: 30 attempts × 4s ≈ 2 min (vs prior 30 × 2s = 60s).
Generous vs typical PyPI propagation, surfaces real upstream
issues past the budget.
Verified locally:
- Probing a non-existent version (0.1.999999) → pip exits 1, loop
retries.
- Probing the current PyPI-latest → pip exits 0, `pip show`
returns the version, loop succeeds.
Closes#130.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes#128's chicken-and-egg. The original gate installed the
CURRENTLY-PUBLISHED molecule-ai-workspace-runtime from PyPI, then
overlaid workspace/requirements.txt, then smoke-imported. That
catches problems with the already-shipped artifact (the daily-cron
upstream-yank case), but it cannot catch problems introduced by the
PR itself: the imports it exercises are from the OLD wheel, not the
PR's source. A PR that adds `from a2a.utils.foo import bar` (where
`bar` is added in a2a-sdk 1.5 and the runtime currently pins 1.3)
slips through:
1. Pip resolves the existing PyPI wheel + a2a-sdk 1.3.
2. Smoke imports the OLD main.py — no reference to `bar` → green.
3. Merge → publish-runtime.yml ships a wheel WITH the new import.
4. Tenant images redeploy → all crash on first boot with
ImportError: cannot import name 'bar' from 'a2a.utils.foo'.
Splits the workflow into two jobs:
- pypi-latest-install (renamed from default-install): unchanged
behavior. Runs on the daily cron and on requirements.txt /
workflow edits. Catches upstream PyPI yanks + the
already-shipped artifact going stale.
- local-build-install (new): runs scripts/build_runtime_package.py
on the PR's workspace/, builds the wheel with python -m build
(mirroring publish-runtime.yml byte-for-byte), installs that
wheel, then runs the same smoke import. Tests the artifact
that WOULD be published if this PR merges.
Path filter widened to workspace/** so any runtime-source change
triggers the local-build job. The pypi-latest job's filter is the
same union; its internal logic is unchanged so the daily-cron and
upstream-detection use cases continue to work.
Verified locally: built the wheel from current workspace/ source via
the same script + python -m build invocation, installed into a fresh
venv, imported from molecule_runtime.main import main_sync
successfully.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The existing wheel-smoke catches AgentCard kwarg-shape regressions
(state_transition_history, supported_protocols) but doesn't catch the
SDK-contract drift class that #2193 just fixed in production: the
a2a-sdk 1.x rename of /.well-known/agent.json →
/.well-known/agent-card.json, plus AGENT_CARD_WELL_KNOWN_PATH moving
to a2a.utils.constants. main.py's readiness probe hardcoded the old
literal and 404'd every attempt, silently dropping every workspace's
initial_prompt for ~weeks before a user reported it.
Two additions to the smoke block:
1. Mount alignment: build an AgentCard, call create_agent_card_routes(),
and assert AGENT_CARD_WELL_KNOWN_PATH is among the mounted paths.
Catches a future SDK release that decouples the constant value
from the route factory's mount path. The source-tree test
(workspace/tests/test_agent_card_well_known_path.py) catches the
main.py side; this catches the SDK side BEFORE PyPI upload.
2. Message helper smoke: import a2a.helpers.new_text_message and
instantiate one. The v0→v1 cheat sheet (memory:
reference_a2a_sdk_v0_to_v1_migration.md) flagged this as a real
migration find — main.py and a2a_executor.py call it in hot
paths, so an import break errors every reply before the message
even leaves the workspace.
Verified by running the equivalent Python inside
ghcr.io/molecule-ai/workspace-template-langgraph:latest:
✓ well-known mount alignment OK (/.well-known/agent-card.json)
✓ message helper import + call OK
Closes the structural-fix half of the #2193 finding from the code-
review-and-quality pass: "the wheel publish smoke didn't catch this.
This is the 7th a2a-sdk migration find of this kind. Task #131 is the
right root-cause fix."
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CRITICAL: every workspace boot since the a2a-sdk 1.0 migration (#1974)
has been crashing at AgentCard construction with:
ValueError: Protocol message AgentCard has no "supported_protocols" field
The protobuf field is `supported_interfaces` (plural, interfaces — see
a2a-sdk types/a2a_pb2.pyi:189). The 0.3→1.0 migration left the kwarg
as `supported_protocols`, which doesn't exist in the 1.0 schema, so
the constructor raises before any subsequent line of main runs.
Why this hid for so long:
- publish-runtime.yml's smoke step only IMPORTED molecule_runtime.main;
importing the module is fine, only CONSTRUCTING the AgentCard fails
- The user-visible symptom is "Workspace failed: " with empty
last_sample_error, indistinguishable from generic boot timeouts
- The state_transition_history=True bug (fixed in #2179) was a
sibling of this — same migration, same class, just caught first
Fix is symmetric with #2179:
1. workspace/main.py: rename the kwarg + comment explaining why
2. .github/workflows/publish-runtime.yml: extend the smoke block to
instantiate AgentCard with the exact production call shape, so
the next field-rename of this class fails at publish time
instead of breaking every workspace startup
Verification:
- Constructed AgentCard against fresh a2a-sdk 1.0.2 in a clean
venv with the corrected kwarg → succeeds
- Constructed it with the original `supported_protocols` kwarg →
fails immediately with the exact error production sees
- Smoke test pinned to mirror main.py's exact call shape; main.py
+ smoke must stay in lockstep going forward
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two structural fixes for the cascade race conditions that bit us
five times today:
1. **PyPI propagation wait** (cascade job): poll PyPI for the
just-published version with a 60s budget BEFORE firing
repository_dispatch. PyPI accepts the upload but takes a few
seconds to make it available via the package index. Cascade was
firing too fast — downstream template builds ran `pip install`
against a stale index, resolved to the previous version, and
docker layer cache locked that in for subsequent rebuilds.
Pairs with the build-arg cache invalidation in molecule-ci PR
(separate change). Wait without invalidation = next build still
pip-resolves correctly. Invalidation without wait = first cascade
build may still race PyPI propagation. Together: no race, no
stale cache.
2. **Path filter expansion**: scripts/build_runtime_package.py is
the build script and changes to it (e.g. import-rewrite fixes,
manifest emit, lib/ subpackage move) directly affect what ships
in the wheel. Was missing from the path filter, so PRs touching
only scripts/ (like #2174's lib/ fix) didn't auto-publish — the
operator had to remember a manual dispatch. Add it to the closed
list of files that trigger auto-publish.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Thin caller for molecule-ci's reusable disable-auto-merge-on-push
workflow. Forces operator re-engagement when a commit is pushed to
an open PR with auto-merge already enabled.
Pairs with the org-wide "Automatically delete head branches" repo
setting (also enabled today). Defense in depth:
1. Repo setting blocks pushes to a merged-and-deleted branch
(post-merge orphan case — what bit #2174 today: my second
commit landed on an already-merged-and-deleted branch).
2. This workflow catches in-queue races (push lands while the
merge queue is processing) by disabling auto-merge so the
operator must explicitly re-engage.
Together they cover the full lifecycle of "auto-merge enabled →
new commits arrive" without relying on operator discipline.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The wheel's pyproject.toml has declared
`molecule-runtime = "molecule_runtime.main:main_sync"` since the
publish pipeline was created on 2026-04-26, but the function
itself was never present in workspace/main.py — it lived in the
pre-monorepo molecule-ai-workspace-runtime repo and was lost
during the consolidation that made workspace/ the source of truth.
The 0.1.15 wheel still had main_sync from a leftover snapshot,
so the regression went unnoticed until 0.1.16 (the first wheel
built from the new source-of-truth) shipped. Symptom: every
workspace container restart loops with
ImportError: cannot import name 'main_sync' from 'molecule_runtime.main'
— the molecule-runtime CLI script's first line tries to import
the missing symbol. Workspaces stay in `provisioning` until the
10-min sweep marks them failed.
Caught by .github/workflows/runtime-pin-compat.yml, which already
imports the symbol by name as its smoke test. (That check kept
failing red on every recent merge_group run; this PR fixes the
underlying symbol-not-found instead of the smoke step.)
Also strengthens publish-runtime.yml's wheel smoke from
`import molecule_runtime.main` (loads the module — passes even
when entry-point target is missing) to `from molecule_runtime.main
import main_sync` (the actual contract the CLI script needs).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two compounding bugs surfaced when 0.1.16 hit production today:
1. scripts/build_runtime_package.py had a hand-curated TOP_LEVEL_MODULES
set listing every workspace/*.py that should get its bare imports
rewritten to `molecule_runtime.X`. The set silently went stale:
- Missing: transcript_auth (added since #87 phase 1c), runtime_wedge,
watcher → unrewritten imports shipped, every workspace startup
died with ModuleNotFoundError.
- Stale: claude_sdk_executor, cli_executor (both removed in #87),
hermes_executor (never existed) → harmless but misleading.
2. publish-runtime.yml's wheel-smoke step asserted on stable invariants
(BaseAdapter, AdapterConfig, a2a_client error sentinel) but never
imported main. So even though main.py held the broken bare
`from transcript_auth import ...`, the smoke check passed.
Fixes:
- Build script now derives the on-disk module set from workspace/*.py
and asserts it matches TOP_LEVEL_MODULES exactly. Drift in either
direction fails the build with a specific diff message instead of
shipping a broken wheel. Closed-list typo guard preserved (we still
edit the set explicitly when a module is added/removed) — the gate
just makes drift impossible to ignore.
- TOP_LEVEL_MODULES updated to current reality: drop the 3 stale,
add the 3 missing.
- publish-runtime.yml wheel-smoke now `import molecule_runtime.main`
before the invariant asserts. main is the entry point and
transitively imports every module — any bare-import bug surfaces
as ModuleNotFoundError before PyPI accepts the upload.
Tested locally: `python3 scripts/build_runtime_package.py
--version 0.1.99 --out /tmp/build-test` succeeds, and
/tmp/build-test/molecule_runtime/main.py contains the rewritten
`from molecule_runtime.transcript_auth import ...`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a third trigger so any merge to staging that changes workspace/**
auto-publishes a new molecule-ai-workspace-runtime patch release. Closes
the human-in-loop gap that caused tonight's RuntimeCapabilities
ImportError outage.
Tonight: #117 added RuntimeCapabilities to molecule_runtime.adapters.base.
The merge landed at 02:37 UTC. Templates rebuilt their images at 07:37
UTC (4 hours later) and started importing the new symbol. PyPI was
still serving 0.1.15 (pre-#117) because nobody remembered to push a
runtime-vX.Y.Z tag or workflow_dispatch the publish. Result: every
template image shipped tonight runs `from molecule_runtime.adapters.base
import RuntimeCapabilities` against an installed runtime that doesn't
export it -> ImportError -> workspace never registers -> stuck in
provisioning until 10-min sweep.
Mechanism:
- New trigger: push to staging filtered to paths: ['workspace/**'].
Path filter applies only to branch pushes; the existing tag trigger
still fires unconditionally.
- Version derivation for the auto case: query PyPI's JSON API for
current latest, bump the patch component. PyPI is the source of
truth so concurrent runs don't double-publish (HTTP 400 on collision).
- concurrency: group serializes parallel staging merges so they don't
race on the bump computation. cancel-in-progress: false because each
workspace/** change deserves its own release.
- publish job now exposes its derived version as a job-level output so
the cascade reads it cleanly. Fixes a latent bug: cascade tried to
read steps.version.outputs.version, which is from a different job's
scope and silently resolved to empty -- then re-derived from
GITHUB_REF_NAME, which would have been "staging" under the new
trigger and produced an invalid version.
Tag-driven and manual-dispatch paths are unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
[Molecule-Platform-Evolvement-Manager]
## What was broken
Same bug class as the secret-scan.yml fix in #2120 — block-internal-paths
hit `fatal: bad object <sha>` exit 128 on the staging push at
2026-04-27 06:50:33Z.
Two cases:
1. **`merge_group` events**: BASE/HEAD came from
`github.event.before` / `.after` which are push-event-only
properties. On merge_group both came back empty, the script fell
through to "scan entire tree" mode which is correct but
inefficient. Worse, when this workflow is required for the merge
queue (line 21-22), an empty-BASE entire-tree scan would run on
every queue check.
2. **`push` events with shallow clones**: `fetch-depth: 2` doesn't
always cover BASE across true merge commits. When BASE is in the
payload but absent from the local object DB, `git diff` errors out
with `fatal: bad object <sha>` and the job exits 128. This is what
broke today's staging push.
## Fix
Same shape as the secret-scan.yml fix (#2120):
- Add a dedicated `git fetch` step for `merge_group.base_sha`.
- Move event-specific SHAs into a step `env:` block; script uses a
`case` over `${{ github.event_name }}` covering pull_request /
merge_group / push (rather than `if pull_request / else push`
which left merge_group on the empty-BASE branch).
- On-demand fetch + `git cat-file -e` guard for push BASE so a SHA
that's payload-present-but-DB-absent triggers the fetch, and a
fetch failure falls through cleanly to "scan entire tree" instead
of exiting 128.
## Test plan
- [x] YAML structure preserved (no schema changes)
- [x] Bash logic mirrors the secret-scan recovery path tested in #2120
- [ ] CI green on this PR's pull_request scan + push to staging post-merge
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-contained happy-path E2E for the two runtimes the project commits
to first-class support for (task #116, completes the loop on the
"both must work end-to-end with tests" requirement).
What it proves per runtime:
1. POST /workspaces succeeds with the runtime + secrets
2. Workspace reaches status=online within its cold-boot window
(claude-code: 240s, hermes: 900s on cold apt + uv + sidecar)
3. POST /a2a (message/send "Reply with PONG") returns a non-error,
non-empty reply
4. activity_logs row written with method=message/send and ok|error
status (a2a_proxy.LogActivity contract)
Skip semantics: each phase independently checks for its required env
key (CLAUDE_CODE_OAUTH_TOKEN / E2E_OPENAI_API_KEY) and skips cleanly
if absent. The script always exit-0s if every phase either passed or
skipped — so wiring it into a no-keys CI job validates the script
itself stays clean without false-failing.
Idempotent: pre-sweeps any prior "Priority E2E (claude-code)" /
"Priority E2E (hermes)" workspaces so a run interrupted by SIGPIPE /
kill -9 (which bypasses the EXIT trap) doesn't poison the next run.
Same defensive pattern as test_notify_attachments_e2e.sh.
CI wiring:
- e2e-api.yml — runs on every PR with no LLM keys, both phases skip,
catches script-level regressions (set -u bugs, syntax issues, etc.)
- canary-staging.yml + e2e-staging-saas.yml already have the keys
via secrets.MOLECULE_STAGING_OPENAI_KEY and exercise wire-real
behavior — could be wired to opt-in if you want claude-code coverage
there too.
Local runs (from this branch, no keys):
=== Results: 0 passed, 0 failed, 2 skipped ===
Validates the capability primitives shipped in PRs #2137-2144: once
template PRs #12 (claude-code) + #25 (hermes) merge with their
declared provides_native_session=True + idle_timeout_override=900,
a manual run with both keys validates the full native+pluggable chain.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User asked to "keep optimizing and comprehensive e2e testings to prove all
works as expected" for the communication path. Adds three layers of coverage
for PR #2130 (agent → user file attachments via send_message_to_user) since
that path has the most user-visible blast radius:
1. Shell E2E (tests/e2e/test_notify_attachments_e2e.sh) — pure platform test,
no workspace container needed. 14 assertions covering: notify text-only
round-trip, notify-with-attachments persists parts[].kind=file in the
shape extractFilesFromTask reads, per-element validation rejects empty
uri/name (regression for the missing gin `dive` bug), and a real
/chat/uploads → /notify URI round-trip when a container is up.
2. Canvas AGENT_MESSAGE handler tests (canvas-events.test.ts +5) — pin the
WebSocket-side filtering that drops malformed attachments, allows
attachments-only bubbles, ignores non-array payloads, and no-ops on
pure-empty events.
3. Persisted response_body shape test (message-parser.test.ts +1) — pins
the {result, parts} contract the chat history loader hydrates on
reload, so refreshing after an agent attachment restores both caption
and download chips.
Also wires the new shell E2E into e2e-api.yml so the contract regresses
in CI rather than only in manual runs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
[Molecule-Platform-Evolvement-Manager]
## What was breaking
All three staging e2e workflows' "Teardown safety net" steps
filtered candidate slugs by `f'e2e-...-{today}-...'` where `today`
was computed at safety-net-step time via `datetime.date.today()`.
When a run crossed midnight UTC (start before 00:00, end after),
`today` became the NEXT day, but the slug it created carried the
PRIOR day's date. The filter never matched its own slug → leak.
## Today's incident
E2E Staging Canvas run [24970092066](
https://github.com/Molecule-AI/molecule-core/actions/runs/24970092066):
- started 2026-04-26 23:45:59Z
- created slug `e2e-canvas-20260426-1u8nz3` at 23:59Z
- ended 2026-04-27 00:12:47Z (failure)
- safety-net step ran with `today=20260427`
- filter `e2e-canvas-20260427-` did not match `...20260426-1u8nz3`
- tenant + child workspace EC2 both stayed up
Confirmed via CP staging logs: no DELETE for `1u8nz3` ever issued.
The Playwright globalTeardown didn't fire (test crashed mid-run);
the workflow safety-net was the last line and it missed.
## Fix
All three workflows now sweep BOTH today AND yesterday's UTC dates,
so a run that crosses midnight still matches its own slug:
```python
today = datetime.date.today()
yesterday = today - datetime.timedelta(days=1)
dates = (today.strftime('%Y%m%d'), yesterday.strftime('%Y%m%d'))
prefixes = tuple(f'e2e-canvas-{d}-' for d in dates) # (canvas variant)
```
Per-run-id scoping (saas + canary) is preserved — the prior-day
prefix still includes the run_id, so cross-midnight runs only sweep
their own slugs, not other in-flight runs from yesterday.
## Why two-day window vs. arbitrary lookback
A run can't legitimately last more than 24h on GitHub-hosted
runners (workflow `timeout-minutes` caps; canary=25, e2e-saas=45,
canvas=30). Two-day window is enough to cover any cross-midnight
run without widening the cross-run-cleanup blast radius further.
The `sweep-stale-e2e-orgs.yml` cron (with its 120-min age threshold)
remains the catch-all for anything older that drifts through.
## Test plan
- [x] Manual logic simulation: post-midnight slug matches yesterday's
prefix; same-day still matches; 2-days-ago does NOT match;
production tenant never matches
- [x] All three workflow YAMLs syntactically valid
- [ ] Next cross-midnight run cleans up its own slug
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>