Commit Graph

166 Commits

Author SHA1 Message Date
Hongming Wang
d8210514c1 ci(canvas): wire vitest --coverage into CI for baseline observability (#1815)
Step 2 of #1815. Step 1 (instrumentation in canvas/vitest.config.ts)
already shipped — the inline comment there explicitly defers wiring
into CI to a follow-up because turning on a 70% threshold blind would
either fail CI immediately or paper over a real gap with an ad-hoc
exclude list.

This PR ships the observability half:
- Replaces `npx vitest run` with `npx vitest run --coverage` in the
  canvas-build job. Coverage gets reported on every PR; no threshold
  gate yet (vitest.config.ts intentionally doesn't set thresholds).
- Adds an artifact upload step for canvas/coverage/ (HTML + json-summary)
  so reviewers can browse the coverage report from any PR. 7-day
  retention; if-no-files-found=warn so a step skip doesn't fail.

Step 3 (thresholds + hard gate) is the natural follow-up — track in a
new sub-issue once we've seen ~5-10 PRs of baseline data and know
where current coverage sits. The issue body proposed lines:70 /
functions:70 / branches:65 / statements:70; that may need adjustment
once the baseline lands.

Closes the Step-2 portion of #1815. Step 3 stays open or gets a fresh
issue depending on your preference.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 08:51:34 -07:00
Hongming Wang
07a17c2e59 Merge remote-tracking branch 'origin/staging' into docs/auto-promote-staging-prereq-comment
# Conflicts:
#	.github/workflows/auto-promote-staging.yml
2026-04-28 20:46:42 -07:00
Hongming Wang
e373fa1a96 docs(ci): document auto-promote-staging GITHUB_TOKEN PR-create prereq
Add a comment block at the top of auto-promote-staging.yml naming the
load-bearing one-time repo setting that the workflow depends on:

  Settings → Actions → General → Workflow permissions
  →  Allow GitHub Actions to create and approve pull requests

Without this toggle, every workflow_run fails with
"GitHub Actions is not permitted to create or approve pull requests
(createPullRequest)". Observed 2026-04-29 01:43 UTC blocking the
fcd87b9 promotion (PRs #2248 + #2249); manually bridged via PR #2252.

The setting is invisible to anyone reading the workflow file, but the
workflow cannot do its job without it. Documenting here so the next
time it gets toggled off (org admin change, repo migration, audit
cleanup) the failure mode points at the cause rather than another
round of "why is auto-promote broken."

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 18:49:07 -07:00
Hongming Wang
fcd87b9526
Merge pull request #2249 from Molecule-AI/fix/publish-runtime-cascade-hard-fail-on-push
fix(ci): hard-fail publish-runtime cascade on push when token missing
2026-04-29 01:33:10 +00:00
Hongming Wang
f1c6673e03 fix(ci): hard-fail publish-runtime cascade on push when token missing
Mirror the sweep-cf-orphans hardening (#2248) on publish-runtime's
TEMPLATE_DISPATCH_TOKEN gate. The previous behaviour was to print
:⚠️:skipping cascade — templates will pick up the new version
on their own next rebuild and exit 0. That message is wrong: the 8
workspace-template repos only rebuild on this repository_dispatch
fanout. Without the dispatch they stay pinned to whatever runtime
version they last saw, and the gap is invisible until someone
notices a template several versions behind weeks later.

Behaviour after this PR:

  - push (auto-trigger on workspace/runtime/** changes) → exit 1
  - workflow_dispatch (manual operator)                  → exit 0
    with a warning (operator already accepted state; let them rerun
    after restoring the secret)

The token-missing path now also names the consequence concretely
("templates will NOT pick up the new version until this token is
restored") so future operators see the actionable line, not the
misleading "they'll catch up on their own" message.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 18:28:01 -07:00
667751919d
Merge pull request #2248 from Molecule-AI/fix/sweep-cf-orphans-hard-fail-on-schedule
fix(ci): hard-fail sweep-cf-orphans on schedule when secrets missing
2026-04-29 01:16:22 +00:00
Hongming Wang
9f39f3ef6c fix(ci): hard-fail sweep-cf-orphans on schedule when secrets missing
Replace the soft-skip-with-warning behaviour for scheduled runs of the
hourly Cloudflare orphan sweeper with an explicit failure when the six
required secrets aren't set. Manual workflow_dispatch keeps the
soft-skip path so an operator can short-circuit a deliberate rerun
without redoing the secrets dance — they accepted the state when they
clicked the button.

Why: from some-date to 2026-04-28, all six secrets were unset on the
repo. Every hourly tick printed a yellow :⚠️: and exited 0,
which GitHub registers as "completed/success" — the sweeper was
indistinguishable from a healthy janitor with nothing to do. Cloudflare
orphans accumulated unobserved to 152/200 (~76% of the zone quota),
and only surfaced via a manual audit. The mechanism to catch this kind
of regression is to make the workflow loud: red runs prompt
investigation, green runs are presumed healthy.

Schedule/workflow_run/push paths now print three ::error:: lines
naming the missing secrets, the fix, and a one-line reference to this
incident, then exit 1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 18:13:22 -07:00
Hongming Wang
5753021194
Merge pull request #2247 from Molecule-AI/fix/auto-promote-staging-pr-based
fix(ci): auto-promote-staging opens a PR + uses merge queue, not direct push
2026-04-29 00:57:33 +00:00
Hongming Wang
e45a5c98b0 fix(ci): auto-promote-staging opens a PR + uses merge queue, not direct push
Mirrors the fix #2234 applied to auto-sync-main-to-staging.yml in the
reverse direction. Both workflows now use the same merge-queue path
that humans use; no special-case bypass.

Why

Every tick of auto-promote-staging.yml since main's branch protection
went stricter has been failing with:

  remote: error: GH006: Protected branch update failed for refs/heads/main.
  remote: - Required status checks "Analyze (go)", "Analyze (javascript-typescript)",
    "Analyze (python)", "Canvas (Next.js)", "Detect changes",
    "E2E API Smoke Test", "Platform (Go)", "Python Lint & Test",
    and "Shellcheck (E2E scripts)" were not set by the expected
    GitHub apps.
  remote: - Changes must be made through a pull request.

The previous version did `git merge --ff-only origin/staging &&
git push origin main` directly. That works against a permissive
branch — it doesn't work against a ruleset that requires checks
satisfied by the expected GitHub apps. Only PR merges through the
queue produce check runs from the right apps.

Result was that today's 12+ merges to staging never propagated to
main; the auto-promote ran every tick and failed every tick, while
operators had to keep opening manual `staging → main` bridges.

Fix

  - Replace the direct git push step with a step that opens (or reuses)
    a PR base=main head=staging and enables auto-merge. The merge queue
    lands it once gates are green on the merge_group ref.
  - The PR's head IS the staging branch (no per-SHA promote branch
    needed) — the whole purpose is "advance main to staging's tip".
  - Add `pull-requests: write` permission so the workflow can call
    gh pr create + gh pr merge --auto.
  - Drop the `git merge-base --is-ancestor` divergence check — the
    merge queue itself enforces branch protection now, and rejects
    the PR if main has diverged from staging history.

Loop safety preserved: when this PR's merge lands on main, it
triggers auto-sync-main-to-staging.yml which opens a sync PR back
to staging. That sync PR's eventual merge is by GITHUB_TOKEN (the
merge queue) which doesn't trigger downstream workflow_run events
— so auto-promote-staging.yml does NOT re-fire from its own merge
landing.

Refs: #2234 (the parallel fix for auto-sync-main-to-staging.yml),
task #142, multiple failing runs visible in
https://github.com/Molecule-AI/molecule-core/actions/workflows/auto-promote-staging.yml
2026-04-28 17:54:15 -07:00
Hongming Wang
c68ea3a284
Merge pull request #2246 from Molecule-AI/chore/all-deps-batch-2026-04-28-pt2
chore(deps): batch dep bumps — 6 safe upgrades (4 actions majors + 2 npm dev deps)
2026-04-29 00:48:15 +00:00
Hongming Wang
fc59f939ac chore(deps): batch dep bumps — 6 safe upgrades (4 actions majors + 2 npm dev deps)
Consolidates the remaining safe-to-merge dependabot PRs from the
2026-04-28 wave into one consumable PR. Replaces three earlier
single-bump PRs (#2245, #2230, #2231) which were closed in favor of
this single batch — same pattern as #2235.

GitHub Actions majors (SHA-pinned per org convention):
  github/codeql-action       v3 → v4.35.2  (#2228)
  actions/setup-node         v4 → v6.4.0   (#2218)
  actions/upload-artifact    v4 → v7.0.1   (#2216)
  actions/setup-python       v5 → v6.2.0   (#2214)

npm dev deps (canvas/, lockfile regenerated in node:22-bookworm
container so @emnapi/* and other Linux-only optional deps are
properly resolved — Mac-native `npm install` strips them, which
caused the earlier #2235 batch to drop these two):
  @types/node                ^22 → ^25.6   (#2231)
  jsdom                      ^25 → ^29.1   (#2230)

Why each is safe

  setup-node v4 → v6 / setup-python v5 → v6:
    Every consumer call pins node-version / python-version
    explicitly. v5 / v6 changed defaults but pinned consumers
    are unaffected. Confirmed via grep across .github/workflows/
    — all setup-node call sites pin '20' or '22', all
    setup-python call sites pin '3.11'.

  codeql-action v3 → v4.35.2:
    Used as init/autobuild/analyze sub-actions in codeql.yml.
    v4 bundles a newer CodeQL CLI; ubuntu-latest auto-updates
    so functional behavior is unchanged. The deprecated
    CODEQL_ACTION_CLEANUP_TRAP_CACHES env var (per v4.35.2
    release notes) is undocumented and we don't set it.

  upload-artifact v4 → v7.0.1:
    v6 introduced Node.js 24 runtime requiring Actions Runner
    >= 2.327.1. All upload-artifact users (codeql.yml,
    e2e-staging-canvas.yml) run on `ubuntu-latest` (GitHub-
    hosted), which auto-updates the runner agent. Self-hosted
    runners are NOT used for these jobs.

  @types/node 22 → 25 / jsdom 25 → 29:
    Both are dev-only — @types/node is type definitions,
    jsdom backs vitest's DOM environment. Tests pass:
    79 files / 1154 tests in node:22-bookworm container.

Verified locally (Linux container so the lockfile reflects what
CI's `npm ci` will install):
  - cd canvas && npm install --include=optional → 169 packages
  - npm test → 1154/1154 pass
  - npm ci → clean install succeeds
  - npm run build → Next.js prerendering succeeds

Closes when this lands (the 3 individual auto-merge PRs from earlier
were closed):
  #2228 #2218 #2216 #2214 #2231 #2230

NOT included (CI failing on dependabot's own run — major framework
bumps that need code-side migration tasks, not safe auto-bumps):
  #2233 next 15 → 16
  #2232 tailwindcss 3 → 4
  #2226 typescript 5 → 6
2026-04-28 17:44:55 -07:00
Hongming Wang
a1bc771f87
Merge pull request #2243 from Molecule-AI/fix/branch-protection-required-check-naming
fix(ci): no-op jobs emit same check-run name as their real counterparts
2026-04-29 00:43:31 +00:00
Hongming Wang
4f0dfbbf0b
Merge pull request #2242 from Molecule-AI/fix/dispatch-input-shell-injection-hardening
fix(security): harden dispatch inputs against shell injection
2026-04-29 00:31:08 +00:00
github-actions[bot]
7b2d9e9bce fix(ci): no-op job emits same check-run name as the real one
Branch protection on `main` requires "E2E API Smoke Test" as a status
check. With Design B's no-op + e2e-api job split, when paths-filter
excludes a commit:

  - e2e-api job (name="E2E API Smoke Test"): SKIPPED
  - no-op job (name="no-op"): SUCCESS

Branch protection counts the skipped check-run as not-satisfied →
auto-promote-staging's `git push origin main` rejected with GH006.
Observed 2026-04-28 00:22 UTC: every gate green at the workflow level,
all_green=true in auto-promote-staging's gate-check, but the FF push
itself rejected with:

    Required status checks "..., E2E API Smoke Test, ..." were not set
    by the expected GitHub apps.

Fix: give the no-op job the same `name:` as the real one. Now both
register as check-runs named "E2E API Smoke Test" — exactly one runs
per workflow execution (mutex `if`), the other registers as skipped
with the same name. Branch protection sees at least one success,
requirement satisfied.

Same fix applied to e2e-staging-canvas.yml's no-op (name → "Canvas
tabs E2E") for symmetry, even though "Canvas tabs E2E" isn't currently
in main's required check list — kept consistent so the next time a
required-checks reshuffle pulls it in, it doesn't recreate this bug.

Note: Design B's intent was always "emit a result auto-promote can
read" — that intent was satisfied at the workflow-conclusion level
(success), but missed the per-check-run-name level. This PR closes
that second-order gap.
2026-04-28 17:25:31 -07:00
github-actions[bot]
b2a0703f1c fix(ci): per-SHA concurrency on staging gate workflows
e2e-staging-canvas had a single global concurrency group:

    concurrency:
      group: e2e-staging-canvas
      cancel-in-progress: false

That meant the entire repo shared one running + one pending slot. When a
staging push queued behind an in-flight run and a third entrant (a PR
run, a follow-on push) entered the group, the staging push got
cancelled. auto-promote-staging then saw `completed/cancelled` for a
required gate and refused to advance main.

Observed 2026-04-28 23:51-23:53: staging tip 3f99fede's e2e-staging-
canvas push run was cancelled within 2:20 of starting because a PR run
on a follow-on branch entered the group. Auto-promote-staging fired 8+
times after that, all skipped because canvas was still in the cancelled
state. The chain stayed stuck until the cancelled run was manually
re-dispatched.

e2e-api had a softer version of the same bug — `group: e2e-api-${{
github.ref }}`. Per-ref isolates push events from PR events, so this
specific scenario didn't hit it, but back-to-back pushes to staging at
SHA-A and SHA-B share refs/heads/staging and would still cancel SHA-A's
queued run when SHA-B enters.

Both workflows now use per-SHA grouping. The single-global-group's
original intent was to throttle parallel E2E provisions, but each E2E
run already isolates its state via fresh-org-per-run, and parallel
infrastructure cost at our scale (~$0.001/min × 10min × 2) is rounding
error compared to a stuck pipeline.

Per-SHA still dedupes accidental double-triggers for the SAME SHA.
It does not cancel obsolete-PR-version runs on force-push — that wasted
CI is acceptable given the alternative is losing staging-tip data that
auto-promote-staging depends on.

Other gate workflows: ci.yml uses `cancel-in-progress: true` which is
correct for unit tests (intentional cancellation on supersede). codeql.yml
is per-ref like e2e-api was; same fix probably applies if the same
deadlock pattern is observed there, but no incident yet so deferring.
2026-04-28 17:18:15 -07:00
github-actions[bot]
475a51adec fix(ci): defer promote when E2E is racing with publish (review fix)
Self-review caught a real correctness bug: scenario where publish-
workspace-server-image completes BEFORE E2E Staging SaaS for a runtime-
touching SHA. Publish typically takes ~5-10min; E2E ~10-15min, so this
ordering is the common case for runtime-path PRs.

Previous gate logic:
  - completed/success: proceed
  - completed/failure: abort
  - everything else (including in_progress): proceed   ← BUG

If publish-trigger fires while E2E is still running, the gate returned
"in_progress/none" and fell through the catch-all "proceed" branch.
Result: :latest retagged on the publish signal alone. Then E2E ends
red — but :latest was already wrongly advanced; the E2E-completion
trigger's job-level if=conclusion==success filter just skips, never
rolls back.

Fix: explicit case for in_progress|queued|requested|waiting|pending
that DEFERS — sets gate.proceed=false, writes a "deferred" summary,
exits 0 (workflow run shows success, retag steps skipped). The E2E
completion trigger then fires later and either promotes (green) or
aborts (red), giving us correct ordering regardless of who finishes
first.

Subsequent steps now guarded by `if: steps.gate.outputs.proceed ==
'true'` instead of relying on `exit 1` for skip semantics.

Also added an explicit catch-all `*)` branch that aborts on unknown
states (forward-compat: GitHub adds a new status, we surface it
instead of silently promoting through it).
2026-04-28 16:59:58 -07:00
github-actions[bot]
f4f45f8561 fix(ci): auto-promote :latest also on publish-image, not just E2E
Previously this workflow only triggered on E2E Staging SaaS completion,
which is itself paths-filtered to runtime handlers
(workspace-server/internal/handlers/{registry,workspace_provision,
a2a_proxy}.go, middleware/**, provisioner/**). publish-workspace-server
-image fires on a STRICTLY BROADER path set (workspace-server/**,
canvas/**, manifest.json) — so canvas-only or cmd-only or sweep-only
PRs rebuilt the platform image without ever advancing :latest.

Result observed 2026-04-28: zero runs of this workflow since merge
despite eight main pushes. :latest sat ~7 hours / 9 PRs behind main.

Fix: add publish-workspace-server-image as a second trigger. Add an
explicit gate inside the job that aborts when E2E Staging SaaS for the
same SHA ended red. When E2E didn't fire (paths-filtered), proceed —
auto-promote-staging's pre-merge gates (CI + E2E Canvas + E2E API +
CodeQL on staging) already validated this SHA before main moved.

Concurrency group serializes promotes per-SHA so the publish+E2E both-
fired race lands cleanly. Idempotent crane tag makes it safe regardless.
2026-04-28 16:53:30 -07:00
a45a026099
Merge pull request #2235 from Molecule-AI/chore/deps-batch-2026-04-28
chore(deps): batch dep bumps — 11 safe upgrades from 2026-04-28 wave
2026-04-28 23:40:51 +00:00
Hongming Wang
0cdbc2c4f6 chore(deps): batch dep bumps — 11 safe upgrades from 2026-04-28 dependabot wave
Consolidates 11 of the 17 open Dependabot PRs (#2215, #2217, #2219-#2225,
#2227, #2229) into one PR. Every entry is a patch / minor / floor bump
where the impact surface is small and CI carries the proof.

Same pattern as the 2026-04-15 batch.

Go (workspace-server/go.mod + go.sum, regenerated via `go mod tidy`):
  - golang.org/x/crypto                    0.49.0  → 0.50.0   (#2225)
  - github.com/golang-jwt/jwt/v5           5.2.2   → 5.3.1    (#2222)
  - github.com/gin-contrib/cors            1.7.2   → 1.7.7    (#2220)
  - github.com/docker/go-connections       0.6.0   → 0.7.0    (#2223)
  - github.com/redis/go-redis/v9           9.7.3   → 9.19.0   (#2217)

Python floor bumps (workspace/requirements.txt; current pip-resolved
versions don't change unless they happen to be below the new floor):
  - httpx                                  >=0.27  → >=0.28.1 (#2221)
  - uvicorn                                >=0.30  → >=0.46   (#2229)
  - temporalio                             >=1.7   → >=1.26   (#2227)
  - websockets                             >=12    → >=16     (#2224)
  - opentelemetry-sdk                      >=1.24  → >=1.41.1 (#2219)

GitHub Actions (SHA-pinned per existing convention):
  - dorny/paths-filter@d1c1ffe (v3) → @fbd0ab8 (v4.0.1)        (#2215)

REMOVED from this batch (lockfile platform mismatch):
  - #2231 @types/node ^22 → ^25.6   (npm install on macOS strips
    Linux-only @emnapi/* entries from package-lock.json that CI's
    `npm ci` then refuses; needs a Linux-side install to land cleanly)
  - #2230 jsdom ^25 → ^29.1          (same)

NOT included in this batch (deferred to per-PR human review):
  - #2228 github/codeql-action     v3 → v4   (CodeQL CLI alignment risk)
  - #2218 actions/setup-node       v4 → v6   (default Node version drift)
  - #2216 actions/upload-artifact  v4 → v7   (3 major versions)
  - #2214 actions/setup-python     v5 → v6   (action major)

NOT merged (CI failing on dependabot's own PR):
  - #2233 next 15 → 16
  - #2232 tailwindcss 3 → 4
  - #2226 typescript 5 → 6

Verified:
  - workspace-server: `go mod tidy && go build ./... && go test ./...` — green
  - workspace requirements.txt: floor bumps only
2026-04-28 16:25:46 -07:00
Hongming Wang
cf258b3355 fix(ci): auto-sync opens a PR + uses merge queue, not direct push
The molecule-core/staging branch is protected by ruleset 15500102
(name: staging-merge-queue) which blocks ALL direct pushes — no
bypass even for org admins or the GitHub Actions integration. The
prior version of this workflow attempted `git push origin staging`
and was rejected with GH013:

    ! [remote rejected] staging -> staging
    (push declined due to repository rule violations)

    - Changes must be made through a pull request.
    - Changes must be made through the merge queue

This was a real architectural mismatch: auto-sync was bypassing
the same gates everyone else goes through to land on staging,
which is exactly what the ruleset is designed to prevent.

The fix matches the org convention: the workflow now opens a PR
(base=staging, head=auto-sync/main-<sha>) and enables auto-merge.
The merge queue picks it up, runs required gates against the
merged result, and lands it. Same path human PRs take through
staging — no special-snowflake bypass.

Trade-off acknowledged

- Slight PR churn: every main push that needs sync opens a tracked
  PR. With concurrency: cancel-in-progress: false (existing) and
  the merge queue's serial processing, this is bounded — PRs land
  in order, no thundering herd.
- The previous direct-push approach worked on
  molecule-controlplane (which has no merge_queue ruleset on
  staging). That version of the workflow was correct for that
  repo's protection model. Per-repo divergence is acceptable; the
  invariant ("staging ⊇ main") is what matters, not how it's
  enforced.

Loop safety preserved

GITHUB_TOKEN-authored merges (including the merge queue's land
of this PR) do NOT trigger downstream workflow runs. So the merge
to staging from this PR doesn't fire auto-promote-staging — same
as the direct-push version.

Idempotency

The branch name is derived from main's short sha
(`auto-sync/main-<sha>`) so workflow restarts on the same main
push reuse the existing branch + PR rather than opening duplicates.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 15:59:26 -07:00
Hongming Wang
1867111d95
Merge pull request #2213 from Molecule-AI/chore/pin-actions-to-shas
chore(security): pin Actions to SHAs + enable Dependabot auto-bumps
2026-04-28 22:49:25 +00:00
Hongming Wang
c77a88c247 chore(security): pin Actions to SHAs + enable Dependabot auto-bumps
Supply-chain hardening for the CI pipeline. 23 workflow files
modified, 59 mutable-tag refs replaced with commit SHAs.

The risk

Every `uses:` reference in .github/workflows/*.yml was pinned to a
mutable tag (e.g., `actions/checkout@v4`). A maintainer of an
action — or a compromised maintainer account — can repoint that
tag to malicious code, and our pipelines silently pull it on the
next run. The tj-actions/changed-files compromise of March 2025 is
the canonical example: maintainer credential leak, attacker
repointed several `@v<N>` tags to a payload that exfiltrated
repository secrets. Repos that pinned to SHAs were unaffected.

The fix

Replace each `@v<N>` with `@<commit-sha> # v<N>`. The trailing
comment preserves human readability ("ah, this is v4"); the SHA
makes the reference immutable.

Actions covered (10 distinct):
  actions/{checkout,setup-go,setup-python,setup-node,upload-artifact,github-script}
  docker/{login-action,setup-buildx-action,build-push-action}
  github/codeql-action/{init,autobuild,analyze}
  dorny/paths-filter
  imjasonh/setup-crane
  pnpm/action-setup (already pinned in molecule-app, listed here for completeness)

Excluded:
  Molecule-AI/molecule-ci/.github/workflows/disable-auto-merge-on-push.yml@main
    — internal org reusable workflow; we control its repo, threat model
    is different from third-party actions. Conventional to pin to @main
    rather than SHA for internal reusables.

The maintenance cost

SHA pinning means upstream fixes require manual SHA bumps. Without
automation, pinned SHAs go stale. So this PR also enables Dependabot
across four ecosystems:

  - github-actions (workflows)
  - gomod (workspace-server)
  - npm (canvas)
  - pip (workspace runtime requirements)

Weekly cadence — the supply-chain attack window is "minutes between
repoint and pull"; weekly auto-bumps don't help with zero-days
regardless. The point is to pull in non-zero-day fixes without
operator effort.

Aligns with user-stated principle: "long-term, robust, fully-
automated, eliminate human error."

Companion PR: Molecule-AI/molecule-controlplane#308 (same pattern,
smaller surface).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 15:37:06 -07:00
Hongming Wang
6638d6e1d7 feat(ci): SECRET_PATTERNS drift lint across known consumers
Adds a lint that diffs the canonical SECRET_PATTERNS array in
.github/workflows/secret-scan.yml against every known public
consumer mirror, failing on any divergence.

Why: every side that scans for credentials carries its own copy of
the pattern list. They drift — most recently the workspace-runtime
pre-commit hook lagged the canonical by one pattern (sk-cp- /
MiniMax F1088 vector), so a developer's local pre-commit would let
a sk-cp- token through while the org-wide CI scan would refuse it.
Useless friction; automated detection closes the gap.

Implementation:
  .github/scripts/lint_secret_pattern_drift.py — pure stdlib, fetches
    each consumer's RAW file via urllib, extracts the
    SECRET_PATTERNS=( ... ) array via anchored regex (the closing
    `)` is anchored to the start of a line because pattern comments
    like `# GitHub PAT (classic)` contain their own paren mid-line),
    diffs against canonical, fails on missing or extra patterns.
    Fetch failures are warnings, not errors — a consumer whose
    branch was renamed shouldn't fail the lint until someone updates
    the URL list.

  .github/workflows/secret-pattern-drift.yml — daily 05:00 UTC cron
    + on-push gate (when canonical, the workflow, or the script
    changes) + workflow_dispatch. Read-only token, 5-minute timeout.

Initial consumer set: workspace-runtime's bundled pre-commit hook
(the one that drifted on sk-cp-). molecule-controlplane's inlined
copy is private so this workflow can't read it; that's tracked
separately and the controlplane's own self-monitor is the gap.

Verified locally: lint detects drift correctly when the runtime
hook is missing sk-cp-, returns clean when aligned.

Refs: task #139.
2026-04-28 15:29:09 -07:00
Hongming Wang
97d5883e76 fix(ci): auto-sync concurrency + cleanup follow-ups
Three small fixes from the self-review of #2209:

1. **Required: concurrency group.** Two pushes to main in quick
   succession (manual UI merge then auto-promote-staging's ff-push,
   or any back-to-back main pushes) would race two auto-sync runs
   against the same staging branch — second `git push origin staging`
   fails non-fast-forward, surfacing as a red CI alert for what should
   be a no-op. Add `concurrency: { group: auto-sync-main-to-staging,
   cancel-in-progress: false }` so the second run waits for the first
   and sees its result.

2. **Hygiene: `git merge --abort` on conflict.** The conflict-error
   path exits 1 with the work tree in a half-merged state. Doesn't
   affect future runs (each gets a fresh checkout) but is an
   unpleasant artifact for anyone who shells into the runner. Abort
   first, then exit.

3. **Doc accuracy: "Loop safety" comment.** The original said the
   chain terminates because "main is either a no-op or advances
   further." That's true but understates the actual safety: GitHub
   Actions explicitly does NOT trigger downstream workflow runs from
   `GITHUB_TOKEN`-authored pushes. So the loop is impossible by
   construction, not just by happy coincidence of ref state. Updated
   the comment to reflect the actual mechanism.

Plus a step-name nit: "Fast-forward staging → main" reads as if main
is the target. Renamed to "Fast-forward staging to main" for
consistency with the workflow's name (main → staging).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 14:59:23 -07:00
Hongming Wang
c59715e143 feat(ci): auto-sync main → staging to keep staging-as-superset invariant
Background

`auto-promote-staging.yml` advances main via `git merge --ff-only`
+ `git push origin main` — clean fast-forward, no merge commit. But
manual `staging → main` merges via the GitHub UI / API create a merge
commit on main that staging doesn't have. The next `staging → main`
PR then evaluates as "BEHIND" because staging is missing that merge
commit, requiring a manual `gh pr update-branch` round-trip.

This pattern bit twice on 2026-04-28 (PRs #2202 and #2205, both
manual bridges to land pipeline fixes themselves). Each needed
update-branch + re-CI before they could merge. Annoying and
avoidable.

What this workflow does

Triggered on every push to main (regardless of source: auto-promote,
UI merge, API merge, direct push):

  1. Check whether main is already in staging's ancestry. If yes,
     no-op — auto-promote-staging keeps them aligned via ff push,
     and the no-op case is the steady state.

  2. If not (manual merge commit on main, or direct main hotfix):
     try `git merge --ff-only origin/main` first. Works when staging
     hasn't diverged with its own commits.

  3. If ff fails (staging has its own in-flight feature work):
     `git merge --no-ff origin/main -m "chore: sync main → staging"`.
     Absorbs main's tip while keeping staging's own history.

  4. Push staging.

Loop safety

Pushing the synced staging triggers auto-promote-staging.yml, which
checks gates on staging's new tip and, if green, ff-pushes staging
to main. Since staging now ⊇ main, the resulting push to main is
either a no-op (no ref change → no push event fires → auto-sync
doesn't re-trigger) or advances main further. In the latter case
auto-sync fires once more, sees main already in staging's ancestry,
no-ops. Bounded.

Conflict handling

If the merge step hits conflicts (staging and main diverged with
incompatible changes), the workflow fails with a clear summary
pointing to manual resolution. This shouldn't happen in practice —
staging is the integration branch; conflicts indicate a direct main
hotfix touching the same code as in-flight staging work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 14:43:43 -07:00
11a38a0ad4
Merge pull request #2207 from Molecule-AI/fix/secret-scan-printf-and-wordsplit
fix(ci): printf format-string sink + filename word-split in secret-scan
2026-04-28 21:11:32 +00:00
Hongming Wang
2c8792d3e0 fix(ci): printf format-string sink + filename word-split in secret-scan
Two latent bash bugs in the canonical secret-scan workflow caught
during the post-merge review of molecule-controlplane #301 (a
private consumer that inlined this workflow's logic and got both
fixes there). Same bugs apply here; fixing in canonical means every
public consumer (gh-identity, github-app-auth, the 8 workspace
template repos) inherits the fix on their next workflow_call.

Bug 1: `printf "$OFFENDING"` is a format-string sink.

  OFFENDING is built from filenames: `${f} (matched: ${pattern})\n`.
  When passed to printf as the first argument, `%` characters in a
  filename are interpreted as conversion specifiers — corrupting the
  error message or printing `%(missing)` artifacts. No filename in
  the current tree triggers it, but a future test fixture, build
  artifact, or contributor-supplied path could.

  Fix: `printf '%b' "$OFFENDING"` interprets the literal `\n` we
  appended without treating OFFENDING as a format string.

Bug 2: `for f in $CHANGED` word-splits on whitespace.

  Filenames containing spaces would split into multiple tokens. The
  self-exclude check (`[ "$f" = "$SELF" ] && continue`) and the diff
  lookup would both operate on partial-path tokens. No filename in
  the current tree has whitespace, but the failure would be silent
  if one ever did.

  Fix: `while IFS= read -r f; do ... done <<< "$CHANGED"` reads
  whole lines as filenames. Added `[ -z "$f" ] && continue` to
  match the original `for` loop's implicit empty-input skip.

Both fixes are mechanically straightforward (~16 lines net diff,
mostly comments documenting the why). No behavior change for
filenames in the current tree; strictly better for the edge cases.

The same fixes already shipped in molecule-controlplane via #301
which inlined a copy of this workflow. The runtime's bundled
pre-commit hook (molecule-ai-workspace-runtime:
molecule_runtime/scripts/pre-commit-checks.sh) likely has the same
bugs — flagged as a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 14:02:50 -07:00
Hongming Wang
9d4ab7b1a2 feat(ci): auto-promote-on-e2e — retag :latest on green E2E Staging SaaS
Closes the final gap in the SaaS pipeline. After auto-promote-staging
fast-forwards main, publish-workspace-server-image builds new
`:staging-<sha>` images, but `:latest` (what prod tenants pull) only
moves on either a manual `promote-latest.yml` dispatch or a canary-
verify retag (gated on Phase 2 fleet that doesn't exist).

This workflow closes that gap by retagging
`platform:staging-<sha>` + `platform-tenant:staging-<sha>` → `:latest`
whenever E2E Staging SaaS passes for a `main` push. Uses crane
(no Docker daemon needed). Verifies both images exist before retagging
either, so a half-published state is impossible.

Why trigger only on `main` (not staging):
  - `:latest` is what prod tenants pull. Only SHAs that have reached
    `main` (via auto-promote-staging) should advance `:latest`.
  - Triggering on staging would let a staging-only revert advance
    `:latest` to a SHA that never reaches `main`, breaking the
    invariant "production runs what's on `main`".

Why a separate workflow rather than folding into e2e-staging-saas.yml:
  - Test concerns and release concerns separate.
  - Disabling promote during an incident is one workflow toggle, not
    an edit to the long E2E file.
  - When Phase 2 canary work eventually lands, the canary path can
    replace this trigger without touching the E2E workflow.

Doc-aligned: per molecule-controlplane/docs/canary-tenants.md,
"green staging E2E → :latest" is the recommended approach for the
current scale (≤20 paying tenants); canary fleet is deferred until
blast radius grows.

Pipeline after this lands is fully self-healing:
  staging push → 4 gates green → auto-promote fast-forwards main
   → publish-workspace-server-image → E2E Staging SaaS
   → THIS WORKFLOW retags :latest → tenant fleet auto-pulls in 5 min
                                    (or redeploy-tenants-on-main fans out faster)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 13:58:41 -07:00
Hongming Wang
17018745d0 fix(ci): auto-promote gate-check uses workflow file paths, not names
Observed 2026-04-28: auto-promote ran for staging head 96955f7b with
all gates actually green (verified via /commits/<sha>/check-runs API)
yet `check-all-gates-green` reported `CodeQL → missing/none` and
aborted. Same SHA was promotable; auto-promote couldn't see it.

Cause: `gh run list --workflow="CodeQL"` matched two workflows in
this repo:

  - codeql.yml (explicit, scans both staging and main)
  - codeql       (GitHub UI-configured Code-quality default setup,
                  internal, scans default branch only)

gh CLI rejects ambiguous `--workflow=<name>` lookups and returns no
result → the gate fell through to `missing/none` and ALL_GREEN was
set false. Every staging push since both names existed has been
silently dead-locked.

Fix: switch GATES from display-name strings to workflow file paths.
File paths are the unique identifier for a workflow file in
.github/workflows/; display names are decoration and can collide.
The same `gh run list --workflow=<file.yml>` query that fails on
"CodeQL" succeeds on "codeql.yml" because the file path resolves
unambiguously.

No behavior change for the other three gates (CI, E2E Canvas, E2E
API Smoke) since their names didn't collide — they keep working,
they just identify by ci.yml / e2e-staging-canvas.yml / e2e-api.yml
now. The log line shape changes from `CI → completed/success` to
`ci.yml → completed/success` which is fine for ops grep.

When adding/removing a gate going forward: file paths only. Keep
branch-protection required-checks (check-run display names) in
sync as a separate manual step.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 13:15:13 -07:00
Hongming Wang
31d25b5a74 fix(ci): e2e gates always emit a result so auto-promote can read it
The auto-promote-staging.yml gate-check (line 99) treats "workflow
didn't run" as failure. Path-filtered triggers on E2E API Smoke Test
and E2E Staging Canvas meant a platform-only or test-only push to
staging — say, the prior PR #2201 which only touched
tests/e2e/test_staging_full_saas.sh — never triggered the canvas
workflow, and auto-promote saw `missing/none`, marked all_green=false,
and aborted. Same class for any push that doesn't touch the gate's
watched paths. Dead-lock by design, never noticed because the gate
was new.

Fix per Design B (always-run + fast-skip):

- Drop `paths:` from the push/pull_request triggers on both gate
  workflows. The workflow now always fires on every staging+main
  push/PR.
- Add a `detect-changes` job using `dorny/paths-filter@v3` that
  decides whether to do real work, scoped to the same paths the
  trigger filter used to watch.
- Real work job (e2e-api / playwright) gates on
  `needs: detect-changes; if: needs.detect-changes.outputs.X == 'true'`.
- Add a sibling `no-op` job that runs when the filter output is
  false, emitting `::notice::… no-op pass`. The workflow run's
  conclusion is `success` either way — auto-promote sees green and
  proceeds.

manual `workflow_dispatch` and the weekly canvas `schedule` short-
circuit detect-changes to always-run — those triggers exist precisely
to exercise the suite and shouldn't be silently no-op'd.

Why this approach over making auto-promote-staging smarter:

The alternative (Design A, considered + rejected) was to teach
auto-promote-staging to read each gate's `paths:` filter and treat
"no run because filter excluded the commit" as conditional pass.
That couples auto-promote to other workflows' YAML schema and breaks
silently if a gate is renamed or its filter changes. Design B keeps
the auto-promote contract simple ("each gate emits success") and
makes each gate self-describing — adding a new gate doesn't require
touching auto-promote.

Cost: ~10-30s of runner overhead per gate per push for the no-op when
paths don't match. Negligible vs the alternative of dead-locked
auto-promote chains.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 12:43:26 -07:00
Hongming Wang
e7eeeb4f59
Merge pull request #2199 from Molecule-AI/fix/pin-compat-narrow-pypi-job-trigger
ci(pin-compat): split into two workflows so each gets a narrow paths filter
2026-04-28 18:20:48 +00:00
Hongming Wang
a089712cef feat(cascade): verify wheel content sha256 against just-built dist
Closes #132. Extends the cascade propagation probe (added in #2197
and clarified in #2198) with a content-integrity check.

The previous probe verified pip can RESOLVE the version we just
published (catches surface 1+2 propagation lag — metadata + simple
index). It did NOT verify pip can DOWNLOAD bytes that match what we
uploaded — leaving a window where a Fastly stale-content scenario
(rare but PyPI has had it: e.g. 2026-04-01 incident where a CDN node
served a previous version's wheel under the new version's URL for
~90s after upload) would pass the probe and ship corrupt builds to
all 8 receiver templates.

Two-stage check, both must pass before the cascade fans out:

  (a) `pip install --no-cache-dir PACKAGE==VERSION` succeeds —
      version is resolvable. (Existing, unchanged.)

  (b) `pip download` of the same wheel + `sha256sum` matches the
      hash captured pre-upload from `dist/*.whl`. (New.)

Captured BEFORE upload via a new `wheel_hash` step that exposes
`steps.wheel_hash.outputs.wheel_sha256`, bubbled up as
`needs.publish.outputs.wheel_sha256`, and consumed by the cascade
probe via the EXPECTED_SHA256 env var.

`pip download` is the right primitive: it writes the actual .whl
file (vs `pip install` which unpacks and discards), so we can
sha256sum it directly. Combined with --no-cache-dir + a wiped
/tmp/probe-dl per poll, every poll re-fetches from the live Fastly
edge — no local-cache mask.

Per-poll cost: ~3-5s pip install + ~3s pip download + 4s sleep.
30-poll budget = ~5-6 min wall on a slow runner (vs the previous
~4-5 min for resolve-only). Well within the cascade's tolerance for
a known-rare CDN issue, and the overwhelming-common case (Fastly
serves matching bytes immediately) exits on the first poll.

Verified locally: pip download of the current PyPI-latest
(molecule-ai-workspace-runtime 0.1.29) produced
sha256=7e782b2d50812257…, exactly matching PyPI's own metadata
endpoint. The mismatch path is exercised inline (different builds
of the same version produce different hashes by definition — the
build_runtime_package.py output is timestamp-deterministic only
within a single CI invocation).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 10:53:50 -07:00
Hongming Wang
a8f59f5fc2 ci(pin-compat): split into two workflows so each gets a narrow paths filter
Closes #134. The post-merge review of #2196 flagged that the combined
workflow's `paths:` filter (the union of both jobs' needs:
`workspace/**` + `scripts/build_runtime_package.py` + the workflow
itself) caused the `pypi-latest-install` job to fire on every
doc-only / adapter-only / unrelated workspace/ edit. The PyPI artifact
that job tests against can't change based on our workspace/ source —
only on actual PyPI publishes — so those runs add noise without
information.

Splits the previously-merged combined workflow:

  runtime-pin-compat.yml (kept):
    - PyPI-latest install + import smoke (was: pypi-latest-install)
    - Narrow `paths:` filter — only fires when workspace/requirements.txt
      or this workflow file changes
    - Cron-driven daily for upstream-yank detection (unchanged)

  runtime-prbuild-compat.yml (new):
    - PR-built wheel + import smoke (was: local-build-install)
    - Broad `paths:` filter — fires on any workspace/ source change,
      scripts/build_runtime_package.py, or this workflow file
    - No cron (workspace/ doesn't change between firings)

Behavior identical to before for content; only the trigger surface is
narrower per-job. Each workflow's name is its own status check, so
branch protection (which currently lists neither as required) can
gate them independently in future.

The prior comment in the combined file explicitly acknowledged the
asymmetry and proposed this split as a follow-up; this is that
follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 10:50:09 -07:00
Hongming Wang
e6ce54006d ci(publish-runtime): use pip-resolve probe to bound cascade fan-out
The cascade's PyPI-propagation gate polled `/pypi/<pkg>/<ver>/json`,
which is one of THREE surfaces pip touches when resolving an install:

  1. /pypi/<pkg>/<ver>/json    — metadata endpoint (the old check)
  2. /simple/<pkg>/             — pip's primary download index
  3. files.pythonhosted.org     — CDN-fronted wheel binary

Each has its own cache. Any one of them can lag behind the others,
and the previous gate would let the cascade fire while (2) or (3)
still served the previous version. Downstream `pip install` in the
template repos then resolved to the OLD wheel, the docker layer
cache locked that stale resolution in, and subsequent rebuilds kept
shipping the old runtime — the "five times in one night" cache trap
referenced in the prior comment.

Replace the metadata-only poll with an actual `pip install
--no-cache-dir --force-reinstall --no-deps PACKAGE==VERSION` from
a fresh venv. If pip can resolve and install the exact version we
just published, every receiver template will too — pip itself is
the ground truth for what the receivers will see, no proxy guessing
about which surface is lagging.

  - Venv created once outside the loop; only `pip install` runs in
    the poll body.
  - --no-cache-dir + --force-reinstall ensures every poll hits the
    live PyPI surfaces (no local-cache mask).
  - --no-deps keeps each poll fast — we only care about resolving
    THIS package, not its dep tree.
  - Loop budget: 30 attempts × 4s ≈ 2 min (vs prior 30 × 2s = 60s).
    Generous vs typical PyPI propagation, surfaces real upstream
    issues past the budget.

Verified locally:
  - Probing a non-existent version (0.1.999999) → pip exits 1, loop
    retries.
  - Probing the current PyPI-latest → pip exits 0, `pip show`
    returns the version, loop succeeds.

Closes #130.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 18:16:33 -07:00
Hongming Wang
7484e6fbec
Merge pull request #2196 from Molecule-AI/fix/runtime-pin-compat-test-pr-artifact
ci(runtime-pin-compat): test the PR-built wheel, not PyPI-latest
2026-04-28 00:42:02 +00:00
Hongming Wang
7065579967 ci(runtime-pin-compat): test the PR-built wheel, not the PyPI-latest one
Closes #128's chicken-and-egg. The original gate installed the
CURRENTLY-PUBLISHED molecule-ai-workspace-runtime from PyPI, then
overlaid workspace/requirements.txt, then smoke-imported. That
catches problems with the already-shipped artifact (the daily-cron
upstream-yank case), but it cannot catch problems introduced by the
PR itself: the imports it exercises are from the OLD wheel, not the
PR's source. A PR that adds `from a2a.utils.foo import bar` (where
`bar` is added in a2a-sdk 1.5 and the runtime currently pins 1.3)
slips through:
  1. Pip resolves the existing PyPI wheel + a2a-sdk 1.3.
  2. Smoke imports the OLD main.py — no reference to `bar` → green.
  3. Merge → publish-runtime.yml ships a wheel WITH the new import.
  4. Tenant images redeploy → all crash on first boot with
     ImportError: cannot import name 'bar' from 'a2a.utils.foo'.

Splits the workflow into two jobs:

  - pypi-latest-install (renamed from default-install): unchanged
    behavior. Runs on the daily cron and on requirements.txt /
    workflow edits. Catches upstream PyPI yanks + the
    already-shipped artifact going stale.

  - local-build-install (new): runs scripts/build_runtime_package.py
    on the PR's workspace/, builds the wheel with python -m build
    (mirroring publish-runtime.yml byte-for-byte), installs that
    wheel, then runs the same smoke import. Tests the artifact
    that WOULD be published if this PR merges.

Path filter widened to workspace/** so any runtime-source change
triggers the local-build job. The pypi-latest job's filter is the
same union; its internal logic is unchanged so the daily-cron and
upstream-detection use cases continue to work.

Verified locally: built the wheel from current workspace/ source via
the same script + python -m build invocation, installed into a fresh
venv, imported from molecule_runtime.main import main_sync
successfully.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 17:39:00 -07:00
Hongming Wang
1b0fab674b ci(publish-runtime): smoke well-known mount alignment + message helper
The existing wheel-smoke catches AgentCard kwarg-shape regressions
(state_transition_history, supported_protocols) but doesn't catch the
SDK-contract drift class that #2193 just fixed in production: the
a2a-sdk 1.x rename of /.well-known/agent.json →
/.well-known/agent-card.json, plus AGENT_CARD_WELL_KNOWN_PATH moving
to a2a.utils.constants. main.py's readiness probe hardcoded the old
literal and 404'd every attempt, silently dropping every workspace's
initial_prompt for ~weeks before a user reported it.

Two additions to the smoke block:

  1. Mount alignment: build an AgentCard, call create_agent_card_routes(),
     and assert AGENT_CARD_WELL_KNOWN_PATH is among the mounted paths.
     Catches a future SDK release that decouples the constant value
     from the route factory's mount path. The source-tree test
     (workspace/tests/test_agent_card_well_known_path.py) catches the
     main.py side; this catches the SDK side BEFORE PyPI upload.

  2. Message helper smoke: import a2a.helpers.new_text_message and
     instantiate one. The v0→v1 cheat sheet (memory:
     reference_a2a_sdk_v0_to_v1_migration.md) flagged this as a real
     migration find — main.py and a2a_executor.py call it in hot
     paths, so an import break errors every reply before the message
     even leaves the workspace.

Verified by running the equivalent Python inside
ghcr.io/molecule-ai/workspace-template-langgraph:latest:
  ✓ well-known mount alignment OK (/.well-known/agent-card.json)
  ✓ message helper import + call OK

Closes the structural-fix half of the #2193 finding from the code-
review-and-quality pass: "the wheel publish smoke didn't catch this.
This is the 7th a2a-sdk migration find of this kind. Task #131 is the
right root-cause fix."

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 17:34:12 -07:00
dccec657d6
Merge branch 'staging' into ci/cicd-review-quick-wins 2026-04-27 13:27:16 -07:00
Hongming Wang
5920fc856d
Merge pull request #2182 from Molecule-AI/ci/agentcard-smoke-followup-2179
fix(workspace): rename supported_protocols → supported_interfaces (CRITICAL — every boot crashes)
2026-04-27 14:58:28 +00:00
Hongming Wang
851fd21fb1 fix(workspace): rename supported_protocols → supported_interfaces (a2a-sdk 1.0)
CRITICAL: every workspace boot since the a2a-sdk 1.0 migration (#1974)
has been crashing at AgentCard construction with:
  ValueError: Protocol message AgentCard has no "supported_protocols" field

The protobuf field is `supported_interfaces` (plural, interfaces — see
a2a-sdk types/a2a_pb2.pyi:189). The 0.3→1.0 migration left the kwarg
as `supported_protocols`, which doesn't exist in the 1.0 schema, so
the constructor raises before any subsequent line of main runs.

Why this hid for so long:
  - publish-runtime.yml's smoke step only IMPORTED molecule_runtime.main;
    importing the module is fine, only CONSTRUCTING the AgentCard fails
  - The user-visible symptom is "Workspace failed: " with empty
    last_sample_error, indistinguishable from generic boot timeouts
  - The state_transition_history=True bug (fixed in #2179) was a
    sibling of this — same migration, same class, just caught first

Fix is symmetric with #2179:
  1. workspace/main.py: rename the kwarg + comment explaining why
  2. .github/workflows/publish-runtime.yml: extend the smoke block to
     instantiate AgentCard with the exact production call shape, so
     the next field-rename of this class fails at publish time
     instead of breaking every workspace startup

Verification:
  - Constructed AgentCard against fresh a2a-sdk 1.0.2 in a clean
    venv with the corrected kwarg → succeeds
  - Constructed it with the original `supported_protocols` kwarg →
    fails immediately with the exact error production sees
  - Smoke test pinned to mirror main.py's exact call shape; main.py
    + smoke must stay in lockstep going forward

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 07:54:23 -07:00
Hongming Wang
1a703f5687 fix(publish-runtime): wait for PyPI propagation + expand path filter
Two structural fixes for the cascade race conditions that bit us
five times today:

1. **PyPI propagation wait** (cascade job): poll PyPI for the
   just-published version with a 60s budget BEFORE firing
   repository_dispatch. PyPI accepts the upload but takes a few
   seconds to make it available via the package index. Cascade was
   firing too fast — downstream template builds ran `pip install`
   against a stale index, resolved to the previous version, and
   docker layer cache locked that in for subsequent rebuilds.
   Pairs with the build-arg cache invalidation in molecule-ci PR
   (separate change). Wait without invalidation = next build still
   pip-resolves correctly. Invalidation without wait = first cascade
   build may still race PyPI propagation. Together: no race, no
   stale cache.

2. **Path filter expansion**: scripts/build_runtime_package.py is
   the build script and changes to it (e.g. import-rewrite fixes,
   manifest emit, lib/ subpackage move) directly affect what ships
   in the wheel. Was missing from the path filter, so PRs touching
   only scripts/ (like #2174's lib/ fix) didn't auto-publish — the
   operator had to remember a manual dispatch. Add it to the closed
   list of files that trigger auto-publish.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 07:42:37 -07:00
Hongming Wang
82b366fce5 ci: add pr-guards caller that disables auto-merge on push
Thin caller for molecule-ci's reusable disable-auto-merge-on-push
workflow. Forces operator re-engagement when a commit is pushed to
an open PR with auto-merge already enabled.

Pairs with the org-wide "Automatically delete head branches" repo
setting (also enabled today). Defense in depth:

1. Repo setting blocks pushes to a merged-and-deleted branch
   (post-merge orphan case — what bit #2174 today: my second
   commit landed on an already-merged-and-deleted branch).
2. This workflow catches in-queue races (push lands while the
   merge queue is processing) by disabling auto-merge so the
   operator must explicitly re-engage.

Together they cover the full lifecycle of "auto-merge enabled →
new commits arrive" without relying on operator discipline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 06:39:31 -07:00
Hongming Wang
3df5867b56 fix: restore main_sync entry point in workspace/main.py
The wheel's pyproject.toml has declared
`molecule-runtime = "molecule_runtime.main:main_sync"` since the
publish pipeline was created on 2026-04-26, but the function
itself was never present in workspace/main.py — it lived in the
pre-monorepo molecule-ai-workspace-runtime repo and was lost
during the consolidation that made workspace/ the source of truth.

The 0.1.15 wheel still had main_sync from a leftover snapshot,
so the regression went unnoticed until 0.1.16 (the first wheel
built from the new source-of-truth) shipped. Symptom: every
workspace container restart loops with

  ImportError: cannot import name 'main_sync' from 'molecule_runtime.main'

— the molecule-runtime CLI script's first line tries to import
the missing symbol. Workspaces stay in `provisioning` until the
10-min sweep marks them failed.

Caught by .github/workflows/runtime-pin-compat.yml, which already
imports the symbol by name as its smoke test. (That check kept
failing red on every recent merge_group run; this PR fixes the
underlying symbol-not-found instead of the smoke step.)

Also strengthens publish-runtime.yml's wheel smoke from
`import molecule_runtime.main` (loads the module — passes even
when entry-point target is missing) to `from molecule_runtime.main
import main_sync` (the actual contract the CLI script needs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 03:35:49 -07:00
Hongming Wang
c68dc1877f fix(release): drift-gate TOP_LEVEL_MODULES + smoke-import main in publish
Two compounding bugs surfaced when 0.1.16 hit production today:

1. scripts/build_runtime_package.py had a hand-curated TOP_LEVEL_MODULES
   set listing every workspace/*.py that should get its bare imports
   rewritten to `molecule_runtime.X`. The set silently went stale:
   - Missing: transcript_auth (added since #87 phase 1c), runtime_wedge,
     watcher → unrewritten imports shipped, every workspace startup
     died with ModuleNotFoundError.
   - Stale: claude_sdk_executor, cli_executor (both removed in #87),
     hermes_executor (never existed) → harmless but misleading.

2. publish-runtime.yml's wheel-smoke step asserted on stable invariants
   (BaseAdapter, AdapterConfig, a2a_client error sentinel) but never
   imported main. So even though main.py held the broken bare
   `from transcript_auth import ...`, the smoke check passed.

Fixes:

- Build script now derives the on-disk module set from workspace/*.py
  and asserts it matches TOP_LEVEL_MODULES exactly. Drift in either
  direction fails the build with a specific diff message instead of
  shipping a broken wheel. Closed-list typo guard preserved (we still
  edit the set explicitly when a module is added/removed) — the gate
  just makes drift impossible to ignore.

- TOP_LEVEL_MODULES updated to current reality: drop the 3 stale,
  add the 3 missing.

- publish-runtime.yml wheel-smoke now `import molecule_runtime.main`
  before the invariant asserts. main is the entry point and
  transitively imports every module — any bare-import bug surfaces
  as ModuleNotFoundError before PyPI accepts the upload.

Tested locally: `python3 scripts/build_runtime_package.py
--version 0.1.99 --out /tmp/build-test` succeeds, and
/tmp/build-test/molecule_runtime/main.py contains the rewritten
`from molecule_runtime.transcript_auth import ...`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 03:19:17 -07:00
Hongming Wang
0a455b7d71 feat(publish-runtime): auto-publish to PyPI on staging pushes that touch workspace/
Adds a third trigger so any merge to staging that changes workspace/**
auto-publishes a new molecule-ai-workspace-runtime patch release. Closes
the human-in-loop gap that caused tonight's RuntimeCapabilities
ImportError outage.

Tonight: #117 added RuntimeCapabilities to molecule_runtime.adapters.base.
The merge landed at 02:37 UTC. Templates rebuilt their images at 07:37
UTC (4 hours later) and started importing the new symbol. PyPI was
still serving 0.1.15 (pre-#117) because nobody remembered to push a
runtime-vX.Y.Z tag or workflow_dispatch the publish. Result: every
template image shipped tonight runs `from molecule_runtime.adapters.base
import RuntimeCapabilities` against an installed runtime that doesn't
export it -> ImportError -> workspace never registers -> stuck in
provisioning until 10-min sweep.

Mechanism:
- New trigger: push to staging filtered to paths: ['workspace/**'].
  Path filter applies only to branch pushes; the existing tag trigger
  still fires unconditionally.
- Version derivation for the auto case: query PyPI's JSON API for
  current latest, bump the patch component. PyPI is the source of
  truth so concurrent runs don't double-publish (HTTP 400 on collision).
- concurrency: group serializes parallel staging merges so they don't
  race on the bump computation. cancel-in-progress: false because each
  workspace/** change deserves its own release.
- publish job now exposes its derived version as a job-level output so
  the cascade reads it cleanly. Fixes a latent bug: cascade tried to
  read steps.version.outputs.version, which is from a different job's
  scope and silently resolved to empty -- then re-derived from
  GITHUB_REF_NAME, which would have been "staging" under the new
  trigger and produced an invalid version.

Tag-driven and manual-dispatch paths are unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 02:11:45 -07:00
Hongming Wang
c1e9aa7461
Merge pull request #2153 from Molecule-AI/fix/block-internal-paths-shallow-clone-bug
fix(ci): block-internal-paths handle merge_group + shallow-clone BASE
2026-04-27 06:58:32 +00:00
Hongming Wang
7ac7a010fa fix(ci): block-internal-paths handle merge_group + shallow-clone BASE
[Molecule-Platform-Evolvement-Manager]

## What was broken

Same bug class as the secret-scan.yml fix in #2120 — block-internal-paths
hit `fatal: bad object <sha>` exit 128 on the staging push at
2026-04-27 06:50:33Z.

Two cases:

1. **`merge_group` events**: BASE/HEAD came from
   `github.event.before` / `.after` which are push-event-only
   properties. On merge_group both came back empty, the script fell
   through to "scan entire tree" mode which is correct but
   inefficient. Worse, when this workflow is required for the merge
   queue (line 21-22), an empty-BASE entire-tree scan would run on
   every queue check.

2. **`push` events with shallow clones**: `fetch-depth: 2` doesn't
   always cover BASE across true merge commits. When BASE is in the
   payload but absent from the local object DB, `git diff` errors out
   with `fatal: bad object <sha>` and the job exits 128. This is what
   broke today's staging push.

## Fix

Same shape as the secret-scan.yml fix (#2120):

- Add a dedicated `git fetch` step for `merge_group.base_sha`.
- Move event-specific SHAs into a step `env:` block; script uses a
  `case` over `${{ github.event_name }}` covering pull_request /
  merge_group / push (rather than `if pull_request / else push`
  which left merge_group on the empty-BASE branch).
- On-demand fetch + `git cat-file -e` guard for push BASE so a SHA
  that's payload-present-but-DB-absent triggers the fetch, and a
  fetch failure falls through cleanly to "scan entire tree" instead
  of exiting 128.

## Test plan

- [x] YAML structure preserved (no schema changes)
- [x] Bash logic mirrors the secret-scan recovery path tested in #2120
- [ ] CI green on this PR's pull_request scan + push to staging post-merge

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 23:54:00 -07:00
Hongming Wang
a4b3ebf951 test(e2e): claude-code + hermes priority-runtimes happy path
Self-contained happy-path E2E for the two runtimes the project commits
to first-class support for (task #116, completes the loop on the
"both must work end-to-end with tests" requirement).

What it proves per runtime:
  1. POST /workspaces succeeds with the runtime + secrets
  2. Workspace reaches status=online within its cold-boot window
     (claude-code: 240s, hermes: 900s on cold apt + uv + sidecar)
  3. POST /a2a (message/send "Reply with PONG") returns a non-error,
     non-empty reply
  4. activity_logs row written with method=message/send and ok|error
     status (a2a_proxy.LogActivity contract)

Skip semantics: each phase independently checks for its required env
key (CLAUDE_CODE_OAUTH_TOKEN / E2E_OPENAI_API_KEY) and skips cleanly
if absent. The script always exit-0s if every phase either passed or
skipped — so wiring it into a no-keys CI job validates the script
itself stays clean without false-failing.

Idempotent: pre-sweeps any prior "Priority E2E (claude-code)" /
"Priority E2E (hermes)" workspaces so a run interrupted by SIGPIPE /
kill -9 (which bypasses the EXIT trap) doesn't poison the next run.
Same defensive pattern as test_notify_attachments_e2e.sh.

CI wiring:
  - e2e-api.yml — runs on every PR with no LLM keys, both phases skip,
    catches script-level regressions (set -u bugs, syntax issues, etc.)
  - canary-staging.yml + e2e-staging-saas.yml already have the keys
    via secrets.MOLECULE_STAGING_OPENAI_KEY and exercise wire-real
    behavior — could be wired to opt-in if you want claude-code coverage
    there too.

Local runs (from this branch, no keys):
  === Results: 0 passed, 0 failed, 2 skipped ===

Validates the capability primitives shipped in PRs #2137-2144: once
template PRs #12 (claude-code) + #25 (hermes) merge with their
declared provides_native_session=True + idle_timeout_override=900,
a manual run with both keys validates the full native+pluggable chain.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 23:48:54 -07:00
rabbitblood
b81d8e9fc5 chore(secret-scan): add sk-cp- MiniMax pattern (F1088 retroactive fix) 2026-04-26 21:43:22 -07:00
Hongming Wang
62cfc21033 test(comms): comprehensive E2E coverage for agent → user attachments
User asked to "keep optimizing and comprehensive e2e testings to prove all
works as expected" for the communication path. Adds three layers of coverage
for PR #2130 (agent → user file attachments via send_message_to_user) since
that path has the most user-visible blast radius:

1. Shell E2E (tests/e2e/test_notify_attachments_e2e.sh) — pure platform test,
   no workspace container needed. 14 assertions covering: notify text-only
   round-trip, notify-with-attachments persists parts[].kind=file in the
   shape extractFilesFromTask reads, per-element validation rejects empty
   uri/name (regression for the missing gin `dive` bug), and a real
   /chat/uploads → /notify URI round-trip when a container is up.

2. Canvas AGENT_MESSAGE handler tests (canvas-events.test.ts +5) — pin the
   WebSocket-side filtering that drops malformed attachments, allows
   attachments-only bubbles, ignores non-array payloads, and no-ops on
   pure-empty events.

3. Persisted response_body shape test (message-parser.test.ts +1) — pins
   the {result, parts} contract the chat history loader hydrates on
   reload, so refreshing after an agent attachment restores both caption
   and download chips.

Also wires the new shell E2E into e2e-api.yml so the contract regresses
in CI rather than only in manual runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 20:41:56 -07:00