Commit Graph

143 Commits

Author SHA1 Message Date
Hongming Wang
97d5883e76 fix(ci): auto-sync concurrency + cleanup follow-ups
Three small fixes from the self-review of #2209:

1. **Required: concurrency group.** Two pushes to main in quick
   succession (manual UI merge then auto-promote-staging's ff-push,
   or any back-to-back main pushes) would race two auto-sync runs
   against the same staging branch — second `git push origin staging`
   fails non-fast-forward, surfacing as a red CI alert for what should
   be a no-op. Add `concurrency: { group: auto-sync-main-to-staging,
   cancel-in-progress: false }` so the second run waits for the first
   and sees its result.

2. **Hygiene: `git merge --abort` on conflict.** The conflict-error
   path exits 1 with the work tree in a half-merged state. Doesn't
   affect future runs (each gets a fresh checkout) but is an
   unpleasant artifact for anyone who shells into the runner. Abort
   first, then exit.

3. **Doc accuracy: "Loop safety" comment.** The original said the
   chain terminates because "main is either a no-op or advances
   further." That's true but understates the actual safety: GitHub
   Actions explicitly does NOT trigger downstream workflow runs from
   `GITHUB_TOKEN`-authored pushes. So the loop is impossible by
   construction, not just by happy coincidence of ref state. Updated
   the comment to reflect the actual mechanism.

Plus a step-name nit: "Fast-forward staging → main" reads as if main
is the target. Renamed to "Fast-forward staging to main" for
consistency with the workflow's name (main → staging).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 14:59:23 -07:00
Hongming Wang
c59715e143 feat(ci): auto-sync main → staging to keep staging-as-superset invariant
Background

`auto-promote-staging.yml` advances main via `git merge --ff-only`
+ `git push origin main` — clean fast-forward, no merge commit. But
manual `staging → main` merges via the GitHub UI / API create a merge
commit on main that staging doesn't have. The next `staging → main`
PR then evaluates as "BEHIND" because staging is missing that merge
commit, requiring a manual `gh pr update-branch` round-trip.

This pattern bit twice on 2026-04-28 (PRs #2202 and #2205, both
manual bridges to land pipeline fixes themselves). Each needed
update-branch + re-CI before they could merge. Annoying and
avoidable.

What this workflow does

Triggered on every push to main (regardless of source: auto-promote,
UI merge, API merge, direct push):

  1. Check whether main is already in staging's ancestry. If yes,
     no-op — auto-promote-staging keeps them aligned via ff push,
     and the no-op case is the steady state.

  2. If not (manual merge commit on main, or direct main hotfix):
     try `git merge --ff-only origin/main` first. Works when staging
     hasn't diverged with its own commits.

  3. If ff fails (staging has its own in-flight feature work):
     `git merge --no-ff origin/main -m "chore: sync main → staging"`.
     Absorbs main's tip while keeping staging's own history.

  4. Push staging.

Loop safety

Pushing the synced staging triggers auto-promote-staging.yml, which
checks gates on staging's new tip and, if green, ff-pushes staging
to main. Since staging now ⊇ main, the resulting push to main is
either a no-op (no ref change → no push event fires → auto-sync
doesn't re-trigger) or advances main further. In the latter case
auto-sync fires once more, sees main already in staging's ancestry,
no-ops. Bounded.

Conflict handling

If the merge step hits conflicts (staging and main diverged with
incompatible changes), the workflow fails with a clear summary
pointing to manual resolution. This shouldn't happen in practice —
staging is the integration branch; conflicts indicate a direct main
hotfix touching the same code as in-flight staging work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 14:43:43 -07:00
11a38a0ad4
Merge pull request #2207 from Molecule-AI/fix/secret-scan-printf-and-wordsplit
fix(ci): printf format-string sink + filename word-split in secret-scan
2026-04-28 21:11:32 +00:00
Hongming Wang
2c8792d3e0 fix(ci): printf format-string sink + filename word-split in secret-scan
Two latent bash bugs in the canonical secret-scan workflow caught
during the post-merge review of molecule-controlplane #301 (a
private consumer that inlined this workflow's logic and got both
fixes there). Same bugs apply here; fixing in canonical means every
public consumer (gh-identity, github-app-auth, the 8 workspace
template repos) inherits the fix on their next workflow_call.

Bug 1: `printf "$OFFENDING"` is a format-string sink.

  OFFENDING is built from filenames: `${f} (matched: ${pattern})\n`.
  When passed to printf as the first argument, `%` characters in a
  filename are interpreted as conversion specifiers — corrupting the
  error message or printing `%(missing)` artifacts. No filename in
  the current tree triggers it, but a future test fixture, build
  artifact, or contributor-supplied path could.

  Fix: `printf '%b' "$OFFENDING"` interprets the literal `\n` we
  appended without treating OFFENDING as a format string.

Bug 2: `for f in $CHANGED` word-splits on whitespace.

  Filenames containing spaces would split into multiple tokens. The
  self-exclude check (`[ "$f" = "$SELF" ] && continue`) and the diff
  lookup would both operate on partial-path tokens. No filename in
  the current tree has whitespace, but the failure would be silent
  if one ever did.

  Fix: `while IFS= read -r f; do ... done <<< "$CHANGED"` reads
  whole lines as filenames. Added `[ -z "$f" ] && continue` to
  match the original `for` loop's implicit empty-input skip.

Both fixes are mechanically straightforward (~16 lines net diff,
mostly comments documenting the why). No behavior change for
filenames in the current tree; strictly better for the edge cases.

The same fixes already shipped in molecule-controlplane via #301
which inlined a copy of this workflow. The runtime's bundled
pre-commit hook (molecule-ai-workspace-runtime:
molecule_runtime/scripts/pre-commit-checks.sh) likely has the same
bugs — flagged as a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 14:02:50 -07:00
Hongming Wang
9d4ab7b1a2 feat(ci): auto-promote-on-e2e — retag :latest on green E2E Staging SaaS
Closes the final gap in the SaaS pipeline. After auto-promote-staging
fast-forwards main, publish-workspace-server-image builds new
`:staging-<sha>` images, but `:latest` (what prod tenants pull) only
moves on either a manual `promote-latest.yml` dispatch or a canary-
verify retag (gated on Phase 2 fleet that doesn't exist).

This workflow closes that gap by retagging
`platform:staging-<sha>` + `platform-tenant:staging-<sha>` → `:latest`
whenever E2E Staging SaaS passes for a `main` push. Uses crane
(no Docker daemon needed). Verifies both images exist before retagging
either, so a half-published state is impossible.

Why trigger only on `main` (not staging):
  - `:latest` is what prod tenants pull. Only SHAs that have reached
    `main` (via auto-promote-staging) should advance `:latest`.
  - Triggering on staging would let a staging-only revert advance
    `:latest` to a SHA that never reaches `main`, breaking the
    invariant "production runs what's on `main`".

Why a separate workflow rather than folding into e2e-staging-saas.yml:
  - Test concerns and release concerns separate.
  - Disabling promote during an incident is one workflow toggle, not
    an edit to the long E2E file.
  - When Phase 2 canary work eventually lands, the canary path can
    replace this trigger without touching the E2E workflow.

Doc-aligned: per molecule-controlplane/docs/canary-tenants.md,
"green staging E2E → :latest" is the recommended approach for the
current scale (≤20 paying tenants); canary fleet is deferred until
blast radius grows.

Pipeline after this lands is fully self-healing:
  staging push → 4 gates green → auto-promote fast-forwards main
   → publish-workspace-server-image → E2E Staging SaaS
   → THIS WORKFLOW retags :latest → tenant fleet auto-pulls in 5 min
                                    (or redeploy-tenants-on-main fans out faster)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 13:58:41 -07:00
Hongming Wang
17018745d0 fix(ci): auto-promote gate-check uses workflow file paths, not names
Observed 2026-04-28: auto-promote ran for staging head 96955f7b with
all gates actually green (verified via /commits/<sha>/check-runs API)
yet `check-all-gates-green` reported `CodeQL → missing/none` and
aborted. Same SHA was promotable; auto-promote couldn't see it.

Cause: `gh run list --workflow="CodeQL"` matched two workflows in
this repo:

  - codeql.yml (explicit, scans both staging and main)
  - codeql       (GitHub UI-configured Code-quality default setup,
                  internal, scans default branch only)

gh CLI rejects ambiguous `--workflow=<name>` lookups and returns no
result → the gate fell through to `missing/none` and ALL_GREEN was
set false. Every staging push since both names existed has been
silently dead-locked.

Fix: switch GATES from display-name strings to workflow file paths.
File paths are the unique identifier for a workflow file in
.github/workflows/; display names are decoration and can collide.
The same `gh run list --workflow=<file.yml>` query that fails on
"CodeQL" succeeds on "codeql.yml" because the file path resolves
unambiguously.

No behavior change for the other three gates (CI, E2E Canvas, E2E
API Smoke) since their names didn't collide — they keep working,
they just identify by ci.yml / e2e-staging-canvas.yml / e2e-api.yml
now. The log line shape changes from `CI → completed/success` to
`ci.yml → completed/success` which is fine for ops grep.

When adding/removing a gate going forward: file paths only. Keep
branch-protection required-checks (check-run display names) in
sync as a separate manual step.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 13:15:13 -07:00
Hongming Wang
31d25b5a74 fix(ci): e2e gates always emit a result so auto-promote can read it
The auto-promote-staging.yml gate-check (line 99) treats "workflow
didn't run" as failure. Path-filtered triggers on E2E API Smoke Test
and E2E Staging Canvas meant a platform-only or test-only push to
staging — say, the prior PR #2201 which only touched
tests/e2e/test_staging_full_saas.sh — never triggered the canvas
workflow, and auto-promote saw `missing/none`, marked all_green=false,
and aborted. Same class for any push that doesn't touch the gate's
watched paths. Dead-lock by design, never noticed because the gate
was new.

Fix per Design B (always-run + fast-skip):

- Drop `paths:` from the push/pull_request triggers on both gate
  workflows. The workflow now always fires on every staging+main
  push/PR.
- Add a `detect-changes` job using `dorny/paths-filter@v3` that
  decides whether to do real work, scoped to the same paths the
  trigger filter used to watch.
- Real work job (e2e-api / playwright) gates on
  `needs: detect-changes; if: needs.detect-changes.outputs.X == 'true'`.
- Add a sibling `no-op` job that runs when the filter output is
  false, emitting `::notice::… no-op pass`. The workflow run's
  conclusion is `success` either way — auto-promote sees green and
  proceeds.

manual `workflow_dispatch` and the weekly canvas `schedule` short-
circuit detect-changes to always-run — those triggers exist precisely
to exercise the suite and shouldn't be silently no-op'd.

Why this approach over making auto-promote-staging smarter:

The alternative (Design A, considered + rejected) was to teach
auto-promote-staging to read each gate's `paths:` filter and treat
"no run because filter excluded the commit" as conditional pass.
That couples auto-promote to other workflows' YAML schema and breaks
silently if a gate is renamed or its filter changes. Design B keeps
the auto-promote contract simple ("each gate emits success") and
makes each gate self-describing — adding a new gate doesn't require
touching auto-promote.

Cost: ~10-30s of runner overhead per gate per push for the no-op when
paths don't match. Negligible vs the alternative of dead-locked
auto-promote chains.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 12:43:26 -07:00
Hongming Wang
e7eeeb4f59
Merge pull request #2199 from Molecule-AI/fix/pin-compat-narrow-pypi-job-trigger
ci(pin-compat): split into two workflows so each gets a narrow paths filter
2026-04-28 18:20:48 +00:00
Hongming Wang
a089712cef feat(cascade): verify wheel content sha256 against just-built dist
Closes #132. Extends the cascade propagation probe (added in #2197
and clarified in #2198) with a content-integrity check.

The previous probe verified pip can RESOLVE the version we just
published (catches surface 1+2 propagation lag — metadata + simple
index). It did NOT verify pip can DOWNLOAD bytes that match what we
uploaded — leaving a window where a Fastly stale-content scenario
(rare but PyPI has had it: e.g. 2026-04-01 incident where a CDN node
served a previous version's wheel under the new version's URL for
~90s after upload) would pass the probe and ship corrupt builds to
all 8 receiver templates.

Two-stage check, both must pass before the cascade fans out:

  (a) `pip install --no-cache-dir PACKAGE==VERSION` succeeds —
      version is resolvable. (Existing, unchanged.)

  (b) `pip download` of the same wheel + `sha256sum` matches the
      hash captured pre-upload from `dist/*.whl`. (New.)

Captured BEFORE upload via a new `wheel_hash` step that exposes
`steps.wheel_hash.outputs.wheel_sha256`, bubbled up as
`needs.publish.outputs.wheel_sha256`, and consumed by the cascade
probe via the EXPECTED_SHA256 env var.

`pip download` is the right primitive: it writes the actual .whl
file (vs `pip install` which unpacks and discards), so we can
sha256sum it directly. Combined with --no-cache-dir + a wiped
/tmp/probe-dl per poll, every poll re-fetches from the live Fastly
edge — no local-cache mask.

Per-poll cost: ~3-5s pip install + ~3s pip download + 4s sleep.
30-poll budget = ~5-6 min wall on a slow runner (vs the previous
~4-5 min for resolve-only). Well within the cascade's tolerance for
a known-rare CDN issue, and the overwhelming-common case (Fastly
serves matching bytes immediately) exits on the first poll.

Verified locally: pip download of the current PyPI-latest
(molecule-ai-workspace-runtime 0.1.29) produced
sha256=7e782b2d50812257…, exactly matching PyPI's own metadata
endpoint. The mismatch path is exercised inline (different builds
of the same version produce different hashes by definition — the
build_runtime_package.py output is timestamp-deterministic only
within a single CI invocation).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 10:53:50 -07:00
Hongming Wang
a8f59f5fc2 ci(pin-compat): split into two workflows so each gets a narrow paths filter
Closes #134. The post-merge review of #2196 flagged that the combined
workflow's `paths:` filter (the union of both jobs' needs:
`workspace/**` + `scripts/build_runtime_package.py` + the workflow
itself) caused the `pypi-latest-install` job to fire on every
doc-only / adapter-only / unrelated workspace/ edit. The PyPI artifact
that job tests against can't change based on our workspace/ source —
only on actual PyPI publishes — so those runs add noise without
information.

Splits the previously-merged combined workflow:

  runtime-pin-compat.yml (kept):
    - PyPI-latest install + import smoke (was: pypi-latest-install)
    - Narrow `paths:` filter — only fires when workspace/requirements.txt
      or this workflow file changes
    - Cron-driven daily for upstream-yank detection (unchanged)

  runtime-prbuild-compat.yml (new):
    - PR-built wheel + import smoke (was: local-build-install)
    - Broad `paths:` filter — fires on any workspace/ source change,
      scripts/build_runtime_package.py, or this workflow file
    - No cron (workspace/ doesn't change between firings)

Behavior identical to before for content; only the trigger surface is
narrower per-job. Each workflow's name is its own status check, so
branch protection (which currently lists neither as required) can
gate them independently in future.

The prior comment in the combined file explicitly acknowledged the
asymmetry and proposed this split as a follow-up; this is that
follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 10:50:09 -07:00
Hongming Wang
e6ce54006d ci(publish-runtime): use pip-resolve probe to bound cascade fan-out
The cascade's PyPI-propagation gate polled `/pypi/<pkg>/<ver>/json`,
which is one of THREE surfaces pip touches when resolving an install:

  1. /pypi/<pkg>/<ver>/json    — metadata endpoint (the old check)
  2. /simple/<pkg>/             — pip's primary download index
  3. files.pythonhosted.org     — CDN-fronted wheel binary

Each has its own cache. Any one of them can lag behind the others,
and the previous gate would let the cascade fire while (2) or (3)
still served the previous version. Downstream `pip install` in the
template repos then resolved to the OLD wheel, the docker layer
cache locked that stale resolution in, and subsequent rebuilds kept
shipping the old runtime — the "five times in one night" cache trap
referenced in the prior comment.

Replace the metadata-only poll with an actual `pip install
--no-cache-dir --force-reinstall --no-deps PACKAGE==VERSION` from
a fresh venv. If pip can resolve and install the exact version we
just published, every receiver template will too — pip itself is
the ground truth for what the receivers will see, no proxy guessing
about which surface is lagging.

  - Venv created once outside the loop; only `pip install` runs in
    the poll body.
  - --no-cache-dir + --force-reinstall ensures every poll hits the
    live PyPI surfaces (no local-cache mask).
  - --no-deps keeps each poll fast — we only care about resolving
    THIS package, not its dep tree.
  - Loop budget: 30 attempts × 4s ≈ 2 min (vs prior 30 × 2s = 60s).
    Generous vs typical PyPI propagation, surfaces real upstream
    issues past the budget.

Verified locally:
  - Probing a non-existent version (0.1.999999) → pip exits 1, loop
    retries.
  - Probing the current PyPI-latest → pip exits 0, `pip show`
    returns the version, loop succeeds.

Closes #130.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 18:16:33 -07:00
Hongming Wang
7484e6fbec
Merge pull request #2196 from Molecule-AI/fix/runtime-pin-compat-test-pr-artifact
ci(runtime-pin-compat): test the PR-built wheel, not PyPI-latest
2026-04-28 00:42:02 +00:00
Hongming Wang
7065579967 ci(runtime-pin-compat): test the PR-built wheel, not the PyPI-latest one
Closes #128's chicken-and-egg. The original gate installed the
CURRENTLY-PUBLISHED molecule-ai-workspace-runtime from PyPI, then
overlaid workspace/requirements.txt, then smoke-imported. That
catches problems with the already-shipped artifact (the daily-cron
upstream-yank case), but it cannot catch problems introduced by the
PR itself: the imports it exercises are from the OLD wheel, not the
PR's source. A PR that adds `from a2a.utils.foo import bar` (where
`bar` is added in a2a-sdk 1.5 and the runtime currently pins 1.3)
slips through:
  1. Pip resolves the existing PyPI wheel + a2a-sdk 1.3.
  2. Smoke imports the OLD main.py — no reference to `bar` → green.
  3. Merge → publish-runtime.yml ships a wheel WITH the new import.
  4. Tenant images redeploy → all crash on first boot with
     ImportError: cannot import name 'bar' from 'a2a.utils.foo'.

Splits the workflow into two jobs:

  - pypi-latest-install (renamed from default-install): unchanged
    behavior. Runs on the daily cron and on requirements.txt /
    workflow edits. Catches upstream PyPI yanks + the
    already-shipped artifact going stale.

  - local-build-install (new): runs scripts/build_runtime_package.py
    on the PR's workspace/, builds the wheel with python -m build
    (mirroring publish-runtime.yml byte-for-byte), installs that
    wheel, then runs the same smoke import. Tests the artifact
    that WOULD be published if this PR merges.

Path filter widened to workspace/** so any runtime-source change
triggers the local-build job. The pypi-latest job's filter is the
same union; its internal logic is unchanged so the daily-cron and
upstream-detection use cases continue to work.

Verified locally: built the wheel from current workspace/ source via
the same script + python -m build invocation, installed into a fresh
venv, imported from molecule_runtime.main import main_sync
successfully.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 17:39:00 -07:00
Hongming Wang
1b0fab674b ci(publish-runtime): smoke well-known mount alignment + message helper
The existing wheel-smoke catches AgentCard kwarg-shape regressions
(state_transition_history, supported_protocols) but doesn't catch the
SDK-contract drift class that #2193 just fixed in production: the
a2a-sdk 1.x rename of /.well-known/agent.json →
/.well-known/agent-card.json, plus AGENT_CARD_WELL_KNOWN_PATH moving
to a2a.utils.constants. main.py's readiness probe hardcoded the old
literal and 404'd every attempt, silently dropping every workspace's
initial_prompt for ~weeks before a user reported it.

Two additions to the smoke block:

  1. Mount alignment: build an AgentCard, call create_agent_card_routes(),
     and assert AGENT_CARD_WELL_KNOWN_PATH is among the mounted paths.
     Catches a future SDK release that decouples the constant value
     from the route factory's mount path. The source-tree test
     (workspace/tests/test_agent_card_well_known_path.py) catches the
     main.py side; this catches the SDK side BEFORE PyPI upload.

  2. Message helper smoke: import a2a.helpers.new_text_message and
     instantiate one. The v0→v1 cheat sheet (memory:
     reference_a2a_sdk_v0_to_v1_migration.md) flagged this as a real
     migration find — main.py and a2a_executor.py call it in hot
     paths, so an import break errors every reply before the message
     even leaves the workspace.

Verified by running the equivalent Python inside
ghcr.io/molecule-ai/workspace-template-langgraph:latest:
  ✓ well-known mount alignment OK (/.well-known/agent-card.json)
  ✓ message helper import + call OK

Closes the structural-fix half of the #2193 finding from the code-
review-and-quality pass: "the wheel publish smoke didn't catch this.
This is the 7th a2a-sdk migration find of this kind. Task #131 is the
right root-cause fix."

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 17:34:12 -07:00
dccec657d6
Merge branch 'staging' into ci/cicd-review-quick-wins 2026-04-27 13:27:16 -07:00
Hongming Wang
5920fc856d
Merge pull request #2182 from Molecule-AI/ci/agentcard-smoke-followup-2179
fix(workspace): rename supported_protocols → supported_interfaces (CRITICAL — every boot crashes)
2026-04-27 14:58:28 +00:00
Hongming Wang
851fd21fb1 fix(workspace): rename supported_protocols → supported_interfaces (a2a-sdk 1.0)
CRITICAL: every workspace boot since the a2a-sdk 1.0 migration (#1974)
has been crashing at AgentCard construction with:
  ValueError: Protocol message AgentCard has no "supported_protocols" field

The protobuf field is `supported_interfaces` (plural, interfaces — see
a2a-sdk types/a2a_pb2.pyi:189). The 0.3→1.0 migration left the kwarg
as `supported_protocols`, which doesn't exist in the 1.0 schema, so
the constructor raises before any subsequent line of main runs.

Why this hid for so long:
  - publish-runtime.yml's smoke step only IMPORTED molecule_runtime.main;
    importing the module is fine, only CONSTRUCTING the AgentCard fails
  - The user-visible symptom is "Workspace failed: " with empty
    last_sample_error, indistinguishable from generic boot timeouts
  - The state_transition_history=True bug (fixed in #2179) was a
    sibling of this — same migration, same class, just caught first

Fix is symmetric with #2179:
  1. workspace/main.py: rename the kwarg + comment explaining why
  2. .github/workflows/publish-runtime.yml: extend the smoke block to
     instantiate AgentCard with the exact production call shape, so
     the next field-rename of this class fails at publish time
     instead of breaking every workspace startup

Verification:
  - Constructed AgentCard against fresh a2a-sdk 1.0.2 in a clean
    venv with the corrected kwarg → succeeds
  - Constructed it with the original `supported_protocols` kwarg →
    fails immediately with the exact error production sees
  - Smoke test pinned to mirror main.py's exact call shape; main.py
    + smoke must stay in lockstep going forward

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 07:54:23 -07:00
Hongming Wang
1a703f5687 fix(publish-runtime): wait for PyPI propagation + expand path filter
Two structural fixes for the cascade race conditions that bit us
five times today:

1. **PyPI propagation wait** (cascade job): poll PyPI for the
   just-published version with a 60s budget BEFORE firing
   repository_dispatch. PyPI accepts the upload but takes a few
   seconds to make it available via the package index. Cascade was
   firing too fast — downstream template builds ran `pip install`
   against a stale index, resolved to the previous version, and
   docker layer cache locked that in for subsequent rebuilds.
   Pairs with the build-arg cache invalidation in molecule-ci PR
   (separate change). Wait without invalidation = next build still
   pip-resolves correctly. Invalidation without wait = first cascade
   build may still race PyPI propagation. Together: no race, no
   stale cache.

2. **Path filter expansion**: scripts/build_runtime_package.py is
   the build script and changes to it (e.g. import-rewrite fixes,
   manifest emit, lib/ subpackage move) directly affect what ships
   in the wheel. Was missing from the path filter, so PRs touching
   only scripts/ (like #2174's lib/ fix) didn't auto-publish — the
   operator had to remember a manual dispatch. Add it to the closed
   list of files that trigger auto-publish.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 07:42:37 -07:00
Hongming Wang
82b366fce5 ci: add pr-guards caller that disables auto-merge on push
Thin caller for molecule-ci's reusable disable-auto-merge-on-push
workflow. Forces operator re-engagement when a commit is pushed to
an open PR with auto-merge already enabled.

Pairs with the org-wide "Automatically delete head branches" repo
setting (also enabled today). Defense in depth:

1. Repo setting blocks pushes to a merged-and-deleted branch
   (post-merge orphan case — what bit #2174 today: my second
   commit landed on an already-merged-and-deleted branch).
2. This workflow catches in-queue races (push lands while the
   merge queue is processing) by disabling auto-merge so the
   operator must explicitly re-engage.

Together they cover the full lifecycle of "auto-merge enabled →
new commits arrive" without relying on operator discipline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 06:39:31 -07:00
Hongming Wang
3df5867b56 fix: restore main_sync entry point in workspace/main.py
The wheel's pyproject.toml has declared
`molecule-runtime = "molecule_runtime.main:main_sync"` since the
publish pipeline was created on 2026-04-26, but the function
itself was never present in workspace/main.py — it lived in the
pre-monorepo molecule-ai-workspace-runtime repo and was lost
during the consolidation that made workspace/ the source of truth.

The 0.1.15 wheel still had main_sync from a leftover snapshot,
so the regression went unnoticed until 0.1.16 (the first wheel
built from the new source-of-truth) shipped. Symptom: every
workspace container restart loops with

  ImportError: cannot import name 'main_sync' from 'molecule_runtime.main'

— the molecule-runtime CLI script's first line tries to import
the missing symbol. Workspaces stay in `provisioning` until the
10-min sweep marks them failed.

Caught by .github/workflows/runtime-pin-compat.yml, which already
imports the symbol by name as its smoke test. (That check kept
failing red on every recent merge_group run; this PR fixes the
underlying symbol-not-found instead of the smoke step.)

Also strengthens publish-runtime.yml's wheel smoke from
`import molecule_runtime.main` (loads the module — passes even
when entry-point target is missing) to `from molecule_runtime.main
import main_sync` (the actual contract the CLI script needs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 03:35:49 -07:00
Hongming Wang
c68dc1877f fix(release): drift-gate TOP_LEVEL_MODULES + smoke-import main in publish
Two compounding bugs surfaced when 0.1.16 hit production today:

1. scripts/build_runtime_package.py had a hand-curated TOP_LEVEL_MODULES
   set listing every workspace/*.py that should get its bare imports
   rewritten to `molecule_runtime.X`. The set silently went stale:
   - Missing: transcript_auth (added since #87 phase 1c), runtime_wedge,
     watcher → unrewritten imports shipped, every workspace startup
     died with ModuleNotFoundError.
   - Stale: claude_sdk_executor, cli_executor (both removed in #87),
     hermes_executor (never existed) → harmless but misleading.

2. publish-runtime.yml's wheel-smoke step asserted on stable invariants
   (BaseAdapter, AdapterConfig, a2a_client error sentinel) but never
   imported main. So even though main.py held the broken bare
   `from transcript_auth import ...`, the smoke check passed.

Fixes:

- Build script now derives the on-disk module set from workspace/*.py
  and asserts it matches TOP_LEVEL_MODULES exactly. Drift in either
  direction fails the build with a specific diff message instead of
  shipping a broken wheel. Closed-list typo guard preserved (we still
  edit the set explicitly when a module is added/removed) — the gate
  just makes drift impossible to ignore.

- TOP_LEVEL_MODULES updated to current reality: drop the 3 stale,
  add the 3 missing.

- publish-runtime.yml wheel-smoke now `import molecule_runtime.main`
  before the invariant asserts. main is the entry point and
  transitively imports every module — any bare-import bug surfaces
  as ModuleNotFoundError before PyPI accepts the upload.

Tested locally: `python3 scripts/build_runtime_package.py
--version 0.1.99 --out /tmp/build-test` succeeds, and
/tmp/build-test/molecule_runtime/main.py contains the rewritten
`from molecule_runtime.transcript_auth import ...`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 03:19:17 -07:00
Hongming Wang
0a455b7d71 feat(publish-runtime): auto-publish to PyPI on staging pushes that touch workspace/
Adds a third trigger so any merge to staging that changes workspace/**
auto-publishes a new molecule-ai-workspace-runtime patch release. Closes
the human-in-loop gap that caused tonight's RuntimeCapabilities
ImportError outage.

Tonight: #117 added RuntimeCapabilities to molecule_runtime.adapters.base.
The merge landed at 02:37 UTC. Templates rebuilt their images at 07:37
UTC (4 hours later) and started importing the new symbol. PyPI was
still serving 0.1.15 (pre-#117) because nobody remembered to push a
runtime-vX.Y.Z tag or workflow_dispatch the publish. Result: every
template image shipped tonight runs `from molecule_runtime.adapters.base
import RuntimeCapabilities` against an installed runtime that doesn't
export it -> ImportError -> workspace never registers -> stuck in
provisioning until 10-min sweep.

Mechanism:
- New trigger: push to staging filtered to paths: ['workspace/**'].
  Path filter applies only to branch pushes; the existing tag trigger
  still fires unconditionally.
- Version derivation for the auto case: query PyPI's JSON API for
  current latest, bump the patch component. PyPI is the source of
  truth so concurrent runs don't double-publish (HTTP 400 on collision).
- concurrency: group serializes parallel staging merges so they don't
  race on the bump computation. cancel-in-progress: false because each
  workspace/** change deserves its own release.
- publish job now exposes its derived version as a job-level output so
  the cascade reads it cleanly. Fixes a latent bug: cascade tried to
  read steps.version.outputs.version, which is from a different job's
  scope and silently resolved to empty -- then re-derived from
  GITHUB_REF_NAME, which would have been "staging" under the new
  trigger and produced an invalid version.

Tag-driven and manual-dispatch paths are unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 02:11:45 -07:00
Hongming Wang
c1e9aa7461
Merge pull request #2153 from Molecule-AI/fix/block-internal-paths-shallow-clone-bug
fix(ci): block-internal-paths handle merge_group + shallow-clone BASE
2026-04-27 06:58:32 +00:00
Hongming Wang
7ac7a010fa fix(ci): block-internal-paths handle merge_group + shallow-clone BASE
[Molecule-Platform-Evolvement-Manager]

## What was broken

Same bug class as the secret-scan.yml fix in #2120 — block-internal-paths
hit `fatal: bad object <sha>` exit 128 on the staging push at
2026-04-27 06:50:33Z.

Two cases:

1. **`merge_group` events**: BASE/HEAD came from
   `github.event.before` / `.after` which are push-event-only
   properties. On merge_group both came back empty, the script fell
   through to "scan entire tree" mode which is correct but
   inefficient. Worse, when this workflow is required for the merge
   queue (line 21-22), an empty-BASE entire-tree scan would run on
   every queue check.

2. **`push` events with shallow clones**: `fetch-depth: 2` doesn't
   always cover BASE across true merge commits. When BASE is in the
   payload but absent from the local object DB, `git diff` errors out
   with `fatal: bad object <sha>` and the job exits 128. This is what
   broke today's staging push.

## Fix

Same shape as the secret-scan.yml fix (#2120):

- Add a dedicated `git fetch` step for `merge_group.base_sha`.
- Move event-specific SHAs into a step `env:` block; script uses a
  `case` over `${{ github.event_name }}` covering pull_request /
  merge_group / push (rather than `if pull_request / else push`
  which left merge_group on the empty-BASE branch).
- On-demand fetch + `git cat-file -e` guard for push BASE so a SHA
  that's payload-present-but-DB-absent triggers the fetch, and a
  fetch failure falls through cleanly to "scan entire tree" instead
  of exiting 128.

## Test plan

- [x] YAML structure preserved (no schema changes)
- [x] Bash logic mirrors the secret-scan recovery path tested in #2120
- [ ] CI green on this PR's pull_request scan + push to staging post-merge

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 23:54:00 -07:00
Hongming Wang
a4b3ebf951 test(e2e): claude-code + hermes priority-runtimes happy path
Self-contained happy-path E2E for the two runtimes the project commits
to first-class support for (task #116, completes the loop on the
"both must work end-to-end with tests" requirement).

What it proves per runtime:
  1. POST /workspaces succeeds with the runtime + secrets
  2. Workspace reaches status=online within its cold-boot window
     (claude-code: 240s, hermes: 900s on cold apt + uv + sidecar)
  3. POST /a2a (message/send "Reply with PONG") returns a non-error,
     non-empty reply
  4. activity_logs row written with method=message/send and ok|error
     status (a2a_proxy.LogActivity contract)

Skip semantics: each phase independently checks for its required env
key (CLAUDE_CODE_OAUTH_TOKEN / E2E_OPENAI_API_KEY) and skips cleanly
if absent. The script always exit-0s if every phase either passed or
skipped — so wiring it into a no-keys CI job validates the script
itself stays clean without false-failing.

Idempotent: pre-sweeps any prior "Priority E2E (claude-code)" /
"Priority E2E (hermes)" workspaces so a run interrupted by SIGPIPE /
kill -9 (which bypasses the EXIT trap) doesn't poison the next run.
Same defensive pattern as test_notify_attachments_e2e.sh.

CI wiring:
  - e2e-api.yml — runs on every PR with no LLM keys, both phases skip,
    catches script-level regressions (set -u bugs, syntax issues, etc.)
  - canary-staging.yml + e2e-staging-saas.yml already have the keys
    via secrets.MOLECULE_STAGING_OPENAI_KEY and exercise wire-real
    behavior — could be wired to opt-in if you want claude-code coverage
    there too.

Local runs (from this branch, no keys):
  === Results: 0 passed, 0 failed, 2 skipped ===

Validates the capability primitives shipped in PRs #2137-2144: once
template PRs #12 (claude-code) + #25 (hermes) merge with their
declared provides_native_session=True + idle_timeout_override=900,
a manual run with both keys validates the full native+pluggable chain.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 23:48:54 -07:00
rabbitblood
b81d8e9fc5 chore(secret-scan): add sk-cp- MiniMax pattern (F1088 retroactive fix) 2026-04-26 21:43:22 -07:00
Hongming Wang
62cfc21033 test(comms): comprehensive E2E coverage for agent → user attachments
User asked to "keep optimizing and comprehensive e2e testings to prove all
works as expected" for the communication path. Adds three layers of coverage
for PR #2130 (agent → user file attachments via send_message_to_user) since
that path has the most user-visible blast radius:

1. Shell E2E (tests/e2e/test_notify_attachments_e2e.sh) — pure platform test,
   no workspace container needed. 14 assertions covering: notify text-only
   round-trip, notify-with-attachments persists parts[].kind=file in the
   shape extractFilesFromTask reads, per-element validation rejects empty
   uri/name (regression for the missing gin `dive` bug), and a real
   /chat/uploads → /notify URI round-trip when a container is up.

2. Canvas AGENT_MESSAGE handler tests (canvas-events.test.ts +5) — pin the
   WebSocket-side filtering that drops malformed attachments, allows
   attachments-only bubbles, ignores non-array payloads, and no-ops on
   pure-empty events.

3. Persisted response_body shape test (message-parser.test.ts +1) — pins
   the {result, parts} contract the chat history loader hydrates on
   reload, so refreshing after an agent attachment restores both caption
   and download chips.

Also wires the new shell E2E into e2e-api.yml so the contract regresses
in CI rather than only in manual runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 20:41:56 -07:00
Hongming Wang
3a36d732e4 fix(ci): sweep prior UTC day in e2e safety nets (midnight-rollover)
[Molecule-Platform-Evolvement-Manager]

## What was breaking

All three staging e2e workflows' "Teardown safety net" steps
filtered candidate slugs by `f'e2e-...-{today}-...'` where `today`
was computed at safety-net-step time via `datetime.date.today()`.

When a run crossed midnight UTC (start before 00:00, end after),
`today` became the NEXT day, but the slug it created carried the
PRIOR day's date. The filter never matched its own slug → leak.

## Today's incident

E2E Staging Canvas run [24970092066](
https://github.com/Molecule-AI/molecule-core/actions/runs/24970092066):
  - started 2026-04-26 23:45:59Z
  - created slug `e2e-canvas-20260426-1u8nz3` at 23:59Z
  - ended 2026-04-27 00:12:47Z (failure)
  - safety-net step ran with `today=20260427`
  - filter `e2e-canvas-20260427-` did not match `...20260426-1u8nz3`
  - tenant + child workspace EC2 both stayed up

Confirmed via CP staging logs: no DELETE for `1u8nz3` ever issued.
The Playwright globalTeardown didn't fire (test crashed mid-run);
the workflow safety-net was the last line and it missed.

## Fix

All three workflows now sweep BOTH today AND yesterday's UTC dates,
so a run that crosses midnight still matches its own slug:

```python
today = datetime.date.today()
yesterday = today - datetime.timedelta(days=1)
dates = (today.strftime('%Y%m%d'), yesterday.strftime('%Y%m%d'))
prefixes = tuple(f'e2e-canvas-{d}-' for d in dates)  # (canvas variant)
```

Per-run-id scoping (saas + canary) is preserved — the prior-day
prefix still includes the run_id, so cross-midnight runs only sweep
their own slugs, not other in-flight runs from yesterday.

## Why two-day window vs. arbitrary lookback

A run can't legitimately last more than 24h on GitHub-hosted
runners (workflow `timeout-minutes` caps; canary=25, e2e-saas=45,
canvas=30). Two-day window is enough to cover any cross-midnight
run without widening the cross-run-cleanup blast radius further.
The `sweep-stale-e2e-orgs.yml` cron (with its 120-min age threshold)
remains the catch-all for anything older that drifts through.

## Test plan

- [x] Manual logic simulation: post-midnight slug matches yesterday's
      prefix; same-day still matches; 2-days-ago does NOT match;
      production tenant never matches
- [x] All three workflow YAMLs syntactically valid
- [ ] Next cross-midnight run cleans up its own slug

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 19:23:36 -07:00
rabbitblood
6e0a8e8e1c docs(ci): fix secret-scan reusable workflow self-doc — repo is molecule-core, ref is @staging 2026-04-26 15:44:31 -07:00
Hongming Wang
05ee0843fc
Merge pull request #2125 from Molecule-AI/fix/canary-teardown-slug-pattern
fix(ci): canary teardown safety-net slug pattern (was reversed)
2026-04-26 22:04:46 +00:00
Hongming Wang
7425351321 fix(ci): canary teardown safety-net slug pattern (was reversed)
[Molecule-Platform-Evolvement-Manager]

## What was broken

`canary-staging.yml`'s teardown safety-net step filtered candidate
slugs with `f'e2e-{today}-canary-'`. But `test_staging_full_saas.sh`
emits canary slugs as `e2e-canary-${date}-${RUN_ID_SUFFIX}` — date
SECOND, mode FIRST. Full-mode slugs are the other way around
(`e2e-${date}-${RUN_ID_SUFFIX}`), and the canary workflow seems to
have been copy-pasted from there without re-checking the slug
generator.

Net effect: the safety-net step ran on every cancelled / failed
canary, hit the CP, got the org list, filtered to zero matches,
and exited cleanly. Every cancelled canary EC2 leaked until the
once-an-hour `sweep-stale-e2e-orgs.yml` cron eventually caught it
(120-min default age threshold means ≥1h leak in the worst case).

## Today's incident

Canary run 24966995140 cancelled at 21:03Z. EC2
`tenant-e2e-canary-20260426-canary-24966` still running 1h25m
later, manually terminated by the CEO. Three earlier cancellations
today (16:04Z, 19:26Z, 20:02Z) hit the same gap — visible as the
hourly canary failure pattern in #2090.

## Fix

- Filter prefix corrected to `e2e-canary-${today}-` (mode FIRST,
  date SECOND) to match the actual slug emitter.
- Added per-run scoping (`-canary-${GITHUB_RUN_ID}-` suffix) when
  GITHUB_RUN_ID is set, mirroring the e2e-staging-saas.yml safety
  net's per-run scoping that was added after the 2026-04-21
  cross-run cleanup incident — guards against a queued canary's
  safety-net step deleting an in-flight different canary's slug
  while the queue's `cancel-in-progress: false` lets two reach the
  teardown step concurrently.
- Added a comment block tracing the bug + the prior incident so
  the next maintainer doesn't re-introduce the same mistake.

## Test plan

- [x] Manual trace: today's slug `e2e-canary-20260426-canary-24966...`
      now matches `e2e-canary-20260426-canary-24966` prefix
- [x] YAML parses
- [ ] Next canary cancellation cleans up automatically

## Companion PR

The PRIMARY symptom (TLS-timeout failures, not the leaked EC2)
traces to a separate bug in `molecule-controlplane`: tunnel/DNS
creation errors are logged-and-continued rather than failing
provision. PR coming separately.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 14:44:27 -07:00
rabbitblood
5478beef90 fix(canary): bump job timeout to 25m so bash fail + diagnostic can fire (#2090)
PR #2107 bumped the bash-side TLS-readiness deadline in
tests/e2e/test_staging_full_saas.sh from 600s to 900s (15 min) AND
added a diagnostic burst on the fail path so the next failure would
identify the broken layer (DNS / TLS / HTTP). What I missed: the
canary workflow's own timeout-minutes was also 15. So GitHub Actions
killed the job at the 15:00 wall-clock mark BEFORE the bash `fail`
+ diagnostic could fire — every cancellation silent, no failure
comment on #2090, no diagnostic data attached.

Visible in the 21:03 UTC canary run: cancelled at 14:03 step time
(15:18 wall) without ever reaching the diagnostic block.

Bump to 25 min — gives ~10 min headroom over the 15-min bash deadline
for setup (org create + tenant provision + admin token fetch) plus
the diagnostic dump plus teardown. Still tighter than the sibling
staging E2E jobs (20/40/45 min) so a genuine wedge surfaces here
first.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 14:36:02 -07:00
Hongming Wang
0ce537750c fix(ci): handle merge_group + shallow-clone BASE in secret-scan
[Molecule-Platform-Evolvement-Manager]

## What was breaking

Two distinct failure modes in `.github/workflows/secret-scan.yml`,
both visible after PR #2115 / #2117 hit the merge queue:

1. **`merge_group` events**: the script reads `github.event.before /
   after` to determine BASE/HEAD. Those properties only exist on
   `push` events. On `merge_group` events both came back empty, the
   script fell through to "no BASE → scan entire tree" mode, and
   false-positived on `canvas/src/lib/validation/__tests__/secret-formats.test.ts`
   which contains a `ghp_xxxx…` literal as a masking-function fixture.
   (Run 24966890424 — exit 1, "matched: ghp_[A-Za-z0-9]{36,}".)

2. **`push` events with shallow clone**: `fetch-depth: 2` doesn't
   always cover BASE across true merge commits. When BASE is in the
   payload but absent from the local object DB, `git diff` errors
   out with `fatal: bad object <sha>` and the job exits 128.
   (Run 24966796278 — push at 20:53Z merging #2115.)

## Fixes

- Add a dedicated fetch step for `merge_group.base_sha` (mirrors
  the existing pull_request base fetch) so the diff base is in the
  object DB before `git diff` runs.
- Move event-specific SHAs into a step `env:` block so the script
  uses a clean `case` over `${{ github.event_name }}` instead of
  a single `if pull_request / else push` that left merge_group on
  the empty branch.
- Add an on-demand fetch for the push-event BASE when it isn't in
  the shallow clone, plus a `git cat-file -e` guard before the
  diff so we fall through cleanly to the "scan entire tree" path
  if the fetch fails (correct, just slower) instead of exiting 128.

## Defense-in-depth

`secret-formats.test.ts` had two literal continuous-string fixtures
(`'ghp_xxxx…'`, `'github_pat_xxxx…'`). The ghp_ one matched the
secret-scan regex. Switched both to the `'prefix_' + 'x'.repeat(N)`
pattern already used elsewhere in the same file — runtime value is
the same, but the literal source text no longer matches the regex
even if the BASE detection ever falls back to tree-scan mode again.

## Test plan

- [x] No remaining regex matches in the secret-formats.test.ts source
- [x] YAML structure preserved
- [ ] CI passes on this PR's pull_request scan (was already passing)
- [ ] CI passes on this PR's merge_group scan (the new path)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 14:08:19 -07:00
Hongming Wang
263012249c
Merge pull request #2109 from Molecule-AI/feat/org-wide-secret-scan-workflow
feat(ci): add secret-scan workflow + reusable entry point for org-wide enrollment
2026-04-26 20:37:16 +00:00
Hongming Wang
f3a204347c
fix(publish-runtime): use PyPI Trusted Publisher (OIDC) instead of PYPI_TOKEN (#2113)
Drops the static PYPI_TOKEN secret in favor of OIDC trusted publishing.
PyPI now mints a short-lived upload credential after verifying the
workflow's OIDC claim against the trusted-publisher config registered
for molecule-ai-workspace-runtime (Molecule-AI/molecule-core,
publish-runtime.yml, environment pypi-publish).

Why:
- A leaked PYPI_TOKEN would let any holder publish arbitrary versions of
  molecule-ai-workspace-runtime to PyPI from anywhere — bypassing the
  monorepo's review and CI gates entirely. The 8 template repos pull
  this package; a malicious publish poisons all of them.
- Trusted Publisher (OIDC) makes that exfil path moot: no long-lived
  credential exists to leak. Only this exact workflow, on this repo,
  in the pypi-publish environment, can upload.

After this lands and the first OIDC publish succeeds, the PYPI_TOKEN
repo secret should be deleted (it becomes dead weight + a leak surface
with no purpose).

Belt-and-suspenders companion to PR #56 in molecule-ai-workspace-runtime
(sibling repo lockdown). Without OIDC, the sibling lockdown alone
doesn't prevent local `python -m build && twine upload` from a laptop
with a personal PyPI maintainer credential.

Co-authored-by: Hongming Wang <hongmingwangalt@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 13:14:47 -07:00
Hongming Wang
199630908d
fix(publish-runtime): smoke test asserts stable invariants, not feature flags (#2112)
The original smoke step had `assert a2a_client._A2A_QUEUED_PREFIX`
which is a feature-flag-style check — it fires false-positive every time
staging is mid-release of that specific feature. Caught when the dry-run
publish (run 24965411618) failed because _A2A_QUEUED_PREFIX hadn't
landed on staging yet (it lives in PR #2061's series, separate from the
PR #2103 chain that shipped this workflow).

Replaced with checks for stable invariants of the package contract:

  - a2a_client._A2A_ERROR_PREFIX exists (always has, since the
    [A2A_ERROR] sentinel is the foundational error-tagging primitive)
  - adapters.get_adapter is callable
  - BaseAdapter has the .name() static method (interface anchor)
  - AdapterConfig has __init__ (dataclass present)

These four cover the cases the smoke test actually needs to catch:
import-path rewrites broken by build_runtime_package.py, missing
modules, dataclass shape regressions. They don't fire when a specific
feature is mid-merge.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Hongming Wang <hongmingwangalt@gmail.com>
2026-04-26 13:14:15 -07:00
rabbitblood
8edbd12980 feat(ci): add secret-scan workflow + reusable entry point for org-wide enrollment
Defense-in-depth for the #2090-class incident (2026-04-24): GitHub's
hosted Copilot Coding Agent leaked a ghs_* installation token into
tenant-proxy/package.json via npm init slurping the URL from a
token-embedded origin remote. We can't fix upstream's clone hygiene,
so we gate at the PR layer.

Single workflow, dual purpose:

1. PR / push / merge_group gate on this repo (molecule-monorepo).
   Refuses any change whose diff additions contain a credential-shaped
   string. Same shape as Block forbidden paths — error message tells
   the agent how to recover without echoing the secret value.

2. Reusable workflow entry point (workflow_call) for the rest of the
   org. Other Molecule-AI repos enroll with a 3-line workflow:

     jobs:
       secret-scan:
         uses: Molecule-AI/molecule-monorepo/.github/workflows/secret-scan.yml@main

   This makes molecule-monorepo the single source of truth for the
   regex set; consumer repos pick up new patterns without per-repo PRs.

Pattern set covers GitHub family (ghp_, ghs_, gho_, ghu_, ghr_,
github_pat_), Anthropic / OpenAI / Slack / AWS. Mirror of the
runtime's bundled pre-commit hook (molecule-ai-workspace-runtime:
molecule_runtime/scripts/pre-commit-checks.sh) — keep aligned when
either side adds a pattern.

Self-exclude on .github/workflows/secret-scan.yml so the file's own
regex literals don't block its merge.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 12:05:18 -07:00
Hongming Wang
c01f057e6b ci: shift e2e-staging-saas to staging + threshold canary auto-issue at 3 reds
Two CICD-review quick wins consolidated into one PR:

# 1. e2e-staging-saas now fires on staging, not just main

The full-lifecycle SaaS E2E was main-only, so it caught regressions
AFTER they shipped to staging (and into the auto-promote PR). Adding
`staging` to the push + pull_request branch list catches them BEFORE
the staging→main promotion opens, making canary's green into
auto-promote-staging meaningfully more trustworthy.

paths-filter is unchanged, so the blast radius stays the same — only
provisioning-critical changes trigger the ~25-35 min run.

# 2. Canary auto-issue thresholded at 3 consecutive failures

The 30-min canary was opening "🔴 Canary failing" issues on every
single failure and de-duping via title match. Transient flakes (CF DNS
hiccup, AWS API blip) generated noise.

Now: on first failure, look up the prior `THRESHOLD-1` runs of this
same workflow. Only file an issue when ALL of those also failed (i.e.
this is the 3rd consecutive red, ~90 min of sustained failure). If an
issue is already open we still comment per-failure so the streak is
visible.

Threshold rationale: canary fires every 30 min, so 3 reds = ~90 min
of sustained failure — past any single-run flake but well inside the
deploy window so a real outage still surfaces fast.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 12:02:52 -07:00
Hongming Wang
0de67cd379 feat(platform/admin): /admin/workspace-images/refresh + Docker SDK + GHCR auth
The production-side end of the runtime CD chain. Operators (or the post-
publish CI workflow) hit this after a runtime release to pull the latest
workspace-template-* images from GHCR and recreate any running ws-* containers
so they adopt the new image. Without this, freshly-published runtime sat in
the registry but containers kept the old image until naturally cycled.

Implementation notes:
- Uses Docker SDK ImagePull rather than shelling out to docker CLI — the
  alpine platform container has no docker CLI installed.
- ghcrAuthHeader() reads GHCR_USER + GHCR_TOKEN env, builds the base64-
  encoded JSON payload Docker engine expects in PullOptions.RegistryAuth.
  Both empty → public/cached images only; both set → private GHCR pulls.
- Container matching uses ContainerInspect (NOT ContainerList) because
  ContainerList returns the resolved digest in .Image, not the human tag.
  Inspect surfaces .Config.Image which is what we need.
- Provisioner.DefaultImagePlatform() exported so admin handler picks the
  same Apple-Silicon-needs-amd64 platform as the provisioner — single
  source of truth for the multi-arch override.

Local-dev companion: scripts/refresh-workspace-images.sh runs on the
host and inherits the host's docker keychain auth — alternate path for
when GHCR_USER/TOKEN aren't set in the platform env.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-04-26 10:17:21 -07:00
Hongming Wang
f1792e1f7a fix(ci): stop sweep-cf-orphans noise — drop merge_group + soft-skip when secrets unset
The sweep-cf-orphans workflow shipped in #2088 was noisier than
intended in two ways. This PR fixes both — was filed under the
Optional finding I left on the original review and now matters because
the noise is observably hitting the merge queue.

1) `merge_group: types: [checks_requested]` was firing the entire
   sweep job on every PR through the merge queue. The original intent
   ("future required-check support without a workflow edit") never
   materialized, and meanwhile every recent merge-queue eval (#2091,
   #2092, #2093, #2094, #2095, #2097) generated a red `Sweep CF
   orphans (merge_group)` run.

   Drop the trigger. Comment in the workflow explains the re-add path
   if/when the workflow IS wired as a required check (re-add the
   trigger AND gate the actual sweep step with
   `if: github.event_name != 'merge_group'` so merge-queue evals are
   no-op success).

2) The `Verify required secrets present` step exits 2 when the 6
   secrets aren't configured yet (the PR body's post-merge step,
   still pending). That turns the hourly schedule into an hourly red
   CI run for as long as the secrets stay unset.

   Convert to a soft skip: emit a `:⚠️:` listing the missing
   secrets and set a `skip=true` step output, then gate the sweep
   step with `if: steps.verify.outputs.skip != 'true'`. Workflow
   reports green and ops still sees the warning when they review
   recent runs.

Net effect:
- merge-queue evals stop generating spurious red runs
- the schedule reports green-with-warning until secrets land
- once secrets land, behavior is identical to today's (real sweep
  runs, hard-fails if a secret is later removed)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 08:05:53 -07:00
Hongming Wang
355355a80a test(workspace): centralize pytest-cov config + 92% floor (closes #1817)
The Python workspace already runs pytest-cov in CI but with no
threshold and inline-flagged config. CI run 24956647701 (2026-04-26
staging) reports 97% coverage on the package — well above the issue's
75% target. The actionable gap is locking in a floor so a regression
can't sneak past, and centralizing config so local `pytest` matches CI.

Changes:

- workspace/pytest.ini — coverage flags moved into addopts (-q,
  --cov=., --cov-report=term-missing, --cov-fail-under=92).
  92% = current 97% measurement minus the 5pp safety margin
  the issue's Step 3 prescribes.

- workspace/.coveragerc (new) — [run] omit list and [report]
  skip_covered. coverage.py doesn't read pytest.ini sections, so
  the omit config has to live here.

- .github/workflows/ci.yml — removed the inline --cov flags from the
  Python Lint & Test step; now reads from pytest.ini. Workflow stays
  the same single-command shape, just simpler.

Result: any PR that drops coverage below 92% fails CI loudly. Floor
ratchets up by replacing 92 with current measurement on a future
test-writing pass — same shape as Go coverage gates landed elsewhere.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 06:21:22 -07:00
rabbitblood
0ae6b201b4 refactor(ci): apply simplify findings on PR #2088
- Drop redundant 'aws --version' step. Script's own 'aws ec2
  describe-instances' fails just as loud with a more actionable
  error; the pre-check added ~1s with no signal value.
- timeout-minutes 10 → 3. Realistic worst case is ~2min (4 curls +
  1 aws + N×CF-DELETE each individually capped at 10s by the
  script's curl -m flag). 3 surfaces hangs within one cron tick
  instead of burning the full interval.
- Document the schedule-vs-dispatch dry-run asymmetry inline so
  the next reader doesn't need to trace input defaults.
- Add merge_group: types: [checks_requested] for queue parity with
  runtime-pin-compat.yml — cheap insurance if this ever becomes a
  required check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 04:18:24 -07:00
rabbitblood
3c18b76aa7 ops(cf): hourly sweep workflow for orphan Cloudflare DNS records (#239)
Closes Molecule-AI/molecule-controlplane#239.

CF zone hit the 200-record quota 2026-04-23+ — every E2E and canary
left a record on moleculesai.app, and no scheduled job pruned them.
Provisions started failing with code 81045 ('Record quota exceeded').

The sweep-cf-orphans.sh script (PR #1978, with decision-function
unit tests added in #2079) already exists but no workflow fires it.
Adding it here as a parallel janitor to sweep-stale-e2e-orgs.yml:

- hourly schedule at :15 (offset from the e2e-orgs sweep at :00 so
  the two converge cleanly without racing the same CP admin endpoint)
- workflow_dispatch with dry_run input default true (ad-hoc verify
  without committing to deletes)
- workflow_dispatch with max_delete_pct input for major cleanups
  (the script's own MAX_DELETE_PCT defaults to 50% as a safety gate)
- concurrency group prevents schedule + manual-dispatch from racing
  the same zone

Why a separate workflow vs sweep-stale-e2e-orgs.yml:
- That workflow drives DELETE /cp/admin/tenants/:slug, assumes CP
  has the org row. Doesn't catch records left when CP itself never
  knew about the tenant (canary scratch, manual ops experiments)
  or when the CP-side cascade's CF-delete branch failed.
- sweep-cf-orphans.sh enumerates the CF zone directly + matches
  against live CP slugs + AWS EC2 names. Catches what the CP-driven
  sweep can't.

Required secrets (will need to be set on the repo): CF_API_TOKEN,
CF_ZONE_ID, CP_PROD_ADMIN_TOKEN, CP_STAGING_ADMIN_TOKEN,
AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY. Pre-flight verify-secrets
step fails loud if any are missing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 04:16:43 -07:00
Hongming Wang
1e7f8ebb1b
Merge pull request #2079 from Molecule-AI/feat/test-sweep-cf-decide-2027
test(ops): unit tests for sweep-cf-orphans decide() (#2027)
2026-04-26 09:21:45 +00:00
rabbitblood
5ce7af2d2c fix(ci): set WORKSPACE_ID for the runtime-pin smoke import
platform_auth.py validates WORKSPACE_ID at module load — EC2 user-data
sets it from cloud-init, but the CI smoke-test was missing it and
failed with 'WORKSPACE_ID is empty'. Set a placeholder UUID so the
import gate exercises only the dep-resolution path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 01:59:56 -07:00
rabbitblood
b817251c85 refactor(ci): apply simplify findings on #2083
Review of the runtime-pin-compat workflow:

- Add merge_group trigger so when this becomes a required check the
  queue green-checks it (mirrors ci.yml convention).
- Cache pip on workspace/requirements.txt — actions/setup-python@v5
  with cache: pip + cache-dependency-path. Saves ~30s per fire.
- Document the load-bearing install order: runtime FIRST so pip
  honors the runtime's declared a2a-sdk constraint (the surface that
  broke 2026-04-24); workspace/requirements.txt SECOND so a2a-sdk
  is upgraded to the runtime image's pinned version. Import smoke
  validates the upgraded combination.

Skipped: branch-protection wiring (separate ops decision, not in
scope here); ci.yml integration (the standalone schedule trigger
is the load-bearing reason to keep this workflow separate).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 01:32:56 -07:00
rabbitblood
9b42a5e311 test(ci): runtime + a2a-sdk pin compatibility gate (controlplane#253)
Closes Molecule-AI/molecule-controlplane#253.

Prevents recurrence of the 5-hour staging outage from 2026-04-24:
molecule-ai-workspace-runtime 0.1.13 declared `a2a-sdk<1.0` in its
metadata but actually imported `a2a.server.routes` (1.0+ only). pip
resolved successfully; every tenant workspace crashed at import. The
canary tenant ultimately caught it but only after 5 hours of degraded
staging. PR #249 fixed the version pin manually; nothing automated
catches the same class of bug for the next release.

This workflow:

- Installs molecule-ai-workspace-runtime fresh from PyPI in a Python
  3.11 venv (mirrors EC2 user-data install pattern)
- Layers in workspace/requirements.txt (the runtime image's actual
  dep set, including the a2a-sdk[http-server]>=1.0,<2.0 pin)
- Runs `from molecule_runtime.main import main_sync` — same import
  the runtime entrypoint does
- Fails CI if pip resolution silently produced a combo that the
  runtime can't actually import

Triggers:
- PR + push to main/staging touching workspace/requirements.txt or
  this workflow (catches local pin changes)
- Daily 13:00 UTC schedule (catches upstream PyPI publishes that
  break the pin combo without any change in our repo)
- workflow_dispatch (manual)

Concurrency cancels in-progress runs on the same ref.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 01:30:36 -07:00
Hongming Wang
b5f9cbbc55 ci(retarget): handle 422 'duplicate PR' by closing redundant main-PR (closes #1884)
When a bot opens a PR against main and there's already another PR on
the same head branch targeting staging, GitHub's PATCH /pulls returns
422 with:

  "A pull request already exists for base branch 'staging' and
   head branch '<branch>'"

Pre-fix: the retarget Action exited 1 with no further action. The
target-main PR sat there as a duplicate, the workflow run showed
red, and someone had to manually close the duplicate. Today's case
(#1881 duplicate of #1820) had to be closed manually.

Fix: catch that specific 422 message and close the main-PR as
redundant instead of failing. Any OTHER 422 (or other error) still
fails loud — the grep matches the specific duplicate-base text, not
a blanket "any 422 means duplicate".

Behaviour matrix:

  PATCH succeeds                           → retargeted, explainer
                                              comment posted
  PATCH 422 "already exists for staging"   → close main-PR with
                                              explainer (NEW)
  PATCH any other failure                  → workflow fails (preserves
                                              loud-fail for real bugs)

Tests: GitHub Actions don't have an inline unit-test framework here.
The workflow YAML parses (validated locally) and the bash logic is
straightforward. Real verification will be the next duplicate-PR
scenario in production.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 00:53:55 -07:00
rabbitblood
6494e9192b refactor(ops): apply simplify findings on #2027 PR
Code-quality + efficiency review of PR #2079:

- Hoist all_slugs = prod_slugs | staging_slugs out of decide() into the
  caller (was rebuilt on every record — 1k records × ~50-slug union per
  call). decide() signature now (r, all_slugs, ec2_names).
- Compile regexes at module scope (_WS_RE, _E2E_RE, _TENANT_RE) +
  hoist platform-core literal set (_PLATFORM_CORE_NAMES). Same change
  mirrored in the bash heredoc.
- Drop decorative # Rule N: comments (numbering was out of order, 3
  before 2 — actively confusing).
- Move the "edits must mirror" reminder OUTSIDE the CANONICAL DECIDE
  block in the .sh file, eliminating the .replace() comment-skip hack
  in TestParityWithBashScript.
- Drop per-line .strip() in _slice_canonical (would mask a real
  indentation bug; both blocks already at column 0).
- subTest() in TestPlatformCore loops so a single failure no longer
  short-circuits the rest of the items.
- merge_group + concurrency on test-ops-scripts.yml (parity with
  ci.yml gate behaviour).
- Fix don't apostrophe in inline comment that closed the python
  heredoc's single-quote and broke bash -n.

All 25 tests still pass. bash -n clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 00:28:15 -07:00
rabbitblood
ba78a5c00d test(ops): unit tests for sweep-cf-orphans decide() (#2027)
Closes #2027.

The CF orphan sweep deletes DNS records — a misclassification could nuke
a live workspace's tunnel. The decision function had MAX_DELETE_PCT
percentage gating but no automated test of category → action mapping.

Approach: extract the decide() function to scripts/ops/sweep_cf_decide.py
as a verbatim copy bracketed by `# CANONICAL DECIDE BEGIN/END` markers.
The shell script keeps its inline heredoc (so the operational path is
untouched) but bracketed by the same markers. A parity test
(TestParityWithBashScript) reads both files and asserts the bracketed
blocks match line-for-line — drift fails CI loudly.

Coverage (25 tests, 1 file, stdlib unittest only):
- Rule 1 platform-core: apex, _vercel, _domainkey, www/api/app/doc/send/status/staging-api
- Rule 3 ws-*: live (matches EC2 prefix) on prod + staging; orphan on prod + staging
- Rule 4 e2e-*: live + orphan on staging; orphan on prod
- Rule 2 generic tenant: live prod + staging; unknown subdomain kept-for-safety
- Rule 5 fallthrough: external domain + unrelated apex
- Rule priority: api.moleculesai.app stays platform-core (not tenant); _vercel stays verification
- Safety gate: under/at/over default 50% threshold; zero-total no-divide; custom threshold
- Empty live-sets: documents that decide() alone classifies as orphan, gate is the defense

CI: new .github/workflows/test-ops-scripts.yml runs `python -m unittest
discover` against scripts/ops/ on every PR/push that touches the
directory. Lightweight — no requirements file, stdlib only.

Local: `cd scripts/ops && python -m unittest test_sweep_cf_decide -v` →
25 tests, all OK.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 00:22:30 -07:00