GitHub Actions scheduler de-prioritises :00 cron firings under load.
Empirical 2026-05-03: the canary's cron was '0,20,40 * * * *' but
actual firings landed at :08, :03, :01, :03 — :20 and :40 silently
dropped. Detection latency degraded from claimed 20 min to actual
~60 min worst case.
Move to '10,30,50 * * * *':
- :10/:30/:50 sit 10 min off the top-of-hour load peak
- Still 5 min from :15 sweep-cf-orphans and :45 sweep-cf-tunnels
(the original constraint that kept us off :15/:45)
- Same 20-min cadence; only the phase changes
No code change beyond the cron expression + comment refresh.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to #2648 — same `>/dev/null || true` swallow-on-error
pattern existed in:
e2e-staging-canvas.yml (single-slug)
e2e-staging-saas.yml (loop)
e2e-staging-sanity.yml (loop)
e2e-staging-external.yml (loop, was `>/dev/null 2>&1` variant)
All four now capture the HTTP code, log a "[teardown] deleted $slug
(HTTP $code)" line on success, and emit a workflow warning naming
the slug + body excerpt on non-2xx. Loop bodies also tally + summarise
total leaks at the end.
Exit semantics unchanged: a single cleanup miss still doesn't fail-flag
the test (sweep-stale-e2e-orgs is the safety net within ~45 min). The
behavior change is purely surfacing — failures that were silent are
now visible on the workflow run page.
Pairs with #2648's tightened sweeper. Together: per-run cleanup
failures are visible AND the safety net catches them quickly.
Closes the per-workflow port noted as out-of-scope in #2648.
See molecule-controlplane#420.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes that close one of the leak classes from the
molecule-controlplane#420 vCPU audit:
1. sweep-stale-e2e-orgs.yml: cron */15 (was hourly), MAX_AGE_MINUTES
30 (was 120). E2E runs are 8-25 min wall clock; 30 min is safely
above the longest run while shrinking the worst-case leak window
from ~2h to ~45 min (15-min sweep cadence + 30-min threshold).
2. canary-staging.yml teardown: the per-slug DELETE used `>/dev/null
|| true`, which swallowed every failure. A 5xx or timeout from CP
looked identical to "successfully deleted" and the canary tenant
kept eating ~2 vCPU until the sweeper caught it. Now we capture
the response code and surface non-2xx as a workflow warning that
names the leaked slug.
The exit semantics stay unchanged — a single-canary cleanup miss
shouldn't fail-flag the canary itself when the actual smoke check
passed. The sweeper is the safety net for whatever slips past.
Caught during the molecule-controlplane#420 audit on 2026-05-03 —
3 e2e canary tenant orphans were running for 24-95 min, all under
the previous 120-min sweep threshold so they went unnoticed until
manual cleanup. Same `|| true` pattern exists in
e2e-staging-{canvas,external,saas,sanity}.yml; out of scope for
this PR (mechanical port; tracking separately) but the sweeper
tightening covers all of them by reducing the safety-net latency.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cuts the per-run LLM cost ~10x (MiniMax M2.7 vs gpt-4.1-mini) and
removes the recurring OpenAI-quota-exhaustion failure mode that took
the canary down on 2026-05-03 (#265 — staging quota burnt for ~16h).
Path:
E2E_RUNTIME=claude-code (default)
→ workspace-configs-templates/claude-code-default/config.yaml's
`minimax` provider (lines 64-69)
→ ANTHROPIC_BASE_URL auto-set to api.minimax.io/anthropic
→ reads MINIMAX_API_KEY (per-vendor env, no collision with
GLM/Z.ai etc.)
Workflow changes (continuous-synth-e2e.yml):
- Default runtime: langgraph → claude-code
- New env: E2E_MODEL_SLUG (defaults to MiniMax-M2.7-highspeed,
overridable via workflow_dispatch)
- New secret wire: E2E_MINIMAX_API_KEY ←
secrets.MOLECULE_STAGING_MINIMAX_API_KEY
- Per-runtime missing-secret guard: claude-code requires MINIMAX,
langgraph/hermes require OPENAI. Cron firing hard-fails on missing
key for the active runtime; dispatch soft-skips so operators can
ad-hoc test without setting up the secret first
- Operators can still pick langgraph/hermes via workflow_dispatch;
the OpenAI fallback path stays wired
Script changes (tests/e2e/test_staging_full_saas.sh):
- SECRETS_JSON branches on which key is set:
E2E_MINIMAX_API_KEY → {MINIMAX_API_KEY: <key>} (claude-code path)
E2E_OPENAI_API_KEY → {OPENAI_API_KEY, HERMES_*, MODEL_PROVIDER} (legacy)
MiniMax wins when both are present — claude-code default canary
must not accidentally consume the OpenAI key
Tests (new tests/e2e/test_secrets_dispatch.sh):
- 10 cases pinning the precedence + payload shape per branch
- Discipline check verified: 5 of 10 FAIL on a swapped if/elif
(precedence inversion), all 10 PASS on the fix
- Anchors on the section-comment header so a structural refactor
fails loudly rather than silently sourcing nothing
The model_slug dispatcher (lib/model_slug.sh) needs no change:
E2E_MODEL_SLUG override path is already wired (line 41), and
claude-code template's `minimax-` prefix matcher catches
"MiniMax-M2.7-highspeed" via lowercase-on-lookup.
Operator action required to land green:
- Set MOLECULE_STAGING_MINIMAX_API_KEY in repo secrets
(Settings → Secrets and Variables → Actions). Use
`gh secret set MOLECULE_STAGING_MINIMAX_API_KEY -R Molecule-AI/molecule-core`
to avoid leaking the value into shell history.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #2571 fixed synth-E2E by branching MODEL_SLUG per runtime, but only
the langgraph branch was verified at runtime — hermes / claude-code /
override / fallback had zero automated coverage. A future regression
(e.g. dropping the langgraph case) would silently revert and only
surface as "Could not resolve authentication method" mid-E2E.
This PR:
- Extracts the dispatch into tests/e2e/lib/model_slug.sh as a sourceable
pick_model_slug() function. No behavior change.
- Adds tests/e2e/test_model_slug.sh — 9 assertions across all 5 dispatch
branches plus the override path. Verified to FAIL when any branch is
flipped (manually regressed langgraph slash-form to confirm the test
catches it; restored before commit).
- Wires the unit test into ci.yml's existing shellcheck job (only runs
when tests/e2e/ or scripts/ change). Pure-bash, no live infra.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The auto-promote ↔ auto-sync chain has been generating empty PRs
indefinitely since the staging merge_queue ruleset uses MERGE
strategy:
1. Auto-promote merges PR via queue → main = merge commit M2 not in staging
2. Auto-sync opens sync-back PR. Workflow's local `git merge --ff-only`
succeeds (PR title even says "ff to ..."), but the queue lands the
PR via MERGE → staging = merge commit S2 not in main
3. Auto-promote sees staging ahead by 1 → opens new promote PR. Tree
diff vs main = 0 (S2's tree == main's tree). But the gate logic
only checks "all required workflows green", not "actual code to
ship" → opens an empty promote PR
4. ... repeat indefinitely
Each round costs ~30-40 min wallclock, ~2 manual approvals (the queue
requires 1 review and the bot can't self-approve without admin
bypass), and one full CodeQL Go run (~15 min).
Observed today (2026-05-03) across PRs #2592 → #2594 → #2595 → #2596
→ #2597 — 5 PRs, ~3 hours, all empty content.
Fix: before opening the promote PR, check that staging's tree
actually differs from main's tree. If they're identical (the
empty-merge-commit cycle), skip cleanly and let the cycle terminate.
Implementation:
- New step `Skip if staging tree == main tree` runs before the
existing gate check.
- `git diff --quiet origin/main $HEAD_SHA` exits 0 iff trees match.
- On match: emits a step summary explaining the skip + sets
`skip=true`; subsequent gate-check + promote steps are gated on
`skip != 'true'` so they short-circuit.
- Fail-open: if `git fetch` errors, fall through to gate check
(preserve existing behavior). Only skip when diff is DEFINITIVELY
empty.
Long-term, the cleaner fix is to switch the merge_queue ruleset's
merge_method away from MERGE so FF-able PRs land cleanly without a
new commit — but that's a broader change affecting every staging
PR's commit shape. This guard is the surgical one-step break.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The retarget-main-to-staging workflow tries to PATCH base=staging on
every bot-authored PR opened against main. Auto-promote staging→main
PRs have head=staging, base=main — retargeting them sets head AND
base to staging, which GitHub rejects with HTTP 422 "no new commits
between base 'staging' and head 'staging'".
This started surfacing on PR #2588 (2026-05-03 14:30) once #2586
switched the auto-promote workflow to an App token. Before #2586
the auto-promote PR was authored by github-actions[bot], which the
retarget filter happened to skip; now it's molecule-ai[bot], which
passes the bot filter and triggers the broken retarget attempt.
Add a head-ref != 'staging' guard so auto-promote PRs short-circuit
before the PATCH. The existing 422 "duplicate base" detector is
left alone — it covers a different operational case.
GITHUB_TOKEN-initiated merges suppress the downstream `push` event on
main per GitHub's documented limitation:
https://docs.github.com/en/actions/using-workflows/triggering-a-workflow#triggering-a-workflow-from-a-workflow
Result before this fix: every staging→main promote landed silently —
publish-workspace-server-image, canary-verify, and redeploy-tenants-on-main
all stayed dark. The polling tail was the SOLE cascade trigger; if it
ever 30-min-timed-out the chain dead-locked invisibly.
Symptom (from the issue body, 2026-04-30):
| Time | Event | Triggered? |
|----------|--------------------------------------------------|-----------|
| 05:48:04 | Promote PR #2352 merged (c140ad28) | No fired |
| 06:07:29 | Promote PR #2356 merged (5973c9bd) | No fired |
Fix: mint the molecule-ai App token BEFORE the promote-PR step and
hand it to the auto-merge call. App-token-initiated merges DO trigger
downstream workflow_run cascades.
The polling tail stays as defense-in-depth (with comments updated):
once we've observed >=10 successful natural cascades it can be dropped.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The PR #2536 cascade prune ('deprecated, no shipping images') was
empirically wrong. Re-confirmed 2026-05-03:
- continuous-synth-e2e.yml defaults to langgraph as its primary canary
- All 5 'deprecated' templates have successful publish-image runs in
the past 24h: langgraph, crewai, autogen, deepagents, gemini-cli
Symptom this fixes — issue #2566 (priority-high, failing 36+h):
Synthetic E2E (staging): langgraph adapter A2A failure
'Received Message object in task mode' — failing for >36h
Today at 11:06 commit e1628c4 fixed the underlying a2a-sdk strict-mode
issue in workspace/a2a_executor.py. publish-runtime fired at 11:13 and
cascaded — but only to claude-code, hermes, openclaw, codex. langgraph
was excluded by the prune, so its image stayed on the broken runtime
and the synth E2E (which defaults to langgraph) kept failing despite
the fix being live in PyPI.
After this lands + the next runtime publish fires, langgraph image
re-bakes with the fix and synth-E2E goes green.
Test plan:
- [x] yaml-validate the workflow
- [ ] After merge, watch publish-runtime cascade to all 9 templates
- [ ] Confirm langgraph publish-image fires + succeeds
- [ ] Confirm next continuous-synth-e2e run goes green
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The workflow_dispatch input default and the workflow_run env fallback
both pointed at 'hongmingwang', which doesn't match any current prod
tenant (slugs are: hongming, chloe-dong, reno-stars). CP silently
skipped the missing canary and put every tenant in batch-1 in parallel,
defeating the canary-first soak gate that exists to catch image-boot
regressions before they hit the whole fleet.
Concrete example from today's c0838d6 redeploy at 11:53Z (run 25278434388):
the dispatched body was `{"target_tag":"staging-c0838d6","canary_slug":"hongmingwang",...}`
and the CP response showed all 3 tenants in `"phase":"batch-1"` — no
soak, no canary. The deploy happened to be safe, but a broken image
would have hit hongming + chloe-dong + reno-stars simultaneously.
Fixed in three places: the runtime ordering comment, the
workflow_dispatch default, and the env fallback used by the
workflow_run trigger. Comment documents the rationale so the next
slug rename doesn't silently regress this again.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The synth-E2E (#2342) provisions a langgraph tenant whose default
model `openai:gpt-4.1-mini` requires OPENAI_API_KEY for the first LLM
call. Sibling workflows already wire this:
- e2e-staging-saas.yml:89
- canary-staging.yml:63
continuous-synth-e2e.yml just forgot. Result: tenant boots, accepts
a2a messages, then returns:
Agent error: "Could not resolve authentication method. Expected
either api_key or auth_token to be set."
This was masked since 2026-04-29 (workflow creation) by a2a-sdk v0→v1
contract violations — PR #2558 (Task-enqueue) and #2563
(TaskUpdater.complete/failed terminal events) cleared those, exposing
the underlying auth gap on the synth-E2E firing at 11:39 UTC today.
The script tests/e2e/test_staging_full_saas.sh:325 already reads
E2E_OPENAI_API_KEY and persists it as a workspace_secret on tenant
create — only the workflow wiring was missing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The redeploy-tenants-on-staging soft-warn filter and the
sweep-stale-e2e-orgs janitor both hardcoded `^e2e-` to identify
ephemeral test tenants. Runtime-test harness fixtures (RFC #2251)
mint slugs prefixed with `rt-e2e-`, which neither matcher recognized.
Concrete impact observed today:
- Two `rt-e2e-v{5,6}-*` tenants left orphaned 8h on staging
(sweep-stale-e2e-orgs ignored them).
- On the next staging redeploy their phantom EC2s returned
`InvalidInstanceId: Instances not in a valid state for account`
from SSM SendCommand → CP returned HTTP 500 + ok=false.
- The redeploy soft-warn missed them too, so the workflow went
red, which broke the auto-promote-staging chain feeding the
canvas warm-paper rollout to prod.
Fix: switch both matchers to recognize the alternation
`^(e2e-|rt-e2e-)`. Long-lived prefixes (demo-prep, dryrun-*, dryrun2-*)
remain non-ephemeral and continue to hard-fail. Comment documents
the source-of-truth list and the cross-file invariant.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the recurrence path of PR #2556. The data fix realigned 8→4
templates in publish-runtime.yml's TEMPLATES variable, but the
underlying drift hazard was unguarded — the next manifest change
could silently leave cascade out of sync again.
This gate fails any PR that changes manifest.json or
publish-runtime.yml in a way that makes the cascade list diverge
from manifest workspace_templates (suffix-stripped). Either
direction is caught:
missing-from-cascade templates that won't auto-rebuild on a new
wheel publish (the codex-stuck-on-stale-runtime
bug class — PR #2512 added codex to manifest,
cascade wasn't updated, codex stayed pinned to
its last-built runtime version for weeks).
extra-in-cascade cascade dispatches to deprecated templates
(the wasted-API-calls + dead-CI-noise class —
PR #2536 pruned 5 templates from manifest;
cascade kept dispatching to all 8 until
PR #2556).
Triggers narrowly: only on PRs that touch manifest.json,
publish-runtime.yml, or the script itself. Fast (single grep+sed+comm
pipeline, no Go build).
Surfaced during the RFC #388 prior-art audit; folded in as the
structural follow-up to the data fix#2556 promised.
Self-tested both failure modes locally before commit:
- Drop codex from cascade → script fails with "MISSING: codex"
- Add langgraph to cascade → script fails with "EXTRA: langgraph"
Refs: https://github.com/Molecule-AI/molecule-controlplane/issues/388
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CP's deprovision flow calls Secrets.DeleteSecret() (provisioner/ec2.go:806)
but only when the deprovision runs to completion. Crashed provisions and
incomplete teardowns leak the per-tenant `molecule/tenant/<org_id>/bootstrap`
secret. At ~$0.40/secret/month, ~45 leaked secrets surfaced as ~$19/month
on the AWS cost dashboard.
The tenant_resources audit table (mig 024) tracks four kinds today —
CloudflareTunnel, CloudflareDNS, EC2Instance, SecurityGroup — and the
existing reconciler doesn't catch Secrets Manager orphans. The proper fix
(KindSecretsManagerSecret + recorder hook + reconciler enumerator) is filed
as a follow-up controlplane issue. This sweeper is the immediate stopgap.
Parallel-shape to sweep-cf-tunnels.sh:
- Hourly schedule offset (:30, between sweep-cf-orphans :15 and
sweep-cf-tunnels :45) so the three janitors don't burst CP admin
at the same minute.
- 24h grace window — never deletes a secret younger than the
provisioning roundtrip, so an in-flight provision can't be racemurdered.
- MAX_DELETE_PCT=50 default (mirrors sweep-cf-orphans for durable
resources; tenant secrets should track 1:1 with live tenants).
- Same schedule-vs-dispatch hardening as the other janitors:
schedule → hard-fail on missing secrets, dispatch → soft-skip.
- 8-way xargs parallelism, dry-run by default, --execute to delete.
Requires a dedicated AWS_JANITOR_* IAM principal — the prod molecule-cp
principal lacks secretsmanager:ListSecrets (it only has scoped
Get/Create/Update/Delete). The workflow's verify-secrets step will hard-fail
on the first scheduled run until those secrets are configured, surfacing
the missing setup loudly rather than silently no-op'ing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cascade `TEMPLATES` list in publish-runtime.yml had drifted from
manifest.json:
Currently dispatches to: claude-code, langgraph, crewai, autogen,
deepagents, hermes, gemini-cli, openclaw
manifest.json supports: claude-code, hermes, openclaw, codex (after
PR #2536 pruned to 4 actively-supported)
Two consequences of the drift:
1. `codex` (added in PR #2512, supported in manifest) was never in the
cascade — fresh runtime publishes did NOT trigger a codex template
rebuild. Codex stayed pinned to whatever runtime version it last saw
at its own image-build time.
2. langgraph/crewai/autogen/deepagents/gemini-cli — deprecated, no
shipping images, no working A2A — were still receiving cascade
dispatches. Wasted API calls and (worse) green CI on dead repos
masks "this template is dead, stop maintaining it."
Now matches manifest.json workspace_templates exactly. Surfaced during
RFC #388 (fast workspace provision) prior-art audit.
Long-term fix is to derive TEMPLATES from manifest.json so this can't
drift again — captured as a Phase-1 invariant in RFC #388. This commit
is the data fix only; structural fix lands with the bake pipeline.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The hourly Sweep stale Cloudflare Tunnels job got cancelled mid-cleanup
on 2026-05-02 (run 25248788312, killed at 5min after deleting 424/672
stale tunnels). A second manual dispatch finished the remaining 254
fine, so the immediate backlog cleared, but two underlying bugs would
re-trip on the next big cleanup.
Bug 1: serial delete loop. The execute branch was a `while read; do
curl -X DELETE; done` pipeline at ~0.7s/tunnel — fine for the
steady-state cleanup of a handful, but a 600+ backlog needs ~7-8min.
This commit fans out to $SWEEP_CONCURRENCY (default 8) workers via
`xargs -P 8 -L 1 -I {} bash -c '...' _ {} < "$DELETE_PLAN"`. With 8x
parallelism the same 600+ list drains in ~60s. Notes:
- We use stdin (`<`) not GNU's `xargs -a FILE` so the script stays
portable to BSD xargs (matters for local-runner testing on macOS).
- We pass ONLY the tunnel id on argv. xargs tokenizes on whitespace
by default; tab-separating id+name on argv risks mangling. The
name is kept in a side-channel id->name map ($NAME_MAP) and looked
up by the worker only on failure, for FAIL_LOG readability.
- Workers print exactly `OK` or `FAIL` on stdout; tally with
`grep -c '^OK$' / '^FAIL$'`.
- On non-zero FAILED, log the first 20 lines of $FAIL_LOG as
"Failure detail (first 20):" — same diagnostic surface as before
but consolidated so we don't spam logs on a flaky CF API.
Bug 2: workflow's 5-min cap was set as a hangs-detector but turned out
to be a real-job-too-slow detector. Raised to 30 min — generous
headroom for the ~60s steady-state run while still surfacing genuine
hangs (and in line with the sweep-cf-orphans companion job).
Bug 3 (drive-by): the existing trap was `trap 'rm -rf "$PAGES_DIR"'
EXIT`, which would have been silently overwritten by any later trap
registration. Replaced with a single `cleanup()` function that wipes
PAGES_DIR + all four new tempfiles (DELETE_PLAN, NAME_MAP, FAIL_LOG,
RESULT_LOG), called once via `trap cleanup EXIT`.
Verification:
- bash -n scripts/ops/sweep-cf-tunnels.sh: clean
- shellcheck -S warning scripts/ops/sweep-cf-tunnels.sh: clean
- python3 yaml.safe_load on the workflow: clean
- Synthetic 30-line delete plan with every 7th id sentinel'd to
return {"success":false}: TEST PASS, DELETED=26 FAILED=4, FAIL_LOG
side-channel name lookup verified.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Recurring failure pattern in redeploy-tenants-on-staging:
##[error]redeploy-fleet returned HTTP 500
##[error]Process completed with exit code 1.
with the per-tenant breakdown in the response body showing the failures
were on ephemeral e2e-* tenants (saas/canvas/ext) whose parent E2E run
torn them down mid-redeploy — SSM exit=2 because the EC2 was already
terminating, or healthz timeout because the CF tunnel was already gone.
The actual operator-facing tenants (dryrun-98407, demo-prep, etc) all
rolled fine in the same call.
This shape repeats every staging push that overlaps an active E2E run.
The downstream `Verify each staging tenant /buildinfo matches published
SHA` step ALREADY distinguishes STALE vs UNREACHABLE for exactly this
reason (per #2402); only the top-level `if HTTP_CODE != 200; exit 1`
gate misclassifies the race.
Filter: HTTP 500 + every failed slug matches `^e2e-` → soft-warn and
fall through to verify. Any non-e2e-* failure or non-500 HTTP remains
a hard fail, with the failed non-e2e slugs surfaced in the error so
the operator doesn't have to dig the response body out of CI.
Verified the gate logic with 6 synthetic CP responses (happy / e2e-only
race / mixed real+e2e fail / non-200 / 200+ok=false / all-real-fail) —
all behave correctly.
prod's redeploy-tenants-on-main is intentionally NOT touched: prod CP
serves no e2e-* tenants, so the race can't occur there and the strict
gate is the right behavior.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#1569 Phase 1 discovery (2026-05-02) found six historical credential
exposures in molecule-core git history. All confirmed dead — but the
reason they got committed in the first place was that the local
pre-commit hook had two gaps that the canonical CI gate (and the
runtime's hook) didn't:
1. **Pattern set was incomplete.** Local hook checked
`sk-ant-|sk-proj-|ghp_|gho_|AKIA|mol_pk_|cfut_` — missing
`ghs_*`, `ghu_*`, `ghr_*`, `github_pat_*`, `sk-svcacct-`,
`sk-cp-`, `xox[baprs]-`, `ASIA*`. The historical leaks were 5×
`ghs_*` (App installation tokens) + 1× `github_pat_*` — none of
which the local hook would have caught even if it ran.
2. **`*.md` and `docs/` were skip-listed.** The leaked tokens lived
in `tick-reflections-temp.md`, `qa-audit-2026-04-21.md`, and
`docs/incidents/INCIDENT_LOG.md` — exactly the file types the
skip-list excluded. The hook ran and silently passed.
This commit:
- Replaces the local hook's hard-coded inline regex with the canonical
13-pattern array (byte-aligned with `.github/workflows/secret-scan.yml`
and the workspace runtime's `pre-commit-checks.sh`).
- Removes the `\.md$|docs/` skip — keeps only binary, lockfile, and
hook-self exclusions.
- Adds the local hook to `lint_secret_pattern_drift.py` as an in-repo
consumer (read-from-disk, no network — the hook lives in the same
checkout the lint runs against). Drift now fails the lint when
canonical changes without the local hook updating in lockstep.
- Adds `.githooks/pre-commit` to the drift-lint workflow's path
filter so consumer-side edits also trigger the lint.
- Adopts the canonical's "don't echo the matched value" defense (the
prior version would have round-tripped a leaked credential into
scrollback / CI logs).
Verified: `python3 .github/scripts/lint_secret_pattern_drift.py`
reports both consumers aligned at 13 patterns. The hook's existing
six other gates (canvas 'use client', dark theme, SQL injection,
go-build, etc.) are untouched.
Companion change (already applied via API, no diff here):
`Scan diff for credential-shaped strings` is now in the required-checks
list on both `staging` and `main` branch protection — was previously a
soft gate (workflow ran, exited 1, but didn't block merge).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Auto-trigger from publish-workspace-server-image now resolves
target_tag to the just-published `staging-<short_head_sha>` digest
instead of `:latest`. Bypasses the dead retag path that was leaving
prod tenants on a 4-day-old image.
The chain pre-fix:
publish-image → pushes :staging-<sha> + :staging-latest (NOT :latest)
canary-verify → soft-skips (CANARY_TENANT_URLS unset, fleet not stood up)
promote-latest → manual workflow_dispatch only, last run 2026-04-28
redeploy-main → pulls :latest → 2026-04-28 digest → all 3 tenants STALE
Today's incident:
e7375348 (main) → publish-image green → redeploy fired → tenants
pulled :latest (76c604fb digest from prior canary-verified state) →
hongming /buildinfo returned 76c604fb instead of e7375348 → verify
step correctly flagged 3/3 STALE → workflow failed.
Today's PRs (#2473 smoke wedge, #2487 panic recovery, #2496 sweeper
followups) shipped to GHCR as :staging-<sha> but never reached prod.
Fix:
- workflow_dispatch input default '' (was 'latest'); empty input
triggers auto-compute path
- new "Compute target tag" step resolves:
1. operator-supplied input → verbatim (rollback / pin)
2. else → staging-<short_head_sha> (auto)
- verify step's operator-pin detection now allows
staging-<short_head_sha> as a non-pin (verification still runs)
When canary fleet is real, this workflow should chain on
canary-verify completion (workflow_run from canary-verify, gated on
promote-to-latest success) instead of publish-image — separate,
smaller PR. Today's fix unblocks prod deploys without that
prerequisite.
Companion: promote-latest.yml dispatched 2026-05-02 against
e7375348 to unstick existing prod tenants. This PR prevents
recurrence.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
auto-sync-main-to-staging.yml hasn't fired since 2026-04-29 despite
multiple staging→main promotes since. The promote PR #2442 (Phase 2)
has been wedged on `mergeStateStatus: BEHIND` for hours because
staging is missing the merge commit from PR #2437.
Three compounding bugs, all fixed here:
1. **GitHub no-recursion suppresses the `on: push` trigger.**
When the merge queue lands a staging→main promote, the resulting
push to main is "by GITHUB_TOKEN", and per
https://docs.github.com/en/actions/using-workflows/triggering-a-workflow#triggering-a-workflow-from-a-workflow
that push event does NOT fire any downstream workflows. Verified
empirically against SHA 76c604fb (PR #2437): exactly ONE workflow
fired on that push — `publish-workspace-server-image`, dispatched
explicitly by auto-promote-staging.yml's polling tail with an App
token (the documented #2357 workaround). Every other `on: push`
workflow on main, including auto-sync, was silently suppressed.
Same fix extended here: auto-promote-staging.yml's polling tail
now ALSO dispatches `auto-sync-main-to-staging.yml --ref main`
via the App token after the merge lands. App-initiated dispatch
propagates `workflow_run` cascades, which is what the publish
tail relies on too. Failure path: emits `::error::` with the
recovery command — operator runs it once and the next promote
self-heals.
auto-sync.yml gains `workflow_dispatch:` so it can be invoked
from the dispatch above + manually if a future promote also
misses (defense in depth).
2. **`runs-on: [self-hosted, macos, arm64]` was wrong for this repo.**
Comment claimed "matches the rest of this repo's workflows" — false:
this is the ONLY workflow in molecule-core/.github/workflows/ with
a non-ubuntu runs-on. Copy-paste artefact from molecule-controlplane
(which IS private and has a Mac runner). molecule-core has no Mac
runner registered, so even when the trigger DID fire (the 3 historic
manual-UI merges), the job would have sat unassigned if the runner
were offline. Switched to `ubuntu-latest` to match every other
workflow in this repo.
3. **The `on: push` trigger remains** as a defense-in-depth path for
the rare case of a manual UI merge by a real user (which uses
their PAT and DOES fire downstream workflows — confirmed via the
2026-04-29 d35a2420 run with `triggering_actor=HongmingWang-Rabbit`
that fired 16 workflows including auto-sync). Belt-and-suspenders.
Long-term: switching auto-promote's `gh pr merge --auto` call to use
the App token (instead of GITHUB_TOKEN) would let `on: push` triggers
fire naturally and obviate the need for the explicit dispatches in
the polling tail. Tracked in #2357 — out of scope here.
Operator recovery for the current Phase 2 wedge: after this lands on
staging, dispatch auto-sync once via
`gh workflow run auto-sync-main-to-staging.yml --ref main` to
backfill the missed sync from 76c604fb. PR #2442 will go from
BEHIND → CLEAN and auto-merge.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brings the local harness from "single tenant covering the request path"
to "two tenants covering both the request path AND the per-tenant
isolation boundary" — the same shape production runs (one EC2 + one
Postgres + one MOLECULE_ORG_ID per tenant).
Why this matters: the four prior replays exercise the SaaS request
path against one tenant. They cannot prove that TenantGuard rejects
a misrouted request (production CF tunnel + AWS LB are the failure
surface), nor that two tenants doing legitimate work in parallel
keep their `activity_logs` / `workspaces` / connection-pool state
partitioned. Both are real bug classes — TenantGuard allowlist drift
shipped #2398, lib/pq prepared-statement cache collision is documented
as an org-wide hazard.
What changed:
1. compose.yml — split into two tenants.
tenant-alpha + postgres-alpha + tenant-beta + postgres-beta + the
shared cp-stub, redis, cf-proxy. Each tenant gets a distinct
ADMIN_TOKEN + MOLECULE_ORG_ID and its own Postgres database. cf-proxy
depends on both tenants becoming healthy.
2. cf-proxy/nginx.conf — Host-header → tenant routing.
`map $host $tenant_upstream` resolves the right backend per request.
Required `resolver 127.0.0.11 valid=30s ipv6=off;` because nginx
needs an explicit DNS resolver to use a variable in `proxy_pass`
(literal hostnames resolve once at startup; variables resolve per
request — without the resolver nginx fails closed with 502).
`server_name` lists both tenants + the legacy alias so unknown Host
headers don't silently route to a default and mask routing bugs.
3. _curl.sh — per-tenant + cross-tenant-negative helpers.
`curl_alpha_admin` / `curl_beta_admin` set the right
Host + Authorization + X-Molecule-Org-Id triple.
`curl_alpha_creds_at_beta` / `curl_beta_creds_at_alpha` exist
precisely to make WRONG requests (replays use them to assert
TenantGuard rejects). `psql_exec_alpha` / `psql_exec_beta` shell out
per-tenant Postgres exec. Legacy aliases (`curl_admin`, `psql_exec`)
keep the four pre-Phase-2 replays working without edits.
4. seed.sh — registers parent+child workspaces in BOTH tenants.
Captures server-generated IDs via `jq -r '.id'` (POST /workspaces
ignores body.id, so the older client-side mint silently desynced
from the workspaces table and broke FK-dependent replays). Stashes
`ALPHA_PARENT_ID` / `ALPHA_CHILD_ID` / `BETA_PARENT_ID` /
`BETA_CHILD_ID` to .seed.env, plus legacy `ALPHA_ID` / `BETA_ID`
aliases for backwards compat with chat-history / channel-envelope.
5. New replays.
tenant-isolation.sh (13 assertions) — TenantGuard 404s any request
whose X-Molecule-Org-Id doesn't match the container's
MOLECULE_ORG_ID. Asserts the 404 body has zero
tenant/org/forbidden/denied keywords (existence of a tenant must
not be probable from the outside). Covers cross-tenant routing
misconfigure + allowlist drift + missing-org-header.
per-tenant-independence.sh (12 assertions) — both tenants seed
activity_logs in parallel with distinct row counts (3 vs 5) and
confirm each tenant's history endpoint returns exactly its own
counts. Then a concurrent INSERT race (10 rows per tenant in
parallel via `&` + wait) catches shared-pool corruption +
prepared-statement cache poisoning + redis cross-keyspace bleed.
6. Bug fix: down.sh + dump-logs SECRETS_ENCRYPTION_KEY validation.
`docker compose down -v` validates the entire compose file even
though it doesn't read the env. up.sh generates a per-run key into
its own shell — down.sh runs in a fresh shell that wouldn't see it,
so without a placeholder `compose down` exited non-zero before
removing volumes. Workspaces silently leaked into the next
./up.sh + seed.sh boot. Caught when tenant-isolation.sh F1/F2 saw
3× duplicate alpha-parent rows accumulated across three prior runs.
Same fix applied to the workflow's dump-logs step.
7. requirements.txt — pin molecule-ai-workspace-runtime>=0.1.78.
channel-envelope-trust-boundary.sh imports from `molecule_runtime.*`
(the wheel-rewritten path) so it catches the failure mode where
the wheel build silently strips a fix that unit tests on local
source still pass. CI was failing this replay because the wheel
wasn't installed — caught in the staging push run from #2492.
8. .github/workflows/harness-replays.yml — Phase 2 plumbing.
* Removed /etc/hosts step (Host-header path eliminated the need;
scripts already source _curl.sh).
* Updated dump-logs to reference the new service names
(tenant-alpha + tenant-beta + postgres-alpha + postgres-beta).
* Added SECRETS_ENCRYPTION_KEY placeholder env on the dump step.
Verified: ./run-all-replays.sh from a clean state — 6/6 passed
(buildinfo-stale-image, channel-envelope-trust-boundary, chat-history,
peer-discovery-404, per-tenant-independence, tenant-isolation).
Roadmap section updated: Phase 2 marked shipped. Phase 3 promoted to
"replace cp-stub with real molecule-controlplane Docker build + env
coherence lint."
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous header said `unittest discover from the scripts/ root
walks recursively`, contradicting the workflow body which runs two
passes precisely because discover does NOT recurse without
__init__.py. Fixed self-review feedback on PR #2440.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two small follow-ups to the PR #2433 → #2436 → #2439 incident chain.
1) `import inbox # noqa: F401` in workspace/a2a_mcp_server.py was
misleading — `inbox` IS used (at the bridge wiring inside main()).
F401 means "imported but unused", which would mask a real future
F401 if the usage is removed. Drop the noqa, keep the explanatory
block comment about the rewriter's `import X` → `import mr.X as X`
expansion (and the `import X as Y` → `import mr.X as X as Y` trap
the comment exists to prevent re-introducing).
2) scripts/test_build_runtime_package.py — 17 unit tests covering
`rewrite_imports()` and `build_import_rewriter()` in
scripts/build_runtime_package.py. Until now the function had zero
coverage despite the entire wheel build depending on it. Tests
pin: bare-import aliasing, dotted-import preservation, indented
imports, from-imports (simple + dotted + multi-symbol + block),
the `import X as Y` rejection added in PR #2436 (with comment-
stripping + indented + comma-not-alias edge cases), allowlist
anchoring (`a2a` ≠ `a2a_tools`), and end-to-end reproduction
of the PR #2433 failing pattern + the #2436 fix pattern.
3) Wire scripts/test_*.py into CI by adding a second discover pass
to test-ops-scripts.yml. Top-level scripts/ tests live alongside
their target file (parallels the scripts/ops/ test layout); the
existing scripts/ops/ pass keeps running because scripts/ops/
has no __init__.py so a single discover from scripts/ root
doesn't recurse. Two passes is simpler than retrofitting
namespace packages. Path filter widened from `scripts/ops/**`
to `scripts/**` so PRs touching the build script trigger the
new tests.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `PR-built wheel + import smoke` gate caught the broken wheel from
PR #2433 (`import inbox as _inbox_module` collision) but couldn't block
the merge because it isn't a required check on staging. Promoting it to
required is the right move per the runtime publish pipeline gates note
(2026-04-27 RuntimeCapabilities ImportError outage), but the existing
`paths: [workspace/**, scripts/...]` filter blocks PRs that don't touch
those paths from ever generating the check run — branch protection
would deadlock waiting on a check that never fires.
Refactor (same shape as e2e-api.yml's e2e-api job):
- Drop top-level `paths:` filter — workflow runs on every push/PR/
merge_group event.
- Add `detect-changes` job using dorny/paths-filter to compute the
`wheel=true|false` output.
- Collapse to ONE always-running `local-build-install` job named
`PR-built wheel + import smoke`. Per-step `if:` gates on the
detect output. PRs untouched by wheel-relevant paths emit a
no-op SUCCESS step ("paths filter excluded this commit") so the
check passes without rebuilding the wheel.
- merge_group + workflow_dispatch unconditionally `wheel=true` so
the queue always validates the to-be-merged state, regardless of
which PR composed it.
Why one-job-with-step-gates instead of two-jobs-sharing-name: SKIPPED
check runs block branch protection even when SUCCESS siblings exist
(verified PR #2264 incident, 2026-04-29). Single always-run job emits
exactly one SUCCESS check run regardless of paths filter.
Follow-up: open a separate PR adding `PR-built wheel + import smoke`
to the staging branch protection's required_status_checks.contexts
once this lands. Doing both in one PR risks the protection update
firing before the workflow refactor merges, deadlocking unrelated PRs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
peer-discovery-404 imports workspace/a2a_client.py which depends on
httpx; the runner's stock Python doesn't have it, so the replay's
PARSE assertion (b) fails with ModuleNotFoundError on every run. The
WIRE assertion (a) — pure curl — passes, so the failure was masking
just enough to make the replay LOOK partially-broken when the tenant
side is fine.
Adding tests/harness/requirements.txt with only httpx instead of
sourcing workspace/requirements.txt: that file pulls a2a-sdk,
langchain-core, opentelemetry, sqlalchemy, temporalio, etc. — ~30s
of install for one replay's PARSE step. The harness's deps surface
should grow when a new replay introduces a new import, not by
default.
Workflow gains one step (`pip install -r tests/harness/requirements.txt`)
between the /etc/hosts setup and run-all-replays. No other changes.
First run on PR #2410 failed with 'container harness-tenant-1 is unhealthy'
but the dump-compose-logs step printed empty tenant logs because
run-all-replays.sh's trap-on-EXIT had already torn down the harness.
Setting KEEP_UP=1 leaves containers in place; the always-run Force
teardown step at the end owns cleanup explicitly. Now we'll actually
see why the tenant didn't become healthy.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the gap between "the harness exists" and "the harness blocks bugs."
Phase 2 of the harness roadmap (per tests/harness/README.md): make
harness-based E2E a required CI check on every PR touching the tenant
binary or the harness itself.
Trigger: push + pull_request to staging+main, paths-filtered to
workspace-server/**, canvas/**, tests/harness/**, and this workflow.
merge_group support included so this becomes branch-protectable.
Single-job-with-conditional-steps pattern (matches e2e-api.yml). One
check run regardless of paths-filter outcome; satisfies branch
protection cleanly per the PR #2264 SKIPPED-in-set finding.
Why this exists: 2026-04-30 we shipped a TenantGuard allowlist gap
(/buildinfo added to router.go in #2398, never added to the allowlist)
that the existing buildinfo-stale-image.sh replay would have caught.
The harness was wired correctly; nobody ran it. Replays as a discipline
beat replays as a memory item.
The CI pipeline:
detect-changes (paths filter)
└ harness-replays (always)
├ no-op pass when paths-filter says no relevant change
└ otherwise: checkout + sibling plugin checkout +
/etc/hosts entry + run-all-replays.sh +
compose-logs-on-failure + force-teardown
Compose logs from tenant/cp-stub/cf-proxy/postgres are dumped on
failure so a CI red is debuggable without re-reproducing locally.
The trap in run-all-replays.sh handles teardown; the always-run
down.sh step is a belt-and-suspenders against trap-bypass kills.
Follow-ups (not in this PR):
- Add this check to staging branch protection once it's been green
for a few PRs (the new-workflow-instability hedge that other gates
followed).
- Eventually wire the buildx GHA cache to speed up tenant image
builds — currently every PR rebuilds the full Dockerfile.tenant
(Go + Next.js + template clones) from scratch. Acceptable for now;
optimize when the timeout-minutes:30 ceiling becomes painful.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
publish-runtime.yml had a broad smoke (AgentCard call-shape, well-known
mount alignment, new_text_message) inline as a heredoc. runtime-prbuild-
compat.yml had a narrow inline smoke (just `from main import main_sync`).
Result: a PR could introduce SDK shape regressions that pass at PR time
and only fail at publish time, post-merge.
Extract the broad smoke into scripts/wheel_smoke.py and invoke it from
both workflows. PR-time gate now matches publish-time gate — same script,
same assertions. Eliminates the drift hazard of two heredocs that have
to be kept in lockstep manually.
Verified locally:
* Built wheel from workspace/ source, installed in venv, ran smoke → pass
* Simulated AgentCard kwarg-rename regression → smoke catches it as
`ValueError: Protocol message AgentCard has no "supported_interfaces"
field` (the exact failure mode of #2179 / supported_protocols incident)
Path filter for runtime-prbuild-compat extended to include
scripts/wheel_smoke.py so smoke-only edits get PR-validated. publish-
runtime path filter intentionally NOT extended — smoke-only edits should
not auto-trigger a PyPI version bump.
Subset of #131 (the broader "invoke main() against stub config" goal
remains pending — main() needs a config dir + stub platform server).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-review of #2403 caught a regression: with a 1-tenant fleet (the
exact case the original #2402 fix targeted), the new floor would
re-introduce the flake. Trace:
TOTAL=1, UNREACHABLE=1, $((1/2))=0
if 1 -gt 0 → TRUE → exit 1
The 50%-rule only meaningfully distinguishes "real outage" from
"teardown race" when the fleet is large enough that "half down" is
statistically meaningful. With 1-3 tenants, canary-verify is the
actual gate (it runs against the canary first and aborts the rollout
if the canary fails to come up).
Gate the floor on TOTAL_VERIFIED >= 4. Truth table:
TOTAL UNREACHABLE RESULT
1 1 soft-warn (original e2e flake case)
4 2 soft-warn (exactly half)
4 3 hard-fail (75% — real outage)
10 6 hard-fail (60% — real outage)
Mirrored across staging.yml + main.yml.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Belt-and-suspenders sanity floor on top of the unreachable-soft-warn
introduced earlier in this PR. Addresses the residual gap noted in
review: if a new image crashes on startup, every tenant ends up
unreachable, and the soft-warn alone would let that ship as a green
deploy. Canary-verify catches it on the canary tenant first, but this
guard is a fallback for canary-skip dispatches and same-batch races.
Threshold is 50% of healthz_ok-snapshotted tenants — comfortably above
the typical e2e-* teardown rate (5-10/hour, ~1 ephemeral tenant per
batch) but below any plausible real-outage scenario.
Mirrored across staging.yml + main.yml for shape parity.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The /buildinfo verify step (PR #2398) was treating "no /buildinfo response"
the same as "tenant returned wrong SHA" — both bumped MISMATCH_COUNT and
hard-failed the workflow. First post-merge run on staging caught a real
edge case: ephemeral E2E tenants (slug e2e-20260430-...) get torn down by
the E2E teardown trap between CP's healthz_ok snapshot and the verify step
running, so the verify step would dial into DNS that no longer resolves
and hard-fail on a benign condition.
The bug class we actually care about is STALE (tenant up + serving old
code, the #2395 root). UNREACHABLE post-redeploy is almost always a benign
teardown race; real "tenant up but unreachable" is caught by CP's own
healthz monitor + the alert pipeline, so double-counting it here was
making this workflow flaky on every staging push that overlapped E2E.
Wire:
- Split MISMATCH_COUNT into STALE_COUNT + UNREACHABLE_COUNT.
- STALE → hard-fail the workflow (the bug class we're guarding).
- UNREACHABLE → :⚠️:, don't fail. Reachable-mismatch still hard-fails.
- Job summary surfaces both lists separately so on-call can tell at a
glance which class fired.
Mirror in redeploy-tenants-on-main.yml for shape parity (prod has fewer
ephemeral tenants but identical asymmetry would be a gratuitous fork).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the gap that let issue #2395 ship: redeploy-fleet workflows reported
ssm_status=Success based on SSM RPC return code alone, while EC2 tenants
silently kept serving the previous :latest digest because docker compose up
without an explicit pull is a no-op when the local tag already exists.
Wire:
- new buildinfo package exposes GitSHA, set at link time via -ldflags from
the GIT_SHA build-arg (default "dev" so test runs without ldflags fail
closed against an unset deploy)
- router exposes GET /buildinfo returning {git_sha} — public, no auth,
cheap enough to curl from CI for every tenant
- both Dockerfiles thread GIT_SHA into the Go build
- publish-workspace-server-image.yml passes GIT_SHA=github.sha for both
images
- redeploy-tenants-on-main.yml + redeploy-tenants-on-staging.yml curl each
tenant's /buildinfo after the redeploy SSM RPC and fail the workflow on
digest mismatch; staging treats both :latest and :staging-latest as
moving tags; verification is skipped only when an operator pinned a
specific tag via workflow_dispatch
Tests:
- TestGitSHA_DefaultDevSentinel pins the dev default
- TestBuildInfoEndpoint_ReturnsGitSHA pins the wire shape that the
workflow's jq lookup depends on
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When gh run list returns [] (no E2E run on the main SHA — the common
case for canvas-only / cmd-only / sweep-only changes whose paths
don't trigger E2E), jq's `.[0]` is null and the interpolation
`"\(null)/\(null // "none")"` produces "null/none". The case
statement has no `null/none)` branch, so it falls into `*)` →
exit 1 → auto-promote-on-e2e fails → `:latest` doesn't get retagged
to the new SHA → tenants on `redeploy-tenants-on-main` end up
pulling the OLD `:latest` digest.
Surfaced 2026-04-30 17:00Z as the first observable consequence of
PR #2389 (App-token dispatch fix). Every prior auto-promote-on-e2e
run was triggered by E2E completion (the "Upstream is E2E itself"
short-circuit at line 151 fired before reaching the gate). #2389
made publish-image's completion event correctly fire workflow_run
listeners — auto-promote-on-e2e is one of those listeners — and
hit the latent jq bug on the first publish-upstream run.
Fix: change `.[0]` to `(.[0] // {})` in the jq filter so the empty-
array case becomes `none/none` (the documented "E2E paths-filtered
out for this SHA — proceed" branch) instead of the unhandled
`null/none`. Also default `.status` for the same defensive reason.
Verified the three input shapes locally:
[] → "none/none" ✓
[{status:completed,conclusion:success}] → "completed/success" ✓
[{status:in_progress,conclusion:null}] → "in_progress/none" ✓
Outer `|| echo "none/none"` fallback retained as defense-in-depth
for non-zero gh exits (network / auth failures).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pins the four workspaces.status=awaiting_agent transitions on a real
staging tenant, end-to-end. Catches the class of silent enum failures
that migration 046 fix-forwarded — specifically:
1. workspace.go:333 — POST /workspaces with runtime=external + no URL
parks the row in 'awaiting_agent'. Pre-046 the UPDATE silently
failed and the row stuck on 'provisioning'.
2. registry.go:resolveDeliveryMode — registering an external workspace
defaults delivery_mode='poll' (PR #2382). The harness asserts the
poll default after register.
3. registry/healthsweep.go:sweepStaleRemoteWorkspaces — after
REMOTE_LIVENESS_STALE_AFTER (90s default) with no heartbeat, the
workspace transitions back to 'awaiting_agent'. Pre-046 the sweep
UPDATE silently failed and the workspace stuck on 'online' forever.
4. Re-register from awaiting_agent → 'online' confirms the state is
operator-recoverable, which is the whole reason for using
awaiting_agent (vs. 'offline') as the external-runtime stale state.
The harness mirrors test_staging_full_saas.sh: tenant create →
DNS/TLS wait → tenant token retrieve → exercise → idempotent teardown
via EXIT/INT/TERM trap. Exit codes match the documented contract
{0,1,2,3,4}; raw bash exit codes are normalized so the safety-net
sweeper doesn't open false-positive incident issues.
The companion workflow gates on the source files that touch this
lifecycle: workspace.go, registry.go, workspace_restart.go,
healthsweep.go, liveness.go, every migration, the static drift gate,
and the script + workflow themselves. Daily 07:30 UTC cron catches
infra drift on quiet days. cancel-in-progress=false because aborting
a half-rolled tenant leaves orphan resources for the safety-net to
clean.
Verification:
- bash -n: ok
- shellcheck: only the documented A && B || C pattern, identical to
test_staging_full_saas.sh.
- YAML parser: ok.
- Workflow path filter matches every site that writes to the
workspace_status enum (cross-checked against the drift gate's
UPDATE workspaces / INSERT INTO workspaces enumeration).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause (verified 2026-04-30): GITHUB_TOKEN-initiated
workflow_dispatch creates the dispatched run, but the resulting run's
completion event does NOT fire downstream `workflow_run` triggers.
This is the documented "no recursion" rule:
https://docs.github.com/en/actions/using-workflows/triggering-a-workflow#triggering-a-workflow-from-a-workflow
Evidence (publish-workspace-server-image runs on main):
run_id | head_sha | triggering_actor | canary | redeploy
------------+-----------+-----------------------+--------+----------
25151545007 | 6ef562ee | HongmingWang-Rabbit | YES | YES
25171773918 | 21313dc | github-actions[bot] | NO | NO
25173801008 | 59dec57 | github-actions[bot] | NO | NO
The 06:52Z run that "worked" was an operator-fired dispatch from the
terminal — actor was the operator's PAT. The two runs that "dropped"
were dispatched by auto-promote-staging.yml's `gh workflow run` step
authenticated via `secrets.GITHUB_TOKEN`, so the actor became
`github-actions[bot]` and the workflow_run cascade was suppressed.
Same workflow file, same dispatch call, same successful publish run
— only the auth token differed.
Fix: mint a molecule-ai GitHub App installation token before the
dispatch step and use it as `GH_TOKEN`. App-initiated dispatches
DO propagate the workflow_run cascade (the App user is a real
identity, not the GITHUB_TOKEN bot pseudonym).
The molecule-ai App (app_id=3398844, installation 124443072) is
already installed on the org with `actions:write` — no new App
needed. Only secrets are missing.
## Required setup before merge
The following repo secrets must be added at
https://github.com/Molecule-AI/molecule-core/settings/secrets/actions
or auto-promote will hard-fail at the new "Mint App token" step:
- `MOLECULE_AI_APP_ID` = `3398844`
- `MOLECULE_AI_APP_PRIVATE_KEY` = contents of a .pem file generated at
https://github.com/organizations/Molecule-AI/settings/installations/124443072
(Click "Generate a private key" if one doesn't exist yet.)
## Long-term cleanup
The polling tail step still exists because the auto-merge call
itself uses GITHUB_TOKEN, so the FF push to main doesn't fire
publish-workspace-server-image's `push` trigger naturally. Switching
the auto-merge call to use the SAME App token would eliminate the
polling tail entirely. Tracked in #2357.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The check has been blocking the staging→main auto-promote PR (#2361)
since 2026-04-30T07:17Z with:
fatal: origin/main...<head>: no merge base
Root cause: the workflow does `git fetch origin <base> --depth=1`
which overwrites checkout@v4's full-history clone with a shallow
tip — destroying the ancestry the subsequent
`git diff origin/main...HEAD` (three-dot, merge-base form) needs.
This deadlocks every staging→main promote PR until manually fixed.
The auto-promote runs were succeeding at the gate-check phase but
the subsequent PR-merge step waited 30 min for the failing check
and timed out, skipping the publish + redeploy dispatch tail.
Fleet recovery for any production-only fix went through staging
fine but never reached main.
Fix: drop --depth=1 so the explicit fetch preserves full history.
The leading comment is updated to call out this trap so a future
maintainer doesn't re-add the flag thinking it's a perf win.
No test added: this is a workflow-config one-liner that the
existing PR check itself exercises end-to-end (the real signal is
PR #2361 going green after this lands).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>