Part of the post-#66 sweep to remove `gh` CLI dependencies that fail
silently against Gitea (which exposes /api/v1 only — no GraphQL → 405,
no /api/v3 → 404). Class A covers `gh pr list / view / diff / comment`
shapes.
Affected:
- `.github/workflows/auto-tag-runtime.yml`
Replaced `gh pr list --search SHA --json number,labels` with a curl
to `/api/v1/repos/.../pulls?state=closed&sort=newest&limit=50` +
jq filter on `merge_commit_sha == github.sha`. Same end-to-end
behaviour: locate the merged PR for this push, read its labels,
pick the bump kind. Defensive `?.name // empty` jq guard handles
unlabelled PRs without erroring. The 50-PR window is comfortably
larger than the volume of staging→main promotes that close in any
reasonable detection window.
- `scripts/check-stale-promote-pr.sh`
Rewrote `fetch_prs` and `post_comment` to call Gitea's REST API
directly. Gitea doesn't expose GitHub's compound `mergeStateStatus`
/ `reviewDecision` fields, so the new fetcher pulls
`/pulls?state=open&base=main` then for each PR pulls
`/pulls/{n}/reviews` and synthesizes the GitHub-shape JSON the rest
of the script (and the existing fixture-based unit tests) consume:
BLOCKED + REVIEW_REQUIRED ↔ mergeable=true AND 0 APPROVED reviews
DIRTY ↔ mergeable=false (alarm doesn't fire)
CLEAN + APPROVED ↔ mergeable=true AND ≥1 APPROVED review
Comment-posting moves to `POST /repos/.../issues/{n}/comments`
(Gitea treats PRs as issues for the comment surface, same as
GitHub's REST). All 23 fixture-driven unit tests still pass —
fixtures pass GitHub-shape JSON via PR_FIXTURE which short-circuits
the live fetch path.
- `scripts/ops/check_migration_collisions.py`
Replaced `gh pr list` + `gh pr diff` calls with stdlib `urllib`
against /api/v1. Helper `_gitea_get` centralizes auth + error
handling; uses GITEA_TOKEN env, falling back to GITHUB_TOKEN
(act_runner) and GH_TOKEN. Return shape from
`open_prs_with_migration_prefix` mimics the historical
`--json number,headRefName` so the call sites are unchanged. All 9
regex-classifier unit tests still pass; live integration test
against the production Gitea API returns 0 collisions for prefix=999
as expected.
curl invocation pattern is `curl --fail-with-body -sS` (NOT `-fsS` —
the two short-fail flags are mutually exclusive in modern curl;
caught by `curl: You must select either --fail or --fail-with-body,
not both` during local verification).
Token model: workflows pass act_runner's GITHUB_TOKEN (per-run, repo
read scope) — same surface used by the auto-sync fix in PR #66 plus
the surrounding workflows. No new repo secrets required.
Verification: bash unit tests (23/23 pass), python unittest (9/9 pass),
live curl call against production Gitea returns 200 with the expected
shape, YAML / shell / Python syntax all validate.
Closes part of #75. Other classes (D — `gh api`; F — `gh run list`)
land in follow-up PRs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause: same as #65/PR-#66 — gh CLI calls Gitea GraphQL
(/api/graphql) which returns HTTP 405. Additionally, gh workflow
run calls /actions/workflows/{id}/dispatches which does not
exist on Gitea 1.22.6 (verified via swagger.v1.json).
Fix:
- Replace gh run list with Gitea REST combined-status endpoint
(GET /repos/{owner}/{repo}/commits/{ref}/status). Combined state
encodes the AND across every check context — simpler than the
per-workflow loop and immune to workflow-name collisions.
- Replace gh pr create / merge --auto with direct curl calls to
POST /pulls and POST /pulls/{N}/merge with merge_when_checks_succeed.
- Remove the post-merge polling tail entirely. The GitHub-era
GITHUB_TOKEN no-recursion rule does not apply on Gitea Actions
(verified empirically: PR #66 merge fired downstream pushes
naturally). Even if we wanted to dispatch, Gitea has no
workflow_dispatch REST endpoint.
Critical constraint: main has enable_push: false with no whitelist;
direct push is impossible for any persona. PR-mediated merge is the
only path. main has required_approvals: 1 — auto-merge waits for
Hongming's approval before landing, preserving the
feedback_prod_apply_needs_hongming_chat_go contract.
Identity: AUTO_SYNC_TOKEN (devops-engineer persona). Not founder PAT.
Per feedback_per_agent_gitea_identity_default. Same persona used by
auto-sync (PR #66) — keeps identity model coherent.
Header comment block fully rewritten with 4 failure-mode runbooks
(A: gates not green, B: PR-create non-201, C: merge schedule fails,
D: token rotated/scope wrong) per PR #66's pattern.
Refs: #65, #73, #195, PR #66 (canonical reference)
Closes#73
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause of `Auto-sync main → staging / sync-staging (push)`
failing every push to main since the GitHub→Gitea migration:
The workflow assumed a GitHub `merge_queue` ruleset on staging
(blocking direct push) and used `gh pr create` + `gh pr merge
--auto` to land sync via the queue. On Gitea this fails at the
`gh pr create` step with `HTTP 405 Method Not Allowed
(https://git.moleculesai.app/api/graphql)` — Gitea exposes no
GraphQL endpoint, and the GitHub-CLI cannot ship PRs against
Gitea.
Verified failure mode in run 1117/job 0 (token logs at
/tmp/log2.txt, run target /molecule-ai/molecule-core/actions/
runs/1117/jobs/0). The merge step succeeded and pushed
auto-sync/main-1e1f4d63; the PR step failed with the 405. So
every main push left an orphan auto-sync/* branch and a red CI
status, with no PR to land it.
Fix: the staging branch protection on Gitea
(`enable_push: true`, `push_whitelist_usernames:
[devops-engineer]`) already permits direct push from the
devops-engineer persona. Drop the entire merge-queue PR
architecture and replace with:
1. Checkout staging with secrets.AUTO_SYNC_TOKEN
(devops-engineer persona token, NOT founder PAT —
`feedback_per_agent_gitea_identity_default`).
2. `git fetch origin main` + ff-merge or no-ff merge.
3. `git push origin staging` directly.
The AUTO_SYNC_TOKEN repo secret already exists (created
2026-05-07 14:00 alongside the staging push_whitelist update).
Workflow name + job name unchanged → required-check name
`Auto-sync main → staging / sync-staging (push)` keeps the
same context, no branch-protection edits needed.
Rejected alternatives (documented in workflow header):
- Reuse PR architecture via Gitea REST: ~80 LOC of API
plumbing for no benefit; direct push works.
- GH_HOST=git.moleculesai.app: still calls /api/graphql,
same 405; doesn't fix the root issue.
- Custom JS action: external dep for a 5-line `git push`.
Header comment in the workflow now documents:
- What this workflow does (SSOT for staging advancing).
- Why direct push (GitHub merge_queue → Gitea push_whitelist).
- Identity and token (anti-bot-ring per saved memory).
- Failure modes A–D with operator runbook for each.
- Loop safety (push to staging doesn't fire push:main → no
recursion).
Verification plan: this fix-PR's merge to main is itself the
trigger; watch the workflow run on the merge commit and on
one follow-up trigger commit, expect both green.
Refs: failing run https://git.moleculesai.app/molecule-ai/
molecule-core/actions/runs/1117/jobs/0
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Audit finding: every workflow that emits a required-status-check name
on molecule-core's branch protection (apply.sh's STAGING_CHECKS +
MAIN_CHECKS) ALREADY uses the safe always-runs-with-conditional-steps
shape — Platform/Canvas/Python/Shellcheck in ci.yml, Canvas tabs E2E
in e2e-staging-canvas.yml, E2E API Smoke in e2e-api.yml, PR-built
wheel in runtime-prbuild-compat.yml, the codeql Analyze matrix, and
the always-on Secret scan + Detect changes. No production drift to
fix today.
Adds a regression-guard so the next path-filter / matrix refactor /
workflow rename can't silently re-introduce the bug shape called out
in saved memory feedback_branch_protection_check_name_parity:
"Path filters … silently break branch protection because no job
emits the protected sentinel status when path-filter returns false."
New tools:
- tools/branch-protection/check_name_parity.sh — extracts every
required check name from apply.sh's heredocs, then for each name
classifies the owning workflow as safe (no top-level paths:) /
safe (per-step if-gates without top-level paths:) / unsafe
(top-level paths: without per-step if-gates) / unsafe-mix
(top-level paths: WITH per-step if-gates — the workflow may still
skip entirely on path exclusion, leaving the gates dormant) /
missing (no emitter at all). Special-cases codeql.yml's matrix-
expanded `Analyze (${{ matrix.language }})`.
- tools/branch-protection/test_check_name_parity.sh — 6 unit tests
covering each classification: safe, unsafe-path-filter, missing,
safe-with-per-step-gates, unsafe-mix, matrix-expansion. Each test
builds a synthetic apply.sh + workflow file in a tmpdir, invokes
the script, and asserts on exit code + stderr substring. Per
feedback_assert_exact_not_substring the assertions pin specific
classifications, not just non-zero exit.
Wired into branch-protection-drift.yml so every PR touching
.github/workflows/** runs the parity check; the existing daily
schedule covers between-PR drift. The check is cheap (~1s) and runs
without the admin token — only reads files in the checkout. Self-
test step runs the unit tests on every invocation, so a regression
in the script can't false-pass on production.
Per BSD-vs-GNU portability hygiene: heredoc-marker extraction stays
in plain awk + sed (no gawk-only `match()` array form), grep regex
avoids `^` anchor for `if:` lines because real workflows use
` - if:` with the `-` step-marker between leading spaces and
`if:` (the original anchor missed every workflow's per-step gates).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why
---
PR #35 marked `continue-on-error: true` at the JOB level (correct YAML),
but Gitea Actions 1.22.6 does NOT propagate job-level continue-on-error
to the commit-status API — every matrix leg still posts `failure`. That
keeps OVERALL=failure on every push to main + staging and blocks the
auto-promote signal even when every other gate is green.
Worse: the underlying CodeQL run never actually worked on Gitea. The
github/codeql-action/init@v4 step calls api.github.com bundle endpoints
(CLI download + query packs + telemetry) that Gitea does NOT proxy.
Confirmed via live-tested run 1d/3101 on operator host:
2026-05-07T20:55:17 ::group::Run Initialize CodeQL
with: languages: ${{ matrix.language }}
queries: security-extended
2026-05-07T20:55:36 ::error::404 page not found
2026-05-07T20:55:50 Failure - Main Initialize CodeQL
2026-05-07T20:55:51 skipping Perform CodeQL Analysis (main skipped)
2026-05-07T20:55:51 :⚠️:No files were found at sarif-results/go/
The SARIF artifact upload was already a no-op (warning above) — the
analyze step never wrote anything because init failed. So nothing of
value is being lost by stubbing this out.
What
----
- Convert the workflow to a single-step stub that emits success per
matrix language (go, javascript-typescript, python).
- Keep workflow `name: CodeQL` exactly (auto-promote-staging.yml
line 67 keys on it as a workflow_run gate).
- Keep job name template `Analyze (${{ matrix.language }})` and the
3-leg matrix exactly (commit-status context names + branch
protection + #144 required-check-name parity).
- Keep all four triggers (push / pull_request / merge_group /
schedule) so merge_group required-checks parity holds.
- Drop the codeql-action steps, the Autobuild step, the SARIF parse
step, and the upload-artifact step — all four of those are now
dead code (init can never succeed against Gitea's API surface).
Policy
------
Per Hongming decision 2026-05-07 (#156): CodeQL is ADVISORY, not
blocking, until a Gitea-compatible SAST pipeline lands. The header
of the new workflow file documents this decision + lists the three
re-enable options (self-hosted Semgrep, Sonatype, GitHub mirror)
plus the compensating controls in place (secret-scan, block-internal-
paths, lint-curl-status-capture, branch-protection-drift).
Closes#156. Touches #142 (no capital-M Molecule-AI refs in this
file — already lowercase per e01077be).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
harness-replays.yml builds tenant-alpha + tenant-beta via tests/harness/
compose.yml using workspace-server/Dockerfile.tenant. Post-#173, that
Dockerfile expects .tenant-bundle-deps/{workspace-configs-templates,
org-templates,plugins} pre-cloned at the build context root. Sister
PR #38 added the pre-clone step to publish-workspace-server-image.yml
but missed harness-replays.yml.
Symptoms:
- main run #892 (2026-05-07T20:28:53Z): COPY
.tenant-bundle-deps/plugins -> failed to calculate checksum ...
not found.
- staging run #964 (2026-05-07T20:41:52Z): hits the OLD in-image
clone path (staging hasn't picked up the Dockerfile.tenant
refactor yet via auto-sync) and fails on
'fatal: could not read Username for https://git.moleculesai.app'
when cloning the first private workspace-template-* repo.
Fix: add the same Pre-clone step to harness-replays.yml,
mirroring publish-workspace-server-image.yml. Uses AUTO_SYNC_TOKEN
(devops-engineer persona PAT) per
feedback_per_agent_gitea_identity_default.
Once auto-sync main->staging unblocks (sister agent fixing the
7-file conflict in flight), staging will inherit both this workflow
fix AND the Dockerfile.tenant refactor atomically.
Refs: #168, #173
Run #1010 (post-#46) succeeded all the way to push but failed with
"repository molecule-ai/platform does not exist" — the platform image
ECR repo had never been created (only platform-tenant existed).
Created the repo via:
aws ecr create-repository --region us-east-2 \
--repository-name molecule-ai/platform \
--image-scanning-configuration scanOnPush=true
This is a one-line workflow comment to satisfy the path-filter and
re-run the publish workflow against the now-existing repo. Closes#173
properly this time — pre-clone + inline ECR auth + ECR repo all in
place.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI run #987 (post-#45) showed `docker push` from shell still hits
"no basic auth credentials" — `aws-actions/amazon-ecr-login@v2`
writes auth to a step-scoped DOCKER_CONFIG that doesn't carry across
to the next shell step on Gitea Actions.
Fix: drop both `aws-actions/configure-aws-credentials@v4` and
`aws-actions/amazon-ecr-login@v2`. Run `aws ecr get-login-password |
docker login` inline in the same shell step as `docker build` +
`docker push`. AWS creds come from secrets via env vars, ECR token
is fresh per-step (12h validity is plenty), config.json lives in the
same shell process — auth state is guaranteed.
This is the operator-host manual approach mapped 1:1 into CI.
runner-base image already has aws-cli + docker (verified locally).
Closes#173 (fifth piece — and final, this matches the manual flow
exactly).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI run #946 (post-#43) confirmed `driver: docker` doesn't fix the ECR
push 401 either: buildx CLI inside the runner container talks to the
operator-host docker daemon (mounted socket), but the daemon doesn't
see the runner's ECR auth state, and the runner's buildx CLI doesn't
attach the auth header in a way the daemon accepts.
Drop buildx + build-push-action entirely. Plain `docker build` +
`docker push` from the runner container works because both use the
SAME docker socket + the SAME runner-container config.json (populated
by `aws ecr get-login-password | docker login` from amazon-ecr-login).
Trade-off: lose multi-arch support. We only ship linux/amd64 tenant
images today, so this is fine. If multi-arch becomes a requirement
later, we can revisit (likely with `docker buildx create
--driver=remote` pointing at an external buildkit, but that's
substantial infra work; not worth it for a single-arch shop).
Closes#173 (fourth piece — and hopefully last; this matches the
operator-host manual approach exactly).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #38 + #41 fixed the Dockerfile-side clone issue. CI run #893 then
revealed two Gitea-Actions-specific issues with the unchanged buildx
config:
1. `failed to push: 401 Unauthorized` to ECR. Root cause: default
buildx driver `docker-container` spawns a buildkit container that
doesn't share the host's `~/.docker/config.json`, so the ECR auth
set up by amazon-ecr-login doesn't reach the push. Fix: pin
`driver: docker` so buildx delegates to the host daemon, which
already has the ECR creds.
2. `dial tcp ...:41939: i/o timeout` on `_apis/artifactcache/cache`.
Root cause: `cache-from/cache-to: type=gha` is GitHub-specific;
Gitea Actions has no compatible artifact-cache backend, so every
cache lookup fails after a 30s timeout. Fix: remove the cache-*
options. Cold-build cost is <10min for 37-repo clone + Go/Node
compile, acceptable. Could revisit with type=registry inline cache
later if rebuilds get painful.
With this + #38/#41, the workflow should run end-to-end on Gitea
Actions: pre-clone -> docker build (host daemon) -> ECR push.
Closes#173 (third and final piece).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
publish-workspace-server-image.yml could not run on Gitea Actions because
Dockerfile.tenant's stage 3 ran `git clone` against private Gitea repos
from inside the Docker build context, where no auth path exists. Every
workspace-server rebuild required a manual operator-host push.
Move cloning to the trusted CI context (where AUTO_SYNC_TOKEN — the
devops-engineer persona PAT — is naturally available). Dockerfile.tenant
now COPYs from .tenant-bundle-deps/, populated by the workflow's new
"Pre-clone manifest deps" step. The Gitea token never enters the image.
- scripts/clone-manifest.sh: optional MOLECULE_GITEA_TOKEN env embeds
basic-auth in the clone URL; redacted in log output. Anonymous fallback
preserved for future public-repo path.
- .github/workflows/publish-workspace-server-image.yml: new pre-clone
step before docker build; injects AUTO_SYNC_TOKEN. Fail-fast if the
secret is empty.
- workspace-server/Dockerfile.tenant: drop stage 3 (templates), COPY
from .tenant-bundle-deps/ instead. Header documents the prereq.
- .gitignore: ignore /.tenant-bundle-deps/ so a local build can't
accidentally commit cloned repos.
Verified locally: clone-manifest.sh with the devops-engineer persona
token cloned all 37 repos (9 ws + 7 org + 21 plugins, 4.9MB after
.git strip).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
scripts/clone-manifest.sh runs inside the platform Dockerfile build,
so a change to that script needs to retrigger publish. Without it,
the prior fix (clone via Gitea + lowercase org) didn't trigger this
workflow because scripts/ wasn't in the path filter.
Also serves as the file change to satisfy the path filter for THIS
push, retriggering publish-workspace-server-image now.
Two coupled cleanups for the post-2026-05-06 stack:
============================================
The plugin injected GITHUB_TOKEN/GH_TOKEN via the App's
installation-access flow (~hourly rotation). Per-agent Gitea
identities replaced this approach after the 2026-05-06 suspension —
workspaces now provision with a per-persona Gitea PAT from .env
instead of an App-rotated token. The plugin code itself lived on
github.com/Molecule-AI/molecule-ai-plugin-github-app-auth which is
also unreachable post-suspension; checking it out at CI build time
was already failing.
Removed:
- workspace-server/cmd/server/main.go: githubappauth import + the
`if os.Getenv("GITHUB_APP_ID") != ""` block that called
BuildRegistry. gh-identity remains as the active mutator.
- workspace-server/Dockerfile + Dockerfile.tenant: COPY of the
sibling repo + the `replace github.com/Molecule-AI/molecule-ai-
plugin-github-app-auth => /plugin` directive injection.
- workspace-server/go.mod + go.sum: github-app-auth dep entry
(cleaned up by `go mod tidy`).
- 3 workflows: actions/checkout steps for the sibling plugin repo:
- .github/workflows/codeql.yml (Go matrix path)
- .github/workflows/harness-replays.yml
- .github/workflows/publish-workspace-server-image.yml
Verified `go build ./cmd/server` + `go vet ./...` pass post-removal.
=======================================================
Same workflow used to push to ghcr.io/molecule-ai/platform +
platform-tenant. ghcr.io/molecule-ai is gone post-suspension. The
operator's ECR org (153263036946.dkr.ecr.us-east-2.amazonaws.com/
molecule-ai/) already hosts platform-tenant + workspace-template-*
+ runner-base images and is the post-suspension SSOT for container
images. This PR aligns publish-workspace-server-image with that
stack.
- env.IMAGE_NAME + env.TENANT_IMAGE_NAME repointed to ECR URL.
- docker/login-action swapped for aws-actions/configure-aws-
credentials@v4 + aws-actions/amazon-ecr-login@v2 chain (the
standard ECR auth pattern; uses AWS_ACCESS_KEY_ID/SECRET secrets
bound to the molecule-cp IAM user).
The :staging-<sha> + :staging-latest tag policy is unchanged —
staging-CP's TENANT_IMAGE pin still points at :staging-latest, just
with the new registry prefix.
Refs molecule-core#157, #161; parallel to org-wide CI-green sweep.
Gitea is case-sensitive on owner slugs; canonical is lowercase
`molecule-ai/...`. Mixed-case `Molecule-AI/...` refs fail-at-0s
when the runner tries to resolve the cross-repo workflow / checkout.
Same fix as molecule-controlplane#12. Mechanical case-correction;
no behavior change beyond making CI resolve again.
Refs: internal#46
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The migration-replay step globbed only *.up.sql, silently skipping
the older flat-naming migrations (001_workspaces.sql,
009_activity_logs.sql, etc.). Fine while no integration test
depended on those tables; broke when the #149 cross-table
atomicity test came in needing both workspaces (FK target for
activity_logs) and activity_logs themselves.
Switch to globbing *.sql + sorted lex-order, excluding *.down.sql
so up/down pairs don't undo themselves mid-run. Add a sanity check
for workspaces + activity_logs + pending_uploads alongside the
existing delegations gate so a future migration drift fails loud
instead of silently skipping the regressed test.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 15-min sweeper has been deleting stale e2e orgs but not the
orphan tunnels left behind when the org-delete cascade half-fails
(CP transient 5xx after the org row is gone but before the CF
tunnel delete completes). Result: tunnels accumulate in CF until
manual operator cleanup.
Add a final step that POSTs `/cp/admin/orphan-tunnels/cleanup`
every tick. Best-effort — failure doesn't fail the workflow; next
tick re-attempts. Output reports deleted_count + failed count for
ops visibility.
This is the catch-all for the orphan-tunnel class. The proper
upstream fix (transactional org delete) lives in CP and tracks as
issue #2989. Until that lands, the sweeper bounded-time-to-cleanup
keeps the leak from escalating.
Note: PR #492 (cf-tunnel silent-success fix) makes this step
actually effective — pre-fix DeleteTunnel silent-succeeded on
1022, so the cleanup endpoint reported success without deleting.
Post-fix the cleanup chains CleanupTunnelConnections + retry on
1022, which actually clears stuck-connector orphans.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Mirrors molecule-controlplane#494: the canonical EPHEMERAL_PREFIXES
list now lives in molecule-controlplane/internal/slugs/ephemeral.go,
where redeploy-fleet reads it to skip in-flight test tenants. The
sweep workflow keeps a Python copy because GHA Python can't import
Go, but a comment now points engineers updating the list to update
both files.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the silent-block failure mode that left 25 commits — including
the Memory v2 redesign and the reno-stars data-loss fix — wedged on
staging for 12+ hours behind a single missing review. The auto-promote
workflow opened the PR + armed auto-merge, but main's branch protection
required a human review and nobody noticed until a user reported
"still seeing old memory tab".
## Detection logic — `scripts/check-stale-promote-pr.sh`
Reads open PRs `base=main head=staging` and alarms on:
- `mergeStateStatus == BLOCKED`
- `reviewDecision == REVIEW_REQUIRED`
- createdAt older than `STALE_HOURS` (default 4h)
Other BLOCKED reasons (DIRTY, BEHIND, failed checks) are NOT alarmed —
those are the author's signal-to-fix. This script targets the specific
"no human reviewed yet" wedge.
Output:
- `::warning` per stale PR (visible in workflow summary + Actions UI)
- PR comment (idempotent via marker-string detection; one alarm
per PR, never re-spammed)
- Exit code = count of stale PRs (capped at 125)
Logic in a script (not inline workflow YAML) so it's:
- **Unit-testable** — tests/test-check-stale-promote-pr.sh exercises
every branch with stubbed fixture JSON + frozen clock. 23 tests
covering: empty list, single stale, just-under-threshold, wrong
reviewDecision, wrong mergeStateStatus, mixed list (only matching
PRs alarm), custom threshold via --stale-hours, exit-code-counts-
matching-PRs, --help, unknown arg → 64, missing repo → 2.
- **Operator-runnable ad-hoc** — `scripts/check-stale-promote-pr.sh`
works from any shell with `gh` + `jq`.
- **SSOT** — one detector, the workflow YAML is just schedule +
invocation surface. Future sibling workflows that need the same
check call the same script.
## Workflow — `.github/workflows/auto-promote-stale-alarm.yml`
Triggers:
- cron `27 * * * *` (hourly, off-the-hour to dodge cron herd)
- workflow_dispatch with `stale_hours` + `post_comment` overrides
Concurrency: `auto-promote-stale-alarm` group, cancel-in-progress=false
(idempotent script; no benefit to cancelling a running scan).
Permissions: `contents: read` + `pull-requests: write` (post comments).
Sparse checkout — only fetches `scripts/check-stale-promote-pr.sh`.
No node_modules, no go modules, no slow setup steps. Workflow runs
in <30s on a clean repo.
## Why "alarm + comment" not "auto-approve"
Considered options in issue #2975:
1. Slack/email alert — picked.
2. Bot-account auto-approve via molecule-ops — circumvents the
human-review gate that branch protection encodes.
3. Trusted-promote bypass via CODEOWNERS — needs Org Admin config
change; out of scope for a workflow PR.
The comment-on-PR pattern picks (1) without external dependencies
(no Slack token, no email config). Subscribers get notified via
GitHub's existing PR notification delivery; the warning shows up in
the Actions feed.
## Why this won't false-positive on legitimate slow reviews
Threshold is 4h. Most legitimate gates clear in <1h, so 4× headroom
is plenty for slow CI. The comment is idempotent (one alarm per PR,
never re-posted) — adding noise stops at 1 comment regardless of
how long the PR sits.
## Test plan
- [x] `bash scripts/test-check-stale-promote-pr.sh` — 23/23 pass
- [x] `python3 -c 'yaml.safe_load(...)'` clean
- [x] `bash -n` clean on both scripts
- [ ] Live verification: dispatch the workflow once main has caught up,
confirm it correctly reports zero stale PRs
Continues the OSS-shape refactor. After iters 4a-4d (rbac, delegation,
memory, messaging) the only behavior left in ``a2a_tools.py`` was
``report_activity`` plus three thin inbox-tool wrappers and the
``_enrich_inbound_for_agent`` helper. This iter extracts the inbox
slice to ``a2a_tools_inbox.py`` so the kitchen-sink module shrinks
from 280 LOC to ~165 LOC of imports + report_activity + back-compat
re-export blocks.
Extracted symbols:
- ``_INBOX_NOT_ENABLED_MSG`` (sentinel)
- ``_enrich_inbound_for_agent`` (poll-path peer enrichment helper)
- ``tool_inbox_peek``
- ``tool_inbox_pop``
- ``tool_wait_for_message``
Re-exports (`from a2a_tools_inbox import …`) preserve the public
``a2a_tools.tool_inbox_*`` surface so existing tests + call sites
continue to resolve unchanged.
New tests in test_a2a_tools_inbox_split.py:
1. **Drift gate (5)** — every previously-public symbol on a2a_tools
is the EXACT same object as a2a_tools_inbox.foo (`is`, not `==`),
catches a future "wrap with logging" refactor that silently loses
existing test coverage.
2. **Import contract (1)** — a2a_tools_inbox does NOT eagerly import
a2a_tools at module load. Pins the layered architecture: the
extracted slice depends on ``inbox`` + a lazy ``a2a_client``
import, never on the kitchen-sink that re-exports it.
3. **_enrich_inbound_for_agent branches (5)** — peer_id-empty
(canvas_user) returns dict unchanged; missing peer_id key same;
a2a_client unavailable (test harness, partial install) degrades
gracefully with a bare envelope; registry hit populates
peer_name + peer_role + agent_card_url; registry miss still
surfaces agent_card_url (constructable from peer_id alone).
The full timeout-clamp / validation / JSON-shape behavior matrix for
the three wrappers stays in test_a2a_tools_inbox_wrappers.py — those
tests pass identically against both the alias and the underlying impl.
Wiring updates:
- ``scripts/build_runtime_package.py``: add ``a2a_tools_inbox`` to
``TOP_LEVEL_MODULES`` so it ships in the runtime wheel and the
drift gate doesn't fail the next publish.
- ``.github/workflows/ci.yml``: add ``a2a_tools_inbox.py`` to
``CRITICAL_FILES`` so the 75% MCP/inbox/auth per-file floor
applies — this is now where the inbox-delivery code actually
lives.
Covers the user-visible flow that Phase 1-5b shipped (RFC #2891):
register a poll-mode workspace, POST a multi-file /chat/uploads, verify
the activity feed shows one chat_upload_receive row per file, fetch the
bytes via /pending-uploads/:fid/content, ack each row, and confirm a
post-ack fetch returns 404. Also pins cross-workspace bleed protection
(workspace B's bearer on A's URL → 401, B's URL with A's file_id →
404) and the file_id-UUID-parse 400 path.
23 assertions, all green against a local platform (Postgres+Redis+
platform-server stack matches the e2e-api.yml CI recipe verbatim).
Why a new script instead of extending test_poll_mode_e2e.sh: that
script tests A2A short-circuit + since_id cursor semantics; this one
tests the chat-upload path. They share zero handler code on the
platform side and would dilute each other's failure messages if
combined.
Why not the bearerless-401 strict-mode assertion: the platform's
wsauth fail-opens for bearerless requests when MOLECULE_ENV=development
(see middleware/devmode.go). The CI workflow doesn't set that var, but
some local-dev .env files do — the assertion would flap by environment
without testing the poll-mode upload contract. The middleware's own
unit tests cover strict-mode 401.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three shell E2E tests created scratch files via `mktemp` but never
deleted them on early exit (assertion failure, SIGINT, errexit). Each
CI run leaked ~10-100 KB of /tmp into the runner; over ~200 runs/week
that's 20+ MB of accumulated cruft.
## Files
- **test_chat_attachments_e2e.sh** — was missing both trap and rm;
added per-run TMPDIR_E2E with `trap rm -rf … EXIT INT TERM`.
- **test_notify_attachments_e2e.sh** — had a `cleanup()` for the
workspace but didn't include the TMPF; only an unconditional
`rm -f` at the bottom (line 233) which doesn't fire on early exit.
Extended cleanup() to also rm the scratch + dropped the redundant
trailing rm.
- **test_chat_attachments_multiruntime_e2e.sh** — `round_trip()`
function had per-call `rm -f` only on the success path; failure
paths leaked. Switched to script-level TMPDIR_E2E + trap; per-call
rm dropped (the trap handles every return path including SIGINT).
Pattern: `mktemp -d -t prefix-XXX` for the dir, `mktemp <full-template>`
for files (portable across BSD/macOS + GNU coreutils — `-p` is
GNU-only and breaks Mac local-dev runs).
## Regression gate
New `tests/e2e/lint_cleanup_traps.sh` asserts every `*.sh` that calls
`mktemp` also has a `trap … EXIT` line in the file. Wired into the
existing Shellcheck (E2E scripts) CI step. Verified locally: passes
on the fixed state, fails-loud when one of the 3 fixes is reverted.
## Verification
- shellcheck --severity=warning clean on all 4 touched files
- lint_cleanup_traps.sh passes on the post-fix tree (6 mktemp users,
all have EXIT trap)
- Negative test: revert one fix → lint exits 1 with file:line +
suggested fix pattern in the error message (CI-grokkable
::error file=… annotation)
- Trap fires on SIGTERM mid-run (smoke-tested on macOS BSD mktemp)
- Trap fires on `exit 1` (smoke-tested)
## Bars met (7-axis)
- SSOT: trap pattern documented in lint message (one rule, one fix)
- Cleanup: this IS the cleanup hygiene fix
- 100% coverage: lint catches future regressions across all
`tests/e2e/*.sh` files, not just the 3 fixed today
- File-split: N/A (no files split)
- Plugin / abstract / modular: N/A (test infra, not product code)
Iteration 2 of RFC #2873.
Every staging push run for the last 4 SHAs was cancelled by the
matching pull_request run because both fired into the same
concurrency group:
group: ${{ github.workflow }}-${{ ...sha }}
Same SHA → same group → cancel-in-progress=true means the second
arrival cancels the first. Empirically the push run lost the race;
staging branch-protection then saw a CANCELLED required check and
the auto-promote chain stalled.
Fix: include github.event_name in the group key. push and
pull_request runs for the same SHA now hash to different groups,
both complete, both report SUCCESS to branch protection.
Pattern of the bug:
10:46 sha=1e8d7ae1 ev=pull_request conclusion=success
10:46 sha=1e8d7ae1 ev=push conclusion=cancelled
10:45 sha=ecf5f6fb ev=pull_request conclusion=success
10:45 sha=ecf5f6fb ev=push conclusion=cancelled
10:28 sha=471dff25 ev=pull_request conclusion=success
10:28 sha=471dff25 ev=push conclusion=cancelled
10:12 sha=9e678ccd ev=pull_request conclusion=success
10:12 sha=9e678ccd ev=push conclusion=cancelled
Same drift class as the 2026-04-28 auto-promote-staging incident
(memory: feedback_concurrency_group_per_sha.md) — globally-scoped
groups silently cancel runs in matched-SHA scenarios.
This is the only workflow in .github/workflows/ that uses the
narrow per-sha shape without event_name. Others either don't use
concurrency at all, or use ${{ github.ref }} which is event-
neutral.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous workflow applied only 049_delegations.up.sql — fragile to
future migrations that touch the delegations table or any other
handlers/-tested table. Operator would have to remember to update
the workflow's psql -f line per migration.
New behavior: loop every .up.sql in lexicographic order, apply each
with ON_ERROR_STOP=1 + per-migration result captured. Failed migrations
are SKIPPED rather than blocking the suite — handles the historical
migrations (017_memories_fts_namespace, 042_a2a_queue, etc.) that
depend on tables since renamed/dropped and can't replay from scratch.
Migrations that DO succeed land their tables, which is sufficient for
the integration tests in handlers/.
Sanity gate at the end: if the delegations table is missing after the
replay, hard-fail with a loud error. That catches a real regression
where 049 itself becomes broken (e.g., schema rename), separate from
the historical-broken-migration noise above.
Per-migration log line ("✓" or "⊘ skipped") makes it easy to spot
when a migration that SHOULD have replayed didn't.
Verified locally: full migration chain runs, 049 lands, all 7
integration tests pass against the chained-migration DB.
Closes#320.
Two-part PR:
## Fix: result_preview was lost on completion
Self-review of #2854 caught a real bug. SetStatus has a same-status
replay no-op; the order of calls in `executeDelegation` completion
+ `UpdateStatus` completed branch clobbered the preview field:
1. updateDelegationStatus(completed, "") fires
2. inner recordLedgerStatus(completed, "", "")
→ SetStatus transitions dispatched → completed with preview=""
3. outer recordLedgerStatus(completed, "", responseText)
→ SetStatus reads current=completed, status=completed
→ SAME-STATUS NO-OP, never writes responseText → preview lost
Confirmed against real Postgres (see integration test). Strict-sqlmock
unit tests passed because they pin SQL shape, not row state.
Fix: call the WITH-PREVIEW recordLedgerStatus FIRST, then
updateDelegationStatus. The inner call becomes the no-op (correctly
preserves the row written by the outer call).
Same gap fixed in UpdateStatus handler — body.ResponsePreview was
never landing in the ledger because updateDelegationStatus's nested
SetStatus(completed, "", "") fired first.
## Gate: real-Postgres integration tests + CI workflow
The unit-test-only workflow that shipped #2854 was the root cause.
Adding two layers of defense:
1. workspace-server/internal/handlers/delegation_ledger_integration_test.go
— `//go:build integration` tag, requires INTEGRATION_DB_URL env var.
4 tests:
* ResultPreviewPreservedThroughCompletion (regression gate for the
bug above — fires the production call sequence in fixed order
and asserts row.result_preview matches)
* ResultPreviewBuggyOrderIsLost (DIAGNOSTIC: confirms the
same-status no-op contract works as designed; if SetStatus's
semantics ever change, this test fires)
* FailedTransitionCapturesErrorDetail (failure-path symmetry)
* FullLifecycle_QueuedToDispatchedToCompleted (forward-only +
happy path)
2. .github/workflows/handlers-postgres-integration.yml
— required check on staging branch protection. Spins postgres:15
service container, applies the delegations migration, runs
`go test -tags=integration` against the live DB. Always-runs +
per-step gating on path filter (handlers/wsauth/migrations) so
the required-check name is satisfied on PRs that don't touch
relevant code.
Local dev workflow (file header documents this):
docker run --rm -d --name pg -e POSTGRES_PASSWORD=test -p 55432:5432 postgres:15-alpine
psql ... < workspace-server/migrations/049_delegations.up.sql
INTEGRATION_DB_URL="postgres://postgres:test@localhost:55432/molecule?sslmode=disable" \
go test -tags=integration ./internal/handlers/ -run "^TestIntegration_"
## Why this matters
Per memory `feedback_mandatory_local_e2e_before_ship`: backend PRs
MUST verify against real Postgres before claiming done. sqlmock pins
SQL shape; only a real DB can verify row state. The workflow makes
this gate mandatory rather than optional.
#2834 added a hard-fail when GH_TOKEN_FOR_ADMIN_API is missing on
schedule + pull_request + workflow_dispatch. The PR-trigger hard-fail
is now blocking every PR in the repo because the secret hasn't been
provisioned yet — including the staging→main auto-promote PR (#2831),
which has no path to set repo secrets itself.
Per feedback_schedule_vs_dispatch_secrets_hardening.md the original
concern is automated/silent triggers losing the gate without a human
to notice. That concern applies to **schedule** specifically:
- schedule: cron, no human, silent soft-skip = invisible regression →
KEEP HARD-FAIL.
- pull_request: a human is reviewing the PR diff and will see workflow
warnings inline. A PR cannot retroactively drift live state — drift
happens *between* PRs (UI clicks, manual gh api PATCH), which the
schedule canary catches. The PR-time gate would only catch typos in
apply.sh, which the *_payload unit tests catch more directly.
→ SOFT-SKIP with a prominent warning.
- workflow_dispatch: operator override, may not have configured the
secret yet. → SOFT-SKIP with warning.
The skip is explicit (SKIP_DRIFT_CHECK=1 surfaced to env, then a step
`if:` guard) so it's auditable in the workflow run UI, not silently
swallowed.
Unblocks #2831 (auto-promote staging→main) + every PR currently behind
this check.
Multi-model review of #2827 caught: the script as-shipped would have
silently weakened branch protection on EVERY non-checks dimension
the moment anyone ran it. Live staging had
enforce_admins=true, dismiss_stale_reviews=false, strict=true,
allow_fork_syncing=false, bypass_pull_request_allowances={
HongmingWang-Rabbit + molecule-ai app
}
Script wrote the opposite for all five. Per memory
feedback_dismiss_stale_reviews_blocks_promote.md, the
dismiss_stale_reviews flip alone is the load-bearing one — would
silently re-block every auto-promote PR (cost user 2.5h once).
This PR:
1. apply.sh: per-branch payloads (build_staging_payload /
build_main_payload) that codify the deliberate per-branch policy
already on the repo, with the script's net contribution being
ONLY the new check names (Canvas tabs E2E + E2E API Smoke on
staging, Canvas tabs E2E on main).
2. apply.sh: R3 preflight that hits /commits/{sha}/check-runs and
asserts every desired check name has at least one historical run
on the branch tip. Catches typos like "Canvas Tabs E2E" vs
"Canvas tabs E2E" — pre-fix a typo would silently block every PR
forever waiting for a context that never emits. Skip via
--skip-preflight for genuinely-new workflows whose first run
hasn't fired.
3. drift_check.sh: compares the FULL normalised payload (admin,
review, lock, conversation, fork-syncing, deletion, force-push)
not just the checks list. Pre-fix the drift gate would have
missed a UI click that flipped enforce_admins or
dismiss_stale_reviews. Drops app_id from the comparison since
GH auto-resolves -1 to a specific app id post-write.
4. branch-protection-drift.yml: per memory
feedback_schedule_vs_dispatch_secrets_hardening.md — schedule +
pull_request triggers HARD-FAIL when GH_TOKEN_FOR_ADMIN_API is
missing (silent skip masks the gate disappearing).
workflow_dispatch keeps soft-skip for one-off operator runs.
Verified by running drift_check against live state: pre-fix would
have shown 5 destructive drifts on staging + 5 on main. Post-fix
shows ONLY the 2 intended additions on staging + 1 on main, which
go away after `apply.sh` runs.
Closes#9.
Three pieces, all small:
1. **docs/e2e-coverage.md** — source of truth for which E2E suites
guard which surfaces. Today three were running but informational
only on staging; that's how the org-import silent-drop bug shipped
without a test catching it pre-merge. Now the matrix shows what's
required where + a follow-up note for the two suites that need an
always-emit refactor before they can be required.
2. **tools/branch-protection/apply.sh** — branch protection as code.
Lets `staging` and `main` required-checks live in a reviewable
shell script instead of UI clicks that get lost between admins.
This PR's net change: add `E2E API Smoke Test` and `Canvas tabs E2E`
as required on staging. Both already use the always-emit path-filter
pattern (no-op step emits SUCCESS when the workflow's paths weren't
touched), so making them required can't deadlock unrelated PRs.
3. **branch-protection-drift.yml** — daily cron + drift_check.sh
that compares live protection against apply.sh's desired state.
Catches out-of-band UI edits before they drift further. Fails the
workflow on mismatch; ops re-runs apply.sh or updates the script.
Out of scope (filed as follow-ups):
- e2e-staging-saas + e2e-staging-external use plain `paths:` filters
and never trigger when paths are unchanged. They need refactoring
to the always-emit shape (same as e2e-api / e2e-staging-canvas)
before they can be required.
- main branch protection mirrors staging here; if main wants the
E2E SaaS / External added later, do it in apply.sh and rerun.
Operator must apply once after merge:
bash tools/branch-protection/apply.sh
The drift check picks it up from there.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bug: the case statement at line 189 grouped completed/failure |
completed/cancelled | completed/timed_out into the same "abort
+ exit 1" branch. cancelled ≠ failure — when per-SHA concurrency
(memory: feedback_concurrency_group_per_sha) cancels an older E2E
run because a newer push landed, the workflow blocked the whole
auto-promote chain on a non-failure.
Caught 2026-05-05 02:03 on sha 31f9a5e: E2E got cancelled by
concurrency, auto-promote :latest aborted with exit 1, the next
auto-promote-staging cycle had to manually clean up.
Split: failure/timed_out keep the abort path. cancelled gets its
own clean-defer branch (same shape as in_progress) — proceed=false
without exit 1, with a step-summary explaining likely concurrency
supersession and pointing operators at manual dispatch if they
need that specific SHA promoted.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-review of PR #2810 caught a regression: my mass-fix added
`2>/dev/null` to every curl invocation, suppressing stderr. The
original `|| echo "000"` shape only swallowed exit codes — stderr
(curl's `-sS`-shown dial errors, timeouts, DNS failures) still went
to the runner log so operators could see WHY a connection failed.
After PR #2810 the next deploy failure would log only the bare
HTTP code with no context. That's exactly the kind of diagnostic
loss that makes outages take longer to triage.
Drop `2>/dev/null` from each curl line — keep it on the `cat`
fallback (which legitimately suppresses "no such file" when curl
crashed before -w ran). The `>tempfile` redirect alone captures
curl's stdout (where -w writes) without touching stderr.
Same 8 files as #2810: redeploy-tenants-on-{main,staging},
sweep-stale-e2e-orgs, e2e-staging-{sanity,saas,external,canvas},
canary-staging.
Tests:
- All 8 files pass the lint
- YAML valid
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 2026-05-04 redeploy-tenants-on-main run for sha 2b862f6 emitted
"HTTP 000000" and failed the deploy. Root cause: when curl exits non-
zero (connection reset → 56, --fail-with-body 4xx/5xx → 22), the
`-w '%{http_code}'` already wrote a status to stdout; the inline
`|| echo "000"` then fires AND appends another "000" to the captured
substitution stdout. Result: HTTP_CODE="<actual><000>" — fails string
comparisons against "200" while looking superficially right.
Same class of bug the synth-E2E §7c gate hit twice (PRs #2779/#2783
+ #2797). Memory feedback_curl_status_capture_pollution.md.
Mass fix in 8 workflows: route -w into a tempfile so curl's exit
code can't pollute stdout. Wrap with set +e/-e so the non-zero
curl exit doesn't trip the outer pipeline.
redeploy-tenants-on-main.yml (production-critical, caught the bug)
redeploy-tenants-on-staging.yml (sibling)
sweep-stale-e2e-orgs.yml (cleanup loop)
e2e-staging-sanity.yml (E2E safety-net teardown)
e2e-staging-saas.yml
e2e-staging-external.yml
e2e-staging-canvas.yml
canary-staging.yml
Plus a new lint workflow `lint-curl-status-capture.yml` that runs on
every PR/push touching `.github/workflows/**`. Multi-line aware:
collapses bash `\` continuations, then matches the buggy
$(curl ... -w '%{http_code}' ... || echo "000") subshell shape.
Distinguishes from the SAFE $(cat tempfile || echo "000") shape
(cat with missing file emits empty stdout, no pollution).
Verified:
- All 8 workflows pass the lint locally
- A known-bad injection is caught
- A known-safe cat-fallback passes through
- yaml.safe_load clean on all changed files
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes part of #2790 (Phase A). The Python total floor at 86% (set in
workspace/pytest.ini, issue #1817) averages over ~6000 lines, so a
single MCP-critical file could regress to ~50% with no CI complaint as
long as other modules compensate. This is the same distribution gap
that #1823 closed Go-side: total floor passes while a critical handler
sits at 0%.
Added gates for these five files (per-file floor 75%):
- workspace/a2a_mcp_server.py — MCP dispatcher (PR #2766 / #2771)
- workspace/mcp_cli.py — molecule-mcp standalone CLI entry
- workspace/a2a_tools.py — workspace-scoped tool implementations
- workspace/inbox.py — multi-workspace inbox + per-workspace cursors
- workspace/platform_auth.py — per-workspace token resolver
These handle multi-tenant routing, auth tokens, and inbox dispatch.
Risk shape mirrors Go-side tokens*/secrets* — a 0%/50% file here is
exactly where the PR #2766 dispatcher bug class slips through without
a structural test.
Floor 75% is strictly additive — current actuals 80-96% (measured
2026-05-04). No existing PR fails. Ratchet plan in COVERAGE_FLOOR.md
target 90% by 2026-08-04.
Implementation: pytest already writes .coverage; new step emits a JSON
view scoped to the critical files via `coverage json --include="*name"`,
then jq extracts each file's percent_covered. Exact key match by
basename so workspace/builtin_tools/a2a_tools.py (a different 100%
file) doesn't shadow workspace/a2a_tools.py.
Verified locally with the actual coverage data:
- floor=75 → 0 failures (matches current state)
- floor=81 → 1 failure (a2a_tools.py at 80%) — proves the gate trips
Pairs with PR #2791 (Phase B — schema↔dispatcher AST drift gate). Phase
C (molecule-mcp e2e harness) remains the largest piece in #2790.
YAML validated locally before commit per
feedback_validate_yaml_before_commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Today's 4 cancelled canaries (25319625186 / 25320942822 / 25321618230 /
25322499952) were all blown by the workflow timeout despite the
underlying tenant boot completing successfully (PR molecule-controlplane#455
fix verified — boot events all reach `boot_script_finished/ok`).
Why the budget was wrong:
The tenant user-data install phase runs apt-get update + install of
docker.io / jq / awscli / caddy / amazon-ssm-agent FROM RAW UBUNTU on
every tenant boot — none of it is pre-baked into the tenant AMI
(EC2_AMI=ami-0ea3c35c5c3284d82, raw Jammy 22.04). Empirical
fetch_secrets/ok timing across today's canaries:
51s debug-mm-1777888039 (09:47Z)
82s 25319625186 (12:42Z)
143s 25320942822 (13:11Z)
625s 25322499952 (13:43Z)
Same EC2_AMI, same instance type (t3.small), same user-data install
sequence — variance is entirely apt-mirror tail latency. A 12-min job
budget leaves only ~2 min for the workspace on slow-apt days; the
workspace itself needs ~3.5 min for claude-code cold boot, so the
budget is structurally too tight whenever apt is slow.
20 min absorbs even the 10+ min boot worst-case and still leaves the
workspace its full ~7 min budget. Cap stays well under the runner's
6-hour ubuntu-latest job ceiling.
Real fix: pre-bake caddy + ssm-agent into the tenant AMI so the boot
phase is no-ops on cached pkgs (will file controlplane#TBD as
follow-up — packer/install-base.sh today only bakes the WORKSPACE thin
AMI, not the tenant AMI; tenants always boot from raw Ubuntu).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Change cron from '10,30,50' (3 fires/hour) to '2,12,22,32,42,52'
(6 fires/hour). All new slots are 1-3 min away from any other
cron, avoiding both the cf-sweep collisions (:15, :45) and the
:30 heavy slot (canary-staging /30, sweep-aws-secrets,
sweep-stale-e2e-orgs every :15).
Why: empirically 2026-05-04 the canary fired only once per hour
on the 10,30,50 schedule (see #2726). Bumping fires-per-hour
gives more chances to land a survived fire under GH's load-
related drop ratio, and keeping all slots in clean lanes
minimizes the per-fire drop probability.
At empirically-observed ~67% drop ratio, 6 attempts/hour yields
~2 effective fires = ~30 min cadence; closer to the 20-min
target than the current shape and provides a real degradation
alarm if drops get worse.
Cost: ~$0.50/day → ~$1/day. Negligible.
Closes#2726.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a third secrets-injection branch in test_staging_full_saas.sh
behind a new E2E_ANTHROPIC_API_KEY env var, wired into all three
auto-running E2E workflows (canary-staging, e2e-staging-saas,
continuous-synth-e2e) via a new MOLECULE_STAGING_ANTHROPIC_API_KEY
repo secret slot.
Operator motivation: after #2578 (the staging OpenAI key went over
quota and stayed dead 36+ hours) we shipped #2710 to migrate the
canary + full-lifecycle E2E to claude-code+MiniMax. Discovered post-
merge that MOLECULE_STAGING_MINIMAX_API_KEY had never been set after
the synth-E2E migration on 2026-05-03 either — synth has been red the
whole time, not just OpenAI quota.
Setting up a MiniMax billing account from scratch is non-trivial
(needs platform-specific signup, KYC, top-up). Operators who already
have an Anthropic API key for their own Claude Code session can now
just set MOLECULE_STAGING_ANTHROPIC_API_KEY and have all three
auto-running E2E gates green within one cron firing.
Priority chain in test_staging_full_saas.sh (first non-empty wins):
1. E2E_MINIMAX_API_KEY → MiniMax (cheapest)
2. E2E_ANTHROPIC_API_KEY → direct Anthropic (cheaper than gpt-4o,
lower setup friction than MiniMax)
3. E2E_OPENAI_API_KEY → langgraph/hermes paths
Verify-key case-statement in all three workflows accepts EITHER
MiniMax OR Anthropic for runtime=claude-code; error message names
both options so operators know they don't have to register a MiniMax
account if they already have an Anthropic key.
Pinned to runtime=claude-code — hermes/langgraph use OpenAI-shaped
envs and won't honour ANTHROPIC_API_KEY without further wiring.
After this lands + secret is set, the dispatched canary verifies the
new path:
gh workflow run canary-staging.yml --repo Molecule-AI/molecule-core --ref staging
Bundles the same hermes+OpenAI → claude-code+MiniMax migration onto
the full-lifecycle E2E that's been red on every provisioning-critical
push since 2026-05-01. Same root cause as the canary fix in the prior
commit: MOLECULE_STAGING_OPENAI_KEY hit insufficient_quota and there's
no SLA on operator billing top-up.
Same shape as canary commit: claude-code as default runtime + MiniMax
as primary key + hermes/langgraph kept as workflow_dispatch options
with OpenAI fallback. Per-runtime verify-key case-statement matches
canary-staging.yml + continuous-synth-e2e.yml byte-for-byte.
Two extra wrinkles vs canary:
- Dispatch input `runtime` default flipped from "hermes" to "claude-code"
so operators dispatching from the UI get the safe path by default.
They can still pick hermes/langgraph from the dropdown when they
specifically want to exercise OpenAI.
- E2E_MODEL_SLUG is dispatch-aware: MiniMax-M2.7-highspeed for
claude-code, openai/gpt-4o for hermes (slash-form per
derive-provider.sh), openai:gpt-4o for langgraph (colon-form per
init_chat_model). The branch comment in lib/model_slug.sh covers
the rationale; pinning the slug here keeps the dispatch UX stable
even when operators don't override.
After this lands + the canary commit lands, the only OpenAI-dependent
E2E surface is the operator-dispatch fallback. The cron canary, the
synth E2E, AND the full-lifecycle gate are all on MiniMax — separate
billing account, no OpenAI quota dependency on auto-runs.
Mirror the migration continuous-synth-e2e.yml made on 2026-05-03 (#265).
Both workflows hit the same MOLECULE_STAGING_OPENAI_KEY which went over
quota on 2026-05-01 (#2578) and stayed dead — the canary has been red
for 36+ hours waiting on operator billing top-up.
This switch breaks the canary's dependency on OpenAI billing entirely:
claude-code template's `minimax` provider routes ANTHROPIC_BASE_URL to
api.minimax.io/anthropic and reads MINIMAX_API_KEY at boot. MiniMax is
~5-10x cheaper per token than gpt-4.1-mini AND on a separate billing
account, so a future OpenAI quota collapse no longer wedges the
canary's "is staging alive?" signal.
Changes:
- E2E_RUNTIME: hermes → claude-code
- Add E2E_MODEL_SLUG: MiniMax-M2.7-highspeed (pin to MiniMax — the
per-runtime claude-code default is "sonnet" which routes to direct
Anthropic and would defeat the cost saving)
- Add E2E_MINIMAX_API_KEY env wired to MOLECULE_STAGING_MINIMAX_API_KEY
- Keep E2E_OPENAI_API_KEY as fallback for operator-dispatched runs that
set E2E_RUNTIME=hermes via workflow_dispatch
- "Verify OpenAI key present" → per-runtime "Verify LLM key present"
case statement matching synth E2E's exact shape (claude-code requires
MiniMax, langgraph/hermes require OpenAI). Hard-fail on missing
required key per #2578's lesson — soft-skip silently fell through to
the wrong SECRETS_JSON branch and produced a confusing auth error
5 min later instead of the clean "secret missing" message at the top.
Verifies #2578 root cause won't recur on the canary path. The synth
E2E and the manual e2e-staging-saas dispatch can still hit OpenAI when
explicitly chosen — only the cron canary moves off it.
The previous soft-skip-on-dispatch path used `exit 0`, which only
ends the STEP — the rest of the workflow continued with empty
secrets. Caught 2026-05-04 by dispatched run 25296530706:
- E2E_MINIMAX_API_KEY: empty
- verify-secrets printed warning + exit 0
- Install required tools: ran
- Run synthetic E2E: ran with empty MiniMax key
- SECRETS_JSON branched to OpenAI shape (MINIMAX empty → fall through)
- But model slug stayed MiniMax-M2.7-highspeed (workflow env)
- Workspace booted with OpenAI keys + MiniMax model
- 5 min later: "Agent error (Exception)" — claude SDK 401'd
against api.minimax.io with the OpenAI key
The confusing failure mode silently masked the real problem (missing
secret) under a runtime-error label. Fix: drop both soft-skip paths
and exit 1 always. Operators who want to verify a YAML change without
setting up secrets can read the verify-secrets step's stderr — the
failure IS the verification signal.
Pure visibility fix; preserves the cron hard-fail path (now also the
dispatch hard-fail path). No mechanism change beyond the exit code.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GitHub Actions scheduler de-prioritises :00 cron firings under load.
Empirical 2026-05-03: the canary's cron was '0,20,40 * * * *' but
actual firings landed at :08, :03, :01, :03 — :20 and :40 silently
dropped. Detection latency degraded from claimed 20 min to actual
~60 min worst case.
Move to '10,30,50 * * * *':
- :10/:30/:50 sit 10 min off the top-of-hour load peak
- Still 5 min from :15 sweep-cf-orphans and :45 sweep-cf-tunnels
(the original constraint that kept us off :15/:45)
- Same 20-min cadence; only the phase changes
No code change beyond the cron expression + comment refresh.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to #2648 — same `>/dev/null || true` swallow-on-error
pattern existed in:
e2e-staging-canvas.yml (single-slug)
e2e-staging-saas.yml (loop)
e2e-staging-sanity.yml (loop)
e2e-staging-external.yml (loop, was `>/dev/null 2>&1` variant)
All four now capture the HTTP code, log a "[teardown] deleted $slug
(HTTP $code)" line on success, and emit a workflow warning naming
the slug + body excerpt on non-2xx. Loop bodies also tally + summarise
total leaks at the end.
Exit semantics unchanged: a single cleanup miss still doesn't fail-flag
the test (sweep-stale-e2e-orgs is the safety net within ~45 min). The
behavior change is purely surfacing — failures that were silent are
now visible on the workflow run page.
Pairs with #2648's tightened sweeper. Together: per-run cleanup
failures are visible AND the safety net catches them quickly.
Closes the per-workflow port noted as out-of-scope in #2648.
See molecule-controlplane#420.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes that close one of the leak classes from the
molecule-controlplane#420 vCPU audit:
1. sweep-stale-e2e-orgs.yml: cron */15 (was hourly), MAX_AGE_MINUTES
30 (was 120). E2E runs are 8-25 min wall clock; 30 min is safely
above the longest run while shrinking the worst-case leak window
from ~2h to ~45 min (15-min sweep cadence + 30-min threshold).
2. canary-staging.yml teardown: the per-slug DELETE used `>/dev/null
|| true`, which swallowed every failure. A 5xx or timeout from CP
looked identical to "successfully deleted" and the canary tenant
kept eating ~2 vCPU until the sweeper caught it. Now we capture
the response code and surface non-2xx as a workflow warning that
names the leaked slug.
The exit semantics stay unchanged — a single-canary cleanup miss
shouldn't fail-flag the canary itself when the actual smoke check
passed. The sweeper is the safety net for whatever slips past.
Caught during the molecule-controlplane#420 audit on 2026-05-03 —
3 e2e canary tenant orphans were running for 24-95 min, all under
the previous 120-min sweep threshold so they went unnoticed until
manual cleanup. Same `|| true` pattern exists in
e2e-staging-{canvas,external,saas,sanity}.yml; out of scope for
this PR (mechanical port; tracking separately) but the sweeper
tightening covers all of them by reducing the safety-net latency.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cuts the per-run LLM cost ~10x (MiniMax M2.7 vs gpt-4.1-mini) and
removes the recurring OpenAI-quota-exhaustion failure mode that took
the canary down on 2026-05-03 (#265 — staging quota burnt for ~16h).
Path:
E2E_RUNTIME=claude-code (default)
→ workspace-configs-templates/claude-code-default/config.yaml's
`minimax` provider (lines 64-69)
→ ANTHROPIC_BASE_URL auto-set to api.minimax.io/anthropic
→ reads MINIMAX_API_KEY (per-vendor env, no collision with
GLM/Z.ai etc.)
Workflow changes (continuous-synth-e2e.yml):
- Default runtime: langgraph → claude-code
- New env: E2E_MODEL_SLUG (defaults to MiniMax-M2.7-highspeed,
overridable via workflow_dispatch)
- New secret wire: E2E_MINIMAX_API_KEY ←
secrets.MOLECULE_STAGING_MINIMAX_API_KEY
- Per-runtime missing-secret guard: claude-code requires MINIMAX,
langgraph/hermes require OPENAI. Cron firing hard-fails on missing
key for the active runtime; dispatch soft-skips so operators can
ad-hoc test without setting up the secret first
- Operators can still pick langgraph/hermes via workflow_dispatch;
the OpenAI fallback path stays wired
Script changes (tests/e2e/test_staging_full_saas.sh):
- SECRETS_JSON branches on which key is set:
E2E_MINIMAX_API_KEY → {MINIMAX_API_KEY: <key>} (claude-code path)
E2E_OPENAI_API_KEY → {OPENAI_API_KEY, HERMES_*, MODEL_PROVIDER} (legacy)
MiniMax wins when both are present — claude-code default canary
must not accidentally consume the OpenAI key
Tests (new tests/e2e/test_secrets_dispatch.sh):
- 10 cases pinning the precedence + payload shape per branch
- Discipline check verified: 5 of 10 FAIL on a swapped if/elif
(precedence inversion), all 10 PASS on the fix
- Anchors on the section-comment header so a structural refactor
fails loudly rather than silently sourcing nothing
The model_slug dispatcher (lib/model_slug.sh) needs no change:
E2E_MODEL_SLUG override path is already wired (line 41), and
claude-code template's `minimax-` prefix matcher catches
"MiniMax-M2.7-highspeed" via lowercase-on-lookup.
Operator action required to land green:
- Set MOLECULE_STAGING_MINIMAX_API_KEY in repo secrets
(Settings → Secrets and Variables → Actions). Use
`gh secret set MOLECULE_STAGING_MINIMAX_API_KEY -R Molecule-AI/molecule-core`
to avoid leaking the value into shell history.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #2571 fixed synth-E2E by branching MODEL_SLUG per runtime, but only
the langgraph branch was verified at runtime — hermes / claude-code /
override / fallback had zero automated coverage. A future regression
(e.g. dropping the langgraph case) would silently revert and only
surface as "Could not resolve authentication method" mid-E2E.
This PR:
- Extracts the dispatch into tests/e2e/lib/model_slug.sh as a sourceable
pick_model_slug() function. No behavior change.
- Adds tests/e2e/test_model_slug.sh — 9 assertions across all 5 dispatch
branches plus the override path. Verified to FAIL when any branch is
flipped (manually regressed langgraph slash-form to confirm the test
catches it; restored before commit).
- Wires the unit test into ci.yml's existing shellcheck job (only runs
when tests/e2e/ or scripts/ change). Pure-bash, no live infra.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>