molecule-core

Author	SHA1	Message	Date
devops-engineer	fab65c78d6	fix(ci): rewrite retarget-main-to-staging for Gitea REST API All checks were successful CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 1s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 1s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 1s Details Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 5s Details Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s Details CI / Detect changes (pull_request) Successful in 7s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 6s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 7s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 7s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 6s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 7s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 2s Details CI / Platform (Go) (pull_request) Successful in 2s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 3s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 6s Details CI / Python Lint & Test (pull_request) Successful in 3s Details CI / Canvas (Next.js) (pull_request) Successful in 4s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 3s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 4s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 4s Details Root cause: same as #65/#73 — gh CLI calls Gitea GraphQL (/api/graphql) which returns HTTP 405. Specifically: - gh api -X PATCH /pulls/{N} sometimes works but is flaky on Gitea (depends on gh's host-resolution layer) - gh pr close / gh pr comment route through GraphQL → 405 Fix: replace all gh calls with direct curl REST calls to Gitea: - PATCH /api/v1/repos/{owner}/{repo}/pulls/{index} body {"base": "staging"} — retarget the PR base - POST /api/v1/repos/{owner}/{repo}/issues/{index}/comments — post the explainer comment (PRs are issues in Gitea, comments share the issue endpoint) - PATCH /api/v1/repos/{owner}/{repo}/pulls/{index} body {"state": "closed"} — close redundant PR for #1884 case Identity: switch from secrets.GITHUB_TOKEN (per-job ephemeral, narrow scope on Gitea) to secrets.AUTO_SYNC_TOKEN (devops-engineer persona). Same persona used by auto-sync (#66) and auto-promote (#78). Per feedback_per_agent_gitea_identity_default. PR-edit and comment do not need branch-protection bypass. Curl-status-capture pattern hardened per feedback_curl_status_capture_pollution: http_code via -w to its own scalar, body to a tempfile, set +e/-e bracket so curl's non-zero-on-4xx doesn't pollute the script's exit chain. Header comment block fully rewritten with 4 failure-mode runbooks (A: 422 dup-base, B: token rotated, C: PR deleted, D: filter mis-fire) per PR #66/#78's pattern. Refs: #65, #74, #196, PR #66 + #78 (canonical reference) Closes #74 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 15:28:26 -07:00
claude-ceo-assistant	0cef033a6a	ci(canary): route curl -w to tempfile to satisfy status-capture lint Some checks failed CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 1s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 1s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 2s Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 7s Details Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 7s Details pr-guards / disable-auto-merge-on-push (pull_request) Failing after 5s Details CI / Detect changes (pull_request) Successful in 10s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 10s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 10s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 11s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 12s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 11s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 13s Details CI / Platform (Go) (pull_request) Successful in 7s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 6s Details CI / Python Lint & Test (pull_request) Successful in 7s Details CI / Canvas (Next.js) (pull_request) Successful in 9s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 8s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 7s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 9s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 12s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details The two API probes used the unsafe shape rejected by lint-curl-status-capture.yml (per feedback_curl_status_capture_pollution): status=$(curl ... -w '%{http_code}' ... \|\| echo "000") When curl exits non-zero (transport error, --fail-with-body 4xx/5xx), the `-w` already wrote a code; the `\|\| echo "000"` then APPENDS another "000", yielding "000000" or "409000" — passes shape checks while looking right. Switch to the canonical safe shape (set +e + tempfile + cat): set +e curl ... -w '%{http_code}' >code_file 2>/dev/null set -e status=$(cat code_file 2>/dev/null \|\| true) [ -z "$status" ] && status="000" Inline comment in both probe steps explains the lint constraint so the next editor doesn't re-introduce the bad pattern. Refs: #72, lint failure on PR #77 (1/22 red → 22/22 expected green)	2026-05-07 15:26:22 -07:00
claude-ceo-assistant	b83b533381	Merge branch 'main' into fix/144-branch-protection-check-name-parity-audit Some checks failed CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 5s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 6s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 6s Details Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 12s Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 11s Details branch-protection drift check / Branch protection drift (pull_request) Successful in 15s Details CI / Detect changes (pull_request) Successful in 12s Details pr-guards / disable-auto-merge-on-push (pull_request) Failing after 6s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 12s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 13s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 12s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 12s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 11s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 12s Details CI / Platform (Go) (pull_request) Successful in 8s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 6s Details CI / Python Lint & Test (pull_request) Successful in 7s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 7s Details CI / Canvas (Next.js) (pull_request) Successful in 9s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 5s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 7s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 7s Details	2026-05-07 22:24:45 +00:00
claude-ceo-assistant	a23cf6a6bb	Merge branch 'main' into fix/harness-replays-pre-clone-manifest Some checks failed CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 3s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 3s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 3s Details Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 6s Details pr-guards / disable-auto-merge-on-push (pull_request) Failing after 3s Details CI / Detect changes (pull_request) Successful in 8s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 8s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 11s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 11s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 11s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 13s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 12s Details Harness Replays / detect-changes (pull_request) Successful in 14s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 6s Details CI / Python Lint & Test (pull_request) Successful in 8s Details CI / Canvas (Next.js) (pull_request) Successful in 9s Details CI / Platform (Go) (pull_request) Successful in 9s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 8s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 8s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 6s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 9s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details Harness Replays / Harness Replays (pull_request) Failing after 47s Details	2026-05-07 22:24:42 +00:00
devops-engineer	6acd63fa5a	fix(ci): rewrite auto-promote staging→main for Gitea REST API All checks were successful CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 7s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 6s Details Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 7s Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 12s Details Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 12s Details CI / Detect changes (pull_request) Successful in 15s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 14s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 13s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 10s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 14s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 13s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 15s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 4s Details CI / Platform (Go) (pull_request) Successful in 4s Details CI / Python Lint & Test (pull_request) Successful in 4s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 4s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 4s Details CI / Canvas (Next.js) (pull_request) Successful in 6s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 5s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 5s Details Root cause: same as #65/PR-#66 — gh CLI calls Gitea GraphQL (/api/graphql) which returns HTTP 405. Additionally, gh workflow run calls /actions/workflows/{id}/dispatches which does not exist on Gitea 1.22.6 (verified via swagger.v1.json). Fix: - Replace gh run list with Gitea REST combined-status endpoint (GET /repos/{owner}/{repo}/commits/{ref}/status). Combined state encodes the AND across every check context — simpler than the per-workflow loop and immune to workflow-name collisions. - Replace gh pr create / merge --auto with direct curl calls to POST /pulls and POST /pulls/{N}/merge with merge_when_checks_succeed. - Remove the post-merge polling tail entirely. The GitHub-era GITHUB_TOKEN no-recursion rule does not apply on Gitea Actions (verified empirically: PR #66 merge fired downstream pushes naturally). Even if we wanted to dispatch, Gitea has no workflow_dispatch REST endpoint. Critical constraint: main has enable_push: false with no whitelist; direct push is impossible for any persona. PR-mediated merge is the only path. main has required_approvals: 1 — auto-merge waits for Hongming's approval before landing, preserving the feedback_prod_apply_needs_hongming_chat_go contract. Identity: AUTO_SYNC_TOKEN (devops-engineer persona). Not founder PAT. Per feedback_per_agent_gitea_identity_default. Same persona used by auto-sync (PR #66) — keeps identity model coherent. Header comment block fully rewritten with 4 failure-mode runbooks (A: gates not green, B: PR-create non-201, C: merge schedule fails, D: token rotated/scope wrong) per PR #66's pattern. Refs: #65, #73, #195, PR #66 (canonical reference) Closes #73 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 15:24:28 -07:00
claude-ceo-assistant	bfc393c065	ci: add AUTO_SYNC_TOKEN rotation drift canary (#72 ) Some checks failed CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 1s Details Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 5s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 6s Details Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 7s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 3s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 3s Details CI / Detect changes (pull_request) Successful in 7s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 8s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 8s Details CI / Python Lint & Test (pull_request) Successful in 4s Details CI / Canvas (Next.js) (pull_request) Successful in 6s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 4s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Failing after 6s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 7s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 1s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 1s Details CI / Platform (Go) (pull_request) Successful in 4s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 4s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 5s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details Adds a 6h-cron synthetic check that fires the auth surface used by auto-sync-main-to-staging.yml (PR #66) and emits a red workflow status when AUTO_SYNC_TOKEN has drifted out of validity. Closes hostile-self-review weakest-spot #3 from PR #66 (token-rotation detection latency). Read-only verification — no writes, no synthetic merge commits, no canary branch noise. Three probes: 1. GET /api/v1/user → token authenticates as devops-engineer 2. GET /api/v1/repos/molecule-ai/molecule-core → read:repository scope 3. git ls-remote refs/heads/staging → exact HTTPS auth path used by actions/checkout in the real auto-sync workflow Hard-fail on missing AUTO_SYNC_TOKEN secret on both schedule and workflow_dispatch — per feedback_schedule_vs_dispatch_secrets_hardening, a silent soft-skip would make the canary itself drift-invisible (the sweep-cf-orphans #2088 lesson). Operator runbook in workflow header. Token reuse: same AUTO_SYNC_TOKEN as the workflow under monitor; no new credential introduced. Read-only paths only. Refs: #72, hostile-self-review #66	2026-05-07 15:23:03 -07:00
devops-engineer	2679fdd01a	chore: sync main → staging (manual, resolve auto-sync workflow conflict, post-#66) Some checks failed E2E Staging Canvas (Playwright) / Canvas tabs E2E (push) Has been cancelled Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (push) Has been cancelled Details CodeQL / Analyze (${{ matrix.language }}) (go) (push) Successful in 1s Details Block internal-flavored paths / Block forbidden paths (push) Successful in 5s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (push) Successful in 0s Details CodeQL / Analyze (${{ matrix.language }}) (python) (push) Successful in 1s Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (push) Successful in 4s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (push) Successful in 5s Details E2E API Smoke Test / detect-changes (push) Successful in 7s Details Handlers Postgres Integration / detect-changes (push) Successful in 7s Details Runtime PR-Built Compatibility / detect-changes (push) Successful in 7s Details CI / Shellcheck (E2E scripts) (push) Successful in 2s Details CI / Platform (Go) (push) Successful in 3s Details CI / Python Lint & Test (push) Successful in 3s Details CI / Canvas Deploy Reminder (push) Has been skipped Details E2E API Smoke Test / E2E API Smoke Test (push) Successful in 41s Details Handlers Postgres Integration / Handlers Postgres Integration (push) Successful in 57s Details CI / Detect changes (push) Successful in 7s Details Secret scan / Scan diff for credential-shaped strings (push) Successful in 6s Details E2E Staging Canvas (Playwright) / detect-changes (push) Successful in 8s Details CI / Canvas (Next.js) (push) Successful in 4s Details # Conflicts: # .github/workflows/auto-sync-main-to-staging.yml	2026-05-07 15:08:20 -07:00
devops-engineer	6235ef7461	fix(ci): rewrite auto-sync main→staging for Gitea direct push All checks were successful CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 1s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 1s Details Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 4s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 0s Details Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s Details CI / Detect changes (pull_request) Successful in 8s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 8s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 6s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 6s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 8s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 7s Details CI / Platform (Go) (pull_request) Successful in 3s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 8s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 3s Details CI / Python Lint & Test (pull_request) Successful in 3s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 3s Details CI / Canvas (Next.js) (pull_request) Successful in 5s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 3s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 4s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 4s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details Root cause of `Auto-sync main → staging / sync-staging (push)` failing every push to main since the GitHub→Gitea migration: The workflow assumed a GitHub `merge_queue` ruleset on staging (blocking direct push) and used `gh pr create` + `gh pr merge --auto` to land sync via the queue. On Gitea this fails at the `gh pr create` step with `HTTP 405 Method Not Allowed (https://git.moleculesai.app/api/graphql)` — Gitea exposes no GraphQL endpoint, and the GitHub-CLI cannot ship PRs against Gitea. Verified failure mode in run 1117/job 0 (token logs at /tmp/log2.txt, run target /molecule-ai/molecule-core/actions/ runs/1117/jobs/0). The merge step succeeded and pushed auto-sync/main-1e1f4d63; the PR step failed with the 405. So every main push left an orphan auto-sync/* branch and a red CI status, with no PR to land it. Fix: the staging branch protection on Gitea (`enable_push: true`, `push_whitelist_usernames: [devops-engineer]`) already permits direct push from the devops-engineer persona. Drop the entire merge-queue PR architecture and replace with: 1. Checkout staging with secrets.AUTO_SYNC_TOKEN (devops-engineer persona token, NOT founder PAT — `feedback_per_agent_gitea_identity_default`). 2. `git fetch origin main` + ff-merge or no-ff merge. 3. `git push origin staging` directly. The AUTO_SYNC_TOKEN repo secret already exists (created 2026-05-07 14:00 alongside the staging push_whitelist update). Workflow name + job name unchanged → required-check name `Auto-sync main → staging / sync-staging (push)` keeps the same context, no branch-protection edits needed. Rejected alternatives (documented in workflow header): - Reuse PR architecture via Gitea REST: ~80 LOC of API plumbing for no benefit; direct push works. - GH_HOST=git.moleculesai.app: still calls /api/graphql, same 405; doesn't fix the root issue. - Custom JS action: external dep for a 5-line `git push`. Header comment in the workflow now documents: - What this workflow does (SSOT for staging advancing). - Why direct push (GitHub merge_queue → Gitea push_whitelist). - Identity and token (anti-bot-ring per saved memory). - Failure modes A–D with operator runbook for each. - Loop safety (push to staging doesn't fire push:main → no recursion). Verification plan: this fix-PR's merge to main is itself the trigger; watch the workflow run on the merge commit and on one follow-up trigger commit, expect both green. Refs: failing run https://git.moleculesai.app/molecule-ai/ molecule-core/actions/runs/1117/jobs/0 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 15:04:12 -07:00
Hongming Wang	7c6acc18ae	ci(branch-protection): check-name parity gate (#144 ) Some checks failed Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 5s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 9s Details Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped Details Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 6s Details CI / Detect changes (pull_request) Successful in 8s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 8s Details branch-protection drift check / Branch protection drift (pull_request) Successful in 9s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 8s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 8s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 9s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 10s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 3s Details CI / Platform (Go) (pull_request) Successful in 5s Details CI / Python Lint & Test (pull_request) Successful in 5s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 5s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 4s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 5s Details CI / Canvas (Next.js) (pull_request) Successful in 6s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 4s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 1m19s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m20s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m21s Details Audit finding: every workflow that emits a required-status-check name on molecule-core's branch protection (apply.sh's STAGING_CHECKS + MAIN_CHECKS) ALREADY uses the safe always-runs-with-conditional-steps shape — Platform/Canvas/Python/Shellcheck in ci.yml, Canvas tabs E2E in e2e-staging-canvas.yml, E2E API Smoke in e2e-api.yml, PR-built wheel in runtime-prbuild-compat.yml, the codeql Analyze matrix, and the always-on Secret scan + Detect changes. No production drift to fix today. Adds a regression-guard so the next path-filter / matrix refactor / workflow rename can't silently re-introduce the bug shape called out in saved memory feedback_branch_protection_check_name_parity: "Path filters … silently break branch protection because no job emits the protected sentinel status when path-filter returns false." New tools: - tools/branch-protection/check_name_parity.sh — extracts every required check name from apply.sh's heredocs, then for each name classifies the owning workflow as safe (no top-level paths:) / safe (per-step if-gates without top-level paths:) / unsafe (top-level paths: without per-step if-gates) / unsafe-mix (top-level paths: WITH per-step if-gates — the workflow may still skip entirely on path exclusion, leaving the gates dormant) / missing (no emitter at all). Special-cases codeql.yml's matrix- expanded `Analyze (${{ matrix.language }})`. - tools/branch-protection/test_check_name_parity.sh — 6 unit tests covering each classification: safe, unsafe-path-filter, missing, safe-with-per-step-gates, unsafe-mix, matrix-expansion. Each test builds a synthetic apply.sh + workflow file in a tmpdir, invokes the script, and asserts on exit code + stderr substring. Per feedback_assert_exact_not_substring the assertions pin specific classifications, not just non-zero exit. Wired into branch-protection-drift.yml so every PR touching .github/workflows/** runs the parity check; the existing daily schedule covers between-PR drift. The check is cheap (~1s) and runs without the admin token — only reads files in the checkout. Self- test step runs the unit tests on every invocation, so a regression in the script can't false-pass on production. Per BSD-vs-GNU portability hygiene: heredoc-marker extraction stays in plain awk + sed (no gawk-only `match()` array form), grep regex avoids `^` anchor for `if:` lines because real workflows use ` - if:` with the `-` step-marker between leading spaces and `if:` (the original anchor missed every workflow's per-step gates). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 14:42:50 -07:00
claude-ceo-assistant	3a00dd236f	fix(ci): convert CodeQL workflow to no-op stub on Gitea (#156 ) All checks were successful Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 14s Details Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 14s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 4s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 4s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 5s Details CI / Detect changes (pull_request) Successful in 17s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 15s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 12s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 12s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 11s Details Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 14s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 17s Details CI / Platform (Go) (pull_request) Successful in 10s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 5s Details CI / Python Lint & Test (pull_request) Successful in 6s Details CI / Canvas (Next.js) (pull_request) Successful in 9s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 7s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 11s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 10s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 10s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details Why --- PR #35 marked `continue-on-error: true` at the JOB level (correct YAML), but Gitea Actions 1.22.6 does NOT propagate job-level continue-on-error to the commit-status API — every matrix leg still posts `failure`. That keeps OVERALL=failure on every push to main + staging and blocks the auto-promote signal even when every other gate is green. Worse: the underlying CodeQL run never actually worked on Gitea. The github/codeql-action/init@v4 step calls api.github.com bundle endpoints (CLI download + query packs + telemetry) that Gitea does NOT proxy. Confirmed via live-tested run 1d/3101 on operator host: 2026-05-07T20:55:17 ::group::Run Initialize CodeQL with: languages: ${{ matrix.language }} queries: security-extended 2026-05-07T20:55:36 ::error::404 page not found 2026-05-07T20:55:50 Failure - Main Initialize CodeQL 2026-05-07T20:55:51 skipping Perform CodeQL Analysis (main skipped) 2026-05-07T20:55:51 :⚠️:No files were found at sarif-results/go/ The SARIF artifact upload was already a no-op (warning above) — the analyze step never wrote anything because init failed. So nothing of value is being lost by stubbing this out. What ---- - Convert the workflow to a single-step stub that emits success per matrix language (go, javascript-typescript, python). - Keep workflow `name: CodeQL` exactly (auto-promote-staging.yml line 67 keys on it as a workflow_run gate). - Keep job name template `Analyze (${{ matrix.language }})` and the 3-leg matrix exactly (commit-status context names + branch protection + #144 required-check-name parity). - Keep all four triggers (push / pull_request / merge_group / schedule) so merge_group required-checks parity holds. - Drop the codeql-action steps, the Autobuild step, the SARIF parse step, and the upload-artifact step — all four of those are now dead code (init can never succeed against Gitea's API surface). Policy ------ Per Hongming decision 2026-05-07 (#156): CodeQL is ADVISORY, not blocking, until a Gitea-compatible SAST pipeline lands. The header of the new workflow file documents this decision + lists the three re-enable options (self-hosted Semgrep, Sonatype, GitHub mirror) plus the compensating controls in place (secret-scan, block-internal- paths, lint-curl-status-capture, branch-protection-drift). Closes #156. Touches #142 (no capital-M Molecule-AI refs in this file — already lowercase per `e01077be`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 14:26:57 -07:00
devops-engineer	229b1a902a	fix(ci): pre-clone manifest deps in harness-replays workflow (#173 followup) Some checks failed Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 15s Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 11s Details CI / Detect changes (pull_request) Successful in 15s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 17s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 20s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 14s Details Harness Replays / detect-changes (pull_request) Successful in 21s Details Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 23s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 18s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 18s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 1m51s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m54s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m57s Details CI / Platform (Go) (pull_request) Successful in 9s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 6s Details CI / Canvas (Next.js) (pull_request) Successful in 9s Details CI / Python Lint & Test (pull_request) Successful in 8s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 9s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 12s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 15s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 16s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details Harness Replays / Harness Replays (pull_request) Failing after 2m13s Details harness-replays.yml builds tenant-alpha + tenant-beta via tests/harness/ compose.yml using workspace-server/Dockerfile.tenant. Post-#173, that Dockerfile expects .tenant-bundle-deps/{workspace-configs-templates, org-templates,plugins} pre-cloned at the build context root. Sister PR #38 added the pre-clone step to publish-workspace-server-image.yml but missed harness-replays.yml. Symptoms: - main run #892 (2026-05-07T20:28:53Z): COPY .tenant-bundle-deps/plugins -> failed to calculate checksum ... not found. - staging run #964 (2026-05-07T20:41:52Z): hits the OLD in-image clone path (staging hasn't picked up the Dockerfile.tenant refactor yet via auto-sync) and fails on 'fatal: could not read Username for https://git.moleculesai.app' when cloning the first private workspace-template-* repo. Fix: add the same Pre-clone step to harness-replays.yml, mirroring publish-workspace-server-image.yml. Uses AUTO_SYNC_TOKEN (devops-engineer persona PAT) per feedback_per_agent_gitea_identity_default. Once auto-sync main->staging unblocks (sister agent fixing the 7-file conflict in flight), staging will inherit both this workflow fix AND the Dockerfile.tenant refactor atomically. Refs: #168, #173	2026-05-07 14:26:52 -07:00
claude-ceo-assistant (Claude Opus 4.7 on Hongming's MacBook)	25fb696965	chore: reconcile main → staging post-suspension divergence Some checks failed Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 7s Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 7s Details cascade-list-drift-gate / check (pull_request) Successful in 9s Details CI / Detect changes (pull_request) Successful in 10s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 10s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 11s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 11s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 12s Details Harness Replays / detect-changes (pull_request) Successful in 13s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 12s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 15s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 9s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 16s Details Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 43s Details Harness Replays / Harness Replays (pull_request) Failing after 40s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 1m32s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m34s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m36s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Failing after 2m53s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 3m44s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 3m57s Details CI / Canvas (Next.js) (pull_request) Successful in 6m50s Details CI / Python Lint & Test (pull_request) Successful in 7m37s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details CI / Platform (Go) (pull_request) Failing after 8m31s Details Refs Task #165 (Class D AUTO_SYNC_TOKEN plumbing). main and staging diverged after the 2026-05-06 GitHub-org suspension because Class D / Class G / feature work landed on staging while unrelated CI fixes (#34-47, ECR auth-inline, buildx→docker, pre-clone manifest deps) landed straight on main. Both branches edited the same workflow files, so every push to main triggered an Auto-sync run that aborted at `git merge --no-ff origin/main` with 7 content conflicts: - .github/workflows/canary-verify.yml (URL: github.com → Gitea) - .github/workflows/ci.yml (3 URL refs) - .github/workflows/publish-runtime.yml (cascade: HTTP repo-dispatch → Gitea push) - .github/workflows/publish-workspace-server-image.yml (drop AWS-action steps; ECR auth is inline) - .github/workflows/retarget-main-to-staging.yml (URL) - manifest.json (lowercase org slug + add mock-bigorg from main) - scripts/clone-manifest.sh (keep main's MOLECULE_GITEA_TOKEN auth path + drop awk-tolower since manifest is now lowercase) Resolution: union — staging's post-suspension Gitea/ECR migrations win on URL/policy edits; main's additive work (mock-bigorg manifest entry, inline ECR auth, MOLECULE_GITEA_TOKEN basic-auth) is preserved on top. After this lands, staging is a strict superset of main, so the next auto-sync run on a push to main will be a clean fast-forward / no-op. The auto-sync workflow on main also picks up staging's AUTO_SYNC_TOKEN swap (Class D #26) for free, fixing the latent layer-2 push-auth issue. Verified locally: - bash -n scripts/clone-manifest.sh - python -c 'yaml.safe_load(...)' on each touched workflow - python -c 'json.load(open(manifest.json))' (21 plugins, 9 templates, 7 org_templates) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 14:24:37 -07:00
devops-engineer	194cdf012b	chore(ci): retrigger publish-workspace-server-image after ECR repo create (#173 ) Some checks failed Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped Details Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 5s Details CI / Detect changes (pull_request) Successful in 8s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 7s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 8s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 8s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 8s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 8s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 8s Details CI / Platform (Go) (pull_request) Successful in 4s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 3s Details CI / Python Lint & Test (pull_request) Successful in 4s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 3s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 4s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 5s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 5s Details CI / Canvas (Next.js) (pull_request) Successful in 20s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 1m18s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m18s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m19s Details Run #1010 (post-#46) succeeded all the way to push but failed with "repository molecule-ai/platform does not exist" — the platform image ECR repo had never been created (only platform-tenant existed). Created the repo via: aws ecr create-repository --region us-east-2 \ --repository-name molecule-ai/platform \ --image-scanning-configuration scanOnPush=true This is a one-line workflow comment to satisfy the path-filter and re-run the publish workflow against the now-existing repo. Closes #173 properly this time — pre-clone + inline ECR auth + ECR repo all in place. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 13:54:11 -07:00
devops-engineer	f0e8d9bb23	fix(ci): inline aws ecr get-login-password + docker login (followup #173 ) Some checks failed Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 4s Details Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 6s Details CI / Detect changes (pull_request) Successful in 8s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 7s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 7s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 8s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 7s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 6s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 8s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 2s Details CI / Platform (Go) (pull_request) Successful in 3s Details CI / Python Lint & Test (pull_request) Successful in 4s Details CI / Canvas (Next.js) (pull_request) Successful in 5s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 4s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 5s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 5s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 6s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m19s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 1m20s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m20s Details CI run #987 (post-#45) showed `docker push` from shell still hits "no basic auth credentials" — `aws-actions/amazon-ecr-login@v2` writes auth to a step-scoped DOCKER_CONFIG that doesn't carry across to the next shell step on Gitea Actions. Fix: drop both `aws-actions/configure-aws-credentials@v4` and `aws-actions/amazon-ecr-login@v2`. Run `aws ecr get-login-password \| docker login` inline in the same shell step as `docker build` + `docker push`. AWS creds come from secrets via env vars, ECR token is fresh per-step (12h validity is plenty), config.json lives in the same shell process — auth state is guaranteed. This is the operator-host manual approach mapped 1:1 into CI. runner-base image already has aws-cli + docker (verified locally). Closes #173 (fifth piece — and final, this matches the manual flow exactly). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 13:49:12 -07:00
devops-engineer	43e2d24c5b	fix(ci): replace buildx with plain docker build+push (followup #173 ) Some checks failed Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped Details Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 5s Details CI / Detect changes (pull_request) Successful in 8s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 7s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 7s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 8s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 7s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 8s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 8s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 4s Details CI / Platform (Go) (pull_request) Successful in 4s Details CI / Python Lint & Test (pull_request) Successful in 4s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 4s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 4s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 6s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 4s Details CI / Canvas (Next.js) (pull_request) Successful in 17s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m21s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m21s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 1m24s Details CI run #946 (post-#43) confirmed `driver: docker` doesn't fix the ECR push 401 either: buildx CLI inside the runner container talks to the operator-host docker daemon (mounted socket), but the daemon doesn't see the runner's ECR auth state, and the runner's buildx CLI doesn't attach the auth header in a way the daemon accepts. Drop buildx + build-push-action entirely. Plain `docker build` + `docker push` from the runner container works because both use the SAME docker socket + the SAME runner-container config.json (populated by `aws ecr get-login-password \| docker login` from amazon-ecr-login). Trade-off: lose multi-arch support. We only ship linux/amd64 tenant images today, so this is fine. If multi-arch becomes a requirement later, we can revisit (likely with `docker buildx create --driver=remote` pointing at an external buildkit, but that's substantial infra work; not worth it for a single-arch shop). Closes #173 (fourth piece — and hopefully last; this matches the operator-host manual approach exactly). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 13:43:50 -07:00
devops-engineer	bee4f9ea79	fix(ci): use docker driver for buildx + drop type=gha cache (followup #173 ) Some checks failed Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 10s Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 10s Details Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped Details CI / Detect changes (pull_request) Successful in 12s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 12s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 15s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 16s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 15s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 12s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 12s Details CI / Platform (Go) (pull_request) Successful in 8s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 6s Details CI / Python Lint & Test (pull_request) Successful in 7s Details CI / Canvas (Next.js) (pull_request) Successful in 8s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 6s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 4s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 7s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 7s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 1m28s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m30s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m33s Details PR #38 + #41 fixed the Dockerfile-side clone issue. CI run #893 then revealed two Gitea-Actions-specific issues with the unchanged buildx config: 1. `failed to push: 401 Unauthorized` to ECR. Root cause: default buildx driver `docker-container` spawns a buildkit container that doesn't share the host's `~/.docker/config.json`, so the ECR auth set up by amazon-ecr-login doesn't reach the push. Fix: pin `driver: docker` so buildx delegates to the host daemon, which already has the ECR creds. 2. `dial tcp ...:41939: i/o timeout` on `_apis/artifactcache/cache`. Root cause: `cache-from/cache-to: type=gha` is GitHub-specific; Gitea Actions has no compatible artifact-cache backend, so every cache lookup fails after a 30s timeout. Fix: remove the cache-* options. Cold-build cost is <10min for 37-repo clone + Go/Node compile, acceptable. Could revisit with type=registry inline cache later if rebuilds get painful. With this + #38/#41, the workflow should run end-to-end on Gitea Actions: pre-clone -> docker build (host daemon) -> ECR push. Closes #173 (third and final piece). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 13:35:07 -07:00
devops-engineer	55689e0b10	fix(post-suspension): migrate github.com/Molecule-AI refs to git.moleculesai.app (Class G #168 ) Some checks failed Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 16s Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 22s Details CI / Detect changes (pull_request) Successful in 24s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 20s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 21s Details pr-guards / disable-auto-merge-on-push (pull_request) Failing after 9s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 44s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 38s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 35s Details Harness Replays / detect-changes (pull_request) Successful in 44s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 27s Details Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 56s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 2m1s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 2m34s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 2m34s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 23s Details Harness Replays / Harness Replays (pull_request) Failing after 1m12s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 2m51s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Failing after 5m37s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 6m15s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 6m34s Details CI / Python Lint & Test (pull_request) Successful in 8m20s Details CI / Canvas (Next.js) (pull_request) Successful in 9m46s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details CI / Platform (Go) (pull_request) Failing after 13m23s Details The GitHub org Molecule-AI was suspended on 2026-05-06; canonical SCM is now Gitea at https://git.moleculesai.app/molecule-ai/. Stale github.com/Molecule-AI/... URLs return 404 and break tooling that clones / pip-installs / curls them. This bundles all non-Go-module URL fixes for this repo into a single PR. Go module path references (in *.go, go.mod, go.sum) are out of scope here -- tracked separately under Task #140. Token-auth clone URLs also flip ${GITHUB_TOKEN} -> ${GITEA_TOKEN} since the GitHub token does not auth against Gitea. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 13:08:15 -07:00
devops-engineer	a6d67b4c68	fix(ci): pre-clone manifest deps in workflow, drop in-image clone (closes #173 ) Some checks failed Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 7s Details Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 7s Details Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped Details CI / Detect changes (pull_request) Successful in 9s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 9s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 10s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 10s Details Harness Replays / detect-changes (pull_request) Successful in 10s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 10s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 10s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 10s Details CI / Python Lint & Test (pull_request) Successful in 6s Details CI / Canvas (Next.js) (pull_request) Successful in 6s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 6s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 8s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 8s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 13s Details Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 34s Details Harness Replays / Harness Replays (pull_request) Failing after 33s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 53s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 1m28s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m29s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m31s Details CI / Platform (Go) (pull_request) Failing after 4m4s Details publish-workspace-server-image.yml could not run on Gitea Actions because Dockerfile.tenant's stage 3 ran `git clone` against private Gitea repos from inside the Docker build context, where no auth path exists. Every workspace-server rebuild required a manual operator-host push. Move cloning to the trusted CI context (where AUTO_SYNC_TOKEN — the devops-engineer persona PAT — is naturally available). Dockerfile.tenant now COPYs from .tenant-bundle-deps/, populated by the workflow's new "Pre-clone manifest deps" step. The Gitea token never enters the image. - scripts/clone-manifest.sh: optional MOLECULE_GITEA_TOKEN env embeds basic-auth in the clone URL; redacted in log output. Anonymous fallback preserved for future public-repo path. - .github/workflows/publish-workspace-server-image.yml: new pre-clone step before docker build; injects AUTO_SYNC_TOKEN. Fail-fast if the secret is empty. - workspace-server/Dockerfile.tenant: drop stage 3 (templates), COPY from .tenant-bundle-deps/ instead. Header documents the prereq. - .gitignore: ignore /.tenant-bundle-deps/ so a local build can't accidentally commit cloned repos. Verified locally: clone-manifest.sh with the devops-engineer persona token cloned all 37 repos (9 ws + 7 org + 21 plugins, 4.9MB after .git strip). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 12:59:46 -07:00
claude-ceo-assistant	b73d3bfff2	fix(ci): mark CodeQL continue-on-error (advisory only) — closes #156 Some checks failed Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 14s Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 16s Details CI / Detect changes (pull_request) Successful in 18s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 23s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 18s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 17s Details pr-guards / disable-auto-merge-on-push (pull_request) Failing after 5s Details Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 9s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 11s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 11s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 2m12s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 2m13s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 2m14s Details CI / Platform (Go) (pull_request) Successful in 9s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 7s Details CI / Canvas (Next.js) (pull_request) Successful in 11s Details CI / Python Lint & Test (pull_request) Successful in 8s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 11s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 14s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 21s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 40s Details	2026-05-07 17:26:52 +00:00
devops-engineer	6de3c1ccd2	fix(ci): add scripts/** to publish-workspace-server-image path filter Some checks failed CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m21s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 1m21s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m36s Details Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 5s Details CI / Detect changes (pull_request) Successful in 6s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 6s Details Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 5s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 7s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 8s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 6s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 6s Details CI / Platform (Go) (pull_request) Successful in 4s Details CI / Canvas (Next.js) (pull_request) Successful in 8s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 6s Details CI / Python Lint & Test (pull_request) Successful in 7s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 7s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 6s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 10s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 8s Details scripts/clone-manifest.sh runs inside the platform Dockerfile build, so a change to that script needs to retrigger publish. Without it, the prior fix (clone via Gitea + lowercase org) didn't trigger this workflow because scripts/ wasn't in the path filter. Also serves as the file change to satisfy the path filter for THIS push, retriggering publish-workspace-server-image now.	2026-05-07 08:18:53 -07:00
devops-engineer	694a036a7f	chore(ci): trailing newline to retrigger publish-workspace-server-image (path-filter requires workflow file change) Some checks failed Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 5s Details Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 6s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 8s Details CI / Detect changes (pull_request) Successful in 9s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 9s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 10s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 11s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 11s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 13s Details CI / Platform (Go) (pull_request) Successful in 9s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 7s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 12s Details CI / Python Lint & Test (pull_request) Successful in 14s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 11s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 10s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 11s Details CI / Canvas (Next.js) (pull_request) Successful in 22s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 1m28s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m30s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m33s Details	2026-05-07 08:12:10 -07:00
devops-engineer	10e510f50c	chore: drop github-app-auth + swap GHCR→ECR (closes #157 , #161 ) Some checks failed Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped Details Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 5s Details CI / Detect changes (pull_request) Successful in 8s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 8s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 8s Details Harness Replays / detect-changes (pull_request) Successful in 9s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 9s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 8s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 8s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 9s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 2s Details CI / Python Lint & Test (pull_request) Successful in 4s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 5s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 5s Details CI / Canvas (Next.js) (pull_request) Successful in 17s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 30s Details Harness Replays / Harness Replays (pull_request) Failing after 32s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m26s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 1m21s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 1m36s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m36s Details CI / Platform (Go) (pull_request) Successful in 2m18s Details Two coupled cleanups for the post-2026-05-06 stack: ============================================ The plugin injected GITHUB_TOKEN/GH_TOKEN via the App's installation-access flow (~hourly rotation). Per-agent Gitea identities replaced this approach after the 2026-05-06 suspension — workspaces now provision with a per-persona Gitea PAT from .env instead of an App-rotated token. The plugin code itself lived on github.com/Molecule-AI/molecule-ai-plugin-github-app-auth which is also unreachable post-suspension; checking it out at CI build time was already failing. Removed: - workspace-server/cmd/server/main.go: githubappauth import + the `if os.Getenv("GITHUB_APP_ID") != ""` block that called BuildRegistry. gh-identity remains as the active mutator. - workspace-server/Dockerfile + Dockerfile.tenant: COPY of the sibling repo + the `replace github.com/Molecule-AI/molecule-ai- plugin-github-app-auth => /plugin` directive injection. - workspace-server/go.mod + go.sum: github-app-auth dep entry (cleaned up by `go mod tidy`). - 3 workflows: actions/checkout steps for the sibling plugin repo: - .github/workflows/codeql.yml (Go matrix path) - .github/workflows/harness-replays.yml - .github/workflows/publish-workspace-server-image.yml Verified `go build ./cmd/server` + `go vet ./...` pass post-removal. ======================================================= Same workflow used to push to ghcr.io/molecule-ai/platform + platform-tenant. ghcr.io/molecule-ai is gone post-suspension. The operator's ECR org (153263036946.dkr.ecr.us-east-2.amazonaws.com/ molecule-ai/) already hosts platform-tenant + workspace-template-* + runner-base images and is the post-suspension SSOT for container images. This PR aligns publish-workspace-server-image with that stack. - env.IMAGE_NAME + env.TENANT_IMAGE_NAME repointed to ECR URL. - docker/login-action swapped for aws-actions/configure-aws- credentials@v4 + aws-actions/amazon-ecr-login@v2 chain (the standard ECR auth pattern; uses AWS_ACCESS_KEY_ID/SECRET secrets bound to the molecule-cp IAM user). The :staging-<sha> + :staging-latest tag policy is unchanged — staging-CP's TENANT_IMAGE pin still points at :staging-latest, just with the new registry prefix. Refs molecule-core#157, #161; parallel to org-wide CI-green sweep.	2026-05-07 07:48:51 -07:00
devops-engineer	64a0bc1f7e	fix(ci): use AUTO_SYNC_TOKEN for auto-sync main->staging (Class D) Some checks failed Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 6s Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 5s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 8s Details CI / Detect changes (pull_request) Successful in 9s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 9s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 9s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 9s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 9s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 10s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 3s Details CI / Platform (Go) (pull_request) Successful in 4s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 5s Details CI / Canvas (Next.js) (pull_request) Successful in 5s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 6s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 7s Details CI / Python Lint & Test (pull_request) Successful in 32s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 31s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 1m23s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m24s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m32s Details Same shape as molecule-controlplane#29: per-job GITHUB_TOKEN doesn't have the Gitea API permissions to open PRs / push branches the auto-sync flow needs. AUTO_SYNC_TOKEN is the devops-engineer persona PAT (per saved memory feedback_per_agent_gitea_identity_default). Companion prod ops (already done): - devops-engineer added as collaborator on molecule-core (write) - devops-engineer added to staging branch protection push_whitelist - AUTO_SYNC_TOKEN registered as Actions secret on molecule-core	2026-05-07 07:01:46 -07:00
devops-engineer	1d8c101c94	chore: drop github-app-auth + swap GHCR→ECR (closes #157 , #161 ) Some checks failed Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 5s Details CI / Detect changes (pull_request) Successful in 8s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 8s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 8s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 8s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 9s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 8s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 9s Details Harness Replays / detect-changes (pull_request) Successful in 9s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 3s Details CI / Canvas (Next.js) (pull_request) Successful in 4s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 4s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 5s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 6s Details Harness Replays / Harness Replays (pull_request) Failing after 27s Details CI / Python Lint & Test (pull_request) Successful in 31s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m19s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m21s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 1m25s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Failing after 15m34s Details CI / Platform (Go) (pull_request) Failing after 15m35s Details Two coupled cleanups for the post-2026-05-06 stack: #157 — drop molecule-ai-plugin-github-app-auth ============================================ The plugin injected GITHUB_TOKEN/GH_TOKEN via the App's installation-access flow (~hourly rotation). Per-agent Gitea identities replaced this approach after the 2026-05-06 suspension — workspaces now provision with a per-persona Gitea PAT from .env instead of an App-rotated token. The plugin code itself lived on github.com/Molecule-AI/molecule-ai-plugin-github-app-auth which is also unreachable post-suspension; checking it out at CI build time was already failing. Removed: - workspace-server/cmd/server/main.go: githubappauth import + the `if os.Getenv("GITHUB_APP_ID") != ""` block that called BuildRegistry. gh-identity remains as the active mutator. - workspace-server/Dockerfile + Dockerfile.tenant: COPY of the sibling repo + the `replace github.com/Molecule-AI/molecule-ai- plugin-github-app-auth => /plugin` directive injection. - workspace-server/go.mod + go.sum: github-app-auth dep entry (cleaned up by `go mod tidy`). - 3 workflows: actions/checkout steps for the sibling plugin repo: - .github/workflows/codeql.yml (Go matrix path) - .github/workflows/harness-replays.yml - .github/workflows/publish-workspace-server-image.yml Verified `go build ./cmd/server` + `go vet ./...` pass post-removal. #161 — swap GHCR→ECR for publish-workspace-server-image ======================================================= Same workflow used to push to ghcr.io/molecule-ai/platform + platform-tenant. ghcr.io/molecule-ai is gone post-suspension. The operator's ECR org (153263036946.dkr.ecr.us-east-2.amazonaws.com/ molecule-ai/) already hosts platform-tenant + workspace-template-* + runner-base images and is the post-suspension SSOT for container images. This PR aligns publish-workspace-server-image with that stack. - env.IMAGE_NAME + env.TENANT_IMAGE_NAME repointed to ECR URL. - docker/login-action swapped for aws-actions/configure-aws- credentials@v4 + aws-actions/amazon-ecr-login@v2 chain (the standard ECR auth pattern; uses AWS_ACCESS_KEY_ID/SECRET secrets bound to the molecule-cp IAM user). The :staging-<sha> + :staging-latest tag policy is unchanged — staging-CP's TENANT_IMAGE pin still points at :staging-latest, just with the new registry prefix. Refs molecule-core#157, #161; parallel to org-wide CI-green sweep.	2026-05-07 05:12:06 -07:00
claude-ceo-assistant	06d4bab29d	Merge pull request 'fix(ci): port publish-runtime cascade to Gitea repo-dispatch API (closes #14 )' (#20 ) from fix/14-cascade-gitea-dispatch into staging Some checks failed Secret scan / Scan diff for credential-shaped strings (push) Successful in 9s Details CI / Canvas (Next.js) (push) Successful in 7s Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (push) Successful in 10s Details E2E API Smoke Test / detect-changes (push) Successful in 11s Details CI / Platform (Go) (push) Successful in 29s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (push) Failing after 54s Details Block internal-flavored paths / Block forbidden paths (push) Successful in 10s Details CI / Detect changes (push) Successful in 11s Details E2E API Smoke Test / E2E API Smoke Test (push) Successful in 28s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (push) Failing after 1m57s Details Handlers Postgres Integration / Handlers Postgres Integration (push) Successful in 9s Details CI / Canvas Deploy Reminder (push) Has been skipped Details E2E Staging Canvas (Playwright) / detect-changes (push) Successful in 12s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (push) Successful in 9s Details Handlers Postgres Integration / detect-changes (push) Successful in 13s Details Runtime PR-Built Compatibility / detect-changes (push) Successful in 12s Details CI / Shellcheck (E2E scripts) (push) Successful in 4s Details CI / Python Lint & Test (push) Successful in 6s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (push) Successful in 10m34s Details CodeQL / Analyze (${{ matrix.language }}) (go) (push) Failing after 19m45s Details CodeQL / Analyze (${{ matrix.language }}) (python) (push) Failing after 20m19s Details	2026-05-07 10:36:32 +00:00
Hongming Wang	4279fecde5	fix(ci): keep codex in TEMPLATES + skip-if-no-publish-image.yml Some checks failed Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 6s Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 6s Details cascade-list-drift-gate / check (pull_request) Successful in 13s Details CI / Detect changes (pull_request) Successful in 9s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 9s Details pr-guards / disable-auto-merge-on-push (pull_request) Failing after 1s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 3s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 5s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 25s Details CI / Platform (Go) (pull_request) Successful in 5m22s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 6s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 4s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 17s Details Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 16s Details CI / Canvas (Next.js) (pull_request) Failing after 5m16s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 6s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m39s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 6s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 5s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 51s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 20m54s Details CI / Python Lint & Test (pull_request) Successful in 15m42s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 19m46s Details The v2 dropped codex from TEMPLATES on the basis of "no publish-image.yml = not part of cascade today." That was correct about the immediate behavior but tripped cascade-list-drift-gate.yml because manifest.json still declares codex (it IS a live runtime — referenced from workspace/config.py and cloned into dev envs by clone-manifest.sh; only the image-publish path is missing). Restore codex to TEMPLATES (matching manifest) and add a runtime soft-skip: probe each repo for .github/workflows/publish-image.yml via the Gitea contents API and skip cleanly if 404. Final job log distinguishes "complete across all" vs "complete with soft-skips". This preserves the drift gate's invariant (TEMPLATES == manifest) while honoring the empirical fact that codex has no publish-image workflow yet. If codex later gains the workflow, no change here is needed — the probe will see 200 and the cascade will fan out to it naturally. Refs molecule-core#14, molecule-core#20.	2026-05-07 03:32:53 -07:00
Hongming Wang	607444e71b	feat(ci): replace curl-dispatch with push-mode cascade (v2) Some checks failed Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s Details CI / Detect changes (pull_request) Successful in 5s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 11s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 5s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 6s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 6s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 6s Details pr-guards / disable-auto-merge-on-push (pull_request) Failing after 2s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 7s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m21s Details Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 46s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m28s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 5s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 10s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 5s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 7s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 26s Details CI / Platform (Go) (pull_request) Successful in 3m32s Details CI / Canvas (Next.js) (pull_request) Failing after 3m34s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details cascade-list-drift-gate / check (pull_request) Failing after 9s Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 4s Details CI / Python Lint & Test (pull_request) Successful in 16m16s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 20m25s Details Empirical blocker on v1: Gitea 1.22.6 has no repository_dispatch / workflow_dispatch trigger API (verified across 6 candidate paths in issuecomment-913). v1's curl-POST loop would always exit-1. v2 pivots to push-mode: each template repo got a small companion PR (merged 2026-05-07) adding a `.runtime-version` file at root + a `resolve-version` job in publish-image.yml that reads the file and forwards the value to the reusable build workflow. publish-runtime now updates that file via git-clone + commit + push, which trips each template's existing `on: push: branches: [main]` trigger. Behaviour changes vs v1: - Templates list dropped from 9 → 8 (codex has no publish-image.yml so was never part of the cascade in practice). - 3-retry pull-rebase loop per template (handles concurrent-push races without force-push). Failures collected, job exits 1 with the failed-template list at the end. - Idempotency: when re-run with the same version, templates already pinned to that version contribute zero commits — operator can safely re-run to retry partial failures. - Author line: "publish-runtime cascade <publish-runtime@moleculesai .app>" trailer makes it clear the commit is workflow-driven, not human (per memory feedback_github_botring_fingerprint). DISPATCH_TOKEN secret name unchanged (still consumed at secrets.DISPATCH_TOKEN per `569df259`). Refs molecule-core#14, builds on molecule-core#20 issuecomment-923 (Phase 2 design).	2026-05-07 03:17:38 -07:00
Hongming Wang	569df259ba	fix(ci): align secret name to plumbed DISPATCH_TOKEN (closes #14 ) Some checks failed pr-guards / disable-auto-merge-on-push (pull_request) Failing after 3s Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 7s Details CI / Detect changes (pull_request) Successful in 7s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 6s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 8s Details cascade-list-drift-gate / check (pull_request) Successful in 13s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 9s Details Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 14s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 12s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 10s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 6s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 6s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 19s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 12s Details Ops Scripts Tests / Ops scripts (unittest) (pull_request) Failing after 19s Details CI / Python Lint & Test (pull_request) Failing after 20s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 7s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 34s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m31s Details CI / Platform (Go) (pull_request) Successful in 3m6s Details CI / Canvas (Next.js) (pull_request) Failing after 3m8s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 14m54s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 15m3s Details The cascade workflow was reading from `secrets.TEMPLATE_DISPATCH_TOKEN` but the plumbed secret name is `DISPATCH_TOKEN` (verified just now via GET /repos/molecule-ai/molecule-core/actions/secrets — only DISPATCH_TOKEN is set). Without this rename the cascade would always evaluate "secret missing" and exit 1 on the next push to staging, defeating the entire point of grant-role-access.sh --apply that just landed. Three references updated: - env mapping (`secrets.X` → `secrets.DISPATCH_TOKEN`) - workflow_dispatch warning text - push-trigger error text The bash-side variable name is unchanged (still `DISPATCH_TOKEN`) so the curl invocation at line 372 is unaffected. YAML round-trip parses clean.	2026-05-07 02:38:20 -07:00
claude-ceo-assistant	1d9d8c7809	Merge pull request 'fix(scripts): migrate ghcr.io→ECR + raw.githubusercontent.com→Gitea (#46 )' (#16 ) from fix/script-ghcr-and-lint-paths into staging Some checks failed CI / Platform (Go) (push) Blocked by required conditions Details CI / Canvas (Next.js) (push) Blocked by required conditions Details CI / Shellcheck (E2E scripts) (push) Blocked by required conditions Details CI / Canvas Deploy Reminder (push) Blocked by required conditions Details CI / Python Lint & Test (push) Blocked by required conditions Details E2E API Smoke Test / detect-changes (push) Waiting to run Details E2E API Smoke Test / E2E API Smoke Test (push) Blocked by required conditions Details E2E Staging Canvas (Playwright) / detect-changes (push) Waiting to run Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (push) Blocked by required conditions Details Handlers Postgres Integration / Handlers Postgres Integration (push) Blocked by required conditions Details Runtime PR-Built Compatibility / detect-changes (push) Waiting to run Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (push) Blocked by required conditions Details Secret scan / Scan diff for credential-shaped strings (push) Waiting to run Details Ops Scripts Tests / Ops scripts (unittest) (push) Waiting to run Details Block internal-flavored paths / Block forbidden paths (push) Has been cancelled Details Handlers Postgres Integration / detect-changes (push) Has been cancelled Details CI / Detect changes (push) Has been cancelled Details CodeQL / Analyze (${{ matrix.language }}) (go) (push) Has been cancelled Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (push) Has been cancelled Details CodeQL / Analyze (${{ matrix.language }}) (python) (push) Has been cancelled Details SECRET_PATTERNS drift lint / Detect SECRET_PATTERNS drift (push) Failing after 12s Details	2026-05-07 09:25:24 +00:00
claude-ceo-assistant	ce3f1f48a4	fix(ci): port publish-runtime cascade to Gitea repo-dispatch API (closes molecule-core#14) Some checks failed Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 4s Details cascade-list-drift-gate / check (pull_request) Successful in 4s Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 5s Details CI / Detect changes (pull_request) Successful in 6s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 3s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 6s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 6s Details CI / Python Lint & Test (pull_request) Failing after 14s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 4s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 49s Details CI / Canvas (Next.js) (pull_request) Failing after 1m55s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 6s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 5s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 6s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 6s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 4s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 6s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 7s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m20s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m24s Details CI / Platform (Go) (pull_request) Successful in 2m5s Details ## Symptom `publish-runtime.yml::cascade` fired a `repository_dispatch` to 10 workspace-template repos via direct curl to `https://api.github.com/repos/...`. Post-2026-05-06 the org's GitHub presence is suspended; every invocation 404s. The job's `:⚠️:` posture meant the failure didn't propagate, leaving the runtime PyPI publish → template image rebuild pipeline silently broken. ## Why Option A (rewrite) and not Option B (delete) Verified 2026-05-07 by devops-engineer (molecule-core#14 thread): - The cron-poll mechanism (/etc/cron.d/molecule-deploy-poll) tracks ONLY the Vercel/Railway-deployed repos (landingpage/docs/molecule-app/molecules-market /molecule-controlplane). It does NOT track workspace-template-* repos. - Each of the 9 template `publish-image.yml` workflows has `repository_dispatch: types: [runtime-published]` as a load-bearing trigger. Without the cascade, when the runtime ships a new PyPI version, templates don't auto-rebuild. So Option B (delete) would silently break the runtime → template fan-out. Option A (rewrite to Gitea's API shape) is the right call. Security-auditor agreed after seeing the cron-poll TRACKED list. ## API surface change \| Concern \| Pre-fix (GitHub) \| Post-fix (Gitea) \| \|---\|---\|---\| \| URL \| `https://api.github.com/repos/$REPO/dispatches` \| `${GITEA_URL}/api/v1/repos/$REPO/dispatches` \| \| Owner case \| `Molecule-AI/...` \| `molecule-ai/...` (lowercase, Gitea is case-sensitive) \| \| Auth header \| `Authorization: Bearer $DISPATCH_TOKEN` \| `Authorization: token $DISPATCH_TOKEN` \| \| Body shape \| `{event_type, client_payload}` \| UNCHANGED — Gitea is GitHub-compatible here \| \| Success code \| `204 No Content` \| `204 No Content` (unchanged) \| `GITEA_URL` defaults to `https://git.moleculesai.app`; overridable via job env. ## Out-of-band: DISPATCH_TOKEN secret rotation The DISPATCH_TOKEN secret was a GitHub PAT. It must be re-minted as a Gitea PAT for the new API to authenticate. Per saved memory `feedback_per_agent_gitea_identity_default`, this should be a dedicated `publish-runtime-bot` persona token with `write:repository` scope on the 9 target repos — NOT the founder PAT. This PR ships the workflow change. Token rotation is the operator-host follow-up (security-auditor's lane) — coordinate the merge so the token is in place before the next runtime release fires. ## Backwards compatibility The workflow ran silently-broken since 2026-05-06 (every invocation 404 + :⚠️: but no failure). So there is no functional regression from "silently broken" to "actually working". Any in-progress operator-managed manual dispatch path is unaffected; the Gitea API parallel path doesn't require operator intervention. ## Test plan - [x] YAML parse OK on the modified workflow file - [ ] Smoke test: trigger a runtime publish (or simulate via dispatching to one template) post-merge; verify HTTP 204 + the template's publish-image workflow fires + the template's image gets re-pushed against the new runtime version. Phase 4 verification belongs to internal#46 follow-up. ## Hostile self-review (3 weakest spots) 1. The fan-out remains all-or-nothing: a single template failure surfaces as a `:⚠️:` but PyPI publish proceeds. With 9 templates this is a ~10% per-template chance of stale-image-on-runtime-bump if any one fails. Defense: the warning shows up in the workflow summary; operators retry. Future hardening: requeue-on-fail with bounded retry, or a separate reconcile cron that detects template/runtime version drift and re-dispatches. 2. `DISPATCH_TOKEN` validity is enforced by the Gitea API (401 on stale) but the workflow doesn't differentiate 401 from 404. Either way the warning fires. Future hardening: explicit token-shape check at the start of the cascade job (curl `/api/v1/user` once, fail-fast if 401). 3. Owner-case lowercase is right today but couples the workflow to the current Gitea org slug. If the org is ever renamed, this workflow breaks silently. Less fragile alternative: derive REPO from a canonical config (e.g. `gh repo list molecule-ai`) instead of string-concatenating. Acceptable today; filed as the same future hardening pass as item 1. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 01:31:37 -07:00
claude-ceo-assistant	aa22183e52	chore(ci): pin artifact actions to @v3 for Gitea act_runner compatibility (internal#46) Some checks failed Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 7s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 1m9s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 7s Details CI / Detect changes (pull_request) Successful in 7s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 5s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 3s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 5s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 4s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 4s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m31s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m33s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 13s Details CI / Python Lint & Test (pull_request) Failing after 19s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 5s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 6s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 4s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Failing after 27s Details CI / Canvas (Next.js) (pull_request) Successful in 4m47s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details CI / Platform (Go) (pull_request) Successful in 5m32s Details Mechanical pin: 4 `actions/upload-artifact@v4.6.2/v7.0.1` uses → `@v3`. v4+/v7+ rely on a runtime API shape that Gitea's act_runner v0.6.x doesn't fully support. v3 uses the legacy server protocol act_runner ships end-to-end. Files (4 uses): - .github/workflows/ci.yml:238 (v4.6.2 → v3) - .github/workflows/codeql.yml:124 (v7.0.1 → v3) - .github/workflows/e2e-staging-canvas.yml:142 (v7.0.1 → v3) - .github/workflows/e2e-staging-canvas.yml:150 (v7.0.1 → v3) YAML parse green on all 3 files. Sister PRs land for `molecule-controlplane` and `codex-channel-molecule`. Per internal#46 Phase 2 audit; tracked under that umbrella. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 01:00:53 -07:00
security-auditor	e01077be38	fix(ci): lowercase 'molecule-ai/' in cross-repo workflow refs Some checks failed Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped Details cascade-list-drift-gate / check (pull_request) Successful in 3s Details Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 3s Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 5s Details CI / Detect changes (pull_request) Successful in 4s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 5s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 4s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 5s Details Harness Replays / detect-changes (pull_request) Successful in 4s Details pr-guards / disable-auto-merge-on-push (pull_request) Failing after 0s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 50s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 3s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 4s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 5s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m16s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m16s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 7s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 3s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 6s Details CI / Python Lint & Test (pull_request) Failing after 16s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 4s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 3s Details Harness Replays / Harness Replays (pull_request) Failing after 40s Details CI / Canvas (Next.js) (pull_request) Failing after 4m47s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details CI / Platform (Go) (pull_request) Successful in 5m25s Details Gitea is case-sensitive on owner slugs; canonical is lowercase `molecule-ai/...`. Mixed-case `Molecule-AI/...` refs fail-at-0s when the runner tries to resolve the cross-repo workflow / checkout. Same fix as molecule-controlplane#12. Mechanical case-correction; no behavior change beyond making CI resolve again. Refs: internal#46 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 01:00:10 -07:00
documentation-specialist	5d4184f4a3	fix(scripts): migrate ghcr.io→ECR + raw.githubusercontent.com→Gitea (#46 ) Some checks failed Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 6s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 7s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 5s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 54s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s Details CI / Detect changes (pull_request) Successful in 5s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 6s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 6s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 6s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 6s Details CI / Platform (Go) (pull_request) Successful in 3s Details CI / Python Lint & Test (pull_request) Successful in 3s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 3s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 5s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 4s Details Ops Scripts Tests / Ops scripts (unittest) (pull_request) Failing after 13s Details CI / Canvas (Next.js) (pull_request) Successful in 42s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m18s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m20s Details Per documentation-specialist's grep agent (2026-05-07T07:30, see internal#46): runtime-breaking ghcr.io references in shell scripts + docker-compose + the slip-past-workflow lint_secret_pattern_drift.py all need migration. These were missed by security-auditor's workflow-only audit. Files (6): - .github/scripts/lint_secret_pattern_drift.py:40 — workspace-runtime pre-commit-checks.sh consumer URL: raw.githubusercontent.com → Gitea raw URL (https://git.moleculesai.app/molecule-ai/.../raw/ branch/main/...). The lint job runs in CI and would 404 today. - scripts/refresh-workspace-images.sh:54 — workspace-template image pull URL: ghcr.io → ECR (153263036946.dkr.ecr.us-east-2.amazonaws.com). - scripts/rollback-latest.sh — full rewrite of header + auth flow: * ghcr.io/molecule-ai/{platform,platform-tenant} → ECR * GITHUB_TOKEN with write:packages → AWS ECR auth (aws ecr get-login-password). Per saved memory reference_post_suspension_pipeline, prod cutover is to ECR. * Updated header docs to match new auth flow + prereqs. - scripts/demo-freeze.sh:13,17 — comment-only ghcr → ECR (the script doesn't currently exec these URLs, but the comments describe the cascade and need to match reality). - docker-compose.yml:215-216 — canvas image: ghcr.io → ECR + updated the auth comment to describe `aws ecr get-login-password` flow. - tools/check-template-parity.sh:21 — inline curl install instructions: raw.githubusercontent.com → Gitea raw URL. Hostile self-review: 1. rollback-latest.sh's GITHUB_TOKEN→aws-cli auth swap is a behavior change. Operators using this script now need aws CLI authenticated for region us-east-2 with ECR pull/push perms. Documented in updated header. Operators who don't have aws CLI will get 'aws: command not installed' which is a clear failure mode (not silent). 2. The Gitea raw URL shape (/raw/branch/main/) differs from GitHub's raw.githubusercontent.com structure. Verified pattern by inspecting other Gitea raw URLs in the codebase. If Gitea's URL changes (1.23+), update via the same one-line edit. 3. Doesn't touch packer/scripts/install-base.sh which has a similar ghcr.io ref per the grep agent's findings — that's bigger-scope (packer build pipeline) and lives in molecule-controlplane-ish territory; filing as parked follow-up under #46 if not already. Refs: molecule-ai/internal#46, molecule-ai/internal#37, molecule-ai/internal#38, saved memory reference_post_suspension_pipeline	2026-05-07 00:56:23 -07:00
Hongming Wang	debe29c889	ci(handlers-postgres-integration): apply legacy .sql migrations too The migration-replay step globbed only .up.sql, silently skipping the older flat-naming migrations (001_workspaces.sql, 009_activity_logs.sql, etc.). Fine while no integration test depended on those tables; broke when the #149 cross-table atomicity test came in needing both workspaces (FK target for activity_logs) and activity_logs themselves. Switch to globbing .sql + sorted lex-order, excluding .down.sql so up/down pairs don't undo themselves mid-run. Add a sanity check for workspaces + activity_logs + pending_uploads alongside the existing delegations gate so a future migration drift fails loud instead of silently skipping the regressed test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 22:02:24 -07:00
Hongming Wang	88ff0d770b	chore(sweep): add orphan-tunnel cleanup step (#2987 / #340 ) The 15-min sweeper has been deleting stale e2e orgs but not the orphan tunnels left behind when the org-delete cascade half-fails (CP transient 5xx after the org row is gone but before the CF tunnel delete completes). Result: tunnels accumulate in CF until manual operator cleanup. Add a final step that POSTs `/cp/admin/orphan-tunnels/cleanup` every tick. Best-effort — failure doesn't fail the workflow; next tick re-attempts. Output reports deleted_count + failed count for ops visibility. This is the catch-all for the orphan-tunnel class. The proper upstream fix (transactional org delete) lives in CP and tracks as issue #2989. Until that lands, the sweeper bounded-time-to-cleanup keeps the leak from escalating. Note: PR #492 (cf-tunnel silent-success fix) makes this step actually effective — pre-fix DeleteTunnel silent-succeeded on 1022, so the cleanup endpoint reported success without deleting. Post-fix the cleanup chains CleanupTunnelConnections + retry on 1022, which actually clears stuck-connector orphans. 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-05-05 19:36:20 -07:00
Hongming Wang	a19ee90556	chore(sweep): note SSOT for ephemeral prefixes lives in CP Mirrors molecule-controlplane#494: the canonical EPHEMERAL_PREFIXES list now lives in molecule-controlplane/internal/slugs/ephemeral.go, where redeploy-fleet reads it to skip in-flight test tenants. The sweep workflow keeps a Python copy because GHA Python can't import Go, but a comment now points engineers updating the list to update both files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 19:18:13 -07:00
Hongming Wang	caf19e8980	feat(ops): hourly alarm for auto-promote PR stuck on REVIEW_REQUIRED (#2975 ) Closes the silent-block failure mode that left 25 commits — including the Memory v2 redesign and the reno-stars data-loss fix — wedged on staging for 12+ hours behind a single missing review. The auto-promote workflow opened the PR + armed auto-merge, but main's branch protection required a human review and nobody noticed until a user reported "still seeing old memory tab". ## Detection logic — `scripts/check-stale-promote-pr.sh` Reads open PRs `base=main head=staging` and alarms on: - `mergeStateStatus == BLOCKED` - `reviewDecision == REVIEW_REQUIRED` - createdAt older than `STALE_HOURS` (default 4h) Other BLOCKED reasons (DIRTY, BEHIND, failed checks) are NOT alarmed — those are the author's signal-to-fix. This script targets the specific "no human reviewed yet" wedge. Output: - `::warning` per stale PR (visible in workflow summary + Actions UI) - PR comment (idempotent via marker-string detection; one alarm per PR, never re-spammed) - Exit code = count of stale PRs (capped at 125) Logic in a script (not inline workflow YAML) so it's: - Unit-testable — tests/test-check-stale-promote-pr.sh exercises every branch with stubbed fixture JSON + frozen clock. 23 tests covering: empty list, single stale, just-under-threshold, wrong reviewDecision, wrong mergeStateStatus, mixed list (only matching PRs alarm), custom threshold via --stale-hours, exit-code-counts- matching-PRs, --help, unknown arg → 64, missing repo → 2. - Operator-runnable ad-hoc — `scripts/check-stale-promote-pr.sh` works from any shell with `gh` + `jq`. - SSOT — one detector, the workflow YAML is just schedule + invocation surface. Future sibling workflows that need the same check call the same script. ## Workflow — `.github/workflows/auto-promote-stale-alarm.yml` Triggers: - cron `27 * * * *` (hourly, off-the-hour to dodge cron herd) - workflow_dispatch with `stale_hours` + `post_comment` overrides Concurrency: `auto-promote-stale-alarm` group, cancel-in-progress=false (idempotent script; no benefit to cancelling a running scan). Permissions: `contents: read` + `pull-requests: write` (post comments). Sparse checkout — only fetches `scripts/check-stale-promote-pr.sh`. No node_modules, no go modules, no slow setup steps. Workflow runs in <30s on a clean repo. ## Why "alarm + comment" not "auto-approve" Considered options in issue #2975: 1. Slack/email alert — picked. 2. Bot-account auto-approve via molecule-ops — circumvents the human-review gate that branch protection encodes. 3. Trusted-promote bypass via CODEOWNERS — needs Org Admin config change; out of scope for a workflow PR. The comment-on-PR pattern picks (1) without external dependencies (no Slack token, no email config). Subscribers get notified via GitHub's existing PR notification delivery; the warning shows up in the Actions feed. ## Why this won't false-positive on legitimate slow reviews Threshold is 4h. Most legitimate gates clear in <1h, so 4× headroom is plenty for slow CI. The comment is idempotent (one alarm per PR, never re-posted) — adding noise stops at 1 comment regardless of how long the PR sits. ## Test plan - [x] `bash scripts/test-check-stale-promote-pr.sh` — 23/23 pass - [x] `python3 -c 'yaml.safe_load(...)'` clean - [x] `bash -n` clean on both scripts - [ ] Live verification: dispatch the workflow once main has caught up, confirm it correctly reports zero stale PRs	2026-05-05 17:55:27 -07:00
Hongming Wang	475da5b64c	refactor(workspace): extract inbox tools from a2a_tools.py (RFC #2873 iter 4e) Continues the OSS-shape refactor. After iters 4a-4d (rbac, delegation, memory, messaging) the only behavior left in ``a2a_tools.py`` was ``report_activity`` plus three thin inbox-tool wrappers and the ``_enrich_inbound_for_agent`` helper. This iter extracts the inbox slice to ``a2a_tools_inbox.py`` so the kitchen-sink module shrinks from 280 LOC to ~165 LOC of imports + report_activity + back-compat re-export blocks. Extracted symbols: - ``_INBOX_NOT_ENABLED_MSG`` (sentinel) - ``_enrich_inbound_for_agent`` (poll-path peer enrichment helper) - ``tool_inbox_peek`` - ``tool_inbox_pop`` - ``tool_wait_for_message`` Re-exports (`from a2a_tools_inbox import …`) preserve the public ``a2a_tools.tool_inbox_`` surface so existing tests + call sites continue to resolve unchanged. New tests in test_a2a_tools_inbox_split.py: 1. Drift gate (5)* — every previously-public symbol on a2a_tools is the EXACT same object as a2a_tools_inbox.foo (`is`, not `==`), catches a future "wrap with logging" refactor that silently loses existing test coverage. 2. Import contract (1) — a2a_tools_inbox does NOT eagerly import a2a_tools at module load. Pins the layered architecture: the extracted slice depends on ``inbox`` + a lazy ``a2a_client`` import, never on the kitchen-sink that re-exports it. 3. _enrich_inbound_for_agent branches (5) — peer_id-empty (canvas_user) returns dict unchanged; missing peer_id key same; a2a_client unavailable (test harness, partial install) degrades gracefully with a bare envelope; registry hit populates peer_name + peer_role + agent_card_url; registry miss still surfaces agent_card_url (constructable from peer_id alone). The full timeout-clamp / validation / JSON-shape behavior matrix for the three wrappers stays in test_a2a_tools_inbox_wrappers.py — those tests pass identically against both the alias and the underlying impl. Wiring updates: - ``scripts/build_runtime_package.py``: add ``a2a_tools_inbox`` to ``TOP_LEVEL_MODULES`` so it ships in the runtime wheel and the drift gate doesn't fail the next publish. - ``.github/workflows/ci.yml``: add ``a2a_tools_inbox.py`` to ``CRITICAL_FILES`` so the 75% MCP/inbox/auth per-file floor applies — this is now where the inbox-delivery code actually lives.	2026-05-05 14:28:58 -07:00
Hongming Wang	0ca4e431c1	test(e2e): add poll-mode chat upload E2E and wire into e2e-api.yml Covers the user-visible flow that Phase 1-5b shipped (RFC #2891): register a poll-mode workspace, POST a multi-file /chat/uploads, verify the activity feed shows one chat_upload_receive row per file, fetch the bytes via /pending-uploads/:fid/content, ack each row, and confirm a post-ack fetch returns 404. Also pins cross-workspace bleed protection (workspace B's bearer on A's URL → 401, B's URL with A's file_id → 404) and the file_id-UUID-parse 400 path. 23 assertions, all green against a local platform (Postgres+Redis+ platform-server stack matches the e2e-api.yml CI recipe verbatim). Why a new script instead of extending test_poll_mode_e2e.sh: that script tests A2A short-circuit + since_id cursor semantics; this one tests the chat-upload path. They share zero handler code on the platform side and would dilute each other's failure messages if combined. Why not the bearerless-401 strict-mode assertion: the platform's wsauth fail-opens for bearerless requests when MOLECULE_ENV=development (see middleware/devmode.go). The CI workflow doesn't set that var, but some local-dev .env files do — the assertion would flap by environment without testing the poll-mode upload contract. The middleware's own unit tests cover strict-mode 401. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 13:08:55 -07:00
Hongming Wang	6125700c39	test(e2e): plug /tmp scratch leaks in 3 shell E2E tests + add CI lint gate (RFC #2873 iter 2) Three shell E2E tests created scratch files via `mktemp` but never deleted them on early exit (assertion failure, SIGINT, errexit). Each CI run leaked ~10-100 KB of /tmp into the runner; over ~200 runs/week that's 20+ MB of accumulated cruft. ## Files - test_chat_attachments_e2e.sh — was missing both trap and rm; added per-run TMPDIR_E2E with `trap rm -rf … EXIT INT TERM`. - test_notify_attachments_e2e.sh — had a `cleanup()` for the workspace but didn't include the TMPF; only an unconditional `rm -f` at the bottom (line 233) which doesn't fire on early exit. Extended cleanup() to also rm the scratch + dropped the redundant trailing rm. - test_chat_attachments_multiruntime_e2e.sh — `round_trip()` function had per-call `rm -f` only on the success path; failure paths leaked. Switched to script-level TMPDIR_E2E + trap; per-call rm dropped (the trap handles every return path including SIGINT). Pattern: `mktemp -d -t prefix-XXX` for the dir, `mktemp <full-template>` for files (portable across BSD/macOS + GNU coreutils — `-p` is GNU-only and breaks Mac local-dev runs). ## Regression gate New `tests/e2e/lint_cleanup_traps.sh` asserts every `.sh` that calls `mktemp` also has a `trap … EXIT` line in the file. Wired into the existing Shellcheck (E2E scripts) CI step. Verified locally: passes on the fixed state, fails-loud when one of the 3 fixes is reverted. ## Verification - shellcheck --severity=warning clean on all 4 touched files - lint_cleanup_traps.sh passes on the post-fix tree (6 mktemp users, all have EXIT trap) - Negative test: revert one fix → lint exits 1 with file:line + suggested fix pattern in the error message (CI-grokkable ::error file=… annotation) - Trap fires on SIGTERM mid-run (smoke-tested on macOS BSD mktemp) - Trap fires on `exit 1` (smoke-tested) ## Bars met (7-axis) - SSOT: trap pattern documented in lint message (one rule, one fix) - Cleanup: this IS the cleanup hygiene fix - 100% coverage: lint catches future regressions across all `tests/e2e/.sh` files, not just the 3 fixed today - File-split: N/A (no files split) - Plugin / abstract / modular: N/A (test infra, not product code) Iteration 2 of RFC #2873.	2026-05-05 04:21:26 -07:00
Hongming Wang	42f2ea3f4f	fix(ci): include event_name in runtime-prbuild-compat concurrency group Every staging push run for the last 4 SHAs was cancelled by the matching pull_request run because both fired into the same concurrency group: group: ${{ github.workflow }}-${{ ...sha }} Same SHA → same group → cancel-in-progress=true means the second arrival cancels the first. Empirically the push run lost the race; staging branch-protection then saw a CANCELLED required check and the auto-promote chain stalled. Fix: include github.event_name in the group key. push and pull_request runs for the same SHA now hash to different groups, both complete, both report SUCCESS to branch protection. Pattern of the bug: 10:46 sha=1e8d7ae1 ev=pull_request conclusion=success 10:46 sha=1e8d7ae1 ev=push conclusion=cancelled 10:45 sha=ecf5f6fb ev=pull_request conclusion=success 10:45 sha=ecf5f6fb ev=push conclusion=cancelled 10:28 sha=471dff25 ev=pull_request conclusion=success 10:28 sha=471dff25 ev=push conclusion=cancelled 10:12 sha=9e678ccd ev=pull_request conclusion=success 10:12 sha=9e678ccd ev=push conclusion=cancelled Same drift class as the 2026-04-28 auto-promote-staging incident (memory: feedback_concurrency_group_per_sha.md) — globally-scoped groups silently cancel runs in matched-SHA scenarios. This is the only workflow in .github/workflows/ that uses the narrow per-sha shape without event_name. Others either don't use concurrency at all, or use ${{ github.ref }} which is event- neutral. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 04:01:20 -07:00
Hongming Wang	90d202c80a	ci(handlers-pg): apply all migrations with skip-on-error + sanity check (#320 ) Previous workflow applied only 049_delegations.up.sql — fragile to future migrations that touch the delegations table or any other handlers/-tested table. Operator would have to remember to update the workflow's psql -f line per migration. New behavior: loop every .up.sql in lexicographic order, apply each with ON_ERROR_STOP=1 + per-migration result captured. Failed migrations are SKIPPED rather than blocking the suite — handles the historical migrations (017_memories_fts_namespace, 042_a2a_queue, etc.) that depend on tables since renamed/dropped and can't replay from scratch. Migrations that DO succeed land their tables, which is sufficient for the integration tests in handlers/. Sanity gate at the end: if the delegations table is missing after the replay, hard-fail with a loud error. That catches a real regression where 049 itself becomes broken (e.g., schema rename), separate from the historical-broken-migration noise above. Per-migration log line ("✓" or "⊘ skipped") makes it easy to spot when a migration that SHOULD have replayed didn't. Verified locally: full migration chain runs, 049 lands, all 7 integration tests pass against the chained-migration DB. Closes #320.	2026-05-05 03:48:43 -07:00
Hongming Wang	4c9f12258d	fix(delegations): preserve result_preview through completion + add real-Postgres integration gate Two-part PR: ## Fix: result_preview was lost on completion Self-review of #2854 caught a real bug. SetStatus has a same-status replay no-op; the order of calls in `executeDelegation` completion + `UpdateStatus` completed branch clobbered the preview field: 1. updateDelegationStatus(completed, "") fires 2. inner recordLedgerStatus(completed, "", "") → SetStatus transitions dispatched → completed with preview="" 3. outer recordLedgerStatus(completed, "", responseText) → SetStatus reads current=completed, status=completed → SAME-STATUS NO-OP, never writes responseText → preview lost Confirmed against real Postgres (see integration test). Strict-sqlmock unit tests passed because they pin SQL shape, not row state. Fix: call the WITH-PREVIEW recordLedgerStatus FIRST, then updateDelegationStatus. The inner call becomes the no-op (correctly preserves the row written by the outer call). Same gap fixed in UpdateStatus handler — body.ResponsePreview was never landing in the ledger because updateDelegationStatus's nested SetStatus(completed, "", "") fired first. ## Gate: real-Postgres integration tests + CI workflow The unit-test-only workflow that shipped #2854 was the root cause. Adding two layers of defense: 1. workspace-server/internal/handlers/delegation_ledger_integration_test.go — `//go:build integration` tag, requires INTEGRATION_DB_URL env var. 4 tests: * ResultPreviewPreservedThroughCompletion (regression gate for the bug above — fires the production call sequence in fixed order and asserts row.result_preview matches) * ResultPreviewBuggyOrderIsLost (DIAGNOSTIC: confirms the same-status no-op contract works as designed; if SetStatus's semantics ever change, this test fires) * FailedTransitionCapturesErrorDetail (failure-path symmetry) * FullLifecycle_QueuedToDispatchedToCompleted (forward-only + happy path) 2. .github/workflows/handlers-postgres-integration.yml — required check on staging branch protection. Spins postgres:15 service container, applies the delegations migration, runs `go test -tags=integration` against the live DB. Always-runs + per-step gating on path filter (handlers/wsauth/migrations) so the required-check name is satisfied on PRs that don't touch relevant code. Local dev workflow (file header documents this): docker run --rm -d --name pg -e POSTGRES_PASSWORD=test -p 55432:5432 postgres:15-alpine psql ... < workspace-server/migrations/049_delegations.up.sql INTEGRATION_DB_URL="postgres://postgres:test@localhost:55432/molecule?sslmode=disable" \ go test -tags=integration ./internal/handlers/ -run "^TestIntegration_" ## Why this matters Per memory `feedback_mandatory_local_e2e_before_ship`: backend PRs MUST verify against real Postgres before claiming done. sqlmock pins SQL shape; only a real DB can verify row state. The workflow makes this gate mandatory rather than optional.	2026-05-05 02:47:52 -07:00
Hongming Wang	c89f17a2aa	fix(branch-protection-drift): hard-fail on schedule only, soft-skip + warn on PR #2834 added a hard-fail when GH_TOKEN_FOR_ADMIN_API is missing on schedule + pull_request + workflow_dispatch. The PR-trigger hard-fail is now blocking every PR in the repo because the secret hasn't been provisioned yet — including the staging→main auto-promote PR (#2831), which has no path to set repo secrets itself. Per feedback_schedule_vs_dispatch_secrets_hardening.md the original concern is automated/silent triggers losing the gate without a human to notice. That concern applies to schedule specifically: - schedule: cron, no human, silent soft-skip = invisible regression → KEEP HARD-FAIL. - pull_request: a human is reviewing the PR diff and will see workflow warnings inline. A PR cannot retroactively drift live state — drift happens between PRs (UI clicks, manual gh api PATCH), which the schedule canary catches. The PR-time gate would only catch typos in apply.sh, which the *_payload unit tests catch more directly. → SOFT-SKIP with a prominent warning. - workflow_dispatch: operator override, may not have configured the secret yet. → SOFT-SKIP with warning. The skip is explicit (SKIP_DRIFT_CHECK=1 surfaced to env, then a step `if:` guard) so it's auditable in the workflow run UI, not silently swallowed. Unblocks #2831 (auto-promote staging→main) + every PR currently behind this check.	2026-05-04 21:20:30 -07:00
Hongming Wang	2e505e7748	fix(branch-protection): apply.sh respects live state + full-payload drift Multi-model review of #2827 caught: the script as-shipped would have silently weakened branch protection on EVERY non-checks dimension the moment anyone ran it. Live staging had enforce_admins=true, dismiss_stale_reviews=false, strict=true, allow_fork_syncing=false, bypass_pull_request_allowances={ HongmingWang-Rabbit + molecule-ai app } Script wrote the opposite for all five. Per memory feedback_dismiss_stale_reviews_blocks_promote.md, the dismiss_stale_reviews flip alone is the load-bearing one — would silently re-block every auto-promote PR (cost user 2.5h once). This PR: 1. apply.sh: per-branch payloads (build_staging_payload / build_main_payload) that codify the deliberate per-branch policy already on the repo, with the script's net contribution being ONLY the new check names (Canvas tabs E2E + E2E API Smoke on staging, Canvas tabs E2E on main). 2. apply.sh: R3 preflight that hits /commits/{sha}/check-runs and asserts every desired check name has at least one historical run on the branch tip. Catches typos like "Canvas Tabs E2E" vs "Canvas tabs E2E" — pre-fix a typo would silently block every PR forever waiting for a context that never emits. Skip via --skip-preflight for genuinely-new workflows whose first run hasn't fired. 3. drift_check.sh: compares the FULL normalised payload (admin, review, lock, conversation, fork-syncing, deletion, force-push) not just the checks list. Pre-fix the drift gate would have missed a UI click that flipped enforce_admins or dismiss_stale_reviews. Drops app_id from the comparison since GH auto-resolves -1 to a specific app id post-write. 4. branch-protection-drift.yml: per memory feedback_schedule_vs_dispatch_secrets_hardening.md — schedule + pull_request triggers HARD-FAIL when GH_TOKEN_FOR_ADMIN_API is missing (silent skip masks the gate disappearing). workflow_dispatch keeps soft-skip for one-off operator runs. Verified by running drift_check against live state: pre-fix would have shown 5 destructive drifts on staging + 5 on main. Post-fix shows ONLY the 2 intended additions on staging + 1 on main, which go away after `apply.sh` runs.	2026-05-04 20:52:11 -07:00
Hongming Wang	7cc1c39c49	ci: e2e coverage matrix + branch-protection-as-code Closes #9. Three pieces, all small: 1. docs/e2e-coverage.md — source of truth for which E2E suites guard which surfaces. Today three were running but informational only on staging; that's how the org-import silent-drop bug shipped without a test catching it pre-merge. Now the matrix shows what's required where + a follow-up note for the two suites that need an always-emit refactor before they can be required. 2. tools/branch-protection/apply.sh — branch protection as code. Lets `staging` and `main` required-checks live in a reviewable shell script instead of UI clicks that get lost between admins. This PR's net change: add `E2E API Smoke Test` and `Canvas tabs E2E` as required on staging. Both already use the always-emit path-filter pattern (no-op step emits SUCCESS when the workflow's paths weren't touched), so making them required can't deadlock unrelated PRs. 3. branch-protection-drift.yml — daily cron + drift_check.sh that compares live protection against apply.sh's desired state. Catches out-of-band UI edits before they drift further. Fails the workflow on mismatch; ops re-runs apply.sh or updates the script. Out of scope (filed as follow-ups): - e2e-staging-saas + e2e-staging-external use plain `paths:` filters and never trigger when paths are unchanged. They need refactoring to the always-emit shape (same as e2e-api / e2e-staging-canvas) before they can be required. - main branch protection mirrors staging here; if main wants the E2E SaaS / External added later, do it in apply.sh and rerun. Operator must apply once after merge: bash tools/branch-protection/apply.sh The drift check picks it up from there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 20:21:59 -07:00
Hongming Wang	8df8487bbe	fix(auto-promote): treat E2E completed/cancelled as defer, not failure Bug: the case statement at line 189 grouped completed/failure \| completed/cancelled \| completed/timed_out into the same "abort + exit 1" branch. cancelled ≠ failure — when per-SHA concurrency (memory: feedback_concurrency_group_per_sha) cancels an older E2E run because a newer push landed, the workflow blocked the whole auto-promote chain on a non-failure. Caught 2026-05-05 02:03 on sha `31f9a5e`: E2E got cancelled by concurrency, auto-promote :latest aborted with exit 1, the next auto-promote-staging cycle had to manually clean up. Split: failure/timed_out keep the abort path. cancelled gets its own clean-defer branch (same shape as in_progress) — proceed=false without exit 1, with a step-summary explaining likely concurrency supersession and pointing operators at manual dispatch if they need that specific SHA promoted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 19:26:29 -07:00
Hongming Wang	c5dd14d8db	fix(workflows): preserve curl stderr in 8 status-capture sites Self-review of PR #2810 caught a regression: my mass-fix added `2>/dev/null` to every curl invocation, suppressing stderr. The original `\|\| echo "000"` shape only swallowed exit codes — stderr (curl's `-sS`-shown dial errors, timeouts, DNS failures) still went to the runner log so operators could see WHY a connection failed. After PR #2810 the next deploy failure would log only the bare HTTP code with no context. That's exactly the kind of diagnostic loss that makes outages take longer to triage. Drop `2>/dev/null` from each curl line — keep it on the `cat` fallback (which legitimately suppresses "no such file" when curl crashed before -w ran). The `>tempfile` redirect alone captures curl's stdout (where -w writes) without touching stderr. Same 8 files as #2810: redeploy-tenants-on-{main,staging}, sweep-stale-e2e-orgs, e2e-staging-{sanity,saas,external,canvas}, canary-staging. Tests: - All 8 files pass the lint - YAML valid Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 18:54:50 -07:00
Hongming Wang	463316772b	fix(workflows): rewrite curl status-capture to prevent exit-code pollution The 2026-05-04 redeploy-tenants-on-main run for sha `2b862f6` emitted "HTTP 000000" and failed the deploy. Root cause: when curl exits non- zero (connection reset → 56, --fail-with-body 4xx/5xx → 22), the `-w '%{http_code}'` already wrote a status to stdout; the inline `\|\| echo "000"` then fires AND appends another "000" to the captured substitution stdout. Result: HTTP_CODE="<actual><000>" — fails string comparisons against "200" while looking superficially right. Same class of bug the synth-E2E §7c gate hit twice (PRs #2779/#2783 + #2797). Memory feedback_curl_status_capture_pollution.md. Mass fix in 8 workflows: route -w into a tempfile so curl's exit code can't pollute stdout. Wrap with set +e/-e so the non-zero curl exit doesn't trip the outer pipeline. redeploy-tenants-on-main.yml (production-critical, caught the bug) redeploy-tenants-on-staging.yml (sibling) sweep-stale-e2e-orgs.yml (cleanup loop) e2e-staging-sanity.yml (E2E safety-net teardown) e2e-staging-saas.yml e2e-staging-external.yml e2e-staging-canvas.yml canary-staging.yml Plus a new lint workflow `lint-curl-status-capture.yml` that runs on every PR/push touching `.github/workflows/**`. Multi-line aware: collapses bash `\` continuations, then matches the buggy $(curl ... -w '%{http_code}' ... \|\| echo "000") subshell shape. Distinguishes from the SAFE $(cat tempfile \|\| echo "000") shape (cat with missing file emits empty stdout, no pollution). Verified: - All 8 workflows pass the lint locally - A known-bad injection is caught - A known-safe cat-fallback passes through - yaml.safe_load clean on all changed files Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 18:29:38 -07:00
Hongming Wang	26fa220bef	ci(coverage): per-file 75% floor for MCP/inbox/auth Python critical paths Closes part of #2790 (Phase A). The Python total floor at 86% (set in workspace/pytest.ini, issue #1817) averages over ~6000 lines, so a single MCP-critical file could regress to ~50% with no CI complaint as long as other modules compensate. This is the same distribution gap that #1823 closed Go-side: total floor passes while a critical handler sits at 0%. Added gates for these five files (per-file floor 75%): - workspace/a2a_mcp_server.py — MCP dispatcher (PR #2766 / #2771) - workspace/mcp_cli.py — molecule-mcp standalone CLI entry - workspace/a2a_tools.py — workspace-scoped tool implementations - workspace/inbox.py — multi-workspace inbox + per-workspace cursors - workspace/platform_auth.py — per-workspace token resolver These handle multi-tenant routing, auth tokens, and inbox dispatch. Risk shape mirrors Go-side tokens/secrets — a 0%/50% file here is exactly where the PR #2766 dispatcher bug class slips through without a structural test. Floor 75% is strictly additive — current actuals 80-96% (measured 2026-05-04). No existing PR fails. Ratchet plan in COVERAGE_FLOOR.md target 90% by 2026-08-04. Implementation: pytest already writes .coverage; new step emits a JSON view scoped to the critical files via `coverage json --include="*name"`, then jq extracts each file's percent_covered. Exact key match by basename so workspace/builtin_tools/a2a_tools.py (a different 100% file) doesn't shadow workspace/a2a_tools.py. Verified locally with the actual coverage data: - floor=75 → 0 failures (matches current state) - floor=81 → 1 failure (a2a_tools.py at 80%) — proves the gate trips Pairs with PR #2791 (Phase B — schema↔dispatcher AST drift gate). Phase C (molecule-mcp e2e harness) remains the largest piece in #2790. YAML validated locally before commit per feedback_validate_yaml_before_commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 16:35:21 -07:00
Hongming Wang	ff1003e5f6	ci(canary): bump timeout-minutes 12 → 20 to absorb apt tail latency Today's 4 cancelled canaries (25319625186 / 25320942822 / 25321618230 / 25322499952) were all blown by the workflow timeout despite the underlying tenant boot completing successfully (PR molecule-controlplane#455 fix verified — boot events all reach `boot_script_finished/ok`). Why the budget was wrong: The tenant user-data install phase runs apt-get update + install of docker.io / jq / awscli / caddy / amazon-ssm-agent FROM RAW UBUNTU on every tenant boot — none of it is pre-baked into the tenant AMI (EC2_AMI=ami-0ea3c35c5c3284d82, raw Jammy 22.04). Empirical fetch_secrets/ok timing across today's canaries: 51s debug-mm-1777888039 (09:47Z) 82s 25319625186 (12:42Z) 143s 25320942822 (13:11Z) 625s 25322499952 (13:43Z) Same EC2_AMI, same instance type (t3.small), same user-data install sequence — variance is entirely apt-mirror tail latency. A 12-min job budget leaves only ~2 min for the workspace on slow-apt days; the workspace itself needs ~3.5 min for claude-code cold boot, so the budget is structurally too tight whenever apt is slow. 20 min absorbs even the 10+ min boot worst-case and still leaves the workspace its full ~7 min budget. Cap stays well under the runner's 6-hour ubuntu-latest job ceiling. Real fix: pre-bake caddy + ssm-agent into the tenant AMI so the boot phase is no-ops on cached pkgs (will file controlplane#TBD as follow-up — packer/install-base.sh today only bakes the WORKSPACE thin AMI, not the tenant AMI; tenants always boot from raw Ubuntu). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 07:02:12 -07:00
Hongming Wang	032c011b37	ci: bump continuous-synth-e2e cadence 3→6 fires/hour, all clean slots Change cron from '10,30,50' (3 fires/hour) to '2,12,22,32,42,52' (6 fires/hour). All new slots are 1-3 min away from any other cron, avoiding both the cf-sweep collisions (:15, :45) and the :30 heavy slot (canary-staging /30, sweep-aws-secrets, sweep-stale-e2e-orgs every :15). Why: empirically 2026-05-04 the canary fired only once per hour on the 10,30,50 schedule (see #2726). Bumping fires-per-hour gives more chances to land a survived fire under GH's load- related drop ratio, and keeping all slots in clean lanes minimizes the per-fire drop probability. At empirically-observed ~67% drop ratio, 6 attempts/hour yields ~2 effective fires = ~30 min cadence; closer to the 20-min target than the current shape and provides a real degradation alarm if drops get worse. Cost: ~$0.50/day → ~$1/day. Negligible. Closes #2726. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 05:10:48 -07:00
Hongming Wang	98f883cb99	e2e: add direct-Anthropic LLM-key path alongside MiniMax + OpenAI Adds a third secrets-injection branch in test_staging_full_saas.sh behind a new E2E_ANTHROPIC_API_KEY env var, wired into all three auto-running E2E workflows (canary-staging, e2e-staging-saas, continuous-synth-e2e) via a new MOLECULE_STAGING_ANTHROPIC_API_KEY repo secret slot. Operator motivation: after #2578 (the staging OpenAI key went over quota and stayed dead 36+ hours) we shipped #2710 to migrate the canary + full-lifecycle E2E to claude-code+MiniMax. Discovered post- merge that MOLECULE_STAGING_MINIMAX_API_KEY had never been set after the synth-E2E migration on 2026-05-03 either — synth has been red the whole time, not just OpenAI quota. Setting up a MiniMax billing account from scratch is non-trivial (needs platform-specific signup, KYC, top-up). Operators who already have an Anthropic API key for their own Claude Code session can now just set MOLECULE_STAGING_ANTHROPIC_API_KEY and have all three auto-running E2E gates green within one cron firing. Priority chain in test_staging_full_saas.sh (first non-empty wins): 1. E2E_MINIMAX_API_KEY → MiniMax (cheapest) 2. E2E_ANTHROPIC_API_KEY → direct Anthropic (cheaper than gpt-4o, lower setup friction than MiniMax) 3. E2E_OPENAI_API_KEY → langgraph/hermes paths Verify-key case-statement in all three workflows accepts EITHER MiniMax OR Anthropic for runtime=claude-code; error message names both options so operators know they don't have to register a MiniMax account if they already have an Anthropic key. Pinned to runtime=claude-code — hermes/langgraph use OpenAI-shaped envs and won't honour ANTHROPIC_API_KEY without further wiring. After this lands + secret is set, the dispatched canary verifies the new path: gh workflow run canary-staging.yml --repo Molecule-AI/molecule-core --ref staging	2026-05-04 00:51:14 -07:00
Hongming Wang	eaee113416	e2e-staging-saas: same migration off OpenAI default to claude-code+MiniMax Bundles the same hermes+OpenAI → claude-code+MiniMax migration onto the full-lifecycle E2E that's been red on every provisioning-critical push since 2026-05-01. Same root cause as the canary fix in the prior commit: MOLECULE_STAGING_OPENAI_KEY hit insufficient_quota and there's no SLA on operator billing top-up. Same shape as canary commit: claude-code as default runtime + MiniMax as primary key + hermes/langgraph kept as workflow_dispatch options with OpenAI fallback. Per-runtime verify-key case-statement matches canary-staging.yml + continuous-synth-e2e.yml byte-for-byte. Two extra wrinkles vs canary: - Dispatch input `runtime` default flipped from "hermes" to "claude-code" so operators dispatching from the UI get the safe path by default. They can still pick hermes/langgraph from the dropdown when they specifically want to exercise OpenAI. - E2E_MODEL_SLUG is dispatch-aware: MiniMax-M2.7-highspeed for claude-code, openai/gpt-4o for hermes (slash-form per derive-provider.sh), openai:gpt-4o for langgraph (colon-form per init_chat_model). The branch comment in lib/model_slug.sh covers the rationale; pinning the slug here keeps the dispatch UX stable even when operators don't override. After this lands + the canary commit lands, the only OpenAI-dependent E2E surface is the operator-dispatch fallback. The cron canary, the synth E2E, AND the full-lifecycle gate are all on MiniMax — separate billing account, no OpenAI quota dependency on auto-runs.	2026-05-04 00:20:36 -07:00
Hongming Wang	6f8f978975	canary-staging: migrate from hermes+OpenAI to claude-code+MiniMax Mirror the migration continuous-synth-e2e.yml made on 2026-05-03 (#265). Both workflows hit the same MOLECULE_STAGING_OPENAI_KEY which went over quota on 2026-05-01 (#2578) and stayed dead — the canary has been red for 36+ hours waiting on operator billing top-up. This switch breaks the canary's dependency on OpenAI billing entirely: claude-code template's `minimax` provider routes ANTHROPIC_BASE_URL to api.minimax.io/anthropic and reads MINIMAX_API_KEY at boot. MiniMax is ~5-10x cheaper per token than gpt-4.1-mini AND on a separate billing account, so a future OpenAI quota collapse no longer wedges the canary's "is staging alive?" signal. Changes: - E2E_RUNTIME: hermes → claude-code - Add E2E_MODEL_SLUG: MiniMax-M2.7-highspeed (pin to MiniMax — the per-runtime claude-code default is "sonnet" which routes to direct Anthropic and would defeat the cost saving) - Add E2E_MINIMAX_API_KEY env wired to MOLECULE_STAGING_MINIMAX_API_KEY - Keep E2E_OPENAI_API_KEY as fallback for operator-dispatched runs that set E2E_RUNTIME=hermes via workflow_dispatch - "Verify OpenAI key present" → per-runtime "Verify LLM key present" case statement matching synth E2E's exact shape (claude-code requires MiniMax, langgraph/hermes require OpenAI). Hard-fail on missing required key per #2578's lesson — soft-skip silently fell through to the wrong SECRETS_JSON branch and produced a confusing auth error 5 min later instead of the clean "secret missing" message at the top. Verifies #2578 root cause won't recur on the canary path. The synth E2E and the manual e2e-staging-saas dispatch can still hit OpenAI when explicitly chosen — only the cron canary moves off it.	2026-05-04 00:18:03 -07:00
Hongming Wang	9689c6f6d5	fix(synth-e2e): verify-secrets step must hard-fail (exit 0 only ends step) The previous soft-skip-on-dispatch path used `exit 0`, which only ends the STEP — the rest of the workflow continued with empty secrets. Caught 2026-05-04 by dispatched run 25296530706: - E2E_MINIMAX_API_KEY: empty - verify-secrets printed warning + exit 0 - Install required tools: ran - Run synthetic E2E: ran with empty MiniMax key - SECRETS_JSON branched to OpenAI shape (MINIMAX empty → fall through) - But model slug stayed MiniMax-M2.7-highspeed (workflow env) - Workspace booted with OpenAI keys + MiniMax model - 5 min later: "Agent error (Exception)" — claude SDK 401'd against api.minimax.io with the OpenAI key The confusing failure mode silently masked the real problem (missing secret) under a runtime-error label. Fix: drop both soft-skip paths and exit 1 always. Operators who want to verify a YAML change without setting up secrets can read the verify-secrets step's stderr — the failure IS the verification signal. Pure visibility fix; preserves the cron hard-fail path (now also the dispatch hard-fail path). No mechanism change beyond the exit code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 18:32:26 -07:00
Hongming Wang	a306a97dd3	ci(synth-e2e): move cron off :00 to dodge GH scheduler drops GitHub Actions scheduler de-prioritises :00 cron firings under load. Empirical 2026-05-03: the canary's cron was '0,20,40 * * * ' but actual firings landed at :08, :03, :01, :03 — :20 and :40 silently dropped. Detection latency degraded from claimed 20 min to actual ~60 min worst case. Move to '10,30,50 * * *': - :10/:30/:50 sit 10 min off the top-of-hour load peak - Still 5 min from :15 sweep-cf-orphans and :45 sweep-cf-tunnels (the original constraint that kept us off :15/:45) - Same 20-min cadence; only the phase changes No code change beyond the cron expression + comment refresh. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 17:28:45 -07:00
Hongming Wang	8b9e7e6d59	ci: port DELETE-verify pattern to remaining staging e2e workflows Follow-up to #2648 — same `>/dev/null \|\| true` swallow-on-error pattern existed in: e2e-staging-canvas.yml (single-slug) e2e-staging-saas.yml (loop) e2e-staging-sanity.yml (loop) e2e-staging-external.yml (loop, was `>/dev/null 2>&1` variant) All four now capture the HTTP code, log a "[teardown] deleted $slug (HTTP $code)" line on success, and emit a workflow warning naming the slug + body excerpt on non-2xx. Loop bodies also tally + summarise total leaks at the end. Exit semantics unchanged: a single cleanup miss still doesn't fail-flag the test (sweep-stale-e2e-orgs is the safety net within ~45 min). The behavior change is purely surfacing — failures that were silent are now visible on the workflow run page. Pairs with #2648's tightened sweeper. Together: per-run cleanup failures are visible AND the safety net catches them quickly. Closes the per-workflow port noted as out-of-scope in #2648. See molecule-controlplane#420. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 16:24:43 -07:00
Hongming Wang	3cd8c53de0	ci: tighten e2e cleanup race window 120m -> ~45m worst case Two changes that close one of the leak classes from the molecule-controlplane#420 vCPU audit: 1. sweep-stale-e2e-orgs.yml: cron */15 (was hourly), MAX_AGE_MINUTES 30 (was 120). E2E runs are 8-25 min wall clock; 30 min is safely above the longest run while shrinking the worst-case leak window from ~2h to ~45 min (15-min sweep cadence + 30-min threshold). 2. canary-staging.yml teardown: the per-slug DELETE used `>/dev/null \|\| true`, which swallowed every failure. A 5xx or timeout from CP looked identical to "successfully deleted" and the canary tenant kept eating ~2 vCPU until the sweeper caught it. Now we capture the response code and surface non-2xx as a workflow warning that names the leaked slug. The exit semantics stay unchanged — a single-canary cleanup miss shouldn't fail-flag the canary itself when the actual smoke check passed. The sweeper is the safety net for whatever slips past. Caught during the molecule-controlplane#420 audit on 2026-05-03 — 3 e2e canary tenant orphans were running for 24-95 min, all under the previous 120-min sweep threshold so they went unnoticed until manual cleanup. Same `\|\| true` pattern exists in e2e-staging-{canvas,external,saas,sanity}.yml; out of scope for this PR (mechanical port; tracking separately) but the sweeper tightening covers all of them by reducing the safety-net latency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 16:08:40 -07:00
Hongming Wang	79a0203798	feat(synth-e2e): switch canary to claude-code + MiniMax-M2.7-highspeed Cuts the per-run LLM cost ~10x (MiniMax M2.7 vs gpt-4.1-mini) and removes the recurring OpenAI-quota-exhaustion failure mode that took the canary down on 2026-05-03 (#265 — staging quota burnt for ~16h). Path: E2E_RUNTIME=claude-code (default) → workspace-configs-templates/claude-code-default/config.yaml's `minimax` provider (lines 64-69) → ANTHROPIC_BASE_URL auto-set to api.minimax.io/anthropic → reads MINIMAX_API_KEY (per-vendor env, no collision with GLM/Z.ai etc.) Workflow changes (continuous-synth-e2e.yml): - Default runtime: langgraph → claude-code - New env: E2E_MODEL_SLUG (defaults to MiniMax-M2.7-highspeed, overridable via workflow_dispatch) - New secret wire: E2E_MINIMAX_API_KEY ← secrets.MOLECULE_STAGING_MINIMAX_API_KEY - Per-runtime missing-secret guard: claude-code requires MINIMAX, langgraph/hermes require OPENAI. Cron firing hard-fails on missing key for the active runtime; dispatch soft-skips so operators can ad-hoc test without setting up the secret first - Operators can still pick langgraph/hermes via workflow_dispatch; the OpenAI fallback path stays wired Script changes (tests/e2e/test_staging_full_saas.sh): - SECRETS_JSON branches on which key is set: E2E_MINIMAX_API_KEY → {MINIMAX_API_KEY: <key>} (claude-code path) E2E_OPENAI_API_KEY → {OPENAI_API_KEY, HERMES_*, MODEL_PROVIDER} (legacy) MiniMax wins when both are present — claude-code default canary must not accidentally consume the OpenAI key Tests (new tests/e2e/test_secrets_dispatch.sh): - 10 cases pinning the precedence + payload shape per branch - Discipline check verified: 5 of 10 FAIL on a swapped if/elif (precedence inversion), all 10 PASS on the fix - Anchors on the section-comment header so a structural refactor fails loudly rather than silently sourcing nothing The model_slug dispatcher (lib/model_slug.sh) needs no change: E2E_MODEL_SLUG override path is already wired (line 41), and claude-code template's `minimax-` prefix matcher catches "MiniMax-M2.7-highspeed" via lowercase-on-lookup. Operator action required to land green: - Set MOLECULE_STAGING_MINIMAX_API_KEY in repo secrets (Settings → Secrets and Variables → Actions). Use `gh secret set MOLECULE_STAGING_MINIMAX_API_KEY -R Molecule-AI/molecule-core` to avoid leaking the value into shell history. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 15:35:14 -07:00
Hongming Wang	ac6f65ab5e	test(e2e): pin pick_model_slug behavior with bash unit tests PR #2571 fixed synth-E2E by branching MODEL_SLUG per runtime, but only the langgraph branch was verified at runtime — hermes / claude-code / override / fallback had zero automated coverage. A future regression (e.g. dropping the langgraph case) would silently revert and only surface as "Could not resolve authentication method" mid-E2E. This PR: - Extracts the dispatch into tests/e2e/lib/model_slug.sh as a sourceable pick_model_slug() function. No behavior change. - Adds tests/e2e/test_model_slug.sh — 9 assertions across all 5 dispatch branches plus the override path. Verified to FAIL when any branch is flipped (manually regressed langgraph slash-form to confirm the test catches it; restored before commit). - Wires the unit test into ci.yml's existing shellcheck job (only runs when tests/e2e/ or scripts/ change). Pure-bash, no live infra. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 12:04:12 -07:00
Hongming Wang	80b38900de	fix(auto-promote): skip empty-tree promotes to break perpetual cycle The auto-promote ↔ auto-sync chain has been generating empty PRs indefinitely since the staging merge_queue ruleset uses MERGE strategy: 1. Auto-promote merges PR via queue → main = merge commit M2 not in staging 2. Auto-sync opens sync-back PR. Workflow's local `git merge --ff-only` succeeds (PR title even says "ff to ..."), but the queue lands the PR via MERGE → staging = merge commit S2 not in main 3. Auto-promote sees staging ahead by 1 → opens new promote PR. Tree diff vs main = 0 (S2's tree == main's tree). But the gate logic only checks "all required workflows green", not "actual code to ship" → opens an empty promote PR 4. ... repeat indefinitely Each round costs ~30-40 min wallclock, ~2 manual approvals (the queue requires 1 review and the bot can't self-approve without admin bypass), and one full CodeQL Go run (~15 min). Observed today (2026-05-03) across PRs #2592 → #2594 → #2595 → #2596 → #2597 — 5 PRs, ~3 hours, all empty content. Fix: before opening the promote PR, check that staging's tree actually differs from main's tree. If they're identical (the empty-merge-commit cycle), skip cleanly and let the cycle terminate. Implementation: - New step `Skip if staging tree == main tree` runs before the existing gate check. - `git diff --quiet origin/main $HEAD_SHA` exits 0 iff trees match. - On match: emits a step summary explaining the skip + sets `skip=true`; subsequent gate-check + promote steps are gated on `skip != 'true'` so they short-circuit. - Fail-open: if `git fetch` errors, fall through to gate check (preserve existing behavior). Only skip when diff is DEFINITIVELY empty. Long-term, the cleaner fix is to switch the merge_queue ruleset's merge_method away from MERGE so FF-able PRs land cleanly without a new commit — but that's a broader change affecting every staging PR's commit shape. This guard is the surgical one-step break. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 08:56:44 -07:00
Hongming Wang	2b3f44c3c8	fix(retarget): skip PRs whose head is staging (auto-promote PRs) The retarget-main-to-staging workflow tries to PATCH base=staging on every bot-authored PR opened against main. Auto-promote staging→main PRs have head=staging, base=main — retargeting them sets head AND base to staging, which GitHub rejects with HTTP 422 "no new commits between base 'staging' and head 'staging'". This started surfacing on PR #2588 (2026-05-03 14:30) once #2586 switched the auto-promote workflow to an App token. Before #2586 the auto-promote PR was authored by github-actions[bot], which the retarget filter happened to skip; now it's molecule-ai[bot], which passes the bot filter and triggers the broken retarget attempt. Add a head-ref != 'staging' guard so auto-promote PRs short-circuit before the PATCH. The existing 422 "duplicate base" detector is left alone — it covers a different operational case.	2026-05-03 07:34:24 -07:00
Hongming Wang	bc11ed8a2b	fix(auto-promote): use App token for auto-merge to fire downstream cascade (#2357 ) GITHUB_TOKEN-initiated merges suppress the downstream `push` event on main per GitHub's documented limitation: https://docs.github.com/en/actions/using-workflows/triggering-a-workflow#triggering-a-workflow-from-a-workflow Result before this fix: every staging→main promote landed silently — publish-workspace-server-image, canary-verify, and redeploy-tenants-on-main all stayed dark. The polling tail was the SOLE cascade trigger; if it ever 30-min-timed-out the chain dead-locked invisibly. Symptom (from the issue body, 2026-04-30): \| Time \| Event \| Triggered? \| \|----------\|--------------------------------------------------\|-----------\| \| 05:48:04 \| Promote PR #2352 merged (`c140ad28`) \| No fired \| \| 06:07:29 \| Promote PR #2356 merged (`5973c9bd`) \| No fired \| Fix: mint the molecule-ai App token BEFORE the promote-PR step and hand it to the auto-merge call. App-token-initiated merges DO trigger downstream workflow_run cascades. The polling tail stays as defense-in-depth (with comments updated): once we've observed >=10 successful natural cascades it can be dropped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 07:13:26 -07:00
Hongming Wang	8f48a38550	fix(publish-runtime): re-add 5 templates wrongly removed from cascade (#2566 ) The PR #2536 cascade prune ('deprecated, no shipping images') was empirically wrong. Re-confirmed 2026-05-03: - continuous-synth-e2e.yml defaults to langgraph as its primary canary - All 5 'deprecated' templates have successful publish-image runs in the past 24h: langgraph, crewai, autogen, deepagents, gemini-cli Symptom this fixes — issue #2566 (priority-high, failing 36+h): Synthetic E2E (staging): langgraph adapter A2A failure 'Received Message object in task mode' — failing for >36h Today at 11:06 commit `e1628c4` fixed the underlying a2a-sdk strict-mode issue in workspace/a2a_executor.py. publish-runtime fired at 11:13 and cascaded — but only to claude-code, hermes, openclaw, codex. langgraph was excluded by the prune, so its image stayed on the broken runtime and the synth E2E (which defaults to langgraph) kept failing despite the fix being live in PyPI. After this lands + the next runtime publish fires, langgraph image re-bakes with the fix and synth-E2E goes green. Test plan: - [x] yaml-validate the workflow - [ ] After merge, watch publish-runtime cascade to all 9 templates - [ ] Confirm langgraph publish-image fires + succeeds - [ ] Confirm next continuous-synth-e2e run goes green Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 05:41:53 -07:00
Hongming Wang	60a516bc8d	ci(redeploy): fix stale canary_slug default 'hongmingwang' → 'hongming' The workflow_dispatch input default and the workflow_run env fallback both pointed at 'hongmingwang', which doesn't match any current prod tenant (slugs are: hongming, chloe-dong, reno-stars). CP silently skipped the missing canary and put every tenant in batch-1 in parallel, defeating the canary-first soak gate that exists to catch image-boot regressions before they hit the whole fleet. Concrete example from today's `c0838d6` redeploy at 11:53Z (run 25278434388): the dispatched body was `{"target_tag":"staging-c0838d6","canary_slug":"hongmingwang",...}` and the CP response showed all 3 tenants in `"phase":"batch-1"` — no soak, no canary. The deploy happened to be safe, but a broken image would have hit hongming + chloe-dong + reno-stars simultaneously. Fixed in three places: the runtime ordering comment, the workflow_dispatch default, and the env fallback used by the workflow_run trigger. Comment documents the rationale so the next slug rename doesn't silently regress this again. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 05:06:01 -07:00
Hongming Wang	5e46ea70d6	ci(synth-e2e): wire MOLECULE_STAGING_OPENAI_KEY into provisioned tenant The synth-E2E (#2342) provisions a langgraph tenant whose default model `openai:gpt-4.1-mini` requires OPENAI_API_KEY for the first LLM call. Sibling workflows already wire this: - e2e-staging-saas.yml:89 - canary-staging.yml:63 continuous-synth-e2e.yml just forgot. Result: tenant boots, accepts a2a messages, then returns: Agent error: "Could not resolve authentication method. Expected either api_key or auth_token to be set." This was masked since 2026-04-29 (workflow creation) by a2a-sdk v0→v1 contract violations — PR #2558 (Task-enqueue) and #2563 (TaskUpdater.complete/failed terminal events) cleared those, exposing the underlying auth gap on the synth-E2E firing at 11:39 UTC today. The script tests/e2e/test_staging_full_saas.sh:325 already reads E2E_OPENAI_API_KEY and persists it as a workspace_secret on tenant create — only the workflow wiring was missing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 04:43:07 -07:00
Hongming Wang	596e797dca	ci(deploy): broaden ephemeral-prefix matchers to cover rt-e2e-* The redeploy-tenants-on-staging soft-warn filter and the sweep-stale-e2e-orgs janitor both hardcoded `^e2e-` to identify ephemeral test tenants. Runtime-test harness fixtures (RFC #2251) mint slugs prefixed with `rt-e2e-`, which neither matcher recognized. Concrete impact observed today: - Two `rt-e2e-v{5,6}-` tenants left orphaned 8h on staging (sweep-stale-e2e-orgs ignored them). - On the next staging redeploy their phantom EC2s returned `InvalidInstanceId: Instances not in a valid state for account` from SSM SendCommand → CP returned HTTP 500 + ok=false. - The redeploy soft-warn missed them too, so the workflow went red, which broke the auto-promote-staging chain feeding the canvas warm-paper rollout to prod. Fix: switch both matchers to recognize the alternation `^(e2e-\|rt-e2e-)`. Long-lived prefixes (demo-prep, dryrun-, dryrun2-*) remain non-ephemeral and continue to hard-fail. Comment documents the source-of-truth list and the cross-file invariant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 04:28:29 -07:00
Hongming Wang	09010212a0	feat(ci): structural drift gate for cascade list vs manifest (RFC #388 PR-3) Closes the recurrence path of PR #2556. The data fix realigned 8→4 templates in publish-runtime.yml's TEMPLATES variable, but the underlying drift hazard was unguarded — the next manifest change could silently leave cascade out of sync again. This gate fails any PR that changes manifest.json or publish-runtime.yml in a way that makes the cascade list diverge from manifest workspace_templates (suffix-stripped). Either direction is caught: missing-from-cascade templates that won't auto-rebuild on a new wheel publish (the codex-stuck-on-stale-runtime bug class — PR #2512 added codex to manifest, cascade wasn't updated, codex stayed pinned to its last-built runtime version for weeks). extra-in-cascade cascade dispatches to deprecated templates (the wasted-API-calls + dead-CI-noise class — PR #2536 pruned 5 templates from manifest; cascade kept dispatching to all 8 until PR #2556). Triggers narrowly: only on PRs that touch manifest.json, publish-runtime.yml, or the script itself. Fast (single grep+sed+comm pipeline, no Go build). Surfaced during the RFC #388 prior-art audit; folded in as the structural follow-up to the data fix #2556 promised. Self-tested both failure modes locally before commit: - Drop codex from cascade → script fails with "MISSING: codex" - Add langgraph to cascade → script fails with "EXTRA: langgraph" Refs: https://github.com/Molecule-AI/molecule-controlplane/issues/388 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 03:52:39 -07:00
Hongming Wang	e014d22ee9	Merge pull request #2557 from Molecule-AI/feat/sweep-aws-secrets-orphans feat(ops): sweep orphan AWS Secrets Manager secrets	2026-05-03 09:48:59 +00:00
Hongming Wang	6f8f7932d2	feat(ops): add sweep-aws-secrets janitor — orphan tenant bootstrap secrets CP's deprovision flow calls Secrets.DeleteSecret() (provisioner/ec2.go:806) but only when the deprovision runs to completion. Crashed provisions and incomplete teardowns leak the per-tenant `molecule/tenant/<org_id>/bootstrap` secret. At ~$0.40/secret/month, ~45 leaked secrets surfaced as ~$19/month on the AWS cost dashboard. The tenant_resources audit table (mig 024) tracks four kinds today — CloudflareTunnel, CloudflareDNS, EC2Instance, SecurityGroup — and the existing reconciler doesn't catch Secrets Manager orphans. The proper fix (KindSecretsManagerSecret + recorder hook + reconciler enumerator) is filed as a follow-up controlplane issue. This sweeper is the immediate stopgap. Parallel-shape to sweep-cf-tunnels.sh: - Hourly schedule offset (:30, between sweep-cf-orphans :15 and sweep-cf-tunnels :45) so the three janitors don't burst CP admin at the same minute. - 24h grace window — never deletes a secret younger than the provisioning roundtrip, so an in-flight provision can't be racemurdered. - MAX_DELETE_PCT=50 default (mirrors sweep-cf-orphans for durable resources; tenant secrets should track 1:1 with live tenants). - Same schedule-vs-dispatch hardening as the other janitors: schedule → hard-fail on missing secrets, dispatch → soft-skip. - 8-way xargs parallelism, dry-run by default, --execute to delete. Requires a dedicated AWS_JANITOR_* IAM principal — the prod molecule-cp principal lacks secretsmanager:ListSecrets (it only has scoped Get/Create/Update/Delete). The workflow's verify-secrets step will hard-fail on the first scheduled run until those secrets are configured, surfacing the missing setup loudly rather than silently no-op'ing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 02:38:08 -07:00
Hongming Wang	24276b9458	fix(publish-runtime): align cascade list to 4 supported runtimes The cascade `TEMPLATES` list in publish-runtime.yml had drifted from manifest.json: Currently dispatches to: claude-code, langgraph, crewai, autogen, deepagents, hermes, gemini-cli, openclaw manifest.json supports: claude-code, hermes, openclaw, codex (after PR #2536 pruned to 4 actively-supported) Two consequences of the drift: 1. `codex` (added in PR #2512, supported in manifest) was never in the cascade — fresh runtime publishes did NOT trigger a codex template rebuild. Codex stayed pinned to whatever runtime version it last saw at its own image-build time. 2. langgraph/crewai/autogen/deepagents/gemini-cli — deprecated, no shipping images, no working A2A — were still receiving cascade dispatches. Wasted API calls and (worse) green CI on dead repos masks "this template is dead, stop maintaining it." Now matches manifest.json workspace_templates exactly. Surfaced during RFC #388 (fast workspace provision) prior-art audit. Long-term fix is to derive TEMPLATES from manifest.json so this can't drift again — captured as a Phase-1 invariant in RFC #388. This commit is the data fix only; structural fix lands with the bake pipeline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 02:28:15 -07:00
Hongming Wang	fd5fe34f69	Merge pull request #2523 from Molecule-AI/dependabot/github_actions/actions/github-script-9.0.0 chore(deps)(deps): bump actions/github-script from 7.1.0 to 9.0.0	2026-05-03 01:37:00 +00:00
Hongming Wang	0d8b0c37a6	Merge pull request #2521 from Molecule-AI/dependabot/github_actions/actions/checkout-6 chore(deps)(deps): bump actions/checkout from 4 to 6	2026-05-03 01:36:57 +00:00
Hongming Wang	252e126207	Merge pull request #2522 from Molecule-AI/dependabot/github_actions/docker/setup-buildx-action-4.0.0 chore(deps)(deps): bump docker/setup-buildx-action from 3.12.0 to 4.0.0	2026-05-03 01:27:03 +00:00
Hongming Wang	e84df73e96	Merge pull request #2528 from Molecule-AI/dependabot/github_actions/docker/build-push-action-7.1.0 chore(deps)(deps): bump docker/build-push-action from 6.19.2 to 7.1.0	2026-05-03 01:27:00 +00:00
dependabot[bot]	c46db97ac6	chore(deps)(deps): bump docker/build-push-action from 6.19.2 to 7.1.0 Bumps [docker/build-push-action](https://github.com/docker/build-push-action) from 6.19.2 to 7.1.0. - [Release notes](https://github.com/docker/build-push-action/releases) - [Commits](`10e90e3645...bcafcacb16`) --- updated-dependencies: - dependency-name: docker/build-push-action dependency-version: 7.1.0 dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>	2026-05-02 19:23:17 +00:00
dependabot[bot]	6c6c6eb1e8	chore(deps)(deps): bump imjasonh/setup-crane from 0.4 to 0.5 Bumps [imjasonh/setup-crane](https://github.com/imjasonh/setup-crane) from 0.4 to 0.5. - [Release notes](https://github.com/imjasonh/setup-crane/releases) - [Commits](`31b88efe9d...6da1ae0188`) --- updated-dependencies: - dependency-name: imjasonh/setup-crane dependency-version: '0.5' dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>	2026-05-02 19:23:13 +00:00
dependabot[bot]	e1f7d49575	chore(deps)(deps): bump actions/github-script from 7.1.0 to 9.0.0 Bumps [actions/github-script](https://github.com/actions/github-script) from 7.1.0 to 9.0.0. - [Release notes](https://github.com/actions/github-script/releases) - [Commits](`f28e40c7f3...3a2844b7e9`) --- updated-dependencies: - dependency-name: actions/github-script dependency-version: 9.0.0 dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>	2026-05-02 19:23:09 +00:00
dependabot[bot]	ab7ac2e103	chore(deps)(deps): bump docker/setup-buildx-action from 3.12.0 to 4.0.0 Bumps [docker/setup-buildx-action](https://github.com/docker/setup-buildx-action) from 3.12.0 to 4.0.0. - [Release notes](https://github.com/docker/setup-buildx-action/releases) - [Commits](`8d2750c68a...4d04d5d948`) --- updated-dependencies: - dependency-name: docker/setup-buildx-action dependency-version: 4.0.0 dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>	2026-05-02 19:23:05 +00:00
dependabot[bot]	3598eb41d1	chore(deps)(deps): bump actions/checkout from 4 to 6 Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 6. - [Release notes](https://github.com/actions/checkout/releases) - [Commits](https://github.com/actions/checkout/compare/v4...v6) --- updated-dependencies: - dependency-name: actions/checkout dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>	2026-05-02 19:23:01 +00:00
Hongming Wang	8bf29b7d0e	fix(sweep-cf-tunnels): parallelize deletes + raise workflow timeout The hourly Sweep stale Cloudflare Tunnels job got cancelled mid-cleanup on 2026-05-02 (run 25248788312, killed at 5min after deleting 424/672 stale tunnels). A second manual dispatch finished the remaining 254 fine, so the immediate backlog cleared, but two underlying bugs would re-trip on the next big cleanup. Bug 1: serial delete loop. The execute branch was a `while read; do curl -X DELETE; done` pipeline at ~0.7s/tunnel — fine for the steady-state cleanup of a handful, but a 600+ backlog needs ~7-8min. This commit fans out to $SWEEP_CONCURRENCY (default 8) workers via `xargs -P 8 -L 1 -I {} bash -c '...' _ {} < "$DELETE_PLAN"`. With 8x parallelism the same 600+ list drains in ~60s. Notes: - We use stdin (`<`) not GNU's `xargs -a FILE` so the script stays portable to BSD xargs (matters for local-runner testing on macOS). - We pass ONLY the tunnel id on argv. xargs tokenizes on whitespace by default; tab-separating id+name on argv risks mangling. The name is kept in a side-channel id->name map ($NAME_MAP) and looked up by the worker only on failure, for FAIL_LOG readability. - Workers print exactly `OK` or `FAIL` on stdout; tally with `grep -c '^OK$' / '^FAIL$'`. - On non-zero FAILED, log the first 20 lines of $FAIL_LOG as "Failure detail (first 20):" — same diagnostic surface as before but consolidated so we don't spam logs on a flaky CF API. Bug 2: workflow's 5-min cap was set as a hangs-detector but turned out to be a real-job-too-slow detector. Raised to 30 min — generous headroom for the ~60s steady-state run while still surfacing genuine hangs (and in line with the sweep-cf-orphans companion job). Bug 3 (drive-by): the existing trap was `trap 'rm -rf "$PAGES_DIR"' EXIT`, which would have been silently overwritten by any later trap registration. Replaced with a single `cleanup()` function that wipes PAGES_DIR + all four new tempfiles (DELETE_PLAN, NAME_MAP, FAIL_LOG, RESULT_LOG), called once via `trap cleanup EXIT`. Verification: - bash -n scripts/ops/sweep-cf-tunnels.sh: clean - shellcheck -S warning scripts/ops/sweep-cf-tunnels.sh: clean - python3 yaml.safe_load on the workflow: clean - Synthetic 30-line delete plan with every 7th id sentinel'd to return {"success":false}: TEST PASS, DELETED=26 FAILED=4, FAIL_LOG side-channel name lookup verified. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 02:35:46 -07:00
Hongming Wang	6e0eb2ddc9	fix(redeploy-staging): tolerate e2e-* teardown race in fleet HTTP 500 Recurring failure pattern in redeploy-tenants-on-staging: ##[error]redeploy-fleet returned HTTP 500 ##[error]Process completed with exit code 1. with the per-tenant breakdown in the response body showing the failures were on ephemeral e2e-* tenants (saas/canvas/ext) whose parent E2E run torn them down mid-redeploy — SSM exit=2 because the EC2 was already terminating, or healthz timeout because the CF tunnel was already gone. The actual operator-facing tenants (dryrun-98407, demo-prep, etc) all rolled fine in the same call. This shape repeats every staging push that overlaps an active E2E run. The downstream `Verify each staging tenant /buildinfo matches published SHA` step ALREADY distinguishes STALE vs UNREACHABLE for exactly this reason (per #2402); only the top-level `if HTTP_CODE != 200; exit 1` gate misclassifies the race. Filter: HTTP 500 + every failed slug matches `^e2e-` → soft-warn and fall through to verify. Any non-e2e-* failure or non-500 HTTP remains a hard fail, with the failed non-e2e slugs surfaced in the error so the operator doesn't have to dig the response body out of CI. Verified the gate logic with 6 synthetic CP responses (happy / e2e-only race / mixed real+e2e fail / non-200 / 200+ok=false / all-real-fail) — all behave correctly. prod's redeploy-tenants-on-main is intentionally NOT touched: prod CP serves no e2e-* tenants, so the race can't occur there and the strict gate is the right behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 02:17:36 -07:00
Hongming Wang	43c234df35	secret-scan: align local pre-commit + extend drift lint (closes #1569 root) #1569 Phase 1 discovery (2026-05-02) found six historical credential exposures in molecule-core git history. All confirmed dead — but the reason they got committed in the first place was that the local pre-commit hook had two gaps that the canonical CI gate (and the runtime's hook) didn't: 1. Pattern set was incomplete. Local hook checked `sk-ant-\|sk-proj-\|ghp_\|gho_\|AKIA\|mol_pk_\|cfut_` — missing `ghs_`, `ghu_`, `ghr_`, `github_pat_`, `sk-svcacct-`, `sk-cp-`, `xox[baprs]-`, `ASIA`. The historical leaks were 5× `ghs_` (App installation tokens) + 1× `github_pat_` — none of which the local hook would have caught even if it ran. 2. `.md` and `docs/` were skip-listed.** The leaked tokens lived in `tick-reflections-temp.md`, `qa-audit-2026-04-21.md`, and `docs/incidents/INCIDENT_LOG.md` — exactly the file types the skip-list excluded. The hook ran and silently passed. This commit: - Replaces the local hook's hard-coded inline regex with the canonical 13-pattern array (byte-aligned with `.github/workflows/secret-scan.yml` and the workspace runtime's `pre-commit-checks.sh`). - Removes the `\.md$\|docs/` skip — keeps only binary, lockfile, and hook-self exclusions. - Adds the local hook to `lint_secret_pattern_drift.py` as an in-repo consumer (read-from-disk, no network — the hook lives in the same checkout the lint runs against). Drift now fails the lint when canonical changes without the local hook updating in lockstep. - Adds `.githooks/pre-commit` to the drift-lint workflow's path filter so consumer-side edits also trigger the lint. - Adopts the canonical's "don't echo the matched value" defense (the prior version would have round-tripped a leaked credential into scrollback / CI logs). Verified: `python3 .github/scripts/lint_secret_pattern_drift.py` reports both consumers aligned at 13 patterns. The hook's existing six other gates (canvas 'use client', dark theme, SQL injection, go-build, etc.) are untouched. Companion change (already applied via API, no diff here): `Scan diff for credential-shaped strings` is now in the required-checks list on both `staging` and `main` branch protection — was previously a soft gate (workflow ran, exited 1, but didn't block merge). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 23:47:56 -07:00
Hongming Wang	115f1f5e64	fix(redeploy-main): pull staging-<head_sha> instead of stale :latest Auto-trigger from publish-workspace-server-image now resolves target_tag to the just-published `staging-<short_head_sha>` digest instead of `:latest`. Bypasses the dead retag path that was leaving prod tenants on a 4-day-old image. The chain pre-fix: publish-image → pushes :staging-<sha> + :staging-latest (NOT :latest) canary-verify → soft-skips (CANARY_TENANT_URLS unset, fleet not stood up) promote-latest → manual workflow_dispatch only, last run 2026-04-28 redeploy-main → pulls :latest → 2026-04-28 digest → all 3 tenants STALE Today's incident: `e7375348` (main) → publish-image green → redeploy fired → tenants pulled :latest (`76c604fb` digest from prior canary-verified state) → hongming /buildinfo returned `76c604fb` instead of `e7375348` → verify step correctly flagged 3/3 STALE → workflow failed. Today's PRs (#2473 smoke wedge, #2487 panic recovery, #2496 sweeper followups) shipped to GHCR as :staging-<sha> but never reached prod. Fix: - workflow_dispatch input default '' (was 'latest'); empty input triggers auto-compute path - new "Compute target tag" step resolves: 1. operator-supplied input → verbatim (rollback / pin) 2. else → staging-<short_head_sha> (auto) - verify step's operator-pin detection now allows staging-<short_head_sha> as a non-pin (verification still runs) When canary fleet is real, this workflow should chain on canary-verify completion (workflow_run from canary-verify, gated on promote-to-latest success) instead of publish-image — separate, smaller PR. Today's fix unblocks prod deploys without that prerequisite. Companion: promote-latest.yml dispatched 2026-05-02 against `e7375348` to unstick existing prod tenants. This PR prevents recurrence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 23:17:59 -07:00
Hongming Wang	3d8a0a58fa	ci(auto-sync): App-token dispatch + ubuntu-latest + workflow_dispatch auto-sync-main-to-staging.yml hasn't fired since 2026-04-29 despite multiple staging→main promotes since. The promote PR #2442 (Phase 2) has been wedged on `mergeStateStatus: BEHIND` for hours because staging is missing the merge commit from PR #2437. Three compounding bugs, all fixed here: 1. GitHub no-recursion suppresses the `on: push` trigger. When the merge queue lands a staging→main promote, the resulting push to main is "by GITHUB_TOKEN", and per https://docs.github.com/en/actions/using-workflows/triggering-a-workflow#triggering-a-workflow-from-a-workflow that push event does NOT fire any downstream workflows. Verified empirically against SHA `76c604fb` (PR #2437): exactly ONE workflow fired on that push — `publish-workspace-server-image`, dispatched explicitly by auto-promote-staging.yml's polling tail with an App token (the documented #2357 workaround). Every other `on: push` workflow on main, including auto-sync, was silently suppressed. Same fix extended here: auto-promote-staging.yml's polling tail now ALSO dispatches `auto-sync-main-to-staging.yml --ref main` via the App token after the merge lands. App-initiated dispatch propagates `workflow_run` cascades, which is what the publish tail relies on too. Failure path: emits `::error::` with the recovery command — operator runs it once and the next promote self-heals. auto-sync.yml gains `workflow_dispatch:` so it can be invoked from the dispatch above + manually if a future promote also misses (defense in depth). 2. `runs-on: [self-hosted, macos, arm64]` was wrong for this repo. Comment claimed "matches the rest of this repo's workflows" — false: this is the ONLY workflow in molecule-core/.github/workflows/ with a non-ubuntu runs-on. Copy-paste artefact from molecule-controlplane (which IS private and has a Mac runner). molecule-core has no Mac runner registered, so even when the trigger DID fire (the 3 historic manual-UI merges), the job would have sat unassigned if the runner were offline. Switched to `ubuntu-latest` to match every other workflow in this repo. 3. The `on: push` trigger remains as a defense-in-depth path for the rare case of a manual UI merge by a real user (which uses their PAT and DOES fire downstream workflows — confirmed via the 2026-04-29 `d35a2420` run with `triggering_actor=HongmingWang-Rabbit` that fired 16 workflows including auto-sync). Belt-and-suspenders. Long-term: switching auto-promote's `gh pr merge --auto` call to use the App token (instead of GITHUB_TOKEN) would let `on: push` triggers fire naturally and obviate the need for the explicit dispatches in the polling tail. Tracked in #2357 — out of scope here. Operator recovery for the current Phase 2 wedge: after this lands on staging, dispatch auto-sync once via `gh workflow run auto-sync-main-to-staging.yml --ref main` to backfill the missed sync from `76c604fb`. PR #2442 will go from BEHIND → CLEAN and auto-merge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 22:28:35 -07:00
Hongming Wang	c275716005	harness(phase-2): multi-tenant compose + cross-tenant isolation replays Brings the local harness from "single tenant covering the request path" to "two tenants covering both the request path AND the per-tenant isolation boundary" — the same shape production runs (one EC2 + one Postgres + one MOLECULE_ORG_ID per tenant). Why this matters: the four prior replays exercise the SaaS request path against one tenant. They cannot prove that TenantGuard rejects a misrouted request (production CF tunnel + AWS LB are the failure surface), nor that two tenants doing legitimate work in parallel keep their `activity_logs` / `workspaces` / connection-pool state partitioned. Both are real bug classes — TenantGuard allowlist drift shipped #2398, lib/pq prepared-statement cache collision is documented as an org-wide hazard. What changed: 1. compose.yml — split into two tenants. tenant-alpha + postgres-alpha + tenant-beta + postgres-beta + the shared cp-stub, redis, cf-proxy. Each tenant gets a distinct ADMIN_TOKEN + MOLECULE_ORG_ID and its own Postgres database. cf-proxy depends on both tenants becoming healthy. 2. cf-proxy/nginx.conf — Host-header → tenant routing. `map $host $tenant_upstream` resolves the right backend per request. Required `resolver 127.0.0.11 valid=30s ipv6=off;` because nginx needs an explicit DNS resolver to use a variable in `proxy_pass` (literal hostnames resolve once at startup; variables resolve per request — without the resolver nginx fails closed with 502). `server_name` lists both tenants + the legacy alias so unknown Host headers don't silently route to a default and mask routing bugs. 3. _curl.sh — per-tenant + cross-tenant-negative helpers. `curl_alpha_admin` / `curl_beta_admin` set the right Host + Authorization + X-Molecule-Org-Id triple. `curl_alpha_creds_at_beta` / `curl_beta_creds_at_alpha` exist precisely to make WRONG requests (replays use them to assert TenantGuard rejects). `psql_exec_alpha` / `psql_exec_beta` shell out per-tenant Postgres exec. Legacy aliases (`curl_admin`, `psql_exec`) keep the four pre-Phase-2 replays working without edits. 4. seed.sh — registers parent+child workspaces in BOTH tenants. Captures server-generated IDs via `jq -r '.id'` (POST /workspaces ignores body.id, so the older client-side mint silently desynced from the workspaces table and broke FK-dependent replays). Stashes `ALPHA_PARENT_ID` / `ALPHA_CHILD_ID` / `BETA_PARENT_ID` / `BETA_CHILD_ID` to .seed.env, plus legacy `ALPHA_ID` / `BETA_ID` aliases for backwards compat with chat-history / channel-envelope. 5. New replays. tenant-isolation.sh (13 assertions) — TenantGuard 404s any request whose X-Molecule-Org-Id doesn't match the container's MOLECULE_ORG_ID. Asserts the 404 body has zero tenant/org/forbidden/denied keywords (existence of a tenant must not be probable from the outside). Covers cross-tenant routing misconfigure + allowlist drift + missing-org-header. per-tenant-independence.sh (12 assertions) — both tenants seed activity_logs in parallel with distinct row counts (3 vs 5) and confirm each tenant's history endpoint returns exactly its own counts. Then a concurrent INSERT race (10 rows per tenant in parallel via `&` + wait) catches shared-pool corruption + prepared-statement cache poisoning + redis cross-keyspace bleed. 6. Bug fix: down.sh + dump-logs SECRETS_ENCRYPTION_KEY validation. `docker compose down -v` validates the entire compose file even though it doesn't read the env. up.sh generates a per-run key into its own shell — down.sh runs in a fresh shell that wouldn't see it, so without a placeholder `compose down` exited non-zero before removing volumes. Workspaces silently leaked into the next ./up.sh + seed.sh boot. Caught when tenant-isolation.sh F1/F2 saw 3× duplicate alpha-parent rows accumulated across three prior runs. Same fix applied to the workflow's dump-logs step. 7. requirements.txt — pin molecule-ai-workspace-runtime>=0.1.78. channel-envelope-trust-boundary.sh imports from `molecule_runtime.` (the wheel-rewritten path) so it catches the failure mode where the wheel build silently strips a fix that unit tests on local source still pass. CI was failing this replay because the wheel wasn't installed — caught in the staging push run from #2492. 8. .github/workflows/harness-replays.yml — Phase 2 plumbing. Removed /etc/hosts step (Host-header path eliminated the need; scripts already source _curl.sh). * Updated dump-logs to reference the new service names (tenant-alpha + tenant-beta + postgres-alpha + postgres-beta). * Added SECRETS_ENCRYPTION_KEY placeholder env on the dump step. Verified: ./run-all-replays.sh from a clean state — 6/6 passed (buildinfo-stale-image, channel-envelope-trust-boundary, chat-history, peer-discovery-404, per-tenant-independence, tenant-isolation). Roadmap section updated: Phase 2 marked shipped. Phase 3 promoted to "replace cp-stub with real molecule-controlplane Docker build + env coherence lint." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 21:36:40 -07:00
Hongming Wang	e58e446444	docs(ci): correct test-ops-scripts.yml header — discover does NOT recurse The previous header said `unittest discover from the scripts/ root walks recursively`, contradicting the workflow body which runs two passes precisely because discover does NOT recurse without __init__.py. Fixed self-review feedback on PR #2440. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:52:58 -07:00
Hongming Wang	f2545fcb57	Merge pull request #2440 from Molecule-AI/chore/wheel-rewriter-tests-and-noqa-cleanup chore: rewriter unit tests + drop misleading noqa on import inbox	2026-05-01 03:48:33 +00:00
Hongming Wang	6e92fe0a08	chore: rewriter unit tests + drop misleading noqa on `import inbox` Two small follow-ups to the PR #2433 → #2436 → #2439 incident chain. 1) `import inbox # noqa: F401` in workspace/a2a_mcp_server.py was misleading — `inbox` IS used (at the bridge wiring inside main()). F401 means "imported but unused", which would mask a real future F401 if the usage is removed. Drop the noqa, keep the explanatory block comment about the rewriter's `import X` → `import mr.X as X` expansion (and the `import X as Y` → `import mr.X as X as Y` trap the comment exists to prevent re-introducing). 2) scripts/test_build_runtime_package.py — 17 unit tests covering `rewrite_imports()` and `build_import_rewriter()` in scripts/build_runtime_package.py. Until now the function had zero coverage despite the entire wheel build depending on it. Tests pin: bare-import aliasing, dotted-import preservation, indented imports, from-imports (simple + dotted + multi-symbol + block), the `import X as Y` rejection added in PR #2436 (with comment- stripping + indented + comma-not-alias edge cases), allowlist anchoring (`a2a` ≠ `a2a_tools`), and end-to-end reproduction of the PR #2433 failing pattern + the #2436 fix pattern. 3) Wire scripts/test_.py into CI by adding a second discover pass to test-ops-scripts.yml. Top-level scripts/ tests live alongside their target file (parallels the scripts/ops/ test layout); the existing scripts/ops/ pass keeps running because scripts/ops/ has no __init__.py so a single discover from scripts/ root doesn't recurse. Two passes is simpler than retrofitting namespace packages. Path filter widened from `scripts/ops/` to `scripts/*` so PRs touching the build script trigger the new tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:45:32 -07:00
Hongming Wang	3c16c27415	ci(wheel-smoke): always-run with per-step if-gates for required-check eligibility The `PR-built wheel + import smoke` gate caught the broken wheel from PR #2433 (`import inbox as _inbox_module` collision) but couldn't block the merge because it isn't a required check on staging. Promoting it to required is the right move per the runtime publish pipeline gates note (2026-04-27 RuntimeCapabilities ImportError outage), but the existing `paths: [workspace/**, scripts/...]` filter blocks PRs that don't touch those paths from ever generating the check run — branch protection would deadlock waiting on a check that never fires. Refactor (same shape as e2e-api.yml's e2e-api job): - Drop top-level `paths:` filter — workflow runs on every push/PR/ merge_group event. - Add `detect-changes` job using dorny/paths-filter to compute the `wheel=true\|false` output. - Collapse to ONE always-running `local-build-install` job named `PR-built wheel + import smoke`. Per-step `if:` gates on the detect output. PRs untouched by wheel-relevant paths emit a no-op SUCCESS step ("paths filter excluded this commit") so the check passes without rebuilding the wheel. - merge_group + workflow_dispatch unconditionally `wheel=true` so the queue always validates the to-be-merged state, regardless of which PR composed it. Why one-job-with-step-gates instead of two-jobs-sharing-name: SKIPPED check runs block branch protection even when SUCCESS siblings exist (verified PR #2264 incident, 2026-04-29). Single always-run job emits exactly one SUCCESS check run regardless of paths filter. Follow-up: open a separate PR adding `PR-built wheel + import smoke` to the staging branch protection's required_status_checks.contexts once this lands. Doing both in one PR risks the protection update firing before the workflow refactor merges, deadlocking unrelated PRs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:40:05 -07:00
Hongming Wang	c68ec23d3c	Merge pull request #2410 from Molecule-AI/auto/harness-replays-ci-gate ci: gate PRs on tests/harness/run-all-replays.sh	2026-04-30 20:35:30 +00:00
Hongming Wang	0f0df576f5	Merge pull request #2392 from Molecule-AI/auto/e2e-staging-external-runtime test(e2e): live staging regression for external-runtime awaiting_agent transitions	2026-04-30 20:32:23 +00:00
Hongming Wang	c8b17ea1ad	fix(harness): install httpx for replay Python evals peer-discovery-404 imports workspace/a2a_client.py which depends on httpx; the runner's stock Python doesn't have it, so the replay's PARSE assertion (b) fails with ModuleNotFoundError on every run. The WIRE assertion (a) — pure curl — passes, so the failure was masking just enough to make the replay LOOK partially-broken when the tenant side is fine. Adding tests/harness/requirements.txt with only httpx instead of sourcing workspace/requirements.txt: that file pulls a2a-sdk, langchain-core, opentelemetry, sqlalchemy, temporalio, etc. — ~30s of install for one replay's PARSE step. The harness's deps surface should grow when a new replay introduces a new import, not by default. Workflow gains one step (`pip install -r tests/harness/requirements.txt`) between the /etc/hosts setup and run-all-replays. No other changes.	2026-04-30 13:32:00 -07:00
Hongming Wang	24cb2a286f	ci(harness-replays): KEEP_UP=1 so dump-logs step has containers to read First run on PR #2410 failed with 'container harness-tenant-1 is unhealthy' but the dump-compose-logs step printed empty tenant logs because run-all-replays.sh's trap-on-EXIT had already torn down the harness. Setting KEEP_UP=1 leaves containers in place; the always-run Force teardown step at the end owns cleanup explicitly. Now we'll actually see why the tenant didn't become healthy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 13:15:46 -07:00
Hongming Wang	3105e87cf7	ci: gate PRs on tests/harness/run-all-replays.sh Closes the gap between "the harness exists" and "the harness blocks bugs." Phase 2 of the harness roadmap (per tests/harness/README.md): make harness-based E2E a required CI check on every PR touching the tenant binary or the harness itself. Trigger: push + pull_request to staging+main, paths-filtered to workspace-server/, canvas/, tests/harness/**, and this workflow. merge_group support included so this becomes branch-protectable. Single-job-with-conditional-steps pattern (matches e2e-api.yml). One check run regardless of paths-filter outcome; satisfies branch protection cleanly per the PR #2264 SKIPPED-in-set finding. Why this exists: 2026-04-30 we shipped a TenantGuard allowlist gap (/buildinfo added to router.go in #2398, never added to the allowlist) that the existing buildinfo-stale-image.sh replay would have caught. The harness was wired correctly; nobody ran it. Replays as a discipline beat replays as a memory item. The CI pipeline: detect-changes (paths filter) └ harness-replays (always) ├ no-op pass when paths-filter says no relevant change └ otherwise: checkout + sibling plugin checkout + /etc/hosts entry + run-all-replays.sh + compose-logs-on-failure + force-teardown Compose logs from tenant/cp-stub/cf-proxy/postgres are dumped on failure so a CI red is debuggable without re-reproducing locally. The trap in run-all-replays.sh handles teardown; the always-run down.sh step is a belt-and-suspenders against trap-bypass kills. Follow-ups (not in this PR): - Add this check to staging branch protection once it's been green for a few PRs (the new-workflow-instability hedge that other gates followed). - Eventually wire the buildx GHA cache to speed up tenant image builds — currently every PR rebuilds the full Dockerfile.tenant (Go + Next.js + template clones) from scratch. Acceptable for now; optimize when the timeout-minutes:30 ceiling becomes painful. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 13:04:53 -07:00
Hongming Wang	ef206b5be6	refactor(ci): extract wheel smoke into shared script publish-runtime.yml had a broad smoke (AgentCard call-shape, well-known mount alignment, new_text_message) inline as a heredoc. runtime-prbuild- compat.yml had a narrow inline smoke (just `from main import main_sync`). Result: a PR could introduce SDK shape regressions that pass at PR time and only fail at publish time, post-merge. Extract the broad smoke into scripts/wheel_smoke.py and invoke it from both workflows. PR-time gate now matches publish-time gate — same script, same assertions. Eliminates the drift hazard of two heredocs that have to be kept in lockstep manually. Verified locally: * Built wheel from workspace/ source, installed in venv, ran smoke → pass * Simulated AgentCard kwarg-rename regression → smoke catches it as `ValueError: Protocol message AgentCard has no "supported_interfaces" field` (the exact failure mode of #2179 / supported_protocols incident) Path filter for runtime-prbuild-compat extended to include scripts/wheel_smoke.py so smoke-only edits get PR-validated. publish- runtime path filter intentionally NOT extended — smoke-only edits should not auto-trigger a PyPI version bump. Subset of #131 (the broader "invoke main() against stub config" goal remains pending — main() needs a config dir + stub platform server). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 11:52:07 -07:00
Hongming Wang	9b909c4459	fix(ci): gate 50%-floor on TOTAL_VERIFIED >= 4 Self-review of #2403 caught a regression: with a 1-tenant fleet (the exact case the original #2402 fix targeted), the new floor would re-introduce the flake. Trace: TOTAL=1, UNREACHABLE=1, $((1/2))=0 if 1 -gt 0 → TRUE → exit 1 The 50%-rule only meaningfully distinguishes "real outage" from "teardown race" when the fleet is large enough that "half down" is statistically meaningful. With 1-3 tenants, canary-verify is the actual gate (it runs against the canary first and aborts the rollout if the canary fails to come up). Gate the floor on TOTAL_VERIFIED >= 4. Truth table: TOTAL UNREACHABLE RESULT 1 1 soft-warn (original e2e flake case) 4 2 soft-warn (exactly half) 4 3 hard-fail (75% — real outage) 10 6 hard-fail (60% — real outage) Mirrored across staging.yml + main.yml. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 11:40:31 -07:00
Hongming Wang	ec39fecda2	fix(ci): hard-fail when >50% of fleet unreachable post-redeploy Belt-and-suspenders sanity floor on top of the unreachable-soft-warn introduced earlier in this PR. Addresses the residual gap noted in review: if a new image crashes on startup, every tenant ends up unreachable, and the soft-warn alone would let that ship as a green deploy. Canary-verify catches it on the canary tenant first, but this guard is a fallback for canary-skip dispatches and same-batch races. Threshold is 50% of healthz_ok-snapshotted tenants — comfortably above the typical e2e-* teardown rate (5-10/hour, ~1 ephemeral tenant per batch) but below any plausible real-outage scenario. Mirrored across staging.yml + main.yml for shape parity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 11:35:56 -07:00
Hongming Wang	d45241cae7	fix(ci): distinguish unreachable from stale in /buildinfo verify step The /buildinfo verify step (PR #2398) was treating "no /buildinfo response" the same as "tenant returned wrong SHA" — both bumped MISMATCH_COUNT and hard-failed the workflow. First post-merge run on staging caught a real edge case: ephemeral E2E tenants (slug e2e-20260430-...) get torn down by the E2E teardown trap between CP's healthz_ok snapshot and the verify step running, so the verify step would dial into DNS that no longer resolves and hard-fail on a benign condition. The bug class we actually care about is STALE (tenant up + serving old code, the #2395 root). UNREACHABLE post-redeploy is almost always a benign teardown race; real "tenant up but unreachable" is caught by CP's own healthz monitor + the alert pipeline, so double-counting it here was making this workflow flaky on every staging push that overlapped E2E. Wire: - Split MISMATCH_COUNT into STALE_COUNT + UNREACHABLE_COUNT. - STALE → hard-fail the workflow (the bug class we're guarding). - UNREACHABLE → :⚠️:, don't fail. Reachable-mismatch still hard-fails. - Job summary surfaces both lists separately so on-call can tell at a glance which class fired. Mirror in redeploy-tenants-on-main.yml for shape parity (prod has fewer ephemeral tenants but identical asymmetry would be a gratuitous fork). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 11:25:46 -07:00

1 2 3 4 5 ...

342 Commits