molecule-core

Author	SHA1	Message	Date
claude-ceo-assistant (Claude Opus 4.7 on Hongming's MacBook)	25fb696965	chore: reconcile main → staging post-suspension divergence Some checks failed Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 7s Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 7s Details cascade-list-drift-gate / check (pull_request) Successful in 9s Details CI / Detect changes (pull_request) Successful in 10s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 10s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 11s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 11s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 12s Details Harness Replays / detect-changes (pull_request) Successful in 13s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 12s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 15s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 9s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 16s Details Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 43s Details Harness Replays / Harness Replays (pull_request) Failing after 40s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 1m32s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m34s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m36s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Failing after 2m53s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 3m44s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 3m57s Details CI / Canvas (Next.js) (pull_request) Successful in 6m50s Details CI / Python Lint & Test (pull_request) Successful in 7m37s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details CI / Platform (Go) (pull_request) Failing after 8m31s Details Refs Task #165 (Class D AUTO_SYNC_TOKEN plumbing). main and staging diverged after the 2026-05-06 GitHub-org suspension because Class D / Class G / feature work landed on staging while unrelated CI fixes (#34-47, ECR auth-inline, buildx→docker, pre-clone manifest deps) landed straight on main. Both branches edited the same workflow files, so every push to main triggered an Auto-sync run that aborted at `git merge --no-ff origin/main` with 7 content conflicts: - .github/workflows/canary-verify.yml (URL: github.com → Gitea) - .github/workflows/ci.yml (3 URL refs) - .github/workflows/publish-runtime.yml (cascade: HTTP repo-dispatch → Gitea push) - .github/workflows/publish-workspace-server-image.yml (drop AWS-action steps; ECR auth is inline) - .github/workflows/retarget-main-to-staging.yml (URL) - manifest.json (lowercase org slug + add mock-bigorg from main) - scripts/clone-manifest.sh (keep main's MOLECULE_GITEA_TOKEN auth path + drop awk-tolower since manifest is now lowercase) Resolution: union — staging's post-suspension Gitea/ECR migrations win on URL/policy edits; main's additive work (mock-bigorg manifest entry, inline ECR auth, MOLECULE_GITEA_TOKEN basic-auth) is preserved on top. After this lands, staging is a strict superset of main, so the next auto-sync run on a push to main will be a clean fast-forward / no-op. The auto-sync workflow on main also picks up staging's AUTO_SYNC_TOKEN swap (Class D #26) for free, fixing the latent layer-2 push-auth issue. Verified locally: - bash -n scripts/clone-manifest.sh - python -c 'yaml.safe_load(...)' on each touched workflow - python -c 'json.load(open(manifest.json))' (21 plugins, 9 templates, 7 org_templates) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 14:24:37 -07:00
devops-engineer	194cdf012b	chore(ci): retrigger publish-workspace-server-image after ECR repo create (#173 ) Some checks failed Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped Details Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 5s Details CI / Detect changes (pull_request) Successful in 8s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 7s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 8s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 8s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 8s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 8s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 8s Details CI / Platform (Go) (pull_request) Successful in 4s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 3s Details CI / Python Lint & Test (pull_request) Successful in 4s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 3s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 4s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 5s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 5s Details CI / Canvas (Next.js) (pull_request) Successful in 20s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 1m18s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m18s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m19s Details Run #1010 (post-#46) succeeded all the way to push but failed with "repository molecule-ai/platform does not exist" — the platform image ECR repo had never been created (only platform-tenant existed). Created the repo via: aws ecr create-repository --region us-east-2 \ --repository-name molecule-ai/platform \ --image-scanning-configuration scanOnPush=true This is a one-line workflow comment to satisfy the path-filter and re-run the publish workflow against the now-existing repo. Closes #173 properly this time — pre-clone + inline ECR auth + ECR repo all in place. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 13:54:11 -07:00
devops-engineer	f0e8d9bb23	fix(ci): inline aws ecr get-login-password + docker login (followup #173 ) Some checks failed Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 4s Details Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 6s Details CI / Detect changes (pull_request) Successful in 8s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 7s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 7s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 8s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 7s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 6s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 8s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 2s Details CI / Platform (Go) (pull_request) Successful in 3s Details CI / Python Lint & Test (pull_request) Successful in 4s Details CI / Canvas (Next.js) (pull_request) Successful in 5s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 4s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 5s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 5s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 6s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m19s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 1m20s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m20s Details CI run #987 (post-#45) showed `docker push` from shell still hits "no basic auth credentials" — `aws-actions/amazon-ecr-login@v2` writes auth to a step-scoped DOCKER_CONFIG that doesn't carry across to the next shell step on Gitea Actions. Fix: drop both `aws-actions/configure-aws-credentials@v4` and `aws-actions/amazon-ecr-login@v2`. Run `aws ecr get-login-password \| docker login` inline in the same shell step as `docker build` + `docker push`. AWS creds come from secrets via env vars, ECR token is fresh per-step (12h validity is plenty), config.json lives in the same shell process — auth state is guaranteed. This is the operator-host manual approach mapped 1:1 into CI. runner-base image already has aws-cli + docker (verified locally). Closes #173 (fifth piece — and final, this matches the manual flow exactly). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 13:49:12 -07:00
devops-engineer	43e2d24c5b	fix(ci): replace buildx with plain docker build+push (followup #173 ) Some checks failed Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped Details Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 5s Details CI / Detect changes (pull_request) Successful in 8s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 7s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 7s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 8s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 7s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 8s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 8s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 4s Details CI / Platform (Go) (pull_request) Successful in 4s Details CI / Python Lint & Test (pull_request) Successful in 4s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 4s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 4s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 6s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 4s Details CI / Canvas (Next.js) (pull_request) Successful in 17s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m21s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m21s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 1m24s Details CI run #946 (post-#43) confirmed `driver: docker` doesn't fix the ECR push 401 either: buildx CLI inside the runner container talks to the operator-host docker daemon (mounted socket), but the daemon doesn't see the runner's ECR auth state, and the runner's buildx CLI doesn't attach the auth header in a way the daemon accepts. Drop buildx + build-push-action entirely. Plain `docker build` + `docker push` from the runner container works because both use the SAME docker socket + the SAME runner-container config.json (populated by `aws ecr get-login-password \| docker login` from amazon-ecr-login). Trade-off: lose multi-arch support. We only ship linux/amd64 tenant images today, so this is fine. If multi-arch becomes a requirement later, we can revisit (likely with `docker buildx create --driver=remote` pointing at an external buildkit, but that's substantial infra work; not worth it for a single-arch shop). Closes #173 (fourth piece — and hopefully last; this matches the operator-host manual approach exactly). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 13:43:50 -07:00
devops-engineer	bee4f9ea79	fix(ci): use docker driver for buildx + drop type=gha cache (followup #173 ) Some checks failed Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 10s Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 10s Details Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped Details CI / Detect changes (pull_request) Successful in 12s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 12s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 15s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 16s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 15s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 12s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 12s Details CI / Platform (Go) (pull_request) Successful in 8s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 6s Details CI / Python Lint & Test (pull_request) Successful in 7s Details CI / Canvas (Next.js) (pull_request) Successful in 8s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 6s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 4s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 7s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 7s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 1m28s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m30s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m33s Details PR #38 + #41 fixed the Dockerfile-side clone issue. CI run #893 then revealed two Gitea-Actions-specific issues with the unchanged buildx config: 1. `failed to push: 401 Unauthorized` to ECR. Root cause: default buildx driver `docker-container` spawns a buildkit container that doesn't share the host's `~/.docker/config.json`, so the ECR auth set up by amazon-ecr-login doesn't reach the push. Fix: pin `driver: docker` so buildx delegates to the host daemon, which already has the ECR creds. 2. `dial tcp ...:41939: i/o timeout` on `_apis/artifactcache/cache`. Root cause: `cache-from/cache-to: type=gha` is GitHub-specific; Gitea Actions has no compatible artifact-cache backend, so every cache lookup fails after a 30s timeout. Fix: remove the cache-* options. Cold-build cost is <10min for 37-repo clone + Go/Node compile, acceptable. Could revisit with type=registry inline cache later if rebuilds get painful. With this + #38/#41, the workflow should run end-to-end on Gitea Actions: pre-clone -> docker build (host daemon) -> ECR push. Closes #173 (third and final piece). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 13:35:07 -07:00
devops-engineer	55689e0b10	fix(post-suspension): migrate github.com/Molecule-AI refs to git.moleculesai.app (Class G #168 ) Some checks failed Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 16s Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 22s Details CI / Detect changes (pull_request) Successful in 24s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 20s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 21s Details pr-guards / disable-auto-merge-on-push (pull_request) Failing after 9s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 44s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 38s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 35s Details Harness Replays / detect-changes (pull_request) Successful in 44s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 27s Details Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 56s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 2m1s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 2m34s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 2m34s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 23s Details Harness Replays / Harness Replays (pull_request) Failing after 1m12s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 2m51s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Failing after 5m37s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 6m15s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 6m34s Details CI / Python Lint & Test (pull_request) Successful in 8m20s Details CI / Canvas (Next.js) (pull_request) Successful in 9m46s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details CI / Platform (Go) (pull_request) Failing after 13m23s Details The GitHub org Molecule-AI was suspended on 2026-05-06; canonical SCM is now Gitea at https://git.moleculesai.app/molecule-ai/. Stale github.com/Molecule-AI/... URLs return 404 and break tooling that clones / pip-installs / curls them. This bundles all non-Go-module URL fixes for this repo into a single PR. Go module path references (in *.go, go.mod, go.sum) are out of scope here -- tracked separately under Task #140. Token-auth clone URLs also flip ${GITHUB_TOKEN} -> ${GITEA_TOKEN} since the GitHub token does not auth against Gitea. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 13:08:15 -07:00
devops-engineer	a6d67b4c68	fix(ci): pre-clone manifest deps in workflow, drop in-image clone (closes #173 ) Some checks failed Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 7s Details Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 7s Details Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped Details CI / Detect changes (pull_request) Successful in 9s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 9s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 10s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 10s Details Harness Replays / detect-changes (pull_request) Successful in 10s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 10s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 10s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 10s Details CI / Python Lint & Test (pull_request) Successful in 6s Details CI / Canvas (Next.js) (pull_request) Successful in 6s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 6s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 8s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 8s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 13s Details Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 34s Details Harness Replays / Harness Replays (pull_request) Failing after 33s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 53s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 1m28s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m29s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m31s Details CI / Platform (Go) (pull_request) Failing after 4m4s Details publish-workspace-server-image.yml could not run on Gitea Actions because Dockerfile.tenant's stage 3 ran `git clone` against private Gitea repos from inside the Docker build context, where no auth path exists. Every workspace-server rebuild required a manual operator-host push. Move cloning to the trusted CI context (where AUTO_SYNC_TOKEN — the devops-engineer persona PAT — is naturally available). Dockerfile.tenant now COPYs from .tenant-bundle-deps/, populated by the workflow's new "Pre-clone manifest deps" step. The Gitea token never enters the image. - scripts/clone-manifest.sh: optional MOLECULE_GITEA_TOKEN env embeds basic-auth in the clone URL; redacted in log output. Anonymous fallback preserved for future public-repo path. - .github/workflows/publish-workspace-server-image.yml: new pre-clone step before docker build; injects AUTO_SYNC_TOKEN. Fail-fast if the secret is empty. - workspace-server/Dockerfile.tenant: drop stage 3 (templates), COPY from .tenant-bundle-deps/ instead. Header documents the prereq. - .gitignore: ignore /.tenant-bundle-deps/ so a local build can't accidentally commit cloned repos. Verified locally: clone-manifest.sh with the devops-engineer persona token cloned all 37 repos (9 ws + 7 org + 21 plugins, 4.9MB after .git strip). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 12:59:46 -07:00
claude-ceo-assistant	b73d3bfff2	fix(ci): mark CodeQL continue-on-error (advisory only) — closes #156 Some checks failed Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 14s Details pr-guards / disable-auto-merge-on-push (pull_request) Failing after 5s Details Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 9s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 14s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 16s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 2m14s Details CI / Platform (Go) (pull_request) Successful in 9s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 7s Details CI / Detect changes (pull_request) Successful in 18s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 11s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 2m13s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 11s Details CI / Python Lint & Test (pull_request) Successful in 8s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 11s Details CI / Canvas (Next.js) (pull_request) Successful in 11s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 21s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 40s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 23s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 18s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 2m12s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 17s Details	2026-05-07 17:26:52 +00:00
devops-engineer	6de3c1ccd2	fix(ci): add scripts/** to publish-workspace-server-image path filter Some checks failed CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m21s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 1m21s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m36s Details Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 5s Details CI / Detect changes (pull_request) Successful in 6s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 6s Details Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 5s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 7s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 8s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 6s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 6s Details CI / Platform (Go) (pull_request) Successful in 4s Details CI / Canvas (Next.js) (pull_request) Successful in 8s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 6s Details CI / Python Lint & Test (pull_request) Successful in 7s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 7s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 6s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 10s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 8s Details scripts/clone-manifest.sh runs inside the platform Dockerfile build, so a change to that script needs to retrigger publish. Without it, the prior fix (clone via Gitea + lowercase org) didn't trigger this workflow because scripts/ wasn't in the path filter. Also serves as the file change to satisfy the path filter for THIS push, retriggering publish-workspace-server-image now.	2026-05-07 08:18:53 -07:00
devops-engineer	694a036a7f	chore(ci): trailing newline to retrigger publish-workspace-server-image (path-filter requires workflow file change) Some checks failed Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 5s Details Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 6s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 8s Details CI / Detect changes (pull_request) Successful in 9s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 9s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 10s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 11s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 11s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 13s Details CI / Platform (Go) (pull_request) Successful in 9s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 7s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 12s Details CI / Python Lint & Test (pull_request) Successful in 14s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 11s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 10s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 11s Details CI / Canvas (Next.js) (pull_request) Successful in 22s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 1m28s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m30s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m33s Details	2026-05-07 08:12:10 -07:00
devops-engineer	10e510f50c	chore: drop github-app-auth + swap GHCR→ECR (closes #157 , #161 ) Some checks failed Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped Details Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 5s Details CI / Detect changes (pull_request) Successful in 8s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 8s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 8s Details Harness Replays / detect-changes (pull_request) Successful in 9s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 9s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 8s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 8s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 9s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 2s Details CI / Python Lint & Test (pull_request) Successful in 4s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 5s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 5s Details CI / Canvas (Next.js) (pull_request) Successful in 17s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 30s Details Harness Replays / Harness Replays (pull_request) Failing after 32s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m26s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 1m21s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 1m36s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m36s Details CI / Platform (Go) (pull_request) Successful in 2m18s Details Two coupled cleanups for the post-2026-05-06 stack: ============================================ The plugin injected GITHUB_TOKEN/GH_TOKEN via the App's installation-access flow (~hourly rotation). Per-agent Gitea identities replaced this approach after the 2026-05-06 suspension — workspaces now provision with a per-persona Gitea PAT from .env instead of an App-rotated token. The plugin code itself lived on github.com/Molecule-AI/molecule-ai-plugin-github-app-auth which is also unreachable post-suspension; checking it out at CI build time was already failing. Removed: - workspace-server/cmd/server/main.go: githubappauth import + the `if os.Getenv("GITHUB_APP_ID") != ""` block that called BuildRegistry. gh-identity remains as the active mutator. - workspace-server/Dockerfile + Dockerfile.tenant: COPY of the sibling repo + the `replace github.com/Molecule-AI/molecule-ai- plugin-github-app-auth => /plugin` directive injection. - workspace-server/go.mod + go.sum: github-app-auth dep entry (cleaned up by `go mod tidy`). - 3 workflows: actions/checkout steps for the sibling plugin repo: - .github/workflows/codeql.yml (Go matrix path) - .github/workflows/harness-replays.yml - .github/workflows/publish-workspace-server-image.yml Verified `go build ./cmd/server` + `go vet ./...` pass post-removal. ======================================================= Same workflow used to push to ghcr.io/molecule-ai/platform + platform-tenant. ghcr.io/molecule-ai is gone post-suspension. The operator's ECR org (153263036946.dkr.ecr.us-east-2.amazonaws.com/ molecule-ai/) already hosts platform-tenant + workspace-template-* + runner-base images and is the post-suspension SSOT for container images. This PR aligns publish-workspace-server-image with that stack. - env.IMAGE_NAME + env.TENANT_IMAGE_NAME repointed to ECR URL. - docker/login-action swapped for aws-actions/configure-aws- credentials@v4 + aws-actions/amazon-ecr-login@v2 chain (the standard ECR auth pattern; uses AWS_ACCESS_KEY_ID/SECRET secrets bound to the molecule-cp IAM user). The :staging-<sha> + :staging-latest tag policy is unchanged — staging-CP's TENANT_IMAGE pin still points at :staging-latest, just with the new registry prefix. Refs molecule-core#157, #161; parallel to org-wide CI-green sweep.	2026-05-07 07:48:51 -07:00
devops-engineer	64a0bc1f7e	fix(ci): use AUTO_SYNC_TOKEN for auto-sync main->staging (Class D) Some checks failed Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 6s Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 5s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 8s Details CI / Detect changes (pull_request) Successful in 9s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 9s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 9s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 9s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 9s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 10s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 3s Details CI / Platform (Go) (pull_request) Successful in 4s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 5s Details CI / Canvas (Next.js) (pull_request) Successful in 5s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 6s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 7s Details CI / Python Lint & Test (pull_request) Successful in 32s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 31s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 1m23s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m24s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m32s Details Same shape as molecule-controlplane#29: per-job GITHUB_TOKEN doesn't have the Gitea API permissions to open PRs / push branches the auto-sync flow needs. AUTO_SYNC_TOKEN is the devops-engineer persona PAT (per saved memory feedback_per_agent_gitea_identity_default). Companion prod ops (already done): - devops-engineer added as collaborator on molecule-core (write) - devops-engineer added to staging branch protection push_whitelist - AUTO_SYNC_TOKEN registered as Actions secret on molecule-core	2026-05-07 07:01:46 -07:00
devops-engineer	1d8c101c94	chore: drop github-app-auth + swap GHCR→ECR (closes #157 , #161 ) Some checks failed Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 5s Details CI / Detect changes (pull_request) Successful in 8s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 8s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 8s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 8s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 9s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 8s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 9s Details Harness Replays / detect-changes (pull_request) Successful in 9s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 3s Details CI / Canvas (Next.js) (pull_request) Successful in 4s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 4s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 5s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 6s Details Harness Replays / Harness Replays (pull_request) Failing after 27s Details CI / Python Lint & Test (pull_request) Successful in 31s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m19s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m21s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 1m25s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Failing after 15m34s Details CI / Platform (Go) (pull_request) Failing after 15m35s Details Two coupled cleanups for the post-2026-05-06 stack: #157 — drop molecule-ai-plugin-github-app-auth ============================================ The plugin injected GITHUB_TOKEN/GH_TOKEN via the App's installation-access flow (~hourly rotation). Per-agent Gitea identities replaced this approach after the 2026-05-06 suspension — workspaces now provision with a per-persona Gitea PAT from .env instead of an App-rotated token. The plugin code itself lived on github.com/Molecule-AI/molecule-ai-plugin-github-app-auth which is also unreachable post-suspension; checking it out at CI build time was already failing. Removed: - workspace-server/cmd/server/main.go: githubappauth import + the `if os.Getenv("GITHUB_APP_ID") != ""` block that called BuildRegistry. gh-identity remains as the active mutator. - workspace-server/Dockerfile + Dockerfile.tenant: COPY of the sibling repo + the `replace github.com/Molecule-AI/molecule-ai- plugin-github-app-auth => /plugin` directive injection. - workspace-server/go.mod + go.sum: github-app-auth dep entry (cleaned up by `go mod tidy`). - 3 workflows: actions/checkout steps for the sibling plugin repo: - .github/workflows/codeql.yml (Go matrix path) - .github/workflows/harness-replays.yml - .github/workflows/publish-workspace-server-image.yml Verified `go build ./cmd/server` + `go vet ./...` pass post-removal. #161 — swap GHCR→ECR for publish-workspace-server-image ======================================================= Same workflow used to push to ghcr.io/molecule-ai/platform + platform-tenant. ghcr.io/molecule-ai is gone post-suspension. The operator's ECR org (153263036946.dkr.ecr.us-east-2.amazonaws.com/ molecule-ai/) already hosts platform-tenant + workspace-template-* + runner-base images and is the post-suspension SSOT for container images. This PR aligns publish-workspace-server-image with that stack. - env.IMAGE_NAME + env.TENANT_IMAGE_NAME repointed to ECR URL. - docker/login-action swapped for aws-actions/configure-aws- credentials@v4 + aws-actions/amazon-ecr-login@v2 chain (the standard ECR auth pattern; uses AWS_ACCESS_KEY_ID/SECRET secrets bound to the molecule-cp IAM user). The :staging-<sha> + :staging-latest tag policy is unchanged — staging-CP's TENANT_IMAGE pin still points at :staging-latest, just with the new registry prefix. Refs molecule-core#157, #161; parallel to org-wide CI-green sweep.	2026-05-07 05:12:06 -07:00
claude-ceo-assistant	06d4bab29d	Merge pull request 'fix(ci): port publish-runtime cascade to Gitea repo-dispatch API (closes #14 )' (#20 ) from fix/14-cascade-gitea-dispatch into staging Some checks failed Secret scan / Scan diff for credential-shaped strings (push) Successful in 9s Details CI / Canvas (Next.js) (push) Successful in 7s Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (push) Successful in 10s Details E2E API Smoke Test / detect-changes (push) Successful in 11s Details CI / Platform (Go) (push) Successful in 29s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (push) Failing after 54s Details Block internal-flavored paths / Block forbidden paths (push) Successful in 10s Details CI / Detect changes (push) Successful in 11s Details E2E API Smoke Test / E2E API Smoke Test (push) Successful in 28s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (push) Failing after 1m57s Details Handlers Postgres Integration / Handlers Postgres Integration (push) Successful in 9s Details CI / Canvas Deploy Reminder (push) Has been skipped Details E2E Staging Canvas (Playwright) / detect-changes (push) Successful in 12s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (push) Successful in 9s Details Handlers Postgres Integration / detect-changes (push) Successful in 13s Details Runtime PR-Built Compatibility / detect-changes (push) Successful in 12s Details CI / Shellcheck (E2E scripts) (push) Successful in 4s Details CI / Python Lint & Test (push) Successful in 6s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (push) Successful in 10m34s Details CodeQL / Analyze (${{ matrix.language }}) (go) (push) Failing after 19m45s Details CodeQL / Analyze (${{ matrix.language }}) (python) (push) Failing after 20m19s Details	2026-05-07 10:36:32 +00:00
Hongming Wang	4279fecde5	fix(ci): keep codex in TEMPLATES + skip-if-no-publish-image.yml Some checks failed Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 6s Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 6s Details cascade-list-drift-gate / check (pull_request) Successful in 13s Details CI / Detect changes (pull_request) Successful in 9s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 9s Details pr-guards / disable-auto-merge-on-push (pull_request) Failing after 1s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 3s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 5s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 25s Details CI / Platform (Go) (pull_request) Successful in 5m22s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 6s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 4s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 17s Details Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 16s Details CI / Canvas (Next.js) (pull_request) Failing after 5m16s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 6s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m39s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 6s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 5s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 51s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 20m54s Details CI / Python Lint & Test (pull_request) Successful in 15m42s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 19m46s Details The v2 dropped codex from TEMPLATES on the basis of "no publish-image.yml = not part of cascade today." That was correct about the immediate behavior but tripped cascade-list-drift-gate.yml because manifest.json still declares codex (it IS a live runtime — referenced from workspace/config.py and cloned into dev envs by clone-manifest.sh; only the image-publish path is missing). Restore codex to TEMPLATES (matching manifest) and add a runtime soft-skip: probe each repo for .github/workflows/publish-image.yml via the Gitea contents API and skip cleanly if 404. Final job log distinguishes "complete across all" vs "complete with soft-skips". This preserves the drift gate's invariant (TEMPLATES == manifest) while honoring the empirical fact that codex has no publish-image workflow yet. If codex later gains the workflow, no change here is needed — the probe will see 200 and the cascade will fan out to it naturally. Refs molecule-core#14, molecule-core#20.	2026-05-07 03:32:53 -07:00
Hongming Wang	607444e71b	feat(ci): replace curl-dispatch with push-mode cascade (v2) Some checks failed Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s Details CI / Detect changes (pull_request) Successful in 5s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 11s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 5s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 6s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 6s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 6s Details pr-guards / disable-auto-merge-on-push (pull_request) Failing after 2s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 7s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m21s Details Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 46s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m28s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 5s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 10s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 5s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 7s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 26s Details CI / Platform (Go) (pull_request) Successful in 3m32s Details CI / Canvas (Next.js) (pull_request) Failing after 3m34s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details cascade-list-drift-gate / check (pull_request) Failing after 9s Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 4s Details CI / Python Lint & Test (pull_request) Successful in 16m16s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 20m25s Details Empirical blocker on v1: Gitea 1.22.6 has no repository_dispatch / workflow_dispatch trigger API (verified across 6 candidate paths in issuecomment-913). v1's curl-POST loop would always exit-1. v2 pivots to push-mode: each template repo got a small companion PR (merged 2026-05-07) adding a `.runtime-version` file at root + a `resolve-version` job in publish-image.yml that reads the file and forwards the value to the reusable build workflow. publish-runtime now updates that file via git-clone + commit + push, which trips each template's existing `on: push: branches: [main]` trigger. Behaviour changes vs v1: - Templates list dropped from 9 → 8 (codex has no publish-image.yml so was never part of the cascade in practice). - 3-retry pull-rebase loop per template (handles concurrent-push races without force-push). Failures collected, job exits 1 with the failed-template list at the end. - Idempotency: when re-run with the same version, templates already pinned to that version contribute zero commits — operator can safely re-run to retry partial failures. - Author line: "publish-runtime cascade <publish-runtime@moleculesai .app>" trailer makes it clear the commit is workflow-driven, not human (per memory feedback_github_botring_fingerprint). DISPATCH_TOKEN secret name unchanged (still consumed at secrets.DISPATCH_TOKEN per `569df259`). Refs molecule-core#14, builds on molecule-core#20 issuecomment-923 (Phase 2 design).	2026-05-07 03:17:38 -07:00
Hongming Wang	569df259ba	fix(ci): align secret name to plumbed DISPATCH_TOKEN (closes #14 ) Some checks failed pr-guards / disable-auto-merge-on-push (pull_request) Failing after 3s Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 7s Details CI / Detect changes (pull_request) Successful in 7s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 6s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 8s Details cascade-list-drift-gate / check (pull_request) Successful in 13s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 9s Details Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 14s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 12s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 10s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 6s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 6s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 19s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 12s Details Ops Scripts Tests / Ops scripts (unittest) (pull_request) Failing after 19s Details CI / Python Lint & Test (pull_request) Failing after 20s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 7s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 34s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m31s Details CI / Platform (Go) (pull_request) Successful in 3m6s Details CI / Canvas (Next.js) (pull_request) Failing after 3m8s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 14m54s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 15m3s Details The cascade workflow was reading from `secrets.TEMPLATE_DISPATCH_TOKEN` but the plumbed secret name is `DISPATCH_TOKEN` (verified just now via GET /repos/molecule-ai/molecule-core/actions/secrets — only DISPATCH_TOKEN is set). Without this rename the cascade would always evaluate "secret missing" and exit 1 on the next push to staging, defeating the entire point of grant-role-access.sh --apply that just landed. Three references updated: - env mapping (`secrets.X` → `secrets.DISPATCH_TOKEN`) - workflow_dispatch warning text - push-trigger error text The bash-side variable name is unchanged (still `DISPATCH_TOKEN`) so the curl invocation at line 372 is unaffected. YAML round-trip parses clean.	2026-05-07 02:38:20 -07:00
claude-ceo-assistant	1d9d8c7809	Merge pull request 'fix(scripts): migrate ghcr.io→ECR + raw.githubusercontent.com→Gitea (#46 )' (#16 ) from fix/script-ghcr-and-lint-paths into staging Some checks failed CI / Platform (Go) (push) Blocked by required conditions Details CI / Canvas (Next.js) (push) Blocked by required conditions Details CI / Shellcheck (E2E scripts) (push) Blocked by required conditions Details CI / Canvas Deploy Reminder (push) Blocked by required conditions Details CI / Python Lint & Test (push) Blocked by required conditions Details E2E API Smoke Test / detect-changes (push) Waiting to run Details E2E API Smoke Test / E2E API Smoke Test (push) Blocked by required conditions Details E2E Staging Canvas (Playwright) / detect-changes (push) Waiting to run Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (push) Blocked by required conditions Details Handlers Postgres Integration / Handlers Postgres Integration (push) Blocked by required conditions Details Runtime PR-Built Compatibility / detect-changes (push) Waiting to run Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (push) Blocked by required conditions Details Secret scan / Scan diff for credential-shaped strings (push) Waiting to run Details Ops Scripts Tests / Ops scripts (unittest) (push) Waiting to run Details Block internal-flavored paths / Block forbidden paths (push) Has been cancelled Details Handlers Postgres Integration / detect-changes (push) Has been cancelled Details CI / Detect changes (push) Has been cancelled Details CodeQL / Analyze (${{ matrix.language }}) (go) (push) Has been cancelled Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (push) Has been cancelled Details CodeQL / Analyze (${{ matrix.language }}) (python) (push) Has been cancelled Details SECRET_PATTERNS drift lint / Detect SECRET_PATTERNS drift (push) Failing after 12s Details	2026-05-07 09:25:24 +00:00
claude-ceo-assistant	ce3f1f48a4	fix(ci): port publish-runtime cascade to Gitea repo-dispatch API (closes molecule-core#14) Some checks failed Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 4s Details cascade-list-drift-gate / check (pull_request) Successful in 4s Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 5s Details CI / Detect changes (pull_request) Successful in 6s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 3s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 6s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 6s Details CI / Python Lint & Test (pull_request) Failing after 14s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 4s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 49s Details CI / Canvas (Next.js) (pull_request) Failing after 1m55s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 6s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 5s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 6s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 6s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 4s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 6s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 7s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m20s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m24s Details CI / Platform (Go) (pull_request) Successful in 2m5s Details ## Symptom `publish-runtime.yml::cascade` fired a `repository_dispatch` to 10 workspace-template repos via direct curl to `https://api.github.com/repos/...`. Post-2026-05-06 the org's GitHub presence is suspended; every invocation 404s. The job's `:⚠️:` posture meant the failure didn't propagate, leaving the runtime PyPI publish → template image rebuild pipeline silently broken. ## Why Option A (rewrite) and not Option B (delete) Verified 2026-05-07 by devops-engineer (molecule-core#14 thread): - The cron-poll mechanism (/etc/cron.d/molecule-deploy-poll) tracks ONLY the Vercel/Railway-deployed repos (landingpage/docs/molecule-app/molecules-market /molecule-controlplane). It does NOT track workspace-template-* repos. - Each of the 9 template `publish-image.yml` workflows has `repository_dispatch: types: [runtime-published]` as a load-bearing trigger. Without the cascade, when the runtime ships a new PyPI version, templates don't auto-rebuild. So Option B (delete) would silently break the runtime → template fan-out. Option A (rewrite to Gitea's API shape) is the right call. Security-auditor agreed after seeing the cron-poll TRACKED list. ## API surface change \| Concern \| Pre-fix (GitHub) \| Post-fix (Gitea) \| \|---\|---\|---\| \| URL \| `https://api.github.com/repos/$REPO/dispatches` \| `${GITEA_URL}/api/v1/repos/$REPO/dispatches` \| \| Owner case \| `Molecule-AI/...` \| `molecule-ai/...` (lowercase, Gitea is case-sensitive) \| \| Auth header \| `Authorization: Bearer $DISPATCH_TOKEN` \| `Authorization: token $DISPATCH_TOKEN` \| \| Body shape \| `{event_type, client_payload}` \| UNCHANGED — Gitea is GitHub-compatible here \| \| Success code \| `204 No Content` \| `204 No Content` (unchanged) \| `GITEA_URL` defaults to `https://git.moleculesai.app`; overridable via job env. ## Out-of-band: DISPATCH_TOKEN secret rotation The DISPATCH_TOKEN secret was a GitHub PAT. It must be re-minted as a Gitea PAT for the new API to authenticate. Per saved memory `feedback_per_agent_gitea_identity_default`, this should be a dedicated `publish-runtime-bot` persona token with `write:repository` scope on the 9 target repos — NOT the founder PAT. This PR ships the workflow change. Token rotation is the operator-host follow-up (security-auditor's lane) — coordinate the merge so the token is in place before the next runtime release fires. ## Backwards compatibility The workflow ran silently-broken since 2026-05-06 (every invocation 404 + :⚠️: but no failure). So there is no functional regression from "silently broken" to "actually working". Any in-progress operator-managed manual dispatch path is unaffected; the Gitea API parallel path doesn't require operator intervention. ## Test plan - [x] YAML parse OK on the modified workflow file - [ ] Smoke test: trigger a runtime publish (or simulate via dispatching to one template) post-merge; verify HTTP 204 + the template's publish-image workflow fires + the template's image gets re-pushed against the new runtime version. Phase 4 verification belongs to internal#46 follow-up. ## Hostile self-review (3 weakest spots) 1. The fan-out remains all-or-nothing: a single template failure surfaces as a `:⚠️:` but PyPI publish proceeds. With 9 templates this is a ~10% per-template chance of stale-image-on-runtime-bump if any one fails. Defense: the warning shows up in the workflow summary; operators retry. Future hardening: requeue-on-fail with bounded retry, or a separate reconcile cron that detects template/runtime version drift and re-dispatches. 2. `DISPATCH_TOKEN` validity is enforced by the Gitea API (401 on stale) but the workflow doesn't differentiate 401 from 404. Either way the warning fires. Future hardening: explicit token-shape check at the start of the cascade job (curl `/api/v1/user` once, fail-fast if 401). 3. Owner-case lowercase is right today but couples the workflow to the current Gitea org slug. If the org is ever renamed, this workflow breaks silently. Less fragile alternative: derive REPO from a canonical config (e.g. `gh repo list molecule-ai`) instead of string-concatenating. Acceptable today; filed as the same future hardening pass as item 1. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 01:31:37 -07:00
claude-ceo-assistant	aa22183e52	chore(ci): pin artifact actions to @v3 for Gitea act_runner compatibility (internal#46) Some checks failed Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 7s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 1m9s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 7s Details CI / Detect changes (pull_request) Successful in 7s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 5s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 3s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 5s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 4s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 4s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m31s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m33s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 13s Details CI / Python Lint & Test (pull_request) Failing after 19s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 5s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 6s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 4s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Failing after 27s Details CI / Canvas (Next.js) (pull_request) Successful in 4m47s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details CI / Platform (Go) (pull_request) Successful in 5m32s Details Mechanical pin: 4 `actions/upload-artifact@v4.6.2/v7.0.1` uses → `@v3`. v4+/v7+ rely on a runtime API shape that Gitea's act_runner v0.6.x doesn't fully support. v3 uses the legacy server protocol act_runner ships end-to-end. Files (4 uses): - .github/workflows/ci.yml:238 (v4.6.2 → v3) - .github/workflows/codeql.yml:124 (v7.0.1 → v3) - .github/workflows/e2e-staging-canvas.yml:142 (v7.0.1 → v3) - .github/workflows/e2e-staging-canvas.yml:150 (v7.0.1 → v3) YAML parse green on all 3 files. Sister PRs land for `molecule-controlplane` and `codex-channel-molecule`. Per internal#46 Phase 2 audit; tracked under that umbrella. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 01:00:53 -07:00
security-auditor	e01077be38	fix(ci): lowercase 'molecule-ai/' in cross-repo workflow refs Some checks failed Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped Details cascade-list-drift-gate / check (pull_request) Successful in 3s Details Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 3s Details Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 5s Details CI / Detect changes (pull_request) Successful in 4s Details pr-guards / disable-auto-merge-on-push (pull_request) Failing after 0s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 5s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 4s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 5s Details Harness Replays / detect-changes (pull_request) Successful in 4s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 50s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 3s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 4s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 5s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m16s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m16s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 7s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 3s Details CI / Python Lint & Test (pull_request) Failing after 16s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 4s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 6s Details Harness Replays / Harness Replays (pull_request) Failing after 40s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 3s Details CI / Canvas (Next.js) (pull_request) Failing after 4m47s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details CI / Platform (Go) (pull_request) Successful in 5m25s Details Gitea is case-sensitive on owner slugs; canonical is lowercase `molecule-ai/...`. Mixed-case `Molecule-AI/...` refs fail-at-0s when the runner tries to resolve the cross-repo workflow / checkout. Same fix as molecule-controlplane#12. Mechanical case-correction; no behavior change beyond making CI resolve again. Refs: internal#46 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 01:00:10 -07:00
documentation-specialist	5d4184f4a3	fix(scripts): migrate ghcr.io→ECR + raw.githubusercontent.com→Gitea (#46 ) Some checks failed Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 6s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 7s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 5s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 54s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s Details CI / Detect changes (pull_request) Successful in 5s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 6s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 6s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 6s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 6s Details CI / Platform (Go) (pull_request) Successful in 3s Details CI / Python Lint & Test (pull_request) Successful in 3s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 3s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 5s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 4s Details Ops Scripts Tests / Ops scripts (unittest) (pull_request) Failing after 13s Details CI / Canvas (Next.js) (pull_request) Successful in 42s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m18s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m20s Details Per documentation-specialist's grep agent (2026-05-07T07:30, see internal#46): runtime-breaking ghcr.io references in shell scripts + docker-compose + the slip-past-workflow lint_secret_pattern_drift.py all need migration. These were missed by security-auditor's workflow-only audit. Files (6): - .github/scripts/lint_secret_pattern_drift.py:40 — workspace-runtime pre-commit-checks.sh consumer URL: raw.githubusercontent.com → Gitea raw URL (https://git.moleculesai.app/molecule-ai/.../raw/ branch/main/...). The lint job runs in CI and would 404 today. - scripts/refresh-workspace-images.sh:54 — workspace-template image pull URL: ghcr.io → ECR (153263036946.dkr.ecr.us-east-2.amazonaws.com). - scripts/rollback-latest.sh — full rewrite of header + auth flow: * ghcr.io/molecule-ai/{platform,platform-tenant} → ECR * GITHUB_TOKEN with write:packages → AWS ECR auth (aws ecr get-login-password). Per saved memory reference_post_suspension_pipeline, prod cutover is to ECR. * Updated header docs to match new auth flow + prereqs. - scripts/demo-freeze.sh:13,17 — comment-only ghcr → ECR (the script doesn't currently exec these URLs, but the comments describe the cascade and need to match reality). - docker-compose.yml:215-216 — canvas image: ghcr.io → ECR + updated the auth comment to describe `aws ecr get-login-password` flow. - tools/check-template-parity.sh:21 — inline curl install instructions: raw.githubusercontent.com → Gitea raw URL. Hostile self-review: 1. rollback-latest.sh's GITHUB_TOKEN→aws-cli auth swap is a behavior change. Operators using this script now need aws CLI authenticated for region us-east-2 with ECR pull/push perms. Documented in updated header. Operators who don't have aws CLI will get 'aws: command not installed' which is a clear failure mode (not silent). 2. The Gitea raw URL shape (/raw/branch/main/) differs from GitHub's raw.githubusercontent.com structure. Verified pattern by inspecting other Gitea raw URLs in the codebase. If Gitea's URL changes (1.23+), update via the same one-line edit. 3. Doesn't touch packer/scripts/install-base.sh which has a similar ghcr.io ref per the grep agent's findings — that's bigger-scope (packer build pipeline) and lives in molecule-controlplane-ish territory; filing as parked follow-up under #46 if not already. Refs: molecule-ai/internal#46, molecule-ai/internal#37, molecule-ai/internal#38, saved memory reference_post_suspension_pipeline	2026-05-07 00:56:23 -07:00
Hongming Wang	debe29c889	ci(handlers-postgres-integration): apply legacy .sql migrations too The migration-replay step globbed only .up.sql, silently skipping the older flat-naming migrations (001_workspaces.sql, 009_activity_logs.sql, etc.). Fine while no integration test depended on those tables; broke when the #149 cross-table atomicity test came in needing both workspaces (FK target for activity_logs) and activity_logs themselves. Switch to globbing .sql + sorted lex-order, excluding .down.sql so up/down pairs don't undo themselves mid-run. Add a sanity check for workspaces + activity_logs + pending_uploads alongside the existing delegations gate so a future migration drift fails loud instead of silently skipping the regressed test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 22:02:24 -07:00
Hongming Wang	88ff0d770b	chore(sweep): add orphan-tunnel cleanup step (#2987 / #340 ) The 15-min sweeper has been deleting stale e2e orgs but not the orphan tunnels left behind when the org-delete cascade half-fails (CP transient 5xx after the org row is gone but before the CF tunnel delete completes). Result: tunnels accumulate in CF until manual operator cleanup. Add a final step that POSTs `/cp/admin/orphan-tunnels/cleanup` every tick. Best-effort — failure doesn't fail the workflow; next tick re-attempts. Output reports deleted_count + failed count for ops visibility. This is the catch-all for the orphan-tunnel class. The proper upstream fix (transactional org delete) lives in CP and tracks as issue #2989. Until that lands, the sweeper bounded-time-to-cleanup keeps the leak from escalating. Note: PR #492 (cf-tunnel silent-success fix) makes this step actually effective — pre-fix DeleteTunnel silent-succeeded on 1022, so the cleanup endpoint reported success without deleting. Post-fix the cleanup chains CleanupTunnelConnections + retry on 1022, which actually clears stuck-connector orphans. 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-05-05 19:36:20 -07:00
Hongming Wang	a19ee90556	chore(sweep): note SSOT for ephemeral prefixes lives in CP Mirrors molecule-controlplane#494: the canonical EPHEMERAL_PREFIXES list now lives in molecule-controlplane/internal/slugs/ephemeral.go, where redeploy-fleet reads it to skip in-flight test tenants. The sweep workflow keeps a Python copy because GHA Python can't import Go, but a comment now points engineers updating the list to update both files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 19:18:13 -07:00
Hongming Wang	caf19e8980	feat(ops): hourly alarm for auto-promote PR stuck on REVIEW_REQUIRED (#2975 ) Closes the silent-block failure mode that left 25 commits — including the Memory v2 redesign and the reno-stars data-loss fix — wedged on staging for 12+ hours behind a single missing review. The auto-promote workflow opened the PR + armed auto-merge, but main's branch protection required a human review and nobody noticed until a user reported "still seeing old memory tab". ## Detection logic — `scripts/check-stale-promote-pr.sh` Reads open PRs `base=main head=staging` and alarms on: - `mergeStateStatus == BLOCKED` - `reviewDecision == REVIEW_REQUIRED` - createdAt older than `STALE_HOURS` (default 4h) Other BLOCKED reasons (DIRTY, BEHIND, failed checks) are NOT alarmed — those are the author's signal-to-fix. This script targets the specific "no human reviewed yet" wedge. Output: - `::warning` per stale PR (visible in workflow summary + Actions UI) - PR comment (idempotent via marker-string detection; one alarm per PR, never re-spammed) - Exit code = count of stale PRs (capped at 125) Logic in a script (not inline workflow YAML) so it's: - Unit-testable — tests/test-check-stale-promote-pr.sh exercises every branch with stubbed fixture JSON + frozen clock. 23 tests covering: empty list, single stale, just-under-threshold, wrong reviewDecision, wrong mergeStateStatus, mixed list (only matching PRs alarm), custom threshold via --stale-hours, exit-code-counts- matching-PRs, --help, unknown arg → 64, missing repo → 2. - Operator-runnable ad-hoc — `scripts/check-stale-promote-pr.sh` works from any shell with `gh` + `jq`. - SSOT — one detector, the workflow YAML is just schedule + invocation surface. Future sibling workflows that need the same check call the same script. ## Workflow — `.github/workflows/auto-promote-stale-alarm.yml` Triggers: - cron `27 * * * *` (hourly, off-the-hour to dodge cron herd) - workflow_dispatch with `stale_hours` + `post_comment` overrides Concurrency: `auto-promote-stale-alarm` group, cancel-in-progress=false (idempotent script; no benefit to cancelling a running scan). Permissions: `contents: read` + `pull-requests: write` (post comments). Sparse checkout — only fetches `scripts/check-stale-promote-pr.sh`. No node_modules, no go modules, no slow setup steps. Workflow runs in <30s on a clean repo. ## Why "alarm + comment" not "auto-approve" Considered options in issue #2975: 1. Slack/email alert — picked. 2. Bot-account auto-approve via molecule-ops — circumvents the human-review gate that branch protection encodes. 3. Trusted-promote bypass via CODEOWNERS — needs Org Admin config change; out of scope for a workflow PR. The comment-on-PR pattern picks (1) without external dependencies (no Slack token, no email config). Subscribers get notified via GitHub's existing PR notification delivery; the warning shows up in the Actions feed. ## Why this won't false-positive on legitimate slow reviews Threshold is 4h. Most legitimate gates clear in <1h, so 4× headroom is plenty for slow CI. The comment is idempotent (one alarm per PR, never re-posted) — adding noise stops at 1 comment regardless of how long the PR sits. ## Test plan - [x] `bash scripts/test-check-stale-promote-pr.sh` — 23/23 pass - [x] `python3 -c 'yaml.safe_load(...)'` clean - [x] `bash -n` clean on both scripts - [ ] Live verification: dispatch the workflow once main has caught up, confirm it correctly reports zero stale PRs	2026-05-05 17:55:27 -07:00
Hongming Wang	475da5b64c	refactor(workspace): extract inbox tools from a2a_tools.py (RFC #2873 iter 4e) Continues the OSS-shape refactor. After iters 4a-4d (rbac, delegation, memory, messaging) the only behavior left in ``a2a_tools.py`` was ``report_activity`` plus three thin inbox-tool wrappers and the ``_enrich_inbound_for_agent`` helper. This iter extracts the inbox slice to ``a2a_tools_inbox.py`` so the kitchen-sink module shrinks from 280 LOC to ~165 LOC of imports + report_activity + back-compat re-export blocks. Extracted symbols: - ``_INBOX_NOT_ENABLED_MSG`` (sentinel) - ``_enrich_inbound_for_agent`` (poll-path peer enrichment helper) - ``tool_inbox_peek`` - ``tool_inbox_pop`` - ``tool_wait_for_message`` Re-exports (`from a2a_tools_inbox import …`) preserve the public ``a2a_tools.tool_inbox_`` surface so existing tests + call sites continue to resolve unchanged. New tests in test_a2a_tools_inbox_split.py: 1. Drift gate (5)* — every previously-public symbol on a2a_tools is the EXACT same object as a2a_tools_inbox.foo (`is`, not `==`), catches a future "wrap with logging" refactor that silently loses existing test coverage. 2. Import contract (1) — a2a_tools_inbox does NOT eagerly import a2a_tools at module load. Pins the layered architecture: the extracted slice depends on ``inbox`` + a lazy ``a2a_client`` import, never on the kitchen-sink that re-exports it. 3. _enrich_inbound_for_agent branches (5) — peer_id-empty (canvas_user) returns dict unchanged; missing peer_id key same; a2a_client unavailable (test harness, partial install) degrades gracefully with a bare envelope; registry hit populates peer_name + peer_role + agent_card_url; registry miss still surfaces agent_card_url (constructable from peer_id alone). The full timeout-clamp / validation / JSON-shape behavior matrix for the three wrappers stays in test_a2a_tools_inbox_wrappers.py — those tests pass identically against both the alias and the underlying impl. Wiring updates: - ``scripts/build_runtime_package.py``: add ``a2a_tools_inbox`` to ``TOP_LEVEL_MODULES`` so it ships in the runtime wheel and the drift gate doesn't fail the next publish. - ``.github/workflows/ci.yml``: add ``a2a_tools_inbox.py`` to ``CRITICAL_FILES`` so the 75% MCP/inbox/auth per-file floor applies — this is now where the inbox-delivery code actually lives.	2026-05-05 14:28:58 -07:00
Hongming Wang	0ca4e431c1	test(e2e): add poll-mode chat upload E2E and wire into e2e-api.yml Covers the user-visible flow that Phase 1-5b shipped (RFC #2891): register a poll-mode workspace, POST a multi-file /chat/uploads, verify the activity feed shows one chat_upload_receive row per file, fetch the bytes via /pending-uploads/:fid/content, ack each row, and confirm a post-ack fetch returns 404. Also pins cross-workspace bleed protection (workspace B's bearer on A's URL → 401, B's URL with A's file_id → 404) and the file_id-UUID-parse 400 path. 23 assertions, all green against a local platform (Postgres+Redis+ platform-server stack matches the e2e-api.yml CI recipe verbatim). Why a new script instead of extending test_poll_mode_e2e.sh: that script tests A2A short-circuit + since_id cursor semantics; this one tests the chat-upload path. They share zero handler code on the platform side and would dilute each other's failure messages if combined. Why not the bearerless-401 strict-mode assertion: the platform's wsauth fail-opens for bearerless requests when MOLECULE_ENV=development (see middleware/devmode.go). The CI workflow doesn't set that var, but some local-dev .env files do — the assertion would flap by environment without testing the poll-mode upload contract. The middleware's own unit tests cover strict-mode 401. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 13:08:55 -07:00
Hongming Wang	6125700c39	test(e2e): plug /tmp scratch leaks in 3 shell E2E tests + add CI lint gate (RFC #2873 iter 2) Three shell E2E tests created scratch files via `mktemp` but never deleted them on early exit (assertion failure, SIGINT, errexit). Each CI run leaked ~10-100 KB of /tmp into the runner; over ~200 runs/week that's 20+ MB of accumulated cruft. ## Files - test_chat_attachments_e2e.sh — was missing both trap and rm; added per-run TMPDIR_E2E with `trap rm -rf … EXIT INT TERM`. - test_notify_attachments_e2e.sh — had a `cleanup()` for the workspace but didn't include the TMPF; only an unconditional `rm -f` at the bottom (line 233) which doesn't fire on early exit. Extended cleanup() to also rm the scratch + dropped the redundant trailing rm. - test_chat_attachments_multiruntime_e2e.sh — `round_trip()` function had per-call `rm -f` only on the success path; failure paths leaked. Switched to script-level TMPDIR_E2E + trap; per-call rm dropped (the trap handles every return path including SIGINT). Pattern: `mktemp -d -t prefix-XXX` for the dir, `mktemp <full-template>` for files (portable across BSD/macOS + GNU coreutils — `-p` is GNU-only and breaks Mac local-dev runs). ## Regression gate New `tests/e2e/lint_cleanup_traps.sh` asserts every `.sh` that calls `mktemp` also has a `trap … EXIT` line in the file. Wired into the existing Shellcheck (E2E scripts) CI step. Verified locally: passes on the fixed state, fails-loud when one of the 3 fixes is reverted. ## Verification - shellcheck --severity=warning clean on all 4 touched files - lint_cleanup_traps.sh passes on the post-fix tree (6 mktemp users, all have EXIT trap) - Negative test: revert one fix → lint exits 1 with file:line + suggested fix pattern in the error message (CI-grokkable ::error file=… annotation) - Trap fires on SIGTERM mid-run (smoke-tested on macOS BSD mktemp) - Trap fires on `exit 1` (smoke-tested) ## Bars met (7-axis) - SSOT: trap pattern documented in lint message (one rule, one fix) - Cleanup: this IS the cleanup hygiene fix - 100% coverage: lint catches future regressions across all `tests/e2e/.sh` files, not just the 3 fixed today - File-split: N/A (no files split) - Plugin / abstract / modular: N/A (test infra, not product code) Iteration 2 of RFC #2873.	2026-05-05 04:21:26 -07:00
Hongming Wang	42f2ea3f4f	fix(ci): include event_name in runtime-prbuild-compat concurrency group Every staging push run for the last 4 SHAs was cancelled by the matching pull_request run because both fired into the same concurrency group: group: ${{ github.workflow }}-${{ ...sha }} Same SHA → same group → cancel-in-progress=true means the second arrival cancels the first. Empirically the push run lost the race; staging branch-protection then saw a CANCELLED required check and the auto-promote chain stalled. Fix: include github.event_name in the group key. push and pull_request runs for the same SHA now hash to different groups, both complete, both report SUCCESS to branch protection. Pattern of the bug: 10:46 sha=1e8d7ae1 ev=pull_request conclusion=success 10:46 sha=1e8d7ae1 ev=push conclusion=cancelled 10:45 sha=ecf5f6fb ev=pull_request conclusion=success 10:45 sha=ecf5f6fb ev=push conclusion=cancelled 10:28 sha=471dff25 ev=pull_request conclusion=success 10:28 sha=471dff25 ev=push conclusion=cancelled 10:12 sha=9e678ccd ev=pull_request conclusion=success 10:12 sha=9e678ccd ev=push conclusion=cancelled Same drift class as the 2026-04-28 auto-promote-staging incident (memory: feedback_concurrency_group_per_sha.md) — globally-scoped groups silently cancel runs in matched-SHA scenarios. This is the only workflow in .github/workflows/ that uses the narrow per-sha shape without event_name. Others either don't use concurrency at all, or use ${{ github.ref }} which is event- neutral. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 04:01:20 -07:00
Hongming Wang	90d202c80a	ci(handlers-pg): apply all migrations with skip-on-error + sanity check (#320 ) Previous workflow applied only 049_delegations.up.sql — fragile to future migrations that touch the delegations table or any other handlers/-tested table. Operator would have to remember to update the workflow's psql -f line per migration. New behavior: loop every .up.sql in lexicographic order, apply each with ON_ERROR_STOP=1 + per-migration result captured. Failed migrations are SKIPPED rather than blocking the suite — handles the historical migrations (017_memories_fts_namespace, 042_a2a_queue, etc.) that depend on tables since renamed/dropped and can't replay from scratch. Migrations that DO succeed land their tables, which is sufficient for the integration tests in handlers/. Sanity gate at the end: if the delegations table is missing after the replay, hard-fail with a loud error. That catches a real regression where 049 itself becomes broken (e.g., schema rename), separate from the historical-broken-migration noise above. Per-migration log line ("✓" or "⊘ skipped") makes it easy to spot when a migration that SHOULD have replayed didn't. Verified locally: full migration chain runs, 049 lands, all 7 integration tests pass against the chained-migration DB. Closes #320.	2026-05-05 03:48:43 -07:00
Hongming Wang	4c9f12258d	fix(delegations): preserve result_preview through completion + add real-Postgres integration gate Two-part PR: ## Fix: result_preview was lost on completion Self-review of #2854 caught a real bug. SetStatus has a same-status replay no-op; the order of calls in `executeDelegation` completion + `UpdateStatus` completed branch clobbered the preview field: 1. updateDelegationStatus(completed, "") fires 2. inner recordLedgerStatus(completed, "", "") → SetStatus transitions dispatched → completed with preview="" 3. outer recordLedgerStatus(completed, "", responseText) → SetStatus reads current=completed, status=completed → SAME-STATUS NO-OP, never writes responseText → preview lost Confirmed against real Postgres (see integration test). Strict-sqlmock unit tests passed because they pin SQL shape, not row state. Fix: call the WITH-PREVIEW recordLedgerStatus FIRST, then updateDelegationStatus. The inner call becomes the no-op (correctly preserves the row written by the outer call). Same gap fixed in UpdateStatus handler — body.ResponsePreview was never landing in the ledger because updateDelegationStatus's nested SetStatus(completed, "", "") fired first. ## Gate: real-Postgres integration tests + CI workflow The unit-test-only workflow that shipped #2854 was the root cause. Adding two layers of defense: 1. workspace-server/internal/handlers/delegation_ledger_integration_test.go — `//go:build integration` tag, requires INTEGRATION_DB_URL env var. 4 tests: * ResultPreviewPreservedThroughCompletion (regression gate for the bug above — fires the production call sequence in fixed order and asserts row.result_preview matches) * ResultPreviewBuggyOrderIsLost (DIAGNOSTIC: confirms the same-status no-op contract works as designed; if SetStatus's semantics ever change, this test fires) * FailedTransitionCapturesErrorDetail (failure-path symmetry) * FullLifecycle_QueuedToDispatchedToCompleted (forward-only + happy path) 2. .github/workflows/handlers-postgres-integration.yml — required check on staging branch protection. Spins postgres:15 service container, applies the delegations migration, runs `go test -tags=integration` against the live DB. Always-runs + per-step gating on path filter (handlers/wsauth/migrations) so the required-check name is satisfied on PRs that don't touch relevant code. Local dev workflow (file header documents this): docker run --rm -d --name pg -e POSTGRES_PASSWORD=test -p 55432:5432 postgres:15-alpine psql ... < workspace-server/migrations/049_delegations.up.sql INTEGRATION_DB_URL="postgres://postgres:test@localhost:55432/molecule?sslmode=disable" \ go test -tags=integration ./internal/handlers/ -run "^TestIntegration_" ## Why this matters Per memory `feedback_mandatory_local_e2e_before_ship`: backend PRs MUST verify against real Postgres before claiming done. sqlmock pins SQL shape; only a real DB can verify row state. The workflow makes this gate mandatory rather than optional.	2026-05-05 02:47:52 -07:00
Hongming Wang	c89f17a2aa	fix(branch-protection-drift): hard-fail on schedule only, soft-skip + warn on PR #2834 added a hard-fail when GH_TOKEN_FOR_ADMIN_API is missing on schedule + pull_request + workflow_dispatch. The PR-trigger hard-fail is now blocking every PR in the repo because the secret hasn't been provisioned yet — including the staging→main auto-promote PR (#2831), which has no path to set repo secrets itself. Per feedback_schedule_vs_dispatch_secrets_hardening.md the original concern is automated/silent triggers losing the gate without a human to notice. That concern applies to schedule specifically: - schedule: cron, no human, silent soft-skip = invisible regression → KEEP HARD-FAIL. - pull_request: a human is reviewing the PR diff and will see workflow warnings inline. A PR cannot retroactively drift live state — drift happens between PRs (UI clicks, manual gh api PATCH), which the schedule canary catches. The PR-time gate would only catch typos in apply.sh, which the *_payload unit tests catch more directly. → SOFT-SKIP with a prominent warning. - workflow_dispatch: operator override, may not have configured the secret yet. → SOFT-SKIP with warning. The skip is explicit (SKIP_DRIFT_CHECK=1 surfaced to env, then a step `if:` guard) so it's auditable in the workflow run UI, not silently swallowed. Unblocks #2831 (auto-promote staging→main) + every PR currently behind this check.	2026-05-04 21:20:30 -07:00
Hongming Wang	2e505e7748	fix(branch-protection): apply.sh respects live state + full-payload drift Multi-model review of #2827 caught: the script as-shipped would have silently weakened branch protection on EVERY non-checks dimension the moment anyone ran it. Live staging had enforce_admins=true, dismiss_stale_reviews=false, strict=true, allow_fork_syncing=false, bypass_pull_request_allowances={ HongmingWang-Rabbit + molecule-ai app } Script wrote the opposite for all five. Per memory feedback_dismiss_stale_reviews_blocks_promote.md, the dismiss_stale_reviews flip alone is the load-bearing one — would silently re-block every auto-promote PR (cost user 2.5h once). This PR: 1. apply.sh: per-branch payloads (build_staging_payload / build_main_payload) that codify the deliberate per-branch policy already on the repo, with the script's net contribution being ONLY the new check names (Canvas tabs E2E + E2E API Smoke on staging, Canvas tabs E2E on main). 2. apply.sh: R3 preflight that hits /commits/{sha}/check-runs and asserts every desired check name has at least one historical run on the branch tip. Catches typos like "Canvas Tabs E2E" vs "Canvas tabs E2E" — pre-fix a typo would silently block every PR forever waiting for a context that never emits. Skip via --skip-preflight for genuinely-new workflows whose first run hasn't fired. 3. drift_check.sh: compares the FULL normalised payload (admin, review, lock, conversation, fork-syncing, deletion, force-push) not just the checks list. Pre-fix the drift gate would have missed a UI click that flipped enforce_admins or dismiss_stale_reviews. Drops app_id from the comparison since GH auto-resolves -1 to a specific app id post-write. 4. branch-protection-drift.yml: per memory feedback_schedule_vs_dispatch_secrets_hardening.md — schedule + pull_request triggers HARD-FAIL when GH_TOKEN_FOR_ADMIN_API is missing (silent skip masks the gate disappearing). workflow_dispatch keeps soft-skip for one-off operator runs. Verified by running drift_check against live state: pre-fix would have shown 5 destructive drifts on staging + 5 on main. Post-fix shows ONLY the 2 intended additions on staging + 1 on main, which go away after `apply.sh` runs.	2026-05-04 20:52:11 -07:00
Hongming Wang	7cc1c39c49	ci: e2e coverage matrix + branch-protection-as-code Closes #9. Three pieces, all small: 1. docs/e2e-coverage.md — source of truth for which E2E suites guard which surfaces. Today three were running but informational only on staging; that's how the org-import silent-drop bug shipped without a test catching it pre-merge. Now the matrix shows what's required where + a follow-up note for the two suites that need an always-emit refactor before they can be required. 2. tools/branch-protection/apply.sh — branch protection as code. Lets `staging` and `main` required-checks live in a reviewable shell script instead of UI clicks that get lost between admins. This PR's net change: add `E2E API Smoke Test` and `Canvas tabs E2E` as required on staging. Both already use the always-emit path-filter pattern (no-op step emits SUCCESS when the workflow's paths weren't touched), so making them required can't deadlock unrelated PRs. 3. branch-protection-drift.yml — daily cron + drift_check.sh that compares live protection against apply.sh's desired state. Catches out-of-band UI edits before they drift further. Fails the workflow on mismatch; ops re-runs apply.sh or updates the script. Out of scope (filed as follow-ups): - e2e-staging-saas + e2e-staging-external use plain `paths:` filters and never trigger when paths are unchanged. They need refactoring to the always-emit shape (same as e2e-api / e2e-staging-canvas) before they can be required. - main branch protection mirrors staging here; if main wants the E2E SaaS / External added later, do it in apply.sh and rerun. Operator must apply once after merge: bash tools/branch-protection/apply.sh The drift check picks it up from there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 20:21:59 -07:00
Hongming Wang	8df8487bbe	fix(auto-promote): treat E2E completed/cancelled as defer, not failure Bug: the case statement at line 189 grouped completed/failure \| completed/cancelled \| completed/timed_out into the same "abort + exit 1" branch. cancelled ≠ failure — when per-SHA concurrency (memory: feedback_concurrency_group_per_sha) cancels an older E2E run because a newer push landed, the workflow blocked the whole auto-promote chain on a non-failure. Caught 2026-05-05 02:03 on sha `31f9a5e`: E2E got cancelled by concurrency, auto-promote :latest aborted with exit 1, the next auto-promote-staging cycle had to manually clean up. Split: failure/timed_out keep the abort path. cancelled gets its own clean-defer branch (same shape as in_progress) — proceed=false without exit 1, with a step-summary explaining likely concurrency supersession and pointing operators at manual dispatch if they need that specific SHA promoted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 19:26:29 -07:00
Hongming Wang	c5dd14d8db	fix(workflows): preserve curl stderr in 8 status-capture sites Self-review of PR #2810 caught a regression: my mass-fix added `2>/dev/null` to every curl invocation, suppressing stderr. The original `\|\| echo "000"` shape only swallowed exit codes — stderr (curl's `-sS`-shown dial errors, timeouts, DNS failures) still went to the runner log so operators could see WHY a connection failed. After PR #2810 the next deploy failure would log only the bare HTTP code with no context. That's exactly the kind of diagnostic loss that makes outages take longer to triage. Drop `2>/dev/null` from each curl line — keep it on the `cat` fallback (which legitimately suppresses "no such file" when curl crashed before -w ran). The `>tempfile` redirect alone captures curl's stdout (where -w writes) without touching stderr. Same 8 files as #2810: redeploy-tenants-on-{main,staging}, sweep-stale-e2e-orgs, e2e-staging-{sanity,saas,external,canvas}, canary-staging. Tests: - All 8 files pass the lint - YAML valid Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 18:54:50 -07:00
Hongming Wang	463316772b	fix(workflows): rewrite curl status-capture to prevent exit-code pollution The 2026-05-04 redeploy-tenants-on-main run for sha `2b862f6` emitted "HTTP 000000" and failed the deploy. Root cause: when curl exits non- zero (connection reset → 56, --fail-with-body 4xx/5xx → 22), the `-w '%{http_code}'` already wrote a status to stdout; the inline `\|\| echo "000"` then fires AND appends another "000" to the captured substitution stdout. Result: HTTP_CODE="<actual><000>" — fails string comparisons against "200" while looking superficially right. Same class of bug the synth-E2E §7c gate hit twice (PRs #2779/#2783 + #2797). Memory feedback_curl_status_capture_pollution.md. Mass fix in 8 workflows: route -w into a tempfile so curl's exit code can't pollute stdout. Wrap with set +e/-e so the non-zero curl exit doesn't trip the outer pipeline. redeploy-tenants-on-main.yml (production-critical, caught the bug) redeploy-tenants-on-staging.yml (sibling) sweep-stale-e2e-orgs.yml (cleanup loop) e2e-staging-sanity.yml (E2E safety-net teardown) e2e-staging-saas.yml e2e-staging-external.yml e2e-staging-canvas.yml canary-staging.yml Plus a new lint workflow `lint-curl-status-capture.yml` that runs on every PR/push touching `.github/workflows/**`. Multi-line aware: collapses bash `\` continuations, then matches the buggy $(curl ... -w '%{http_code}' ... \|\| echo "000") subshell shape. Distinguishes from the SAFE $(cat tempfile \|\| echo "000") shape (cat with missing file emits empty stdout, no pollution). Verified: - All 8 workflows pass the lint locally - A known-bad injection is caught - A known-safe cat-fallback passes through - yaml.safe_load clean on all changed files Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 18:29:38 -07:00
Hongming Wang	26fa220bef	ci(coverage): per-file 75% floor for MCP/inbox/auth Python critical paths Closes part of #2790 (Phase A). The Python total floor at 86% (set in workspace/pytest.ini, issue #1817) averages over ~6000 lines, so a single MCP-critical file could regress to ~50% with no CI complaint as long as other modules compensate. This is the same distribution gap that #1823 closed Go-side: total floor passes while a critical handler sits at 0%. Added gates for these five files (per-file floor 75%): - workspace/a2a_mcp_server.py — MCP dispatcher (PR #2766 / #2771) - workspace/mcp_cli.py — molecule-mcp standalone CLI entry - workspace/a2a_tools.py — workspace-scoped tool implementations - workspace/inbox.py — multi-workspace inbox + per-workspace cursors - workspace/platform_auth.py — per-workspace token resolver These handle multi-tenant routing, auth tokens, and inbox dispatch. Risk shape mirrors Go-side tokens/secrets — a 0%/50% file here is exactly where the PR #2766 dispatcher bug class slips through without a structural test. Floor 75% is strictly additive — current actuals 80-96% (measured 2026-05-04). No existing PR fails. Ratchet plan in COVERAGE_FLOOR.md target 90% by 2026-08-04. Implementation: pytest already writes .coverage; new step emits a JSON view scoped to the critical files via `coverage json --include="*name"`, then jq extracts each file's percent_covered. Exact key match by basename so workspace/builtin_tools/a2a_tools.py (a different 100% file) doesn't shadow workspace/a2a_tools.py. Verified locally with the actual coverage data: - floor=75 → 0 failures (matches current state) - floor=81 → 1 failure (a2a_tools.py at 80%) — proves the gate trips Pairs with PR #2791 (Phase B — schema↔dispatcher AST drift gate). Phase C (molecule-mcp e2e harness) remains the largest piece in #2790. YAML validated locally before commit per feedback_validate_yaml_before_commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 16:35:21 -07:00
Hongming Wang	ff1003e5f6	ci(canary): bump timeout-minutes 12 → 20 to absorb apt tail latency Today's 4 cancelled canaries (25319625186 / 25320942822 / 25321618230 / 25322499952) were all blown by the workflow timeout despite the underlying tenant boot completing successfully (PR molecule-controlplane#455 fix verified — boot events all reach `boot_script_finished/ok`). Why the budget was wrong: The tenant user-data install phase runs apt-get update + install of docker.io / jq / awscli / caddy / amazon-ssm-agent FROM RAW UBUNTU on every tenant boot — none of it is pre-baked into the tenant AMI (EC2_AMI=ami-0ea3c35c5c3284d82, raw Jammy 22.04). Empirical fetch_secrets/ok timing across today's canaries: 51s debug-mm-1777888039 (09:47Z) 82s 25319625186 (12:42Z) 143s 25320942822 (13:11Z) 625s 25322499952 (13:43Z) Same EC2_AMI, same instance type (t3.small), same user-data install sequence — variance is entirely apt-mirror tail latency. A 12-min job budget leaves only ~2 min for the workspace on slow-apt days; the workspace itself needs ~3.5 min for claude-code cold boot, so the budget is structurally too tight whenever apt is slow. 20 min absorbs even the 10+ min boot worst-case and still leaves the workspace its full ~7 min budget. Cap stays well under the runner's 6-hour ubuntu-latest job ceiling. Real fix: pre-bake caddy + ssm-agent into the tenant AMI so the boot phase is no-ops on cached pkgs (will file controlplane#TBD as follow-up — packer/install-base.sh today only bakes the WORKSPACE thin AMI, not the tenant AMI; tenants always boot from raw Ubuntu). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 07:02:12 -07:00
Hongming Wang	032c011b37	ci: bump continuous-synth-e2e cadence 3→6 fires/hour, all clean slots Change cron from '10,30,50' (3 fires/hour) to '2,12,22,32,42,52' (6 fires/hour). All new slots are 1-3 min away from any other cron, avoiding both the cf-sweep collisions (:15, :45) and the :30 heavy slot (canary-staging /30, sweep-aws-secrets, sweep-stale-e2e-orgs every :15). Why: empirically 2026-05-04 the canary fired only once per hour on the 10,30,50 schedule (see #2726). Bumping fires-per-hour gives more chances to land a survived fire under GH's load- related drop ratio, and keeping all slots in clean lanes minimizes the per-fire drop probability. At empirically-observed ~67% drop ratio, 6 attempts/hour yields ~2 effective fires = ~30 min cadence; closer to the 20-min target than the current shape and provides a real degradation alarm if drops get worse. Cost: ~$0.50/day → ~$1/day. Negligible. Closes #2726. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 05:10:48 -07:00
Hongming Wang	98f883cb99	e2e: add direct-Anthropic LLM-key path alongside MiniMax + OpenAI Adds a third secrets-injection branch in test_staging_full_saas.sh behind a new E2E_ANTHROPIC_API_KEY env var, wired into all three auto-running E2E workflows (canary-staging, e2e-staging-saas, continuous-synth-e2e) via a new MOLECULE_STAGING_ANTHROPIC_API_KEY repo secret slot. Operator motivation: after #2578 (the staging OpenAI key went over quota and stayed dead 36+ hours) we shipped #2710 to migrate the canary + full-lifecycle E2E to claude-code+MiniMax. Discovered post- merge that MOLECULE_STAGING_MINIMAX_API_KEY had never been set after the synth-E2E migration on 2026-05-03 either — synth has been red the whole time, not just OpenAI quota. Setting up a MiniMax billing account from scratch is non-trivial (needs platform-specific signup, KYC, top-up). Operators who already have an Anthropic API key for their own Claude Code session can now just set MOLECULE_STAGING_ANTHROPIC_API_KEY and have all three auto-running E2E gates green within one cron firing. Priority chain in test_staging_full_saas.sh (first non-empty wins): 1. E2E_MINIMAX_API_KEY → MiniMax (cheapest) 2. E2E_ANTHROPIC_API_KEY → direct Anthropic (cheaper than gpt-4o, lower setup friction than MiniMax) 3. E2E_OPENAI_API_KEY → langgraph/hermes paths Verify-key case-statement in all three workflows accepts EITHER MiniMax OR Anthropic for runtime=claude-code; error message names both options so operators know they don't have to register a MiniMax account if they already have an Anthropic key. Pinned to runtime=claude-code — hermes/langgraph use OpenAI-shaped envs and won't honour ANTHROPIC_API_KEY without further wiring. After this lands + secret is set, the dispatched canary verifies the new path: gh workflow run canary-staging.yml --repo Molecule-AI/molecule-core --ref staging	2026-05-04 00:51:14 -07:00
Hongming Wang	eaee113416	e2e-staging-saas: same migration off OpenAI default to claude-code+MiniMax Bundles the same hermes+OpenAI → claude-code+MiniMax migration onto the full-lifecycle E2E that's been red on every provisioning-critical push since 2026-05-01. Same root cause as the canary fix in the prior commit: MOLECULE_STAGING_OPENAI_KEY hit insufficient_quota and there's no SLA on operator billing top-up. Same shape as canary commit: claude-code as default runtime + MiniMax as primary key + hermes/langgraph kept as workflow_dispatch options with OpenAI fallback. Per-runtime verify-key case-statement matches canary-staging.yml + continuous-synth-e2e.yml byte-for-byte. Two extra wrinkles vs canary: - Dispatch input `runtime` default flipped from "hermes" to "claude-code" so operators dispatching from the UI get the safe path by default. They can still pick hermes/langgraph from the dropdown when they specifically want to exercise OpenAI. - E2E_MODEL_SLUG is dispatch-aware: MiniMax-M2.7-highspeed for claude-code, openai/gpt-4o for hermes (slash-form per derive-provider.sh), openai:gpt-4o for langgraph (colon-form per init_chat_model). The branch comment in lib/model_slug.sh covers the rationale; pinning the slug here keeps the dispatch UX stable even when operators don't override. After this lands + the canary commit lands, the only OpenAI-dependent E2E surface is the operator-dispatch fallback. The cron canary, the synth E2E, AND the full-lifecycle gate are all on MiniMax — separate billing account, no OpenAI quota dependency on auto-runs.	2026-05-04 00:20:36 -07:00
Hongming Wang	6f8f978975	canary-staging: migrate from hermes+OpenAI to claude-code+MiniMax Mirror the migration continuous-synth-e2e.yml made on 2026-05-03 (#265). Both workflows hit the same MOLECULE_STAGING_OPENAI_KEY which went over quota on 2026-05-01 (#2578) and stayed dead — the canary has been red for 36+ hours waiting on operator billing top-up. This switch breaks the canary's dependency on OpenAI billing entirely: claude-code template's `minimax` provider routes ANTHROPIC_BASE_URL to api.minimax.io/anthropic and reads MINIMAX_API_KEY at boot. MiniMax is ~5-10x cheaper per token than gpt-4.1-mini AND on a separate billing account, so a future OpenAI quota collapse no longer wedges the canary's "is staging alive?" signal. Changes: - E2E_RUNTIME: hermes → claude-code - Add E2E_MODEL_SLUG: MiniMax-M2.7-highspeed (pin to MiniMax — the per-runtime claude-code default is "sonnet" which routes to direct Anthropic and would defeat the cost saving) - Add E2E_MINIMAX_API_KEY env wired to MOLECULE_STAGING_MINIMAX_API_KEY - Keep E2E_OPENAI_API_KEY as fallback for operator-dispatched runs that set E2E_RUNTIME=hermes via workflow_dispatch - "Verify OpenAI key present" → per-runtime "Verify LLM key present" case statement matching synth E2E's exact shape (claude-code requires MiniMax, langgraph/hermes require OpenAI). Hard-fail on missing required key per #2578's lesson — soft-skip silently fell through to the wrong SECRETS_JSON branch and produced a confusing auth error 5 min later instead of the clean "secret missing" message at the top. Verifies #2578 root cause won't recur on the canary path. The synth E2E and the manual e2e-staging-saas dispatch can still hit OpenAI when explicitly chosen — only the cron canary moves off it.	2026-05-04 00:18:03 -07:00
Hongming Wang	9689c6f6d5	fix(synth-e2e): verify-secrets step must hard-fail (exit 0 only ends step) The previous soft-skip-on-dispatch path used `exit 0`, which only ends the STEP — the rest of the workflow continued with empty secrets. Caught 2026-05-04 by dispatched run 25296530706: - E2E_MINIMAX_API_KEY: empty - verify-secrets printed warning + exit 0 - Install required tools: ran - Run synthetic E2E: ran with empty MiniMax key - SECRETS_JSON branched to OpenAI shape (MINIMAX empty → fall through) - But model slug stayed MiniMax-M2.7-highspeed (workflow env) - Workspace booted with OpenAI keys + MiniMax model - 5 min later: "Agent error (Exception)" — claude SDK 401'd against api.minimax.io with the OpenAI key The confusing failure mode silently masked the real problem (missing secret) under a runtime-error label. Fix: drop both soft-skip paths and exit 1 always. Operators who want to verify a YAML change without setting up secrets can read the verify-secrets step's stderr — the failure IS the verification signal. Pure visibility fix; preserves the cron hard-fail path (now also the dispatch hard-fail path). No mechanism change beyond the exit code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 18:32:26 -07:00
Hongming Wang	a306a97dd3	ci(synth-e2e): move cron off :00 to dodge GH scheduler drops GitHub Actions scheduler de-prioritises :00 cron firings under load. Empirical 2026-05-03: the canary's cron was '0,20,40 * * * ' but actual firings landed at :08, :03, :01, :03 — :20 and :40 silently dropped. Detection latency degraded from claimed 20 min to actual ~60 min worst case. Move to '10,30,50 * * *': - :10/:30/:50 sit 10 min off the top-of-hour load peak - Still 5 min from :15 sweep-cf-orphans and :45 sweep-cf-tunnels (the original constraint that kept us off :15/:45) - Same 20-min cadence; only the phase changes No code change beyond the cron expression + comment refresh. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 17:28:45 -07:00
Hongming Wang	8b9e7e6d59	ci: port DELETE-verify pattern to remaining staging e2e workflows Follow-up to #2648 — same `>/dev/null \|\| true` swallow-on-error pattern existed in: e2e-staging-canvas.yml (single-slug) e2e-staging-saas.yml (loop) e2e-staging-sanity.yml (loop) e2e-staging-external.yml (loop, was `>/dev/null 2>&1` variant) All four now capture the HTTP code, log a "[teardown] deleted $slug (HTTP $code)" line on success, and emit a workflow warning naming the slug + body excerpt on non-2xx. Loop bodies also tally + summarise total leaks at the end. Exit semantics unchanged: a single cleanup miss still doesn't fail-flag the test (sweep-stale-e2e-orgs is the safety net within ~45 min). The behavior change is purely surfacing — failures that were silent are now visible on the workflow run page. Pairs with #2648's tightened sweeper. Together: per-run cleanup failures are visible AND the safety net catches them quickly. Closes the per-workflow port noted as out-of-scope in #2648. See molecule-controlplane#420. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 16:24:43 -07:00
Hongming Wang	3cd8c53de0	ci: tighten e2e cleanup race window 120m -> ~45m worst case Two changes that close one of the leak classes from the molecule-controlplane#420 vCPU audit: 1. sweep-stale-e2e-orgs.yml: cron */15 (was hourly), MAX_AGE_MINUTES 30 (was 120). E2E runs are 8-25 min wall clock; 30 min is safely above the longest run while shrinking the worst-case leak window from ~2h to ~45 min (15-min sweep cadence + 30-min threshold). 2. canary-staging.yml teardown: the per-slug DELETE used `>/dev/null \|\| true`, which swallowed every failure. A 5xx or timeout from CP looked identical to "successfully deleted" and the canary tenant kept eating ~2 vCPU until the sweeper caught it. Now we capture the response code and surface non-2xx as a workflow warning that names the leaked slug. The exit semantics stay unchanged — a single-canary cleanup miss shouldn't fail-flag the canary itself when the actual smoke check passed. The sweeper is the safety net for whatever slips past. Caught during the molecule-controlplane#420 audit on 2026-05-03 — 3 e2e canary tenant orphans were running for 24-95 min, all under the previous 120-min sweep threshold so they went unnoticed until manual cleanup. Same `\|\| true` pattern exists in e2e-staging-{canvas,external,saas,sanity}.yml; out of scope for this PR (mechanical port; tracking separately) but the sweeper tightening covers all of them by reducing the safety-net latency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 16:08:40 -07:00
Hongming Wang	79a0203798	feat(synth-e2e): switch canary to claude-code + MiniMax-M2.7-highspeed Cuts the per-run LLM cost ~10x (MiniMax M2.7 vs gpt-4.1-mini) and removes the recurring OpenAI-quota-exhaustion failure mode that took the canary down on 2026-05-03 (#265 — staging quota burnt for ~16h). Path: E2E_RUNTIME=claude-code (default) → workspace-configs-templates/claude-code-default/config.yaml's `minimax` provider (lines 64-69) → ANTHROPIC_BASE_URL auto-set to api.minimax.io/anthropic → reads MINIMAX_API_KEY (per-vendor env, no collision with GLM/Z.ai etc.) Workflow changes (continuous-synth-e2e.yml): - Default runtime: langgraph → claude-code - New env: E2E_MODEL_SLUG (defaults to MiniMax-M2.7-highspeed, overridable via workflow_dispatch) - New secret wire: E2E_MINIMAX_API_KEY ← secrets.MOLECULE_STAGING_MINIMAX_API_KEY - Per-runtime missing-secret guard: claude-code requires MINIMAX, langgraph/hermes require OPENAI. Cron firing hard-fails on missing key for the active runtime; dispatch soft-skips so operators can ad-hoc test without setting up the secret first - Operators can still pick langgraph/hermes via workflow_dispatch; the OpenAI fallback path stays wired Script changes (tests/e2e/test_staging_full_saas.sh): - SECRETS_JSON branches on which key is set: E2E_MINIMAX_API_KEY → {MINIMAX_API_KEY: <key>} (claude-code path) E2E_OPENAI_API_KEY → {OPENAI_API_KEY, HERMES_*, MODEL_PROVIDER} (legacy) MiniMax wins when both are present — claude-code default canary must not accidentally consume the OpenAI key Tests (new tests/e2e/test_secrets_dispatch.sh): - 10 cases pinning the precedence + payload shape per branch - Discipline check verified: 5 of 10 FAIL on a swapped if/elif (precedence inversion), all 10 PASS on the fix - Anchors on the section-comment header so a structural refactor fails loudly rather than silently sourcing nothing The model_slug dispatcher (lib/model_slug.sh) needs no change: E2E_MODEL_SLUG override path is already wired (line 41), and claude-code template's `minimax-` prefix matcher catches "MiniMax-M2.7-highspeed" via lowercase-on-lookup. Operator action required to land green: - Set MOLECULE_STAGING_MINIMAX_API_KEY in repo secrets (Settings → Secrets and Variables → Actions). Use `gh secret set MOLECULE_STAGING_MINIMAX_API_KEY -R Molecule-AI/molecule-core` to avoid leaking the value into shell history. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 15:35:14 -07:00
Hongming Wang	ac6f65ab5e	test(e2e): pin pick_model_slug behavior with bash unit tests PR #2571 fixed synth-E2E by branching MODEL_SLUG per runtime, but only the langgraph branch was verified at runtime — hermes / claude-code / override / fallback had zero automated coverage. A future regression (e.g. dropping the langgraph case) would silently revert and only surface as "Could not resolve authentication method" mid-E2E. This PR: - Extracts the dispatch into tests/e2e/lib/model_slug.sh as a sourceable pick_model_slug() function. No behavior change. - Adds tests/e2e/test_model_slug.sh — 9 assertions across all 5 dispatch branches plus the override path. Verified to FAIL when any branch is flipped (manually regressed langgraph slash-form to confirm the test catches it; restored before commit). - Wires the unit test into ci.yml's existing shellcheck job (only runs when tests/e2e/ or scripts/ change). Pure-bash, no live infra. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 12:04:12 -07:00

1 2 3 4 5 ...

281 Commits