Commit Graph

322 Commits

Author SHA1 Message Date
694a036a7f chore(ci): trailing newline to retrigger publish-workspace-server-image (path-filter requires workflow file change)
Some checks failed
Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped
Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 5s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 6s
E2E API Smoke Test / detect-changes (pull_request) Successful in 8s
CI / Detect changes (pull_request) Successful in 9s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 9s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 10s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 11s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 11s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 13s
CI / Platform (Go) (pull_request) Successful in 9s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 7s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 12s
CI / Python Lint & Test (pull_request) Successful in 14s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 11s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 10s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 11s
CI / Canvas (Next.js) (pull_request) Successful in 22s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 1m28s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m30s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m33s
2026-05-07 08:12:10 -07:00
devops-engineer
10e510f50c chore: drop github-app-auth + swap GHCR→ECR (closes #157, #161)
Some checks failed
Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s
Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 5s
CI / Detect changes (pull_request) Successful in 8s
E2E API Smoke Test / detect-changes (pull_request) Successful in 8s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 8s
Harness Replays / detect-changes (pull_request) Successful in 9s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 9s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 8s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 8s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 9s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 2s
CI / Python Lint & Test (pull_request) Successful in 4s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 5s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 5s
CI / Canvas (Next.js) (pull_request) Successful in 17s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 30s
Harness Replays / Harness Replays (pull_request) Failing after 32s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m26s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 1m21s
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 1m36s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m36s
CI / Platform (Go) (pull_request) Successful in 2m18s
Two coupled cleanups for the post-2026-05-06 stack:

============================================
The plugin injected GITHUB_TOKEN/GH_TOKEN via the App's
installation-access flow (~hourly rotation). Per-agent Gitea
identities replaced this approach after the 2026-05-06 suspension —
workspaces now provision with a per-persona Gitea PAT from .env
instead of an App-rotated token. The plugin code itself lived on
github.com/Molecule-AI/molecule-ai-plugin-github-app-auth which is
also unreachable post-suspension; checking it out at CI build time
was already failing.

Removed:
- workspace-server/cmd/server/main.go: githubappauth import + the
  `if os.Getenv("GITHUB_APP_ID") != ""` block that called
  BuildRegistry. gh-identity remains as the active mutator.
- workspace-server/Dockerfile + Dockerfile.tenant: COPY of the
  sibling repo + the `replace github.com/Molecule-AI/molecule-ai-
  plugin-github-app-auth => /plugin` directive injection.
- workspace-server/go.mod + go.sum: github-app-auth dep entry
  (cleaned up by `go mod tidy`).
- 3 workflows: actions/checkout steps for the sibling plugin repo:
    - .github/workflows/codeql.yml (Go matrix path)
    - .github/workflows/harness-replays.yml
    - .github/workflows/publish-workspace-server-image.yml

Verified `go build ./cmd/server` + `go vet ./...` pass post-removal.

=======================================================
Same workflow used to push to ghcr.io/molecule-ai/platform +
platform-tenant. ghcr.io/molecule-ai is gone post-suspension. The
operator's ECR org (153263036946.dkr.ecr.us-east-2.amazonaws.com/
molecule-ai/) already hosts platform-tenant + workspace-template-*
+ runner-base images and is the post-suspension SSOT for container
images. This PR aligns publish-workspace-server-image with that
stack.

- env.IMAGE_NAME + env.TENANT_IMAGE_NAME repointed to ECR URL.
- docker/login-action swapped for aws-actions/configure-aws-
  credentials@v4 + aws-actions/amazon-ecr-login@v2 chain (the
  standard ECR auth pattern; uses AWS_ACCESS_KEY_ID/SECRET secrets
  bound to the molecule-cp IAM user).

The :staging-<sha> + :staging-latest tag policy is unchanged —
staging-CP's TENANT_IMAGE pin still points at :staging-latest, just
with the new registry prefix.

Refs molecule-core#157, #161; parallel to org-wide CI-green sweep.
2026-05-07 07:48:51 -07:00
64a0bc1f7e fix(ci): use AUTO_SYNC_TOKEN for auto-sync main->staging (Class D)
Some checks failed
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 6s
Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 5s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 8s
CI / Detect changes (pull_request) Successful in 9s
E2E API Smoke Test / detect-changes (pull_request) Successful in 9s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 9s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 9s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 9s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 10s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 3s
CI / Platform (Go) (pull_request) Successful in 4s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 5s
CI / Canvas (Next.js) (pull_request) Successful in 5s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 6s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 7s
CI / Python Lint & Test (pull_request) Successful in 32s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 31s
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 1m23s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m24s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m32s
Same shape as molecule-controlplane#29: per-job GITHUB_TOKEN
doesn't have the Gitea API permissions to open PRs / push branches
the auto-sync flow needs. AUTO_SYNC_TOKEN is the devops-engineer
persona PAT (per saved memory feedback_per_agent_gitea_identity_default).

Companion prod ops (already done):
- devops-engineer added as collaborator on molecule-core (write)
- devops-engineer added to staging branch protection push_whitelist
- AUTO_SYNC_TOKEN registered as Actions secret on molecule-core
2026-05-07 07:01:46 -07:00
devops-engineer
1d8c101c94 chore: drop github-app-auth + swap GHCR→ECR (closes #157, #161)
Some checks failed
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s
Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 5s
CI / Detect changes (pull_request) Successful in 8s
E2E API Smoke Test / detect-changes (pull_request) Successful in 8s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 8s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 8s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 9s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 8s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 9s
Harness Replays / detect-changes (pull_request) Successful in 9s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 3s
CI / Canvas (Next.js) (pull_request) Successful in 4s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 4s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 5s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 6s
Harness Replays / Harness Replays (pull_request) Failing after 27s
CI / Python Lint & Test (pull_request) Successful in 31s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m19s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m21s
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 1m25s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Failing after 15m34s
CI / Platform (Go) (pull_request) Failing after 15m35s
Two coupled cleanups for the post-2026-05-06 stack:

#157 — drop molecule-ai-plugin-github-app-auth
============================================
The plugin injected GITHUB_TOKEN/GH_TOKEN via the App's
installation-access flow (~hourly rotation). Per-agent Gitea
identities replaced this approach after the 2026-05-06 suspension —
workspaces now provision with a per-persona Gitea PAT from .env
instead of an App-rotated token. The plugin code itself lived on
github.com/Molecule-AI/molecule-ai-plugin-github-app-auth which is
also unreachable post-suspension; checking it out at CI build time
was already failing.

Removed:
- workspace-server/cmd/server/main.go: githubappauth import + the
  `if os.Getenv("GITHUB_APP_ID") != ""` block that called
  BuildRegistry. gh-identity remains as the active mutator.
- workspace-server/Dockerfile + Dockerfile.tenant: COPY of the
  sibling repo + the `replace github.com/Molecule-AI/molecule-ai-
  plugin-github-app-auth => /plugin` directive injection.
- workspace-server/go.mod + go.sum: github-app-auth dep entry
  (cleaned up by `go mod tidy`).
- 3 workflows: actions/checkout steps for the sibling plugin repo:
    - .github/workflows/codeql.yml (Go matrix path)
    - .github/workflows/harness-replays.yml
    - .github/workflows/publish-workspace-server-image.yml

Verified `go build ./cmd/server` + `go vet ./...` pass post-removal.

#161 — swap GHCR→ECR for publish-workspace-server-image
=======================================================
Same workflow used to push to ghcr.io/molecule-ai/platform +
platform-tenant. ghcr.io/molecule-ai is gone post-suspension. The
operator's ECR org (153263036946.dkr.ecr.us-east-2.amazonaws.com/
molecule-ai/) already hosts platform-tenant + workspace-template-*
+ runner-base images and is the post-suspension SSOT for container
images. This PR aligns publish-workspace-server-image with that
stack.

- env.IMAGE_NAME + env.TENANT_IMAGE_NAME repointed to ECR URL.
- docker/login-action swapped for aws-actions/configure-aws-
  credentials@v4 + aws-actions/amazon-ecr-login@v2 chain (the
  standard ECR auth pattern; uses AWS_ACCESS_KEY_ID/SECRET secrets
  bound to the molecule-cp IAM user).

The :staging-<sha> + :staging-latest tag policy is unchanged —
staging-CP's TENANT_IMAGE pin still points at :staging-latest, just
with the new registry prefix.

Refs molecule-core#157, #161; parallel to org-wide CI-green sweep.
2026-05-07 05:12:06 -07:00
06d4bab29d Merge pull request 'fix(ci): port publish-runtime cascade to Gitea repo-dispatch API (closes #14)' (#20) from fix/14-cascade-gitea-dispatch into staging
Some checks failed
Secret scan / Scan diff for credential-shaped strings (push) Successful in 9s
CI / Canvas (Next.js) (push) Successful in 7s
Check merge_group trigger on required workflows / Required workflows have merge_group trigger (push) Successful in 10s
E2E API Smoke Test / detect-changes (push) Successful in 11s
CI / Platform (Go) (push) Successful in 29s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (push) Failing after 54s
Block internal-flavored paths / Block forbidden paths (push) Successful in 10s
CI / Detect changes (push) Successful in 11s
E2E API Smoke Test / E2E API Smoke Test (push) Successful in 28s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (push) Failing after 1m57s
Handlers Postgres Integration / Handlers Postgres Integration (push) Successful in 9s
CI / Canvas Deploy Reminder (push) Has been skipped
E2E Staging Canvas (Playwright) / detect-changes (push) Successful in 12s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (push) Successful in 9s
Handlers Postgres Integration / detect-changes (push) Successful in 13s
Runtime PR-Built Compatibility / detect-changes (push) Successful in 12s
CI / Shellcheck (E2E scripts) (push) Successful in 4s
CI / Python Lint & Test (push) Successful in 6s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (push) Successful in 10m34s
CodeQL / Analyze (${{ matrix.language }}) (go) (push) Failing after 19m45s
CodeQL / Analyze (${{ matrix.language }}) (python) (push) Failing after 20m19s
2026-05-07 10:36:32 +00:00
Hongming Wang
4279fecde5 fix(ci): keep codex in TEMPLATES + skip-if-no-publish-image.yml
Some checks failed
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 6s
Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 6s
cascade-list-drift-gate / check (pull_request) Successful in 13s
CI / Detect changes (pull_request) Successful in 9s
E2E API Smoke Test / detect-changes (pull_request) Successful in 9s
pr-guards / disable-auto-merge-on-push (pull_request) Failing after 1s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 3s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 5s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 25s
CI / Platform (Go) (pull_request) Successful in 5m22s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 6s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 4s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 17s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 16s
CI / Canvas (Next.js) (pull_request) Failing after 5m16s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 6s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m39s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 6s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 5s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 51s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 20m54s
CI / Python Lint & Test (pull_request) Successful in 15m42s
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 19m46s
The v2 dropped codex from TEMPLATES on the basis of "no
publish-image.yml = not part of cascade today." That was correct
about the immediate behavior but tripped cascade-list-drift-gate.yml
because manifest.json still declares codex (it IS a live runtime —
referenced from workspace/config.py and cloned into dev envs by
clone-manifest.sh; only the image-publish path is missing).

Restore codex to TEMPLATES (matching manifest) and add a runtime
soft-skip: probe each repo for .github/workflows/publish-image.yml
via the Gitea contents API and skip cleanly if 404. Final job log
distinguishes "complete across all" vs "complete with soft-skips".

This preserves the drift gate's invariant (TEMPLATES == manifest)
while honoring the empirical fact that codex has no publish-image
workflow yet. If codex later gains the workflow, no change here is
needed — the probe will see 200 and the cascade will fan out to it
naturally.

Refs molecule-core#14, molecule-core#20.
2026-05-07 03:32:53 -07:00
Hongming Wang
607444e71b feat(ci): replace curl-dispatch with push-mode cascade (v2)
Some checks failed
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s
CI / Detect changes (pull_request) Successful in 5s
E2E API Smoke Test / detect-changes (pull_request) Successful in 11s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 5s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 6s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 6s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 6s
pr-guards / disable-auto-merge-on-push (pull_request) Failing after 2s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 7s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m21s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 46s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m28s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 5s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 10s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 5s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 7s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 26s
CI / Platform (Go) (pull_request) Successful in 3m32s
CI / Canvas (Next.js) (pull_request) Failing after 3m34s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
cascade-list-drift-gate / check (pull_request) Failing after 9s
Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 4s
CI / Python Lint & Test (pull_request) Successful in 16m16s
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 20m25s
Empirical blocker on v1: Gitea 1.22.6 has no repository_dispatch /
workflow_dispatch trigger API (verified across 6 candidate paths in
issuecomment-913). v1's curl-POST loop would always exit-1.

v2 pivots to push-mode: each template repo got a small companion PR
(merged 2026-05-07) adding a `.runtime-version` file at root + a
`resolve-version` job in publish-image.yml that reads the file and
forwards the value to the reusable build workflow. publish-runtime
now updates that file via git-clone + commit + push, which trips
each template's existing `on: push: branches: [main]` trigger.

Behaviour changes vs v1:
- Templates list dropped from 9 → 8 (codex has no publish-image.yml
  so was never part of the cascade in practice).
- 3-retry pull-rebase loop per template (handles concurrent-push
  races without force-push). Failures collected, job exits 1 with
  the failed-template list at the end.
- Idempotency: when re-run with the same version, templates already
  pinned to that version contribute zero commits — operator can
  safely re-run to retry partial failures.
- Author line: "publish-runtime cascade <publish-runtime@moleculesai
  .app>" trailer makes it clear the commit is workflow-driven, not
  human (per memory feedback_github_botring_fingerprint).

DISPATCH_TOKEN secret name unchanged (still consumed at
secrets.DISPATCH_TOKEN per 569df259).

Refs molecule-core#14, builds on molecule-core#20 issuecomment-923
(Phase 2 design).
2026-05-07 03:17:38 -07:00
Hongming Wang
569df259ba fix(ci): align secret name to plumbed DISPATCH_TOKEN (closes #14)
Some checks failed
pr-guards / disable-auto-merge-on-push (pull_request) Failing after 3s
Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 7s
CI / Detect changes (pull_request) Successful in 7s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 6s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 8s
cascade-list-drift-gate / check (pull_request) Successful in 13s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 9s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 14s
E2E API Smoke Test / detect-changes (pull_request) Successful in 12s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 10s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 6s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 6s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 19s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 12s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Failing after 19s
CI / Python Lint & Test (pull_request) Failing after 20s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 7s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 34s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m31s
CI / Platform (Go) (pull_request) Successful in 3m6s
CI / Canvas (Next.js) (pull_request) Failing after 3m8s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 14m54s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 15m3s
The cascade workflow was reading from `secrets.TEMPLATE_DISPATCH_TOKEN`
but the plumbed secret name is `DISPATCH_TOKEN` (verified just now via
GET /repos/molecule-ai/molecule-core/actions/secrets — only DISPATCH_TOKEN
is set). Without this rename the cascade would always evaluate "secret
missing" and exit 1 on the next push to staging, defeating the entire
point of grant-role-access.sh --apply that just landed.

Three references updated:
  - env mapping (`secrets.X` → `secrets.DISPATCH_TOKEN`)
  - workflow_dispatch warning text
  - push-trigger error text

The bash-side variable name is unchanged (still `DISPATCH_TOKEN`) so
the curl invocation at line 372 is unaffected. YAML round-trip parses
clean.
2026-05-07 02:38:20 -07:00
1d9d8c7809 Merge pull request 'fix(scripts): migrate ghcr.io→ECR + raw.githubusercontent.com→Gitea (#46)' (#16) from fix/script-ghcr-and-lint-paths into staging
Some checks failed
CI / Platform (Go) (push) Blocked by required conditions
CI / Canvas (Next.js) (push) Blocked by required conditions
CI / Shellcheck (E2E scripts) (push) Blocked by required conditions
CI / Canvas Deploy Reminder (push) Blocked by required conditions
CI / Python Lint & Test (push) Blocked by required conditions
E2E API Smoke Test / detect-changes (push) Waiting to run
E2E API Smoke Test / E2E API Smoke Test (push) Blocked by required conditions
E2E Staging Canvas (Playwright) / detect-changes (push) Waiting to run
E2E Staging Canvas (Playwright) / Canvas tabs E2E (push) Blocked by required conditions
Handlers Postgres Integration / Handlers Postgres Integration (push) Blocked by required conditions
Runtime PR-Built Compatibility / detect-changes (push) Waiting to run
Runtime PR-Built Compatibility / PR-built wheel + import smoke (push) Blocked by required conditions
Secret scan / Scan diff for credential-shaped strings (push) Waiting to run
Ops Scripts Tests / Ops scripts (unittest) (push) Waiting to run
Block internal-flavored paths / Block forbidden paths (push) Has been cancelled
Handlers Postgres Integration / detect-changes (push) Has been cancelled
CI / Detect changes (push) Has been cancelled
CodeQL / Analyze (${{ matrix.language }}) (go) (push) Has been cancelled
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (push) Has been cancelled
CodeQL / Analyze (${{ matrix.language }}) (python) (push) Has been cancelled
SECRET_PATTERNS drift lint / Detect SECRET_PATTERNS drift (push) Failing after 12s
2026-05-07 09:25:24 +00:00
ce3f1f48a4 fix(ci): port publish-runtime cascade to Gitea repo-dispatch API (closes molecule-core#14)
Some checks failed
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 4s
cascade-list-drift-gate / check (pull_request) Successful in 4s
Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 5s
CI / Detect changes (pull_request) Successful in 6s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 3s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 6s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 6s
CI / Python Lint & Test (pull_request) Failing after 14s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 4s
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 49s
CI / Canvas (Next.js) (pull_request) Failing after 1m55s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
Handlers Postgres Integration / detect-changes (pull_request) Successful in 6s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 5s
E2E API Smoke Test / detect-changes (pull_request) Successful in 6s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 6s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 4s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 6s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 7s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m20s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m24s
CI / Platform (Go) (pull_request) Successful in 2m5s
## Symptom

`publish-runtime.yml::cascade` fired a `repository_dispatch` to 10 workspace-template
repos via direct curl to `https://api.github.com/repos/...`. Post-2026-05-06 the
org's GitHub presence is suspended; every invocation 404s. The job's
`:⚠️:` posture meant the failure didn't propagate, leaving the runtime
PyPI publish → template image rebuild pipeline silently broken.

## Why Option A (rewrite) and not Option B (delete)

Verified 2026-05-07 by devops-engineer (molecule-core#14 thread):

- The cron-poll mechanism (/etc/cron.d/molecule-deploy-poll) tracks ONLY the
  Vercel/Railway-deployed repos (landingpage/docs/molecule-app/molecules-market
  /molecule-controlplane). It does NOT track workspace-template-* repos.
- Each of the 9 template `publish-image.yml` workflows has
  `repository_dispatch: types: [runtime-published]` as a load-bearing trigger.
  Without the cascade, when the runtime ships a new PyPI version, templates
  don't auto-rebuild.

So Option B (delete) would silently break the runtime → template fan-out.
Option A (rewrite to Gitea's API shape) is the right call. Security-auditor
agreed after seeing the cron-poll TRACKED list.

## API surface change

| Concern | Pre-fix (GitHub) | Post-fix (Gitea) |
|---|---|---|
| URL | `https://api.github.com/repos/$REPO/dispatches` | `${GITEA_URL}/api/v1/repos/$REPO/dispatches` |
| Owner case | `Molecule-AI/...` | `molecule-ai/...` (lowercase, Gitea is case-sensitive) |
| Auth header | `Authorization: Bearer $DISPATCH_TOKEN` | `Authorization: token $DISPATCH_TOKEN` |
| Body shape | `{event_type, client_payload}` | UNCHANGED — Gitea is GitHub-compatible here |
| Success code | `204 No Content` | `204 No Content` (unchanged) |

`GITEA_URL` defaults to `https://git.moleculesai.app`; overridable via job env.

## Out-of-band: DISPATCH_TOKEN secret rotation

The DISPATCH_TOKEN secret was a GitHub PAT. It must be re-minted as a Gitea
PAT for the new API to authenticate. Per saved memory
`feedback_per_agent_gitea_identity_default`, this should be a dedicated
`publish-runtime-bot` persona token with `write:repository` scope on the
9 target repos — NOT the founder PAT.

This PR ships the workflow change. Token rotation is the operator-host
follow-up (security-auditor's lane) — coordinate the merge so the token
is in place before the next runtime release fires.

## Backwards compatibility

The workflow ran silently-broken since 2026-05-06 (every invocation 404
+ :⚠️: but no failure). So there is no functional regression from
"silently broken" to "actually working". Any in-progress operator-managed
manual dispatch path is unaffected; the Gitea API parallel path doesn't
require operator intervention.

## Test plan

- [x] YAML parse OK on the modified workflow file
- [ ] Smoke test: trigger a runtime publish (or simulate via dispatching to one
      template) post-merge; verify HTTP 204 + the template's publish-image
      workflow fires + the template's image gets re-pushed against the new
      runtime version. Phase 4 verification belongs to internal#46 follow-up.

## Hostile self-review (3 weakest spots)

1. The fan-out remains all-or-nothing: a single template failure surfaces as
   a `:⚠️:` but PyPI publish proceeds. With 9 templates this is a
   ~10% per-template chance of stale-image-on-runtime-bump if any one fails.
   Defense: the warning shows up in the workflow summary; operators retry.
   Future hardening: requeue-on-fail with bounded retry, or a separate
   reconcile cron that detects template/runtime version drift and re-dispatches.

2. `DISPATCH_TOKEN` validity is enforced by the Gitea API (401 on stale)
   but the workflow doesn't differentiate 401 from 404. Either way the
   warning fires. Future hardening: explicit token-shape check at the start
   of the cascade job (curl `/api/v1/user` once, fail-fast if 401).

3. Owner-case lowercase is right today but couples the workflow to the
   current Gitea org slug. If the org is ever renamed, this workflow
   breaks silently. Less fragile alternative: derive REPO from a
   canonical config (e.g. `gh repo list molecule-ai`) instead of
   string-concatenating. Acceptable today; filed as the same future
   hardening pass as item 1.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 01:31:37 -07:00
aa22183e52 chore(ci): pin artifact actions to @v3 for Gitea act_runner compatibility (internal#46)
Some checks failed
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s
Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 7s
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 1m9s
E2E API Smoke Test / detect-changes (pull_request) Successful in 7s
CI / Detect changes (pull_request) Successful in 7s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 5s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 3s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 5s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 4s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 4s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m31s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m33s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 13s
CI / Python Lint & Test (pull_request) Failing after 19s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 5s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 6s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 4s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Failing after 27s
CI / Canvas (Next.js) (pull_request) Successful in 4m47s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CI / Platform (Go) (pull_request) Successful in 5m32s
Mechanical pin: 4 `actions/upload-artifact@v4.6.2/v7.0.1` uses → `@v3`. v4+/v7+
rely on a runtime API shape that Gitea's act_runner v0.6.x doesn't fully
support. v3 uses the legacy server protocol act_runner ships end-to-end.

Files (4 uses):
  - .github/workflows/ci.yml:238 (v4.6.2 → v3)
  - .github/workflows/codeql.yml:124 (v7.0.1 → v3)
  - .github/workflows/e2e-staging-canvas.yml:142 (v7.0.1 → v3)
  - .github/workflows/e2e-staging-canvas.yml:150 (v7.0.1 → v3)

YAML parse green on all 3 files.

Sister PRs land for `molecule-controlplane` and `codex-channel-molecule`.
Per internal#46 Phase 2 audit; tracked under that umbrella.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 01:00:53 -07:00
security-auditor
e01077be38 fix(ci): lowercase 'molecule-ai/' in cross-repo workflow refs
Some checks failed
Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped
cascade-list-drift-gate / check (pull_request) Successful in 3s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 3s
Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 5s
CI / Detect changes (pull_request) Successful in 4s
pr-guards / disable-auto-merge-on-push (pull_request) Failing after 0s
E2E API Smoke Test / detect-changes (pull_request) Successful in 5s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 4s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 5s
Harness Replays / detect-changes (pull_request) Successful in 4s
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 50s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 3s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 4s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 5s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m16s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m16s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 7s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 3s
CI / Python Lint & Test (pull_request) Failing after 16s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 4s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 6s
Harness Replays / Harness Replays (pull_request) Failing after 40s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 3s
CI / Canvas (Next.js) (pull_request) Failing after 4m47s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CI / Platform (Go) (pull_request) Successful in 5m25s
Gitea is case-sensitive on owner slugs; canonical is lowercase
`molecule-ai/...`. Mixed-case `Molecule-AI/...` refs fail-at-0s
when the runner tries to resolve the cross-repo workflow / checkout.

Same fix as molecule-controlplane#12. Mechanical case-correction;
no behavior change beyond making CI resolve again.

Refs: internal#46

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 01:00:10 -07:00
documentation-specialist
5d4184f4a3 fix(scripts): migrate ghcr.io→ECR + raw.githubusercontent.com→Gitea (#46)
Some checks failed
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 6s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 7s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 5s
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 54s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s
CI / Detect changes (pull_request) Successful in 5s
E2E API Smoke Test / detect-changes (pull_request) Successful in 6s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 6s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 6s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 6s
CI / Platform (Go) (pull_request) Successful in 3s
CI / Python Lint & Test (pull_request) Successful in 3s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 3s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 5s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 4s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Failing after 13s
CI / Canvas (Next.js) (pull_request) Successful in 42s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m18s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m20s
Per documentation-specialist's grep agent (2026-05-07T07:30, see
internal#46): runtime-breaking ghcr.io references in shell scripts +
docker-compose + the slip-past-workflow lint_secret_pattern_drift.py
all need migration. These were missed by security-auditor's
workflow-only audit.

Files (6):

- .github/scripts/lint_secret_pattern_drift.py:40 — workspace-runtime
  pre-commit-checks.sh consumer URL: raw.githubusercontent.com →
  Gitea raw URL (https://git.moleculesai.app/molecule-ai/.../raw/
  branch/main/...). The lint job runs in CI and would 404 today.

- scripts/refresh-workspace-images.sh:54 — workspace-template image
  pull URL: ghcr.io → ECR (153263036946.dkr.ecr.us-east-2.amazonaws.com).

- scripts/rollback-latest.sh — full rewrite of header + auth flow:
  * ghcr.io/molecule-ai/{platform,platform-tenant} → ECR
  * GITHUB_TOKEN with write:packages → AWS ECR auth
    (aws ecr get-login-password). Per saved memory
    reference_post_suspension_pipeline, prod cutover is to ECR.
  * Updated header docs to match new auth flow + prereqs.

- scripts/demo-freeze.sh:13,17 — comment-only ghcr → ECR
  (the script doesn't currently exec these URLs, but the comments
  describe the cascade and need to match reality).

- docker-compose.yml:215-216 — canvas image: ghcr.io → ECR + updated
  the auth comment to describe `aws ecr get-login-password` flow.

- tools/check-template-parity.sh:21 — inline curl install instructions:
  raw.githubusercontent.com → Gitea raw URL.

Hostile self-review:

1. rollback-latest.sh's GITHUB_TOKEN→aws-cli auth swap is a behavior
   change. Operators using this script now need aws CLI
   authenticated for region us-east-2 with ECR pull/push perms.
   Documented in updated header. Operators who don't have aws CLI
   will get 'aws: command not installed' which is a clear failure
   mode (not silent).
2. The Gitea raw URL shape (/raw/branch/main/) differs from GitHub's
   raw.githubusercontent.com structure. Verified pattern by
   inspecting other Gitea raw URLs in the codebase. If Gitea's URL
   changes (1.23+), update via the same one-line edit.
3. Doesn't touch packer/scripts/install-base.sh which has a similar
   ghcr.io ref per the grep agent's findings — that's bigger-scope
   (packer build pipeline) and lives in molecule-controlplane-ish
   territory; filing as parked follow-up under #46 if not already.

Refs: molecule-ai/internal#46, molecule-ai/internal#37,
molecule-ai/internal#38, saved memory reference_post_suspension_pipeline
2026-05-07 00:56:23 -07:00
Hongming Wang
debe29c889 ci(handlers-postgres-integration): apply legacy *.sql migrations too
The migration-replay step globbed only *.up.sql, silently skipping
the older flat-naming migrations (001_workspaces.sql,
009_activity_logs.sql, etc.). Fine while no integration test
depended on those tables; broke when the #149 cross-table
atomicity test came in needing both workspaces (FK target for
activity_logs) and activity_logs themselves.

Switch to globbing *.sql + sorted lex-order, excluding *.down.sql
so up/down pairs don't undo themselves mid-run. Add a sanity check
for workspaces + activity_logs + pending_uploads alongside the
existing delegations gate so a future migration drift fails loud
instead of silently skipping the regressed test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 22:02:24 -07:00
Hongming Wang
88ff0d770b chore(sweep): add orphan-tunnel cleanup step (#2987 / #340)
The 15-min sweeper has been deleting stale e2e orgs but not the
orphan tunnels left behind when the org-delete cascade half-fails
(CP transient 5xx after the org row is gone but before the CF
tunnel delete completes). Result: tunnels accumulate in CF until
manual operator cleanup.

Add a final step that POSTs `/cp/admin/orphan-tunnels/cleanup`
every tick. Best-effort — failure doesn't fail the workflow; next
tick re-attempts. Output reports deleted_count + failed count for
ops visibility.

This is the catch-all for the orphan-tunnel class. The proper
upstream fix (transactional org delete) lives in CP and tracks as
issue #2989. Until that lands, the sweeper bounded-time-to-cleanup
keeps the leak from escalating.

Note: PR #492 (cf-tunnel silent-success fix) makes this step
actually effective — pre-fix DeleteTunnel silent-succeeded on
1022, so the cleanup endpoint reported success without deleting.
Post-fix the cleanup chains CleanupTunnelConnections + retry on
1022, which actually clears stuck-connector orphans.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-05-05 19:36:20 -07:00
Hongming Wang
a19ee90556 chore(sweep): note SSOT for ephemeral prefixes lives in CP
Mirrors molecule-controlplane#494: the canonical EPHEMERAL_PREFIXES
list now lives in molecule-controlplane/internal/slugs/ephemeral.go,
where redeploy-fleet reads it to skip in-flight test tenants. The
sweep workflow keeps a Python copy because GHA Python can't import
Go, but a comment now points engineers updating the list to update
both files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 19:18:13 -07:00
Hongming Wang
caf19e8980 feat(ops): hourly alarm for auto-promote PR stuck on REVIEW_REQUIRED (#2975)
Closes the silent-block failure mode that left 25 commits — including
the Memory v2 redesign and the reno-stars data-loss fix — wedged on
staging for 12+ hours behind a single missing review. The auto-promote
workflow opened the PR + armed auto-merge, but main's branch protection
required a human review and nobody noticed until a user reported
"still seeing old memory tab".

## Detection logic — `scripts/check-stale-promote-pr.sh`

Reads open PRs `base=main head=staging` and alarms on:
  - `mergeStateStatus == BLOCKED`
  - `reviewDecision == REVIEW_REQUIRED`
  - createdAt older than `STALE_HOURS` (default 4h)

Other BLOCKED reasons (DIRTY, BEHIND, failed checks) are NOT alarmed —
those are the author's signal-to-fix. This script targets the specific
"no human reviewed yet" wedge.

Output:
  - `::warning` per stale PR (visible in workflow summary + Actions UI)
  - PR comment (idempotent via marker-string detection; one alarm
    per PR, never re-spammed)
  - Exit code = count of stale PRs (capped at 125)

Logic in a script (not inline workflow YAML) so it's:
  - **Unit-testable** — tests/test-check-stale-promote-pr.sh exercises
    every branch with stubbed fixture JSON + frozen clock. 23 tests
    covering: empty list, single stale, just-under-threshold, wrong
    reviewDecision, wrong mergeStateStatus, mixed list (only matching
    PRs alarm), custom threshold via --stale-hours, exit-code-counts-
    matching-PRs, --help, unknown arg → 64, missing repo → 2.
  - **Operator-runnable ad-hoc** — `scripts/check-stale-promote-pr.sh`
    works from any shell with `gh` + `jq`.
  - **SSOT** — one detector, the workflow YAML is just schedule +
    invocation surface. Future sibling workflows that need the same
    check call the same script.

## Workflow — `.github/workflows/auto-promote-stale-alarm.yml`

Triggers:
  - cron `27 * * * *` (hourly, off-the-hour to dodge cron herd)
  - workflow_dispatch with `stale_hours` + `post_comment` overrides

Concurrency: `auto-promote-stale-alarm` group, cancel-in-progress=false
(idempotent script; no benefit to cancelling a running scan).

Permissions: `contents: read` + `pull-requests: write` (post comments).

Sparse checkout — only fetches `scripts/check-stale-promote-pr.sh`.
No node_modules, no go modules, no slow setup steps. Workflow runs
in <30s on a clean repo.

## Why "alarm + comment" not "auto-approve"

Considered options in issue #2975:
  1. Slack/email alert — picked.
  2. Bot-account auto-approve via molecule-ops — circumvents the
     human-review gate that branch protection encodes.
  3. Trusted-promote bypass via CODEOWNERS — needs Org Admin config
     change; out of scope for a workflow PR.

The comment-on-PR pattern picks (1) without external dependencies
(no Slack token, no email config). Subscribers get notified via
GitHub's existing PR notification delivery; the warning shows up in
the Actions feed.

## Why this won't false-positive on legitimate slow reviews

Threshold is 4h. Most legitimate gates clear in <1h, so 4× headroom
is plenty for slow CI. The comment is idempotent (one alarm per PR,
never re-posted) — adding noise stops at 1 comment regardless of
how long the PR sits.

## Test plan

- [x] `bash scripts/test-check-stale-promote-pr.sh` — 23/23 pass
- [x] `python3 -c 'yaml.safe_load(...)'` clean
- [x] `bash -n` clean on both scripts
- [ ] Live verification: dispatch the workflow once main has caught up,
      confirm it correctly reports zero stale PRs
2026-05-05 17:55:27 -07:00
Hongming Wang
475da5b64c refactor(workspace): extract inbox tools from a2a_tools.py (RFC #2873 iter 4e)
Continues the OSS-shape refactor. After iters 4a-4d (rbac, delegation,
memory, messaging) the only behavior left in ``a2a_tools.py`` was
``report_activity`` plus three thin inbox-tool wrappers and the
``_enrich_inbound_for_agent`` helper. This iter extracts the inbox
slice to ``a2a_tools_inbox.py`` so the kitchen-sink module shrinks
from 280 LOC to ~165 LOC of imports + report_activity + back-compat
re-export blocks.

Extracted symbols:
  - ``_INBOX_NOT_ENABLED_MSG`` (sentinel)
  - ``_enrich_inbound_for_agent`` (poll-path peer enrichment helper)
  - ``tool_inbox_peek``
  - ``tool_inbox_pop``
  - ``tool_wait_for_message``

Re-exports (`from a2a_tools_inbox import …`) preserve the public
``a2a_tools.tool_inbox_*`` surface so existing tests + call sites
continue to resolve unchanged.

New tests in test_a2a_tools_inbox_split.py:
  1. **Drift gate (5)** — every previously-public symbol on a2a_tools
     is the EXACT same object as a2a_tools_inbox.foo (`is`, not `==`),
     catches a future "wrap with logging" refactor that silently loses
     existing test coverage.
  2. **Import contract (1)** — a2a_tools_inbox does NOT eagerly import
     a2a_tools at module load. Pins the layered architecture: the
     extracted slice depends on ``inbox`` + a lazy ``a2a_client``
     import, never on the kitchen-sink that re-exports it.
  3. **_enrich_inbound_for_agent branches (5)** — peer_id-empty
     (canvas_user) returns dict unchanged; missing peer_id key same;
     a2a_client unavailable (test harness, partial install) degrades
     gracefully with a bare envelope; registry hit populates
     peer_name + peer_role + agent_card_url; registry miss still
     surfaces agent_card_url (constructable from peer_id alone).

The full timeout-clamp / validation / JSON-shape behavior matrix for
the three wrappers stays in test_a2a_tools_inbox_wrappers.py — those
tests pass identically against both the alias and the underlying impl.

Wiring updates:
  - ``scripts/build_runtime_package.py``: add ``a2a_tools_inbox`` to
    ``TOP_LEVEL_MODULES`` so it ships in the runtime wheel and the
    drift gate doesn't fail the next publish.
  - ``.github/workflows/ci.yml``: add ``a2a_tools_inbox.py`` to
    ``CRITICAL_FILES`` so the 75% MCP/inbox/auth per-file floor
    applies — this is now where the inbox-delivery code actually
    lives.
2026-05-05 14:28:58 -07:00
Hongming Wang
0ca4e431c1 test(e2e): add poll-mode chat upload E2E and wire into e2e-api.yml
Covers the user-visible flow that Phase 1-5b shipped (RFC #2891):
register a poll-mode workspace, POST a multi-file /chat/uploads, verify
the activity feed shows one chat_upload_receive row per file, fetch the
bytes via /pending-uploads/:fid/content, ack each row, and confirm a
post-ack fetch returns 404. Also pins cross-workspace bleed protection
(workspace B's bearer on A's URL → 401, B's URL with A's file_id →
404) and the file_id-UUID-parse 400 path.

23 assertions, all green against a local platform (Postgres+Redis+
platform-server stack matches the e2e-api.yml CI recipe verbatim).

Why a new script instead of extending test_poll_mode_e2e.sh: that
script tests A2A short-circuit + since_id cursor semantics; this one
tests the chat-upload path. They share zero handler code on the
platform side and would dilute each other's failure messages if
combined.

Why not the bearerless-401 strict-mode assertion: the platform's
wsauth fail-opens for bearerless requests when MOLECULE_ENV=development
(see middleware/devmode.go). The CI workflow doesn't set that var, but
some local-dev .env files do — the assertion would flap by environment
without testing the poll-mode upload contract. The middleware's own
unit tests cover strict-mode 401.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:08:55 -07:00
Hongming Wang
6125700c39 test(e2e): plug /tmp scratch leaks in 3 shell E2E tests + add CI lint gate (RFC #2873 iter 2)
Three shell E2E tests created scratch files via `mktemp` but never
deleted them on early exit (assertion failure, SIGINT, errexit). Each
CI run leaked ~10-100 KB of /tmp into the runner; over ~200 runs/week
that's 20+ MB of accumulated cruft.

## Files

- **test_chat_attachments_e2e.sh** — was missing both trap and rm;
  added per-run TMPDIR_E2E with `trap rm -rf … EXIT INT TERM`.
- **test_notify_attachments_e2e.sh** — had a `cleanup()` for the
  workspace but didn't include the TMPF; only an unconditional
  `rm -f` at the bottom (line 233) which doesn't fire on early exit.
  Extended cleanup() to also rm the scratch + dropped the redundant
  trailing rm.
- **test_chat_attachments_multiruntime_e2e.sh** — `round_trip()`
  function had per-call `rm -f` only on the success path; failure
  paths leaked. Switched to script-level TMPDIR_E2E + trap; per-call
  rm dropped (the trap handles every return path including SIGINT).

Pattern: `mktemp -d -t prefix-XXX` for the dir, `mktemp <full-template>`
for files (portable across BSD/macOS + GNU coreutils — `-p` is
GNU-only and breaks Mac local-dev runs).

## Regression gate

New `tests/e2e/lint_cleanup_traps.sh` asserts every `*.sh` that calls
`mktemp` also has a `trap … EXIT` line in the file. Wired into the
existing Shellcheck (E2E scripts) CI step. Verified locally: passes
on the fixed state, fails-loud when one of the 3 fixes is reverted.

## Verification

  - shellcheck --severity=warning clean on all 4 touched files
  - lint_cleanup_traps.sh passes on the post-fix tree (6 mktemp users,
    all have EXIT trap)
  - Negative test: revert one fix → lint exits 1 with file:line +
    suggested fix pattern in the error message (CI-grokkable
    ::error file=… annotation)
  - Trap fires on SIGTERM mid-run (smoke-tested on macOS BSD mktemp)
  - Trap fires on `exit 1` (smoke-tested)

## Bars met (7-axis)

  - SSOT: trap pattern documented in lint message (one rule, one fix)
  - Cleanup: this IS the cleanup hygiene fix
  - 100% coverage: lint catches future regressions across all
    `tests/e2e/*.sh` files, not just the 3 fixed today
  - File-split: N/A (no files split)
  - Plugin / abstract / modular: N/A (test infra, not product code)

Iteration 2 of RFC #2873.
2026-05-05 04:21:26 -07:00
Hongming Wang
42f2ea3f4f fix(ci): include event_name in runtime-prbuild-compat concurrency group
Every staging push run for the last 4 SHAs was cancelled by the
matching pull_request run because both fired into the same
concurrency group:

  group: ${{ github.workflow }}-${{ ...sha }}

Same SHA → same group → cancel-in-progress=true means the second
arrival cancels the first. Empirically the push run lost the race;
staging branch-protection then saw a CANCELLED required check and
the auto-promote chain stalled.

Fix: include github.event_name in the group key. push and
pull_request runs for the same SHA now hash to different groups,
both complete, both report SUCCESS to branch protection.

Pattern of the bug:

  10:46 sha=1e8d7ae1 ev=pull_request conclusion=success
  10:46 sha=1e8d7ae1 ev=push        conclusion=cancelled
  10:45 sha=ecf5f6fb ev=pull_request conclusion=success
  10:45 sha=ecf5f6fb ev=push        conclusion=cancelled
  10:28 sha=471dff25 ev=pull_request conclusion=success
  10:28 sha=471dff25 ev=push        conclusion=cancelled
  10:12 sha=9e678ccd ev=pull_request conclusion=success
  10:12 sha=9e678ccd ev=push        conclusion=cancelled

Same drift class as the 2026-04-28 auto-promote-staging incident
(memory: feedback_concurrency_group_per_sha.md) — globally-scoped
groups silently cancel runs in matched-SHA scenarios.

This is the only workflow in .github/workflows/ that uses the
narrow per-sha shape without event_name. Others either don't use
concurrency at all, or use ${{ github.ref }} which is event-
neutral.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 04:01:20 -07:00
Hongming Wang
90d202c80a ci(handlers-pg): apply all migrations with skip-on-error + sanity check (#320)
Previous workflow applied only 049_delegations.up.sql — fragile to
future migrations that touch the delegations table or any other
handlers/-tested table. Operator would have to remember to update
the workflow's psql -f line per migration.

New behavior: loop every .up.sql in lexicographic order, apply each
with ON_ERROR_STOP=1 + per-migration result captured. Failed migrations
are SKIPPED rather than blocking the suite — handles the historical
migrations (017_memories_fts_namespace, 042_a2a_queue, etc.) that
depend on tables since renamed/dropped and can't replay from scratch.
Migrations that DO succeed land their tables, which is sufficient for
the integration tests in handlers/.

Sanity gate at the end: if the delegations table is missing after the
replay, hard-fail with a loud error. That catches a real regression
where 049 itself becomes broken (e.g., schema rename), separate from
the historical-broken-migration noise above.

Per-migration log line ("✓" or "⊘ skipped") makes it easy to spot
when a migration that SHOULD have replayed didn't.

Verified locally: full migration chain runs, 049 lands, all 7
integration tests pass against the chained-migration DB.

Closes #320.
2026-05-05 03:48:43 -07:00
Hongming Wang
4c9f12258d fix(delegations): preserve result_preview through completion + add real-Postgres integration gate
Two-part PR:

## Fix: result_preview was lost on completion

Self-review of #2854 caught a real bug. SetStatus has a same-status
replay no-op; the order of calls in `executeDelegation` completion
+ `UpdateStatus` completed branch clobbered the preview field:

  1. updateDelegationStatus(completed, "")    fires
  2. inner recordLedgerStatus(completed, "", "")
       → SetStatus transitions dispatched → completed with preview=""
  3. outer recordLedgerStatus(completed, "", responseText)
       → SetStatus reads current=completed, status=completed
       → SAME-STATUS NO-OP, never writes responseText → preview lost

Confirmed against real Postgres (see integration test). Strict-sqlmock
unit tests passed because they pin SQL shape, not row state.

Fix: call the WITH-PREVIEW recordLedgerStatus FIRST, then
updateDelegationStatus. The inner call becomes the no-op (correctly
preserves the row written by the outer call).

Same gap fixed in UpdateStatus handler — body.ResponsePreview was
never landing in the ledger because updateDelegationStatus's nested
SetStatus(completed, "", "") fired first.

## Gate: real-Postgres integration tests + CI workflow

The unit-test-only workflow that shipped #2854 was the root cause.
Adding two layers of defense:

1. workspace-server/internal/handlers/delegation_ledger_integration_test.go
   — `//go:build integration` tag, requires INTEGRATION_DB_URL env var.
   4 tests:
     * ResultPreviewPreservedThroughCompletion (regression gate for the
       bug above — fires the production call sequence in fixed order
       and asserts row.result_preview matches)
     * ResultPreviewBuggyOrderIsLost (DIAGNOSTIC: confirms the
       same-status no-op contract works as designed; if SetStatus's
       semantics ever change, this test fires)
     * FailedTransitionCapturesErrorDetail (failure-path symmetry)
     * FullLifecycle_QueuedToDispatchedToCompleted (forward-only +
       happy path)

2. .github/workflows/handlers-postgres-integration.yml
   — required check on staging branch protection. Spins postgres:15
   service container, applies the delegations migration, runs
   `go test -tags=integration` against the live DB. Always-runs +
   per-step gating on path filter (handlers/wsauth/migrations) so
   the required-check name is satisfied on PRs that don't touch
   relevant code.

Local dev workflow (file header documents this):

  docker run --rm -d --name pg -e POSTGRES_PASSWORD=test -p 55432:5432 postgres:15-alpine
  psql ... < workspace-server/migrations/049_delegations.up.sql
  INTEGRATION_DB_URL="postgres://postgres:test@localhost:55432/molecule?sslmode=disable" \
    go test -tags=integration ./internal/handlers/ -run "^TestIntegration_"

## Why this matters

Per memory `feedback_mandatory_local_e2e_before_ship`: backend PRs
MUST verify against real Postgres before claiming done. sqlmock pins
SQL shape; only a real DB can verify row state. The workflow makes
this gate mandatory rather than optional.
2026-05-05 02:47:52 -07:00
Hongming Wang
c89f17a2aa fix(branch-protection-drift): hard-fail on schedule only, soft-skip + warn on PR
#2834 added a hard-fail when GH_TOKEN_FOR_ADMIN_API is missing on
schedule + pull_request + workflow_dispatch. The PR-trigger hard-fail
is now blocking every PR in the repo because the secret hasn't been
provisioned yet — including the staging→main auto-promote PR (#2831),
which has no path to set repo secrets itself.

Per feedback_schedule_vs_dispatch_secrets_hardening.md the original
concern is automated/silent triggers losing the gate without a human
to notice. That concern applies to **schedule** specifically:

- schedule: cron, no human, silent soft-skip = invisible regression →
  KEEP HARD-FAIL.
- pull_request: a human is reviewing the PR diff and will see workflow
  warnings inline. A PR cannot retroactively drift live state — drift
  happens *between* PRs (UI clicks, manual gh api PATCH), which the
  schedule canary catches. The PR-time gate would only catch typos in
  apply.sh, which the *_payload unit tests catch more directly.
  → SOFT-SKIP with a prominent warning.
- workflow_dispatch: operator override, may not have configured the
  secret yet. → SOFT-SKIP with warning.

The skip is explicit (SKIP_DRIFT_CHECK=1 surfaced to env, then a step
`if:` guard) so it's auditable in the workflow run UI, not silently
swallowed.

Unblocks #2831 (auto-promote staging→main) + every PR currently behind
this check.
2026-05-04 21:20:30 -07:00
Hongming Wang
2e505e7748 fix(branch-protection): apply.sh respects live state + full-payload drift
Multi-model review of #2827 caught: the script as-shipped would have
silently weakened branch protection on EVERY non-checks dimension
the moment anyone ran it. Live staging had

  enforce_admins=true, dismiss_stale_reviews=false, strict=true,
  allow_fork_syncing=false, bypass_pull_request_allowances={
    HongmingWang-Rabbit + molecule-ai app
  }

Script wrote the opposite for all five. Per memory
feedback_dismiss_stale_reviews_blocks_promote.md, the
dismiss_stale_reviews flip alone is the load-bearing one — would
silently re-block every auto-promote PR (cost user 2.5h once).

This PR:

1. apply.sh: per-branch payloads (build_staging_payload /
   build_main_payload) that codify the deliberate per-branch policy
   already on the repo, with the script's net contribution being
   ONLY the new check names (Canvas tabs E2E + E2E API Smoke on
   staging, Canvas tabs E2E on main).

2. apply.sh: R3 preflight that hits /commits/{sha}/check-runs and
   asserts every desired check name has at least one historical run
   on the branch tip. Catches typos like "Canvas Tabs E2E" vs
   "Canvas tabs E2E" — pre-fix a typo would silently block every PR
   forever waiting for a context that never emits. Skip via
   --skip-preflight for genuinely-new workflows whose first run
   hasn't fired.

3. drift_check.sh: compares the FULL normalised payload (admin,
   review, lock, conversation, fork-syncing, deletion, force-push)
   not just the checks list. Pre-fix the drift gate would have
   missed a UI click that flipped enforce_admins or
   dismiss_stale_reviews. Drops app_id from the comparison since
   GH auto-resolves -1 to a specific app id post-write.

4. branch-protection-drift.yml: per memory
   feedback_schedule_vs_dispatch_secrets_hardening.md — schedule +
   pull_request triggers HARD-FAIL when GH_TOKEN_FOR_ADMIN_API is
   missing (silent skip masks the gate disappearing).
   workflow_dispatch keeps soft-skip for one-off operator runs.

Verified by running drift_check against live state: pre-fix would
have shown 5 destructive drifts on staging + 5 on main. Post-fix
shows ONLY the 2 intended additions on staging + 1 on main, which
go away after `apply.sh` runs.
2026-05-04 20:52:11 -07:00
Hongming Wang
7cc1c39c49 ci: e2e coverage matrix + branch-protection-as-code
Closes #9.

Three pieces, all small:

1. **docs/e2e-coverage.md** — source of truth for which E2E suites
   guard which surfaces. Today three were running but informational
   only on staging; that's how the org-import silent-drop bug shipped
   without a test catching it pre-merge. Now the matrix shows what's
   required where + a follow-up note for the two suites that need an
   always-emit refactor before they can be required.

2. **tools/branch-protection/apply.sh** — branch protection as code.
   Lets `staging` and `main` required-checks live in a reviewable
   shell script instead of UI clicks that get lost between admins.
   This PR's net change: add `E2E API Smoke Test` and `Canvas tabs E2E`
   as required on staging. Both already use the always-emit path-filter
   pattern (no-op step emits SUCCESS when the workflow's paths weren't
   touched), so making them required can't deadlock unrelated PRs.

3. **branch-protection-drift.yml** — daily cron + drift_check.sh
   that compares live protection against apply.sh's desired state.
   Catches out-of-band UI edits before they drift further. Fails the
   workflow on mismatch; ops re-runs apply.sh or updates the script.

Out of scope (filed as follow-ups):
- e2e-staging-saas + e2e-staging-external use plain `paths:` filters
  and never trigger when paths are unchanged. They need refactoring
  to the always-emit shape (same as e2e-api / e2e-staging-canvas)
  before they can be required.
- main branch protection mirrors staging here; if main wants the
  E2E SaaS / External added later, do it in apply.sh and rerun.

Operator must apply once after merge:
  bash tools/branch-protection/apply.sh
The drift check picks it up from there.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 20:21:59 -07:00
Hongming Wang
8df8487bbe fix(auto-promote): treat E2E completed/cancelled as defer, not failure
Bug: the case statement at line 189 grouped completed/failure |
completed/cancelled | completed/timed_out into the same "abort
+ exit 1" branch. cancelled ≠ failure — when per-SHA concurrency
(memory: feedback_concurrency_group_per_sha) cancels an older E2E
run because a newer push landed, the workflow blocked the whole
auto-promote chain on a non-failure.

Caught 2026-05-05 02:03 on sha 31f9a5e: E2E got cancelled by
concurrency, auto-promote :latest aborted with exit 1, the next
auto-promote-staging cycle had to manually clean up.

Split: failure/timed_out keep the abort path. cancelled gets its
own clean-defer branch (same shape as in_progress) — proceed=false
without exit 1, with a step-summary explaining likely concurrency
supersession and pointing operators at manual dispatch if they
need that specific SHA promoted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 19:26:29 -07:00
Hongming Wang
c5dd14d8db fix(workflows): preserve curl stderr in 8 status-capture sites
Self-review of PR #2810 caught a regression: my mass-fix added
`2>/dev/null` to every curl invocation, suppressing stderr. The
original `|| echo "000"` shape only swallowed exit codes — stderr
(curl's `-sS`-shown dial errors, timeouts, DNS failures) still went
to the runner log so operators could see WHY a connection failed.

After PR #2810 the next deploy failure would log only the bare
HTTP code with no context. That's exactly the kind of diagnostic
loss that makes outages take longer to triage.

Drop `2>/dev/null` from each curl line — keep it on the `cat`
fallback (which legitimately suppresses "no such file" when curl
crashed before -w ran). The `>tempfile` redirect alone captures
curl's stdout (where -w writes) without touching stderr.

Same 8 files as #2810: redeploy-tenants-on-{main,staging},
sweep-stale-e2e-orgs, e2e-staging-{sanity,saas,external,canvas},
canary-staging.

Tests:
- All 8 files pass the lint
- YAML valid

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 18:54:50 -07:00
Hongming Wang
463316772b fix(workflows): rewrite curl status-capture to prevent exit-code pollution
The 2026-05-04 redeploy-tenants-on-main run for sha 2b862f6 emitted
"HTTP 000000" and failed the deploy. Root cause: when curl exits non-
zero (connection reset → 56, --fail-with-body 4xx/5xx → 22), the
`-w '%{http_code}'` already wrote a status to stdout; the inline
`|| echo "000"` then fires AND appends another "000" to the captured
substitution stdout. Result: HTTP_CODE="<actual><000>" — fails string
comparisons against "200" while looking superficially right.

Same class of bug the synth-E2E §7c gate hit twice (PRs #2779/#2783
+ #2797). Memory feedback_curl_status_capture_pollution.md.

Mass fix in 8 workflows: route -w into a tempfile so curl's exit
code can't pollute stdout. Wrap with set +e/-e so the non-zero
curl exit doesn't trip the outer pipeline.

  redeploy-tenants-on-main.yml      (production-critical, caught the bug)
  redeploy-tenants-on-staging.yml   (sibling)
  sweep-stale-e2e-orgs.yml          (cleanup loop)
  e2e-staging-sanity.yml             (E2E safety-net teardown)
  e2e-staging-saas.yml
  e2e-staging-external.yml
  e2e-staging-canvas.yml
  canary-staging.yml

Plus a new lint workflow `lint-curl-status-capture.yml` that runs on
every PR/push touching `.github/workflows/**`. Multi-line aware:
collapses bash `\` continuations, then matches the buggy
$(curl ... -w '%{http_code}' ... || echo "000") subshell shape.
Distinguishes from the SAFE $(cat tempfile || echo "000") shape
(cat with missing file emits empty stdout, no pollution).

Verified:
- All 8 workflows pass the lint locally
- A known-bad injection is caught
- A known-safe cat-fallback passes through
- yaml.safe_load clean on all changed files

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 18:29:38 -07:00
Hongming Wang
26fa220bef ci(coverage): per-file 75% floor for MCP/inbox/auth Python critical paths
Closes part of #2790 (Phase A). The Python total floor at 86% (set in
workspace/pytest.ini, issue #1817) averages over ~6000 lines, so a
single MCP-critical file could regress to ~50% with no CI complaint as
long as other modules compensate. This is the same distribution gap
that #1823 closed Go-side: total floor passes while a critical handler
sits at 0%.

Added gates for these five files (per-file floor 75%):
- workspace/a2a_mcp_server.py — MCP dispatcher (PR #2766 / #2771)
- workspace/mcp_cli.py — molecule-mcp standalone CLI entry
- workspace/a2a_tools.py — workspace-scoped tool implementations
- workspace/inbox.py — multi-workspace inbox + per-workspace cursors
- workspace/platform_auth.py — per-workspace token resolver

These handle multi-tenant routing, auth tokens, and inbox dispatch.
Risk shape mirrors Go-side tokens*/secrets* — a 0%/50% file here is
exactly where the PR #2766 dispatcher bug class slips through without
a structural test.

Floor 75% is strictly additive — current actuals 80-96% (measured
2026-05-04). No existing PR fails. Ratchet plan in COVERAGE_FLOOR.md
target 90% by 2026-08-04.

Implementation: pytest already writes .coverage; new step emits a JSON
view scoped to the critical files via `coverage json --include="*name"`,
then jq extracts each file's percent_covered. Exact key match by
basename so workspace/builtin_tools/a2a_tools.py (a different 100%
file) doesn't shadow workspace/a2a_tools.py.

Verified locally with the actual coverage data:
- floor=75 → 0 failures (matches current state)
- floor=81 → 1 failure (a2a_tools.py at 80%) — proves the gate trips

Pairs with PR #2791 (Phase B — schema↔dispatcher AST drift gate). Phase
C (molecule-mcp e2e harness) remains the largest piece in #2790.

YAML validated locally before commit per
feedback_validate_yaml_before_commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 16:35:21 -07:00
Hongming Wang
ff1003e5f6 ci(canary): bump timeout-minutes 12 → 20 to absorb apt tail latency
Today's 4 cancelled canaries (25319625186 / 25320942822 / 25321618230 /
25322499952) were all blown by the workflow timeout despite the
underlying tenant boot completing successfully (PR molecule-controlplane#455
fix verified — boot events all reach `boot_script_finished/ok`).

Why the budget was wrong:

The tenant user-data install phase runs apt-get update + install of
docker.io / jq / awscli / caddy / amazon-ssm-agent FROM RAW UBUNTU on
every tenant boot — none of it is pre-baked into the tenant AMI
(EC2_AMI=ami-0ea3c35c5c3284d82, raw Jammy 22.04). Empirical
fetch_secrets/ok timing across today's canaries:

  51s   debug-mm-1777888039 (09:47Z)
  82s   25319625186          (12:42Z)
  143s  25320942822          (13:11Z)
  625s  25322499952          (13:43Z)

Same EC2_AMI, same instance type (t3.small), same user-data install
sequence — variance is entirely apt-mirror tail latency. A 12-min job
budget leaves only ~2 min for the workspace on slow-apt days; the
workspace itself needs ~3.5 min for claude-code cold boot, so the
budget is structurally too tight whenever apt is slow.

20 min absorbs even the 10+ min boot worst-case and still leaves the
workspace its full ~7 min budget. Cap stays well under the runner's
6-hour ubuntu-latest job ceiling.

Real fix: pre-bake caddy + ssm-agent into the tenant AMI so the boot
phase is no-ops on cached pkgs (will file controlplane#TBD as
follow-up — packer/install-base.sh today only bakes the WORKSPACE thin
AMI, not the tenant AMI; tenants always boot from raw Ubuntu).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 07:02:12 -07:00
Hongming Wang
032c011b37 ci: bump continuous-synth-e2e cadence 3→6 fires/hour, all clean slots
Change cron from '10,30,50' (3 fires/hour) to '2,12,22,32,42,52'
(6 fires/hour). All new slots are 1-3 min away from any other
cron, avoiding both the cf-sweep collisions (:15, :45) and the
:30 heavy slot (canary-staging /30, sweep-aws-secrets,
sweep-stale-e2e-orgs every :15).

Why: empirically 2026-05-04 the canary fired only once per hour
on the 10,30,50 schedule (see #2726). Bumping fires-per-hour
gives more chances to land a survived fire under GH's load-
related drop ratio, and keeping all slots in clean lanes
minimizes the per-fire drop probability.

At empirically-observed ~67% drop ratio, 6 attempts/hour yields
~2 effective fires = ~30 min cadence; closer to the 20-min
target than the current shape and provides a real degradation
alarm if drops get worse.

Cost: ~$0.50/day → ~$1/day. Negligible.

Closes #2726.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 05:10:48 -07:00
Hongming Wang
98f883cb99 e2e: add direct-Anthropic LLM-key path alongside MiniMax + OpenAI
Adds a third secrets-injection branch in test_staging_full_saas.sh
behind a new E2E_ANTHROPIC_API_KEY env var, wired into all three
auto-running E2E workflows (canary-staging, e2e-staging-saas,
continuous-synth-e2e) via a new MOLECULE_STAGING_ANTHROPIC_API_KEY
repo secret slot.

Operator motivation: after #2578 (the staging OpenAI key went over
quota and stayed dead 36+ hours) we shipped #2710 to migrate the
canary + full-lifecycle E2E to claude-code+MiniMax. Discovered post-
merge that MOLECULE_STAGING_MINIMAX_API_KEY had never been set after
the synth-E2E migration on 2026-05-03 either — synth has been red the
whole time, not just OpenAI quota.

Setting up a MiniMax billing account from scratch is non-trivial
(needs platform-specific signup, KYC, top-up). Operators who already
have an Anthropic API key for their own Claude Code session can now
just set MOLECULE_STAGING_ANTHROPIC_API_KEY and have all three
auto-running E2E gates green within one cron firing.

Priority chain in test_staging_full_saas.sh (first non-empty wins):
  1. E2E_MINIMAX_API_KEY      → MiniMax (cheapest)
  2. E2E_ANTHROPIC_API_KEY    → direct Anthropic (cheaper than gpt-4o,
                                lower setup friction than MiniMax)
  3. E2E_OPENAI_API_KEY       → langgraph/hermes paths

Verify-key case-statement in all three workflows accepts EITHER
MiniMax OR Anthropic for runtime=claude-code; error message names
both options so operators know they don't have to register a MiniMax
account if they already have an Anthropic key.

Pinned to runtime=claude-code — hermes/langgraph use OpenAI-shaped
envs and won't honour ANTHROPIC_API_KEY without further wiring.

After this lands + secret is set, the dispatched canary verifies the
new path:
  gh workflow run canary-staging.yml --repo Molecule-AI/molecule-core --ref staging
2026-05-04 00:51:14 -07:00
Hongming Wang
eaee113416 e2e-staging-saas: same migration off OpenAI default to claude-code+MiniMax
Bundles the same hermes+OpenAI → claude-code+MiniMax migration onto
the full-lifecycle E2E that's been red on every provisioning-critical
push since 2026-05-01. Same root cause as the canary fix in the prior
commit: MOLECULE_STAGING_OPENAI_KEY hit insufficient_quota and there's
no SLA on operator billing top-up.

Same shape as canary commit: claude-code as default runtime + MiniMax
as primary key + hermes/langgraph kept as workflow_dispatch options
with OpenAI fallback. Per-runtime verify-key case-statement matches
canary-staging.yml + continuous-synth-e2e.yml byte-for-byte.

Two extra wrinkles vs canary:
- Dispatch input `runtime` default flipped from "hermes" to "claude-code"
  so operators dispatching from the UI get the safe path by default.
  They can still pick hermes/langgraph from the dropdown when they
  specifically want to exercise OpenAI.
- E2E_MODEL_SLUG is dispatch-aware: MiniMax-M2.7-highspeed for
  claude-code, openai/gpt-4o for hermes (slash-form per
  derive-provider.sh), openai:gpt-4o for langgraph (colon-form per
  init_chat_model). The branch comment in lib/model_slug.sh covers
  the rationale; pinning the slug here keeps the dispatch UX stable
  even when operators don't override.

After this lands + the canary commit lands, the only OpenAI-dependent
E2E surface is the operator-dispatch fallback. The cron canary, the
synth E2E, AND the full-lifecycle gate are all on MiniMax — separate
billing account, no OpenAI quota dependency on auto-runs.
2026-05-04 00:20:36 -07:00
Hongming Wang
6f8f978975 canary-staging: migrate from hermes+OpenAI to claude-code+MiniMax
Mirror the migration continuous-synth-e2e.yml made on 2026-05-03 (#265).
Both workflows hit the same MOLECULE_STAGING_OPENAI_KEY which went over
quota on 2026-05-01 (#2578) and stayed dead — the canary has been red
for 36+ hours waiting on operator billing top-up.

This switch breaks the canary's dependency on OpenAI billing entirely:
claude-code template's `minimax` provider routes ANTHROPIC_BASE_URL to
api.minimax.io/anthropic and reads MINIMAX_API_KEY at boot. MiniMax is
~5-10x cheaper per token than gpt-4.1-mini AND on a separate billing
account, so a future OpenAI quota collapse no longer wedges the
canary's "is staging alive?" signal.

Changes:
- E2E_RUNTIME: hermes → claude-code
- Add E2E_MODEL_SLUG: MiniMax-M2.7-highspeed (pin to MiniMax — the
  per-runtime claude-code default is "sonnet" which routes to direct
  Anthropic and would defeat the cost saving)
- Add E2E_MINIMAX_API_KEY env wired to MOLECULE_STAGING_MINIMAX_API_KEY
- Keep E2E_OPENAI_API_KEY as fallback for operator-dispatched runs that
  set E2E_RUNTIME=hermes via workflow_dispatch
- "Verify OpenAI key present" → per-runtime "Verify LLM key present"
  case statement matching synth E2E's exact shape (claude-code requires
  MiniMax, langgraph/hermes require OpenAI). Hard-fail on missing
  required key per #2578's lesson — soft-skip silently fell through to
  the wrong SECRETS_JSON branch and produced a confusing auth error
  5 min later instead of the clean "secret missing" message at the top.

Verifies #2578 root cause won't recur on the canary path. The synth
E2E and the manual e2e-staging-saas dispatch can still hit OpenAI when
explicitly chosen — only the cron canary moves off it.
2026-05-04 00:18:03 -07:00
Hongming Wang
9689c6f6d5 fix(synth-e2e): verify-secrets step must hard-fail (exit 0 only ends step)
The previous soft-skip-on-dispatch path used `exit 0`, which only
ends the STEP — the rest of the workflow continued with empty
secrets. Caught 2026-05-04 by dispatched run 25296530706:
  - E2E_MINIMAX_API_KEY: empty
  - verify-secrets printed warning + exit 0
  - Install required tools: ran
  - Run synthetic E2E: ran with empty MiniMax key
  - SECRETS_JSON branched to OpenAI shape (MINIMAX empty → fall through)
  - But model slug stayed MiniMax-M2.7-highspeed (workflow env)
  - Workspace booted with OpenAI keys + MiniMax model
  - 5 min later: "Agent error (Exception)" — claude SDK 401'd
    against api.minimax.io with the OpenAI key

The confusing failure mode silently masked the real problem (missing
secret) under a runtime-error label. Fix: drop both soft-skip paths
and exit 1 always. Operators who want to verify a YAML change without
setting up secrets can read the verify-secrets step's stderr — the
failure IS the verification signal.

Pure visibility fix; preserves the cron hard-fail path (now also the
dispatch hard-fail path). No mechanism change beyond the exit code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 18:32:26 -07:00
Hongming Wang
a306a97dd3 ci(synth-e2e): move cron off :00 to dodge GH scheduler drops
GitHub Actions scheduler de-prioritises :00 cron firings under load.
Empirical 2026-05-03: the canary's cron was '0,20,40 * * * *' but
actual firings landed at :08, :03, :01, :03 — :20 and :40 silently
dropped. Detection latency degraded from claimed 20 min to actual
~60 min worst case.

Move to '10,30,50 * * * *':
- :10/:30/:50 sit 10 min off the top-of-hour load peak
- Still 5 min from :15 sweep-cf-orphans and :45 sweep-cf-tunnels
  (the original constraint that kept us off :15/:45)
- Same 20-min cadence; only the phase changes

No code change beyond the cron expression + comment refresh.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 17:28:45 -07:00
Hongming Wang
8b9e7e6d59 ci: port DELETE-verify pattern to remaining staging e2e workflows
Follow-up to #2648 — same `>/dev/null || true` swallow-on-error
pattern existed in:

  e2e-staging-canvas.yml   (single-slug)
  e2e-staging-saas.yml     (loop)
  e2e-staging-sanity.yml   (loop)
  e2e-staging-external.yml (loop, was `>/dev/null 2>&1` variant)

All four now capture the HTTP code, log a "[teardown] deleted $slug
(HTTP $code)" line on success, and emit a workflow warning naming
the slug + body excerpt on non-2xx. Loop bodies also tally + summarise
total leaks at the end.

Exit semantics unchanged: a single cleanup miss still doesn't fail-flag
the test (sweep-stale-e2e-orgs is the safety net within ~45 min). The
behavior change is purely surfacing — failures that were silent are
now visible on the workflow run page.

Pairs with #2648's tightened sweeper. Together: per-run cleanup
failures are visible AND the safety net catches them quickly.

Closes the per-workflow port noted as out-of-scope in #2648.
See molecule-controlplane#420.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 16:24:43 -07:00
Hongming Wang
3cd8c53de0 ci: tighten e2e cleanup race window 120m -> ~45m worst case
Two changes that close one of the leak classes from the
molecule-controlplane#420 vCPU audit:

1. sweep-stale-e2e-orgs.yml: cron */15 (was hourly), MAX_AGE_MINUTES
   30 (was 120). E2E runs are 8-25 min wall clock; 30 min is safely
   above the longest run while shrinking the worst-case leak window
   from ~2h to ~45 min (15-min sweep cadence + 30-min threshold).

2. canary-staging.yml teardown: the per-slug DELETE used `>/dev/null
   || true`, which swallowed every failure. A 5xx or timeout from CP
   looked identical to "successfully deleted" and the canary tenant
   kept eating ~2 vCPU until the sweeper caught it. Now we capture
   the response code and surface non-2xx as a workflow warning that
   names the leaked slug.

The exit semantics stay unchanged — a single-canary cleanup miss
shouldn't fail-flag the canary itself when the actual smoke check
passed. The sweeper is the safety net for whatever slips past.

Caught during the molecule-controlplane#420 audit on 2026-05-03 —
3 e2e canary tenant orphans were running for 24-95 min, all under
the previous 120-min sweep threshold so they went unnoticed until
manual cleanup. Same `|| true` pattern exists in
e2e-staging-{canvas,external,saas,sanity}.yml; out of scope for
this PR (mechanical port; tracking separately) but the sweeper
tightening covers all of them by reducing the safety-net latency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 16:08:40 -07:00
Hongming Wang
79a0203798 feat(synth-e2e): switch canary to claude-code + MiniMax-M2.7-highspeed
Cuts the per-run LLM cost ~10x (MiniMax M2.7 vs gpt-4.1-mini) and
removes the recurring OpenAI-quota-exhaustion failure mode that took
the canary down on 2026-05-03 (#265 — staging quota burnt for ~16h).

Path:
  E2E_RUNTIME=claude-code (default)
  → workspace-configs-templates/claude-code-default/config.yaml's
    `minimax` provider (lines 64-69)
  → ANTHROPIC_BASE_URL auto-set to api.minimax.io/anthropic
  → reads MINIMAX_API_KEY (per-vendor env, no collision with
    GLM/Z.ai etc.)

Workflow changes (continuous-synth-e2e.yml):
- Default runtime: langgraph → claude-code
- New env: E2E_MODEL_SLUG (defaults to MiniMax-M2.7-highspeed,
  overridable via workflow_dispatch)
- New secret wire: E2E_MINIMAX_API_KEY ←
  secrets.MOLECULE_STAGING_MINIMAX_API_KEY
- Per-runtime missing-secret guard: claude-code requires MINIMAX,
  langgraph/hermes require OPENAI. Cron firing hard-fails on missing
  key for the active runtime; dispatch soft-skips so operators can
  ad-hoc test without setting up the secret first
- Operators can still pick langgraph/hermes via workflow_dispatch;
  the OpenAI fallback path stays wired

Script changes (tests/e2e/test_staging_full_saas.sh):
- SECRETS_JSON branches on which key is set:
    E2E_MINIMAX_API_KEY → {MINIMAX_API_KEY: <key>}  (claude-code path)
    E2E_OPENAI_API_KEY  → {OPENAI_API_KEY, HERMES_*, MODEL_PROVIDER}  (legacy)
  MiniMax wins when both are present — claude-code default canary
  must not accidentally consume the OpenAI key

Tests (new tests/e2e/test_secrets_dispatch.sh):
- 10 cases pinning the precedence + payload shape per branch
- Discipline check verified: 5 of 10 FAIL on a swapped if/elif
  (precedence inversion), all 10 PASS on the fix
- Anchors on the section-comment header so a structural refactor
  fails loudly rather than silently sourcing nothing

The model_slug dispatcher (lib/model_slug.sh) needs no change:
E2E_MODEL_SLUG override path is already wired (line 41), and
claude-code template's `minimax-` prefix matcher catches
"MiniMax-M2.7-highspeed" via lowercase-on-lookup.

Operator action required to land green:
- Set MOLECULE_STAGING_MINIMAX_API_KEY in repo secrets
  (Settings → Secrets and Variables → Actions). Use
  `gh secret set MOLECULE_STAGING_MINIMAX_API_KEY -R Molecule-AI/molecule-core`
  to avoid leaking the value into shell history.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 15:35:14 -07:00
Hongming Wang
ac6f65ab5e test(e2e): pin pick_model_slug behavior with bash unit tests
PR #2571 fixed synth-E2E by branching MODEL_SLUG per runtime, but only
the langgraph branch was verified at runtime — hermes / claude-code /
override / fallback had zero automated coverage. A future regression
(e.g. dropping the langgraph case) would silently revert and only
surface as "Could not resolve authentication method" mid-E2E.

This PR:
- Extracts the dispatch into tests/e2e/lib/model_slug.sh as a sourceable
  pick_model_slug() function. No behavior change.
- Adds tests/e2e/test_model_slug.sh — 9 assertions across all 5 dispatch
  branches plus the override path. Verified to FAIL when any branch is
  flipped (manually regressed langgraph slash-form to confirm the test
  catches it; restored before commit).
- Wires the unit test into ci.yml's existing shellcheck job (only runs
  when tests/e2e/ or scripts/ change). Pure-bash, no live infra.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 12:04:12 -07:00
Hongming Wang
80b38900de fix(auto-promote): skip empty-tree promotes to break perpetual cycle
The auto-promote ↔ auto-sync chain has been generating empty PRs
indefinitely since the staging merge_queue ruleset uses MERGE
strategy:

1. Auto-promote merges PR via queue → main = merge commit M2 not in staging
2. Auto-sync opens sync-back PR. Workflow's local `git merge --ff-only`
   succeeds (PR title even says "ff to ..."), but the queue lands the
   PR via MERGE → staging = merge commit S2 not in main
3. Auto-promote sees staging ahead by 1 → opens new promote PR. Tree
   diff vs main = 0 (S2's tree == main's tree). But the gate logic
   only checks "all required workflows green", not "actual code to
   ship" → opens an empty promote PR
4. ... repeat indefinitely

Each round costs ~30-40 min wallclock, ~2 manual approvals (the queue
requires 1 review and the bot can't self-approve without admin
bypass), and one full CodeQL Go run (~15 min).

Observed today (2026-05-03) across PRs #2592#2594#2595#2596#2597 — 5 PRs, ~3 hours, all empty content.

Fix: before opening the promote PR, check that staging's tree
actually differs from main's tree. If they're identical (the
empty-merge-commit cycle), skip cleanly and let the cycle terminate.

Implementation:
- New step `Skip if staging tree == main tree` runs before the
  existing gate check.
- `git diff --quiet origin/main $HEAD_SHA` exits 0 iff trees match.
- On match: emits a step summary explaining the skip + sets
  `skip=true`; subsequent gate-check + promote steps are gated on
  `skip != 'true'` so they short-circuit.
- Fail-open: if `git fetch` errors, fall through to gate check
  (preserve existing behavior). Only skip when diff is DEFINITIVELY
  empty.

Long-term, the cleaner fix is to switch the merge_queue ruleset's
merge_method away from MERGE so FF-able PRs land cleanly without a
new commit — but that's a broader change affecting every staging
PR's commit shape. This guard is the surgical one-step break.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 08:56:44 -07:00
Hongming Wang
2b3f44c3c8 fix(retarget): skip PRs whose head is staging (auto-promote PRs)
The retarget-main-to-staging workflow tries to PATCH base=staging on
every bot-authored PR opened against main. Auto-promote staging→main
PRs have head=staging, base=main — retargeting them sets head AND
base to staging, which GitHub rejects with HTTP 422 "no new commits
between base 'staging' and head 'staging'".

This started surfacing on PR #2588 (2026-05-03 14:30) once #2586
switched the auto-promote workflow to an App token. Before #2586
the auto-promote PR was authored by github-actions[bot], which the
retarget filter happened to skip; now it's molecule-ai[bot], which
passes the bot filter and triggers the broken retarget attempt.

Add a head-ref != 'staging' guard so auto-promote PRs short-circuit
before the PATCH. The existing 422 "duplicate base" detector is
left alone — it covers a different operational case.
2026-05-03 07:34:24 -07:00
Hongming Wang
bc11ed8a2b fix(auto-promote): use App token for auto-merge to fire downstream cascade (#2357)
GITHUB_TOKEN-initiated merges suppress the downstream `push` event on
main per GitHub's documented limitation:
  https://docs.github.com/en/actions/using-workflows/triggering-a-workflow#triggering-a-workflow-from-a-workflow

Result before this fix: every staging→main promote landed silently —
publish-workspace-server-image, canary-verify, and redeploy-tenants-on-main
all stayed dark. The polling tail was the SOLE cascade trigger; if it
ever 30-min-timed-out the chain dead-locked invisibly.

Symptom (from the issue body, 2026-04-30):

| Time     | Event                                            | Triggered? |
|----------|--------------------------------------------------|-----------|
| 05:48:04 | Promote PR #2352 merged (c140ad28)               | No fired  |
| 06:07:29 | Promote PR #2356 merged (5973c9bd)               | No fired  |

Fix: mint the molecule-ai App token BEFORE the promote-PR step and
hand it to the auto-merge call. App-token-initiated merges DO trigger
downstream workflow_run cascades.

The polling tail stays as defense-in-depth (with comments updated):
once we've observed >=10 successful natural cascades it can be dropped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 07:13:26 -07:00
Hongming Wang
8f48a38550 fix(publish-runtime): re-add 5 templates wrongly removed from cascade (#2566)
The PR #2536 cascade prune ('deprecated, no shipping images') was
empirically wrong. Re-confirmed 2026-05-03:

- continuous-synth-e2e.yml defaults to langgraph as its primary canary
- All 5 'deprecated' templates have successful publish-image runs in
  the past 24h: langgraph, crewai, autogen, deepagents, gemini-cli

Symptom this fixes — issue #2566 (priority-high, failing 36+h):

  Synthetic E2E (staging): langgraph adapter A2A failure
  'Received Message object in task mode' — failing for >36h

Today at 11:06 commit e1628c4 fixed the underlying a2a-sdk strict-mode
issue in workspace/a2a_executor.py. publish-runtime fired at 11:13 and
cascaded — but only to claude-code, hermes, openclaw, codex. langgraph
was excluded by the prune, so its image stayed on the broken runtime
and the synth E2E (which defaults to langgraph) kept failing despite
the fix being live in PyPI.

After this lands + the next runtime publish fires, langgraph image
re-bakes with the fix and synth-E2E goes green.

Test plan:

- [x] yaml-validate the workflow
- [ ] After merge, watch publish-runtime cascade to all 9 templates
- [ ] Confirm langgraph publish-image fires + succeeds
- [ ] Confirm next continuous-synth-e2e run goes green

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 05:41:53 -07:00
Hongming Wang
60a516bc8d ci(redeploy): fix stale canary_slug default 'hongmingwang' → 'hongming'
The workflow_dispatch input default and the workflow_run env fallback
both pointed at 'hongmingwang', which doesn't match any current prod
tenant (slugs are: hongming, chloe-dong, reno-stars). CP silently
skipped the missing canary and put every tenant in batch-1 in parallel,
defeating the canary-first soak gate that exists to catch image-boot
regressions before they hit the whole fleet.

Concrete example from today's c0838d6 redeploy at 11:53Z (run 25278434388):
the dispatched body was `{"target_tag":"staging-c0838d6","canary_slug":"hongmingwang",...}`
and the CP response showed all 3 tenants in `"phase":"batch-1"` — no
soak, no canary. The deploy happened to be safe, but a broken image
would have hit hongming + chloe-dong + reno-stars simultaneously.

Fixed in three places: the runtime ordering comment, the
workflow_dispatch default, and the env fallback used by the
workflow_run trigger. Comment documents the rationale so the next
slug rename doesn't silently regress this again.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 05:06:01 -07:00
Hongming Wang
5e46ea70d6 ci(synth-e2e): wire MOLECULE_STAGING_OPENAI_KEY into provisioned tenant
The synth-E2E (#2342) provisions a langgraph tenant whose default
model `openai:gpt-4.1-mini` requires OPENAI_API_KEY for the first LLM
call. Sibling workflows already wire this:
- e2e-staging-saas.yml:89
- canary-staging.yml:63

continuous-synth-e2e.yml just forgot. Result: tenant boots, accepts
a2a messages, then returns:

  Agent error: "Could not resolve authentication method. Expected
  either api_key or auth_token to be set."

This was masked since 2026-04-29 (workflow creation) by a2a-sdk v0→v1
contract violations — PR #2558 (Task-enqueue) and #2563
(TaskUpdater.complete/failed terminal events) cleared those, exposing
the underlying auth gap on the synth-E2E firing at 11:39 UTC today.

The script tests/e2e/test_staging_full_saas.sh:325 already reads
E2E_OPENAI_API_KEY and persists it as a workspace_secret on tenant
create — only the workflow wiring was missing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 04:43:07 -07:00
Hongming Wang
596e797dca ci(deploy): broaden ephemeral-prefix matchers to cover rt-e2e-*
The redeploy-tenants-on-staging soft-warn filter and the
sweep-stale-e2e-orgs janitor both hardcoded `^e2e-` to identify
ephemeral test tenants. Runtime-test harness fixtures (RFC #2251)
mint slugs prefixed with `rt-e2e-`, which neither matcher recognized.

Concrete impact observed today:
  - Two `rt-e2e-v{5,6}-*` tenants left orphaned 8h on staging
    (sweep-stale-e2e-orgs ignored them).
  - On the next staging redeploy their phantom EC2s returned
    `InvalidInstanceId: Instances not in a valid state for account`
    from SSM SendCommand → CP returned HTTP 500 + ok=false.
  - The redeploy soft-warn missed them too, so the workflow went
    red, which broke the auto-promote-staging chain feeding the
    canvas warm-paper rollout to prod.

Fix: switch both matchers to recognize the alternation
`^(e2e-|rt-e2e-)`. Long-lived prefixes (demo-prep, dryrun-*, dryrun2-*)
remain non-ephemeral and continue to hard-fail. Comment documents
the source-of-truth list and the cross-file invariant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 04:28:29 -07:00
Hongming Wang
09010212a0 feat(ci): structural drift gate for cascade list vs manifest (RFC #388 PR-3)
Closes the recurrence path of PR #2556. The data fix realigned 8→4
templates in publish-runtime.yml's TEMPLATES variable, but the
underlying drift hazard was unguarded — the next manifest change
could silently leave cascade out of sync again.

This gate fails any PR that changes manifest.json or
publish-runtime.yml in a way that makes the cascade list diverge
from manifest workspace_templates (suffix-stripped). Either
direction is caught:

  missing-from-cascade  templates that won't auto-rebuild on a new
                       wheel publish (the codex-stuck-on-stale-runtime
                       bug class — PR #2512 added codex to manifest,
                       cascade wasn't updated, codex stayed pinned to
                       its last-built runtime version for weeks).

  extra-in-cascade     cascade dispatches to deprecated templates
                       (the wasted-API-calls + dead-CI-noise class —
                       PR #2536 pruned 5 templates from manifest;
                       cascade kept dispatching to all 8 until
                       PR #2556).

Triggers narrowly: only on PRs that touch manifest.json,
publish-runtime.yml, or the script itself. Fast (single grep+sed+comm
pipeline, no Go build).

Surfaced during the RFC #388 prior-art audit; folded in as the
structural follow-up to the data fix #2556 promised.

Self-tested both failure modes locally before commit:
  - Drop codex from cascade → script fails with "MISSING: codex"
  - Add langgraph to cascade → script fails with "EXTRA: langgraph"

Refs: https://github.com/Molecule-AI/molecule-controlplane/issues/388

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 03:52:39 -07:00
Hongming Wang
e014d22ee9
Merge pull request #2557 from Molecule-AI/feat/sweep-aws-secrets-orphans
feat(ops): sweep orphan AWS Secrets Manager secrets
2026-05-03 09:48:59 +00:00
Hongming Wang
6f8f7932d2 feat(ops): add sweep-aws-secrets janitor — orphan tenant bootstrap secrets
CP's deprovision flow calls Secrets.DeleteSecret() (provisioner/ec2.go:806)
but only when the deprovision runs to completion. Crashed provisions and
incomplete teardowns leak the per-tenant `molecule/tenant/<org_id>/bootstrap`
secret. At ~$0.40/secret/month, ~45 leaked secrets surfaced as ~$19/month
on the AWS cost dashboard.

The tenant_resources audit table (mig 024) tracks four kinds today —
CloudflareTunnel, CloudflareDNS, EC2Instance, SecurityGroup — and the
existing reconciler doesn't catch Secrets Manager orphans. The proper fix
(KindSecretsManagerSecret + recorder hook + reconciler enumerator) is filed
as a follow-up controlplane issue. This sweeper is the immediate stopgap.

Parallel-shape to sweep-cf-tunnels.sh:
  - Hourly schedule offset (:30, between sweep-cf-orphans :15 and
    sweep-cf-tunnels :45) so the three janitors don't burst CP admin
    at the same minute.
  - 24h grace window — never deletes a secret younger than the
    provisioning roundtrip, so an in-flight provision can't be racemurdered.
  - MAX_DELETE_PCT=50 default (mirrors sweep-cf-orphans for durable
    resources; tenant secrets should track 1:1 with live tenants).
  - Same schedule-vs-dispatch hardening as the other janitors:
    schedule → hard-fail on missing secrets, dispatch → soft-skip.
  - 8-way xargs parallelism, dry-run by default, --execute to delete.

Requires a dedicated AWS_JANITOR_* IAM principal — the prod molecule-cp
principal lacks secretsmanager:ListSecrets (it only has scoped
Get/Create/Update/Delete). The workflow's verify-secrets step will hard-fail
on the first scheduled run until those secrets are configured, surfacing
the missing setup loudly rather than silently no-op'ing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 02:38:08 -07:00
Hongming Wang
24276b9458 fix(publish-runtime): align cascade list to 4 supported runtimes
The cascade `TEMPLATES` list in publish-runtime.yml had drifted from
manifest.json:

  Currently dispatches to: claude-code, langgraph, crewai, autogen,
                           deepagents, hermes, gemini-cli, openclaw
  manifest.json supports:  claude-code, hermes, openclaw, codex (after
                           PR #2536 pruned to 4 actively-supported)

Two consequences of the drift:

1. `codex` (added in PR #2512, supported in manifest) was never in the
   cascade — fresh runtime publishes did NOT trigger a codex template
   rebuild. Codex stayed pinned to whatever runtime version it last saw
   at its own image-build time.

2. langgraph/crewai/autogen/deepagents/gemini-cli — deprecated, no
   shipping images, no working A2A — were still receiving cascade
   dispatches. Wasted API calls and (worse) green CI on dead repos
   masks "this template is dead, stop maintaining it."

Now matches manifest.json workspace_templates exactly. Surfaced during
RFC #388 (fast workspace provision) prior-art audit.

Long-term fix is to derive TEMPLATES from manifest.json so this can't
drift again — captured as a Phase-1 invariant in RFC #388. This commit
is the data fix only; structural fix lands with the bake pipeline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 02:28:15 -07:00
Hongming Wang
fd5fe34f69
Merge pull request #2523 from Molecule-AI/dependabot/github_actions/actions/github-script-9.0.0
chore(deps)(deps): bump actions/github-script from 7.1.0 to 9.0.0
2026-05-03 01:37:00 +00:00
Hongming Wang
0d8b0c37a6
Merge pull request #2521 from Molecule-AI/dependabot/github_actions/actions/checkout-6
chore(deps)(deps): bump actions/checkout from 4 to 6
2026-05-03 01:36:57 +00:00
Hongming Wang
252e126207
Merge pull request #2522 from Molecule-AI/dependabot/github_actions/docker/setup-buildx-action-4.0.0
chore(deps)(deps): bump docker/setup-buildx-action from 3.12.0 to 4.0.0
2026-05-03 01:27:03 +00:00
Hongming Wang
e84df73e96
Merge pull request #2528 from Molecule-AI/dependabot/github_actions/docker/build-push-action-7.1.0
chore(deps)(deps): bump docker/build-push-action from 6.19.2 to 7.1.0
2026-05-03 01:27:00 +00:00
dependabot[bot]
c46db97ac6
chore(deps)(deps): bump docker/build-push-action from 6.19.2 to 7.1.0
Bumps [docker/build-push-action](https://github.com/docker/build-push-action) from 6.19.2 to 7.1.0.
- [Release notes](https://github.com/docker/build-push-action/releases)
- [Commits](10e90e3645...bcafcacb16)

---
updated-dependencies:
- dependency-name: docker/build-push-action
  dependency-version: 7.1.0
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-05-02 19:23:17 +00:00
dependabot[bot]
6c6c6eb1e8
chore(deps)(deps): bump imjasonh/setup-crane from 0.4 to 0.5
Bumps [imjasonh/setup-crane](https://github.com/imjasonh/setup-crane) from 0.4 to 0.5.
- [Release notes](https://github.com/imjasonh/setup-crane/releases)
- [Commits](31b88efe9d...6da1ae0188)

---
updated-dependencies:
- dependency-name: imjasonh/setup-crane
  dependency-version: '0.5'
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-05-02 19:23:13 +00:00
dependabot[bot]
e1f7d49575
chore(deps)(deps): bump actions/github-script from 7.1.0 to 9.0.0
Bumps [actions/github-script](https://github.com/actions/github-script) from 7.1.0 to 9.0.0.
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](f28e40c7f3...3a2844b7e9)

---
updated-dependencies:
- dependency-name: actions/github-script
  dependency-version: 9.0.0
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-05-02 19:23:09 +00:00
dependabot[bot]
ab7ac2e103
chore(deps)(deps): bump docker/setup-buildx-action from 3.12.0 to 4.0.0
Bumps [docker/setup-buildx-action](https://github.com/docker/setup-buildx-action) from 3.12.0 to 4.0.0.
- [Release notes](https://github.com/docker/setup-buildx-action/releases)
- [Commits](8d2750c68a...4d04d5d948)

---
updated-dependencies:
- dependency-name: docker/setup-buildx-action
  dependency-version: 4.0.0
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-05-02 19:23:05 +00:00
dependabot[bot]
3598eb41d1
chore(deps)(deps): bump actions/checkout from 4 to 6
Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 6.
- [Release notes](https://github.com/actions/checkout/releases)
- [Commits](https://github.com/actions/checkout/compare/v4...v6)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-05-02 19:23:01 +00:00
Hongming Wang
8bf29b7d0e fix(sweep-cf-tunnels): parallelize deletes + raise workflow timeout
The hourly Sweep stale Cloudflare Tunnels job got cancelled mid-cleanup
on 2026-05-02 (run 25248788312, killed at 5min after deleting 424/672
stale tunnels). A second manual dispatch finished the remaining 254
fine, so the immediate backlog cleared, but two underlying bugs would
re-trip on the next big cleanup.

Bug 1: serial delete loop. The execute branch was a `while read; do
curl -X DELETE; done` pipeline at ~0.7s/tunnel — fine for the
steady-state cleanup of a handful, but a 600+ backlog needs ~7-8min.
This commit fans out to $SWEEP_CONCURRENCY (default 8) workers via
`xargs -P 8 -L 1 -I {} bash -c '...' _ {} < "$DELETE_PLAN"`. With 8x
parallelism the same 600+ list drains in ~60s. Notes:

  - We use stdin (`<`) not GNU's `xargs -a FILE` so the script stays
    portable to BSD xargs (matters for local-runner testing on macOS).
  - We pass ONLY the tunnel id on argv. xargs tokenizes on whitespace
    by default; tab-separating id+name on argv risks mangling. The
    name is kept in a side-channel id->name map ($NAME_MAP) and looked
    up by the worker only on failure, for FAIL_LOG readability.
  - Workers print exactly `OK` or `FAIL` on stdout; tally with
    `grep -c '^OK$' / '^FAIL$'`.
  - On non-zero FAILED, log the first 20 lines of $FAIL_LOG as
    "Failure detail (first 20):" — same diagnostic surface as before
    but consolidated so we don't spam logs on a flaky CF API.

Bug 2: workflow's 5-min cap was set as a hangs-detector but turned out
to be a real-job-too-slow detector. Raised to 30 min — generous
headroom for the ~60s steady-state run while still surfacing genuine
hangs (and in line with the sweep-cf-orphans companion job).

Bug 3 (drive-by): the existing trap was `trap 'rm -rf "$PAGES_DIR"'
EXIT`, which would have been silently overwritten by any later trap
registration. Replaced with a single `cleanup()` function that wipes
PAGES_DIR + all four new tempfiles (DELETE_PLAN, NAME_MAP, FAIL_LOG,
RESULT_LOG), called once via `trap cleanup EXIT`.

Verification:
  - bash -n scripts/ops/sweep-cf-tunnels.sh: clean
  - shellcheck -S warning scripts/ops/sweep-cf-tunnels.sh: clean
  - python3 yaml.safe_load on the workflow: clean
  - Synthetic 30-line delete plan with every 7th id sentinel'd to
    return {"success":false}: TEST PASS, DELETED=26 FAILED=4, FAIL_LOG
    side-channel name lookup verified.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 02:35:46 -07:00
Hongming Wang
6e0eb2ddc9 fix(redeploy-staging): tolerate e2e-* teardown race in fleet HTTP 500
Recurring failure pattern in redeploy-tenants-on-staging:

  ##[error]redeploy-fleet returned HTTP 500
  ##[error]Process completed with exit code 1.

with the per-tenant breakdown in the response body showing the failures
were on ephemeral e2e-* tenants (saas/canvas/ext) whose parent E2E run
torn them down mid-redeploy — SSM exit=2 because the EC2 was already
terminating, or healthz timeout because the CF tunnel was already gone.
The actual operator-facing tenants (dryrun-98407, demo-prep, etc) all
rolled fine in the same call.

This shape repeats every staging push that overlaps an active E2E run.
The downstream `Verify each staging tenant /buildinfo matches published
SHA` step ALREADY distinguishes STALE vs UNREACHABLE for exactly this
reason (per #2402); only the top-level `if HTTP_CODE != 200; exit 1`
gate misclassifies the race.

Filter: HTTP 500 + every failed slug matches `^e2e-` → soft-warn and
fall through to verify. Any non-e2e-* failure or non-500 HTTP remains
a hard fail, with the failed non-e2e slugs surfaced in the error so
the operator doesn't have to dig the response body out of CI.

Verified the gate logic with 6 synthetic CP responses (happy / e2e-only
race / mixed real+e2e fail / non-200 / 200+ok=false / all-real-fail) —
all behave correctly.

prod's redeploy-tenants-on-main is intentionally NOT touched: prod CP
serves no e2e-* tenants, so the race can't occur there and the strict
gate is the right behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 02:17:36 -07:00
Hongming Wang
43c234df35 secret-scan: align local pre-commit + extend drift lint (closes #1569 root)
#1569 Phase 1 discovery (2026-05-02) found six historical credential
exposures in molecule-core git history. All confirmed dead — but the
reason they got committed in the first place was that the local
pre-commit hook had two gaps that the canonical CI gate (and the
runtime's hook) didn't:

  1. **Pattern set was incomplete.** Local hook checked
     `sk-ant-|sk-proj-|ghp_|gho_|AKIA|mol_pk_|cfut_` — missing
     `ghs_*`, `ghu_*`, `ghr_*`, `github_pat_*`, `sk-svcacct-`,
     `sk-cp-`, `xox[baprs]-`, `ASIA*`. The historical leaks were 5×
     `ghs_*` (App installation tokens) + 1× `github_pat_*` — none of
     which the local hook would have caught even if it ran.
  2. **`*.md` and `docs/` were skip-listed.** The leaked tokens lived
     in `tick-reflections-temp.md`, `qa-audit-2026-04-21.md`, and
     `docs/incidents/INCIDENT_LOG.md` — exactly the file types the
     skip-list excluded. The hook ran and silently passed.

This commit:

- Replaces the local hook's hard-coded inline regex with the canonical
  13-pattern array (byte-aligned with `.github/workflows/secret-scan.yml`
  and the workspace runtime's `pre-commit-checks.sh`).
- Removes the `\.md$|docs/` skip — keeps only binary, lockfile, and
  hook-self exclusions.
- Adds the local hook to `lint_secret_pattern_drift.py` as an in-repo
  consumer (read-from-disk, no network — the hook lives in the same
  checkout the lint runs against). Drift now fails the lint when
  canonical changes without the local hook updating in lockstep.
- Adds `.githooks/pre-commit` to the drift-lint workflow's path
  filter so consumer-side edits also trigger the lint.
- Adopts the canonical's "don't echo the matched value" defense (the
  prior version would have round-tripped a leaked credential into
  scrollback / CI logs).

Verified: `python3 .github/scripts/lint_secret_pattern_drift.py`
reports both consumers aligned at 13 patterns. The hook's existing
six other gates (canvas 'use client', dark theme, SQL injection,
go-build, etc.) are untouched.

Companion change (already applied via API, no diff here):
`Scan diff for credential-shaped strings` is now in the required-checks
list on both `staging` and `main` branch protection — was previously a
soft gate (workflow ran, exited 1, but didn't block merge).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 23:47:56 -07:00
Hongming Wang
115f1f5e64 fix(redeploy-main): pull staging-<head_sha> instead of stale :latest
Auto-trigger from publish-workspace-server-image now resolves
target_tag to the just-published `staging-<short_head_sha>` digest
instead of `:latest`. Bypasses the dead retag path that was leaving
prod tenants on a 4-day-old image.

The chain pre-fix:
  publish-image  → pushes :staging-<sha> + :staging-latest (NOT :latest)
  canary-verify  → soft-skips (CANARY_TENANT_URLS unset, fleet not stood up)
  promote-latest → manual workflow_dispatch only, last run 2026-04-28
  redeploy-main  → pulls :latest → 2026-04-28 digest → all 3 tenants STALE

Today's incident:
  e7375348 (main) → publish-image green → redeploy fired → tenants
  pulled :latest (76c604fb digest from prior canary-verified state) →
  hongming /buildinfo returned 76c604fb instead of e7375348 → verify
  step correctly flagged 3/3 STALE → workflow failed.

Today's PRs (#2473 smoke wedge, #2487 panic recovery, #2496 sweeper
followups) shipped to GHCR as :staging-<sha> but never reached prod.

Fix:
  - workflow_dispatch input default '' (was 'latest'); empty input
    triggers auto-compute path
  - new "Compute target tag" step resolves:
    1. operator-supplied input → verbatim (rollback / pin)
    2. else → staging-<short_head_sha> (auto)
  - verify step's operator-pin detection now allows
    staging-<short_head_sha> as a non-pin (verification still runs)

When canary fleet is real, this workflow should chain on
canary-verify completion (workflow_run from canary-verify, gated on
promote-to-latest success) instead of publish-image — separate,
smaller PR. Today's fix unblocks prod deploys without that
prerequisite.

Companion: promote-latest.yml dispatched 2026-05-02 against
e7375348 to unstick existing prod tenants. This PR prevents
recurrence.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 23:17:59 -07:00
Hongming Wang
3d8a0a58fa ci(auto-sync): App-token dispatch + ubuntu-latest + workflow_dispatch
auto-sync-main-to-staging.yml hasn't fired since 2026-04-29 despite
multiple staging→main promotes since. The promote PR #2442 (Phase 2)
has been wedged on `mergeStateStatus: BEHIND` for hours because
staging is missing the merge commit from PR #2437.

Three compounding bugs, all fixed here:

1. **GitHub no-recursion suppresses the `on: push` trigger.**
   When the merge queue lands a staging→main promote, the resulting
   push to main is "by GITHUB_TOKEN", and per
   https://docs.github.com/en/actions/using-workflows/triggering-a-workflow#triggering-a-workflow-from-a-workflow
   that push event does NOT fire any downstream workflows. Verified
   empirically against SHA 76c604fb (PR #2437): exactly ONE workflow
   fired on that push — `publish-workspace-server-image`, dispatched
   explicitly by auto-promote-staging.yml's polling tail with an App
   token (the documented #2357 workaround). Every other `on: push`
   workflow on main, including auto-sync, was silently suppressed.

   Same fix extended here: auto-promote-staging.yml's polling tail
   now ALSO dispatches `auto-sync-main-to-staging.yml --ref main`
   via the App token after the merge lands. App-initiated dispatch
   propagates `workflow_run` cascades, which is what the publish
   tail relies on too. Failure path: emits `::error::` with the
   recovery command — operator runs it once and the next promote
   self-heals.

   auto-sync.yml gains `workflow_dispatch:` so it can be invoked
   from the dispatch above + manually if a future promote also
   misses (defense in depth).

2. **`runs-on: [self-hosted, macos, arm64]` was wrong for this repo.**
   Comment claimed "matches the rest of this repo's workflows" — false:
   this is the ONLY workflow in molecule-core/.github/workflows/ with
   a non-ubuntu runs-on. Copy-paste artefact from molecule-controlplane
   (which IS private and has a Mac runner). molecule-core has no Mac
   runner registered, so even when the trigger DID fire (the 3 historic
   manual-UI merges), the job would have sat unassigned if the runner
   were offline. Switched to `ubuntu-latest` to match every other
   workflow in this repo.

3. **The `on: push` trigger remains** as a defense-in-depth path for
   the rare case of a manual UI merge by a real user (which uses
   their PAT and DOES fire downstream workflows — confirmed via the
   2026-04-29 d35a2420 run with `triggering_actor=HongmingWang-Rabbit`
   that fired 16 workflows including auto-sync). Belt-and-suspenders.

Long-term: switching auto-promote's `gh pr merge --auto` call to use
the App token (instead of GITHUB_TOKEN) would let `on: push` triggers
fire naturally and obviate the need for the explicit dispatches in
the polling tail. Tracked in #2357 — out of scope here.

Operator recovery for the current Phase 2 wedge: after this lands on
staging, dispatch auto-sync once via
`gh workflow run auto-sync-main-to-staging.yml --ref main` to
backfill the missed sync from 76c604fb. PR #2442 will go from
BEHIND → CLEAN and auto-merge.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 22:28:35 -07:00
Hongming Wang
c275716005 harness(phase-2): multi-tenant compose + cross-tenant isolation replays
Brings the local harness from "single tenant covering the request path"
to "two tenants covering both the request path AND the per-tenant
isolation boundary" — the same shape production runs (one EC2 + one
Postgres + one MOLECULE_ORG_ID per tenant).

Why this matters: the four prior replays exercise the SaaS request
path against one tenant. They cannot prove that TenantGuard rejects
a misrouted request (production CF tunnel + AWS LB are the failure
surface), nor that two tenants doing legitimate work in parallel
keep their `activity_logs` / `workspaces` / connection-pool state
partitioned. Both are real bug classes — TenantGuard allowlist drift
shipped #2398, lib/pq prepared-statement cache collision is documented
as an org-wide hazard.

What changed:

1. compose.yml — split into two tenants.
   tenant-alpha + postgres-alpha + tenant-beta + postgres-beta + the
   shared cp-stub, redis, cf-proxy. Each tenant gets a distinct
   ADMIN_TOKEN + MOLECULE_ORG_ID and its own Postgres database. cf-proxy
   depends on both tenants becoming healthy.

2. cf-proxy/nginx.conf — Host-header → tenant routing.
   `map $host $tenant_upstream` resolves the right backend per request.
   Required `resolver 127.0.0.11 valid=30s ipv6=off;` because nginx
   needs an explicit DNS resolver to use a variable in `proxy_pass`
   (literal hostnames resolve once at startup; variables resolve per
   request — without the resolver nginx fails closed with 502).
   `server_name` lists both tenants + the legacy alias so unknown Host
   headers don't silently route to a default and mask routing bugs.

3. _curl.sh — per-tenant + cross-tenant-negative helpers.
   `curl_alpha_admin` / `curl_beta_admin` set the right
   Host + Authorization + X-Molecule-Org-Id triple.
   `curl_alpha_creds_at_beta` / `curl_beta_creds_at_alpha` exist
   precisely to make WRONG requests (replays use them to assert
   TenantGuard rejects). `psql_exec_alpha` / `psql_exec_beta` shell out
   per-tenant Postgres exec. Legacy aliases (`curl_admin`, `psql_exec`)
   keep the four pre-Phase-2 replays working without edits.

4. seed.sh — registers parent+child workspaces in BOTH tenants.
   Captures server-generated IDs via `jq -r '.id'` (POST /workspaces
   ignores body.id, so the older client-side mint silently desynced
   from the workspaces table and broke FK-dependent replays). Stashes
   `ALPHA_PARENT_ID` / `ALPHA_CHILD_ID` / `BETA_PARENT_ID` /
   `BETA_CHILD_ID` to .seed.env, plus legacy `ALPHA_ID` / `BETA_ID`
   aliases for backwards compat with chat-history / channel-envelope.

5. New replays.

   tenant-isolation.sh (13 assertions) — TenantGuard 404s any request
   whose X-Molecule-Org-Id doesn't match the container's
   MOLECULE_ORG_ID. Asserts the 404 body has zero
   tenant/org/forbidden/denied keywords (existence of a tenant must
   not be probable from the outside). Covers cross-tenant routing
   misconfigure + allowlist drift + missing-org-header.

   per-tenant-independence.sh (12 assertions) — both tenants seed
   activity_logs in parallel with distinct row counts (3 vs 5) and
   confirm each tenant's history endpoint returns exactly its own
   counts. Then a concurrent INSERT race (10 rows per tenant in
   parallel via `&` + wait) catches shared-pool corruption +
   prepared-statement cache poisoning + redis cross-keyspace bleed.

6. Bug fix: down.sh + dump-logs SECRETS_ENCRYPTION_KEY validation.
   `docker compose down -v` validates the entire compose file even
   though it doesn't read the env. up.sh generates a per-run key into
   its own shell — down.sh runs in a fresh shell that wouldn't see it,
   so without a placeholder `compose down` exited non-zero before
   removing volumes. Workspaces silently leaked into the next
   ./up.sh + seed.sh boot. Caught when tenant-isolation.sh F1/F2 saw
   3× duplicate alpha-parent rows accumulated across three prior runs.
   Same fix applied to the workflow's dump-logs step.

7. requirements.txt — pin molecule-ai-workspace-runtime>=0.1.78.
   channel-envelope-trust-boundary.sh imports from `molecule_runtime.*`
   (the wheel-rewritten path) so it catches the failure mode where
   the wheel build silently strips a fix that unit tests on local
   source still pass. CI was failing this replay because the wheel
   wasn't installed — caught in the staging push run from #2492.

8. .github/workflows/harness-replays.yml — Phase 2 plumbing.
   * Removed /etc/hosts step (Host-header path eliminated the need;
     scripts already source _curl.sh).
   * Updated dump-logs to reference the new service names
     (tenant-alpha + tenant-beta + postgres-alpha + postgres-beta).
   * Added SECRETS_ENCRYPTION_KEY placeholder env on the dump step.

Verified: ./run-all-replays.sh from a clean state — 6/6 passed
(buildinfo-stale-image, channel-envelope-trust-boundary, chat-history,
peer-discovery-404, per-tenant-independence, tenant-isolation).

Roadmap section updated: Phase 2 marked shipped. Phase 3 promoted to
"replace cp-stub with real molecule-controlplane Docker build + env
coherence lint."

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 21:36:40 -07:00
Hongming Wang
e58e446444 docs(ci): correct test-ops-scripts.yml header — discover does NOT recurse
The previous header said `unittest discover from the scripts/ root
walks recursively`, contradicting the workflow body which runs two
passes precisely because discover does NOT recurse without
__init__.py. Fixed self-review feedback on PR #2440.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 20:52:58 -07:00
Hongming Wang
f2545fcb57
Merge pull request #2440 from Molecule-AI/chore/wheel-rewriter-tests-and-noqa-cleanup
chore: rewriter unit tests + drop misleading noqa on import inbox
2026-05-01 03:48:33 +00:00
Hongming Wang
6e92fe0a08 chore: rewriter unit tests + drop misleading noqa on import inbox
Two small follow-ups to the PR #2433#2436#2439 incident chain.

1) `import inbox  # noqa: F401` in workspace/a2a_mcp_server.py was
   misleading — `inbox` IS used (at the bridge wiring inside main()).
   F401 means "imported but unused", which would mask a real future
   F401 if the usage is removed. Drop the noqa, keep the explanatory
   block comment about the rewriter's `import X` → `import mr.X as X`
   expansion (and the `import X as Y` → `import mr.X as X as Y` trap
   the comment exists to prevent re-introducing).

2) scripts/test_build_runtime_package.py — 17 unit tests covering
   `rewrite_imports()` and `build_import_rewriter()` in
   scripts/build_runtime_package.py. Until now the function had zero
   coverage despite the entire wheel build depending on it. Tests
   pin: bare-import aliasing, dotted-import preservation, indented
   imports, from-imports (simple + dotted + multi-symbol + block),
   the `import X as Y` rejection added in PR #2436 (with comment-
   stripping + indented + comma-not-alias edge cases), allowlist
   anchoring (`a2a` ≠ `a2a_tools`), and end-to-end reproduction
   of the PR #2433 failing pattern + the #2436 fix pattern.

3) Wire scripts/test_*.py into CI by adding a second discover pass
   to test-ops-scripts.yml. Top-level scripts/ tests live alongside
   their target file (parallels the scripts/ops/ test layout); the
   existing scripts/ops/ pass keeps running because scripts/ops/
   has no __init__.py so a single discover from scripts/ root
   doesn't recurse. Two passes is simpler than retrofitting
   namespace packages. Path filter widened from `scripts/ops/**`
   to `scripts/**` so PRs touching the build script trigger the
   new tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 20:45:32 -07:00
Hongming Wang
3c16c27415 ci(wheel-smoke): always-run with per-step if-gates for required-check eligibility
The `PR-built wheel + import smoke` gate caught the broken wheel from
PR #2433 (`import inbox as _inbox_module` collision) but couldn't block
the merge because it isn't a required check on staging. Promoting it to
required is the right move per the runtime publish pipeline gates note
(2026-04-27 RuntimeCapabilities ImportError outage), but the existing
`paths: [workspace/**, scripts/...]` filter blocks PRs that don't touch
those paths from ever generating the check run — branch protection
would deadlock waiting on a check that never fires.

Refactor (same shape as e2e-api.yml's e2e-api job):
- Drop top-level `paths:` filter — workflow runs on every push/PR/
  merge_group event.
- Add `detect-changes` job using dorny/paths-filter to compute the
  `wheel=true|false` output.
- Collapse to ONE always-running `local-build-install` job named
  `PR-built wheel + import smoke`. Per-step `if:` gates on the
  detect output. PRs untouched by wheel-relevant paths emit a
  no-op SUCCESS step ("paths filter excluded this commit") so the
  check passes without rebuilding the wheel.
- merge_group + workflow_dispatch unconditionally `wheel=true` so
  the queue always validates the to-be-merged state, regardless of
  which PR composed it.

Why one-job-with-step-gates instead of two-jobs-sharing-name: SKIPPED
check runs block branch protection even when SUCCESS siblings exist
(verified PR #2264 incident, 2026-04-29). Single always-run job emits
exactly one SUCCESS check run regardless of paths filter.

Follow-up: open a separate PR adding `PR-built wheel + import smoke`
to the staging branch protection's required_status_checks.contexts
once this lands. Doing both in one PR risks the protection update
firing before the workflow refactor merges, deadlocking unrelated PRs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 20:40:05 -07:00
Hongming Wang
c68ec23d3c
Merge pull request #2410 from Molecule-AI/auto/harness-replays-ci-gate
ci: gate PRs on tests/harness/run-all-replays.sh
2026-04-30 20:35:30 +00:00
Hongming Wang
0f0df576f5
Merge pull request #2392 from Molecule-AI/auto/e2e-staging-external-runtime
test(e2e): live staging regression for external-runtime awaiting_agent transitions
2026-04-30 20:32:23 +00:00
Hongming Wang
c8b17ea1ad fix(harness): install httpx for replay Python evals
peer-discovery-404 imports workspace/a2a_client.py which depends on
httpx; the runner's stock Python doesn't have it, so the replay's
PARSE assertion (b) fails with ModuleNotFoundError on every run. The
WIRE assertion (a) — pure curl — passes, so the failure was masking
just enough to make the replay LOOK partially-broken when the tenant
side is fine.

Adding tests/harness/requirements.txt with only httpx instead of
sourcing workspace/requirements.txt: that file pulls a2a-sdk,
langchain-core, opentelemetry, sqlalchemy, temporalio, etc. — ~30s
of install for one replay's PARSE step. The harness's deps surface
should grow when a new replay introduces a new import, not by
default.

Workflow gains one step (`pip install -r tests/harness/requirements.txt`)
between the /etc/hosts setup and run-all-replays. No other changes.
2026-04-30 13:32:00 -07:00
Hongming Wang
24cb2a286f ci(harness-replays): KEEP_UP=1 so dump-logs step has containers to read
First run on PR #2410 failed with 'container harness-tenant-1 is unhealthy'
but the dump-compose-logs step printed empty tenant logs because
run-all-replays.sh's trap-on-EXIT had already torn down the harness.

Setting KEEP_UP=1 leaves containers in place; the always-run Force
teardown step at the end owns cleanup explicitly. Now we'll actually
see why the tenant didn't become healthy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 13:15:46 -07:00
Hongming Wang
3105e87cf7 ci: gate PRs on tests/harness/run-all-replays.sh
Closes the gap between "the harness exists" and "the harness blocks bugs."
Phase 2 of the harness roadmap (per tests/harness/README.md): make
harness-based E2E a required CI check on every PR touching the tenant
binary or the harness itself.

Trigger: push + pull_request to staging+main, paths-filtered to
workspace-server/**, canvas/**, tests/harness/**, and this workflow.
merge_group support included so this becomes branch-protectable.

Single-job-with-conditional-steps pattern (matches e2e-api.yml). One
check run regardless of paths-filter outcome; satisfies branch
protection cleanly per the PR #2264 SKIPPED-in-set finding.

Why this exists: 2026-04-30 we shipped a TenantGuard allowlist gap
(/buildinfo added to router.go in #2398, never added to the allowlist)
that the existing buildinfo-stale-image.sh replay would have caught.
The harness was wired correctly; nobody ran it. Replays as a discipline
beat replays as a memory item.

The CI pipeline:
  detect-changes (paths filter)
    └ harness-replays (always)
        ├ no-op pass when paths-filter says no relevant change
        └ otherwise: checkout + sibling plugin checkout +
                     /etc/hosts entry + run-all-replays.sh +
                     compose-logs-on-failure + force-teardown

Compose logs from tenant/cp-stub/cf-proxy/postgres are dumped on
failure so a CI red is debuggable without re-reproducing locally.
The trap in run-all-replays.sh handles teardown; the always-run
down.sh step is a belt-and-suspenders against trap-bypass kills.

Follow-ups (not in this PR):
- Add this check to staging branch protection once it's been green
  for a few PRs (the new-workflow-instability hedge that other gates
  followed).
- Eventually wire the buildx GHA cache to speed up tenant image
  builds — currently every PR rebuilds the full Dockerfile.tenant
  (Go + Next.js + template clones) from scratch. Acceptable for now;
  optimize when the timeout-minutes:30 ceiling becomes painful.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 13:04:53 -07:00
Hongming Wang
ef206b5be6 refactor(ci): extract wheel smoke into shared script
publish-runtime.yml had a broad smoke (AgentCard call-shape, well-known
mount alignment, new_text_message) inline as a heredoc. runtime-prbuild-
compat.yml had a narrow inline smoke (just `from main import main_sync`).
Result: a PR could introduce SDK shape regressions that pass at PR time
and only fail at publish time, post-merge.

Extract the broad smoke into scripts/wheel_smoke.py and invoke it from
both workflows. PR-time gate now matches publish-time gate — same script,
same assertions. Eliminates the drift hazard of two heredocs that have
to be kept in lockstep manually.

Verified locally:
  * Built wheel from workspace/ source, installed in venv, ran smoke → pass
  * Simulated AgentCard kwarg-rename regression → smoke catches it as
    `ValueError: Protocol message AgentCard has no "supported_interfaces"
    field` (the exact failure mode of #2179 / supported_protocols incident)

Path filter for runtime-prbuild-compat extended to include
scripts/wheel_smoke.py so smoke-only edits get PR-validated. publish-
runtime path filter intentionally NOT extended — smoke-only edits should
not auto-trigger a PyPI version bump.

Subset of #131 (the broader "invoke main() against stub config" goal
remains pending — main() needs a config dir + stub platform server).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 11:52:07 -07:00
Hongming Wang
9b909c4459 fix(ci): gate 50%-floor on TOTAL_VERIFIED >= 4
Self-review of #2403 caught a regression: with a 1-tenant fleet (the
exact case the original #2402 fix targeted), the new floor would
re-introduce the flake. Trace:

  TOTAL=1, UNREACHABLE=1, $((1/2))=0
  if 1 -gt 0 → TRUE → exit 1

The 50%-rule only meaningfully distinguishes "real outage" from
"teardown race" when the fleet is large enough that "half down" is
statistically meaningful. With 1-3 tenants, canary-verify is the
actual gate (it runs against the canary first and aborts the rollout
if the canary fails to come up).

Gate the floor on TOTAL_VERIFIED >= 4. Truth table:

  TOTAL  UNREACHABLE  RESULT
  1      1            soft-warn (original e2e flake case)
  4      2            soft-warn (exactly half)
  4      3            hard-fail (75% — real outage)
  10     6            hard-fail (60% — real outage)

Mirrored across staging.yml + main.yml.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 11:40:31 -07:00
Hongming Wang
ec39fecda2 fix(ci): hard-fail when >50% of fleet unreachable post-redeploy
Belt-and-suspenders sanity floor on top of the unreachable-soft-warn
introduced earlier in this PR. Addresses the residual gap noted in
review: if a new image crashes on startup, every tenant ends up
unreachable, and the soft-warn alone would let that ship as a green
deploy. Canary-verify catches it on the canary tenant first, but this
guard is a fallback for canary-skip dispatches and same-batch races.

Threshold is 50% of healthz_ok-snapshotted tenants — comfortably above
the typical e2e-* teardown rate (5-10/hour, ~1 ephemeral tenant per
batch) but below any plausible real-outage scenario.

Mirrored across staging.yml + main.yml for shape parity.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 11:35:56 -07:00
Hongming Wang
d45241cae7 fix(ci): distinguish unreachable from stale in /buildinfo verify step
The /buildinfo verify step (PR #2398) was treating "no /buildinfo response"
the same as "tenant returned wrong SHA" — both bumped MISMATCH_COUNT and
hard-failed the workflow. First post-merge run on staging caught a real
edge case: ephemeral E2E tenants (slug e2e-20260430-...) get torn down by
the E2E teardown trap between CP's healthz_ok snapshot and the verify step
running, so the verify step would dial into DNS that no longer resolves
and hard-fail on a benign condition.

The bug class we actually care about is STALE (tenant up + serving old
code, the #2395 root). UNREACHABLE post-redeploy is almost always a benign
teardown race; real "tenant up but unreachable" is caught by CP's own
healthz monitor + the alert pipeline, so double-counting it here was
making this workflow flaky on every staging push that overlapped E2E.

Wire:
  - Split MISMATCH_COUNT into STALE_COUNT + UNREACHABLE_COUNT.
  - STALE → hard-fail the workflow (the bug class we're guarding).
  - UNREACHABLE → :⚠️:, don't fail. Reachable-mismatch still hard-fails.
  - Job summary surfaces both lists separately so on-call can tell at a
    glance which class fired.

Mirror in redeploy-tenants-on-main.yml for shape parity (prod has fewer
ephemeral tenants but identical asymmetry would be a gratuitous fork).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 11:25:46 -07:00
Hongming Wang
998e13c4bd feat(deploy): verify each tenant /buildinfo matches published SHA after redeploy
Closes the gap that let issue #2395 ship: redeploy-fleet workflows reported
ssm_status=Success based on SSM RPC return code alone, while EC2 tenants
silently kept serving the previous :latest digest because docker compose up
without an explicit pull is a no-op when the local tag already exists.

Wire:
  - new buildinfo package exposes GitSHA, set at link time via -ldflags from
    the GIT_SHA build-arg (default "dev" so test runs without ldflags fail
    closed against an unset deploy)
  - router exposes GET /buildinfo returning {git_sha} — public, no auth,
    cheap enough to curl from CI for every tenant
  - both Dockerfiles thread GIT_SHA into the Go build
  - publish-workspace-server-image.yml passes GIT_SHA=github.sha for both
    images
  - redeploy-tenants-on-main.yml + redeploy-tenants-on-staging.yml curl each
    tenant's /buildinfo after the redeploy SSM RPC and fail the workflow on
    digest mismatch; staging treats both :latest and :staging-latest as
    moving tags; verification is skipped only when an operator pinned a
    specific tag via workflow_dispatch

Tests:
  - TestGitSHA_DefaultDevSentinel pins the dev default
  - TestBuildInfoEndpoint_ReturnsGitSHA pins the wire shape that the
    workflow's jq lookup depends on

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:08 -07:00
Hongming Wang
8efb2dae8d fix(ci): handle empty E2E lookup in auto-promote-on-e2e gate
When gh run list returns [] (no E2E run on the main SHA — the common
case for canvas-only / cmd-only / sweep-only changes whose paths
don't trigger E2E), jq's `.[0]` is null and the interpolation
`"\(null)/\(null // "none")"` produces "null/none". The case
statement has no `null/none)` branch, so it falls into `*)` →
exit 1 → auto-promote-on-e2e fails → `:latest` doesn't get retagged
to the new SHA → tenants on `redeploy-tenants-on-main` end up
pulling the OLD `:latest` digest.

Surfaced 2026-04-30 17:00Z as the first observable consequence of
PR #2389 (App-token dispatch fix). Every prior auto-promote-on-e2e
run was triggered by E2E completion (the "Upstream is E2E itself"
short-circuit at line 151 fired before reaching the gate). #2389
made publish-image's completion event correctly fire workflow_run
listeners — auto-promote-on-e2e is one of those listeners — and
hit the latent jq bug on the first publish-upstream run.

Fix: change `.[0]` to `(.[0] // {})` in the jq filter so the empty-
array case becomes `none/none` (the documented "E2E paths-filtered
out for this SHA — proceed" branch) instead of the unhandled
`null/none`. Also default `.status` for the same defensive reason.

Verified the three input shapes locally:
  []                                          → "none/none"  ✓
  [{status:completed,conclusion:success}]     → "completed/success"  ✓
  [{status:in_progress,conclusion:null}]      → "in_progress/none"  ✓

Outer `|| echo "none/none"` fallback retained as defense-in-depth
for non-zero gh exits (network / auth failures).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:07:52 -07:00
Hongming Wang
79496dcffe test(e2e): live staging regression for external-runtime awaiting_agent transitions
Pins the four workspaces.status=awaiting_agent transitions on a real
staging tenant, end-to-end. Catches the class of silent enum failures
that migration 046 fix-forwarded — specifically:

  1. workspace.go:333 — POST /workspaces with runtime=external + no URL
     parks the row in 'awaiting_agent'. Pre-046 the UPDATE silently
     failed and the row stuck on 'provisioning'.

  2. registry.go:resolveDeliveryMode — registering an external workspace
     defaults delivery_mode='poll' (PR #2382). The harness asserts the
     poll default after register.

  3. registry/healthsweep.go:sweepStaleRemoteWorkspaces — after
     REMOTE_LIVENESS_STALE_AFTER (90s default) with no heartbeat, the
     workspace transitions back to 'awaiting_agent'. Pre-046 the sweep
     UPDATE silently failed and the workspace stuck on 'online' forever.

  4. Re-register from awaiting_agent → 'online' confirms the state is
     operator-recoverable, which is the whole reason for using
     awaiting_agent (vs. 'offline') as the external-runtime stale state.

The harness mirrors test_staging_full_saas.sh: tenant create →
DNS/TLS wait → tenant token retrieve → exercise → idempotent teardown
via EXIT/INT/TERM trap. Exit codes match the documented contract
{0,1,2,3,4}; raw bash exit codes are normalized so the safety-net
sweeper doesn't open false-positive incident issues.

The companion workflow gates on the source files that touch this
lifecycle: workspace.go, registry.go, workspace_restart.go,
healthsweep.go, liveness.go, every migration, the static drift gate,
and the script + workflow themselves. Daily 07:30 UTC cron catches
infra drift on quiet days. cancel-in-progress=false because aborting
a half-rolled tenant leaves orphan resources for the safety-net to
clean.

Verification:
  - bash -n: ok
  - shellcheck: only the documented A && B || C pattern, identical to
    test_staging_full_saas.sh.
  - YAML parser: ok.
  - Workflow path filter matches every site that writes to the
    workspace_status enum (cross-checked against the drift gate's
    UPDATE workspaces / INSERT INTO workspaces enumeration).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 09:36:18 -07:00
Hongming Wang
e418d32582 ci(auto-promote): dispatch publish via molecule-ai App token to unblock workflow_run chain
Root cause (verified 2026-04-30): GITHUB_TOKEN-initiated
workflow_dispatch creates the dispatched run, but the resulting run's
completion event does NOT fire downstream `workflow_run` triggers.

This is the documented "no recursion" rule:
https://docs.github.com/en/actions/using-workflows/triggering-a-workflow#triggering-a-workflow-from-a-workflow

Evidence (publish-workspace-server-image runs on main):

  run_id      | head_sha  | triggering_actor      | canary | redeploy
  ------------+-----------+-----------------------+--------+----------
  25151545007 | 6ef562ee  | HongmingWang-Rabbit   |  YES   |  YES
  25171773918 | 21313dc   | github-actions[bot]   |  NO    |  NO
  25173801008 | 59dec57   | github-actions[bot]   |  NO    |  NO

The 06:52Z run that "worked" was an operator-fired dispatch from the
terminal — actor was the operator's PAT. The two runs that "dropped"
were dispatched by auto-promote-staging.yml's `gh workflow run` step
authenticated via `secrets.GITHUB_TOKEN`, so the actor became
`github-actions[bot]` and the workflow_run cascade was suppressed.

Same workflow file, same dispatch call, same successful publish run
— only the auth token differed.

Fix: mint a molecule-ai GitHub App installation token before the
dispatch step and use it as `GH_TOKEN`. App-initiated dispatches
DO propagate the workflow_run cascade (the App user is a real
identity, not the GITHUB_TOKEN bot pseudonym).

The molecule-ai App (app_id=3398844, installation 124443072) is
already installed on the org with `actions:write` — no new App
needed. Only secrets are missing.

## Required setup before merge

The following repo secrets must be added at
https://github.com/Molecule-AI/molecule-core/settings/secrets/actions
or auto-promote will hard-fail at the new "Mint App token" step:

- `MOLECULE_AI_APP_ID`         = `3398844`
- `MOLECULE_AI_APP_PRIVATE_KEY` = contents of a .pem file generated at
  https://github.com/organizations/Molecule-AI/settings/installations/124443072

(Click "Generate a private key" if one doesn't exist yet.)

## Long-term cleanup

The polling tail step still exists because the auto-merge call
itself uses GITHUB_TOKEN, so the FF push to main doesn't fire
publish-workspace-server-image's `push` trigger naturally. Switching
the auto-merge call to use the SAME App token would eliminate the
polling tail entirely. Tracked in #2357.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 08:55:49 -07:00
Hongming Wang
a9391c5900 fix(ci): drop --depth=1 from migration collision check fetch
The check has been blocking the staging→main auto-promote PR (#2361)
since 2026-04-30T07:17Z with:

  fatal: origin/main...<head>: no merge base

Root cause: the workflow does `git fetch origin <base> --depth=1`
which overwrites checkout@v4's full-history clone with a shallow
tip — destroying the ancestry the subsequent
`git diff origin/main...HEAD` (three-dot, merge-base form) needs.

This deadlocks every staging→main promote PR until manually fixed.
The auto-promote runs were succeeding at the gate-check phase but
the subsequent PR-merge step waited 30 min for the failing check
and timed out, skipping the publish + redeploy dispatch tail.
Fleet recovery for any production-only fix went through staging
fine but never reached main.

Fix: drop --depth=1 so the explicit fetch preserves full history.
The leading comment is updated to call out this trap so a future
maintainer doesn't re-add the flag thinking it's a perf win.

No test added: this is a workflow-config one-liner that the
existing PR check itself exercises end-to-end (the real signal is
PR #2361 going green after this lands).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 05:28:03 -07:00
Hongming Wang
e06ebaefdf
Merge pull request #2346 from Molecule-AI/auto/issue-2341-migration-collision
ci: hard gate against migration version collisions (#2341)
2026-04-30 08:50:19 +00:00
Hongming Wang
26d5c5ba1f fix(ci): close gaps in auto-promote dispatch tail (#2358 follow-up)
Independent review of #2358 surfaced three gaps that the original
self-review missed. All three would manifest only on the FIRST real
staging→main promotion through the new tail step, so they'd silently
re-introduce the deploy-chain bug #2357 was supposed to fix.

1. **Missing `actions: write` permission.** `gh workflow run` POSTs to
   `/repos/.../actions/workflows/.../dispatches`, which requires the
   actions:write scope on GITHUB_TOKEN. The job had only contents:write
   + pull-requests:write, so the dispatch call would 403 on every run
   and the publish chain would still not fire. Adding the scope.

2. **No workflow-level concurrency block.** When CI + E2E Staging
   Canvas + E2E API Smoke + CodeQL all complete within seconds of each
   other on a green staging push (the typical case), four separate
   workflow_run events fire and four parallel auto-promote runs all
   reach the dispatch tail. They poll the same PR, all observe the
   same mergedAt, and all call `gh workflow run` — producing 2-4×
   redundant publish builds racing for the same `:staging-latest`
   retag and 2-4× canary-verify chains. Added
   `concurrency.group: auto-promote-staging, cancel-in-progress: false`.
   cancel-in-progress=false because killing a polling tail that's
   about to dispatch would re-introduce the original bug.

3. **PR closed-without-merge ties up a runner for 30 min.** If the
   merge queue rejects the PR (gates flip red post-approval), or an
   operator closes it manually, mergedAt stays null forever and the
   loop polls 60 × 30s burning a runner slot. Now also reads `state`
   in the same `gh pr view` call and breaks early when STATE=CLOSED.

Verification on this PR is structural (workflow won't fire on a
staging→main promotion until this lands AND a subsequent staging
push triggers auto-promote). The actions:write fix in particular is
unverifiable until the next real run — the prior #2358 fix has
the same property, so we're stacking two unverifiable workflow
edits. That's intentional rather than risky: stage 1 (#2358) was
load-bearing for the deploy-chain restoration; stage 2 (this PR)
hardens it before it actually matters.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 00:03:31 -07:00
Hongming Wang
d850ec7c8c
Merge pull request #2358 from Molecule-AI/auto/issue-2357-promote-dispatch-chain
fix(ci): dispatch publish chain after auto-promote merge (#2357)
2026-04-30 06:36:02 +00:00
Hongming Wang
9a7f61661b fix(ci): dispatch publish chain after auto-promote merge (#2357)
The auto-promote staging → main flow uses `gh pr merge --auto` with
GITHUB_TOKEN, which means GitHub suppresses downstream `push` events on
the resulting main commit. This is documented behavior — events created
by GITHUB_TOKEN do not trigger new workflow runs, with workflow_dispatch
and repository_dispatch as the only exceptions.

Effect: when the merge queue lands the auto-promote PR, the main push
DOES NOT fire publish-workspace-server-image. canary-verify + the
:staging-<sha> → :latest retag never run, so redeploy-tenants-on-main
also never fires. Tenants stay on stale code until someone manually
dispatches the chain (which is what just happened for issue #2339).

Fix here: after enqueuing auto-merge, poll for the PR to land, then
explicitly `gh workflow run publish-workspace-server-image.yml --ref
main`. workflow_dispatch is the documented exception, so the dispatch
event itself DOES create a new run. canary-verify and
redeploy-tenants-on-main chain via workflow_run as before.

Long-term (tracked in #2357): switch the auto-merge call above to a
GitHub App token (actions/create-github-app-token) so the merge event
itself can trigger the downstream chain naturally; the polling tail
becomes deletable.

Why a 30-min poll cap: merge queue typically lands a green promote PR
within 5-10 min. 30 min covers a slow CI run without hanging the
workflow indefinitely. If the merge times out, the step warns and
exits 0 — operator can manually dispatch as a fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 23:31:13 -07:00
Hongming Wang
a495b86a06 test(e2e): poll-mode + since_id cursor round-trip (#2339 PR 4)
End-to-end coverage for the canvas-chat unblocker. Exercises every
moving part of the #2339 stack against a real platform instance:

Phase 1 — register a workspace as delivery_mode=poll WITHOUT a URL;
verify the response carries delivery_mode=poll.
Phase 2 — invalid delivery_mode rejected with 400 (typo defense).
Phase 3 — POST A2A to the poll-mode workspace; verify proxyA2ARequest
short-circuits and returns 200 {status:queued, delivery_mode:poll,
method:message/send} without ever resolving an agent URL.
Phase 4 — verify the queued message appears in /activity?type=a2a_receive
with the right method + payload (the polling agent reads from here).
Phase 5 — since_id cursor returns ASC-ordered rows STRICTLY AFTER the
cursor; the cursor row itself must NOT be replayed. Sends two
follow-up messages and asserts ordering: rows[0] is the older new
event, rows[-1] is the newer.
Phase 6 — unknown / pruned cursor returns 410 Gone with an explanation.
Phase 7 — cross-workspace cursor isolation: a UUID belonging to one
workspace cannot be used to peek at another workspace's feed (returns
410, same as pruned, no info leak).

Idempotent: per-run unique workspace ids (date+pid). Trap-based cleanup
deletes the test rows on exit; no e2e_cleanup_all_workspaces call (see
feedback_never_run_cluster_cleanup_tests_on_live_platform.md).

Wired into .github/workflows/e2e-api.yml so it runs on every PR that
touches workspace-server/, tests/e2e/, or the workflow file itself —
same gate as the existing test_a2a_e2e + test_notify_attachments suites.

Stacked on #2354 (PR 3: since_id cursor).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 23:07:10 -07:00
Hongming Wang
db5d11ffca ci: continuous synthetic E2E against staging (#2342)
Hard gate Tier 2 item 2 of 4. Cron-driven full-lifecycle E2E that
catches regressions visible only at runtime — schema drift,
deployment-pipeline gaps, vendor outages, env-var rotations,
DNS / CF / Railway side-effects.

Empirical motivation from today:
  - #2345 (A2A v0.2 silent drop) — passed unit tests, broke at JSON-RPC
    parse layer between sender + receiver. Visible only when a sender
    exercises the full path. Now-fixed by PR #2349, but a continuous
    E2E would have surfaced it within 20 min of the regression.
  - RFC #2312 chat upload — landed staging-branch but never reached
    staging tenants because publish-workspace-server-image was main-
    only. Caught by manual dogfooding hours after deploy. Same pattern.

Both classes are invisible to PR-time CI. The continuous gate fires
every 20 min against a real staging tenant and surfaces regressions
within minutes.

Cadence: cron `0,20,40 * * * *` (3x/hour). Offsets the existing
sweep-cf-orphans (:15) and sweep-cf-tunnels (:45) so the three ops
don't burst CF/AWS APIs at the same minute. Concurrency group
prevents overlapping runs if one hangs.

Cost: ~$0.50-1/day GHA + pennies of staging tenant lifecycle.

Reuses existing tests/e2e/test_staging_full_saas.sh — no new harness
to maintain. Bounded at 10 min wall-clock (vs 15 min default) so
stuck runs fail fast rather than holding up the next firing.
Defaults to E2E_RUNTIME=langgraph (fastest cold start; the regression
classes this gate catches don't need hermes-specific paths). Operators
can dispatch with runtime=hermes when they want SDK-native coverage.

Schedule-vs-dispatch hardening: hard-fail on missing
CP_STAGING_ADMIN_API_TOKEN for cron firing (silent-skip would mask
real outages); soft-skip for operator dispatch.

Refs:
  - #2342 hard-gates Tier 2 item 2
  - #2345 (A2A v0.2 fix that this gate would have caught earlier)
  - #2335 / #2337 (deployment-pipeline gaps that this gate also catches)
2026-04-29 22:04:57 -07:00
Hongming Wang
ea8ff626a9 ci: hard gate against migration version collisions (#2341)
Two PRs targeting staging can each add a migration with the same
numeric prefix (e.g. 044_*.up.sql). Each passes CI independently.
They collide at merge time. Worst case: second migration silently
doesn't apply and prod schema drifts from what the code expects.

Caught manually 2026-04-30 during PR #2276 rebase: 044_runtime_image_pins
collided with 044_platform_inbound_secret from RFC #2312. This workflow
makes that detection automatic at PR-open time.

How it works:
  scripts/ops/check_migration_collisions.py runs on every PR that
  touches workspace-server/migrations/**. For each new/modified
  migration filename, extracts the numeric prefix and checks:

  1. Does the base branch already have a DIFFERENT migration file with
     the same prefix? (PR branched off an old base, base advanced and
     another PR landed the same number — needs rebase.)

  2. Is another OPEN PR (not this one) also adding a migration with
     the same prefix? (Race-window collision — both pass CI separately,
     would collide at merge time.)

Either case → exit 1 with a clear ::error:: message naming the
conflicting PR(s) so the author knows what to renumber.

Implementation notes:
  - Uses git ls-tree (not working-tree walk) so it works against any
    base ref without checkout.
  - Uses gh pr diff --name-only per open PR, bounded by `gh pr list
    --limit 100`. ~30s worst case for a busy repo, <5s normally.
  - --diff-filter=AM picks up Added or Modified — renaming a migration
    in place is also flagged (intentional; renaming migrations isn't
    safe).
  - Same filename in both PR and base = no collision (PR is editing
    in-place, fine).

Tests:
  scripts/ops/test_check_migration_collisions.py — 9 cases on the
  regex classifier (the load-bearing piece). End-to-end git/gh path
  is exercised by running the workflow against real PRs.

Hard-gates Tier 1 item 1 (#2341). Cheapest, cleanest gate. Catches
one specific class of merge-time foot-gun automatically.

Refs hard-gates discussion 2026-04-30. Tier 1 of 4 (others tracked
in #2342, #2343, #2344).
2026-04-29 21:42:42 -07:00
Hongming Wang
856ff89973
Merge pull request #2338 from Molecule-AI/auto/redeploy-main-concurrency-parity
ci: add concurrency block to redeploy-tenants-on-main for parity
2026-04-30 04:16:53 +00:00
Hongming Wang
360361a0ce ci: add concurrency block to redeploy-tenants-on-main for parity
Parity with #2337's redeploy-tenants-on-staging.yml. Both prod and
staging redeploys now have explicit serialization:

  group: redeploy-tenants-on-main          (per-workflow, global)
  group: redeploy-tenants-on-staging       (per-workflow, global)

cancel-in-progress: false on both — aborting a half-rolled-out fleet
would leave tenants stuck on whatever image they happened to be on
when cancelled. Better to finish the in-flight rollout before starting
the next one.

Pre-fix this workflow relied on GitHub's implicit workflow_run queueing,
which is "probably fine" but not defensible — explicit > implicit for
load-bearing pipeline behavior. Picked up as a #2337 review nit
(architecture finding 1: concurrency asymmetry between the two
redeploy workflows).

No behavior change in the common case. The change matters only when
two main pushes land within seconds AND the first redeploy is still
mid-rollout — currently rare; will become more common once #2335
(staging-trigger publish) feeds main more frequently via auto-promote.
2026-04-29 21:14:41 -07:00
Hongming Wang
b7291e006b ci: serialize publish + auto-redeploy staging tenants
Two follow-ups from #2335 review (tracked in #2336):

1. Add `concurrency:` block to publish-workspace-server-image.yml so
   two rapid staging pushes don't race the same :staging-latest retag.
   Group is per-branch (`${{ github.ref }}`) so staging and main can
   build in parallel — they produce different :staging-<sha> tags and
   last-write-wins on :staging-latest is acceptable across branches.
   `cancel-in-progress: false` keeps in-flight builds — partially-pushed
   images would break canary-fleet pin consistency.

2. Add redeploy-tenants-on-staging.yml. After #2335, every staging push
   produces a fresh :staging-latest, but existing tenants only pick it
   up on next reprovision. This workflow mirrors redeploy-tenants-on-
   main but for staging:
   - workflow_run-gated to branches: [staging]
   - target_tag default 'staging-latest' (vs 'latest' for prod)
   - CP_URL default https://staging-api.moleculesai.app
   - CP_STAGING_ADMIN_API_TOKEN repo secret (operator must set)
   - canary_slug empty by default — staging is itself the canary; no
     sub-canary needed inside it. Soak still applies if operator
     specifies a tenant for blast-radius control.

   Schedule-vs-dispatch hardening matches sweep-cf-orphans/sweep-cf-
   tunnels: hard-fail on auto-trigger when secret missing so misconfig
   doesn't silently leave staging tenants on stale code; soft-skip on
   operator dispatch.

Operator action required after merge:
  Add CP_STAGING_ADMIN_API_TOKEN repo secret. Pull value from staging-
  CP's CP_ADMIN_API_TOKEN env in Railway controlplane / staging
  environment. Until set, the auto-trigger will fail the workflow run
  (visible as red CI), surfacing the misconfiguration. Workflow runs
  only on staging publish-workspace-server-image success, so no extra
  load while it sits unconfigured.

Verification:
- YAML lint clean on both workflows.
- Reviewed redeploy-tenants-on-main as template; differences are scoped
  to staging-specific values (URL, tag, secret name) + harden-on-missing-
  secret pattern.

Refs #2335, #2336.
2026-04-29 21:11:45 -07:00
Hongming Wang
2e1cef324b ci: trigger publish-workspace-server-image on staging push too
Root cause: this workflow only triggered on `branches: [main]`, but
staging-CP pins TENANT_IMAGE=:staging-latest (verified via Railway).
:staging-latest was only retagged on main push, so:

  staging-branch code → never built → never reaches staging tenants
  staging-CP serves   → "yesterday's main" indefinitely

When staging→main was wedged (path-filter parity bug, canvas teardown
race — both fixed earlier today), :staging-latest stopped updating
entirely. RFC #2312 (chat upload HTTP-forward) landed on staging but
freshly-provisioned staging tenants kept failing chat upload because
they pulled pre-RFC-#2312 image. Verified by tearing down a fresh
tenant and observing the legacy "workspace container not running"
error from the docker-exec code path that RFC #2312 deleted.

Pre-2026-04-24 there was a related-but-different incident: TENANT_IMAGE
was a static :staging-<sha> pin that drifted 10 days behind. This new
incident is "the dynamic pin still drifts when its update workflow
doesn't fire."

Fix: add `staging` to the branches trigger. Tag policy is unchanged
(:staging-<sha> + :staging-latest on every push). canary-verify.yml
still runs on main push (workflow_run-gated to `branches: [main]`),
preserving the canary-verified :latest promotion for prod tenants.

Steady state after this:
  - staging push → :staging-latest = staging-branch code → staging-CP
  - main push    → :staging-<sha> for canary, :staging-latest retag
                   (post-promote main code), and after canary green
                   → :latest for prod tenants

What this does NOT change:
  - canary-verify.yml flow (still main-only)
  - redeploy-tenants-on-main.yml (still rolls prod fleet on main push)
  - publish-canvas-image.yml (self-hosted standalone canvas; orthogonal)
  - The :latest tag (canary-verified main, unchanged)

What this does fix:
  - RFC #2312-class fixes that land on staging now actually reach
    staging tenants without waiting for staging→main promote.
  - The dogfooding observation "staging tenants seem to be running
    yesterday's code" disappears as a class.

Drive-by: also fixed the typo in the path-filter list (was
`publish-platform-image.yml`, the actual file is
`publish-workspace-server-image.yml`).
2026-04-29 21:00:56 -07:00
Hongming Wang
3a6d2f179d feat(ops): add sweep-cf-tunnels janitor — orphan Cloudflare Tunnels accumulate
CP's tenant-delete cascade removes the DNS record (with sweep-cf-orphans
as a backstop) but does NOT delete the underlying Cloudflare Tunnel.
Each E2E provision creates one Tunnel named `tenant-<slug>`; without
cleanup these accumulate indefinitely on the account, consuming the
tunnel quota and cluttering the dashboard.

Observed 2026-04-30: dozens of `tenant-e2e-canvas-*` tunnels in Down
state with zero replicas, weeks past their tenant's deletion. Same
class of bug as the DNS-records leak that drove sweep-cf-orphans
(controlplane#239).

Parallel-shape to sweep-cf-orphans:
  - Same dry-run-by-default + --execute pattern
  - Same MAX_DELETE_PCT safety gate (default 90% — higher than DNS
    sweep's 50% because tenant-shaped tunnels are orphans by design)
  - Same schedule/dispatch hardening (hard-fail on missing secrets
    when scheduled, soft-skip when dispatched)
  - Cron offset to :45 to avoid CF API bursts colliding with the DNS
    sweep at :15

Decision rules (in order):
  1. Name doesn't match `tenant-<slug>` → keep (unknown — never sweep
     tunnels that might belong to platform infra).
  2. Tunnel has active connections (status=healthy or non-empty
     connections array) → keep (defense-in-depth: don't kill a live
     tunnel even if CP forgot the org).
  3. Slug ∈ {prod_slugs ∪ staging_slugs} → keep.
  4. Otherwise → delete (orphan).

Verified by:
  - shell syntax check (bash -n)
  - YAML lint
  - Decide-logic offline smoke (7 cases, all pass)
  - End-to-end dry-run smoke with stubbed CP + CF APIs

Required secrets (added to existing org-secrets):
  CF_API_TOKEN          must include account:cloudflare_tunnel:edit
                        scope (separate from zone:dns:edit used by
                        sweep-cf-orphans — same token if scope is
                        broad, or a new token if narrowly scoped).
  CF_ACCOUNT_ID         account that owns the tunnels (visible in
                        dash.cloudflare.com URL path).
  CP_PROD_ADMIN_TOKEN   reused from sweep-cf-orphans.
  CP_STAGING_ADMIN_TOKEN reused from sweep-cf-orphans.

Note: CP-side root cause (tenant-delete should cascade to tunnel
delete) is in molecule-controlplane and worth fixing separately. This
janitor is the operational backstop in the meantime — same pattern
applied to DNS records when the same root cause was unaddressed.
2026-04-29 19:42:47 -07:00
Hongming Wang
15b98c4916 fix(e2e-canvas): kill teardown race that poisons concurrent runs
Setup wrote .playwright-staging-state.json at the END (step 7), only
after org create + provision-wait + TLS + workspace create + workspace-
online all succeeded. If setup crashed at steps 1-6, the org existed in
CP but the state file did not, so Playwright's globalTeardown bailed
out ("nothing to tear down") and the workflow safety-net pattern-swept
every e2e-canvas-<today>-* org to compensate. That sweep deleted
concurrent runs' live tenants — including their CF DNS records —
causing victims' next fetch to die with `getaddrinfo ENOTFOUND`.

Race observed 2026-04-30 on PR #2264 staging→main: three real-test
runs killed each other mid-test, blocking 68 commits of staging→main
promotion.

Fix: write the state file as setup's first action, right after slug
generation, before any CP call. Now:

  - Crash before slug gen        → no state file, no orphan to clean
  - Crash during steps 1-6       → state file has slug; teardown deletes
                                   it (DELETE 404s if org never created)
  - Setup completes              → state file has full state; teardown
                                   deletes the slug

The workflow safety-net no longer pattern-sweeps; it reads the state
file and deletes only the recorded slug. Concurrent canvas-E2E runs no
longer poison each other.

Verified by:
  - tsc --noEmit on staging-setup.ts + staging-teardown.ts
  - YAML lint on e2e-staging-canvas.yml
  - Code review: state file write moved to line 113 (post-makeSlug,
    pre-CP) with the original line-249 write retained as a "promote
    to full state" overwrite at the end
2026-04-29 19:23:56 -07:00
Hongming Wang
c8205b009a ci: daily Railway pin-audit cron + issue-on-failure (#2169)
Acceptance criterion 3 of #2001 ("CI check that fails if TENANT_IMAGE
contains a SHA-shaped suffix") was deferred from PR #2168 because
querying Railway from a GitHub Actions runner needs RAILWAY_TOKEN
plumbed as a repo secret. The detection script + regression test in
#2168 cover detection; this is the automation-cadence layer.

Daily 13:00 UTC schedule (06:00 PT) + workflow_dispatch. Daily is the
right cadence for variables-tier config — Railway env var changes are
deliberate operator actions, low-frequency. Hourly would risk Railway
API rate-limit surprises.

Issue-on-failure pattern mirrors e2e-staging-sanity.yml — drift opens
a `railway-drift` priority-high issue (or comments on the open one),
and a subsequent clean run auto-closes it with a "drift resolved"
comment. No human-in-the-loop needed for the close.

Schedule-vs-dispatch secret hardening per
feedback_schedule_vs_dispatch_secrets_hardening:
- Schedule trigger HARD-FAILS on missing RAILWAY_AUDIT_TOKEN
  (silent-success was the failure mode that bit us before)
- workflow_dispatch SOFT-SKIPS so an operator can dry-run the
  workflow shape during initial token provisioning

Operator action required before this gate is live:
- Provision a Railway API token, read-only `variables` scope on the
  molecule-platform project (id 7ccc8c68-61f4-42ab-9be5-586eeee11768)
- Store as repo secret RAILWAY_AUDIT_TOKEN
- Rotate per the standard 90-day schedule

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 17:43:01 -07:00
Hongming Wang
c79cf1cfa9 ci: collapse two-jobs-sharing-name path-filter pattern in e2e-api/e2e-staging-canvas
Branch protection treats matching-name check runs as a SET — any SKIPPED
member fails the required-check eval, even with SUCCESS siblings. The
two-jobs-sharing-name pattern (no-op + real-job) emits one SKIPPED + one
SUCCESS check run per workflow run; with multiple runs at the same SHA
(detect-changes triggers + auto-promote re-runs) the SET fills with
SKIPPED entries that block branch protection.

Verified live on PR #2264 (staging→main auto-promote): mergeStateStatus
stayed BLOCKED for 18+ hours despite APPROVED + MERGEABLE + all gates
green at the workflow level. `gh pr merge` returned "base branch policy
prohibits the merge"; `enqueuePullRequest` returned "No merge queue
found for branch 'main'". The check-runs API showed `E2E API Smoke
Test` and `Canvas tabs E2E` each had 2 SKIPPED + 2 SUCCESS at head SHA
66142c1e.

Fix: collapse no-op + real-job into ONE job with no job-level `if:`,
gating real work via per-step `if: needs.detect-changes.outputs.X ==
'true'`. The job always runs and emits exactly one SUCCESS check run
under the required-check name regardless of paths-filter outcome —
branch-protection-clean.

Same pattern as ci.yml's earlier conversion of Canvas/Platform/Python/
Shellcheck (PR #2322). Closes the parity-fix that should have been
applied to all four path-filtered required checks at once.
2026-04-29 17:29:44 -07:00