molecule-core

Author	SHA1	Message	Date
claude-ceo-assistant	4b82db72a7	Merge branch 'main' into fix/issue-72-auto-sync-token-canary-v2	2026-05-07 23:44:22 +00:00
claude-ceo-assistant	ed0874504e	Merge branch 'main' into fix/issue75-class-F-gh-run-list-to-statuses	2026-05-07 23:44:00 +00:00
claude-ceo-assistant	1819ac21f4	Merge branch 'main' into fix/issue75-class-A-gh-pr-to-gitea-rest	2026-05-07 23:37:57 +00:00
claude-ceo-assistant	d81fb98163	Merge branch 'main' into fix/issue-72-auto-sync-token-canary-v2	2026-05-07 22:53:32 +00:00
claude-ceo-assistant	4d5c9a6646	Merge branch 'main' into fix/issue75-class-F-gh-run-list-to-statuses	2026-05-07 22:53:26 +00:00
claude-ceo-assistant	9ecee78782	Merge branch 'main' into fix/issue75-class-A-gh-pr-to-gitea-rest	2026-05-07 22:53:11 +00:00
claude-ceo-assistant	d21c09babe	Merge branch 'main' into fix/195-auto-promote-staging-gitea-rest	2026-05-07 22:53:00 +00:00
claude-ceo-assistant	85140f1c72	Merge branch 'main' into fix/issue-72-auto-sync-token-canary-v2	2026-05-07 22:40:56 +00:00
devops-engineer	5b3ce5c818	fix(ci): replace gh run list with Gitea commit-status query (#75 class F) Part of the post-#66 sweep to remove `gh` CLI dependencies that fail silently against Gitea. Class F covers `gh run list --workflow=X --commit=SHA` shapes — querying whether a specific workflow ran (and how it finished) for a specific SHA. Why this is the only call site in class F: `gh run list` hits GitHub's `/repos/.../actions/runs` REST endpoint. Gitea exposes ZERO endpoints under `/repos/.../actions/runs` — verified 2026-05-07 via swagger inspection: only secrets, variables, and runner-registration tokens live under /actions/. There's no way to query workflow run state via the Gitea v1 API directly. However, every Gitea Actions job DOES emit a commit status with `context = "<Workflow Name> / <Job Name> (<event>)"` (verified 2026-05-07 by reading /repos/.../commits/{sha}/statuses on a recent main SHA). That surface is exactly what we need: each workflow run leg is one status row, the aggregate state encodes the run outcome, and Gitea exposes it under `/api/v1/repos/.../commits/{sha}/statuses` which IS available. Affected: `auto-promote-on-e2e.yml` (lines 172-180): Old: `gh run list --workflow e2e-staging-saas.yml --commit $SHA --json status,conclusion --jq ...` returning a 5-bucket string like `completed/success` \| `in_progress/none` \| `none/none` \| `completed/failure` \| `completed/cancelled`. New: `curl /api/v1/repos/.../commits/$SHA/statuses` + jq filter on contexts whose name starts with `"E2E Staging SaaS (full lifecycle) /"`. Mapping: 0 matched contexts → "none/none" (E2E paths- filtered out — same as before) any context = pending → "in_progress/none" (defer) any context = error\|failure → "completed/failure" (abort) all contexts = success → "completed/success" (proceed) The `completed/cancelled` arm of the case statement becomes unreachable: Gitea status API doesn't expose a `cancelled` state (it has success/failure/error/pending/warning), so per-SHA concurrency cancellations now surface as `failure` and are handled by the failure branch. Documented in-place; the cancelled arm is kept as defense-in-depth for any future dual-host operation. Verification: - Live curl against the current main SHA returns `none/none` (E2E was paths-filtered for that change set — expected). - Synthetic-input jq tests verify all four mapping buckets: no contexts → "none/none" one context = pending → "in_progress/none" success + success → "completed/success" success + failure → "completed/failure" - YAML syntax validates. Token: continues to use act_runner's GITHUB_TOKEN (per-run, repo read scope). The `/commits/{sha}/statuses` endpoint is repo-scoped, no extra perms needed. Closes part of #75. Master tracking issue at #75; companion PRs: #80 (class A — `gh pr ...`), #81 (class D — `gh api ...`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 15:38:57 -07:00
claude-ceo-assistant	bcc72419ce	Merge branch 'main' into fix/144-branch-protection-check-name-parity-audit	2026-05-07 22:35:33 +00:00
claude-ceo-assistant	e4e1bf4080	ci(canary): annotate EXPECTED_PERSONA dual-update constraint Hostile-self-review weakest-spot #2: if the devops-engineer persona is ever renamed, the canary will go red even if everything else is fine. Add an inline comment pointing the next editor at both files that must update together (auto-sync-main-to-staging.yml's git config + this canary's EXPECTED_PERSONA + the staging branch protection's push_whitelist_usernames). No behaviour change — comment-only.	2026-05-07 15:35:22 -07:00
claude-ceo-assistant	62629eda4a	ci(canary): rewrite Probe 3 to actually validate auth (NOP push --dry-run) While verifying Phase 4, found a real flaw in Probe 3 (`git ls-remote refs/heads/staging`). On a public repo (which molecule-core is), Gitea falls back to anonymous read on bad auth, so `ls-remote` succeeds even with a junk token. The probe was therefore green-lighting rotated tokens — false-green, the worst possible canary failure mode. Rewritten to use `git push --dry-run` of the current staging SHA back to `refs/heads/staging`: - Push always authenticates (auth-gated on smart-protocol handshake, before the dry-run can compute the empty-diff). - NOP by construction: pushing the current tip back to itself is "Everything up-to-date" with exit 0. - Bad token → "Authentication failed", exit 128. - Doesn't reach pre-receive (where branch-protection authz runs), so scope is "auth only" — matches the design intent (failure mode B); authz already covered daily by branch-protection-drift.yml. Implementation note: `git push` requires a local repo. Spinning up a fresh `git init` in a tempdir (~1KB, ~50ms) instead of pulling the full repo via actions/checkout — actions/checkout would clone ~hundreds of MB for what amounts to "a place to run git from." Local mutation tests pass: - Real token: "Everything up-to-date" exit 0 - Junk token: "Authentication failed" exit 128 with actionable ::error:: messages pointing at the runbook Header comment + runbook step-mapping updated to reflect new probe shape. Refs: #72	2026-05-07 15:34:34 -07:00
devops-engineer	e075557b19	fix(ci): replace gh pr CLI with Gitea v1 REST in workflows + scripts (#75 class A) Part of the post-#66 sweep to remove `gh` CLI dependencies that fail silently against Gitea (which exposes /api/v1 only — no GraphQL → 405, no /api/v3 → 404). Class A covers `gh pr list / view / diff / comment` shapes. Affected: - `.github/workflows/auto-tag-runtime.yml` Replaced `gh pr list --search SHA --json number,labels` with a curl to `/api/v1/repos/.../pulls?state=closed&sort=newest&limit=50` + jq filter on `merge_commit_sha == github.sha`. Same end-to-end behaviour: locate the merged PR for this push, read its labels, pick the bump kind. Defensive `?.name // empty` jq guard handles unlabelled PRs without erroring. The 50-PR window is comfortably larger than the volume of staging→main promotes that close in any reasonable detection window. - `scripts/check-stale-promote-pr.sh` Rewrote `fetch_prs` and `post_comment` to call Gitea's REST API directly. Gitea doesn't expose GitHub's compound `mergeStateStatus` / `reviewDecision` fields, so the new fetcher pulls `/pulls?state=open&base=main` then for each PR pulls `/pulls/{n}/reviews` and synthesizes the GitHub-shape JSON the rest of the script (and the existing fixture-based unit tests) consume: BLOCKED + REVIEW_REQUIRED ↔ mergeable=true AND 0 APPROVED reviews DIRTY ↔ mergeable=false (alarm doesn't fire) CLEAN + APPROVED ↔ mergeable=true AND ≥1 APPROVED review Comment-posting moves to `POST /repos/.../issues/{n}/comments` (Gitea treats PRs as issues for the comment surface, same as GitHub's REST). All 23 fixture-driven unit tests still pass — fixtures pass GitHub-shape JSON via PR_FIXTURE which short-circuits the live fetch path. - `scripts/ops/check_migration_collisions.py` Replaced `gh pr list` + `gh pr diff` calls with stdlib `urllib` against /api/v1. Helper `_gitea_get` centralizes auth + error handling; uses GITEA_TOKEN env, falling back to GITHUB_TOKEN (act_runner) and GH_TOKEN. Return shape from `open_prs_with_migration_prefix` mimics the historical `--json number,headRefName` so the call sites are unchanged. All 9 regex-classifier unit tests still pass; live integration test against the production Gitea API returns 0 collisions for prefix=999 as expected. curl invocation pattern is `curl --fail-with-body -sS` (NOT `-fsS` — the two short-fail flags are mutually exclusive in modern curl; caught by `curl: You must select either --fail or --fail-with-body, not both` during local verification). Token model: workflows pass act_runner's GITHUB_TOKEN (per-run, repo read scope) — same surface used by the auto-sync fix in PR #66 plus the surrounding workflows. No new repo secrets required. Verification: bash unit tests (23/23 pass), python unittest (9/9 pass), live curl call against production Gitea returns 200 with the expected shape, YAML / shell / Python syntax all validate. Closes part of #75. Other classes (D — `gh api`; F — `gh run list`) land in follow-up PRs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 15:29:26 -07:00
claude-ceo-assistant	0cef033a6a	ci(canary): route curl -w to tempfile to satisfy status-capture lint The two API probes used the unsafe shape rejected by lint-curl-status-capture.yml (per feedback_curl_status_capture_pollution): status=$(curl ... -w '%{http_code}' ... \|\| echo "000") When curl exits non-zero (transport error, --fail-with-body 4xx/5xx), the `-w` already wrote a code; the `\|\| echo "000"` then APPENDS another "000", yielding "000000" or "409000" — passes shape checks while looking right. Switch to the canonical safe shape (set +e + tempfile + cat): set +e curl ... -w '%{http_code}' >code_file 2>/dev/null set -e status=$(cat code_file 2>/dev/null \|\| true) [ -z "$status" ] && status="000" Inline comment in both probe steps explains the lint constraint so the next editor doesn't re-introduce the bad pattern. Refs: #72, lint failure on PR #77 (1/22 red → 22/22 expected green)	2026-05-07 15:26:22 -07:00
claude-ceo-assistant	b83b533381	Merge branch 'main' into fix/144-branch-protection-check-name-parity-audit	2026-05-07 22:24:45 +00:00
claude-ceo-assistant	a23cf6a6bb	Merge branch 'main' into fix/harness-replays-pre-clone-manifest	2026-05-07 22:24:42 +00:00
devops-engineer	6acd63fa5a	fix(ci): rewrite auto-promote staging→main for Gitea REST API Root cause: same as #65/PR-#66 — gh CLI calls Gitea GraphQL (/api/graphql) which returns HTTP 405. Additionally, gh workflow run calls /actions/workflows/{id}/dispatches which does not exist on Gitea 1.22.6 (verified via swagger.v1.json). Fix: - Replace gh run list with Gitea REST combined-status endpoint (GET /repos/{owner}/{repo}/commits/{ref}/status). Combined state encodes the AND across every check context — simpler than the per-workflow loop and immune to workflow-name collisions. - Replace gh pr create / merge --auto with direct curl calls to POST /pulls and POST /pulls/{N}/merge with merge_when_checks_succeed. - Remove the post-merge polling tail entirely. The GitHub-era GITHUB_TOKEN no-recursion rule does not apply on Gitea Actions (verified empirically: PR #66 merge fired downstream pushes naturally). Even if we wanted to dispatch, Gitea has no workflow_dispatch REST endpoint. Critical constraint: main has enable_push: false with no whitelist; direct push is impossible for any persona. PR-mediated merge is the only path. main has required_approvals: 1 — auto-merge waits for Hongming's approval before landing, preserving the feedback_prod_apply_needs_hongming_chat_go contract. Identity: AUTO_SYNC_TOKEN (devops-engineer persona). Not founder PAT. Per feedback_per_agent_gitea_identity_default. Same persona used by auto-sync (PR #66) — keeps identity model coherent. Header comment block fully rewritten with 4 failure-mode runbooks (A: gates not green, B: PR-create non-201, C: merge schedule fails, D: token rotated/scope wrong) per PR #66's pattern. Refs: #65, #73, #195, PR #66 (canonical reference) Closes #73 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 15:24:28 -07:00
claude-ceo-assistant	bfc393c065	ci: add AUTO_SYNC_TOKEN rotation drift canary (#72 ) Adds a 6h-cron synthetic check that fires the auth surface used by auto-sync-main-to-staging.yml (PR #66) and emits a red workflow status when AUTO_SYNC_TOKEN has drifted out of validity. Closes hostile-self-review weakest-spot #3 from PR #66 (token-rotation detection latency). Read-only verification — no writes, no synthetic merge commits, no canary branch noise. Three probes: 1. GET /api/v1/user → token authenticates as devops-engineer 2. GET /api/v1/repos/molecule-ai/molecule-core → read:repository scope 3. git ls-remote refs/heads/staging → exact HTTPS auth path used by actions/checkout in the real auto-sync workflow Hard-fail on missing AUTO_SYNC_TOKEN secret on both schedule and workflow_dispatch — per feedback_schedule_vs_dispatch_secrets_hardening, a silent soft-skip would make the canary itself drift-invisible (the sweep-cf-orphans #2088 lesson). Operator runbook in workflow header. Token reuse: same AUTO_SYNC_TOKEN as the workflow under monitor; no new credential introduced. Read-only paths only. Refs: #72, hostile-self-review #66	2026-05-07 15:23:03 -07:00
devops-engineer	6235ef7461	fix(ci): rewrite auto-sync main→staging for Gitea direct push Root cause of `Auto-sync main → staging / sync-staging (push)` failing every push to main since the GitHub→Gitea migration: The workflow assumed a GitHub `merge_queue` ruleset on staging (blocking direct push) and used `gh pr create` + `gh pr merge --auto` to land sync via the queue. On Gitea this fails at the `gh pr create` step with `HTTP 405 Method Not Allowed (https://git.moleculesai.app/api/graphql)` — Gitea exposes no GraphQL endpoint, and the GitHub-CLI cannot ship PRs against Gitea. Verified failure mode in run 1117/job 0 (token logs at /tmp/log2.txt, run target /molecule-ai/molecule-core/actions/ runs/1117/jobs/0). The merge step succeeded and pushed auto-sync/main-1e1f4d63; the PR step failed with the 405. So every main push left an orphan auto-sync/* branch and a red CI status, with no PR to land it. Fix: the staging branch protection on Gitea (`enable_push: true`, `push_whitelist_usernames: [devops-engineer]`) already permits direct push from the devops-engineer persona. Drop the entire merge-queue PR architecture and replace with: 1. Checkout staging with secrets.AUTO_SYNC_TOKEN (devops-engineer persona token, NOT founder PAT — `feedback_per_agent_gitea_identity_default`). 2. `git fetch origin main` + ff-merge or no-ff merge. 3. `git push origin staging` directly. The AUTO_SYNC_TOKEN repo secret already exists (created 2026-05-07 14:00 alongside the staging push_whitelist update). Workflow name + job name unchanged → required-check name `Auto-sync main → staging / sync-staging (push)` keeps the same context, no branch-protection edits needed. Rejected alternatives (documented in workflow header): - Reuse PR architecture via Gitea REST: ~80 LOC of API plumbing for no benefit; direct push works. - GH_HOST=git.moleculesai.app: still calls /api/graphql, same 405; doesn't fix the root issue. - Custom JS action: external dep for a 5-line `git push`. Header comment in the workflow now documents: - What this workflow does (SSOT for staging advancing). - Why direct push (GitHub merge_queue → Gitea push_whitelist). - Identity and token (anti-bot-ring per saved memory). - Failure modes A–D with operator runbook for each. - Loop safety (push to staging doesn't fire push:main → no recursion). Verification plan: this fix-PR's merge to main is itself the trigger; watch the workflow run on the merge commit and on one follow-up trigger commit, expect both green. Refs: failing run https://git.moleculesai.app/molecule-ai/ molecule-core/actions/runs/1117/jobs/0 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 15:04:12 -07:00
Hongming Wang	7c6acc18ae	ci(branch-protection): check-name parity gate (#144 ) Audit finding: every workflow that emits a required-status-check name on molecule-core's branch protection (apply.sh's STAGING_CHECKS + MAIN_CHECKS) ALREADY uses the safe always-runs-with-conditional-steps shape — Platform/Canvas/Python/Shellcheck in ci.yml, Canvas tabs E2E in e2e-staging-canvas.yml, E2E API Smoke in e2e-api.yml, PR-built wheel in runtime-prbuild-compat.yml, the codeql Analyze matrix, and the always-on Secret scan + Detect changes. No production drift to fix today. Adds a regression-guard so the next path-filter / matrix refactor / workflow rename can't silently re-introduce the bug shape called out in saved memory feedback_branch_protection_check_name_parity: "Path filters … silently break branch protection because no job emits the protected sentinel status when path-filter returns false." New tools: - tools/branch-protection/check_name_parity.sh — extracts every required check name from apply.sh's heredocs, then for each name classifies the owning workflow as safe (no top-level paths:) / safe (per-step if-gates without top-level paths:) / unsafe (top-level paths: without per-step if-gates) / unsafe-mix (top-level paths: WITH per-step if-gates — the workflow may still skip entirely on path exclusion, leaving the gates dormant) / missing (no emitter at all). Special-cases codeql.yml's matrix- expanded `Analyze (${{ matrix.language }})`. - tools/branch-protection/test_check_name_parity.sh — 6 unit tests covering each classification: safe, unsafe-path-filter, missing, safe-with-per-step-gates, unsafe-mix, matrix-expansion. Each test builds a synthetic apply.sh + workflow file in a tmpdir, invokes the script, and asserts on exit code + stderr substring. Per feedback_assert_exact_not_substring the assertions pin specific classifications, not just non-zero exit. Wired into branch-protection-drift.yml so every PR touching .github/workflows/** runs the parity check; the existing daily schedule covers between-PR drift. The check is cheap (~1s) and runs without the admin token — only reads files in the checkout. Self- test step runs the unit tests on every invocation, so a regression in the script can't false-pass on production. Per BSD-vs-GNU portability hygiene: heredoc-marker extraction stays in plain awk + sed (no gawk-only `match()` array form), grep regex avoids `^` anchor for `if:` lines because real workflows use ` - if:` with the `-` step-marker between leading spaces and `if:` (the original anchor missed every workflow's per-step gates). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 14:42:50 -07:00
claude-ceo-assistant	3a00dd236f	fix(ci): convert CodeQL workflow to no-op stub on Gitea (#156 ) Why --- PR #35 marked `continue-on-error: true` at the JOB level (correct YAML), but Gitea Actions 1.22.6 does NOT propagate job-level continue-on-error to the commit-status API — every matrix leg still posts `failure`. That keeps OVERALL=failure on every push to main + staging and blocks the auto-promote signal even when every other gate is green. Worse: the underlying CodeQL run never actually worked on Gitea. The github/codeql-action/init@v4 step calls api.github.com bundle endpoints (CLI download + query packs + telemetry) that Gitea does NOT proxy. Confirmed via live-tested run 1d/3101 on operator host: 2026-05-07T20:55:17 ::group::Run Initialize CodeQL with: languages: ${{ matrix.language }} queries: security-extended 2026-05-07T20:55:36 ::error::404 page not found 2026-05-07T20:55:50 Failure - Main Initialize CodeQL 2026-05-07T20:55:51 skipping Perform CodeQL Analysis (main skipped) 2026-05-07T20:55:51 :⚠️:No files were found at sarif-results/go/ The SARIF artifact upload was already a no-op (warning above) — the analyze step never wrote anything because init failed. So nothing of value is being lost by stubbing this out. What ---- - Convert the workflow to a single-step stub that emits success per matrix language (go, javascript-typescript, python). - Keep workflow `name: CodeQL` exactly (auto-promote-staging.yml line 67 keys on it as a workflow_run gate). - Keep job name template `Analyze (${{ matrix.language }})` and the 3-leg matrix exactly (commit-status context names + branch protection + #144 required-check-name parity). - Keep all four triggers (push / pull_request / merge_group / schedule) so merge_group required-checks parity holds. - Drop the codeql-action steps, the Autobuild step, the SARIF parse step, and the upload-artifact step — all four of those are now dead code (init can never succeed against Gitea's API surface). Policy ------ Per Hongming decision 2026-05-07 (#156): CodeQL is ADVISORY, not blocking, until a Gitea-compatible SAST pipeline lands. The header of the new workflow file documents this decision + lists the three re-enable options (self-hosted Semgrep, Sonatype, GitHub mirror) plus the compensating controls in place (secret-scan, block-internal- paths, lint-curl-status-capture, branch-protection-drift). Closes #156. Touches #142 (no capital-M Molecule-AI refs in this file — already lowercase per `e01077be`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 14:26:57 -07:00
devops-engineer	229b1a902a	fix(ci): pre-clone manifest deps in harness-replays workflow (#173 followup) harness-replays.yml builds tenant-alpha + tenant-beta via tests/harness/ compose.yml using workspace-server/Dockerfile.tenant. Post-#173, that Dockerfile expects .tenant-bundle-deps/{workspace-configs-templates, org-templates,plugins} pre-cloned at the build context root. Sister PR #38 added the pre-clone step to publish-workspace-server-image.yml but missed harness-replays.yml. Symptoms: - main run #892 (2026-05-07T20:28:53Z): COPY .tenant-bundle-deps/plugins -> failed to calculate checksum ... not found. - staging run #964 (2026-05-07T20:41:52Z): hits the OLD in-image clone path (staging hasn't picked up the Dockerfile.tenant refactor yet via auto-sync) and fails on 'fatal: could not read Username for https://git.moleculesai.app' when cloning the first private workspace-template-* repo. Fix: add the same Pre-clone step to harness-replays.yml, mirroring publish-workspace-server-image.yml. Uses AUTO_SYNC_TOKEN (devops-engineer persona PAT) per feedback_per_agent_gitea_identity_default. Once auto-sync main->staging unblocks (sister agent fixing the 7-file conflict in flight), staging will inherit both this workflow fix AND the Dockerfile.tenant refactor atomically. Refs: #168, #173	2026-05-07 14:26:52 -07:00
devops-engineer	194cdf012b	chore(ci): retrigger publish-workspace-server-image after ECR repo create (#173 ) Run #1010 (post-#46) succeeded all the way to push but failed with "repository molecule-ai/platform does not exist" — the platform image ECR repo had never been created (only platform-tenant existed). Created the repo via: aws ecr create-repository --region us-east-2 \ --repository-name molecule-ai/platform \ --image-scanning-configuration scanOnPush=true This is a one-line workflow comment to satisfy the path-filter and re-run the publish workflow against the now-existing repo. Closes #173 properly this time — pre-clone + inline ECR auth + ECR repo all in place. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 13:54:11 -07:00
devops-engineer	f0e8d9bb23	fix(ci): inline aws ecr get-login-password + docker login (followup #173 ) CI run #987 (post-#45) showed `docker push` from shell still hits "no basic auth credentials" — `aws-actions/amazon-ecr-login@v2` writes auth to a step-scoped DOCKER_CONFIG that doesn't carry across to the next shell step on Gitea Actions. Fix: drop both `aws-actions/configure-aws-credentials@v4` and `aws-actions/amazon-ecr-login@v2`. Run `aws ecr get-login-password \| docker login` inline in the same shell step as `docker build` + `docker push`. AWS creds come from secrets via env vars, ECR token is fresh per-step (12h validity is plenty), config.json lives in the same shell process — auth state is guaranteed. This is the operator-host manual approach mapped 1:1 into CI. runner-base image already has aws-cli + docker (verified locally). Closes #173 (fifth piece — and final, this matches the manual flow exactly). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 13:49:12 -07:00
devops-engineer	43e2d24c5b	fix(ci): replace buildx with plain docker build+push (followup #173 ) CI run #946 (post-#43) confirmed `driver: docker` doesn't fix the ECR push 401 either: buildx CLI inside the runner container talks to the operator-host docker daemon (mounted socket), but the daemon doesn't see the runner's ECR auth state, and the runner's buildx CLI doesn't attach the auth header in a way the daemon accepts. Drop buildx + build-push-action entirely. Plain `docker build` + `docker push` from the runner container works because both use the SAME docker socket + the SAME runner-container config.json (populated by `aws ecr get-login-password \| docker login` from amazon-ecr-login). Trade-off: lose multi-arch support. We only ship linux/amd64 tenant images today, so this is fine. If multi-arch becomes a requirement later, we can revisit (likely with `docker buildx create --driver=remote` pointing at an external buildkit, but that's substantial infra work; not worth it for a single-arch shop). Closes #173 (fourth piece — and hopefully last; this matches the operator-host manual approach exactly). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 13:43:50 -07:00
devops-engineer	bee4f9ea79	fix(ci): use docker driver for buildx + drop type=gha cache (followup #173 ) PR #38 + #41 fixed the Dockerfile-side clone issue. CI run #893 then revealed two Gitea-Actions-specific issues with the unchanged buildx config: 1. `failed to push: 401 Unauthorized` to ECR. Root cause: default buildx driver `docker-container` spawns a buildkit container that doesn't share the host's `~/.docker/config.json`, so the ECR auth set up by amazon-ecr-login doesn't reach the push. Fix: pin `driver: docker` so buildx delegates to the host daemon, which already has the ECR creds. 2. `dial tcp ...:41939: i/o timeout` on `_apis/artifactcache/cache`. Root cause: `cache-from/cache-to: type=gha` is GitHub-specific; Gitea Actions has no compatible artifact-cache backend, so every cache lookup fails after a 30s timeout. Fix: remove the cache-* options. Cold-build cost is <10min for 37-repo clone + Go/Node compile, acceptable. Could revisit with type=registry inline cache later if rebuilds get painful. With this + #38/#41, the workflow should run end-to-end on Gitea Actions: pre-clone -> docker build (host daemon) -> ECR push. Closes #173 (third and final piece). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 13:35:07 -07:00
devops-engineer	a6d67b4c68	fix(ci): pre-clone manifest deps in workflow, drop in-image clone (closes #173 ) publish-workspace-server-image.yml could not run on Gitea Actions because Dockerfile.tenant's stage 3 ran `git clone` against private Gitea repos from inside the Docker build context, where no auth path exists. Every workspace-server rebuild required a manual operator-host push. Move cloning to the trusted CI context (where AUTO_SYNC_TOKEN — the devops-engineer persona PAT — is naturally available). Dockerfile.tenant now COPYs from .tenant-bundle-deps/, populated by the workflow's new "Pre-clone manifest deps" step. The Gitea token never enters the image. - scripts/clone-manifest.sh: optional MOLECULE_GITEA_TOKEN env embeds basic-auth in the clone URL; redacted in log output. Anonymous fallback preserved for future public-repo path. - .github/workflows/publish-workspace-server-image.yml: new pre-clone step before docker build; injects AUTO_SYNC_TOKEN. Fail-fast if the secret is empty. - workspace-server/Dockerfile.tenant: drop stage 3 (templates), COPY from .tenant-bundle-deps/ instead. Header documents the prereq. - .gitignore: ignore /.tenant-bundle-deps/ so a local build can't accidentally commit cloned repos. Verified locally: clone-manifest.sh with the devops-engineer persona token cloned all 37 repos (9 ws + 7 org + 21 plugins, 4.9MB after .git strip). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 12:59:46 -07:00
claude-ceo-assistant	b73d3bfff2	fix(ci): mark CodeQL continue-on-error (advisory only) — closes #156	2026-05-07 17:26:52 +00:00
devops-engineer	6de3c1ccd2	fix(ci): add scripts/** to publish-workspace-server-image path filter scripts/clone-manifest.sh runs inside the platform Dockerfile build, so a change to that script needs to retrigger publish. Without it, the prior fix (clone via Gitea + lowercase org) didn't trigger this workflow because scripts/ wasn't in the path filter. Also serves as the file change to satisfy the path filter for THIS push, retriggering publish-workspace-server-image now.	2026-05-07 08:18:53 -07:00
devops-engineer	694a036a7f	chore(ci): trailing newline to retrigger publish-workspace-server-image (path-filter requires workflow file change)	2026-05-07 08:12:10 -07:00
devops-engineer	10e510f50c	chore: drop github-app-auth + swap GHCR→ECR (closes #157 , #161 ) Two coupled cleanups for the post-2026-05-06 stack: ============================================ The plugin injected GITHUB_TOKEN/GH_TOKEN via the App's installation-access flow (~hourly rotation). Per-agent Gitea identities replaced this approach after the 2026-05-06 suspension — workspaces now provision with a per-persona Gitea PAT from .env instead of an App-rotated token. The plugin code itself lived on github.com/Molecule-AI/molecule-ai-plugin-github-app-auth which is also unreachable post-suspension; checking it out at CI build time was already failing. Removed: - workspace-server/cmd/server/main.go: githubappauth import + the `if os.Getenv("GITHUB_APP_ID") != ""` block that called BuildRegistry. gh-identity remains as the active mutator. - workspace-server/Dockerfile + Dockerfile.tenant: COPY of the sibling repo + the `replace github.com/Molecule-AI/molecule-ai- plugin-github-app-auth => /plugin` directive injection. - workspace-server/go.mod + go.sum: github-app-auth dep entry (cleaned up by `go mod tidy`). - 3 workflows: actions/checkout steps for the sibling plugin repo: - .github/workflows/codeql.yml (Go matrix path) - .github/workflows/harness-replays.yml - .github/workflows/publish-workspace-server-image.yml Verified `go build ./cmd/server` + `go vet ./...` pass post-removal. ======================================================= Same workflow used to push to ghcr.io/molecule-ai/platform + platform-tenant. ghcr.io/molecule-ai is gone post-suspension. The operator's ECR org (153263036946.dkr.ecr.us-east-2.amazonaws.com/ molecule-ai/) already hosts platform-tenant + workspace-template-* + runner-base images and is the post-suspension SSOT for container images. This PR aligns publish-workspace-server-image with that stack. - env.IMAGE_NAME + env.TENANT_IMAGE_NAME repointed to ECR URL. - docker/login-action swapped for aws-actions/configure-aws- credentials@v4 + aws-actions/amazon-ecr-login@v2 chain (the standard ECR auth pattern; uses AWS_ACCESS_KEY_ID/SECRET secrets bound to the molecule-cp IAM user). The :staging-<sha> + :staging-latest tag policy is unchanged — staging-CP's TENANT_IMAGE pin still points at :staging-latest, just with the new registry prefix. Refs molecule-core#157, #161; parallel to org-wide CI-green sweep.	2026-05-07 07:48:51 -07:00
security-auditor	e01077be38	fix(ci): lowercase 'molecule-ai/' in cross-repo workflow refs Gitea is case-sensitive on owner slugs; canonical is lowercase `molecule-ai/...`. Mixed-case `Molecule-AI/...` refs fail-at-0s when the runner tries to resolve the cross-repo workflow / checkout. Same fix as molecule-controlplane#12. Mechanical case-correction; no behavior change beyond making CI resolve again. Refs: internal#46 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 01:00:10 -07:00
Hongming Wang	debe29c889	ci(handlers-postgres-integration): apply legacy .sql migrations too The migration-replay step globbed only .up.sql, silently skipping the older flat-naming migrations (001_workspaces.sql, 009_activity_logs.sql, etc.). Fine while no integration test depended on those tables; broke when the #149 cross-table atomicity test came in needing both workspaces (FK target for activity_logs) and activity_logs themselves. Switch to globbing .sql + sorted lex-order, excluding .down.sql so up/down pairs don't undo themselves mid-run. Add a sanity check for workspaces + activity_logs + pending_uploads alongside the existing delegations gate so a future migration drift fails loud instead of silently skipping the regressed test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 22:02:24 -07:00
Hongming Wang	88ff0d770b	chore(sweep): add orphan-tunnel cleanup step (#2987 / #340 ) The 15-min sweeper has been deleting stale e2e orgs but not the orphan tunnels left behind when the org-delete cascade half-fails (CP transient 5xx after the org row is gone but before the CF tunnel delete completes). Result: tunnels accumulate in CF until manual operator cleanup. Add a final step that POSTs `/cp/admin/orphan-tunnels/cleanup` every tick. Best-effort — failure doesn't fail the workflow; next tick re-attempts. Output reports deleted_count + failed count for ops visibility. This is the catch-all for the orphan-tunnel class. The proper upstream fix (transactional org delete) lives in CP and tracks as issue #2989. Until that lands, the sweeper bounded-time-to-cleanup keeps the leak from escalating. Note: PR #492 (cf-tunnel silent-success fix) makes this step actually effective — pre-fix DeleteTunnel silent-succeeded on 1022, so the cleanup endpoint reported success without deleting. Post-fix the cleanup chains CleanupTunnelConnections + retry on 1022, which actually clears stuck-connector orphans. 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-05-05 19:36:20 -07:00
Hongming Wang	a19ee90556	chore(sweep): note SSOT for ephemeral prefixes lives in CP Mirrors molecule-controlplane#494: the canonical EPHEMERAL_PREFIXES list now lives in molecule-controlplane/internal/slugs/ephemeral.go, where redeploy-fleet reads it to skip in-flight test tenants. The sweep workflow keeps a Python copy because GHA Python can't import Go, but a comment now points engineers updating the list to update both files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 19:18:13 -07:00
Hongming Wang	caf19e8980	feat(ops): hourly alarm for auto-promote PR stuck on REVIEW_REQUIRED (#2975 ) Closes the silent-block failure mode that left 25 commits — including the Memory v2 redesign and the reno-stars data-loss fix — wedged on staging for 12+ hours behind a single missing review. The auto-promote workflow opened the PR + armed auto-merge, but main's branch protection required a human review and nobody noticed until a user reported "still seeing old memory tab". ## Detection logic — `scripts/check-stale-promote-pr.sh` Reads open PRs `base=main head=staging` and alarms on: - `mergeStateStatus == BLOCKED` - `reviewDecision == REVIEW_REQUIRED` - createdAt older than `STALE_HOURS` (default 4h) Other BLOCKED reasons (DIRTY, BEHIND, failed checks) are NOT alarmed — those are the author's signal-to-fix. This script targets the specific "no human reviewed yet" wedge. Output: - `::warning` per stale PR (visible in workflow summary + Actions UI) - PR comment (idempotent via marker-string detection; one alarm per PR, never re-spammed) - Exit code = count of stale PRs (capped at 125) Logic in a script (not inline workflow YAML) so it's: - Unit-testable — tests/test-check-stale-promote-pr.sh exercises every branch with stubbed fixture JSON + frozen clock. 23 tests covering: empty list, single stale, just-under-threshold, wrong reviewDecision, wrong mergeStateStatus, mixed list (only matching PRs alarm), custom threshold via --stale-hours, exit-code-counts- matching-PRs, --help, unknown arg → 64, missing repo → 2. - Operator-runnable ad-hoc — `scripts/check-stale-promote-pr.sh` works from any shell with `gh` + `jq`. - SSOT — one detector, the workflow YAML is just schedule + invocation surface. Future sibling workflows that need the same check call the same script. ## Workflow — `.github/workflows/auto-promote-stale-alarm.yml` Triggers: - cron `27 * * * *` (hourly, off-the-hour to dodge cron herd) - workflow_dispatch with `stale_hours` + `post_comment` overrides Concurrency: `auto-promote-stale-alarm` group, cancel-in-progress=false (idempotent script; no benefit to cancelling a running scan). Permissions: `contents: read` + `pull-requests: write` (post comments). Sparse checkout — only fetches `scripts/check-stale-promote-pr.sh`. No node_modules, no go modules, no slow setup steps. Workflow runs in <30s on a clean repo. ## Why "alarm + comment" not "auto-approve" Considered options in issue #2975: 1. Slack/email alert — picked. 2. Bot-account auto-approve via molecule-ops — circumvents the human-review gate that branch protection encodes. 3. Trusted-promote bypass via CODEOWNERS — needs Org Admin config change; out of scope for a workflow PR. The comment-on-PR pattern picks (1) without external dependencies (no Slack token, no email config). Subscribers get notified via GitHub's existing PR notification delivery; the warning shows up in the Actions feed. ## Why this won't false-positive on legitimate slow reviews Threshold is 4h. Most legitimate gates clear in <1h, so 4× headroom is plenty for slow CI. The comment is idempotent (one alarm per PR, never re-posted) — adding noise stops at 1 comment regardless of how long the PR sits. ## Test plan - [x] `bash scripts/test-check-stale-promote-pr.sh` — 23/23 pass - [x] `python3 -c 'yaml.safe_load(...)'` clean - [x] `bash -n` clean on both scripts - [ ] Live verification: dispatch the workflow once main has caught up, confirm it correctly reports zero stale PRs	2026-05-05 17:55:27 -07:00
Hongming Wang	475da5b64c	refactor(workspace): extract inbox tools from a2a_tools.py (RFC #2873 iter 4e) Continues the OSS-shape refactor. After iters 4a-4d (rbac, delegation, memory, messaging) the only behavior left in ``a2a_tools.py`` was ``report_activity`` plus three thin inbox-tool wrappers and the ``_enrich_inbound_for_agent`` helper. This iter extracts the inbox slice to ``a2a_tools_inbox.py`` so the kitchen-sink module shrinks from 280 LOC to ~165 LOC of imports + report_activity + back-compat re-export blocks. Extracted symbols: - ``_INBOX_NOT_ENABLED_MSG`` (sentinel) - ``_enrich_inbound_for_agent`` (poll-path peer enrichment helper) - ``tool_inbox_peek`` - ``tool_inbox_pop`` - ``tool_wait_for_message`` Re-exports (`from a2a_tools_inbox import …`) preserve the public ``a2a_tools.tool_inbox_`` surface so existing tests + call sites continue to resolve unchanged. New tests in test_a2a_tools_inbox_split.py: 1. Drift gate (5)* — every previously-public symbol on a2a_tools is the EXACT same object as a2a_tools_inbox.foo (`is`, not `==`), catches a future "wrap with logging" refactor that silently loses existing test coverage. 2. Import contract (1) — a2a_tools_inbox does NOT eagerly import a2a_tools at module load. Pins the layered architecture: the extracted slice depends on ``inbox`` + a lazy ``a2a_client`` import, never on the kitchen-sink that re-exports it. 3. _enrich_inbound_for_agent branches (5) — peer_id-empty (canvas_user) returns dict unchanged; missing peer_id key same; a2a_client unavailable (test harness, partial install) degrades gracefully with a bare envelope; registry hit populates peer_name + peer_role + agent_card_url; registry miss still surfaces agent_card_url (constructable from peer_id alone). The full timeout-clamp / validation / JSON-shape behavior matrix for the three wrappers stays in test_a2a_tools_inbox_wrappers.py — those tests pass identically against both the alias and the underlying impl. Wiring updates: - ``scripts/build_runtime_package.py``: add ``a2a_tools_inbox`` to ``TOP_LEVEL_MODULES`` so it ships in the runtime wheel and the drift gate doesn't fail the next publish. - ``.github/workflows/ci.yml``: add ``a2a_tools_inbox.py`` to ``CRITICAL_FILES`` so the 75% MCP/inbox/auth per-file floor applies — this is now where the inbox-delivery code actually lives.	2026-05-05 14:28:58 -07:00
Hongming Wang	0ca4e431c1	test(e2e): add poll-mode chat upload E2E and wire into e2e-api.yml Covers the user-visible flow that Phase 1-5b shipped (RFC #2891): register a poll-mode workspace, POST a multi-file /chat/uploads, verify the activity feed shows one chat_upload_receive row per file, fetch the bytes via /pending-uploads/:fid/content, ack each row, and confirm a post-ack fetch returns 404. Also pins cross-workspace bleed protection (workspace B's bearer on A's URL → 401, B's URL with A's file_id → 404) and the file_id-UUID-parse 400 path. 23 assertions, all green against a local platform (Postgres+Redis+ platform-server stack matches the e2e-api.yml CI recipe verbatim). Why a new script instead of extending test_poll_mode_e2e.sh: that script tests A2A short-circuit + since_id cursor semantics; this one tests the chat-upload path. They share zero handler code on the platform side and would dilute each other's failure messages if combined. Why not the bearerless-401 strict-mode assertion: the platform's wsauth fail-opens for bearerless requests when MOLECULE_ENV=development (see middleware/devmode.go). The CI workflow doesn't set that var, but some local-dev .env files do — the assertion would flap by environment without testing the poll-mode upload contract. The middleware's own unit tests cover strict-mode 401. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 13:08:55 -07:00
Hongming Wang	6125700c39	test(e2e): plug /tmp scratch leaks in 3 shell E2E tests + add CI lint gate (RFC #2873 iter 2) Three shell E2E tests created scratch files via `mktemp` but never deleted them on early exit (assertion failure, SIGINT, errexit). Each CI run leaked ~10-100 KB of /tmp into the runner; over ~200 runs/week that's 20+ MB of accumulated cruft. ## Files - test_chat_attachments_e2e.sh — was missing both trap and rm; added per-run TMPDIR_E2E with `trap rm -rf … EXIT INT TERM`. - test_notify_attachments_e2e.sh — had a `cleanup()` for the workspace but didn't include the TMPF; only an unconditional `rm -f` at the bottom (line 233) which doesn't fire on early exit. Extended cleanup() to also rm the scratch + dropped the redundant trailing rm. - test_chat_attachments_multiruntime_e2e.sh — `round_trip()` function had per-call `rm -f` only on the success path; failure paths leaked. Switched to script-level TMPDIR_E2E + trap; per-call rm dropped (the trap handles every return path including SIGINT). Pattern: `mktemp -d -t prefix-XXX` for the dir, `mktemp <full-template>` for files (portable across BSD/macOS + GNU coreutils — `-p` is GNU-only and breaks Mac local-dev runs). ## Regression gate New `tests/e2e/lint_cleanup_traps.sh` asserts every `.sh` that calls `mktemp` also has a `trap … EXIT` line in the file. Wired into the existing Shellcheck (E2E scripts) CI step. Verified locally: passes on the fixed state, fails-loud when one of the 3 fixes is reverted. ## Verification - shellcheck --severity=warning clean on all 4 touched files - lint_cleanup_traps.sh passes on the post-fix tree (6 mktemp users, all have EXIT trap) - Negative test: revert one fix → lint exits 1 with file:line + suggested fix pattern in the error message (CI-grokkable ::error file=… annotation) - Trap fires on SIGTERM mid-run (smoke-tested on macOS BSD mktemp) - Trap fires on `exit 1` (smoke-tested) ## Bars met (7-axis) - SSOT: trap pattern documented in lint message (one rule, one fix) - Cleanup: this IS the cleanup hygiene fix - 100% coverage: lint catches future regressions across all `tests/e2e/.sh` files, not just the 3 fixed today - File-split: N/A (no files split) - Plugin / abstract / modular: N/A (test infra, not product code) Iteration 2 of RFC #2873.	2026-05-05 04:21:26 -07:00
Hongming Wang	42f2ea3f4f	fix(ci): include event_name in runtime-prbuild-compat concurrency group Every staging push run for the last 4 SHAs was cancelled by the matching pull_request run because both fired into the same concurrency group: group: ${{ github.workflow }}-${{ ...sha }} Same SHA → same group → cancel-in-progress=true means the second arrival cancels the first. Empirically the push run lost the race; staging branch-protection then saw a CANCELLED required check and the auto-promote chain stalled. Fix: include github.event_name in the group key. push and pull_request runs for the same SHA now hash to different groups, both complete, both report SUCCESS to branch protection. Pattern of the bug: 10:46 sha=1e8d7ae1 ev=pull_request conclusion=success 10:46 sha=1e8d7ae1 ev=push conclusion=cancelled 10:45 sha=ecf5f6fb ev=pull_request conclusion=success 10:45 sha=ecf5f6fb ev=push conclusion=cancelled 10:28 sha=471dff25 ev=pull_request conclusion=success 10:28 sha=471dff25 ev=push conclusion=cancelled 10:12 sha=9e678ccd ev=pull_request conclusion=success 10:12 sha=9e678ccd ev=push conclusion=cancelled Same drift class as the 2026-04-28 auto-promote-staging incident (memory: feedback_concurrency_group_per_sha.md) — globally-scoped groups silently cancel runs in matched-SHA scenarios. This is the only workflow in .github/workflows/ that uses the narrow per-sha shape without event_name. Others either don't use concurrency at all, or use ${{ github.ref }} which is event- neutral. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 04:01:20 -07:00
Hongming Wang	90d202c80a	ci(handlers-pg): apply all migrations with skip-on-error + sanity check (#320 ) Previous workflow applied only 049_delegations.up.sql — fragile to future migrations that touch the delegations table or any other handlers/-tested table. Operator would have to remember to update the workflow's psql -f line per migration. New behavior: loop every .up.sql in lexicographic order, apply each with ON_ERROR_STOP=1 + per-migration result captured. Failed migrations are SKIPPED rather than blocking the suite — handles the historical migrations (017_memories_fts_namespace, 042_a2a_queue, etc.) that depend on tables since renamed/dropped and can't replay from scratch. Migrations that DO succeed land their tables, which is sufficient for the integration tests in handlers/. Sanity gate at the end: if the delegations table is missing after the replay, hard-fail with a loud error. That catches a real regression where 049 itself becomes broken (e.g., schema rename), separate from the historical-broken-migration noise above. Per-migration log line ("✓" or "⊘ skipped") makes it easy to spot when a migration that SHOULD have replayed didn't. Verified locally: full migration chain runs, 049 lands, all 7 integration tests pass against the chained-migration DB. Closes #320.	2026-05-05 03:48:43 -07:00
Hongming Wang	4c9f12258d	fix(delegations): preserve result_preview through completion + add real-Postgres integration gate Two-part PR: ## Fix: result_preview was lost on completion Self-review of #2854 caught a real bug. SetStatus has a same-status replay no-op; the order of calls in `executeDelegation` completion + `UpdateStatus` completed branch clobbered the preview field: 1. updateDelegationStatus(completed, "") fires 2. inner recordLedgerStatus(completed, "", "") → SetStatus transitions dispatched → completed with preview="" 3. outer recordLedgerStatus(completed, "", responseText) → SetStatus reads current=completed, status=completed → SAME-STATUS NO-OP, never writes responseText → preview lost Confirmed against real Postgres (see integration test). Strict-sqlmock unit tests passed because they pin SQL shape, not row state. Fix: call the WITH-PREVIEW recordLedgerStatus FIRST, then updateDelegationStatus. The inner call becomes the no-op (correctly preserves the row written by the outer call). Same gap fixed in UpdateStatus handler — body.ResponsePreview was never landing in the ledger because updateDelegationStatus's nested SetStatus(completed, "", "") fired first. ## Gate: real-Postgres integration tests + CI workflow The unit-test-only workflow that shipped #2854 was the root cause. Adding two layers of defense: 1. workspace-server/internal/handlers/delegation_ledger_integration_test.go — `//go:build integration` tag, requires INTEGRATION_DB_URL env var. 4 tests: * ResultPreviewPreservedThroughCompletion (regression gate for the bug above — fires the production call sequence in fixed order and asserts row.result_preview matches) * ResultPreviewBuggyOrderIsLost (DIAGNOSTIC: confirms the same-status no-op contract works as designed; if SetStatus's semantics ever change, this test fires) * FailedTransitionCapturesErrorDetail (failure-path symmetry) * FullLifecycle_QueuedToDispatchedToCompleted (forward-only + happy path) 2. .github/workflows/handlers-postgres-integration.yml — required check on staging branch protection. Spins postgres:15 service container, applies the delegations migration, runs `go test -tags=integration` against the live DB. Always-runs + per-step gating on path filter (handlers/wsauth/migrations) so the required-check name is satisfied on PRs that don't touch relevant code. Local dev workflow (file header documents this): docker run --rm -d --name pg -e POSTGRES_PASSWORD=test -p 55432:5432 postgres:15-alpine psql ... < workspace-server/migrations/049_delegations.up.sql INTEGRATION_DB_URL="postgres://postgres:test@localhost:55432/molecule?sslmode=disable" \ go test -tags=integration ./internal/handlers/ -run "^TestIntegration_" ## Why this matters Per memory `feedback_mandatory_local_e2e_before_ship`: backend PRs MUST verify against real Postgres before claiming done. sqlmock pins SQL shape; only a real DB can verify row state. The workflow makes this gate mandatory rather than optional.	2026-05-05 02:47:52 -07:00
Hongming Wang	c89f17a2aa	fix(branch-protection-drift): hard-fail on schedule only, soft-skip + warn on PR #2834 added a hard-fail when GH_TOKEN_FOR_ADMIN_API is missing on schedule + pull_request + workflow_dispatch. The PR-trigger hard-fail is now blocking every PR in the repo because the secret hasn't been provisioned yet — including the staging→main auto-promote PR (#2831), which has no path to set repo secrets itself. Per feedback_schedule_vs_dispatch_secrets_hardening.md the original concern is automated/silent triggers losing the gate without a human to notice. That concern applies to schedule specifically: - schedule: cron, no human, silent soft-skip = invisible regression → KEEP HARD-FAIL. - pull_request: a human is reviewing the PR diff and will see workflow warnings inline. A PR cannot retroactively drift live state — drift happens between PRs (UI clicks, manual gh api PATCH), which the schedule canary catches. The PR-time gate would only catch typos in apply.sh, which the *_payload unit tests catch more directly. → SOFT-SKIP with a prominent warning. - workflow_dispatch: operator override, may not have configured the secret yet. → SOFT-SKIP with warning. The skip is explicit (SKIP_DRIFT_CHECK=1 surfaced to env, then a step `if:` guard) so it's auditable in the workflow run UI, not silently swallowed. Unblocks #2831 (auto-promote staging→main) + every PR currently behind this check.	2026-05-04 21:20:30 -07:00
Hongming Wang	2e505e7748	fix(branch-protection): apply.sh respects live state + full-payload drift Multi-model review of #2827 caught: the script as-shipped would have silently weakened branch protection on EVERY non-checks dimension the moment anyone ran it. Live staging had enforce_admins=true, dismiss_stale_reviews=false, strict=true, allow_fork_syncing=false, bypass_pull_request_allowances={ HongmingWang-Rabbit + molecule-ai app } Script wrote the opposite for all five. Per memory feedback_dismiss_stale_reviews_blocks_promote.md, the dismiss_stale_reviews flip alone is the load-bearing one — would silently re-block every auto-promote PR (cost user 2.5h once). This PR: 1. apply.sh: per-branch payloads (build_staging_payload / build_main_payload) that codify the deliberate per-branch policy already on the repo, with the script's net contribution being ONLY the new check names (Canvas tabs E2E + E2E API Smoke on staging, Canvas tabs E2E on main). 2. apply.sh: R3 preflight that hits /commits/{sha}/check-runs and asserts every desired check name has at least one historical run on the branch tip. Catches typos like "Canvas Tabs E2E" vs "Canvas tabs E2E" — pre-fix a typo would silently block every PR forever waiting for a context that never emits. Skip via --skip-preflight for genuinely-new workflows whose first run hasn't fired. 3. drift_check.sh: compares the FULL normalised payload (admin, review, lock, conversation, fork-syncing, deletion, force-push) not just the checks list. Pre-fix the drift gate would have missed a UI click that flipped enforce_admins or dismiss_stale_reviews. Drops app_id from the comparison since GH auto-resolves -1 to a specific app id post-write. 4. branch-protection-drift.yml: per memory feedback_schedule_vs_dispatch_secrets_hardening.md — schedule + pull_request triggers HARD-FAIL when GH_TOKEN_FOR_ADMIN_API is missing (silent skip masks the gate disappearing). workflow_dispatch keeps soft-skip for one-off operator runs. Verified by running drift_check against live state: pre-fix would have shown 5 destructive drifts on staging + 5 on main. Post-fix shows ONLY the 2 intended additions on staging + 1 on main, which go away after `apply.sh` runs.	2026-05-04 20:52:11 -07:00
Hongming Wang	7cc1c39c49	ci: e2e coverage matrix + branch-protection-as-code Closes #9. Three pieces, all small: 1. docs/e2e-coverage.md — source of truth for which E2E suites guard which surfaces. Today three were running but informational only on staging; that's how the org-import silent-drop bug shipped without a test catching it pre-merge. Now the matrix shows what's required where + a follow-up note for the two suites that need an always-emit refactor before they can be required. 2. tools/branch-protection/apply.sh — branch protection as code. Lets `staging` and `main` required-checks live in a reviewable shell script instead of UI clicks that get lost between admins. This PR's net change: add `E2E API Smoke Test` and `Canvas tabs E2E` as required on staging. Both already use the always-emit path-filter pattern (no-op step emits SUCCESS when the workflow's paths weren't touched), so making them required can't deadlock unrelated PRs. 3. branch-protection-drift.yml — daily cron + drift_check.sh that compares live protection against apply.sh's desired state. Catches out-of-band UI edits before they drift further. Fails the workflow on mismatch; ops re-runs apply.sh or updates the script. Out of scope (filed as follow-ups): - e2e-staging-saas + e2e-staging-external use plain `paths:` filters and never trigger when paths are unchanged. They need refactoring to the always-emit shape (same as e2e-api / e2e-staging-canvas) before they can be required. - main branch protection mirrors staging here; if main wants the E2E SaaS / External added later, do it in apply.sh and rerun. Operator must apply once after merge: bash tools/branch-protection/apply.sh The drift check picks it up from there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 20:21:59 -07:00
Hongming Wang	8df8487bbe	fix(auto-promote): treat E2E completed/cancelled as defer, not failure Bug: the case statement at line 189 grouped completed/failure \| completed/cancelled \| completed/timed_out into the same "abort + exit 1" branch. cancelled ≠ failure — when per-SHA concurrency (memory: feedback_concurrency_group_per_sha) cancels an older E2E run because a newer push landed, the workflow blocked the whole auto-promote chain on a non-failure. Caught 2026-05-05 02:03 on sha `31f9a5e`: E2E got cancelled by concurrency, auto-promote :latest aborted with exit 1, the next auto-promote-staging cycle had to manually clean up. Split: failure/timed_out keep the abort path. cancelled gets its own clean-defer branch (same shape as in_progress) — proceed=false without exit 1, with a step-summary explaining likely concurrency supersession and pointing operators at manual dispatch if they need that specific SHA promoted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 19:26:29 -07:00
Hongming Wang	c5dd14d8db	fix(workflows): preserve curl stderr in 8 status-capture sites Self-review of PR #2810 caught a regression: my mass-fix added `2>/dev/null` to every curl invocation, suppressing stderr. The original `\|\| echo "000"` shape only swallowed exit codes — stderr (curl's `-sS`-shown dial errors, timeouts, DNS failures) still went to the runner log so operators could see WHY a connection failed. After PR #2810 the next deploy failure would log only the bare HTTP code with no context. That's exactly the kind of diagnostic loss that makes outages take longer to triage. Drop `2>/dev/null` from each curl line — keep it on the `cat` fallback (which legitimately suppresses "no such file" when curl crashed before -w ran). The `>tempfile` redirect alone captures curl's stdout (where -w writes) without touching stderr. Same 8 files as #2810: redeploy-tenants-on-{main,staging}, sweep-stale-e2e-orgs, e2e-staging-{sanity,saas,external,canvas}, canary-staging. Tests: - All 8 files pass the lint - YAML valid Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 18:54:50 -07:00
Hongming Wang	463316772b	fix(workflows): rewrite curl status-capture to prevent exit-code pollution The 2026-05-04 redeploy-tenants-on-main run for sha `2b862f6` emitted "HTTP 000000" and failed the deploy. Root cause: when curl exits non- zero (connection reset → 56, --fail-with-body 4xx/5xx → 22), the `-w '%{http_code}'` already wrote a status to stdout; the inline `\|\| echo "000"` then fires AND appends another "000" to the captured substitution stdout. Result: HTTP_CODE="<actual><000>" — fails string comparisons against "200" while looking superficially right. Same class of bug the synth-E2E §7c gate hit twice (PRs #2779/#2783 + #2797). Memory feedback_curl_status_capture_pollution.md. Mass fix in 8 workflows: route -w into a tempfile so curl's exit code can't pollute stdout. Wrap with set +e/-e so the non-zero curl exit doesn't trip the outer pipeline. redeploy-tenants-on-main.yml (production-critical, caught the bug) redeploy-tenants-on-staging.yml (sibling) sweep-stale-e2e-orgs.yml (cleanup loop) e2e-staging-sanity.yml (E2E safety-net teardown) e2e-staging-saas.yml e2e-staging-external.yml e2e-staging-canvas.yml canary-staging.yml Plus a new lint workflow `lint-curl-status-capture.yml` that runs on every PR/push touching `.github/workflows/**`. Multi-line aware: collapses bash `\` continuations, then matches the buggy $(curl ... -w '%{http_code}' ... \|\| echo "000") subshell shape. Distinguishes from the SAFE $(cat tempfile \|\| echo "000") shape (cat with missing file emits empty stdout, no pollution). Verified: - All 8 workflows pass the lint locally - A known-bad injection is caught - A known-safe cat-fallback passes through - yaml.safe_load clean on all changed files Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 18:29:38 -07:00
Hongming Wang	26fa220bef	ci(coverage): per-file 75% floor for MCP/inbox/auth Python critical paths Closes part of #2790 (Phase A). The Python total floor at 86% (set in workspace/pytest.ini, issue #1817) averages over ~6000 lines, so a single MCP-critical file could regress to ~50% with no CI complaint as long as other modules compensate. This is the same distribution gap that #1823 closed Go-side: total floor passes while a critical handler sits at 0%. Added gates for these five files (per-file floor 75%): - workspace/a2a_mcp_server.py — MCP dispatcher (PR #2766 / #2771) - workspace/mcp_cli.py — molecule-mcp standalone CLI entry - workspace/a2a_tools.py — workspace-scoped tool implementations - workspace/inbox.py — multi-workspace inbox + per-workspace cursors - workspace/platform_auth.py — per-workspace token resolver These handle multi-tenant routing, auth tokens, and inbox dispatch. Risk shape mirrors Go-side tokens/secrets — a 0%/50% file here is exactly where the PR #2766 dispatcher bug class slips through without a structural test. Floor 75% is strictly additive — current actuals 80-96% (measured 2026-05-04). No existing PR fails. Ratchet plan in COVERAGE_FLOOR.md target 90% by 2026-08-04. Implementation: pytest already writes .coverage; new step emits a JSON view scoped to the critical files via `coverage json --include="*name"`, then jq extracts each file's percent_covered. Exact key match by basename so workspace/builtin_tools/a2a_tools.py (a different 100% file) doesn't shadow workspace/a2a_tools.py. Verified locally with the actual coverage data: - floor=75 → 0 failures (matches current state) - floor=81 → 1 failure (a2a_tools.py at 80%) — proves the gate trips Pairs with PR #2791 (Phase B — schema↔dispatcher AST drift gate). Phase C (molecule-mcp e2e harness) remains the largest piece in #2790. YAML validated locally before commit per feedback_validate_yaml_before_commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 16:35:21 -07:00
Hongming Wang	ff1003e5f6	ci(canary): bump timeout-minutes 12 → 20 to absorb apt tail latency Today's 4 cancelled canaries (25319625186 / 25320942822 / 25321618230 / 25322499952) were all blown by the workflow timeout despite the underlying tenant boot completing successfully (PR molecule-controlplane#455 fix verified — boot events all reach `boot_script_finished/ok`). Why the budget was wrong: The tenant user-data install phase runs apt-get update + install of docker.io / jq / awscli / caddy / amazon-ssm-agent FROM RAW UBUNTU on every tenant boot — none of it is pre-baked into the tenant AMI (EC2_AMI=ami-0ea3c35c5c3284d82, raw Jammy 22.04). Empirical fetch_secrets/ok timing across today's canaries: 51s debug-mm-1777888039 (09:47Z) 82s 25319625186 (12:42Z) 143s 25320942822 (13:11Z) 625s 25322499952 (13:43Z) Same EC2_AMI, same instance type (t3.small), same user-data install sequence — variance is entirely apt-mirror tail latency. A 12-min job budget leaves only ~2 min for the workspace on slow-apt days; the workspace itself needs ~3.5 min for claude-code cold boot, so the budget is structurally too tight whenever apt is slow. 20 min absorbs even the 10+ min boot worst-case and still leaves the workspace its full ~7 min budget. Cap stays well under the runner's 6-hour ubuntu-latest job ceiling. Real fix: pre-bake caddy + ssm-agent into the tenant AMI so the boot phase is no-ops on cached pkgs (will file controlplane#TBD as follow-up — packer/install-base.sh today only bakes the WORKSPACE thin AMI, not the tenant AMI; tenants always boot from raw Ubuntu). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 07:02:12 -07:00

1 2 3 4 5 ...

289 Commits