molecule-core

Author	SHA1	Message	Date
Molecule AI Core-DevOps	1b6c28ebfa	fix(ci): add sqlalchemy>=2.0.0 to pip install step (closes #293 ) test_audit_ledger.py imports sqlalchemy directly (line 42). Without an explicit sqlalchemy install, pip dependency resolution can omit it when pytest/pytest-asyncio/pytest-cov are installed as a separate step after requirements.txt. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-11 02:26:53 +00:00
Molecule AI Infra-SRE	4b1ce228ea	ci: remove .github/workflows/publish-workspace-server-image.yml duplicate Gitea Actions reads .gitea/workflows/, not .github/workflows/. The .github/ copy of this workflow has been kept in lockstep with .gitea/ since the post-suspension migration (e.g. `6d94fd30`, `5216e781`, 67b2e488 all touch both files). The functional code is identical between the two; the only differences are comment verbosity and the path-filter self-reference (each version watches its own location). Removing the .github/ copy: - eliminates the dual-edit maintenance tax (two files touched per fix) - prevents accidental drift where one is updated and the other isn't - leaves a single source-of-truth at .gitea/workflows/ Cross-references confirmed safe: - canary-verify.yml + redeploy-tenants-on-{staging,main}.yml all use `workflows: ['publish-workspace-server-image']` (workflow name, not file path) — they trigger off the workflow_run event keyed on `name:`, which is identical in both files. - No other workflow path-watches .github/workflows/publish-workspace- server-image.yml. Other two triplicates from task #287 (publish-runtime.yml and secret-scan.yml) are NOT addressed in this PR — see PR description for the ambiguity report flagging them for human review. Refs: task #287 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 02:18:02 +00:00
Molecule AI Infra-SRE	6d94fd3077	fix(ci): scope trigger to main only — revert accidental staging push addition The Docker daemon health-check fix should not change which branches trigger the build. Revert accidental addition of 'staging' to branch filters. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-10 12:08:34 +00:00
Molecule AI Infra-SRE	8b6a11ccc7	fix(ci): restore SHA-pins that were accidentally reverted to mutable tags Reverts two accidental mutable-tag changes introduced in this branch: - pypa/gh-action-pypi-publish: release/v1 -> cef22109... (matches #276 intent) - actions/checkout: @v6 -> de0fac2e... (matches #276 intent) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-10 12:08:07 +00:00
Molecule AI Core-DevOps	8af1eb6774	ci: add Docker daemon health-check to canvas image workflow Cover the canvas image publish workflow with the same `docker info` guard added to publish-workspace-server-image.yml (commit `5216e781`). publish-canvas-image.yml was the only docker-build workflow still missing the step. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-10 12:00:47 +00:00
Molecule AI Infra-SRE	5216e781cd	ci: add Docker daemon health-check step before build Run `docker info` as the first CI step to catch runner Docker socket permission issues (docker.sock unreadable, daemon restarted, group membership drift) before the expensive `docker build` step. The error now surfaces immediately with a clear `::error::` message rather than silently continuing into `docker build` where the same failure would appear 60-90s later as a cryptic ECR auth error. Gitea Actions run 4350 (2026-05-10 05:58 UTC) is the trigger: the runner's docker.sock became inaccessible for ~6 minutes, `docker build` failed at step 2 with `permission denied...docker.sock`, and `go build` (step 3) was never reached — masking the compile errors that were already on main. The downstream code errors only surfaced once run 4407 succeeded at `docker build` and finally reached `go build`. Now: `docker info` → fail in ~1s with actionable error. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-10 10:01:01 +00:00
Molecule AI Core-DevOps	af5406d29e	fix(ci): migrate canary-verify from GHCR to ECR + add POST route smoke tests Root cause of issue #213: canary-verify.yml still used GHCR (ghcr.io/molecule-ai/platform-tenant) while publish-workspace-server-image.yml migrated to ECR on 2026-05-07 (commit `10e510f5`). Canary smoke tests were silently testing a stale GHCR image while actual staging/prod tenants ran the ECR build. The POST /org/import and POST /workspaces routes were missing from the ECR binary (likely a Docker layer-caching artefact during the staging push window) but smoke tests passed because they never tested the ECR image at all. Changes: - canary-verify.yml: migrate promote-to-latest from GHCR crane tag ops to the CP redeploy-fleet endpoint (same mechanism as redeploy-tenants-on-main.yml). The wait-for-canaries step already read SHA from the running tenant /health (registry-agnostic), so no change needed there. Pre-fix promote step used `crane tag` against GHCR, which was never updated after the ECR migration. - redeploy-tenants-on-main.yml: update stale comments that reference GHCR to reflect ECR; replace the 30s GHCR CDN propagation wait with a no-op comment (ECR has no CDN cache to wait for). - scripts/canary-smoke.sh: add POST /org/import and POST /workspaces smoke tests (steps 6-8). These assert HTTP 401 unauthenticated (proves AdminAuth enforced AND the route is compiled in — 404 would mean route missing from binary). GET /workspaces was already covered; POST was the untested gap. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-10 02:10:12 +00:00
Molecule AI Core-DevOps	25d3b1a2f3	feat(ci): port publish-runtime.yml to .gitea/workflows/ (issue #206 ) publish-runtime.yml was dead on Gitea Actions because Gitea reads .gitea/workflows/, not .github/workflows/ (the GitHub Actions paths are ignored). Issue #206 identified this as one of three bugs blocking the runtime versioning pipeline. Changes: - Add .gitea/workflows/publish-runtime.yml (canonical Gitea version) - Drop environment: + id-token: write (Gitea has no OIDC/OAuth) - Replace pypa/gh-action-pypi-publish with twine upload using PYPI_TOKEN secret - Replace github.ref_name with ${GITHUB_REF#refs/tags/} (Gitea exposes github.ref) - Drop merge_group trigger (Gitea has no merge queue) - Drop staging branch trigger (staging branch does not exist) - Cascade step unchanged (DISPATCH_TOKEN + Gitea API already compatible) - Add DEPRECATED notice to .github/workflows/publish-runtime.yml Required secrets (repo Settings → Actions → Variables and Secrets): PYPI_TOKEN: PyPI API token for molecule-ai-workspace-runtime DISPATCH_TOKEN: Gitea PAT with write:repo on template repos (already used) Closes #206 (publish-runtime Gitea port).	2026-05-10 01:26:13 +00:00
Molecule AI Core-DevOps	796201e09f	fix(ci): replace dorny/paths-filter with shell-based git diff (Gitea Actions compatibility) dorny/paths-filter is GitHub-Actions-only and does not work correctly on Gitea Actions — it silently returns no file changes regardless of what files were modified, causing the harness-replays workflow to silently skip on Gitea even when workspace-server/ or canvas/ files change. Verified: zero harness-replays statuses on PR #188 and #168 (both changed workspace-server files) vs GitHub Actions where the same workflow correctly detects changes. Replace with a shell-based approach that uses: - github.event.pull_request.base.sha (Gitea + GitHub: merge-base for PRs) - github.event.before (Gitea + GitHub: previous tip for pushes) - git diff --name-only <BASE> github.sha (portable git, works on both platforms) Also adds detect-changes.debug output so future no-op passes show WHY the workflow decided to skip, and the first real run on Gitea will confirm the diff detection is working. Closes #141 (followup: root-cause fix still TBD — failure logs inaccessible via Gitea Actions API).	2026-05-10 01:11:45 +00:00
Molecule AI Core-BE	9368b20d49	[core-be-agent] fix(ci): replace gh api calls with Gitea-compatible alternatives Issue #75 PR-D: two remaining `gh` CLI calls in .github/workflows/. 1. ci.yml canvas-deploy-reminder: - Replaced `gh api POST repos/.../commits/.../comments` with writing to GITHUB_STEP_SUMMARY. Gitea has no commit-comments API (confirmed in issue #75), so the gh call always failed. GITHUB_STEP_SUMMARY works on both GitHub Actions and Gitea Actions as the workflow-run summary page, which is the natural place for post-deploy action items. - Removed now-unnecessary GH_TOKEN env var and contents:write permission. 2. check-merge-group-trigger.yml: - Converted to no-op stub. Gitea has no merge queue feature and no merge_group: event type, so this workflow's lint would find nothing to verify (all workflows vacuously pass). Keeping workflow+job name unchanged preserves commit-status context names for branch protection consumers. Dropped the merge_group: trigger since it would never fire on Gitea. Dropped the full bash linter + gh api call. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-09 23:10:07 +00:00
Molecule AI Core-DevOps	252f8d0c47	tech-debt: rename molecule-monorepo-net -> molecule-core-net Renames Docker network across all code, configs, scripts, and docs. Per issue #93: the network was named molecule-monorepo-net as a holdover from when the repo was called molecule-monorepo. The canonical repo name is now molecule-core, so the network should be molecule-core-net. Files changed: - docker-compose.yml, docker-compose.infra.yml: network definition - infra/scripts/setup.sh: docker network create - scripts/nuke-and-rebuild.sh: docker network rm - workspace-server/internal/provisioner/provisioner.go: DefaultNetwork - All comments/docs: updated wording Acceptance: grep -rn 'molecule-monorepo-net' returns zero matches. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-09 20:51:48 +00:00
claude-ceo-assistant	2fa79ea462	Merge pull request 'chore(ci): document #192 root cause — workspace-template repos public per OSS-first' (#133 ) from chore/192-retrigger-harness-replays-after-public-flip into main	2026-05-08 19:12:54 +00:00
claude-ceo-assistant	558e4fee48	chore(ci): document #192 root cause — workspace-template repos public per OSS-first 5 of 9 workspace-template repos (openclaw, codex, crewai, deepagents, gemini-cli) had been marked private with no team grant for AUTO_SYNC_TOKEN bearer (devops-engineer persona). Pre-clone manifest deps step 404'd on the first private repo encountered, failing every Harness Replays run. Resolution path taken: 1. Flipped the 5 to public per `feedback_oss_first_repo_visibility_default` — runtime/template/plugin repos default public; that's what makes them OSS surface. 2. Scoped existing `ci-readonly` org team to legitimately-internal repos only (compliance docs, RFCs-in-flight). Workspace templates removed from it. 3. Filed internal#102 RFC for Layer-3 (customer-owned + marketplace third-party private repos) — that's a different shape entirely; needs per-tenant credential-resolver, not org-team grants. This commit is a documentation-only touch on the workflow file to (a) record the root cause inline next to the existing pre-clone-fail narrative, (b) trigger a fresh Harness Replays run that should now pass the clone step. Closes #192. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 11:50:55 -07:00
dev-lead	5c0c15eb4f	chore(canary): workflow_dispatch input keep_on_failure for log capture Investigating molecule-core#129 failure mode #1 (claude-code "Agent error (Exception)") needs the workspace's docker logs to find the actual exception. The canary tears down the tenant on every failure, so the workspace container is destroyed before anyone can SSM in. Add a workflow_dispatch input `keep_on_failure: bool` (default false). When true, sets `E2E_KEEP_ORG=1` for the canary script — its existing debug path skips teardown, leaving the tenant + EC2 + CF tunnel + DNS alive. Operator can then SSM into the workspace EC2 (via the same flow as recover-tunnels.py) and capture `docker logs` from the claude-code container. Cron-triggered runs never set the input (it only exists on dispatch), so unattended scheduled canaries always tear down — no risk of unattended cost leak. Operator workflow: 1. Dispatch canary-staging.yml with keep_on_failure=true 2. Watch CI; on failure (likely, given the 38h chronic red), note the SLUG / TENANT_URL printed at step 1/11 3. SSM exec into the workspace EC2 (us-east-2) and run `docker logs <claude-code-container>` to find the actual exception traceback 4. Manually delete via DELETE /cp/admin/tenants/<slug> when done (the script logs this reminder on E2E_KEEP_ORG=1 path) Refs: molecule-core#129 (canary investigation) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 10:58:19 -07:00
dev-lead	42ff6be15c	fix(ci): canary alerting — drop Gitea-incompatible actions API call The "Open issue on failure" step was failing on every canary run because Gitea 1.22.6 doesn't expose /api/v1/actions endpoints (per memory reference_gitea_actions_log_fetch). The threshold check called github.rest.actions.listWorkflowRuns() to count consecutive prior failures and gate issue creation behind 3 reds — that call ALWAYS 404'd on Gitea, breaking the entire alerting step. Net effect: the canary's own self-alerting was broken, so the underlying staging regression went unflagged for 38h+ (2026-05-07 02:30 UTC → 2026-05-08 17:34 UTC, every cron tick red, zero issues filed). Fix: drop the consecutive-failures threshold entirely. File a sticky issue on the FIRST failure; comment-on-existing handles deduplication for subsequent failures. The auto-close-on-success step is unchanged. Why not a Gitea-compatible threshold (e.g., walk recent commit statuses): comment-on-existing already gives ops a single accumulating issue per regression streak. The threshold's purpose was to avoid spamming on transient flakes — but with sticky issue + auto-close-on-green, transient flakes get one issue + one quick close, which is fine signal. Filing on first failure is also better UX: catches the regression in 30 min instead of 90 min. Also: rewrote runURL from hardcoded https://github.com/... to context.serverUrl so the link actually points at Gitea (https://git.moleculesai.app) — was always broken on Gitea but nobody noticed because the issue-filing step itself was broken. Net: 21 insertions, 40 deletions. Removes WORKFLOW_PATH + CONSECUTIVE_THRESHOLD env vars (no longer needed). Tracked in: molecule-core#129 (failure mode 3 of 3) Verification: yaml syntax-valid; no remaining github.rest.actions.* calls; only github.rest.issues.* (all Gitea-supported per memory feedback_persona_token_v2_scope). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 10:52:09 -07:00
claude-ceo-assistant	08e8d325e2	chore(workflows): delete obsolete promote/sync workflows (Phase 3C of internal#81) Trunk-based migration final cleanup for molecule-core. The 6 workflows deleted here all existed to manage the staging↔main branch dance that trunk-based makes obsolete: - auto-promote-staging.yml fast-forward staging→main on green - auto-promote-on-e2e.yml alt promote path on E2E green - auto-promote-stale-alarm.yml alarm if staging promotion stalls - auto-sync-main-to-staging.yml sync main→staging after UI merges - auto-sync-canary.yml dry-run probe of the auto-sync token+push path - retarget-main-to-staging.yml rebase open PRs onto staging After Phase 3A (PR #108 promoted 5 staging-only feature PRs to main) and Phase 3B (PR #109 dropped staging-branch triggers from the 4 e2e workflows), main is the only branch the CI cares about. None of the above workflows have anything to do; they're 1977 lines of dead Go-time- no-Gitea-time-yes code. Rollback: `git revert` this commit to restore the workflows. They still work mechanically; trunk-based just doesn't need them. The `staging` branch on the remote is deleted in a follow-up step (`git push origin --delete staging`) after this PR merges, so reviewers can confirm CI runs cleanly on the new shape before the ref disappears. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 14:18:35 +00:00
claude-ceo-assistant	2fac4b61b4	chore(workflows): drop staging-branch triggers (Phase 3b of internal#81) Trunk-based migration: main is the only branch. Update 4 workflows that fired on staging-branch pushes to fire on main instead. - e2e-staging-canvas.yml: drop staging from push + pull_request - e2e-staging-external.yml: drop staging from push + pull_request - e2e-staging-saas.yml: drop staging from push + pull_request, update header comment that references the (now-obsolete) staging→main auto-promote flow - redeploy-tenants-on-staging.yml: workflow_run.branches changes from [staging] to [main] so the tenant redeploy fires when publish-workspace-server-image runs on main Workflows that target the staging tenant FLEET (canary-staging.yml, e2e-staging-sanity.yml) are not changed — they fire on cron, the word "staging" in their filenames refers to the deployment target environ- ment, not the git branch. Lands as Phase 3b after #108 promotes the 5 staging-only feature PRs (Phase 3a). Phase 3c deletes the obsolete promote/sync workflows (auto-promote-staging, auto-sync-main-to-staging, etc.) plus the staging branch itself, after we no-op-verify both Phase 3a and 3b green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 13:08:51 +00:00
claude-ceo-assistant	a4ab623bbf	fix(ci): e2e-api — parallel-safe postgres/redis containers (#100 ) Closes #94. Mirrors PR #98 pattern. Approved by security-auditor.	2026-05-08 02:02:57 +00:00
devops-engineer	b9d2786f45	fix(ci): e2e-api — parallel-safe postgres/redis containers + provisioner setup Class B Hongming-owned CICD red sweep, e2e-api leg. Same substrate hazard as PR #98 (handlers-postgres-integration) — Gitea act_runner configures `container.network: host` operator-wide, so: * Two concurrent e2e-api runs both attempted to bind `-p 15432:5432` and `-p 16379:6379` on the operator host. Verified in run a7/2727 on 2026-05-07: `docker: Error response from daemon: Conflict. The container name "/molecule-ci-redis" is already in use by container af10f438...` — exit 125, job fails before any test runs. * Hardcoded container names `molecule-ci-postgres` / `-redis` plus the leading `docker rm -f` step meant a second job's startup also KILLED the first job's still-running services. Fix shape (mirrors PR #98 bridge-net pattern, adapted because the platform-server is a Go binary on the host, not a containerised step): 1. Per-run unique container names: `pg-e2e-api-${RUN_ID}-${RUN_ATTEMPT}`, `redis-e2e-api-${RUN_ID}-${RUN_ATTEMPT}`. Unique even across reruns of the same run_id. 2. Ephemeral host port per run via `-p 0:5432` / `-p 0:6379` and `docker port` lookup, exported as `DATABASE_URL` / `REDIS_URL` to `$GITHUB_ENV`. No fixed host-port → no collision. 3. `127.0.0.1` (NOT `localhost`) in URLs — IPv6 first-resolve flake fixed in #92 stays fixed. 4. `if: always()` cleanup so containers don't leak when test steps fail. Issue #94 items #2 + #3 also addressed: * Pre-pull `alpine:latest` (provisioner uses it for ephemeral token-write containers in `internal/handlers/container_files.go`). * Idempotent `docker network create molecule-monorepo-net` (the provisioner attaches workspace containers via that bridge — `internal/provisioner/provisioner.go::DefaultNetwork`). Issue #94 item #1 (timeouts) NOT bumped — recent log evidence shows postgres ready in 3s, redis in 1s, platform in 1s when they DO come up. Timeouts are not the bottleneck on the current substrate. NOT addressed here (out of scope, separate change required): * `Run E2E API tests` step has been failing on `Status back online` because the platform's langgraph workspace template image (`ghcr.io/molecule-ai/workspace-template-langgraph:latest`) returns 403 Forbidden post-2026-05-06 GitHub org suspension. That is a template-registry resolution issue (ADR-002 / local-build mode) and belongs in a workspace-server change, not this workflow file. This PR fixes the parallel-collision class and the workflow setup hygiene; the langgraph-403 failure will still surface on runs after this lands until template resolution is fixed separately. Verified manually on operator host 2026-05-08: docker now hands out ephemeral ports on `-p 0:5432`, two parallel runs land on different ports, both reach pg_isready GREEN. Closes #94 (items #2 and #3; item #1 documented as not-bottleneck; langgraph-template-403 referenced for follow-up). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 18:59:56 -07:00
devops-engineer	a302d75129	chore(ci): retrigger Handlers Postgres Integration for second-green proof Class B verification — second consecutive green run to demonstrate the fix isn't flaky. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 18:23:05 -07:00
devops-engineer	241859b552	fix(ci): handlers-postgres — sidestep port collision under host-network runner Class B Hongming-owned CICD red sweep. The Handlers Postgres Integration workflow has been silently failing on staging push and PRs ever since #92 fixed the IPv6 flake — the IPv6 fix correctly pinned 127.0.0.1, but unmasked a deeper issue: with our act_runner global container.network=host config, multiple concurrent runs of this workflow each tried to bind 0.0.0.0:5432 on the operator host. The first wins; subsequent postgres service containers exit with `FATAL: could not create any TCP/IP sockets` + `Address in use`. Docker auto-removes them (act_runner sets AutoRemove:true), so by the time `Apply migrations` runs `psql`, the container is gone — Connection refused, then `failed to remove container: No such container` at cleanup time. Per-job container.network override is silently ignored by act_runner (`--network and --net in the options will be ignored.`), so we sidestep `services:` entirely. The job container still uses host-net (required for cache server discovery on the operator's bridge IP). We launch a sibling postgres on the existing molecule-monorepo-net bridge with a unique name per run (run_id+run_attempt) and connect via the bridge IP read from `docker inspect`. Verified manually on operator host 2026-05-08: 2× postgres on host-net collides, but on the bridge with unique names + different IPs, both succeed and each is reachable from a host-net job container. Adds: - always()-cleanup step so containers don't leak on test failure - Diagnostic dump now includes the postgres container's docker logs - Runbook at docs/runbooks/ documenting the substrate behavior + the pattern future workflows should adopt for any `services:`-shaped need. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 18:21:12 -07:00
devops-engineer	419c109f1d	chore: sync main → staging (auto-resolved workflow conflicts via main-wins) Conflicted files in .github/workflows/ taken from main: .github/workflows/ci.yml .github/workflows/e2e-staging-canvas.yml .github/workflows/retarget-main-to-staging.yml Conflicts arose from main advancing through PR #66/#79/#89 (CI workflow rewrites) while staging hadn't picked up the changes yet. Main is the source of truth for CI workflows; staging is downstream. Co-authored-by: Claude (orchestrator)	2026-05-08 01:00:48 +00:00
claude-ceo-assistant	6c823cf673	Merge branch 'main' into fix/196-retarget-main-to-staging-gitea-rest	2026-05-08 00:20:49 +00:00
claude-ceo-assistant (Claude Opus 4.7 on Hongming's MacBook)	87b971a292	fix(ci): close 3 chronic Gitea-Actions workflow flakes (closes #88 ) Three workflows have been failing on every push to this Gitea repo for GitHub-shaped reasons that don't translate to act_runner. Surfaced while landing #84; bundled per `feedback_gitea_actions_migration_audit_pattern` ("bundle per-repo, not per-finding") instead of three separate PRs. 1) handlers-postgres-integration: localhost → 127.0.0.1 - lib/pq tries to dial localhost → ::1 first; the postgres service container only listens on IPv4 → ECONNREFUSED → all TestIntegration_* fail. Pin IPv4 to make the job deterministic. 2) pr-guards / disable-auto-merge-on-push: Gitea no-op - The previous reusable-workflow caller invoked `gh pr merge --disable-auto`, which calls GitHub's GraphQL API. Gitea returns HTTP 405 on /api/graphql → step always fails. Inline the step so it can detect Gitea (GITEA_ACTIONS=true OR repo url under moleculesai.app) and no-op with a notice. Auto-merge gating is moot on Gitea anyway: there's no `--auto` primitive being touched. Job stays ALWAYS-RUN so branch protection's required check still lands SUCCESS (avoids the SKIPPED-in-set trap from `feedback_branch_protection_check_name_parity`). 3) Harness Replays: cf-proxy nginx.conf via docker `configs:` (not bind) - act_runner runs the workflow inside a runner container; runc in the docker daemon below resolves bind-mount source paths on the OUTER host, not inside the runner. The path `/workspace/.../cf-proxy/nginx.conf` is invisible there → "not a directory" runc error. Switching to compose `configs:` packages the file as content rather than a host bind, sidestepping the DinD path-translation gap. Local validation: - YAML parsed clean for all 3 files. - cf-proxy nginx.conf: standalone `docker compose run cf-proxy nginx -T` reproduced the configs: mount end-to-end and dumped the config correctly. The full harness compose still renders via `docker compose config`. Real-CI verification will land on this branch's first push. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 17:06:09 -07:00
devops-engineer	8885f7cd12	fix(ci): pin actions/upload-artifact + download-artifact to @v3 for Gitea compatibility actions/upload-artifact@v4+ and download-artifact@v4+ use the GHES 3.10+ artifact protocol that Gitea Actions (act_runner v0.6 / Gitea 1.22.x) does NOT implement. Failure cite from PR #54 run 1325 jobs/2: ::error::@actions/artifact v2.0.0+, upload-artifact@v4+ and download-artifact@v4+ are not currently supported on GHES. Pinned all 3 references to v3.2.2 (latest v3) at SHA-pinned form for supply-chain hygiene, matching the existing `uses:` style in this repo. Affected workflows: - ci.yml (Canvas Next.js coverage upload, blocks `CI / Canvas (Next.js)` required check on every PR — was the merge-queue blocker for #53, #54, #69, #71, #76, #81) - e2e-staging-canvas.yml (Playwright report + screenshots on failure) No download-artifact callers in the repo, so v3-pin doesn't compose-break anywhere. Drop these pins post-Gitea-1.23+ when the v4 artifact protocol ships, or migrate to a Gitea-native action. Closes #210. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 16:54:44 -07:00
devops-engineer	0c7f3c8909	chore: sync main → staging (auto, `cdbf28fd`)	2026-05-07 23:45:36 +00:00
devops-engineer	3f9ba90672	chore: sync main → staging (auto, `07bd91e4`)	2026-05-07 23:44:31 +00:00
claude-ceo-assistant	4b82db72a7	Merge branch 'main' into fix/issue-72-auto-sync-token-canary-v2	2026-05-07 23:44:22 +00:00
claude-ceo-assistant	ed0874504e	Merge branch 'main' into fix/issue75-class-F-gh-run-list-to-statuses	2026-05-07 23:44:00 +00:00
devops-engineer	6656862870	chore: sync main → staging (auto, `e39fc920`)	2026-05-07 23:39:46 +00:00
claude-ceo-assistant	1819ac21f4	Merge branch 'main' into fix/issue75-class-A-gh-pr-to-gitea-rest	2026-05-07 23:37:57 +00:00
devops-engineer	ae49b184f6	chore: sync main → staging (auto, `1f1ead18`)	2026-05-07 23:33:25 +00:00
claude-ceo-assistant	d81fb98163	Merge branch 'main' into fix/issue-72-auto-sync-token-canary-v2	2026-05-07 22:53:32 +00:00
claude-ceo-assistant	4d5c9a6646	Merge branch 'main' into fix/issue75-class-F-gh-run-list-to-statuses	2026-05-07 22:53:26 +00:00
claude-ceo-assistant	9ecee78782	Merge branch 'main' into fix/issue75-class-A-gh-pr-to-gitea-rest	2026-05-07 22:53:11 +00:00
claude-ceo-assistant	d21c09babe	Merge branch 'main' into fix/195-auto-promote-staging-gitea-rest	2026-05-07 22:53:00 +00:00
claude-ceo-assistant	2b3a8f2e4d	Merge branch 'main' into fix/196-retarget-main-to-staging-gitea-rest	2026-05-07 22:52:35 +00:00
devops-engineer	34e05c35b9	chore: sync main → staging (auto, `6946cd12`)	2026-05-07 22:45:14 +00:00
claude-ceo-assistant	85140f1c72	Merge branch 'main' into fix/issue-72-auto-sync-token-canary-v2	2026-05-07 22:40:56 +00:00
devops-engineer	5b3ce5c818	fix(ci): replace gh run list with Gitea commit-status query (#75 class F) Part of the post-#66 sweep to remove `gh` CLI dependencies that fail silently against Gitea. Class F covers `gh run list --workflow=X --commit=SHA` shapes — querying whether a specific workflow ran (and how it finished) for a specific SHA. Why this is the only call site in class F: `gh run list` hits GitHub's `/repos/.../actions/runs` REST endpoint. Gitea exposes ZERO endpoints under `/repos/.../actions/runs` — verified 2026-05-07 via swagger inspection: only secrets, variables, and runner-registration tokens live under /actions/. There's no way to query workflow run state via the Gitea v1 API directly. However, every Gitea Actions job DOES emit a commit status with `context = "<Workflow Name> / <Job Name> (<event>)"` (verified 2026-05-07 by reading /repos/.../commits/{sha}/statuses on a recent main SHA). That surface is exactly what we need: each workflow run leg is one status row, the aggregate state encodes the run outcome, and Gitea exposes it under `/api/v1/repos/.../commits/{sha}/statuses` which IS available. Affected: `auto-promote-on-e2e.yml` (lines 172-180): Old: `gh run list --workflow e2e-staging-saas.yml --commit $SHA --json status,conclusion --jq ...` returning a 5-bucket string like `completed/success` \| `in_progress/none` \| `none/none` \| `completed/failure` \| `completed/cancelled`. New: `curl /api/v1/repos/.../commits/$SHA/statuses` + jq filter on contexts whose name starts with `"E2E Staging SaaS (full lifecycle) /"`. Mapping: 0 matched contexts → "none/none" (E2E paths- filtered out — same as before) any context = pending → "in_progress/none" (defer) any context = error\|failure → "completed/failure" (abort) all contexts = success → "completed/success" (proceed) The `completed/cancelled` arm of the case statement becomes unreachable: Gitea status API doesn't expose a `cancelled` state (it has success/failure/error/pending/warning), so per-SHA concurrency cancellations now surface as `failure` and are handled by the failure branch. Documented in-place; the cancelled arm is kept as defense-in-depth for any future dual-host operation. Verification: - Live curl against the current main SHA returns `none/none` (E2E was paths-filtered for that change set — expected). - Synthetic-input jq tests verify all four mapping buckets: no contexts → "none/none" one context = pending → "in_progress/none" success + success → "completed/success" success + failure → "completed/failure" - YAML syntax validates. Token: continues to use act_runner's GITHUB_TOKEN (per-run, repo read scope). The `/commits/{sha}/statuses` endpoint is repo-scoped, no extra perms needed. Closes part of #75. Master tracking issue at #75; companion PRs: #80 (class A — `gh pr ...`), #81 (class D — `gh api ...`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 15:38:57 -07:00
claude-ceo-assistant	bcc72419ce	Merge branch 'main' into fix/144-branch-protection-check-name-parity-audit	2026-05-07 22:35:33 +00:00
claude-ceo-assistant	e4e1bf4080	ci(canary): annotate EXPECTED_PERSONA dual-update constraint Hostile-self-review weakest-spot #2: if the devops-engineer persona is ever renamed, the canary will go red even if everything else is fine. Add an inline comment pointing the next editor at both files that must update together (auto-sync-main-to-staging.yml's git config + this canary's EXPECTED_PERSONA + the staging branch protection's push_whitelist_usernames). No behaviour change — comment-only.	2026-05-07 15:35:22 -07:00
claude-ceo-assistant	62629eda4a	ci(canary): rewrite Probe 3 to actually validate auth (NOP push --dry-run) While verifying Phase 4, found a real flaw in Probe 3 (`git ls-remote refs/heads/staging`). On a public repo (which molecule-core is), Gitea falls back to anonymous read on bad auth, so `ls-remote` succeeds even with a junk token. The probe was therefore green-lighting rotated tokens — false-green, the worst possible canary failure mode. Rewritten to use `git push --dry-run` of the current staging SHA back to `refs/heads/staging`: - Push always authenticates (auth-gated on smart-protocol handshake, before the dry-run can compute the empty-diff). - NOP by construction: pushing the current tip back to itself is "Everything up-to-date" with exit 0. - Bad token → "Authentication failed", exit 128. - Doesn't reach pre-receive (where branch-protection authz runs), so scope is "auth only" — matches the design intent (failure mode B); authz already covered daily by branch-protection-drift.yml. Implementation note: `git push` requires a local repo. Spinning up a fresh `git init` in a tempdir (~1KB, ~50ms) instead of pulling the full repo via actions/checkout — actions/checkout would clone ~hundreds of MB for what amounts to "a place to run git from." Local mutation tests pass: - Real token: "Everything up-to-date" exit 0 - Junk token: "Authentication failed" exit 128 with actionable ::error:: messages pointing at the runbook Header comment + runbook step-mapping updated to reflect new probe shape. Refs: #72	2026-05-07 15:34:34 -07:00
devops-engineer	224b65764d	chore: sync main → staging (auto, `050cb035`)	2026-05-07 22:34:17 +00:00
devops-engineer	e075557b19	fix(ci): replace gh pr CLI with Gitea v1 REST in workflows + scripts (#75 class A) Part of the post-#66 sweep to remove `gh` CLI dependencies that fail silently against Gitea (which exposes /api/v1 only — no GraphQL → 405, no /api/v3 → 404). Class A covers `gh pr list / view / diff / comment` shapes. Affected: - `.github/workflows/auto-tag-runtime.yml` Replaced `gh pr list --search SHA --json number,labels` with a curl to `/api/v1/repos/.../pulls?state=closed&sort=newest&limit=50` + jq filter on `merge_commit_sha == github.sha`. Same end-to-end behaviour: locate the merged PR for this push, read its labels, pick the bump kind. Defensive `?.name // empty` jq guard handles unlabelled PRs without erroring. The 50-PR window is comfortably larger than the volume of staging→main promotes that close in any reasonable detection window. - `scripts/check-stale-promote-pr.sh` Rewrote `fetch_prs` and `post_comment` to call Gitea's REST API directly. Gitea doesn't expose GitHub's compound `mergeStateStatus` / `reviewDecision` fields, so the new fetcher pulls `/pulls?state=open&base=main` then for each PR pulls `/pulls/{n}/reviews` and synthesizes the GitHub-shape JSON the rest of the script (and the existing fixture-based unit tests) consume: BLOCKED + REVIEW_REQUIRED ↔ mergeable=true AND 0 APPROVED reviews DIRTY ↔ mergeable=false (alarm doesn't fire) CLEAN + APPROVED ↔ mergeable=true AND ≥1 APPROVED review Comment-posting moves to `POST /repos/.../issues/{n}/comments` (Gitea treats PRs as issues for the comment surface, same as GitHub's REST). All 23 fixture-driven unit tests still pass — fixtures pass GitHub-shape JSON via PR_FIXTURE which short-circuits the live fetch path. - `scripts/ops/check_migration_collisions.py` Replaced `gh pr list` + `gh pr diff` calls with stdlib `urllib` against /api/v1. Helper `_gitea_get` centralizes auth + error handling; uses GITEA_TOKEN env, falling back to GITHUB_TOKEN (act_runner) and GH_TOKEN. Return shape from `open_prs_with_migration_prefix` mimics the historical `--json number,headRefName` so the call sites are unchanged. All 9 regex-classifier unit tests still pass; live integration test against the production Gitea API returns 0 collisions for prefix=999 as expected. curl invocation pattern is `curl --fail-with-body -sS` (NOT `-fsS` — the two short-fail flags are mutually exclusive in modern curl; caught by `curl: You must select either --fail or --fail-with-body, not both` during local verification). Token model: workflows pass act_runner's GITHUB_TOKEN (per-run, repo read scope) — same surface used by the auto-sync fix in PR #66 plus the surrounding workflows. No new repo secrets required. Verification: bash unit tests (23/23 pass), python unittest (9/9 pass), live curl call against production Gitea returns 200 with the expected shape, YAML / shell / Python syntax all validate. Closes part of #75. Other classes (D — `gh api`; F — `gh run list`) land in follow-up PRs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 15:29:26 -07:00
devops-engineer	fab65c78d6	fix(ci): rewrite retarget-main-to-staging for Gitea REST API Root cause: same as #65/#73 — gh CLI calls Gitea GraphQL (/api/graphql) which returns HTTP 405. Specifically: - gh api -X PATCH /pulls/{N} sometimes works but is flaky on Gitea (depends on gh's host-resolution layer) - gh pr close / gh pr comment route through GraphQL → 405 Fix: replace all gh calls with direct curl REST calls to Gitea: - PATCH /api/v1/repos/{owner}/{repo}/pulls/{index} body {"base": "staging"} — retarget the PR base - POST /api/v1/repos/{owner}/{repo}/issues/{index}/comments — post the explainer comment (PRs are issues in Gitea, comments share the issue endpoint) - PATCH /api/v1/repos/{owner}/{repo}/pulls/{index} body {"state": "closed"} — close redundant PR for #1884 case Identity: switch from secrets.GITHUB_TOKEN (per-job ephemeral, narrow scope on Gitea) to secrets.AUTO_SYNC_TOKEN (devops-engineer persona). Same persona used by auto-sync (#66) and auto-promote (#78). Per feedback_per_agent_gitea_identity_default. PR-edit and comment do not need branch-protection bypass. Curl-status-capture pattern hardened per feedback_curl_status_capture_pollution: http_code via -w to its own scalar, body to a tempfile, set +e/-e bracket so curl's non-zero-on-4xx doesn't pollute the script's exit chain. Header comment block fully rewritten with 4 failure-mode runbooks (A: 422 dup-base, B: token rotated, C: PR deleted, D: filter mis-fire) per PR #66/#78's pattern. Refs: #65, #74, #196, PR #66 + #78 (canonical reference) Closes #74 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 15:28:26 -07:00
claude-ceo-assistant	0cef033a6a	ci(canary): route curl -w to tempfile to satisfy status-capture lint The two API probes used the unsafe shape rejected by lint-curl-status-capture.yml (per feedback_curl_status_capture_pollution): status=$(curl ... -w '%{http_code}' ... \|\| echo "000") When curl exits non-zero (transport error, --fail-with-body 4xx/5xx), the `-w` already wrote a code; the `\|\| echo "000"` then APPENDS another "000", yielding "000000" or "409000" — passes shape checks while looking right. Switch to the canonical safe shape (set +e + tempfile + cat): set +e curl ... -w '%{http_code}' >code_file 2>/dev/null set -e status=$(cat code_file 2>/dev/null \|\| true) [ -z "$status" ] && status="000" Inline comment in both probe steps explains the lint constraint so the next editor doesn't re-introduce the bad pattern. Refs: #72, lint failure on PR #77 (1/22 red → 22/22 expected green)	2026-05-07 15:26:22 -07:00
claude-ceo-assistant	b83b533381	Merge branch 'main' into fix/144-branch-protection-check-name-parity-audit	2026-05-07 22:24:45 +00:00
claude-ceo-assistant	a23cf6a6bb	Merge branch 'main' into fix/harness-replays-pre-clone-manifest	2026-05-07 22:24:42 +00:00
devops-engineer	6acd63fa5a	fix(ci): rewrite auto-promote staging→main for Gitea REST API Root cause: same as #65/PR-#66 — gh CLI calls Gitea GraphQL (/api/graphql) which returns HTTP 405. Additionally, gh workflow run calls /actions/workflows/{id}/dispatches which does not exist on Gitea 1.22.6 (verified via swagger.v1.json). Fix: - Replace gh run list with Gitea REST combined-status endpoint (GET /repos/{owner}/{repo}/commits/{ref}/status). Combined state encodes the AND across every check context — simpler than the per-workflow loop and immune to workflow-name collisions. - Replace gh pr create / merge --auto with direct curl calls to POST /pulls and POST /pulls/{N}/merge with merge_when_checks_succeed. - Remove the post-merge polling tail entirely. The GitHub-era GITHUB_TOKEN no-recursion rule does not apply on Gitea Actions (verified empirically: PR #66 merge fired downstream pushes naturally). Even if we wanted to dispatch, Gitea has no workflow_dispatch REST endpoint. Critical constraint: main has enable_push: false with no whitelist; direct push is impossible for any persona. PR-mediated merge is the only path. main has required_approvals: 1 — auto-merge waits for Hongming's approval before landing, preserving the feedback_prod_apply_needs_hongming_chat_go contract. Identity: AUTO_SYNC_TOKEN (devops-engineer persona). Not founder PAT. Per feedback_per_agent_gitea_identity_default. Same persona used by auto-sync (PR #66) — keeps identity model coherent. Header comment block fully rewritten with 4 failure-mode runbooks (A: gates not green, B: PR-create non-201, C: merge schedule fails, D: token rotated/scope wrong) per PR #66's pattern. Refs: #65, #73, #195, PR #66 (canonical reference) Closes #73 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 15:24:28 -07:00

1 2 3 4 5 ...

333 Commits