Three workflows have been failing on every push to this Gitea repo for
GitHub-shaped reasons that don't translate to act_runner. Surfaced
while landing #84; bundled per `feedback_gitea_actions_migration_audit_pattern`
("bundle per-repo, not per-finding") instead of three separate PRs.
1) handlers-postgres-integration: localhost → 127.0.0.1
- lib/pq tries to dial localhost → ::1 first; the postgres service
container only listens on IPv4 → ECONNREFUSED → all
TestIntegration_* fail. Pin IPv4 to make the job deterministic.
2) pr-guards / disable-auto-merge-on-push: Gitea no-op
- The previous reusable-workflow caller invoked `gh pr merge
--disable-auto`, which calls GitHub's GraphQL API. Gitea returns
HTTP 405 on /api/graphql → step always fails. Inline the step so
it can detect Gitea (GITEA_ACTIONS=true OR repo url under
moleculesai.app) and no-op with a notice. Auto-merge gating is
moot on Gitea anyway: there's no `--auto` primitive being
touched. Job stays ALWAYS-RUN so branch protection's required
check still lands SUCCESS (avoids the SKIPPED-in-set trap from
`feedback_branch_protection_check_name_parity`).
3) Harness Replays: cf-proxy nginx.conf via docker `configs:` (not bind)
- act_runner runs the workflow inside a runner container; runc in
the docker daemon below resolves bind-mount source paths on the
OUTER host, not inside the runner. The path
`/workspace/.../cf-proxy/nginx.conf` is invisible there → "not a
directory" runc error. Switching to compose `configs:` packages
the file as content rather than a host bind, sidestepping the
DinD path-translation gap.
Local validation:
- YAML parsed clean for all 3 files.
- cf-proxy nginx.conf: standalone `docker compose run cf-proxy
nginx -T` reproduced the configs: mount end-to-end and dumped the
config correctly. The full harness compose still renders via
`docker compose config`.
Real-CI verification will land on this branch's first push.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Part of the post-#66 sweep to remove `gh` CLI dependencies that fail
silently against Gitea. Class F covers `gh run list --workflow=X
--commit=SHA` shapes — querying whether a specific workflow ran (and
how it finished) for a specific SHA.
Why this is the only call site in class F:
`gh run list` hits GitHub's `/repos/.../actions/runs` REST endpoint.
Gitea exposes ZERO endpoints under `/repos/.../actions/runs` —
verified 2026-05-07 via swagger inspection: only secrets, variables,
and runner-registration tokens live under /actions/. There's no way
to query workflow run state via the Gitea v1 API directly.
However, every Gitea Actions job DOES emit a commit status with
`context = "<Workflow Name> / <Job Name> (<event>)"` (verified
2026-05-07 by reading /repos/.../commits/{sha}/statuses on a recent
main SHA). That surface is exactly what we need: each workflow run
leg is one status row, the aggregate state encodes the run outcome,
and Gitea exposes it under `/api/v1/repos/.../commits/{sha}/statuses`
which IS available.
Affected:
`auto-promote-on-e2e.yml` (lines 172-180):
Old: `gh run list --workflow e2e-staging-saas.yml --commit $SHA
--json status,conclusion --jq ...` returning a 5-bucket string
like `completed/success` | `in_progress/none` | `none/none` |
`completed/failure` | `completed/cancelled`.
New: `curl /api/v1/repos/.../commits/$SHA/statuses` + jq filter on
contexts whose name starts with
`"E2E Staging SaaS (full lifecycle) /"`. Mapping:
0 matched contexts → "none/none" (E2E paths-
filtered out
— same as
before)
any context = pending → "in_progress/none" (defer)
any context = error|failure → "completed/failure" (abort)
all contexts = success → "completed/success" (proceed)
The `completed/cancelled` arm of the case statement becomes
unreachable: Gitea status API doesn't expose a `cancelled` state
(it has success/failure/error/pending/warning), so per-SHA
concurrency cancellations now surface as `failure` and are handled
by the failure branch. Documented in-place; the cancelled arm is
kept as defense-in-depth for any future dual-host operation.
Verification:
- Live curl against the current main SHA returns `none/none` (E2E
was paths-filtered for that change set — expected).
- Synthetic-input jq tests verify all four mapping buckets:
no contexts → "none/none"
one context = pending → "in_progress/none"
success + success → "completed/success"
success + failure → "completed/failure"
- YAML syntax validates.
Token: continues to use act_runner's GITHUB_TOKEN (per-run, repo
read scope). The `/commits/{sha}/statuses` endpoint is repo-scoped,
no extra perms needed.
Closes part of #75. Master tracking issue at #75; companion PRs:
#80 (class A — `gh pr ...`), #81 (class D — `gh api ...`).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hostile-self-review weakest-spot #2: if the devops-engineer persona
is ever renamed, the canary will go red even if everything else is
fine. Add an inline comment pointing the next editor at both files
that must update together (auto-sync-main-to-staging.yml's git
config + this canary's EXPECTED_PERSONA + the staging branch
protection's push_whitelist_usernames).
No behaviour change — comment-only.
While verifying Phase 4, found a real flaw in Probe 3 (`git ls-remote
refs/heads/staging`). On a public repo (which molecule-core is), Gitea
falls back to anonymous read on bad auth, so `ls-remote` succeeds even
with a junk token. The probe was therefore green-lighting rotated
tokens — false-green, the worst possible canary failure mode.
Rewritten to use `git push --dry-run` of the current staging SHA back
to `refs/heads/staging`:
- Push always authenticates (auth-gated on smart-protocol handshake,
before the dry-run can compute the empty-diff).
- NOP by construction: pushing the current tip back to itself is
"Everything up-to-date" with exit 0.
- Bad token → "Authentication failed", exit 128.
- Doesn't reach pre-receive (where branch-protection authz runs), so
scope is "auth only" — matches the design intent (failure mode B);
authz already covered daily by branch-protection-drift.yml.
Implementation note: `git push` requires a local repo. Spinning up a
fresh `git init` in a tempdir (~1KB, ~50ms) instead of pulling the
full repo via actions/checkout — actions/checkout would clone
~hundreds of MB for what amounts to "a place to run git from."
Local mutation tests pass:
- Real token: "Everything up-to-date" exit 0
- Junk token: "Authentication failed" exit 128 with actionable
::error:: messages pointing at the runbook
Header comment + runbook step-mapping updated to reflect new probe
shape. Refs: #72
Part of the post-#66 sweep to remove `gh` CLI dependencies that fail
silently against Gitea (which exposes /api/v1 only — no GraphQL → 405,
no /api/v3 → 404). Class A covers `gh pr list / view / diff / comment`
shapes.
Affected:
- `.github/workflows/auto-tag-runtime.yml`
Replaced `gh pr list --search SHA --json number,labels` with a curl
to `/api/v1/repos/.../pulls?state=closed&sort=newest&limit=50` +
jq filter on `merge_commit_sha == github.sha`. Same end-to-end
behaviour: locate the merged PR for this push, read its labels,
pick the bump kind. Defensive `?.name // empty` jq guard handles
unlabelled PRs without erroring. The 50-PR window is comfortably
larger than the volume of staging→main promotes that close in any
reasonable detection window.
- `scripts/check-stale-promote-pr.sh`
Rewrote `fetch_prs` and `post_comment` to call Gitea's REST API
directly. Gitea doesn't expose GitHub's compound `mergeStateStatus`
/ `reviewDecision` fields, so the new fetcher pulls
`/pulls?state=open&base=main` then for each PR pulls
`/pulls/{n}/reviews` and synthesizes the GitHub-shape JSON the rest
of the script (and the existing fixture-based unit tests) consume:
BLOCKED + REVIEW_REQUIRED ↔ mergeable=true AND 0 APPROVED reviews
DIRTY ↔ mergeable=false (alarm doesn't fire)
CLEAN + APPROVED ↔ mergeable=true AND ≥1 APPROVED review
Comment-posting moves to `POST /repos/.../issues/{n}/comments`
(Gitea treats PRs as issues for the comment surface, same as
GitHub's REST). All 23 fixture-driven unit tests still pass —
fixtures pass GitHub-shape JSON via PR_FIXTURE which short-circuits
the live fetch path.
- `scripts/ops/check_migration_collisions.py`
Replaced `gh pr list` + `gh pr diff` calls with stdlib `urllib`
against /api/v1. Helper `_gitea_get` centralizes auth + error
handling; uses GITEA_TOKEN env, falling back to GITHUB_TOKEN
(act_runner) and GH_TOKEN. Return shape from
`open_prs_with_migration_prefix` mimics the historical
`--json number,headRefName` so the call sites are unchanged. All 9
regex-classifier unit tests still pass; live integration test
against the production Gitea API returns 0 collisions for prefix=999
as expected.
curl invocation pattern is `curl --fail-with-body -sS` (NOT `-fsS` —
the two short-fail flags are mutually exclusive in modern curl;
caught by `curl: You must select either --fail or --fail-with-body,
not both` during local verification).
Token model: workflows pass act_runner's GITHUB_TOKEN (per-run, repo
read scope) — same surface used by the auto-sync fix in PR #66 plus
the surrounding workflows. No new repo secrets required.
Verification: bash unit tests (23/23 pass), python unittest (9/9 pass),
live curl call against production Gitea returns 200 with the expected
shape, YAML / shell / Python syntax all validate.
Closes part of #75. Other classes (D — `gh api`; F — `gh run list`)
land in follow-up PRs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The two API probes used the unsafe shape rejected by
lint-curl-status-capture.yml (per feedback_curl_status_capture_pollution):
status=$(curl ... -w '%{http_code}' ... || echo "000")
When curl exits non-zero (transport error, --fail-with-body 4xx/5xx),
the `-w` already wrote a code; the `|| echo "000"` then APPENDS another
"000", yielding "000000" or "409000" — passes shape checks while looking
right.
Switch to the canonical safe shape (set +e + tempfile + cat):
set +e
curl ... -w '%{http_code}' >code_file 2>/dev/null
set -e
status=$(cat code_file 2>/dev/null || true)
[ -z "$status" ] && status="000"
Inline comment in both probe steps explains the lint constraint so
the next editor doesn't re-introduce the bad pattern.
Refs: #72, lint failure on PR #77 (1/22 red → 22/22 expected green)
Root cause: same as #65/PR-#66 — gh CLI calls Gitea GraphQL
(/api/graphql) which returns HTTP 405. Additionally, gh workflow
run calls /actions/workflows/{id}/dispatches which does not
exist on Gitea 1.22.6 (verified via swagger.v1.json).
Fix:
- Replace gh run list with Gitea REST combined-status endpoint
(GET /repos/{owner}/{repo}/commits/{ref}/status). Combined state
encodes the AND across every check context — simpler than the
per-workflow loop and immune to workflow-name collisions.
- Replace gh pr create / merge --auto with direct curl calls to
POST /pulls and POST /pulls/{N}/merge with merge_when_checks_succeed.
- Remove the post-merge polling tail entirely. The GitHub-era
GITHUB_TOKEN no-recursion rule does not apply on Gitea Actions
(verified empirically: PR #66 merge fired downstream pushes
naturally). Even if we wanted to dispatch, Gitea has no
workflow_dispatch REST endpoint.
Critical constraint: main has enable_push: false with no whitelist;
direct push is impossible for any persona. PR-mediated merge is the
only path. main has required_approvals: 1 — auto-merge waits for
Hongming's approval before landing, preserving the
feedback_prod_apply_needs_hongming_chat_go contract.
Identity: AUTO_SYNC_TOKEN (devops-engineer persona). Not founder PAT.
Per feedback_per_agent_gitea_identity_default. Same persona used by
auto-sync (PR #66) — keeps identity model coherent.
Header comment block fully rewritten with 4 failure-mode runbooks
(A: gates not green, B: PR-create non-201, C: merge schedule fails,
D: token rotated/scope wrong) per PR #66's pattern.
Refs: #65, #73, #195, PR #66 (canonical reference)
Closes#73
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a 6h-cron synthetic check that fires the auth surface used by
auto-sync-main-to-staging.yml (PR #66) and emits a red workflow
status when AUTO_SYNC_TOKEN has drifted out of validity. Closes
hostile-self-review weakest-spot #3 from PR #66 (token-rotation
detection latency).
Read-only verification — no writes, no synthetic merge commits, no
canary branch noise. Three probes:
1. GET /api/v1/user → token authenticates as devops-engineer
2. GET /api/v1/repos/molecule-ai/molecule-core → read:repository scope
3. git ls-remote refs/heads/staging → exact HTTPS auth path used by
actions/checkout in the real auto-sync workflow
Hard-fail on missing AUTO_SYNC_TOKEN secret on both schedule and
workflow_dispatch — per feedback_schedule_vs_dispatch_secrets_hardening,
a silent soft-skip would make the canary itself drift-invisible (the
sweep-cf-orphans #2088 lesson). Operator runbook in workflow header.
Token reuse: same AUTO_SYNC_TOKEN as the workflow under monitor; no
new credential introduced. Read-only paths only.
Refs: #72, hostile-self-review #66
Root cause of `Auto-sync main → staging / sync-staging (push)`
failing every push to main since the GitHub→Gitea migration:
The workflow assumed a GitHub `merge_queue` ruleset on staging
(blocking direct push) and used `gh pr create` + `gh pr merge
--auto` to land sync via the queue. On Gitea this fails at the
`gh pr create` step with `HTTP 405 Method Not Allowed
(https://git.moleculesai.app/api/graphql)` — Gitea exposes no
GraphQL endpoint, and the GitHub-CLI cannot ship PRs against
Gitea.
Verified failure mode in run 1117/job 0 (token logs at
/tmp/log2.txt, run target /molecule-ai/molecule-core/actions/
runs/1117/jobs/0). The merge step succeeded and pushed
auto-sync/main-1e1f4d63; the PR step failed with the 405. So
every main push left an orphan auto-sync/* branch and a red CI
status, with no PR to land it.
Fix: the staging branch protection on Gitea
(`enable_push: true`, `push_whitelist_usernames:
[devops-engineer]`) already permits direct push from the
devops-engineer persona. Drop the entire merge-queue PR
architecture and replace with:
1. Checkout staging with secrets.AUTO_SYNC_TOKEN
(devops-engineer persona token, NOT founder PAT —
`feedback_per_agent_gitea_identity_default`).
2. `git fetch origin main` + ff-merge or no-ff merge.
3. `git push origin staging` directly.
The AUTO_SYNC_TOKEN repo secret already exists (created
2026-05-07 14:00 alongside the staging push_whitelist update).
Workflow name + job name unchanged → required-check name
`Auto-sync main → staging / sync-staging (push)` keeps the
same context, no branch-protection edits needed.
Rejected alternatives (documented in workflow header):
- Reuse PR architecture via Gitea REST: ~80 LOC of API
plumbing for no benefit; direct push works.
- GH_HOST=git.moleculesai.app: still calls /api/graphql,
same 405; doesn't fix the root issue.
- Custom JS action: external dep for a 5-line `git push`.
Header comment in the workflow now documents:
- What this workflow does (SSOT for staging advancing).
- Why direct push (GitHub merge_queue → Gitea push_whitelist).
- Identity and token (anti-bot-ring per saved memory).
- Failure modes A–D with operator runbook for each.
- Loop safety (push to staging doesn't fire push:main → no
recursion).
Verification plan: this fix-PR's merge to main is itself the
trigger; watch the workflow run on the merge commit and on
one follow-up trigger commit, expect both green.
Refs: failing run https://git.moleculesai.app/molecule-ai/
molecule-core/actions/runs/1117/jobs/0
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Audit finding: every workflow that emits a required-status-check name
on molecule-core's branch protection (apply.sh's STAGING_CHECKS +
MAIN_CHECKS) ALREADY uses the safe always-runs-with-conditional-steps
shape — Platform/Canvas/Python/Shellcheck in ci.yml, Canvas tabs E2E
in e2e-staging-canvas.yml, E2E API Smoke in e2e-api.yml, PR-built
wheel in runtime-prbuild-compat.yml, the codeql Analyze matrix, and
the always-on Secret scan + Detect changes. No production drift to
fix today.
Adds a regression-guard so the next path-filter / matrix refactor /
workflow rename can't silently re-introduce the bug shape called out
in saved memory feedback_branch_protection_check_name_parity:
"Path filters … silently break branch protection because no job
emits the protected sentinel status when path-filter returns false."
New tools:
- tools/branch-protection/check_name_parity.sh — extracts every
required check name from apply.sh's heredocs, then for each name
classifies the owning workflow as safe (no top-level paths:) /
safe (per-step if-gates without top-level paths:) / unsafe
(top-level paths: without per-step if-gates) / unsafe-mix
(top-level paths: WITH per-step if-gates — the workflow may still
skip entirely on path exclusion, leaving the gates dormant) /
missing (no emitter at all). Special-cases codeql.yml's matrix-
expanded `Analyze (${{ matrix.language }})`.
- tools/branch-protection/test_check_name_parity.sh — 6 unit tests
covering each classification: safe, unsafe-path-filter, missing,
safe-with-per-step-gates, unsafe-mix, matrix-expansion. Each test
builds a synthetic apply.sh + workflow file in a tmpdir, invokes
the script, and asserts on exit code + stderr substring. Per
feedback_assert_exact_not_substring the assertions pin specific
classifications, not just non-zero exit.
Wired into branch-protection-drift.yml so every PR touching
.github/workflows/** runs the parity check; the existing daily
schedule covers between-PR drift. The check is cheap (~1s) and runs
without the admin token — only reads files in the checkout. Self-
test step runs the unit tests on every invocation, so a regression
in the script can't false-pass on production.
Per BSD-vs-GNU portability hygiene: heredoc-marker extraction stays
in plain awk + sed (no gawk-only `match()` array form), grep regex
avoids `^` anchor for `if:` lines because real workflows use
` - if:` with the `-` step-marker between leading spaces and
`if:` (the original anchor missed every workflow's per-step gates).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why
---
PR #35 marked `continue-on-error: true` at the JOB level (correct YAML),
but Gitea Actions 1.22.6 does NOT propagate job-level continue-on-error
to the commit-status API — every matrix leg still posts `failure`. That
keeps OVERALL=failure on every push to main + staging and blocks the
auto-promote signal even when every other gate is green.
Worse: the underlying CodeQL run never actually worked on Gitea. The
github/codeql-action/init@v4 step calls api.github.com bundle endpoints
(CLI download + query packs + telemetry) that Gitea does NOT proxy.
Confirmed via live-tested run 1d/3101 on operator host:
2026-05-07T20:55:17 ::group::Run Initialize CodeQL
with: languages: ${{ matrix.language }}
queries: security-extended
2026-05-07T20:55:36 ::error::404 page not found
2026-05-07T20:55:50 Failure - Main Initialize CodeQL
2026-05-07T20:55:51 skipping Perform CodeQL Analysis (main skipped)
2026-05-07T20:55:51 :⚠️:No files were found at sarif-results/go/
The SARIF artifact upload was already a no-op (warning above) — the
analyze step never wrote anything because init failed. So nothing of
value is being lost by stubbing this out.
What
----
- Convert the workflow to a single-step stub that emits success per
matrix language (go, javascript-typescript, python).
- Keep workflow `name: CodeQL` exactly (auto-promote-staging.yml
line 67 keys on it as a workflow_run gate).
- Keep job name template `Analyze (${{ matrix.language }})` and the
3-leg matrix exactly (commit-status context names + branch
protection + #144 required-check-name parity).
- Keep all four triggers (push / pull_request / merge_group /
schedule) so merge_group required-checks parity holds.
- Drop the codeql-action steps, the Autobuild step, the SARIF parse
step, and the upload-artifact step — all four of those are now
dead code (init can never succeed against Gitea's API surface).
Policy
------
Per Hongming decision 2026-05-07 (#156): CodeQL is ADVISORY, not
blocking, until a Gitea-compatible SAST pipeline lands. The header
of the new workflow file documents this decision + lists the three
re-enable options (self-hosted Semgrep, Sonatype, GitHub mirror)
plus the compensating controls in place (secret-scan, block-internal-
paths, lint-curl-status-capture, branch-protection-drift).
Closes#156. Touches #142 (no capital-M Molecule-AI refs in this
file — already lowercase per e01077be).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
harness-replays.yml builds tenant-alpha + tenant-beta via tests/harness/
compose.yml using workspace-server/Dockerfile.tenant. Post-#173, that
Dockerfile expects .tenant-bundle-deps/{workspace-configs-templates,
org-templates,plugins} pre-cloned at the build context root. Sister
PR #38 added the pre-clone step to publish-workspace-server-image.yml
but missed harness-replays.yml.
Symptoms:
- main run #892 (2026-05-07T20:28:53Z): COPY
.tenant-bundle-deps/plugins -> failed to calculate checksum ...
not found.
- staging run #964 (2026-05-07T20:41:52Z): hits the OLD in-image
clone path (staging hasn't picked up the Dockerfile.tenant
refactor yet via auto-sync) and fails on
'fatal: could not read Username for https://git.moleculesai.app'
when cloning the first private workspace-template-* repo.
Fix: add the same Pre-clone step to harness-replays.yml,
mirroring publish-workspace-server-image.yml. Uses AUTO_SYNC_TOKEN
(devops-engineer persona PAT) per
feedback_per_agent_gitea_identity_default.
Once auto-sync main->staging unblocks (sister agent fixing the
7-file conflict in flight), staging will inherit both this workflow
fix AND the Dockerfile.tenant refactor atomically.
Refs: #168, #173
Refs Task #165 (Class D AUTO_SYNC_TOKEN plumbing).
main and staging diverged after the 2026-05-06 GitHub-org suspension
because Class D / Class G / feature work landed on staging while
unrelated CI fixes (#34-47, ECR auth-inline, buildx→docker, pre-clone
manifest deps) landed straight on main. Both branches edited the
same workflow files, so every push to main triggered an Auto-sync
run that aborted at `git merge --no-ff origin/main` with 7 content
conflicts:
- .github/workflows/canary-verify.yml (URL: github.com → Gitea)
- .github/workflows/ci.yml (3 URL refs)
- .github/workflows/publish-runtime.yml (cascade: HTTP repo-dispatch
→ Gitea push)
- .github/workflows/publish-workspace-server-image.yml
(drop AWS-action steps;
ECR auth is inline)
- .github/workflows/retarget-main-to-staging.yml (URL)
- manifest.json (lowercase org slug + add
mock-bigorg from main)
- scripts/clone-manifest.sh (keep main's MOLECULE_GITEA_TOKEN
auth path + drop awk-tolower
since manifest is now lowercase)
Resolution: union — staging's post-suspension Gitea/ECR migrations win
on URL/policy edits; main's additive work (mock-bigorg manifest entry,
inline ECR auth, MOLECULE_GITEA_TOKEN basic-auth) is preserved on top.
After this lands, staging is a strict superset of main, so the next
auto-sync run on a push to main will be a clean fast-forward / no-op.
The auto-sync workflow on main also picks up staging's AUTO_SYNC_TOKEN
swap (Class D #26) for free, fixing the latent layer-2 push-auth issue.
Verified locally:
- bash -n scripts/clone-manifest.sh
- python -c 'yaml.safe_load(...)' on each touched workflow
- python -c 'json.load(open(manifest.json))' (21 plugins, 9 templates,
7 org_templates)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run #1010 (post-#46) succeeded all the way to push but failed with
"repository molecule-ai/platform does not exist" — the platform image
ECR repo had never been created (only platform-tenant existed).
Created the repo via:
aws ecr create-repository --region us-east-2 \
--repository-name molecule-ai/platform \
--image-scanning-configuration scanOnPush=true
This is a one-line workflow comment to satisfy the path-filter and
re-run the publish workflow against the now-existing repo. Closes#173
properly this time — pre-clone + inline ECR auth + ECR repo all in
place.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI run #987 (post-#45) showed `docker push` from shell still hits
"no basic auth credentials" — `aws-actions/amazon-ecr-login@v2`
writes auth to a step-scoped DOCKER_CONFIG that doesn't carry across
to the next shell step on Gitea Actions.
Fix: drop both `aws-actions/configure-aws-credentials@v4` and
`aws-actions/amazon-ecr-login@v2`. Run `aws ecr get-login-password |
docker login` inline in the same shell step as `docker build` +
`docker push`. AWS creds come from secrets via env vars, ECR token
is fresh per-step (12h validity is plenty), config.json lives in the
same shell process — auth state is guaranteed.
This is the operator-host manual approach mapped 1:1 into CI.
runner-base image already has aws-cli + docker (verified locally).
Closes#173 (fifth piece — and final, this matches the manual flow
exactly).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI run #946 (post-#43) confirmed `driver: docker` doesn't fix the ECR
push 401 either: buildx CLI inside the runner container talks to the
operator-host docker daemon (mounted socket), but the daemon doesn't
see the runner's ECR auth state, and the runner's buildx CLI doesn't
attach the auth header in a way the daemon accepts.
Drop buildx + build-push-action entirely. Plain `docker build` +
`docker push` from the runner container works because both use the
SAME docker socket + the SAME runner-container config.json (populated
by `aws ecr get-login-password | docker login` from amazon-ecr-login).
Trade-off: lose multi-arch support. We only ship linux/amd64 tenant
images today, so this is fine. If multi-arch becomes a requirement
later, we can revisit (likely with `docker buildx create
--driver=remote` pointing at an external buildkit, but that's
substantial infra work; not worth it for a single-arch shop).
Closes#173 (fourth piece — and hopefully last; this matches the
operator-host manual approach exactly).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #38 + #41 fixed the Dockerfile-side clone issue. CI run #893 then
revealed two Gitea-Actions-specific issues with the unchanged buildx
config:
1. `failed to push: 401 Unauthorized` to ECR. Root cause: default
buildx driver `docker-container` spawns a buildkit container that
doesn't share the host's `~/.docker/config.json`, so the ECR auth
set up by amazon-ecr-login doesn't reach the push. Fix: pin
`driver: docker` so buildx delegates to the host daemon, which
already has the ECR creds.
2. `dial tcp ...:41939: i/o timeout` on `_apis/artifactcache/cache`.
Root cause: `cache-from/cache-to: type=gha` is GitHub-specific;
Gitea Actions has no compatible artifact-cache backend, so every
cache lookup fails after a 30s timeout. Fix: remove the cache-*
options. Cold-build cost is <10min for 37-repo clone + Go/Node
compile, acceptable. Could revisit with type=registry inline cache
later if rebuilds get painful.
With this + #38/#41, the workflow should run end-to-end on Gitea
Actions: pre-clone -> docker build (host daemon) -> ECR push.
Closes#173 (third and final piece).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The GitHub org Molecule-AI was suspended on 2026-05-06; canonical SCM
is now Gitea at https://git.moleculesai.app/molecule-ai/. Stale
github.com/Molecule-AI/... URLs return 404 and break tooling that
clones / pip-installs / curls them.
This bundles all non-Go-module URL fixes for this repo into a single PR.
Go module path references (in *.go, go.mod, go.sum) are out of scope
here -- tracked separately under Task #140.
Token-auth clone URLs also flip ${GITHUB_TOKEN} -> ${GITEA_TOKEN} since
the GitHub token does not auth against Gitea.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
publish-workspace-server-image.yml could not run on Gitea Actions because
Dockerfile.tenant's stage 3 ran `git clone` against private Gitea repos
from inside the Docker build context, where no auth path exists. Every
workspace-server rebuild required a manual operator-host push.
Move cloning to the trusted CI context (where AUTO_SYNC_TOKEN — the
devops-engineer persona PAT — is naturally available). Dockerfile.tenant
now COPYs from .tenant-bundle-deps/, populated by the workflow's new
"Pre-clone manifest deps" step. The Gitea token never enters the image.
- scripts/clone-manifest.sh: optional MOLECULE_GITEA_TOKEN env embeds
basic-auth in the clone URL; redacted in log output. Anonymous fallback
preserved for future public-repo path.
- .github/workflows/publish-workspace-server-image.yml: new pre-clone
step before docker build; injects AUTO_SYNC_TOKEN. Fail-fast if the
secret is empty.
- workspace-server/Dockerfile.tenant: drop stage 3 (templates), COPY
from .tenant-bundle-deps/ instead. Header documents the prereq.
- .gitignore: ignore /.tenant-bundle-deps/ so a local build can't
accidentally commit cloned repos.
Verified locally: clone-manifest.sh with the devops-engineer persona
token cloned all 37 repos (9 ws + 7 org + 21 plugins, 4.9MB after
.git strip).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
scripts/clone-manifest.sh runs inside the platform Dockerfile build,
so a change to that script needs to retrigger publish. Without it,
the prior fix (clone via Gitea + lowercase org) didn't trigger this
workflow because scripts/ wasn't in the path filter.
Also serves as the file change to satisfy the path filter for THIS
push, retriggering publish-workspace-server-image now.
Two coupled cleanups for the post-2026-05-06 stack:
============================================
The plugin injected GITHUB_TOKEN/GH_TOKEN via the App's
installation-access flow (~hourly rotation). Per-agent Gitea
identities replaced this approach after the 2026-05-06 suspension —
workspaces now provision with a per-persona Gitea PAT from .env
instead of an App-rotated token. The plugin code itself lived on
github.com/Molecule-AI/molecule-ai-plugin-github-app-auth which is
also unreachable post-suspension; checking it out at CI build time
was already failing.
Removed:
- workspace-server/cmd/server/main.go: githubappauth import + the
`if os.Getenv("GITHUB_APP_ID") != ""` block that called
BuildRegistry. gh-identity remains as the active mutator.
- workspace-server/Dockerfile + Dockerfile.tenant: COPY of the
sibling repo + the `replace github.com/Molecule-AI/molecule-ai-
plugin-github-app-auth => /plugin` directive injection.
- workspace-server/go.mod + go.sum: github-app-auth dep entry
(cleaned up by `go mod tidy`).
- 3 workflows: actions/checkout steps for the sibling plugin repo:
- .github/workflows/codeql.yml (Go matrix path)
- .github/workflows/harness-replays.yml
- .github/workflows/publish-workspace-server-image.yml
Verified `go build ./cmd/server` + `go vet ./...` pass post-removal.
=======================================================
Same workflow used to push to ghcr.io/molecule-ai/platform +
platform-tenant. ghcr.io/molecule-ai is gone post-suspension. The
operator's ECR org (153263036946.dkr.ecr.us-east-2.amazonaws.com/
molecule-ai/) already hosts platform-tenant + workspace-template-*
+ runner-base images and is the post-suspension SSOT for container
images. This PR aligns publish-workspace-server-image with that
stack.
- env.IMAGE_NAME + env.TENANT_IMAGE_NAME repointed to ECR URL.
- docker/login-action swapped for aws-actions/configure-aws-
credentials@v4 + aws-actions/amazon-ecr-login@v2 chain (the
standard ECR auth pattern; uses AWS_ACCESS_KEY_ID/SECRET secrets
bound to the molecule-cp IAM user).
The :staging-<sha> + :staging-latest tag policy is unchanged —
staging-CP's TENANT_IMAGE pin still points at :staging-latest, just
with the new registry prefix.
Refs molecule-core#157, #161; parallel to org-wide CI-green sweep.
Same shape as molecule-controlplane#29: per-job GITHUB_TOKEN
doesn't have the Gitea API permissions to open PRs / push branches
the auto-sync flow needs. AUTO_SYNC_TOKEN is the devops-engineer
persona PAT (per saved memory feedback_per_agent_gitea_identity_default).
Companion prod ops (already done):
- devops-engineer added as collaborator on molecule-core (write)
- devops-engineer added to staging branch protection push_whitelist
- AUTO_SYNC_TOKEN registered as Actions secret on molecule-core
Two coupled cleanups for the post-2026-05-06 stack:
#157 — drop molecule-ai-plugin-github-app-auth
============================================
The plugin injected GITHUB_TOKEN/GH_TOKEN via the App's
installation-access flow (~hourly rotation). Per-agent Gitea
identities replaced this approach after the 2026-05-06 suspension —
workspaces now provision with a per-persona Gitea PAT from .env
instead of an App-rotated token. The plugin code itself lived on
github.com/Molecule-AI/molecule-ai-plugin-github-app-auth which is
also unreachable post-suspension; checking it out at CI build time
was already failing.
Removed:
- workspace-server/cmd/server/main.go: githubappauth import + the
`if os.Getenv("GITHUB_APP_ID") != ""` block that called
BuildRegistry. gh-identity remains as the active mutator.
- workspace-server/Dockerfile + Dockerfile.tenant: COPY of the
sibling repo + the `replace github.com/Molecule-AI/molecule-ai-
plugin-github-app-auth => /plugin` directive injection.
- workspace-server/go.mod + go.sum: github-app-auth dep entry
(cleaned up by `go mod tidy`).
- 3 workflows: actions/checkout steps for the sibling plugin repo:
- .github/workflows/codeql.yml (Go matrix path)
- .github/workflows/harness-replays.yml
- .github/workflows/publish-workspace-server-image.yml
Verified `go build ./cmd/server` + `go vet ./...` pass post-removal.
#161 — swap GHCR→ECR for publish-workspace-server-image
=======================================================
Same workflow used to push to ghcr.io/molecule-ai/platform +
platform-tenant. ghcr.io/molecule-ai is gone post-suspension. The
operator's ECR org (153263036946.dkr.ecr.us-east-2.amazonaws.com/
molecule-ai/) already hosts platform-tenant + workspace-template-*
+ runner-base images and is the post-suspension SSOT for container
images. This PR aligns publish-workspace-server-image with that
stack.
- env.IMAGE_NAME + env.TENANT_IMAGE_NAME repointed to ECR URL.
- docker/login-action swapped for aws-actions/configure-aws-
credentials@v4 + aws-actions/amazon-ecr-login@v2 chain (the
standard ECR auth pattern; uses AWS_ACCESS_KEY_ID/SECRET secrets
bound to the molecule-cp IAM user).
The :staging-<sha> + :staging-latest tag policy is unchanged —
staging-CP's TENANT_IMAGE pin still points at :staging-latest, just
with the new registry prefix.
Refs molecule-core#157, #161; parallel to org-wide CI-green sweep.
The v2 dropped codex from TEMPLATES on the basis of "no
publish-image.yml = not part of cascade today." That was correct
about the immediate behavior but tripped cascade-list-drift-gate.yml
because manifest.json still declares codex (it IS a live runtime —
referenced from workspace/config.py and cloned into dev envs by
clone-manifest.sh; only the image-publish path is missing).
Restore codex to TEMPLATES (matching manifest) and add a runtime
soft-skip: probe each repo for .github/workflows/publish-image.yml
via the Gitea contents API and skip cleanly if 404. Final job log
distinguishes "complete across all" vs "complete with soft-skips".
This preserves the drift gate's invariant (TEMPLATES == manifest)
while honoring the empirical fact that codex has no publish-image
workflow yet. If codex later gains the workflow, no change here is
needed — the probe will see 200 and the cascade will fan out to it
naturally.
Refs molecule-core#14, molecule-core#20.
Empirical blocker on v1: Gitea 1.22.6 has no repository_dispatch /
workflow_dispatch trigger API (verified across 6 candidate paths in
issuecomment-913). v1's curl-POST loop would always exit-1.
v2 pivots to push-mode: each template repo got a small companion PR
(merged 2026-05-07) adding a `.runtime-version` file at root + a
`resolve-version` job in publish-image.yml that reads the file and
forwards the value to the reusable build workflow. publish-runtime
now updates that file via git-clone + commit + push, which trips
each template's existing `on: push: branches: [main]` trigger.
Behaviour changes vs v1:
- Templates list dropped from 9 → 8 (codex has no publish-image.yml
so was never part of the cascade in practice).
- 3-retry pull-rebase loop per template (handles concurrent-push
races without force-push). Failures collected, job exits 1 with
the failed-template list at the end.
- Idempotency: when re-run with the same version, templates already
pinned to that version contribute zero commits — operator can
safely re-run to retry partial failures.
- Author line: "publish-runtime cascade <publish-runtime@moleculesai
.app>" trailer makes it clear the commit is workflow-driven, not
human (per memory feedback_github_botring_fingerprint).
DISPATCH_TOKEN secret name unchanged (still consumed at
secrets.DISPATCH_TOKEN per 569df259).
Refs molecule-core#14, builds on molecule-core#20 issuecomment-923
(Phase 2 design).
The cascade workflow was reading from `secrets.TEMPLATE_DISPATCH_TOKEN`
but the plumbed secret name is `DISPATCH_TOKEN` (verified just now via
GET /repos/molecule-ai/molecule-core/actions/secrets — only DISPATCH_TOKEN
is set). Without this rename the cascade would always evaluate "secret
missing" and exit 1 on the next push to staging, defeating the entire
point of grant-role-access.sh --apply that just landed.
Three references updated:
- env mapping (`secrets.X` → `secrets.DISPATCH_TOKEN`)
- workflow_dispatch warning text
- push-trigger error text
The bash-side variable name is unchanged (still `DISPATCH_TOKEN`) so
the curl invocation at line 372 is unaffected. YAML round-trip parses
clean.
## Symptom
`publish-runtime.yml::cascade` fired a `repository_dispatch` to 10 workspace-template
repos via direct curl to `https://api.github.com/repos/...`. Post-2026-05-06 the
org's GitHub presence is suspended; every invocation 404s. The job's
`:⚠️:` posture meant the failure didn't propagate, leaving the runtime
PyPI publish → template image rebuild pipeline silently broken.
## Why Option A (rewrite) and not Option B (delete)
Verified 2026-05-07 by devops-engineer (molecule-core#14 thread):
- The cron-poll mechanism (/etc/cron.d/molecule-deploy-poll) tracks ONLY the
Vercel/Railway-deployed repos (landingpage/docs/molecule-app/molecules-market
/molecule-controlplane). It does NOT track workspace-template-* repos.
- Each of the 9 template `publish-image.yml` workflows has
`repository_dispatch: types: [runtime-published]` as a load-bearing trigger.
Without the cascade, when the runtime ships a new PyPI version, templates
don't auto-rebuild.
So Option B (delete) would silently break the runtime → template fan-out.
Option A (rewrite to Gitea's API shape) is the right call. Security-auditor
agreed after seeing the cron-poll TRACKED list.
## API surface change
| Concern | Pre-fix (GitHub) | Post-fix (Gitea) |
|---|---|---|
| URL | `https://api.github.com/repos/$REPO/dispatches` | `${GITEA_URL}/api/v1/repos/$REPO/dispatches` |
| Owner case | `Molecule-AI/...` | `molecule-ai/...` (lowercase, Gitea is case-sensitive) |
| Auth header | `Authorization: Bearer $DISPATCH_TOKEN` | `Authorization: token $DISPATCH_TOKEN` |
| Body shape | `{event_type, client_payload}` | UNCHANGED — Gitea is GitHub-compatible here |
| Success code | `204 No Content` | `204 No Content` (unchanged) |
`GITEA_URL` defaults to `https://git.moleculesai.app`; overridable via job env.
## Out-of-band: DISPATCH_TOKEN secret rotation
The DISPATCH_TOKEN secret was a GitHub PAT. It must be re-minted as a Gitea
PAT for the new API to authenticate. Per saved memory
`feedback_per_agent_gitea_identity_default`, this should be a dedicated
`publish-runtime-bot` persona token with `write:repository` scope on the
9 target repos — NOT the founder PAT.
This PR ships the workflow change. Token rotation is the operator-host
follow-up (security-auditor's lane) — coordinate the merge so the token
is in place before the next runtime release fires.
## Backwards compatibility
The workflow ran silently-broken since 2026-05-06 (every invocation 404
+ :⚠️: but no failure). So there is no functional regression from
"silently broken" to "actually working". Any in-progress operator-managed
manual dispatch path is unaffected; the Gitea API parallel path doesn't
require operator intervention.
## Test plan
- [x] YAML parse OK on the modified workflow file
- [ ] Smoke test: trigger a runtime publish (or simulate via dispatching to one
template) post-merge; verify HTTP 204 + the template's publish-image
workflow fires + the template's image gets re-pushed against the new
runtime version. Phase 4 verification belongs to internal#46 follow-up.
## Hostile self-review (3 weakest spots)
1. The fan-out remains all-or-nothing: a single template failure surfaces as
a `:⚠️:` but PyPI publish proceeds. With 9 templates this is a
~10% per-template chance of stale-image-on-runtime-bump if any one fails.
Defense: the warning shows up in the workflow summary; operators retry.
Future hardening: requeue-on-fail with bounded retry, or a separate
reconcile cron that detects template/runtime version drift and re-dispatches.
2. `DISPATCH_TOKEN` validity is enforced by the Gitea API (401 on stale)
but the workflow doesn't differentiate 401 from 404. Either way the
warning fires. Future hardening: explicit token-shape check at the start
of the cascade job (curl `/api/v1/user` once, fail-fast if 401).
3. Owner-case lowercase is right today but couples the workflow to the
current Gitea org slug. If the org is ever renamed, this workflow
breaks silently. Less fragile alternative: derive REPO from a
canonical config (e.g. `gh repo list molecule-ai`) instead of
string-concatenating. Acceptable today; filed as the same future
hardening pass as item 1.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mechanical pin: 4 `actions/upload-artifact@v4.6.2/v7.0.1` uses → `@v3`. v4+/v7+
rely on a runtime API shape that Gitea's act_runner v0.6.x doesn't fully
support. v3 uses the legacy server protocol act_runner ships end-to-end.
Files (4 uses):
- .github/workflows/ci.yml:238 (v4.6.2 → v3)
- .github/workflows/codeql.yml:124 (v7.0.1 → v3)
- .github/workflows/e2e-staging-canvas.yml:142 (v7.0.1 → v3)
- .github/workflows/e2e-staging-canvas.yml:150 (v7.0.1 → v3)
YAML parse green on all 3 files.
Sister PRs land for `molecule-controlplane` and `codex-channel-molecule`.
Per internal#46 Phase 2 audit; tracked under that umbrella.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>