[follow-up #2183] manifest-entry-existence CI gate (PR-time defense vs publish-time detection) #2185

Closed
opened 2026-06-04 01:21:57 +00:00 by fullstack-engineer · 0 comments
Member

Summary

Convert the manual GET /api/v1/repos/<name> audit (which caught 2 latent 404s in #2183's fix) into a permanent CI gate. Today, the publish-workspace-server-image workflow is the only place a bad manifest entry surfaces — and that runs on push to main, which is too late. A pre-merge check would catch the bug class at PR-review time.

Bug class

A bad manifest.json entry — a (name, repo, ref) triplet where the repo returns 404 from Gitea — turns every main push red. The failure mode is:

  1. PR is merged to main with a bad manifest entry (PR-CI does not check entry existence; it just runs the Go tests + Python lints, which never touch manifest.json)
  2. Watchdog (main-red-watchdog.yml) runs publish-workspace-server-image on the merge
  3. Step 2 (Pre-clone manifest deps) hits a 404 on the bad entry
  4. scripts/clone-manifest.sh retries 3x (3s, 6s backoff), then exits 1
  5. Workflow fails, watchdog auto-files a [main-red] issue
  6. Engineer investigates, removes the bad entry, opens a follow-up PR
  7. Cycle repeats as soon as the next push happens with another bad entry

This is preventable: the manifest entries are immutable data + their existence can be checked at PR-review time with a single API call per entry. Cost: ~5 lines of bash + a workflow YAML.

Concrete incident: #2183

When investigating #2183, I audited all 32 manifest entries via GET /api/v1/repos/<name> and found 2 were 404:

entry status how it landed
molecule-ai/molecule-ai-org-template-free-beats-all 404 Predates PR #2180 (the trigger) — has been in main since at least 15935143c8d2 (2026-05-08). Latent until the next push ran the publish workflow.
molecule-ai/molecule-ai-org-template-medo-smoke 404 Latent — CI never got to it because free-beats-all failed first and short-circuited the script. Caught by my audit AFTER the first-line fix.

PR #2184 (head 87431290, base 0b91c180) removes both. Without the audit, the human would have merged a 1-line fix, the next push would have failed on medo-smoke, and a second main-red issue would have fired.

Proposal

New workflow: .gitea/workflows/manifest-entry-existence-check.yml

name: manifest-entry-existence-check

on:
  pull_request:
    paths:
      - manifest.json

jobs:
  check-entries:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Verify each manifest entry resolves on Gitea
        run: |
          set -euo pipefail
          # Strip JSON5 // comments first to match the publish workflow's
          # `Pre-clone manifest deps` parsing path.
          sed '/^[[:space:]]*\/\//d' manifest.json > /tmp/manifest.json
          # Anonymous API is enough: per the 2026-05-08 OSS-surface contract
          # (15935143c8d2 _comment), every entry is public on Gitea.
          count=$(jq -r '(.plugins + .workspace_templates + .org_templates) | length' /tmp/manifest.json)
          missing=()
          for i in $(seq 0 $((count-1))); do
            name=$(jq -r "(.plugins + .workspace_templates + .org_templates)[$i].name" /tmp/manifest.json)
            repo=$(jq -r "(.plugins + .workspace_templates + .org_templates)[$i].repo" /tmp/manifest.json)
            # 3 retries with backoff, mirroring clone-manifest.sh
            for attempt in 1 2 3; do
              http_code=$(curl -s -o /dev/null -w "%{http_code}" "https://git.moleculesai.app/api/v1/repos/${repo}")
              if [ "$http_code" = "200" ]; then
                echo "  OK: $name -> $repo"
                break
              elif [ "$http_code" = "404" ]; then
                echo "::error::manifest entry $name points at $repo which does not exist on Gitea (404)"
                missing+=("$name:$repo")
                break
              else
                echo "  attempt $attempt: $name -> $repo returned HTTP $http_code, retrying"
                sleep $((attempt * 2))
              fi
            done
          done
          if [ "${#missing[@]}" -gt 0 ]; then
            echo "::error::${#missing[@]} manifest entries are broken:"
            printf '  - %s\n' "${missing[@]}"
            exit 1
          fi

Why this is safe

  • No new secrets: the Gitea API endpoint is anonymous for public repos, and the 2026-05-08 OSS-surface contract says every manifest entry is public.
  • Low cost: 32 GETs per PR; <5s on the Gitea Actions runner. Negligible compared to the existing CI contexts.
  • Narrow blast radius: the workflow only triggers on PRs that touch manifest.json (the paths: filter). It does not run on every PR.
  • Belt-and-suspenders: the publish-workspace-server-image workflow's Pre-clone manifest deps step remains the runtime defense-in-depth. The CI gate is the PR-time defense.
  • No false positives for non-main refs: the script just checks repo existence, not ref resolvability. (Ref check would be a separate concern — pulling refs/heads/<ref> via the API to verify the branch exists. Out of scope for this issue.)

Open design questions

  1. Should this be a required status check? That would need repo-admin config to add it to the branch-protection required-checks list. The watchdog + human review are already catching these today (just reactively, post-merge). Worth a CTO ruling.
  2. Should we also verify ref exists? A repo can exist but the named ref (main, a tag, a branch) can be deleted. The current proposal only checks repo existence. Adding a ref check would catch the parallel bug class but adds a git ls-remote per entry (heavier; would need auth because anonymous git ls-remote is not always enabled on private repos).
  3. Should it be a separate workflow or inlined into an existing lint-*.yml? Existing lint-*.yml workflows trigger on a wider paths set and have different ownership. A dedicated workflow keeps the ownership clear.

Implementation plan

  1. Draft the workflow YAML above
  2. Open PR adding .gitea/workflows/manifest-entry-existence-check.yml
  3. Verify it correctly FAILS on a temporary bad entry (sanity test)
  4. Verify it correctly PASSES on a known-good manifest
  5. (Optional) Request CTO ruling on making it a required check

Acceptance criteria

  • Workflow exists at .gitea/workflows/manifest-entry-existence-check.yml
  • Triggers only on PRs that modify manifest.json
  • Fails loudly with a clear error message naming each broken entry
  • Re-runs the existing PR-CI test matrix without breakage
  • Documented in the workflow's top-of-file comment (mirroring publish-workspace-server-image.yml style)

Related

  • #2183 — main-red incident this fix is preventing
  • #2184 — the actual 2-line manifest.json fix (head 87431290, awaiting human GO)
  • Commit 15935143c8d2 (2026-05-08) — established the "every manifest entry is public" OSS-surface contract that makes the anonymous-API approach work
    ]<]minimax[>[
## Summary Convert the manual `GET /api/v1/repos/<name>` audit (which caught 2 latent 404s in #2183's fix) into a permanent CI gate. Today, the publish-workspace-server-image workflow is the only place a bad manifest entry surfaces — and that runs on push to main, which is too late. A pre-merge check would catch the bug class at PR-review time. ## Bug class A bad `manifest.json` entry — a `(name, repo, ref)` triplet where the `repo` returns 404 from Gitea — turns every main push red. The failure mode is: 1. PR is merged to main with a bad manifest entry (PR-CI does not check entry existence; it just runs the Go tests + Python lints, which never touch `manifest.json`) 2. Watchdog (`main-red-watchdog.yml`) runs publish-workspace-server-image on the merge 3. Step 2 (`Pre-clone manifest deps`) hits a 404 on the bad entry 4. `scripts/clone-manifest.sh` retries 3x (3s, 6s backoff), then exits 1 5. Workflow fails, watchdog auto-files a [main-red] issue 6. Engineer investigates, removes the bad entry, opens a follow-up PR 7. **Cycle repeats** as soon as the next push happens with another bad entry This is preventable: the manifest entries are immutable data + their existence can be checked at PR-review time with a single API call per entry. Cost: ~5 lines of bash + a workflow YAML. ## Concrete incident: #2183 When investigating #2183, I audited all 32 manifest entries via `GET /api/v1/repos/<name>` and found 2 were 404: | entry | status | how it landed | |-------|--------|---------------| | `molecule-ai/molecule-ai-org-template-free-beats-all` | 404 | Predates PR #2180 (the trigger) — has been in main since at least `15935143c8d2` (2026-05-08). Latent until the next push ran the publish workflow. | | `molecule-ai/molecule-ai-org-template-medo-smoke` | 404 | Latent — CI never got to it because `free-beats-all` failed first and short-circuited the script. Caught by my audit AFTER the first-line fix. | PR #2184 (head `87431290`, base `0b91c180`) removes both. Without the audit, the human would have merged a 1-line fix, the next push would have failed on `medo-smoke`, and a second `main-red` issue would have fired. ## Proposal New workflow: `.gitea/workflows/manifest-entry-existence-check.yml` ```yaml name: manifest-entry-existence-check on: pull_request: paths: - manifest.json jobs: check-entries: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Verify each manifest entry resolves on Gitea run: | set -euo pipefail # Strip JSON5 // comments first to match the publish workflow's # `Pre-clone manifest deps` parsing path. sed '/^[[:space:]]*\/\//d' manifest.json > /tmp/manifest.json # Anonymous API is enough: per the 2026-05-08 OSS-surface contract # (15935143c8d2 _comment), every entry is public on Gitea. count=$(jq -r '(.plugins + .workspace_templates + .org_templates) | length' /tmp/manifest.json) missing=() for i in $(seq 0 $((count-1))); do name=$(jq -r "(.plugins + .workspace_templates + .org_templates)[$i].name" /tmp/manifest.json) repo=$(jq -r "(.plugins + .workspace_templates + .org_templates)[$i].repo" /tmp/manifest.json) # 3 retries with backoff, mirroring clone-manifest.sh for attempt in 1 2 3; do http_code=$(curl -s -o /dev/null -w "%{http_code}" "https://git.moleculesai.app/api/v1/repos/${repo}") if [ "$http_code" = "200" ]; then echo " OK: $name -> $repo" break elif [ "$http_code" = "404" ]; then echo "::error::manifest entry $name points at $repo which does not exist on Gitea (404)" missing+=("$name:$repo") break else echo " attempt $attempt: $name -> $repo returned HTTP $http_code, retrying" sleep $((attempt * 2)) fi done done if [ "${#missing[@]}" -gt 0 ]; then echo "::error::${#missing[@]} manifest entries are broken:" printf ' - %s\n' "${missing[@]}" exit 1 fi ``` ## Why this is safe - **No new secrets:** the Gitea API endpoint is anonymous for public repos, and the 2026-05-08 OSS-surface contract says every manifest entry is public. - **Low cost:** 32 GETs per PR; <5s on the Gitea Actions runner. Negligible compared to the existing CI contexts. - **Narrow blast radius:** the workflow only triggers on PRs that touch `manifest.json` (the `paths:` filter). It does not run on every PR. - **Belt-and-suspenders:** the publish-workspace-server-image workflow's `Pre-clone manifest deps` step remains the runtime defense-in-depth. The CI gate is the PR-time defense. - **No false positives for non-`main` refs:** the script just checks repo existence, not ref resolvability. (Ref check would be a separate concern — pulling `refs/heads/<ref>` via the API to verify the branch exists. Out of scope for this issue.) ## Open design questions 1. **Should this be a required status check?** That would need repo-admin config to add it to the `branch-protection` required-checks list. The watchdog + human review are already catching these today (just reactively, post-merge). Worth a CTO ruling. 2. **Should we also verify `ref` exists?** A repo can exist but the named ref (`main`, a tag, a branch) can be deleted. The current proposal only checks repo existence. Adding a ref check would catch the parallel bug class but adds a `git ls-remote` per entry (heavier; would need auth because anonymous `git ls-remote` is not always enabled on private repos). 3. **Should it be a separate workflow or inlined into an existing `lint-*.yml`?** Existing `lint-*.yml` workflows trigger on a wider paths set and have different ownership. A dedicated workflow keeps the ownership clear. ## Implementation plan 1. Draft the workflow YAML above 2. Open PR adding `.gitea/workflows/manifest-entry-existence-check.yml` 3. Verify it correctly FAILS on a temporary bad entry (sanity test) 4. Verify it correctly PASSES on a known-good manifest 5. (Optional) Request CTO ruling on making it a required check ## Acceptance criteria - Workflow exists at `.gitea/workflows/manifest-entry-existence-check.yml` - Triggers only on PRs that modify `manifest.json` - Fails loudly with a clear error message naming each broken entry - Re-runs the existing PR-CI test matrix without breakage - Documented in the workflow's top-of-file comment (mirroring publish-workspace-server-image.yml style) ## Related - #2183 — main-red incident this fix is preventing - #2184 — the actual 2-line manifest.json fix (head `87431290`, awaiting human GO) - Commit `15935143c8d2` (2026-05-08) — established the "every manifest entry is public" OSS-surface contract that makes the anonymous-API approach work ]<]minimax[>[
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2185