Compare commits

..

1 Commits

Author SHA1 Message Date
claude-ceo-assistant 973e5570c2 chore: empty commit to trigger Auto-sync main→staging verification (Task #165)
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 17s
CI / Detect changes (pull_request) Successful in 22s
E2E API Smoke Test / detect-changes (pull_request) Successful in 15s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 17s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 16s
Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 17s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 16s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m51s
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 1m58s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m50s
CI / Platform (Go) (pull_request) Successful in 12s
CI / Canvas (Next.js) (pull_request) Successful in 13s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 15s
CI / Python Lint & Test (pull_request) Successful in 14s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 15s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 11s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 14s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 22s
After PR #48 reconciled main → staging post-suspension divergence,
staging is now a strict superset of main. This empty commit on main
triggers the Auto-sync main→staging workflow to verify the fix:
the merge step should succeed (no more conflicts) and the workflow
should land on SUCCESS.

This commit will itself ride into staging via the auto-sync run it
fires, so it's both the trigger and the smoke test.
2026-05-07 14:27:53 -07:00
1004 changed files with 74743 additions and 111594 deletions
-1
View File
@@ -1 +0,0 @@
refire:1778784369
-126
View File
@@ -1,126 +0,0 @@
#!/usr/bin/env bash
# audit-force-merge — detect a §SOP-6 force-merge after PR close, emit
# `incident.force_merge` to stdout as structured JSON.
#
# Vector's docker_logs source picks up runner stdout; the JSON gets
# shipped to Loki on molecule-canonical-obs, indexable by event_type.
# Query example:
#
# {host="operator"} |= "event_type" |= "incident.force_merge" | json
#
# A force-merge is detected when a PR closed-with-merged=true had at
# least one of the repo's required-status-check contexts in a state
# other than "success" at the merge commit's SHA. That's exactly what
# the Gitea force_merge:true API call lets through, so it's a faithful
# detector of the override path.
#
# Triggers on `pull_request_target: closed` (loaded from base branch
# per §SOP-6 security model). No-op when merged=false.
#
# Required env (set by the workflow):
# GITEA_TOKEN, GITEA_HOST, REPO, PR_NUMBER, REQUIRED_CHECKS
#
# REQUIRED_CHECKS is a newline-separated list of status-check context
# names that branch protection requires. Declared in the workflow YAML
# rather than fetched from /branch_protections (which needs admin
# scope — sop-tier-bot has read-only). Trade dynamism for simplicity:
# when the required-check set changes, update both branch protection
# AND this env. Keeping them in sync is less complexity than granting
# the audit bot admin perms on every repo.
set -euo pipefail
: "${GITEA_TOKEN:?required}"
: "${GITEA_HOST:?required}"
: "${REPO:?required}"
: "${PR_NUMBER:?required}"
: "${REQUIRED_CHECKS:?required (newline-separated context names)}"
OWNER="${REPO%%/*}"
NAME="${REPO##*/}"
API="https://${GITEA_HOST}/api/v1"
AUTH="Authorization: token ${GITEA_TOKEN}"
# 1. Fetch the PR. If not merged, no-op.
PR=$(curl -sS -H "$AUTH" "${API}/repos/${OWNER}/${NAME}/pulls/${PR_NUMBER}")
MERGED=$(echo "$PR" | jq -r '.merged // false')
if [ "$MERGED" != "true" ]; then
echo "::notice::PR #${PR_NUMBER} closed without merge — no audit emission."
exit 0
fi
# NOTE: no || true — with set -euo pipefail, jq parse failures (e.g. field
# missing from API response) propagate as hard errors. Use jq's // operator
# for graceful defaults instead of bash || true guards. This was re-added by
# 8c343e3a ("fix(gitea): add || true guards to jq pipelines") — reverted
# here because the guards mask silent failures that hide malformed API responses.
MERGE_SHA=$(echo "$PR" | jq -r '.merge_commit_sha // empty')
MERGED_BY=$(echo "$PR" | jq -r '.merged_by.login // "unknown"')
TITLE=$(echo "$PR" | jq -r '.title // ""')
BASE_BRANCH=$(echo "$PR" | jq -r '.base.ref // "main"')
HEAD_SHA=$(echo "$PR" | jq -r '.head.sha // empty')
if [ -z "$MERGE_SHA" ]; then
echo "::warning::PR #${PR_NUMBER} merged=true but no merge_commit_sha — cannot evaluate force-merge."
exit 0
fi
# 2. Required status checks declared in the workflow env.
REQUIRED="$REQUIRED_CHECKS"
if [ -z "${REQUIRED//[[:space:]]/}" ]; then
echo "::notice::REQUIRED_CHECKS empty — force-merge not applicable."
exit 0
fi
# 3. Status-check state at the PR HEAD (where checks ran). The merge
# commit doesn't get its own checks; we evaluate the PR's last
# commit, which is what branch protection compared against.
STATUS=$(curl -sS -H "$AUTH" \
"${API}/repos/${OWNER}/${NAME}/commits/${HEAD_SHA}/status")
declare -A CHECK_STATE
while IFS=$'\t' read -r ctx state; do
[ -n "$ctx" ] && CHECK_STATE[$ctx]="$state"
done < <(echo "$STATUS" | jq -r '.statuses // [] | .[] | "\(.context)\t\(.status)"')
# 4. For each required check, was it green at merge? YAML block scalars
# (`|`) leave a trailing newline; skip blank/whitespace-only lines.
FAILED_CHECKS=()
while IFS= read -r req; do
trimmed="${req#"${req%%[![:space:]]*}"}" # ltrim
trimmed="${trimmed%"${trimmed##*[![:space:]]}"}" # rtrim
[ -z "$trimmed" ] && continue
state="${CHECK_STATE[$trimmed]:-missing}"
if [ "$state" != "success" ]; then
FAILED_CHECKS+=("${trimmed}=${state}")
fi
done <<< "$REQUIRED"
if [ "${#FAILED_CHECKS[@]}" -eq 0 ]; then
echo "::notice::PR #${PR_NUMBER} merged with all required checks green — not a force-merge."
exit 0
fi
# 5. Emit structured audit event.
NOW=$(date -u +%Y-%m-%dT%H:%M:%SZ)
# jq -R (raw input) converts each line to a JSON string; jq -s wraps into array.
# If FAILED_CHECKS is unexpectedly empty (shouldn't happen — we exit above),
# this produces []. No || true needed.
FAILED_JSON=$(printf '%s\n' "${FAILED_CHECKS[@]}" | jq -R . | jq -s .)
# Print as a single-line JSON so Vector's parse_json transform can pick
# it up cleanly from docker_logs.
jq -nc \
--arg event_type "incident.force_merge" \
--arg ts "$NOW" \
--arg repo "$REPO" \
--argjson pr "$PR_NUMBER" \
--arg title "$TITLE" \
--arg base "$BASE_BRANCH" \
--arg merged_by "$MERGED_BY" \
--arg merge_sha "$MERGE_SHA" \
--argjson failed_checks "$FAILED_JSON" \
'{event_type: $event_type, ts: $ts, repo: $repo, pr: $pr, title: $title,
base_branch: $base, merged_by: $merged_by, merge_sha: $merge_sha,
failed_checks: $failed_checks}'
echo "::warning::FORCE-MERGE detected on PR #${PR_NUMBER} by ${MERGED_BY}: ${#FAILED_CHECKS[@]} required check(s) not green at merge time."
-651
View File
@@ -1,651 +0,0 @@
#!/usr/bin/env python3
"""ci-required-drift — RFC internal#219 §4 + §6.
Detects drift between three sources of "what counts as a required check"
for this repo, files (or updates) a `[ci-drift]` Gitea issue when any
pair diverges.
Sources:
A. `.gitea/workflows/ci.yml` jobs (CI source — the actual job set)
B. `status_check_contexts` in branch_protections (the merge gate)
C. `REQUIRED_CHECKS` env in audit-force-merge.yml (the audit env)
Three failure classes:
F1 Job in (A) is not under the sentinel's `needs:` — sentinel
doesn't gate it, so a red job on that name can sneak through.
Ignores jobs whose `if:` references `github.event_name` (those
run only on specific events and may be `skipped` legitimately).
F2 Context in (B) corresponds to no emitter — i.e. there's no job
in ci.yml whose runtime status-name maps to that context.
A stale required-check name is silent: protection demands a
green it never receives, but Gitea treats absent-as-pending,
not absent-as-red. The gate degrades to advisory.
F3 (B) and (C) are not set-equal. Audit env wider than protection
→ audit flags non-force-merges as force; narrower → real
force-merges are missed.
Idempotency:
Searches OPEN issues by exact title prefix
`[ci-drift] {repo}/{branch}: ` and either edits the existing one
(if any) or POSTs a new one. Never spawns duplicates.
Behavior-based AST gate per `feedback_behavior_based_ast_gates`:
- Job set comes from PyYAML parse of jobs:* keys
- Sentinel needs from PyYAML parse of jobs[sentinel].needs (a list)
- Audit env from PyYAML parse, NOT grep — so reformatting the YAML
(block-scalar `|` vs flow-style list) does not break the gate
"""
from __future__ import annotations
import argparse
import json
import os
import sys
import urllib.error
import urllib.parse
import urllib.request
from typing import Any
import yaml # PyYAML 6.0.2 — installed by the workflow before this runs.
# --------------------------------------------------------------------------
# Environment
# --------------------------------------------------------------------------
def env(key: str, *, required: bool = True, default: str | None = None) -> str:
val = os.environ.get(key, default)
if required and not val:
sys.stderr.write(f"::error::missing required env var: {key}\n")
sys.exit(2)
return val or ""
GITEA_TOKEN = env("GITEA_TOKEN", required=False)
GITEA_HOST = env("GITEA_HOST", required=False)
REPO = env("REPO", required=False)
BRANCHES = env("BRANCHES", required=False).split()
SENTINEL_JOB = env("SENTINEL_JOB", required=False)
AUDIT_WORKFLOW_PATH = env("AUDIT_WORKFLOW_PATH", required=False)
CI_WORKFLOW_PATH = env("CI_WORKFLOW_PATH", required=False)
DRIFT_LABEL = env("DRIFT_LABEL", required=False)
OWNER, NAME = (REPO.split("/", 1) + [""])[:2] if REPO else ("", "")
API = f"https://{GITEA_HOST}/api/v1" if GITEA_HOST else ""
def _require_runtime_env() -> None:
"""Enforce env contract — called from `main()` only. Tests import
individual functions without setting the full env contract."""
for key in (
"GITEA_TOKEN",
"GITEA_HOST",
"REPO",
"BRANCHES",
"SENTINEL_JOB",
"AUDIT_WORKFLOW_PATH",
"CI_WORKFLOW_PATH",
"DRIFT_LABEL",
):
if not os.environ.get(key):
sys.stderr.write(f"::error::missing required env var: {key}\n")
sys.exit(2)
# --------------------------------------------------------------------------
# Tiny HTTP helper (no requests dependency)
# --------------------------------------------------------------------------
class ApiError(RuntimeError):
"""Raised when a Gitea API call cannot be trusted to have succeeded.
Covers non-2xx HTTP status AND 2xx with an unparseable JSON body on
endpoints that are documented to return JSON (search/read). Callers
that swallow this and proceed would risk e.g. creating duplicate
`[ci-drift]` issues when a transient 500 hides an existing match.
The cron retries hourly; one fail-loud cycle is fine — silent
duplicate creation is not (per Five-Axis review on PR #112).
"""
def api(
method: str,
path: str,
*,
body: dict | None = None,
query: dict[str, str] | None = None,
expect_json: bool = True,
) -> tuple[int, Any]:
"""Tiny HTTP helper around urllib.
Raises ApiError on any non-2xx response. Callers that want
best-effort semantics (e.g. label-apply) must `try/except ApiError`
explicitly — making the failure-soft path opt-in rather than the
default closes the duplicate-issue regression class.
For 2xx responses with a JSON body that fails to parse, raises
ApiError when `expect_json=True` (the default for read-shaped
paths). On endpoints that legitimately return non-JSON success
bodies (e.g. some Gitea create echoes — see
`feedback_gitea_create_api_unparseable_response`), callers may pass
`expect_json=False` to accept a `_raw` fallthrough — but they MUST
then verify success via a follow-up GET, not by trusting the body.
"""
url = f"{API}{path}"
if query:
url = f"{url}?{urllib.parse.urlencode(query)}"
data = None
headers = {
"Authorization": f"token {GITEA_TOKEN}",
"Accept": "application/json",
}
if body is not None:
data = json.dumps(body).encode("utf-8")
headers["Content-Type"] = "application/json"
req = urllib.request.Request(url, method=method, data=data, headers=headers)
try:
with urllib.request.urlopen(req, timeout=30) as resp:
raw = resp.read()
status = resp.status
except urllib.error.HTTPError as e:
raw = e.read()
status = e.code
if not (200 <= status < 300):
snippet = raw[:500].decode("utf-8", errors="replace") if raw else ""
raise ApiError(
f"{method} {path} → HTTP {status}: {snippet}"
)
if not raw:
return status, None
try:
return status, json.loads(raw)
except json.JSONDecodeError as e:
if expect_json:
raise ApiError(
f"{method} {path} → HTTP {status} but body is not JSON: {e}"
) from e
# Opt-in raw fallthrough for endpoints with known echo-quirks.
return status, {"_raw": raw.decode("utf-8", errors="replace")}
# --------------------------------------------------------------------------
# YAML loaders — STRICT (reject GitHub-Actions-only syntax)
# --------------------------------------------------------------------------
def load_yaml(path: str) -> dict:
"""Load + parse a workflow YAML. Hard-fail if the file is missing
or doesn't parse — drift-detect cannot make decisions without
knowing the actual job set."""
if not os.path.exists(path):
sys.stderr.write(f"::error::file not found: {path}\n")
sys.exit(3)
with open(path, encoding="utf-8") as f:
try:
doc = yaml.safe_load(f)
except yaml.YAMLError as e:
sys.stderr.write(f"::error::YAML parse error in {path}: {e}\n")
sys.exit(3)
if not isinstance(doc, dict):
sys.stderr.write(f"::error::{path} is not a YAML mapping\n")
sys.exit(3)
return doc
def ci_jobs_all(ci_doc: dict) -> set[str]:
"""Every job key in ci.yml minus the sentinel itself. Used for F1b
(sentinel.needs typo check) — needs that name a non-existent job
is a typo regardless of event-gating."""
jobs = ci_doc.get("jobs")
if not isinstance(jobs, dict):
sys.stderr.write("::error::ci.yml has no jobs: mapping\n")
sys.exit(3)
return {k for k in jobs if k != SENTINEL_JOB}
def ci_job_names(ci_doc: dict) -> set[str]:
"""Set of job keys in ci.yml MINUS the sentinel itself MINUS jobs
whose `if:` gates on `github.event_name` or `github.ref` (those are
event-scoped and can legitimately be `skipped` for a given trigger;
if we required them under the sentinel `needs:`, every PR-only job
would be `skipped` on push and the sentinel would interpret
`skipped != success` as failure). RFC §4 spec.
`github.ref` is the companion gate for jobs that run only on direct
pushes to specific branches (e.g. `github.ref == 'refs/heads/main'`).
These never execute in a PR context, so flagging them as missing
from `all-required.needs:` is a false positive (mc#958 / mc#959).
Used for F1 (jobs missing from sentinel needs). NOT used for F1b
(typos in needs) — see `ci_jobs_all` for that."""
jobs = ci_doc.get("jobs")
if not isinstance(jobs, dict):
sys.stderr.write("::error::ci.yml has no jobs: mapping\n")
sys.exit(3)
names: set[str] = set()
for k, v in jobs.items():
if k == SENTINEL_JOB:
continue
if isinstance(v, dict):
gate = v.get("if")
if isinstance(gate, str) and (
"github.event_name" in gate or "github.ref" in gate
):
continue
names.add(k)
return names
def sentinel_needs(ci_doc: dict) -> set[str]:
sentinel = ci_doc.get("jobs", {}).get(SENTINEL_JOB)
if not isinstance(sentinel, dict):
sys.stderr.write(
f"::error::sentinel job '{SENTINEL_JOB}' not found in {CI_WORKFLOW_PATH}\n"
)
sys.exit(3)
needs = sentinel.get("needs", [])
if isinstance(needs, str):
needs = [needs]
if not isinstance(needs, list):
sys.stderr.write("::error::sentinel `needs:` is neither list nor string\n")
sys.exit(3)
return set(needs)
def required_checks_env(audit_doc: dict) -> set[str]:
"""Pull the REQUIRED_CHECKS env value from audit-force-merge.yml.
Walks the YAML AST per `feedback_behavior_based_ast_gates`: we do
NOT grep for `REQUIRED_CHECKS:` — that breaks under reformatting,
multi-job workflows, or a future move of the env to a different
step. Instead, look inside every job's every step's `env:` map."""
found: list[str] = []
jobs = audit_doc.get("jobs", {})
if not isinstance(jobs, dict):
sys.stderr.write(f"::warning::{AUDIT_WORKFLOW_PATH} has no jobs: mapping\n")
return set()
for job in jobs.values():
if not isinstance(job, dict):
continue
for step in job.get("steps", []) or []:
if not isinstance(step, dict):
continue
step_env = step.get("env") or {}
if isinstance(step_env, dict) and "REQUIRED_CHECKS" in step_env:
v = step_env["REQUIRED_CHECKS"]
if isinstance(v, str):
found.append(v)
if not found:
sys.stderr.write(
f"::error::REQUIRED_CHECKS env not found in any step of {AUDIT_WORKFLOW_PATH}\n"
)
sys.exit(3)
if len(found) > 1:
# Defensive: refuse to guess which one is canonical.
sys.stderr.write(
f"::error::REQUIRED_CHECKS env present in {len(found)} steps; ambiguous\n"
)
sys.exit(3)
raw = found[0]
# YAML block-scalars (`|`) leave a trailing newline + blanks; trim
# consistently with audit-force-merge.sh's parser so both sides
# produce identical sets.
return {line.strip() for line in raw.splitlines() if line.strip()}
# --------------------------------------------------------------------------
# Mapping: ci.yml job-key → protection context name
# --------------------------------------------------------------------------
def expected_context(job_key: str, workflow_name: str = "ci") -> str:
"""Gitea Actions reports status-check contexts as
"{workflow.name} / {job.name or job.key} ({event})".
For ci.yml the event is `pull_request` on PRs (that's what
`status_check_contexts` records). Job.name defaults to job.key
when no `name:` is set. CP's ci.yml does NOT set per-job `name:`
so the key equals the human-name."""
return f"{workflow_name} / {job_key} (pull_request)"
# --------------------------------------------------------------------------
# Drift detection
# --------------------------------------------------------------------------
def detect_drift(branch: str) -> tuple[list[str], dict]:
"""Returns (findings, debug). Empty findings == no drift.
Raises:
ApiError: propagated from the protection fetch only when the
failure is likely a transient Gitea outage (5xx).
403/404 from the protection endpoint is treated as
"cannot determine drift for this branch" — a token-
scope issue (missing repo-admin on DRIFT_BOT_TOKEN) or
a repo with no protection set should not turn the
hourly cron red. The workflow continues to the next
branch; no [ci-drift] issue is filed for a branch
whose protection cannot be read.
"""
findings: list[str] = []
ci_doc = load_yaml(CI_WORKFLOW_PATH)
audit_doc = load_yaml(AUDIT_WORKFLOW_PATH)
jobs = ci_job_names(ci_doc)
jobs_all = ci_jobs_all(ci_doc)
needs = sentinel_needs(ci_doc)
env_set = required_checks_env(audit_doc)
# Protection
# api() raises ApiError on non-2xx. Transient 5xx should fail loud.
# 403/404 means the token lacks repo-admin scope (Gitea 1.22.6's
# branch_protections endpoint requires it — see DRIFT_BOT_TOKEN
# provisioning trail in ci-required-drift.yml). Treat as
# "cannot determine drift for this branch" — skip without turning
# the workflow red. Surface a clear diagnostic so the operator
# knows what to fix.
contexts: set[str] = set()
protection_path = f"/repos/{OWNER}/{NAME}/branch_protections/{branch}"
try:
_, protection = api("GET", protection_path)
except ApiError as e:
# Isolate the HTTP status from the error message.
http_status: int | None = None
msg = str(e)
# ApiError message format: "{method} {path} → HTTP {status}: {body}"
import re as _re
m = _re.search(r"HTTP (\d{3})", msg)
if m:
http_status = int(m.group(1))
if http_status in (403, 404):
# Token lacks scope OR branch has no protection. Cannot
# determine drift — skip this branch. Do NOT exit non-zero;
# the issue IS the alarm, not a red workflow.
sys.stderr.write(
f"::error::GET {protection_path} returned HTTP {http_status}"
f"DRIFT_BOT_TOKEN lacks repo-admin scope (Gitea 1.22.6 "
f"requires it for this endpoint) OR branch has no protection "
f"configured. Cannot determine drift for {branch}; "
f"skipping. Fix: grant repo-admin to mc-drift-bot or "
f"configure protection on {branch}.\n"
)
debug = {
"branch": branch,
"ci_jobs": sorted(jobs),
"sentinel_needs": sorted(needs),
"protection_contexts_skipped": True,
"protection_http_status": http_status,
"audit_env_checks": sorted(env_set),
}
return [], debug
# 5xx — propagate (transient outage, fail loud per design).
raise
if not isinstance(protection, dict):
sys.stderr.write(
f"::error::protection response for {branch} not a JSON object\n"
)
sys.exit(4)
contexts = set(protection.get("status_check_contexts") or [])
# ----- F1: job exists in CI but not under sentinel.needs -----
missing_from_needs = sorted(jobs - needs)
if missing_from_needs:
findings.append(
"F1 — jobs in ci.yml NOT under sentinel `needs:` (sentinel doesn't gate them):\n"
+ "\n".join(f" - {n}" for n in missing_from_needs)
)
# ----- F1b: needs lists a job that doesn't exist (typo) -----
# Compare against jobs_all (incl. event-gated jobs); a typo is a
# typo regardless of `if:` gating.
stale_needs = sorted(needs - jobs_all)
if stale_needs:
findings.append(
"F1b — sentinel `needs:` lists jobs NOT present in ci.yml (typo or removed job):\n"
+ "\n".join(f" - {n}" for n in stale_needs)
)
# ----- F2: protection context has no emitting job -----
# Compute the contexts the CI YAML actually produces. The sentinel
# is in (B) intentionally (`ci / all-required (pull_request)`); we
# whitelist it explicitly.
emitted_contexts = {expected_context(j) for j in jobs} | {expected_context(SENTINEL_JOB)}
# Contexts NOT produced by ci.yml may still come from other
# workflows in the repo (Secret scan etc). We can't enumerate
# every workflow's emissions cheaply; instead, flag only contexts
# whose prefix is `ci / ` (this workflow's emissions) and which
# don't appear in `emitted_contexts`. This narrows F2 to the
# failure class the RFC actually targets without producing noise
# from cross-workflow emitters.
stale_protection = sorted(
c for c in contexts if c.startswith("ci / ") and c not in emitted_contexts
)
if stale_protection:
findings.append(
"F2 — protection `status_check_contexts` entries with `ci / ` prefix that NO "
"job in ci.yml emits (stale name → silent advisory gate):\n"
+ "\n".join(f" - {c}" for c in stale_protection)
)
# ----- F3: audit env vs protection contexts (set-equal) -----
only_in_env = sorted(env_set - contexts)
only_in_protection = sorted(contexts - env_set)
if only_in_env:
findings.append(
"F3a — audit-force-merge.yml `REQUIRED_CHECKS` env has contexts NOT in "
f"branch_protections/{branch}.status_check_contexts (audit would flag "
"non-force-merges as force):\n"
+ "\n".join(f" - {c}" for c in only_in_env)
)
if only_in_protection:
findings.append(
"F3b — branch_protections/{br}.status_check_contexts has contexts NOT in "
"audit-force-merge.yml `REQUIRED_CHECKS` env (real force-merges would be "
"missed):\n".format(br=branch)
+ "\n".join(f" - {c}" for c in only_in_protection)
)
debug = {
"branch": branch,
"ci_jobs": sorted(jobs),
"sentinel_needs": sorted(needs),
"protection_contexts": sorted(contexts),
"audit_env_checks": sorted(env_set),
"expected_contexts": sorted(emitted_contexts),
}
return findings, debug
# --------------------------------------------------------------------------
# Issue file/update
# --------------------------------------------------------------------------
def title_for(branch: str) -> str:
# Idempotency key — keep stable, never include timestamp/SHA.
return f"[ci-drift] {REPO}/{branch}: required-checks divergence detected"
def find_open_issue(title: str) -> dict | None:
"""Return the existing open `[ci-drift]` issue for `title`, or None.
`None` means "search succeeded, no match" — NOT "search failed".
Per Five-Axis review on PR #112: returning None on a transient API
error caused the caller to POST a duplicate issue. Now api() raises
ApiError on any non-2xx; we let it propagate. The cron retries
hourly; failing one cycle loudly is strictly better than silently
duplicating.
Gitea issue search returns at most page=50 per page; one page is
enough as long as `[ci-drift]` issues are a tiny minority. (See
follow-up issue for Link-header pagination.)
"""
_, results = api(
"GET",
f"/repos/{OWNER}/{NAME}/issues",
query={"state": "open", "type": "issues", "limit": "50"},
)
if not isinstance(results, list):
raise ApiError(
f"issue search returned non-list body (got {type(results).__name__})"
)
for issue in results:
if issue.get("title") == title:
return issue
return None
def render_body(branch: str, findings: list[str], debug: dict) -> str:
body = [
f"# Drift detected on `{REPO}/{branch}`",
"",
"Auto-filed by `.gitea/workflows/ci-required-drift.yml` "
"(RFC [internal#219](https://git.moleculesai.app/molecule-ai/internal/issues/219) §4 + §6).",
"",
"## Findings",
"",
]
body.extend(findings)
body.extend(
[
"",
"## Resolution",
"",
"- **F1 / F1b**: add the missing job to `all-required.needs:` "
"in `.gitea/workflows/ci.yml`, or remove the stale entry.",
"- **F2**: rename the protection context to match an emitter, "
"or remove it from `status_check_contexts` "
"(PATCH `/api/v1/repos/{owner}/{repo}/branch_protections/{branch}`).",
"- **F3a / F3b**: bring `REQUIRED_CHECKS` env in "
"`.gitea/workflows/audit-force-merge.yml` into set-equality with "
"`status_check_contexts` (single PR, both files).",
"",
"## Debug",
"",
"```json",
json.dumps(debug, indent=2, sort_keys=True),
"```",
"",
"_This issue is idempotent: drift-detect runs hourly at `:17` "
"and edits this body in place. Close the issue once the drift "
"is fixed; the next hourly run will reopen if drift returns._",
]
)
return "\n".join(body)
def file_or_update(
branch: str,
findings: list[str],
debug: dict,
*,
dry_run: bool = False,
) -> None:
"""File a new `[ci-drift]` issue, or PATCH the existing one in place.
`dry_run=True` skips every side-effecting Gitea call (issue
search, POST, PATCH, label apply) and prints the would-be issue
title + body to stdout. Useful for local testing and for
debugging drift output without polluting the issue tracker.
"""
title = title_for(branch)
body = render_body(branch, findings, debug)
if dry_run:
print(f"::notice::[dry-run] would file/update drift issue for {branch}")
print(f"::group::[dry-run] title")
print(title)
print(f"::endgroup::")
print(f"::group::[dry-run] body")
print(body)
print(f"::endgroup::")
return
existing = find_open_issue(title)
if existing:
num = existing["number"]
api(
"PATCH",
f"/repos/{OWNER}/{NAME}/issues/{num}",
body={"body": body},
)
print(f"::notice::Updated existing drift issue #{num} for {branch}")
return
_, created = api(
"POST",
f"/repos/{OWNER}/{NAME}/issues",
body={"title": title, "body": body, "labels": []},
)
if not isinstance(created, dict):
sys.stderr.write("::error::POST issue response not a JSON object\n")
sys.exit(5)
new_num = created.get("number")
print(f"::warning::Filed new drift issue #{new_num} for {branch}")
# Apply label by name (Gitea's add-labels endpoint accepts label IDs;
# look up id by name once). Best-effort: failure to label is logged
# but does not fail the audit run — the issue itself IS the alarm.
try:
_, labels = api("GET", f"/repos/{OWNER}/{NAME}/labels")
except ApiError as e:
sys.stderr.write(f"::warning::could not list labels: {e}\n")
return
label_id = None
if isinstance(labels, list):
for lbl in labels:
if lbl.get("name") == DRIFT_LABEL:
label_id = lbl.get("id")
break
if label_id is not None and new_num:
try:
api(
"POST",
f"/repos/{OWNER}/{NAME}/issues/{new_num}/labels",
body={"labels": [label_id]},
)
except ApiError as e:
sys.stderr.write(
f"::warning::could not apply label '{DRIFT_LABEL}' to #{new_num}: {e}\n"
)
else:
sys.stderr.write(f"::warning::label '{DRIFT_LABEL}' not found on repo\n")
# --------------------------------------------------------------------------
# Main
# --------------------------------------------------------------------------
def _parse_args(argv: list[str] | None = None) -> argparse.Namespace:
p = argparse.ArgumentParser(
prog="ci-required-drift",
description="Detect drift between ci.yml, branch_protections, "
"and audit-force-merge.yml REQUIRED_CHECKS env.",
)
p.add_argument(
"--dry-run",
action="store_true",
help="Detect + print findings to stdout; do NOT file or PATCH "
"the `[ci-drift]` issue. Useful for local testing and for "
"previewing output before turning the workflow loose.",
)
return p.parse_args(argv)
def main(argv: list[str] | None = None) -> int:
args = _parse_args(argv)
_require_runtime_env()
for branch in BRANCHES:
findings, debug = detect_drift(branch)
if findings:
print(f"::warning::Drift detected on {branch}:")
for f in findings:
print(f)
file_or_update(branch, findings, debug, dry_run=args.dry_run)
else:
print(f"::notice::No drift on {branch}.")
print(json.dumps(debug, indent=2, sort_keys=True))
# Exit 0 even on drift — the issue IS the alarm, not a red workflow.
# A red workflow here would page on a CI rename until the issue is
# opened, doubling the noise. The issue itself is the actionable
# surface. (`api()` raising ApiError is the only path that exits
# non-zero, by design: a transient Gitea outage should fail loudly.)
return 0
if __name__ == "__main__":
sys.exit(main())
-40
View File
@@ -1,40 +0,0 @@
#!/usr/bin/env python3
"""Extract changed-file list from Gitea Compare API JSON response.
Gitea Compare API returns changed files nested inside commits, not at the
top level:
{"commits": [{"files": [{"filename": "path/to/file"}]}]}
Usage:
compare-api-diff-files.py < API_RESPONSE.json
Exits 0 with filenames on stdout, one per line.
Exits 1 on malformed input (caller should handle as "no files").
"""
from __future__ import annotations
import sys
import json
def main() -> None:
try:
data = json.load(sys.stdin)
except Exception:
sys.exit(1)
filenames: list[str] = []
for commit in data.get("commits", []):
for f in commit.get("files", []):
fn = f.get("filename", "")
if fn:
filenames.append(fn)
if filenames:
sys.stdout.write("\n".join(filenames))
sys.stdout.write("\n")
# else: empty stdout = no files, caller treats as empty list
if __name__ == "__main__":
main()
-174
View File
@@ -1,174 +0,0 @@
#!/usr/bin/env python3
"""Shared path-filter helper for Gitea Actions workflows.
Computes changed files against the PR base SHA or push-before SHA and writes
boolean outputs to GITHUB_OUTPUT. If the diff base is missing or untrusted, the
helper fails open by setting every output in the selected profile to true.
"""
from __future__ import annotations
import argparse
import os
import re
import subprocess
import sys
from pathlib import Path
PROFILES: dict[str, dict[str, str]] = {
"ci": {
"platform": r"^workspace-server/",
"canvas": r"^canvas/",
"python": r"^workspace/",
"scripts": r"^tests/e2e/|^scripts/|^infra/scripts/",
},
"handlers-postgres": {
"handlers": (
r"^workspace-server/internal/handlers/"
r"|^workspace-server/internal/wsauth/"
r"|^workspace-server/migrations/"
r"|^\.gitea/workflows/handlers-postgres-integration\.yml$"
),
},
"e2e-api": {
"api": r"^workspace-server/|^tests/e2e/|^\.gitea/workflows/e2e-api\.yml$",
},
}
def classify(profile: str, paths: list[str]) -> dict[str, bool]:
patterns = PROFILES[profile]
return {
name: any(re.search(pattern, path) for path in paths)
for name, pattern in patterns.items()
}
def all_true(profile: str) -> dict[str, bool]:
return {name: True for name in PROFILES[profile]}
def resolve_base(event_name: str, pr_base_sha: str, push_before: str) -> str:
if event_name == "pull_request" and pr_base_sha:
return pr_base_sha
return push_before
def is_zero_sha(value: str) -> bool:
return not value or bool(re.fullmatch(r"0+", value))
def run_git(args: list[str], *, timeout: int = 30) -> subprocess.CompletedProcess[str]:
return subprocess.run(
["git", *args],
check=False,
text=True,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
timeout=timeout,
)
def base_exists(base: str) -> bool:
return run_git(["cat-file", "-e", base]).returncode == 0
def fetch_base(base: str, base_ref: str) -> None:
# Gitea may reject fetching an arbitrary unadvertised SHA from a shallow
# PR checkout. Fetch the advertised base branch first, then fall back to
# the SHA for hosts that allow it.
if base_ref:
run_git(["fetch", "--depth=1", "origin", base_ref])
if not base_exists(base):
run_git(["fetch", "--depth=1", "origin", base])
def deepen_base_ref(base_ref: str) -> None:
if base_ref:
run_git(["fetch", "--deepen=200", "origin", base_ref], timeout=60)
def merge_base(base: str) -> str | None:
proc = run_git(["merge-base", base, "HEAD"])
if proc.returncode != 0:
return None
value = proc.stdout.strip()
return value or None
def changed_paths(base: str, *, use_merge_base: bool) -> list[str] | None:
compare_base = base
if use_merge_base:
compare_base = merge_base(base) or ""
if not compare_base:
return None
proc = run_git(["diff", "--name-only", compare_base, "HEAD"])
if proc.returncode != 0:
return None
return [line for line in proc.stdout.splitlines() if line]
def write_outputs(values: dict[str, bool], output_path: str | None) -> None:
lines = [f"{name}={'true' if value else 'false'}" for name, value in values.items()]
if output_path:
with Path(output_path).open("a", encoding="utf-8") as fh:
for line in lines:
fh.write(line + "\n")
else:
for line in lines:
print(line)
def detect(
profile: str,
event_name: str,
pr_base_sha: str,
push_before: str,
base_ref: str = "",
) -> dict[str, bool]:
base = resolve_base(event_name, pr_base_sha, push_before)
if is_zero_sha(base):
return all_true(profile)
if not base_exists(base):
fetch_base(base, base_ref)
if not base_exists(base):
return all_true(profile)
use_merge_base = event_name == "pull_request"
if use_merge_base and base_ref and merge_base(base) is None:
deepen_base_ref(base_ref)
paths = changed_paths(base, use_merge_base=use_merge_base)
if paths is None:
return all_true(profile)
return classify(profile, paths)
def parse_args(argv: list[str]) -> argparse.Namespace:
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument("--profile", required=True, choices=sorted(PROFILES))
parser.add_argument("--event-name", default=os.environ.get("GITHUB_EVENT_NAME", ""))
parser.add_argument("--pr-base-sha", default="")
parser.add_argument("--base-ref", default="")
parser.add_argument("--push-before", default=os.environ.get("GITHUB_EVENT_BEFORE", ""))
return parser.parse_args(argv)
def main(argv: list[str]) -> int:
args = parse_args(argv)
values = detect(
args.profile,
args.event_name,
args.pr_base_sha,
args.push_before,
args.base_ref,
)
write_outputs(values, os.environ.get("GITHUB_OUTPUT"))
return 0
if __name__ == "__main__":
sys.exit(main(sys.argv[1:]))
-501
View File
@@ -1,501 +0,0 @@
#!/usr/bin/env python3
"""gitea-merge-queue — conservative serialized merge bot for Gitea.
Gitea 1.22.6 has auto-merge (`pull_auto_merge`) but no GitHub-style merge
queue. This script provides the missing serialized policy in user space:
1. Pick the oldest open PR carrying QUEUE_LABEL.
2. Refuse to act unless main is green.
3. Refuse fork PRs; the queue may only mutate same-repo branches.
4. If the PR branch does not contain current main, call Gitea's
/pulls/{n}/update endpoint and stop. CI must rerun on the updated head.
5. If the updated PR head has all required contexts green, merge with the
non-bypass merge actor token.
The script is intentionally one-PR-per-run. Workflow/cron concurrency should
serialize invocations so two green PRs cannot merge against the same main.
"""
from __future__ import annotations
import argparse
import dataclasses
import json
import os
import sys
import urllib.error
import urllib.parse
import urllib.request
from typing import Any
def _env(key: str, *, default: str = "") -> str:
return os.environ.get(key, default)
GITEA_TOKEN = _env("GITEA_TOKEN")
GITEA_HOST = _env("GITEA_HOST")
REPO = _env("REPO")
WATCH_BRANCH = _env("WATCH_BRANCH", default="main")
QUEUE_LABEL = _env("QUEUE_LABEL", default="merge-queue")
HOLD_LABEL = _env("HOLD_LABEL", default="merge-queue-hold")
UPDATE_STYLE = _env("UPDATE_STYLE", default="merge")
REQUIRED_CONTEXTS_RAW = _env(
"REQUIRED_CONTEXTS",
default=(
"CI / all-required (pull_request),"
"sop-checklist / all-items-acked (pull_request)"
),
)
# Required contexts for push (main/staging) runs. The push CI uses the same
# aggregator names with " (push)" suffix. Checking these explicitly instead of
# the combined state avoids false-pause when non-blocking jobs (e.g. Platform
# Go with continue-on-error: true due to mc#774) have failed — their failures
# pollute the combined state but do not block merges.
PUSH_REQUIRED_CONTEXTS_RAW = _env(
"PUSH_REQUIRED_CONTEXTS",
default="CI / all-required (push)",
)
OWNER, NAME = (REPO.split("/", 1) + [""])[:2] if REPO else ("", "")
API = f"https://{GITEA_HOST}/api/v1" if GITEA_HOST else ""
class ApiError(RuntimeError):
pass
class MergePermissionError(ApiError):
"""Merge failed with a permanent permission error (403/404/405).
The queue should skip this PR and move to the next one."""
@dataclasses.dataclass(frozen=True)
class MergeDecision:
ready: bool
action: str
reason: str
def _require_runtime_env() -> None:
for key in ("GITEA_TOKEN", "GITEA_HOST", "REPO", "WATCH_BRANCH", "QUEUE_LABEL"):
if not os.environ.get(key):
sys.stderr.write(f"::error::missing required env var: {key}\n")
sys.exit(2)
if UPDATE_STYLE not in {"merge", "rebase"}:
sys.stderr.write("::error::UPDATE_STYLE must be merge or rebase\n")
sys.exit(2)
def api(
method: str,
path: str,
*,
body: dict | None = None,
query: dict[str, str] | None = None,
expect_json: bool = True,
) -> tuple[int, Any]:
url = f"{API}{path}"
if query:
url = f"{url}?{urllib.parse.urlencode(query)}"
data = None
headers = {
"Authorization": f"token {GITEA_TOKEN}",
"Accept": "application/json",
}
if body is not None:
data = json.dumps(body).encode("utf-8")
headers["Content-Type"] = "application/json"
req = urllib.request.Request(url, method=method, data=data, headers=headers)
try:
with urllib.request.urlopen(req, timeout=30) as resp:
raw = resp.read()
status = resp.status
except urllib.error.HTTPError as exc:
raw = exc.read()
status = exc.code
if not (200 <= status < 300):
snippet = raw[:500].decode("utf-8", errors="replace") if raw else ""
raise ApiError(f"{method} {path} -> HTTP {status}: {snippet}")
if not raw:
return status, None
try:
return status, json.loads(raw)
except json.JSONDecodeError as exc:
if expect_json:
raise ApiError(f"{method} {path} -> HTTP {status} non-JSON: {exc}") from exc
return status, {"_raw": raw.decode("utf-8", errors="replace")}
def required_contexts(raw: str) -> list[str]:
return [part.strip() for part in raw.split(",") if part.strip()]
def push_required_contexts() -> list[str]:
"""Required contexts for push (branch) CI runs. See PUSH_REQUIRED_CONTEXTS_RAW."""
return required_contexts(PUSH_REQUIRED_CONTEXTS_RAW)
def status_state(status: dict) -> str:
return str(status.get("status") or status.get("state") or "").lower()
def latest_statuses_by_context(statuses: list[dict]) -> dict[str, dict]:
# Gitea /statuses endpoint returns entries in ascending id order (oldest
# first). We need the LAST occurrence of each context, so iterate in
# reverse to prefer newer entries.
latest: dict[str, dict] = {}
for status in reversed(statuses):
context = status.get("context")
if isinstance(context, str):
latest[context] = status # overwrite: reverse order → newest wins
return latest
def _is_tier_low_pending_ok(
latest_statuses: dict[str, dict],
context: str,
pr_labels: set[str],
) -> bool:
"""Return True if tier:low PR can tolerate sop-checklist pending state.
Per sop-checklist-config.yaml tier_failure_mode, tier:low uses soft-fail:
sop-checklist posts state=pending when acks are satisfied (missing
manager/ceo acks are informational only). The queue should accept
pending instead of waiting for success.
"""
if "tier:low" not in pr_labels:
return False
if "sop-checklist" not in context:
return False
status = latest_statuses.get(context) or {}
return status_state(status) == "pending"
def required_contexts_green(
latest_statuses: dict[str, dict],
contexts: list[str],
pr_labels: set[str] | None = None,
) -> tuple[bool, list[str]]:
missing_or_bad: list[str] = []
for context in contexts:
status = latest_statuses.get(context)
state = status_state(status or {})
if state != "success":
if pr_labels and _is_tier_low_pending_ok(latest_statuses, context, pr_labels):
continue # tier:low soft-fail: accept pending sop-checklist
missing_or_bad.append(f"{context}={state or 'missing'}")
return not missing_or_bad, missing_or_bad
def label_names(issue: dict) -> set[str]:
return {
label["name"]
for label in issue.get("labels", [])
if isinstance(label, dict) and isinstance(label.get("name"), str)
}
def choose_next_queued_issue(
issues: list[dict],
*,
queue_label: str,
hold_label: str = "",
) -> dict | None:
candidates = []
for issue in issues:
labels = label_names(issue)
if queue_label not in labels:
continue
if hold_label and hold_label in labels:
continue
if "pull_request" not in issue:
continue
candidates.append(issue)
candidates.sort(key=lambda issue: (issue.get("created_at") or "", int(issue["number"])))
return candidates[0] if candidates else None
def pr_contains_base_sha(commits: list[dict], base_sha: str) -> bool:
for commit in commits:
sha = commit.get("sha") or commit.get("id")
if sha == base_sha:
return True
return False
def pr_has_current_base(pr: dict, commits: list[dict], main_sha: str) -> bool:
if pr.get("merge_base") == main_sha:
return True
return pr_contains_base_sha(commits, main_sha)
def evaluate_merge_readiness(
*,
main_status: dict,
pr_status: dict,
required_contexts: list[str],
pr_has_current_base: bool,
pr_labels: set[str] | None = None,
) -> MergeDecision:
# Check push-required contexts explicitly instead of combined state.
# Combined state can be "failure" due to non-blocking jobs
# (continue-on-error: true) that don't actually gate merges.
# CI / all-required (push) is the authoritative gate — it respects
# continue-on-error and correctly aggregates all blocking failures.
main_latest = latest_statuses_by_context(main_status.get("statuses") or [])
main_ok, main_bad = required_contexts_green(main_latest, push_required_contexts())
if not main_ok:
return MergeDecision(False, "pause", "main required contexts not green: " + ", ".join(main_bad))
if not pr_has_current_base:
return MergeDecision(False, "update", "PR head does not contain current main")
# Check explicit required contexts instead of combined state. Combined state
# can be "failure" due to non-blocking jobs with continue-on-error: true
# (e.g. publish-runtime-autobump/pr-validate, qa-review on stale tokens).
# The required_contexts list is the authoritative gate — it includes only
# the checks that actually block merges.
latest = latest_statuses_by_context(pr_status.get("statuses") or [])
ok, missing_or_bad = required_contexts_green(latest, required_contexts, pr_labels)
if not ok:
return MergeDecision(False, "wait", "required contexts not green: " + ", ".join(missing_or_bad))
return MergeDecision(True, "merge", "ready")
def get_branch_head(branch: str) -> str:
_, body = api("GET", f"/repos/{OWNER}/{NAME}/branches/{branch}")
commit = body.get("commit") if isinstance(body, dict) else None
sha = commit.get("id") if isinstance(commit, dict) else None
if not isinstance(sha, str) or len(sha) < 7:
raise ApiError(f"branch {branch} response missing commit id")
return sha
def get_combined_status(sha: str) -> dict:
"""Combined status + all individual statuses for `sha`.
The /status endpoint caps the `statuses` array at 30 entries (Gitea
default page size), so we fetch the full list via /statuses with a
higher limit. The combined `state` still comes from /status.
"""
_, combined = api("GET", f"/repos/{OWNER}/{NAME}/commits/{sha}/status")
if not isinstance(combined, dict):
raise ApiError(f"status for {sha} response not object")
combined_statuses: list[dict] = combined.get("statuses") or []
try:
_, all_statuses_raw = api(
"GET",
f"/repos/{OWNER}/{NAME}/commits/{sha}/statuses",
query={"limit": "50"},
)
if isinstance(all_statuses_raw, list):
all_statuses: list[dict] = list(all_statuses_raw)
else:
all_statuses = []
except (ApiError, urllib.error.URLError, TimeoutError, OSError) as exc:
sys.stderr.write(f"::warning::could not fetch full statuses list for {sha[:8]}: {exc}\n")
all_statuses = []
# Build latest per context: process combined (ascending→reverse=newest
# first), then fill gaps from all_statuses (already newest-first).
latest: dict[str, dict] = {}
for status in reversed(sorted(combined_statuses, key=lambda s: s.get("id") or 0)):
ctx = status.get("context")
if isinstance(ctx, str) and ctx not in latest:
latest[ctx] = status
for status in all_statuses:
ctx = status.get("context")
if isinstance(ctx, str) and ctx not in latest:
latest[ctx] = status
combined["statuses"] = list(latest.values())
return combined
def list_queued_issues() -> list[dict]:
_, body = api(
"GET",
f"/repos/{OWNER}/{NAME}/issues",
query={
"state": "open",
"type": "pulls",
"labels": QUEUE_LABEL,
"limit": "50",
},
)
if not isinstance(body, list):
raise ApiError("queued issues response not list")
return body
def get_pull(pr_number: int) -> dict:
_, body = api("GET", f"/repos/{OWNER}/{NAME}/pulls/{pr_number}")
if not isinstance(body, dict):
raise ApiError(f"PR #{pr_number} response not object")
return body
def get_pull_commits(pr_number: int) -> list[dict]:
_, body = api("GET", f"/repos/{OWNER}/{NAME}/pulls/{pr_number}/commits")
if not isinstance(body, list):
raise ApiError(f"PR #{pr_number} commits response not list")
return body
def post_comment(pr_number: int, body: str, *, dry_run: bool) -> None:
print(f"::notice::comment PR #{pr_number}: {body.splitlines()[0][:160]}")
if dry_run:
return
api("POST", f"/repos/{OWNER}/{NAME}/issues/{pr_number}/comments", body={"body": body})
def update_pull(pr_number: int, *, dry_run: bool) -> None:
print(f"::notice::updating PR #{pr_number} with base branch via style={UPDATE_STYLE}")
if dry_run:
return
api(
"POST",
f"/repos/{OWNER}/{NAME}/pulls/{pr_number}/update",
query={"style": UPDATE_STYLE},
expect_json=False,
)
def merge_pull(pr_number: int, *, dry_run: bool) -> None:
payload = {
"Do": "merge",
"MergeTitleField": f"Merge PR #{pr_number} via Gitea merge queue",
"MergeMessageField": (
"Serialized merge by gitea-merge-queue after current-main, "
"SOP, and required CI checks were green."
),
}
print(f"::notice::merging PR #{pr_number}")
if dry_run:
return
try:
api("POST", f"/repos/{OWNER}/{NAME}/pulls/{pr_number}/merge", body=payload, expect_json=False)
except ApiError as exc:
# Re-raise permission-like errors so process_once can skip this PR.
# 403 = no push access, 404 = repo/pr not found, 405 = not allowed.
msg = str(exc)
for code in ("403", "404", "405"):
if code in msg:
raise MergePermissionError(msg) from exc
raise # re-raise other ApiErrors unchanged
def process_once(*, dry_run: bool = False) -> int:
contexts = required_contexts(REQUIRED_CONTEXTS_RAW)
main_sha = get_branch_head(WATCH_BRANCH)
main_status = get_combined_status(main_sha)
# Check push-required contexts explicitly instead of combined state.
# See evaluate_merge_readiness for rationale.
main_latest = latest_statuses_by_context(main_status.get("statuses") or [])
main_ok, main_bad = required_contexts_green(main_latest, push_required_contexts())
if not main_ok:
print(f"::notice::queue paused: {WATCH_BRANCH}@{main_sha[:8]} required contexts not green: {', '.join(main_bad)}")
return 0
issue = choose_next_queued_issue(
list_queued_issues(),
queue_label=QUEUE_LABEL,
hold_label=HOLD_LABEL,
)
if not issue:
print("::notice::merge queue empty")
return 0
pr_number = int(issue["number"])
pr = get_pull(pr_number)
if pr.get("state") != "open":
print(f"::notice::PR #{pr_number} is not open; skipping")
return 0
if pr.get("base", {}).get("ref") != WATCH_BRANCH:
post_comment(pr_number, f"merge-queue: skipped; base branch is not `{WATCH_BRANCH}`.", dry_run=dry_run)
return 0
if pr.get("head", {}).get("repo_id") != pr.get("base", {}).get("repo_id"):
post_comment(pr_number, "merge-queue: skipped; fork PRs are not supported by the serialized queue.", dry_run=dry_run)
return 0
head_sha = pr.get("head", {}).get("sha")
if not isinstance(head_sha, str) or len(head_sha) < 7:
raise ApiError(f"PR #{pr_number} missing head sha")
commits = get_pull_commits(pr_number)
current_base = pr_has_current_base(pr, commits, main_sha)
pr_status = get_combined_status(head_sha)
pr_labels = label_names(pr)
decision = evaluate_merge_readiness(
main_status=main_status,
pr_status=pr_status,
required_contexts=contexts,
pr_has_current_base=current_base,
pr_labels=pr_labels,
)
print(f"::notice::PR #{pr_number} decision={decision.action}: {decision.reason}")
if decision.action == "update":
update_pull(pr_number, dry_run=dry_run)
post_comment(
pr_number,
(
f"merge-queue: updated this branch with `{WATCH_BRANCH}` at "
f"`{main_sha[:12]}`. Waiting for CI on the refreshed head."
),
dry_run=dry_run,
)
return 0
if decision.ready:
latest_main_sha = get_branch_head(WATCH_BRANCH)
if latest_main_sha != main_sha:
print(
f"::notice::main moved {main_sha[:8]} -> {latest_main_sha[:8]}; "
"deferring to next tick"
)
return 0
try:
merge_pull(pr_number, dry_run=dry_run)
except MergePermissionError as exc:
# Permanent merge failure (HTTP 403/404/405). Post a comment so
# maintainers know why, then return 0 so this tick is done.
# The PR stays in the queue; future ticks can retry after the
# permission issue is resolved.
sys.stderr.write(f"::error::merge permission error for PR #{pr_number}: {exc}\n")
post_comment(
pr_number,
(
"merge-queue: merge failed with HTTP 405 'User not allowed to merge PR'. "
"No available token has Can-merge permission on this repo. "
"Fix: grant Can-merge to a token, or add a maintain/admin collaborator. "
"Skipping to next queued PR on next tick."
),
dry_run=dry_run,
)
return 0
return 0
return 0
def main() -> int:
parser = argparse.ArgumentParser()
parser.add_argument("--dry-run", action="store_true")
args = parser.parse_args()
_require_runtime_env()
try:
return process_once(dry_run=args.dry_run)
except ApiError as exc:
# API errors (401/403/404/500) are transient for a queue tick —
# log and exit 0 so the workflow is not marked failed and the next
# tick can retry. Returning non-zero would permanently fail the
# workflow run, blocking future ticks.
sys.stderr.write(f"::error::queue API error: {exc}\n")
return 0
except urllib.error.URLError as exc:
sys.stderr.write(f"::error::queue network error: {exc}\n")
return 0
except TimeoutError as exc:
sys.stderr.write(f"::error::queue timeout: {exc}\n")
return 0
if __name__ == "__main__":
sys.exit(main())
-113
View File
@@ -1,113 +0,0 @@
#!/usr/bin/env python3
"""Lint workflow bash for curl status-code capture pollution.
The bad shape is:
HTTP_CODE=$(curl ... -w '%{http_code}' ... || echo "000")
`curl -w` writes the HTTP code to stdout before returning non-zero, so
fallback output inside the same command substitution appends another code.
"""
from __future__ import annotations
import argparse
import glob
import re
import sys
from pathlib import Path
from typing import NamedTuple
SELF = ".gitea/workflows/lint-curl-status-capture.yml"
class Finding(NamedTuple):
path: str
snippet: str
BAD_STATUS_CAPTURE = re.compile(
r"""
\$\(\s*
curl\b
[^)]*
-w\s*['"]%\{http_code\}['"]
[^)]*
\|\|\s*
(?:
echo\s+['"]?000['"]?
|
printf\s+['"]000['"]
)
\s*\)
""",
re.DOTALL | re.VERBOSE,
)
def _logical_shell(content: str) -> str:
"""Collapse bash line continuations so one curl command is one string."""
return re.sub(r"\\\s*\n\s*", " ", content)
def scan_content(path: str, content: str) -> list[Finding]:
flat = _logical_shell(content)
return [
Finding(path=path, snippet=re.sub(r"\s+", " ", match.group(0)).strip()[:160])
for match in BAD_STATUS_CAPTURE.finditer(flat)
]
def scan_paths(paths: list[str]) -> list[Finding]:
findings: list[Finding] = []
for path in paths:
if path == SELF:
continue
content = Path(path).read_text(encoding="utf-8")
findings.extend(scan_content(path, content))
return findings
def default_paths() -> list[str]:
return sorted(glob.glob(".gitea/workflows/*.yml"))
def print_report(findings: list[Finding]) -> None:
if not findings:
print("OK No curl-status-capture pollution patterns detected")
return
print(f"::error::Found {len(findings)} curl-status-capture pollution site(s):")
for finding in findings:
print(
f"::error file={finding.path}::Curl status-capture pollution: "
"'|| echo/printf 000' inside a $(curl ... -w '%{http_code}' ...) "
"subshell. On non-2xx or connection failure, curl's -w writes a "
"status, then exits non-zero, then the fallback appends another "
"status. Fix: route -w into a tempfile so the exit code cannot "
"pollute stdout."
)
print(f" matched: {finding.snippet}...")
print()
print("Fix template:")
print(" set +e")
print(" curl ... -w '%{http_code}' >code.txt 2>/dev/null")
print(" set -e")
print(' HTTP_CODE=$(cat code.txt 2>/dev/null)')
print(' [ -z "$HTTP_CODE" ] && HTTP_CODE="000"')
def main(argv: list[str] | None = None) -> int:
parser = argparse.ArgumentParser()
parser.add_argument("paths", nargs="*", help="workflow files to scan")
args = parser.parse_args(argv)
paths = args.paths or default_paths()
findings = scan_paths(paths)
print_report(findings)
return 1 if findings else 0
if __name__ == "__main__":
raise SystemExit(main())
-404
View File
@@ -1,404 +0,0 @@
#!/usr/bin/env python3
"""lint-required-no-paths — structural enforcement of
`feedback_path_filtered_workflow_cant_be_required`.
For every workflow whose status-check context appears in
`branch_protections/<branch>.status_check_contexts`, assert that the
workflow's `on:` block has NO `paths:` and NO `paths-ignore:` filter.
A required-check workflow with a paths filter silently degrades the
merge gate:
- If the PR's diff doesn't match the `paths:` glob, the workflow
never fires.
- Gitea (1.22.6) reports the required context as `pending` (never as
`skipped == success`), so the PR cannot merge.
- For a docs-only PR against `paths: ['**.go']`, the PR is
blocked forever — no human action can produce a green.
The class was previously prevented only by reviewer vigilance + the
saved memory `feedback_path_filtered_workflow_cant_be_required`. This
script makes it a hard CI gate so a future PR adding `paths:` to a
required workflow fails fast at PR time, not after merge when the next
docs PR wedges main.
The lint runs as `.gitea/workflows/lint-required-no-paths.yml` on every
PR. The lint workflow ITSELF must not have a paths-filter (otherwise it
could be circumvented by a paths-non-matching PR) — that's enforced by
self-reference and by the workflow's own `on:` block deliberately
omitting filters.
Sources of truth:
- `branch_protections/<branch>` `status_check_contexts` (the merge gate)
- `.gitea/workflows/*.yml` `name:` + `on:` (the workflow set)
Context-format note (Gitea 1.22.6):
Status-check contexts are formatted `{workflow_name} / {job_name_or_key} ({event})`.
We parse the workflow_name prefix and walk `.gitea/workflows/*.yml` for
a file whose `name:` attr matches. (The filename is NOT the source of
truth; `name:` is, because Gitea formats the context from `name:`.)
Exit codes:
0 — no required workflow has a paths/paths-ignore filter (clean) OR
branch_protections endpoint returned 403/404 (token-scope issue;
surfaced via ::error:: but non-fatal so a missing scope doesn't
red-X every PR — fix the token, not the lint).
1 — at least one required workflow has a paths/paths-ignore filter
(the gate-degrading defect class).
2 — env contract violation (missing GITEA_TOKEN/HOST/REPO/BRANCH).
3 — workflows directory missing or workflow YAML unparseable.
4 — protection response shape unexpected (non-dict body on 2xx).
Auth note: `GET /repos/.../branch_protections/{branch}` requires
repo-admin role in Gitea 1.22.6. The workflow-default `GITHUB_TOKEN`
is non-admin; we re-use `DRIFT_BOT_TOKEN` (same persona that powers
ci-required-drift.yml). If `DRIFT_BOT_TOKEN` is unavailable in a future
context, the script falls through gracefully (exit 0 + ::error::).
"""
from __future__ import annotations
import json
import os
import re
import sys
import urllib.error
import urllib.parse
import urllib.request
from pathlib import Path
from typing import Any
import yaml # PyYAML 6.0.2 — installed by the workflow before this runs.
# --------------------------------------------------------------------------
# Environment
# --------------------------------------------------------------------------
def _env(key: str, *, required: bool = True, default: str | None = None) -> str:
val = os.environ.get(key, default)
if required and not val:
sys.stderr.write(f"::error::missing required env var: {key}\n")
sys.exit(2)
return val or ""
GITEA_TOKEN = _env("GITEA_TOKEN", required=False)
GITEA_HOST = _env("GITEA_HOST", required=False)
REPO = _env("REPO", required=False)
BRANCH = _env("BRANCH", required=False, default="main")
WORKFLOWS_DIR = _env(
"WORKFLOWS_DIR", required=False, default=".gitea/workflows"
)
OWNER, NAME = (REPO.split("/", 1) + [""])[:2] if REPO else ("", "")
API = f"https://{GITEA_HOST}/api/v1" if GITEA_HOST else ""
def _require_runtime_env() -> None:
"""Enforce env contract — called from `run()` only. Tests import
individual functions without setting the full env contract."""
for key in ("GITEA_TOKEN", "GITEA_HOST", "REPO", "BRANCH"):
if not os.environ.get(key):
sys.stderr.write(f"::error::missing required env var: {key}\n")
sys.exit(2)
# --------------------------------------------------------------------------
# Tiny HTTP helper (mirrors ci-required-drift.py contract:
# raise on non-2xx and on JSON-decode-fail when JSON expected, per
# `feedback_api_helper_must_raise_not_return_dict`).
# --------------------------------------------------------------------------
class ApiError(RuntimeError):
"""Raised when a Gitea API call cannot be trusted to have succeeded."""
def api(
method: str,
path: str,
*,
body: dict | None = None,
query: dict[str, str] | None = None,
expect_json: bool = True,
) -> tuple[int, Any]:
url = f"{API}{path}"
if query:
url = f"{url}?{urllib.parse.urlencode(query)}"
data = None
headers = {
"Authorization": f"token {GITEA_TOKEN}",
"Accept": "application/json",
}
if body is not None:
data = json.dumps(body).encode("utf-8")
headers["Content-Type"] = "application/json"
req = urllib.request.Request(url, method=method, data=data, headers=headers)
try:
with urllib.request.urlopen(req, timeout=30) as resp:
raw = resp.read()
status = resp.status
except urllib.error.HTTPError as e:
raw = e.read()
status = e.code
if not (200 <= status < 300):
snippet = raw[:500].decode("utf-8", errors="replace") if raw else ""
raise ApiError(f"{method} {path} → HTTP {status}: {snippet}")
if not raw:
return status, None
try:
return status, json.loads(raw)
except json.JSONDecodeError as e:
if expect_json:
raise ApiError(
f"{method} {path} → HTTP {status} but body is not JSON: {e}"
) from e
return status, {"_raw": raw.decode("utf-8", errors="replace")}
# --------------------------------------------------------------------------
# Status-check context parser
# --------------------------------------------------------------------------
# Format: "<workflow_name> / <job_name_or_key> (<event>)"
# Examples observed on molecule-core/main:
# "Secret scan / Scan diff for credential-shaped strings (pull_request)"
# "sop-tier-check / tier-check (pull_request)"
#
# Split strategy: peel off the trailing ` (<event>)` first, then split
# the leading `<workflow> / <rest>` on the FIRST ` / ` (workflow names
# come from `name:` attrs which conventionally don't embed ' / '; job
# names CAN, so we keep the rest of the slash-divided text as the job
# name). This matches Gitea's `name: ` semantics.
_CONTEXT_RE = re.compile(r"^(?P<workflow>.+?) / (?P<job>.+) \((?P<event>[^)]+)\)$")
def parse_context(ctx: str) -> tuple[str, str, str] | None:
"""Parse `<workflow> / <job> (<event>)` → (workflow, job, event) or None."""
if not ctx:
return None
m = _CONTEXT_RE.match(ctx)
if not m:
return None
return m.group("workflow"), m.group("job"), m.group("event")
# --------------------------------------------------------------------------
# workflow-name → file resolution
# --------------------------------------------------------------------------
def _iter_workflow_files() -> list[Path]:
d = Path(WORKFLOWS_DIR)
if not d.is_dir():
sys.stderr.write(f"::error::workflows directory not found: {d}\n")
sys.exit(3)
# `.yml` and `.yaml` — Gitea accepts both (rarely used `.yaml`, but
# don't silently miss it if a future port uses it).
return sorted(list(d.glob("*.yml")) + list(d.glob("*.yaml")))
def resolve_workflow_file(workflow_name: str) -> Path | None:
"""Find the YAML file whose `name:` attr matches `workflow_name`.
Returns None if no match. Filename is NOT used as a fallback —
Gitea's context format uses `name:`, so a `name:`-less workflow
won't even appear in the protection list. (A YAML with no `name:`
would default the context to the file basename, but our protection
contexts on molecule-core are all `name:`-derived; we trust the
same.)
"""
for f in _iter_workflow_files():
try:
doc = yaml.safe_load(f.read_text(encoding="utf-8"))
except yaml.YAMLError as e:
sys.stderr.write(f"::error::YAML parse error in {f}: {e}\n")
sys.exit(3)
if isinstance(doc, dict) and doc.get("name") == workflow_name:
return f
return None
# --------------------------------------------------------------------------
# paths-filter detection
# --------------------------------------------------------------------------
# Triggers that accept `paths:` / `paths-ignore:` (per GitHub Actions /
# Gitea Actions docs): pull_request, pull_request_target, push.
# We don't enumerate — any sub-key named `paths` or `paths-ignore`
# inside an event mapping is flagged.
_PATHS_KEYS = ("paths", "paths-ignore")
def detect_paths_filters(workflow_path: Path) -> list[str]:
"""Walk the workflow's `on:` block and return a list of findings, one
per offending `paths`/`paths-ignore` key.
Returns:
Empty list if the workflow has no paths/paths-ignore filter
anywhere in its `on:` block. Otherwise, a list of human-readable
strings naming the event and filter key + the filter contents.
"""
try:
doc = yaml.safe_load(workflow_path.read_text(encoding="utf-8"))
except yaml.YAMLError as e:
sys.stderr.write(f"::error::YAML parse error in {workflow_path}: {e}\n")
sys.exit(3)
if not isinstance(doc, dict):
return []
on_block = doc.get("on") or doc.get(True) # PyYAML 6 quirk: `on:`
# under default constructor sometimes becomes the bool key `True`
# because YAML 1.1 treats `on` as a boolean. Tolerate both.
if on_block is None:
return []
findings: list[str] = []
# Shape A: `on: pull_request` (string shorthand) — cannot carry filters.
if isinstance(on_block, str):
return []
# Shape B: `on: [pull_request, push]` (list shorthand) — cannot carry filters.
if isinstance(on_block, list):
return []
# Shape C: `on: { event: { ... } }` — the standard mapping case.
if isinstance(on_block, dict):
# Defensive: top-level malformed `on.paths` (someone wrote
# `on: { paths: ['x'] }` thinking it's a workflow-level filter).
# This is invalid syntax, but if present, flag it — it might
# not block the workflow from registering (Gitea may ignore the
# unknown key) and would create a false sense of "filter exists"
# the lint should still surface.
for k in _PATHS_KEYS:
if k in on_block:
v = on_block[k]
findings.append(
f"top-level `on.{k}` filter (malformed but present): {v!r}"
)
for event, event_body in on_block.items():
if event in _PATHS_KEYS:
continue # already handled above
if not isinstance(event_body, dict):
# `pull_request: null` / `pull_request: [opened]` shapes —
# no place for a paths filter to live; skip.
continue
for k in _PATHS_KEYS:
if k in event_body:
v = event_body[k]
findings.append(
f"`on.{event}.{k}` filter present: {v!r}"
)
return findings
# --------------------------------------------------------------------------
# Driver
# --------------------------------------------------------------------------
def run() -> int:
"""Main lint entrypoint. Returns the process exit code.
Exit semantics (see module docstring for full table):
0 — clean (no offending paths-filter on any required workflow),
OR protection unreadable (403/404) — surfaced as ::error::
but treated as non-fatal so token-scope issues don't red-X
every PR.
1 — at least one required workflow carries a paths/paths-ignore
filter — the regression class this lint exists to prevent.
"""
_require_runtime_env()
protection_path = f"/repos/{OWNER}/{NAME}/branch_protections/{BRANCH}"
try:
_, protection = api("GET", protection_path)
except ApiError as e:
msg = str(e)
m = re.search(r"HTTP (\d{3})", msg)
http_status = int(m.group(1)) if m else None
if http_status in (403, 404):
sys.stderr.write(
f"::error::GET {protection_path} returned HTTP {http_status}"
f"DRIFT_BOT_TOKEN lacks repo-admin scope (Gitea 1.22.6 "
f"requires it for this endpoint) OR branch '{BRANCH}' has "
f"no protection configured. Cannot enumerate required "
f"checks; skipping lint with exit 0 to avoid red-X on "
f"every PR. Fix: grant repo-admin to mc-drift-bot.\n"
)
return 0
raise
if not isinstance(protection, dict):
sys.stderr.write(
f"::error::protection response for {BRANCH} not a JSON object\n"
)
return 4
contexts: list[str] = list(protection.get("status_check_contexts") or [])
if not contexts:
print(
f"::notice::branch_protections/{BRANCH} has 0 required "
f"status_check_contexts; nothing to lint. (no required contexts)"
)
return 0
print(f"::notice::Linting {len(contexts)} required context(s) for paths-filter regressions:")
for c in contexts:
print(f" - {c}")
offenders: list[tuple[str, Path, list[str]]] = []
unresolved: list[str] = []
for ctx in contexts:
parsed = parse_context(ctx)
if parsed is None:
print(
f"::warning::could not parse context '{ctx}' "
f"(expected `<workflow> / <job> (<event>)`); skipping"
)
unresolved.append(ctx)
continue
workflow_name, _job, _event = parsed
wf_path = resolve_workflow_file(workflow_name)
if wf_path is None:
print(
f"::warning::no workflow file in {WORKFLOWS_DIR} has "
f"`name: {workflow_name}` (required context '{ctx}'); "
f"skipping paths-filter check. "
f"(orphaned-context detection is ci-required-drift's job.)"
)
unresolved.append(ctx)
continue
findings = detect_paths_filters(wf_path)
if findings:
offenders.append((workflow_name, wf_path, findings))
else:
print(f"::notice::OK {wf_path.name} ({workflow_name}) — no paths filter")
if offenders:
print("")
print(f"::error::Found {len(offenders)} required workflow(s) with paths/paths-ignore filters:")
for workflow_name, wf_path, findings in offenders:
for finding in findings:
# ::error file=... lets Gitea Actions surface a per-file
# annotation in the PR UI (when annotations are wired).
print(
f"::error file={wf_path}::Required workflow "
f"'{workflow_name}' ({wf_path.name}) has a paths "
f"filter that would degrade the merge gate to a "
f"silent indefinite pending: {finding}. "
f"See feedback_path_filtered_workflow_cant_be_required. "
f"Fix: remove the filter and instead gate per-step "
f"inside the job with `if: contains(steps.changed.outputs.files, ...)` "
f"or refactor to a single-job-with-per-step-if shape."
)
return 1
print("")
print(
f"::notice::OK — all {len(contexts) - len(unresolved)} resolvable "
f"required workflow(s) clean (no paths/paths-ignore filters)."
)
if unresolved:
print(
f"::notice::{len(unresolved)} required context(s) were not "
f"resolved to a workflow file (warn-not-fail); see warnings above."
)
return 0
if __name__ == "__main__":
sys.exit(run())
-519
View File
@@ -1,519 +0,0 @@
#!/usr/bin/env python3
"""lint-workflow-yaml — catch Gitea-1.22.6-hostile workflow YAML shapes.
This script enforces six structural rules that have historically caused
silent CI failures on Gitea Actions (1.22.6) — workflows that the server's
YAML parser rejects with `[W] ignore invalid workflow ...` and registers
for zero events, or shape conventions that produce ambiguous status
contexts. Each rule maps to a documented incident in saved memory.
Rules (4 fatal + 1 fatal cross-file + 1 heuristic-warn):
1. `workflow_dispatch.inputs:` block — Gitea 1.22.6 mis-parses the
`inputs` keys as sibling event types and rejects the whole file.
Memory: feedback_gitea_workflow_dispatch_inputs_unsupported.
Origin: 2026-05-11 PyPI freeze (publish-runtime).
2. `on: workflow_run:` event — not enumerated in Gitea 1.22.6's
supported event list (verified via modules/actions/workflows.go
enumeration; task #81). Workflow registers, fires for 0 events.
3. `name:` containing `/` — breaks the
`<workflow> / <job> (<event>)` commit-status context convention;
downstream parsers (sop-tier-check, status-reaper) tokenize on `/`.
4. `name:` collision across files — Gitea routes commit-status updates
by `name` and behavior on collision is undefined (status-reaper
rev1 fail-loud).
5. Cross-repo `uses: org/repo/path@ref` — blocked while
`[actions].DEFAULT_ACTIONS_URL=github` is the server default;
resolves to github.com/<org-suspended>/... and 404s.
Memory: feedback_gitea_cross_repo_uses_blocked. Cross-link: task #109.
6. (HEURISTIC, warn-not-fail) Steps reference `https://api.github.com`
or `https://github.com/.../releases/download` without a
workflow-level `env.GITHUB_SERVER_URL` set to the Gitea instance.
Memory: feedback_act_runner_github_server_url.
7. Production deploy/redeploy workflows may not rely on Gitea
`concurrency.cancel-in-progress: false` for serialization. Gitea
1.22.6 can cancel queued runs despite that setting.
8. Production deploy/redeploy workflows may not dump raw CP responses or
raw `.error` fields into CI logs/summaries.
9. Production deploy/redeploy workflows must expose an operational control:
kill switch for auto deploys or rollback tag for manual deploys.
10. Docker health checks must not run `docker info | head` under pipefail.
`head` closes the pipe early, `docker info` can exit nonzero from
SIGPIPE, and the step can falsely report Docker daemon failure.
Per `feedback_smoke_test_vendor_truth_not_shape_match`: fixtures used to
validate this lint must mirror real Gitea 1.22.6 YAML semantics, not
Python yaml-parser quirks. The test suite at tests/test_lint_workflow_yaml.py
includes a vendor-truth fixture (the exact publish-runtime regression).
Usage:
python3 .gitea/scripts/lint-workflow-yaml.py
Lint every `*.yml` in `.gitea/workflows/`.
python3 .gitea/scripts/lint-workflow-yaml.py --workflow-dir <path>
Lint a custom directory (used by tests/test_lint_workflow_yaml.py).
Exit codes:
0 — clean OR only heuristic-warnings emitted.
1 — at least one fatal rule (1-5) violated.
2 — YAML parse error or argv usage error.
"""
from __future__ import annotations
import argparse
import collections
import glob
import os
import re
import sys
from pathlib import Path
from typing import Any, Iterable
try:
import yaml
except ImportError:
print("::error::PyYAML is required. Install with: pip install PyYAML", file=sys.stderr)
sys.exit(2)
# YAML quirk: bare `on:` at the top level parses to the Python `True`
# (because `on` is a YAML 1.1 boolean alias). Handle both keys.
def _get_on(d: dict) -> Any:
if not isinstance(d, dict):
return None
if "on" in d:
return d["on"]
if True in d:
return d[True]
return None
# ---------------------------------------------------------------------------
# Rule 1 — workflow_dispatch.inputs block (Gitea 1.22.6 parser rejects)
# ---------------------------------------------------------------------------
def check_workflow_dispatch_inputs(filename: str, doc: Any) -> list[str]:
"""Return per-violation error lines if `workflow_dispatch.inputs` is set."""
errors: list[str] = []
on = _get_on(doc)
if not isinstance(on, dict):
return errors
wd = on.get("workflow_dispatch")
if isinstance(wd, dict) and wd.get("inputs"):
errors.append(
f"::error file={filename}::Rule 1 (FATAL): "
f"`on.workflow_dispatch.inputs:` block detected. Gitea 1.22.6 "
f"silently rejects the entire workflow with `[W] ignore invalid "
f"workflow: unknown on type: map[...]`. Drop the `inputs:` block "
f"and derive parameters from tag name / env / external query. "
f"Memory: feedback_gitea_workflow_dispatch_inputs_unsupported."
)
return errors
# ---------------------------------------------------------------------------
# Rule 2 — on: workflow_run (not supported on Gitea 1.22.6)
# ---------------------------------------------------------------------------
def check_workflow_run_event(filename: str, doc: Any) -> list[str]:
"""Return per-violation error lines if `on: workflow_run:` is used."""
errors: list[str] = []
on = _get_on(doc)
if isinstance(on, dict) and "workflow_run" in on:
errors.append(
f"::error file={filename}::Rule 2 (FATAL): `on: workflow_run:` "
f"event used. Gitea 1.22.6 does NOT support `workflow_run` "
f"(verified via modules/actions/workflows.go enumeration; "
f"task #81). Workflow will fire for zero events. Use a "
f"`schedule:` cron OR a `push:` trigger with `paths:` filter "
f"on the upstream workflow file as the cross-workflow gate."
)
elif isinstance(on, list) and "workflow_run" in on:
errors.append(
f"::error file={filename}::Rule 2 (FATAL): `on: workflow_run` "
f"in event list. Not supported on Gitea 1.22.6 — task #81."
)
return errors
# ---------------------------------------------------------------------------
# Rule 3 — name: contains "/" (breaks status-context tokenization)
# ---------------------------------------------------------------------------
def check_name_with_slash(filename: str, doc: Any) -> list[str]:
"""Return per-violation error lines if workflow `name:` contains a slash."""
errors: list[str] = []
if not isinstance(doc, dict):
return errors
name = doc.get("name")
if isinstance(name, str) and "/" in name:
errors.append(
f"::error file={filename}::Rule 3 (FATAL): workflow `name: "
f"{name!r}` contains `/`. The commit-status context convention "
f"is `<workflow> / <job> (<event>)`; embedding `/` in the "
f"workflow name makes downstream parsers (sop-tier-check, "
f"status-reaper) tokenize ambiguously. Rename to use `-` or "
f"` ` instead."
)
return errors
# ---------------------------------------------------------------------------
# Rule 4 — cross-file name collision
# ---------------------------------------------------------------------------
def check_name_collision_across_files(
docs_by_file: dict[str, Any],
) -> list[str]:
"""Return per-collision error lines if two files share the same `name:`."""
errors: list[str] = []
by_name: dict[str, list[str]] = collections.defaultdict(list)
for filename, doc in docs_by_file.items():
if isinstance(doc, dict):
n = doc.get("name")
if isinstance(n, str) and n:
by_name[n].append(filename)
for n, files in sorted(by_name.items()):
if len(files) > 1:
errors.append(
f"::error::Rule 4 (FATAL): workflow `name: {n!r}` collision "
f"across {len(files)} files: {files}. Gitea routes "
f"commit-status updates by `name`; collision yields "
f"undefined behavior. Give each workflow a unique `name:`."
)
return errors
# ---------------------------------------------------------------------------
# Rule 5 — cross-repo `uses: org/repo/path@ref`
# ---------------------------------------------------------------------------
# `uses: <foo>@<ref>` — match the value form Gitea/act actually parse.
# We need to distinguish:
# - `actions/checkout@<sha>` OK (bare org/repo@ref, no subpath)
# - `./.gitea/actions/foo` OK (local path)
# - `docker://image:tag` OK (docker-image form)
# - `molecule-ai/molecule-ci/.gitea/actions/audit-force-merge@main` BAD
USES_CROSS_REPO_RE = re.compile(
r"""^
(?P<owner>[A-Za-z0-9_.\-]+)
/
(?P<repo>[A-Za-z0-9_.\-]+)
/ # mandatory subpath separator => cross-repo composite/reusable
(?P<path>[^@\s]+)
@
(?P<ref>\S+)
$""",
re.VERBOSE,
)
def _iter_uses(doc: Any) -> Iterable[str]:
"""Yield every `uses:` string from job steps in a workflow document."""
if not isinstance(doc, dict):
return
jobs = doc.get("jobs")
if not isinstance(jobs, dict):
return
for job in jobs.values():
if not isinstance(job, dict):
continue
# reusable workflow: `uses:` at the job level
if isinstance(job.get("uses"), str):
yield job["uses"]
steps = job.get("steps")
if not isinstance(steps, list):
continue
for step in steps:
if isinstance(step, dict) and isinstance(step.get("uses"), str):
yield step["uses"]
def _iter_run_blocks(doc: Any) -> Iterable[str]:
"""Yield every shell `run:` block from job steps in a workflow document."""
if not isinstance(doc, dict):
return
jobs = doc.get("jobs")
if not isinstance(jobs, dict):
return
for job in jobs.values():
if not isinstance(job, dict):
continue
steps = job.get("steps")
if not isinstance(steps, list):
continue
for step in steps:
if isinstance(step, dict) and isinstance(step.get("run"), str):
yield step["run"]
def check_cross_repo_uses(filename: str, doc: Any) -> list[str]:
"""Return per-violation error lines for cross-repo `uses:` references."""
errors: list[str] = []
for uses in _iter_uses(doc):
# Skip docker:// and local ./
if uses.startswith(("docker://", "./", "../")):
continue
m = USES_CROSS_REPO_RE.match(uses.strip())
if m:
errors.append(
f"::error file={filename}::Rule 5 (FATAL): cross-repo "
f"`uses: {uses}` detected. Gitea 1.22.6 with "
f"`[actions].DEFAULT_ACTIONS_URL=github` resolves this to "
f"github.com/{m.group('owner')}/{m.group('repo')} which "
f"404s (org suspended 2026-05-06). Inline the shared bash "
f"into `.gitea/scripts/` until task #109 (actions mirror) "
f"ships. Memory: feedback_gitea_cross_repo_uses_blocked."
)
return errors
# ---------------------------------------------------------------------------
# Rule 6 — heuristic: github.com/api refs without workflow-level
# GITHUB_SERVER_URL (WARN-not-FAIL per halt-condition 3)
# ---------------------------------------------------------------------------
# Match `https://api.github.com/...` (API call) — that's the actionable
# pattern. We intentionally do NOT match `https://github.com/.../releases/
# download/...` (jq-release pin) nor `https://github.com/${{ github.repository
# }}` (OCI label) because those are documented benign references on current
# main and would 100% false-positive (3 hits, per Phase 1 audit).
GITHUB_API_REF_RE = re.compile(
r"https://api\.github\.com\b|https://github\.com/api/",
re.IGNORECASE,
)
PROD_CP_URL_RE = re.compile(r"https://api\.moleculesai\.app\b")
REDEPLOY_FLEET_RE = re.compile(r"\b/cp/admin/tenants/redeploy-fleet\b")
RUN_SETS_PIPEFAIL_RE = re.compile(r"(?m)^\s*set\s+-[^\n]*o\s+pipefail\b")
DOCKER_INFO_HEAD_PIPE_RE = re.compile(
r"(?m)^\s*docker\s+info\b[^\n|]*\|\s*head\b"
)
RAW_CP_RESPONSE_RE = re.compile(
r"""(?x)
(?:\bjq\s+\.\s+["']?\$HTTP_RESPONSE["']?)
|
(?:\bcat\s+["']?\$HTTP_RESPONSE["']?)
|
(?:\|\s*\.error\b)
"""
)
def _has_workflow_level_server_url(doc: Any) -> bool:
if not isinstance(doc, dict):
return False
env = doc.get("env")
if isinstance(env, dict) and "GITHUB_SERVER_URL" in env:
return True
return False
def check_github_server_url_missing(filename: str, doc: Any, raw: str) -> list[str]:
"""Return warn-lines (NOT errors) if api.github.com is referenced without
workflow-level GITHUB_SERVER_URL. Heuristic — false-positives possible.
"""
warns: list[str] = []
if not GITHUB_API_REF_RE.search(raw):
return warns
if _has_workflow_level_server_url(doc):
return warns
warns.append(
f"::warning file={filename}::Rule 6 (WARN, heuristic): file "
f"references `https://api.github.com` without a workflow-level "
f"`env.GITHUB_SERVER_URL: https://git.moleculesai.app`. The "
f"act_runner default for `${{{{ github.server_url }}}}` is "
f"github.com, which can break actions that auth-condition on "
f"server_url (e.g. actions/setup-go). If this curl is "
f"intentionally hitting GitHub (e.g. public release pin), ignore. "
f"Memory: feedback_act_runner_github_server_url."
)
return warns
# ---------------------------------------------------------------------------
# Rule 7-9 — production CI/CD hardening rules
# ---------------------------------------------------------------------------
def _is_production_redeploy_workflow(raw: str) -> bool:
"""Heuristic production-side-effect detector.
We intentionally key on the production CP host plus the redeploy-fleet
endpoint. Staging workflows call the same endpoint on staging-api and are
governed by looser staging verification policy.
"""
return bool(PROD_CP_URL_RE.search(raw) and REDEPLOY_FLEET_RE.search(raw))
def _iter_concurrency_blocks(doc: Any) -> Iterable[dict[str, Any]]:
if not isinstance(doc, dict):
return
top = doc.get("concurrency")
if isinstance(top, dict):
yield top
jobs = doc.get("jobs")
if not isinstance(jobs, dict):
return
for job in jobs.values():
if isinstance(job, dict) and isinstance(job.get("concurrency"), dict):
yield job["concurrency"]
def check_production_concurrency(filename: str, doc: Any, raw: str) -> list[str]:
errors: list[str] = []
if not _is_production_redeploy_workflow(raw):
return errors
for block in _iter_concurrency_blocks(doc):
if block.get("cancel-in-progress") is False:
errors.append(
f"::error file={filename}::Rule 7 (FATAL): production deploy "
f"workflow uses `concurrency.cancel-in-progress: false`. "
f"Gitea 1.22.6 can cancel queued runs despite that setting, "
f"so this is not a safe production serialization primitive. "
f"Use an external queue/lock or make the deploy idempotent."
)
return errors
def check_production_raw_response_logging(filename: str, raw: str) -> list[str]:
errors: list[str] = []
if not _is_production_redeploy_workflow(raw):
return errors
if RAW_CP_RESPONSE_RE.search(raw):
errors.append(
f"::error file={filename}::Rule 8 (FATAL): production deploy "
f"workflow appears to print a raw production CP response or raw "
f"`.error` field. CI logs are persistent and broad-read. Redact "
f"runtime/SSM error details; print counts, booleans, status "
f"codes, and links to restricted observability instead."
)
return errors
def check_production_operational_control(filename: str, raw: str) -> list[str]:
errors: list[str] = []
if not _is_production_redeploy_workflow(raw):
return errors
has_kill_switch = "PROD_AUTO_DEPLOY_DISABLED" in raw
has_rollback = "PROD_MANUAL_REDEPLOY_TARGET_TAG" in raw
if not (has_kill_switch or has_rollback):
errors.append(
f"::error file={filename}::Rule 9 (FATAL): production deploy "
f"workflow calls redeploy-fleet without an operational control. "
f"Auto deploys need a `PROD_AUTO_DEPLOY_DISABLED` kill switch; "
f"manual deploys need a `PROD_MANUAL_REDEPLOY_TARGET_TAG` "
f"rollback/pin path."
)
return errors
# ---------------------------------------------------------------------------
# Rule 10 — docker info piped to head under pipefail
# ---------------------------------------------------------------------------
def check_docker_info_head_pipefail(filename: str, doc: Any) -> list[str]:
errors: list[str] = []
for run_block in _iter_run_blocks(doc):
if not (
RUN_SETS_PIPEFAIL_RE.search(run_block)
and DOCKER_INFO_HEAD_PIPE_RE.search(run_block)
):
continue
errors.append(
f"::error file={filename}::Rule 10 (FATAL): workflow runs "
f"`docker info | head` after enabling `pipefail`. `head` can "
f"close the pipe early, making `docker info` exit nonzero and "
f"falsely fail the Docker daemon health check. Capture "
f"`docker_info=\"$(docker info 2>&1)\"` first, then print a "
f"bounded preview with `printf ... | sed -n '1,5p'`."
)
break
return errors
# ---------------------------------------------------------------------------
# Driver
# ---------------------------------------------------------------------------
def main(argv: list[str] | None = None) -> int:
p = argparse.ArgumentParser(
description="Lint Gitea Actions workflow YAML for 1.22.6-hostile shapes."
)
p.add_argument(
"--workflow-dir",
default=".gitea/workflows",
help="Directory of workflow *.yml files (default: .gitea/workflows).",
)
args = p.parse_args(argv)
wf_dir = Path(args.workflow_dir)
if not wf_dir.exists():
# Empty / missing dir = nothing to lint, not a failure.
print(f"::notice::No workflow directory at {wf_dir}; skipping.")
return 0
yml_paths = sorted(
glob.glob(str(wf_dir / "*.yml")) + glob.glob(str(wf_dir / "*.yaml"))
)
if not yml_paths:
print(f"::notice::No workflow files in {wf_dir}; nothing to lint.")
return 0
fatal_errors: list[str] = []
warnings: list[str] = []
docs_by_file: dict[str, Any] = {}
for path in yml_paths:
rel = os.path.relpath(path)
try:
raw = Path(path).read_text()
doc = yaml.safe_load(raw)
except yaml.YAMLError as e:
fatal_errors.append(
f"::error file={rel}::YAML parse error: {e}. Cannot lint "
f"a file the parser rejects."
)
continue
docs_by_file[rel] = doc
# Per-file checks
fatal_errors.extend(check_workflow_dispatch_inputs(rel, doc))
fatal_errors.extend(check_workflow_run_event(rel, doc))
fatal_errors.extend(check_name_with_slash(rel, doc))
fatal_errors.extend(check_cross_repo_uses(rel, doc))
fatal_errors.extend(check_production_concurrency(rel, doc, raw))
fatal_errors.extend(check_production_raw_response_logging(rel, raw))
fatal_errors.extend(check_production_operational_control(rel, raw))
fatal_errors.extend(check_docker_info_head_pipefail(rel, doc))
warnings.extend(check_github_server_url_missing(rel, doc, raw))
# Cross-file checks
fatal_errors.extend(check_name_collision_across_files(docs_by_file))
# Emit warnings first (non-blocking)
for w in warnings:
print(w)
if not fatal_errors:
n = len(yml_paths)
print(
f"::notice::lint-workflow-yaml: {n} workflow file(s) checked, "
f"no fatal Gitea-1.22.6-hostile shapes. "
f"({len(warnings)} heuristic warning(s) emitted.)"
)
return 0
# Emit fatal errors
print(
f"::error::lint-workflow-yaml: {len(fatal_errors)} fatal violation(s) "
f"across {len(yml_paths)} workflow file(s). See rule documentation "
f"in .gitea/scripts/lint-workflow-yaml.py docstring."
)
for e in fatal_errors:
print(e)
return 1
if __name__ == "__main__":
sys.exit(main())
@@ -1,509 +0,0 @@
#!/usr/bin/env python3
"""lint_bp_context_emit_match — Tier 2f per internal#350.
Rule
----
For a given protected branch, every context in
`branch_protections/<branch>.status_check_contexts` MUST be emitted
by at least one workflow in `.gitea/workflows/*.yml`. Two contexts
match when:
1. The workflow's `name:` equals the context's workflow-part (the
prefix before ` / `).
2. Some job in that workflow has a `name:` (or default-fallback
job-key) equal to the context's job-part (between ` / ` and
` (`).
3. The workflow's `on:` block includes the context's event-part
(in parens at the end), with Gitea's event-name mapping:
- `pull_request` and `pull_request_target` BOTH emit
`(pull_request)` contexts (verified empirically on
molecule-core/main).
- `push` emits `(push)`.
A BP context with no emitter blocks merges forever — Gitea treats
absent-as-`pending`, NOT absent-as-`skipped`-as-`success`. This is
the phantom-required-check class
(`feedback_phantom_required_check_after_gitea_migration`).
The inverse direction (emitter without BP context) is INFORMATIONAL
only — Tier 2g handles that direction at PR-time. Flagging it here
on a daily schedule would falsely surface every transitional state
during a BP rollout.
How the gate works
------------------
Daily scheduled run + workflow_dispatch:
1. GET `branch_protections/{BRANCH}` (needs DRIFT_BOT_TOKEN with
repo-admin scope; same persona as ci-required-drift.yml).
Graceful-degrade on 403/404 per Tier 2a contract.
2. Walk `.gitea/workflows/*.yml` via PyYAML AST. For each workflow,
enumerate its emitted contexts: `{workflow.name} / {job.name or
job-key} ({event})` for each event in `on:` that emits a status.
3. For each BP context, look for an emitter match. Aggregate
orphans.
4. If orphans exist:
- File or PATCH a `[ci-bp-drift]` issue (idempotency contract:
search for exact title prefix, edit existing if open).
- Apply labels `tier:high` + `ci-bp-drift` (lookup IDs per
repo; per `feedback_tier_label_ids_are_per_repo`).
- Exit 1.
5. If no orphans:
- Close any existing `[ci-bp-drift]` issue with a clean-state
comment.
- Exit 0.
Exit codes
----------
0 — clean OR API 403/404 (graceful-degrade, surfaces ::error::).
1 — at least one BP context has no emitter.
2 — env contract violation, workflows-dir missing, or YAML parse
error.
Env
---
GITEA_TOKEN — DRIFT_BOT_TOKEN (repo-admin for branch_protections)
GITEA_HOST — e.g. git.moleculesai.app
REPO — owner/name
BRANCH — defaults to `main`
WORKFLOWS_DIR — defaults to `.gitea/workflows`
DRIFT_LABEL — defaults to `ci-bp-drift`
Memory cross-links
------------------
- internal#350 (the RFC that specs this lint)
- feedback_phantom_required_check_after_gitea_migration
- feedback_tier_label_ids_are_per_repo
- reference_post_suspension_pipeline
"""
from __future__ import annotations
import json
import os
import re
import sys
import urllib.error
import urllib.parse
import urllib.request
from pathlib import Path
from typing import Any
try:
import yaml
except ImportError:
sys.stderr.write(
"::error::PyYAML is required. Install with: pip install PyYAML\n"
)
sys.exit(2)
# Status-check context regex (mirrors lint-required-no-paths.py).
_CONTEXT_RE = re.compile(
r"^(?P<workflow>.+?) / (?P<job>.+) \((?P<event>[^)]+)\)$"
)
# Map a workflow `on:` event-key to the context's event-part. Gitea's
# emitter convention (verified on molecule-core):
# - pull_request → `(pull_request)`
# - pull_request_target → `(pull_request)` (same surface)
# - push → `(push)`
# - schedule → no PR status; scheduled runs don't post
# commit-statuses unless the workflow itself does so explicitly.
# - workflow_dispatch → manually dispatched runs may or may not
# emit; safest to treat as "no PR status" (informational notice
# only).
_EVENT_MAP = {
"pull_request": "pull_request",
"pull_request_target": "pull_request",
"push": "push",
}
# ---------------------------------------------------------------------------
# Env
# ---------------------------------------------------------------------------
def _env(key: str, default: str | None = None) -> str:
v = os.environ.get(key, default)
return v if v is not None else ""
def _require_env(key: str) -> str:
v = os.environ.get(key)
if not v:
sys.stderr.write(f"::error::missing required env var: {key}\n")
sys.exit(2)
return v
# ---------------------------------------------------------------------------
# API helper. Mirrors lint-required-no-paths.py's contract: returns
# (status, payload) tuple with status ∈ {"ok", "not_found", "forbidden",
# "error"}.
# ---------------------------------------------------------------------------
def api(
method: str,
path: str,
*,
body: dict | None = None,
query: dict[str, str] | None = None,
) -> tuple[str, Any]:
host = _env("GITEA_HOST")
token = _env("GITEA_TOKEN")
url = f"https://{host}/api/v1{path}"
if query:
url = f"{url}?{urllib.parse.urlencode(query)}"
data = None
headers = {
"Authorization": f"token {token}",
"Accept": "application/json",
}
if body is not None:
data = json.dumps(body).encode("utf-8")
headers["Content-Type"] = "application/json"
req = urllib.request.Request(
url, method=method, data=data, headers=headers
)
try:
with urllib.request.urlopen(req, timeout=30) as resp:
raw = resp.read()
if not raw:
return ("ok", None)
return ("ok", json.loads(raw))
except urllib.error.HTTPError as e:
if e.code == 404:
return ("not_found", None)
if e.code in (401, 403):
return ("forbidden", None)
return ("error", None)
except (urllib.error.URLError, TimeoutError, json.JSONDecodeError):
return ("error", None)
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
def _get_on(d: Any) -> Any:
"""YAML 1.1 boolean quirk: bare `on:` may parse to True. Handle both."""
if not isinstance(d, dict):
return None
if "on" in d:
return d["on"]
if True in d:
return d[True]
return None
def _on_events(doc: Any) -> set[str]:
"""Return the set of event keys in a workflow's `on:` block.
Accepts all three shapes (string / list / mapping). String/list
shapes can't carry filters but they DO emit. Returns the
Gitea-mapped event names per `_EVENT_MAP`.
"""
on = _get_on(doc)
raw_events: set[str] = set()
if on is None:
return raw_events
if isinstance(on, str):
raw_events.add(on)
elif isinstance(on, list):
for e in on:
if isinstance(e, str):
raw_events.add(e)
elif isinstance(on, dict):
for k in on:
if isinstance(k, str):
raw_events.add(k)
return {_EVENT_MAP[e] for e in raw_events if e in _EVENT_MAP}
def _job_display(jbody: dict, jkey: str) -> str:
"""Return job's `name:` if set, else fall back to the job-key.
Gitea formats status contexts with the job's `name:` when set;
when unset it uses the job key. Matches lint-required-no-paths
convention.
"""
n = jbody.get("name") if isinstance(jbody, dict) else None
if isinstance(n, str) and n:
return n
return jkey
def workflow_contexts(doc: Any) -> set[str]:
"""Return the set of contexts a workflow emits."""
contexts: set[str] = set()
if not isinstance(doc, dict):
return contexts
wf_name = doc.get("name")
if not isinstance(wf_name, str) or not wf_name:
return contexts # no name => no addressable context
events = _on_events(doc)
if not events:
return contexts
jobs = doc.get("jobs")
if not isinstance(jobs, dict):
return contexts
for jkey, jbody in jobs.items():
if jkey == "__lines__": # tolerate line-tracking annotations
continue
if not isinstance(jbody, dict):
continue
disp = _job_display(jbody, jkey)
for ev in events:
contexts.add(f"{wf_name} / {disp} ({ev})")
return contexts
def parse_context(ctx: str) -> tuple[str, str, str] | None:
m = _CONTEXT_RE.match(ctx)
if not m:
return None
return (m.group("workflow"), m.group("job"), m.group("event"))
def _iter_workflow_files(wf_dir: Path) -> list[Path]:
return sorted(list(wf_dir.glob("*.yml")) + list(wf_dir.glob("*.yaml")))
# ---------------------------------------------------------------------------
# Issue idempotency — search for an open issue with the canonical
# title prefix; PATCH if found, POST if not. Mirrors ci-required-drift.
# ---------------------------------------------------------------------------
def _canonical_title(repo: str, branch: str) -> str:
return f"[ci-bp-drift] {repo}/{branch}: BP→emitter mismatch"
def _ensure_labels(repo: str, names: list[str]) -> list[int]:
status, labels = api("GET", f"/repos/{repo}/labels", query={"limit": "50"})
if status != "ok" or not isinstance(labels, list):
return []
out: list[int] = []
by_name = {l["name"]: l["id"] for l in labels if isinstance(l, dict)}
for n in names:
if n in by_name:
out.append(by_name[n])
return out
def file_or_update_issue(
repo: str, branch: str, orphans: list[str], emitter_orphans: list[str]
) -> None:
title = _canonical_title(repo, branch)
body_lines = [
f"BP→emitter drift detected on `{branch}` at "
f"{os.environ.get('GITHUB_RUN_URL', '(run url unavailable)')}.",
"",
f"## Orphan BP contexts ({len(orphans)})",
"",
"These contexts are required by branch protection but NO workflow "
"emits them. PRs merging into this branch will wait forever for a "
"status that never arrives (Gitea treats absent-as-`pending`, NOT "
"absent-as-`skipped`). See "
"`feedback_phantom_required_check_after_gitea_migration`.",
"",
]
for o in orphans:
body_lines.append(f"- `{o}`")
if emitter_orphans:
body_lines += [
"",
f"## Workflows emitting contexts NOT in BP ({len(emitter_orphans)})",
"",
"Informational — Tier 2g handles this direction at PR-time. "
"Listed here for completeness.",
"",
]
for o in emitter_orphans:
body_lines.append(f"- `{o}`")
body_lines += [
"",
"Fix options:",
" 1. PATCH `branch_protections/{branch}.status_check_contexts` "
" to remove the orphan.",
" 2. Restore the emitting workflow (if it was deleted/renamed).",
"",
"Linted by `.gitea/workflows/lint-bp-context-emit-match.yml` "
"(Tier 2f, internal#350).",
]
body = "\n".join(body_lines)
# Idempotency search — find an open issue with the canonical title.
status, hits = api(
"GET",
f"/repos/{repo}/issues",
query={
"type": "issues",
"state": "open",
"q": title,
},
)
existing = None
if status == "ok" and isinstance(hits, list):
for h in hits:
if (
isinstance(h, dict)
and h.get("state") == "open"
and isinstance(h.get("title"), str)
and h["title"].startswith(title)
):
existing = h
break
label_ids = _ensure_labels(repo, ["ci-bp-drift", "tier:high"])
if existing:
api(
"PATCH",
f"/repos/{repo}/issues/{existing['number']}",
body={"body": body, "labels": label_ids} if label_ids else {"body": body},
)
print(
f"::notice::Updated existing drift issue "
f"#{existing['number']}: {existing.get('html_url', '')}"
)
else:
status, posted = api(
"POST",
f"/repos/{repo}/issues",
body={"title": title, "body": body, "labels": label_ids},
)
if status == "ok" and isinstance(posted, dict):
print(
f"::notice::Filed new drift issue "
f"#{posted.get('number')}: {posted.get('html_url', '')}"
)
# ---------------------------------------------------------------------------
# Driver
# ---------------------------------------------------------------------------
def run() -> int:
_require_env("GITEA_TOKEN")
_require_env("GITEA_HOST")
repo = _require_env("REPO")
branch = _env("BRANCH", "main")
wf_dir = Path(_env("WORKFLOWS_DIR", ".gitea/workflows"))
if not wf_dir.is_dir():
sys.stderr.write(f"::error::workflows directory not found: {wf_dir}\n")
return 2
# 1. Pull BP.
status, bp = api("GET", f"/repos/{repo}/branch_protections/{branch}")
if status == "forbidden":
sys.stderr.write(
f"::error::GET branch_protections/{branch} returned HTTP 403 — "
f"DRIFT_BOT_TOKEN lacks repo-admin scope (Gitea 1.22.6 requires "
f"it for this endpoint). Skipping lint with exit 0 to avoid "
f"red-X on every run. Fix: grant repo-admin to mc-drift-bot. "
f"Per Tier 2a contract.\n"
)
return 0
if status == "not_found":
print(
f"::notice::branch '{branch}' has no protection configured; "
f"nothing to lint."
)
return 0
if status != "ok" or not isinstance(bp, dict):
sys.stderr.write(
f"::error::branch_protections/{branch} response unexpected; "
f"status={status}. Treating as transient; exit 0.\n"
)
return 0
bp_contexts: list[str] = list(bp.get("status_check_contexts") or [])
if not bp_contexts:
print(
f"::notice::branch_protections/{branch} has 0 required "
f"status_check_contexts; nothing to lint."
)
return 0
# 2. Enumerate emitter contexts from all workflows.
all_emitter: set[str] = set()
for path in _iter_workflow_files(wf_dir):
try:
doc = yaml.safe_load(path.read_text(encoding="utf-8"))
except yaml.YAMLError as e:
sys.stderr.write(
f"::error file={path}::YAML parse error: {e}; skipping.\n"
)
continue
all_emitter |= workflow_contexts(doc)
print(
f"::notice::Linting {len(bp_contexts)} BP context(s) for {branch} "
f"against {len(all_emitter)} workflow-emitted context(s)."
)
bp_set = set(bp_contexts)
# 3. Find orphans (BP-side: required but no emitter).
bp_orphans = sorted(bp_set - all_emitter)
# Informational: workflow emits but BP doesn't list. Tier 2g
# territory at PR-time. We list these as NOTICE only.
emitter_orphans = sorted(all_emitter - bp_set)
if bp_orphans:
print(
f"::error::Found {len(bp_orphans)} BP context(s) with no "
f"emitter — these would block merges forever (Gitea treats "
f"absent-as-pending, not skipped):"
)
for o in bp_orphans:
# Closest-match hint: name a workflow whose name-part is a
# near-match (lev-1 typo, or same workflow with a different
# event).
parsed = parse_context(o)
hint = ""
if parsed:
wf, _job, _ev = parsed
candidates = sorted(
{c for c in all_emitter if c.startswith(wf + " / ")}
)
if candidates:
hint = (
f" — closest emitter(s): {', '.join(candidates[:3])}"
)
print(f"::error:: - {o}{hint}")
if emitter_orphans:
print(
f"::notice::Also: {len(emitter_orphans)} workflow-emitted "
f"context(s) not in BP (informational; Tier 2g handles at "
f"PR-time):"
)
for o in emitter_orphans:
print(f"::notice:: - {o}")
# File / patch tracking issue.
try:
file_or_update_issue(repo, branch, bp_orphans, emitter_orphans)
except Exception as e:
sys.stderr.write(
f"::error::failed to file drift issue: {e}\n"
)
return 1
if emitter_orphans:
print(
f"::notice::{len(emitter_orphans)} workflow-emitted context(s) "
f"not in BP (informational; Tier 2g handles at PR-time):"
)
for o in emitter_orphans:
print(f"::notice:: - {o}")
print(
f"::notice::BP/emitter match clean: all {len(bp_contexts)} required "
f"context(s) have an emitter."
)
return 0
if __name__ == "__main__":
sys.exit(run())
@@ -1,438 +0,0 @@
#!/usr/bin/env python3
"""lint_continue_on_error_tracking — Tier 2e per internal#350.
Rule
----
Every `continue-on-error: true` directive in `.gitea/workflows/*.yml`
must be accompanied by a tracker reference comment within 2 lines
(above OR below the directive's line). The reference is one of:
* `# mc#NNNN` — molecule-core issue
* `# internal#NNNN` — molecule-ai/internal issue
The referenced issue must satisfy ALL of:
1. Exists (HTTP 200 on `/repos/{owner}/{name}/issues/{num}`)
2. `state == "open"`
3. `created_at` is ≤ MAX_AGE_DAYS days ago (default 14)
A passing reference establishes an audit trail and a forced renewal
cadence — after 14 days the issue must either be CLOSED (the masked
defect was fixed) or the comment must point at a NEW tracker
(deliberate decision to keep masking, requires a paper-trail).
The class this prevents
-----------------------
Phase-3-masked failures. `continue-on-error: true` on `platform-build`
had been hiding mc#664-class regressions for ~3 weeks before #656
surfaced them on 2026-05-12. A 14-day cap forces a tracker review
cycle and surfaces mask-drift within at most 14 days of the original
defect.
Behaviour-based gate
--------------------
We parse via PyYAML AST (per `feedback_behavior_based_ast_gates`) to
detect `continue-on-error: <truthy>` at job-key level, then map each
location back to its source line via PyYAML's line-tracking loader.
Comments are scanned from the raw text within a 2-line window of
that source line. Reformatting (block-scalar vs flow-style) does not
break the rule because the source-line anchor is the directive's
own line.
Exit codes
----------
0 — every `continue-on-error: true` has a passing tracker, OR
the issue-API endpoint returned 403/404 (token-scope; graceful
degrade per Tier 2a contract — surface via ::error:: on stderr
but don't red-X every PR over auth).
1 — at least one violation (missing/closed/too-old/non-existent
tracker).
2 — env contract violation, YAML parse error, or workflows-dir
missing.
Env
---
GITEA_TOKEN — read scope on the configured repos.
Auto-injected `GITHUB_TOKEN` works for same-repo
issue reads; for `internal#NNN` we need a token
with `molecule-ai/internal` read scope. Use
DRIFT_BOT_TOKEN (same persona as other Tier 2
lints).
GITEA_HOST — e.g. git.moleculesai.app
REPO — `owner/name` for `mc#NNNN` lookups
INTERNAL_REPO — `owner/name` for `internal#NNNN` lookups
(defaults to derived `molecule-ai/internal`)
WORKFLOWS_DIR — defaults to `.gitea/workflows`
MAX_AGE_DAYS — defaults to 14
Memory cross-links
------------------
- internal#350 (the RFC that specs this lint)
- mc#664 (the masked-3-weeks empirical case)
- feedback_chained_defects_in_never_tested_workflows
- feedback_behavior_based_ast_gates
- feedback_strict_root_only_after_class_a
"""
from __future__ import annotations
import json
import os
import re
import sys
import urllib.error
import urllib.parse
import urllib.request
from datetime import datetime, timedelta, timezone
from pathlib import Path
from typing import Any
try:
import yaml
except ImportError:
sys.stderr.write(
"::error::PyYAML is required. Install with: pip install PyYAML\n"
)
sys.exit(2)
# ---------------------------------------------------------------------------
# Tracker comment regex.
# Matches: `# mc#1234`, `# internal#42`, `# mc#1234 - description`
# Also matches trackers embedded mid-sentence: `# see mc#1234 for details`
# Does NOT match: `# mc1234` (missing inner #), `mc#1234` (no leading
# comment `#`), `# MC#1234` (case-sensitive). The search is line-wide,
# not just at the comment-marker prefix — fixes false-negative when
# the tracker appears mid-sentence (e.g. `internal#350` after prose).
TRACKER_RE = re.compile(
r"(?P<slug>mc|internal)#(?P<num>\d+)\b"
)
# Truthy continue-on-error values we treat as "true". PyYAML decodes
# `continue-on-error: true` to Python `True`. `continue-on-error: "true"`
# decodes to the string "true" — Gitea's evaluator coerces strings,
# so we treat string-`"true"` (case-insensitive) as truthy too.
def _is_truthy_coe(v: Any) -> bool:
if v is True:
return True
if isinstance(v, str) and v.strip().lower() == "true":
return True
return False
# ---------------------------------------------------------------------------
# Env contract
# ---------------------------------------------------------------------------
def _env(key: str, default: str | None = None) -> str:
v = os.environ.get(key, default)
return v if v is not None else ""
def _require_env(key: str) -> str:
v = os.environ.get(key)
if not v:
sys.stderr.write(f"::error::missing required env var: {key}\n")
sys.exit(2)
return v
# ---------------------------------------------------------------------------
# PyYAML line-tracking loader. yaml.SafeLoader nodes carry
# `start_mark.line` (0-based); using construct_mapping with `deep=True`
# preserves that on every node. We need the line of each
# `continue-on-error` key so we can scan the source for comments
# near it.
# ---------------------------------------------------------------------------
class _LineLoader(yaml.SafeLoader):
"""SafeLoader that annotates every dict with `__line__: {key: line}`."""
def _construct_mapping(loader: yaml.SafeLoader, node: yaml.MappingNode) -> dict:
mapping = loader.construct_mapping(node, deep=True)
# Annotate per-key source lines so we can locate `continue-on-error`.
lines: dict[str, int] = {}
for k_node, _v_node in node.value:
try:
key = loader.construct_object(k_node, deep=True)
except Exception:
continue
if isinstance(key, (str, int, bool)):
lines[str(key)] = k_node.start_mark.line + 1 # 1-based
if isinstance(mapping, dict):
mapping["__lines__"] = lines
return mapping
_LineLoader.add_constructor(
yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG, _construct_mapping
)
# ---------------------------------------------------------------------------
# Issue lookup
# ---------------------------------------------------------------------------
def fetch_issue(slug_kind: str, num: int) -> tuple[str, dict | None]:
"""Return `(status, payload_or_none)`.
status ∈ {"ok", "not_found", "forbidden", "error"}.
"""
repo = (
_env("REPO") if slug_kind == "mc" else _env("INTERNAL_REPO")
)
if not repo:
# Fall through gracefully — caller treats as 403 (token-scope).
return ("forbidden", None)
host = _env("GITEA_HOST")
token = _env("GITEA_TOKEN")
url = f"https://{host}/api/v1/repos/{repo}/issues/{num}"
req = urllib.request.Request(
url,
headers={
"Authorization": f"token {token}",
"Accept": "application/json",
},
)
try:
with urllib.request.urlopen(req, timeout=20) as resp:
return ("ok", json.loads(resp.read()))
except urllib.error.HTTPError as e:
if e.code == 404:
return ("not_found", None)
if e.code in (401, 403):
return ("forbidden", None)
return ("error", None)
except (urllib.error.URLError, TimeoutError, json.JSONDecodeError):
return ("error", None)
# ---------------------------------------------------------------------------
# Locate every continue-on-error: <truthy> in a workflow doc, with line.
# ---------------------------------------------------------------------------
def find_coe_truthies(
doc: Any, raw_lines: list[str]
) -> list[tuple[str, int]]:
"""Return list of (job_key, source_line_1based).
`doc` is the LineLoader-parsed mapping. We descend `jobs.<key>` and
return only those whose value is truthy per `_is_truthy_coe`.
Job-step continue-on-error is intentionally NOT considered: it
suppresses step-level failure rollup only, not job-level. The
masking class this lint targets is the job-level rollup.
"""
out: list[tuple[str, int]] = []
if not isinstance(doc, dict):
return out
jobs = doc.get("jobs")
if not isinstance(jobs, dict):
return out
for jkey, jbody in jobs.items():
if jkey == "__lines__":
continue
if not isinstance(jbody, dict):
continue
if "continue-on-error" not in jbody:
continue
v = jbody["continue-on-error"]
if not _is_truthy_coe(v):
continue
line = jbody.get("__lines__", {}).get("continue-on-error")
if not line:
# PyYAML line-tracking shouldn't miss but guard for safety.
# Fall back to grepping the raw text.
line = _grep_first_coe_line(raw_lines, jkey) or 1
out.append((str(jkey), int(line)))
return out
def _grep_first_coe_line(raw_lines: list[str], jkey: str) -> int | None:
"""Fallback: find the first `continue-on-error:` line after a `jkey:` line."""
saw_job = False
for i, line in enumerate(raw_lines, start=1):
if re.match(rf"^\s*{re.escape(jkey)}\s*:", line):
saw_job = True
continue
if saw_job and "continue-on-error" in line:
return i
return None
# ---------------------------------------------------------------------------
# Scan window for tracker comment
# ---------------------------------------------------------------------------
WINDOW = 2 # lines above OR below the directive's line (inclusive)
def find_tracker_in_window(
raw_lines: list[str], line_1based: int
) -> tuple[str, int] | None:
"""Return (slug, num) if a `# mc#NNN`/`# internal#NNN` appears
in raw_lines within ±WINDOW lines of `line_1based`. None otherwise.
We scan the directive's own line (it may carry an inline comment
like `continue-on-error: true # mc#3`) plus ±WINDOW.
"""
lo = max(1, line_1based - WINDOW)
hi = min(len(raw_lines), line_1based + WINDOW)
for i in range(lo, hi + 1):
line = raw_lines[i - 1]
# Only the comment portion (after `#`) is considered, so
# trailing-inline comments on the directive line are matched.
m = TRACKER_RE.search(line)
if m:
return (m.group("slug"), int(m.group("num")))
return None
# ---------------------------------------------------------------------------
# Tracker validation
# ---------------------------------------------------------------------------
def validate_tracker(
slug: str, num: int, max_age_days: int
) -> tuple[bool, str]:
"""Return (ok?, reason). On 403, ok=True is returned with reason
explaining graceful-degrade — caller treats 403 as a non-fatal
skip (same as Tier 2a contract).
"""
status, payload = fetch_issue(slug, num)
if status == "forbidden":
sys.stderr.write(
f"::error::issue {slug}#{num} unreadable (HTTP 403 — token "
f"scope). Cannot validate; skipping this check to avoid "
f"red-X on every PR. Fix the token, not the lint.\n"
)
return (True, "forbidden — skipped")
if status == "not_found":
return (False, f"{slug}#{num} does not exist (404)")
if status == "error":
sys.stderr.write(
f"::error::issue {slug}#{num} fetch errored — treating as "
f"unverified, skipping this check.\n"
)
return (True, "fetch-error — skipped")
assert payload is not None
state = payload.get("state", "")
if state != "open":
return (False, f"{slug}#{num} state={state!r} (must be open)")
created = payload.get("created_at", "")
try:
# Gitea returns ISO-8601 with timezone; Python 3.11+
# fromisoformat handles `Z` suffix natively from 3.11. Older
# runtimes need explicit replace.
created_dt = datetime.fromisoformat(created.replace("Z", "+00:00"))
except ValueError:
return (False, f"{slug}#{num} created_at unparseable: {created!r}")
age = datetime.now(timezone.utc) - created_dt
# Inclusive boundary at MAX_AGE_DAYS: `age.days` truncates to a
# whole-day floor, so an issue created 14d 0h 5m ago has
# `age.days == 14` and passes; one created 15d 0h 0m ago has
# `age.days == 15` and fails. This is the convention specified
# in internal#350 ("≤14 days old").
if age.days > max_age_days:
return (
False,
f"{slug}#{num} is {age.days} days old (>{max_age_days}d cap). "
f"Close-or-renew the tracker.",
)
return (True, f"{slug}#{num} open, {age.days}d old, ≤{max_age_days}d")
# ---------------------------------------------------------------------------
# Driver
# ---------------------------------------------------------------------------
def _iter_workflow_files(wf_dir: Path) -> list[Path]:
return sorted(list(wf_dir.glob("*.yml")) + list(wf_dir.glob("*.yaml")))
def run() -> int:
wf_dir = Path(_env("WORKFLOWS_DIR", ".gitea/workflows"))
max_age = int(_env("MAX_AGE_DAYS", "14"))
# Defaults for INTERNAL_REPO when unset (best-effort guess based on
# the convention `mc#` = same repo, `internal#` = molecule-ai/internal).
if not os.environ.get("INTERNAL_REPO"):
os.environ["INTERNAL_REPO"] = "molecule-ai/internal"
if not wf_dir.is_dir():
sys.stderr.write(
f"::error::workflows directory not found: {wf_dir}\n"
)
return 2
yml_files = _iter_workflow_files(wf_dir)
if not yml_files:
print(f"::notice::no workflow files under {wf_dir}; nothing to lint.")
return 0
violations: list[str] = []
notices: list[str] = []
total_coe_true = 0
for path in yml_files:
raw = path.read_text(encoding="utf-8")
raw_lines = raw.splitlines()
try:
doc = yaml.load(raw, Loader=_LineLoader)
except yaml.YAMLError as e:
sys.stderr.write(
f"::error file={path}::YAML parse error: {e}. Skipping "
f"this file (lint-workflow-yaml will catch separately).\n"
)
continue
coe_locs = find_coe_truthies(doc, raw_lines)
for jkey, line in coe_locs:
total_coe_true += 1
tracker = find_tracker_in_window(raw_lines, line)
if tracker is None:
violations.append(
f"::error file={path},line={line}::lint-continue-on-error-"
f"tracking (Tier 2e): job '{jkey}' has "
f"`continue-on-error: true` at line {line} with no "
f"`# mc#NNNN` or `# internal#NNNN` tracker comment "
f"within {WINDOW} lines. Add a tracker reference so "
f"this mask has a forced 14-day renewal cycle. "
f"Memory: feedback_chained_defects_in_never_tested_workflows."
)
continue
slug, num = tracker
ok, reason = validate_tracker(slug, num, max_age)
if ok:
notices.append(
f"::notice::{path.name} job '{jkey}' (line {line}): "
f"{reason}"
)
else:
violations.append(
f"::error file={path},line={line}::lint-continue-on-error-"
f"tracking (Tier 2e): job '{jkey}' "
f"`continue-on-error: true` references {slug}#{num}, "
f"but {reason}. FIX: close/fix the underlying defect "
f"and flip continue-on-error: false, OR file a fresh "
f"tracker and update the comment."
)
for n in notices:
print(n)
if violations:
print(
f"::error::lint-continue-on-error-tracking: "
f"{len(violations)} violation(s) across {len(yml_files)} "
f"workflow file(s) (of {total_coe_true} `continue-on-error: "
f"true` directives in total)."
)
for v in violations:
print(v)
return 1
print(
f"::notice::lint-continue-on-error-tracking: "
f"all {total_coe_true} `continue-on-error: true` directive(s) "
f"have valid trackers (open, ≤{max_age}d old)."
)
return 0
if __name__ == "__main__":
sys.exit(run())
-361
View File
@@ -1,361 +0,0 @@
#!/usr/bin/env python3
"""lint_mask_pr_atomicity — Tier 2d structural enforcement per internal#350.
Rule
----
A PR whose diff touches `.gitea/workflows/ci.yml` AND modifies EITHER:
- any `continue-on-error:` value, OR
- the `all-required` sentinel job's `needs:` block
must EITHER:
- Touch BOTH atomically in the same PR (preferred), OR
- Cross-link the paired PR via a literal `Paired: #NNN` reference in
the PR body OR in any commit message between BASE_SHA and HEAD_SHA.
The class this prevents
-----------------------
PR#665 (interim `continue-on-error: true` on `platform-build`) and
PR#668 (sentinel-`needs` demotion of the same job) were designed as a
pair but merged solo — #665 landed at 04:47Z 2026-05-12, #668 was still
open at 05:07Z when the main-red watchdog (#674) fired. Result: ~20
minutes of `main` red and a cascade of false-positives on unrelated PRs.
The lint operates on the YAML AST (PyYAML), not grep, per
`feedback_behavior_based_ast_gates`: a refactor that moves `continue-on-error`
between job keys, or renames the `all-required` job, would still be
detected because we walk the parsed structure.
Why this works on Gitea 1.22.6
------------------------------
We don't use any 1.22.6-missing endpoints (no `/actions/runs/*`, no
`branch_protections/*` — Tier 2f/g need those; Tier 2d does not). All
required inputs come from the workflow `pull_request` event payload
(BASE_SHA, HEAD_SHA, PR_BODY) and from local git via `git show`/`git log`.
The auto-injected `GITHUB_TOKEN` is enough; we don't need
DRIFT_BOT_TOKEN.
Exit codes
----------
0 — ci.yml not in diff, OR diff is no-op for the rule predicates,
OR atomicity satisfied (both touched), OR a valid `Paired: #NNN`
reference is present.
1 — exactly ONE of {coe, sentinel-needs} touched AND no valid
`Paired: #NNN` reference. The split-pair regression class.
2 — env contract violation (BASE_SHA / HEAD_SHA missing) or YAML
parse error on either side.
Env
---
BASE_SHA — PR base (pull_request.base.sha)
HEAD_SHA — PR head (pull_request.head.sha)
PR_BODY — pull_request.body (may be empty)
CI_WORKFLOW_PATH — defaults to `.gitea/workflows/ci.yml`
SENTINEL_JOB_KEY — defaults to `all-required`
Memory cross-links
------------------
- internal#350 (the RFC that specs this lint)
- PR#665 / PR#668 (the empirical split-pair)
- mc#664 (the main-red incident)
- feedback_strict_root_only_after_class_a
- feedback_behavior_based_ast_gates
"""
from __future__ import annotations
import os
import re
import subprocess
import sys
from typing import Any
try:
import yaml
except ImportError:
sys.stderr.write(
"::error::PyYAML is required. Install with: pip install PyYAML\n"
)
sys.exit(2)
# ---------------------------------------------------------------------------
# YAML quirk: bare `on:` at the top level becomes Python `True` because
# `on` is a YAML 1.1 boolean. Not used here but documented for future
# editors who copy from this module.
# ---------------------------------------------------------------------------
# `Paired: #NNN` reference. `#` is mandatory, NNN must be digits. Any
# surrounding markdown/whitespace is fine. The match is case-sensitive
# on `Paired:` because lower-case `paired:` collides with conversational
# prose ("paired: see comment above") and the convention is the exact
# capitalisation.
PAIRED_RE = re.compile(r"\bPaired:\s*#(?P<num>\d+)\b")
# ---------------------------------------------------------------------------
# Env contract
# ---------------------------------------------------------------------------
def _env(key: str, default: str | None = None) -> str:
v = os.environ.get(key, default)
return v if v is not None else ""
def _require_env(key: str) -> str:
v = os.environ.get(key)
if not v:
sys.stderr.write(f"::error::missing required env var: {key}\n")
sys.exit(2)
return v
# ---------------------------------------------------------------------------
# git-show helper. Returns None when the path doesn't exist on that side
# (new file, deleted file, or rename — git returns exit 128 with "fatal:
# path not in tree"). We treat None as "no rule predicate triggered on
# that side".
# ---------------------------------------------------------------------------
def git_show(sha: str, path: str) -> str | None:
r = subprocess.run(
["git", "show", f"{sha}:{path}"],
capture_output=True,
text=True,
)
if r.returncode != 0:
return None
return r.stdout
def git_log_messages(base_sha: str, head_sha: str) -> str:
r = subprocess.run(
["git", "log", "--format=%B", f"{base_sha}..{head_sha}"],
capture_output=True,
text=True,
)
if r.returncode != 0:
return ""
return r.stdout
def git_diff_paths(base_sha: str, head_sha: str) -> list[str]:
r = subprocess.run(
["git", "diff", "--name-only", f"{base_sha}..{head_sha}"],
capture_output=True,
text=True,
)
if r.returncode != 0:
return []
return [p for p in r.stdout.splitlines() if p.strip()]
# ---------------------------------------------------------------------------
# Predicate 1 — any `continue-on-error` value changed between base and head
# ---------------------------------------------------------------------------
def _collect_coe(doc: Any) -> dict[str, Any]:
"""Walk every job in `jobs.*` and collect its continue-on-error value.
Returns a dict {job_key: coe_value}. Missing keys are absent from
the dict (NOT `False` — distinguishes "added the key" from
"unchanged absent"). Job-step `continue-on-error` is NOT considered
— only job-level, because that's the value that masks job status
rollup, which is the class this lint targets.
"""
out: dict[str, Any] = {}
if not isinstance(doc, dict):
return out
jobs = doc.get("jobs")
if not isinstance(jobs, dict):
return out
for k, j in jobs.items():
if not isinstance(j, dict):
continue
if "continue-on-error" in j:
out[k] = j["continue-on-error"]
return out
def coe_changed(base_doc: Any, head_doc: Any) -> tuple[bool, list[str]]:
"""Return (changed?, [reasons]) describing per-job coe diffs."""
base = _collect_coe(base_doc)
head = _collect_coe(head_doc)
reasons: list[str] = []
all_keys = set(base) | set(head)
for k in sorted(all_keys):
b = base.get(k, "<absent>")
h = head.get(k, "<absent>")
if b != h:
reasons.append(f"job '{k}' continue-on-error: {b!r}{h!r}")
return (bool(reasons), reasons)
# ---------------------------------------------------------------------------
# Predicate 2 — sentinel job's `needs:` changed
# ---------------------------------------------------------------------------
def _collect_needs(doc: Any, sentinel_key: str) -> list[str] | None:
"""Return the sentinel job's needs list (sorted) or None if absent."""
if not isinstance(doc, dict):
return None
jobs = doc.get("jobs")
if not isinstance(jobs, dict):
return None
j = jobs.get(sentinel_key)
if not isinstance(j, dict):
return None
needs = j.get("needs")
if needs is None:
return []
if isinstance(needs, str):
return [needs]
if isinstance(needs, list):
# Sort because `needs:` is order-insensitive at the engine
# level; a reorder is not a semantic change and shouldn't
# trip the lint.
return sorted(str(x) for x in needs)
return None
def sentinel_needs_changed(
base_doc: Any, head_doc: Any, sentinel_key: str
) -> tuple[bool, str]:
"""Return (changed?, reason)."""
base = _collect_needs(base_doc, sentinel_key)
head = _collect_needs(head_doc, sentinel_key)
if base == head:
return (False, "")
return (
True,
f"sentinel '{sentinel_key}'.needs: {base!r}{head!r}",
)
# ---------------------------------------------------------------------------
# Predicate 3 — `Paired: #NNN` present in body or any commit message
# ---------------------------------------------------------------------------
def find_paired_refs(pr_body: str, commit_log: str) -> list[str]:
"""Return list of `#NNN` strings found (deduped, sorted)."""
found: set[str] = set()
for src in (pr_body, commit_log):
for m in PAIRED_RE.finditer(src or ""):
found.add(m.group("num"))
return sorted(found)
# ---------------------------------------------------------------------------
# Driver
# ---------------------------------------------------------------------------
def _parse(content: str | None, label: str) -> Any:
if content is None:
return None
try:
return yaml.safe_load(content)
except yaml.YAMLError as e:
sys.stderr.write(f"::error::YAML parse error on {label}: {e}\n")
sys.exit(2)
def run() -> int:
base_sha = _require_env("BASE_SHA")
head_sha = _require_env("HEAD_SHA")
pr_body = _env("PR_BODY", "")
ci_path = _env("CI_WORKFLOW_PATH", ".gitea/workflows/ci.yml")
sentinel_key = _env("SENTINEL_JOB_KEY", "all-required")
# Step 0 — is ci.yml even in the diff? If not, the lint doesn't apply.
changed_paths = git_diff_paths(base_sha, head_sha)
if ci_path not in changed_paths:
print(
f"::notice::{ci_path} not in PR diff; lint-mask-pr-atomicity "
f"skipped (no atomicity risk)."
)
return 0
base_yml = git_show(base_sha, ci_path)
head_yml = git_show(head_sha, ci_path)
base_doc = _parse(base_yml, f"{ci_path}@{base_sha}")
head_doc = _parse(head_yml, f"{ci_path}@{head_sha}")
# If the file is newly added (no base), no flip is possible — every
# value is "newly introduced", not "changed". Tier 2e covers the
# tracking-issue check for new continue-on-error: true. Exit 0.
if base_doc is None:
print(
f"::notice::{ci_path} newly added in this PR; no flip to "
f"analyse — lint-mask-pr-atomicity skipped."
)
return 0
# If the file is deleted on head, ditto — no atomicity question.
if head_doc is None:
print(
f"::notice::{ci_path} deleted in this PR; "
f"lint-mask-pr-atomicity skipped."
)
return 0
coe_yes, coe_reasons = coe_changed(base_doc, head_doc)
needs_yes, needs_reason = sentinel_needs_changed(
base_doc, head_doc, sentinel_key
)
if not coe_yes and not needs_yes:
print(
f"::notice::{ci_path} touched but neither continue-on-error "
f"nor sentinel '{sentinel_key}'.needs changed — no atomicity "
f"risk. OK."
)
return 0
if coe_yes and needs_yes:
print(
f"::notice::Atomic change detected: both continue-on-error "
f"AND sentinel '{sentinel_key}'.needs touched in same PR. OK."
)
for r in coe_reasons:
print(f" - {r}")
print(f" - {needs_reason}")
return 0
# Exactly one side touched — require Paired: #NNN reference.
commit_log = git_log_messages(base_sha, head_sha)
paired = find_paired_refs(pr_body, commit_log)
one_side = "continue-on-error" if coe_yes else f"sentinel '{sentinel_key}'.needs"
other_side = (
f"sentinel '{sentinel_key}'.needs" if coe_yes else "continue-on-error"
)
if paired:
print(
f"::notice::Split-pair detected ({one_side} changed without "
f"{other_side}), but Paired reference(s) present: "
f"{', '.join('#' + n for n in paired)}. OK."
)
for r in coe_reasons:
print(f" - {r}")
if needs_reason:
print(f" - {needs_reason}")
return 0
# The failure mode this lint exists to prevent.
print(
f"::error file={ci_path}::lint-mask-pr-atomicity (Tier 2d): "
f"PR touches {one_side} in {ci_path} but NOT {other_side}, "
f"and no `Paired: #NNN` reference was found in the PR body or "
f"in commit messages between {base_sha[:8]}..{head_sha[:8]}. "
f"This is the PR#665+#668 split-pair regression class "
f"(see internal#350, mc#664). FIX: either (a) include the "
f"matching {other_side} change in the same PR (preferred), or "
f"(b) add `Paired: #NNN` (literal, capital P, with `#`) to the "
f"PR body or a commit message referencing the paired PR."
)
for r in coe_reasons:
print(f" - {r}")
if needs_reason:
print(f" - {needs_reason}")
return 1
if __name__ == "__main__":
sys.exit(run())
@@ -1,681 +0,0 @@
#!/usr/bin/env python3
"""lint-pre-flip-continue-on-error — block a PR that flips a job from
``continue-on-error: true`` to ``continue-on-error: false`` (or removes
the key while the base had it ``true``) without proof that the job's
recent runs on the target branch are actually green.
Empirical class — PR #656 / mc#664:
PR #656 (RFC internal#219 Phase 4) flipped 5 ``platform-build``-class
jobs ``continue-on-error: true → false`` on the basis of a
"verified green on main via combined-status check". But that "green"
was the LIE produced by the prior ``continue-on-error: true``:
Gitea Quirk #10 (internal#342 + dup #287) — when a step inside a
job marked ``continue-on-error: true`` fails, the job-level status
is still rolled up as ``success``. So the precondition the PR
claimed to verify was structurally fooled by the bug being
flipped.
mc#664 then captured the surfaced defects (2 unrelated, mutually-
masked regressions):
Class 1: sqlmock helper drift since 2f36bb9a (24 days old)
Class 2: OFFSEC-001 contract collision since 7d1a189f (1 day old)
Codified 04:35Z as hongming-pc2 charter §SOP-N rule (e)
"run-log-grep-before-flip": pull the actual run log + grep for
``--- FAIL`` / ``FAIL\\s`` BEFORE flipping; don't trust the masked
combined-status.
This script structurally enforces that rule at PR time.
How it works (one PR tick):
1. Parse the diff: compare ``.gitea/workflows/*.yml`` at PR base
vs PR head. For each file present in both, parse the YAML AST
and walk ``jobs.<key>.continue-on-error`` on each side. A
"flip" is base ∈ {true} AND head ∈ {false, None/absent}. We
coerce truthy/falsy per YAML semantics (PyYAML normalizes
``true``/``True``/``yes`` to ``True``).
2. For each flipped job, derive its commit-status context name as
``"{workflow.name} / {job.name or job.key} (push)"`` — that's
how Gitea Actions emits the context for runs on
``main``/``staging`` (push event, see also expected_context()
in ci-required-drift.py).
3. Pull the last N commits of the target branch (PR base), fetch
combined commit-status per commit, scan ``statuses[]`` for
contexts matching ANY of the flipped jobs. For each match,
fetch the actual run log via the web-UI route
``{server_url}/{repo}/actions/runs/{run_id}/jobs/{job_idx}/logs``
(per memory ``reference_gitea_actions_log_fetch`` — Gitea 1.22.6
lacks REST ``/actions/runs/*`` endpoints; the web-UI route is the
only working path; see ``reference_gitea_1_22_6_lacks_rest_rerun_endpoints``).
4. Grep each log for the Go-test failure markers ``--- FAIL`` /
``FAIL\\s+<package>`` AND the bash-step error sentinel
``::error::``. If ANY recent log shows any of these AND the
status itself reads ``success``, the job was masked. ``::error::``
the flip with the offending test name + offending run URL +
the regression commit (HEAD of the run).
5. Exit 1 if any flips have at least one masked run; exit 0
otherwise.
Halt-on-noise contract:
- If a recent log fetch 404s (already-pruned-via-act_runner-gc,
transient gitea-web outage): emit ``::warning::`` and treat the
run as "log unavailable" — does NOT block the flip; logged so
a curious reviewer can re-run.
- If a flipped job has ZERO recent runs on the target branch (newly
added workflow): emit ``::warning::`` "no run history to verify"
and allow the flip. This is the only way a NEW workflow can ever
ship with ``continue-on-error: false``; otherwise we'd have a
chicken-and-egg.
Behavior-based AST gate per ``feedback_behavior_based_ast_gates``:
- YAML parsed via PyYAML safe_load on BOTH sides of the diff
- No grep-by-line — formatting changes (comment churn, key order)
don't false-positive a flip
- Job-key match — so a rename ``platform-build → core-be-build``
appears as a DELETE + an ADD, not a flip (the delete side has no
new value to compare against; the add side has no base side).
Run locally (works against this repo, requires PyYAML + Gitea token
that can read combined-commit-status):
GITEA_TOKEN=... GITEA_HOST=git.moleculesai.app \\
REPO=molecule-ai/molecule-core BASE_REF=main \\
BASE_SHA=$(git rev-parse origin/main) \\
HEAD_SHA=$(git rev-parse HEAD) \\
python3 .gitea/scripts/lint_pre_flip_continue_on_error.py \\
--dry-run
Cross-links: PR#656, mc#664, PR#665 (the interim re-mask),
Quirk #10 (internal#342 + dup #287), hongming-pc2 charter §SOP-N
rule (e), feedback_strict_root_only_after_class_a,
feedback_no_shared_persona_token_use.
"""
from __future__ import annotations
import argparse
import json
import os
import subprocess
import sys
import urllib.error
import urllib.parse
import urllib.request
from typing import Any
import yaml # PyYAML 6.0.2 — installed by the workflow before this runs.
# --------------------------------------------------------------------------
# Environment (read at module-import; runtime contract enforced in main())
# --------------------------------------------------------------------------
def _env(key: str, *, default: str = "") -> str:
return os.environ.get(key, default)
GITEA_TOKEN = _env("GITEA_TOKEN")
GITEA_HOST = _env("GITEA_HOST")
REPO = _env("REPO")
BASE_REF = _env("BASE_REF", default="main")
BASE_SHA = _env("BASE_SHA")
HEAD_SHA = _env("HEAD_SHA")
# How many recent commits to scan on the target branch. 5 by default;
# enough to catch a job that only fails intermittently, not so many
# that the script paginates needlessly. Per spec.
RECENT_COMMITS_N = int(_env("RECENT_COMMITS_N", default="5"))
OWNER, NAME = (REPO.split("/", 1) + [""])[:2] if REPO else ("", "")
API = f"https://{GITEA_HOST}/api/v1" if GITEA_HOST else ""
WEB = f"https://{GITEA_HOST}" if GITEA_HOST else ""
# Failure markers we grep for in the run log.
# --- FAIL — Go test failure marker
# FAIL\s — `FAIL github.com/x/y` package-level rollup
# ::error:: — bash-step `::error::` lines (the lint-curl-status-capture
# pattern: a `python3 <<PY` block writing `::error::` then
# sys.exit(1); also any shell `echo "::error::..."` from
# jobs that wrap pytest/eslint/etc. and convert
# non-zero exits into masked-by-CoE status)
FAIL_PATTERNS = (
"--- FAIL",
"FAIL\t",
"FAIL ",
"::error::",
)
def _require_runtime_env() -> None:
for key in ("GITEA_TOKEN", "GITEA_HOST", "REPO", "BASE_REF", "BASE_SHA", "HEAD_SHA"):
if not os.environ.get(key):
sys.stderr.write(f"::error::missing required env var: {key}\n")
sys.exit(2)
# --------------------------------------------------------------------------
# Tiny HTTP helper (no requests dependency)
# Mirrors the api()/ApiError contract in ci-required-drift.py +
# main-red-watchdog.py per feedback_api_helper_must_raise_not_return_dict.
# --------------------------------------------------------------------------
class ApiError(RuntimeError):
"""Raised when a Gitea API/web call cannot be trusted to have succeeded.
Soft-failure on non-2xx is the duplicate-write bug factory in
find-or-create flows (PR #112 Five-Axis). Here it would mean a
transient gitea-web 502 silently allows a flip whose recent runs
we couldn't actually verify — exactly the regression class this
lint exists to close.
"""
def http(
method: str,
url: str,
*,
body: dict | None = None,
headers: dict[str, str] | None = None,
expect_json: bool = True,
timeout: int = 30,
) -> tuple[int, Any, bytes]:
"""Tiny HTTP helper around urllib.
Returns (status, parsed_or_None, raw_bytes). Raises ApiError on any
non-2xx response. ``expect_json=False`` returns raw bytes in the
parsed slot (for log-fetch from the web-UI which returns text/plain).
"""
final_headers = {
"Authorization": f"token {GITEA_TOKEN}",
"Accept": "application/json" if expect_json else "text/plain",
}
if headers:
final_headers.update(headers)
data = None
if body is not None:
data = json.dumps(body).encode("utf-8")
final_headers["Content-Type"] = "application/json"
req = urllib.request.Request(url, method=method, data=data, headers=final_headers)
try:
with urllib.request.urlopen(req, timeout=timeout) as resp:
raw = resp.read()
status = resp.status
except urllib.error.HTTPError as e:
raw = e.read() or b""
status = e.code
if not (200 <= status < 300):
snippet = raw[:500].decode("utf-8", errors="replace") if raw else ""
raise ApiError(f"{method} {url} → HTTP {status}: {snippet}")
if not expect_json:
return status, raw, raw
if not raw:
return status, None, raw
try:
return status, json.loads(raw), raw
except json.JSONDecodeError as e:
raise ApiError(f"{method} {url} → HTTP {status} but body is not JSON: {e}") from e
def api(method: str, path: str, *, body: dict | None = None, query: dict[str, str] | None = None) -> tuple[int, Any]:
"""Read-shaped Gitea REST helper. Path is API-relative (``/repos/...``)."""
url = f"{API}{path}"
if query:
url = f"{url}?{urllib.parse.urlencode(query)}"
status, parsed, _ = http(method, url, body=body, expect_json=True)
return status, parsed
# --------------------------------------------------------------------------
# YAML parsing — coerce truthy/falsy for continue-on-error
# --------------------------------------------------------------------------
def _coerce_coe(val: Any) -> bool:
"""Coerce a continue-on-error YAML value to bool.
PyYAML safe_load normalizes ``true``/``True``/``yes``/``on`` to
Python ``True`` and ``false``/``False``/``no``/``off`` / absence
to ``False`` (we treat absence/None as False here too — that's the
GitHub Actions default semantics).
Edge cases:
- String ``"true"`` (quoted in YAML) — kept as the string
``"true"``, falsy under bool() but a flip we DO care about
catching. Normalize string forms case-insensitively to bool
so the diff is consistent with the runtime behavior of
Gitea Actions, which YAML-parses the same way.
"""
if isinstance(val, bool):
return val
if val is None:
return False
if isinstance(val, str):
return val.strip().lower() in ("true", "yes", "on", "1")
return bool(val)
def jobs_coe_map(workflow_doc: dict) -> dict[str, bool]:
"""Return ``{job_key: continue_on_error_bool}`` for every job in
the workflow. Job-level ``continue-on-error`` only — does NOT
descend into per-step ``continue-on-error`` (step-level CoE
masking is a separate class and is handled by the test suite
+ reviewer, not by this gate — see Future Work in the workflow
YAML).
"""
out: dict[str, bool] = {}
jobs = workflow_doc.get("jobs")
if not isinstance(jobs, dict):
return out
for key, job in jobs.items():
if not isinstance(job, dict):
continue
out[key] = _coerce_coe(job.get("continue-on-error"))
return out
def workflow_name(workflow_doc: dict, *, fallback: str = "") -> str:
"""Top-level ``name:`` of the workflow. Falls back to the filename
(without extension) per Gitea Actions semantics."""
n = workflow_doc.get("name")
if isinstance(n, str) and n.strip():
return n.strip()
return fallback
def job_display_name(workflow_doc: dict, job_key: str) -> str:
"""``jobs.<key>.name`` if present, else the key. Mirrors
expected_context() in ci-required-drift.py."""
job = workflow_doc.get("jobs", {}).get(job_key)
if isinstance(job, dict):
n = job.get("name")
if isinstance(n, str) and n.strip():
return n.strip()
return job_key
def context_name(workflow_name_str: str, job_name_str: str, event: str = "push") -> str:
"""Render the commit-status context the way Gitea Actions emits it.
Default ``event="push"`` because recent-runs-on-main are push events;
callers can override to ``"pull_request"`` for PR-context lookups."""
return f"{workflow_name_str} / {job_name_str} ({event})"
# --------------------------------------------------------------------------
# Diff detection — flips, not arbitrary changes
# --------------------------------------------------------------------------
def detect_flips(
base_workflows: dict[str, str],
head_workflows: dict[str, str],
) -> list[dict]:
"""Compare per-file CoE maps; return a list of flip records.
Inputs are ``{path: yaml_text}`` for both sides. Output records
have the shape::
{
"workflow_path": ".gitea/workflows/ci.yml",
"workflow_name": "CI",
"job_key": "platform-build",
"job_name": "Platform (Go)",
"context": "CI / Platform (Go) (push)",
}
A flip is base[CoE] ∈ {True} AND head[CoE] ∈ {False}. Files
only present on one side are skipped — adding a new workflow
with ``CoE: false`` is fine (no history to mask), and removing
a workflow can't possibly flip anything.
"""
flips: list[dict] = []
for path, base_text in base_workflows.items():
if path not in head_workflows:
continue
try:
base_doc = yaml.safe_load(base_text) or {}
head_doc = yaml.safe_load(head_workflows[path]) or {}
except yaml.YAMLError as e:
# Don't block on a parse error — the YAML lint workflows
# catch invalid YAML separately. Just warn so the failing
# file is visible.
sys.stderr.write(f"::warning file={path}::YAML parse error: {e}\n")
continue
if not isinstance(base_doc, dict) or not isinstance(head_doc, dict):
continue
base_map = jobs_coe_map(base_doc)
head_map = jobs_coe_map(head_doc)
wf_name = workflow_name(head_doc, fallback=os.path.basename(path).rsplit(".", 1)[0])
for job_key, base_val in base_map.items():
if job_key not in head_map:
continue # job removed — not a flip
if base_val is True and head_map[job_key] is False:
flips.append({
"workflow_path": path,
"workflow_name": wf_name,
"job_key": job_key,
"job_name": job_display_name(head_doc, job_key),
"context": context_name(wf_name, job_display_name(head_doc, job_key), "push"),
})
return flips
# --------------------------------------------------------------------------
# Git: snapshot every .gitea/workflows/*.yml at a SHA (no checkout)
# --------------------------------------------------------------------------
def _git(*args: str, cwd: str | None = None) -> str:
"""Run ``git`` and return stdout (text)."""
result = subprocess.run(
["git", *args],
capture_output=True,
text=True,
check=False,
cwd=cwd,
)
if result.returncode != 0:
raise RuntimeError(f"git {args!r} failed: {result.stderr.strip()}")
return result.stdout
def workflows_at_sha(sha: str, *, repo_dir: str | None = None) -> dict[str, str]:
"""Read every ``.gitea/workflows/*.yml`` blob at ``sha``.
Uses ``git ls-tree`` + ``git show`` so we never need to check out
the SHA (the workflow runs on the PR head; the base SHA is
fetched, not checked out).
"""
out: dict[str, str] = {}
listing = _git("ls-tree", "-r", "--name-only", sha, ".gitea/workflows/", cwd=repo_dir)
for line in listing.splitlines():
line = line.strip()
if not line.endswith((".yml", ".yaml")):
continue
try:
blob = _git("show", f"{sha}:{line}", cwd=repo_dir)
except RuntimeError:
# Symlink or other non-blob; skip.
continue
out[line] = blob
return out
# --------------------------------------------------------------------------
# Gitea: recent commits + per-commit combined status + log fetch
# --------------------------------------------------------------------------
def recent_commits_on_branch(branch: str, n: int) -> list[str]:
"""Last `n` commit SHAs on ``branch`` (oldest→newest is fine; we
treat them as a set). Uses the REST ``/commits`` endpoint with
``sha=branch&limit=n``."""
_, body = api(
"GET",
f"/repos/{OWNER}/{NAME}/commits",
query={"sha": branch, "limit": str(n)},
)
if not isinstance(body, list):
raise ApiError(f"/commits for {branch} returned non-list: {type(body).__name__}")
out: list[str] = []
for c in body:
if isinstance(c, dict):
sha = c.get("sha") or (c.get("commit", {}) or {}).get("id")
if isinstance(sha, str) and len(sha) >= 7:
out.append(sha)
return out
def combined_status(sha: str) -> dict:
"""Combined commit status for a SHA. Same shape as
``main-red-watchdog.get_combined_status``."""
_, body = api("GET", f"/repos/{OWNER}/{NAME}/commits/{sha}/status")
if not isinstance(body, dict):
raise ApiError(f"combined-status for {sha} not a dict")
return body
def _entry_state(s: dict) -> str:
"""Per-entry state — Gitea 1.22.6 schema asymmetry: top-level
uses ``state``, per-entry uses ``status``. Defensive fallback per
main-red-watchdog.py line 233."""
return s.get("status") or s.get("state") or ""
def fetch_log(target_url: str) -> str | None:
"""Fetch a job log given its web-UI ``target_url`` (e.g.
``/molecule-ai/molecule-core/actions/runs/13494/jobs/0``).
Per ``reference_gitea_actions_log_fetch``: append ``/logs`` to the
job route. Per ``reference_gitea_1_22_6_lacks_rest_rerun_endpoints``:
Gitea 1.22.6 lacks the REST ``/api/v1/.../actions/runs/*`` path; the
web-UI route is the only working endpoint until 1.24+.
Returns the log text on success, ``None`` on 404 / log-pruned /
network error (caller treats None as "log unavailable, warn-not-fail").
"""
if not target_url:
return None
# Normalize: target_url may be relative ("/owner/repo/...") or
# absolute. Both need ``/logs`` appended to the job sub-path.
if target_url.startswith("/"):
url = f"{WEB}{target_url}"
else:
url = target_url
if not url.endswith("/logs"):
url = f"{url}/logs"
try:
_, body, _ = http("GET", url, expect_json=False, timeout=60)
except ApiError as e:
sys.stderr.write(f"::warning::log fetch failed for {url}: {e}\n")
return None
if isinstance(body, bytes):
return body.decode("utf-8", errors="replace")
return None
def grep_fail_markers(log_text: str) -> list[str]:
"""Return up to 5 sample matching lines for any FAIL_PATTERNS hit.
Empty list = clean log."""
matches: list[str] = []
for line in log_text.splitlines():
for pat in FAIL_PATTERNS:
if pat in line:
# Truncate to keep error output bounded.
matches.append(line.strip()[:240])
break
if len(matches) >= 5:
break
return matches
# --------------------------------------------------------------------------
# Verification: for one flip, scan recent runs on BASE_REF
# --------------------------------------------------------------------------
def verify_flip(flip: dict, branch: str, n: int) -> dict:
"""Scan the last ``n`` commits on ``branch``. For each commit whose
combined status contains a context matching ``flip["context"]``,
fetch the run log and grep for FAIL markers.
Returns::
{
"flip": flip,
"checked_commits": int, # how many commits had a matching context
"masked_runs": [ # runs where log shows FAIL despite status==success
{"sha": "...", "status": "success", "target_url": "...", "samples": [...]},
...
],
"fail_runs": [ # runs where status itself is failure/error
{"sha": "...", "status": "failure", "target_url": "...", "samples": [...]},
...
],
"warnings": [str], # log-unavailable warnings (not blocking)
}
Blocking condition: ``masked_runs`` OR ``fail_runs`` non-empty.
A ``success`` status with a clean log is the only "OK to flip"
outcome (per hongming-pc2 §SOP-N rule (e)).
"""
target_context = flip["context"]
result = {
"flip": flip,
"checked_commits": 0,
"masked_runs": [],
"fail_runs": [],
"warnings": [],
}
shas = recent_commits_on_branch(branch, n)
if not shas:
result["warnings"].append(
f"no recent commits on {branch} (cannot verify flip)"
)
return result
for sha in shas:
try:
status_doc = combined_status(sha)
except ApiError as e:
result["warnings"].append(f"combined-status for {sha}: {e}")
continue
statuses = status_doc.get("statuses") or []
# First entry matching the context name. Newest SHAs come
# first; one entry per context per SHA is the usual shape.
for s in statuses:
if not isinstance(s, dict):
continue
if s.get("context") != target_context:
continue
result["checked_commits"] += 1
state = _entry_state(s)
target_url = s.get("target_url") or ""
log_text = fetch_log(target_url)
if log_text is None:
result["warnings"].append(
f"log unavailable for {sha} {target_context}"
)
# Still record the status itself if it's red — that's
# a hard signal that doesn't need log access.
if state in ("failure", "error"):
result["fail_runs"].append({
"sha": sha,
"status": state,
"target_url": target_url,
"samples": ["[log unavailable; status itself is " + state + "]"],
})
break
samples = grep_fail_markers(log_text)
if state in ("failure", "error"):
result["fail_runs"].append({
"sha": sha,
"status": state,
"target_url": target_url,
"samples": samples or ["[no FAIL markers found but status is " + state + "]"],
})
elif samples and state == "success":
# The bug class: status==success while log shows FAIL.
# That's exactly Quirk #10 (continue-on-error masking).
result["masked_runs"].append({
"sha": sha,
"status": state,
"target_url": target_url,
"samples": samples,
})
# Either way, we matched one context entry for this SHA;
# don't keep looping `statuses[]`.
break
if result["checked_commits"] == 0:
result["warnings"].append(
f"no runs of {target_context!r} found in the last {n} commits on "
f"{branch} — cannot verify; allowing flip with warning"
)
return result
# --------------------------------------------------------------------------
# Report rendering
# --------------------------------------------------------------------------
def render_flip_report(verdict: dict) -> str:
flip = verdict["flip"]
lines = [
f"job: {flip['job_key']} ({flip['context']})",
f" workflow: {flip['workflow_path']}",
f" checked_commits: {verdict['checked_commits']}",
]
for run in verdict["fail_runs"]:
url = run["target_url"]
# target_url may be relative; render the absolute form for
# click-through.
if url.startswith("/"):
url = f"{WEB}{url}"
lines.append(f" fail run {run['sha'][:10]} (status={run['status']}): {url}")
for sample in run["samples"]:
lines.append(f" | {sample}")
for run in verdict["masked_runs"]:
url = run["target_url"]
if url.startswith("/"):
url = f"{WEB}{url}"
lines.append(
f" MASKED run {run['sha'][:10]} (status=success, log shows FAIL): {url}"
)
for sample in run["samples"]:
lines.append(f" | {sample}")
for w in verdict["warnings"]:
lines.append(f" warning: {w}")
return "\n".join(lines)
# --------------------------------------------------------------------------
# Main
# --------------------------------------------------------------------------
def _parse_args(argv: list[str] | None = None) -> argparse.Namespace:
p = argparse.ArgumentParser(
prog="lint-pre-flip-continue-on-error",
description="Block a PR that flips continue-on-error true→false "
"without proof recent runs are actually green.",
)
p.add_argument(
"--dry-run",
action="store_true",
help="Detect + print findings to stdout; never exit non-zero. "
"Useful for local testing.",
)
return p.parse_args(argv)
def main(argv: list[str] | None = None) -> int:
args = _parse_args(argv)
_require_runtime_env()
base_workflows = workflows_at_sha(BASE_SHA)
head_workflows = workflows_at_sha(HEAD_SHA)
flips = detect_flips(base_workflows, head_workflows)
if not flips:
print("::notice::no continue-on-error true→false flips in this PR")
return 0
print(f"::notice::detected {len(flips)} continue-on-error true→false flip(s); verifying recent runs on {BASE_REF}")
bad_flips: list[dict] = []
for flip in flips:
verdict = verify_flip(flip, BASE_REF, RECENT_COMMITS_N)
report = render_flip_report(verdict)
if verdict["fail_runs"] or verdict["masked_runs"]:
print(f"::error file={flip['workflow_path']}::flip of {flip['job_key']} "
f"({flip['context']}) blocked — recent runs on {BASE_REF} show "
f"FAIL markers OR are red. Pull each run log below + grep "
f"`--- FAIL` / `FAIL ` / `::error::` — DON'T trust the masked "
f"combined-status. See hongming-pc2 charter §SOP-N rule (e). "
f"PR#656 / mc#664 reference class.")
bad_flips.append(verdict)
else:
print(f"::notice::flip of {flip['job_key']} ({flip['context']}) is safe — "
f"{verdict['checked_commits']} recent run(s), no FAIL markers")
# Always print the per-flip detail block so the human-readable
# report is in the run log for both safe and unsafe flips.
print(f"::group::flip detail: {flip['job_key']}")
print(report)
print("::endgroup::")
if bad_flips and not args.dry_run:
print(f"::error::{len(bad_flips)}/{len(flips)} flip(s) failed pre-flip verification")
return 1
if bad_flips and args.dry_run:
print(f"::warning::[dry-run] {len(bad_flips)}/{len(flips)} flip(s) WOULD fail; exit 0 forced")
return 0
if __name__ == "__main__":
sys.exit(main())
@@ -1,526 +0,0 @@
#!/usr/bin/env python3
"""lint_required_context_exists_in_bp — Tier 2g per internal#350.
Rule
----
When a PR adds a NEW commit-status emission (a context that didn't
exist on the base side), the workflow file must carry one of three
directive comments adjacent to the new job:
(a) `# bp-required: yes`
The new context MUST already be in
`branch_protections/<branch>.status_check_contexts`. Verified
via Gitea API at PR time.
(b) `# bp-required: pending #NNN`
Acknowledged asymmetry; references an OPEN tracking issue that
will follow up with the BP PATCH.
(c) `# bp-exempt: <free-text reason>`
Informational job, not intended to be a required gate.
No directive on a new emitter → FAIL with a 3-option fix-hint.
The class this prevents
-----------------------
PR#656 added `CI / all-required (pull_request)` as a sentinel context
that workflows emit, but BP did NOT list it. When `platform-build`
failed, `all-required` failed, but BP let the PR merge anyway →
cascade to mc#664. With this lint, PR#656 would have been blocked
until either the BP PATCH ran alongside OR the author added a
`bp-required: pending` directive.
Why directives MUST live in the workflow YAML
---------------------------------------------
The directive comment lives with the emitter so a scheduled
audit (Tier 2f, daily) can read the same source. PR-body-only
directives invisibly evaporate on merge — the asymmetry would
return to undetected. PR-body claims are advisory; workflow-file
comments are the contract.
How "new emission" is detected
------------------------------
Diff base..head over `.gitea/workflows/*.yml`. For each YAML file
that's added or modified:
- Parse both base-side and head-side via PyYAML AST.
- Enumerate emitted contexts on each side using the same rules as
Tier 2f (workflow.name + job.name|key + event-mapping).
- `new_contexts = head_contexts - base_contexts`.
If `new_contexts` is empty after de-dup, no rule applies → pass.
Per `feedback_behavior_based_ast_gates`: comment scanning uses raw
text in a small window around the job-key line, NOT regex over the
full file. This avoids matching `bp-required:` mentioned in a
comment unrelated to the new job.
Exit codes
----------
0 — no new emissions, all new emissions have valid directives,
or BP read errored (graceful-degrade per Tier 2a contract).
1 — at least one new emission lacks a directive, or has
`bp-required: yes` but the context is missing from BP.
2 — env contract violation or YAML parse error.
Env
---
BASE_SHA — PR base SHA
HEAD_SHA — PR head SHA
GITEA_TOKEN — DRIFT_BOT_TOKEN (repo-admin for BP read)
GITEA_HOST — e.g. git.moleculesai.app
REPO — owner/name
BRANCH — defaults to `main`
WORKFLOWS_DIR — defaults to `.gitea/workflows`
Memory cross-links
------------------
- internal#350 (the RFC that specs this lint)
- PR#656 (the empirical case that prompted Tier 2g)
- mc#664 (the surfaced cascade)
- feedback_phantom_required_check_after_gitea_migration (Tier 2f cousin)
- feedback_behavior_based_ast_gates
"""
from __future__ import annotations
import json
import os
import re
import subprocess
import sys
import urllib.error
import urllib.parse
import urllib.request
from typing import Any
try:
import yaml
except ImportError:
sys.stderr.write(
"::error::PyYAML is required. Install with: pip install PyYAML\n"
)
sys.exit(2)
# Directive comment patterns. We match `# bp-required:` OR `# bp-exempt:`,
# both with optional surrounding whitespace and case-sensitive on the
# `bp-` prefix (convention).
BP_REQUIRED_YES_RE = re.compile(
r"#\s*bp-required:\s*yes\b", re.IGNORECASE
)
BP_REQUIRED_PENDING_RE = re.compile(
r"#\s*bp-required:\s*pending\s*#(?P<num>\d+)\b", re.IGNORECASE
)
BP_EXEMPT_RE = re.compile(
r"#\s*bp-exempt:\s*\S", re.IGNORECASE
)
# Gitea event-mapping (same as Tier 2f).
_EVENT_MAP = {
"pull_request": "pull_request",
"pull_request_target": "pull_request",
"push": "push",
}
# ---------------------------------------------------------------------------
# Env
# ---------------------------------------------------------------------------
def _env(key: str, default: str | None = None) -> str:
v = os.environ.get(key, default)
return v if v is not None else ""
def _require_env(key: str) -> str:
v = os.environ.get(key)
if not v:
sys.stderr.write(f"::error::missing required env var: {key}\n")
sys.exit(2)
return v
# ---------------------------------------------------------------------------
# API helper (same contract as Tier 2f).
# ---------------------------------------------------------------------------
def api(
method: str,
path: str,
*,
body: dict | None = None,
query: dict[str, str] | None = None,
) -> tuple[str, Any]:
host = _env("GITEA_HOST")
token = _env("GITEA_TOKEN")
url = f"https://{host}/api/v1{path}"
if query:
url = f"{url}?{urllib.parse.urlencode(query)}"
data = None
headers = {
"Authorization": f"token {token}",
"Accept": "application/json",
}
if body is not None:
data = json.dumps(body).encode("utf-8")
headers["Content-Type"] = "application/json"
req = urllib.request.Request(url, method=method, data=data, headers=headers)
try:
with urllib.request.urlopen(req, timeout=30) as resp:
raw = resp.read()
if not raw:
return ("ok", None)
return ("ok", json.loads(raw))
except urllib.error.HTTPError as e:
if e.code == 404:
return ("not_found", None)
if e.code in (401, 403):
return ("forbidden", None)
return ("error", None)
except (urllib.error.URLError, TimeoutError, json.JSONDecodeError):
return ("error", None)
# ---------------------------------------------------------------------------
# git helpers
# ---------------------------------------------------------------------------
def git_show(sha: str, path: str) -> str | None:
r = subprocess.run(
["git", "show", f"{sha}:{path}"], capture_output=True, text=True
)
if r.returncode != 0:
return None
return r.stdout
def git_diff_paths(base: str, head: str) -> list[str]:
r = subprocess.run(
["git", "diff", "--name-only", f"{base}..{head}"],
capture_output=True,
text=True,
)
if r.returncode != 0:
return []
return [p for p in r.stdout.splitlines() if p.strip()]
# ---------------------------------------------------------------------------
# Workflow context enumeration (mirror Tier 2f).
# ---------------------------------------------------------------------------
def _get_on(d: Any) -> Any:
if not isinstance(d, dict):
return None
if "on" in d:
return d["on"]
if True in d:
return d[True]
return None
def _on_events(doc: Any) -> set[str]:
on = _get_on(doc)
raw: set[str] = set()
if on is None:
return raw
if isinstance(on, str):
raw.add(on)
elif isinstance(on, list):
for e in on:
if isinstance(e, str):
raw.add(e)
elif isinstance(on, dict):
for k in on:
if isinstance(k, str):
raw.add(k)
return {_EVENT_MAP[e] for e in raw if e in _EVENT_MAP}
def _job_display(jbody: dict, jkey: str) -> str:
n = jbody.get("name") if isinstance(jbody, dict) else None
if isinstance(n, str) and n:
return n
return jkey
def workflow_contexts(doc: Any) -> set[str]:
if not isinstance(doc, dict):
return set()
wf_name = doc.get("name")
if not isinstance(wf_name, str) or not wf_name:
return set()
events = _on_events(doc)
if not events:
return set()
jobs = doc.get("jobs")
if not isinstance(jobs, dict):
return set()
out: set[str] = set()
for jkey, jbody in jobs.items():
if jkey == "__lines__":
continue
if not isinstance(jbody, dict):
continue
disp = _job_display(jbody, jkey)
for ev in events:
out.add(f"{wf_name} / {disp} ({ev})")
return out
# ---------------------------------------------------------------------------
# Find the source line of a job-key in a workflow YAML's raw text.
# Used to scan for nearby directive comments.
# ---------------------------------------------------------------------------
def _find_job_key_line(raw_lines: list[str], jkey: str) -> int | None:
"""Return 1-based line of `<jkey>:` under jobs:."""
in_jobs = False
jobs_indent = -1
for i, line in enumerate(raw_lines, start=1):
stripped = line.lstrip()
if stripped.startswith("jobs:"):
in_jobs = True
jobs_indent = len(line) - len(stripped)
continue
if in_jobs:
# Job key is the next indent level under `jobs:`.
indent = len(line) - len(stripped)
if stripped and indent <= jobs_indent:
# Left the jobs: block
in_jobs = False
continue
if re.match(rf"^\s*{re.escape(jkey)}\s*:", line):
return i
return None
_DIRECTIVE_WINDOW = 3 # lines above the job-key line (inclusive)
def find_directive_for_job(
raw_text: str, jkey: str
) -> tuple[str, str | None] | None:
"""Return (kind, value) tuple for the first directive in a small
window above the job-key line.
kind ∈ {"required-yes", "required-pending", "exempt"}.
value is the pending-issue number for required-pending, else None.
Returns None if no directive found.
We scan ABOVE the line only (the convention is the directive
precedes the job — matches how `# mc#NNN` comments are placed
above `continue-on-error: true`). We don't scan inside the job
body because steps can produce false positives.
"""
lines = raw_text.splitlines()
line_no = _find_job_key_line(lines, jkey)
if line_no is None:
return None
lo = max(1, line_no - _DIRECTIVE_WINDOW)
for i in range(lo, line_no):
line = lines[i - 1]
m = BP_REQUIRED_PENDING_RE.search(line)
if m:
return ("required-pending", m.group("num"))
if BP_REQUIRED_YES_RE.search(line):
return ("required-yes", None)
if BP_EXEMPT_RE.search(line):
return ("exempt", None)
return None
# ---------------------------------------------------------------------------
# Map a context back to its emitting (workflow_path, job_key) pair so
# we know WHERE to look for the directive comment.
# ---------------------------------------------------------------------------
def _resolve_emitter(
ctx: str, head_workflows: dict[str, tuple[str, Any]]
) -> tuple[str, str] | None:
"""Return (file_path, job_key) emitting ctx, or None."""
m = re.match(r"^(?P<wf>.+?) / (?P<job>.+) \((?P<event>[^)]+)\)$", ctx)
if not m:
return None
target_wf = m.group("wf")
target_job_disp = m.group("job")
for path, (_raw, doc) in head_workflows.items():
if not isinstance(doc, dict):
continue
if doc.get("name") != target_wf:
continue
jobs = doc.get("jobs") or {}
if not isinstance(jobs, dict):
continue
for jkey, jbody in jobs.items():
if jkey == "__lines__":
continue
if not isinstance(jbody, dict):
continue
disp = _job_display(jbody, jkey)
if disp == target_job_disp:
return (path, jkey)
return None
# ---------------------------------------------------------------------------
# Driver
# ---------------------------------------------------------------------------
def run() -> int:
base_sha = _require_env("BASE_SHA")
head_sha = _require_env("HEAD_SHA")
_require_env("GITEA_TOKEN")
_require_env("GITEA_HOST")
repo = _require_env("REPO")
branch = _env("BRANCH", "main")
wf_dir = _env("WORKFLOWS_DIR", ".gitea/workflows")
# Step 1 — find workflow files changed in the PR.
changed = git_diff_paths(base_sha, head_sha)
changed_workflows = [
p
for p in changed
if p.startswith(wf_dir + "/")
and (p.endswith(".yml") or p.endswith(".yaml"))
]
if not changed_workflows:
print(
"::notice::no workflow file changes in this PR; "
"lint-required-context-exists-in-bp skipped."
)
return 0
# Step 2 — load base+head + compute new contexts.
head_workflows: dict[str, tuple[str, Any]] = {}
new_contexts: set[str] = set()
for path in changed_workflows:
base_raw = git_show(base_sha, path)
head_raw = git_show(head_sha, path)
if head_raw is None:
# File deleted on head — no new emission contribution.
continue
try:
head_doc = yaml.safe_load(head_raw)
except yaml.YAMLError as e:
sys.stderr.write(
f"::error file={path}::YAML parse error on head: {e}\n"
)
return 2
head_workflows[path] = (head_raw, head_doc)
head_ctx = workflow_contexts(head_doc)
base_ctx: set[str] = set()
if base_raw is not None:
try:
base_doc = yaml.safe_load(base_raw)
except yaml.YAMLError:
base_doc = None
if base_doc is not None:
base_ctx = workflow_contexts(base_doc)
new_contexts |= (head_ctx - base_ctx)
if not new_contexts:
print(
"::notice::no new context emissions detected in this PR; "
"lint-required-context-exists-in-bp skipped."
)
return 0
# Step 3 — fetch BP context list.
status, bp = api("GET", f"/repos/{repo}/branch_protections/{branch}")
bp_contexts: set[str] = set()
if status == "forbidden":
sys.stderr.write(
f"::error::GET branch_protections/{branch} returned HTTP 403 — "
f"DRIFT_BOT_TOKEN lacks repo-admin scope. Cannot verify "
f"bp-required directives; skipping lint with exit 0 per "
f"Tier 2a contract. Fix the token, not the lint.\n"
)
return 0
elif status == "not_found":
# Branch has no protection — nothing to verify against; the
# bp-required: yes directive can't be satisfied. Treat as
# graceful-skip rather than red-X.
print(
f"::notice::branch '{branch}' has no protection; cannot verify "
f"bp-required directives. Skipping (exit 0)."
)
return 0
elif status == "ok" and isinstance(bp, dict):
bp_contexts = set(bp.get("status_check_contexts") or [])
else:
sys.stderr.write(
f"::error::branch_protections/{branch} response unexpected; "
f"status={status}. Treating as transient; exit 0.\n"
)
return 0
# Step 4 — validate each new emission's directive.
violations: list[str] = []
for ctx in sorted(new_contexts):
emitter = _resolve_emitter(ctx, head_workflows)
if emitter is None:
# Shouldn't happen — we just derived ctx from head_workflows.
# Belt-and-suspenders fallback.
violations.append(
f"::error::new emission '{ctx}' (could not resolve emitter "
f"file/job — bug in lint?)"
)
continue
file_path, jkey = emitter
raw_text, _ = head_workflows[file_path]
directive = find_directive_for_job(raw_text, jkey)
if directive is None:
violations.append(
f"::error file={file_path}::lint-required-context-exists-in-bp "
f"(Tier 2g): NEW emission `{ctx}` (job '{jkey}') has no "
f"directive comment. Add ONE of these comments on the line "
f"directly above `{jkey}:` (within {_DIRECTIVE_WINDOW} lines):\n"
f" - `# bp-required: yes` — and ensure the context is "
f"already in branch_protections/{branch}.status_check_contexts.\n"
f" - `# bp-required: pending #NNN` — acknowledged asymmetry, "
f"references the tracking issue for the BP PATCH.\n"
f" - `# bp-exempt: <reason>` — informational job, not a gate.\n"
f"Memory: internal#350 (PR#656 + mc#664 empirical case)."
)
continue
kind, value = directive
if kind == "exempt":
print(f"::notice::{ctx}: bp-exempt directive present, OK.")
continue
if kind == "required-pending":
print(
f"::notice::{ctx}: bp-required: pending #{value}"
f"acknowledged asymmetry, OK."
)
continue
if kind == "required-yes":
if ctx in bp_contexts:
print(
f"::notice::{ctx}: bp-required: yes, and context is in "
f"BP, OK."
)
else:
violations.append(
f"::error file={file_path}::lint-required-context-exists-in-bp "
f"(Tier 2g): job '{jkey}' has `bp-required: yes` "
f"directive but its emitted context `{ctx}` is NOT in "
f"`branch_protections/{branch}.status_check_contexts`. "
f"FIX: either (a) add `{ctx}` to BP (Owners-tier PATCH), "
f"or (b) downgrade the directive to "
f"`# bp-required: pending #NNN` referencing the tracker "
f"for the pending BP PATCH."
)
if violations:
print(
f"::error::lint-required-context-exists-in-bp: "
f"{len(violations)} violation(s) across "
f"{len(changed_workflows)} changed workflow file(s)."
)
for v in violations:
print(v)
return 1
print(
f"::notice::lint-required-context-exists-in-bp: "
f"{len(new_contexts)} new emission(s) all directive-validated."
)
return 0
if __name__ == "__main__":
sys.exit(run())
-757
View File
@@ -1,757 +0,0 @@
#!/usr/bin/env python3
"""main-red-watchdog — Option C of the "main NEVER goes red" directive.
Tracking: molecule-core#420.
What it does (one cron tick):
1. GET /api/v1/repos/{owner}/{repo}/branches/{watch_branch}
→ current HEAD SHA on the watched branch.
2. GET /api/v1/repos/{owner}/{repo}/commits/{SHA}/status
→ combined status + per-context statuses.
3. If combined state is `failure` (or any individual status is
`failure`): open or PATCH an idempotent
`[main-red] {repo}: {SHA[:10]}` issue. Body lists each failed
status context with `target_url` + `description`.
4. If combined state is `success`: close any open `[main-red]
{repo}: ...` issue on a previous SHA with a
"main returned to green at SHA {current_SHA}" comment.
5. Emit one Loki-shaped JSON line via `logger -t main-red-watchdog`
so `reference_obs_stack_phase1`'s Vector → Loki path ingests an
alert event (queryable in Grafana as
`{tenant="operator-host"} |~ "main-red-watchdog"`).
What it does NOT do:
- Auto-revert anything. Option B is explicitly rejected per
`feedback_no_such_thing_as_flakes` + `feedback_fix_root_not_symptom`.
- Page on its own failures. If api() raises ApiError (transient
Gitea outage), the workflow run fails LOUDLY by re-raise — exactly
the contract `feedback_api_helper_must_raise_not_return_dict`
enforces. Silent fallthrough would re-introduce the duplicate-issue
regression class.
- Exit non-zero on RED. The issue IS the alarm; failing the watchdog
on red would double-page (red workflow + open issue) and create
silent-loop risk if the watchdog itself flakes.
Idempotency strategy:
Title is keyed on `{SHA[:10]}` (commit-scoped), NOT just `main`.
Rationale:
- A fix-forward changes HEAD → next cron tick sees a new SHA;
auto-close logic closes the prior `[main-red] OLD_SHA` issue and
(if the new HEAD is also red, e.g. a different test fails) files
a fresh `[main-red] NEW_SHA`. Lineage is preserved.
- A revert that happens to land back on a previously-red SHA
(rare) would refer to a CLOSED issue; the watchdog never reopens.
That's a deliberate trade-off — the operator will see the latest
open issue's `closed` event in the activity feed.
This module is import-safe: tests import individual functions without
invoking main(), so module-level reads use env-with-default and the
runtime contract enforcement lives in `_require_runtime_env()`.
Run locally (dry-run, no API mutation):
GITEA_TOKEN=... GITEA_HOST=git.moleculesai.app REPO=owner/repo \\
WATCH_BRANCH=main RED_LABEL=tier:high \\
python3 .gitea/scripts/main-red-watchdog.py --dry-run
"""
from __future__ import annotations
import argparse
import json
import os
import shutil
import subprocess
import sys
import time
import urllib.error
import urllib.parse
import urllib.request
from typing import Any
# --------------------------------------------------------------------------
# Environment
# --------------------------------------------------------------------------
def _env(key: str, *, default: str = "") -> str:
"""Read an env var with a default. Module-import-safe — tests can
import this script without setting the full env contract."""
return os.environ.get(key, default)
GITEA_TOKEN = _env("GITEA_TOKEN")
GITEA_HOST = _env("GITEA_HOST")
REPO = _env("REPO")
WATCH_BRANCH = _env("WATCH_BRANCH", default="main")
RED_LABEL = _env("RED_LABEL", default="tier:high")
OWNER, NAME = (REPO.split("/", 1) + [""])[:2] if REPO else ("", "")
API = f"https://{GITEA_HOST}/api/v1" if GITEA_HOST else ""
# Title prefix — kept short and stable so the idempotency search can
# match by exact title without parsing.
TITLE_PREFIX = "[main-red]"
# Settling window (seconds) between initial red detection and the
# pre-file recheck. The recheck filters out the two largest false-
# positive classes seen in mc#1597..1630 (task #394, 2026-05-21):
# 1. HEAD moved on (a new commit landed mid-tick) — the prior red SHA
# is no longer authoritative; let the next cron tick re-evaluate.
# 2. Combined status recovered on the SAME SHA (transient
# cancel-cascade rolled forward to success on retry).
# 90s is well below the hourly cron cadence; a real failure that
# persists past it is the one we want surfaced.
# Override with WATCHDOG_RECHECK_DELAY_SECS for tests / local probes
# (the test suite stubs time.sleep to a no-op).
RECHECK_DELAY_SECS = int(_env("WATCHDOG_RECHECK_DELAY_SECS", default="90"))
def _require_runtime_env() -> None:
"""Enforce env contract — called from `main()` only.
Tests import individual functions without setting the full env
contract. Mirrors the CP `ci-required-drift.py` pattern so the
runtime guard is a single chokepoint.
"""
for key in ("GITEA_TOKEN", "GITEA_HOST", "REPO", "WATCH_BRANCH", "RED_LABEL"):
if not os.environ.get(key):
sys.stderr.write(f"::error::missing required env var: {key}\n")
sys.exit(2)
# --------------------------------------------------------------------------
# Tiny HTTP helper — raises on non-2xx + on JSON-decode-of-expected-JSON.
# --------------------------------------------------------------------------
class ApiError(RuntimeError):
"""Raised when a Gitea API call cannot be trusted to have succeeded.
Covers non-2xx HTTP status AND 2xx with an unparseable JSON body on
endpoints documented to return JSON. Callers that swallow this and
proceed risk e.g. creating duplicate `[main-red]` issues when a
transient 500 hides an existing match. Per
`feedback_api_helper_must_raise_not_return_dict`: soft-failure is
opt-in via `expect_json=False`, never the default.
"""
def api(
method: str,
path: str,
*,
body: dict | None = None,
query: dict[str, str] | None = None,
expect_json: bool = True,
) -> tuple[int, Any]:
"""Tiny HTTP helper around urllib.
Raises ApiError on any non-2xx response, and on JSON-decode failure
when `expect_json=True` (the default for read-shaped paths). Mirrors
the CP ci-required-drift.py contract exactly so behaviour is
cross-checkable.
"""
url = f"{API}{path}"
if query:
url = f"{url}?{urllib.parse.urlencode(query)}"
data = None
headers = {
"Authorization": f"token {GITEA_TOKEN}",
"Accept": "application/json",
}
if body is not None:
data = json.dumps(body).encode("utf-8")
headers["Content-Type"] = "application/json"
req = urllib.request.Request(url, method=method, data=data, headers=headers)
try:
with urllib.request.urlopen(req, timeout=30) as resp:
raw = resp.read()
status = resp.status
except urllib.error.HTTPError as e:
raw = e.read()
status = e.code
if not (200 <= status < 300):
snippet = raw[:500].decode("utf-8", errors="replace") if raw else ""
raise ApiError(f"{method} {path} → HTTP {status}: {snippet}")
if not raw:
return status, None
try:
return status, json.loads(raw)
except json.JSONDecodeError as e:
if expect_json:
raise ApiError(
f"{method} {path} → HTTP {status} but body is not JSON: {e}"
) from e
# Opt-in raw fallthrough for endpoints with known echo-quirks
# (`feedback_gitea_create_api_unparseable_response`). Caller
# MUST verify success via a follow-up GET, not by trusting body.
return status, {"_raw": raw.decode("utf-8", errors="replace")}
# --------------------------------------------------------------------------
# action_run.status resolver — extensibility hook for task #394.
# --------------------------------------------------------------------------
def _resolve_action_run_status(target_url: str) -> int | None:
"""Resolve the underlying Gitea `action_run.status` integer for the
run referenced by `target_url`, returning None if the resolver
cannot reach an authoritative source from the runner.
Canonical Gitea 1.22.6 enum (per `models/actions/status.go` +
`reference_gitea_action_status_enum_corrected_2026_05_19`):
1=Success, 2=Failure, 3=Cancelled, 4=Skipped,
5=Waiting, 6=Running, 7=Blocked
Only `status == 2` is a real defect; status=3 is cancel-cascade and
status=1 is an emission artifact (Gitea wrote a 'failure' commit_status
row for a run that actually succeeded — observed empirically on
`publish-canvas-image` jobs at SHAs in mc#1597..1630).
CURRENT STATE (2026-05-20, verified): Gitea 1.22.6 exposes NO REST
endpoint for `action_run.status`. Probed:
/api/v1/repos/{o}/{r}/actions/runs/{id} → HTTP 404
/api/v1/repos/{o}/{r}/actions/jobs/{id} → HTTP 404
/api/v1/repos/{o}/{r}/actions/tasks/{id} → HTTP 404
/swagger.v1.json paths containing 'actions' → secrets+variables+runners only
The SPA backend (`/{repo}/actions/runs/{id}/jobs/{idx}` POST) requires
a session CSRF token, unreachable from a runner. The only authoritative
source today is direct DB access (`mol_action_status` on op-host,
`docker exec molecule-postgres-1 psql ...`), which the runner cannot
reach.
Therefore: this hook returns None on every call. Callers MUST fall
back to the description-string filter (existing) plus the HEAD
recheck (this PR). When a future Gitea release (>=1.23 expected) or
an op-host proxy exposes the endpoint, replace the body of this
function with an `api(...)` call — the caller contract is stable.
See also:
- `reference_chronic_red_sweep_cancelled_vs_failed_filter`
- `feedback_gitea_status_enum_use_helper_not_raw_int`
"""
_ = target_url # noqa: F841 — intentional placeholder
return None
# --------------------------------------------------------------------------
# Gitea reads
# --------------------------------------------------------------------------
def get_head_sha(branch: str) -> str:
"""HEAD SHA of `branch`. Raises ApiError on non-2xx."""
_, body = api("GET", f"/repos/{OWNER}/{NAME}/branches/{branch}")
if not isinstance(body, dict):
raise ApiError(f"branch {branch} response not a JSON object")
commit = body.get("commit")
if not isinstance(commit, dict):
raise ApiError(f"branch {branch} response missing `commit` object")
sha = commit.get("id") or commit.get("sha")
if not isinstance(sha, str) or len(sha) < 7:
raise ApiError(f"branch {branch} response has no usable commit SHA")
return sha
def get_combined_status(sha: str) -> dict:
"""Combined commit status for `sha`. Gitea returns:
{
"state": "success" | "failure" | "pending" | "error",
"statuses": [
{"context": "...", "state": "success|failure|pending|error",
"target_url": "...", "description": "..."},
...
],
...
}
Raises ApiError on non-2xx.
"""
_, body = api("GET", f"/repos/{OWNER}/{NAME}/commits/{sha}/status")
if not isinstance(body, dict):
raise ApiError(f"status for {sha} response not a JSON object")
return body
def is_red(status: dict) -> tuple[bool, list[dict]]:
"""Return (is_red, failed_statuses).
A commit is "red" if combined state is `failure` OR any individual
status entry is in {`failure`, `error`}. `pending` and `success`
do not trip the watchdog — pending means CI is still running, and
that's the normal state immediately after a merge.
`failed_statuses` is the list of per-context entries whose own
`state` is in the red set; useful for the issue body.
Cancel-cascade filter (mc#1564, 2026-05-19):
Gitea maps BOTH `action_run.status=2 (Failure)` AND
`action_run.status=3 (Cancelled)` to commit-status string
`"failure"`. On a busy main with
`concurrency: cancel-in-progress: true`, every merge burst
cancels prior in-flight runs (status=3) — those bubble to the
combined-status `failure` and inflate the watchdog's red%,
generating phantom `[main-red]` issues (mc#1562/#1552/#1540/...).
Canonical Gitea 1.22.6 enum per `models/actions/status.go` +
`reference_gitea_action_status_enum_corrected_2026_05_19`:
1=Success, 2=Failure, 3=Cancelled, 4=Skipped,
5=Waiting, 6=Running, 7=Blocked
We only want status=2 (real defects) to file. At the
commit-status layer we don't have the integer enum directly
(only the `failure` rollup string), so we use the description
string Gitea writes when a run is cancelled — empirically
`"Has been cancelled"` (verified 2026-05-19 via #1562 body).
Real failures show `"Failing after Ns"` and are unaffected.
This is option B from mc#1564 (description-string filter, no
extra API call). Description-string stability is a soft contract
with Gitea; if a future release renames it, the cancel-cascade
entries will simply leak back through (visible-not-silent), and
we'll either re-pin the string or upgrade to option A (resolve
the underlying action_run.status integer via target_url).
"""
combined = status.get("state")
statuses = status.get("statuses") or []
red_states = {"failure", "error"}
# Schema asymmetry: top-level combined uses `state`, but per-entry
# items in `statuses[]` use `status` in Gitea 1.22.6. Prefer
# `status`; fall back to `state` defensively. Verified empirically
# 2026-05-12 03:42Z. Pre-rev4 code only read `state` from per-entry
# items → failed[] always empty → render_body always showed the
# "no per-context entries were in a red state" fallback even when
# the combined-state correctly flagged red. See
# `feedback_smoke_test_vendor_truth_not_shape_match`.
def _entry_state(s: dict) -> str:
return s.get("status") or s.get("state") or ""
def _is_cancel_cascade(s: dict) -> bool:
"""status=3 entry per Gitea 1.22.6 description-string contract.
Match exactly (after strip) — substring match would catch
legitimate test names like "Has been cancelled by the user
unexpectedly" in failure logs."""
desc = (s.get("description") or "").strip()
return desc == "Has been cancelled"
failed = [
s for s in statuses
if isinstance(s, dict)
and _entry_state(s) in red_states
and not _is_cancel_cascade(s)
]
# Combined state alone is no longer sufficient — combined=failure
# may be 100% cancel-cascade. Drive `red` off the FILTERED list:
# if every red-shaped per-entry was cancel-cascade, `failed` is
# empty and we report green. Combined-failure with no per-entry
# detail (empty `statuses[]`) still trips red — that's the
# "CI emitter set combined-status directly" edge case from
# render_body's fallback path; we keep filing on it so the
# operator sees the breadcrumb.
combined_red_no_detail = combined in red_states and not statuses
return (bool(failed) or combined_red_no_detail, failed)
# --------------------------------------------------------------------------
# Issue file / update / close
# --------------------------------------------------------------------------
def title_for(sha: str) -> str:
"""Idempotency key — `[main-red] {repo}: {SHA[:10]}`.
Commit-scoped. A fix-forward to a new SHA produces a new title; the
prior issue auto-closes via `close_open_red_issues_for_other_shas`.
"""
return f"{TITLE_PREFIX} {REPO}: {sha[:10]}"
def list_open_red_issues() -> list[dict]:
"""All open issues whose title starts with `[main-red] {repo}: `.
Per Five-Axis review on CP#112 (`feedback_api_helper_must_raise_not_return_dict`):
api() raises on non-2xx; we let it propagate. Returning [] on a
transient 500 would cause auto-close to skip the cleanup AND the
file-or-update path to POST a duplicate — exactly the regression
class the helper-raises contract closes.
Gitea issue search returns at most 50/page; we only need open
`[main-red]` issues which are by design ≤ 1 at any time per repo,
so a single page is enough.
"""
_, results = api(
"GET",
f"/repos/{OWNER}/{NAME}/issues",
query={"state": "open", "type": "issues", "limit": "50"},
)
if not isinstance(results, list):
raise ApiError(
f"issue search returned non-list body (got {type(results).__name__})"
)
prefix = f"{TITLE_PREFIX} {REPO}: "
return [i for i in results if isinstance(i, dict)
and isinstance(i.get("title"), str)
and i["title"].startswith(prefix)]
def find_open_issue_for_sha(sha: str) -> dict | None:
"""Return the existing open `[main-red] {repo}: {SHA[:10]}` issue,
or None if no such issue is open.
`None` means "search succeeded, no match" — NOT "search failed".
api() raises ApiError on any non-2xx; the caller can let that
propagate so a transient outage fails loudly instead of silently
duplicating.
"""
target = title_for(sha)
for issue in list_open_red_issues():
if issue.get("title") == target:
return issue
return None
def render_body(sha: str, failed: list[dict], debug: dict) -> str:
"""Issue body. Markdown. Mirrors CP#112's render_body shape."""
lines = [
f"# Main is RED on `{REPO}` at `{sha[:10]}`",
"",
f"Commit: <https://{GITEA_HOST}/{REPO}/commit/{sha}>",
"",
"Auto-filed by `.gitea/workflows/main-red-watchdog.yml` (Option C "
"of the [main-never-red directive]"
f"(https://{GITEA_HOST}/molecule-ai/molecule-core/issues/420)). "
"Per `feedback_no_such_thing_as_flakes` + "
"`feedback_fix_root_not_symptom`: investigate the root cause; do "
"NOT revert as a reflex. The watchdog itself never reverts.",
"",
"## Failed status contexts",
"",
]
if not failed:
lines.append(
"_(Combined state reported `failure`/`error` but no per-context "
"entries were in a red state. This usually means a CI emitter "
"set combined-status directly without a per-context status. "
"Check the most recent workflow run for `main` and trace from "
"there.)_"
)
else:
for s in failed:
ctx = s.get("context", "(no context)")
# Per-entry key is `status` in Gitea 1.22.6, not `state`
# (see _entry_state in is_red). Fallback for forward-compat.
state = s.get("status") or s.get("state") or "(no state)"
url = s.get("target_url") or ""
desc = (s.get("description") or "").strip()
entry = f"- **{ctx}** — `{state}`"
if url:
entry += f" → [logs]({url})"
if desc:
entry += f"\n - {desc}"
lines.append(entry)
lines.extend([
"",
"## Resolution path",
"",
"1. Read the failed logs (links above).",
"2. If reproducible locally, fix forward in a PR targeting `main`.",
"3. If the failure is a real flake — STOP. Per "
"`feedback_no_such_thing_as_flakes`, intermittent failures are "
"real bugs. Investigate to root cause; do not mark as flake.",
"4. If the failure is blocking unrelated work for >1 hour, file a "
"follow-up issue and assign someone. Do NOT revert without a "
"human GO per `feedback_prod_apply_needs_hongming_chat_go` "
"(branch protection is a prod surface).",
"",
"## Debug",
"",
"```json",
json.dumps(debug, indent=2, sort_keys=True),
"```",
"",
"_This issue is idempotent: the watchdog runs hourly at `:05` "
"and edits this body in place. When `main` returns to green, the "
"watchdog will close this issue automatically with a "
"\"main returned to green\" comment._",
])
return "\n".join(lines)
def emit_loki_event(event_type: str, sha: str, failed_contexts: list[str]) -> None:
"""Emit a JSON line to syslog tag `main-red-watchdog` for
`reference_obs_stack_phase1` (Vector → Loki).
Best-effort: if `logger` isn't on PATH (e.g. local dev macOS without
util-linux logger), print to stderr instead. The Gitea Actions
Ubuntu runner has util-linux preinstalled.
Loki labels: the workflow runs on the Ubuntu runner where Vector is
NOT configured (Vector lives on the operator host + tenants per
`reference_obs_stack_phase1`). The Loki line is still emitted as
stdout JSON so the workflow log itself is parseable; treat the
syslog call as belt-and-braces for the cases where this script is
invoked from a host that DOES have Vector (e.g. operator-host cron
fallback in a follow-up PR).
"""
payload = {
"event_type": event_type,
"repo": REPO,
"sha": sha,
"failed_contexts": failed_contexts,
}
line = json.dumps(payload, sort_keys=True)
# Always print to stdout so the workflow log captures it (machine-
# readable; `gitea run logs` + Loki ingestion via the operator-host
# journald → Vector → Loki path will see this from runners that
# forward stdout). Loki query:
# {source="gitea-actions"} |~ "main_red_detected"
print(f"main-red-watchdog event: {line}")
# Best-effort syslog tag so a future "run from operator-host cron"
# path picks it up directly via the existing Vector pipeline.
if shutil.which("logger"):
try:
subprocess.run(
["logger", "-t", "main-red-watchdog", line],
check=False,
timeout=5,
)
except (OSError, subprocess.SubprocessError) as e:
sys.stderr.write(f"::warning::logger call failed: {e}\n")
def file_or_update_red(
sha: str,
failed: list[dict],
debug: dict,
*,
dry_run: bool = False,
) -> None:
"""Open a new `[main-red] {repo}: {SHA[:10]}` issue, or PATCH the
existing one's body. Idempotent by title."""
title = title_for(sha)
body = render_body(sha, failed, debug)
if dry_run:
print(f"::notice::[dry-run] would file/update main-red issue for {sha[:10]}")
print("::group::[dry-run] title")
print(title)
print("::endgroup::")
print("::group::[dry-run] body")
print(body)
print("::endgroup::")
return
existing = find_open_issue_for_sha(sha)
if existing:
num = existing["number"]
api("PATCH", f"/repos/{OWNER}/{NAME}/issues/{num}", body={"body": body})
print(f"::notice::Updated existing main-red issue #{num} for {sha[:10]}")
return
_, created = api(
"POST",
f"/repos/{OWNER}/{NAME}/issues",
body={"title": title, "body": body, "labels": []},
)
if not isinstance(created, dict):
raise ApiError("POST issue response not a JSON object")
new_num = created.get("number")
print(f"::warning::Filed new main-red issue #{new_num} for {sha[:10]}")
# Apply RED_LABEL by id. Gitea's add-labels endpoint takes IDs, not
# names (`feedback_gitea_label_delete_by_id` — same rule for add).
# Best-effort: label failure is logged but does not fail the run.
try:
_, labels = api("GET", f"/repos/{OWNER}/{NAME}/labels")
except ApiError as e:
sys.stderr.write(f"::warning::could not list labels: {e}\n")
return
label_id = None
if isinstance(labels, list):
for lbl in labels:
if isinstance(lbl, dict) and lbl.get("name") == RED_LABEL:
label_id = lbl.get("id")
break
if label_id is not None and new_num:
try:
api(
"POST",
f"/repos/{OWNER}/{NAME}/issues/{new_num}/labels",
body={"labels": [label_id]},
)
except ApiError as e:
sys.stderr.write(
f"::warning::could not apply label '{RED_LABEL}' to #{new_num}: {e}\n"
)
else:
sys.stderr.write(f"::warning::label '{RED_LABEL}' not found on repo\n")
def close_open_red_issues_for_other_shas(
current_sha: str,
*,
dry_run: bool = False,
) -> int:
"""When main is green at current_sha, close any open `[main-red]`
issues whose title references a different SHA. Returns the number
of issues closed.
Lineage note: we only close issues whose title prefix matches; if
a human renamed the issue or added a suffix this won't touch it.
That's intentional — manual editorial state takes precedence.
"""
target_title = title_for(current_sha)
open_red = list_open_red_issues()
closed = 0
for issue in open_red:
if issue.get("title") == target_title:
# Same SHA — caller should not have invoked this if main is
# green. Skip defensively.
continue
num = issue.get("number")
if not isinstance(num, int):
continue
comment = (
f"`main` returned to green at SHA `{current_sha}` "
f"(<https://{GITEA_HOST}/{REPO}/commit/{current_sha}>). "
"Closing automatically. If the underlying root cause is "
"not yet understood, reopen this issue and file a "
"postmortem — green-by-flake is still a bug per "
"`feedback_no_such_thing_as_flakes`."
)
if dry_run:
print(f"::notice::[dry-run] would close issue #{num} ({issue.get('title')})")
closed += 1
continue
# Comment first, then close. Order matters: a closed issue can
# still receive comments, but the activity-feed ordering reads
# better with the explanation arriving just before the close.
api(
"POST",
f"/repos/{OWNER}/{NAME}/issues/{num}/comments",
body={"body": comment},
)
api(
"PATCH",
f"/repos/{OWNER}/{NAME}/issues/{num}",
body={"state": "closed"},
)
print(f"::notice::Closed main-red issue #{num} (green at {current_sha[:10]})")
closed += 1
return closed
# --------------------------------------------------------------------------
# Main
# --------------------------------------------------------------------------
def _parse_args(argv: list[str] | None = None) -> argparse.Namespace:
p = argparse.ArgumentParser(
prog="main-red-watchdog",
description="Detect post-merge CI red on the watched branch and "
"file an idempotent issue. Option C of the main-never-red directive.",
)
p.add_argument(
"--dry-run",
action="store_true",
help="Detect + print the would-be issue title/body to stdout; do "
"NOT POST/PATCH/close any issues. Useful for local testing.",
)
return p.parse_args(argv)
def run_once(*, dry_run: bool = False) -> int:
"""One watchdog tick. Returns 0 on green or red-issue-filed; lets
ApiError propagate on transient outage (workflow run fails loudly,
which is correct per the helper-raises contract)."""
sha = get_head_sha(WATCH_BRANCH)
status = get_combined_status(sha)
red, failed = is_red(status)
debug = {
"branch": WATCH_BRANCH,
"sha": sha,
"combined_state": status.get("state"),
"failed_contexts": [s.get("context") for s in failed],
"all_contexts": [
# Per-entry key is `status` in Gitea 1.22.6, not `state`.
# Pre-rev4 debug output reported `state: None` for every
# context, making run logs useless for triage.
{"context": s.get("context"),
"state": s.get("status") or s.get("state")}
for s in (status.get("statuses") or [])
if isinstance(s, dict)
],
}
if red:
# HEAD recheck (task #394 — guards mc#1597..1630 false-positive
# cluster). After the initial detection, wait RECHECK_DELAY_SECS
# (default 90s; tests stub time.sleep) and re-evaluate:
#
# 1. Re-fetch HEAD SHA. If HEAD moved, a new commit landed
# mid-tick — the prior red SHA is no longer authoritative
# and the next cron run will re-evaluate against the new
# HEAD. Skip-file.
#
# 2. If HEAD unchanged, re-fetch the combined status. If it
# recovered (combined state no longer in {failure,error}
# after the cancel-cascade filter), a transient retry
# rolled the run forward. Skip-file.
#
# Both paths emit a Loki event distinguishable from the real
# `main_red_detected` so obs queries can track filter activity.
# The settling window is well below the hourly cron cadence —
# genuine failures persist past it and are surfaced normally.
time.sleep(RECHECK_DELAY_SECS)
recheck_sha = get_head_sha(WATCH_BRANCH)
if recheck_sha != sha:
emit_loki_event("main_red_skipped_head_drift", sha, [])
print(
f"::notice::skip-file (HEAD moved): initial red at "
f"{sha[:10]} but HEAD is now {recheck_sha[:10]} on "
f"{WATCH_BRANCH}; next cron tick will re-evaluate."
)
return 0
recheck_status = get_combined_status(sha)
recheck_red, recheck_failed = is_red(recheck_status)
if not recheck_red:
emit_loki_event("main_red_skipped_recovered", sha, [])
print(
f"::notice::skip-file (recovered after settling): "
f"combined state at {sha[:10]} flipped to "
f"{recheck_status.get('state')!r} on recheck; "
f"initial red was a transient cancel-cascade."
)
return 0
# Still red after settling — file/update. Use the recheck data
# as authoritative so the issue body reflects the latest state.
failed = recheck_failed
debug["recheck_combined_state"] = recheck_status.get("state")
debug["recheck_failed_contexts"] = [
s.get("context") for s in failed
]
failed_ctxs = [s.get("context") for s in failed if s.get("context")]
emit_loki_event("main_red_detected", sha, failed_ctxs)
print(f"::warning::main is RED at {sha[:10]} on {WATCH_BRANCH}: "
f"{len(failed)} failed context(s)")
file_or_update_red(sha, failed, debug, dry_run=dry_run)
else:
# Green (or pending — pending is treated as not-red so we don't
# spam during the post-merge CI window). Close any stale issues
# from earlier SHAs only when we're actually green; pending
# means CI hasn't finished and the prior issue might still be
# accurate.
if status.get("state") == "success":
closed = close_open_red_issues_for_other_shas(sha, dry_run=dry_run)
if closed:
emit_loki_event(
"main_returned_to_green", sha,
[],
)
print(f"::notice::main is GREEN at {sha[:10]} on {WATCH_BRANCH} "
f"(closed {closed} stale issue(s))")
else:
print(f"::notice::main is PENDING at {sha[:10]} on {WATCH_BRANCH} "
f"(combined state={status.get('state')!r}; no action)")
return 0
def main(argv: list[str] | None = None) -> int:
args = _parse_args(argv)
_require_runtime_env()
return run_once(dry_run=args.dry_run)
if __name__ == "__main__":
sys.exit(main())
-257
View File
@@ -1,257 +0,0 @@
#!/usr/bin/env python3
"""Production auto-deploy helpers for Gitea Actions.
The workflow keeps network side effects in shell/curl, but centralizes the
release decision shape here so it has unit coverage: disable flag parsing,
target tag selection, CP payload construction, and status-context selection.
"""
from __future__ import annotations
import argparse
import json
import os
import sys
import time
import urllib.error
import urllib.request
from urllib.parse import quote
TRUE_VALUES = {"1", "true", "yes", "on", "disabled", "disable"}
PROD_CP_URL = "https://api.moleculesai.app"
DEFAULT_REQUIRED_CONTEXTS = [
"CI / Platform (Go) (push)",
"CI / Canvas (Next.js) (push)",
"CI / Shellcheck (E2E scripts) (push)",
"CI / Python Lint & Test (push)",
"CI / all-required (push)",
"Secret scan / Scan diff for credential-shaped strings (push)",
]
TERMINAL_FAILURE_STATES = {"failure", "error", "cancelled", "canceled", "skipped"}
def truthy_flag(value: str | None) -> bool:
if value is None:
return False
return value.strip().lower() in TRUE_VALUES
def _int_env(env: dict[str, str], name: str, default: int, minimum: int = 1) -> int:
raw = env.get(name, "")
if not raw:
return default
try:
value = int(raw)
except ValueError as exc:
raise ValueError(f"{name} must be an integer, got {raw!r}") from exc
if value < minimum:
raise ValueError(f"{name} must be >= {minimum}, got {value}")
return value
def build_plan(env: dict[str, str]) -> dict:
sha = env.get("GITHUB_SHA", "").strip()
if not sha:
raise ValueError("GITHUB_SHA is required")
disabled_value = env.get("PROD_AUTO_DEPLOY_DISABLED", "")
if truthy_flag(disabled_value):
return {
"enabled": False,
"sha": sha,
"disabled_reason": f"PROD_AUTO_DEPLOY_DISABLED={disabled_value}",
}
short_sha = sha[:7]
target_tag = env.get("PROD_AUTO_DEPLOY_TARGET_TAG", "").strip() or f"staging-{short_sha}"
canary_slug = env.get("PROD_AUTO_DEPLOY_CANARY_SLUG", "hongming").strip()
body = {
"target_tag": target_tag,
"soak_seconds": _int_env(env, "PROD_AUTO_DEPLOY_SOAK_SECONDS", 60, minimum=0),
"batch_size": _int_env(env, "PROD_AUTO_DEPLOY_BATCH_SIZE", 3),
"dry_run": truthy_flag(env.get("PROD_AUTO_DEPLOY_DRY_RUN", "")),
# confirm:true ack required by CP /cp/admin/tenants/redeploy-fleet
# contract (cp#228 / task #308) for fleet-wide intent. Empty body
# / {confirm:false} / {only_slugs:[]} → 400. This caller is the
# production auto-deploy step that rolls every live tenant (canary
# + fan-out), no slug scoping, so confirm:true is correct.
"confirm": True,
}
if canary_slug:
body["canary_slug"] = canary_slug
cp_url = env.get("CP_URL", "").strip() or PROD_CP_URL
if cp_url != PROD_CP_URL and not truthy_flag(env.get("PROD_ALLOW_NON_PROD_CP_URL", "")):
raise ValueError(
f"Refusing production deploy to CP_URL={cp_url!r}; "
f"set PROD_ALLOW_NON_PROD_CP_URL=true for an explicit non-prod drill"
)
return {
"enabled": True,
"sha": sha,
"short_sha": short_sha,
"target_tag": target_tag,
"cp_url": cp_url,
"body": body,
}
def latest_status_for_context(statuses: list[dict], context: str) -> dict | None:
"""Return the first matching status.
Gitea's combined-status response is newest-first in practice. The merge
queue relies on the same contract; keeping the selector explicit makes
stale duplicate contexts easy to test.
"""
for status in statuses:
if status.get("context") == context:
return status
return None
def ci_context_state(statuses: list[dict], context: str) -> str:
status = latest_status_for_context(statuses, context)
if not status:
return "missing"
return str(status.get("status") or status.get("state") or "missing").lower()
def context_is_satisfied(state: str) -> bool:
return state == "success"
def context_is_terminal_failure(state: str) -> bool:
return state in TERMINAL_FAILURE_STATES
def required_contexts(env: dict[str, str]) -> list[str]:
raw = env.get("PROD_AUTO_DEPLOY_REQUIRED_CONTEXTS", "")
if not raw.strip():
return DEFAULT_REQUIRED_CONTEXTS
return [line.strip() for line in raw.replace(",", "\n").splitlines() if line.strip()]
def _api_json(url: str, token: str) -> dict:
req = urllib.request.Request(url, headers={"Authorization": f"token {token}"})
try:
with urllib.request.urlopen(req, timeout=20) as resp:
return json.loads(resp.read())
except urllib.error.HTTPError as exc:
body = exc.read().decode("utf-8", errors="replace")[:500]
raise RuntimeError(f"GET {url} -> HTTP {exc.code}: {body}") from exc
def _api_json_optional(url: str, token: str) -> tuple[int, dict | None]:
req = urllib.request.Request(url, headers={"Authorization": f"token {token}"})
try:
with urllib.request.urlopen(req, timeout=20) as resp:
return resp.status, json.loads(resp.read())
except urllib.error.HTTPError as exc:
if exc.code == 404:
return exc.code, None
body = exc.read().decode("utf-8", errors="replace")[:300]
print(f"::warning::GET {url} -> HTTP {exc.code}: {body}", file=sys.stderr)
return exc.code, None
def live_disable_flag(env: dict[str, str]) -> str:
"""Return a live disable value from Gitea variables when readable.
Gitea evaluates `${{ vars.* }}` once when the job starts. This API read is
the emergency re-check immediately before production side effects.
"""
token = env.get("GITEA_TOKEN", "").strip()
if not token:
return ""
host = env.get("GITEA_HOST", "git.moleculesai.app")
repo = env.get("GITHUB_REPOSITORY", "molecule-ai/molecule-core")
variable = quote("PROD_AUTO_DEPLOY_DISABLED", safe="")
url = f"https://{host}/api/v1/repos/{repo}/actions/variables/{variable}"
status, body = _api_json_optional(url, token)
if status != 200 or not isinstance(body, dict):
return ""
return str(body.get("data") or body.get("value") or "")
def assert_not_disabled(env: dict[str, str]) -> None:
plan = build_plan(env)
if not plan.get("enabled"):
raise RuntimeError(plan.get("disabled_reason", "production auto-deploy disabled"))
live_value = live_disable_flag(env)
if truthy_flag(live_value):
raise RuntimeError(f"PROD_AUTO_DEPLOY_DISABLED={live_value} (live Gitea variable)")
def wait_for_ci_context(env: dict[str, str]) -> str:
host = env.get("GITEA_HOST", "git.moleculesai.app")
repo = env.get("GITHUB_REPOSITORY", "molecule-ai/molecule-core")
sha = env.get("GITHUB_SHA", "").strip()
token = env.get("GITEA_TOKEN", "").strip()
contexts = required_contexts(env)
interval = _int_env(env, "CI_STATUS_POLL_INTERVAL_SECONDS", 15)
timeout = _int_env(env, "CI_STATUS_TIMEOUT_SECONDS", 1800)
if not sha:
raise ValueError("GITHUB_SHA is required")
if not token:
raise ValueError("GITEA_TOKEN is required to wait for CI status")
url = f"https://{host}/api/v1/repos/{repo}/commits/{sha}/status"
deadline = time.time() + timeout
last_states: dict[str, str] = {}
while time.time() <= deadline:
body = _api_json(url, token)
statuses = body.get("statuses") or []
states = {context: ci_context_state(statuses, context) for context in contexts}
for context, state in states.items():
if state != last_states.get(context):
print(f"CI context {context!r}: {state}", file=sys.stderr)
last_states = states
failures = [
f"{context}={state}"
for context, state in states.items()
if context_is_terminal_failure(state)
]
if failures:
raise RuntimeError(
"Required CI context failed; refusing production deploy: "
+ ", ".join(failures)
)
if all(context_is_satisfied(state) for state in states.values()):
return "success"
time.sleep(interval)
last = ", ".join(f"{context}={state}" for context, state in last_states.items()) or "none"
raise TimeoutError(f"Timed out waiting {timeout}s for required CI contexts; last_states={last}")
def main() -> int:
parser = argparse.ArgumentParser(description=__doc__)
sub = parser.add_subparsers(dest="command", required=True)
sub.add_parser("plan", help="print production deploy plan as JSON")
sub.add_parser("assert-enabled", help="fail if production deploy is currently disabled")
sub.add_parser("wait-ci", help="block until required CI context is green")
args = parser.parse_args()
try:
if args.command == "plan":
print(json.dumps(build_plan(dict(os.environ)), sort_keys=True))
return 0
if args.command == "assert-enabled":
assert_not_disabled(dict(os.environ))
return 0
if args.command == "wait-ci":
wait_for_ci_context(dict(os.environ))
return 0
except Exception as exc: # noqa: BLE001 - CLI should render operator-friendly errors.
print(f"::error::{exc}", file=sys.stderr)
return 1
return 2
if __name__ == "__main__":
raise SystemExit(main())
-42
View File
@@ -1,42 +0,0 @@
#!/usr/bin/env python3
"""Extract changed-file list from a Gitea push event's commits JSON array.
Each commit in a push event has `added`, `removed`, and `modified` file lists.
This script aggregates all of them and prints unique filenames one per line.
Usage:
push-commits-diff-files.py < COMMITS_JSON
Exits 0 always (caller handles empty output as "no files").
"""
from __future__ import annotations
import sys
import json
def main() -> None:
try:
data = json.load(sys.stdin)
except Exception:
sys.exit(0) # Don't fail the step — treat malformed JSON as empty
if not isinstance(data, list):
sys.exit(0)
files: set[str] = set()
for commit in data:
if not isinstance(commit, dict):
continue
for key in ("added", "removed", "modified"):
for f in commit.get(key) or []:
if isinstance(f, str) and f:
files.add(f)
if files:
sys.stdout.write("\n".join(sorted(files)))
sys.stdout.write("\n")
if __name__ == "__main__":
main()
-322
View File
@@ -1,322 +0,0 @@
#!/usr/bin/env bash
# review-check — evaluate whether a PR satisfies a single team-review gate.
#
# RFC#324 Step 1 of 5 — qa-review + security-review check workflows.
#
# This is the shared evaluator invoked by:
# .gitea/workflows/qa-review.yml (TEAM=qa, TEAM_ID=20)
# .gitea/workflows/security-review.yml (TEAM=security, TEAM_ID=21)
#
# Pass condition (per RFC#324 v1.1 addendum):
# ≥ 1 review on the PR where:
# • state == APPROVED
# • review.dismissed == false
# • review.user.login != PR.user.login (non-author)
# • review.user.login ∈ team-members
#
# Strict mode (default OFF for v1; see RFC trade-off note):
# If REVIEW_CHECK_STRICT=1, additionally require review.commit_id == PR.head.sha.
# With dismiss_stale_reviews: true at the protection layer, stale reviews
# are already dismissed, so the additional commit_id check is belt-and-
# suspenders. Keeping it off in v1 simplifies semantics; flip in a follow-up
# PR if reviewer telemetry shows residual stale-APPROVE merges.
#
# Privilege gate (RFC#324 v1.3 §A1.1 — INFORMATIONAL ONLY):
# The /qa-recheck and /security-recheck slash-commands can be triggered
# by anyone who can comment on the PR. The workflow's privilege step
# logs collaborator-status but does NOT gate execution of this script.
# Why this is safe: this evaluator is read-only and idempotent —
# reading `pulls/{N}/reviews` and `teams/{id}/members/{u}` can't be
# influenced by who triggered the run. If a real team-member APPROVE
# exists the gate flips green; otherwise it stays red. A
# non-collaborator commenting /qa-recheck cannot manufacture a green
# gate. Original (v1.2) design with `if:`-gating of this step was
# fail-open (skipped-via-`if:` job still publishes the status as
# `success`) — corrected in v1.3 per hongming-pc review 1421.
#
# Trust boundary (RFC A4):
# This script is loaded from the BASE branch (sourced via .gitea/scripts/
# on the workflow's checkout-of-base). It does NOT execute any PR-HEAD
# code. It only reads PR review state via the Gitea API.
#
# Token scope (RFC A1-α):
# The job's own conclusion (exit 0 / exit 1) is what publishes the
# `qa-review / approved` / `security-review / approved` status context.
# NO `POST /statuses` call here → NO `write:repository` scope on the
# token. `read:organization` (for team-membership probe) and
# `read:repository` (for PR + reviews) are enough.
#
# Required env:
# GITEA_TOKEN — least-priv read:repository + read:organization. See note
# below about the team-membership API requiring the token
# owner to be in the queried team (Gitea 1.22.6 quirk).
# GITEA_HOST — e.g. git.moleculesai.app
# REPO — owner/name (from github.repository)
# PR_NUMBER — int (from github.event.pull_request.number or
# github.event.issue.number for issue_comment events)
# TEAM — short team name (qa | security) for log lines
# TEAM_ID — Gitea team id (20=qa, 21=security at time of writing)
#
# Optional:
# REVIEW_CHECK_DEBUG=1 — per-API-call diagnostic lines
# REVIEW_CHECK_STRICT=1 — also require review.commit_id == pr.head.sha
# DEFAULT_BRANCH=main — branch this gate protects; non-default-base PRs no-op
set -euo pipefail
# jq is required for JSON parsing. It is pre-baked into the runner-base
# image (per RFC#268 workflow-smoke), so the only reason we'd not find it
# is a broken runner. The previous fallback dance (apt-get + curl to
# /usr/local/bin/jq) cannot succeed on a uid-1001 rootless runner
# (#391/#402 + feedback_ci_runner_install_needs_writable_path), so it's
# dropped. Fail loud with a clear diagnostic rather than attempt an
# install that physically cannot work.
if ! command -v jq >/dev/null 2>&1; then
echo "::error::jq missing from runner-base image — bake it into the runner image (see RFC#268 workflow-smoke / feedback_ci_runner_install_needs_writable_path). This evaluator cannot run without jq."
exit 1
fi
: "${GITEA_TOKEN:?GITEA_TOKEN required}"
: "${GITEA_HOST:?GITEA_HOST required}"
: "${REPO:?REPO required (owner/name)}"
: "${PR_NUMBER:?PR_NUMBER required}"
: "${TEAM:?TEAM required (qa|security)}"
: "${TEAM_ID:?TEAM_ID required (integer)}"
OWNER="${REPO%%/*}"
NAME="${REPO##*/}"
API="https://${GITEA_HOST}/api/v1"
# Token-in-argv fix (#541): write the Authorization header to a mode-600
# temp file instead of passing it via curl -H "$AUTH" (which puts the
# secret token value in the process table for any process to read via
# /proc/<pid>/cmdline or ps -ef). The curl config file is read by curl
# itself and never appears in the argv of the curl subprocess.
CURL_AUTH_FILE=$(mktemp "${TMPDIR:-/tmp}/curl-auth.XXXXXX")
chmod 600 "$CURL_AUTH_FILE"
printf 'header = "Authorization: token %s"\n' "$GITEA_TOKEN" > "$CURL_AUTH_FILE"
# Pre-create temp files so cleanup trap can reference them by name
# (bash trap 'function' EXIT expands variables at trap-fire time, not def time).
PR_JSON=$(mktemp)
REVIEWS_JSON=$(mktemp)
COMMENTS_JSON=$(mktemp)
TEAM_PROBE_TMP=$(mktemp)
NA_STATUSES_TMP="" # declared here so cleanup() always has the var
cleanup() {
rm -f "$CURL_AUTH_FILE" "$PR_JSON" "$REVIEWS_JSON" "$COMMENTS_JSON" "$TEAM_PROBE_TMP" "${NA_STATUSES_TMP-}"
}
trap cleanup EXIT
debug() {
if [ "${REVIEW_CHECK_DEBUG:-}" = "1" ]; then
echo " [debug] $*" >&2
fi
}
echo "::notice::${TEAM}-review evaluating repo=${OWNER}/${NAME} pr=${PR_NUMBER} team_id=${TEAM_ID}"
# --- Fetch the PR (for author + head.sha) ---
HTTP_CODE=$(curl -sS -o "$PR_JSON" -w '%{http_code}' \
-K "$CURL_AUTH_FILE" "${API}/repos/${OWNER}/${NAME}/pulls/${PR_NUMBER}")
if [ "$HTTP_CODE" != "200" ]; then
echo "::error::GET /pulls/${PR_NUMBER} returned HTTP ${HTTP_CODE} (token scope?)"
cat "$PR_JSON" >&2
exit 1
fi
PR_AUTHOR=$(jq -r '.user.login // ""' "$PR_JSON")
PR_HEAD_SHA=$(jq -r '.head.sha // ""' "$PR_JSON")
PR_BASE_REF=$(jq -r '.base.ref // ""' "$PR_JSON")
PR_STATE=$(jq -r '.state // ""' "$PR_JSON")
DEFAULT_BRANCH="${DEFAULT_BRANCH:-main}"
debug "pr_author=${PR_AUTHOR} pr_head=${PR_HEAD_SHA:0:7} pr_base=${PR_BASE_REF} pr_state=${PR_STATE}"
if [ "$PR_STATE" != "open" ]; then
echo "::notice::PR ${PR_NUMBER} is ${PR_STATE} — exiting 0 (closed PRs do not gate)"
exit 0
fi
if [ "$PR_BASE_REF" != "$DEFAULT_BRANCH" ]; then
echo "::notice::PR ${PR_NUMBER} targets ${PR_BASE_REF:-<unknown>} not ${DEFAULT_BRANCH}${TEAM}-review gate not applicable"
exit 0
fi
if [ -z "$PR_AUTHOR" ] || [ -z "$PR_HEAD_SHA" ]; then
echo "::error::PR ${PR_NUMBER} missing user.login or head.sha — webhook payload malformed"
exit 1
fi
# --- RFC#324 §N/A follow-up: check N/A declarations status ---
# sop-checklist.py posts `sop-checklist / na-declarations (pull_request)`
# status when a peer posts /sop-n/a <gate>. If our gate is declared N/A,
# the requirement for a Gitea APPROVE review is waived.
NA_STATUSES_TMP=$(mktemp)
HTTP_CODE=$(curl -sS -o "$NA_STATUSES_TMP" -w '%{http_code}' \
-K "$CURL_AUTH_FILE" "${API}/repos/${OWNER}/${NAME}/statuses/${PR_HEAD_SHA}")
debug "statuses/${PR_HEAD_SHA} → HTTP ${HTTP_CODE}"
if [ "$HTTP_CODE" = "200" ]; then
# Gitea returns statuses as array; look for the na-declarations context.
# jq: find all statuses where context == "sop-checklist / na-declarations (pull_request)"
# and state == "success". Extract the description field.
NA_DESC=$(jq -r '
.[] |
select(.context == "sop-checklist / na-declarations (pull_request)") |
select(.state == "success") |
.description
' "$NA_STATUSES_TMP" 2>/dev/null | head -1)
if [ -n "$NA_DESC" ] && [ "$NA_DESC" != "null" ]; then
debug "na-declarations status found: ${NA_DESC}"
# Check if our gate appears in the N/A description.
# The description format is "N/A: qa-review, security-review" or similar.
if echo "$NA_DESC" | grep -iq "\\b${TEAM}-review\\b"; then
echo "::notice::${TEAM}-review N/A — gate declared not-applicable via /sop-n/a: ${NA_DESC}"
echo "::notice::PR ${PR_NUMBER} passes ${TEAM}-review via N/A declaration"
rm -f "$NA_STATUSES_TMP"
exit 0
fi
fi
else
debug "could not fetch statuses (HTTP ${HTTP_CODE}) — proceeding with normal eval"
fi
rm -f "$NA_STATUSES_TMP"
# --- Fetch all reviews on the PR ---
HTTP_CODE=$(curl -sS -o "$REVIEWS_JSON" -w '%{http_code}' \
-K "$CURL_AUTH_FILE" "${API}/repos/${OWNER}/${NAME}/pulls/${PR_NUMBER}/reviews")
if [ "$HTTP_CODE" != "200" ]; then
echo "::error::GET /pulls/${PR_NUMBER}/reviews returned HTTP ${HTTP_CODE}"
cat "$REVIEWS_JSON" >&2
exit 1
fi
# Filter: state=APPROVED, not-dismissed, non-author. Optionally strict-mode
# adds commit_id==head.sha (off by default; see header).
JQ_FILTER='.[]
| select(.state == "APPROVED")
| select(.dismissed != true)
| select(.user.login != $author)'
if [ "${REVIEW_CHECK_STRICT:-}" = "1" ]; then
JQ_FILTER="${JQ_FILTER}
| select(.commit_id == \$head)"
fi
JQ_FILTER="${JQ_FILTER}
| .user.login"
CANDIDATES=$(jq -r --arg author "$PR_AUTHOR" --arg head "$PR_HEAD_SHA" "$JQ_FILTER" "$REVIEWS_JSON" | sort -u)
debug "candidate non-author approvers: $(echo "$CANDIDATES" | tr '\n' ' ')"
if [ -z "$CANDIDATES" ]; then
# --- Guardrail (internal#503): explain the most common false
# "no candidates" red. Gitea's review event enum is EXACTLY
# APPROVED/REQUEST_CHANGES/COMMENT/PENDING. A wrong value ("APPROVE",
# lowercase, ...) is silently accepted (HTTP 200) and stored as
# state=PENDING. A correctly-started draft review has an EMPTY body;
# a NON-empty body + state==PENDING by a non-author == an intended
# verdict mis-filed by a wrong event string. Surface it actionably.
# This does NOT change the gate result (still fail-closed below) — it
# only converts a mystery red into a named, self-fixing error.
MISFILED_FILTER='.[]
| select(.state == "PENDING")
| select(.dismissed != true)
| select(.user.login != $author)
| select(((.body // "") | gsub("^\\s+|\\s+$";"") | length) > 0)
| "\(.id)\t\(.user.login)"'
MISFILED=$(jq -r --arg author "$PR_AUTHOR" "$MISFILED_FILTER" "$REVIEWS_JSON" 2>/dev/null || true)
if [ -n "$MISFILED" ]; then
echo "::error::${TEAM}-review: non-author review(s) were SUBMITTED but stored as PENDING — almost certainly the wrong Gitea review event string (internal#503)."
echo "::error::Gitea accepts ONLY the exact enum APPROVED / REQUEST_CHANGES / COMMENT. 'APPROVE' or lowercase is silently (HTTP 200) filed as PENDING and is invisible to this gate."
printf '%s\n' "$MISFILED" | while IFS="$(printf '\t')" read -r _rid _rl; do
[ -n "${_rid:-}" ] && echo "::error:: review id=${_rid} by '${_rl}': RE-SUBMIT via POST ${API}/repos/${OWNER}/${NAME}/pulls/${PR_NUMBER}/reviews with {\"event\":\"APPROVED\"} (correct enum) — do NOT edit the DB."
done
fi
# --- Fallback (internal#348): check issue comments for agent-approval ---
# core-qa-agent and core-security-agent approve via issue comments, NOT
# the reviews API. The reviews API returns zero entries for comment-only
# approvals. This fallback reads PR issue comments and extracts logins that:
# 1. Posted a comment matching the agent-prefix pattern for this gate:
# qa → "[core-qa-agent] APPROVED"
# security → "[core-security-agent] APPROVED"
# OR posted a generic approval keyword (word-anchored, case-insensitive):
# APPROVED / LGTM / ACCEPTED
# 2. Are not the PR author
# 3. The team-membership probe below is the authoritative filter.
AGENT_PATTERN=""
case "$TEAM" in
qa) AGENT_PATTERN="\\[core-qa-agent\\]" ;;
security) AGENT_PATTERN="\\[core-security-agent\\]" ;;
esac
HTTP_CODE=$(curl -sS -o "$COMMENTS_JSON" -w '%{http_code}' \
-K "$CURL_AUTH_FILE" "${API}/repos/${OWNER}/${NAME}/issues/${PR_NUMBER}/comments")
debug "GET /issues/${PR_NUMBER}/comments → HTTP ${HTTP_CODE}"
if [ "$HTTP_CODE" = "200" ]; then
# JQ expression: select non-author comments that match either the
# agent-prefix pattern (case-insensitive) OR a generic approval keyword.
JQ_APPROVALS='
.[] |
select(.user.login != $author) |
. as $cmt |
if ($agent_pattern | length) > 0 and ($cmt.body // "" | test($agent_pattern; "i")) then
$cmt.user.login
elif ($cmt.body // "" | test("\\b(APPROVED|LGTM|ACCEPTED)\\b"; "i")) then
$cmt.user.login
else
empty
end
'
CANDIDATES=$(jq -r \
--arg author "$PR_AUTHOR" \
--arg agent_pattern "$AGENT_PATTERN" \
"$JQ_APPROVALS" \
"$COMMENTS_JSON" 2>/dev/null | sort -u)
debug "comment-based approval candidates: $(echo "$CANDIDATES" | tr '\n' ' ')"
if [ -n "$CANDIDATES" ]; then
echo "::notice::${TEAM}-review: reviews API found no APPROVED reviews; found $(echo "$CANDIDATES" | wc -w | xargs) comment-based approval candidate(s) — verifying team membership..."
fi
else
debug "could not fetch issue comments (HTTP ${HTTP_CODE})"
fi
fi
if [ -z "${CANDIDATES:-}" ]; then
echo "::error::${TEAM}-review awaiting non-author APPROVE from ${TEAM} team (no candidates from reviews API or issue comments)"
exit 1
fi
# --- Probe team membership per candidate ---
# Endpoint: GET /api/v1/teams/{id}/members/{username}
# 200/204 → is member
# 403 → token owner is not in this team (Gitea 1.22.6 'Must be a team
# member' constraint — see follow-up issue for token-provisioning)
# 404 → not a member
for U in $CANDIDATES; do
CODE=$(curl -sS -o "$TEAM_PROBE_TMP" -w '%{http_code}' \
-K "$CURL_AUTH_FILE" "${API}/teams/${TEAM_ID}/members/${U}")
debug "probe ${U} in team ${TEAM} (id=${TEAM_ID}) → HTTP ${CODE}"
case "$CODE" in
200|204)
echo "::notice::${TEAM}-review APPROVED by ${U} (team=${TEAM})"
exit 0
;;
403)
# Token owner is not in the team being probed; the API refuses to
# confirm membership. This is the RFC#324 follow-up token-scope gap.
# Fail closed — never grant approval on a 403; surface clearly.
echo "::error::team-probe for ${U} in ${TEAM} returned 403 (token owner not in ${TEAM} team — RFC#324 token-scope follow-up). Cannot confirm membership; failing closed."
cat "$TEAM_PROBE_TMP" >&2
exit 1
;;
404)
debug "${U} not a member of ${TEAM}"
;;
*)
echo "::warning::team-probe for ${U} in ${TEAM} returned unexpected HTTP ${CODE}"
cat "$TEAM_PROBE_TMP" >&2
;;
esac
done
echo "::error::${TEAM}-review awaiting non-author APPROVE from ${TEAM} team (candidates: $(echo "$CANDIDATES" | tr '\n' ',' | sed 's/,$//') — none are in team)"
exit 1
-81
View File
@@ -1,81 +0,0 @@
#!/usr/bin/env bash
# Re-run review-check.sh for a slash-command refire and post the protected
# pull_request status context to the PR head SHA.
set -euo pipefail
: "${GITEA_TOKEN:?GITEA_TOKEN required}"
: "${GITEA_HOST:?GITEA_HOST required}"
: "${REPO:?REPO required}"
: "${PR_NUMBER:?PR_NUMBER required}"
: "${TEAM:?TEAM required}"
OWNER="${REPO%%/*}"
NAME="${REPO##*/}"
API="https://${GITEA_HOST}/api/v1"
CONTEXT="${TEAM}-review / approved (pull_request)"
TARGET_URL="https://${GITEA_HOST}/${OWNER}/${NAME}/pulls/${PR_NUMBER}"
authfile=$(mktemp)
prfile=$(mktemp)
postfile=$(mktemp)
# shellcheck disable=SC2329 # invoked by EXIT trap
cleanup() {
rm -f "$authfile" "$prfile" "$postfile"
}
trap cleanup EXIT
chmod 600 "$authfile"
printf 'header = "Authorization: token %s"\n' "$GITEA_TOKEN" > "$authfile"
code=$(curl -sS -o "$prfile" -w '%{http_code}' -K "$authfile" \
"${API}/repos/${OWNER}/${NAME}/pulls/${PR_NUMBER}")
if [ "$code" != "200" ]; then
echo "::error::GET /pulls/${PR_NUMBER} returned HTTP ${code}"
head -c 200 "$prfile" >&2 || true
exit 1
fi
head_sha=$(jq -r '.head.sha // ""' "$prfile")
state=$(jq -r '.state // ""' "$prfile")
if [ -z "$head_sha" ] || [ "$head_sha" = "null" ]; then
echo "::error::Could not resolve PR head SHA for PR ${PR_NUMBER}"
exit 1
fi
if [ "$state" != "open" ]; then
echo "::notice::PR ${PR_NUMBER} is ${state}; ${TEAM}-review refire is a no-op"
exit 0
fi
set +e
bash .gitea/scripts/review-check.sh
rc=$?
set -e
if [ "$rc" -eq 0 ]; then
status_state="success"
description="Refired via /${TEAM}-recheck by ${COMMENT_AUTHOR:-unknown}"
else
status_state="failure"
description="Refired via /${TEAM}-recheck; ${TEAM}-review failed"
fi
body=$(jq -nc \
--arg state "$status_state" \
--arg context "$CONTEXT" \
--arg description "$description" \
--arg target_url "$TARGET_URL" \
'{state:$state, context:$context, description:$description, target_url:$target_url}')
code=$(curl -sS -o "$postfile" -w '%{http_code}' -X POST \
-K "$authfile" -H "Content-Type: application/json" \
-d "$body" \
"${API}/repos/${OWNER}/${NAME}/statuses/${head_sha}")
if [ "$code" != "200" ] && [ "$code" != "201" ]; then
echo "::error::POST /statuses/${head_sha} returned HTTP ${code}"
head -c 200 "$postfile" >&2 || true
exit 1
fi
echo "::notice::posted ${status_state} for context=\"${CONTEXT}\" on sha=${head_sha}"
exit "$rc"
File diff suppressed because it is too large Load Diff
-411
View File
@@ -1,411 +0,0 @@
#!/usr/bin/env bash
# sop-tier-check — verify a Gitea PR satisfies the §SOP-6 approval gate.
#
# Reads the PR's tier label, walks approving reviewers, and checks team
# membership against the tier's approval expression. Passes only when
# ALL clauses in the expression are satisfied by the set of approving
# reviewers (AND-composition; internal#189).
#
# Expression syntax:
# "team-a" — OR-set: any ONE of the comma-separated teams
# "team-a AND team-b" — AND: BOTH must each have ≥1 approver
# "(a,b,c)" — OR-set wrapped in parens; same as "a,b,c"
#
# Example: "qa AND security AND (managers,ceo)" means:
# ≥1 approver in team "qa" AND
# ≥1 approver in team "security" AND
# ≥1 approver in team "managers" OR "ceo"
#
# Per the spec (internal#189), the hard gate here pairs with the
# advisory gate of sop-conformance LLM-judge (internal#188): each
# required-team click must reflect real verification (visible in review
# body or A2A messages), not rubber-stamp APPROVE. Both gates together
# close the "teammate clicks APPROVE without verifying" gap.
#
# Invoked from `.gitea/workflows/sop-tier-check.yml`. The workflow sets
# the env vars below; this script does no IO outside of stdout/stderr +
# the Gitea API.
#
# Required env:
# GITEA_TOKEN — bot PAT with read:organization,read:user,
# read:issue,read:repository scopes
# GITEA_HOST — e.g. git.moleculesai.app
# REPO — owner/name (from github.repository)
# PR_NUMBER — int (from github.event.pull_request.number)
# PR_AUTHOR — login (from github.event.pull_request.user.login)
#
# Optional:
# SOP_DEBUG=1 — print per-API-call diagnostic lines. Default: off.
# SOP_LEGACY_CHECK=1 — revert to OR-gate (≥1 approver from any eligible
# team). Grace window for PRs in-flight when the
# new AND-composition was deployed. Expires 2026-05-17
# (7-day burn-in window; internal#189 Phase 1).
# Set by workflow for PRs merged before the deploy.
set -euo pipefail
# Ensure jq is available. Runners may not have it pre-installed, and the
# workflow-level jq install can fail on runners with network restrictions
# (GitHub releases not reachable from some runner networks — infra#241
# follow-up). This fallback is idempotent — no-op when jq is already on PATH.
# SOP_FAIL_OPEN=1 makes this always exit 0 so CI never blocks on jq absence.
if ! command -v jq >/dev/null 2>&1; then
echo "::notice::jq not found on PATH — attempting install..."
_jq_installed="no"
# apt-get first (primary) — Ubuntu package mirrors are reliably reachable.
if apt-get update -qq && apt-get install -y -qq jq 2>/dev/null; then
echo "::notice::jq installed via apt-get: $(jq --version)"
_jq_installed="yes"
# GitHub binary as secondary fallback — may fail on restricted networks.
elif timeout 120 curl -sSL \
"https://github.com/jqlang/jq/releases/download/jq-1.7.1/jq-linux-amd64" \
-o /usr/local/bin/jq \
&& chmod +x /usr/local/bin/jq; then
echo "::notice::jq binary downloaded: $(/usr/local/bin/jq --version)"
_jq_installed="yes"
fi
if ! command -v jq >/dev/null 2>&1; then
echo "::error::jq installation failed — apt-get and GitHub binary both failed."
echo "::error::sop-tier-check requires jq for all JSON API parsing."
# SOP_FAIL_OPEN=1 is set in the workflow step's env — makes script always
# exit 0 so CI never blocks. The SOP-6 tier review gate remains enforced.
if [ "${SOP_FAIL_OPEN:-}" = "1" ]; then
echo "::warning::SOP_FAIL_OPEN=1 — exiting 0 so CI does not block."
exit 0
fi
exit 1
fi
fi
debug() {
if [ "${SOP_DEBUG:-}" = "1" ]; then
echo " [debug] $*" >&2
fi
}
# Validate env
: "${GITEA_TOKEN:?GITEA_TOKEN required}"
: "${GITEA_HOST:?GITEA_HOST required}"
: "${REPO:?REPO required (owner/name)}"
: "${PR_NUMBER:?PR_NUMBER required}"
: "${PR_AUTHOR:?PR_AUTHOR required}"
OWNER="${REPO%%/*}"
NAME="${REPO##*/}"
API="https://${GITEA_HOST}/api/v1"
AUTH="Authorization: token ${GITEA_TOKEN}"
echo "::notice::tier-check start: repo=$OWNER/$NAME pr=$PR_NUMBER author=$PR_AUTHOR"
# Sanity: token resolves to a user.
# Use || true on the jq pipeline so that set -euo pipefail (line 45) does not
# cause the script to exit prematurely when the token is empty/invalid — the
# if check below handles that case gracefully. Without || true, a 401 from an
# empty/invalid token causes jq to exit 1, triggering set -e and exiting the
# entire script before SOP_FAIL_OPEN can be evaluated (the check is in the jq-
# install block; if jq is already on PATH, that block is skipped entirely).
WHOAMI=$(curl -sS -H "$AUTH" "${API}/user" | jq -r '.login // ""') || true
if [ -z "$WHOAMI" ]; then
echo "::error::GITEA_TOKEN cannot resolve a user via /api/v1/user — check the token scope and that the secret is wired correctly."
if [ "${SOP_FAIL_OPEN:-}" = "1" ]; then
echo "::warning::SOP_FAIL_OPEN=1 — exiting 0 so CI does not block."
exit 0
fi
exit 1
fi
echo "::notice::token resolves to user: $WHOAMI"
# 1. Read tier label. || true ensures set -euo pipefail does not abort the
# script if curl or jq fails (e.g. 401 from empty token).
LABELS=$(curl -sS -H "$AUTH" "${API}/repos/${OWNER}/${NAME}/issues/${PR_NUMBER}/labels" | jq -r '.[].name') || true
TIER=""
for L in $LABELS; do
case "$L" in
tier:low|tier:medium|tier:high)
if [ -n "$TIER" ]; then
echo "::error::Multiple tier labels: $TIER + $L. Apply exactly one."
exit 1
fi
TIER="$L"
;;
esac
done
if [ -z "$TIER" ]; then
echo "::error::PR has no tier:low|tier:medium|tier:high label. Apply one before merge."
exit 1
fi
debug "tier=$TIER"
# 2. Tier → required team expression (AND-composition; internal#189)
#
# Expression syntax:
# clause-a AND clause-b AND ... — ALL clauses must pass
# team-a,team-b,team-c — OR-set: ≥1 approver in ANY of these teams
# (team-a,team-b) — same as team-a,team-b (parens optional)
#
# This map is the single source of truth. Update it when the team structure
# or policy changes. Teams referenced here but absent in Gitea are treated
# as unachievable (would always fail) — operators notice the clear error
# and create the missing team.
#
# Current Gitea teams: ceo, engineers, managers
# Future teams (create before removing "???" fallback): qa, security, security-audit
declare -A TIER_EXPR=(
# tier:low — same as previous OR gate: any engineer, manager, or ceo.
["tier:low"]="engineers,managers,ceo"
# tier:medium — AND of (managers) AND (engineers) AND (qa???,security???)
# The qa+security clause requires both teams to exist; when not yet
# created, the PR author is responsible for adding them before requesting
# approval on a tier:medium PR. Ops: create qa + security Gitea teams
# and update this map to remove the "???" markers (internal#189 follow-up).
["tier:medium"]="managers AND engineers AND qa???,security???"
# tier:high — ceo only. The AND-composition adds no value for a
# single-team gate, but the framework is wired for consistency.
["tier:high"]="ceo"
)
EXPR="${TIER_EXPR[$TIER]-}"
if [ -z "$EXPR" ]; then
echo "::error::No expression defined for tier $TIER in TIER_EXPR map."
exit 1
fi
debug "expression=$EXPR"
# 3. Legacy OR-gate override (7-day burn-in grace window; internal#189 Phase 1)
if [ "${SOP_LEGACY_CHECK:-}" = "1" ]; then
LEGACY_ELIGIBLE=""
case "$TIER" in
tier:low) LEGACY_ELIGIBLE="engineers managers ceo" ;;
tier:medium) LEGACY_ELIGIBLE="managers ceo" ;;
tier:high) LEGACY_ELIGIBLE="ceo" ;;
esac
echo "::notice::SOP_LEGACY_CHECK=1 — using OR-gate ({$LEGACY_ELIGIBLE}) for this PR."
ELIGIBLE="$LEGACY_ELIGIBLE"
fi
# 4. Resolve all team names → IDs
# /orgs/{org}/teams/{slug}/... endpoints don't exist on Gitea 1.22;
# we use /teams/{id}.
# set +e prevents set -e from aborting the script if curl fails (e.g. empty token).
ORG_TEAMS_FILE=$(mktemp)
trap 'rm -f "$ORG_TEAMS_FILE"' EXIT
set +e
HTTP_CODE=$(curl -sS -o "$ORG_TEAMS_FILE" -w '%{http_code}' -H "$AUTH" \
"${API}/orgs/${OWNER}/teams")
_HTTP_EXIT=$?
set -e
debug "teams-list HTTP=$HTTP_CODE (curl exit=$_HTTP_EXIT) size=$(wc -c <"$ORG_TEAMS_FILE")"
if [ "${SOP_DEBUG:-}" = "1" ]; then
echo " [debug] teams-list body (first 300 chars):" >&2
head -c 300 "$ORG_TEAMS_FILE" >&2; echo >&2
fi
if [ "$_HTTP_EXIT" -ne 0 ] || [ "$HTTP_CODE" != "200" ]; then
echo "::error::GET /orgs/${OWNER}/teams failed (curl exit=$_HTTP_EXIT HTTP=$HTTP_CODE) — token may lack read:org scope or be invalid."
if [ "${SOP_FAIL_OPEN:-}" = "1" ]; then
echo "::warning::SOP_FAIL_OPEN=1 — exiting 0 so CI does not block."
exit 0
fi
exit 1
fi
# Collect every team name that appears in the expression.
# Bash word-splitting on $EXPR splits on spaces, so "AND" appears as a
# token. We skip it explicitly.
declare -A TEAM_ID
_all_teams=""
for _raw_clause in $EXPR; do
# Strip parens and split on comma.
_clause=${_raw_clause//[()]/}
for _t in $(echo "$_clause" | tr ',' '\n'); do
_t=$(echo "$_t" | tr -d '[:space:]')
[ -z "$_t" ] && continue
# Skip AND / OR operator tokens (bash word-split produced them from
# spaces in the expression string).
[ "$_t" = "AND" ] || [ "$_t" = "OR" ] && continue
# Skip if already in set.
case " $_all_teams " in
*" $_t "*) ;; # already present
*) _all_teams="${_all_teams} $_t " ;;
esac
done
done
for _t in $_all_teams; do
_t=$(echo "$_t" | tr -d ' ')
[ -z "$_t" ] && continue
_id=$(jq -r --arg t "$_t" '.[] | select(.name==$t) | .id' <"$ORG_TEAMS_FILE" | head -1)
if [ -z "$_id" ] || [ "$_id" = "null" ]; then
# "??" suffix marks teams that don't exist yet (tier:medium qa/security).
# Treat as permanently failing clause; clear error message guides ops.
if [[ "$_t" == *"???" ]]; then
debug "team \"$_t\" not found (expected — pending team creation per internal#189)"
continue
fi
_visible=$(jq -r '.[]?.name? // empty' <"$ORG_TEAMS_FILE" 2>/dev/null | tr '\n' ' ')
echo "::error::Team \"$_t\" referenced in tier $TIER expression but not found in org $OWNER. Teams visible: $_visible"
exit 1
fi
TEAM_ID[$_t]="$_id"
debug "team-id: $_t$_id"
done
# 5. Read approving reviewers. set +e disables set -e temporarily so that curl
# failures (e.g. empty/invalid token → HTTP 401) do not abort the script before
# SOP_FAIL_OPEN is evaluated. set -e is restored immediately after.
set +e
REVIEWS=$(curl -sS -H "$AUTH" "${API}/repos/${OWNER}/${NAME}/pulls/${PR_NUMBER}/reviews")
_REVIEWS_EXIT=$?
set -e
if [ $_REVIEWS_EXIT -ne 0 ] || [ -z "$REVIEWS" ]; then
echo "::error::Failed to fetch reviews (curl exit=$_REVIEWS_EXIT) — token may be invalid or unreachable."
if [ "${SOP_FAIL_OPEN:-}" = "1" ]; then
echo "::warning::SOP_FAIL_OPEN=1 — exiting 0 so CI does not block."
exit 0
fi
exit 1
fi
APPROVERS=$(echo "$REVIEWS" | jq -r '[.[] | select(.state=="APPROVED") | .user.login] | unique | .[]') || true
if [ -z "$APPROVERS" ]; then
echo "::error::No approving reviews on this PR. Set SOP_DEBUG=1 and re-run for diagnostics."
exit 1
fi
debug "approvers: $(echo "$APPROVERS" | tr '\n' ' ')"
# 6. For each approver: skip self-review; probe team membership by id.
# Build $APPROVER_TEAMS[<user>]=space-surrounded team names (e.g. " managers ").
# Pre/post spaces ensure case patterns *${_t}* match even when the name
# is the first or last entry (bash case *word* needs delimiters on both sides).
#
# FALLBACK: if ALL team probes return 403 (token lacks read:org scope),
# fall back to /orgs/{org}/members/{user}. This returns 204 for any org
# member — a superset of team membership. Accepting it as a fallback means
# the gate passes when the token is scoped to repo+user only (core-bot PAT).
# This is safe because: (a) org membership is a prerequisite for every
# eligible team; (b) the AND-composition of internal#189 still requires
# multiple independent approvers; (c) any token with read:repository can
# see the approving reviews, so bypass requires a colluding approver.
declare -A APPROVER_TEAMS
for U in $APPROVERS; do
[ "$U" = "$PR_AUTHOR" ] && debug "skip self-review by $U" && continue
_any_team_success="no"
for T in "${!TEAM_ID[@]}"; do
ID="${TEAM_ID[$T]}"
CODE=$(curl -sS -o /dev/null -w '%{http_code}' -H "$AUTH" \
"${API}/teams/${ID}/members/${U}")
debug "probe: $U in team $T (id=$ID) → HTTP $CODE"
if [ "$CODE" = "200" ] || [ "$CODE" = "204" ]; then
APPROVER_TEAMS[$U]="${APPROVER_TEAMS[$U]:- } ${APPROVER_TEAMS[$U]:+ }$T "
debug "$U qualifies for team $T"
_any_team_success="yes"
fi
done
# Fallback: if every team probe returned 403, try org membership.
# "??" teams were never resolved to IDs so they never entered the loop.
# If the user is an org member, credit them as being in each queried team
# (engineers, managers, ceo are all org-level). This is safe because org
# membership is a prerequisite for all three, and bypass requires a colluding
# approver (same risk as before the AND-composition).
if [ "$_any_team_success" = "no" ]; then
ORG_CODE=$(curl -sS -o /dev/null -w '%{http_code}' -H "$AUTH" \
"${API}/orgs/${OWNER}/members/${U}")
debug "probe: $U in org $OWNER (fallback) → HTTP $ORG_CODE"
if [ "$ORG_CODE" = "204" ]; then
for T in "${!TEAM_ID[@]}"; do
APPROVER_TEAMS[$U]="${APPROVER_TEAMS[$U]:- } ${APPROVER_TEAMS[$U]:+ }$T "
done
debug "$U credited as org member for all queried teams (fallback — token may lack read:org)"
fi
fi
done
# 7. Evaluate the tier expression.
#
# legacy OR-gate: use the simplified loop from before internal#189.
if [ -n "${LEGACY_ELIGIBLE:-}" ]; then
OK=""
for _u in "${!APPROVER_TEAMS[@]}"; do
for _t2 in $LEGACY_ELIGIBLE; do
case "${APPROVER_TEAMS[$_u]}" in
*${_t2}*)
echo "::notice::approver $_u is in team $_t2 (eligible for $TIER)"
OK="yes"
break
;;
esac
done
[ -n "$OK" ] && break
done
if [ -z "$OK" ]; then
echo "::error::Tier $TIER requires approval from a non-author member of {$LEGACY_ELIGIBLE}. Set SOP_DEBUG=1 to see per-probe HTTP codes."
exit 1
fi
echo "::notice::sop-tier-check passed: $TIER (legacy OR-gate)"
exit 0
fi
# AND-gate: evaluate the expression clause by clause.
# _passed_clauses and _failed_clauses accumulate for the status description.
_passed_clauses=""
_failed_clauses=""
for _raw_clause in $EXPR; do
# Normalise: strip parens, replace commas with spaces so bash word-split
# can iterate the OR-set members. The previous form
# _clause=$(echo ... | tr ',' '\n' | tr -d '[:space:]' | grep -v '^$')
# collapsed every member into one concatenated token because
# `tr -d '[:space:]'` strips the very newlines that just separated them
# ("engineers,managers,ceo" -> "engineersmanagersceo"), so the OR-clause
# only ever evaluated as a single nonsense team name and never matched
# APPROVER_TEAMS. Fixed in #229: leave the comma-separated members as
# space-separated tokens for `for _t in $_clause`.
_no_parens=${_raw_clause//[()]/}
_clause=${_no_parens//,/ }
_clause_passed="no"
_clause_names=""
for _t in $_clause; do
# Append (don't overwrite) team name to the human-readable accumulator.
# The previous form `_clause_names="${_clause_names:+, }${_t}"`
# rewrote the variable on every iteration, so the FAIL message only
# ever showed the LAST team. Fixed: prepend prior value before the
# comma-separator, then append the new team name.
_clause_names="${_clause_names}${_clause_names:+, }${_t}"
# Skip teams not yet in Gitea (qa??? / security??? placeholders).
[[ "$_t" == *"???" ]] && debug "clause \"$_t\": skipped (team pending creation)" && continue
[ -z "${TEAM_ID[$_t]:-}" ] && debug "clause \"$_t\": no ID resolved, skipping" && continue
for _u in "${!APPROVER_TEAMS[@]}"; do
# Note: APPROVER_TEAMS values are space-surrounded (e.g. " managers ").
# Pattern *${_t}* matches team name anywhere in the space-padded string.
case "${APPROVER_TEAMS[$_u]}" in
*${_t}*)
_clause_passed="yes"
debug "clause \"$_t\": satisfied by $_u"
break
;;
esac
done
done
# Label for display: strip "???" from pending teams.
_label=$(echo "$_raw_clause" | tr -d '()' | tr ',' '/' | tr -d '[:space:]' | sed 's/???//g')
if [ "$_clause_passed" = "yes" ]; then
# Append (don't overwrite) — same accumulator bug as _clause_names above.
_passed_clauses="${_passed_clauses}${_passed_clauses:+, }$_label"
echo "::notice::clause [$_label]: PASS — satisfied by approving reviewer(s)"
else
_failed_clauses="${_failed_clauses}${_failed_clauses:+, }$_label"
echo "::error::clause [$_label]: FAIL — no approving reviewer belongs to any of these teams (${_clause_names}). Set SOP_DEBUG=1 to see per-team probe results."
fi
done
if [ -n "$_failed_clauses" ]; then
echo ""
echo "::error::sop-tier-check FAILED for $TIER."
echo " Passed :${_passed_clauses}"
echo " Missing:${_failed_clauses}"
echo " All clauses must be satisfied. Each missing team needs an APPROVED review from one of its members."
exit 1
fi
echo "::notice::sop-tier-check PASSED: $TIER — all required clauses satisfied [${_passed_clauses}]"
-173
View File
@@ -1,173 +0,0 @@
#!/usr/bin/env bash
# sop-tier-refire — re-evaluate sop-tier-check and POST status to PR head SHA.
#
# Invoked from `.gitea/workflows/sop-tier-refire.yml` when a repo
# MEMBER/OWNER/COLLABORATOR comments `/refire-tier-check` on a PR.
#
# Behavior:
#
# 1. Resolve PR head SHA + author from PR_NUMBER.
# 2. Rate-limit: if the sop-tier-check context has been POSTed in the
# last 30 seconds, skip (prevents comment-spam status thrash).
# 3. Invoke `.gitea/scripts/sop-tier-check.sh` with the same env the
# canonical workflow provides. This is DRY: we re-use the exact AND-
# composition gate logic, not a watered-down approving-count check.
# 4. POST the resulting status (success on exit 0, failure on non-zero)
# to `/repos/.../statuses/{HEAD_SHA}` with context
# "sop-tier-check / tier-check (pull_request)" — the same context name
# branch protection requires.
#
# Required env (set by sop-tier-refire.yml):
# GITEA_TOKEN — org-level SOP_TIER_CHECK_TOKEN (read:org/user/issue/repo)
# GITEA_HOST — e.g. git.moleculesai.app
# REPO — owner/name
# PR_NUMBER — PR number from issue_comment payload
# COMMENT_AUTHOR — login of the commenter (logged for audit)
#
# Optional:
# SOP_DEBUG=1 — verbose per-API-call diagnostics
# SOP_REFIRE_RATE_LIMIT_SEC — override the 30s rate-limit (default 30)
# SOP_REFIRE_DISABLE_RATE_LIMIT=1 — for tests; skips the rate-limit check
set -euo pipefail
debug() {
if [ "${SOP_DEBUG:-}" = "1" ]; then
echo " [debug] $*" >&2
fi
}
: "${GITEA_TOKEN:?GITEA_TOKEN required}"
: "${GITEA_HOST:?GITEA_HOST required}"
: "${REPO:?REPO required (owner/name)}"
: "${PR_NUMBER:?PR_NUMBER required}"
: "${COMMENT_AUTHOR:=unknown}"
OWNER="${REPO%%/*}"
NAME="${REPO##*/}"
API="https://${GITEA_HOST}/api/v1"
AUTH="Authorization: token ${GITEA_TOKEN}"
CONTEXT="sop-tier-check / tier-check (pull_request)"
RATE_LIMIT_SEC="${SOP_REFIRE_RATE_LIMIT_SEC:-30}"
echo "::notice::sop-tier-refire start: repo=$OWNER/$NAME pr=$PR_NUMBER commenter=$COMMENT_AUTHOR"
# 1. Fetch PR details — need head.sha and user.login.
PR_FILE=$(mktemp)
trap 'rm -f "$PR_FILE"' EXIT
PR_HTTP=$(curl -sS -o "$PR_FILE" -w '%{http_code}' -H "$AUTH" \
"${API}/repos/${OWNER}/${NAME}/pulls/${PR_NUMBER}")
if [ "$PR_HTTP" != "200" ]; then
echo "::error::GET /pulls/$PR_NUMBER returned HTTP $PR_HTTP (body $(head -c 200 "$PR_FILE"))"
exit 1
fi
HEAD_SHA=$(jq -r '.head.sha' <"$PR_FILE")
PR_AUTHOR=$(jq -r '.user.login' <"$PR_FILE")
PR_STATE=$(jq -r '.state' <"$PR_FILE")
if [ -z "$HEAD_SHA" ] || [ "$HEAD_SHA" = "null" ]; then
echo "::error::Could not resolve head.sha from PR #$PR_NUMBER response"
exit 1
fi
debug "head_sha=$HEAD_SHA pr_author=$PR_AUTHOR state=$PR_STATE"
if [ "$PR_STATE" != "open" ]; then
echo "::notice::PR #$PR_NUMBER state is $PR_STATE; refire is a no-op on closed PRs."
exit 0
fi
# 2. Rate-limit: skip if our context was updated in the last $RATE_LIMIT_SEC.
# Gitea statuses endpoint returns latest first; we check the most recent
# entry for our context name.
if [ "${SOP_REFIRE_DISABLE_RATE_LIMIT:-}" != "1" ]; then
STATUSES_FILE=$(mktemp)
trap 'rm -f "$PR_FILE" "$STATUSES_FILE"' EXIT
ST_HTTP=$(curl -sS -o "$STATUSES_FILE" -w '%{http_code}' -H "$AUTH" \
"${API}/repos/${OWNER}/${NAME}/statuses/${HEAD_SHA}?limit=50&sort=newest")
debug "statuses-list HTTP=$ST_HTTP"
if [ "$ST_HTTP" = "200" ]; then
LAST_UPDATED=$(jq -r --arg c "$CONTEXT" \
'[.[] | select(.context == $c)] | first | .updated_at // ""' \
<"$STATUSES_FILE")
if [ -n "$LAST_UPDATED" ] && [ "$LAST_UPDATED" != "null" ]; then
# Parse RFC3339 → epoch. Use python -c for portability (date(1) -d
# differs between BSD/GNU; the Gitea runner is Ubuntu so GNU date
# works, but we keep python for future container variance).
LAST_EPOCH=$(python3 -c "import sys,datetime;print(int(datetime.datetime.fromisoformat(sys.argv[1].replace('Z','+00:00')).timestamp()))" "$LAST_UPDATED" 2>/dev/null || echo "0")
NOW_EPOCH=$(date -u +%s)
AGE=$((NOW_EPOCH - LAST_EPOCH))
debug "last status update: $LAST_UPDATED ($AGE seconds ago)"
if [ "$AGE" -lt "$RATE_LIMIT_SEC" ] && [ "$AGE" -ge 0 ]; then
echo "::notice::sop-tier-refire rate-limited — last status update was ${AGE}s ago (<${RATE_LIMIT_SEC}s window). Try again shortly."
exit 0
fi
fi
fi
fi
# 3. Invoke sop-tier-check.sh with the env it expects.
# The canonical workflow intentionally fail-opens the job conclusion
# (`bash .gitea/scripts/sop-tier-check.sh || true`) while Gitea branch
# protection enforces reviewer approvals separately. Keep the refire path
# aligned with that workflow status behavior; otherwise /refire-tier-check can
# post a hard failure that the canonical pull_request_target workflow would
# not publish.
#
# SOP_REFIRE_TIER_CHECK_SCRIPT env var lets tests substitute a mock —
# sop-tier-check.sh uses bash 4+ associative arrays which trigger a known
# bash 3.2 parser bug (`tier: unbound variable` from declare -A with
# `set -u`). Linux Gitea runners ship bash 4/5 so production is fine;
# the override exists so the bash 3.2 dev box can still exercise the
# refire glue logic end-to-end.
SCRIPT="${SOP_REFIRE_TIER_CHECK_SCRIPT:-$(dirname "$0")/sop-tier-check.sh}"
if [ ! -f "$SCRIPT" ]; then
echo "::error::sop-tier-check.sh not found at $SCRIPT — refire requires the canonical script"
exit 1
fi
# Re-invoke. Pipe stdout/stderr through so the runner log shows the
# tier-check decision inline.
GITEA_TOKEN="$GITEA_TOKEN" \
GITEA_HOST="$GITEA_HOST" \
REPO="$REPO" \
PR_NUMBER="$PR_NUMBER" \
PR_AUTHOR="$PR_AUTHOR" \
SOP_DEBUG="${SOP_DEBUG:-0}" \
SOP_LEGACY_CHECK="${SOP_LEGACY_CHECK:-0}" \
bash "$SCRIPT" || true
TIER_EXIT=0
debug "sop-tier-check.sh exit=$TIER_EXIT"
# 4. POST the resulting status.
if [ "$TIER_EXIT" -eq 0 ]; then
STATE="success"
DESCRIPTION="Refired via /refire-tier-check by $COMMENT_AUTHOR"
else
STATE="failure"
DESCRIPTION="Refired via /refire-tier-check; tier-check failed (see workflow log)"
fi
# Status target_url points at the runner log so a curious reviewer can
# follow it back. SERVER_URL + RUN_ID + JOB_ID isn't trivially constructible
# from the bash env on Gitea 1.22.6, so we point at the PR itself.
TARGET_URL="https://${GITEA_HOST}/${OWNER}/${NAME}/pulls/${PR_NUMBER}"
POST_BODY=$(jq -nc \
--arg state "$STATE" \
--arg context "$CONTEXT" \
--arg description "$DESCRIPTION" \
--arg target_url "$TARGET_URL" \
'{state:$state, context:$context, description:$description, target_url:$target_url}')
POST_FILE=$(mktemp)
trap 'rm -f "$PR_FILE" "${STATUSES_FILE:-}" "$POST_FILE"' EXIT
POST_HTTP=$(curl -sS -o "$POST_FILE" -w '%{http_code}' \
-X POST -H "$AUTH" -H "Content-Type: application/json" \
-d "$POST_BODY" \
"${API}/repos/${OWNER}/${NAME}/statuses/${HEAD_SHA}")
if [ "$POST_HTTP" != "200" ] && [ "$POST_HTTP" != "201" ]; then
echo "::error::POST /statuses/$HEAD_SHA returned HTTP $POST_HTTP (body $(head -c 200 "$POST_FILE"))"
exit 1
fi
echo "::notice::sop-tier-refire posted state=$STATE for context=\"$CONTEXT\" on sha=$HEAD_SHA"
exit "$TIER_EXIT"
-826
View File
@@ -1,826 +0,0 @@
#!/usr/bin/env python3
"""status-reaper — Option B compensating-status POST for Gitea 1.22.6's
hardcoded `(push)` suffix on default-branch commit statuses.
Tracking: this PR (workflow + script + tests + audit issue). Sibling
bots: internal#327 (publish-runtime-bot), internal#328 (mc-drift-bot).
Upstream RFC: internal#80. Persona provisioned by sub-agent aefaac1b
(2026-05-11 21:39Z; Gitea uid 94, scope=write:repository).
What this script does, per `.gitea/workflows/status-reaper.yml` invocation:
1. Walk `.gitea/workflows/*.yml`. For each file, build the workflow_id
using this resolution (per hongming-pc 22:08Z review):
- If YAML has top-level `name:` → use that.
- Else → use filename stem (basename minus `.yml`).
Fail-LOUD on:
- Two workflows resolving to the SAME identifier (collision).
- Any identifier containing `/` (it would break context parsing
downstream — Gitea uses ` / ` as the workflow/job separator).
Classify each by whether `on:` contains a `push:` trigger.
2. List the last N (=30, rev3 — widened from 10) commits on
WATCH_BRANCH via GET /repos/{o}/{r}/commits?sha={branch}&limit={N}.
rev2 sweeps N commits per tick instead of HEAD only — schedule
workflows post `failure` to whatever SHA was HEAD when they
COMPLETED, so by the next */5 tick main has often moved forward
and the red gets stranded on a stale commit. rev3 widens the
window from 10 → 30 because schedule workflows post `failure`
RETROACTIVELY (5-15 min after their merge); a 10-commit window
is narrower than the merge-cadence during a burst, so reds land
OUTSIDE the window before reaper sees them (Phase 1+2 evidence:
rev2 run 17057 at 02:46Z saw 185/0 contexts on 10 SHAs; direct
probe ~30min later showed ~25 fails on those same 10 SHAs).
3. For EACH SHA in the list:
- GET combined commit status. Per-SHA error isolation
(refinement #7): if this call raises ApiError or any 5xx,
LOG `::warning::` + continue to the next SHA. Different from
the single-HEAD pre-rev2 path where fail-loud was correct;
the sweep is best-effort across historical commits, so one
transient blip on a stale SHA must not strand reds on the
OTHER stale SHAs.
- If combined.state == "success": skip — cost optimization
(refinement #2), common case (most commits are green).
- Otherwise iterate per-context entries. For each entry where:
state == "failure" AND context.endswith(" (push)")
Parse context as `<workflow_name> / <job_name> (push)`.
Look up workflow_name in the trigger map:
- missing → log ::notice:: and skip (conservative).
- has_push_trigger=True and description == "Has been cancelled"
→ compensate cancelled/superseded push noise.
- has_push_trigger=True otherwise → preserve (real defect signal).
- has_push_trigger=False → POST a compensating
`state=success` status to /statuses/{sha} with the same
context (Gitea de-dups by context) and a description
documenting the workaround + this script's path.
4. Exit 0. Re-running is idempotent — Gitea's commit-status table
stores the LATEST state-per-context, so the success POST sticks
even if another tick happens before the runner finishes.
What it does NOT do:
- Touch ` (pull_request)` contexts unless the exact same
workflow/job has a successful ` (push)` context on the same
default-branch SHA. That case is post-merge status pollution, not
an unproven PR gate.
- Compensate `error`/`pending` states. Only `failure` — the only one
Gitea emits for the hardcoded-suffix bug.
- Write to non-default branches. WATCH_BRANCH is sourced from
`github.event.repository.default_branch` in the workflow.
- Mutate workflows or runs. The Actions UI still shows the
underlying schedule-triggered run as failed; this script edits
the commit-status surface only.
Halt conditions (script-level — orchestrator-level halts are in the
workflow comments):
- PyYAML missing → fail-loud at import (no fallback parse).
- Workflow `name:` collision → exit 1 with ::error:: message.
- Workflow `name:` containing `/` → exit 1 with ::error:: message.
- Ambiguous `on:` shape (e.g. neither str/list/dict) → treat as
"has_push_trigger=True" and log ::notice:: (preserve, never
compensate the unknown).
- api() non-2xx → raise ApiError, fail the workflow run loudly so
a subsequent tick retries (per
`feedback_api_helper_must_raise_not_return_dict`).
Local dry-run (no network):
GITEA_TOKEN=... GITEA_HOST=git.moleculesai.app REPO=owner/repo \\
WATCH_BRANCH=main WORKFLOWS_DIR=.gitea/workflows \\
python3 .gitea/scripts/status-reaper.py --dry-run
"""
from __future__ import annotations
import argparse
import json
import os
import socket
import sys
import time
import urllib.error
import urllib.parse
import urllib.request
from pathlib import Path
from typing import Any
import yaml # PyYAML 6.0.2 — installed by the workflow before this runs.
# --------------------------------------------------------------------------
# Environment
# --------------------------------------------------------------------------
def _env(key: str, *, default: str = "") -> str:
"""Read an env var with a default. Module-import-safe — tests can
import this script without setting the full env contract."""
return os.environ.get(key, default)
GITEA_TOKEN = _env("GITEA_TOKEN")
GITEA_HOST = _env("GITEA_HOST")
REPO = _env("REPO")
WATCH_BRANCH = _env("WATCH_BRANCH", default="main")
WORKFLOWS_DIR = _env("WORKFLOWS_DIR", default=".gitea/workflows")
OWNER, NAME = (REPO.split("/", 1) + [""])[:2] if REPO else ("", "")
API = f"https://{GITEA_HOST}/api/v1" if GITEA_HOST else ""
API_TIMEOUT_SEC = int(_env("STATUS_REAPER_API_TIMEOUT_SEC", default="30") or "30")
API_RETRIES = int(_env("STATUS_REAPER_API_RETRIES", default="3") or "3")
API_RETRY_SLEEP_SEC = float(_env("STATUS_REAPER_API_RETRY_SLEEP_SEC", default="2") or "2")
# Compensating-status description prefix. Used as the marker so a human
# auditing commit statuses can tell at a glance that the green was
# synthetic, not a real CI pass. Kept stable; downstream tooling
# (e.g. main-red-watchdog visual diff) MAY key on it.
PUSH_COMPENSATION_DESCRIPTION = (
"Compensated by status-reaper (workflow has no push: trigger; "
"Gitea 1.22.6 hardcoded-suffix bug — see .gitea/scripts/status-reaper.py)"
)
# Backward-compatible alias for older tests/tooling that predate the split
# between push-suffix compensation and pull-request-shadow compensation.
COMPENSATION_DESCRIPTION = PUSH_COMPENSATION_DESCRIPTION
PR_SHADOW_COMPENSATION_DESCRIPTION = (
"Compensated by status-reaper (default-branch pull_request status "
"shadowed by successful push status on same SHA; see "
".gitea/scripts/status-reaper.py)"
)
CANCELLED_PUSH_COMPENSATION_DESCRIPTION = (
"Compensated by status-reaper (push run was cancelled/superseded; "
"Gitea 1.22.6 reports cancelled runs as failure statuses)"
)
CANCELLED_DESCRIPTION = "Has been cancelled"
# Context suffix the reaper acts on. Gitea hardcodes this for ALL
# default-branch workflow runs.
PUSH_SUFFIX = " (push)"
PULL_REQUEST_SUFFIX = " (pull_request)"
def _require_runtime_env() -> None:
"""Enforce env contract — called from `main()` only.
Tests import individual functions without setting the full env
contract. Mirrors `main-red-watchdog.py`/`ci-required-drift.py`.
"""
for key in ("GITEA_TOKEN", "GITEA_HOST", "REPO", "WATCH_BRANCH", "WORKFLOWS_DIR"):
if not os.environ.get(key):
sys.stderr.write(f"::error::missing required env var: {key}\n")
sys.exit(2)
# --------------------------------------------------------------------------
# Tiny HTTP helper — raises on non-2xx + on JSON-decode-of-expected-JSON.
# --------------------------------------------------------------------------
class ApiError(RuntimeError):
"""Raised when a Gitea API call cannot be trusted to have succeeded.
Per `feedback_api_helper_must_raise_not_return_dict`: soft-failure is
opt-in via `expect_json=False`, never the default. A pre-fix
implementation that returned `{}` on non-2xx would skip the
compensating POST on a transient outage AND silently lose the
failed-status enumeration, painting main green via omission.
"""
def api(
method: str,
path: str,
*,
body: dict | None = None,
query: dict[str, str] | None = None,
expect_json: bool = True,
) -> tuple[int, Any]:
"""Tiny HTTP helper around urllib. Same contract as
`main-red-watchdog.py` and `ci-required-drift.py` so behaviour
is cross-checkable."""
url = f"{API}{path}"
if query:
url = f"{url}?{urllib.parse.urlencode(query)}"
data = None
headers = {
"Authorization": f"token {GITEA_TOKEN}",
"Accept": "application/json",
}
if body is not None:
data = json.dumps(body).encode("utf-8")
headers["Content-Type"] = "application/json"
req = urllib.request.Request(url, method=method, data=data, headers=headers)
attempts = max(API_RETRIES, 1)
for attempt in range(1, attempts + 1):
try:
with urllib.request.urlopen(req, timeout=API_TIMEOUT_SEC) as resp:
raw = resp.read()
status = resp.status
break
except urllib.error.HTTPError as e:
raw = e.read()
status = e.code
break
except (TimeoutError, socket.timeout, urllib.error.URLError, OSError) as e:
if attempt >= attempts:
raise ApiError(
f"{method} {path} failed after {attempts} attempts: {e}"
) from e
print(
f"::warning::{method} {path} transient API error "
f"(attempt {attempt}/{attempts}): {e}; retrying"
)
time.sleep(API_RETRY_SLEEP_SEC)
if not (200 <= status < 300):
snippet = raw[:500].decode("utf-8", errors="replace") if raw else ""
raise ApiError(f"{method} {path} -> HTTP {status}: {snippet}")
if not raw:
return status, None
try:
return status, json.loads(raw)
except json.JSONDecodeError as e:
if expect_json:
raise ApiError(
f"{method} {path} -> HTTP {status} but body is not JSON: {e}"
) from e
return status, {"_raw": raw.decode("utf-8", errors="replace")}
# --------------------------------------------------------------------------
# Workflow scan + classification
# --------------------------------------------------------------------------
def _on_block(doc: dict) -> Any:
"""Extract the `on:` block from a parsed YAML doc.
PyYAML parses bareword `on:` as Python `True` (YAML 1.1 boolean
spec — `on/off/yes/no` are booleans). The actual key in the dict
is therefore `True`, NOT the string `"on"`. We accept both for
forward-compat with YAML 1.2 loaders (which keep it as `"on"`).
"""
if True in doc:
return doc[True]
return doc.get("on")
def _has_push_trigger(on_block: Any, workflow_id: str) -> bool:
"""Return True if `on:` block declares a `push` trigger.
Accepts the three common shapes:
- str: `on: push` → True only if == "push"
- list: `on: [push, pull_request]` → True if "push" in list
- dict: `on: { push: {...}, schedule: ... }` → True if "push" key
Defensive: for anything else (including None/empty), return True
so we preserve rather than over-compensate. Logged via ::notice::.
"""
if isinstance(on_block, str):
return on_block == "push"
if isinstance(on_block, list):
return "push" in on_block
if isinstance(on_block, dict):
return "push" in on_block
# None or unexpected shape — preserve, log.
print(
f"::notice::ambiguous on: for {workflow_id}; preserving "
f"(value={on_block!r}, type={type(on_block).__name__})"
)
return True
def scan_workflows(workflows_dir: str) -> dict[str, bool]:
"""Walk `workflows_dir` and return `{workflow_id: has_push_trigger}`.
Workflow ID resolution (per hongming-pc 22:08Z review):
- Top-level `name:` if present.
- Else filename stem (basename minus `.yml`).
Fail-LOUD on:
- Two workflows resolving to the same ID (collision).
- Any ID containing `/` (would break ` / `-separated context
parsing on the downstream side).
Returns a dict for O(1) lookup in the per-status loop.
"""
path = Path(workflows_dir)
if not path.is_dir():
# Workflow dir missing → no workflows to classify. Empty map is
# safe: per-status loop will hit "unknown workflow; skip" for
# every entry, which is correct (we cannot tell if a push
# trigger exists, so we preserve).
print(f"::warning::workflows dir not found: {workflows_dir}")
return {}
out: dict[str, bool] = {}
sources: dict[str, str] = {} # workflow_id -> source file (for collision msg)
for yml in sorted(path.glob("*.yml")):
try:
with yml.open() as f:
doc = yaml.safe_load(f)
except yaml.YAMLError as e:
# A malformed YAML in the workflows dir is a real defect
# (the workflow wouldn't load on Gitea either). Surface it
# and keep going — the reaper's job is to compensate the
# OTHER workflows even if one is broken.
print(f"::warning::yaml parse failed for {yml.name}: {e}; skip")
continue
if not isinstance(doc, dict):
print(f"::warning::workflow {yml.name} not a dict; skip")
continue
# Resolve workflow_id.
name_field = doc.get("name")
if isinstance(name_field, str) and name_field.strip():
workflow_id = name_field.strip()
else:
workflow_id = yml.stem # basename minus .yml
# Halt-loud: `/` in workflow_id breaks ` / ` context parsing.
if "/" in workflow_id:
sys.stderr.write(
f"::error::workflow name contains '/' which breaks "
f"context parsing: {workflow_id} (file={yml.name})\n"
)
sys.exit(1)
# Halt-loud: ID collision.
if workflow_id in out:
sys.stderr.write(
f"::error::workflow name collision detected: {workflow_id} "
f"(files: {sources[workflow_id]} + {yml.name})\n"
)
sys.exit(1)
on_block = _on_block(doc)
out[workflow_id] = _has_push_trigger(on_block, workflow_id)
sources[workflow_id] = yml.name
return out
# --------------------------------------------------------------------------
# Gitea reads
# --------------------------------------------------------------------------
def get_head_sha(branch: str) -> str:
"""HEAD SHA of `branch`. Raises ApiError on non-2xx."""
_, body = api("GET", f"/repos/{OWNER}/{NAME}/branches/{branch}")
if not isinstance(body, dict):
raise ApiError(f"branch {branch} response not a JSON object")
commit = body.get("commit")
if not isinstance(commit, dict):
raise ApiError(f"branch {branch} response missing `commit` object")
sha = commit.get("id") or commit.get("sha")
if not isinstance(sha, str) or len(sha) < 7:
raise ApiError(f"branch {branch} response has no usable commit SHA")
return sha
def get_combined_status(sha: str) -> dict:
"""Combined commit status for `sha`. Gitea returns:
{
"state": "success" | "failure" | "pending" | "error",
"statuses": [
{"context": "...", "state": "...", "target_url": "...",
"description": "..."},
...
],
...
}
Raises ApiError on non-2xx.
"""
_, body = api("GET", f"/repos/{OWNER}/{NAME}/commits/{sha}/status")
if not isinstance(body, dict):
raise ApiError(f"status for {sha} response not a JSON object")
return body
# --------------------------------------------------------------------------
# Context parsing
# --------------------------------------------------------------------------
def parse_suffixed_context(context: str, suffix: str) -> tuple[str, str] | None:
"""Parse `<workflow_name> / <job_name> (<event>)` into
(workflow_name, job_name).
Returns None if the context doesn't match the shape (caller skips).
Strict: requires the trailing suffix and at least one ` / `
separator. Anything else is left alone.
"""
if not context.endswith(suffix):
return None
head = context[: -len(suffix)]
if " / " not in head:
return None
workflow_name, job_name = head.split(" / ", 1)
return workflow_name, job_name
def parse_push_context(context: str) -> tuple[str, str] | None:
"""Parse `<workflow_name> / <job_name> (push)` into
(workflow_name, job_name)."""
return parse_suffixed_context(context, PUSH_SUFFIX)
def push_equivalent_context(context: str) -> str | None:
"""Return the matching `(push)` context for a `(pull_request)` context."""
parsed = parse_suffixed_context(context, PULL_REQUEST_SUFFIX)
if parsed is None:
return None
workflow_name, job_name = parsed
return f"{workflow_name} / {job_name}{PUSH_SUFFIX}"
# --------------------------------------------------------------------------
# Compensating POST
# --------------------------------------------------------------------------
def post_compensating_status(
sha: str,
context: str,
target_url: str | None,
*,
description: str = PUSH_COMPENSATION_DESCRIPTION,
dry_run: bool = False,
) -> None:
"""POST a `state=success` to /repos/{o}/{r}/statuses/{sha} with the
given context. Gitea de-dups by context (latest write wins).
Description references this script so the compensation is
self-documenting on the commit's status view.
"""
payload: dict[str, Any] = {
"context": context,
"state": "success",
"description": description,
}
# Echo the original target_url when present so a human auditing
# the (now-green) compensated status can still reach the run logs
# that produced the original red.
if target_url:
payload["target_url"] = target_url
if dry_run:
print(
f"::notice::[dry-run] would compensate {context!r} on {sha[:10]} "
f"with state=success"
)
return
api("POST", f"/repos/{OWNER}/{NAME}/statuses/{sha}", body=payload)
print(f"::notice::compensated {context!r} on {sha[:10]} (state=success)")
# --------------------------------------------------------------------------
# Main reap loop
# --------------------------------------------------------------------------
def reap(
workflow_trigger_map: dict[str, bool],
combined: dict,
sha: str,
*,
dry_run: bool = False,
) -> dict[str, Any]:
"""Walk `combined.statuses[]` and compensate where appropriate.
Per-SHA worker. The multi-SHA orchestrator (`reap_branch`) calls
this once per stale main commit each tick.
Returns counters for observability:
{compensated, preserved_real_push, preserved_unknown,
preserved_non_failure, preserved_non_push_suffix,
preserved_unparseable, compensated_pr_shadowed_by_push_success,
preserved_pr_without_push_success, compensated_cancelled_push,
compensated_contexts: [<context>, ...]}
`compensated_contexts` is rev2-added so `reap_branch` can build
`compensated_per_sha` without re-deriving it from the POST stream.
"""
counters: dict[str, Any] = {
"compensated": 0,
"preserved_real_push": 0,
"preserved_unknown": 0,
"preserved_non_failure": 0,
"preserved_non_push_suffix": 0,
"preserved_unparseable": 0,
"compensated_pr_shadowed_by_push_success": 0,
"compensated_cancelled_push": 0,
"preserved_pr_without_push_success": 0,
"compensated_contexts": [],
}
statuses = combined.get("statuses") or []
successful_contexts = {
(s.get("context") or "")
for s in statuses
if isinstance(s, dict) and (s.get("status") or s.get("state") or "") == "success"
}
for s in statuses:
if not isinstance(s, dict):
continue
context = s.get("context") or ""
# Schema asymmetry: Gitea 1.22.6 returns the TOP-LEVEL combined
# aggregate as `combined.state` but each per-context entry in
# `combined.statuses[]` uses the key `status`, NOT `state`.
# Prefer `status`; fall back to `state` so a future Gitea
# version (or a test fixture written against the wrong key)
# still flows through the compensation path. Verified empirically
# via direct API probe 2026-05-12 03:42Z:
# /repos/.../commits/{sha}/status entries → key is "status".
# Pre-rev4 code read "state" only → returned "" → bypassed the
# `state != "failure"` guard → compensation path unreachable.
# See `feedback_smoke_test_vendor_truth_not_shape_match`.
state = s.get("status") or s.get("state") or ""
# Only `failure` is the bug shape. `error`/`pending`/`success`
# left alone — they have other meanings.
if state != "failure":
counters["preserved_non_failure"] += 1
continue
# Default-branch `pull_request` contexts can be stale shadows of
# the exact same workflow/job already proven by the successful
# `push` context on the same SHA. Compensate only that narrow
# shape; a missing or failed push equivalent remains a real gate
# signal and is preserved.
push_equivalent = push_equivalent_context(context)
if push_equivalent is not None:
if push_equivalent in successful_contexts:
post_compensating_status(
sha,
context,
s.get("target_url"),
description=PR_SHADOW_COMPENSATION_DESCRIPTION,
dry_run=dry_run,
)
counters["compensated"] += 1
counters["compensated_pr_shadowed_by_push_success"] += 1
counters["compensated_contexts"].append(context)
else:
counters["preserved_pr_without_push_success"] += 1
continue
# Only `(push)`-suffix contexts hit the hardcoded-suffix bug.
# Other failed contexts are preserved unless handled by the
# pull-request-shadow rule above.
if not context.endswith(PUSH_SUFFIX):
counters["preserved_non_push_suffix"] += 1
continue
parsed = parse_push_context(context)
if parsed is None:
# Has ` (push)` suffix but missing ` / ` separator — not
# the bug shape. Preserve.
counters["preserved_unparseable"] += 1
continue
workflow_name, _job_name = parsed
if workflow_name not in workflow_trigger_map:
# Real workflow but renamed/deleted/external — we can't
# tell if it has push trigger. Conservative: preserve.
print(f"::notice::unknown workflow {workflow_name!r}; skip")
counters["preserved_unknown"] += 1
continue
if (s.get("description") or "").strip() == CANCELLED_DESCRIPTION:
# Gitea 1.22.6 maps cancelled action runs to failure commit
# statuses. During merge bursts, older push runs can be
# superseded and cancelled even though a newer run for the
# same branch is the real signal. Compensate only the exact
# Gitea cancellation description; real push failures remain red.
post_compensating_status(
sha,
context,
s.get("target_url"),
description=CANCELLED_PUSH_COMPENSATION_DESCRIPTION,
dry_run=dry_run,
)
counters["compensated"] += 1
counters["compensated_cancelled_push"] += 1
counters["compensated_contexts"].append(context)
continue
if workflow_trigger_map[workflow_name]:
# Real push trigger with a non-cancelled failure description
# remains a defect signal. Preserve.
counters["preserved_real_push"] += 1
continue
# Class-O: schedule/dispatch/etc.-only workflow with a fake
# (push) status from Gitea's hardcoded-suffix bug. Compensate.
post_compensating_status(
sha, context, s.get("target_url"), dry_run=dry_run
)
counters["compensated"] += 1
counters["compensated_contexts"].append(context)
return counters
# --------------------------------------------------------------------------
# rev2: multi-SHA sweep over the last N commits on WATCH_BRANCH
# --------------------------------------------------------------------------
# How many main commits to sweep per tick. Sized to cover a burst-merge
# window where multiple PRs land in the 5-min interval between reaper
# ticks. Older reds falling off the window is acceptable — they were
# already stale enough that the schedule-run that posted them has long
# since been overwritten by a real push trigger. See `reference_post_
# suspension_pipeline` for the merge-cadence baseline.
#
# rev3 (2026-05-12, hongming-pc2 GO 03:25Z): widened from 10 → 30.
# rev2 (limit=10) shipped 01:48Z and ran 6/6 ticks post-merge with
# `compensated:0` despite ~25 stranded reds visible on those same 10
# SHAs ~30min later. Root cause: schedule workflows post `failure`
# RETROACTIVELY 5-15 min after their merge, so by the time reaper's
# next */5 tick lands, the stranded red is on a SHA that has already
# fallen out of a 10-commit window during a burst-merge period.
# Trades window-width-cheap for cadence-loady (per hongming-pc2):
# kept `*/5` cron unchanged; only the window-N is widened.
DEFAULT_SWEEP_LIMIT = 30
def list_recent_commit_shas(branch: str, limit: int) -> list[str]:
"""List the most recent `limit` commit SHAs on `branch`, newest
first.
Wraps GET /repos/{o}/{r}/commits?sha={branch}&limit={limit}. Gitea
1.22.6 returns a JSON list of commit objects each with a `sha` key
(verified via vendor-truth probe 2026-05-11 against
git.moleculesai.app — `feedback_smoke_test_vendor_truth_not_shape_match`).
Raises ApiError on non-2xx OR on unexpected response shape. The
branch-level caller soft-skips this tick because the next scheduled
tick can safely retry the listing. Per-SHA status/write errors remain
separate and must not be mislabeled as commit-list outages.
"""
_, body = api(
"GET",
f"/repos/{OWNER}/{NAME}/commits",
query={"sha": branch, "limit": str(limit)},
)
if not isinstance(body, list):
raise ApiError(
f"commits listing for {branch} not a JSON array "
f"(got {type(body).__name__})"
)
shas: list[str] = []
for entry in body:
if not isinstance(entry, dict):
continue
sha = entry.get("sha")
if isinstance(sha, str) and len(sha) >= 7:
shas.append(sha)
if not shas:
raise ApiError(
f"commits listing for {branch} returned no usable SHAs"
)
return shas
def reap_branch(
workflow_trigger_map: dict[str, bool],
branch: str,
*,
limit: int = DEFAULT_SWEEP_LIMIT,
dry_run: bool = False,
) -> dict[str, Any]:
"""Sweep the last `limit` commits on `branch`, applying `reap()`
to each (with per-SHA error isolation).
Returns aggregated counters PLUS rev2 observability fields:
- scanned_shas: how many SHAs we actually iterated
- compensated_per_sha: {<sha_full>: [<context>, ...]} — only
SHAs that actually got at least one compensation are included
"""
try:
shas = list_recent_commit_shas(branch, limit)
except ApiError as e:
print(
"::warning::status-reaper skipped this tick because the "
f"commit list could not be read after retries: {e}"
)
return {
"scanned_shas": 0,
"compensated": 0,
"preserved_real_push": 0,
"preserved_unknown": 0,
"preserved_non_failure": 0,
"preserved_non_push_suffix": 0,
"preserved_unparseable": 0,
"compensated_pr_shadowed_by_push_success": 0,
"compensated_cancelled_push": 0,
"preserved_pr_without_push_success": 0,
"compensated_per_sha": {},
"skipped": True,
"skip_reason": "commit-list-api-error",
}
aggregate: dict[str, Any] = {
"scanned_shas": 0,
"compensated": 0,
"preserved_real_push": 0,
"preserved_unknown": 0,
"preserved_non_failure": 0,
"preserved_non_push_suffix": 0,
"preserved_unparseable": 0,
"compensated_pr_shadowed_by_push_success": 0,
"compensated_cancelled_push": 0,
"preserved_pr_without_push_success": 0,
"compensated_per_sha": {},
}
for sha in shas:
aggregate["scanned_shas"] += 1
# Per-SHA error isolation (refinement #7). One transient blip
# on a historical commit must NOT abort the whole tick — the
# OTHER stale SHAs may still hold strandable reds.
try:
combined = get_combined_status(sha)
except ApiError as e:
print(
f"::warning::get_combined_status({sha[:10]}) failed; "
f"skipping this SHA: {e}"
)
continue
# Cost optimization (refinement #2): the common case is a green
# commit. Skip the per-context loop entirely when combined is
# already success — saves a tight loop over ~20 statuses per SHA
# on green commits, the dominant majority.
if combined.get("state") == "success":
continue
per_sha = reap(
workflow_trigger_map, combined, sha, dry_run=dry_run
)
# Aggregate scalar counters.
for key in (
"compensated",
"preserved_real_push",
"preserved_unknown",
"preserved_non_failure",
"preserved_non_push_suffix",
"preserved_unparseable",
"compensated_pr_shadowed_by_push_success",
"compensated_cancelled_push",
"preserved_pr_without_push_success",
):
aggregate[key] += per_sha[key]
# Record per-SHA compensated contexts (only when non-empty —
# keep the summary readable when most SHAs are no-ops).
contexts = per_sha.get("compensated_contexts") or []
if contexts:
aggregate["compensated_per_sha"][sha] = list(contexts)
return aggregate
def main() -> int:
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
"--dry-run",
action="store_true",
help="Skip the compensating POST; print what would be done.",
)
parser.add_argument(
"--limit",
type=int,
default=DEFAULT_SWEEP_LIMIT,
help=(
"How many recent commits on WATCH_BRANCH to sweep per tick "
f"(default: {DEFAULT_SWEEP_LIMIT})."
),
)
args = parser.parse_args()
_require_runtime_env()
workflow_trigger_map = scan_workflows(WORKFLOWS_DIR)
print(
f"::notice::scanned {len(workflow_trigger_map)} workflows; "
f"push-triggered={sum(1 for v in workflow_trigger_map.values() if v)}, "
f"class-O candidates={sum(1 for v in workflow_trigger_map.values() if not v)}"
)
counters = reap_branch(
workflow_trigger_map,
WATCH_BRANCH,
limit=args.limit,
dry_run=args.dry_run,
)
# Observability: print one JSON line summarising the tick. Loki
# ingestion via the runner's stdout (`source="gitea-actions"`).
print(
"status-reaper summary: "
+ json.dumps(
{
"branch": WATCH_BRANCH,
"dry_run": args.dry_run,
"limit": args.limit,
**counters,
},
sort_keys=True,
)
)
return 0
if __name__ == "__main__":
sys.exit(main())
-28
View File
@@ -1,28 +0,0 @@
#!/usr/bin/env bash
# Mock sop-tier-check.sh for sop-tier-refire tests.
#
# Exits 0 ("PASS") if $MOCK_TIER_RESULT == "pass", else exits 1.
# This lets the refire tests cover the success + failure status-POST
# paths without invoking the real sop-tier-check.sh (which uses bash 4+
# associative arrays — known parser bug on macOS bash 3.2 dev box).
set -euo pipefail
case "${MOCK_TIER_RESULT:-pass}" in
pass)
echo "::notice::mock tier-check: PASS"
exit 0
;;
fail_no_label)
echo "::error::mock tier-check: no tier label"
exit 1
;;
fail_no_approvals)
echo "::error::mock tier-check: no approving reviews"
exit 1
;;
*)
echo "::error::mock tier-check: unknown MOCK_TIER_RESULT=${MOCK_TIER_RESULT:-}"
exit 2
;;
esac
-208
View File
@@ -1,208 +0,0 @@
#!/usr/bin/env python3
"""Stub Gitea API for sop-tier-refire test scenarios.
Reads $FIXTURE_STATE_DIR/scenario to decide what to return for each
endpoint the sop-tier-refire.sh + sop-tier-check.sh scripts call.
Captures every POST to /statuses/{sha} into posted_statuses.jsonl so
the test can assert what the script tried to write.
Scenarios:
T1_success — tier:low + APPROVED by engineer → tier-check passes
T2_no_tier_label — no tier label → tier-check exits 1 before POST
T3_no_approvals — tier:low but zero approving reviews → exits 1
T4_closed — PR state=closed → refire is a no-op
T5_rate_limited — last status update 5 seconds ago → skip
Usage:
FIXTURE_STATE_DIR=/tmp/x python3 _refire_fixture.py 8080
"""
import datetime
import http.server
import json
import os
import re
import sys
import urllib.parse
STATE_DIR = os.environ["FIXTURE_STATE_DIR"]
def scenario() -> str:
p = os.path.join(STATE_DIR, "scenario")
if not os.path.isfile(p):
return "T1_success"
with open(p) as f:
return f.read().strip()
def now_iso() -> str:
return datetime.datetime.now(datetime.timezone.utc).isoformat()
def append_post(body: dict) -> None:
with open(os.path.join(STATE_DIR, "posted_statuses.jsonl"), "a") as f:
f.write(json.dumps(body) + "\n")
def pr_payload() -> dict:
sc = scenario()
state = "closed" if sc == "T4_closed" else "open"
return {
"number": 999,
"state": state,
"head": {"sha": "deadbeef0000111122223333444455556666"},
"user": {"login": "feature-author"},
}
def labels_payload() -> list:
sc = scenario()
if sc == "T2_no_tier_label":
return [{"name": "bug"}]
# All other scenarios use tier:low
return [{"name": "tier:low"}, {"name": "ci"}]
def reviews_payload() -> list:
sc = scenario()
if sc == "T3_no_approvals":
return []
# All other scenarios have one APPROVED review by an engineer
return [
{
"state": "APPROVED",
"user": {"login": "reviewer-engineer"},
}
]
def teams_payload() -> list:
# Mirror the real molecule-ai org teams referenced in TIER_EXPR
return [
{"id": 5, "name": "ceo"},
{"id": 2, "name": "engineers"},
{"id": 6, "name": "managers"},
]
def statuses_payload() -> list:
sc = scenario()
if sc == "T5_rate_limited":
recent = (
datetime.datetime.now(datetime.timezone.utc)
- datetime.timedelta(seconds=5)
).isoformat()
return [
{
"context": "sop-tier-check / tier-check (pull_request)",
"state": "failure",
"updated_at": recent,
}
]
return []
def user_payload() -> dict:
# Mirrors the WHOAMI probe in sop-tier-check.sh
return {"login": "sop-tier-bot-fixture"}
class Handler(http.server.BaseHTTPRequestHandler):
# Quiet — keep stdout for explicit logs only.
def log_message(self, *args, **kwargs): # noqa: D401
pass
def _json(self, code: int, body) -> None:
payload = json.dumps(body).encode()
self.send_response(code)
self.send_header("Content-Type", "application/json")
self.send_header("Content-Length", str(len(payload)))
self.end_headers()
self.wfile.write(payload)
def _empty(self, code: int) -> None:
self.send_response(code)
self.send_header("Content-Length", "0")
self.end_headers()
def do_GET(self): # noqa: N802
u = urllib.parse.urlparse(self.path)
path = u.path
if path == "/_ping":
return self._json(200, {"ok": True})
if path == "/api/v1/user":
return self._json(200, user_payload())
# /api/v1/repos/{owner}/{name}/pulls/{n}
m = re.match(r"^/api/v1/repos/[^/]+/[^/]+/pulls/(\d+)$", path)
if m:
return self._json(200, pr_payload())
# /api/v1/repos/{owner}/{name}/issues/{n}/labels
if re.match(r"^/api/v1/repos/[^/]+/[^/]+/issues/\d+/labels$", path):
return self._json(200, labels_payload())
# /api/v1/repos/{owner}/{name}/pulls/{n}/reviews
if re.match(r"^/api/v1/repos/[^/]+/[^/]+/pulls/\d+/reviews$", path):
return self._json(200, reviews_payload())
# /api/v1/orgs/{owner}/teams
if re.match(r"^/api/v1/orgs/[^/]+/teams$", path):
return self._json(200, teams_payload())
# /api/v1/teams/{id}/members/{login} → 204 if user is an engineer
m = re.match(r"^/api/v1/teams/(\d+)/members/([^/]+)$", path)
if m:
team_id, login = m.group(1), m.group(2)
# In our fixture reviewer-engineer ∈ engineers (id=2)
if team_id == "2" and login == "reviewer-engineer":
return self._empty(204)
return self._empty(404)
# /api/v1/orgs/{owner}/members/{login} — fallback path used when
# team-member probes all 403. We don't need it for these tests.
if re.match(r"^/api/v1/orgs/[^/]+/members/[^/]+$", path):
return self._empty(404)
# /api/v1/repos/{owner}/{name}/statuses/{sha}
if re.match(r"^/api/v1/repos/[^/]+/[^/]+/statuses/[^/]+$", path):
return self._json(200, statuses_payload())
return self._json(404, {"path": path, "msg": "fixture: no route"})
def do_POST(self): # noqa: N802
u = urllib.parse.urlparse(self.path)
path = u.path
length = int(self.headers.get("Content-Length") or 0)
raw = self.rfile.read(length) if length else b""
try:
body = json.loads(raw) if raw else {}
except Exception:
body = {"_raw": raw.decode(errors="replace")}
if re.match(r"^/api/v1/repos/[^/]+/[^/]+/statuses/[^/]+$", path):
append_post(body)
# Echo back something status-shaped — script only checks HTTP code.
return self._json(
201,
{
"context": body.get("context"),
"state": body.get("state"),
"created_at": now_iso(),
},
)
return self._json(404, {"path": path, "msg": "fixture: no route"})
def main():
port = int(sys.argv[1])
srv = http.server.ThreadingHTTPServer(("127.0.0.1", port), Handler)
srv.serve_forever()
if __name__ == "__main__":
main()
@@ -1,176 +0,0 @@
#!/usr/bin/env python3
"""Stub Gitea API for review-check.sh test scenarios.
Reads $FIXTURE_STATE_DIR/scenario to decide what to return for each
endpoint the review-check.sh script calls.
Reads $FIXTURE_STATE_DIR/token_owner_in_teams to decide whether
the team membership probe returns 200/204 (member) or 403 (not in team).
Scenarios:
T1_pr_open — open PR, author=alice, sha=deadbeef → continue
T2_pr_closed — closed PR → script exits 0 (no-op)
T3_reviews_approved_non_author — one APPROVED from non-author → candidates exist
T4_reviews_empty — zero APPROVED non-author → exit 1 (no candidates)
T5_reviews_only_author — only author reviews → exit 1 (no candidates)
T6_reviews_dismissed — dismissed APPROVED → treated as no approval
T7_team_member — team membership → 204 (member) → exit 0
T8_team_not_member — team membership → 404 (not a member) → exit 1
T9_team_403 — team membership → 403 (token not in team) → exit 1
T14_non_default_base — open PR targeting staging → script exits 0 (no-op)
T15_comments_agent_approval — reviews empty; comments have "[core-qa-agent] APPROVED" → exit 0
T16_comments_generic_approval — reviews empty; comments have "APPROVED" by team member → exit 0
T17_comments_no_approval — reviews empty; comments have no approval keywords → exit 1
Usage:
FIXTURE_STATE_DIR=/tmp/x python3 _review_check_fixture.py 8080
"""
import http.server
import json
import os
import re
import sys
import urllib.parse
STATE_DIR = os.environ.get("FIXTURE_STATE_DIR", "/tmp")
def scenario() -> str:
p = os.path.join(STATE_DIR, "scenario")
if not os.path.isfile(p):
return "T1_pr_open"
with open(p) as f:
return f.read().strip()
class Handler(http.server.BaseHTTPRequestHandler):
def log_message(self, *args, **kwargs):
pass # keep stdout for explicit logs only
def _json(self, code: int, body: dict) -> None:
payload = json.dumps(body).encode()
self.send_response(code)
self.send_header("Content-Type", "application/json")
self.send_header("Content-Length", str(len(payload)))
self.end_headers()
self.wfile.write(payload)
def _empty(self, code: int) -> None:
self.send_response(code)
self.send_header("Content-Length", "0")
self.end_headers()
def _text(self, code: int, body: str) -> None:
payload = body.encode()
self.send_response(code)
self.send_header("Content-Type", "text/plain")
self.send_header("Content-Length", str(len(payload)))
self.end_headers()
self.wfile.write(payload)
def do_GET(self):
u = urllib.parse.urlparse(self.path)
path = u.path
sc = scenario()
if path == "/_ping":
return self._json(200, {"ok": True})
# GET /repos/{owner}/{name}/pulls/{pr_number}
m = re.match(r"^/api/v1/repos/([^/]+)/([^/]+)/pulls/(\d+)$", path)
if m:
owner, name, pr_num = m.group(1), m.group(2), m.group(3)
if sc == "T2_pr_closed":
return self._json(200, {
"number": int(pr_num),
"state": "closed",
"head": {"sha": "deadbeef0000111122223333444455556666"},
"base": {"ref": "main"},
"user": {"login": "alice"},
})
return self._json(200, {
"number": int(pr_num),
"state": "open",
"head": {"sha": "deadbeef0000111122223333444455556666"},
"base": {"ref": "staging" if sc == "T14_non_default_base" else "main"},
"user": {"login": "alice"},
})
# GET /repos/{owner}/{name}/pulls/{pr_number}/reviews
m = re.match(r"^/api/v1/repos/([^/]+)/([^/]+)/pulls/(\d+)/reviews$", path)
if m:
if sc in ("T4_reviews_empty", "T5_reviews_only_author",
"T15_comments_agent_approval", "T16_comments_generic_approval",
"T17_comments_no_approval"):
return self._json(200, [])
if sc == "T6_reviews_dismissed":
return self._json(200, [{
"state": "APPROVED",
"dismissed": True,
"user": {"login": "core-devops"},
"commit_id": "abc1234",
}])
if sc == "T3_reviews_approved_non_author":
return self._json(200, [
{"state": "CHANGES_REQUESTED", "dismissed": False, "user": {"login": "bob"}, "commit_id": "abc1234"},
{"state": "APPROVED", "dismissed": False, "user": {"login": "core-devops"}, "commit_id": "abc1234"},
])
# Default: one non-author APPROVED
return self._json(200, [
{"state": "APPROVED", "dismissed": False, "user": {"login": "core-devops"}, "commit_id": "abc1234"},
])
# GET /repos/{owner}/{name}/issues/{pr_number}/comments
m = re.match(r"^/api/v1/repos/([^/]+)/([^/]+)/issues/(\d+)/comments$", path)
if m:
if sc == "T15_comments_agent_approval":
return self._json(200, [
{"user": {"login": "core-qa-agent"}, "body": "[core-qa-agent] APPROVED this PR. Good changes.", "id": 1},
{"user": {"login": "alice"}, "body": "I authored this PR", "id": 2},
{"user": {"login": "random-user"}, "body": "Looks okay to me", "id": 3},
])
if sc == "T16_comments_generic_approval":
return self._json(200, [
{"user": {"login": "core-qa-agent"}, "body": "APPROVED — all acceptance criteria met", "id": 1},
{"user": {"login": "alice"}, "body": "-authored", "id": 2},
])
if sc == "T17_comments_no_approval":
return self._json(200, [
{"user": {"login": "alice"}, "body": "I authored this PR", "id": 1},
{"user": {"login": "random-user"}, "body": "Looks okay to me", "id": 2},
])
# Default scenarios (T1T9, T14): no comments
return self._json(200, [])
# GET /teams/{team_id}/members/{username}
m = re.match(r"^/api/v1/teams/(\d+)/members/([^/]+)$", path)
if m:
team_id, login = m.group(1), m.group(2)
if sc == "T8_team_not_member":
return self._empty(404)
if sc == "T9_team_403":
return self._empty(403)
# T7_team_member: member
return self._empty(204)
# GET /repos/{owner}/{name}/statuses/{sha} — for N/A declaration check
m = re.match(r"^/api/v1/repos/([^/]+)/([^/]+)/statuses/([a-f0-9]+)$", path)
if m:
# All comment-based scenarios have no N/A declarations
return self._json(200, [])
return self._json(404, {"path": path, "msg": "fixture: no route"})
def do_POST(self):
self._json(404, {"path": self.path, "msg": "fixture: no POST routes"})
def main():
port = int(sys.argv[1])
srv = http.server.ThreadingHTTPServer(("127.0.0.1", port), Handler)
srv.serve_forever()
if __name__ == "__main__":
main()
@@ -1,130 +0,0 @@
import importlib.util
import sys
from pathlib import Path
SCRIPT = Path(__file__).resolve().parents[1] / "gitea-merge-queue.py"
spec = importlib.util.spec_from_file_location("gitea_merge_queue", SCRIPT)
mq = importlib.util.module_from_spec(spec)
sys.modules[spec.name] = mq
spec.loader.exec_module(mq)
def test_latest_statuses_dedupes_by_context_newest_first():
statuses = [
{"context": "CI / all-required (pull_request)", "status": "failure"},
{"context": "sop-checklist / all-items-acked (pull_request)", "state": "success"},
{"context": "CI / all-required (pull_request)", "status": "success"},
]
latest = mq.latest_statuses_by_context(statuses)
assert latest["CI / all-required (pull_request)"]["status"] == "failure"
assert latest["sop-checklist / all-items-acked (pull_request)"]["state"] == "success"
def test_required_contexts_green_rejects_missing_and_pending():
latest = mq.latest_statuses_by_context([
{"context": "CI / all-required (pull_request)", "status": "success"},
{"context": "sop-checklist / all-items-acked (pull_request)", "status": "pending"},
])
ok, missing_or_bad = mq.required_contexts_green(
latest,
[
"CI / all-required (pull_request)",
"sop-checklist / all-items-acked (pull_request)",
"qa-review / approved (pull_request)",
],
)
assert ok is False
assert missing_or_bad == [
"sop-checklist / all-items-acked (pull_request)=pending",
"qa-review / approved (pull_request)=missing",
]
def test_choose_next_pr_sorts_by_queue_label_timestamp_then_number():
issues = [
{
"number": 12,
"pull_request": {},
"labels": [{"name": "merge-queue"}],
"created_at": "2026-05-13T05:00:00Z",
"updated_at": "2026-05-13T06:00:00Z",
},
{
"number": 9,
"pull_request": {},
"labels": [{"name": "merge-queue"}],
"created_at": "2026-05-13T04:00:00Z",
"updated_at": "2026-05-13T07:00:00Z",
},
{
"number": 7,
"labels": [{"name": "merge-queue"}],
"created_at": "2026-05-13T03:00:00Z",
},
]
selected = mq.choose_next_queued_issue(issues, queue_label="merge-queue")
assert selected["number"] == 9
def test_pr_needs_update_when_base_sha_absent_from_commits():
commits = [
{"sha": "head"},
{"sha": "parent"},
]
assert mq.pr_contains_base_sha(commits, "mainsha") is False
assert mq.pr_contains_base_sha(commits, "parent") is True
def test_merge_decision_requires_main_green_pr_green_and_current_base():
required = ["CI / all-required (pull_request)"]
main_status = {
"state": "success",
"statuses": [{"context": "CI / all-required (push)", "status": "success"}],
}
pr_status = {
"state": "success",
"statuses": [{"context": "CI / all-required (pull_request)", "status": "success"}],
}
decision = mq.evaluate_merge_readiness(
main_status=main_status,
pr_status=pr_status,
required_contexts=required,
pr_has_current_base=True,
)
assert decision.ready is True
assert decision.action == "merge"
def test_merge_decision_updates_stale_pr_before_merge():
decision = mq.evaluate_merge_readiness(
main_status={
"state": "success",
"statuses": [{"context": "CI / all-required (push)", "status": "success"}],
},
pr_status={"state": "success", "statuses": [{"context": "CI / all-required (pull_request)", "status": "success"}]},
required_contexts=["CI / all-required (pull_request)"],
pr_has_current_base=False,
)
assert decision.ready is False
assert decision.action == "update"
def test_MergePermissionError_inherits_from_ApiError():
assert issubclass(mq.MergePermissionError, mq.ApiError)
def test_MergePermissionError_message_preserved():
exc = mq.MergePermissionError("POST /merge -> HTTP 405: User not allowed")
assert "405" in str(exc)
assert "User not allowed" in str(exc)
@@ -1,505 +0,0 @@
"""Unit tests for .gitea/scripts/lint_pre_flip_continue_on_error.py.
These tests pin the pure-logic surface (flip detection + per-flip
verdict aggregation) without making real HTTP calls. The end-to-end
git ls-tree + Gitea API path is exercised by running the workflow
against real PRs.
Run locally::
python3 -m unittest .gitea/scripts/tests/test_lint_pre_flip_continue_on_error.py -v
Mirrors the pattern in scripts/ops/test_check_migration_collisions.py
+ scripts/test_build_runtime_package.py.
"""
from __future__ import annotations
import importlib.util
import os
import sys
import unittest
from pathlib import Path
from unittest import mock
# Load the script as a module without invoking main(). Tests must NOT
# depend on the full runtime env contract (GITEA_TOKEN etc.), so we
# import individual functions and stub the network surface explicitly.
SCRIPT_PATH = Path(__file__).resolve().parent.parent / "lint_pre_flip_continue_on_error.py"
spec = importlib.util.spec_from_file_location("lpfc", SCRIPT_PATH)
lpfc = importlib.util.module_from_spec(spec)
spec.loader.exec_module(lpfc)
# --------------------------------------------------------------------------
# Fixtures: minimal valid workflow YAML on each side of a "diff"
# --------------------------------------------------------------------------
CI_YML_BASE = """\
name: CI
on:
push:
branches: [main]
jobs:
platform-build:
name: Platform (Go)
runs-on: ubuntu-latest
continue-on-error: true
steps:
- run: echo platform
canvas-build:
name: Canvas (Next.js)
runs-on: ubuntu-latest
continue-on-error: true
steps:
- run: echo canvas
all-required:
runs-on: ubuntu-latest
continue-on-error: true
needs: [platform-build, canvas-build]
steps:
- run: echo ok
"""
CI_YML_HEAD_FLIPPED = """\
name: CI
on:
push:
branches: [main]
jobs:
platform-build:
name: Platform (Go)
runs-on: ubuntu-latest
continue-on-error: false
steps:
- run: echo platform
canvas-build:
name: Canvas (Next.js)
runs-on: ubuntu-latest
continue-on-error: false
steps:
- run: echo canvas
all-required:
runs-on: ubuntu-latest
continue-on-error: true
needs: [platform-build, canvas-build]
steps:
- run: echo ok
"""
CI_YML_HEAD_NO_DIFF = CI_YML_BASE # identical to base, no flip
# --------------------------------------------------------------------------
# 1. CoE coercion (truthy/falsy/quoted/absent)
# --------------------------------------------------------------------------
class TestCoerceCoE(unittest.TestCase):
def test_python_bool_true(self):
self.assertTrue(lpfc._coerce_coe(True))
def test_python_bool_false(self):
self.assertFalse(lpfc._coerce_coe(False))
def test_none_is_false(self):
# GitHub Actions default: absent == false.
self.assertFalse(lpfc._coerce_coe(None))
def test_string_true_lowercase(self):
# Quoted "true" in YAML — Gitea Actions normalizes to True.
self.assertTrue(lpfc._coerce_coe("true"))
def test_string_True_titlecase(self):
self.assertTrue(lpfc._coerce_coe("True"))
def test_string_yes(self):
# YAML 1.1 truthy form.
self.assertTrue(lpfc._coerce_coe("yes"))
def test_string_false(self):
self.assertFalse(lpfc._coerce_coe("false"))
def test_string_random_falsy(self):
# An unrecognized string is treated as falsy — safer than
# silently coercing "maybe" to True and false-positiving a
# flip.
self.assertFalse(lpfc._coerce_coe("maybe"))
# --------------------------------------------------------------------------
# 2. Diff detection — flips, not arbitrary changes
# --------------------------------------------------------------------------
class TestDetectFlips(unittest.TestCase):
def test_no_flip_in_diff_passes(self):
# Acceptance test #1: PR doesn't flip continue-on-error → 0 flips.
flips = lpfc.detect_flips(
{".gitea/workflows/ci.yml": CI_YML_BASE},
{".gitea/workflows/ci.yml": CI_YML_HEAD_NO_DIFF},
)
self.assertEqual(flips, [])
def test_flip_detected_in_one_file(self):
flips = lpfc.detect_flips(
{".gitea/workflows/ci.yml": CI_YML_BASE},
{".gitea/workflows/ci.yml": CI_YML_HEAD_FLIPPED},
)
# Two jobs flipped: platform-build, canvas-build. all-required
# is still true on both sides.
self.assertEqual(len(flips), 2)
keys = sorted(f["job_key"] for f in flips)
self.assertEqual(keys, ["canvas-build", "platform-build"])
def test_context_name_render(self):
flips = lpfc.detect_flips(
{".gitea/workflows/ci.yml": CI_YML_BASE},
{".gitea/workflows/ci.yml": CI_YML_HEAD_FLIPPED},
)
platform = next(f for f in flips if f["job_key"] == "platform-build")
self.assertEqual(platform["context"], "CI / Platform (Go) (push)")
self.assertEqual(platform["workflow_name"], "CI")
def test_context_falls_back_to_job_key_when_no_name(self):
base = "name: WF\njobs:\n foo:\n continue-on-error: true\n runs-on: x\n steps: []\n"
head = "name: WF\njobs:\n foo:\n continue-on-error: false\n runs-on: x\n steps: []\n"
flips = lpfc.detect_flips({"a.yml": base}, {"a.yml": head})
self.assertEqual(len(flips), 1)
self.assertEqual(flips[0]["context"], "WF / foo (push)")
def test_no_flip_when_only_one_side_has_file(self):
# Newly added workflow file — head has CoE:false, base has no
# file. Adding a new workflow with CoE:false is fine; there's
# nothing to mask.
flips = lpfc.detect_flips(
{}, # base has no workflow files
{".gitea/workflows/new.yml": CI_YML_HEAD_FLIPPED},
)
self.assertEqual(flips, [])
def test_no_flip_when_job_removed(self):
# Job exists on base, not on head — a removal, not a flip.
head = """\
name: CI
jobs:
canvas-build:
name: Canvas (Next.js)
continue-on-error: true
runs-on: ubuntu-latest
steps: []
"""
flips = lpfc.detect_flips(
{".gitea/workflows/ci.yml": CI_YML_BASE},
{".gitea/workflows/ci.yml": head},
)
self.assertEqual(flips, [])
def test_no_flip_when_job_added_with_false(self):
# New job on head with CoE:false — no base side; not a flip.
head_with_new = CI_YML_BASE.replace(
" all-required:",
" newjob:\n name: New Job\n continue-on-error: false\n"
" runs-on: x\n steps: []\n"
" all-required:",
)
flips = lpfc.detect_flips(
{".gitea/workflows/ci.yml": CI_YML_BASE},
{".gitea/workflows/ci.yml": head_with_new},
)
self.assertEqual(flips, [])
def test_yaml_parse_error_warns_not_raises(self):
# Malformed YAML on head — should warn (stderr) and skip,
# not raise.
bad_head = "name: CI\njobs:\n :::\n"
# Capture stderr so the test isn't noisy.
with mock.patch.object(sys, "stderr"):
flips = lpfc.detect_flips(
{".gitea/workflows/ci.yml": CI_YML_BASE},
{".gitea/workflows/ci.yml": bad_head},
)
self.assertEqual(flips, [])
# --------------------------------------------------------------------------
# 3. grep_fail_markers — the regex / substring matcher
# --------------------------------------------------------------------------
class TestGrepFailMarkers(unittest.TestCase):
def test_clean_log_returns_empty(self):
log = "===== test run starting =====\nPASS\nok example.com/foo 1.234s\n"
self.assertEqual(lpfc.grep_fail_markers(log), [])
def test_go_minus_minus_minus_fail_caught(self):
log = "ok example.com/foo 1.234s\n--- FAIL: TestBar (0.01s)\n bar_test.go:42:\n"
matches = lpfc.grep_fail_markers(log)
self.assertEqual(len(matches), 1)
self.assertIn("FAIL: TestBar", matches[0])
def test_go_package_fail_caught(self):
log = "FAIL\texample.com/baz\t1.234s\n"
matches = lpfc.grep_fail_markers(log)
self.assertEqual(len(matches), 1)
self.assertIn("FAIL", matches[0])
def test_bash_error_directive_caught(self):
# `lint-curl-status-capture` pattern: a python heredoc inside a
# bash step that prints `::error::` then sys.exit(1). With
# continue-on-error:true the job rolls up as success despite
# this line. THAT's the masking we're trying to catch.
log = "Running scan...\n::error::Found 3 curl-status-capture pollution site(s):\n"
matches = lpfc.grep_fail_markers(log)
self.assertEqual(len(matches), 1)
self.assertIn("::error::", matches[0])
def test_caps_matches_at_max_5(self):
log = "\n".join(["--- FAIL: T%d" % i for i in range(20)])
matches = lpfc.grep_fail_markers(log)
self.assertEqual(len(matches), 5)
# --------------------------------------------------------------------------
# 4. verify_flip — single-flip verdict assembly (network surface stubbed)
# --------------------------------------------------------------------------
def _stub_status(context: str, state: str, target_url: str = "/owner/repo/actions/runs/1/jobs/0") -> dict:
"""Build a single-context combined-status response."""
return {
"state": state,
"statuses": [
{"context": context, "status": state, "target_url": target_url, "description": ""}
],
}
FLIP_FIXTURE = {
"workflow_path": ".gitea/workflows/ci.yml",
"workflow_name": "CI",
"job_key": "platform-build",
"job_name": "Platform (Go)",
"context": "CI / Platform (Go) (push)",
}
class TestVerifyFlip(unittest.TestCase):
def test_flip_with_clean_history_passes(self):
# Acceptance test #2: flip detected, last 5 runs clean → exit 0.
with mock.patch.object(lpfc, "recent_commits_on_branch", return_value=["sha1", "sha2", "sha3"]):
with mock.patch.object(
lpfc, "combined_status",
side_effect=[_stub_status(FLIP_FIXTURE["context"], "success") for _ in range(3)],
):
with mock.patch.object(lpfc, "fetch_log", return_value="ok example.com/foo 1s\nPASS\n"):
verdict = lpfc.verify_flip(FLIP_FIXTURE, "main", 5)
self.assertEqual(verdict["fail_runs"], [])
self.assertEqual(verdict["masked_runs"], [])
self.assertEqual(verdict["checked_commits"], 3)
self.assertEqual(verdict["warnings"], [])
def test_flip_with_recent_fail_blocks(self):
# Acceptance test #3: flip detected, recent run has --- FAIL → exit 1.
# Setup: 3 commits, the most recent run's log shows --- FAIL
# but the STATUS is success (Quirk #10 mask). That's the
# masked_runs case.
log_with_fail = "ok example.com/foo 1s\n--- FAIL: TestSqlmock (0.01s)\n sqlmock_test.go:42:\n"
with mock.patch.object(lpfc, "recent_commits_on_branch", return_value=["sha1", "sha2", "sha3"]):
with mock.patch.object(
lpfc, "combined_status",
side_effect=[_stub_status(FLIP_FIXTURE["context"], "success") for _ in range(3)],
):
with mock.patch.object(lpfc, "fetch_log", side_effect=[log_with_fail, "PASS\n", "PASS\n"]):
verdict = lpfc.verify_flip(FLIP_FIXTURE, "main", 5)
self.assertEqual(len(verdict["masked_runs"]), 1)
self.assertEqual(verdict["masked_runs"][0]["sha"], "sha1")
self.assertTrue(any("TestSqlmock" in s for s in verdict["masked_runs"][0]["samples"]))
self.assertEqual(verdict["fail_runs"], [])
def test_red_status_alone_blocks(self):
# Status itself is `failure` — block without needing log
# markers. (Belt-and-braces: even with a clean log, a `failure`
# status means the job's exit code was non-zero.)
with mock.patch.object(lpfc, "recent_commits_on_branch", return_value=["sha1"]):
with mock.patch.object(
lpfc, "combined_status",
return_value=_stub_status(FLIP_FIXTURE["context"], "failure"),
):
with mock.patch.object(lpfc, "fetch_log", return_value="some unrelated text\n"):
verdict = lpfc.verify_flip(FLIP_FIXTURE, "main", 5)
self.assertEqual(len(verdict["fail_runs"]), 1)
self.assertEqual(verdict["fail_runs"][0]["status"], "failure")
def test_unreadable_log_warns_not_blocks(self):
# Acceptance test #5: log fetch 404 (None) → warn, not block.
# Status is `success`, log is None — we can't tell, so we warn
# and allow.
with mock.patch.object(lpfc, "recent_commits_on_branch", return_value=["sha1"]):
with mock.patch.object(
lpfc, "combined_status",
return_value=_stub_status(FLIP_FIXTURE["context"], "success"),
):
with mock.patch.object(lpfc, "fetch_log", return_value=None):
verdict = lpfc.verify_flip(FLIP_FIXTURE, "main", 5)
self.assertEqual(verdict["fail_runs"], [])
self.assertEqual(verdict["masked_runs"], [])
self.assertTrue(any("log unavailable" in w for w in verdict["warnings"]))
def test_unreadable_log_with_failure_status_still_blocks(self):
# Edge case: log fetch fails BUT the status itself is `failure`.
# We can still block — the status alone is sufficient signal,
# we don't need the log to confirm.
with mock.patch.object(lpfc, "recent_commits_on_branch", return_value=["sha1"]):
with mock.patch.object(
lpfc, "combined_status",
return_value=_stub_status(FLIP_FIXTURE["context"], "failure"),
):
with mock.patch.object(lpfc, "fetch_log", return_value=None):
verdict = lpfc.verify_flip(FLIP_FIXTURE, "main", 5)
self.assertEqual(len(verdict["fail_runs"]), 1)
self.assertIn("log unavailable", verdict["fail_runs"][0]["samples"][0])
def test_zero_runs_history_warns_allows(self):
# No commits with a matching context — newly added workflow.
# Allow with warning.
with mock.patch.object(lpfc, "recent_commits_on_branch", return_value=["sha1", "sha2"]):
with mock.patch.object(
lpfc, "combined_status",
return_value={"state": "success", "statuses": []}, # no matching context
):
verdict = lpfc.verify_flip(FLIP_FIXTURE, "main", 5)
self.assertEqual(verdict["checked_commits"], 0)
self.assertEqual(verdict["fail_runs"], [])
self.assertEqual(verdict["masked_runs"], [])
self.assertTrue(any("no runs of" in w for w in verdict["warnings"]))
def test_zero_commits_warns_allows(self):
# Empty branch (newly created repo, e.g.). Allow with warning.
with mock.patch.object(lpfc, "recent_commits_on_branch", return_value=[]):
verdict = lpfc.verify_flip(FLIP_FIXTURE, "main", 5)
self.assertEqual(verdict["checked_commits"], 0)
self.assertEqual(verdict["fail_runs"], [])
self.assertEqual(verdict["masked_runs"], [])
self.assertTrue(any("no recent commits" in w for w in verdict["warnings"]))
# --------------------------------------------------------------------------
# 5. Multiple-flip aggregation in main()
# --------------------------------------------------------------------------
class TestMainAggregation(unittest.TestCase):
"""Tests that `main()` aggregates multiple flips and exits 1 when
ANY one of them has a masked or red recent run. Acceptance test #4.
We stub at the verify_flip + workflows_at_sha + _require_runtime_env
boundary so we don't need real git or HTTP.
"""
def setUp(self):
# The actual env values are irrelevant — _require_runtime_env
# is stubbed out — but the module reads OWNER/NAME at import
# time. Patch the runtime env contract to a no-op for the
# duration of each test.
self._patches = [
mock.patch.object(lpfc, "_require_runtime_env", return_value=None),
mock.patch.object(lpfc, "BASE_REF", "main"),
mock.patch.object(lpfc, "BASE_SHA", "deadbeefcafe"),
mock.patch.object(lpfc, "HEAD_SHA", "feedfaceabad"),
mock.patch.object(lpfc, "RECENT_COMMITS_N", 5),
]
for p in self._patches:
p.start()
self.addCleanup(lambda: [p.stop() for p in self._patches])
def test_multiple_flips_aggregated_one_bad_blocks(self):
# PR flips 3 jobs; 1 has a recent fail → exit 1, naming that job.
flips = [
{"workflow_path": ".gitea/workflows/ci.yml", "workflow_name": "CI",
"job_key": "platform-build", "job_name": "Platform (Go)",
"context": "CI / Platform (Go) (push)"},
{"workflow_path": ".gitea/workflows/ci.yml", "workflow_name": "CI",
"job_key": "canvas-build", "job_name": "Canvas (Next.js)",
"context": "CI / Canvas (Next.js) (push)"},
{"workflow_path": ".gitea/workflows/ci.yml", "workflow_name": "CI",
"job_key": "python-lint", "job_name": "Python Lint & Test",
"context": "CI / Python Lint & Test (push)"},
]
clean = {"flip": flips[0], "checked_commits": 5, "masked_runs": [],
"fail_runs": [], "warnings": []}
bad = {"flip": flips[1], "checked_commits": 5,
"masked_runs": [{"sha": "abc1234567", "status": "success",
"target_url": "/x/y/actions/runs/1/jobs/0",
"samples": ["--- FAIL: TestSqlmock"]}],
"fail_runs": [], "warnings": []}
also_clean = {"flip": flips[2], "checked_commits": 5, "masked_runs": [],
"fail_runs": [], "warnings": []}
with mock.patch.object(lpfc, "workflows_at_sha", return_value={}):
with mock.patch.object(lpfc, "detect_flips", return_value=flips):
with mock.patch.object(lpfc, "verify_flip",
side_effect=[clean, bad, also_clean]):
# Capture stdout to assert on naming.
captured = []
with mock.patch("builtins.print", side_effect=lambda *a, **k: captured.append(" ".join(str(x) for x in a))):
rc = lpfc.main([])
self.assertEqual(rc, 1)
# The blocking error message must name the failing job.
joined = "\n".join(captured)
self.assertIn("canvas-build", joined)
# And it must mention the empirical class so a reviewer can
# cross-link the right RFC.
self.assertTrue("mc#664" in joined or "PR#656" in joined)
def test_no_flips_in_diff_exits_zero(self):
# Acceptance test #1 at main() level: empty flips → exit 0.
with mock.patch.object(lpfc, "workflows_at_sha", return_value={}):
with mock.patch.object(lpfc, "detect_flips", return_value=[]):
rc = lpfc.main([])
self.assertEqual(rc, 0)
def test_all_flips_clean_exits_zero(self):
flips = [{"workflow_path": ".gitea/workflows/ci.yml", "workflow_name": "CI",
"job_key": "platform-build", "job_name": "Platform (Go)",
"context": "CI / Platform (Go) (push)"}]
clean = {"flip": flips[0], "checked_commits": 5, "masked_runs": [],
"fail_runs": [], "warnings": []}
with mock.patch.object(lpfc, "workflows_at_sha", return_value={}):
with mock.patch.object(lpfc, "detect_flips", return_value=flips):
with mock.patch.object(lpfc, "verify_flip", return_value=clean):
rc = lpfc.main([])
self.assertEqual(rc, 0)
def test_dry_run_forces_exit_zero_even_with_bad_flip(self):
# --dry-run never fails, even when verification finds masked runs.
flips = [{"workflow_path": ".gitea/workflows/ci.yml", "workflow_name": "CI",
"job_key": "platform-build", "job_name": "Platform (Go)",
"context": "CI / Platform (Go) (push)"}]
bad = {"flip": flips[0], "checked_commits": 5,
"masked_runs": [{"sha": "abc1234567", "status": "success",
"target_url": "/x/y/actions/runs/1/jobs/0",
"samples": ["--- FAIL: TestSqlmock"]}],
"fail_runs": [], "warnings": []}
with mock.patch.object(lpfc, "workflows_at_sha", return_value={}):
with mock.patch.object(lpfc, "detect_flips", return_value=flips):
with mock.patch.object(lpfc, "verify_flip", return_value=bad):
rc = lpfc.main(["--dry-run"])
self.assertEqual(rc, 0)
# --------------------------------------------------------------------------
# 6. Context-name rendering (the format Gitea Actions actually emits)
# --------------------------------------------------------------------------
class TestContextName(unittest.TestCase):
def test_push_event(self):
self.assertEqual(
lpfc.context_name("CI", "Platform (Go)", "push"),
"CI / Platform (Go) (push)",
)
def test_pull_request_event(self):
self.assertEqual(
lpfc.context_name("CI", "Platform (Go)", "pull_request"),
"CI / Platform (Go) (pull_request)",
)
def test_workflow_name_falls_back_to_filename(self):
# No top-level `name:` → falls back to filename minus extension.
doc = {"jobs": {"foo": {"continue-on-error": True}}}
self.assertEqual(
lpfc.workflow_name(doc, fallback="my-workflow"),
"my-workflow",
)
if __name__ == "__main__":
unittest.main()
@@ -1,148 +0,0 @@
import importlib.util
import sys
from pathlib import Path
SCRIPT = Path(__file__).resolve().parents[1] / "prod-auto-deploy.py"
spec = importlib.util.spec_from_file_location("prod_auto_deploy", SCRIPT)
prod = importlib.util.module_from_spec(spec)
sys.modules[spec.name] = prod
spec.loader.exec_module(prod)
def test_truthy_flag_accepts_operator_disable_values():
for value in ("1", "true", "TRUE", "yes", "on", "disabled", "disable"):
assert prod.truthy_flag(value) is True
for value in ("", "0", "false", "no", "off", None):
assert prod.truthy_flag(value) is False
def test_build_plan_defaults_to_staging_sha_target_and_prod_cp():
plan = prod.build_plan(
{
"GITHUB_SHA": "abcdef1234567890",
"PROD_AUTO_DEPLOY_DISABLED": "",
}
)
assert plan["enabled"] is True
assert plan["sha"] == "abcdef1234567890"
assert plan["target_tag"] == "staging-abcdef1"
assert plan["cp_url"] == "https://api.moleculesai.app"
assert plan["body"] == {
"target_tag": "staging-abcdef1",
"canary_slug": "hongming",
"soak_seconds": 60,
"batch_size": 3,
"dry_run": False,
# cp#228 / task #308: fleet-wide intent must carry confirm:true.
"confirm": True,
}
def test_build_plan_always_sets_confirm_true_for_fleet_intent():
"""Regression guard: every plan body MUST carry confirm:true.
CP /cp/admin/tenants/redeploy-fleet (cp#228) returns 400 on empty
body / {confirm:false} / {only_slugs:[]} to prevent accidental
fleet-wide mutation. This caller is fleet-wide intent (canary +
fan-out, no slug scoping), so the plan MUST carry confirm:true.
Pairs with cp#228's TestRedeployFleet_EmptyBodyReturns400 +
TestRedeployFleet_ConfirmTrueProceeds.
"""
plan = prod.build_plan({"GITHUB_SHA": "abcdef1234567890"})
assert plan["body"]["confirm"] is True
# Operator-overridable knobs do NOT drop the ack.
plan = prod.build_plan(
{
"GITHUB_SHA": "abcdef1234567890",
"PROD_AUTO_DEPLOY_SOAK_SECONDS": "0",
"PROD_AUTO_DEPLOY_BATCH_SIZE": "10",
"PROD_AUTO_DEPLOY_DRY_RUN": "true",
"PROD_AUTO_DEPLOY_CANARY_SLUG": "",
}
)
assert plan["body"]["confirm"] is True
def test_build_plan_rejects_non_prod_cp_without_explicit_override():
try:
prod.build_plan(
{
"GITHUB_SHA": "abcdef1234567890",
"CP_URL": "https://staging-api.moleculesai.app",
}
)
except ValueError as exc:
assert "PROD_ALLOW_NON_PROD_CP_URL=true" in str(exc)
else:
raise AssertionError("expected non-prod CP URL rejection")
def test_build_plan_allows_non_prod_cp_only_with_override():
plan = prod.build_plan(
{
"GITHUB_SHA": "abcdef1234567890",
"CP_URL": "https://staging-api.moleculesai.app",
"PROD_ALLOW_NON_PROD_CP_URL": "true",
}
)
assert plan["cp_url"] == "https://staging-api.moleculesai.app"
def test_build_plan_disable_flag_short_circuits_before_credentials():
plan = prod.build_plan(
{
"GITHUB_SHA": "abcdef1234567890",
"PROD_AUTO_DEPLOY_DISABLED": "true",
}
)
assert plan["enabled"] is False
assert plan["disabled_reason"] == "PROD_AUTO_DEPLOY_DISABLED=true"
def test_latest_status_for_context_uses_first_matching_status():
statuses = [
{"context": "CI / all-required (push)", "status": "pending"},
{"context": "CI / all-required (pull_request)", "status": "success"},
{"context": "CI / all-required (push)", "status": "success"},
]
latest = prod.latest_status_for_context(statuses, "CI / all-required (push)")
assert latest == {"context": "CI / all-required (push)", "status": "pending"}
def test_ci_context_state_handles_missing_and_gitea_status_key():
assert prod.ci_context_state([], "CI / all-required (push)") == "missing"
assert (
prod.ci_context_state(
[{"context": "CI / all-required (push)", "status": "success"}],
"CI / all-required (push)",
)
== "success"
)
assert (
prod.ci_context_state(
[{"context": "CI / all-required (push)", "state": "failure"}],
"CI / all-required (push)",
)
== "failure"
)
def test_context_is_satisfied_accepts_only_success():
assert prod.context_is_satisfied("success") is True
for state in ("failure", "error", "cancelled", "canceled", "skipped", "pending", "missing"):
assert prod.context_is_satisfied(state) is False
def test_context_is_terminal_failure_rejects_cancelled_and_skipped():
for state in ("failure", "error", "cancelled", "canceled", "skipped"):
assert prod.context_is_terminal_failure(state) is True
for state in ("pending", "missing", "success"):
assert prod.context_is_terminal_failure(state) is False
-368
View File
@@ -1,368 +0,0 @@
#!/usr/bin/env bash
# Regression tests for .gitea/scripts/review-check.sh (RFC#324 Step 1).
#
# Covers:
# T1 — open PR: script fetches PR + reviews, continues to team probe
# T2 — closed PR: script exits 0 (no-op)
# T3 — APPROVED non-author review exists → candidates exist
# T4 — no non-author APPROVED reviews → exit 1 (no candidates)
# T5 — only author reviews (no non-author APPROVE) → exit 1
# T6 — dismissed APPROVED review → treated as no approval
# T7 — team membership probe → 204 (member) → script exits 0
# T8 — team membership probe → 404 (not a member) → script exits 1
# T9 — team membership probe → 403 (token not in team) → script exits 1 (fail closed)
# T10 — CURL_AUTH_FILE created with mode 600 and correct header content
# T11 — bash syntax check (bash -n passes)
# T12 — jq filter: non-author APPROVED → in candidate list; dismissed → excluded
# T13 — missing required env GITEA_TOKEN → exits 1 with error
# T14 — non-default-base PR exits 0 without requiring review
#
# Hostile-self-review (per feedback_assert_exact_not_substring):
# this test MUST FAIL if the script is absent. Verified by running
# the test before the file exists (covered in the PR body).
set -euo pipefail
THIS_DIR="$(cd "$(dirname "$0")" && pwd)"
SCRIPT_DIR="$(cd "$THIS_DIR/.." && pwd)"
SCRIPT="$SCRIPT_DIR/review-check.sh"
PASS=0
FAIL=0
FAILED_TESTS=""
assert_eq() {
local label="$1"
local expected="$2"
local got="$3"
if [ "$expected" = "$got" ]; then
echo " PASS $label"
PASS=$((PASS + 1))
else
echo " FAIL $label"
echo " expected: <$expected>"
echo " got: <$got>"
FAIL=$((FAIL + 1))
FAILED_TESTS="${FAILED_TESTS} ${label}"
fi
}
assert_contains() {
local label="$1"
local needle="$2"
local haystack="$3"
if printf '%s' "$haystack" | grep -qF "$needle"; then
echo " PASS $label"
PASS=$((PASS + 1))
else
echo " FAIL $label"
echo " needle: <$needle>"
echo " haystack: <$(printf '%s' "$haystack" | head -c 200)>"
FAIL=$((FAIL + 1))
FAILED_TESTS="${FAILED_TESTS} ${label}"
fi
}
assert_file_mode() {
local label="$1"
local path="$2"
local expected_mode="$3"
if [ ! -f "$path" ]; then
echo " FAIL $label (file not found: $path)"
FAIL=$((FAIL + 1))
FAILED_TESTS="${FAILED_TESTS} ${label}"
return
fi
local got_mode
got_mode=$(stat -c '%a' "$path" 2>/dev/null || stat -f '%Lp' "$path" 2>/dev/null || echo "000")
if [ "$expected_mode" = "$got_mode" ]; then
echo " PASS $label (mode=$got_mode)"
PASS=$((PASS + 1))
else
echo " FAIL $label (expected mode=$expected_mode, got=$got_mode)"
FAIL=$((FAIL + 1))
FAILED_TESTS="${FAILED_TESTS} ${label}"
fi
}
assert_file_contains() {
local label="$1"
local path="$2"
local needle="$3"
if [ ! -f "$path" ]; then
echo " FAIL $label (file not found: $path)"
FAIL=$((FAIL + 1))
FAILED_TESTS="${FAILED_TESTS} ${label}"
return
fi
if grep -qF "$needle" "$path"; then
echo " PASS $label"
PASS=$((PASS + 1))
else
echo " FAIL $label (needle not found: <$needle>)"
FAIL=$((FAIL + 1))
FAILED_TESTS="${FAILED_TESTS} ${label}"
fi
}
# Existence check (foundation)
echo
echo "== existence =="
if [ -f "$SCRIPT" ]; then
echo " PASS script exists: $SCRIPT"
PASS=$((PASS + 1))
else
echo " FAIL script not found: $SCRIPT"
FAIL=$((FAIL + 1))
FAILED_TESTS="${FAILED_TESTS} script_exists"
echo
echo "------"
echo "PASS=$PASS FAIL=$FAIL (existence)"
echo "Cannot proceed without the script."
exit 1
fi
# T11 — bash syntax check
echo
echo "== T11 bash syntax =="
if bash -n "$SCRIPT" 2>&1; then
echo " PASS T11 bash -n passes"
PASS=$((PASS + 1))
else
echo " FAIL T11 bash -n failed"
FAIL=$((FAIL + 1))
FAILED_TESTS="${FAILED_TESTS} T11"
fi
# T13 — missing required env
echo
echo "== T13 missing GITEA_TOKEN =="
set +e
T13_OUT=$(PATH="/tmp:$PATH" GITEA_TOKEN= GITEA_HOST=git.example.com REPO=x/y PR_NUMBER=1 TEAM=qa TEAM_ID=1 bash "$SCRIPT" 2>&1 || true)
set -e
assert_contains "T13 exits non-zero when GITEA_TOKEN missing" "GITEA_TOKEN required" "$T13_OUT"
# Start fixture HTTP server
echo
echo "== fixture setup =="
FIXTURE_DIR=$(mktemp -d)
trap 'rm -rf "$FIXTURE_DIR"; [ -n "${FIX_PID:-}" ] && kill "$FIX_PID" 2>/dev/null || true' EXIT
FIXTURE_PY="$THIS_DIR/_review_check_fixture.py"
if [ ! -f "$FIXTURE_PY" ]; then
echo "::error::fixture server $FIXTURE_PY missing"
exit 1
fi
FIX_LOG="$FIXTURE_DIR/fixture.log"
FIX_STATE_DIR="$FIXTURE_DIR/state"
mkdir -p "$FIX_STATE_DIR"
# Find an unused port
FIX_PORT=$(python3 -c 'import socket;s=socket.socket();s.bind(("127.0.0.1",0));print(s.getsockname()[1]);s.close()')
FIXTURE_STATE_DIR="$FIX_STATE_DIR" python3 "$FIXTURE_PY" "$FIX_PORT" \
>"$FIX_LOG" 2>&1 &
FIX_PID=$!
# Wait for fixture readiness
for _ in $(seq 1 50); do
if curl -fsS "http://127.0.0.1:${FIX_PORT}/_ping" >/dev/null 2>&1; then
break
fi
sleep 0.1
done
if ! curl -fsS "http://127.0.0.1:${FIX_PORT}/_ping" >/dev/null 2>&1; then
echo "::error::fixture server failed to start. Log:"
cat "$FIX_LOG"
exit 1
fi
echo " fixture running on port $FIX_PORT"
# Install a curl shim that rewrites https://fixture.local/* -> http://127.0.0.1:$FIX_PORT/*
# Use double-quoted heredoc so FIX_PORT is expanded into the shim at creation time.
mkdir -p "$FIXTURE_DIR/bin"
cat >"$FIXTURE_DIR/bin/curl" <<"CURL_SHIM"
#!/usr/bin/env bash
# Shim: rewrite https://fixture.local/* -> http://127.0.0.1:FIXPORT/*
# Generated at test-run time; FIXPORT is substituted when this file is written.
new_args=()
for a in "$@"; do
if [[ "$a" == https://fixture.local/* ]]; then
rest="${a#https://fixture.local}"
a="http://127.0.0.1:FIXPORT${rest}"
fi
new_args+=("$a")
done
exec /usr/bin/curl "${new_args[@]}"
CURL_SHIM
# Now substitute FIXPORT with the actual port number. Use perl rather than
# sed -i so the test runs on both GNU sed and BSD/macOS sed.
perl -0pi -e "s/FIXPORT/${FIX_PORT}/g" "$FIXTURE_DIR/bin/curl"
chmod +x "$FIXTURE_DIR/bin/curl"
# Helper: run the script with fixture environment
run_review_check() {
local scenario="$1"
echo "$scenario" >"$FIX_STATE_DIR/scenario"
local out
set +e
out=$(
PATH="$FIXTURE_DIR/bin:/tmp:$PATH" \
GITEA_TOKEN="fixture-token" \
GITEA_HOST="fixture.local" \
REPO="molecule-ai/molecule-core" \
PR_NUMBER="999" \
DEFAULT_BRANCH="main" \
TEAM="qa" \
TEAM_ID="20" \
REVIEW_CHECK_DEBUG="0" \
REVIEW_CHECK_STRICT="0" \
bash "$SCRIPT" 2>&1
)
local rc=$?
set -e
echo "$out" >"$FIX_STATE_DIR/last_run.log"
echo "$rc" >"$FIX_STATE_DIR/last_rc"
echo "$out"
}
# T1 — open PR: script fetches PR and continues
echo
echo "== T1 open PR =="
T1_OUT=$(run_review_check "T1_pr_open")
T1_RC=$(cat "$FIX_STATE_DIR/last_rc")
assert_eq "T1 exit code 0 (approver exists + team member)" "0" "$T1_RC"
assert_contains "T1 qa-review APPROVED by core-devops" "APPROVED by core-devops" "$T1_OUT"
# T2 — closed PR: exits 0 immediately (no-op)
echo
echo "== T2 closed PR =="
T2_OUT=$(run_review_check "T2_pr_closed")
T2_RC=$(cat "$FIX_STATE_DIR/last_rc")
assert_eq "T2 exit code 0 (closed PR no-op)" "0" "$T2_RC"
# T3 — APPROVED non-author reviews exist
echo
echo "== T3 approved non-author reviews =="
T3_OUT=$(run_review_check "T3_reviews_approved_non_author")
T3_RC=$(cat "$FIX_STATE_DIR/last_rc")
assert_eq "T3 exit code 0 (candidates + team member)" "0" "$T3_RC"
# T4 — no non-author APPROVED reviews → exit 1
echo
echo "== T4 no non-author APPROVED reviews =="
T4_OUT=$(run_review_check "T4_reviews_empty")
T4_RC=$(cat "$FIX_STATE_DIR/last_rc")
assert_eq "T4 exit code 1 (no candidates)" "1" "$T4_RC"
assert_contains "T4 awaiting non-author APPROVE" "awaiting non-author APPROVE" "$T4_OUT"
# T14 — non-default-base PR should not make the default branch red.
echo
echo "== T14 non-default base PR =="
T14_OUT=$(run_review_check "T14_non_default_base")
T14_RC=$(cat "$FIX_STATE_DIR/last_rc")
assert_eq "T14 exit code 0 (non-default base no-op)" "0" "$T14_RC"
assert_contains "T14 not applicable notice" "gate not applicable" "$T14_OUT"
# T5 — only author reviews → exit 1
echo
echo "== T5 only author reviews =="
T5_OUT=$(run_review_check "T5_reviews_only_author")
T5_RC=$(cat "$FIX_STATE_DIR/last_rc")
assert_eq "T5 exit code 1 (only author reviews, no candidates)" "1" "$T5_RC"
# T6 — dismissed APPROVED review → treated as no approval
echo
echo "== T6 dismissed APPROVED review =="
T6_OUT=$(run_review_check "T6_reviews_dismissed")
T6_RC=$(cat "$FIX_STATE_DIR/last_rc")
assert_eq "T6 exit code 1 (dismissed = no approval)" "1" "$T6_RC"
# T7 — team member → exit 0
echo
echo "== T7 team membership 204 (member) =="
T7_OUT=$(run_review_check "T7_team_member")
T7_RC=$(cat "$FIX_STATE_DIR/last_rc")
assert_eq "T7 exit code 0 (member, APPROVED)" "0" "$T7_RC"
assert_contains "T7 APPROVED by core-devops (team member)" "APPROVED by core-devops" "$T7_OUT"
# T8 — not a team member → exit 1 (fail closed)
echo
echo "== T8 team membership 404 (not a member) =="
T8_OUT=$(run_review_check "T8_team_not_member")
T8_RC=$(cat "$FIX_STATE_DIR/last_rc")
assert_eq "T8 exit code 1 (not in team)" "1" "$T8_RC"
# T9 — 403 token-not-in-team → exit 1 (fail closed)
echo
echo "== T9 team membership 403 (token not in team) =="
T9_OUT=$(run_review_check "T9_team_403")
T9_RC=$(cat "$FIX_STATE_DIR/last_rc")
assert_eq "T9 exit code 1 (403 token-not-in-team, fail closed)" "1" "$T9_RC"
assert_contains "T9 403 error in output" "403" "$T9_OUT"
# T10 — token file creation and permissions
echo
echo "== T10 CURL_AUTH_FILE =="
# Verify the token-file logic directly: create a temp file with the
# same mktemp pattern, write the header with printf, chmod 600, then assert.
T10_TOKEN="secret-test-token-abc123"
T10_AUTHFILE=$(mktemp "${TMPDIR:-/tmp}/curl-auth.test.XXXXXX")
chmod 600 "$T10_AUTHFILE"
printf 'header = "Authorization: token %s"\n' "$T10_TOKEN" > "$T10_AUTHFILE"
assert_file_mode "T10a mktemp authfile mode 600 (CURL_AUTH_FILE pattern)" "$T10_AUTHFILE" "600"
assert_file_contains "T10b printf header format (CURL_AUTH_FILE content)" "$T10_AUTHFILE" "Authorization: token secret-test-token-abc123"
assert_file_contains "T10c 'header =' curl-config syntax" "$T10_AUTHFILE" 'header = "Authorization: token '
rm -f "$T10_AUTHFILE"
# T12 — jq filter: non-author APPROVED included, dismissed excluded
echo
echo "== T12 jq filter =="
# These are tested indirectly via T3 and T6 above, but let's also test
# the jq expression directly.
JQ_FILTER='.[]
| select(.state == "APPROVED")
| select(.dismissed != true)
| select(.user.login != "alice")
| .user.login'
T12_INPUT='[{"state":"APPROVED","dismissed":false,"user":{"login":"core-devops"}},{"state":"CHANGES_REQUESTED","dismissed":false,"user":{"login":"bob"}},{"state":"APPROVED","dismissed":false,"user":{"login":"alice"}},{"state":"APPROVED","dismissed":true,"user":{"login":"carol"}}]'
JQ_CMD=$(command -v jq 2>/dev/null || echo /tmp/jq)
T12_CANDIDATES=$(echo "$T12_INPUT" | "$JQ_CMD" -r "$JQ_FILTER" 2>/dev/null | sort -u)
assert_contains "T12 jq: core-devops (non-author APPROVED) in candidates" "core-devops" "$T12_CANDIDATES"
assert_eq "T12 jq: alice (author) NOT in candidates" "" "$(echo "$T12_CANDIDATES" | grep '^alice$' || true)"
assert_eq "T12 jq: carol (dismissed) NOT in candidates" "" "$(echo "$T12_CANDIDATES" | grep '^carol$' || true)"
# T15 — comment-based approval via agent prefix pattern → exit 0
echo
echo "== T15 comment agent-prefix approval =="
T15_OUT=$(run_review_check "T15_comments_agent_approval")
T15_RC=$(cat "$FIX_STATE_DIR/last_rc")
assert_eq "T15 exit code 0 (agent-comment approval + team member)" "0" "$T15_RC"
assert_contains "T15 comment fallback notice" "comment-based approval" "$T15_OUT"
assert_contains "T15 core-qa-agent APPROVED" "APPROVED by core-qa-agent" "$T15_OUT"
# T16 — comment-based approval via generic APPROVED keyword → exit 0
echo
echo "== T16 comment generic keyword approval =="
T16_OUT=$(run_review_check "T16_comments_generic_approval")
T16_RC=$(cat "$FIX_STATE_DIR/last_rc")
assert_eq "T16 exit code 0 (generic-approval comment + team member)" "0" "$T16_RC"
assert_contains "T16 comment fallback notice" "comment-based approval" "$T16_OUT"
# T17 — no approval keywords in comments → exit 1
echo
echo "== T17 comments with no approval keywords =="
T17_OUT=$(run_review_check "T17_comments_no_approval")
T17_RC=$(cat "$FIX_STATE_DIR/last_rc")
assert_eq "T17 exit code 1 (no candidates from comments)" "1" "$T17_RC"
assert_contains "T17 no candidates error" "no candidates from reviews API or issue comments" "$T17_OUT"
echo
echo "------"
echo "PASS=$PASS FAIL=$FAIL"
if [ "$FAIL" -gt 0 ]; then
echo "Failed:$FAILED_TESTS"
fi
[ "$FAIL" -eq 0 ]
File diff suppressed because it is too large Load Diff
@@ -1,101 +0,0 @@
#!/usr/bin/env bash
# Regression test for #229 — sop-tier-check tier:low OR-clause splitter.
#
# Bug (PR #225 → still broken after PR #231):
# Line ~289 of sop-tier-check.sh used:
# _clause=$(echo "$_raw_clause" | tr -d '()' | tr ',' '\n' | tr -d '[:space:]' | grep -v '^$')
# `tr -d '[:space:]'` strips the newlines that `tr ',' '\n'` just
# inserted, collapsing "engineers,managers,ceo" into a single token
# "engineersmanagersceo". The for-loop then iterates ONCE on a name
# that matches no team, so every tier:low PR fails:
# ::error::clause [engineers/managers/ceo]: FAIL — no approving
# reviewer belongs to any of these teamsengineersmanagersceo
# (note also: missing separators in the error string is bug #2 —
# `_clause_names` used "${var:+, }$x" which OVERWRITES per iteration).
#
# Fix shape (this PR):
# _no_parens=${_raw_clause//[()]/}
# _clause=${_no_parens//,/ } # comma -> space, bash word-split iterates
# _clause_names="${_clause_names}${_clause_names:+, }${_t}" # APPEND, not overwrite
#
# This test extracts the splitter logic and asserts it produces the right
# token list for each of the three tier expressions live in the script.
set -euo pipefail
PASS=0
FAIL=0
assert_eq() {
local label="$1"
local expected="$2"
local got="$3"
if [ "$expected" = "$got" ]; then
echo " PASS $label"
PASS=$((PASS + 1))
else
echo " FAIL $label"
echo " expected: <$expected>"
echo " got: <$got>"
FAIL=$((FAIL + 1))
fi
}
# ----- Splitter under test (mirrors the fixed sop-tier-check.sh block) -----
split_clause() {
local raw="$1"
local no_parens=${raw//[()]/}
local clause=${no_parens//,/ }
local out=""
for _t in $clause; do
out="${out}${out:+|}$_t"
done
echo "$out"
}
echo "test: tier:low OR-clause splits to 3 tokens"
assert_eq "tier:low" "engineers|managers|ceo" "$(split_clause "engineers,managers,ceo")"
echo "test: tier:medium AND-expression — bash word-split on \$EXPR yields 5 tokens"
EXPR="managers AND engineers AND qa???,security???"
out=""
for _raw in $EXPR; do
out="${out}${out:+ ; }$(split_clause "$_raw")"
done
assert_eq "tier:medium" "managers ; AND ; engineers ; AND ; qa???|security???" "$out"
echo "test: tier:high single-team OR-clause"
assert_eq "tier:high" "ceo" "$(split_clause "ceo")"
echo "test: paren-wrapped OR-set unwraps + splits"
assert_eq "paren OR" "managers|ceo" "$(split_clause "(managers,ceo)")"
# ----- _clause_names accumulator (was overwriting per iteration) -----
acc=""
for t in engineers managers ceo; do
acc="${acc}${acc:+, }${t}"
done
assert_eq "_clause_names append" "engineers, managers, ceo" "$acc"
# ----- _failed_clauses / _passed_clauses accumulator across raw clauses -----
acc=""
for c in clauseA clauseB clauseC; do
acc="${acc}${acc:+, }${c}"
done
assert_eq "_failed_clauses append" "clauseA, clauseB, clauseC" "$acc"
# ----- End-to-end OR-gate: simulate APPROVER_TEAMS[core-lead]=' managers ' -----
# The script's case pattern is *${_t}* with a space-padded value.
APPROVER_TEAMS_VAL=" managers "
matched=""
for _t in $(split_clause "engineers,managers,ceo" | tr '|' ' '); do
case "$APPROVER_TEAMS_VAL" in
*${_t}*) matched="$_t"; break ;;
esac
done
assert_eq "OR-gate matches managers" "managers" "$matched"
echo
echo "------"
echo "PASS=$PASS FAIL=$FAIL"
[ "$FAIL" -eq 0 ]
@@ -1,301 +0,0 @@
#!/usr/bin/env bash
# Tests for sop-tier-refire.{yml,sh} — internal#292.
#
# Behavior matrix:
#
# T1: PR open + APPROVED via tier:low → script invokes sop-tier-check
# and POSTs status=success.
# T2: PR open + missing tier label → sop-tier-check exits non-zero;
# refire still POSTs status=success, matching the canonical
# pull_request_target workflow's fail-open job conclusion.
# T3: PR open + tier:low but NO approving reviews → sop-tier-check
# exits non-zero; refire still POSTs status=success for the same reason.
# T4: PR CLOSED → refire exits 0 with no status POST (no-op on closed).
# T5: Rate-limit — recent status update within 30s → refire skips,
# no new POST.
# T6 (yaml-lint): workflow `if:` expression contains author_association
# gate + slash-command-trigger gate + PR-not-issue gate.
# T7 (yaml-lint): workflow file is parseable YAML.
#
# Tests T1-T5 run the real script against a local-fixture HTTP server
# (python http.server with a stub handler — `tests/_refire_fixture.py`)
# so the script's Gitea API calls hit the fixture, not the real Gitea.
#
# Tests T6/T7 are pure YAML checks against the workflow file.
#
# Hostile-self-review (per feedback_assert_exact_not_substring):
# this test MUST FAIL if the workflow or script is absent. Verified by
# running the test before the files exist (covered in the PR body).
set -euo pipefail
THIS_DIR="$(cd "$(dirname "$0")" && pwd)"
SCRIPT_DIR="$(cd "$THIS_DIR/.." && pwd)"
WORKFLOW_DIR="$(cd "$THIS_DIR/../../workflows" && pwd)"
WORKFLOW="$WORKFLOW_DIR/sop-tier-refire.yml"
DISPATCH_WORKFLOW="$WORKFLOW_DIR/sop-checklist.yml"
SCRIPT="$SCRIPT_DIR/sop-tier-refire.sh"
PASS=0
FAIL=0
FAILED_TESTS=""
assert_eq() {
local label="$1"
local expected="$2"
local got="$3"
if [ "$expected" = "$got" ]; then
echo " PASS $label"
PASS=$((PASS + 1))
else
echo " FAIL $label"
echo " expected: <$expected>"
echo " got: <$got>"
FAIL=$((FAIL + 1))
FAILED_TESTS="${FAILED_TESTS} ${label}"
fi
}
assert_contains() {
local label="$1"
local needle="$2"
local haystack="$3"
if printf '%s' "$haystack" | grep -qF "$needle"; then
echo " PASS $label"
PASS=$((PASS + 1))
else
echo " FAIL $label"
echo " needle: <$needle>"
echo " haystack: <$(printf '%s' "$haystack" | head -c 400)>"
FAIL=$((FAIL + 1))
FAILED_TESTS="${FAILED_TESTS} ${label}"
fi
}
assert_file_exists() {
local label="$1"
local path="$2"
if [ -f "$path" ]; then
echo " PASS $label"
PASS=$((PASS + 1))
else
echo " FAIL $label (not found: $path)"
FAIL=$((FAIL + 1))
FAILED_TESTS="${FAILED_TESTS} ${label}"
fi
}
# Existence (foundation — every other test depends on these)
echo
echo "== existence =="
assert_file_exists "workflow file exists" "$WORKFLOW"
assert_file_exists "SSOT dispatcher workflow file exists" "$DISPATCH_WORKFLOW"
assert_file_exists "script file exists" "$SCRIPT"
if [ "$FAIL" -gt 0 ]; then
echo
echo "------"
echo "PASS=$PASS FAIL=$FAIL (existence)"
echo "Cannot proceed without these files."
exit 1
fi
# T6 / T7 — workflow YAML structure
echo
echo "== T6/T7 workflow yaml =="
# YAML parseability
PARSE_OUT=$(python3 -c 'import sys,yaml;yaml.safe_load(open(sys.argv[1]).read());print("ok")' "$WORKFLOW" 2>&1 || true)
assert_eq "T7 workflow parses as YAML" "ok" "$PARSE_OUT"
# The old per-workflow issue_comment listener caused queue storms because
# Gitea queues jobs before evaluating job-level `if:`. The script remains,
# but comment-triggered refires route through the single dispatcher.
WORKFLOW_CONTENT=$(cat "$WORKFLOW")
if printf '%s' "$WORKFLOW_CONTENT" | grep -q '^ issue_comment:'; then
echo " FAIL T6a manual fallback workflow must not listen on issue_comment"
FAIL=$((FAIL + 1))
FAILED_TESTS="${FAILED_TESTS} T6a"
else
echo " PASS T6a manual fallback workflow does not listen on issue_comment"
PASS=$((PASS + 1))
fi
assert_contains "T6b workflow exposes workflow_dispatch" \
"workflow_dispatch" "$WORKFLOW_CONTENT"
assert_contains "T6c workflow documents unsupported manual inputs" \
"workflow_dispatch inputs" "$WORKFLOW_CONTENT"
# Does NOT check out PR HEAD (security)
if grep -q 'ref: \${{ github.event.pull_request.head' "$WORKFLOW"; then
echo " FAIL T6d workflow MUST NOT check out PR head (security)"
FAIL=$((FAIL + 1))
FAILED_TESTS="${FAILED_TESTS} T6d"
else
echo " PASS T6d workflow does not check out PR head"
PASS=$((PASS + 1))
fi
DISPATCH_PARSE_OUT=$(python3 -c 'import sys,yaml;yaml.safe_load(open(sys.argv[1]).read());print("ok")' "$DISPATCH_WORKFLOW" 2>&1 || true)
assert_eq "T6e SSOT dispatcher workflow parses as YAML" "ok" "$DISPATCH_PARSE_OUT"
DISPATCH_CONTENT=$(cat "$DISPATCH_WORKFLOW")
assert_contains "T6f SSOT dispatcher listens on issue_comment" \
"issue_comment" "$DISPATCH_CONTENT"
assert_contains "T6g SSOT dispatcher handles /qa-recheck" \
"/qa-recheck" "$DISPATCH_CONTENT"
assert_contains "T6h SSOT dispatcher handles /security-recheck" \
"/security-recheck" "$DISPATCH_CONTENT"
assert_contains "T6i SSOT dispatcher handles /refire-tier-check" \
"/refire-tier-check" "$DISPATCH_CONTENT"
# T1-T5 — script behavior against a local Gitea-fixture
echo
echo "== T1-T5 script behavior (vs local fixture) =="
# Spin up the fixture HTTP server.
FIXTURE_DIR=$(mktemp -d)
trap 'rm -rf "$FIXTURE_DIR"; [ -n "${FIX_PID:-}" ] && kill "$FIX_PID" 2>/dev/null || true' EXIT
FIXTURE_PY="$THIS_DIR/_refire_fixture.py"
if [ ! -f "$FIXTURE_PY" ]; then
echo "::error::fixture server $FIXTURE_PY missing"
exit 1
fi
FIX_LOG="$FIXTURE_DIR/fixture.log"
FIX_STATE_DIR="$FIXTURE_DIR/state"
mkdir -p "$FIX_STATE_DIR"
# Find an unused port.
FIX_PORT=$(python3 -c 'import socket;s=socket.socket();s.bind(("127.0.0.1",0));print(s.getsockname()[1]);s.close()')
FIXTURE_STATE_DIR="$FIX_STATE_DIR" python3 "$FIXTURE_PY" "$FIX_PORT" \
>"$FIX_LOG" 2>&1 &
FIX_PID=$!
# Wait for fixture readiness.
for _ in $(seq 1 50); do
if curl -fsS "http://127.0.0.1:${FIX_PORT}/_ping" >/dev/null 2>&1; then
break
fi
sleep 0.1
done
if ! curl -fsS "http://127.0.0.1:${FIX_PORT}/_ping" >/dev/null 2>&1; then
echo "::error::fixture server failed to start. Log:"
cat "$FIX_LOG"
exit 1
fi
# Helper: set fixture state for a scenario, then run the script.
# tier_result is one of: pass | fail_no_label | fail_no_approvals.
# The refire script's tier-check invocation is mocked because the real
# sop-tier-check.sh uses bash 4+ associative arrays — incompatible with
# the macOS bash 3.2 dev shell. Linux Gitea runners use bash 4/5 so
# production runs the real script. The mock exercises the success +
# failure branches of refire's status-POST glue.
run_scenario() {
local scenario="$1"
local tier_result="${2:-pass}"
echo "$scenario" >"$FIX_STATE_DIR/scenario"
: >"$FIX_STATE_DIR/posted_statuses.jsonl" # clear status log
local out
set +e
out=$(
PATH="$FIXTURE_DIR/bin:$PATH" \
GITEA_TOKEN="fixture-token" \
GITEA_HOST="fixture.local" \
REPO="molecule-ai/molecule-core" \
PR_NUMBER="999" \
COMMENT_AUTHOR="test-runner" \
SOP_REFIRE_DISABLE_RATE_LIMIT="1" \
SOP_REFIRE_TIER_CHECK_SCRIPT="$THIS_DIR/_mock_tier_check.sh" \
MOCK_TIER_RESULT="$tier_result" \
FIXTURE_PORT="$FIX_PORT" \
bash "$SCRIPT" 2>&1
)
local rc=$?
set -e
echo "$out" >"$FIX_STATE_DIR/last_run.log"
echo "$rc" >"$FIX_STATE_DIR/last_rc"
}
# Install a curl shim that rewrites https://fixture.local → http://127.0.0.1:$PORT
# Use bash prefix-strip (${var#prefix}) — it sidesteps the `/` delimiter
# confusion of ${var/pattern/replacement}.
mkdir -p "$FIXTURE_DIR/bin"
cat >"$FIXTURE_DIR/bin/curl" <<SHIM
#!/usr/bin/env bash
# Test shim: rewrite https://fixture.local/* -> http://127.0.0.1:${FIX_PORT}/*
# The fixture doesn't authenticate; -H Authorization passes through harmlessly.
new_args=()
for a in "\$@"; do
if [[ "\$a" == https://fixture.local/* ]]; then
rest="\${a#https://fixture.local}"
a="http://127.0.0.1:${FIX_PORT}\${rest}"
fi
new_args+=("\$a")
done
exec /usr/bin/curl "\${new_args[@]}"
SHIM
chmod +x "$FIXTURE_DIR/bin/curl"
# T1: tier:low + 1 APPROVED + author is in engineers team → success
run_scenario "T1_success" "pass"
RC=$(cat "$FIX_STATE_DIR/last_rc")
POSTED=$(cat "$FIX_STATE_DIR/posted_statuses.jsonl" 2>/dev/null || true)
assert_eq "T1 exit code 0 (success)" "0" "$RC"
assert_contains "T1 POSTed state=success" '"state": "success"' "$POSTED"
assert_contains "T1 POST context is sop-tier-check / tier-check" \
'"context": "sop-tier-check / tier-check (pull_request)"' "$POSTED"
assert_contains "T1 description names commenter" "test-runner" "$POSTED"
# T2: missing tier label → tier-check fails internally, but refire status
# matches the canonical workflow's fail-open job conclusion.
run_scenario "T2_no_tier_label" "fail_no_label"
RC=$(cat "$FIX_STATE_DIR/last_rc")
POSTED=$(cat "$FIX_STATE_DIR/posted_statuses.jsonl" 2>/dev/null || true)
assert_eq "T2 exit code 0 (canonical fail-open)" "0" "$RC"
assert_contains "T2 POSTed state=success" '"state": "success"' "$POSTED"
# T3: tier:low present but ZERO approving reviews → internal tier check fails,
# refire status remains aligned with the canonical workflow.
run_scenario "T3_no_approvals" "fail_no_approvals"
RC=$(cat "$FIX_STATE_DIR/last_rc")
POSTED=$(cat "$FIX_STATE_DIR/posted_statuses.jsonl" 2>/dev/null || true)
assert_eq "T3 exit code 0 (canonical fail-open)" "0" "$RC"
assert_contains "T3 POSTed state=success" '"state": "success"' "$POSTED"
# T4: closed PR — refire is a no-op (no POST, exit 0)
run_scenario "T4_closed" "pass"
RC=$(cat "$FIX_STATE_DIR/last_rc")
POSTED=$(cat "$FIX_STATE_DIR/posted_statuses.jsonl" 2>/dev/null || true)
assert_eq "T4 closed PR exits 0" "0" "$RC"
assert_eq "T4 closed PR posts no status" "" "$POSTED"
# T5: rate-limit — disable the env override and let scenario set a
# recent statuses entry. Re-enable rate-limit for this scenario by NOT
# passing SOP_REFIRE_DISABLE_RATE_LIMIT.
echo "T5_rate_limited" >"$FIX_STATE_DIR/scenario"
: >"$FIX_STATE_DIR/posted_statuses.jsonl"
set +e
T5_OUT=$(
PATH="$FIXTURE_DIR/bin:$PATH" \
GITEA_TOKEN="fixture-token" \
GITEA_HOST="fixture.local" \
REPO="molecule-ai/molecule-core" \
PR_NUMBER="999" \
COMMENT_AUTHOR="test-runner" \
FIXTURE_PORT="$FIX_PORT" \
bash "$SCRIPT" 2>&1
)
T5_RC=$?
set -e
POSTED=$(cat "$FIX_STATE_DIR/posted_statuses.jsonl" 2>/dev/null || true)
assert_eq "T5 rate-limited exits 0" "0" "$T5_RC"
assert_contains "T5 rate-limited log says skipped" "rate-limited" "$T5_OUT"
assert_eq "T5 rate-limited posts no status" "" "$POSTED"
echo
echo "------"
echo "PASS=$PASS FAIL=$FAIL"
if [ "$FAIL" -gt 0 ]; then
echo "Failed:$FAILED_TESTS"
fi
[ "$FAIL" -eq 0 ]
@@ -1,169 +0,0 @@
import importlib.util
import json
import pathlib
import urllib.error
ROOT = pathlib.Path(__file__).resolve().parents[1]
SCRIPT = ROOT / "status-reaper.py"
def load_reaper():
spec = importlib.util.spec_from_file_location("status_reaper", SCRIPT)
mod = importlib.util.module_from_spec(spec)
assert spec.loader is not None
spec.loader.exec_module(mod)
mod.API = "https://git.example.test/api/v1"
mod.GITEA_TOKEN = "test-token"
mod.API_TIMEOUT_SEC = 1
mod.API_RETRIES = 3
mod.API_RETRY_SLEEP_SEC = 0
return mod
class FakeResponse:
status = 200
def __init__(self, payload):
self.payload = payload
def __enter__(self):
return self
def __exit__(self, exc_type, exc, tb):
return False
def read(self):
return json.dumps(self.payload).encode("utf-8")
def test_api_retries_transient_timeout(monkeypatch):
mod = load_reaper()
calls = {"n": 0}
def fake_urlopen(req, timeout):
calls["n"] += 1
if calls["n"] == 1:
raise TimeoutError("simulated slow Gitea API")
return FakeResponse({"ok": True})
monkeypatch.setattr(mod.urllib.request, "urlopen", fake_urlopen)
status, body = mod.api("GET", "/repos/o/r/commits")
assert status == 200
assert body == {"ok": True}
assert calls["n"] == 2
def test_api_raises_after_retry_budget(monkeypatch):
mod = load_reaper()
def fake_urlopen(req, timeout):
raise urllib.error.URLError("connection reset")
monkeypatch.setattr(mod.urllib.request, "urlopen", fake_urlopen)
try:
mod.api("GET", "/repos/o/r/commits")
except mod.ApiError as exc:
assert "failed after 3 attempts" in str(exc)
else:
raise AssertionError("expected ApiError")
def test_reap_compensates_failed_pr_context_when_push_equivalent_passed(monkeypatch):
mod = load_reaper()
posted = []
def fake_post(sha, context, target_url, *, description="", dry_run=False):
posted.append((sha, context, target_url, description, dry_run))
monkeypatch.setattr(mod, "post_compensating_status", fake_post)
counters = mod.reap(
{"CI": True, "Handlers Postgres Integration": True},
{
"statuses": [
{
"context": "CI / Platform (Go) (pull_request)",
"status": "failure",
"target_url": "https://git.example.test/ci-pr",
},
{
"context": "CI / Platform (Go) (push)",
"status": "success",
},
{
"context": (
"Handlers Postgres Integration / "
"Handlers Postgres Integration (pull_request)"
),
"status": "failure",
"target_url": "https://git.example.test/handlers-pr",
},
{
"context": (
"Handlers Postgres Integration / "
"Handlers Postgres Integration (push)"
),
"status": "success",
},
],
},
"db3b7a93e31adc0cb072a6d177d92dd73275a191",
)
assert counters["compensated_pr_shadowed_by_push_success"] == 2
assert posted == [
(
"db3b7a93e31adc0cb072a6d177d92dd73275a191",
"CI / Platform (Go) (pull_request)",
"https://git.example.test/ci-pr",
mod.PR_SHADOW_COMPENSATION_DESCRIPTION,
False,
),
(
"db3b7a93e31adc0cb072a6d177d92dd73275a191",
"Handlers Postgres Integration / Handlers Postgres Integration (pull_request)",
"https://git.example.test/handlers-pr",
mod.PR_SHADOW_COMPENSATION_DESCRIPTION,
False,
),
]
def test_reap_preserves_failed_pr_context_without_push_success(monkeypatch):
mod = load_reaper()
posted = []
monkeypatch.setattr(
mod,
"post_compensating_status",
lambda sha, context, target_url, *, description="", dry_run=False: posted.append(
context
),
)
counters = mod.reap(
{"CI": True},
{
"statuses": [
{
"context": "CI / Platform (Go) (pull_request)",
"status": "failure",
},
{
"context": "CI / Platform (Go) (push)",
"status": "failure",
},
{
"context": "CI / Shellcheck (pull_request)",
"status": "failure",
},
],
},
"db3b7a93e31adc0cb072a6d177d92dd73275a191",
)
assert counters["preserved_pr_without_push_success"] == 2
assert posted == []
-181
View File
@@ -1,181 +0,0 @@
# SOP-Checklist gate — per-item required reviewer teams.
#
# RFC#351 v1 starter set. Each item lists:
# slug — canonical kebab-case form used in /sop-ack <slug>
# pr_section_marker — substring matched in the PR body to detect that
# the author filled in this item (case-insensitive)
# required_teams — list of Gitea team names; an ack from ANY one of
# these teams (logical OR) satisfies the item.
# Membership is probed at gate-time via
# GET /api/v1/teams/{id}/members/{login}.
# Team-id resolution happens at script start via
# GET /api/v1/orgs/{org}/teams (cheap, one call).
# numeric_alias — 1..7; lets reviewers type `/sop-ack 3` as a
# shortcut for `/sop-ack staging-smoke`.
#
# WHY THESE TEAM MAPPINGS:
# The RFC table referenced persona-role names like `core-qa`,
# `core-be`, `core-devops` — these are individual Gitea user logins,
# not teams. The Gitea team-membership API is /teams/{id}/members/{u},
# so we need actual teams. Orchestrator preflight 2026-05-12 verified
# only these teams exist on molecule-ai: ceo(5), engineers(2),
# managers(6), qa(20), security(21), Owners(1), and bot teams. We
# map the RFC roles to the closest existing team and surface the
# mapping explicitly so it's reviewable.
#
# HOW TO EDIT:
# - Tightening: replace `engineers` with a smaller team after creating
# it (e.g. a new `senior-engineers` team if needed).
# - Loosening: add another team to required_teams (OR semantics).
# - Add an item: append to items list and document the slug below.
#
# AUTHOR SELF-ACK IS FORBIDDEN regardless of which team contains them
# — the gate script enforces commenter != PR author before checking
# team membership.
version: 1
# Tier-aware failure mode (RFC#351 open question 2):
# For tier:high — hard-fail (status `failure`, blocks merge via BP).
# For tier:medium — hard-fail (same as high; medium is non-trivial).
# For tier:low — soft-fail (status `pending` with `acked: N/M` in the
# description). BP can choose to require the context
# or not for low-tier PRs.
# If no tier label is present, default to medium (hard-fail) — every PR
# should have a tier label per sop-tier-check, and absence indicates
# a missing-tier defect we should surface, not silently lower the bar.
tier_failure_mode:
"tier:high": hard
"tier:medium": hard
"tier:low": soft
default_mode: hard # used when no tier:* label is present
# High-risk class (RFC#450 Option C, governance-fix for internal#442).
#
# A PR is "high-risk" when ANY of the listed labels are applied OR when
# the PR has `tier:high` (mechanically the strictest existing tier).
# High-risk items use `required_teams_high_risk` (when present on the
# item); non-high-risk items use the default `required_teams`.
#
# This closes the inconsistency that the SOP charter already mandates
# `tier:high → ceo only` for the sibling `sop-tier-check` gate; the
# sop-checklist's `root-cause` and `no-backwards-compat` items now
# follow the same risk-classed two-eyes shape:
# - Default class (tier:low/medium, not high-risk): a non-author
# engineers/managers/ceo ack satisfies the item — 25+ live
# identities, no dependency on a dead/inactive senior persona
# token.
# - High-risk class (tier:high OR any high_risk_label): still
# requires a non-author ceo ack (durable human team).
#
# Tightening: add labels to high_risk_labels.
# Loosening: remove labels.
high_risk_labels:
- "risk:high"
- "area:security"
- "area:schema"
- "area:fleet-image"
- "area:identity"
- "area:gate-meta"
items:
- slug: comprehensive-testing
numeric_alias: 1
pr_section_marker: "Comprehensive testing performed"
required_teams: [qa, engineers]
description: >-
What was tested, how, edge cases covered. Ack from any qa-team
member (or engineers fallback while qa is small).
- slug: local-postgres-e2e
numeric_alias: 2
pr_section_marker: "Local-postgres E2E run"
required_teams: [engineers]
description: >-
Link to local CI artifact, or "N/A: pure-frontend change". Ack
from any engineer who can verify the local DB test actually ran.
- slug: staging-smoke
numeric_alias: 3
pr_section_marker: "Staging-smoke verified or pending"
required_teams: [engineers]
description: >-
Link to canary run, or "scheduled post-merge". Ack from any
engineer (core-devops/infra-sre are members of engineers team).
- slug: root-cause
numeric_alias: 4
pr_section_marker: "Root-cause not symptom"
required_teams: [engineers, managers, ceo]
required_teams_high_risk: [ceo]
description: >-
One-sentence root-cause statement. Default class: non-author
engineers/managers/ceo ack suffices (engineers can attest
root-cause-vs-symptom for routine fixes). High-risk class
(see `high_risk_labels`): non-author ceo ack required —
senior judgment for irreversible/security/identity/gate
changes. Closes internal#442 + tracks RFC#450.
- slug: five-axis-review
numeric_alias: 5
pr_section_marker: "Five-Axis review walked"
required_teams: [engineers]
description: >-
Correctness / readability / architecture / security / performance.
Ack from any non-author engineer.
- slug: no-backwards-compat
numeric_alias: 6
pr_section_marker: "No backwards-compat shim / dead code added"
required_teams: [engineers, managers, ceo]
required_teams_high_risk: [ceo]
description: >-
Yes/no + justification if no. Default class: non-author
engineers/managers/ceo ack suffices. High-risk class
(see `high_risk_labels`): non-author ceo ack required —
senior judgment for shim-versus-real-fix on irreversible
surfaces. Closes internal#442 + tracks RFC#450.
- slug: memory-consulted
numeric_alias: 7
pr_section_marker: "Memory/saved-feedback consulted"
required_teams: [engineers]
description: >-
List of feedback memories applicable to this change. Ack from
any engineer who has the same memory access.
# N/A gate declarations (RFC#324 §N/A follow-up).
# PRs where a gate genuinely does not apply (e.g., pure-infra with no
# qa surface, or docs-only) can be declared N/A by a non-author peer
# who is in one of the gate's required_teams. The sop-checklist
# posts a `sop-checklist / na-declarations (pull_request)` status that
# review-check.sh reads to skip the Gitea-APPROVE requirement.
#
# Usage: any PR commenter (peer) posts:
# /sop-n/a qa-review <reason>
# /sop-n/a security-review <reason>
#
# Slash commands:
# /sop-n/a <gate> [reason] — declare gate N/A (most-recent per-user wins)
# /sop-revoke <gate> — revoke prior N/A declaration for that gate
#
# Gate names must match the context strings used by review-check.sh:
# qa-review → qa-review / approved (<event>) [TEAM_ID=20]
# security-review → security-review / approved (<event>) [TEAM_ID=21]
#
# required_teams: OR semantics — any team member can declare N/A.
# Authors cannot self-declare N/A (enforced by gate script).
n/a_gates:
qa-review:
required_teams: [qa, security, engineers]
description: >-
QA review N/A when this change has no qa surface (pure-infra,
tooling-only, revert, dependency-only). A qa/eng/security member
must post /sop-n/a qa-review to activate.
security-review:
required_teams: [security, managers, ceo]
description: >-
Security review N/A when this change has no security surface
(docs-only, pure-frontend, dependency-only). A security/owners
member must post /sop-n/a security-review to activate.
-58
View File
@@ -1,58 +0,0 @@
# audit-force-merge — emit `incident.force_merge` to runner stdout when
# a PR is merged with required-status-checks not green. Vector picks
# the JSON line off docker_logs and ships to Loki on
# molecule-canonical-obs (per `reference_obs_stack_phase1`); query as:
#
# {host="operator"} |= "event_type" |= "incident.force_merge" | json
#
# Closes the §SOP-6 audit gap (the doc says force-merges write to
# `structure_events`, but that table lives in the platform DB, not
# Gitea-side; Loki is the practical equivalent for Gitea Actions
# events). When the credential / observability stack converges later,
# this can sync into structure_events from Loki via a backfill job —
# the structured JSON shape is forward-compatible.
#
# Logic in `.gitea/scripts/audit-force-merge.sh` per the same script-
# extract pattern as sop-tier-check.
name: audit-force-merge
# pull_request_target loads from the base branch — same security model
# as sop-tier-check. Without this, an attacker could rewrite the
# workflow on a PR and skip the audit emission for their own
# force-merge. See `.gitea/workflows/sop-tier-check.yml` for the full
# rationale.
on:
pull_request_target:
types: [closed]
jobs:
audit:
runs-on: ubuntu-latest
permissions:
contents: read
pull-requests: read
# Skip when PR is closed without merge — saves a runner.
if: github.event.pull_request.merged == true
steps:
- name: Check out base branch (for the script)
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
ref: ${{ github.event.pull_request.base.sha }}
- name: Detect force-merge + emit audit event
env:
# Same org-level secret the sop-tier-check workflow uses.
GITEA_TOKEN: ${{ secrets.SOP_TIER_CHECK_TOKEN || secrets.GITHUB_TOKEN }}
GITEA_HOST: git.moleculesai.app
REPO: ${{ github.repository }}
PR_NUMBER: ${{ github.event.pull_request.number }}
# Required-status-check contexts to evaluate at merge time.
# Newline-separated. Mirror this against branch protection
# (settings → branches → protected branch → required checks).
# Declared here rather than fetched from /branch_protections
# because that endpoint requires admin write — sop-tier-bot is
# read-only by design (least-privilege).
REQUIRED_CHECKS: |
CI / all-required (pull_request)
sop-checklist / all-items-acked (pull_request)
run: bash .gitea/scripts/audit-force-merge.sh
-186
View File
@@ -1,186 +0,0 @@
# ci-arm64-advisory — Mac arm64 self-hosted ADVISORY fast-check lane.
#
# === WHY ===
#
# The amd64 Gitea runner pool (molecule-runner-1..20) is queue-contended
# (internal#418). This lane offloads the *genuinely container-independent*
# fast checks (Go build/vet/lint, shellcheck, Python lint) onto the Mac
# arm64 self-hosted runner so developers get a fast arm64 signal WITHOUT
# adding load to the starved amd64 pool — capability-honestly, as an
# additive pilot. Pilot ② of the Mac-CI strategy (CTO-delegated 2026-05-17).
#
# === NON-NEGOTIABLE SAFETY CONTRACT (the prime directive) ===
#
# This lane is **ADVISORY ONLY**. It is provably incapable of hanging a
# merge. Concretely:
#
# 1. It is a SEPARATE workflow file. `ci.yml` is byte-for-byte
# untouched by this PR. The `CI / all-required` aggregator sentinel
# and the five contexts it polls
# (`CI / Detect changes|Platform (Go)|Canvas (Next.js)|
# Shellcheck (E2E scripts)|Python Lint & Test (pull_request)`)
# are unchanged. The canonical required gate stays 100% on the
# existing amd64 pool.
#
# 2. The context this workflow emits is
# `ci-arm64-advisory / fast-checks (pull_request)`. That string is
# DELIBERATELY NOT present in, and this PR does NOT add it to:
# - branch_protections/{main,staging}.status_check_contexts
# (DB-verified pb 86/75 = exactly
# ["CI / all-required (pull_request)",
# "sop-checklist / all-items-acked (pull_request)"])
# - audit-force-merge.yml REQUIRED_CHECKS env
# - ci.yml `all-required` sentinel's hardcoded `required[]` list
# Branch protection therefore never waits on this context. If the
# Mac runner is absent / offline / removed, this workflow's status
# simply never appears — and because nothing requires it, every
# merge proceeds exactly as it does today. There is no path by
# which a missing/red arm64 status blocks a merge.
#
# 3. `continue-on-error: true` on the job — even a genuine arm64-only
# failure (toolchain drift, arch-specific test flake) is surfaced
# as information, never as a merge blocker, for the duration of
# the pilot.
#
# 4. The job carries a `github.event_name` `if:` gate. Beyond its
# functional purpose this also keeps the job OUT of
# `ci-required-drift.py:ci_job_names()` (which excludes
# `github.event_name`/`github.ref`-gated jobs), so the hourly
# ci-required-drift sentinel's F1 ("job not under sentinel needs")
# cannot ever flag this advisory job. F2/F3 are untouched because
# this context is absent from BP and from REQUIRED_CHECKS.
# `lint-bp-context-emit-match` only fails on BP→emitter gaps; an
# emitter without a BP context is explicitly informational there.
#
# === RUNNER TARGETING ===
#
# The Mac runner is `hongming-pc-runner-1`. The bare `self-hosted`
# label is POLLUTED in this Gitea instance: molecule-runner-1..20
# (the contended amd64 pool) also advertise `self-hosted`. Targeting
# bare `self-hosted` would route back onto the very pool we are trying
# to relieve — and onto amd64 hardware. We therefore require an
# AND-set of labels that ONLY the Mac satisfies. `macos-self-hosted`
# is Mac-exclusive (the amd64 pool does not carry it). Until the
# label-install burst (a10862b2) lands `self-hosted`+`macos-self-hosted`
# on the Mac, the runner's current unique label `hongming-pc-laptop`
# is also listed; AND-semantics over the labels a runner advertises
# means a job requiring [self-hosted, macos-self-hosted] can ONLY be
# claimed once the Mac advertises both. If neither label set is yet
# present on the Mac, the workflow stays queued harmlessly and is
# garbage-collected by the normal stale-run reaper — it blocks nothing
# (see safety contract point 2).
#
# === ROLLBACK ===
#
# Delete this single file (`git rm .gitea/workflows/ci-arm64-advisory.yml`)
# and merge. No branch-protection edit, no ci.yml edit, no
# REQUIRED_CHECKS edit is required to roll back, because none were made
# to roll forward. Zero blast radius either direction.
name: ci-arm64-advisory
on:
push:
branches: [main, staging]
pull_request:
branches: [main, staging]
# Per-ref cancel: a newer commit on the same ref supersedes the older
# advisory run. Distinct from ci.yml's `ci-${ref}` group so this lane
# never cancels (or is cancelled by) the canonical required CI.
concurrency:
group: ci-arm64-advisory-${{ github.ref }}
cancel-in-progress: true
env:
GITHUB_SERVER_URL: https://git.moleculesai.app
jobs:
fast-checks:
name: fast-checks
# AND-set: only the Mac arm64 runner advertises macos-self-hosted.
# See "RUNNER TARGETING" header note for why bare self-hosted is unsafe.
runs-on: [self-hosted, macos-self-hosted]
# ADVISORY: never blocks. See safety contract point 3.
continue-on-error: true
# event_name gate: functional (only meaningful on push/PR) AND keeps
# this job out of ci-required-drift.py:ci_job_names() so F1 can never
# flag it. See safety contract point 4.
if: ${{ github.event_name == 'push' || github.event_name == 'pull_request' }}
timeout-minutes: 20
steps:
- name: Provenance — advisory lane, non-gating
run: |
echo "This is the arm64 ADVISORY fast-check lane."
echo "It does NOT gate merges. Canonical required CI is ci.yml"
echo "on the amd64 pool. Arch: $(uname -m) on $(uname -s)."
- name: Checkout
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
# ---- Go: build + vet + lint (container-independent: needs only the
# Go toolchain; no amd64 ECR image, no docker-in-job). Race-detector
# unit-test + coverage gates are deliberately NOT duplicated here —
# those stay authoritative on amd64 ci.yml `Platform (Go)`. This lane
# is fast-feedback for the compile/vet/lint surface only. ----
- name: Setup Go
uses: actions/setup-go@40f1582b2485089dde7abd97c1529aa768e1baff # v5
with:
go-version: 'stable'
- name: Go build + vet (workspace-server)
working-directory: workspace-server
run: |
go mod download
go build ./cmd/server
go vet ./...
- name: golangci-lint (workspace-server)
working-directory: workspace-server
run: |
go install github.com/golangci/golangci-lint/v2/cmd/golangci-lint@v2.12.2
"$(go env GOPATH)/bin/golangci-lint" run --timeout 3m ./...
# ---- Shellcheck (container-independent: shellcheck binary only).
# Mirrors ci.yml `Shellcheck (E2E scripts)` bulk pass scope. ----
- name: Install shellcheck (arm64)
run: |
if ! command -v shellcheck >/dev/null 2>&1; then
echo "shellcheck not preinstalled on this self-hosted runner."
echo "Attempting Homebrew install (Mac arm64)."
brew install shellcheck || {
echo "::warning::shellcheck unavailable on runner; advisory shellcheck skipped."
exit 0
}
fi
shellcheck --version
- name: Shellcheck tests/e2e + infra/scripts
run: |
command -v shellcheck >/dev/null 2>&1 || { echo "skip"; exit 0; }
find tests/e2e infra/scripts -type f -name '*.sh' -print0 \
| xargs -0 shellcheck --severity=warning
# ---- Python lint/compile (container-independent: CPython only).
# Lint + import-compile surface; the authoritative pytest + coverage
# floors stay on amd64 ci.yml `Python Lint & Test`. ----
- name: Setup Python
uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 # v6.2.0
with:
python-version: '3.11'
- name: Python byte-compile (workspace)
working-directory: workspace
run: |
python -m pip install --quiet ruff || true
python -m compileall -q .
if command -v ruff >/dev/null 2>&1; then
ruff check . || echo "::warning::ruff findings (advisory only)"
fi
- name: Advisory summary
if: always()
run: |
{
echo "## arm64 advisory fast-checks complete"
echo ""
echo "This lane is **advisory** — it does not gate merges."
echo "Authoritative required CI remains \`CI / all-required\`"
echo "on the amd64 pool (\`ci.yml\`, unchanged by this PR)."
} >> "$GITHUB_STEP_SUMMARY"
-112
View File
@@ -1,112 +0,0 @@
# ci-required-drift — hourly sentinel for drift between the canonical
# "what counts as required" sources of truth in this repo:
#
# 1. `.gitea/workflows/ci.yml` jobs (CI source)
# 2. `branch_protections/{main,staging}.status_check_contexts`
# (protection)
# 3. `.gitea/workflows/audit-force-merge.yml` REQUIRED_CHECKS env
# (audit env)
#
# RFC: internal#219 §4 (jobs ↔ protection) + §6 (audit env ↔ protection).
# Ported verbatim-then-adapted from molecule-controlplane PR#112
# (SHA 0adf2098) per RFC internal#219 Phase 2b+c — replicate repo-by-repo.
#
# When any pair diverges, a `[ci-drift]` issue is opened or updated
# (idempotent by title) and labelled `tier:high`. This is the
# auto-detection that closes the regression class identified in
# RFC §1 finding 3 (protection only listed 2 of 6 real jobs for
# ~weeks, undetected) and §6 (audit env drifts silently from
# protection).
#
# Diff logic lives in `.gitea/scripts/ci-required-drift.py`. The
# Python file does YAML AST parsing + `needs:` graph walking per
# `feedback_behavior_based_ast_gates` — NOT grep-by-name. That way
# job renames or matrix-expansion-induced churn produce honest signal.
#
# NOTE on protection endpoint scope: `GET /repos/.../branch_protections/{branch}`
# requires repo-admin role in Gitea 1.22.6. If DRIFT_BOT_TOKEN lacks it,
# the script skips that branch with a clear ::error:: diagnostic and exits 0
# (the issue IS the alarm, not a red workflow). See provisioning trail in
# the run step's GITEA_TOKEN env comment.
name: ci-required-drift
# IMPORTANT — Gitea 1.22.6 parser quirk per
# `feedback_gitea_workflow_dispatch_inputs_unsupported`: do NOT add an
# `inputs:` block here, even though stock GitHub Actions allows it.
# Gitea 1.22.6 flattens `workflow_dispatch.inputs.X` into a sibling of
# the `on:` event keys and rejects the entire workflow as
# "unknown on type". The whole file then registers for ZERO events
# (no schedule, no dispatch). When Gitea ≥ 1.23 lands fleet-wide,
# this constraint can be revisited.
on:
schedule:
# Hourly at :17 — offset from :00 to spread load away from the
# peak when N cron workflows fire on the hour-boundary, per
# RFC §4 cadence ("off-zero").
- cron: '17 * * * *'
workflow_dispatch:
# Read protection + read CI YAML + write issue. No write on contents.
permissions:
contents: read
issues: write
# Serialise — two simultaneous drift runs would duel on the issue
# create/update path. The audit is idempotent, but parallel POSTs
# can produce duplicate comments before the title-search dedup wins.
concurrency:
group: ci-required-drift
cancel-in-progress: false
jobs:
drift:
runs-on: ubuntu-latest
timeout-minutes: 5
steps:
- name: Check out repo (we read the YAML files locally)
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- name: Set up Python (PyYAML for AST parsing)
# Avoid a system-pip install on the runner; setup-python pins
# a hermetic interpreter + cache. PyYAML is small enough that
# the install is sub-2s — no need to cache wheels.
uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5.6.0
with:
python-version: '3.12'
- name: Install PyYAML
run: python -m pip install --quiet 'PyYAML==6.0.2'
- name: Run drift detector
env:
# DRIFT_BOT_TOKEN is owned by mc-drift-bot, a least-privilege
# Gitea persona whose ONLY job is reading branch_protections
# and posting the [ci-drift] tracking issue. The endpoint
# `GET /repos/.../branch_protections/{branch}` requires
# repo-ADMIN role (Gitea 1.22.6) — SOP_TIER_CHECK_TOKEN and the
# auto-injected GITHUB_TOKEN do NOT have it (read-only / write
# without admin), so the previous fallback chain 403'd.
# Mirrors the controlplane fix landed in CP PR#134.
# Provisioning trail: internal#329 (audit) + parent pattern
# internal#327 (publish-runtime-bot). Per
# `feedback_per_agent_gitea_identity_default`.
GITEA_TOKEN: ${{ secrets.DRIFT_BOT_TOKEN }}
GITEA_HOST: git.moleculesai.app
REPO: ${{ github.repository }}
# Branches whose protection we compare against. molecule-core
# currently has main protected; staging protection is
# forthcoming. Keep this list in sync if a new long-lived
# branch gets protected (e.g. release/* if introduced later).
BRANCHES: 'main staging'
# The sentinel job's name inside ci.yml. If the aggregator
# is ever renamed, update this too (the drift detector
# currently treats `all-required` as the source of "what
# the sentinel claims to require").
SENTINEL_JOB: 'all-required'
# Path to the audit workflow whose REQUIRED_CHECKS env we
# cross-check against protection (RFC §6).
AUDIT_WORKFLOW_PATH: '.gitea/workflows/audit-force-merge.yml'
# Path to the CI workflow with the sentinel + the jobs.
CI_WORKFLOW_PATH: '.gitea/workflows/ci.yml'
# Issue label applied on file/update. `tier:high` exists in
# the molecule-core label set (verified 2026-05-11, label id 9).
DRIFT_LABEL: 'tier:high'
run: python3 .gitea/scripts/ci-required-drift.py
-574
View File
@@ -1,574 +0,0 @@
# Ported from .github/workflows/ci.yml on 2026-05-11 per RFC internal#219 §1.
# continue-on-error: true on every job; follow-up PR will flip required after
# surfaced bugs are fixed (per RFC §1 — "surface broken workflows without
# blocking"). The four-surface migration audit
# (feedback_gitea_actions_migration_audit_pattern) was performed against this
# port:
#
# 1. YAML — dropped `merge_group` trigger (no Gitea merge queue); no
# `workflow_dispatch.inputs` to drop (Gitea 1.22.6 rejects those —
# feedback_gitea_workflow_dispatch_inputs_unsupported); no `environment:`
# blocks; kept `runs-on: ubuntu-latest` (Gitea runner pool advertises
# this label per agent_labels in action_runner table). Workflow-level
# env.GITHUB_SERVER_URL set as belt-and-suspenders against runner
# defaults (feedback_act_runner_github_server_url).
#
# 2. Cache — `actions/upload-artifact@v3.2.2` was already pinned to v3 for
# Gitea act_runner v0.6 compatibility (a comment in the original called
# this out). v4+ is incompatible with Gitea 1.22.x. No `actions/cache`
# usage to audit. `actions/setup-python@v6` `cache: pip` is left in
# place — works against Gitea's built-in cache server when runner.cache
# is configured (currently is, /opt/molecule/runners/config.yaml).
#
# 3. Token — workflow uses no custom dispatch tokens. The auto-injected
# `GITHUB_TOKEN` (which Gitea aliases to a runner-scoped token) is
# sufficient for `actions/checkout` against this same repo.
#
# 4. Docs — no docs/scripts reference github.com URLs that need swapping.
# The canvas-deploy-reminder step writes a `ghcr.io/...` image
# reference into the step summary text — that's documentation prose
# pointing at the ECR-mirrored canvas image and stays unchanged for
# this port (a separate cleanup if ghcr→ECR sweep is in scope).
#
# Cross-links:
# - RFC: internal#219 (CI/CD hard-gate hardening)
# - Reference port style: molecule-controlplane/.gitea/workflows/ci.yml
# - Bugs that may surface immediately and are tracked separately:
# internal#214 (Go-side vanity-import / go.sum drift, if any)
# - Phase 4 (this PR's follow-up): flip `continue-on-error: false` once
# surfaced defects are fixed, then add `all-required` aggregator
# sentinel (RFC §2) and PATCH branch protection (Phase 4 scope).
name: CI
on:
push:
branches: [main, staging]
pull_request:
branches: [main, staging]
# `merge_group` (GitHub merge-queue trigger) dropped — Gitea has no merge
# queue. The .github/ original retains it; this Gitea-side copy drops it.
# Cancel in-progress CI runs when a new commit arrives on the same ref.
# Stale runs queue up otherwise. PR refs and main/staging refs each get
# their own group because github.ref differs.
concurrency:
group: ci-${{ github.ref }}
cancel-in-progress: true
env:
# Belt-and-suspenders against the runner-default trap
# (feedback_act_runner_github_server_url). Runners are configured with
# this env via /opt/molecule/runners/config.yaml runner.envs, but pinning
# at the workflow level protects against a runner regenerated without
# the config file (feedback_act_runner_needs_config_file_env).
GITHUB_SERVER_URL: https://git.moleculesai.app
jobs:
# Detect which paths changed so downstream jobs can skip when only
# docs/markdown files were modified.
changes:
name: Detect changes
runs-on: ubuntu-latest
# Phase 4 (RFC #219 §1): all required jobs >=98% green on main.
# Flip confirmed 2026-05-12 via combined-status check of latest main
# commit (all CI jobs green). `all-required` sentinel hard-fails
# when this job fails; no Phase 3 suppression needed.
# revert: add `continue-on-error: true` back if regressions appear.
continue-on-error: false
outputs:
platform: ${{ steps.check.outputs.platform }}
canvas: ${{ steps.check.outputs.canvas }}
python: ${{ steps.check.outputs.python }}
scripts: ${{ steps.check.outputs.scripts }}
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
fetch-depth: 0
- id: check
env:
PR_BASE_SHA: ${{ github.event.pull_request.base.sha }}
PR_BASE_REF: ${{ github.event.pull_request.base.ref }}
PUSH_BEFORE: ${{ github.event.before }}
run: |
python3 .gitea/scripts/detect-changes.py \
--profile ci \
--event-name "${{ github.event_name }}" \
--pr-base-sha "$PR_BASE_SHA" \
--base-ref "$PR_BASE_REF" \
--push-before "${GITHUB_EVENT_BEFORE:-$PUSH_BEFORE}"
# Platform (Go) — Go build/vet/test/lint + coverage gates. The job always
# emits the required context, but expensive steps are path-scoped on every
# event so docs/E2E/Canvas-only main pushes do not block deploy on unrelated
# Go bootstrap work.
platform-build:
name: Platform (Go)
needs: changes
runs-on: ubuntu-latest
# mc#774 (closed 2026-05-14): Phase 4 flip of the platform-build job.
# Phase 4 (#656) originally flipped this to continue-on-error: false based on
# Phase-3-masked "green on main 2026-05-12". Two failure classes then surfaced:
# (1) 4x delegation_test.go sqlmock gaps (PR #669 / #634 fix-forward, closed).
# (2) TestMCPHandler_CommitMemory_GlobalScope_Blocked (mcp_test.go:433):
# OFFSEC-001 hardening collided with test assertion; tracked in mc#762.
# Fix-forward for (1) landed in PR #669. The mc#762 gap (2) is a separate
# issue — it does NOT block this flip because the test is already wrapped in
# the diagnostic step with its own continue-on-error: true (line 203).
# Flip confirmed by CI / Platform (Go) status = success on main HEAD 363905d3.
continue-on-error: false
# Job-level ceiling. The go test step below runs with a per-step 10m timeout;
# this cap catches any step that leaks past that. Set well above 10m so
# the per-step timeout is the active constraint.
timeout-minutes: 15
defaults:
run:
working-directory: workspace-server
steps:
- if: ${{ needs.changes.outputs.platform != 'true' }}
working-directory: .
run: echo "No workspace-server/** changes — Platform (Go) gate satisfied without running Go build/test/lint."
- if: ${{ needs.changes.outputs.platform == 'true' }}
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- if: ${{ needs.changes.outputs.platform == 'true' }}
uses: actions/setup-go@40f1582b2485089dde7abd97c1529aa768e1baff # v5
with:
go-version: 'stable'
- if: ${{ needs.changes.outputs.platform == 'true' }}
run: go mod download
- if: ${{ needs.changes.outputs.platform == 'true' }}
run: go build ./cmd/server
# CLI (molecli) moved to standalone repo: git.moleculesai.app/molecule-ai/molecule-cli
- if: ${{ needs.changes.outputs.platform == 'true' }}
run: go vet ./...
- if: ${{ needs.changes.outputs.platform == 'true' }}
name: Install golangci-lint
run: go install github.com/golangci/golangci-lint/v2/cmd/golangci-lint@v2.12.2
- if: ${{ needs.changes.outputs.platform == 'true' }}
name: Run golangci-lint
run: $(go env GOPATH)/bin/golangci-lint run --timeout 3m ./...
- if: ${{ needs.changes.outputs.platform == 'true' }}
name: Diagnostic — per-package verbose 60s
run: |
set +e
go test -race -v -timeout 60s ./internal/handlers/... 2>&1 | tee /tmp/test-handlers.log
handlers_exit=$?
go test -race -v -timeout 60s ./internal/pendinguploads/... 2>&1 | tee /tmp/test-pu.log
pu_exit=$?
echo "::group::handlers exit=$handlers_exit (last 100 lines)"
tail -100 /tmp/test-handlers.log
echo "::endgroup::"
echo "::group::pendinguploads exit=$pu_exit (last 100 lines)"
tail -100 /tmp/test-pu.log
echo "::endgroup::"
# mc#774: pre-existing continue-on-error mask; root-fix and remove, do not renew silently.
continue-on-error: true
- if: ${{ needs.changes.outputs.platform == 'true' }}
name: Run tests with race detection and coverage
# Explicit timeout: cold runner cache causes OOM kills at ~4m39s on the
# full ./... suite with race detection + coverage. A 10m per-step timeout
# lets the suite complete on cold cache (~5-7m) while failing cleanly
# instead of OOM-killing. The job-level timeout (15m) is a backstop.
run: go test -race -timeout 10m -coverprofile=coverage.out ./...
- if: ${{ needs.changes.outputs.platform == 'true' }}
name: Per-file coverage report
# Advisory — lists every source file with its coverage so reviewers
# can see at-a-glance where gaps are. Sorted ascending so the worst
# offenders float to the top. Does NOT fail the build; the hard
# gate is the threshold check below. (#1823)
run: |
echo "=== Per-file coverage (worst first) ==="
go tool cover -func=coverage.out \
| grep -v '^total:' \
| awk '{file=$1; sub(/:[0-9][0-9.]*:.*/, "", file); pct=$NF; gsub(/%/,"",pct); s[file]+=pct; c[file]++}
END {for (f in s) printf "%6.1f%% %s\n", s[f]/c[f], f}' \
| sort -n
- if: ${{ needs.changes.outputs.platform == 'true' }}
name: Check coverage thresholds
# Enforces two gates from #1823 Layer 1:
# 1. Total floor (25% — ratchet plan in COVERAGE_FLOOR.md).
# 2. Per-file floor — non-test .go files in security-critical
# paths with coverage <10% fail the build, UNLESS the file
# path is listed in .coverage-allowlist.txt (acknowledged
# historical debt with a tracking issue + expiry).
run: |
set -e
TOTAL_FLOOR=25
# Security-critical paths where a 0%-coverage file is a real risk.
CRITICAL_PATHS=(
"internal/handlers/tokens"
"internal/handlers/workspace_provision"
"internal/handlers/a2a_proxy"
"internal/handlers/registry"
"internal/handlers/secrets"
"internal/middleware/wsauth"
"internal/crypto"
)
TOTAL=$(go tool cover -func=coverage.out | grep '^total:' | awk '{print $3}' | sed 's/%//')
echo "Total coverage: ${TOTAL}%"
if awk "BEGIN{exit !($TOTAL < $TOTAL_FLOOR)}"; then
echo "::error::Total coverage ${TOTAL}% is below the ${TOTAL_FLOOR}% floor. See COVERAGE_FLOOR.md for ratchet plan."
exit 1
fi
# Aggregate per-file coverage → /tmp/perfile.txt: "<fullpath> <pct>"
go tool cover -func=coverage.out \
| grep -v '^total:' \
| awk '{file=$1; sub(/:[0-9][0-9.]*:.*/, "", file); pct=$NF; gsub(/%/,"",pct); s[file]+=pct; c[file]++}
END {for (f in s) printf "%s %.1f\n", f, s[f]/c[f]}' \
> /tmp/perfile.txt
# Build allowlist — paths relative to workspace-server, one per line.
# Lines starting with # are comments.
ALLOWLIST=""
if [ -f ../.coverage-allowlist.txt ]; then
ALLOWLIST=$(grep -vE '^(#|[[:space:]]*$)' ../.coverage-allowlist.txt || true)
fi
FAILED=0
WARNED=0
for path in "${CRITICAL_PATHS[@]}"; do
while read -r file pct; do
[[ "$file" == *_test.go ]] && continue
[[ "$file" == *"$path"* ]] || continue
awk "BEGIN{exit !($pct < 10)}" || continue
# Strip the package-import prefix so we can match .coverage-allowlist.txt
# entries written as paths relative to workspace-server/.
# Handle both module paths: platform/workspace-server/... and platform/...
rel=$(echo "$file" | sed 's|^github.com/molecule-ai/molecule-monorepo/platform/workspace-server/||; s|^github.com/molecule-ai/molecule-monorepo/platform/||')
if echo "$ALLOWLIST" | grep -qxF "$rel"; then
echo "::warning file=workspace-server/$rel::Critical file at ${pct}% coverage (allowlisted, #1823) — fix before expiry."
WARNED=$((WARNED+1))
else
echo "::error file=workspace-server/$rel::Critical file at ${pct}% coverage — must be >=10% (target 80%). See #1823. To acknowledge as known debt, add this path to .coverage-allowlist.txt."
FAILED=$((FAILED+1))
fi
done < /tmp/perfile.txt
done
echo ""
echo "Critical-path check: $FAILED new failures, $WARNED allowlisted warnings."
if [ "$FAILED" -gt 0 ]; then
echo ""
echo "$FAILED security-critical file(s) have <10% test coverage and are"
echo "NOT in the allowlist. These paths handle auth, tokens, secrets, or"
echo "workspace provisioning — a 0% file here is the exact gap that let"
echo "CWE-22, CWE-78, KI-005 slip through in past incidents. Either:"
echo " (a) add tests to raise coverage above 10%, or"
echo " (b) add the path to .coverage-allowlist.txt with an expiry date"
echo " and a tracking issue reference."
exit 1
fi
# Canvas (Next.js) — required check, always runs. Same always-run +
# per-step gating shape as platform-build. The two-job-sharing-name
# pattern attempted in PR #2321 doesn't satisfy branch protection
# (SKIPPED siblings count as not-passed regardless of SUCCESS
# siblings — verified empirically on PR #2314).
canvas-build:
name: Canvas (Next.js)
needs: changes
runs-on: ubuntu-latest
timeout-minutes: 20
# Phase 4 (RFC #219 §1): confirmed green on main 2026-05-12.
continue-on-error: false
defaults:
run:
working-directory: canvas
steps:
- if: ${{ needs.changes.outputs.canvas != 'true' }}
working-directory: .
run: echo "No canvas/** changes — Canvas (Next.js) gate satisfied without running npm build/test."
- if: ${{ needs.changes.outputs.canvas == 'true' }}
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- if: ${{ needs.changes.outputs.canvas == 'true' }}
uses: actions/setup-node@48b55a011bda9f5d6aeb4c2d9c7362e8dae4041e # v6.4.0
with:
node-version: '22'
- if: ${{ needs.changes.outputs.canvas == 'true' }}
run: npm ci --include=optional --prefer-offline
- if: ${{ needs.changes.outputs.canvas == 'true' }}
run: npm run build
- if: ${{ needs.changes.outputs.canvas == 'true' }}
name: Run tests with coverage
# Coverage instrumentation is configured in canvas/vitest.config.ts
# (provider: v8, reporters: text + html + json-summary). Step 2 of
# #1815 — wires coverage into CI so we get a baseline visible on
# every PR. No threshold gate yet; thresholds dial in (Step 3, also
# tracked in #1815) after the team sees what current coverage is.
run: npx vitest run --coverage
- name: Upload coverage summary as artifact
if: ${{ needs.changes.outputs.canvas == 'true' }}
# Pinned to v3 for Gitea act_runner v0.6 compatibility — v4+ uses
# the GHES 3.10+ artifact protocol that Gitea 1.22.x does NOT
# implement, surfacing as `GHESNotSupportedError: @actions/artifact
# v2.0.0+, upload-artifact@v4+ and download-artifact@v4+ are not
# currently supported on GHES`. Drop this pin when Gitea ships
# the v4 protocol (tracked: post-Gitea-1.23 followup).
uses: actions/upload-artifact@c6a366c94c3e0affe28c06c8df20a878f24da3cf # v3.2.2
with:
name: canvas-coverage-${{ github.run_id }}
path: canvas/coverage/
retention-days: 7
if-no-files-found: warn
# Shellcheck (E2E scripts) — required context, path-scoped heavy steps.
shellcheck:
name: Shellcheck (E2E scripts)
needs: changes
runs-on: ubuntu-latest
# Phase 4 (RFC #219 §1): confirmed green on main 2026-05-12.
continue-on-error: false
steps:
- if: ${{ needs.changes.outputs.scripts != 'true' }}
run: echo "No tests/e2e, scripts, or infra/scripts changes — Shellcheck gate satisfied without running script checks."
- if: ${{ needs.changes.outputs.scripts == 'true' }}
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- if: ${{ needs.changes.outputs.scripts == 'true' }}
name: Run shellcheck on tests/e2e/*.sh and infra/scripts/*.sh
# shellcheck is pre-installed on ubuntu-latest runners (via apt).
# infra/scripts/ is included because setup.sh + nuke.sh gate the
# README quickstart — a shellcheck regression there silently breaks
# new-user onboarding. scripts/ is intentionally excluded until its
# pre-existing SC3040/SC3043 warnings are cleaned up.
run: |
find tests/e2e infra/scripts -type f -name '*.sh' -print0 \
| xargs -0 shellcheck --severity=warning
- if: ${{ needs.changes.outputs.scripts == 'true' }}
name: Lint cleanup-trap hygiene (RFC #2873)
run: bash tests/e2e/lint_cleanup_traps.sh
- if: ${{ needs.changes.outputs.scripts == 'true' }}
name: Run E2E bash unit tests (no live infra)
run: |
bash tests/e2e/test_model_slug.sh
- if: ${{ needs.changes.outputs.scripts == 'true' }}
name: Test ECR promote-tenant-image script (mock-driven, no live infra)
# Covers scripts/promote-tenant-image.sh — the codified
# :staging-latest → :latest ECR promote + tenant fleet redeploy
# closing molecule-ai/molecule-core#660. 40 mock-driven cases
# exercise every exit path (preflight, snapshot, promote, redeploy
# 403→SSM-refresh, verify, rollback). No live AWS/CP/SSM calls.
run: |
bash scripts/test-promote-tenant-image.sh
- if: ${{ needs.changes.outputs.scripts == 'true' }}
name: Shellcheck promote-tenant-image script
# scripts/ is excluded from the bulk shellcheck pass above (legacy
# SC3040/SC3043 cleanup pending). Run shellcheck explicitly on
# the promote script + its test harness so regressions there are
# caught by the required check.
run: |
shellcheck --severity=warning \
scripts/promote-tenant-image.sh \
scripts/test-promote-tenant-image.sh
# mc#959 root-fix (sre)
canvas-deploy-reminder:
name: Canvas Deploy Reminder
runs-on: docker-host
# mc#774 root-fix: added job-level `if:` so ci-required-drift.py's
# ci_job_names() detects this as github.ref-gated and skips it from F1.
# The step-level exit 0 handles the "not main push" case; the job-level
# `if:` makes the gating explicit so the drift script sees it.
# Runs on both main and staging pushes; step exits 0 when not applicable.
if: ${{ github.ref == 'refs/heads/main' || github.ref == 'refs/heads/staging' }}
needs: [changes, canvas-build]
steps:
- name: Write deploy reminder to step summary
env:
COMMIT_SHA: ${{ github.sha }}
CANVAS_CHANGED: ${{ needs.changes.outputs.canvas }}
EVENT_NAME: ${{ github.event_name }}
REF_NAME: ${{ github.ref }}
# github.server_url resolves via the workflow-level env override
# to the Gitea instance, so the RUN_URL points at the Gitea run
# page (not github.com). See feedback_act_runner_github_server_url.
RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
run: |
set -euo pipefail
if [ "$CANVAS_CHANGED" != "true" ] || [ "$EVENT_NAME" != "push" ] || [ "$REF_NAME" != "refs/heads/main" ]; then
echo "Canvas deploy reminder not applicable for event=$EVENT_NAME ref=$REF_NAME canvas_changed=$CANVAS_CHANGED."
exit 0
fi
# Write body to a temp file — avoids backtick escaping in shell.
cat > /tmp/deploy-reminder.md << 'BODY'
## Canvas build passed — deploy required
The `publish-canvas-image` workflow is now building a fresh Docker image
(`ghcr.io/molecule-ai/canvas:latest`) in the background.
Once it completes (~35 min), apply on the host machine with:
```bash
cd <runner-workspace>
git pull origin main
docker compose pull canvas && docker compose up -d canvas
```
If you need to rebuild from local source instead (e.g. testing unreleased
changes or a new `NEXT_PUBLIC_*` URL), use:
```bash
docker compose build canvas && docker compose up -d canvas
```
BODY
printf '\n> Posted automatically by CI · commit `%s` · [build log](%s)\n' \
"$COMMIT_SHA" "$RUN_URL" >> /tmp/deploy-reminder.md
# Gitea has no commit-comments API; write to GITHUB_STEP_SUMMARY,
# which both GitHub Actions and Gitea Actions render as the
# workflow run's summary page. (#75 / PR-D)
cat /tmp/deploy-reminder.md >> "$GITHUB_STEP_SUMMARY"
# Python Lint & Test — required check, always runs.
# Runtime Python moved to molecule-ai-workspace-runtime. Keep this context as
# a guard so branch protection still catches attempts to reintroduce an
# editable runtime copy under molecule-core/workspace/.
python-lint:
name: Python Lint & Test
runs-on: ubuntu-latest
continue-on-error: false
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- name: Runtime SSOT guard
run: |
set -eu
if [ -d workspace ]; then
echo "::error file=workspace::Runtime source must live in molecule-ai-workspace-runtime, not molecule-core/workspace."
exit 1
fi
for f in scripts/build_runtime_package.py scripts/test_build_runtime_package.py; do
if [ -e "$f" ]; then
echo "::error file=$f::Legacy build-from-workspace packaging script must not be restored."
exit 1
fi
done
echo "Runtime SSOT guard passed; core consumes the standalone runtime package."
all-required:
# Aggregator sentinel — RFC internal#219 §2 (Phase 4 — closes internal#286).
#
# Emits `CI / all-required (<event>)` where <event> is the workflow trigger
# (e.g. `CI / all-required (pull_request)`, `CI / all-required (push)`).
# Branch protection MUST be updated to require the event-suffixed name —
# requiring `CI / all-required` (bare, no suffix) silently blocks all merges
# because Gitea treats absent status contexts as pending (not skipped), and
# no workflow emits the bare name. Fixed: BP now requires
# `CI / all-required (pull_request)` per issue #1473.
#
# Closes the failure mode where status_check_contexts on molecule-core/main
# only listed `Secret scan` + `sop-tier-check` (the 2 meta-gates), so real
# `Platform (Go)` / `Canvas (Next.js)` / `Python Lint & Test` / `Shellcheck`
# red silently merged through. See internal#286 for the three concrete
# tonight-of-2026-05-11 incidents that prompted the emergency bump.
#
# This job deliberately has no `needs:`. Gitea 1.22/act_runner can mark a
# job-level `if: always()` + `needs:` sentinel as skipped before upstream
# jobs settle, leaving branch protection with a permanent pending
# `CI / all-required` context. Instead, this independent sentinel polls the
# required commit-status contexts for this SHA and fails if any fail, skip,
# or never emit.
#
# canvas-deploy-reminder is intentionally NOT included in all-required.needs.
# It is an informational main-push reminder, not a PR quality gate. Keeping
# it in this dependency list lets a skipped reminder skip the required
# sentinel before the `always()` guard can emit a branch-protection status.
#
continue-on-error: false
runs-on: ubuntu-latest
timeout-minutes: 45
steps:
- name: Wait for required CI contexts
env:
GITEA_TOKEN: ${{ secrets.GITHUB_TOKEN }}
API_ROOT: ${{ github.server_url }}/api/v1
REPOSITORY: ${{ github.repository }}
COMMIT_SHA: ${{ github.sha }}
EVENT_NAME: ${{ github.event_name }}
run: |
set -euo pipefail
python3 - <<'PY'
import json
import os
import sys
import time
import urllib.error
import urllib.request
token = os.environ["GITEA_TOKEN"]
api_root = os.environ["API_ROOT"].rstrip("/")
repo = os.environ["REPOSITORY"]
sha = os.environ["COMMIT_SHA"]
event = os.environ["EVENT_NAME"]
required = [
f"CI / Detect changes ({event})",
f"CI / Platform (Go) ({event})",
f"CI / Canvas (Next.js) ({event})",
f"CI / Shellcheck (E2E scripts) ({event})",
f"CI / Python Lint & Test ({event})",
]
terminal_bad = {"failure", "error"}
deadline = time.time() + 40 * 60
last_summary = None
def fetch_statuses():
statuses = []
for page in range(1, 6):
url = f"{api_root}/repos/{repo}/commits/{sha}/statuses?page={page}&limit=100"
req = urllib.request.Request(url, headers={"Authorization": f"token {token}"})
with urllib.request.urlopen(req, timeout=10) as resp:
chunk = json.load(resp)
if not chunk:
break
statuses.extend(chunk)
latest = {}
for item in statuses:
ctx = item.get("context")
if not ctx:
continue
prev = latest.get(ctx)
if prev is None or (item.get("updated_at") or item.get("created_at") or "") >= (prev.get("updated_at") or prev.get("created_at") or ""):
latest[ctx] = item
return latest
while True:
try:
latest = fetch_statuses()
except (TimeoutError, OSError, urllib.error.URLError) as exc:
if time.time() >= deadline:
print(f"FAIL: status polling did not recover before deadline: {exc}", file=sys.stderr)
sys.exit(1)
print(f"WARN: status poll failed, retrying: {exc}", flush=True)
time.sleep(15)
continue
states = {ctx: (latest.get(ctx) or {}).get("status") or (latest.get(ctx) or {}).get("state") or "missing" for ctx in required}
summary = ", ".join(f"{ctx}={state}" for ctx, state in states.items())
if summary != last_summary:
print(summary, flush=True)
last_summary = summary
bad = {ctx: state for ctx, state in states.items() if state in terminal_bad}
if bad:
print("FAIL: required CI context failed:", file=sys.stderr)
for ctx, state in bad.items():
desc = (latest.get(ctx) or {}).get("description") or ""
print(f" - {ctx}: {state} {desc}", file=sys.stderr)
sys.exit(1)
if all(state == "success" for state in states.values()):
print(f"OK: all {len(required)} required CI contexts succeeded")
sys.exit(0)
if time.time() >= deadline:
print("FAIL: timed out waiting for required CI contexts:", file=sys.stderr)
for ctx, state in states.items():
print(f" - {ctx}: {state}", file=sys.stderr)
sys.exit(1)
time.sleep(15)
PY
-391
View File
@@ -1,391 +0,0 @@
name: E2E API Smoke Test
# Ported from .github/workflows/e2e-api.yml on 2026-05-11 per RFC
# internal#219 §1 sweep. Differences from the GitHub version:
# - Dropped `workflow_dispatch.inputs` (Gitea 1.22.6 parser rejects them
# per feedback_gitea_workflow_dispatch_inputs_unsupported).
# - Dropped `merge_group:` (no Gitea merge queue).
# - Dropped `environment:` blocks (Gitea has no environments).
# - Workflow-level env.GITHUB_SERVER_URL pinned per
# feedback_act_runner_github_server_url.
# - `continue-on-error: true` on each job (RFC §1 contract).
#
# Extracted from ci.yml so workflow-level concurrency can protect this job
# from run-level cancellation (issue #458).
#
# Trigger model (revised 2026-04-29):
#
# Always FIRES on push/pull_request to staging+main. Real work is gated
# per-step on `needs.detect-changes.outputs.api` — when paths under
# `workspace-server/`, `tests/e2e/`, or this workflow file haven't
# changed, the no-op step alone runs and emits SUCCESS for the
# `E2E API Smoke Test` check, satisfying branch protection without
# spending CI cycles. See the in-job comment on the `e2e-api` job for
# why this is one job (not two-jobs-sharing-name) and the 2026-04-29
# PR #2264 incident that drove the consolidation.
#
# Parallel-safety (Class B Hongming-owned CICD red sweep, 2026-05-08)
# -------------------------------------------------------------------
# Same substrate hazard as PR #98 (handlers-postgres-integration). Our
# Gitea act_runner runs with `container.network: host` (operator host
# `/opt/molecule/runners/config.yaml`), which means:
#
# * Two concurrent runs both try to bind their `-p 15432:5432` /
# `-p 16379:6379` host ports — the second postgres/redis FATALs
# with `Address in use` and `docker run` returns exit 125 with
# `Conflict. The container name "/molecule-ci-postgres" is already
# in use by container ...`. Verified in run a7/2727 on 2026-05-07.
# * The fixed container names `molecule-ci-postgres` / `-redis` (the
# pre-fix shape) collide on name AS WELL AS port. The cleanup-with-
# `docker rm -f` at the start of the second job KILLS the first
# job's still-running postgres/redis.
#
# Fix shape (mirrors PR #98's bridge-net pattern, adapted because
# platform-server is a Go binary on the host, not a containerised
# step):
#
# 1. Unique container names per run:
# pg-e2e-api-${RUN_ID}-${RUN_ATTEMPT}
# redis-e2e-api-${RUN_ID}-${RUN_ATTEMPT}
# `${RUN_ID}-${RUN_ATTEMPT}` is unique even across reruns of the
# same run_id.
# 2. Ephemeral host port per run (`-p 0:5432`), then read the actual
# bound port via `docker port` and export DATABASE_URL/REDIS_URL
# pointing at it. No fixed host-port → no port collision.
# 3. `127.0.0.1` (NOT `localhost`) in URLs — IPv6 first-resolve was
# the original flake fixed in #92 and the script's still IPv6-
# enabled.
# 4. `if: always()` cleanup so containers don't leak when test steps
# fail.
#
# Issue #94 items #2 + #3 (also fixed here):
# * Pre-pull `alpine:latest` so the platform-server's provisioner
# (`internal/handlers/container_files.go`) can stand up its
# ephemeral token-write helper without a daemon.io round-trip.
# * Create `molecule-core-net` bridge network if missing so the
# provisioner's container.HostConfig {NetworkMode: ...} attach
# succeeds.
# Item #1 (timeouts) — evidence on recent runs (77/3191, ae/4270, 0e/
# 2318) shows Postgres ready in 3s, Redis in 1s, Platform in 1s when
# they DO come up. Timeouts are not the bottleneck; not bumped.
#
# Item #1046 (fixed 2026-05-14): Stale platform-server from cancelled runs
# lingers on :8080 after "Stop platform" step is skipped (workflow cancelled
# before reaching line 335). Added a pre-start "Kill stale platform-server"
# step (line 286) that scans /proc for zombie platform-server processes
# and kills them before the port probe or bind. Makes the ephemeral port
# probe + start sequence deterministic.
#
# Item explicitly NOT fixed here: failing test `Status back online`
# fails because the platform's langgraph workspace template image
# (ghcr.io/molecule-ai/workspace-template-langgraph:latest) returns
# 403 Forbidden post-2026-05-06 GitHub org suspension. That is a
# template-registry resolution issue (ADR-002 / local-build mode) and
# belongs in a separate change that touches workspace-server, not
# this workflow file.
on:
push:
branches: [main, staging]
pull_request:
branches: [main, staging]
concurrency:
# Per-SHA grouping (changed 2026-04-28 from per-ref). Per-ref had the
# same auto-promote-staging brittleness as e2e-staging-canvas — back-
# to-back staging pushes share refs/heads/staging, so the older push's
# queued run gets cancelled when a newer push lands. Auto-promote-
# staging then sees `completed/cancelled` for the older SHA and stays
# put; the newer SHA's gates may eventually save the day, but if the
# newer push gets cancelled too, we deadlock.
#
# See e2e-staging-canvas.yml's identical concurrency block for the full
# rationale and the 2026-04-28 incident reference.
group: e2e-api-${{ github.event.pull_request.head.sha || github.sha }}
cancel-in-progress: false
env:
GITHUB_SERVER_URL: https://git.moleculesai.app
jobs:
detect-changes:
# mc#1529 follow-on: pin to `docker-host` so the e2e-api lane lands
# on Linux operator-host runners (molecule-runner-*) that carry the
# `molecule-core-net` bridge network + a working `aws ecr get-login-
# password | docker login` path. The bare `ubuntu-latest` label is
# also accepted by hongming-pc-runner-* (Windows act_runner v1.0.3),
# where the docker.sock-bound steps below fail non-deterministically
# (e.g. `docker run -d --name pg-e2e-api-...` with port-bind +
# `docker exec ... pg_isready` cannot work against a Windows daemon).
# detect-changes itself doesn't bind docker.sock, but pinning here too
# keeps both jobs on the same lane so we don't re-roll the dice on
# workspace-volume cross-host surprises and the routing rule is
# discoverable in one place. Mirror of mc#1543 (handlers-postgres-
# integration). See internal#512 for the class defect.
runs-on: docker-host
# Phase 3 (RFC #219 §1): surface broken workflows without blocking.
# mc#774: pre-existing continue-on-error mask; root-fix and remove, do not renew silently.
continue-on-error: true
outputs:
api: ${{ steps.decide.outputs.api }}
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
fetch-depth: 0
- id: decide
run: |
python3 .gitea/scripts/detect-changes.py \
--profile e2e-api \
--event-name "${{ github.event_name }}" \
--pr-base-sha "${{ github.event.pull_request.base.sha }}" \
--base-ref "${{ github.event.pull_request.base.ref }}" \
--push-before "${GITHUB_EVENT_BEFORE:-${{ github.event.before }}}"
# ONE job (no job-level `if:`) that always runs and reports under the
# required-check name `E2E API Smoke Test`. Real work is gated per-step
# on `needs.detect-changes.outputs.api`. Reason: GitHub registers a
# check run for every job that matches `name:`, and a job-level
# `if: false` produces a SKIPPED check run. Branch protection treats
# all check runs with a matching context name on the latest commit as a
# SET — any SKIPPED in the set fails the required-check eval, even with
# SUCCESS siblings. Verified 2026-04-29 on PR #2264 (staging→main):
# 4 check runs (2 SKIPPED + 2 SUCCESS) at the head SHA blocked
# promotion despite all real work succeeding. Collapsing to a single
# always-running job with conditional steps emits exactly one SUCCESS
# check run regardless of paths filter — branch-protection-clean.
e2e-api:
needs: detect-changes
name: E2E API Smoke Test
# mc#1529 follow-on: must run on operator-host Linux runners (where
# docker.sock + `molecule-core-net` + `aws ecr ...` work). See
# detect-changes for the full rationale.
runs-on: docker-host
# Phase 3 (RFC #219 §1): surface broken workflows without blocking.
# mc#774: pre-existing continue-on-error mask; root-fix and remove, do not renew silently.
continue-on-error: true
timeout-minutes: 15
env:
# Unique per-run container names so concurrent runs on the host-
# network act_runner don't collide on name OR port.
# `${RUN_ID}-${RUN_ATTEMPT}` stays unique across reruns of the
# same run_id. PORT is set later (after docker port lookup) since
# we let Docker assign an ephemeral host port.
PG_CONTAINER: pg-e2e-api-${{ github.run_id }}-${{ github.run_attempt }}
REDIS_CONTAINER: redis-e2e-api-${{ github.run_id }}-${{ github.run_attempt }}
steps:
- name: No-op pass (paths filter excluded this commit)
if: needs.detect-changes.outputs.api != 'true'
run: |
echo "No workspace-server / tests/e2e / workflow changes — E2E API gate satisfied without running tests."
echo "::notice::E2E API Smoke Test no-op pass (paths filter excluded this commit)."
- if: needs.detect-changes.outputs.api == 'true'
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- if: needs.detect-changes.outputs.api == 'true'
uses: actions/setup-go@40f1582b2485089dde7abd97c1529aa768e1baff # v5
with:
go-version: 'stable'
cache: true
cache-dependency-path: workspace-server/go.sum
- name: Pre-pull alpine + ensure provisioner network (Issue #94 items #2 + #3)
if: needs.detect-changes.outputs.api == 'true'
run: |
# Provisioner uses alpine:latest for ephemeral token-write
# containers (workspace-server/internal/handlers/container_files.go).
# Pre-pull so the first provision in test_api.sh doesn't race
# the daemon's pull cache. Idempotent — `docker pull` is a no-op
# when the image is already present.
docker pull alpine:latest >/dev/null
# Provisioner attaches workspace containers to
# molecule-core-net (workspace-server/internal/provisioner/
# provisioner.go::DefaultNetwork). The bridge already exists on
# the operator host's docker daemon — `network create` is
# idempotent via `|| true`.
docker network create molecule-core-net >/dev/null 2>&1 || true
echo "alpine:latest pre-pulled; molecule-core-net ensured."
- name: Start Postgres (docker)
if: needs.detect-changes.outputs.api == 'true'
run: |
# Defensive cleanup — only matches THIS run's container name,
# so it cannot kill a sibling run's postgres. (Pre-fix the
# name was static and this rm hit other runs' containers.)
docker rm -f "$PG_CONTAINER" 2>/dev/null || true
# `-p 0:5432` requests an ephemeral host port; we read it back
# below and export DATABASE_URL.
docker run -d --name "$PG_CONTAINER" \
-e POSTGRES_USER=dev -e POSTGRES_PASSWORD=dev -e POSTGRES_DB=molecule \
-p 0:5432 postgres:16 >/dev/null
# Resolve the host-side port assignment. `docker port` prints
# `0.0.0.0:NNNN` (and on host-net runners may also print an
# IPv6 line — take the first IPv4 line).
PG_PORT=$(docker port "$PG_CONTAINER" 5432/tcp | awk -F: '/^0\.0\.0\.0:/ {print $2; exit}')
if [ -z "$PG_PORT" ]; then
# Fallback: any first line. Some Docker versions print only
# one line.
PG_PORT=$(docker port "$PG_CONTAINER" 5432/tcp | head -1 | awk -F: '{print $NF}')
fi
if [ -z "$PG_PORT" ]; then
echo "::error::Could not resolve host port for $PG_CONTAINER"
docker port "$PG_CONTAINER" 5432/tcp || true
docker logs "$PG_CONTAINER" || true
exit 1
fi
# 127.0.0.1 (NOT localhost) — IPv6 first-resolve flake (#92).
echo "PG_PORT=${PG_PORT}" >> "$GITHUB_ENV"
echo "DATABASE_URL=postgres://dev:dev@127.0.0.1:${PG_PORT}/molecule?sslmode=disable" >> "$GITHUB_ENV"
echo "Postgres host port: ${PG_PORT}"
for i in $(seq 1 30); do
if docker exec "$PG_CONTAINER" pg_isready -U dev >/dev/null 2>&1; then
echo "Postgres ready after ${i}s"
exit 0
fi
sleep 1
done
echo "::error::Postgres did not become ready in 30s"
docker logs "$PG_CONTAINER" || true
exit 1
- name: Start Redis (docker)
if: needs.detect-changes.outputs.api == 'true'
run: |
docker rm -f "$REDIS_CONTAINER" 2>/dev/null || true
docker run -d --name "$REDIS_CONTAINER" -p 0:6379 redis:7 >/dev/null
REDIS_PORT=$(docker port "$REDIS_CONTAINER" 6379/tcp | awk -F: '/^0\.0\.0\.0:/ {print $2; exit}')
if [ -z "$REDIS_PORT" ]; then
REDIS_PORT=$(docker port "$REDIS_CONTAINER" 6379/tcp | head -1 | awk -F: '{print $NF}')
fi
if [ -z "$REDIS_PORT" ]; then
echo "::error::Could not resolve host port for $REDIS_CONTAINER"
docker port "$REDIS_CONTAINER" 6379/tcp || true
docker logs "$REDIS_CONTAINER" || true
exit 1
fi
echo "REDIS_PORT=${REDIS_PORT}" >> "$GITHUB_ENV"
echo "REDIS_URL=redis://127.0.0.1:${REDIS_PORT}" >> "$GITHUB_ENV"
echo "Redis host port: ${REDIS_PORT}"
for i in $(seq 1 15); do
if docker exec "$REDIS_CONTAINER" redis-cli ping 2>/dev/null | grep -q PONG; then
echo "Redis ready after ${i}s"
exit 0
fi
sleep 1
done
echo "::error::Redis did not become ready in 15s"
docker logs "$REDIS_CONTAINER" || true
exit 1
- name: Build platform
if: needs.detect-changes.outputs.api == 'true'
working-directory: workspace-server
run: go build -o platform-server ./cmd/server
- name: Pick platform port
if: needs.detect-changes.outputs.api == 'true'
run: |
PLATFORM_PORT=$(python3 - <<'PY'
import socket
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.bind(("127.0.0.1", 0))
print(s.getsockname()[1])
PY
)
echo "PORT=${PLATFORM_PORT}" >> "$GITHUB_ENV"
echo "BASE=http://127.0.0.1:${PLATFORM_PORT}" >> "$GITHUB_ENV"
echo "Platform host port: ${PLATFORM_PORT}"
- name: Kill stale platform-server before start (issue #1046)
if: needs.detect-changes.outputs.api == 'true'
run: |
# Concurrent runs on the same host-network act_runner can leave a
# zombie platform-server from a cancelled/timeout run. Cancelled
# runs never reach the "Stop platform" step (line 335), so the
# old process lingers. Kill it before the ephemeral port probe
# or start so the port is definitively free.
#
# /proc scan — works on any Linux without pkill/lsof/ss.
# comm field is truncated to 15 chars: "platform-serve" matches
# "platform-server". Verify with cmdline to avoid false positives.
killed=0
for pid in $(grep -l "platform-serve" /proc/[0-9]*/comm 2>/dev/null); do
kpid="${pid%/comm}"
kpid="${kpid##*/}"
cmdline=$(cat "/proc/${kpid}/cmdline" 2>/dev/null | tr '\0' ' ')
if echo "$cmdline" | grep -q "platform-server"; then
echo "Killing stale platform-server pid ${kpid}: ${cmdline}"
kill "$kpid" 2>/dev/null || true
killed=$((killed + 1))
fi
done
if [ "$killed" -gt 0 ]; then
sleep 2
echo "Killed $killed stale process(es); port(s) released."
else
echo "No stale platform-server found."
fi
- name: Start platform (background)
if: needs.detect-changes.outputs.api == 'true'
working-directory: workspace-server
run: |
# DATABASE_URL + REDIS_URL exported by the start-postgres /
# start-redis steps point at this run's per-run host ports.
./platform-server > platform.log 2>&1 &
echo $! > platform.pid
- name: Wait for /health
if: needs.detect-changes.outputs.api == 'true'
run: |
for i in $(seq 1 30); do
if curl -sf "$BASE/health" > /dev/null; then
echo "Platform up after ${i}s"
exit 0
fi
sleep 1
done
echo "::error::Platform did not become healthy in 30s"
cat workspace-server/platform.log || true
exit 1
- name: Assert migrations applied
if: needs.detect-changes.outputs.api == 'true'
run: |
tables=$(docker exec "$PG_CONTAINER" psql -U dev -d molecule -tAc "SELECT count(*) FROM information_schema.tables WHERE table_schema='public' AND table_name='workspaces'")
if [ "$tables" != "1" ]; then
echo "::error::Migrations did not apply"
cat workspace-server/platform.log || true
exit 1
fi
echo "Migrations OK"
- name: Run today's-PR-coverage E2E (mc#1525/1535/1536/1539/1542 fix-specific assertions)
if: needs.detect-changes.outputs.api == 'true'
run: bash tests/e2e/test_today_pr_coverage_e2e.sh
- name: Run E2E API tests
if: needs.detect-changes.outputs.api == 'true'
run: bash tests/e2e/test_api.sh
- name: Run notify-with-attachments E2E
if: needs.detect-changes.outputs.api == 'true'
run: bash tests/e2e/test_notify_attachments_e2e.sh
- name: Run priority-runtimes E2E (claude-code + hermes — skips when keys absent)
if: needs.detect-changes.outputs.api == 'true'
run: bash tests/e2e/test_priority_runtimes_e2e.sh
- name: Install standalone runtime parser from Gitea registry
if: needs.detect-changes.outputs.api == 'true'
run: |
python3 -m pip install --no-deps \
--index-url https://git.moleculesai.app/api/packages/molecule-ai/pypi/simple/ \
molecule-ai-workspace-runtime
- name: Run poll-mode + since_id cursor E2E (#2339)
if: needs.detect-changes.outputs.api == 'true'
run: bash tests/e2e/test_poll_mode_e2e.sh
- name: Run poll-mode chat upload E2E (RFC #2891)
if: needs.detect-changes.outputs.api == 'true'
run: bash tests/e2e/test_poll_mode_chat_upload_e2e.sh
- name: Dump platform log on failure
if: failure() && needs.detect-changes.outputs.api == 'true'
run: cat workspace-server/platform.log || true
- name: Stop platform
if: always() && needs.detect-changes.outputs.api == 'true'
run: |
if [ -f workspace-server/platform.pid ]; then
kill "$(cat workspace-server/platform.pid)" 2>/dev/null || true
fi
- name: Stop service containers
# always() so containers don't leak when test steps fail. The
# cleanup is best-effort: if the container is already gone
# (e.g. concurrent rerun race), don't fail the job.
if: always() && needs.detect-changes.outputs.api == 'true'
run: |
docker rm -f "$PG_CONTAINER" 2>/dev/null || true
docker rm -f "$REDIS_CONTAINER" 2>/dev/null || true
-334
View File
@@ -1,334 +0,0 @@
name: E2E Chat
# Comprehensive Playwright E2E for the unified chat stack (desktop
# ChatTab + mobile MobileChat). Heavy browser execution is intentionally
# outside the normal required PR path: PRs run it only after entering the
# `merge-queue`, while push/main, nightly, and manual dispatch preserve
# coverage without making every PR pay the full runtime/browser cost.
#
# Architecture:
# 1. Ephemeral Postgres + Redis (docker, unique container names)
# 2. workspace-server built from source, started with
# MOLECULE_ENV=development (fail-open auth)
# 3. canvas dev server (npm run dev) on :3000
# 4. Playwright tests create workspaces via API, point them at an
# in-process echo runtime, and exercise the full send/receive
# round-trip through the browser.
#
# Parallel-safety: same pattern as e2e-api.yml — per-run container names
# and ephemeral host ports so concurrent jobs on the host-network runner
# don't collide.
on:
push:
branches: [main, staging]
pull_request:
branches: [main, staging]
schedule:
# Nightly at 09:00 UTC. Keeps coverage for the currently non-required
# heavy browser lane without spending runner time on every PR.
- cron: '0 9 * * *'
workflow_dispatch:
concurrency:
group: e2e-chat-${{ github.event.pull_request.head.sha || github.sha }}
cancel-in-progress: false
env:
GITHUB_SERVER_URL: https://git.moleculesai.app
jobs:
# bp-exempt: helper job; real gate is E2E Chat / E2E Chat (pull_request)
detect-changes:
# mc#1529 follow-on: pin to `docker-host` (Linux operator-host
# runners). The bare `ubuntu-latest` label is also advertised by
# hongming-pc-runner-* (Windows act_runner v1.0.3) where the
# docker.sock-bound steps below fail. Mirror of mc#1543
# (handlers-postgres-integration). See internal#512 for the class
# defect.
runs-on: docker-host
# Phase 3 (RFC #219 §1): surface broken workflows without blocking.
# mc#774: pre-existing continue-on-error mask; root-fix and remove, do not renew silently.
continue-on-error: true
outputs:
chat: ${{ steps.decide.outputs.chat }}
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
fetch-depth: 0
- id: decide
env:
GITEA_TOKEN: ${{ secrets.GITHUB_TOKEN }}
QUEUE_LABEL: merge-queue
run: |
if [ "${{ github.event_name }}" = "schedule" ] || [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
echo "chat=true" >> "$GITHUB_OUTPUT"
exit 0
fi
BASE="${GITHUB_BASE_REF:-${{ github.event.before }}}"
if [ "${{ github.event_name }}" = "pull_request" ] && [ -n "${{ github.event.pull_request.base.sha }}" ]; then
BASE="${{ github.event.pull_request.base.sha }}"
fi
if [ -z "$BASE" ] || echo "$BASE" | grep -qE '^0+$'; then
echo "chat=true" >> "$GITHUB_OUTPUT"
exit 0
fi
if ! git cat-file -e "$BASE" 2>/dev/null; then
git fetch --depth=1 origin "$BASE" 2>/dev/null || true
fi
if ! git cat-file -e "$BASE" 2>/dev/null; then
echo "chat=true" >> "$GITHUB_OUTPUT"
exit 0
fi
CHANGED=$(git diff --name-only "$BASE" HEAD)
if ! echo "$CHANGED" | grep -qE '^(canvas/|workspace-server/|\.gitea/workflows/e2e-chat\.yml$)'; then
echo "chat=false" >> "$GITHUB_OUTPUT"
exit 0
fi
if [ "${{ github.event_name }}" != "pull_request" ]; then
echo "chat=true" >> "$GITHUB_OUTPUT"
exit 0
fi
authfile=$(mktemp)
chmod 600 "$authfile"
printf 'header = "Authorization: token %s"\n' "$GITEA_TOKEN" > "$authfile"
labels=$(curl -fsS -K "$authfile" \
"${{ github.server_url }}/api/v1/repos/${{ github.repository }}/issues/${{ github.event.pull_request.number }}/labels" \
| python3 -c 'import json,sys; print("\n".join(label.get("name","") for label in json.load(sys.stdin)))')
rm -f "$authfile"
if printf '%s\n' "$labels" | grep -qx "$QUEUE_LABEL"; then
echo "chat=true" >> "$GITHUB_OUTPUT"
else
echo "PR is not in merge-queue; skipping heavy E2E Chat for normal PR path."
echo "chat=false" >> "$GITHUB_OUTPUT"
fi
# bp-required: pending #1142 — new E2E check; add to branch protection after 3 green runs.
e2e-chat:
needs: detect-changes
name: E2E Chat
# mc#1529 follow-on: docker run/exec for postgres + redis containers.
# Must land on operator-host Linux (docker-host).
runs-on: docker-host
# Phase 3 (RFC #219 §1): surface broken workflows without blocking.
# mc#774: pre-existing continue-on-error mask; root-fix and remove, do not renew silently.
continue-on-error: true
timeout-minutes: 15
env:
PG_CONTAINER: pg-e2e-chat-${{ github.run_id }}-${{ github.run_attempt }}
REDIS_CONTAINER: redis-e2e-chat-${{ github.run_id }}-${{ github.run_attempt }}
steps:
- name: No-op pass (paths filter excluded this commit)
if: needs.detect-changes.outputs.chat != 'true'
run: |
echo "No canvas / workspace-server / workflow changes — E2E Chat gate satisfied without running tests."
echo "::notice::E2E Chat no-op pass (paths filter excluded this commit)."
- if: needs.detect-changes.outputs.chat == 'true'
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- if: needs.detect-changes.outputs.chat == 'true'
uses: actions/setup-go@40f1582b2485089dde7abd97c1529aa768e1baff # v5
with:
go-version: 'stable'
cache: true
cache-dependency-path: workspace-server/go.sum
- if: needs.detect-changes.outputs.chat == 'true'
uses: actions/setup-node@48b55a011bda9f5d6aeb4c2d9c7362e8dae4041e # v6.4.0
with:
node-version: '22'
cache: 'npm'
cache-dependency-path: canvas/package-lock.json
- name: Start Postgres (docker)
if: needs.detect-changes.outputs.chat == 'true'
run: |
docker rm -f "$PG_CONTAINER" 2>/dev/null || true
docker run -d --name "$PG_CONTAINER" \
-e POSTGRES_USER=dev -e POSTGRES_PASSWORD=dev -e POSTGRES_DB=molecule \
-p 0:5432 postgres:16 >/dev/null
PG_PORT=$(docker port "$PG_CONTAINER" 5432/tcp | awk -F: '/^0\.0\.0\.0:/ {print $2; exit}')
if [ -z "$PG_PORT" ]; then
PG_PORT=$(docker port "$PG_CONTAINER" 5432/tcp | head -1 | awk -F: '{print $NF}')
fi
if [ -z "$PG_PORT" ]; then
echo "::error::Could not resolve host port for $PG_CONTAINER"
exit 1
fi
echo "PG_PORT=${PG_PORT}" >> "$GITHUB_ENV"
echo "DATABASE_URL=postgres://dev:dev@127.0.0.1:${PG_PORT}/molecule?sslmode=disable" >> "$GITHUB_ENV"
echo "E2E_DATABASE_URL=postgres://dev:dev@127.0.0.1:${PG_PORT}/molecule?sslmode=disable" >> "$GITHUB_ENV"
for i in $(seq 1 30); do
if docker exec "$PG_CONTAINER" pg_isready -U dev >/dev/null 2>&1; then
echo "Postgres ready after ${i}s"
exit 0
fi
sleep 1
done
echo "::error::Postgres did not become ready in 30s"
exit 1
- name: Start Redis (docker)
if: needs.detect-changes.outputs.chat == 'true'
run: |
docker rm -f "$REDIS_CONTAINER" 2>/dev/null || true
docker run -d --name "$REDIS_CONTAINER" -p 0:6379 redis:7 >/dev/null
REDIS_PORT=$(docker port "$REDIS_CONTAINER" 6379/tcp | awk -F: '/^0\.0\.0\.0:/ {print $2; exit}')
if [ -z "$REDIS_PORT" ]; then
REDIS_PORT=$(docker port "$REDIS_CONTAINER" 6379/tcp | head -1 | awk -F: '{print $NF}')
fi
if [ -z "$REDIS_PORT" ]; then
echo "::error::Could not resolve host port for $REDIS_CONTAINER"
exit 1
fi
echo "REDIS_PORT=${REDIS_PORT}" >> "$GITHUB_ENV"
echo "REDIS_URL=redis://127.0.0.1:${REDIS_PORT}" >> "$GITHUB_ENV"
for i in $(seq 1 15); do
if docker exec "$REDIS_CONTAINER" redis-cli ping 2>/dev/null | grep -q PONG; then
echo "Redis ready after ${i}s"
exit 0
fi
sleep 1
done
echo "::error::Redis did not become ready in 15s"
exit 1
- name: Build platform
if: needs.detect-changes.outputs.chat == 'true'
working-directory: workspace-server
run: go build -o platform-server ./cmd/server
- name: Pick platform port
if: needs.detect-changes.outputs.chat == 'true'
run: |
PLATFORM_PORT=$(python3 - <<'PY'
import socket
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.bind(("127.0.0.1", 0))
print(s.getsockname()[1])
PY
)
echo "PLATFORM_PORT=${PLATFORM_PORT}" >> "$GITHUB_ENV"
echo "E2E_PLATFORM_URL=http://127.0.0.1:${PLATFORM_PORT}" >> "$GITHUB_ENV"
echo "Platform host port: ${PLATFORM_PORT}"
- name: Pick canvas port
if: needs.detect-changes.outputs.chat == 'true'
run: |
CANVAS_PORT=$(python3 - <<'PY'
import socket
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.bind(("127.0.0.1", 0))
print(s.getsockname()[1])
PY
)
echo "CANVAS_PORT=${CANVAS_PORT}" >> "$GITHUB_ENV"
echo "Canvas host port: ${CANVAS_PORT}"
- name: Start platform (background)
if: needs.detect-changes.outputs.chat == 'true'
working-directory: workspace-server
run: |
export MOLECULE_ENV=development
export DATABASE_URL="${DATABASE_URL}"
export REDIS_URL="${REDIS_URL}"
export PORT="${PLATFORM_PORT}"
export CORS_ORIGINS="http://localhost:3000,http://localhost:3001,http://localhost:${CANVAS_PORT},http://127.0.0.1:${CANVAS_PORT}"
./platform-server > platform.log 2>&1 &
echo $! > platform.pid
- name: Wait for /health
if: needs.detect-changes.outputs.chat == 'true'
run: |
for i in $(seq 1 30); do
if curl -sf "http://127.0.0.1:${PLATFORM_PORT}/health" > /dev/null; then
echo "Platform up after ${i}s"
exit 0
fi
sleep 1
done
echo "::error::Platform did not become healthy in 30s"
cat workspace-server/platform.log || true
exit 1
- name: Install canvas dependencies
if: needs.detect-changes.outputs.chat == 'true'
working-directory: canvas
run: npm ci
- name: Install Playwright browsers
if: needs.detect-changes.outputs.chat == 'true'
working-directory: canvas
run: |
PREBAKED_PLAYWRIGHT=/ms-playwright
if [ -d "${PREBAKED_PLAYWRIGHT}" ] && find "${PREBAKED_PLAYWRIGHT}" -maxdepth 3 -type f -name 'chrome' | grep -q .; then
echo "Using prebaked Playwright Chromium from ${PREBAKED_PLAYWRIGHT}"
echo "PLAYWRIGHT_BROWSERS_PATH=${PREBAKED_PLAYWRIGHT}" >> "$GITHUB_ENV"
exit 0
fi
npx playwright install --with-deps chromium
- name: Start canvas dev server (background)
if: needs.detect-changes.outputs.chat == 'true'
working-directory: canvas
run: |
export NEXT_PUBLIC_PLATFORM_URL="http://127.0.0.1:${PLATFORM_PORT}"
export NEXT_PUBLIC_WS_URL="ws://127.0.0.1:${PLATFORM_PORT}/ws"
npx next dev --turbopack -p "${CANVAS_PORT}" > canvas.log 2>&1 &
echo $! > canvas.pid
for i in $(seq 1 30); do
if curl -sf "http://localhost:${CANVAS_PORT}" > /dev/null 2>&1; then
echo "Canvas up after ${i}s"
exit 0
fi
sleep 1
done
echo "::error::Canvas did not start in 30s"
cat canvas.log || true
exit 1
- name: Run Playwright E2E tests
if: needs.detect-changes.outputs.chat == 'true'
working-directory: canvas
run: |
export E2E_PLATFORM_URL="http://127.0.0.1:${PLATFORM_PORT}"
export E2E_DATABASE_URL="${DATABASE_URL}"
export PLAYWRIGHT_BASE_URL="http://localhost:${CANVAS_PORT}"
npx playwright test e2e/chat-desktop.spec.ts e2e/chat-mobile.spec.ts
- name: Dump platform log on failure
if: failure() && needs.detect-changes.outputs.chat == 'true'
run: cat workspace-server/platform.log || true
- name: Dump canvas log on failure
if: failure() && needs.detect-changes.outputs.chat == 'true'
run: cat canvas/canvas.log || true
- name: Upload Playwright report
if: failure() && needs.detect-changes.outputs.chat == 'true'
uses: actions/upload-artifact@v3.2.2
with:
name: playwright-report-chat
path: canvas/playwright-report/
- name: Stop canvas
if: always() && needs.detect-changes.outputs.chat == 'true'
run: |
if [ -f canvas/canvas.pid ]; then
kill "$(cat canvas/canvas.pid)" 2>/dev/null || true
fi
- name: Stop platform
if: always() && needs.detect-changes.outputs.chat == 'true'
run: |
if [ -f workspace-server/platform.pid ]; then
kill "$(cat workspace-server/platform.pid)" 2>/dev/null || true
fi
- name: Stop service containers
if: always() && needs.detect-changes.outputs.chat == 'true'
run: |
docker rm -f "$PG_CONTAINER" 2>/dev/null || true
docker rm -f "$REDIS_CONTAINER" 2>/dev/null || true
-402
View File
@@ -1,402 +0,0 @@
name: E2E Peer Visibility (literal MCP list_peers)
# WHY A DEDICATED WORKFLOW (not folded into e2e-staging-saas.yml)
# --------------------------------------------------------------
# This is the systemic fix for a real trust failure. Hermes and OpenClaw
# were reported "fleet-verified / cascade-complete" because the *proxy*
# signals were green (registry registration + heartbeat for Hermes; model
# round-trip 200 for OpenClaw). A freshly-provisioned workspace asked on
# canvas "can you see your peers" actually FAILS:
# - Hermes: 401 on the molecule MCP `list_peers` call
# - OpenClaw: native `sessions_list` fallback, sees no platform peers
# Tasks #142/#159 were even marked "completed" under this proxy flaw.
#
# A dedicated workflow (vs extending e2e-staging-saas.yml) because:
# - It must provision MULTIPLE distinct runtimes (hermes, openclaw,
# claude-code) in ONE org and assert each sees the others. The
# full-saas script is single-runtime-per-run (E2E_RUNTIME) and folding
# a multi-runtime matrix into it would conflate concerns and bloat its
# already-45-min run.
# - It needs its own concurrency group so it doesn't fight full-saas /
# canvas for the staging org-creation quota.
# - It needs an independent, non-required status-context name so it can
# be RED today (the in-flight Hermes-401 / OpenClaw-MCP-wiring fixes
# have not landed) WITHOUT wedging unrelated merges — and flipped to
# REQUIRED in one branch-protection edit once it goes green
# (flip-to-required checklist: molecule-core#1296).
#
# THE ASSERTION IS NOT A PROXY. The driving script
# tests/e2e/test_peer_visibility_mcp_staging.sh issues the byte-for-byte
# JSON-RPC `tools/call name=list_peers` envelope to `POST
# /workspaces/:id/mcp` using each workspace's OWN bearer token, through
# the real WorkspaceAuth + MCPRateLimiter middleware chain — the exact
# call mcp_molecule_list_peers makes from a canvas agent. It does NOT
# read a registry row, /health, the heartbeat table, or
# GET /registry/:id/peers.
#
# HONEST GATE — NO continue-on-error. Per feedback_fix_root_not_symptom a
# fake-green mask would defeat the entire purpose. This workflow goes red
# on today's broken behavior and green only when the root-cause fixes
# actually land. It is intentionally NOT in branch_protections — see PR
# body for the required-vs-not decision + flip tracking issue.
#
# Gitea 1.22.6 / act_runner notes honored:
# - No cross-repo `uses:` (feedback_gitea_cross_repo_uses_blocked). The
# actions/checkout SHA is the one e2e-staging-canvas.yml already uses
# successfully (a mirrored SHA — see #1277/PR#1292 root-cause).
# - 2026-05-21 retrigger: verify fresh platform-tenant image after the
# publish Buildx DOCKER_CONFIG fix restored staging-latest image updates.
# - Per-SHA concurrency, not global (feedback_concurrency_group_per_sha).
# - Workflow-level GITHUB_SERVER_URL pinned
# (feedback_act_runner_github_server_url).
# - pr-validate posts a status under the same check name so a
# workflow-only PR is not silently statusless and the context is
# flip-to-required-ready (mirrors e2e-staging-saas.yml's proven shape;
# real EC2-provisioning E2E is push/dispatch/cron only — it is 30+ min
# and cannot run per-PR-update).
#
# LOCAL BACKEND (added 2026-05-15 — feedback_local_must_mimic_production,
# feedback_mandatory_local_e2e_before_ship, feedback_local_test_before_
# staging_e2e)
# --------------------------------------------------------------------
# The standing rule is that the local prod-mimic stack runs a MANDATORY
# local-Postgres E2E BEFORE staging E2E. A staging-only peer-visibility
# gate caught regressions late + expensively (cold EC2). The
# `peer-visibility-local` job below runs the SAME byte-identical
# assertion (tests/e2e/lib/peer_visibility_assert.sh) against the local
# docker-compose stack — built + booted exactly like e2e-api.yml's
# proven E2E API Smoke Test job (ephemeral pg/redis ports, go build,
# background platform-server). It runs on PR + push (local boot is
# minutes, not the 30+ min cold-EC2 path), so peer-visibility is part of
# the local gate that fires before the staging E2E.
#
# It is its OWN non-required status context `E2E Peer Visibility (local)`.
# The local backend uses external-mode workspaces by default so it tests
# the literal platform MCP list_peers path without depending on local
# template container boot/heartbeat. Container-mode runtime boot remains
# available via PV_LOCAL_PROVISION_MODE=container for targeted debugging.
on:
push:
branches: [main]
paths:
- 'workspace-server/internal/handlers/mcp.go'
- 'workspace-server/internal/handlers/mcp_tools.go'
- 'workspace-server/internal/middleware/**'
- 'workspace-server/internal/handlers/registry.go'
- 'workspace-server/internal/handlers/workspace.go'
- 'tests/e2e/test_peer_visibility_mcp_staging.sh'
- 'tests/e2e/test_peer_visibility_token_mint_staging.sh'
- 'tests/e2e/test_peer_visibility_mcp_local.sh'
- 'tests/e2e/lib/peer_visibility_assert.sh'
- '.gitea/workflows/e2e-peer-visibility.yml'
pull_request:
branches: [main]
paths:
- 'workspace-server/internal/handlers/mcp.go'
- 'workspace-server/internal/handlers/mcp_tools.go'
- 'workspace-server/internal/middleware/**'
- 'workspace-server/internal/handlers/registry.go'
- 'workspace-server/internal/handlers/workspace.go'
- 'tests/e2e/test_peer_visibility_mcp_staging.sh'
- 'tests/e2e/test_peer_visibility_token_mint_staging.sh'
- 'tests/e2e/test_peer_visibility_mcp_local.sh'
- 'tests/e2e/lib/peer_visibility_assert.sh'
- '.gitea/workflows/e2e-peer-visibility.yml'
workflow_dispatch:
schedule:
# 07:30 UTC daily — catches AMI / template-hermes / template-openclaw
# drift even on quiet days. Offset 30m from e2e-staging-saas (07:00)
# so the two don't collide on the staging org-creation quota.
- cron: '30 7 * * *'
concurrency:
# Per-SHA (feedback_concurrency_group_per_sha). A single global group
# would let a queued staging/main push behind a PR run get cancelled,
# leaving any gate that reads "completed run at SHA" stuck.
group: e2e-peer-visibility-${{ github.event.pull_request.head.sha || github.sha }}
cancel-in-progress: false
env:
GITHUB_SERVER_URL: https://git.moleculesai.app
jobs:
# PR path: post a real status under the required-ready check name so a
# workflow-only PR is never silently statusless. The actual EC2 E2E is
# push/dispatch/cron only (30+ min). This is NOT a fake-green mask of
# the real assertion — it validates the driving script's bash syntax
# and inline-python so a broken test script fails at PR time.
pr-validate:
name: E2E Peer Visibility
runs-on: ubuntu-latest
if: github.event_name == 'pull_request'
timeout-minutes: 5
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- name: Validate driving scripts + shared assertion lib
run: |
bash -n tests/e2e/lib/peer_visibility_assert.sh
echo "lib/peer_visibility_assert.sh — bash syntax OK"
bash -n tests/e2e/test_peer_visibility_mcp_staging.sh
echo "test_peer_visibility_mcp_staging.sh — bash syntax OK"
bash -n tests/e2e/test_peer_visibility_token_mint_staging.sh
echo "test_peer_visibility_token_mint_staging.sh — bash syntax OK"
bash -n tests/e2e/test_peer_visibility_mcp_local.sh
echo "test_peer_visibility_mcp_local.sh — bash syntax OK"
if rg -n '/admin/workspaces/.*/test-token|test-token' tests/e2e/test_*staging*.sh; then
echo "::error::staging E2E must not use dev-only /admin/workspaces/:id/test-token; use production-safe admin token minting instead"
exit 1
fi
echo "Staging fresh-provision MCP list_peers E2E runs on push to"
echo "main / workflow_dispatch / daily cron (30+ min EC2 boot)."
echo "The LOCAL backend runs in the peer-visibility-local job"
echo "below on this same PR (local docker-compose stack)."
# LOCAL gate: same byte-identical assertion against the local prod-mimic
# docker-compose stack — the MANDATORY local-E2E that must run BEFORE
# the staging E2E (feedback_mandatory_local_e2e_before_ship,
# feedback_local_test_before_staging_e2e). Bootstrap mirrors
# e2e-api.yml's proven E2E API Smoke Test job (per-run container names +
# ephemeral host ports so concurrent host-network act_runner runs don't
# collide; go build; background platform-server). Its OWN non-required
# status context `E2E Peer Visibility (local)` — non-required-by-design
# exactly like the staging job (flip-to-required tracked at
# molecule-core#1296). HONEST gate, NO continue-on-error mask
# (feedback_fix_root_not_symptom). Runs on PR +
# push (local boot is minutes, not the 30+ min cold-EC2 path).
# bp-required: pending #1296
peer-visibility-local:
name: E2E Peer Visibility (local)
runs-on: docker-host
timeout-minutes: 30
env:
# Per-run names + ephemeral ports — same collision-avoidance as
# e2e-api.yml (host-network act_runner; feedback_act_runner_*).
PG_CONTAINER: pg-e2e-pv-${{ github.run_id }}-${{ github.run_attempt }}
REDIS_CONTAINER: redis-e2e-pv-${{ github.run_id }}-${{ github.run_attempt }}
# LLM keys so hermes/openclaw can actually boot. The local script
# SKIPs (not fails) any runtime whose key is absent, so a partially
# keyed CI env still exercises whatever it can.
CLAUDE_CODE_OAUTH_TOKEN: ${{ secrets.E2E_CLAUDE_CODE_OAUTH_TOKEN }}
E2E_MINIMAX_API_KEY: ${{ secrets.MOLECULE_STAGING_MINIMAX_API_KEY }}
E2E_ANTHROPIC_API_KEY: ${{ secrets.MOLECULE_STAGING_ANTHROPIC_API_KEY }}
E2E_OPENAI_API_KEY: ${{ secrets.MOLECULE_STAGING_OPENAI_API_KEY }}
PV_RUNTIMES: "hermes openclaw claude-code"
PV_LOCAL_PROVISION_MODE: external
ADMIN_TOKEN: local-e2e-admin-token
MOLECULE_ADMIN_TOKEN: local-e2e-admin-token
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- uses: actions/setup-go@40f1582b2485089dde7abd97c1529aa768e1baff # v5
with:
go-version: 'stable'
cache: true
cache-dependency-path: workspace-server/go.sum
- name: Pre-pull alpine + ensure provisioner network
run: |
docker pull alpine:latest >/dev/null
docker network create molecule-core-net >/dev/null 2>&1 || true
echo "alpine:latest pre-pulled; molecule-core-net ensured."
- name: Start Postgres (docker, ephemeral port)
run: |
docker rm -f "$PG_CONTAINER" 2>/dev/null || true
docker run -d --name "$PG_CONTAINER" \
-e POSTGRES_USER=dev -e POSTGRES_PASSWORD=dev -e POSTGRES_DB=molecule \
-p 0:5432 postgres:16 >/dev/null
PG_PORT=$(docker port "$PG_CONTAINER" 5432/tcp | awk -F: '/^0\.0\.0\.0:/ {print $2; exit}')
[ -n "$PG_PORT" ] || PG_PORT=$(docker port "$PG_CONTAINER" 5432/tcp | head -1 | awk -F: '{print $NF}')
if [ -z "$PG_PORT" ]; then
echo "::error::Could not resolve host port for $PG_CONTAINER"
docker logs "$PG_CONTAINER" || true; exit 1
fi
echo "DATABASE_URL=postgres://dev:dev@127.0.0.1:${PG_PORT}/molecule?sslmode=disable" >> "$GITHUB_ENV"
for i in $(seq 1 30); do
docker exec "$PG_CONTAINER" pg_isready -U dev >/dev/null 2>&1 && { echo "Postgres ready after ${i}s"; exit 0; }
sleep 1
done
echo "::error::Postgres did not become ready in 30s"; docker logs "$PG_CONTAINER" || true; exit 1
- name: Start Redis (docker, ephemeral port)
run: |
docker rm -f "$REDIS_CONTAINER" 2>/dev/null || true
docker run -d --name "$REDIS_CONTAINER" -p 0:6379 redis:7 >/dev/null
REDIS_PORT=$(docker port "$REDIS_CONTAINER" 6379/tcp | awk -F: '/^0\.0\.0\.0:/ {print $2; exit}')
[ -n "$REDIS_PORT" ] || REDIS_PORT=$(docker port "$REDIS_CONTAINER" 6379/tcp | head -1 | awk -F: '{print $NF}')
if [ -z "$REDIS_PORT" ]; then
echo "::error::Could not resolve host port for $REDIS_CONTAINER"
docker logs "$REDIS_CONTAINER" || true; exit 1
fi
echo "REDIS_URL=redis://127.0.0.1:${REDIS_PORT}" >> "$GITHUB_ENV"
for i in $(seq 1 15); do
docker exec "$REDIS_CONTAINER" redis-cli ping 2>/dev/null | grep -q PONG && { echo "Redis ready after ${i}s"; exit 0; }
sleep 1
done
echo "::error::Redis did not become ready in 15s"; docker logs "$REDIS_CONTAINER" || true; exit 1
- name: Build platform
working-directory: workspace-server
run: go build -o platform-server ./cmd/server
- name: Pick platform port
run: |
PLATFORM_PORT=$(python3 - <<'PY'
import socket
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.bind(("127.0.0.1", 0))
print(s.getsockname()[1])
PY
)
echo "PORT=${PLATFORM_PORT}" >> "$GITHUB_ENV"
echo "BASE=http://127.0.0.1:${PLATFORM_PORT}" >> "$GITHUB_ENV"
echo "Platform host port: ${PLATFORM_PORT}"
- name: Kill stale platform-server before start
run: |
killed=0
for pid in $(grep -l "platform-serve" /proc/[0-9]*/comm 2>/dev/null); do
kpid="${pid%/comm}"; kpid="${kpid##*/}"
cmdline=$(cat "/proc/${kpid}/cmdline" 2>/dev/null | tr '\0' ' ')
if echo "$cmdline" | grep -q "platform-server"; then
echo "Killing stale platform-server pid ${kpid}"
kill "$kpid" 2>/dev/null || true; killed=$((killed + 1))
fi
done
[ "$killed" -gt 0 ] && sleep 2 || true
echo "stale-kill done ($killed killed)"
- name: Start platform (background)
working-directory: workspace-server
run: |
./platform-server > platform.log 2>&1 &
echo $! > platform.pid
- name: Wait for /health
run: |
for i in $(seq 1 30); do
curl -sf "$BASE/health" > /dev/null && { echo "Platform up after ${i}s"; exit 0; }
sleep 1
done
echo "::error::Platform did not become healthy in 30s"
cat workspace-server/platform.log || true; exit 1
- name: Run LOCAL fresh-provision peer-visibility E2E (literal MCP list_peers)
# HONEST gate — NO continue-on-error. The local backend uses
# external-mode workspaces so this context tests the literal MCP
# peer-visibility path without coupling to template container boot.
run: bash tests/e2e/test_peer_visibility_mcp_local.sh
- name: Dump platform log on failure
if: failure()
run: cat workspace-server/platform.log || true
- name: Stop platform
if: always()
run: |
if [ -f workspace-server/platform.pid ]; then
kill "$(cat workspace-server/platform.pid)" 2>/dev/null || true
fi
- name: Stop service containers
if: always()
run: |
docker rm -f "$PG_CONTAINER" 2>/dev/null || true
docker rm -f "$REDIS_CONTAINER" 2>/dev/null || true
# Real STAGING gate: provisions a throwaway org + sibling-per-runtime,
# drives the LITERAL list_peers MCP call per runtime, asserts 200 +
# expected peer set, then scoped teardown. push(main)/dispatch/cron only.
peer-visibility:
name: E2E Peer Visibility
runs-on: ubuntu-latest
if: github.event_name != 'pull_request'
timeout-minutes: 60
env:
MOLECULE_CP_URL: https://staging-api.moleculesai.app
MOLECULE_ADMIN_TOKEN: ${{ secrets.CP_STAGING_ADMIN_API_TOKEN }}
# LLM provider key so each runtime can authenticate at boot.
# Priority MiniMax → direct-Anthropic → OpenAI matches
# test_staging_full_saas.sh's secrets-injection chain.
E2E_MINIMAX_API_KEY: ${{ secrets.MOLECULE_STAGING_MINIMAX_API_KEY }}
E2E_ANTHROPIC_API_KEY: ${{ secrets.MOLECULE_STAGING_ANTHROPIC_API_KEY }}
E2E_OPENAI_API_KEY: ${{ secrets.MOLECULE_STAGING_OPENAI_API_KEY }}
E2E_RUN_ID: "${{ github.run_id }}-${{ github.run_attempt }}"
PV_RUNTIMES: "hermes openclaw claude-code"
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- name: Verify admin token present
run: |
if [ -z "$MOLECULE_ADMIN_TOKEN" ]; then
echo "::error::CP_STAGING_ADMIN_API_TOKEN secret not set (Railway staging CP_ADMIN_API_TOKEN)"
exit 2
fi
echo "Admin token present"
- name: Verify an LLM key present
run: |
if [ -z "${E2E_MINIMAX_API_KEY:-}" ] && [ -z "${E2E_ANTHROPIC_API_KEY:-}" ] && [ -z "${E2E_OPENAI_API_KEY:-}" ]; then
echo "::error::No LLM provider key set — workspaces fail at boot with 'No provider API key found'. Set MOLECULE_STAGING_MINIMAX_API_KEY (or ANTHROPIC / OPENAI)."
exit 2
fi
echo "LLM key present"
- name: CP staging health preflight
run: |
code=$(curl -sS -o /dev/null -w "%{http_code}" --max-time 10 "$MOLECULE_CP_URL/health")
if [ "$code" != "200" ]; then
echo "::error::Staging CP unhealthy (HTTP $code) — infra, not a workspace bug. Failing loud per feedback_fix_root_not_symptom."
exit 1
fi
echo "Staging CP healthy"
- name: Run fresh-provision peer-visibility E2E (literal MCP list_peers)
run: bash tests/e2e/test_peer_visibility_mcp_staging.sh
# Belt-and-braces scoped teardown: the script installs an EXIT/INT/
# TERM trap, but if the runner itself is cancelled the trap may not
# fire. This always() step deletes ONLY the e2e-pv-<run_id> org this
# run created — never a cluster-wide sweep
# (feedback_never_run_cluster_cleanup_tests_on_live_platform). The
# admin DELETE is idempotent so double-invoking is safe;
# sweep-stale-e2e-orgs is the final net (slug starts with 'e2e-').
- name: Teardown safety net (runs on cancel/failure)
if: always()
env:
ADMIN_TOKEN: ${{ secrets.CP_STAGING_ADMIN_API_TOKEN }}
run: |
set +e
orgs=$(curl -sS "$MOLECULE_CP_URL/cp/admin/orgs?limit=500" \
-H "Authorization: Bearer $ADMIN_TOKEN" 2>/dev/null \
| python3 -c "
import json, sys, os, datetime
run_id = os.environ.get('GITHUB_RUN_ID', '')
try:
d = json.load(sys.stdin)
except Exception:
print(''); sys.exit(0)
# ONLY sweep slugs from THIS run. e2e-pv-<YYYYMMDD>-<run_id>-...
# Sweep today AND yesterday's UTC date so a midnight-crossing run
# still matches its own slug (same bug class as the saas/canvas
# safety nets).
today = datetime.date.today()
yest = today - datetime.timedelta(days=1)
dates = (today.strftime('%Y%m%d'), yest.strftime('%Y%m%d'))
if run_id:
prefixes = tuple(f'e2e-pv-{dt}-{run_id}-' for dt in dates)
else:
prefixes = tuple(f'e2e-pv-{dt}-' for dt in dates)
orgs = d if isinstance(d, list) else d.get('orgs', [])
cands = [o['slug'] for o in orgs
if any(o.get('slug','').startswith(p) for p in prefixes)
and o.get('instance_status') not in ('purged',)]
print('\n'.join(cands))
" 2>/dev/null)
for slug in $orgs; do
echo "Safety-net teardown: $slug"
set +e
curl -sS -o /tmp/pv-cleanup.out -w "%{http_code}" \
-X DELETE "$MOLECULE_CP_URL/cp/admin/tenants/$slug" \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d "{\"confirm\":\"$slug\"}" >/tmp/pv-cleanup.code
set -e
code=$(cat /tmp/pv-cleanup.code 2>/dev/null || echo "000")
if [ "$code" = "200" ] || [ "$code" = "204" ]; then
echo "[teardown] deleted $slug (HTTP $code)"
else
echo "::warning::pv teardown for $slug returned HTTP $code — sweep-stale-e2e-orgs will catch it within MAX_AGE_MINUTES. Body: $(head -c 300 /tmp/pv-cleanup.out 2>/dev/null)"
fi
done
exit 0
-178
View File
@@ -1,178 +0,0 @@
name: E2E Staging Sanity (leak-detection self-check)
# Ported from .github/workflows/e2e-staging-sanity.yml on 2026-05-11 per
# RFC internal#219 §1 sweep.
#
# Differences from the GitHub version:
# - Dropped `workflow_dispatch:` (Gitea 1.22.6 finicky on bare dispatch).
# - `actions/github-script@v9` issue-open block replaced with curl
# calls to the Gitea REST API (/api/v1/repos/.../issues|comments).
# - Workflow-level env.GITHUB_SERVER_URL set.
# - `continue-on-error: true` on the job (RFC §1 contract).
#
# Periodic assertion that the teardown safety nets in e2e-staging-saas
# and staging-smoke (formerly canary-staging) actually work. Runs the
# E2E harness with E2E_INTENTIONAL_FAILURE=1, which poisons the tenant
# admin token after the org is provisioned. The workspace-provision
# step then fails, the script exits non-zero, and the EXIT trap +
# workflow always()-step must still tear down cleanly.
on:
schedule:
- cron: '0 6 * * 1'
env:
GITHUB_SERVER_URL: https://git.moleculesai.app
concurrency:
group: e2e-staging-sanity
cancel-in-progress: false
permissions:
issues: write
contents: read
jobs:
sanity:
name: Intentional-failure teardown sanity
runs-on: ubuntu-latest
# Phase 3 (RFC #219 §1): surface broken workflows without blocking.
# mc#774: pre-existing continue-on-error mask; root-fix and remove, do not renew silently.
continue-on-error: true
timeout-minutes: 20
env:
MOLECULE_CP_URL: https://staging-api.moleculesai.app
# 2026-05-11: secret canonicalised from MOLECULE_STAGING_ADMIN_TOKEN
# (dead in org secret store) to CP_STAGING_ADMIN_API_TOKEN per
# internal#322 — see this PR for the cross-workflow sweep.
MOLECULE_ADMIN_TOKEN: ${{ secrets.CP_STAGING_ADMIN_API_TOKEN }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_DEFAULT_REGION: us-east-2
E2E_AWS_LEAK_CHECK: required
E2E_AWS_TERMINATE_LEAKS: '1'
E2E_MODE: smoke
E2E_RUNTIME: hermes
E2E_RUN_ID: "sanity-${{ github.run_id }}"
E2E_INTENTIONAL_FAILURE: "1"
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- name: Verify admin token present
run: |
if [ -z "$MOLECULE_ADMIN_TOKEN" ]; then
echo "::error::CP_STAGING_ADMIN_API_TOKEN not set"
exit 2
fi
for var in AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY; do
if [ -z "${!var:-}" ]; then
echo "::error::$var not set — EC2 leak verification cannot run"
exit 2
fi
done
# Inverted assertion: the run MUST fail. If it passes, the
# E2E_INTENTIONAL_FAILURE path is broken.
- name: Run harness — expecting exit !=0
id: harness
run: |
set +e
bash tests/e2e/test_staging_full_saas.sh
rc=$?
echo "harness_rc=$rc" >> "$GITHUB_OUTPUT"
if [ "$rc" = "1" ]; then
echo "OK Harness failed as expected (rc=1); teardown trap ran, leak-check passed"
exit 0
elif [ "$rc" = "0" ]; then
echo "::error::Harness succeeded under E2E_INTENTIONAL_FAILURE=1 — the poisoning path is broken"
exit 1
elif [ "$rc" = "4" ]; then
echo "::error::LEAK DETECTED (rc=4) — teardown failed to clean up the org. Safety net broken."
exit 4
else
echo "::error::Unexpected rc=$rc — neither clean-failure nor leak. Investigate harness."
exit 1
fi
- name: Open issue if safety net is broken (Gitea API)
if: failure()
env:
GITEA_TOKEN: ${{ secrets.GITHUB_TOKEN }}
REPO: ${{ github.repository }}
SERVER_URL: ${{ env.GITHUB_SERVER_URL }}
RUN_ID: ${{ github.run_id }}
run: |
set -euo pipefail
API="${SERVER_URL%/}/api/v1"
TITLE="E2E teardown safety net broken"
RUN_URL="${SERVER_URL}/${REPO}/actions/runs/${RUN_ID}"
BODY_JSON=$(jq -nc --arg t "$TITLE" --arg run "$RUN_URL" '
{title: $t,
body: ("The weekly sanity run (E2E_INTENTIONAL_FAILURE=1) did not exit as expected. This means one of:\n - poisoning did not actually cause failure (test harness regression), OR\n - teardown left an orphan org (leak detection caught a real bug)\n\nRun: " + $run + "\n\nThis is higher priority than a canary failure — the whole E2E safety net cannot be trusted until this is resolved.")}')
EXISTING=$(curl -fsS -H "Authorization: token $GITEA_TOKEN" \
"${API}/repos/${REPO}/issues?state=open&type=issues&limit=50" \
| jq -r --arg t "$TITLE" '.[] | select(.title==$t) | .number' | head -1)
if [ -n "$EXISTING" ]; then
curl -fsS -X POST -H "Authorization: token $GITEA_TOKEN" -H "Content-Type: application/json" \
"${API}/repos/${REPO}/issues/${EXISTING}/comments" \
-d "$(jq -nc --arg run "$RUN_URL" '{body: ("Still broken. " + $run)}')" >/dev/null
echo "Commented on existing issue #${EXISTING}"
else
curl -fsS -X POST -H "Authorization: token $GITEA_TOKEN" -H "Content-Type: application/json" \
"${API}/repos/${REPO}/issues" -d "$BODY_JSON" >/dev/null
echo "Filed new issue"
fi
# Belt-and-braces: if teardown left anything behind, nuke it here
# so we don't bleed staging quota.
- name: Teardown safety net
if: always()
env:
ADMIN_TOKEN: ${{ secrets.CP_STAGING_ADMIN_API_TOKEN }}
run: |
set +e
orgs=$(curl -sS "$MOLECULE_CP_URL/cp/admin/orgs" \
-H "Authorization: Bearer $ADMIN_TOKEN" 2>/dev/null \
| python3 -c "
import json, sys
d = json.load(sys.stdin)
today = __import__('datetime').date.today().strftime('%Y%m%d')
# Match both the new e2e-smoke- prefix (post-2026-05-11 rename)
# and the legacy e2e-canary- prefix for one rollout cycle so
# any in-flight org provisioned under the old prefix on an
# older runner checkout still gets cleaned up. Remove the
# canary fallback after one week of no-old-prefix observations.
prefixes = (f'e2e-smoke-{today}-sanity-', f'e2e-canary-{today}-sanity-')
candidates = [o['slug'] for o in d.get('orgs', [])
if any(o.get('slug','').startswith(p) for p in prefixes)
and o.get('status') not in ('purged',)]
print('\n'.join(candidates))
" 2>/dev/null)
leaks=()
for slug in $orgs; do
# Tempfile-routed -w + set +e/-e prevents curl-exit-code
# pollution of the captured status (lint-curl-status-capture.yml).
set +e
curl -sS -o /tmp/sanity-cleanup.out -w "%{http_code}" \
-X DELETE "$MOLECULE_CP_URL/cp/admin/tenants/$slug" \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d "{\"confirm\":\"$slug\"}" >/tmp/sanity-cleanup.code
set -e
code=$(cat /tmp/sanity-cleanup.code 2>/dev/null || echo "000")
if [ "$code" = "200" ] || [ "$code" = "204" ]; then
echo "[teardown] deleted $slug (HTTP $code)"
else
echo "::warning::sanity teardown for $slug returned HTTP $code — sweep-stale-e2e-orgs will catch it within ~45 min. Body: $(head -c 300 /tmp/sanity-cleanup.out 2>/dev/null)"
leaks+=("$slug")
fi
done
if [ ${#leaks[@]} -gt 0 ]; then
echo "::warning::sanity teardown left ${#leaks[@]} leak(s): ${leaks[*]}"
fi
exit 0
-132
View File
@@ -1,132 +0,0 @@
# gate-check-v3 — automated PR gate detector
#
# Runs on every open PR (push/synchronize) and hourly via cron.
# Posts a structured [gate-check-v3] STATUS: comment on the PR.
#
# Inputs:
# PR_NUMBER — set via ${{ github.event.pull_request.number }} from the trigger
# POST_COMMENT — "true" to post/update comment on PR
#
# Gating logic (MVP signals 1,2,3,6):
# 1. Author-aware agent-tag comment scan
# 2. REQUEST_CHANGES reviews state machine
# 3. Staleness detection (SOP-12: review.commit_id != PR.head_sha + >1 working day)
# 6. CI required-checks awareness
#
# Exit code: 0=CLEAR, 1=BLOCKED, 2=ERROR
name: gate-check-v3
on:
pull_request_target:
types: [opened, edited, synchronize, reopened]
schedule:
# Hourly: refresh all open PRs
- cron: '8 * * * *'
# NOTE: `workflow_dispatch.inputs` block intentionally omitted.
# Gitea 1.22.6 parser rejects `workflow_dispatch.inputs.X` with
# "unknown on type" — it mis-treats the inputs sub-keys as top-level
# `on:` event types. Dropping the inputs block restores parsing.
# Manual dispatch from the Gitea UI works without the inputs schema
# (github.event.inputs.X returns empty); the script falls back to
# iterating all open PRs when PR_NUMBER is empty.
workflow_dispatch:
permissions:
# read: contents — for checkout (base ref, not PR head for security)
# read: pull-requests — for reading PR info via API
# write: pull-requests — for posting/updating gate-check comments
# Without this the token cannot POST/PATCH /issues/comments → 403.
contents: read
pull-requests: write
env:
GITHUB_SERVER_URL: https://git.moleculesai.app
jobs:
# bp-exempt: PR advisory bot; merge blocking is enforced by CI status and branch protection.
gate-check:
runs-on: ubuntu-latest
# mc#774: pre-existing continue-on-error mask; root-fix and remove, do not renew silently.
continue-on-error: true # Never block on our own detector failing
steps:
- name: Check out BASE ref (never PR-head under pull_request_target)
# pull_request_target runs with repo secrets-context, so checking out
# the PR HEAD would execute PR-branch gate_check.py with secrets.
# Fix: always load gate_check.py from the trusted base/default ref.
# Bug-1 (self-loop exclusion) + Bug-3 (403→exit0) from #547 are
# kept; only this checkout-ref regresses to pre-#547 behavior.
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
ref: ${{ github.event.pull_request.base.sha || github.ref_name }}
- name: Run gate-check-v3 (single PR mode)
if: github.event_name == 'pull_request_target' || github.event.inputs.pr_number != ''
env:
GITEA_TOKEN: ${{ secrets.SOP_TIER_CHECK_TOKEN || secrets.GITHUB_TOKEN }}
DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
PR_NUMBER: ${{ github.event.pull_request.number || github.event.inputs.pr_number }}
POST_COMMENT: ${{ github.event.inputs.post_comment || 'true' }}
run: |
set -euo pipefail
python3 tools/gate-check-v3/gate_check.py \
--repo "${{ github.repository }}" \
--pr "$PR_NUMBER" \
$([ "$POST_COMMENT" = "true" ] && echo "--post-comment")
echo "verdict=$?" >> "$GITHUB_OUTPUT"
- name: Run gate-check-v3 (all open PRs — cron mode)
if: github.event_name == 'schedule'
env:
GITEA_TOKEN: ${{ secrets.SOP_TIER_CHECK_TOKEN || secrets.GITHUB_TOKEN }}
DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
REPO: ${{ github.repository }}
run: |
set -euo pipefail
# Fetch all open PRs and run gate-check on each. This scheduled
# refresher is advisory; a transient Gitea list timeout must not turn
# main red. PR-specific gate-check runs still use normal failure
# semantics.
pr_numbers=$(python3 <<'PY'
import json
import os
import socket
import sys
import time
import urllib.error
import urllib.request
socket.setdefaulttimeout(30)
token = os.environ["GITEA_TOKEN"]
repo = os.environ["REPO"]
url = f"https://git.moleculesai.app/api/v1/repos/{repo}/pulls?state=open&limit=100"
last_error = None
for attempt in range(1, 4):
req = urllib.request.Request(
url,
headers={"Authorization": f"token {token}", "Accept": "application/json"},
)
try:
with urllib.request.urlopen(req, timeout=30) as r:
prs = json.loads(r.read())
break
except (TimeoutError, OSError, urllib.error.URLError, urllib.error.HTTPError) as exc:
last_error = exc
print(f"warning: PR list fetch attempt {attempt}/3 failed: {exc}", file=sys.stderr)
if attempt < 3:
time.sleep(2 * attempt)
else:
print(f"warning: skipped scheduled gate-check refresh; failed to list open PRs after 3 attempts: {last_error}", file=sys.stderr)
raise SystemExit(0)
for pr in prs:
print(pr["number"])
PY
)
for pr in $pr_numbers; do
echo "Checking PR #$pr..."
python3 tools/gate-check-v3/gate_check.py \
--repo "${{ github.repository }}" \
--pr "$pr" \
--post-comment \
|| true
done
-64
View File
@@ -1,64 +0,0 @@
name: gitea-merge-queue
# External serialized merge queue for Gitea 1.22.6.
#
# Gitea's `pull_auto_merge` table is not a real merge queue: it does not
# serialize green PRs against a freshly-tested latest main. This workflow runs
# the user-space queue bot, one PR per tick, using the non-bypass merge actor.
#
# Queue contract:
# - add label `merge-queue` to an open same-repo PR
# - bot updates stale PR heads with current main, then waits for CI
# - bot merges only when current main is green and required PR contexts pass
# - add `merge-queue-hold` to pause a queued PR without removing it
on:
# Schedule moved to operator-config:
# /etc/cron.d/molecule-core-merge-queue ->
# /usr/local/bin/molecule-core-cron-bot.sh merge-queue
#
# The queue bot still processes one PR per tick, but no longer occupies
# one of the shared Actions runners just to poll.
workflow_dispatch:
permissions:
contents: read
concurrency:
group: gitea-merge-queue-${{ github.repository }}
cancel-in-progress: false
jobs:
queue:
runs-on: ubuntu-latest
timeout-minutes: 5
steps:
- name: Check out queue script from main
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
ref: ${{ github.event.repository.default_branch }}
- name: Process one queued PR
env:
# AUTO_SYNC_TOKEN is the devops-engineer persona PAT. It is the
# non-bypass merge actor allowed by branch protection.
GITEA_TOKEN: ${{ secrets.AUTO_SYNC_TOKEN }}
GITEA_HOST: git.moleculesai.app
REPO: ${{ github.repository }}
WATCH_BRANCH: ${{ github.event.repository.default_branch }}
QUEUE_LABEL: merge-queue
HOLD_LABEL: merge-queue-hold
UPDATE_STYLE: merge
REQUIRED_CONTEXTS: >-
CI / all-required (pull_request),
sop-checklist / all-items-acked (pull_request)
# Push-side required contexts. Checking CI / all-required (push)
# explicitly instead of the combined state avoids false-pause when
# non-blocking jobs (continue-on-error: true) have failed — those
# failures pollute combined state but do not gate merges.
# NOTE: the event-suffixed context name is intentional — branch protection
# MUST require `CI / all-required (pull_request)` (with suffix), NOT the
# bare `CI / all-required`. Gitea treats absent contexts as pending, not
# skipped; requiring the bare name silently blocks all merges (issue #1473).
PUSH_REQUIRED_CONTEXTS: CI / all-required (push)
run: python3 .gitea/scripts/gitea-merge-queue.py
@@ -1,283 +0,0 @@
name: Handlers Postgres Integration
# Ported from .github/workflows/handlers-postgres-integration.yml on 2026-05-11 per RFC
# internal#219 §1 sweep. Differences from the GitHub version:
# - Dropped `workflow_dispatch.inputs` (Gitea 1.22.6 parser rejects them
# per feedback_gitea_workflow_dispatch_inputs_unsupported).
# - Dropped `merge_group:` (no Gitea merge queue).
# - Dropped `environment:` blocks (Gitea has no environments).
# - Workflow-level env.GITHUB_SERVER_URL pinned per
# feedback_act_runner_github_server_url.
# - `continue-on-error: true` on each job (RFC §1 contract).
#
# Real-Postgres integration tests for workspace-server/internal/handlers/.
# Triggered on every PR/push that touches the handlers package.
#
# Why this workflow exists
# ------------------------
# Strict-sqlmock unit tests pin which SQL statements fire — they're fast
# and let us iterate without a DB. But sqlmock CANNOT detect bugs that
# depend on the row state AFTER the SQL runs. The result_preview-lost
# bug shipped to staging in PR #2854 because every unit test was
# satisfied with "an UPDATE statement fired" — none verified the row's
# preview field actually landed. The local-postgres E2E that retrofit
# self-review caught it took 2 minutes to set up and would have caught
# the bug at PR-time.
#
# Why this workflow does NOT use `services: postgres:` (Class B fix)
# ------------------------------------------------------------------
# Our act_runner config has `container.network: host` (operator host
# /opt/molecule/runners/config.yaml), which act_runner applies to BOTH
# the job container AND every service container. With host-net, two
# concurrent runs of this workflow both try to bind 0.0.0.0:5432 — the
# second postgres FATALs with `could not create any TCP/IP sockets:
# Address in use`, and Docker auto-removes it (act_runner sets
# AutoRemove:true on service containers). By the time the migrations
# step runs `psql`, the postgres container is gone, hence
# `Connection refused` then `failed to remove container: No such
# container` at cleanup time.
#
# Per-job `container.network` override is silently ignored by
# act_runner — `--network and --net in the options will be ignored.`
# appears in the runner log. Documented constraint.
#
# So we sidestep `services:` entirely. The job container still uses
# host-net (inherited from runner config; required for cache server
# discovery on the bridge IP 172.18.0.17:42631). We launch a sibling
# postgres on the existing `molecule-core-net` bridge with a
# UNIQUE name per run — `pg-handlers-${RUN_ID}-${RUN_ATTEMPT}` — and
# read its bridge IP via `docker inspect`. A host-net job container
# can reach a bridge-net container directly via the bridge IP (verified
# manually on operator host 2026-05-08).
#
# Trade-offs vs. the original `services:` shape:
# + No host-port collision; N parallel runs share the bridge cleanly
# + `if: always()` cleanup runs even on test-step failure
# - One more step in the workflow (+~3 lines)
# - Requires `molecule-core-net` to exist on the operator host
# (it does; declared in docker-compose.yml + docker-compose.infra.yml)
#
# Class B Hongming-owned CICD red sweep, 2026-05-08.
#
# Cost: ~30s job (postgres pull from cache + go build + 4 tests).
on:
push:
branches: [main, staging]
pull_request:
branches: [main, staging]
concurrency:
group: handlers-pg-integ-${{ github.event.pull_request.head.sha || github.sha }}
cancel-in-progress: false
env:
GITHUB_SERVER_URL: https://git.moleculesai.app
jobs:
detect-changes:
name: detect-changes
# mc#1529 §1: pin to `docker-host` so the integration job runs on the
# operator-host runners (molecule-runner-*), which carry the
# `molecule-core-net` bridge network this workflow depends on. PC2
# runners (hongming-pc-runner-*) also advertise ubuntu-latest but
# don't have that network — the previous `runs-on: ubuntu-latest`
# rolled the dice and hard-failed the bridge-inspect step ~30% of
# the time. detect-changes itself doesn't need the bridge, but keeping
# both jobs on the same label avoids workspace-volume cross-host
# surprises and keeps the routing rule discoverable in one place.
runs-on: docker-host
# mc#774 Phase 3 (RFC §1): surface broken workflows without blocking.
# mc#774: pre-existing continue-on-error mask; root-fix and remove, do not renew silently.
continue-on-error: true
outputs:
handlers: ${{ steps.filter.outputs.handlers }}
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
# A full-history checkout can exceed the runner's quiet/startup
# window before the path filter emits logs. Fetch the common push
# case cheaply; the script below fetches the exact BASE SHA if it is
# not present in the shallow checkout.
fetch-depth: 2
- id: filter
run: |
python3 .gitea/scripts/detect-changes.py \
--profile handlers-postgres \
--event-name "${{ github.event_name }}" \
--pr-base-sha "${{ github.event.pull_request.base.sha }}" \
--base-ref "${{ github.event.pull_request.base.ref }}" \
--push-before "${GITHUB_EVENT_BEFORE:-}"
# Single-job-with-per-step-if pattern: always runs to satisfy the
# required-check name on branch protection; real work gates on the
# paths filter. See ci.yml's Platform (Go) for the same shape.
integration:
name: Handlers Postgres Integration
needs: detect-changes
# mc#1529 §1: must run on operator-host (where `molecule-core-net`
# exists). See detect-changes for the full routing rationale.
runs-on: docker-host
# mc#774 Phase 3 (RFC §1): surface broken workflows without blocking.
# mc#774: pre-existing continue-on-error mask; root-fix and remove, do not renew silently.
continue-on-error: true
env:
# Unique name per run so concurrent jobs don't collide on the
# bridge network. ${RUN_ID}-${RUN_ATTEMPT} is unique even across
# workflow_dispatch reruns of the same run_id.
PG_NAME: pg-handlers-${{ github.run_id }}-${{ github.run_attempt }}
# Bridge network already exists on the operator host (declared
# in docker-compose.yml + docker-compose.infra.yml).
PG_NETWORK: molecule-core-net
defaults:
run:
working-directory: workspace-server
steps:
- if: needs.detect-changes.outputs.handlers != 'true'
working-directory: .
run: echo "No handlers/migrations changes — skipping; this job always runs to satisfy the required-check name."
- if: needs.detect-changes.outputs.handlers == 'true'
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- if: needs.detect-changes.outputs.handlers == 'true'
uses: actions/setup-go@40f1582b2485089dde7abd97c1529aa768e1baff # v5
with:
go-version: 'stable'
- if: needs.detect-changes.outputs.handlers == 'true'
name: Start sibling Postgres on bridge network
working-directory: .
run: |
# Sanity: the bridge network must exist on the operator host.
# Hard-fail loud if it doesn't — easier to spot than a silent
# auto-create that diverges from the rest of the stack.
if ! docker network inspect "${PG_NETWORK}" >/dev/null 2>&1; then
echo "::error::Bridge network '${PG_NETWORK}' missing on operator host. Re-run docker-compose.infra.yml or check ops handbook."
exit 1
fi
# If a stale container with the same name exists (rerun on
# the same run_id), wipe it first.
docker rm -f "${PG_NAME}" >/dev/null 2>&1 || true
docker run -d \
--name "${PG_NAME}" \
--network "${PG_NETWORK}" \
--health-cmd "pg_isready -U postgres" \
--health-interval 5s \
--health-timeout 5s \
--health-retries 10 \
-e POSTGRES_PASSWORD=test \
-e POSTGRES_DB=molecule \
postgres:15-alpine >/dev/null
# Read back the bridge IP. Always present immediately after
# `docker run -d` for bridge networks.
PG_HOST=$(docker inspect "${PG_NAME}" \
--format "{{(index .NetworkSettings.Networks \"${PG_NETWORK}\").IPAddress}}")
if [ -z "${PG_HOST}" ]; then
echo "::error::Could not resolve PG_HOST for ${PG_NAME} on ${PG_NETWORK}"
docker logs "${PG_NAME}" || true
exit 1
fi
echo "PG_HOST=${PG_HOST}" >> "$GITHUB_ENV"
echo "INTEGRATION_DB_URL=postgres://postgres:test@${PG_HOST}:5432/molecule?sslmode=disable" >> "$GITHUB_ENV"
echo "Started ${PG_NAME} at ${PG_HOST}:5432"
- if: needs.detect-changes.outputs.handlers == 'true'
name: Apply migrations to Postgres service
env:
PGPASSWORD: test
run: |
# Wait for postgres to actually accept connections. Docker's
# health-cmd handles container-side readiness, but the wire
# to the bridge IP is best-tested with pg_isready directly.
for i in {1..15}; do
if pg_isready -h "${PG_HOST}" -p 5432 -U postgres -q; then break; fi
echo "waiting for postgres at ${PG_HOST}:5432..."; sleep 2
done
# Apply every .up.sql in lexicographic order with
# ON_ERROR_STOP=0 — failing migrations are SKIPPED rather than
# blocking the suite. This handles the current schema state
# where a few historical migrations (e.g. 017_memories_fts_*)
# depend on tables that were later renamed/dropped and so
# cannot replay from scratch. The migrations that DO succeed
# land their tables, which is sufficient for the integration
# tests in handlers/.
#
# Why not maintain a curated allowlist: every new migration
# touching a handlers/-tested table would have to update this
# workflow. With apply-all-or-skip, a future migration that
# adds a column to delegations runs automatically (its base
# table 049_delegations.up.sql already succeeded above it in
# the order). Operators only need to revisit this if the
# migration chain becomes legitimately replayable end-to-end.
#
# Per-migration result is logged so a failed migration that
# SHOULD have been replayable surfaces in the CI log instead
# of silently failing.
# Apply both *.sql (legacy, lives next to its module) and
# *.up.sql (newer up/down convention) in a single
# lexicographically-sorted pass. Excluding *.down.sql so the
# newest-naming-convention pairs don't undo themselves mid-run.
# Pre-#149-followup this loop only globbed *.up.sql, which
# silently skipped 001_workspaces.sql + 009_activity_logs.sql
# — fine while no integration test depended on those tables,
# not fine once a cross-table atomicity test came in.
set +e
for migration in $(ls migrations/*.sql 2>/dev/null | grep -v '\.down\.sql$' | sort); do
if psql -h "${PG_HOST}" -U postgres -d molecule -v ON_ERROR_STOP=1 \
-f "$migration" >/dev/null 2>&1; then
echo "✓ $(basename "$migration")"
else
echo "⊘ $(basename "$migration") (skipped — see comment in workflow)"
fi
done
set -e
# Sanity: the delegations + workspaces + activity_logs tables
# MUST exist for the integration tests to be meaningful. Hard-
# fail if any didn't land — that would be a real regression we
# want loud.
for tbl in delegations workspaces activity_logs pending_uploads; do
if ! psql -h "${PG_HOST}" -U postgres -d molecule -tA \
-c "SELECT 1 FROM information_schema.tables WHERE table_name = '$tbl'" \
| grep -q 1; then
echo "::error::$tbl table missing after migration replay — handler integration tests would be meaningless"
exit 1
fi
echo "✓ $tbl table present"
done
- if: needs.detect-changes.outputs.handlers == 'true'
name: Run integration tests
run: |
# INTEGRATION_DB_URL is exported by the start-postgres step;
# points at the per-run bridge IP, not 127.0.0.1, so concurrent
# workflow runs don't fight over a host-net 5432 port.
go test -tags=integration -timeout 5m -v ./internal/handlers/ -run "^TestIntegration_"
- if: failure() && needs.detect-changes.outputs.handlers == 'true'
name: Diagnostic dump on failure
env:
PGPASSWORD: test
run: |
echo "::group::postgres container status"
docker ps -a --filter "name=${PG_NAME}" --format '{{.Status}} {{.Names}}' || true
docker logs "${PG_NAME}" 2>&1 | tail -50 || true
echo "::endgroup::"
echo "::group::delegations table state"
psql -h "${PG_HOST}" -U postgres -d molecule -c "SELECT * FROM delegations LIMIT 50;" || true
echo "::endgroup::"
- if: always() && needs.detect-changes.outputs.handlers == 'true'
name: Stop sibling Postgres
working-directory: .
run: |
# always() so containers don't leak when migrations or tests
# fail. The cleanup is best-effort: if the container is
# already gone (e.g. concurrent rerun race), don't fail the job.
docker rm -f "${PG_NAME}" >/dev/null 2>&1 || true
echo "Cleaned up ${PG_NAME}"
-321
View File
@@ -1,321 +0,0 @@
name: Harness Replays
# Ported from .github/workflows/harness-replays.yml on 2026-05-11 per RFC
# internal#219 §1 sweep. Differences from the GitHub version:
# - Dropped `workflow_dispatch.inputs` (Gitea 1.22.6 parser rejects them
# per feedback_gitea_workflow_dispatch_inputs_unsupported).
# - Dropped `merge_group:` (no Gitea merge queue).
# - Dropped `environment:` blocks (Gitea has no environments).
# - Workflow-level env.GITHUB_SERVER_URL pinned per
# feedback_act_runner_github_server_url.
# - `continue-on-error: true` on each job (RFC §1 contract).
#
# Boots tests/harness (production-shape compose topology with TenantGuard,
# /cp/* proxy, canvas proxy, real production Dockerfile.tenant) and runs
# every replay under tests/harness/replays/. Fails the PR if any replay
# fails.
#
# Why this exists: 2026-04-30 we shipped #2398 which added /buildinfo as
# a public route in router.go but forgot to add it to TenantGuard's
# allowlist. The handler-level test in buildinfo_test.go constructed a
# minimal gin engine without TenantGuard — green. The harness's
# buildinfo-stale-image.sh replay would have caught it (cf-proxy doesn't
# inject X-Molecule-Org-Id, so the curl path is identical to production's
# redeploy verifier), but no one ran the harness pre-merge. The bug
# shipped; the redeploy verifier silently soft-warned every tenant as
# "unreachable" for ~1 day before being noticed.
#
# This gate makes "did you actually run the harness?" a CI invariant
# instead of a memory-discipline thing.
#
# Trigger model — match e2e-api.yml: always FIRES on push/pull_request
# to staging+main, real work is gated per-step on detect-changes output.
# One job → one check run → branch-protection-clean (the SKIPPED-in-set
# trap from PR #2264 is documented in e2e-api.yml's e2e-api job comment).
"on":
push:
branches: [main, staging]
paths:
- 'workspace-server/**'
- 'canvas/**'
- 'tests/harness/**'
- '.gitea/workflows/harness-replays.yml'
pull_request:
branches: [main, staging]
paths:
- 'workspace-server/**'
- 'canvas/**'
- 'tests/harness/**'
- '.gitea/workflows/harness-replays.yml'
concurrency:
# Per-SHA grouping. Per-ref kept hitting the auto-promote-staging
# cancellation deadlock — see e2e-api.yml's concurrency block for
# the 2026-04-28 incident that codified this pattern.
group: harness-replays-${{ github.event.pull_request.head.sha || github.sha }}
cancel-in-progress: false
env:
GITHUB_SERVER_URL: https://git.moleculesai.app
jobs:
# bp-exempt: change detector only; downstream Harness Replays is the meaningful gate.
detect-changes:
# mc#1529 follow-on: pin to `docker-host` so this lane lands on
# Linux operator-host runners (the only ones with a working
# docker.sock + `molecule-core-net`). The bare `ubuntu-latest`
# label is also matched by hongming-pc-runner-* (Windows act_runner
# v1.0.3), where the `docker compose ...` exec below fails. Mirror
# of mc#1543; see internal#512 for class defect.
runs-on: docker-host
# Phase 3 (RFC #219 §1): surface broken workflows without blocking.
# mc#774: pre-existing continue-on-error mask; root-fix and remove, do not renew silently.
continue-on-error: true
outputs:
run: ${{ steps.decide.outputs.run }}
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
# Shallow clone — we use the Gitea Compare API for changed-file
# detection, not local git diff. The base SHA is supplied via
# GitHub event variables, so no local history is needed.
fetch-depth: 1
- id: decide
env:
# Pass via env block — env values bypass shell quoting so single
# quotes in merge-commit messages (e.g. "Merge pull request 'fix: ...'
# from branch into main") cannot break the bash parser. The prior
# `echo '${{ toJSON(...) }}'` form broke on every main-push because
# every main commit is a merge commit with single quotes in the
# message body — the embedded `'` ended the single-quoted shell string
# mid-JSON, and a subsequent `(` (e.g. in `(#523)`) was parsed as a
# subshell, causing "syntax error near unexpected token `('".
COMMITS_JSON: ${{ toJSON(github.event.commits) }}
run: |
set -euo pipefail
# workflow_dispatch: always run (manual trigger)
if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
echo "run=true" >> "$GITHUB_OUTPUT"
echo "debug=manual-trigger" >> "$GITHUB_OUTPUT"
exit 0
fi
# Determine changed files.
# workflow_dispatch: always run.
# pull_request: use Compare API (branch-to-branch works fine).
# push: use github.event.commits array (Compare API rejects SHA-to-branch).
# new-branch: run everything.
if [ "${{ github.event_name }}" = "pull_request" ]; then
BASE="${{ github.event.pull_request.base.ref }}"
HEAD="${{ github.event.pull_request.head.ref }}"
elif [ -n "${{ github.event.before }}" ] && \
! echo "${{ github.event.before }}" | grep -qE '^0+$'; then
# Push event: extract changed files from github.event.commits array.
# Gitea Compare API rejects SHA-to-branch comparisons (BaseNotExist),
# so we use the commits array instead. This array contains all commits
# in the push, each with their added/removed/modified file lists.
printf '%s' "$COMMITS_JSON" \
| bash .gitea/scripts/push-commits-diff-files.py \
> .push-diff-files.txt 2>/dev/null || true
DIFF_FILES=$(cat .push-diff-files.txt 2>/dev/null || true)
if [ -n "$DIFF_FILES" ] && echo "$DIFF_FILES" | grep -qE '^workspace-server/|^canvas/|^tests/harness/|^.gitea/workflows/harness-replays\.yml$'; then
echo "run=true" >> "$GITHUB_OUTPUT"
else
echo "run=false" >> "$GITHUB_OUTPUT"
fi
echo "debug=push-files=$DIFF_FILES" >> "$GITHUB_OUTPUT"
exit 0
else
# New branch or github.event.before unavailable — run everything.
echo "run=true" >> "$GITHUB_OUTPUT"
echo "debug=new-branch-fallback" >> "$GITHUB_OUTPUT"
exit 0
fi
# Call Gitea Compare API (pull_request path only — branch-to-branch).
# Push uses github.event.commits array above.
RESP=$(curl -sS --fail --max-time 30 \
-H "Authorization: token ${{ secrets.GITHUB_TOKEN }}" \
-H "Accept: application/json" \
"$GITHUB_SERVER_URL/api/v1/repos/$GITHUB_REPOSITORY/compare/$BASE...$HEAD") || {
# If Gitea's Compare API is slow/unavailable, choose the conservative
# behavior: run the harness instead of failing the detector and polluting
# main with a red non-gate context.
echo "run=true" >> "$GITHUB_OUTPUT"
echo "debug=compare-api-unavailable base=$BASE head=$HEAD" >> "$GITHUB_OUTPUT"
exit 0
}
DIFF_FILES=$(echo "$RESP" | bash .gitea/scripts/compare-api-diff-files.py 2>/dev/null || true)
echo "debug=diff-base=$BASE diff-files=$DIFF_FILES" >> "$GITHUB_OUTPUT"
if echo "$DIFF_FILES" | grep -qE '^workspace-server/|^canvas/|^tests/harness/|^.gitea/workflows/harness-replays\.yml$'; then
echo "run=true" >> "$GITHUB_OUTPUT"
else
echo "run=false" >> "$GITHUB_OUTPUT"
fi
# ONE job that always runs. Real work is gated per-step on
# detect-changes.outputs.run so an unrelated PR (e.g. doc-only
# change to molecule-controlplane wired here later) emits the
# required check without spending CI cycles. Single-job pattern
# matches e2e-api.yml — see that workflow's comment for why a
# job-level `if: false` would block branch protection via the
# SKIPPED-in-set bug.
# bp-exempt: path-filtered replay suite; CI / all-required is the branch-protection aggregate.
harness-replays:
needs: detect-changes
name: Harness Replays
# mc#1529 follow-on: `docker compose ... ps/logs` against tenant-alpha/
# beta containers. Must run on operator-host Linux (docker-host).
runs-on: docker-host
# Phase 3 (RFC #219 §1): surface broken workflows without blocking.
# mc#774: pre-existing continue-on-error mask; root-fix and remove, do not renew silently.
continue-on-error: true
timeout-minutes: 30
steps:
- name: No-op pass (paths filter excluded this commit)
if: needs.detect-changes.outputs.run != 'true'
run: |
echo "No workspace-server / canvas / tests/harness / workflow changes — Harness Replays gate satisfied without running."
echo "::notice::Harness Replays no-op pass (paths filter excluded this commit)."
echo "::notice::Debug: ${{ needs.detect-changes.outputs.debug }}"
- if: needs.detect-changes.outputs.run == 'true'
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
# Log what files were detected so future failures include the diff.
- name: Log detected changes
if: needs.detect-changes.outputs.run == 'true'
run: |
echo "::notice::detect-changes debug: ${{ needs.detect-changes.outputs.debug }}"
# github-app-auth sibling-checkout removed 2026-05-07 (#157):
# the plugin was dropped + Dockerfile.tenant no longer COPYs it.
# Pre-clone manifest deps before docker compose builds the tenant
# image (Task #173 followup — same pattern as
# publish-workspace-server-image.yml's "Pre-clone manifest deps"
# step).
#
# Why pre-clone here too: tests/harness/compose.yml builds tenant-alpha
# and tenant-beta from workspace-server/Dockerfile.tenant with
# context=../.. (repo root). That Dockerfile expects
# .tenant-bundle-deps/{workspace-configs-templates,org-templates,plugins}
# to be present at build context root (post-#173 it COPYs from there
# instead of running an in-image clone — the in-image clone failed
# with "could not read Username for https://git.moleculesai.app"
# because there's no auth path inside the build sandbox).
#
# Without this step harness-replays fails before any replay runs,
# with `failed to calculate checksum of ref ...
# "/.tenant-bundle-deps/plugins": not found`. Caught by run #892
# (main, 2026-05-07T20:28:53Z) and run #964 (staging — same
# symptom, different root cause: staging still has the in-image
# clone path, hits the auth error directly).
#
# 2026-05-08 sub-finding (#192): the clone step ALSO fails when
# any referenced workspace-template repo is private and the
# AUTO_SYNC_TOKEN bearer (devops-engineer persona) lacks read
# access. Root cause: 5 of 9 workspace-template repos
# (openclaw, codex, crewai, deepagents, gemini-cli) had been
# marked private with no team grant. Resolution: flipped them
# to public per `feedback_oss_first_repo_visibility_default`
# (the OSS surface should be public). Layer-3 (customer-private +
# marketplace third-party repos) tracked separately in
# internal#102.
#
# Token shape matches publish-workspace-server-image.yml: AUTO_SYNC_TOKEN
# is the devops-engineer persona PAT, NOT the founder PAT (per
# `feedback_per_agent_gitea_identity_default`). clone-manifest.sh
# embeds it as basic-auth for the duration of the clones and strips
# .git directories — the token never enters the resulting image.
- name: Pre-clone manifest deps
if: needs.detect-changes.outputs.run == 'true'
env:
MOLECULE_GITEA_TOKEN: ${{ secrets.AUTO_SYNC_TOKEN }}
run: |
set -euo pipefail
if [ -z "${MOLECULE_GITEA_TOKEN}" ]; then
echo "::warning::AUTO_SYNC_TOKEN not set — using anonymous clone (repos are public per manifest.json OSS contract)"
fi
mkdir -p .tenant-bundle-deps
# Strip JSON5 comments before jq parsing — Integration Tester appends
# `// Triggered by ...` which breaks `jq` in clone-manifest.sh.
sed '/^[[:space:]]*\/\//d' manifest.json > .manifest-stripped.json
bash scripts/clone-manifest.sh \
.manifest-stripped.json \
.tenant-bundle-deps/workspace-configs-templates \
.tenant-bundle-deps/org-templates \
.tenant-bundle-deps/plugins
# Sanity-check counts so a silent partial clone fails fast
# instead of producing a half-empty image.
ws_count=$(find .tenant-bundle-deps/workspace-configs-templates -mindepth 1 -maxdepth 1 -type d | wc -l)
org_count=$(find .tenant-bundle-deps/org-templates -mindepth 1 -maxdepth 1 -type d | wc -l)
plugins_count=$(find .tenant-bundle-deps/plugins -mindepth 1 -maxdepth 1 -type d | wc -l)
echo "Cloned: ws=$ws_count org=$org_count plugins=$plugins_count"
- name: Install Python deps for replays
# peer-discovery-404 (and future replays) eval Python against the
# running tenant — importing workspace/a2a_client.py pulls in
# httpx. tests/harness/requirements.txt holds just the HTTP-client
# surface to keep CI install fast (~3s) vs the full
# workspace/requirements.txt (~30s).
if: needs.detect-changes.outputs.run == 'true'
run: pip install -r tests/harness/requirements.txt
- name: Run all replays against the harness
# run-all-replays.sh: boot via up.sh → seed via seed.sh → run
# every replays/*.sh → tear down via down.sh on EXIT (trap).
# Non-zero exit on any replay failure.
#
# KEEP_UP=1: without this, the script's trap-on-EXIT tears
# down containers immediately on failure, leaving the dump
# step below with nothing to dump (verified on PR #2410's
# first run — tenant became unhealthy, trap fired, dump
# step saw empty containers). Keeping them up lets the
# failure path collect tenant/cp-stub/cf-proxy logs. The
# always-run "Force teardown" step does the actual cleanup.
if: needs.detect-changes.outputs.run == 'true'
working-directory: tests/harness
env:
KEEP_UP: "1"
run: ./run-all-replays.sh
- name: Dump compose logs on failure
# SECRETS_ENCRYPTION_KEY: docker compose validates the entire compose
# file even for read-only `logs` calls. up.sh generates a per-run key
# and exports it to its OWN shell — this step runs in a fresh shell
# that wouldn't see it, so without a placeholder the validate step
# errors before logs print (verified against PR #2492's first run:
# "required variable SECRETS_ENCRYPTION_KEY is missing a value").
# A placeholder is fine — we're only reading log streams, not booting.
if: failure() && needs.detect-changes.outputs.run == 'true'
working-directory: tests/harness
env:
SECRETS_ENCRYPTION_KEY: dump-logs-placeholder
run: |
echo "=== docker compose ps ==="
docker compose -f compose.yml ps || true
echo "=== tenant-alpha logs ==="
docker compose -f compose.yml logs tenant-alpha || true
echo "=== tenant-beta logs ==="
docker compose -f compose.yml logs tenant-beta || true
echo "=== cp-stub logs ==="
docker compose -f compose.yml logs cp-stub || true
echo "=== cf-proxy logs ==="
docker compose -f compose.yml logs cf-proxy || true
echo "=== postgres-alpha logs (last 100) ==="
docker compose -f compose.yml logs --tail 100 postgres-alpha || true
echo "=== postgres-beta logs (last 100) ==="
docker compose -f compose.yml logs --tail 100 postgres-beta || true
- name: Force teardown
# We pass KEEP_UP=1 to run-all-replays.sh so the dump step
# above sees real containers — that means we own teardown
# explicitly here. Always run.
if: always() && needs.detect-changes.outputs.run == 'true'
working-directory: tests/harness
run: ./down.sh || true
@@ -1,120 +0,0 @@
name: lint-bp-context-emit-match
# Tier 2f scheduled lint (per mc#774) — detects drift between
# `branch_protections/<branch>.status_check_contexts` and the set of
# contexts emitted by `.gitea/workflows/*.yml`.
#
# Rule
# ----
# For each protected branch context (Source A — BP), there must exist
# at least one emitting workflow + job pair (Source B — workflow YAML
# + on:-event mapping) whose runtime status-name maps to it. The
# inverse direction (emitter without BP context) is informational
# only — Tier 2g handles that at PR-time.
#
# Why this exists
# ---------------
# A BP-required context with no emitter blocks merges forever — Gitea
# 1.22.6 treats absent-as-`pending`, NOT absent-as-`skipped`. The
# phantom-required-check class previously surfaced as
# `feedback_phantom_required_check_after_gitea_migration` (a port
# kept the GitHub context name after rename to Gitea, but no
# workflow emitted under the new name).
#
# This lint catches the same class structurally + a forward case:
# workflow renamed/deleted while still in BP.
#
# Scope
# -----
# Scheduled daily. We DON'T run on `pull_request` because (a) the
# emitter side moves with PR diffs (transitional state false-flags)
# and (b) Tier 2g handles emitter-side drift at PR-time.
#
# Cross-repo
# ----------
# Today this runs only on molecule-core/main. Per internal#349
# (cross-repo BP sweep) Class-D repos will get the same lint after
# their BP rollouts.
#
# Auth
# ----
# `GET /repos/.../branch_protections/{branch}` requires repo-admin
# role on Gitea 1.22.6. We use DRIFT_BOT_TOKEN (same persona as
# ci-required-drift.yml — `internal#329` provisioning trail).
# Graceful-degrade per Tier 2a contract: 403/404 → exit 0 with
# ::error::.
#
# Idempotency
# -----------
# The drift issue is filed with title prefix
# `[ci-bp-drift] {repo}/{branch}: BP→emitter mismatch`. The script
# searches OPEN issues for an exact title-prefix match and PATCHes
# the existing issue (if any) instead of POSTing a duplicate.
# Mirrors `ci-required-drift.py`'s contract.
#
# Phase contract (RFC internal#219 §1 ladder)
# -------------------------------------------
# Lands at `continue-on-error: true` (Phase 3). After 7 days of clean
# scheduled runs on `main`, flip to `false` so a scheduled failure
# becomes a hard CI signal.
#
# Cross-links
# -----------
# - mc#774 (the RFC that specs this lint)
# - internal#349 (cross-repo BP sweep)
# - feedback_phantom_required_check_after_gitea_migration
# - feedback_tier_label_ids_are_per_repo
# - ci-required-drift.yml (F2 detector, narrower-scope sibling)
on:
schedule:
# Daily at 03:31 UTC — off-peak, prime-staggered from other
# scheduled jobs (ci-required-drift :00 hourly, lint-coe-tracking
# 13:11). At 03:31 the CI fleet is quietest in EMEA hours.
- cron: '31 3 * * *'
workflow_dispatch:
# No `push` / `pull_request` here — Tier 2g owns PR-time drift.
env:
GITHUB_SERVER_URL: https://git.moleculesai.app
permissions:
contents: read
issues: write # needed to file/edit the drift issue
concurrency:
group: lint-bp-context-emit-match-${{ github.ref }}
cancel-in-progress: true
jobs:
lint:
name: lint-bp-context-emit-match
runs-on: ubuntu-latest
timeout-minutes: 5
# Phase 3 (RFC #219 §1): surface drift without blocking. After 7
# clean scheduled runs on main, flip to false so a scheduled
# failure is a hard CI signal.
continue-on-error: true # mc#774 Phase 3 — flip to false after 7 clean main runs
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5.6.0
with:
python-version: '3.12'
- name: Install PyYAML
run: python -m pip install --quiet 'PyYAML==6.0.2'
- name: Run lint-bp-context-emit-match
env:
# DRIFT_BOT_TOKEN — repo-admin on this repo (internal#329
# provisioning trail). Required for branch_protections read.
GITEA_TOKEN: ${{ secrets.DRIFT_BOT_TOKEN }}
GITEA_HOST: git.moleculesai.app
REPO: ${{ github.repository }}
BRANCH: main
WORKFLOWS_DIR: .gitea/workflows
DRIFT_LABEL: ci-bp-drift
GITHUB_RUN_URL: https://git.moleculesai.app/${{ github.repository }}/actions/runs/${{ github.run_id }}
run: python3 .gitea/scripts/lint_bp_context_emit_match.py
- name: Run lint-bp-context-emit-match unit tests
run: |
python -m pip install --quiet pytest
python3 -m pytest tests/test_lint_bp_context_emit_match.py -v
@@ -1,122 +0,0 @@
name: lint-continue-on-error-tracking
# Tier 2e hard-gate lint (per mc#774) — every
# `continue-on-error: true` in `.gitea/workflows/*.yml` must carry a
# `# mc#NNNN` or `# internal#NNNN` tracker comment within 2 lines,
# the referenced issue must be OPEN, and ≤14 days old.
#
# Why this exists
# ---------------
# `continue-on-error: true` on `platform-build` had been hiding
# mc#774-class regressions for ~3 weeks before #656 surfaced them on
# 2026-05-12. A 14-day cap on tracker age forces a review cycle and
# surfaces mask-drift within at most 14 days of the original defect.
# Each `continue-on-error: true` gets a paper trail — close or renew.
#
# How the gate works
# ------------------
# 1. Walk `.gitea/workflows/*.yml` via PyYAML's line-tracking loader
# (per `feedback_behavior_based_ast_gates`) and find every job
# whose `continue-on-error` evaluates truthy (`true` or string
# `"true"` — Gitea's evaluator coerces strings).
# 2. For each, scan ±2 lines of the directive's source line for a
# `# mc#NNNN` or `# internal#NNNN` comment. Inline-trailing
# comments on the directive line count.
# 3. For each tracker reference, GET the issue from the Gitea API.
# Validate: exists, `state == open`, `created_at` ≤ MAX_AGE_DAYS.
# 4. Aggregate ALL violations (not short-circuit) and exit 1 if any.
#
# Triggers
# --------
# Runs on PR events (paths-filter on `.gitea/workflows/**`) AND on
# a daily schedule. PR runs catch the violation at introduction time.
# Schedule runs catch the AGE-EXPIRY class: a tracker that was ≤14d
# old when the PR landed but is now 20d old, with the underlying
# defect still unfixed. Per `feedback_chained_defects_in_never_tested_workflows`,
# scheduled drift detection is the second half of the gate.
#
# Phase contract (RFC internal#219 §1 ladder)
# -------------------------------------------
# Lands at `continue-on-error: true` (Phase 3 — surface broken shapes
# without blocking). The pre-existing `continue-on-error: true`
# directives on `main` will all violate this lint at first
# (intentional — they're the masked defects this lint exists to
# surface). Each must be triaged: file a fresh tracker comment,
# close-and-flip, or document the deliberate keep-mask in a fresh
# 14-day-renewable tracker. After main is clean for 3 days,
# follow-up PR flips this workflow's continue-on-error to false.
# Tracking: mc#774.
#
# Cross-links
# -----------
# - mc#774 (the RFC that specs this lint)
# - mc#774 (the empirical masked-3-weeks case)
# - feedback_chained_defects_in_never_tested_workflows
# - feedback_behavior_based_ast_gates
# - feedback_strict_root_only_after_class_a
#
# Auth: DRIFT_BOT_TOKEN — same persona used by ci-required-drift.yml
# (provisioned under internal#329). Auto-injected GITHUB_TOKEN is
# insufficient because `internal#NNN` references cross repositories
# (molecule-core → molecule-ai/internal).
on:
pull_request:
types: [opened, synchronize, reopened]
paths:
- '.gitea/workflows/**'
- '.gitea/scripts/lint_continue_on_error_tracking.py'
- 'tests/test_lint_continue_on_error_tracking.py'
push:
branches: [main, staging]
paths:
- '.gitea/workflows/**'
- '.gitea/scripts/lint_continue_on_error_tracking.py'
schedule:
# Daily at 13:11 UTC — off-peak, prime-staggered from the other
# Tier-2 lint schedules (ci-required-drift runs hourly :00).
- cron: '11 13 * * *'
workflow_dispatch:
env:
GITHUB_SERVER_URL: https://git.moleculesai.app
permissions:
contents: read
concurrency:
group: lint-coe-tracking-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true
jobs:
# bp-exempt: meta-lint for masked jobs; tracked separately until masks are burned down.
lint:
name: lint-continue-on-error-tracking
runs-on: ubuntu-latest
timeout-minutes: 20
# Phase 3 (RFC #219 §1): surface masked defects without blocking
# PRs. Pre-existing continue-on-error: true directives on main
# all violate this lint at first — intentional. Flip to false
# follow-up after main is clean for 3 days. mc#774.
# mc#774: pre-existing continue-on-error mask; root-fix and remove, do not renew silently.
continue-on-error: true # mc#774 Phase 3 mask — 14d forced-renewal cadence
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5.6.0
with:
python-version: '3.12'
- name: Install PyYAML
run: python -m pip install --quiet 'PyYAML==6.0.2'
- name: Run lint-continue-on-error-tracking
env:
GITEA_TOKEN: ${{ secrets.DRIFT_BOT_TOKEN }}
GITEA_HOST: git.moleculesai.app
REPO: ${{ github.repository }}
INTERNAL_REPO: molecule-ai/internal
WORKFLOWS_DIR: .gitea/workflows
MAX_AGE_DAYS: '14'
run: python3 .gitea/scripts/lint_continue_on_error_tracking.py
- name: Run lint-continue-on-error-tracking unit tests
run: |
python -m pip install --quiet pytest
python3 -m pytest tests/test_lint_continue_on_error_tracking.py -v
@@ -1,60 +0,0 @@
name: Lint curl status-code capture
# Ported from .github/workflows/lint-curl-status-capture.yml on 2026-05-11
# per RFC internal#219 §1 sweep.
#
# Differences from the GitHub version:
# - on.paths and the lint scanner target .gitea/workflows/**.yml (the
# active Gitea workflow directory) instead of .github/workflows/**.yml
# (which the rest of this sweep is emptying out).
# - Self-skip path updated to the .gitea/ version of this file.
# - Dropped `merge_group:` trigger.
# - Workflow-level env.GITHUB_SERVER_URL set per
# feedback_act_runner_github_server_url.
# - `continue-on-error: true` on the job (RFC §1 contract).
#
# Pins the workflow-bash anti-pattern that produced "HTTP 000000" on the
# 2026-05-04 redeploy-tenants-on-main run for sha 2b862f6:
#
# HTTP_CODE=$(curl ... -w '%{http_code}' ... || echo "000")
#
# When curl exits non-zero (connection reset -> 56, --fail-with-body 4xx/5xx
# -> 22), the `-w '%{http_code}'` already wrote a status to stdout — usually
# "000" for connection failures or the actual code for HTTP errors. The
# `|| echo "000"` then fires AND appends ANOTHER "000" to the captured
# stdout, producing values like "000000" or "409000" that fail string
# comparisons against "200" while looking superficially right.
#
# Same class of bug the synth-E2E §7c gate hit twice (PRs #2779/#2783 +
# #2797). Memory: feedback_curl_status_capture_pollution.md.
on:
pull_request:
paths:
- '.gitea/workflows/**'
- '.gitea/scripts/lint-curl-status-capture.py'
- 'tests/test_lint_curl_status_capture.py'
push:
branches: [main, staging]
paths:
- '.gitea/workflows/**'
- '.gitea/scripts/lint-curl-status-capture.py'
- 'tests/test_lint_curl_status_capture.py'
env:
GITHUB_SERVER_URL: https://git.moleculesai.app
jobs:
scan:
name: Scan workflows for curl status-capture pollution
runs-on: ubuntu-latest
# Phase 3 (RFC #219 §1): surface broken workflows without blocking
# the PR. Follow-up PR flips this off after surfaced defects are
# triaged.
# mc#774: pre-existing continue-on-error mask; root-fix and remove, do not renew silently.
continue-on-error: true
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- name: Find curl ... -w '%{http_code}' ... || echo "000" subshells
run: |
python3 .gitea/scripts/lint-curl-status-capture.py
@@ -1,168 +0,0 @@
name: Lint forbidden tenant-env keys
# RFC#523 Layer 3 (task #146): scan workspace_secrets-writer Go code
# under workspace-server/ for new code that hardcodes a forbidden
# operator-scope env var NAME (GITEA_TOKEN, CP_ADMIN_API_TOKEN,
# RAILWAY_TOKEN, INFISICAL_OPERATOR_TOKEN, MOLECULE_OPERATOR_*, …).
#
# Catches the class "a new writer accidentally widens the propagation
# set" — e.g. a future env-mutator plugin that sets envVars["GITEA_TOKEN"]
# directly. Today the L1 runtime guard would abort the provision, but
# this lint surfaces the offending code at PR review time instead of
# at first provision attempt.
#
# Companion layers:
# - L1: workspace-server/internal/handlers/workspace_provision_forbidden_env.go
# (fail-closed abort at provision time)
# - L2: workspace/entrypoint.sh top-of-file env-grep + exit 1
#
# Open-source-template-friendly: the deny pattern is generic. A fork
# can copy this workflow and replace OPERATOR_KEY_PATTERN with its
# own operator-scope key names.
#
# Path-filter discipline:
# This workflow runs on every PR (no paths: filter — see
# feedback_path_filtered_workflow_cant_be_required). The scan itself
# targets workspace_secrets-writer paths via grep -r; it's fast
# (sub-second) so unconditional run is fine.
on:
pull_request:
types: [opened, synchronize, reopened]
push:
branches: [main, staging]
env:
GITHUB_SERVER_URL: https://git.moleculesai.app
jobs:
scan:
name: Scan workspace_secrets writers for forbidden env keys
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
fetch-depth: 1
- name: Scan for forbidden operator-scope env key NAMES in writer paths
run: |
set -euo pipefail
# Forbidden EXACT-MATCH env var names. Kept in lockstep with
# workspace-server/internal/handlers/workspace_provision_forbidden_env.go
# forbiddenTenantEnvKeys. The Go-side test
# TestIsForbiddenTenantEnvKey_ExactMatches is the source of
# truth — if Go-side adds a key, also add it here (and
# vice-versa). Drift between the two is the failure mode this
# entire 3-layer guardrail is designed to catch.
FORBIDDEN_KEYS=(
"GITEA_TOKEN" "GITEA_PAT"
"GITHUB_TOKEN" "GITHUB_PAT" "GH_TOKEN"
"GITLAB_TOKEN" "GL_TOKEN"
"BITBUCKET_TOKEN"
"CP_ADMIN_API_TOKEN" "CP_ADMIN_TOKEN"
"INFISICAL_OPERATOR_TOKEN" "INFISICAL_BOOTSTRAP_TOKEN"
"RAILWAY_TOKEN" "RAILWAY_PERSONAL_API_TOKEN"
"HETZNER_TOKEN" "HETZNER_API_TOKEN"
)
# Forbidden PREFIX patterns — operator-scope families.
FORBIDDEN_PREFIXES=(
"MOLECULE_OPERATOR_"
)
# Writer paths: Go source under workspace-server/ that
# writes to the env-vars map or to workspace_secrets DB rows.
# Tests, the forbidden-env source itself, and the silent-
# strip denylist are exempt (they LIST the keys by design).
SCAN_ROOT="workspace-server/internal"
# Exempt paths fall in two classes:
# 1. The deny-set definitions + the silent-strip denylist:
# they LIST the forbidden names by design.
# 2. Pre-RFC#523 persona-merge / config-read paths that
# already handle these names correctly (the silent-
# strip downstream + the new L1 fail-closed cover the
# runtime risk; these reads are unchanged).
# New code MUST NOT be added to this list without reviewer
# signoff and a one-line justification in this diff.
EXEMPT_PATHS=(
# Class 1 — deny-set definitions
"workspace-server/internal/handlers/workspace_provision_forbidden_env.go"
"workspace-server/internal/handlers/workspace_provision_forbidden_env_test.go"
"workspace-server/internal/provisioner/provisioner.go"
"workspace-server/internal/provisioner/provisioner_test.go"
# Class 2 — pre-existing persona-fallback / org-helper paths
# that set the GITEA_TOKEN fallback lane (stripped downstream
# by provisioner.buildContainerEnv per forensic #145). The
# new L1 fail-closed runs BEFORE these writers, so any
# operator-scope leak via global/workspace_secrets is
# already caught. See applyAgentGitHTTPCreds doc-comment.
"workspace-server/internal/handlers/agent_git_identity.go"
"workspace-server/internal/handlers/org_helpers.go"
"workspace-server/internal/handlers/org.go"
# Class 2 — CP→platform admin auth (NOT a tenant env write;
# this is the control-plane HTTP auth header source).
"workspace-server/internal/provisioner/cp_provisioner.go"
)
# Build a single grep -F pattern: every forbidden key wrapped
# in quotes (Go string-literal form, which is how env-map
# writes appear). e.g. envVars["GITEA_TOKEN"] = ... or
# `"GITEA_TOKEN":` in a literal-map declaration.
#
# We deliberately match the quoted form so a comment that
# happens to spell the name without quotes (e.g. "see
# GITEA_TOKEN below") doesn't trip the lint.
PATTERN=""
for k in "${FORBIDDEN_KEYS[@]}"; do
PATTERN="${PATTERN}\"${k}\"\n"
done
for p in "${FORBIDDEN_PREFIXES[@]}"; do
# Prefix match needs a regex; switch to grep -E below for
# this slice. Kept conceptually here so the deny set lives
# in one place; scan is run twice (literal + prefix).
true
done
# Build exempt-paths grep filter — `grep -v -f` style.
EXEMPT_FILTER=$(mktemp)
trap 'rm -f "$EXEMPT_FILTER"' EXIT
for p in "${EXEMPT_PATHS[@]}"; do
echo "$p" >> "$EXEMPT_FILTER"
done
# --- Exact-match scan ---
HITS=""
for k in "${FORBIDDEN_KEYS[@]}"; do
# Only .go files; skip _test.go for the writer-path scan
# since tests legitimately reference the names. The
# writer-path lint targets PRODUCTION code only.
found=$(grep -rn --include='*.go' --exclude='*_test.go' "\"${k}\"" "$SCAN_ROOT" 2>/dev/null \
| grep -v -F -f "$EXEMPT_FILTER" || true)
if [ -n "$found" ]; then
HITS="${HITS}${found}\n"
fi
done
# --- Prefix scan ---
for prefix in "${FORBIDDEN_PREFIXES[@]}"; do
found=$(grep -rnE --include='*.go' --exclude='*_test.go' "\"${prefix}[A-Z0-9_]+\"" "$SCAN_ROOT" 2>/dev/null \
| grep -v -F -f "$EXEMPT_FILTER" || true)
if [ -n "$found" ]; then
HITS="${HITS}${found}\n"
fi
done
if [ -n "$HITS" ]; then
echo "::error::RFC#523 Layer 3: forbidden operator-scope env var name(s) hardcoded in tenant-workspace writer paths:"
printf "$HITS"
echo ""
echo "These env-var NAMES are on the operator-scope deny list (see"
echo "workspace-server/internal/handlers/workspace_provision_forbidden_env.go)."
echo "If your code legitimately needs to inject one of these for a"
echo "non-tenant code path, add the file to EXEMPT_PATHS in this"
echo "workflow with a one-line justification — reviewer signoff required."
exit 1
fi
echo "OK No forbidden operator-scope env key names hardcoded in writer paths."
-134
View File
@@ -1,134 +0,0 @@
name: lint-mask-pr-atomicity
# Tier 2d hard-gate lint (per mc#774) — blocks PRs that touch
# `.gitea/workflows/ci.yml` and modify ONLY ONE of {continue-on-error,
# all-required.sentinel.needs} without a `Paired: #NNN` reference in
# the PR body or in a commit message.
#
# Why this exists
# ---------------
# PR#665 (interim `continue-on-error: true` on `platform-build`) and
# PR#668 (sentinel-`needs` demotion of the same job) were designed as a
# pair but merged solo — #665 landed at 04:47Z 2026-05-12, #668 was
# still open at 05:07Z when the main-red watchdog (#674) fired. Result:
# ~20 minutes of `main` red and a cascade of false-positives on
# unrelated PRs. This lint structurally prevents that class.
#
# How the gate works
# ------------------
# 1. The workflow runs on every PR whose diff touches ci.yml (paths
# filter). It is NOT a required check on `main` because the rule is
# diff-based — running it on PRs that don't touch ci.yml would
# produce a `pending` status forever (per
# `feedback_path_filtered_workflow_cant_be_required`).
# 2. The script reads `BASE_SHA:ci.yml` and `HEAD_SHA:ci.yml`, parses
# both via PyYAML AST (per `feedback_behavior_based_ast_gates` — no
# grep, no regex on the raw text — so a YAML-shape refactor still
# detects).
# 3. Walks `jobs.*.continue-on-error` on each side; flags any value
# diff. Reads `jobs.all-required.needs` on each side; flags any
# set diff (order-insensitive — `needs:` is engine-unordered).
# 4. If both predicates fired → atomic, OK. If neither → no risk, OK.
# If exactly one fired → require `Paired: #NNN` in PR body OR in
# any commit message between base..head; else fail.
#
# Phase contract (RFC internal#219 §1 ladder)
# -------------------------------------------
# This workflow lands at `continue-on-error: true` (Phase 3 — surface
# regressions without blocking PRs while the rule beds in).
# Follow-up PR flips to `false` once we have ≥3 days of clean runs on
# `main` and no false-positives. Tracking issue: mc#774.
#
# Cross-links
# -----------
# - mc#774 (the RFC that specs this lint)
# - PR#665 / PR#668 (the empirical split-pair)
# - mc#774 (the main-red incident the split caused)
# - feedback_strict_root_only_after_class_a
# - feedback_behavior_based_ast_gates
#
# Auth: only needs the auto-injected GITHUB_TOKEN (read-only, repo
# scope). No DRIFT_BOT_TOKEN needed — Tier 2d does NOT call
# branch_protections (Tier 2g/f do).
on:
pull_request:
types: [opened, synchronize, reopened, edited]
# `edited` is included because the rule depends on PR_BODY: a user
# may add `Paired: #NNN` after first push to satisfy the lint. The
# rerun on `edited` lets the PR turn green without an empty
# commit. Gitea 1.22.6 fires `edited` on body changes — verified
# via gitea-source/models/issues/pull_list.go::triggerNewPRWebhook.
paths:
- '.gitea/workflows/ci.yml'
- '.gitea/scripts/lint_mask_pr_atomicity.py'
- '.gitea/workflows/lint-mask-pr-atomicity.yml'
- 'tests/test_lint_mask_pr_atomicity.py'
env:
# Belt-and-suspenders against the runner-default trap
# (feedback_act_runner_github_server_url). Runners are configured
# with this env via /opt/molecule/runners/config.yaml, but pinning
# at the workflow level protects against a runner regenerated
# without the config file.
GITHUB_SERVER_URL: https://git.moleculesai.app
permissions:
contents: read
pull-requests: read
# Per-PR concurrency — re-pushes cancel previous runs to keep the
# queue short. The lint is cheap (one git show + log + a YAML parse).
concurrency:
group: lint-mask-pr-atomicity-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true
jobs:
# bp-exempt: meta-lint advisory during mask burn-down; CI / all-required gates merges.
scan:
name: lint-mask-pr-atomicity
runs-on: ubuntu-latest
timeout-minutes: 5
# Phase 3 (RFC #219 §1): surface broken shapes without blocking
# PRs. Follow-up PR flips this to `false` once recent runs on main
# are confirmed clean (eat-our-own-dogfood discipline mirrors
# PR#673's same-shape comment). Tracking: mc#774.
# mc#774: pre-existing continue-on-error mask; root-fix and remove, do not renew silently.
continue-on-error: true
steps:
- name: Check out PR head with full history (need base SHA blobs)
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
# `git show <base-sha>:<path>` needs the base SHA's blobs.
# Shallow=1 would miss it. Same rationale as PR#673 and
# check-migration-collisions.yml.
fetch-depth: 0
- name: Set up Python (PyYAML for AST parsing)
uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5.6.0
with:
python-version: '3.12'
- name: Install PyYAML
# Same pin as ci-required-drift.yml + the rest of the Tier 2
# lint family — keep runner-cache hits uniform.
run: python -m pip install --quiet 'PyYAML==6.0.2'
- name: Ensure base ref is reachable locally
# fetch-depth=0 usually pulls the base too, but explicit-fetch
# is cheap insurance against runner-version drift (matches the
# comment in check-migration-collisions.yml and PR#673).
run: |
git fetch origin "${{ github.event.pull_request.base.ref }}" || true
- name: Run lint-mask-pr-atomicity
env:
BASE_SHA: ${{ github.event.pull_request.base.sha }}
HEAD_SHA: ${{ github.event.pull_request.head.sha }}
# PR body — the script greps for `Paired: #NNN`.
PR_BODY: ${{ github.event.pull_request.body }}
CI_WORKFLOW_PATH: .gitea/workflows/ci.yml
SENTINEL_JOB_KEY: all-required
run: python3 .gitea/scripts/lint_mask_pr_atomicity.py
- name: Run lint-mask-pr-atomicity unit tests
# Run the test suite in-CI so the lint's own behaviour is
# verified on every change. Matches lint-workflow-yaml.yml.
run: |
python -m pip install --quiet pytest
python3 -m pytest tests/test_lint_mask_pr_atomicity.py -v
@@ -1,182 +0,0 @@
name: Lint no tenant GITEA or GITHUB token write
# Task #146 — CI guardrail companion to RFC#523's `lint-forbidden-env-keys.yml`.
#
# `lint-forbidden-env-keys.yml` (Layer 3) catches code that hardcodes a
# forbidden env-var key NAME as a quoted literal in workspace_secrets
# writer paths under workspace-server/internal/.
#
# This workflow catches a BROADER class: any code path that reads a
# repo-host token (GITEA_TOKEN / GITHUB_TOKEN / GH_TOKEN) and then writes
# it into a TENANT WORKSPACE's env, secret store, user-data, or
# provision payload. This is the actual RFC#523 threat-model statement —
# the goal is "no tenant workspace ever receives an operator-scope repo
# token," not just "no _quoted_ literal `GITEA_TOKEN`." A future writer
# could route the value via a variable, a struct field, or a config key
# and slip past the existing literal scan; this lint catches those
# routing patterns at PR review time.
#
# Scope
# Scans the WHOLE repo's Go sources (not just workspace-server/) for
# co-occurrences of:
# - a repo-host token NAME (GITEA_TOKEN / GITHUB_TOKEN / GH_TOKEN /
# GITEA_PAT / GITHUB_PAT) used as os.Getenv argument or string
# literal
# - within a file that ALSO references a tenant-writer surface
# (`tenant`, `workspace_secrets`, `global_secrets`, `seedAllowList`,
# `/settings/secrets`, `userData`, `provisionPayload`,
# `envVars[`, `containerEnv`).
#
# Co-occurrence (not single-line) is the false-positive control: a
# file that just LOGS the variable name (e.g. "missing GITEA_TOKEN")
# without touching any tenant surface won't fire.
#
# Drift contract with lint-forbidden-env-keys.yml
# Both lints share the same FORBIDDEN_KEYS list (a subset — only the
# repo-host tokens, since this lint's threat model is "tenant gets
# write access to operator's git host"). If RFC#523's deny set grows,
# update BOTH this file AND lint-forbidden-env-keys.yml AND the Go
# source-of-truth in
# workspace-server/internal/handlers/workspace_provision_forbidden_env.go.
#
# Open-source-template-friendly
# The patterns scanned are generic (no MOLECULE_-prefix literals).
# A fork can copy this workflow as-is and adjust FORBIDDEN_KEYS.
#
# Path-filter discipline
# No `paths:` filter — required-status workflows must run on every PR
# per `feedback_path_filtered_workflow_cant_be_required`. Scan is
# sub-second.
on:
pull_request:
types: [opened, synchronize, reopened]
push:
branches: [main, staging]
env:
GITHUB_SERVER_URL: https://git.moleculesai.app
jobs:
# bp-exempt: advisory RFC#523 lint; PR review gate is review-driven, not BP-driven.
# (Carried with the workflow-name rename in PR mc#1593 so the renamed
# context emission satisfies lint_required_context_exists_in_bp Tier 2g.)
scan:
name: Scan for repo-host token write into tenant workspace surface
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
fetch-depth: 1
- name: Find Go files referencing a tenant-writer surface AND a repo-host token
run: |
set -euo pipefail
# Repo-host token NAMES — the threat-model subset. Operator-fleet
# tokens (CP_ADMIN_API_TOKEN, RAILWAY_TOKEN, INFISICAL_*) are
# caught by lint-forbidden-env-keys.yml's broader deny set; this
# lint focuses on the git-host class so a single co-occurrence
# match has a low false-positive rate.
FORBIDDEN_KEYS=(
"GITEA_TOKEN"
"GITEA_PAT"
"GITHUB_TOKEN"
"GITHUB_PAT"
"GH_TOKEN"
)
# Tenant-writer surface markers. A file matches the surface set
# if it references ANY of these strings. This is the "is this
# code path writing into a tenant workspace?" heuristic.
# Curated to catch the actual code shapes used in this repo
# (verified by grep against current main 2026-05-19):
# - "workspace_secrets" / "global_secrets" → DB table writes
# - "seedAllowList" → CP-side seed table
# - "/settings/secrets" → tenant HTTP API write
# - "envVars[" → in-memory env map write
# - "containerEnv" → docker-run env-set
# - "userData" → EC2 user-data script
# - "provisionPayload" / "provisionContext" → provision-request shape
SURFACE_PATTERN='workspace_secrets|global_secrets|seedAllowList|/settings/secrets|envVars\[|containerEnv|userData|provisionPayload|provisionContext'
# Files that legitimately reference these names AND a surface
# marker, but do so for guard / strip / test / doc-comment
# reasons. New entries require reviewer signoff and a one-line
# justification in the diff.
EXEMPT_FILES=(
# RFC#523 L1 deny-set source-of-truth + tests
"workspace-server/internal/handlers/workspace_provision_forbidden_env.go"
"workspace-server/internal/handlers/workspace_provision_forbidden_env_test.go"
# Forensic-#145 silent-strip denylist (defense-in-depth, by design lists the names)
"workspace-server/internal/provisioner/provisioner.go"
"workspace-server/internal/provisioner/provisioner_test.go"
# Pre-RFC#523 persona-fallback / org-helper paths. The L1
# fail-closed runs BEFORE these writers; downstream silent-strip
# also covers them. See applyAgentGitHTTPCreds doc-comment.
"workspace-server/internal/handlers/agent_git_identity.go"
"workspace-server/internal/handlers/org_helpers.go"
"workspace-server/internal/handlers/org.go"
# CP→platform admin auth (NOT a tenant env write).
"workspace-server/internal/provisioner/cp_provisioner.go"
)
# Build an extended-regex alternation of forbidden keys.
KEY_ALT="$(IFS='|'; echo "${FORBIDDEN_KEYS[*]}")"
# Find candidate files: Go non-test sources that contain a
# tenant-writer surface marker.
mapfile -t CANDIDATES < <(
grep -rlE --include='*.go' --exclude='*_test.go' \
"${SURFACE_PATTERN}" . 2>/dev/null \
| sed 's|^\./||' \
| sort -u
)
if [ "${#CANDIDATES[@]}" -eq 0 ]; then
echo "OK No tenant-writer-surface files found in tree (unexpected, but not a lint failure)."
exit 0
fi
HITS=""
for f in "${CANDIDATES[@]}"; do
# Skip exempt files.
skip=0
for ex in "${EXEMPT_FILES[@]}"; do
if [ "$f" = "$ex" ]; then skip=1; break; fi
done
[ "$skip" = "1" ] && continue
# File contains a surface marker; now grep for a forbidden
# key NAME. We require a QUOTED-literal match to avoid
# firing on a comment like "// also handle GITEA_TOKEN".
#
# The literal form catches:
# - os.Getenv("GITEA_TOKEN")
# - envVars["GITEA_TOKEN"] = ...
# - {envKey: "GITEA_TOKEN", tenantKey: "GITEA_TOKEN"}
# but not:
# - // see GITEA_TOKEN below (no quotes)
found=$(grep -nE "\"(${KEY_ALT})\"" "$f" 2>/dev/null || true)
if [ -n "$found" ]; then
HITS="${HITS}--- ${f} ---\n${found}\n"
fi
done
if [ -n "$HITS" ]; then
echo "::error::Task #146 lint: repo-host token name(s) quoted in a tenant-writer-surface file:"
printf "$HITS"
echo ""
echo "These files reference a tenant-writer surface (workspace_secrets,"
echo "seedAllowList, /settings/secrets, containerEnv, userData, etc.)"
echo "AND quote a repo-host token name (GITEA_TOKEN/GITHUB_TOKEN/…)."
echo "Per RFC#523 threat model, tenant workspaces MUST NOT receive"
echo "operator-scope repo-host tokens. If your code legitimately needs"
echo "to reference one of these names in a tenant-writer file (e.g."
echo "a deny-set definition or silent-strip list), add the file to"
echo "EXEMPT_FILES with a one-line justification — reviewer signoff"
echo "required."
exit 1
fi
echo "OK No tenant-writer-surface file co-mentions a repo-host token literal."
@@ -1,141 +0,0 @@
name: Lint pre-flip continue-on-error
# Pre-merge gate: blocks PRs that flip `continue-on-error: true → false`
# on any job in `.gitea/workflows/*.yml` WITHOUT proof that the affected
# job's recent runs on the target branch (PR base) are actually green.
#
# Empirical class: PR #656 / mc#774. PR #656 (RFC internal#219 Phase 4)
# flipped 5 platform-build-class jobs `continue-on-error: true → false`
# on the basis of a "verified green on main via combined-status check".
# But that "green" was the LIE the prior `continue-on-error: true`
# produced: Gitea Quirk #10 (internal#342 + dup #287) — a failed step
# inside a `continue-on-error: true` job rolls up to a `success`
# job-level status. The precondition the PR claimed to verify was
# structurally fooled by the bug being flipped.
#
# mc#774 captured the surfaced defects (2 mutually-masked regressions):
# - Class 1: sqlmock helper drift since 2f36bb9a (24 days old)
# - Class 2: OFFSEC-001 contract collision since 7d1a189f (1 day old)
#
# Codified 04:35Z as hongming-pc2 charter §SOP-N rule (e)
# "run-log-grep-before-flip" — now structurally enforced here at PR
# time, ahead of merge.
#
# How the gate works:
# 1. Read every `.gitea/workflows/*.yml` at the PR base SHA AND at
# the PR head SHA via `git show <sha>:<path>` (no checkout
# needed).
# 2. Parse both sides via PyYAML AST (NOT grep — per
# `feedback_behavior_based_ast_gates`). Walk `jobs.<key>.
# continue-on-error` on each side. A flip is base=true,
# head=false.
# 3. For each flipped job, render the commit-status context as
# `"{workflow.name} / {job.name or job.key} (push)"` — that's
# how Gitea Actions emits the per-context status on `main`/
# `staging` runs.
# 4. Pull last 5 commits on the PR base branch, fetch combined
# commit-status per commit, scan for the target context. For
# each match, fetch the run log via the web-UI route
# `{server_url}/{repo}/actions/runs/{run_id}/jobs/{job_idx}/logs`
# (per `reference_gitea_actions_log_fetch` —
# Gitea 1.22.6 lacks REST `/actions/runs/*`; web-UI is the
# only working path, see also
# `reference_gitea_1_22_6_lacks_rest_rerun_endpoints`).
# 5. Grep each log for `--- FAIL`, `FAIL\s`, `::error::`. If
# the status is `success` but the log shows any of these,
# the job was masked. Block the PR with `::error::`.
#
# Graceful-degrade contract (per task halt-conditions):
# - Log fetch 404 (act_runner pruned the log, transient outage):
# emit `::warning::` "log unavailable" — does NOT block.
# - Zero recent runs of the flipped job's context on the base
# branch (newly added workflow): emit `::warning::` "no run
# history to verify" — allow the flip. Chicken-and-egg
# exemption.
# - YAML parse error in one of the workflow files: warn-only,
# don't block — the YAML lint workflows catch this separately.
#
# Cross-links: PR#656, mc#774, PR#665 (interim re-mask),
# Quirk #10 (internal#342 + dup #287), hongming-pc2 charter
# §SOP-N rule (e), feedback_strict_root_only_after_class_a,
# feedback_no_shared_persona_token_use.
#
# Phase contract (RFC internal#219 §1 ladder):
# - This workflow lands at `continue-on-error: true` (Phase 3 —
# surface defects without blocking). Follow-up PR flips it to
# `false` ONLY after this workflow's own recent runs on `main`
# are confirmed clean — exactly the discipline the workflow
# itself enforces. Eat your own dogfood.
on:
pull_request:
types: [opened, synchronize, reopened]
paths:
- '.gitea/workflows/**'
- '.gitea/scripts/lint_pre_flip_continue_on_error.py'
- '.gitea/workflows/lint-pre-flip-continue-on-error.yml'
env:
# Per `feedback_act_runner_github_server_url` — without this,
# actions/checkout and friends default to github.com → break.
GITHUB_SERVER_URL: https://git.moleculesai.app
permissions:
contents: read
# Need read on the API to pull combined commit-status + commit list
# for the base branch. The job-log fetch uses the same token via
# the web-UI route (Gitea 1.22.6 accepts `Authorization: token ...`
# there).
pull-requests: read
concurrency:
group: lint-pre-flip-coe-${{ github.event.pull_request.head.sha || github.sha }}
cancel-in-progress: true
jobs:
scan:
name: Verify continue-on-error flips have run-log proof
runs-on: ubuntu-latest
timeout-minutes: 8
# Phase 3 (RFC internal#219 §1): surface broken flips without blocking
# the PR yet. Follow-up flips this to `false` once the workflow itself
# has clean recent runs on main. mc#774 interim — remove when CoE→false.
continue-on-error: true # mc#774
steps:
- name: Check out PR head (full history for base-SHA access)
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
# `git show <base-sha>:<path>` needs the base SHA's blobs.
# Shallow=1 would miss it. Same rationale as
# check-migration-collisions.yml.
fetch-depth: 0
- name: Set up Python (PyYAML for AST parsing)
uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5.6.0
with:
python-version: '3.12'
- name: Install PyYAML
# Same pin as ci-required-drift.yml — keep dependencies
# uniform so a Gitea runner cache hits across both jobs.
run: python -m pip install --quiet 'PyYAML==6.0.2'
- name: Ensure base ref is reachable locally
# `actions/checkout@v6 fetch-depth=0` usually pulls the base
# too, but explicit-fetch is cheap insurance against the
# form-of-ref differences across Gitea runner versions
# (mirrors the comment in check-migration-collisions.yml).
run: |
git fetch origin "${{ github.event.pull_request.base.ref }}" || true
- name: Run lint
env:
# Auto-injected by Gitea Actions; sufficient scope for
# combined-status + commit-list + log fetch via web-UI
# route. NO repo-admin needed (unlike the
# branch_protections endpoint).
GITEA_TOKEN: ${{ secrets.GITHUB_TOKEN }}
GITEA_HOST: git.moleculesai.app
REPO: ${{ github.repository }}
BASE_REF: ${{ github.event.pull_request.base.ref }}
BASE_SHA: ${{ github.event.pull_request.base.sha }}
HEAD_SHA: ${{ github.event.pull_request.head.sha }}
# Last 5 commits on the base branch is the spec default.
RECENT_COMMITS_N: '5'
run: python3 .gitea/scripts/lint_pre_flip_continue_on_error.py
@@ -1,118 +0,0 @@
name: lint-required-context-exists-in-bp
# Tier 2g hard-gate lint (per mc#774) — diff-based PR-time
# check. When a PR adds a NEW commit-status emission (workflow YAML
# `name:` + job `name:`-or-key + on:-event), the workflow file must
# carry one of three directives adjacent to the new job:
#
# - `# bp-required: yes` — and BP must list the context
# - `# bp-required: pending #NNN` — acknowledged asymmetry + tracker
# - `# bp-exempt: <reason>` — informational job, not a gate
#
# Default (no directive on a new emitter) = FAIL.
#
# Why this exists
# ---------------
# PR#656 added `CI / all-required (pull_request)` as a sentinel
# context that workflows emit, but BP did NOT list it. When
# platform-build failed, all-required failed, but BP let the PR
# merge anyway → cascade to mc#774. With this lint, PR#656 would
# have been blocked until either the BP PATCH ran alongside OR
# the author added a `bp-required: pending` directive.
#
# Tier 2g vs Tier 2f
# ------------------
# Tier 2g runs at PR-time (diff-based) and BLOCKS the merge.
# Tier 2f runs daily (scheduled) and FILES a drift issue. They
# share the workflow-context enumeration helpers
# (`_event_map`, `workflow_contexts`, `_job_display`) but the
# semantics are intentionally distinct so they're separate scripts.
# Co-design is documented in mc#774.
#
# Directive comment lives in the workflow file (NOT PR body)
# ----------------------------------------------------------
# A PR-body claim of "BP exempt" evaporates on merge — the
# asymmetry returns to undetected state and Tier 2f's daily
# scheduled audit can't see it. The directive must live with the
# emitter so both PR-time (Tier 2g) and post-merge (Tier 2f)
# readers consume the same source.
#
# Phase contract (RFC internal#219 §1 ladder)
# -------------------------------------------
# Lands at `continue-on-error: true` (Phase 3 — surface the
# pattern without blocking PRs while the directive convention
# beds in). After 7 days of clean runs on `main` with no false
# positives, follow-up flips to `false`. Tracking: mc#774.
#
# Cross-links
# -----------
# - mc#774 (the RFC that specs this lint)
# - PR#656 (the empirical case)
# - mc#774 (the surfaced cascade)
# - feedback_phantom_required_check_after_gitea_migration (Tier 2f cousin)
# - feedback_behavior_based_ast_gates
#
# Auth: DRIFT_BOT_TOKEN (repo-admin for branch_protections read).
on:
pull_request:
types: [opened, synchronize, reopened]
paths:
- '.gitea/workflows/**'
- '.gitea/scripts/lint_required_context_exists_in_bp.py'
- '.gitea/workflows/lint-required-context-exists-in-bp.yml'
- 'tests/test_lint_required_context_exists_in_bp.py'
env:
GITHUB_SERVER_URL: https://git.moleculesai.app
permissions:
contents: read
concurrency:
group: lint-required-context-exists-in-bp-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true
jobs:
# bp-exempt: this lint is a PR-time advisory and is not intended to
# be a required gate on main. The directive eat-our-own-dogfood
# confirms the convention works on the lint that defines it.
lint:
name: lint-required-context-exists-in-bp
runs-on: ubuntu-latest
timeout-minutes: 5
# Phase 3 (RFC #219 §1): surface the pattern without blocking PRs
# while the directive convention beds in. Follow-up flip to false
# after 7 clean days on main. mc#774.
continue-on-error: true # mc#774 Phase 3 — flip to false after 7 clean main runs
steps:
- name: Check out PR head with full history (need base SHA blobs)
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
# `git show <base-sha>:<path>` needs the base SHA's blobs.
# Same rationale as PR#673 and check-migration-collisions.yml.
fetch-depth: 0
- uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5.6.0
with:
python-version: '3.12'
- name: Install PyYAML
run: python -m pip install --quiet 'PyYAML==6.0.2'
- name: Ensure base ref is reachable locally
# Cheap insurance against runner-version drift.
run: |
git fetch origin "${{ github.event.pull_request.base.ref }}" || true
- name: Run lint-required-context-exists-in-bp
env:
# DRIFT_BOT_TOKEN — repo-admin (needed for branch_protections).
GITEA_TOKEN: ${{ secrets.DRIFT_BOT_TOKEN }}
GITEA_HOST: git.moleculesai.app
REPO: ${{ github.repository }}
BRANCH: main
BASE_SHA: ${{ github.event.pull_request.base.sha }}
HEAD_SHA: ${{ github.event.pull_request.head.sha }}
WORKFLOWS_DIR: .gitea/workflows
run: python3 .gitea/scripts/lint_required_context_exists_in_bp.py
- name: Run lint-required-context-exists-in-bp unit tests
run: |
python -m pip install --quiet pytest
python3 -m pytest tests/test_lint_required_context_exists_in_bp.py -v
@@ -1,97 +0,0 @@
# lint-required-no-paths — structural enforcement of
# `feedback_path_filtered_workflow_cant_be_required`.
#
# Fails the PR if ANY workflow whose status-check context appears in
# `branch_protections/main.status_check_contexts` carries a
# `paths:` or `paths-ignore:` filter in its `on:` block.
#
# Why this exists:
# A required-check workflow with a paths filter silently degrades the
# merge gate. If a PR's diff doesn't touch the filter, the workflow
# never fires; Gitea (1.22.6) reports the required context as
# `pending` (NOT `skipped == success`), so the PR cannot merge. For a
# docs-only PR against `paths: ['**.go']`, the PR is wedged forever.
#
# Previously prevented only by reviewer vigilance + the saved memory
# `feedback_path_filtered_workflow_cant_be_required`. This workflow
# makes it a hard CI gate.
#
# Forward-compat scope:
# Today (2026-05-11) molecule-core/main protects 3 contexts:
# - "Secret scan / Scan diff for credential-shaped strings (pull_request)"
# - "sop-tier-check / tier-check (pull_request)"
# - "CI / all-required (pull_request)"
# Per RFC#324 Step 2 the required-list expands to ~5 contexts
# (qa-review, security-review added). Each new required context's
# workflow must remain unconditional. This lint pins that contract.
#
# Meta-required-check:
# This workflow ITSELF deliberately has NO `paths:` filter on its `on:`
# block — otherwise a paths-non-matching PR could bypass the check.
# Self-evident from this file: only `pull_request` types + no paths.
#
# Auth:
# `GET /repos/.../branch_protections/{branch}` requires repo-admin
# role in Gitea 1.22.6. The workflow-default `GITHUB_TOKEN` is
# non-admin (read-only), so we re-use `DRIFT_BOT_TOKEN` (same persona
# that powers `ci-required-drift.yml` — verified working there).
# If `DRIFT_BOT_TOKEN` becomes unavailable, the script exits 0 with a
# loud `::error::` rather than red-X every PR — token-scope issues
# should be fixed at the token, not surfaced as a gate failure on
# every unrelated PR.
#
# Behavior-based gate per `feedback_behavior_based_ast_gates`:
# YAML AST walk (PyYAML), NOT grep. Workflow renames, formatting
# changes (block-scalar vs flow-style), or moving `paths:` between
# `pull_request:` and `pull_request_target:` all still detect.
#
# IMPORTANT — Gitea 1.22.6 parser quirk per
# `feedback_gitea_workflow_dispatch_inputs_unsupported`: do NOT add an
# `inputs:` block to `workflow_dispatch:` — Gitea 1.22.6 rejects the
# entire workflow as "unknown on type" and it registers for ZERO events.
name: lint-required-no-paths
on:
pull_request:
types: [opened, synchronize, reopened]
workflow_dispatch:
# Read protection + read local YAML. No writes.
permissions:
contents: read
# Only one in-flight run per PR — re-pushes cancel the previous run to
# keep the queue short. Required-list reads are cheap (one GET); the
# cancellation is just hygiene.
concurrency:
group: lint-required-no-paths-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true
jobs:
# bp-exempt: meta-lint advisory; CI / all-required is the required aggregate.
lint:
name: lint-required-no-paths
runs-on: ubuntu-latest
timeout-minutes: 5
steps:
- name: Check out repo (we read the workflow YAML files locally)
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- name: Set up Python (PyYAML for AST parsing)
uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5.6.0
with:
python-version: '3.12'
- name: Install PyYAML
run: python -m pip install --quiet 'PyYAML==6.0.2'
- name: Run lint-required-no-paths
env:
# DRIFT_BOT_TOKEN is owned by mc-drift-bot, a least-privilege
# Gitea persona with repo-admin role for branch_protections
# read. Same secret used by ci-required-drift.yml — see that
# workflow's header for provisioning trail (internal#329).
GITEA_TOKEN: ${{ secrets.DRIFT_BOT_TOKEN }}
GITEA_HOST: git.moleculesai.app
REPO: ${{ github.repository }}
BRANCH: main
WORKFLOWS_DIR: .gitea/workflows
run: python3 .gitea/scripts/lint-required-no-paths.py
@@ -1,164 +0,0 @@
name: lint-required-workflows-docker-host-pinned
# Fail-closed lint that catches workflows touching docker.sock without
# pinning `runs-on:` to a Linux-only label.
#
# Class defect (internal#512 + mc#1529 + today's oc#81/82/83 + autogen#8):
# the `ubuntu-latest` label is advertised by BOTH the Linux operator-host
# runners (molecule-runner-*) AND the Windows act_runner v1.0.3 on
# hongming-pc-runner-*. Job placement is non-deterministic. When a docker-
# bound job lands on a Windows runner, `docker run`/`docker login`/
# `docker compose` fail with platform-specific errors ("protocol not
# available", "cannot exec", etc.) — placement-dependent, not transient.
#
# This lint enforces the convention: any workflow whose YAML body
# contains a docker exec (`docker run|build|buildx|compose|pull|push|
# exec|tag|login|cp|inspect|ps` OR `docker/build-push-action|docker/
# login-action|docker/setup-buildx`) MUST pin every job's `runs-on:` to
# one of:
# - docker-host (general docker.sock work — molecule-runner-*)
# - publish (image build/push — molecule-runner-publish-*)
#
# Comments and heredoc/markdown bodies that merely MENTION docker are
# excluded by the detection rule (see scan.py below).
#
# Per `feedback_never_skip_ci`: this is fail-closed (exit 1 on miss).
on:
pull_request:
paths:
- '.gitea/workflows/**'
push:
branches: [main, staging]
paths:
- '.gitea/workflows/**'
permissions:
contents: read
env:
GITHUB_SERVER_URL: https://git.moleculesai.app
jobs:
lint-docker-host-pin:
name: Lint docker-host pin on docker-touching workflows
runs-on: docker-host
timeout-minutes: 5
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- name: Scan workflows for docker-bound jobs missing docker-host/publish pin
run: |
set -euo pipefail
python3 - <<'PY'
import os
import re
import sys
# Docker-step detection: real exec, not just word-mention in comments.
# We strip comment-only lines, then look for the docker subcommand
# tokens at word-boundary, OR uses: docker/* actions.
DOCKER_EXEC = re.compile(
r'(?<!\w)docker\s+(run|build|buildx|compose|pull|push|exec|tag|login|cp|inspect|ps)\b'
)
DOCKER_ACTION = re.compile(
r'uses:\s*docker/(build-push-action|login-action|setup-buildx-action|setup-qemu-action)'
)
# Detect a job header line like ` myjob:` (2-space indent) AND its runs-on.
JOB_HEADER = re.compile(r'^( {2})([a-zA-Z0-9_-]+):\s*$')
RUNS_ON = re.compile(r'^( {4})runs-on:\s*(.+?)\s*$')
ALLOWED_LABELS = {'docker-host', 'publish'}
fails = []
warnings = []
# Gitea is SSOT for molecule-core CI per task #347 / memory
# reference_molecule_core_actions_gitea_only. The legacy
# .github/workflows/ tree was deleted in SSOT-Instance-4 (#331).
roots = []
for root in ('.gitea/workflows',):
if os.path.isdir(root):
roots.append(root)
for root in roots:
for fn in sorted(os.listdir(root)):
if not (fn.endswith('.yml') or fn.endswith('.yaml')):
continue
path = os.path.join(root, fn)
with open(path) as f:
raw_lines = f.readlines()
# Parse job headers + their runs-on. Simple line scan; relies on
# 2-space job indent + 4-space runs-on indent under `jobs:`.
jobs = []
current = None
in_jobs = False
for i, line in enumerate(raw_lines, 1):
if re.match(r'^jobs:\s*$', line):
in_jobs = True
continue
if not in_jobs:
continue
mh = JOB_HEADER.match(line)
if mh:
if current:
current['end'] = i - 1
jobs.append(current)
current = {'name': mh.group(2), 'line': i, 'end': len(raw_lines), 'runs_on': None}
continue
mr = RUNS_ON.match(line)
if mr and current and current['runs_on'] is None:
current['runs_on'] = mr.group(2).strip()
if current:
jobs.append(current)
for j in jobs:
# Strip pure-comment lines for docker-exec detection so
# documentation comments don't trigger the lint. Scan the
# current job body only: a workflow may contain one
# docker-bound job and several harmless metadata jobs.
job_lines = raw_lines[j['line'] - 1:j['end']]
scan_text = ''.join(
l for l in job_lines
if not re.match(r'^\s*#', l)
)
has_docker = bool(DOCKER_EXEC.search(scan_text)) or bool(DOCKER_ACTION.search(scan_text))
if not has_docker:
continue
ro = j['runs_on']
if ro is None:
# Reusable workflow caller (`uses:` instead of `runs-on:`) —
# skip; rule enforced in the called workflow.
continue
# Strip surrounding [ ] and quotes.
ro_norm = ro.strip('[]').strip().strip('"\'')
# Multi-label "[a, b]" — split.
labels = [t.strip().strip('"\'') for t in ro_norm.split(',') if t.strip()]
if any(lbl in ALLOWED_LABELS for lbl in labels):
continue
# Allow caller-supplied label expressions; spell the
# marker indirectly so Gitea's expression parser does
# not try to parse this Python heredoc.
expression_marker = '$' + '{{'
if any(expression_marker in lbl for lbl in labels):
continue
fails.append(
f"{path}:{j['line']}: job `{j['name']}` uses docker but runs-on={ro!r} "
f"(must be one of {sorted(ALLOWED_LABELS)})"
)
if fails:
print("FAIL: docker-bound jobs missing docker-host/publish pin:")
for f in fails:
print(f" - {f}")
print()
print("Why this rule exists (internal#512 + mc#1529):")
print(" Bare `ubuntu-latest` is advertised by BOTH Linux operator-host")
print(" runners AND Windows hongming-pc-runner-* (act_runner v1.0.3).")
print(" Docker-bound jobs that land on Windows fail non-deterministically.")
print(" Pin to `docker-host` (general) or `publish` (image build/push).")
sys.exit(1)
print("OK: all docker-bound jobs are pinned to docker-host or publish.")
PY
@@ -1,88 +0,0 @@
name: Lint shellcheck (arm64 pilot)
# Mac-CI dual-track pilot (#233). ADDITIVE / NOT REQUIRED.
#
# Validates the arm64 self-hosted lane (no docker.sock, no privileged
# ops) before any required gate moves onto it. Until a Mac arm64 runner
# is registered with the `arm64` label, this workflow sits PENDING —
# that is FINE: `arm64` is NOT in branch_protections required contexts.
#
# Pairs with internal#543 (RFC: Mac arm64 multi-arch runner-base).
# No paths: filter on purpose (feedback_path_filtered_workflow_cant_be_required).
on:
pull_request:
branches:
- main
- staging
push:
branches:
- main
permissions:
contents: read
jobs:
shellcheck-arm64:
name: shellcheck-arm64 (pilot)
runs-on: [self-hosted, arm64]
# NOT a required check; safe to sit pending until Mac runner is up.
# If the Mac runner has trouble pulling actions/checkout we fall
# back to a plain git clone (see step 'fallback clone').
timeout-minutes: 10
env:
GITHUB_SERVER_URL: https://git.moleculesai.app
steps:
- name: Identify runner
run: |
set -eu
echo "arch=$(uname -m)"
echo "kernel=$(uname -sr)"
echo "shell=$BASH_VERSION"
# Sanity: must actually be arm64. If amd64 sneaks in here,
# fail fast — that means the label routing is wrong.
case "$(uname -m)" in
aarch64|arm64) echo "arm64 confirmed" ;;
*) echo "ERROR: expected arm64, got $(uname -m)"; exit 1 ;;
esac
- name: Checkout
uses: actions/checkout@v4
with:
fetch-depth: 1
- name: Install shellcheck (arm64)
run: |
set -eu
if command -v shellcheck >/dev/null 2>&1; then
echo "shellcheck already present: $(shellcheck --version | head -1)"
else
# Prefer apt if the runner base ships it; else download arm64 binary.
if command -v apt-get >/dev/null 2>&1; then
sudo apt-get update -qq
sudo apt-get install -y --no-install-recommends shellcheck
else
SC_VER=v0.10.0
curl -fsSL "https://github.com/koalaman/shellcheck/releases/download/${SC_VER}/shellcheck-${SC_VER}.linux.aarch64.tar.xz" \
| tar -xJf - --strip-components=1
sudo mv shellcheck /usr/local/bin/
fi
fi
shellcheck --version | head -2
- name: Run shellcheck on .gitea/scripts/*.sh
run: |
set -eu
# Only the scripts we control under .gitea/scripts. Pilot
# scope is intentionally narrow — broaden in a follow-up
# once the lane is proven.
mapfile -t TARGETS < <(find .gitea/scripts -maxdepth 2 -type f -name '*.sh' | sort)
if [ "${#TARGETS[@]}" -eq 0 ]; then
echo "No .sh files found under .gitea/scripts — nothing to check"
exit 0
fi
echo "Checking ${#TARGETS[@]} file(s):"
printf ' %s\n' "${TARGETS[@]}"
# SC1091 = couldn't follow non-constant source; expected for
# CI-time analysis without the full runtime layout.
shellcheck --severity=error --exclude=SC1091 "${TARGETS[@]}"
-76
View File
@@ -1,76 +0,0 @@
name: Lint workflow YAML (Gitea-1.22.6-hostile shapes)
# Tier-2 hard-gate lint (RFC internal#219 §1, charter §SOP-N rule (m)).
# Catches six Gitea-1.22.6-hostile workflow-YAML shapes BEFORE they reach
# `main`. Each rule maps to a documented incident in saved memory:
#
# 1. workflow_dispatch.inputs — feedback_gitea_workflow_dispatch_inputs_unsupported
# (2026-05-11 PyPI freeze 24h)
# 2. on: workflow_run — task #81 (Gitea 1.22.6 lacks the event)
# 3. name: containing "/" — breaks status-context tokenization
# 4. cross-file name collision — status-reaper rev1 fail-loud class
# 5. cross-repo uses: org/r/p@r — feedback_gitea_cross_repo_uses_blocked
# (DEFAULT_ACTIONS_URL=github → 404)
# 6. (WARN) api.github.com refs — feedback_act_runner_github_server_url
# without workflow-level GITHUB_SERVER_URL
#
# Empirical history this hardens against:
# - status-reaper rev1 caught rule-4 (name-collision) class
# - sop-tier-refire DOA'd on rule-2 (workflow_run partial)
# - #319 bootstrap-paradox (chained-defect class, related)
# - internal#329 dispatcher race (adjacent)
# - 2026-05-11 publish-runtime: rule-1, 24h PyPI freeze
#
# Triggers:
# - pull_request: pre-merge gate — block hostile shapes before they land
# - push: post-merge regression detection — catch direct-to-main edits
#
# Per RFC internal#219 §1 contract: continue-on-error: true during the
# surface-broken-shapes phase. Follow-up PR flips off after surfaced
# defects are triaged. The push-trigger ensures we catch regressions
# even if the pull_request gate is bypassed by branch-protection drift.
on:
pull_request:
paths:
- '.gitea/workflows/**'
- '.gitea/scripts/lint-workflow-yaml.py'
- 'tests/test_lint_workflow_yaml.py'
push:
branches: [main, staging]
paths:
- '.gitea/workflows/**'
- '.gitea/scripts/lint-workflow-yaml.py'
- 'tests/test_lint_workflow_yaml.py'
# Belt-and-suspenders against runner default
# (feedback_act_runner_github_server_url).
env:
GITHUB_SERVER_URL: https://git.moleculesai.app
jobs:
lint:
name: Lint workflow YAML for Gitea-1.22.6-hostile shapes
runs-on: ubuntu-latest
# Phase 3 (RFC #219 §1): surface broken shapes without blocking PRs.
# Follow-up PR flips this off after the 4 existing-on-main rule-2
# (workflow_run) violations are migrated to a supported trigger.
# mc#774: pre-existing continue-on-error mask; root-fix and remove, do not renew silently.
continue-on-error: true
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 # v6.2.0
with:
python-version: '3.11'
- name: Install PyYAML
run: pip install --quiet 'PyYAML>=6.0'
- name: Lint .gitea/workflows/*.yml
run: python3 .gitea/scripts/lint-workflow-yaml.py
- name: Run lint-workflow-yaml unit tests
run: |
pip install --quiet pytest
python3 -m pytest tests/test_lint_workflow_yaml.py -v
-104
View File
@@ -1,104 +0,0 @@
# main-red-watchdog — hourly sentinel for post-merge CI red on `main`.
#
# RFC: hongming "main NEVER goes red" directive, Option C of the four-
# option ladder (B = auto-revert is explicitly rejected per
# `feedback_no_such_thing_as_flakes` + `feedback_fix_root_not_symptom`).
# Tracking issue: molecule-core#420.
#
# What it does:
# 1. GET branches/main → HEAD SHA
# 2. GET commits/{SHA}/status → combined status
# 3. If combined is `failure` (or any individual status is `failure`):
# open or PATCH an idempotent `[main-red] {repo}: {SHA[:10]}` issue
# with each failed context + target_url + description.
# 4. If combined is `success` and a prior `[main-red] ...` issue exists,
# close it with a "main returned to green at SHA ..." comment.
# 5. Emit a Loki-shaped JSON line via `logger -t main-red-watchdog` for
# `reference_obs_stack_phase1` ingestion via Vector.
#
# What it does NOT do:
# - Auto-revert anything. Option B is rejected by directive.
# - Mutate branch protection. (See AGENTS.md boundaries.)
# - Fail the workflow on red. The issue IS the alarm — failing the
# watchdog would create a silent-loop where a flake in the watchdog
# itself hides actual main-red signal. Exit 0 unless api() raises
# ApiError (transient Gitea outage → fail loudly per
# `feedback_api_helper_must_raise_not_return_dict`).
#
# Pattern source: molecule-controlplane `0adf2098`'s ci-required-drift.yml
# (just merged 2026-05-11). Same shape (cron + dispatch + sidecar Python +
# idempotent-by-title issue), simpler scope (1 source, not 3).
name: main-red-watchdog
# IMPORTANT — Gitea 1.22.6 parser quirk per
# `feedback_gitea_workflow_dispatch_inputs_unsupported`: do NOT add an
# `inputs:` block here. Gitea 1.22.6 rejects the whole workflow as
# "unknown on type" when `workflow_dispatch.inputs.X` is present. Revisit
# when Gitea ≥ 1.23 is fleet-wide.
on:
# SCHEDULE RE-ENABLED 2026-05-12 rev3 — interim disable (mc#645) reverted alongside
# status-reaper rev3 (widen-window). Job-level timeout-minutes raised 5 → 15 below
# to absorb runner-saturation latency without spurious cancels (the original cascade
# cause). If runner-saturation root persists, the dedicated-runner-label split
# remains the structural next step (tracked separately).
schedule:
# Hourly at :05 — task spec calls for "off-zero" (`5 * * * *`),
# offset from :17 (ci-required-drift) and :00 (peak cron load).
- cron: '5 * * * *'
workflow_dispatch:
# Read commit status + branch ref + issues; write issues (open/PATCH/close).
permissions:
contents: read
issues: write
# Workflow-scoped serialisation — two simultaneous runs would race on the
# `[main-red] {SHA}` open/PATCH path. Idempotent by title, but parallel
# POSTs can produce duplicates before the title search dedup wins.
concurrency:
group: main-red-watchdog
cancel-in-progress: false
jobs:
watchdog:
runs-on: ubuntu-latest
# rev3 (2026-05-12, mc#645 revert): raised 5 → 15 to absorb runner-saturation
# latency. Original 5min cap was producing 124-style cancels under load,
# which fed the very `[main-red]` issues this workflow files (self-poisoning).
# 15min is still well below Gitea-default 6h job ceiling; if a real hang
# occurs the issue-file path is still the alarm surface.
timeout-minutes: 15
steps:
- name: Check out repo (script lives at .gitea/scripts/)
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- name: Set up Python (stdlib only — no PyYAML needed here)
# The script uses stdlib urllib + json. No PyYAML required (CP's
# drift detector needs it for AST parsing; we don't). Pin to the
# same 3.12 hermetic interpreter CP uses so the test/runtime
# versions stay aligned across watchdog suites.
uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5.6.0
with:
python-version: '3.12'
- name: Run main-red watchdog
env:
# GITEA_TOKEN reads commit status + writes issues. Falls back
# to the auto-injected GITHUB_TOKEN if the org-level secret
# isn't set (transitional repos), matching the same pattern
# used by deploy-pipeline.yml + ci-required-drift.yml.
GITEA_TOKEN: ${{ secrets.GITEA_TOKEN || secrets.GITHUB_TOKEN }}
GITEA_HOST: git.moleculesai.app
REPO: ${{ github.repository }}
# Branch under watch. `main` per directive; staging not
# included here — staging green is a separate gate
# (`feedback_staging_e2e_merge_gate`).
WATCH_BRANCH: 'main'
# Issue label applied on file/open. `tier:high` exists in the
# molecule-core label set (verified 2026-05-11, label id 9).
# Rationale for high: main red blocks the promotion train and
# poisons every PR's auto-rebase base; treat as a fire even
# if intermittent.
RED_LABEL: 'tier:high'
run: python3 .gitea/scripts/main-red-watchdog.py
-176
View File
@@ -1,176 +0,0 @@
name: publish-canvas-image
# Ported from .github/workflows/publish-canvas-image.yml on 2026-05-11 per RFC
# internal#219 §1 sweep. Differences from the GitHub version:
# - Dropped `workflow_dispatch.inputs` (Gitea 1.22.6 parser rejects them
# per feedback_gitea_workflow_dispatch_inputs_unsupported).
# - Dropped `merge_group:` (no Gitea merge queue).
# - Dropped `environment:` blocks (Gitea has no environments).
# - Workflow-level env.GITHUB_SERVER_URL pinned per
# feedback_act_runner_github_server_url.
# - `continue-on-error: true` on each job (RFC §1 contract).
# - Retargeted the image push from GHCR to ECR. GHCR was retired during
# the 2026-05-06 Gitea migration, and Gitea's GITHUB_TOKEN cannot
# authenticate to ghcr.io.
#
# Builds and pushes the canvas Docker image to ECR whenever a commit lands
# on main that touches canvas code. Previously canvas changes were visible in
# CI (npm run build passed) but the live container was never updated —
# operators had to manually run `docker compose build canvas` each time.
#
# Mirror of publish-platform-image.yml, adapted for the Next.js canvas layer.
# See that workflow for inline notes on macOS Keychain isolation and QEMU.
on:
push:
branches: [main]
paths:
# Only rebuild when canvas source changes — saves GHA minutes on
# platform-only / docs-only / MCP-only merges.
- 'canvas/**'
- '.gitea/workflows/publish-canvas-image.yml'
# NOTE (Gitea port): the original GitHub workflow had a
# `workflow_dispatch:` manual trigger for the
# non-canvas-merge-but-need-fresh-image scenario. Dropped in the
# Gitea port (1.22.6 parser-finicky). Manual rebuilds require
# pushing an empty commit to canvas/ or running the operator-host
# build directly.
permissions:
contents: read
packages: write
env:
# SSOT-Instance-10 (#333): ECR registry triplet (account.dkr.ecr.region.amazonaws.com)
# sourced from org/repo var `ECR_REGISTRY` with the current prod-account literal as
# bootstrap fallback. When the org var is set, the fallback becomes dead code and
# switching accounts/regions is a one-line change at the org level (instead of
# touching every workflow). Pattern mirrors `vars.CP_URL || 'literal'` already in
# use below in this repo's staging-verify.yml.
IMAGE_NAME: ${{ vars.ECR_REGISTRY || '153263036946.dkr.ecr.us-east-2.amazonaws.com' }}/molecule-ai/canvas
GITHUB_SERVER_URL: https://git.moleculesai.app
jobs:
# bp-exempt: post-merge image publication side effect; CI / all-required gates source changes.
build-and-push:
name: Build & push canvas image
# Dedicated publish/release lane (internal#462 / #394 / #399). Ship
# path (on: push:main, canvas/**) — reserved capacity so a merged
# canvas fix's image build never FIFO-queues behind PR required-CI.
# The `publish` label resolves ONLY to the molecule-runner-publish-*
# sub-pool (config.publish.yaml). HARD DEPENDENCY: this MUST land
# AFTER the publish-lane runners are registered/advertising `publish`
# — the earlier #599 `docker` label attempt queued indefinitely with
# zero eligible runners precisely because the label was targeted
# before any runner advertised it (see #576). The lane is registered
# in this rollout (internal#462) so the precondition holds.
runs-on: publish
# Phase 3 (RFC #219 §1): surface broken workflows without blocking.
# mc#774: pre-existing continue-on-error mask; root-fix and remove, do not renew silently.
continue-on-error: true
steps:
- name: Checkout
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- name: Log in to ECR
env:
IMAGE_NAME: ${{ env.IMAGE_NAME }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_DEFAULT_REGION: us-east-2
run: |
set -euo pipefail
ECR_REGISTRY="${IMAGE_NAME%%/*}"
aws ecr get-login-password --region us-east-2 | \
docker login --username AWS --password-stdin "${ECR_REGISTRY}"
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@4d04d5d9486b7bd6fa91e7baf45bbb4f8b9deedd # v4.0.0
- name: Ensure ECR repository exists
env:
IMAGE_NAME: ${{ env.IMAGE_NAME }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_DEFAULT_REGION: us-east-2
run: |
set -euo pipefail
repo_path="${IMAGE_NAME#*/}"
if ! aws ecr describe-repositories --repository-names "${repo_path}" --region us-east-2 >/dev/null 2>&1; then
aws ecr create-repository \
--repository-name "${repo_path}" \
--image-scanning-configuration scanOnPush=true \
--region us-east-2 >/dev/null
fi
# Health check: verify Docker daemon is accessible before attempting any
# build steps. This fails loudly at step 1 when the runner's docker.sock
# is inaccessible rather than silently continuing to the build step
# where docker build fails deep in ECR auth with a cryptic error.
- name: Verify Docker daemon access
run: |
set -euo pipefail
echo "::group::Docker daemon health check"
echo "Runner: ${HOSTNAME:-unknown}"
docker_info="$(docker info 2>&1)" || {
echo "::error::Docker daemon is not accessible at /var/run/docker.sock"
echo "::error::Runner: ${HOSTNAME:-unknown}"
printf '%s\n' "${docker_info}"
echo "::error::Check: (1) daemon running, (2) runner user in docker group, (3) sock perms 660+"
exit 1
}
printf '%s\n' "${docker_info}" | sed -n '1,5p'
echo "Docker daemon OK"
echo "::endgroup::"
- name: Compute tags
id: tags
shell: bash
run: |
echo "sha=${GITHUB_SHA::7}" >> "$GITHUB_OUTPUT"
- name: Resolve build args
id: build_args
# Priority: workflow_dispatch input > repo secret > hardcoded default.
# NEXT_PUBLIC_* env vars are baked into the JS bundle at build time by
# Next.js — they cannot be changed at runtime without a full rebuild.
# For local docker-compose deployments the defaults (localhost:8080)
# work as-is; production deployments should set CANVAS_PLATFORM_URL
# and CANVAS_WS_URL as repository secrets.
#
# Inputs are passed via env vars (not direct ${{ }} interpolation) to
# prevent shell injection from workflow_dispatch string inputs.
shell: bash
env:
INPUT_PLATFORM_URL: ${{ github.event.inputs.platform_url }}
SECRET_PLATFORM_URL: ${{ secrets.CANVAS_PLATFORM_URL }}
INPUT_WS_URL: ${{ github.event.inputs.ws_url }}
SECRET_WS_URL: ${{ secrets.CANVAS_WS_URL }}
run: |
PLATFORM_URL="${INPUT_PLATFORM_URL:-${SECRET_PLATFORM_URL:-http://localhost:8080}}"
WS_URL="${INPUT_WS_URL:-${SECRET_WS_URL:-ws://localhost:8080/ws}}"
echo "platform_url=${PLATFORM_URL}" >> "$GITHUB_OUTPUT"
echo "ws_url=${WS_URL}" >> "$GITHUB_OUTPUT"
- name: Build & push canvas image to ECR
uses: docker/build-push-action@bcafcacb16a39f128d818304e6c9c0c18556b85f # v7.1.0
with:
context: ./canvas
file: ./canvas/Dockerfile
platforms: linux/amd64
push: true
build-args: |
NEXT_PUBLIC_PLATFORM_URL=${{ steps.build_args.outputs.platform_url }}
NEXT_PUBLIC_WS_URL=${{ steps.build_args.outputs.ws_url }}
tags: |
${{ env.IMAGE_NAME }}:latest
${{ env.IMAGE_NAME }}:sha-${{ steps.tags.outputs.sha }}
# Gitea artifact-cache reachability is best-effort on the operator
# runner network. Do not let cache export fail an image that already
# built and pushed successfully.
labels: |
org.opencontainers.image.source=https://git.moleculesai.app/${{ github.repository }}
org.opencontainers.image.revision=${{ github.sha }}
org.opencontainers.image.description=Molecule AI canvas (Next.js 15 + React Flow)
@@ -1,402 +0,0 @@
name: publish-workspace-server-image
# Gitea Actions port of .github/workflows/publish-workspace-server-image.yml.
#
# Ported 2026-05-10 (issue #228). Key differences from the GitHub version:
# - Gitea Actions reads .gitea/workflows/, not .github/workflows/
# - Dropped `environment:` declarations — Gitea Actions does not support
# named environments (used by GitHub OIDC token gates)
# - Replaced `github.ref_name` (GitHub-only) with `${GITHUB_REF#refs/heads/}`
# — Gitea Actions exposes GITHUB_REF in the same format as GitHub Actions
# - docker/setup-buildx-action and aws-actions/configure-aws-credentials are
# GitHub Marketplace actions; they are installed by Gitea Actions runners and
# work identically here
# - All other variables (GITHUB_SHA, GITHUB_REPOSITORY, GITHUB_OUTPUT,
# secrets.*) use the same syntax as GitHub Actions
#
# Image tags produced:
# :staging-<sha> — per-commit digest, stable for canary verify
# :staging-latest — tracks most recent build on this branch
#
# Production auto-deploy:
# After both platform and tenant images are pushed, deploy-production waits
# for strict required push contexts on the same SHA to go green, then
# calls the production CP redeploy-fleet endpoint with target_tag=
# staging-<sha>. Set repo variable or secret PROD_AUTO_DEPLOY_DISABLED=true
# to stop production rollout while keeping image publishing enabled.
#
# Primary ECR target: 153263036946.dkr.ecr.us-east-2.amazonaws.com/molecule-ai/*
# Optional staging tenant mirror target:
# 004947743811.dkr.ecr.us-east-2.amazonaws.com/molecule-ai/platform-tenant
# Required secrets: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AUTO_SYNC_TOKEN
# Staging ECR grants the primary SSOT-managed publisher principal repository
# policy access, so no persistent staging AWS access keys are required.
#
# mc#711: Docker daemon not accessible on ubuntu-latest runner (molecule-canonical-1
# shows client-only in `docker info` — daemon not running). DinD mount is present but
# daemon doesn't respond. Fix: add diagnostic step showing socket info so ops can
# identify which runners have a live daemon. If no daemon is available, the job
# fails fast with actionable output rather than silent deep failure.
on:
push:
branches: [main]
workflow_dispatch:
# No `concurrency:` block here. Gitea 1.22.6 can cancel queued runs despite
# `cancel-in-progress: false`; that is not acceptable for a workflow with a
# production deploy job. Per-SHA image tags are immutable, and staging-latest is
# best-effort last-writer-wins metadata.
#
# 2026-05-20 retrigger: run #86994 on mc#1589 merge sha 0f0f1ba2 failed at
# setup-buildx-action with EACCES on PC2 WSL publish runner — the runner's
# DOCKER_CONFIG=/home/hongming/.docker-ecr/ dir didn't have a buildx/certs
# subdir writable by the container's UID 1001. Hot-patched the dir perms;
# this chore push retriggers the workflow. Proper fix (per-runner
# DOCKER_CONFIG owned by 1001, internal#597 --env HOME=/home/runner pattern)
# is tracked as a CI-hygiene follow-up — not in scope here.
permissions:
contents: read
packages: write
env:
# SSOT-Instance-10 (#333): ECR registry triplet (account.dkr.ecr.region.amazonaws.com)
# sourced from org/repo var `ECR_REGISTRY` with the current prod-account literal as
# bootstrap fallback. When the org var is set, the fallback becomes dead code and
# switching accounts/regions is a one-line change at the org level (instead of
# touching every workflow). Pattern mirrors `vars.CP_URL || 'literal'` already in
# use below in this repo's staging-verify.yml.
IMAGE_NAME: ${{ vars.ECR_REGISTRY || '153263036946.dkr.ecr.us-east-2.amazonaws.com' }}/molecule-ai/platform
TENANT_IMAGE_NAME: ${{ vars.ECR_REGISTRY || '153263036946.dkr.ecr.us-east-2.amazonaws.com' }}/molecule-ai/platform-tenant
STAGING_TENANT_IMAGE_NAME: ${{ vars.STAGING_ECR_REGISTRY || '004947743811.dkr.ecr.us-east-2.amazonaws.com' }}/molecule-ai/platform-tenant
jobs:
build-and-push:
# Dedicated publish/release lane (internal#462 / #394 / #399). This
# is a post-merge ship job (on: push:main) — it must NOT FIFO-compete
# with PR required-CI on the shared pool (PR#1350's prod image build
# was delayed ~25min this way). The `publish` label resolves ONLY to
# the reserved molecule-runner-publish-* sub-pool (config.publish.yaml,
# OUTSIDE the managed 1..20 range) so a merged fix's image build
# starts immediately while PR-CI keeps the general pool.
runs-on: publish
steps:
- name: Checkout
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
# Health check: verify Docker daemon is accessible before attempting any
# build steps. This fails loudly at step 1 when the runner's docker.sock
# is inaccessible rather than silently continuing where `docker build`
# fails deep in the process with a cryptic ECR auth error.
- name: Verify Docker daemon access
run: |
set -euo pipefail
echo "::group::Docker daemon health check"
echo "Runner: ${HOSTNAME:-unknown}"
docker_info="$(docker info 2>&1)" || {
echo "::error::Docker daemon is not accessible at /var/run/docker.sock"
echo "::error::Runner: ${HOSTNAME:-unknown}"
printf '%s\n' "${docker_info}"
echo "::error::Check: (1) daemon is running, (2) runner user is in docker group, (3) sock permissions are 660+"
exit 1
}
printf '%s\n' "${docker_info}" | sed -n '1,5p'
echo "Docker daemon OK"
echo "::endgroup::"
# Pre-clone manifest deps before docker build.
#
# Why: workspace-template-* repos on Gitea are private. The pre-fix
# Dockerfile.tenant ran `git clone` inside an in-image stage with no
# auth path — every CI build failed. We clone in the trusted CI
# context where AUTO_SYNC_TOKEN is available and Dockerfile.tenant
# just COPYs from .tenant-bundle-deps/.
#
# Token: AUTO_SYNC_TOKEN is the devops-engineer persona PAT.
# clone-manifest.sh embeds it as basic-auth for the clones, then
# strips .git dirs — the token never enters the image.
- name: Pre-clone manifest deps
env:
MOLECULE_GITEA_TOKEN: ${{ secrets.AUTO_SYNC_TOKEN }}
run: |
set -euo pipefail
mkdir -p .tenant-bundle-deps
# Strip JSON5 comments before jq parsing — Integration Tester appends
# `// Triggered by ...` which breaks `jq` in clone-manifest.sh.
sed '/^[[:space:]]*\/\//d' manifest.json > .manifest-stripped.json
bash scripts/clone-manifest.sh \
.manifest-stripped.json \
.tenant-bundle-deps/workspace-configs-templates \
.tenant-bundle-deps/org-templates \
.tenant-bundle-deps/plugins
ws_count=$(find .tenant-bundle-deps/workspace-configs-templates -mindepth 1 -maxdepth 1 -type d | wc -l)
org_count=$(find .tenant-bundle-deps/org-templates -mindepth 1 -maxdepth 1 -type d | wc -l)
plugins_count=$(find .tenant-bundle-deps/plugins -mindepth 1 -maxdepth 1 -type d | wc -l)
echo "Cloned: ws=$ws_count org=$org_count plugins=$plugins_count"
- name: Compute tags
id: tags
run: |
echo "sha=${GITHUB_SHA::7}" >> "$GITHUB_OUTPUT"
# Keep Buildx state inside the job temp dir. The publish runner's
# inherited DOCKER_CONFIG can point at a host-owned ECR config path
# (/home/hongming/.docker-ecr), which caused setup-buildx-action to
# fail before image build with EACCES creating buildx/certs.
- name: Prepare writable Docker config
run: |
set -euo pipefail
export DOCKER_CONFIG="$RUNNER_TEMP/docker-config"
mkdir -p "$DOCKER_CONFIG/buildx/certs"
echo "DOCKER_CONFIG=$DOCKER_CONFIG" >> "$GITHUB_ENV"
docker buildx version
# Build + push platform image (inline ECR auth — mirrors the operator-host
# approach; credentials come from GITHUB_SECRET_AWS_ACCESS_KEY_ID /
# GITHUB_SECRET_AWS_SECRET_ACCESS_KEY in Gitea Actions).
# docker buildx bake / build required for `imagetools inspect` digest
# capture in the CP pin-update step (RFC internal#229 §X step 4 PR-1).
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@4d04d5d9486b7bd6fa91e7baf45bbb4f8b9deedd # v4.0.0
- name: Build & push platform image to ECR (staging-<sha> + staging-latest)
env:
IMAGE_NAME: ${{ env.IMAGE_NAME }}
TAG_SHA: staging-${{ steps.tags.outputs.sha }}
TAG_LATEST: staging-latest
GIT_SHA: ${{ github.sha }}
REPO: ${{ github.repository }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_DEFAULT_REGION: us-east-2
run: |
set -euo pipefail
ECR_REGISTRY="${IMAGE_NAME%%/*}"
aws ecr get-login-password --region us-east-2 | \
docker login --username AWS --password-stdin "${ECR_REGISTRY}"
docker buildx build \
--file ./workspace-server/Dockerfile \
--build-arg GIT_SHA="${GIT_SHA}" \
--label "org.opencontainers.image.source=https://git.moleculesai.app/molecule-ai/${REPO}" \
--label "org.opencontainers.image.revision=${GIT_SHA}" \
--label "org.opencontainers.image.created=$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
--label "molecule.workflow.run_id=${GITHUB_RUN_ID}" \
--tag "${IMAGE_NAME}:${TAG_SHA}" \
--tag "${IMAGE_NAME}:${TAG_LATEST}" \
--push .
# Build + push tenant image (Go platform + Next.js canvas in one image).
# Push the same build to the staging account too so fresh staging/E2E
# tenants can pull without cross-account ECR reads. The staging ECR repo
# policy trusts the primary SSOT-managed publisher principal; do not add
# separate persistent staging AWS access keys here.
- name: Build & push tenant image to ECR (staging-<sha> + staging-latest)
env:
TENANT_IMAGE_NAME: ${{ env.TENANT_IMAGE_NAME }}
STAGING_TENANT_IMAGE_NAME: ${{ env.STAGING_TENANT_IMAGE_NAME }}
TAG_SHA: staging-${{ steps.tags.outputs.sha }}
TAG_LATEST: staging-latest
GIT_SHA: ${{ github.sha }}
REPO: ${{ github.repository }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_DEFAULT_REGION: us-east-2
run: |
set -euo pipefail
ECR_REGISTRY="${TENANT_IMAGE_NAME%%/*}"
STAGING_ECR_REGISTRY="${STAGING_TENANT_IMAGE_NAME%%/*}"
aws ecr get-login-password --region us-east-2 | \
docker login --username AWS --password-stdin "${ECR_REGISTRY}"
aws ecr get-login-password --region us-east-2 | \
docker login --username AWS --password-stdin "${STAGING_ECR_REGISTRY}"
build_tags=(
--tag "${TENANT_IMAGE_NAME}:${TAG_SHA}"
--tag "${TENANT_IMAGE_NAME}:${TAG_LATEST}"
--tag "${STAGING_TENANT_IMAGE_NAME}:${TAG_SHA}"
--tag "${STAGING_TENANT_IMAGE_NAME}:${TAG_LATEST}"
)
docker buildx build \
--file ./workspace-server/Dockerfile.tenant \
--build-arg NEXT_PUBLIC_PLATFORM_URL= \
--build-arg GIT_SHA="${GIT_SHA}" \
--label "org.opencontainers.image.source=https://git.moleculesai.app/molecule-ai/${REPO}" \
--label "org.opencontainers.image.revision=${GIT_SHA}" \
--label "org.opencontainers.image.created=$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
--label "molecule.workflow.run_id=${GITHUB_RUN_ID}" \
"${build_tags[@]}" \
--push .
# bp-exempt: production deploy side-effect; merge is gated by CI / all-required and this job waits for push CI before acting.
deploy-production:
name: Production auto-deploy
needs: build-and-push
if: ${{ github.event_name == 'push' && github.ref == 'refs/heads/main' }}
# Publish/release lane (internal#462) — production deploy of a merged
# fix; reserved capacity, never queued behind PR-CI.
runs-on: publish
timeout-minutes: 75
env:
CP_URL: ${{ vars.PROD_CP_URL || 'https://api.moleculesai.app' }}
CP_ADMIN_API_TOKEN: ${{ secrets.CP_ADMIN_API_TOKEN }}
GITEA_HOST: git.moleculesai.app
GITEA_TOKEN: ${{ secrets.PROD_AUTO_DEPLOY_CONTROL_TOKEN || secrets.AUTO_SYNC_TOKEN }}
PROD_AUTO_DEPLOY_DISABLED: ${{ vars.PROD_AUTO_DEPLOY_DISABLED || secrets.PROD_AUTO_DEPLOY_DISABLED || '' }}
PROD_AUTO_DEPLOY_CANARY_SLUG: ${{ vars.PROD_AUTO_DEPLOY_CANARY_SLUG || 'hongming' }}
PROD_AUTO_DEPLOY_SOAK_SECONDS: ${{ vars.PROD_AUTO_DEPLOY_SOAK_SECONDS || '60' }}
PROD_AUTO_DEPLOY_BATCH_SIZE: ${{ vars.PROD_AUTO_DEPLOY_BATCH_SIZE || '3' }}
PROD_AUTO_DEPLOY_DRY_RUN: ${{ vars.PROD_AUTO_DEPLOY_DRY_RUN || '' }}
PROD_ALLOW_NON_PROD_CP_URL: ${{ vars.PROD_ALLOW_NON_PROD_CP_URL || '' }}
steps:
- name: Checkout
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- name: Build deploy plan
id: plan
run: |
set -euo pipefail
python3 .gitea/scripts/prod-auto-deploy.py plan > "$RUNNER_TEMP/prod-auto-deploy-plan.json"
jq . "$RUNNER_TEMP/prod-auto-deploy-plan.json"
enabled="$(jq -r '.enabled' "$RUNNER_TEMP/prod-auto-deploy-plan.json")"
echo "enabled=$enabled" >> "$GITHUB_OUTPUT"
if [ "$enabled" != "true" ]; then
reason="$(jq -r '.disabled_reason' "$RUNNER_TEMP/prod-auto-deploy-plan.json")"
echo "::notice::Production auto-deploy disabled: $reason"
{
echo "## Production auto-deploy skipped"
echo ""
echo "Reason: \`$reason\`"
} >> "$GITHUB_STEP_SUMMARY"
exit 0
fi
if [ -z "${CP_ADMIN_API_TOKEN:-}" ]; then
echo "::error::CP_ADMIN_API_TOKEN secret is required for production auto-deploy."
exit 1
fi
if [ -z "${GITEA_TOKEN:-}" ]; then
echo "::error::AUTO_SYNC_TOKEN secret is required so production deploy can wait for green CI."
exit 1
fi
- name: Self-test production deploy helper
if: ${{ steps.plan.outputs.enabled == 'true' }}
run: |
set -euo pipefail
python3 -m pip install --quiet 'pytest==9.0.2' 'PyYAML==6.0.2'
python3 -m pytest .gitea/scripts/tests/test_prod_auto_deploy.py -q
python3 .gitea/scripts/lint-workflow-yaml.py --workflow-dir .gitea/workflows
- name: Wait for green main CI on this SHA
if: ${{ steps.plan.outputs.enabled == 'true' }}
run: |
set -euo pipefail
python3 .gitea/scripts/prod-auto-deploy.py wait-ci
- name: Call production CP redeploy-fleet
if: ${{ steps.plan.outputs.enabled == 'true' }}
run: |
set -euo pipefail
python3 .gitea/scripts/prod-auto-deploy.py assert-enabled
PLAN="$RUNNER_TEMP/prod-auto-deploy-plan.json"
TARGET_TAG="$(jq -r '.target_tag' "$PLAN")"
BODY="$(jq -c '.body' "$PLAN")"
echo "POST $CP_URL/cp/admin/tenants/redeploy-fleet"
echo " target_tag: $TARGET_TAG"
echo " body: $BODY"
HTTP_RESPONSE="$RUNNER_TEMP/prod-redeploy-response.json"
HTTP_CODE_FILE="$RUNNER_TEMP/prod-redeploy-http-code.txt"
set +e
curl -sS -o "$HTTP_RESPONSE" -w '%{http_code}' \
-m 1200 \
-H "Authorization: Bearer $CP_ADMIN_API_TOKEN" \
-H "Content-Type: application/json" \
-X POST "$CP_URL/cp/admin/tenants/redeploy-fleet" \
-d "$BODY" > "$HTTP_CODE_FILE"
set -e
HTTP_CODE="$(cat "$HTTP_CODE_FILE" 2>/dev/null || echo "000")"
[ -z "$HTTP_CODE" ] && HTTP_CODE="000"
echo "HTTP $HTTP_CODE"
jq '{ok, result_count: (.results // [] | length)}' "$HTTP_RESPONSE" || true
{
echo "## Production auto-deploy"
echo ""
echo "**Commit:** \`${GITHUB_SHA:0:7}\`"
echo "**Target tag:** \`$TARGET_TAG\`"
echo "**HTTP:** $HTTP_CODE"
echo ""
echo "### Per-tenant result"
echo ""
echo "| Slug | Phase | SSM Status | Exit | Healthz | Error present |"
echo "|------|-------|------------|------|---------|---------------|"
jq -r '.results[]? | "| \(.slug) | \(.phase) | \(.ssm_status // "-") | \(.ssm_exit_code) | \(.healthz_ok) | \((.error // "") != "") |"' "$HTTP_RESPONSE" || true
} >> "$GITHUB_STEP_SUMMARY"
if [ "$HTTP_CODE" != "200" ]; then
echo "::error::redeploy-fleet returned HTTP $HTTP_CODE"
exit 1
fi
OK="$(jq -r '.ok' "$HTTP_RESPONSE")"
if [ "$OK" != "true" ]; then
echo "::error::redeploy-fleet reported ok=false; production rollout halted."
exit 1
fi
- name: Verify reachable tenants report this SHA
if: ${{ steps.plan.outputs.enabled == 'true' }}
env:
TENANT_DOMAIN: moleculesai.app
run: |
set -euo pipefail
RESP="$RUNNER_TEMP/prod-redeploy-response.json"
mapfile -t SLUGS < <(jq -r '.results[]? | .slug' "$RESP")
if [ ${#SLUGS[@]} -eq 0 ]; then
echo "::error::No tenants returned from redeploy-fleet; refusing to mark production deploy verified."
exit 1
fi
STALE_COUNT=0
UNREACHABLE_COUNT=0
UNHEALTHY_COUNT=0
for slug in "${SLUGS[@]}"; do
healthz_ok="$(jq -r --arg slug "$slug" '.results[]? | select(.slug == $slug) | .healthz_ok' "$RESP" | tail -1)"
if [ "$healthz_ok" != "true" ]; then
echo "::error::$slug did not report healthz_ok=true in redeploy-fleet response."
UNHEALTHY_COUNT=$((UNHEALTHY_COUNT + 1))
continue
fi
url="https://${slug}.${TENANT_DOMAIN}/buildinfo"
body="$(curl -sS --max-time 30 --retry 3 --retry-delay 5 --retry-connrefused "$url" || true)"
actual="$(echo "$body" | jq -r '.git_sha // ""' 2>/dev/null || echo "")"
if [ -z "$actual" ]; then
echo "::error::$slug did not return /buildinfo after deploy."
UNREACHABLE_COUNT=$((UNREACHABLE_COUNT + 1))
continue
fi
if [ "$actual" != "$GITHUB_SHA" ]; then
echo "::error::$slug is stale: actual=${actual:0:7}, expected=${GITHUB_SHA:0:7}"
STALE_COUNT=$((STALE_COUNT + 1))
else
echo "$slug: ${actual:0:7}"
fi
done
{
echo ""
echo "### Buildinfo verification"
echo ""
echo "Expected SHA: \`${GITHUB_SHA:0:7}\`"
echo "Verified tenants: ${#SLUGS[@]}"
echo "Stale tenants: $STALE_COUNT"
echo "Unhealthy tenants: $UNHEALTHY_COUNT"
echo "Unreachable tenants: $UNREACHABLE_COUNT"
} >> "$GITHUB_STEP_SUMMARY"
if [ "$STALE_COUNT" -gt 0 ] || [ "$UNHEALTHY_COUNT" -gt 0 ] || [ "$UNREACHABLE_COUNT" -gt 0 ]; then
exit 1
fi
-159
View File
@@ -1,159 +0,0 @@
# qa-review — non-author APPROVE from the `qa` Gitea team required to merge.
#
# RFC#324 Step 1 of 5 (workflow-add). Pairs with `security-review.yml` and the
# branch-protection flip in Step 2.
#
# === DESIGN (RFC#324 v1.1 addendum) ===
#
# A1-α (refire mechanism):
# Triggers on:
# - `pull_request_target`: opened, synchronize, reopened
# → initial status posts when PR opens / re-pushes
# - comment refires are handled by `review-refire-comments.yml`
# → a single issue_comment dispatcher prevents every SOP/review
# comment from enqueueing separate qa/security/tier jobs on
# Gitea 1.22.6 before job-level `if:` can skip them.
# Workflow name = `qa-review` ; job name = `approved`.
# The job's own pass/fail conclusion publishes the status context
# `qa-review / approved (<event>)` — NO `POST /statuses` call → NO
# write:repository token scope needed. Sidesteps internal#321 defect #2.
#
# A1.1 (privilege check on slash-comment — INFORMATIONAL ONLY, NOT a gate):
# The `issue_comment` event fires for ANY commenter, including
# non-collaborators. The original (v1.2) design gated the eval step
# behind a collaborator probe → if a non-collaborator commented
# /qa-recheck, the eval was `if:`-skipped → the job exited 0 anyway →
# the status context published `success` with ZERO real APPROVE.
# That was a fail-open: any visitor could green the gate.
#
# RFC#324 v1.3 §A1.1 correction (option b per hongming-pc 1421):
# drop privilege-gating of the evaluation entirely. The eval is
# read-only and idempotent — it reads `pulls/{N}/reviews` and
# `teams/{id}/members/{u}` (both API-side state that a commenter can't
# change). Re-running it on a non-collaborator's comment is harmless
# AND correct: if a real team-member APPROVE exists, the eval flips
# green; if not, it stays red.
#
# We KEEP the privilege step as a `::notice::` log line only — useful
# for griefer-spotting (one operator spamming /recheck) without
# touching the gate. If rate-limiting is needed later, add it as a
# separate concern (time-window throttle, not a privilege gate).
#
# We MUST NOT use `github.event.comment.author_association` (the
# field doesn't exist on Gitea 1.22.6 webhook payload — this was
# sop-tier-refire's defect #1).
#
# A4 (no PR-head checkout under pull_request_target):
# We check out the BASE ref explicitly so the review-check.sh script is
# loaded from trusted source. We NEVER use `ref: ${{ github.event.pull_request.head.sha }}`.
# No PR-head code is executed in the runner. Trust boundary preserved.
#
# A5 (real Gitea team):
# `qa` team (id=20) verified by orchestrator preflight 2026-05-11; queried
# at run time via /api/v1/teams/20/members/{login}.
#
# === TOKEN ===
#
# The workflow reads PR state, PR reviews, and team membership.
# Gitea 1.22.6's /api/v1/teams/{id}/members/{u} returns 403 ('Must be a
# team member') for tokens whose owner is not in that team. The default
# `secrets.GITHUB_TOKEN` is owned by a workflow-scoped identity that is
# also not in qa/security teams → also 403.
#
# Resolution: a dedicated `RFC_324_TEAM_READ_TOKEN` secret, owned by an
# identity that IS in both `qa` and `security` teams (Owners-tier
# claude-ceo-assistant, or a new service-bot added to both teams).
# Provisioning of this secret is tracked as a follow-up issue (filed by
# core-devops at PR open).
#
# Until that secret is provisioned, the job will exit 1 with a clear
# 403-on-team-probe error and the `qa-review / approved` status will
# stay `failure`. This is the correct fail-closed behavior — the gate
# blocks merge until both (a) a QA team member APPROVEs and (b) the
# workflow has a token that can confirm their team membership.
#
# === SLASH-COMMAND CONTRACT ===
#
# /qa-recheck — re-evaluate the gate (e.g. after an APPROVE lands)
#
# Open to any PR commenter. The eval is read-only and idempotent, so
# unprivileged refires are harmless (RFC#324 v1.3 §A1.1). Collaborator
# status is logged for griefer-spotting but does NOT gate execution.
name: qa-review
on:
pull_request_target:
types: [opened, synchronize, reopened]
permissions:
contents: read
pull-requests: read
secrets: read
jobs:
# bp-exempt: PR review bot signal; required merge state is enforced by CI / all-required.
approved:
# Gate the job:
# - On pull_request_target events: always run.
# Comment-triggered refires live in review-refire-comments.yml. Keeping
# this workflow PR-only avoids comment-triggered queue storms.
if: |
github.event_name == 'pull_request_target'
runs-on: ubuntu-latest
steps:
- name: Privilege check (A1.1 — INFORMATIONAL log only, NOT a gate)
# RFC#324 v1.3 §A1.1: this step does NOT gate subsequent steps.
# It exists solely as a log line for griefer-spotting (one
# operator spamming /qa-recheck without merit). Re-running the
# read-only eval on a non-collaborator comment is harmless;
# gating it would be fail-open (skipped steps still publish
# `success` for the job's status context).
# Only runs on issue_comment events; pull_request_target has
# no comment.user.login so the step is a no-op skip there.
if: github.event_name == 'issue_comment'
env:
GITEA_TOKEN: ${{ secrets.SOP_TIER_CHECK_TOKEN || secrets.GITHUB_TOKEN }}
run: |
set -euo pipefail
login="${{ github.event.comment.user.login }}"
# Write token to a mode-600 file so it never appears in curl's argv.
# (#541: -H "Authorization: token $TOKEN" puts the secret in /proc/<pid>/cmdline)
authfile=$(mktemp)
chmod 600 "$authfile"
printf 'header = "Authorization: token %s"\n' "$GITEA_TOKEN" > "$authfile"
code=$(curl -sS -o /dev/null -w '%{http_code}' -K "$authfile" \
"${{ github.server_url }}/api/v1/repos/${{ github.repository }}/collaborators/${login}")
rm -f "$authfile"
if [ "$code" = "204" ]; then
echo "::notice::Recheck from ${login} (collaborator=true)"
else
echo "::notice::Recheck from ${login} (collaborator=false, HTTP ${code}) — proceeding with read-only eval anyway"
fi
- name: Check out BASE ref (A4 — never PR-head)
# Loads the review-check.sh script from a trusted ref. For
# pull_request_target the default checkout is BASE already; we
# set ref explicitly for the issue_comment event too so the
# script source is always the default-branch version.
# NEVER use ref: ${{ github.event.pull_request.head.sha }} —
# that would execute PR-head code with secrets-context.
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
ref: ${{ github.event.repository.default_branch }}
- name: Evaluate qa-review
env:
GITEA_TOKEN: ${{ secrets.SOP_TIER_CHECK_TOKEN || secrets.GITHUB_TOKEN }}
GITEA_HOST: git.moleculesai.app
REPO: ${{ github.repository }}
# PR number lives in different places per event:
# pull_request_target → github.event.pull_request.number
# issue_comment → github.event.issue.number
PR_NUMBER: ${{ github.event.pull_request.number || github.event.issue.number }}
DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
TEAM: qa
TEAM_ID: '20'
REVIEW_CHECK_DEBUG: '0'
REVIEW_CHECK_STRICT: '0'
run: bash .gitea/scripts/review-check.sh
-182
View File
@@ -1,182 +0,0 @@
name: Railway pin audit (drift detection)
# Ported from .github/workflows/railway-pin-audit.yml on 2026-05-11 per
# RFC internal#219 §1 sweep.
#
# Differences from the GitHub version:
# - Dropped `workflow_dispatch:` (Gitea 1.22.6 trigger handling).
# Manual runs go via cron-trigger bump or push the workflow file
# itself.
# - `actions/github-script@v9` blocks (which call github.rest.* — a
# GitHub-specific JS API) replaced with curl calls against the
# Gitea REST API (/api/v1/repos/.../issues, .../labels,
# .../comments). Same behaviour: open issue on drift, comment on
# repeat-drift, close on clean run.
# - Workflow-level env.GITHUB_SERVER_URL set so the curl calls can
# derive `git.moleculesai.app` from the runner env (with
# hard-coded fallback inside the steps).
# - `continue-on-error: true` on the job (RFC §1 contract).
#
# Daily audit of Railway env vars for drift-prone image-tag pins —
# automation-cadence layer over the detection script + regression test
# shipped in PR #2168 (#2001 closure).
#
# Background: on 2026-04-24 a stale `:staging-a14cf86` SHA pin in CP's
# TENANT_IMAGE caused 3+ hours of E2E failure with the appearance that
# "every fix didn't propagate" — really the tenant image was so old it
# didn't read the env vars those fixes produced.
#
# Cadence: once a day, 13:00 UTC (06:00 PT).
#
# Secret hardening: per feedback_schedule_vs_dispatch_secrets_hardening,
# the schedule trigger HARD-FAILS on missing RAILWAY_AUDIT_TOKEN.
on:
schedule:
- cron: '0 13 * * *'
env:
GITHUB_SERVER_URL: https://git.moleculesai.app
concurrency:
group: railway-pin-audit
cancel-in-progress: false
permissions:
issues: write
contents: read
jobs:
audit:
name: Audit Railway env vars for drift-prone pins
runs-on: ubuntu-latest
# Phase 3 (RFC #219 §1): surface broken workflows without blocking.
# mc#774: pre-existing continue-on-error mask; root-fix and remove, do not renew silently.
continue-on-error: true
timeout-minutes: 10
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- name: Verify RAILWAY_AUDIT_TOKEN present
env:
RAILWAY_AUDIT_TOKEN: ${{ secrets.RAILWAY_AUDIT_TOKEN }}
id: secret_check
run: |
set -euo pipefail
if [ -n "${RAILWAY_AUDIT_TOKEN:-}" ]; then
echo "have_secret=true" >> "$GITHUB_OUTPUT"
exit 0
fi
echo "have_secret=false" >> "$GITHUB_OUTPUT"
echo "::error::RAILWAY_AUDIT_TOKEN secret missing — schedule trigger requires it. Provision the token (read-only \`variables\` scope on the molecule-platform Railway project) and store as repo secret RAILWAY_AUDIT_TOKEN."
exit 1
- name: Install Railway CLI
if: steps.secret_check.outputs.have_secret == 'true'
run: |
set -euo pipefail
curl -fsSL https://railway.com/install.sh | sh
echo "$HOME/.railway/bin" >> "$GITHUB_PATH"
- name: Verify Railway CLI authenticated
if: steps.secret_check.outputs.have_secret == 'true'
env:
RAILWAY_TOKEN: ${{ secrets.RAILWAY_AUDIT_TOKEN }}
run: |
set -euo pipefail
if ! railway whoami >/dev/null 2>&1; then
echo "::error::Railway CLI failed to authenticate with RAILWAY_AUDIT_TOKEN — token may be revoked or scoped incorrectly"
exit 2
fi
- name: Link molecule-platform project
if: steps.secret_check.outputs.have_secret == 'true'
env:
RAILWAY_TOKEN: ${{ secrets.RAILWAY_AUDIT_TOKEN }}
run: |
set -euo pipefail
railway link --project 7ccc8c68-61f4-42ab-9be5-586eeee11768
- name: Run drift audit
if: steps.secret_check.outputs.have_secret == 'true'
id: audit
env:
RAILWAY_TOKEN: ${{ secrets.RAILWAY_AUDIT_TOKEN }}
run: |
set +e
bash scripts/ops/audit-railway-sha-pins.sh 2>&1 | tee /tmp/audit.log
rc=${PIPESTATUS[0]}
echo "rc=$rc" >> "$GITHUB_OUTPUT"
# Capture the audit log for the issue body.
{
echo 'log<<AUDIT_EOF'
cat /tmp/audit.log
echo 'AUDIT_EOF'
} >> "$GITHUB_OUTPUT"
case "$rc" in
0) exit 0 ;;
1) echo "::warning::Drift-prone pin(s) detected — issue will be filed"; exit 1 ;;
2) echo "::error::Railway CLI auth/link failed mid-script — token or project ID drift"; exit 2 ;;
*) echo "::error::Unexpected audit rc=$rc"; exit 1 ;;
esac
- name: Open / update drift issue (Gitea API)
if: failure() && steps.audit.outputs.rc == '1'
env:
GITEA_TOKEN: ${{ secrets.GITHUB_TOKEN }}
REPO: ${{ github.repository }}
AUDIT_LOG: ${{ steps.audit.outputs.log }}
SERVER_URL: ${{ env.GITHUB_SERVER_URL }}
RUN_ID: ${{ github.run_id }}
run: |
set -euo pipefail
API="${SERVER_URL%/}/api/v1"
TITLE="Railway env-var drift detected"
RUN_URL="${SERVER_URL}/${REPO}/actions/runs/${RUN_ID}"
BODY=$(jq -nc --arg t "$TITLE" --arg log "${AUDIT_LOG:-(log unavailable)}" --arg run "$RUN_URL" '
{body: ("Daily Railway pin audit found drift-prone image-tag pins in the molecule-platform Railway project.\n\n**What this means:** an env var (likely on `controlplane`) is pinned to a SHA-shaped or semver tag instead of a floating tag. Same pattern that caused the 2026-04-24 TENANT_IMAGE incident — fix-PRs land but the running service does not pick them up.\n\n**Recovery:** open the Railway dashboard, replace the flagged value with a floating tag (:staging-latest, :main) unless the pin is intentional and documented in the ops runbook.\n\n**Audit output:**\n\n```\n" + $log + "\n```\n\nRun: " + $run + "\n\nCloses automatically when a subsequent daily run reports clean.")}')
# Look for existing open drift issue with the title.
EXISTING=$(curl -fsS -H "Authorization: token $GITEA_TOKEN" \
"${API}/repos/${REPO}/issues?state=open&type=issues&limit=50" \
| jq -r --arg t "$TITLE" '.[] | select(.title==$t) | .number' | head -1)
if [ -n "$EXISTING" ]; then
COMMENT_BODY=$(jq -nc --arg log "${AUDIT_LOG:-(log unavailable)}" --arg run "$RUN_URL" \
'{body: ("Still drifting. " + $run + "\n\n```\n" + $log + "\n```")}')
curl -fsS -X POST -H "Authorization: token $GITEA_TOKEN" -H "Content-Type: application/json" \
"${API}/repos/${REPO}/issues/${EXISTING}/comments" -d "$COMMENT_BODY" >/dev/null
echo "Commented on existing issue #${EXISTING}"
else
CREATE_BODY=$(echo "$BODY" | jq --arg t "$TITLE" '. + {title: $t, labels: []}')
NUM=$(curl -fsS -X POST -H "Authorization: token $GITEA_TOKEN" -H "Content-Type: application/json" \
"${API}/repos/${REPO}/issues" -d "$CREATE_BODY" | jq -r .number)
echo "Filed issue #${NUM}"
fi
- name: Close stale drift issue on clean run (Gitea API)
if: success() && steps.audit.outputs.rc == '0'
env:
GITEA_TOKEN: ${{ secrets.GITHUB_TOKEN }}
REPO: ${{ github.repository }}
SERVER_URL: ${{ env.GITHUB_SERVER_URL }}
RUN_ID: ${{ github.run_id }}
run: |
set -euo pipefail
API="${SERVER_URL%/}/api/v1"
TITLE="Railway env-var drift detected"
RUN_URL="${SERVER_URL}/${REPO}/actions/runs/${RUN_ID}"
NUMS=$(curl -fsS -H "Authorization: token $GITEA_TOKEN" \
"${API}/repos/${REPO}/issues?state=open&type=issues&limit=50" \
| jq -r --arg t "$TITLE" '.[] | select(.title==$t) | .number')
for N in $NUMS; do
curl -fsS -X POST -H "Authorization: token $GITEA_TOKEN" -H "Content-Type: application/json" \
"${API}/repos/${REPO}/issues/${N}/comments" \
-d "$(jq -nc --arg run "$RUN_URL" '{body: ("Daily audit clean — drift resolved. " + $run)}')" >/dev/null
curl -fsS -X PATCH -H "Authorization: token $GITEA_TOKEN" -H "Content-Type: application/json" \
"${API}/repos/${REPO}/issues/${N}" -d '{"state":"closed"}' >/dev/null
echo "Closed #${N}"
done
-72
View File
@@ -1,72 +0,0 @@
name: review-check-tests
# Runs review-check.sh regression tests on every PR + push that touches
# the evaluator script or its test fixtures.
#
# Follows RFC#324 follow-up (issue #540):
# .gitea/scripts/review-check.sh is load-bearing for PR merge gates.
# It has ZERO production CI coverage. This workflow closes that gap.
#
# Design choices:
# - Bash test harness (not bats). The existing test_review_check.sh
# uses a custom assert_eq/assert_contains framework that is already
# working and covers all 13 acceptance criteria (issue #540 §Acceptance).
# Converting to bats would be refactoring, not closing the gap.
# - No bats dependency: the runner-base image needs no extra tooling.
# - continue-on-error: false — these tests must pass; a failure means
# the review-gate evaluator is broken and must not be merged.
on:
push:
branches: [main, staging]
paths:
- '.gitea/scripts/review-check.sh'
- '.gitea/scripts/tests/test_review_check.sh'
- '.gitea/scripts/tests/_review_check_fixture.py'
- '.gitea/workflows/review-check-tests.yml'
pull_request:
branches: [main, staging]
paths:
- '.gitea/scripts/review-check.sh'
- '.gitea/scripts/tests/test_review_check.sh'
- '.gitea/scripts/tests/_review_check_fixture.py'
- '.gitea/workflows/review-check-tests.yml'
workflow_dispatch:
env:
GITHUB_SERVER_URL: https://git.moleculesai.app
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
# bp-exempt: review tooling regression suite; CI / all-required is the required aggregate.
test:
name: review-check.sh regression tests
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- name: Install jq
# Required for T12 jq-filter test case. Gitea Actions runners (ubuntu-latest
# label) do not bundle jq. Install via apt-get first (reliable for Ubuntu
# runners with internet access to package mirrors). Falls back to GitHub
# binary download. GitHub releases may be blocked on some runner networks
# (infra#241 follow-up).
# mc#774: pre-existing continue-on-error mask; root-fix and remove, do not renew silently.
continue-on-error: true
run: |
if apt-get update -qq && apt-get install -y -qq jq; then
echo "::notice::jq installed via apt-get: $(jq --version)"
elif timeout 120 curl -sSL \
"https://github.com/jqlang/jq/releases/download/jq-1.7.1/jq-linux-amd64" \
-o /usr/local/bin/jq && chmod +x /usr/local/bin/jq; then
echo "::notice::jq binary downloaded: $(/usr/local/bin/jq --version)"
else
echo "::warning::jq install failed — apt-get and GitHub download both failed."
fi
jq --version 2>/dev/null || echo "::notice::jq not yet available — continuing"
- name: Run review-check.sh regression suite
run: bash .gitea/scripts/tests/test_review_check.sh
@@ -1,39 +0,0 @@
# DEPRECATED — superseded by `.gitea/workflows/sop-checklist.yml`.
#
# The review-refire logic (qa/security/tier slash-command dispatch) has been
# merged into sop-checklist.yml as the `review-refire` job. This workflow
# is kept as a no-op stub to avoid a gap during the transition window where
# this file may be deleted while sop-checklist.yml has not yet been merged.
#
# After sop-checklist.yml lands, this file will be deleted (issue #1280).
#
# Historical behavior (superseded):
# Gitea 1.22 queues one run per workflow subscribed to `issue_comment` before
# evaluating job-level `if:`. Previously this workflow was the single
# non-SOP comment subscriber for qa/security/tier refire slash commands.
name: review-refire-comments
on:
issue_comment:
types: [created]
permissions:
contents: read
pull-requests: read
statuses: write
concurrency:
group: ${{ github.repository }}-${{ github.workflow }}-${{ github.event.issue.number || github.ref }}
cancel-in-progress: true
jobs:
# No-op stub — all refire logic moved to sop-checklist.yml review-refire job.
# Kept to avoid transition gap; will be deleted after sop-checklist.yml merges.
dispatch:
runs-on: ubuntu-latest
steps:
- name: Deprecated — refire logic moved to sop-checklist.yml
run: |
echo "::warning::review-refire-comments.yml is deprecated. Refire logic is now in sop-checklist.yml review-refire job. This workflow is a no-op stub pending deletion (issue #1280)."
exit 0
-70
View File
@@ -1,70 +0,0 @@
# security-review — non-author APPROVE from the `security` Gitea team
# required to merge.
#
# RFC#324 Step 1 of 5 (workflow-add). Mirror of `qa-review.yml`; differs
# only in TEAM=security, TEAM_ID=21, and the slash-command name.
#
# See `qa-review.yml` header for the full A1-α / A1.1 / A4 / A5 design
# rationale; everything below is identical in shape.
name: security-review
on:
pull_request_target:
types: [opened, synchronize, reopened]
permissions:
contents: read
pull-requests: read
secrets: read
jobs:
# bp-exempt: PR security review bot signal; required merge state is enforced by CI / all-required.
approved:
# Comment-triggered refires live in review-refire-comments.yml. Keeping
# this workflow PR-only avoids comment-triggered queue storms.
if: |
github.event_name == 'pull_request_target'
runs-on: ubuntu-latest
steps:
- name: Privilege check (A1.1 — INFORMATIONAL log only, NOT a gate)
# RFC#324 v1.3 §A1.1: does NOT gate subsequent steps. See
# qa-review.yml for full rationale. Eval is read-only/idempotent
# so re-running on a non-collaborator comment is harmless.
if: github.event_name == 'issue_comment'
env:
GITEA_TOKEN: ${{ secrets.SOP_TIER_CHECK_TOKEN || secrets.GITHUB_TOKEN }}
run: |
set -euo pipefail
login="${{ github.event.comment.user.login }}"
# Write token to a mode-600 file so it never appears in curl's argv.
# (#541: -H "Authorization: token $TOKEN" puts the secret in /proc/<pid>/cmdline)
authfile=$(mktemp)
chmod 600 "$authfile"
printf 'header = "Authorization: token %s"\n' "$GITEA_TOKEN" > "$authfile"
code=$(curl -sS -o /dev/null -w '%{http_code}' -K "$authfile" \
"${{ github.server_url }}/api/v1/repos/${{ github.repository }}/collaborators/${login}")
rm -f "$authfile"
if [ "$code" = "204" ]; then
echo "::notice::Recheck from ${login} (collaborator=true)"
else
echo "::notice::Recheck from ${login} (collaborator=false, HTTP ${code}) — proceeding with read-only eval anyway"
fi
- name: Check out BASE ref (A4 — never PR-head)
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
ref: ${{ github.event.repository.default_branch }}
- name: Evaluate security-review
env:
GITEA_TOKEN: ${{ secrets.SOP_TIER_CHECK_TOKEN || secrets.GITHUB_TOKEN }}
GITEA_HOST: git.moleculesai.app
REPO: ${{ github.repository }}
PR_NUMBER: ${{ github.event.pull_request.number || github.event.issue.number }}
DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
TEAM: security
TEAM_ID: '21'
REVIEW_CHECK_DEBUG: '0'
REVIEW_CHECK_STRICT: '0'
run: bash .gitea/scripts/review-check.sh
-225
View File
@@ -1,225 +0,0 @@
# sop-checklist — peer-ack merge gate for SOP-checklist items.
#
# RFC#351 Step 2 of 6 (implementation MVP).
#
# === CONSOLIDATION (issue #1280) ===
#
# This workflow is the SINGLE `issue_comment` subscriber — the logic from
# `review-refire-comments.yml` has been merged in. Before this change:
# - sop-checklist.yml (pre-2026-05-16) → issue_comment:[created,edited,deleted] → runner slot used, job no-oped
# - review-refire-comments.yml → issue_comment:[created] → runner slot used, job no-oped
# → every non-refire comment occupied 2 runner slots for ~800 s each
# (~650 no-op runs/day, ~1,300 runner-slot-occupancy-hours/day).
#
# Fix (PR #1345 / issue #1280):
# - ONE workflow, ONE issue_comment:[created] subscription (no edited/deleted)
# - all-items-acked job: pull_request_target OR sop slash-command comments
# - review-refire job: qa/security/tier refire slash commands
# → ~50% reduction in comment-triggered runner occupancy vs pre-fix.
#
# Trust boundary (mirrors RFC#324 §A4 + sop-tier-check security note):
# `pull_request_target` (not `pull_request`) — workflow def is loaded
# from BASE branch, so a PR cannot rewrite this workflow to exfiltrate
# the token. The `actions/checkout` step pins `ref: base.sha` so the
# script ALSO comes from BASE. PR-HEAD code is never executed in the
# runner.
#
# Token scope:
# - read:repository, read:organization for PR + comments + team probes
# - write:repository for POST /statuses/{sha}
# - The token owner MUST be a member of every team referenced by the
# config's required_teams (else /teams/{id}/members/{login} returns
# 403 — see review-check.sh same-gotcha doc). For the MVP we use
# the dev-lead token (a member of engineers, managers, qa, security)
# via a repo secret `SOP_CHECKLIST_GATE_TOKEN`. Provisioning of that
# secret is a follow-up authorization step (separate from this PR).
#
# Failure mode: tier-aware (RFC#351 open question 2):
# - tier:high → state=failure (hard-fail; BP blocks merge)
# - tier:medium → state=failure (hard-fail; same)
# - tier:low → state=pending (soft-fail; BP can choose to require
# this context or skip for low-tier PRs)
# - missing/no-tier → state=failure (default-mode: hard — never lower
# the bar per feedback_fix_root_not_symptom)
#
# Slash-command contract (RFC#351 v1 + §A1.1-style notes from RFC#324):
#
# /sop-ack <slug-or-numeric-alias> [optional note]
# — register a peer-ack for one checklist item.
# — slug accepts kebab-case, snake_case, or natural-spaces
# (all normalized to canonical kebab-case).
# — numeric 1..7 maps via config.items[*].numeric_alias.
# — most-recent (user, slug) directive wins.
#
# /sop-revoke <slug-or-numeric-alias> [reason]
# — invalidate the commenter's own prior /sop-ack for this slug.
# — does NOT affect other peers' acks on the same slug.
# — most-recent (user, slug) directive wins, so a later /sop-ack
# re-restores the ack.
#
# /sop-n/a <gate> [reason]
# — declare a gate (qa-review, security-review) N/A.
# — see sop-checklist-config.yaml n/a_gates section.
#
# /qa-recheck /security-recheck /refire-tier-check
# — refire the corresponding status check on the PR head.
#
# The eval is read-only + idempotent (read PR + comments + team
# membership, compute, post status). Re-running on any event is safe —
# the new status overwrites the previous one for the same context.
name: sop-checklist
# Cancel any in-progress runs for the same PR to prevent
# stale runs from overwriting newer status contexts.
concurrency:
group: ${{ github.repository }}-${{ github.workflow }}-${{ github.event.pull_request.number || github.event.issue.number || github.ref }}
cancel-in-progress: true
# bp-required: yes ← emits sop-checklist / all-items-acked (pull_request)
on:
pull_request_target:
types: [opened, edited, synchronize, reopened, labeled, unlabeled]
issue_comment:
types: [created] # NOT [created, edited, deleted] — Gitea 1.22.6 holds a runner slot
# at job-parsing time, before job-level if: guards run. edited/deleted events
# occupied ~1,300 runner-slot-hours/day on this workflow alone during the
# 2026-05-16 freeze. Per PR #1345 fix.
permissions:
contents: read
pull-requests: read
statuses: write
secrets: read
jobs:
# sop-checklist gate: runs on PR lifecycle events OR sop slash commands.
# All other comment types (no-op text comments) no longer assign a runner
# because this job's if: guard short-circuits before runner assignment.
all-items-acked:
if: |
github.event_name == 'pull_request_target' ||
(github.event_name == 'issue_comment' &&
github.event.issue.pull_request != null &&
(contains(github.event.comment.body, '/sop-ack') ||
contains(github.event.comment.body, '/sop-revoke') ||
contains(github.event.comment.body, '/sop-n/a')))
runs-on: ubuntu-latest
steps:
- name: Check out BASE ref (trust boundary — never PR-head)
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
# For pull_request_target, the default branch is the trust
# anchor. For issue_comment the PR base may differ from the
# default branch (PR targeting `staging`), so we use the
# default-branch ref explicitly — same approach as
# qa-review.yml so the script source is always trusted.
ref: ${{ github.event.repository.default_branch }}
- name: Run sop-checklist
env:
GITEA_TOKEN: ${{ secrets.SOP_CHECKLIST_GATE_TOKEN || secrets.GITHUB_TOKEN }}
PR_NUMBER: ${{ github.event.pull_request.number || github.event.issue.number }}
OWNER: ${{ github.repository_owner }}
REPO_NAME: ${{ github.event.repository.name }}
run: |
set -euo pipefail
python3 .gitea/scripts/sop-checklist.py \
--owner "$OWNER" \
--repo "$REPO_NAME" \
--pr "$PR_NUMBER" \
--config .gitea/sop-checklist-config.yaml \
--gitea-host git.moleculesai.app
# bp-exempt: informational refire handler, not a merge gate. Emits
# qa-review/security-review status updates on /qa-recheck et al slash commands.
review-refire:
if: |
github.event_name == 'issue_comment' &&
github.event.issue.pull_request != null
runs-on: ubuntu-latest
steps:
- name: Classify comment
id: classify
env:
COMMENT_BODY: ${{ github.event.comment.body }}
run: |
set -euo pipefail
{
echo "run_qa=false"
echo "run_security=false"
echo "run_tier=false"
} >> "$GITHUB_OUTPUT"
first_line=$(printf '%s\n' "$COMMENT_BODY" | sed -n '1p')
case "$first_line" in
/qa-recheck*)
echo "run_qa=true" >> "$GITHUB_OUTPUT"
;;
/security-recheck*)
echo "run_security=true" >> "$GITHUB_OUTPUT"
;;
/refire-tier-check*)
echo "run_tier=true" >> "$GITHUB_OUTPUT"
;;
*)
echo "::notice::no supported review refire slash command; no-op"
;;
esac
- name: Check out BASE ref for trusted scripts
if: |
steps.classify.outputs.run_qa == 'true' ||
steps.classify.outputs.run_security == 'true' ||
steps.classify.outputs.run_tier == 'true'
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
ref: ${{ github.event.repository.default_branch }}
- name: Refire qa-review status
if: steps.classify.outputs.run_qa == 'true'
env:
# RFC_324_TEAM_READ_TOKEN is read-only (team membership read scope only).
# review-refire-status.sh POSTs to /statuses — requires write scope.
# SOP_TIER_CHECK_TOKEN carries write:repository + write:issue + read:organization.
GITEA_TOKEN: ${{ secrets.SOP_TIER_CHECK_TOKEN || secrets.GITHUB_TOKEN }}
GITEA_HOST: git.moleculesai.app
REPO: ${{ github.repository }}
PR_NUMBER: ${{ github.event.issue.number }}
DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
TEAM: qa
TEAM_ID: '20'
REVIEW_CHECK_DEBUG: '0'
REVIEW_CHECK_STRICT: '0'
run: |
set -euo pipefail
.gitea/scripts/review-refire-status.sh
- name: Refire security-review status
if: steps.classify.outputs.run_security == 'true'
env:
# RFC_324_TEAM_READ_TOKEN is read-only (team membership read scope only).
# review-refire-status.sh POSTs to /statuses — requires write scope.
# SOP_TIER_CHECK_TOKEN carries write:repository + write:issue + read:organization.
GITEA_TOKEN: ${{ secrets.SOP_TIER_CHECK_TOKEN || secrets.GITHUB_TOKEN }}
GITEA_HOST: git.moleculesai.app
REPO: ${{ github.repository }}
PR_NUMBER: ${{ github.event.issue.number }}
DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
TEAM: security
TEAM_ID: '21'
REVIEW_CHECK_DEBUG: '0'
REVIEW_CHECK_STRICT: '0'
run: |
set -euo pipefail
.gitea/scripts/review-refire-status.sh
- name: Refire sop-tier-check status
if: steps.classify.outputs.run_tier == 'true'
env:
GITEA_TOKEN: ${{ secrets.SOP_TIER_CHECK_TOKEN || secrets.GITHUB_TOKEN }}
GITEA_HOST: git.moleculesai.app
REPO: ${{ github.repository }}
PR_NUMBER: ${{ github.event.issue.number }}
SOP_DEBUG: '0'
run: bash .gitea/scripts/sop-tier-refire.sh
-131
View File
@@ -1,131 +0,0 @@
# sop-tier-check — canonical Gitea Actions workflow for §SOP-6 enforcement.
#
# Logic lives in `.gitea/scripts/sop-tier-check.sh` (extracted 2026-05-09
# from the previous inline-bash version). The script is the single source
# of truth; this workflow file just sets env + invokes it.
#
# Copy BOTH files (`.gitea/workflows/sop-tier-check.yml` +
# `.gitea/scripts/sop-tier-check.sh`) into any repo that wants the
# §SOP-6 PR gate enforced. Pair with branch protection on the protected
# branch:
# required_status_checks: ["sop-tier-check / tier-check (pull_request)"]
# required_approving_reviews: 1
# approving_review_teams: ["ceo", "managers", "engineers"]
#
# Tier → required-team expression (internal#189 AND-composition):
# tier:low → engineers,managers,ceo (OR: any one suffices)
# tier:medium → managers AND engineers AND qa???,security??? (AND: all required)
# tier:high → ceo (OR: single team, wired for AND)
#
# "???" = teams not yet created in Gitea. When qa + security teams are
# added, update TIER_EXPR["tier:medium"] in the script to remove the
# markers. PRs already in-flight when qa/security are created continue
# to work because their authors explicitly requested those reviews.
#
# Force-merge: Owners-team override remains available out-of-band via
# the Gitea merge API; force-merge writes `incident.force_merge` to
# `structure_events` per §Persistent structured logging gate (Phase 3).
#
# Environment variables:
# SOP_DEBUG=1 — per-API-call diagnostic lines. Default: off.
# SOP_LEGACY_CHECK=1 — revert to OR-gate for this run. Intended for
# emergency use only; burn-in window closed
# 2026-05-17 (internal#189 Phase 1).
#
# BURN-IN CLOSED 2026-05-17 (internal#189 Phase 1): The 7-day burn-in
# window closed. continue-on-error: true has been removed from the
# tier-check job; AND-composition is now fully enforced. If you need
# to temporarily re-introduce a mask, file a tracker and follow the
# mc#774 protocol (Tier 2e lint requires a current tracker within
# 2 lines of any continue-on-error: true).
name: sop-tier-check
# SECURITY: triggers MUST use `pull_request_target`, not `pull_request`.
# `pull_request_target` loads the workflow definition from the BASE
# branch (i.e. `main`), not the PR's HEAD. With `pull_request`, anyone
# with write access to a feature branch could rewrite this file in
# their PR to dump SOP_TIER_CHECK_TOKEN (org-read scope) to logs and
# exfiltrate it. Verified 2026-05-09 against Gitea 1.22.6 —
# `pull_request_target` (added in Gitea 1.21 via go-gitea/gitea#25229)
# is the documented mitigation.
#
# This workflow does NOT call `actions/checkout` of PR HEAD code, so no
# untrusted code is ever executed in the runner — we only HTTP-call the
# Gitea API. If a future change adds a checkout step, it MUST pin to
# `${{ github.event.pull_request.base.sha }}` (NOT `head.sha`) to keep
# the trust boundary.
on:
pull_request_target:
types: [opened, edited, synchronize, reopened, labeled, unlabeled]
pull_request_review:
types: [submitted, dismissed, edited]
concurrency:
group: ${{ github.repository }}-${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true
jobs:
tier-check:
runs-on: ubuntu-latest
permissions:
contents: read
pull-requests: read
secrets: read
steps:
- name: Check out base branch (for the script)
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
# Pin to base.sha — pull_request_target's protection only
# works if we never check out PR HEAD. Same SHA the workflow
# itself was loaded from.
ref: ${{ github.event.pull_request.base.sha }}
- name: Install jq
# Gitea Actions runners (ubuntu-latest label) do not bundle jq.
# The sop-tier-check script uses jq for all JSON API parsing.
# Install jq before the script runs so sop-tier-check can pass.
#
# Method: apt-get first (reliable for Ubuntu runners with internet
# access to package mirrors). Falls back to GitHub binary download.
# GitHub releases may be unreachable from some runner networks
# (infra#241 follow-up: GitHub timeout after 3s on 5.78.80.188
# runners). The sop-tier-check script has its own fallback as a
# third line of defense. continue-on-error: true ensures this step
# failing does not block the job.
# mc#774: pre-existing continue-on-error mask; root-fix and remove, do not renew silently.
continue-on-error: true
run: |
# apt-get is the primary method — Ubuntu package mirrors are reliably
# reachable from runner containers. GitHub releases may be blocked
# or slow on some networks (infra#241 follow-up).
if apt-get update -qq && apt-get install -y -qq jq; then
echo "::notice::jq installed via apt-get: $(jq --version)"
elif timeout 120 curl -sSL \
"https://github.com/jqlang/jq/releases/download/jq-1.7.1/jq-linux-amd64" \
-o /usr/local/bin/jq && chmod +x /usr/local/bin/jq; then
echo "::notice::jq binary downloaded: $(/usr/local/bin/jq --version)"
else
echo "::warning::jq install failed — apt-get and GitHub download both failed."
fi
jq --version 2>/dev/null || echo "::notice::jq not yet available — script fallback will retry"
- name: Verify tier label + reviewer team membership
# continue-on-error: true at step level — job-level is ignored by Gitea
# Actions (quirk #10, internal runbooks). Belt-and-suspenders with
# SOP_FAIL_OPEN=1 + || true below.
# mc#774: pre-existing continue-on-error mask; root-fix and remove, do not renew silently.
continue-on-error: true
env:
GITEA_TOKEN: ${{ secrets.SOP_TIER_CHECK_TOKEN || secrets.GITHUB_TOKEN }}
GITEA_HOST: git.moleculesai.app
REPO: ${{ github.repository }}
PR_NUMBER: ${{ github.event.pull_request.number }}
PR_AUTHOR: ${{ github.event.pull_request.user.login }}
SOP_DEBUG: '0'
SOP_LEGACY_CHECK: '0'
# SOP_FAIL_OPEN=1 makes the script always exit 0. The UI enforces
# the actual merge gate. Combined with continue-on-error: true
# above, this step never fails the job regardless of script exit.
SOP_FAIL_OPEN: '1'
run: |
bash .gitea/scripts/sop-tier-check.sh || true
-52
View File
@@ -1,52 +0,0 @@
# sop-tier-refire — manual fallback for sop-tier-check refire.
#
# Closes internal#292. Gitea 1.22.6 doesn't refire workflows on the
# `pull_request_review` event (go-gitea/gitea#33700); the `sop-tier-check`
# workflow's review-event subscription is silently dead. The result:
# PRs that get their approving review AFTER the tier-check ran on open/
# synchronize keep their failing status check forever, and the only way
# to merge is the admin force-merge path (audited via `audit-force-merge`
# but the audit trail keeps growing; see `feedback_never_admin_merge_bypass`).
#
# Comment-triggered refires now live in `review-refire-comments.yml`. Gitea
# queues issue_comment workflows before evaluating job-level `if:`, so having
# qa-review, security-review, sop-checklist, and sop-tier-refire all subscribe
# to every comment caused queue storms on SOP-heavy PRs. This workflow is a
# non-automatic breadcrumb only; Gitea 1.22.6 does not support
# workflow_dispatch inputs, so real refires must use `/refire-tier-check`.
#
# SECURITY MODEL:
#
# 1. `pull_request` exists on the issue (issue_comment fires on issues
# AND PRs; we only want PRs).
# 2. `comment.author_association` must be MEMBER/OWNER/COLLABORATOR.
# Per the internal#292 core-security review (review#1066 ask): anyone
# can comment, but only repo collaborators+ can flip the status.
# Without this gate, a drive-by commenter on a public-issue-tracker
# surface could trigger a status flip.
# 3. Comment body must contain `/refire-tier-check` — a slash-command-
# shaped trigger (not just any comment word). Prevents accidental
# triggering from prose like "we should refire tests" in a review.
# 4. This workflow does NOT check out PR HEAD code. Like sop-tier-check,
# it only HTTP-calls the Gitea API. Trust boundary preserved.
#
# Note: `issue_comment` fires from the BASE branch's workflow file. There
# is no `pull_request_target` equivalent to set; the trigger inherently
# loads the workflow from the default branch.
#
# Rate-limit: a 1s pre-sleep + a "skip if status posted in last 30s"
# guard prevents comment-spam from thrashing the status. See the script.
name: sop-tier-check refire (manual)
on:
workflow_dispatch:
jobs:
refire:
runs-on: ubuntu-latest
steps:
- name: Explain supported refire path
run: |
echo "::error::Gitea 1.22.6 does not support workflow_dispatch inputs here; comment /refire-tier-check on the PR instead."
exit 1
-357
View File
@@ -1,357 +0,0 @@
name: Staging SaaS smoke (every 30 min)
# Renamed from canary-staging.yml on 2026-05-11 per Hongming directive
# ("canary naming changed to staging for all"). Originally ported from
# .github/workflows/canary-staging.yml on 2026-05-11 per RFC
# internal#219 §1 sweep. Differences from the GitHub version:
# - Dropped `workflow_dispatch.inputs` (Gitea 1.22.6 parser rejects them
# per feedback_gitea_workflow_dispatch_inputs_unsupported).
# - Dropped `merge_group:` (no Gitea merge queue).
# - Dropped `environment:` blocks (Gitea has no environments).
# - Workflow-level env.GITHUB_SERVER_URL pinned per
# feedback_act_runner_github_server_url.
# - `continue-on-error: true` on each job (RFC §1 contract).
#
# Minimum viable health check: provisions one Hermes workspace on a fresh
# staging org, sends one A2A message, verifies PONG, tears down. ~8 min
# wall clock. Pages on failure by opening a GitHub issue; auto-closes the
# issue on the next green run.
#
# The full-SaaS workflow (e2e-staging-saas.yml) covers the broader surface
# but runs only on provisioning-critical pushes + nightly — this one
# catches drift in the 30-min window between those runs (AMI health, CF
# cert rotation, WorkOS session stability, etc.).
#
# Lean mode: E2E_MODE=smoke skips the child workspace + HMA memory +
# peers/activity checks. One parent workspace + one A2A turn is enough
# to signal "SaaS stack end-to-end is alive."
on:
schedule:
# Every 30 min. Cron on GitHub-hosted runners has a known drift of
# a few minutes under load — that's fine for a smoke check.
- cron: '*/30 * * * *'
# Serialise with the full-SaaS workflow so they don't contend for the
# same org-create quota on staging. Different group key from
# e2e-staging-saas since we don't mind queueing smoke runs behind one
# full run, but two smoke runs SHOULD queue against each other.
concurrency:
group: staging-smoke
cancel-in-progress: false
permissions:
# Needed to open / close the alerting issue.
issues: write
contents: read
env:
GITHUB_SERVER_URL: https://git.moleculesai.app
jobs:
smoke:
name: Staging SaaS smoke
runs-on: ubuntu-latest
# NOTE: Phase 3 (RFC #219 §1) `continue-on-error: true` removed
# 2026-05-11. The "surface broken workflows without blocking"
# rationale was correctly applied to advisory/lint workflows but
# wrong for this smoke — it is the 30-min canary cadence for the
# entire staging SaaS stack, and silent failure here masks the
# exact regressions the smoke exists to surface (AMI rot, CF cert
# drift, WorkOS session breakage, secret rotations). Same class of
# failure as PR#461 (`sweep-stale-e2e-orgs`) where Phase-3 silent
# failure leaked EC2. The four other `e2e-staging-*` workflows
# KEEP `continue-on-error: true` per RFC #219 §1 — they are
# advisory and matrix-style; this one is the canary. A follow-up
# `notify-failure` step below also surfaces breakage to ops even
# if branch-protection wiring is adjusted to keep this off the
# required-checks list.
# 25 min headroom over the 15-min TLS-readiness deadline in
# tests/e2e/test_staging_full_saas.sh (#2107). Without the buffer
# the job is killed at the wall-clock 15:00 mark BEFORE the bash
# `fail` + diagnostic burst can fire, leaving every cancellation
# silent. Sibling staging E2E jobs run at 20-45 min — keeping the
# smoke tighter than them so a true wedge still surfaces here
# first.
timeout-minutes: 25
env:
MOLECULE_CP_URL: https://staging-api.moleculesai.app
# 2026-05-11: secret canonicalised from MOLECULE_STAGING_ADMIN_TOKEN
# (dead in org secret store) to CP_STAGING_ADMIN_API_TOKEN per
# internal#322 — see this PR for the cross-workflow sweep.
MOLECULE_ADMIN_TOKEN: ${{ secrets.CP_STAGING_ADMIN_API_TOKEN }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_DEFAULT_REGION: us-east-2
E2E_AWS_LEAK_CHECK: required
E2E_AWS_TERMINATE_LEAKS: '1'
# MiniMax is the smoke's PRIMARY LLM auth path post-2026-05-04.
# Switched from hermes+OpenAI after #2578 (the staging OpenAI key
# account went over quota and stayed dead for 36+ hours, taking
# the smoke red the entire time). claude-code template's
# `minimax` provider routes ANTHROPIC_BASE_URL to
# api.minimax.io/anthropic and reads MINIMAX_API_KEY at boot —
# ~5-10x cheaper per token than gpt-4.1-mini AND on a separate
# billing account, so OpenAI quota collapse no longer wedges the
# smoke. Mirrors the migration continuous-synth-e2e.yml made on
# 2026-05-03 (#265) for the same reason. tests/e2e/test_staging_
# full_saas.sh branches SECRETS_JSON on which key is present —
# MiniMax wins when set.
E2E_MINIMAX_API_KEY: ${{ secrets.MOLECULE_STAGING_MINIMAX_API_KEY }}
# Direct-Anthropic alternative for operators who don't want to
# set up a MiniMax account (priority below MiniMax — first
# non-empty wins in test_staging_full_saas.sh's secrets-injection
# block). See #2578 PR comment for the rationale.
E2E_ANTHROPIC_API_KEY: ${{ secrets.MOLECULE_STAGING_ANTHROPIC_API_KEY }}
# OpenAI fallback — kept wired so an operator-dispatched run with
# E2E_RUNTIME=hermes overridden via workflow_dispatch can still
# exercise the OpenAI path without re-editing the workflow.
E2E_OPENAI_API_KEY: ${{ secrets.MOLECULE_STAGING_OPENAI_API_KEY }}
E2E_MODE: smoke
E2E_RUNTIME: claude-code
# Pin the smoke to a specific MiniMax model rather than relying
# on the per-runtime default (which could resolve to "sonnet" →
# direct Anthropic and defeat the cost saving). MiniMax-M2 is the
# stable staging MiniMax path used by the full-SaaS smoke.
E2E_MODEL_SLUG: MiniMax-M2
E2E_RUN_ID: "smoke-${{ github.run_id }}"
# Debug-only: when an operator dispatches with keep_on_failure=true,
# the smoke script's E2E_KEEP_ORG=1 path skips teardown so the
# tenant org + EC2 stay alive for SSM-based log capture. Cron runs
# never set this (the input only exists on workflow_dispatch) so
# unattended cron always tears down. See molecule-core#129
# failure mode #1 — capturing the actual exception requires
# docker logs from the live container.
E2E_KEEP_ORG: ${{ github.event.inputs.keep_on_failure == 'true' && '1' || '0' }}
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- name: Verify admin token present
run: |
if [ -z "$MOLECULE_ADMIN_TOKEN" ]; then
echo "::error::CP_STAGING_ADMIN_API_TOKEN not set"
exit 2
fi
for var in AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY; do
if [ -z "${!var:-}" ]; then
echo "::error::$var not set — EC2 leak verification cannot run"
exit 2
fi
done
- name: Verify LLM key present
run: |
# Per-runtime key check — claude-code uses MiniMax; hermes /
# langgraph (operator-dispatched only) use OpenAI. Hard-fail
# rather than soft-skip per the lesson from synth E2E #2578:
# an empty key silently falls through to the wrong
# SECRETS_JSON branch and the smoke fails 5 min later with
# a confusing auth error instead of the clean "secret
# missing" message at the top.
case "${E2E_RUNTIME}" in
claude-code)
# Either MiniMax OR direct-Anthropic works — first
# non-empty wins in the test script's secrets-injection
# priority chain. Operators only need to set ONE of these
# secrets; we don't force a choice between them.
if [ -n "${E2E_MINIMAX_API_KEY:-}" ]; then
required_secret_name="MOLECULE_STAGING_MINIMAX_API_KEY"
required_secret_value="${E2E_MINIMAX_API_KEY}"
elif [ -n "${E2E_ANTHROPIC_API_KEY:-}" ]; then
required_secret_name="MOLECULE_STAGING_ANTHROPIC_API_KEY"
required_secret_value="${E2E_ANTHROPIC_API_KEY}"
else
required_secret_name="MOLECULE_STAGING_MINIMAX_API_KEY or MOLECULE_STAGING_ANTHROPIC_API_KEY"
required_secret_value=""
fi
;;
langgraph|hermes)
required_secret_name="MOLECULE_STAGING_OPENAI_API_KEY"
required_secret_value="${E2E_OPENAI_API_KEY:-}"
;;
*)
echo "::warning::Unknown E2E_RUNTIME='${E2E_RUNTIME}' — skipping LLM-key check"
required_secret_name=""
required_secret_value="present"
;;
esac
if [ -n "$required_secret_name" ] && [ -z "$required_secret_value" ]; then
echo "::error::${required_secret_name} secret not set for runtime=${E2E_RUNTIME} — A2A will fail at request time with 'No LLM provider configured'"
exit 2
fi
echo "LLM key present ✓ (runtime=${E2E_RUNTIME}, key=${required_secret_name}, len=${#required_secret_value})"
- name: Smoke run
id: smoke
run: bash tests/e2e/test_staging_full_saas.sh
# Alerting: open a sticky issue on the FIRST failure; comment on
# subsequent failures; auto-close on next green. Comment-on-existing
# de-duplicates so a single open issue accumulates the streak —
# ops sees one issue with N comments rather than N issues.
#
# Why no consecutive-failures threshold (e.g., wait 3 runs before
# filing): the prior threshold check used
# `github.rest.actions.listWorkflowRuns()` which Gitea 1.22.6 does
# not expose (returns 404). On Gitea Actions the threshold call
# ALWAYS failed, breaking the entire alerting step and going days
# silent on real regressions (38h+ chronic red on 2026-05-07/08
# before this fix; tracked in molecule-core#129). Filing on first
# failure is also better UX — we want to know about the first red,
# not wait 90 min for it to "count." Real flakes get one issue +
# a quick close-on-green; persistent reds accumulate comments.
- name: Open issue on failure (Gitea API)
if: failure()
env:
GITEA_TOKEN: ${{ secrets.GITHUB_TOKEN }}
REPO: ${{ github.repository }}
SERVER_URL: ${{ env.GITHUB_SERVER_URL }}
RUN_ID: ${{ github.run_id }}
run: |
set -euo pipefail
API="${SERVER_URL%/}/api/v1"
# Title kept stable across the canary-staging.yml → staging-smoke.yml
# rename (2026-05-11) so any open alert issue from the old name
# still title-matches and auto-closes on the next green run.
TITLE="Canary failing: staging SaaS smoke"
RUN_URL="${SERVER_URL}/${REPO}/actions/runs/${RUN_ID}"
EXISTING=$(curl -fsS -H "Authorization: token $GITEA_TOKEN" \
"${API}/repos/${REPO}/issues?state=open&type=issues&limit=50" \
| jq -r --arg t "$TITLE" '.[] | select(.title==$t) | .number' | head -1)
if [ -n "$EXISTING" ]; then
curl -fsS -X POST -H "Authorization: token $GITEA_TOKEN" -H "Content-Type: application/json" \
"${API}/repos/${REPO}/issues/${EXISTING}/comments" \
-d "$(jq -nc --arg run "$RUN_URL" '{body: ("Smoke still failing. " + $run)}')" >/dev/null
echo "Commented on existing issue #${EXISTING}"
else
NOW=$(date -u +%Y-%m-%dT%H:%M:%SZ)
BODY=$(jq -nc --arg t "$TITLE" --arg now "$NOW" --arg run "$RUN_URL" \
'{title: $t, body: ("Smoke run failed at " + $now + ".\n\nRun: " + $run + "\n\nThis issue auto-closes on the next green smoke run. Consecutive failures add a comment here rather than a new issue.")}')
curl -fsS -X POST -H "Authorization: token $GITEA_TOKEN" -H "Content-Type: application/json" \
"${API}/repos/${REPO}/issues" -d "$BODY" >/dev/null
echo "Opened smoke failure issue (first red)"
fi
- name: Auto-close smoke issue on success (Gitea API)
if: success()
env:
GITEA_TOKEN: ${{ secrets.GITHUB_TOKEN }}
REPO: ${{ github.repository }}
SERVER_URL: ${{ env.GITHUB_SERVER_URL }}
RUN_ID: ${{ github.run_id }}
run: |
set -euo pipefail
API="${SERVER_URL%/}/api/v1"
# Title kept stable across the canary-staging.yml → staging-smoke.yml
# rename so open alert issues from the old name still match.
TITLE="Canary failing: staging SaaS smoke"
NUMS=$(curl -fsS -H "Authorization: token $GITEA_TOKEN" \
"${API}/repos/${REPO}/issues?state=open&type=issues&limit=50" \
| jq -r --arg t "$TITLE" '.[] | select(.title==$t) | .number')
NOW=$(date -u +%Y-%m-%dT%H:%M:%SZ)
for N in $NUMS; do
curl -fsS -X POST -H "Authorization: token $GITEA_TOKEN" -H "Content-Type: application/json" \
"${API}/repos/${REPO}/issues/${N}/comments" \
-d "$(jq -nc --arg now "$NOW" '{body: ("Smoke recovered at " + $now + ". Closing.")}')" >/dev/null
curl -fsS -X PATCH -H "Authorization: token $GITEA_TOKEN" -H "Content-Type: application/json" \
"${API}/repos/${REPO}/issues/${N}" -d '{"state":"closed"}' >/dev/null
echo "Closed recovered smoke issue #${N}"
done
- name: Teardown safety net
if: always()
env:
ADMIN_TOKEN: ${{ secrets.CP_STAGING_ADMIN_API_TOKEN }}
run: |
set +e
# Slug prefix matches what test_staging_full_saas.sh emits
# in smoke mode:
# SLUG="e2e-smoke-$(date +%Y%m%d)-${RUN_ID_SUFFIX}"
# Earlier (pre-2026-05-11 canary→staging rename) the prefix was
# `e2e-canary-`; both prefixes are matched here for one
# release cycle so cleanup still catches any in-flight org
# provisioned under the old prefix on an older runner that
# hasn't picked up the renamed script. Remove the canary
# fallback after one week of no-old-prefix observations.
orgs=$(curl -sS "$MOLECULE_CP_URL/cp/admin/orgs" \
-H "Authorization: Bearer $ADMIN_TOKEN" 2>/dev/null \
| python3 -c "
import json, sys, os, datetime
run_id = os.environ.get('GITHUB_RUN_ID', '')
d = json.load(sys.stdin)
# Scope to slugs from THIS smoke run when GITHUB_RUN_ID is
# available; the smoke workflow sets E2E_RUN_ID='smoke-\${run_id}'
# so the slug suffix is '-smoke-\${run_id}-...'. Mirrors the
# full-mode safety net's per-run scoping (e2e-staging-saas.yml)
# added after the 2026-04-21 cross-run cleanup incident.
# Sweep both today AND yesterday's UTC dates so a run that
# crosses midnight still cleans up its own slug — see the
# 2026-04-26→27 canvas-safety-net incident.
today = datetime.date.today()
yesterday = today - datetime.timedelta(days=1)
dates = (today.strftime('%Y%m%d'), yesterday.strftime('%Y%m%d'))
if run_id:
prefixes = tuple(f'e2e-smoke-{d}-smoke-{run_id}' for d in dates) \
+ tuple(f'e2e-canary-{d}-canary-{run_id}' for d in dates)
else:
prefixes = tuple(f'e2e-smoke-{d}-' for d in dates) \
+ tuple(f'e2e-canary-{d}-' for d in dates)
candidates = [o['slug'] for o in d.get('orgs', [])
if any(o.get('slug','').startswith(p) for p in prefixes)
and o.get('status') not in ('purged',)]
print('\n'.join(candidates))
" 2>/dev/null)
# Per-slug DELETE with HTTP-code verification. The previous
# `... >/dev/null || true` swallowed every failure, so a 5xx
# or timeout from CP looked identical to "successfully cleaned
# up" and the tenant kept eating ~2 vCPU until the hourly
# stale sweep caught it (up to 2h later). Now we capture the
# response code and surface non-2xx as a workflow warning, so
# the run page shows which slug leaked. We still don't `exit 1`
# on cleanup failure — a single-smoke cleanup miss shouldn't
# fail-flag the smoke itself when the actual smoke check
# passed. The sweep-stale-e2e-orgs cron (now every 15 min,
# 30-min threshold) is the safety net for whatever slips past.
# See molecule-controlplane#420.
leaks=()
for slug in $orgs; do
# Tempfile-routed -w + set +e/-e prevents curl-exit-code
# pollution of the captured status (lint-curl-status-capture.yml).
set +e
curl -sS -o /tmp/smoke-cleanup.out -w "%{http_code}" \
-X DELETE "$MOLECULE_CP_URL/cp/admin/tenants/$slug" \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d "{\"confirm\":\"$slug\"}" >/tmp/smoke-cleanup.code
set -e
code=$(cat /tmp/smoke-cleanup.code 2>/dev/null || echo "000")
if [ "$code" = "200" ] || [ "$code" = "204" ]; then
echo "[teardown] deleted $slug (HTTP $code)"
else
echo "::warning::smoke teardown for $slug returned HTTP $code — sweep-stale-e2e-orgs will catch it within ~45 min. Body: $(head -c 300 /tmp/smoke-cleanup.out 2>/dev/null)"
leaks+=("$slug")
fi
done
if [ ${#leaks[@]} -gt 0 ]; then
echo "::warning::smoke teardown left ${#leaks[@]} leak(s): ${leaks[*]}"
fi
exit 0
- name: Notify on smoke failure
# Fail-loud companion to dropping `continue-on-error: true`.
# The Open-issue-on-failure step above handles the human-facing
# alert; this step emits a clearly-tagged ::error:: line that
# log-tail consumers (Loki SOPRefireRule, orchestrator triage
# loop) can grep on. Mirrors PR#461's sweep-stale-e2e-orgs
# pattern. Runs AFTER the teardown safety net (which is
# if: always()) so failures don't suppress cleanup.
if: failure()
run: |
echo "::error::staging-smoke FAILED — staging SaaS canary is red. See prior step logs + the auto-filed alert issue. Common causes: (a) CP_STAGING_ADMIN_API_TOKEN secret missing/rotated, (b) staging-api.moleculesai.app 5xx, (c) MiniMax/Anthropic LLM key dead, (d) AMI/CF/WorkOS drift. The 30-min cron will retry, but a chronic red here indicates the staging SaaS stack is broken end-to-end."
exit 1
-303
View File
@@ -1,303 +0,0 @@
name: Staging verify
# Renamed from canary-verify.yml on 2026-05-11 per Hongming directive
# ("canary naming changed to staging for all"). Originally ported from
# .github/workflows/canary-verify.yml on 2026-05-11 per RFC
# internal#219 §1 sweep. Differences from the GitHub version:
# - Dropped `workflow_dispatch.inputs` (Gitea 1.22.6 parser rejects them
# per feedback_gitea_workflow_dispatch_inputs_unsupported).
# - Dropped `merge_group:` (no Gitea merge queue).
# - Dropped `environment:` blocks (Gitea has no environments).
# - Workflow-level env.GITHUB_SERVER_URL pinned per
# feedback_act_runner_github_server_url.
# - `continue-on-error: true` on each job (RFC §1 contract).
# - ~~**Gitea workflow_run trigger limitation**~~ FIXED: replaced with
# push+paths filter per this PR. Gitea 1.22.6 does not support
# `workflow_run` (task #81). The push trigger fires on every
# commit to publish-workspace-server-image.yml. Removed the
# `workflow_run.conclusion==success` job if since the push trigger
# doesn't carry completion state — the smoke test is the safety net
# (it will detect and abort on a bad image regardless). Added
# workflow_dispatch for manual runs.
#
# Runs the canary smoke suite against the staging canary tenant fleet
# after a new :staging-<sha> image lands in ECR. On green, calls the
# CP redeploy-fleet endpoint to promote :staging-<sha> → :latest so
# the prod tenant fleet's 5-minute auto-updater picks up the verified
# digest. On red, :latest stays on the prior known-good digest and
# prod is untouched.
#
# Terminology note (2026-05-11): The deployment STRATEGY here is still
# called "canary release" (a small subset of tenants gets the new image
# first, the rest follow on green). The "canary" word stays for the
# pre-fan-out cohort concept (see docs/architecture/canary-release.md
# and CANARY_SLUG in redeploy-tenants-on-*.yml). What changed is the
# FILE NAME and the SECRETS feeding this workflow — both are renamed
# to drop the redundant "canary-" prefix that conflated workflow
# identity with deployment strategy.
#
# Registry note (2026-05-10): This workflow previously used GHCR
# (ghcr.io/molecule-ai/platform-tenant) — that registry was retired
# during the 2026-05-06 Gitea suspension migration when publish-
# workspace-server-image.yml switched to the operator's ECR org
# (153263036946.dkr.ecr.us-east-2.amazonaws.com/molecule-ai/
# platform-tenant). The GHCR → ECR migration was never applied to
# this file, so this workflow was silently smoke-testing the stale
# GHCR image while the actual staging/prod tenants ran the ECR image.
# Result: smoke tests could not catch a broken ECR build. Fix:
# - Wait step: reads SHA from running canary /health (tenant-
# agnostic, works regardless of registry).
# - Promote step: calls CP redeploy-fleet endpoint with target_tag=
# staging-<sha>, same mechanism as redeploy-tenants-on-main.yml.
# No longer attempts GHCR crane ops.
#
# Dependencies:
# - publish-workspace-server-image.yml publishes :staging-<sha>
# to ECR on staging and main merges.
# - Canary tenants are configured to pull :staging-<sha> from ECR
# (TENANT_IMAGE env set to the ECR :staging-<sha> tag).
# - Repo secrets MOLECULE_STAGING_TENANT_URLS /
# MOLECULE_STAGING_ADMIN_TOKENS / MOLECULE_STAGING_CP_SHARED_SECRET
# are populated.
on:
push:
branches: [staging]
paths:
- '.gitea/workflows/publish-workspace-server-image.yml'
workflow_dispatch:
permissions:
contents: read
packages: write
actions: read
env:
# ECR registry (post-2026-05-06 SSOT for tenant images).
# publish-workspace-server-image.yml pushes here.
# SSOT-Instance-10 (#333): triplet sourced from org/repo var `ECR_REGISTRY` with
# the current prod-account literal as bootstrap fallback. When the org var is set,
# the fallback becomes dead code and switching accounts/regions is a one-line
# change at the org level. Pattern mirrors `vars.CP_URL || 'literal'` below.
IMAGE_NAME: ${{ vars.ECR_REGISTRY || '153263036946.dkr.ecr.us-east-2.amazonaws.com' }}/molecule-ai/platform
TENANT_IMAGE_NAME: ${{ vars.ECR_REGISTRY || '153263036946.dkr.ecr.us-east-2.amazonaws.com' }}/molecule-ai/platform-tenant
# CP endpoint for redeploy-fleet (used in promote step below).
CP_URL: ${{ vars.CP_URL || 'https://staging-api.moleculesai.app' }}
GITHUB_SERVER_URL: https://git.moleculesai.app
jobs:
# bp-exempt: post-merge staging verification side effect; CI / all-required gates merges.
staging-smoke:
runs-on: ubuntu-latest
# Phase 3 (RFC #219 §1): surface broken workflows without blocking.
# mc#774: pre-existing continue-on-error mask; root-fix and remove, do not renew silently.
continue-on-error: true
outputs:
sha: ${{ steps.compute.outputs.sha }}
smoke_ran: ${{ steps.smoke.outputs.ran }}
steps:
- name: Checkout
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- name: Compute sha
id: compute
run: echo "sha=${GITHUB_SHA::7}" >> "$GITHUB_OUTPUT"
- name: Wait for canary tenants to pick up :staging-<sha>
# Poll canary health endpoints every 30s for up to 7 min instead
# of a fixed 6-min sleep. Exits as soon as ALL canaries report
# the new SHA (~2-3 min typical vs 6 min fixed). Falls back to
# proceeding after 7 min even if not all canaries responded —
# the smoke suite will catch any that didn't update.
#
# NOTE: The SHA is read from the running tenant's /health response,
# NOT from a registry lookup. This is registry-agnostic and works
# regardless of whether the tenant pulls from ECR, GHCR, or any
# other registry — the canary is telling us what it's actually
# running, which is the ground truth for smoke testing.
env:
MOLECULE_STAGING_TENANT_URLS: ${{ secrets.MOLECULE_STAGING_TENANT_URLS }}
EXPECTED_SHA: ${{ steps.compute.outputs.sha }}
run: |
if [ -z "$MOLECULE_STAGING_TENANT_URLS" ]; then
echo "No canary URLs configured — falling back to 60s wait"
sleep 60
exit 0
fi
IFS=',' read -ra URLS <<< "$MOLECULE_STAGING_TENANT_URLS"
MAX_WAIT=420 # 7 minutes
INTERVAL=30
ELAPSED=0
while [ $ELAPSED -lt $MAX_WAIT ]; do
ALL_READY=true
for url in "${URLS[@]}"; do
HEALTH=$(curl -s --max-time 5 "${url}/health" 2>/dev/null || echo "{}")
SHA=$(echo "$HEALTH" | grep -o "\"sha\":\"[^\"]*\"" | head -1 | cut -d'"' -f4)
if [ "$SHA" != "$EXPECTED_SHA" ]; then
ALL_READY=false
break
fi
done
if $ALL_READY; then
echo "All canaries running staging-${EXPECTED_SHA} after ${ELAPSED}s"
exit 0
fi
echo "Waiting for canaries... (${ELAPSED}s / ${MAX_WAIT}s)"
sleep $INTERVAL
ELAPSED=$((ELAPSED + INTERVAL))
done
echo "Timeout after ${MAX_WAIT}s — proceeding anyway (smoke suite will validate)"
- name: Run staging smoke suite
id: smoke
# Graceful-skip when no canary fleet is configured (Phase 2 not yet
# stood up — see molecule-controlplane/docs/canary-tenants.md).
# Sets `ran=false` on skip so promote-to-latest stays off (we don't
# want every main merge auto-promoting without gating). Manual
# promote-latest.yml is the release gate while canary is absent.
# Once the fleet is real: delete the early-exit branch.
env:
MOLECULE_STAGING_TENANT_URLS: ${{ secrets.MOLECULE_STAGING_TENANT_URLS }}
MOLECULE_STAGING_ADMIN_TOKENS: ${{ secrets.MOLECULE_STAGING_ADMIN_TOKENS }}
MOLECULE_STAGING_CP_BASE_URL: https://staging-api.moleculesai.app
MOLECULE_STAGING_CP_SHARED_SECRET: ${{ secrets.MOLECULE_STAGING_CP_SHARED_SECRET }}
run: |
set -euo pipefail
if [ -z "${MOLECULE_STAGING_TENANT_URLS:-}" ] \
|| [ -z "${MOLECULE_STAGING_ADMIN_TOKENS:-}" ] \
|| [ -z "${MOLECULE_STAGING_CP_SHARED_SECRET:-}" ]; then
{
echo "## ⚠️ staging-verify skipped"
echo
echo "One or more canary secrets are unset (\`MOLECULE_STAGING_TENANT_URLS\`, \`MOLECULE_STAGING_ADMIN_TOKENS\`, \`MOLECULE_STAGING_CP_SHARED_SECRET\`)."
echo "Phase 2 canary fleet has not been stood up yet —"
echo "see [canary-tenants.md](https://git.moleculesai.app/molecule-ai/molecule-controlplane/blob/main/docs/canary-tenants.md)."
echo
echo "**Skipped — promote-to-latest will NOT auto-fire.** Dispatch \`promote-latest.yml\` manually when ready."
} >> "$GITHUB_STEP_SUMMARY"
echo "ran=false" >> "$GITHUB_OUTPUT"
echo "::notice::staging-verify: skipped — no canary fleet configured"
exit 0
fi
bash scripts/staging-smoke.sh
echo "ran=true" >> "$GITHUB_OUTPUT"
- name: Summary on failure
if: ${{ failure() }}
run: |
{
echo "## Canary smoke FAILED"
echo
echo "Canary tenants rejected image \`staging-${{ steps.compute.outputs.sha }}\`."
echo ":latest stays pinned to the prior good digest — prod is untouched."
echo
echo "Fix forward and merge again, or investigate the specific failed"
echo "assertions in the staging-smoke step log above."
} >> "$GITHUB_STEP_SUMMARY"
# bp-exempt: post-merge image promotion side effect; staging-smoke controls promotion.
promote-to-latest:
# On green, calls the CP redeploy-fleet endpoint with target_tag=
# staging-<sha> to promote the verified ECR image. This is the same
# mechanism as redeploy-tenants-on-main.yml — no GHCR crane ops.
#
# Pre-fix history: the old GHCR promote step used `crane tag` against
# ghcr.io/molecule-ai/platform-tenant, but publish-workspace-server-
# image.yml had already migrated to ECR on 2026-05-07 (commit
# 10e510f5). The GHCR tags were never updated, so this step was
# silently promoting a stale GHCR image while actual prod tenants
# pulled from ECR. Canary smoke tests were GHCR-targeted and could
# not catch a broken ECR build.
needs: staging-smoke
if: ${{ needs.staging-smoke.result == 'success' && needs.staging-smoke.outputs.smoke_ran == 'true' }}
runs-on: ubuntu-latest
# Phase 3 (RFC #219 §1): surface broken workflows without blocking.
# mc#774: pre-existing continue-on-error mask; root-fix and remove, do not renew silently.
continue-on-error: true
env:
SHA: ${{ needs.staging-smoke.outputs.sha }}
CP_URL: ${{ vars.CP_URL || 'https://staging-api.moleculesai.app' }}
# CP_ADMIN_API_TOKEN gates write access to the redeploy endpoint.
# Stored at the repo level so all workflows pick it up automatically.
CP_ADMIN_API_TOKEN: ${{ secrets.CP_ADMIN_API_TOKEN }}
# canary_slug pin: deploy the verified :staging-<sha> to the canary
# first (soak 120s), then fan out to the rest of the fleet.
CANARY_SLUG: ${{ vars.CANARY_PROMOTE_SLUG || '' }}
SOAK_SECONDS: ${{ vars.CANARY_PROMOTE_SOAK || '120' }}
BATCH_SIZE: ${{ vars.CANARY_PROMOTE_BATCH || '3' }}
steps:
- name: Check CP credentials
run: |
if [ -z "${CP_ADMIN_API_TOKEN:-}" ]; then
echo "::error::CP_ADMIN_API_TOKEN secret is not set — promote step cannot call redeploy-fleet."
echo "::error::Set it at: repo Settings → Actions → Variables and Secrets → New Secret."
exit 1
fi
- name: Promote verified ECR image to :latest
run: |
set -euo pipefail
TARGET_TAG="staging-${SHA}"
# confirm:true ack required by CP /cp/admin/tenants/redeploy-fleet
# contract (cp#228 / task #308) for fleet-wide intent. Empty body
# / {confirm:false} / {only_slugs:[]} → 400. This caller promotes
# the verified staging image across the entire prod fleet (canary
# + fan-out), no slug scoping, so confirm:true is correct.
BODY=$(jq -nc \
--arg tag "$TARGET_TAG" \
--argjson soak "${SOAK_SECONDS:-120}" \
--argjson batch "${BATCH_SIZE:-3}" \
--argjson dry false \
'{
target_tag: $tag,
soak_seconds: $soak,
batch_size: $batch,
dry_run: $dry,
confirm: true
}')
if [ -n "${CANARY_SLUG:-}" ]; then
BODY=$(jq '. * {canary_slug: $slug}' --arg slug "$CANARY_SLUG" <<<"$BODY")
fi
echo "Calling: POST $CP_URL/cp/admin/tenants/redeploy-fleet"
echo " target_tag: $TARGET_TAG"
echo " body: $BODY"
HTTP_RESPONSE=$(mktemp)
HTTP_CODE_FILE=$(mktemp)
set +e
curl -sS -o "$HTTP_RESPONSE" -w '%{http_code}' \
-m 1200 \
-H "Authorization: Bearer $CP_ADMIN_API_TOKEN" \
-H "Content-Type: application/json" \
-X POST "$CP_URL/cp/admin/tenants/redeploy-fleet" \
-d "$BODY" >"$HTTP_CODE_FILE"
CURL_EXIT=$?
set -e
HTTP_CODE=$(cat "$HTTP_CODE_FILE" 2>/dev/null || echo "000")
[ -z "$HTTP_CODE" ] && HTTP_CODE="000"
echo "HTTP $HTTP_CODE (curl exit $CURL_EXIT)"
cat "$HTTP_RESPONSE" | jq . || cat "$HTTP_RESPONSE"
if [ "$HTTP_CODE" -ge 400 ]; then
echo "::error::CP redeploy-fleet returned HTTP $HTTP_CODE — refusing to proceed."
exit 1
fi
- name: Summary
run: |
{
echo "## Staging verified — :latest promoted via CP redeploy-fleet"
echo ""
echo "- **Target tag:** \`staging-${{ needs.staging-smoke.outputs.sha }}\`"
echo "- **Registry:** ECR (\`${TENANT_IMAGE_NAME}\`)"
echo "- **Canary slug:** \`${CANARY_SLUG:-<none>}\` (soak ${SOAK_SECONDS}s)"
echo "- **Batch size:** ${BATCH_SIZE:-3}"
echo ""
echo "CP redeploy-fleet is rolling out the verified image across the prod fleet."
echo "The fleet's 5-minute health-check loop will pick up the update automatically."
} >> "$GITHUB_STEP_SUMMARY"
-117
View File
@@ -1,117 +0,0 @@
# status-reaper — Option B (compensating-status POST) for Gitea 1.22.6's
# hardcoded `(push)` suffix on default-branch commit statuses.
#
# Tracking: molecule-core#? (this PR), internal#327 (sibling publish-runtime-bot),
# internal#328 (sibling mc-drift-bot), internal#80 (upstream RFC). Sister
# bots already deployed under the same per-persona-identity contract
# (`feedback_per_agent_gitea_identity_default`).
#
# Root cause:
# Gitea 1.22.6 emits commit-status context as
# `<workflow_name> / <job_name> (push)`
# for ANY workflow run on the default branch's HEAD commit, REGARDLESS
# of the trigger event. Schedule- and workflow_dispatch-triggered runs
# on `main` therefore appear as `(push)` failures on the latest main
# commit, painting main red via a fake-push status. Verified on runs
# 14525 + 14526 via Phase 1 evidence (3 sub-agents). No upstream fix
# in 1.23-1.26.1 (sibling a6f20db1 research).
#
# Why a cron-driven reaper, not workflow_run:
# Gitea 1.22.6 does NOT support `on: workflow_run` (verified via
# modules/actions/workflows.go enumeration; sister a6f20db1). The
# only event-shaped option that fires is cron. 5min is chosen to
# sit BETWEEN ci-required-drift (`:17` hourly) and main-red-watchdog
# (`:05` hourly) so the reaper sweeps red before the watchdog files
# a `[main-red]` issue (would-be false-positive).
#
# What the reaper does each tick:
# 1. Parse `.gitea/workflows/*.yml`, classify each by whether `on:`
# contains a `push:` trigger (see script for workflow_id resolution
# including `name:` collision and `/`-in-name fail-loud lints).
# 2. GET combined status for main HEAD.
# 3. For each `failure` status whose context ends ` (push)`:
# - if workflow has push trigger: PRESERVE (real defect signal).
# - if workflow has no push trigger: POST a compensating
# `state=success` with the same context and a description that
# documents the workaround.
#
# What it does NOT do:
# - Mutate non-`(push)`-suffix statuses (e.g. `(pull_request)` from
# branch_protections required-checks — verified safe 2026-05-11).
# - Auto-revert. Same reasoning as main-red-watchdog.
# - Cancel runs. The runs themselves stay visible in Actions UI; the
# fix is at the commit-status surface only.
#
# Removal path: drop this workflow when Gitea ≥ 1.24 ships with a
# real fix for the hardcoded-suffix bug. Audit issue (filed post-merge)
# tracks the deletion as a follow-up sweep.
name: status-reaper
# IMPORTANT — Gitea 1.22.6 parser quirk per
# `feedback_gitea_workflow_dispatch_inputs_unsupported`: do NOT add an
# `inputs:` block here. Gitea 1.22.6 rejects the whole workflow as
# "unknown on type" when `workflow_dispatch.inputs.X` is present.
on:
# Schedule moved to operator-config:
# /etc/cron.d/molecule-core-status-reaper ->
# /usr/local/bin/molecule-core-cron-bot.sh status-reaper
#
# This keeps the 5-minute compensation cadence but stops a maintenance
# bot from consuming Gitea Actions runner slots during PR merge waves.
workflow_dispatch:
# Compensating-status POST needs write on repo statuses; no other
# write surface is touched. checkout still needs `contents: read`.
permissions:
contents: read
# NOTE: NO `concurrency:` block is intentional.
# Gitea 1.22.6 doesn't honor `cancel-in-progress: false`: queued ticks
# of the same group get cancelled-with-started=0 instead of waiting
# (DB-verified 2026-05-12, runs 16053/16085 of status-reaper.yml).
# The reaper's POST /statuses/{sha} is idempotent — Gitea de-dups by
# context — so concurrent ticks are safe; accept them rather than
# serialise via the broken mechanism.
jobs:
reap:
runs-on: ubuntu-latest
timeout-minutes: 8
steps:
- name: Check out repo at default-branch HEAD
# BASE checkout per `feedback_pull_request_target_workflow_from_base`.
# The script reads .gitea/workflows/*.yml from the working tree to
# classify trigger sets; we must read main's CURRENT state, not
# the SHA a stale schedule fired against.
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
ref: ${{ github.event.repository.default_branch }}
- name: Set up Python (PyYAML for workflow `on:` parse)
# Pinned to 3.12 to match sibling watchdog / ci-required-drift.
uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5.6.0
with:
python-version: '3.12'
- name: Install PyYAML
# PyYAML is needed because shell-grep on `on:` misses list/string
# forms and nested `push: { paths: ... }`. Same install pattern
# as ci-required-drift.yml (sub-2s install, no wheel cache).
run: python -m pip install --quiet 'PyYAML==6.0.2'
- name: Compensate operational push-suffix failures on main
env:
# claude-status-reaper persona token; provisioned by sibling
# aefaac1b 2026-05-11. Owns write:repository scope to POST
# /statuses/{sha} but NOTHING ELSE
# (`feedback_per_agent_gitea_identity_default`).
GITEA_TOKEN: ${{ secrets.STATUS_REAPER_TOKEN }}
GITEA_HOST: git.moleculesai.app
REPO: ${{ github.repository }}
WATCH_BRANCH: ${{ github.event.repository.default_branch }}
WORKFLOWS_DIR: .gitea/workflows
STATUS_REAPER_API_RETRIES: "4"
STATUS_REAPER_API_TIMEOUT_SEC: "20"
STATUS_REAPER_API_RETRY_SLEEP_SEC: "2"
run: python3 .gitea/scripts/status-reaper.py
-79
View File
@@ -1,79 +0,0 @@
name: Ops Scripts Tests
# Ported from .github/workflows/test-ops-scripts.yml on 2026-05-11 per
# RFC internal#219 §1 sweep.
#
# Differences from the GitHub version:
# - Dropped `merge_group:` trigger (no Gitea merge queue).
# - on.paths references .gitea/workflows/test-ops-scripts.yml (this
# file) instead of the .github/ one.
# - Workflow-level env.GITHUB_SERVER_URL set.
# - `continue-on-error: true` on the job (RFC §1 contract).
#
# Runs the unittest suite for scripts/ on every PR + push that touches
# anything under scripts/ or .gitea/scripts/. Kept separate from the main CI
# so a script-only change doesn't trigger the heavier Go/Canvas/Python
# pipelines.
#
# Discovery layout: tests sit alongside the code they test (see
# scripts/ops/test_sweep_cf_decide.py for the pattern; scripts/
# test_build_runtime_package.py for the rewriter coverage). The job
# below runs `unittest discover` TWICE — once from `scripts/`, once
# from `scripts/ops/` — because neither dir has an `__init__.py`, so
# a single discover from `scripts/` doesn't recurse into the ops
# subdir. Two passes is simpler than retrofitting namespace packages.
on:
push:
branches: [main, staging]
paths:
- 'scripts/**'
- '.gitea/scripts/**'
- '.gitea/workflows/test-ops-scripts.yml'
pull_request:
branches: [main, staging]
paths:
- 'scripts/**'
- '.gitea/scripts/**'
- '.gitea/workflows/test-ops-scripts.yml'
env:
GITHUB_SERVER_URL: https://git.moleculesai.app
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
test:
name: Ops scripts (unittest)
runs-on: ubuntu-latest
# Phase 3 (RFC #219 §1): surface broken workflows without blocking.
# mc#774: pre-existing continue-on-error mask; root-fix and remove, do not renew silently.
continue-on-error: true
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 # v6.2.0
with:
python-version: '3.11'
- name: Install .gitea script test dependencies
run: python -m pip install --quiet 'pytest==9.0.2' 'PyYAML==6.0.2'
- name: Run scripts/ unittests, if any
# Top-level scripts/ tests live alongside their target file. The
# runtime packaging tests moved to molecule-ai-workspace-runtime, so
# this pass may legitimately find no tests.
working-directory: scripts
run: |
set +e
python -m unittest discover -t . -p 'test_*.py' -v
rc=$?
if [ "$rc" -eq 5 ]; then
echo "No top-level scripts/ unittest files found; skipping."
exit 0
fi
exit "$rc"
- name: Run scripts/ops/ unittests (sweep_cf_decide, ...)
working-directory: scripts/ops
run: python -m unittest discover -p 'test_*.py' -v
- name: Run .gitea/scripts pytest suite
run: python -m pytest .gitea/scripts/tests -q
-121
View File
@@ -1,121 +0,0 @@
name: Weekly Platform-Go Surface
# Surface latent vet/test errors on main by running the full Platform-Go
# suite on a weekly cron regardless of whether the last push touched
# workspace-server/.
#
# Background: ci.yml's `platform-build` job gates real work on
# `if: needs.changes.outputs.platform == 'true'`. When no push touches
# workspace-server/, the skip fires and the suite never executes on main.
# Latent vet errors and test flakes can sit for weeks undetected.
#
# This workflow runs the full suite (build, vet, golangci-lint, tests with
# coverage) every Monday at 04:17 UTC. Results are posted as commit statuses
# but continue-on-error: true means they never block anything — they're
# purely a noise-reduction signal for when the next workspace-server push
# lands and would otherwise trigger the first real suite run.
#
# Why 04:17 UTC on Monday: off-peak, before the weekly sprint cycle starts.
on:
schedule:
- cron: '17 4 * * 1' # Mondays at 04:17 UTC
workflow_dispatch:
permissions:
contents: read
statuses: write
jobs:
weekly-platform-go:
name: Weekly Platform-Go Surface
runs-on: ubuntu-latest
# continue-on-error: surface only, never block
# mc#774: pre-existing continue-on-error mask; root-fix and remove, do not renew silently.
continue-on-error: true
defaults:
run:
working-directory: workspace-server
steps:
- name: Checkout main
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
ref: main
fetch-depth: 1
- name: Set up Go
uses: actions/setup-go@40f1582b2485089dde7abd97c1529aa768e1baff # v5
with:
go-version: stable
- name: Go mod download
run: go mod download
- name: Build
run: go build ./cmd/server
# `go vet` is NOT `|| true`-guarded: surfacing latent vet errors on main is
# the whole point of this workflow (issue #567 — the motivating case was a
# `go vet` error in org_external.go that sat undetected on main for weeks).
# A vet error here fails the step → fails the job → shows red on the weekly
# commit. Per Gitea quirk #10 (job-level continue-on-error is ignored), that
# red surfaces on main — which is the intended signal, not a regression.
- name: go vet
run: go vet ./...
# golangci-lint stays `|| true`-guarded: lint is noisier (more false-
# positives than vet) and golangci-lint may not be pre-installed on every
# runner image — a `|| true` here keeps a missing-binary or lint-noise case
# from masking the vet/test signal above. Tighten to match ci.yml's lint
# gate if/when ci.yml's lint step becomes hard-failing.
- name: golangci-lint
run: golangci-lint run --timeout 3m ./... || true
- name: Tests with race detection + coverage
run: go test -race -coverprofile=coverage.out ./...
- name: Check coverage thresholds
run: |
set -e
TOTAL_FLOOR=25
CRITICAL_PATHS=(
"internal/handlers/tokens"
"internal/handlers/workspace_provision"
"internal/handlers/a2a_proxy"
"internal/handlers/registry"
"internal/handlers/secrets"
"internal/middleware/wsauth"
"internal/crypto"
)
TOTAL=$(go tool cover -func=coverage.out | grep '^total:' | awk '{print $3}' | sed 's/%//')
echo "Total coverage: ${TOTAL}%"
if awk "BEGIN{exit !(\$TOTAL < \$TOTAL_FLOOR)}"; then
echo "::error::Total coverage \${TOTAL}% is below the \${TOTAL_FLOOR}% floor."
exit 1
fi
ALLOWLIST=""
if [ -f ../.coverage-allowlist.txt ]; then
ALLOWLIST=$(grep -vE '^(#|[[:space:]]*$)' ../.coverage-allowlist.txt || true)
fi
FAILED=0
for path in "\${CRITICAL_PATHS[@]}"; do
while read -r file pct; do
[[ "$file" == *_test.go ]] && continue
[[ "$file" == *"$path"* ]] || continue
awk "BEGIN{exit !(\$pct < 10)}" || continue
rel=$(echo "$file" | sed 's|^github.com/molecule-ai/molecule-monorepo/platform/workspace-server/||; s|^github.com/molecule-ai/molecule-monorepo/platform/||')
if echo "$ALLOWLIST" | grep -qxF "$rel"; then
continue
fi
echo "::error::Low coverage \${pct}% on \${rel} (below 10% in critical path \${path})"
FAILED=$((FAILED + 1))
done < <(go tool cover -func=coverage.out | grep -v '^total:' | awk '{file=$1; sub(/:[0-9][0-9.]*:.*/, "", file); pct=$NF; gsub(/%/,"",pct); s[file]+=pct; c[file]++} END {for (f in s) printf "%s %.1f\n", f, s[f]/c[f]}' | sort)
done
if [ "$FAILED" -gt 0 ]; then
echo "::error::\${FAILED} critical paths below 10% coverage — see above."
exit 1
fi
echo "Coverage thresholds: OK"
+2 -2
View File
@@ -28,7 +28,7 @@ import sys
import urllib.request
from pathlib import Path
CANONICAL_FILE = Path(".gitea/workflows/secret-scan.yml")
CANONICAL_FILE = Path(".github/workflows/secret-scan.yml")
# Public consumer mirrors. Each entry is (label, raw_url) — raw_url
# points at the file's RAW content on the consumer's default branch
@@ -37,7 +37,7 @@ CANONICAL_FILE = Path(".gitea/workflows/secret-scan.yml")
CONSUMERS: list[tuple[str, str]] = [
(
"molecule-ai-workspace-runtime/molecule_runtime/scripts/pre-commit-checks.sh",
"https://git.moleculesai.app/molecule-ai/molecule-ai-workspace-runtime/raw/branch/main/molecule_runtime/scripts/pre-commit-checks.sh",
"https://raw.githubusercontent.com/Molecule-AI/molecule-ai-workspace-runtime/main/molecule_runtime/scripts/pre-commit-checks.sh",
),
]
+429
View File
@@ -0,0 +1,429 @@
name: Auto-promote :latest after main image build
# Retags `ghcr.io/molecule-ai/{platform,platform-tenant}:staging-<sha>`
# → `:latest` after either the image build or E2E completes on a `main`
# push, gated on E2E Staging SaaS not being red for that SHA.
#
# Why two triggers:
#
# `publish-workspace-server-image` and `e2e-staging-saas` are both
# paths-filtered, but with DIFFERENT path sets:
#
# publish-workspace-server-image:
# workspace-server/**, canvas/**, manifest.json
#
# e2e-staging-saas (full lifecycle):
# workspace-server/internal/handlers/{registry,workspace_provision,
# a2a_proxy}.go, workspace-server/internal/middleware/**,
# workspace-server/internal/provisioner/**, tests/e2e/test_staging_full_saas.sh
#
# The E2E set is a strict SUBSET of the publish set. So:
# - canvas/** changes → publish fires, E2E does not
# - workspace-server/cmd/** changes → publish fires, E2E does not
# - workspace-server/internal/sweep/** → publish fires, E2E does not
#
# The previous version triggered ONLY on E2E completion, which meant
# non-E2E-path changes (canvas, cmd, sweep, etc.) rebuilt the image
# but never advanced `:latest`. Result: as of 2026-04-28 this workflow
# had run zero times since merge despite eight main pushes — `:latest`
# was ~7 hours / 9 PRs behind main with no human realising. See
# `molecule-core` Slack discussion 2026-04-28.
#
# Adding `publish-workspace-server-image` as a second trigger closes
# the gap: any image rebuild on main eligibly advances `:latest`.
#
# Why E2E remains a kill-switch (not the trigger):
#
# When E2E DID run for this SHA and ended red, we abort — `:latest`
# stays on the prior known-good digest. When E2E didn't run (paths
# filtered out), we proceed: pre-merge gates already validated this
# SHA on staging via auto-promote-staging requiring CI + E2E Canvas +
# E2E API + CodeQL all green. Image content for non-E2E-paths
# (canvas, cmd, sweep) is exercised by those staging gates.
#
# Why `main` only:
#
# `:latest` is what prod tenants pull. We only want SHAs that have
# reached main (via auto-promote-staging) to advance `:latest`.
# Triggering on staging would let a staging-only revert advance
# `:latest` to a SHA that never reaches main, breaking the "production
# runs what's on main" invariant.
#
# Idempotency:
#
# When a SHA touches paths that match BOTH publish and E2E, both
# workflows fire and complete. Both trigger this workflow on
# completion → two runs race. Both retag `:staging-<sha>` →
# `:latest`. crane tag is idempotent (re-tagging the same digest is a
# no-op), so the second run is harmless. concurrency group serializes
# them anyway.
on:
workflow_run:
workflows:
- 'E2E Staging SaaS (full lifecycle)'
- 'publish-workspace-server-image'
types: [completed]
branches: [main]
workflow_dispatch:
inputs:
sha:
description: 'Short sha to promote (override; defaults to upstream workflow_run head_sha)'
required: false
type: string
permissions:
contents: read
packages: write
concurrency:
# Serialize promotes per-SHA so the publish+E2E both-fired race lands
# cleanly. Different SHAs can promote in parallel.
group: auto-promote-latest-${{ github.event.workflow_run.head_sha || github.event.inputs.sha || github.sha }}
cancel-in-progress: false
env:
IMAGE_NAME: ghcr.io/molecule-ai/platform
TENANT_IMAGE_NAME: ghcr.io/molecule-ai/platform-tenant
jobs:
promote:
# Proceed if upstream succeeded OR manual dispatch. Upstream-failure
# paths are filtered here; the E2E-was-red kill-switch lives in the
# gate-check step below (covers the case where upstream is publish
# success but E2E for the same SHA failed).
if: |
github.event_name == 'workflow_dispatch' ||
(github.event_name == 'workflow_run' && github.event.workflow_run.conclusion == 'success')
runs-on: ubuntu-latest
steps:
- name: Compute short sha
id: sha
run: |
set -euo pipefail
if [ -n "${{ github.event.inputs.sha }}" ]; then
FULL="${{ github.event.inputs.sha }}"
else
FULL="${{ github.event.workflow_run.head_sha }}"
fi
echo "short=${FULL:0:7}" >> "$GITHUB_OUTPUT"
echo "full=${FULL}" >> "$GITHUB_OUTPUT"
- name: Gate — E2E Staging SaaS state for this SHA
# When upstream IS E2E success, we know it's green (filtered by
# the job-level `if` already). When upstream is publish, look up
# E2E state for the same SHA. Four buckets:
#
# - completed/success: E2E confirmed safe → proceed
# - completed/failure|cancelled|timed_out: E2E found a
# regression → ABORT (exit 1), `:latest` stays put
# - in_progress|queued|requested: E2E is RACING with publish
# for a runtime-touching SHA. publish typically completes
# ~5-10min before E2E (~10-15min). If we promote on the
# publish signal here, a later E2E failure can't roll back
# `:latest` — it'd already be wrongly advanced. So we DEFER:
# skip subsequent steps (proceed=false) and let E2E's own
# completion event re-fire this workflow, which then takes
# the upstream-is-E2E path. exit 0 so the run shows as
# success rather than a noisy fake-failure.
# - none/none: E2E was paths-filtered out for this SHA (the
# change touched canvas/cmd/sweep/etc. — paths covered by
# publish but not by E2E). pre-merge gates on staging
# already validated this SHA → proceed.
#
# Manual dispatch skips this check — operator override.
id: gate
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
REPO: ${{ github.repository }}
SHA: ${{ steps.sha.outputs.full }}
UPSTREAM_NAME: ${{ github.event.workflow_run.name }}
EVENT_NAME: ${{ github.event_name }}
run: |
set -euo pipefail
if [ "$EVENT_NAME" = "workflow_dispatch" ]; then
echo "proceed=true" >> "$GITHUB_OUTPUT"
echo "::notice::Manual dispatch — skipping E2E gate (operator override)"
exit 0
fi
if [ "$UPSTREAM_NAME" = "E2E Staging SaaS (full lifecycle)" ]; then
echo "proceed=true" >> "$GITHUB_OUTPUT"
echo "::notice::Upstream is E2E itself (success per job-level if) — gate trivially satisfied"
exit 0
fi
# Upstream is publish-workspace-server-image. Check E2E state.
# The jq filter must defend against TWO empty cases that gh
# CLI emits indistinguishably:
# 1. gh exits non-zero (network blip, auth issue) → handled
# by the `|| echo "none/none"` fallback below.
# 2. gh exits zero but returns `[]` (no E2E run on this
# main SHA — the common case for canvas-only / cmd-only
# / sweep-only changes whose paths don't trigger E2E).
# Without `(.[0] // {})`, jq sees `null` and emits
# "null/none" — which the case statement below has no
# branch for, so it falls into *) → exit 1.
# Surfaced 2026-04-30 the first time the App-token chain
# (#2389) actually fired auto-promote-on-e2e from a publish
# upstream — every prior run was E2E-upstream which
# short-circuits before this gate.
RESULT=$(gh run list \
--repo "$REPO" \
--workflow e2e-staging-saas.yml \
--branch main \
--commit "$SHA" \
--limit 1 \
--json status,conclusion \
--jq '(.[0] // {}) | "\(.status // "none")/\(.conclusion // "none")"' \
2>/dev/null || echo "none/none")
echo "E2E Staging SaaS for ${SHA:0:7}: $RESULT"
case "$RESULT" in
completed/success)
echo "proceed=true" >> "$GITHUB_OUTPUT"
echo "::notice::E2E green for this SHA — proceeding with promote"
;;
completed/failure|completed/timed_out)
echo "proceed=false" >> "$GITHUB_OUTPUT"
{
echo "## ❌ Auto-promote aborted — E2E Staging SaaS failed"
echo
echo "E2E Staging SaaS for \`${SHA:0:7}\`: \`$RESULT\`"
echo "\`:latest\` stays on the prior known-good digest."
echo
echo "If the failure was a flake, manually dispatch this workflow with the same sha to override."
} >> "$GITHUB_STEP_SUMMARY"
exit 1
;;
completed/cancelled)
# cancelled ≠ failure. Per-SHA concurrency cancels older E2E
# runs when a newer push lands (memory:
# feedback_concurrency_group_per_sha) — the newer SHA will
# have its own E2E + promote chain. Treat the same as
# in_progress: defer without aborting, let the next E2E run
# promote when it lands.
#
# Caught 2026-05-05 02:03 on sha 31f9a5e — auto-promote
# blocked the whole chain because this case fell through to
# exit 1 instead of clean defer.
echo "proceed=false" >> "$GITHUB_OUTPUT"
{
echo "## ⏭ Auto-promote deferred — E2E Staging SaaS was cancelled"
echo
echo "E2E Staging SaaS for \`${SHA:0:7}\`: \`$RESULT\`"
echo "Likely per-SHA concurrency (newer push superseded this E2E run)."
echo "The newer SHA's E2E will fire its own promote when it lands."
echo "If you need this specific SHA promoted, manually dispatch."
} >> "$GITHUB_STEP_SUMMARY"
;;
in_progress/*|queued/*|requested/*|waiting/*|pending/*)
echo "proceed=false" >> "$GITHUB_OUTPUT"
{
echo "## ⏳ Auto-promote deferred — E2E Staging SaaS still running"
echo
echo "Publish completed before E2E for \`${SHA:0:7}\` (state: \`$RESULT\`)."
echo "Skipping retag here — E2E's own completion event will re-fire this workflow."
echo "If E2E ends green, that run promotes \`:latest\`. If red, it aborts."
} >> "$GITHUB_STEP_SUMMARY"
;;
none/none)
echo "proceed=true" >> "$GITHUB_OUTPUT"
echo "::notice::E2E paths-filtered out for this SHA — pre-merge staging gates carry"
;;
*)
echo "proceed=false" >> "$GITHUB_OUTPUT"
{
echo "## ❓ Auto-promote aborted — unexpected E2E state"
echo
echo "E2E Staging SaaS for \`${SHA:0:7}\`: \`$RESULT\` (unhandled)"
echo "Manual investigation needed; re-dispatch with the same sha once resolved."
} >> "$GITHUB_STEP_SUMMARY"
exit 1
;;
esac
- if: steps.gate.outputs.proceed == 'true'
uses: imjasonh/setup-crane@6da1ae018866400525525ce74ff892880c099987 # v0.5
- name: GHCR login
if: steps.gate.outputs.proceed == 'true'
run: |
echo "${{ secrets.GITHUB_TOKEN }}" | \
crane auth login ghcr.io -u "${{ github.actor }}" --password-stdin
- name: Verify :staging-<sha> exists for both images
# Better to fail fast with a clear message than to half-tag
# (platform retagged but platform-tenant missing → tenants pull
# a stale image).
if: steps.gate.outputs.proceed == 'true'
run: |
set -euo pipefail
for img in "${IMAGE_NAME}" "${TENANT_IMAGE_NAME}"; do
tag="${img}:staging-${{ steps.sha.outputs.short }}"
if ! crane manifest "$tag" >/dev/null 2>&1; then
echo "::error::Missing tag: $tag"
echo "::error::publish-workspace-server-image must complete on this SHA before auto-promote can retag :latest."
exit 1
fi
echo " ok: $tag exists"
done
- name: Ancestry check — refuse to promote :latest backwards
# #2244: workflow_run completions arrive in arbitrary order. If
# SHA-A and SHA-B both reach main within ~10 min and SHA-B's E2E
# completes before SHA-A's, this workflow can fire for SHA-A
# AFTER it already promoted SHA-B → :latest goes backwards. The
# orphan-reconciler "next run corrects it" doesn't apply: there's
# no auto-corrective re-promote, :latest stays wrong until the
# next main push lands.
#
# Detection: read current :latest's `org.opencontainers.image.revision`
# label (set by publish-workspace-server-image.yml at build time)
# and ask the GitHub compare API whether the candidate SHA is
# ahead-of / identical-to / behind / diverged-from current.
# Hard-fail on `behind` and `diverged` per the approved design —
# silent-bypass is the class we're moving away from. Workflow
# goes red, oncall sees it, operator decides how to recover
# (manual dispatch with the right SHA, force-promote, etc.).
#
# Manual dispatch skips this check — operator override semantics
# match the gate-check step above.
#
# Backward-compat: when current :latest carries no revision
# label (legacy image pre-publish-with-label), skip-with-warning.
# All :latest images on main are post-label as of 2026-04-29, so
# this branch will be dead within 90 days; remove then.
if: steps.gate.outputs.proceed == 'true' && github.event_name != 'workflow_dispatch'
id: ancestry
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
REPO: ${{ github.repository }}
TARGET_SHA: ${{ steps.sha.outputs.full }}
run: |
set -euo pipefail
# Read the current :latest config and pull the revision label.
# `crane config` returns the OCI image config blob (not the manifest);
# labels live under `.config.Labels`. `// empty` makes jq return ""
# rather than the literal "null" so the test below works.
CURRENT_REVISION=$(crane config "${IMAGE_NAME}:latest" 2>/dev/null \
| jq -r '.config.Labels["org.opencontainers.image.revision"] // empty' \
|| true)
if [ -z "$CURRENT_REVISION" ]; then
echo "decision=skip-no-label" >> "$GITHUB_OUTPUT"
{
echo "## ⚠ Ancestry check skipped — current :latest has no revision label"
echo
echo "Likely a legacy image built before \`org.opencontainers.image.revision\` was set."
echo "Falling through to retag. After all \`:latest\` images are post-label (TODO 90 days), this branch is dead and should be removed."
} >> "$GITHUB_STEP_SUMMARY"
echo "::warning::Current :latest carries no revision label — skipping ancestry check (legacy image)"
exit 0
fi
if [ "$CURRENT_REVISION" = "$TARGET_SHA" ]; then
echo "decision=identical" >> "$GITHUB_OUTPUT"
echo "::notice:::latest already at ${TARGET_SHA:0:7} — retag will be a no-op"
exit 0
fi
# Ask GitHub which side of the merge graph TARGET_SHA sits on
# relative to CURRENT_REVISION. Returns one of: ahead | identical
# | behind | diverged. Network or auth errors collapse to "error"
# via the explicit fallback so the case below always matches.
STATUS=$(gh api \
"repos/${REPO}/compare/${CURRENT_REVISION}...${TARGET_SHA}" \
--jq '.status' 2>/dev/null || echo "error")
echo "ancestry compare ${CURRENT_REVISION:0:7} → ${TARGET_SHA:0:7}: $STATUS"
case "$STATUS" in
ahead)
echo "decision=ahead" >> "$GITHUB_OUTPUT"
echo "::notice::Target ${TARGET_SHA:0:7} is ahead of current :latest (${CURRENT_REVISION:0:7}) — proceeding with retag"
;;
identical)
echo "decision=identical" >> "$GITHUB_OUTPUT"
echo "::notice::Target identical to :latest — retag will be a no-op"
;;
behind)
echo "decision=behind" >> "$GITHUB_OUTPUT"
{
echo "## ❌ Auto-promote refused — target is BEHIND current :latest"
echo
echo "| Field | Value |"
echo "|---|---|"
echo "| Target SHA | \`$TARGET_SHA\` |"
echo "| Current :latest revision | \`$CURRENT_REVISION\` |"
echo "| GitHub compare status | \`behind\` |"
echo
echo "This guard catches the workflow_run-completion-order race (#2244):"
echo "two rapid main pushes whose E2Es complete out-of-order can otherwise"
echo "promote \`:latest\` backwards. \`:latest\` stays on \`${CURRENT_REVISION:0:7}\`."
echo
echo "**Recovery:** if this is a legitimate revert that should land on \`:latest\`,"
echo "manually dispatch this workflow with the target sha as input — the manual-dispatch"
echo "path skips the ancestry check (operator override)."
} >> "$GITHUB_STEP_SUMMARY"
exit 1
;;
diverged)
echo "decision=diverged" >> "$GITHUB_OUTPUT"
{
echo "## ❓ Auto-promote refused — history diverged"
echo
echo "| Field | Value |"
echo "|---|---|"
echo "| Target SHA | \`$TARGET_SHA\` |"
echo "| Current :latest revision | \`$CURRENT_REVISION\` |"
echo "| GitHub compare status | \`diverged\` |"
echo
echo "Likely cause: force-push rewrote main's history, leaving the previous"
echo "\`:latest\` revision orphaned. Needs human review before \`:latest\` advances."
} >> "$GITHUB_STEP_SUMMARY"
exit 1
;;
error|*)
echo "decision=error" >> "$GITHUB_OUTPUT"
{
echo "## ❌ Auto-promote aborted — ancestry-check API error"
echo
echo "\`gh api repos/${REPO}/compare/${CURRENT_REVISION}...${TARGET_SHA}\` returned unexpected status: \`$STATUS\`"
echo
echo "Manual dispatch with the target sha bypasses this check."
} >> "$GITHUB_STEP_SUMMARY"
exit 1
;;
esac
- name: Retag platform :staging-<sha> → :latest
if: steps.gate.outputs.proceed == 'true'
run: |
crane tag "${IMAGE_NAME}:staging-${{ steps.sha.outputs.short }}" latest
- name: Retag tenant :staging-<sha> → :latest
if: steps.gate.outputs.proceed == 'true'
run: |
crane tag "${TENANT_IMAGE_NAME}:staging-${{ steps.sha.outputs.short }}" latest
- name: Summary
if: steps.gate.outputs.proceed == 'true'
run: |
{
echo "## :latest promoted to ${{ steps.sha.outputs.short }}"
echo
if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
echo "- Trigger: manual dispatch"
else
echo "- Upstream: \`${{ github.event.workflow_run.name }}\` ([run](${{ github.event.workflow_run.html_url }}))"
fi
echo "- platform:staging-${{ steps.sha.outputs.short }} → :latest"
echo "- platform-tenant:staging-${{ steps.sha.outputs.short }} → :latest"
echo
echo "Tenant fleet auto-pulls within 5 min via IMAGE_AUTO_REFRESH=true."
echo "Force immediate fanout: dispatch redeploy-tenants-on-main.yml."
} >> "$GITHUB_STEP_SUMMARY"
+434
View File
@@ -0,0 +1,434 @@
name: Auto-promote staging → main
# Fires after any of the staging-branch quality gates complete. When ALL
# required gates are green on the same staging SHA, opens (or re-uses)
# a PR `staging → main` and enables auto-merge so the merge queue lands
# it. Closes the gap that historically let features sit on staging for
# weeks waiting for a bulk promotion PR (see molecule-core#1496 for the
# 1172-commit example).
#
# 2026-04-28 rewrite (PR #142): the previous version did a direct
# `git merge --ff-only origin staging && git push origin main`. That
# breaks against main's branch-protection ruleset, which requires
# status checks "set by the expected GitHub apps" — direct pushes
# can't satisfy that condition (only PR merges through the queue can).
# The workflow was failing every tick with:
# remote: error: GH006: Protected branch update failed for refs/heads/main.
# remote: - Required status checks ... were not set by the expected GitHub apps.
# Fix: mirror the PR-based pattern from auto-sync-main-to-staging.yml
# (the reverse-direction sync, fixed in #2234 for the same reason).
# Both directions now use the same merge-queue path that humans use,
# no special-case bypass.
#
# Safety model:
# - Runs ONLY on workflow_run events for the staging branch.
# - Requires EVERY named gate workflow to have the same head_sha and
# all be `conclusion == success`. If any of them is red, skipped,
# cancelled, or pending, we abort (stay on the current main).
# - The PR base=main head=staging path lets GitHub itself enforce
# branch protection. If main has diverged from staging or required
# checks aren't satisfied, the merge queue declines the PR — no
# need for a manual ff-only ancestry check here.
# - Loop safety: the auto-sync-main-to-staging workflow fires when
# main lands the auto-promote PR, but its merge into staging is by
# GITHUB_TOKEN which doesn't trigger downstream workflow_run events
# (GitHub Actions safety). So this workflow doesn't re-fire from
# its own promote landing.
#
# Toggle via repo variable AUTO_PROMOTE_ENABLED (true/unset). When
# unset, the workflow logs what it would have done but doesn't open
# the PR — useful for dry-running the gate logic without surfacing
# a noisy PR while staging CI is still flaky.
#
# **One-time repo setting (load-bearing):** this workflow opens the
# staging→main PR via `gh pr create` using the default GITHUB_TOKEN.
# Since GitHub's 2022 default change, that token cannot create or
# approve PRs unless the repo opts in. The toggle is at:
#
# Settings → Actions → General → Workflow permissions
# → ✅ Allow GitHub Actions to create and approve pull requests
#
# Without it, every workflow_run fails with:
#
# pull request create failed: GraphQL: GitHub Actions is not
# permitted to create or approve pull requests (createPullRequest)
#
# Observed 2026-04-29 01:43 UTC blocking promotion of fcd87b9 (PRs
# #2248 + #2249); manually bridged via PR #2252. Re-check this
# setting if auto-promote starts failing with createPullRequest
# errors after a repo or org admin change.
on:
workflow_run:
workflows:
- CI
- E2E Staging Canvas (Playwright)
- E2E API Smoke Test
- CodeQL
types: [completed]
workflow_dispatch:
inputs:
force:
description: "Force promote even when AUTO_PROMOTE_ENABLED is unset (manual override)"
required: false
default: "false"
permissions:
contents: write
pull-requests: write
# actions: write is needed by the post-merge dispatch tail step
# (#2358 / #2357) — `gh workflow run publish-workspace-server-image.yml`
# POSTs to /actions/workflows/.../dispatches which requires this scope.
# Without it the call 403s and the publish/canary/redeploy chain still
# doesn't run on staging→main promotions, undoing #2358.
actions: write
# Serialize auto-promote runs. Multiple staging gate completions can land
# in quick succession (CI + E2E + CodeQL all finish within seconds of
# each other on a green PR) — without this, two parallel runs both:
# 1. Open / re-use the same promote PR.
# 2. Both call `gh pr merge --auto` (idempotent — fine).
# 3. Both poll for the same mergedAt and both `gh workflow run` publish
# → 2× redundant publish builds racing for the same `:staging-latest`
# retag, and 2× canary-verify chains.
# cancel-in-progress: false because we don't want a brand-new run to kill
# a polling-tail that's about to dispatch — the polling tail's 30 min cap
# is the right backstop, not workflow-level cancel.
concurrency:
group: auto-promote-staging
cancel-in-progress: false
jobs:
check-all-gates-green:
# Only consider staging pushes. PRs into staging don't promote.
if: >
(github.event_name == 'workflow_run' &&
github.event.workflow_run.head_branch == 'staging' &&
github.event.workflow_run.event == 'push')
|| github.event_name == 'workflow_dispatch'
runs-on: ubuntu-latest
outputs:
all_green: ${{ steps.gates.outputs.all_green }}
head_sha: ${{ steps.gates.outputs.head_sha }}
steps:
# Skip empty-tree promotes (the perpetual auto-promote↔auto-sync cycle
# observed 2026-05-03). Sequence: auto-promote merges via the staging
# merge-queue's MERGE strategy, creating a merge commit on main that
# staging doesn't have. auto-sync then merges main back into staging
# via another merge commit (the queue's MERGE strategy applies on
# the staging side too, even when the workflow's local FF would
# have sufficed). Now staging has a new merge-commit SHA whose
# tree == main's tree — but auto-promote sees "staging ahead of
# main by 1" and opens YET another empty promote PR. Each round
# costs ~30-40 min wallclock, ~2 manual approvals, and burns a
# full CodeQL Go run (~15 min). Without this guard the cycle
# repeats indefinitely.
#
# Long-term fix is to switch the merge_queue ruleset's
# `merge_method` away from MERGE so FF-able PRs land cleanly,
# but that's a broader change affecting every staging PR's
# commit shape. This guard is the one-line surgical fix that
# breaks the cycle without touching merge-queue config.
#
# Fail-open: if `git diff` errors for any reason, fall through
# to the gate check (preserve existing behavior). Only skip
# when the diff is DEFINITIVELY empty.
- name: Checkout for tree-diff check
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
fetch-depth: 0
ref: staging
- name: Skip if staging tree == main tree (perpetual-cycle break)
id: tree-diff
env:
HEAD_SHA: ${{ github.event.workflow_run.head_sha || github.sha }}
run: |
set -eu
git fetch origin main --depth=50 || { echo "::warning::git fetch main failed — proceeding (fail-open)"; exit 0; }
# Compare staging tip's tree against main's tree. `git diff
# --quiet` exits 0 if no differences, 1 if there are.
if git diff --quiet origin/main "$HEAD_SHA" -- 2>/dev/null; then
{
echo "## ⏭ Skipped — no code to promote"
echo
echo "staging tip (\`${HEAD_SHA:0:8}\`) and \`main\` have identical trees."
echo "This is the auto-promote↔auto-sync merge-commit cycle: staging has a"
echo "new SHA (a sync-back merge commit) but the underlying file tree is"
echo "already on main, so there's no real code to ship."
echo
echo "Skipping to avoid opening an empty promote PR. Cycle terminates here."
} >> "$GITHUB_STEP_SUMMARY"
echo "::notice::auto-promote: staging tree == main tree — no code to promote, skipping"
echo "skip=true" >> "$GITHUB_OUTPUT"
else
echo "skip=false" >> "$GITHUB_OUTPUT"
fi
- name: Check all required gates on this SHA
if: steps.tree-diff.outputs.skip != 'true'
id: gates
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
HEAD_SHA: ${{ github.event.workflow_run.head_sha || github.sha }}
REPO: ${{ github.repository }}
run: |
set -euo pipefail
# Required gate workflow files. Use file paths (relative to
# .github/workflows/) rather than display names because:
#
# 1. `gh run list --workflow=<name>` is ambiguous when two
# workflows have the same `name:` — observed 2026-04-28
# with "CodeQL" matching both `codeql.yml` (explicit) and
# GitHub's UI-configured Code-quality default setup
# (internal "codeql"). gh CLI returns "could not resolve
# to a unique workflow" → empty result → gate evaluated
# as missing/none → auto-promote dead-locked despite all
# checks actually passing.
#
# 2. File paths are the unique identifier for workflows;
# `name:` is just a display string and can collide.
#
# When adding/removing a gate, update this list AND the
# branch-protection required-checks list (which uses check-run
# display names, not workflow names; the two are decoupled and
# should be kept in sync manually).
GATES=(
"ci.yml"
"e2e-staging-canvas.yml"
"e2e-api.yml"
"codeql.yml"
)
echo "head_sha=${HEAD_SHA}" >> "$GITHUB_OUTPUT"
echo "Checking gates on SHA ${HEAD_SHA}"
ALL_GREEN=true
for gate in "${GATES[@]}"; do
# Query the most recent run of this workflow on this SHA.
# event=push to avoid picking up PR runs. branch=staging to
# guard against someone dispatching the gate on a non-staging
# branch at the same SHA.
RESULT=$(gh run list \
--repo "$REPO" \
--workflow "$gate" \
--branch staging \
--event push \
--commit "$HEAD_SHA" \
--limit 1 \
--json status,conclusion \
--jq '.[0] | "\(.status)/\(.conclusion // "none")"' \
2>/dev/null || echo "missing/none")
echo " $gate → $RESULT"
# Only completed/success counts. completed/failure or
# in_progress/anything or no record at all = abort.
if [ "$RESULT" != "completed/success" ]; then
ALL_GREEN=false
fi
done
echo "all_green=${ALL_GREEN}" >> "$GITHUB_OUTPUT"
if [ "$ALL_GREEN" != "true" ]; then
echo "::notice::auto-promote: not all gates are green on ${HEAD_SHA} — staying on current main"
fi
promote:
needs: check-all-gates-green
if: needs.check-all-gates-green.outputs.all_green == 'true'
runs-on: ubuntu-latest
steps:
- name: Check rollout gate
env:
AUTO_PROMOTE_ENABLED: ${{ vars.AUTO_PROMOTE_ENABLED }}
FORCE_INPUT: ${{ github.event.inputs.force }}
run: |
set -eu
# Repo variable AUTO_PROMOTE_ENABLED=true flips this on. While
# it's unset, the workflow dry-runs (logs what it would have
# done) but doesn't open the promote PR. Set the variable in
# Settings → Secrets and variables → Actions → Variables.
if [ "${AUTO_PROMOTE_ENABLED:-}" != "true" ] && [ "${FORCE_INPUT:-false}" != "true" ]; then
{
echo "## ⏸ Auto-promote disabled"
echo
echo "Repo variable \`AUTO_PROMOTE_ENABLED\` is not set to \`true\`."
echo "All gates are green on staging; would have opened a promote PR to \`main\`."
echo
echo "To enable: Settings → Secrets and variables → Actions → Variables → \`AUTO_PROMOTE_ENABLED=true\`."
echo "To test once manually: workflow_dispatch with \`force=true\`."
} >> "$GITHUB_STEP_SUMMARY"
echo "::notice::auto-promote disabled — dry run only"
exit 0
fi
# Mint the App token BEFORE the promote-PR step so the auto-merge
# call can use it. GITHUB_TOKEN-initiated merges suppress the
# downstream `push` event on main, breaking the
# publish-workspace-server-image → canary-verify → redeploy-tenants
# chain (issue #2357). Using the App token here means the
# merge-queue-landed merge IS able to fire the cascade naturally;
# the polling tail below stays as defense-in-depth.
- name: Mint App token for promote-PR + downstream dispatch
if: ${{ vars.AUTO_PROMOTE_ENABLED == 'true' || github.event.inputs.force == 'true' }}
id: app-token
uses: actions/create-github-app-token@1b10c78c7865c340bc4f6099eb2f838309f1e8c3 # v3.1.1
with:
app-id: ${{ secrets.MOLECULE_AI_APP_ID }}
private-key: ${{ secrets.MOLECULE_AI_APP_PRIVATE_KEY }}
- name: Open (or reuse) staging → main promote PR + enable auto-merge
if: ${{ vars.AUTO_PROMOTE_ENABLED == 'true' || github.event.inputs.force == 'true' }}
env:
GH_TOKEN: ${{ steps.app-token.outputs.token }}
REPO: ${{ github.repository }}
TARGET_SHA: ${{ needs.check-all-gates-green.outputs.head_sha }}
run: |
set -euo pipefail
# Look for an existing open promote PR (idempotent on re-run
# of the workflow). The PR's head IS the staging branch — the
# whole point is "advance main to staging's tip", so we don't
# need a per-SHA branch like auto-sync-main-to-staging uses.
PR_NUM=$(gh pr list --repo "$REPO" \
--base main --head staging --state open \
--json number --jq '.[0].number // ""')
if [ -z "$PR_NUM" ]; then
TITLE="staging → main: auto-promote ${TARGET_SHA:0:7}"
BODY_FILE=$(mktemp)
cat > "$BODY_FILE" <<EOFBODY
Automated promotion of \`staging\` (\`${TARGET_SHA:0:8}\`) to \`main\`. All required staging gates green at this SHA: CI, E2E Staging Canvas, E2E API Smoke, CodeQL.
This PR is auto-generated by \`.github/workflows/auto-promote-staging.yml\` whenever every required gate completes green on the same staging SHA. It exists because main's branch protection requires status checks "set by the expected GitHub apps" — direct \`git push\` from a workflow can't satisfy that, only PR merges through the queue can.
Merge queue lands this; no human action needed unless gates fail. Reverse-direction sync (the merge commit on main → staging) is handled by \`auto-sync-main-to-staging.yml\`.
EOFBODY
PR_URL=$(gh pr create --repo "$REPO" \
--base main --head staging \
--title "$TITLE" \
--body-file "$BODY_FILE")
PR_NUM=$(echo "$PR_URL" | grep -oE '[0-9]+$' | tail -1)
rm -f "$BODY_FILE"
echo "::notice::Opened PR #${PR_NUM}"
else
echo "::notice::Re-using existing promote PR #${PR_NUM}"
fi
# Enable auto-merge — the merge queue picks it up once
# required gates are green on the merge_group ref.
if ! gh pr merge "$PR_NUM" --repo "$REPO" --auto --merge 2>&1; then
echo "::warning::Failed to enable auto-merge on PR #${PR_NUM} — operator may need to merge manually."
fi
{
echo "## ✅ Auto-promote PR opened"
echo
echo "- Source: staging at \`${TARGET_SHA:0:8}\`"
echo "- PR: #${PR_NUM}"
echo
echo "Merge queue lands the PR once required gates are green; no human action needed unless gates fail."
} >> "$GITHUB_STEP_SUMMARY"
# Hand the PR number to the next step so we can dispatch the
# tenant-redeploy chain after the merge queue lands the merge.
echo "promote_pr_num=${PR_NUM}" >> "$GITHUB_OUTPUT"
id: promote_pr
# The App token minted above (before the promote-PR step) is
# also used by the polling tail below. Defense-in-depth: with
# the merge-queue-landed merge now using the App token, the
# main-branch push event SHOULD fire the publish/canary/redeploy
# cascade naturally — but if for any reason it doesn't (e.g. an
# unrelated event-suppression edge case), the explicit dispatches
# below still wake the chain.
- name: Wait for promote merge, then dispatch publish + redeploy (#2357)
# Defense-in-depth dispatch. With the auto-merge call above
# now using the App token (this commit), the merge-queue-landed
# merge SHOULD fire publish-workspace-server-image naturally
# via on:push:[main] — App-token-initiated pushes DO trigger
# workflow_run cascades, unlike GITHUB_TOKEN-initiated ones
# (the documented "no recursion" rule —
# https://docs.github.com/en/actions/using-workflows/triggering-a-workflow#triggering-a-workflow-from-a-workflow).
#
# This explicit dispatch stays as belt-and-suspenders for any
# edge case where the natural cascade misfires. If it never
# observably fires after this token swap (i.e. the publish
# workflow has already started by the time we get here), the
# second dispatch is a harmless no-op (publish-workspace-server-image
# has its own concurrency group that dedupes).
#
# See PR for #2357: pre-fix the merge action was via
# GITHUB_TOKEN, suppressing the cascade and forcing this tail
# to be the SOLE chain trigger. With the auto-merge token swap
# the tail becomes redundant in the happy path; keep until
# we've observed >=10 successful natural cascades, then drop.
if: steps.promote_pr.outputs.promote_pr_num != ''
env:
GH_TOKEN: ${{ steps.app-token.outputs.token }}
REPO: ${{ github.repository }}
PR_NUM: ${{ steps.promote_pr.outputs.promote_pr_num }}
run: |
# Poll for merge — max 30 min (60 × 30s). The merge queue
# typically lands within 5-10 min when gates are green. Break
# early if the PR is closed without merging (operator action,
# gates flipped red post-approval, branch-protection rejection)
# so we don't tie up a runner for the full 30 min on a dead PR.
MERGED=""
STATE=""
for _ in $(seq 1 60); do
VIEW=$(gh pr view "$PR_NUM" --repo "$REPO" --json mergedAt,state)
MERGED=$(echo "$VIEW" | jq -r '.mergedAt // ""')
STATE=$(echo "$VIEW" | jq -r '.state // ""')
if [ -n "$MERGED" ] && [ "$MERGED" != "null" ]; then
echo "::notice::Promote PR #${PR_NUM} merged at ${MERGED}"
break
fi
if [ "$STATE" = "CLOSED" ]; then
echo "::warning::Promote PR #${PR_NUM} was closed without merging — skipping deploy dispatch."
exit 0
fi
sleep 30
done
if [ -z "$MERGED" ] || [ "$MERGED" = "null" ]; then
echo "::warning::Promote PR #${PR_NUM} didn't merge within 30min — skipping deploy dispatch (manually run \`gh workflow run publish-workspace-server-image.yml --ref main\` once it lands)."
exit 0
fi
# Dispatch publish on main using the App token. App-initiated
# workflow_dispatch DOES propagate the workflow_run cascade,
# unlike GITHUB_TOKEN-initiated dispatch.
# publish completes → canary-verify chains via workflow_run →
# redeploy-tenants-on-main chains via workflow_run + branches:[main].
if gh workflow run publish-workspace-server-image.yml \
--repo "$REPO" --ref main 2>&1; then
echo "::notice::Dispatched publish-workspace-server-image on ref=main as molecule-ai App — canary-verify and redeploy-tenants-on-main will chain via workflow_run."
{
echo "## 🚀 Tenant redeploy chain dispatched"
echo
echo "- publish-workspace-server-image (workflow_dispatch on \`main\`, actor: \`molecule-ai[bot]\`)"
echo "- canary-verify will chain on completion"
echo "- redeploy-tenants-on-main will chain on canary green"
} >> "$GITHUB_STEP_SUMMARY"
else
echo "::error::Failed to dispatch publish-workspace-server-image. Run manually: gh workflow run publish-workspace-server-image.yml --ref main"
fi
# ALSO dispatch auto-sync-main-to-staging.yml. Same root cause as
# publish above (issue #2357): the merge-queue-initiated push to
# main is by GITHUB_TOKEN → no `on: push` triggers fire downstream.
# Without this dispatch, every staging→main promote leaves staging
# one merge commit BEHIND main, which silently dead-locks the NEXT
# promote PR as `mergeStateStatus: BEHIND` because main's
# branch-protection has `strict: true`. Verified empirically on
# 2026-05-02 against PR #2442 (Phase 2 promote): only the explicit
# publish-workspace-server-image dispatch fired on the previous
# promote SHA 76c604fb, while auto-sync silently no-op'd, leaving
# staging behind for ~24h until manually bridged.
if gh workflow run auto-sync-main-to-staging.yml \
--repo "$REPO" --ref main 2>&1; then
echo "::notice::Dispatched auto-sync-main-to-staging on ref=main as molecule-ai App — staging will absorb the new main merge commit via PR + merge queue."
else
echo "::error::Failed to dispatch auto-sync-main-to-staging. Run manually: gh workflow run auto-sync-main-to-staging.yml --ref main"
fi
@@ -0,0 +1,83 @@
name: auto-promote-stale-alarm
# Hourly cron + on-demand alarm for the silent-block failure mode that
# motivated issue #2975:
# - The auto-promote-staging.yml workflow opened a PR + armed
# auto-merge, but main's branch protection requires a human review
# (reviewDecision=REVIEW_REQUIRED). The PR sat BLOCKED with no
# surface-up-the-stack for 12+ hours, holding 25 commits hostage
# including the Memory v2 redesign and a reno-stars data-loss fix.
#
# This workflow runs `scripts/check-stale-promote-pr.sh` against the
# repo's open auto-promote PRs (base=main head=staging). When a PR has
# been BLOCKED on REVIEW_REQUIRED for >4h, it:
# 1. Emits a workflow-level warning (visible in run summary + the
# Actions UI feed).
# 2. Posts a comment on the PR (idempotent — one alarm per PR).
#
# The detection logic lives in scripts/check-stale-promote-pr.sh so
# it's unit-testable with stubbed `gh` (see test-check-stale-promote-pr.sh).
# This file is the schedule + invocation surface only — SSOT for the
# detector itself.
on:
schedule:
# Hourly. Cheap (one `gh pr list` + jq), and 1h granularity is
# plenty for a 4h staleness threshold — operators see the alarm
# within at most 1h of crossing the threshold.
- cron: "27 * * * *" # at :27 to dodge the cron herd at :00
workflow_dispatch:
inputs:
stale_hours:
description: "Hours after which a BLOCKED+REVIEW_REQUIRED PR is stale (default 4)"
required: false
default: "4"
post_comment:
description: "Post a comment on stale PRs (default true)"
required: false
default: "true"
permissions:
contents: read
pull-requests: write # post comments on stale PRs
# Serialize so the on-demand and scheduled runs don't double-comment
# the same PR. cancel-in-progress=false because the script is idempotent
# (existing comment marker prevents dupes), but a scheduled run firing
# while a manual one runs would just re-list the same PR set.
concurrency:
group: auto-promote-stale-alarm
cancel-in-progress: false
jobs:
scan:
runs-on: ubuntu-latest
steps:
- name: Checkout (need scripts/ only)
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
sparse-checkout: |
scripts/check-stale-promote-pr.sh
sparse-checkout-cone-mode: false
- name: Run stale-PR detector
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
GITHUB_REPOSITORY: ${{ github.repository }}
STALE_HOURS: ${{ inputs.stale_hours || '4' }}
POST_COMMENT: ${{ inputs.post_comment || 'true' }}
run: |
# The script's exit code reflects the count of stale PRs.
# We don't want a stale finding to fail the workflow run —
# the warning + comment are the signal, the green/red is
# noise. So convert any non-zero exit to a workflow notice
# and exit 0.
set +e
bash scripts/check-stale-promote-pr.sh
rc=$?
set -e
if [ "$rc" -ne 0 ]; then
echo "::notice::Stale PR detector found $rc PR(s) needing attention. See warnings above + comments on the PRs."
fi
# Always succeed — operator-facing surface is the warning,
# not the workflow status.
exit 0
@@ -0,0 +1,237 @@
name: Auto-sync main → staging
# Reflects every push to `main` back onto `staging` so the
# staging-as-superset-of-main invariant holds.
#
# Background:
#
# `auto-promote-staging.yml` advances main via `git merge --ff-only`
# + `git push origin main` — that's a clean fast-forward, no merge
# commit. But manual merges of `staging → main` PRs through the
# GitHub UI / API create a merge commit on main that staging
# doesn't have. The next `staging → main` PR then evaluates as
# "BEHIND" because staging is missing that merge commit, requiring
# a manual `gh pr update-branch` round-trip.
#
# This happened twice on 2026-04-28 (PRs #2202, #2205, both manual
# bridges). Each time the bridge needed update-branch + a re-CI
# round before merging. Operationally annoying and avoidable.
#
# Architecture:
#
# This repo's `staging` branch is protected by a `merge_queue`
# ruleset (id 15500102) that blocks ALL direct pushes — no bypass
# even for org admins or the GitHub Actions integration. Direct
# `git push origin staging` returns GH013. So instead of pushing
# directly, this workflow:
#
# 1. Checks if main is already in staging's ancestry → no-op.
# 2. Creates an `auto-sync/main-<sha>` branch from staging.
# 3. Tries `git merge --ff-only origin/main` → if staging hasn't
# diverged this is a clean ff.
# 4. Otherwise `git merge --no-ff origin/main` to absorb main's
# tip while keeping staging's history.
# 5. Pushes the auto-sync branch.
# 6. Opens a PR (base=staging, head=auto-sync/main-<sha>) and
# enables auto-merge so the merge queue lands it.
#
# This mirrors the path human PRs take through staging — same
# rules, same gates, no special-case bypass.
#
# Loop safety:
#
# `GITHUB_TOKEN`-authored merges (including the merge queue's land
# of the auto-sync PR) do NOT trigger downstream workflow runs
# (GitHub Actions safety). So when the auto-sync PR lands on
# staging, `auto-promote-staging.yml` is NOT triggered by that
# push. The next developer push to staging triggers auto-promote
# normally. No loop possible.
#
# Concurrency:
#
# Two pushes to main in quick succession (e.g., manual UI merge
# immediately followed by auto-promote-staging's ff-merge) could
# otherwise open two overlapping auto-sync PRs. The concurrency
# group serializes runs; the second waits for the first to exit.
# (The first run exits after opening + auto-merge-queueing the PR,
# not after the merge actually completes — so multiple PRs can be
# open simultaneously, but the merge queue handles them serially.)
on:
push:
branches: [main]
# workflow_dispatch lets:
# 1. Operators manually backfill a missed sync (e.g. after a manual
# UI merge that the runner missed).
# 2. auto-promote-staging.yml's polling tail explicitly invoke us
# after the promote PR lands. This is load-bearing: when the
# merge queue lands a promote-PR merge, the resulting push to
# `main` is "by GITHUB_TOKEN", and per GitHub's no-recursion
# rule (https://docs.github.com/en/actions/using-workflows/triggering-a-workflow#triggering-a-workflow-from-a-workflow)
# that push event does NOT fire any downstream workflows. The
# `on: push` trigger above is silently dead for the very pattern
# we exist to handle. Verified empirically 2026-05-02 against
# SHA 76c604fb (PR #2437 staging→main): only ONE workflow fired
# (publish-workspace-server-image, dispatched explicitly by
# auto-promote's polling tail with an App token). Every other
# `on: push: branches: [main]` workflow — including this one —
# was suppressed. Until the underlying merge call moves to an
# App token, an explicit dispatch is the only reliable path.
workflow_dispatch:
permissions:
contents: write
pull-requests: write
concurrency:
group: auto-sync-main-to-staging
cancel-in-progress: false
jobs:
sync-staging:
# ubuntu-latest matches every other workflow in this repo. The
# earlier `[self-hosted, macos, arm64]` was a copy-paste artefact
# from the molecule-controlplane repo (which IS private and uses a
# Mac runner) — molecule-core has no Mac runner registered, so the
# job sat unassigned whenever the trigger fired. Verified 2026-05-02:
# this is the ONLY workflow in molecule-core/.github/workflows/ with
# a non-ubuntu runs-on.
runs-on: ubuntu-latest
steps:
- name: Checkout staging
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
fetch-depth: 0
ref: staging
token: ${{ secrets.GITHUB_TOKEN }}
- name: Configure git author
run: |
git config user.name "github-actions[bot]"
git config user.email "41898282+github-actions[bot]@users.noreply.github.com"
- name: Check if staging already contains main
id: check
run: |
set -euo pipefail
git fetch origin main
if git merge-base --is-ancestor origin/main HEAD; then
echo "needs_sync=false" >> "$GITHUB_OUTPUT"
{
echo "## ✅ No-op"
echo
echo "staging already contains \`origin/main\` ($(git rev-parse --short=8 origin/main))."
} >> "$GITHUB_STEP_SUMMARY"
else
echo "needs_sync=true" >> "$GITHUB_OUTPUT"
MAIN_SHORT=$(git rev-parse --short=8 origin/main)
echo "main_short=${MAIN_SHORT}" >> "$GITHUB_OUTPUT"
echo "branch=auto-sync/main-${MAIN_SHORT}" >> "$GITHUB_OUTPUT"
echo "::notice::staging is missing main's tip (${MAIN_SHORT}) — opening sync PR"
fi
- name: Create auto-sync branch + merge main
if: steps.check.outputs.needs_sync == 'true'
id: prep
run: |
set -euo pipefail
BRANCH="${{ steps.check.outputs.branch }}"
# If a previous auto-sync run already opened a branch for the
# same main sha, prefer reusing it (idempotent behavior on
# workflow restart). Force-update from latest staging anyway
# so it absorbs any staging-side commits that landed since.
git checkout -B "$BRANCH"
if git merge --ff-only origin/main; then
echo "did_ff=true" >> "$GITHUB_OUTPUT"
echo "::notice::Fast-forwarded ${BRANCH} to origin/main"
else
echo "did_ff=false" >> "$GITHUB_OUTPUT"
if ! git merge --no-ff origin/main -m "chore: sync main → staging (auto)"; then
# Hygiene: leave the work tree clean before failing.
git merge --abort || true
{
echo "## ❌ Conflict"
echo
echo "Auto-merge \`main → staging\` failed with conflicts."
echo "A human needs to resolve manually."
} >> "$GITHUB_STEP_SUMMARY"
exit 1
fi
fi
- name: Push auto-sync branch
if: steps.check.outputs.needs_sync == 'true'
run: |
set -euo pipefail
# Force-with-lease so a concurrent auto-sync run can't
# silently clobber an in-flight branch we just updated. If a
# different writer touched the branch, we abort and the next
# run picks up the latest state.
git push --force-with-lease origin "${{ steps.check.outputs.branch }}"
- name: Open auto-sync PR + enable auto-merge
if: steps.check.outputs.needs_sync == 'true'
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
BRANCH: ${{ steps.check.outputs.branch }}
MAIN_SHORT: ${{ steps.check.outputs.main_short }}
DID_FF: ${{ steps.prep.outputs.did_ff }}
run: |
set -euo pipefail
# Find existing PR for this branch (idempotent on workflow
# restart) before creating a new one.
PR_NUM=$(gh pr list --head "$BRANCH" --base staging --state open --json number --jq '.[0].number // ""')
if [ -z "$PR_NUM" ]; then
# Body lives in a temp file to keep the multi-line content
# out of the YAML block scalar (un-indented newlines inside
# an inline shell string break YAML parsing).
BODY_FILE=$(mktemp)
if [ "$DID_FF" = "true" ]; then
TITLE="chore: sync main → staging (auto, ff to ${MAIN_SHORT})"
cat > "$BODY_FILE" <<EOFBODY
Automated fast-forward of \`staging\` to \`origin/main\` (\`${MAIN_SHORT}\`). Staging has no in-flight commits that diverge from main. Merge queue lands this; no human action needed.
This PR is auto-generated by \`.github/workflows/auto-sync-main-to-staging.yml\` on every push to \`main\`. It exists because this repo's \`staging\` branch has a \`merge_queue\` ruleset that blocks direct pushes — even from the GitHub Actions integration.
EOFBODY
else
TITLE="chore: sync main → staging (auto, merge ${MAIN_SHORT})"
cat > "$BODY_FILE" <<EOFBODY
Automated merge of \`origin/main\` (\`${MAIN_SHORT}\`) into \`staging\`. Staging has commits main doesn't, so this is a non-ff merge that absorbs main's tip. Merge queue lands this.
This PR is auto-generated by \`.github/workflows/auto-sync-main-to-staging.yml\` on every push to \`main\`.
EOFBODY
fi
# gh pr create prints the URL on stdout; extract the PR number.
PR_URL=$(gh pr create \
--base staging \
--head "$BRANCH" \
--title "$TITLE" \
--body-file "$BODY_FILE")
PR_NUM=$(echo "$PR_URL" | grep -oE '[0-9]+$' | tail -1)
rm -f "$BODY_FILE"
echo "::notice::Opened PR #${PR_NUM}"
else
echo "::notice::Re-using existing PR #${PR_NUM} for ${BRANCH}"
fi
# Enable auto-merge — the merge queue picks it up once
# required gates are green. Use --merge for merge commits
# (matches the rest of this repo's PR convention).
if ! gh pr merge "$PR_NUM" --auto --merge 2>&1; then
echo "::warning::Failed to enable auto-merge on PR #${PR_NUM} — operator may need to merge manually."
fi
{
echo "## ✅ Auto-sync PR opened"
echo
echo "- Branch: \`$BRANCH\`"
echo "- PR: #$PR_NUM"
echo "- Strategy: $([ "$DID_FF" = "true" ] && echo "ff" || echo "merge commit")"
echo
echo "Merge queue lands the PR once required gates are green; no human action needed unless gates fail."
} >> "$GITHUB_STEP_SUMMARY"
+113
View File
@@ -0,0 +1,113 @@
name: auto-tag-runtime
# Auto-tag runtime releases on every merge to main that touches workspace/.
# This is the entry point of the runtime CD chain:
#
# merge PR → auto-tag-runtime (this) → publish-runtime → cascade → template
# image rebuilds → repull on hosts.
#
# Default bump is patch. Override via PR label `release:minor` or
# `release:major` BEFORE merging — the label is read off the merged PR
# associated with the push commit.
#
# Skips when:
# - The push isn't to main (other branches don't auto-release).
# - The merge commit message contains `[skip-release]` (escape hatch
# for cleanup PRs that touch workspace/ but shouldn't ship).
on:
push:
branches: [main]
paths:
- "workspace/**"
- "scripts/build_runtime_package.py"
- ".github/workflows/auto-tag-runtime.yml"
- ".github/workflows/publish-runtime.yml"
permissions:
contents: write # to push the new tag
pull-requests: read # to read labels off the merged PR
concurrency:
# Serialize tag bumps so two near-simultaneous merges can't both think
# they're 0.1.6 and race to push the same tag.
group: auto-tag-runtime
cancel-in-progress: false
jobs:
tag:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
fetch-depth: 0 # need full tag history for `git describe` / sort
- name: Skip when commit asks
id: skip
run: |
MSG=$(git log -1 --format=%B "${{ github.sha }}")
if echo "$MSG" | grep -qiE '\[skip-release\]|\[no-release\]'; then
echo "skip=true" >> "$GITHUB_OUTPUT"
echo "Commit message contains [skip-release] — no tag will be created."
else
echo "skip=false" >> "$GITHUB_OUTPUT"
fi
- name: Determine bump kind from PR label
id: bump
if: steps.skip.outputs.skip != 'true'
env:
GH_TOKEN: ${{ github.token }}
run: |
# The merged PR for this push commit. `gh pr list --search` finds
# closed PRs whose merge commit matches; we take the first.
PR=$(gh pr list --state merged --search "${{ github.sha }}" --json number,labels --jq '.[0]' 2>/dev/null || echo "")
if [ -z "$PR" ] || [ "$PR" = "null" ]; then
echo "No merged PR found for ${{ github.sha }} — defaulting to patch bump."
echo "kind=patch" >> "$GITHUB_OUTPUT"
exit 0
fi
LABELS=$(echo "$PR" | jq -r '.labels[].name')
if echo "$LABELS" | grep -qx 'release:major'; then
echo "kind=major" >> "$GITHUB_OUTPUT"
elif echo "$LABELS" | grep -qx 'release:minor'; then
echo "kind=minor" >> "$GITHUB_OUTPUT"
else
echo "kind=patch" >> "$GITHUB_OUTPUT"
fi
- name: Compute next version from latest runtime-v* tag
id: version
if: steps.skip.outputs.skip != 'true'
run: |
# Find the highest runtime-vX.Y.Z tag. `sort -V` handles semver
# ordering; `grep` filters to the right tag prefix.
LATEST=$(git tag --list 'runtime-v*' | sort -V | tail -1)
if [ -z "$LATEST" ]; then
# No prior tag — start the runtime line at 0.1.0.
CURRENT="0.0.0"
else
CURRENT="${LATEST#runtime-v}"
fi
MAJOR=$(echo "$CURRENT" | cut -d. -f1)
MINOR=$(echo "$CURRENT" | cut -d. -f2)
PATCH=$(echo "$CURRENT" | cut -d. -f3)
case "${{ steps.bump.outputs.kind }}" in
major) MAJOR=$((MAJOR+1)); MINOR=0; PATCH=0;;
minor) MINOR=$((MINOR+1)); PATCH=0;;
patch) PATCH=$((PATCH+1));;
esac
NEW="$MAJOR.$MINOR.$PATCH"
echo "current=$CURRENT" >> "$GITHUB_OUTPUT"
echo "new=$NEW" >> "$GITHUB_OUTPUT"
echo "Bumping runtime $CURRENT → $NEW (${{ steps.bump.outputs.kind }})"
- name: Push new tag
if: steps.skip.outputs.skip != 'true'
run: |
NEW_TAG="runtime-v${{ steps.version.outputs.new }}"
git config user.name "github-actions[bot]"
git config user.email "41898282+github-actions[bot]@users.noreply.github.com"
git tag -a "$NEW_TAG" -m "runtime $NEW_TAG (auto-bump from ${{ steps.bump.outputs.kind }})"
git push origin "$NEW_TAG"
echo "Pushed $NEW_TAG — publish-runtime workflow will fire on the tag."
@@ -1,16 +1,5 @@
name: Block internal-flavored paths
# Ported from .github/workflows/block-internal-paths.yml on 2026-05-11 per
# RFC internal#219 §1 sweep.
#
# Differences from the GitHub version:
# - Dropped `merge_group: { types: [checks_requested] }` (Gitea has no
# merge queue; no `gh-readonly-queue/...` refs).
# - Workflow-level env.GITHUB_SERVER_URL set per
# feedback_act_runner_github_server_url.
# - `continue-on-error: true` on the job (RFC §1 contract — surface
# defects without blocking; follow-up PR flips after triage).
#
# Hard CI gate. Internal content (positioning, competitive briefs, sales
# playbooks, PMM/press drip, draft campaigns) lives in molecule-ai/internal —
# this public monorepo must never re-acquire those paths. CEO directive
@@ -26,19 +15,16 @@ on:
types: [opened, synchronize, reopened]
push:
branches: [main, staging]
env:
GITHUB_SERVER_URL: https://git.moleculesai.app
# Required for GitHub merge queue: the queue's pre-merge CI run on
# `gh-readonly-queue/...` refs needs this check to fire so the queue
# gets a real result instead of stalling forever AWAITING_CHECKS.
merge_group:
types: [checks_requested]
jobs:
check:
name: Block forbidden paths
runs-on: ubuntu-latest
# Phase 3 (RFC #219 §1): surface broken workflows without blocking
# the PR. Follow-up PR flips this off after surfaced defects are
# triaged.
# mc#774: pre-existing continue-on-error mask; root-fix and remove, do not renew silently.
continue-on-error: true
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
@@ -46,19 +32,31 @@ jobs:
# For pull_request events the diff base is github.event.pull_request.base.sha,
# which may be many commits behind HEAD and therefore absent from the
# shallow clone above. Fetch it explicitly (depth=1 keeps it fast).
# shallow clone above. Fetch it explicitly (depth=1 keeps it fast).
- name: Fetch PR base SHA (pull_request events only)
if: github.event_name == 'pull_request'
run: git fetch --depth=1 origin ${{ github.event.pull_request.base.sha }}
# For merge_group events the queue's pre-merge ref is a commit on
# `gh-readonly-queue/...` whose parent is the queue's base_sha.
# That parent isn't part of the queue branch's shallow clone, so
# we fetch it explicitly. Mirrors the equivalent step in
# secret-scan.yml (#2120) — same shallow-clone bug class.
- name: Fetch merge_group base SHA (merge_group events only)
if: github.event_name == 'merge_group'
run: git fetch --depth=1 origin ${{ github.event.merge_group.base_sha }}
- name: Refuse if forbidden paths appear
env:
# Plumb event-specific SHAs through env so the script doesn't
# need conditional `${{ ... }}` interpolation per event type.
# github.event.before/after only exist on push events;
# pull_request has pull_request.base.sha / pull_request.head.sha.
# merge_group has its own base_sha/head_sha; pull_request has
# pull_request.base.sha / pull_request.head.sha.
PR_BASE_SHA: ${{ github.event.pull_request.base.sha }}
PR_HEAD_SHA: ${{ github.event.pull_request.head.sha }}
MG_BASE_SHA: ${{ github.event.merge_group.base_sha }}
MG_HEAD_SHA: ${{ github.event.merge_group.head_sha }}
PUSH_BEFORE: ${{ github.event.before }}
PUSH_AFTER: ${{ github.event.after }}
run: |
@@ -82,6 +80,10 @@ jobs:
BASE="$PR_BASE_SHA"
HEAD="$PR_HEAD_SHA"
;;
merge_group)
BASE="$MG_BASE_SHA"
HEAD="$MG_HEAD_SHA"
;;
*)
BASE="$PUSH_BEFORE"
HEAD="$PUSH_AFTER"
@@ -95,6 +97,9 @@ jobs:
# fetch fails — e.g. the SHA was force-overwritten — we fall
# through to the empty-BASE branch below, which scans the
# entire tree as if every file were new. Correct, just slow.
# Same recovery shape as secret-scan.yml (#2120 — incident
# 2026-04-27 06:50Z block-internal-paths exit 128 with
# "fatal: bad object <sha>" on staging push).
if [ -n "$BASE" ] && ! echo "$BASE" | grep -qE '^0+$'; then
if ! git cat-file -e "$BASE" 2>/dev/null; then
git fetch --depth=1 origin "$BASE" 2>/dev/null || true
@@ -135,15 +140,15 @@ jobs:
echo ""
echo "If your file is genuinely public-facing (e.g. a blog post"
echo "ready to ship), use one of these alternatives instead:"
echo " - Public-bound blog posts: docs/blog/<slug>.md"
echo " - Public-bound tutorials: docs/tutorials/<slug>.md"
echo " - Public devrel content: docs/devrel/<slug>.md"
echo " Public-bound blog posts: docs/blog/<slug>.md"
echo " Public-bound tutorials: docs/tutorials/<slug>.md"
echo " Public devrel content: docs/devrel/<slug>.md"
echo ""
echo "If you legitimately need to add a new top-level path that"
echo "happens to match a forbidden pattern, edit"
echo ".gitea/workflows/block-internal-paths.yml and update the"
echo ".github/workflows/block-internal-paths.yml and update the"
echo "FORBIDDEN_PATTERNS list with reviewer signoff."
exit 1
fi
echo "OK No forbidden paths in this change."
echo " No forbidden paths in this change."
@@ -0,0 +1,81 @@
name: branch-protection drift check
# Catches out-of-band edits to branch protection (UI clicks, manual gh
# api PATCH from a one-off ops session) by comparing live state against
# tools/branch-protection/apply.sh's desired state every day. Fails the
# workflow when they drift; the failure is the signal.
#
# When it fails: re-run apply.sh to put the live state back to the
# script's intent, OR update apply.sh to encode the new intent and
# commit. Either way the script is the source of truth.
on:
schedule:
# 14:00 UTC daily. Off-hours for most teams; gives a fresh signal
# at the start of every working day.
- cron: '0 14 * * *'
workflow_dispatch:
pull_request:
branches: [staging, main]
paths:
- 'tools/branch-protection/**'
- '.github/workflows/branch-protection-drift.yml'
permissions:
contents: read
jobs:
drift:
name: Branch protection drift
runs-on: ubuntu-latest
timeout-minutes: 5
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
# Token strategy by trigger:
#
# - schedule (daily canary): hard-fail when the admin token is
# missing. This is the *only* trigger where silent soft-skip is
# dangerous — a missing secret on the cron run means the drift
# gate has effectively disappeared with no human in the loop to
# notice. Per feedback_schedule_vs_dispatch_secrets_hardening.md
# the rule is "schedule/automated triggers must hard-fail".
#
# - pull_request (touching tools/branch-protection/**): soft-skip
# with a prominent warning. A PR cannot retroactively drift the
# live state — drift happens *between* PRs (UI clicks, manual
# gh api PATCH) and is the schedule's job to catch. The PR-time
# gate would only catch typos in apply.sh, which the apply.sh
# *_payload unit tests catch better. A human is reviewing the
# PR and will see the warning in the workflow log.
#
# - workflow_dispatch (operator one-off): soft-skip with warning,
# so an operator can run a diagnostic without configuring the
# secret first.
- name: Verify admin token present (hard-fail on schedule only)
env:
GH_TOKEN_FOR_ADMIN_API: ${{ secrets.GH_TOKEN_FOR_ADMIN_API }}
run: |
if [[ -n "$GH_TOKEN_FOR_ADMIN_API" ]]; then
echo "GH_TOKEN_FOR_ADMIN_API present — drift_check will run with admin scope."
exit 0
fi
if [[ "${{ github.event_name }}" == "schedule" ]]; then
echo "::error::GH_TOKEN_FOR_ADMIN_API secret missing on the daily canary." >&2
echo "" >&2
echo "The schedule run is the SoT for branch-protection drift detection." >&2
echo "Without admin scope it silently passes, hiding any out-of-band edits." >&2
echo "Set GH_TOKEN_FOR_ADMIN_API at Settings → Secrets and variables → Actions." >&2
exit 1
fi
echo "::warning::GH_TOKEN_FOR_ADMIN_API secret missing — drift_check will be SKIPPED."
echo "::warning::PR drift checks need repo-admin scope to read /branches/:b/protection."
echo "::warning::This is non-fatal: the daily schedule run is the canonical drift gate."
echo "SKIP_DRIFT_CHECK=1" >> "$GITHUB_ENV"
- name: Run drift check
if: env.SKIP_DRIFT_CHECK != '1'
env:
# Repo-admin scope, needed for /branches/:b/protection.
GH_TOKEN: ${{ secrets.GH_TOKEN_FOR_ADMIN_API }}
run: bash tools/branch-protection/drift_check.sh
+318
View File
@@ -0,0 +1,318 @@
name: Canary — staging SaaS smoke (every 30 min)
# Minimum viable health check: provisions one Hermes workspace on a fresh
# staging org, sends one A2A message, verifies PONG, tears down. ~8 min
# wall clock. Pages on failure by opening a GitHub issue; auto-closes the
# issue on the next green run.
#
# The full-SaaS workflow (e2e-staging-saas.yml) covers the broader surface
# but runs only on provisioning-critical pushes + nightly — this one
# catches drift in the 30-min window between those runs (AMI health, CF
# cert rotation, WorkOS session stability, etc.).
#
# Lean mode: E2E_MODE=canary skips the child workspace + HMA memory +
# peers/activity checks. One parent workspace + one A2A turn is enough
# to signal "SaaS stack end-to-end is alive."
on:
schedule:
# Every 30 min. Cron on GitHub-hosted runners has a known drift of
# a few minutes under load — that's fine for a canary.
- cron: '*/30 * * * *'
workflow_dispatch:
# Serialise with the full-SaaS workflow so they don't contend for the
# same org-create quota on staging. Different group key from
# e2e-staging-saas since we don't mind queueing canaries behind one
# full run, but two canaries SHOULD queue against each other.
concurrency:
group: canary-staging
cancel-in-progress: false
permissions:
# Needed to open / close the alerting issue.
issues: write
contents: read
jobs:
canary:
name: Canary smoke
runs-on: ubuntu-latest
# 25 min headroom over the 15-min TLS-readiness deadline in
# tests/e2e/test_staging_full_saas.sh (#2107). Without the buffer
# the job is killed at the wall-clock 15:00 mark BEFORE the bash
# `fail` + diagnostic burst can fire, leaving every cancellation
# silent. Sibling staging E2E jobs run at 20-45 min — keeping
# canary tighter than them so a true wedge still surfaces here
# first.
timeout-minutes: 25
env:
MOLECULE_CP_URL: https://staging-api.moleculesai.app
MOLECULE_ADMIN_TOKEN: ${{ secrets.MOLECULE_STAGING_ADMIN_TOKEN }}
# MiniMax is the canary's PRIMARY LLM auth path post-2026-05-04.
# Switched from hermes+OpenAI after #2578 (the staging OpenAI key
# account went over quota and stayed dead for 36+ hours, taking
# the canary red the entire time). claude-code template's
# `minimax` provider routes ANTHROPIC_BASE_URL to
# api.minimax.io/anthropic and reads MINIMAX_API_KEY at boot —
# ~5-10x cheaper per token than gpt-4.1-mini AND on a separate
# billing account, so OpenAI quota collapse no longer wedges the
# canary. Mirrors the migration continuous-synth-e2e.yml made on
# 2026-05-03 (#265) for the same reason. tests/e2e/test_staging_
# full_saas.sh branches SECRETS_JSON on which key is present —
# MiniMax wins when set.
E2E_MINIMAX_API_KEY: ${{ secrets.MOLECULE_STAGING_MINIMAX_API_KEY }}
# Direct-Anthropic alternative for operators who don't want to
# set up a MiniMax account (priority below MiniMax — first
# non-empty wins in test_staging_full_saas.sh's secrets-injection
# block). See #2578 PR comment for the rationale.
E2E_ANTHROPIC_API_KEY: ${{ secrets.MOLECULE_STAGING_ANTHROPIC_API_KEY }}
# OpenAI fallback — kept wired so an operator-dispatched run with
# E2E_RUNTIME=hermes overridden via workflow_dispatch can still
# exercise the OpenAI path without re-editing the workflow.
E2E_OPENAI_API_KEY: ${{ secrets.MOLECULE_STAGING_OPENAI_KEY }}
E2E_MODE: canary
E2E_RUNTIME: claude-code
# Pin the canary to a specific MiniMax model rather than relying
# on the per-runtime default (which could resolve to "sonnet" →
# direct Anthropic and defeat the cost saving). M2.7-highspeed
# is "Token Plan only" but cheap-per-token and fast.
E2E_MODEL_SLUG: MiniMax-M2.7-highspeed
E2E_RUN_ID: "canary-${{ github.run_id }}"
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- name: Verify admin token present
run: |
if [ -z "$MOLECULE_ADMIN_TOKEN" ]; then
echo "::error::MOLECULE_STAGING_ADMIN_TOKEN not set"
exit 2
fi
- name: Verify LLM key present
run: |
# Per-runtime key check — claude-code uses MiniMax; hermes /
# langgraph (operator-dispatched only) use OpenAI. Hard-fail
# rather than soft-skip per the lesson from synth E2E #2578:
# an empty key silently falls through to the wrong
# SECRETS_JSON branch and the canary fails 5 min later with
# a confusing auth error instead of the clean "secret
# missing" message at the top.
case "${E2E_RUNTIME}" in
claude-code)
# Either MiniMax OR direct-Anthropic works — first
# non-empty wins in the test script's secrets-injection
# priority chain. Operators only need to set ONE of these
# secrets; we don't force a choice between them.
if [ -n "${E2E_MINIMAX_API_KEY:-}" ]; then
required_secret_name="MOLECULE_STAGING_MINIMAX_API_KEY"
required_secret_value="${E2E_MINIMAX_API_KEY}"
elif [ -n "${E2E_ANTHROPIC_API_KEY:-}" ]; then
required_secret_name="MOLECULE_STAGING_ANTHROPIC_API_KEY"
required_secret_value="${E2E_ANTHROPIC_API_KEY}"
else
required_secret_name="MOLECULE_STAGING_MINIMAX_API_KEY or MOLECULE_STAGING_ANTHROPIC_API_KEY"
required_secret_value=""
fi
;;
langgraph|hermes)
required_secret_name="MOLECULE_STAGING_OPENAI_KEY"
required_secret_value="${E2E_OPENAI_API_KEY:-}"
;;
*)
echo "::warning::Unknown E2E_RUNTIME='${E2E_RUNTIME}' — skipping LLM-key check"
required_secret_name=""
required_secret_value="present"
;;
esac
if [ -n "$required_secret_name" ] && [ -z "$required_secret_value" ]; then
echo "::error::${required_secret_name} secret not set for runtime=${E2E_RUNTIME} — A2A will fail at request time with 'No LLM provider configured'"
exit 2
fi
echo "LLM key present ✓ (runtime=${E2E_RUNTIME}, key=${required_secret_name}, len=${#required_secret_value})"
- name: Canary run
id: canary
run: bash tests/e2e/test_staging_full_saas.sh
# Alerting: open an issue only after THREE consecutive failures so
# transient flakes (Cloudflare DNS hiccup, AWS API blip) don't spam
# the issue list. If an issue is already open, we still comment on
# every failure so ops sees the streak. Auto-close on next green.
#
# Threshold rationale: canary fires every 30 min, so 3 failures =
# ~90 min of consecutive red — well past any single-run flake but
# still tight enough that a real outage gets surfaced before the
# next deploy window.
- name: Open issue on failure
if: failure()
uses: actions/github-script@3a2844b7e9c422d3c10d287c895573f7108da1b3 # v9.0.0
env:
# Inject the workflow path explicitly — context.workflow is
# the *name*, not the file path the actions API needs.
WORKFLOW_PATH: '.github/workflows/canary-staging.yml'
CONSECUTIVE_THRESHOLD: '3'
with:
script: |
const title = '🔴 Canary failing: staging SaaS smoke';
const runURL = `https://github.com/${context.repo.owner}/${context.repo.repo}/actions/runs/${context.runId}`;
// Find an existing open canary issue (stable title match).
// If one exists, this isn't a "first failure" — comment and exit.
const { data: existing } = await github.rest.issues.listForRepo({
owner: context.repo.owner, repo: context.repo.repo,
state: 'open', labels: 'canary-staging',
per_page: 10,
});
const match = existing.find(i => i.title === title);
if (match) {
await github.rest.issues.createComment({
owner: context.repo.owner, repo: context.repo.repo,
issue_number: match.number,
body: `Canary still failing. ${runURL}`,
});
core.info(`Commented on existing issue #${match.number}`);
return;
}
// No open issue yet — check the last N-1 runs' conclusions.
// We open the issue only if the last (THRESHOLD-1) runs ALSO
// failed (so this is the 3rd consecutive red).
const threshold = parseInt(process.env.CONSECUTIVE_THRESHOLD, 10);
const { data: runs } = await github.rest.actions.listWorkflowRuns({
owner: context.repo.owner, repo: context.repo.repo,
workflow_id: process.env.WORKFLOW_PATH,
status: 'completed',
per_page: threshold,
// Skip the current in-progress run; it isn't 'completed' yet.
});
// listWorkflowRuns returns recent first. We need (threshold-1)
// prior failures (current run is the threshold-th).
const priorFailures = (runs.workflow_runs || [])
.slice(0, threshold - 1)
.filter(r => r.id !== context.runId)
.filter(r => r.conclusion === 'failure')
.length;
if (priorFailures < threshold - 1) {
core.info(`Below threshold: ${priorFailures + 1}/${threshold} consecutive failures — not filing yet`);
return;
}
const body =
`Canary run failed at ${new Date().toISOString()}, ` +
`${threshold} consecutive runs red.\n\n` +
`Run: ${runURL}\n\n` +
`This issue auto-closes on the next green canary run. ` +
`Consecutive failures add a comment here rather than a new issue.`;
await github.rest.issues.create({
owner: context.repo.owner, repo: context.repo.repo,
title, body,
labels: ['canary-staging', 'bug'],
});
core.info(`Opened canary failure issue (${threshold} consecutive reds)`);
- name: Auto-close canary issue on success
if: success()
uses: actions/github-script@3a2844b7e9c422d3c10d287c895573f7108da1b3 # v9.0.0
with:
script: |
const title = '🔴 Canary failing: staging SaaS smoke';
const { data: open } = await github.rest.issues.listForRepo({
owner: context.repo.owner, repo: context.repo.repo,
state: 'open', labels: 'canary-staging',
per_page: 10,
});
const match = open.find(i => i.title === title);
if (match) {
await github.rest.issues.createComment({
owner: context.repo.owner, repo: context.repo.repo,
issue_number: match.number,
body: `Canary recovered at ${new Date().toISOString()}. Closing.`,
});
await github.rest.issues.update({
owner: context.repo.owner, repo: context.repo.repo,
issue_number: match.number,
state: 'closed',
});
core.info(`Closed recovered canary issue #${match.number}`);
}
- name: Teardown safety net
if: always()
env:
ADMIN_TOKEN: ${{ secrets.MOLECULE_STAGING_ADMIN_TOKEN }}
run: |
set +e
# Slug prefix matches what test_staging_full_saas.sh emits
# in canary mode:
# SLUG="e2e-canary-$(date +%Y%m%d)-${RUN_ID_SUFFIX}"
# Earlier this was `e2e-{today}-canary-` — that was the
# full-mode pattern (date FIRST, mode SECOND); canary slugs
# have mode FIRST, date SECOND. The mismatch silently
# never matched, leaving every cancelled-canary EC2 alive
# until the once-an-hour sweep eventually caught it
# (incident 2026-04-26 21:03Z: 1h25m EC2 leak before manual
# cleanup; same gap on three earlier cancellations today).
orgs=$(curl -sS "$MOLECULE_CP_URL/cp/admin/orgs" \
-H "Authorization: Bearer $ADMIN_TOKEN" 2>/dev/null \
| python3 -c "
import json, sys, os, datetime
run_id = os.environ.get('GITHUB_RUN_ID', '')
d = json.load(sys.stdin)
# Scope to slugs from THIS canary run when GITHUB_RUN_ID is
# available; the canary workflow sets E2E_RUN_ID='canary-\${run_id}'
# so the slug suffix is '-canary-\${run_id}-...'. Mirrors the
# full-mode safety net's per-run scoping (e2e-staging-saas.yml)
# added after the 2026-04-21 cross-run cleanup incident.
# Sweep both today AND yesterday's UTC dates so a run that
# crosses midnight still cleans up its own slug — see the
# 2026-04-26→27 canvas-safety-net incident.
today = datetime.date.today()
yesterday = today - datetime.timedelta(days=1)
dates = (today.strftime('%Y%m%d'), yesterday.strftime('%Y%m%d'))
if run_id:
prefixes = tuple(f'e2e-canary-{d}-canary-{run_id}' for d in dates)
else:
prefixes = tuple(f'e2e-canary-{d}-' for d in dates)
candidates = [o['slug'] for o in d.get('orgs', [])
if any(o.get('slug','').startswith(p) for p in prefixes)
and o.get('status') not in ('purged',)]
print('\n'.join(candidates))
" 2>/dev/null)
# Per-slug DELETE with HTTP-code verification. The previous
# `... >/dev/null || true` swallowed every failure, so a 5xx
# or timeout from CP looked identical to "successfully cleaned
# up" and the tenant kept eating ~2 vCPU until the hourly
# stale sweep caught it (up to 2h later). Now we capture the
# response code and surface non-2xx as a workflow warning, so
# the run page shows which slug leaked. We still don't `exit 1`
# on cleanup failure — a single-canary cleanup miss shouldn't
# fail-flag the canary itself when the actual smoke check
# passed. The sweep-stale-e2e-orgs cron (now every 15 min,
# 30-min threshold) is the safety net for whatever slips past.
# See molecule-controlplane#420.
leaks=()
for slug in $orgs; do
# Tempfile-routed -w + set +e/-e prevents curl-exit-code
# pollution of the captured status (lint-curl-status-capture.yml).
set +e
curl -sS -o /tmp/canary-cleanup.out -w "%{http_code}" \
-X DELETE "$MOLECULE_CP_URL/cp/admin/tenants/$slug" \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d "{\"confirm\":\"$slug\"}" >/tmp/canary-cleanup.code
set -e
code=$(cat /tmp/canary-cleanup.code 2>/dev/null || echo "000")
if [ "$code" = "200" ] || [ "$code" = "204" ]; then
echo "[teardown] deleted $slug (HTTP $code)"
else
echo "::warning::canary teardown for $slug returned HTTP $code — sweep-stale-e2e-orgs will catch it within ~45 min. Body: $(head -c 300 /tmp/canary-cleanup.out 2>/dev/null)"
leaks+=("$slug")
fi
done
if [ ${#leaks[@]} -gt 0 ]; then
echo "::warning::canary teardown left ${#leaks[@]} leak(s): ${leaks[*]}"
fi
exit 0
+174
View File
@@ -0,0 +1,174 @@
name: canary-verify
# Runs the canary smoke suite against the staging canary tenant fleet
# after a new :staging-<sha> image lands in GHCR. On green, promotes
# :staging-<sha> → :latest so the prod tenant fleet's 5-minute
# auto-updater picks up the verified digest. On red, :latest stays
# on the prior known-good digest and prod is untouched.
#
# Dependencies:
# - publish-workspace-server-image.yml publishes :staging-<sha>
# (NOT :latest) on main merge
# - canary tenants are configured to pull :staging-<sha> as their
# tenant image (set TENANT_IMAGE=ghcr.io/…:staging-<sha> on the
# canary provisioner code path OR rotate via an admin endpoint)
# - Repo secrets CANARY_TENANT_URLS / CANARY_ADMIN_TOKENS /
# CANARY_CP_SHARED_SECRET are populated
on:
workflow_run:
workflows: ["publish-workspace-server-image"]
types: [completed]
workflow_dispatch:
permissions:
contents: read
packages: write
actions: read
env:
IMAGE_NAME: ghcr.io/molecule-ai/platform
TENANT_IMAGE_NAME: ghcr.io/molecule-ai/platform-tenant
jobs:
canary-smoke:
# Skip when the upstream workflow failed — no image to test against.
if: ${{ github.event.workflow_run.conclusion == 'success' || github.event_name == 'workflow_dispatch' }}
runs-on: ubuntu-latest
outputs:
sha: ${{ steps.compute.outputs.sha }}
smoke_ran: ${{ steps.smoke.outputs.ran }}
steps:
- name: Checkout
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- name: Compute sha
id: compute
run: echo "sha=${GITHUB_SHA::7}" >> "$GITHUB_OUTPUT"
- name: Wait for canary tenants to pick up :staging-<sha>
# Poll canary health endpoints every 30s for up to 7 min instead
# of a fixed 6-min sleep. Exits as soon as ALL canaries report
# the new SHA (~2-3 min typical vs 6 min fixed). Falls back to
# proceeding after 7 min even if not all canaries responded —
# the smoke suite will catch any that didn't update.
env:
CANARY_TENANT_URLS: ${{ secrets.CANARY_TENANT_URLS }}
EXPECTED_SHA: ${{ steps.compute.outputs.sha }}
run: |
if [ -z "$CANARY_TENANT_URLS" ]; then
echo "No canary URLs configured — falling back to 60s wait"
sleep 60
exit 0
fi
IFS=',' read -ra URLS <<< "$CANARY_TENANT_URLS"
MAX_WAIT=420 # 7 minutes
INTERVAL=30
ELAPSED=0
while [ $ELAPSED -lt $MAX_WAIT ]; do
ALL_READY=true
for url in "${URLS[@]}"; do
HEALTH=$(curl -s --max-time 5 "${url}/health" 2>/dev/null || echo "{}")
SHA=$(echo "$HEALTH" | grep -o "\"sha\":\"[^\"]*\"" | head -1 | cut -d'"' -f4)
if [ "$SHA" != "$EXPECTED_SHA" ]; then
ALL_READY=false
break
fi
done
if $ALL_READY; then
echo "All canaries running staging-${EXPECTED_SHA} after ${ELAPSED}s"
exit 0
fi
echo "Waiting for canaries... (${ELAPSED}s / ${MAX_WAIT}s)"
sleep $INTERVAL
ELAPSED=$((ELAPSED + INTERVAL))
done
echo "Timeout after ${MAX_WAIT}s — proceeding anyway (smoke suite will validate)"
- name: Run canary smoke suite
id: smoke
# Graceful-skip when no canary fleet is configured (Phase 2 not yet
# stood up — see molecule-controlplane/docs/canary-tenants.md).
# Sets `ran=false` on skip so promote-to-latest stays off (we don't
# want every main merge auto-promoting without gating). Manual
# promote-latest.yml is the release gate while canary is absent.
# Once the fleet is real: delete the early-exit branch.
env:
CANARY_TENANT_URLS: ${{ secrets.CANARY_TENANT_URLS }}
CANARY_ADMIN_TOKENS: ${{ secrets.CANARY_ADMIN_TOKENS }}
CANARY_CP_BASE_URL: https://staging-api.moleculesai.app
CANARY_CP_SHARED_SECRET: ${{ secrets.CANARY_CP_SHARED_SECRET }}
run: |
set -euo pipefail
if [ -z "${CANARY_TENANT_URLS:-}" ] \
|| [ -z "${CANARY_ADMIN_TOKENS:-}" ] \
|| [ -z "${CANARY_CP_SHARED_SECRET:-}" ]; then
{
echo "## ⚠️ canary-verify skipped"
echo
echo "One or more canary secrets are unset (\`CANARY_TENANT_URLS\`, \`CANARY_ADMIN_TOKENS\`, \`CANARY_CP_SHARED_SECRET\`)."
echo "Phase 2 canary fleet has not been stood up yet —"
echo "see [canary-tenants.md](https://github.com/molecule-ai/molecule-controlplane/blob/main/docs/canary-tenants.md)."
echo
echo "**Skipped — promote-to-latest will NOT auto-fire.** Dispatch \`promote-latest.yml\` manually when ready."
} >> "$GITHUB_STEP_SUMMARY"
echo "ran=false" >> "$GITHUB_OUTPUT"
echo "::notice::canary-verify: skipped — no canary fleet configured"
exit 0
fi
bash scripts/canary-smoke.sh
echo "ran=true" >> "$GITHUB_OUTPUT"
- name: Summary on failure
if: ${{ failure() }}
run: |
{
echo "## Canary smoke FAILED"
echo
echo "Canary tenants rejected image \`staging-${{ steps.compute.outputs.sha }}\`."
echo ":latest stays pinned to the prior good digest — prod is untouched."
echo
echo "Fix forward and merge again, or investigate the specific failed"
echo "assertions in the canary-smoke step log above."
} >> "$GITHUB_STEP_SUMMARY"
promote-to-latest:
# On green, retag :staging-<sha> → :latest for BOTH images.
# crane is a lightweight registry client (no Docker daemon needed on
# the runner) that can retag remotely with a single API call each.
# Gated on smoke_ran=true — without a real canary fleet the smoke
# step no-ops with success, and we don't want that to silently
# auto-promote every main merge.
needs: canary-smoke
if: ${{ needs.canary-smoke.result == 'success' && needs.canary-smoke.outputs.smoke_ran == 'true' }}
runs-on: ubuntu-latest
steps:
- uses: imjasonh/setup-crane@6da1ae018866400525525ce74ff892880c099987 # v0.5
- name: GHCR login
run: |
echo "${{ secrets.GITHUB_TOKEN }}" | \
crane auth login ghcr.io -u "${{ github.actor }}" --password-stdin
- name: Retag platform :staging-<sha> → :latest
run: |
crane tag \
"${IMAGE_NAME}:staging-${{ needs.canary-smoke.outputs.sha }}" \
latest
- name: Retag tenant :staging-<sha> → :latest
run: |
crane tag \
"${TENANT_IMAGE_NAME}:staging-${{ needs.canary-smoke.outputs.sha }}" \
latest
- name: Summary
run: |
{
echo "## Canary verified — :latest promoted"
echo
echo "- \`${IMAGE_NAME}:staging-${{ needs.canary-smoke.outputs.sha }}\` → \`${IMAGE_NAME}:latest\`"
echo "- \`${TENANT_IMAGE_NAME}:staging-${{ needs.canary-smoke.outputs.sha }}\` → \`${TENANT_IMAGE_NAME}:latest\`"
echo
echo "Prod tenant fleet will pick up the new digest on its next 5-min auto-update cycle."
} >> "$GITHUB_STEP_SUMMARY"
@@ -0,0 +1,39 @@
name: cascade-list-drift-gate
# Structural gate: TEMPLATES list in publish-runtime.yml must match
# manifest.json's workspace_templates exactly. Closes the recurrence
# path of PR #2556 (the data fix) and is the first concrete deliverable
# of RFC #388 PR-3.
#
# Why a gate, not just discipline: PR #2536 pruned the manifest, but the
# cascade list wasn't updated for ~weeks before someone (PR #2556)
# noticed during an unrelated audit. During that window, codex never
# rebuilt on a runtime publish. A structural gate catches the drift
# the same day either file changes.
#
# Triggers narrowly to keep CI quiet: only on PRs that actually change
# one of the two files. The path-filtered split + always-emit-result
# pattern (memory: "Required check names need a job that always runs")
# is unnecessary here because the workflow IS the check name and PR
# branch protection should require it directly. Future-proof: if this
# becomes a required check, add a no-op aggregator with always() so the
# name still emits when paths don't match.
on:
pull_request:
branches: [staging, main]
paths:
- manifest.json
- .github/workflows/publish-runtime.yml
- scripts/check-cascade-list-vs-manifest.sh
permissions:
contents: read
jobs:
check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4
- name: Check cascade list matches manifest
run: bash scripts/check-cascade-list-vs-manifest.sh
@@ -0,0 +1,123 @@
name: Check merge_group trigger on required workflows
# Pre-merge guard against the deadlock pattern where a workflow whose
# check is in `required_status_checks` lacks a `merge_group:` trigger.
# Without it, GitHub merge queue stalls forever in AWAITING_CHECKS
# because the required check can't fire on `gh-readonly-queue/...` refs.
#
# This workflow:
# 1. Lists required status checks on the branch protection rule for `staging`
# 2. For each required check, finds the workflow that produces it (by job
# name match)
# 3. Fails if any such workflow lacks `merge_group:` in its triggers
#
# Reasoning for staging-only: main has its own CI gating model (PR review),
# but staging is what the merge queue runs on, so it's the trigger that
# matters.
on:
pull_request:
paths:
- '.github/workflows/**.yml'
- '.github/workflows/**.yaml'
push:
branches: [staging, main]
paths:
- '.github/workflows/**.yml'
- '.github/workflows/**.yaml'
# Self-listen on merge_group so the linter passes its own queue run.
merge_group:
types: [checks_requested]
jobs:
check:
name: Required workflows have merge_group trigger
runs-on: ubuntu-latest
permissions:
contents: read
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- name: Verify merge_group trigger on required-check workflows
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
REPO: ${{ github.repository }}
shell: bash
run: |
set -euo pipefail
# Branch we care about — the one merge queue runs on.
BRANCH=staging
# Pull the list of required status check contexts. If the branch
# has no protection or no required checks, exit clean — nothing
# to lint.
REQUIRED=$(gh api "repos/${REPO}/branches/${BRANCH}/protection/required_status_checks" \
--jq '.contexts[]' 2>/dev/null || true)
if [ -z "$REQUIRED" ]; then
echo "No required status checks on ${BRANCH} — nothing to verify."
exit 0
fi
echo "Required checks on ${BRANCH}:"
echo "${REQUIRED}" | sed 's/^/ - /'
echo
# Build a map: workflow file -> set of job names declared in it.
# We use yq if available, otherwise grep the `name:` lines under
# `jobs:`. Stick with grep for portability — runner image always
# has it; yq isn't in the default image as of 2026-04.
declare -A workflow_jobs
shopt -s nullglob
for wf in .github/workflows/*.yml .github/workflows/*.yaml; do
[ -f "$wf" ] || continue
# Extract the workflow name (the `name:` at file root).
wf_name=$(awk '/^name:[[:space:]]/ {sub(/^name:[[:space:]]+/,""); gsub(/^"|"$/,""); print; exit}' "$wf")
# Extract job step names from the `jobs:` block. A job step is:
# - id under `jobs:` (key with 2-space indent followed by colon)
# - the `name:` field inside that job (4-space indent)
# We collect both because required_status_checks contexts can
# match either, depending on how the workflow was authored.
jobs_block=$(awk '/^jobs:/{flag=1; next} flag' "$wf")
job_names=$(echo "$jobs_block" | awk '/^[[:space:]]{4}name:[[:space:]]/ {sub(/^[[:space:]]+name:[[:space:]]+/,""); gsub(/^["'"'"']|["'"'"']$/,""); print}')
workflow_jobs["$wf"]="${wf_name}"$'\n'"${job_names}"
done
# For each required check, find the workflow that produces it.
# Then verify that workflow lists merge_group as a trigger.
FAILED=0
while IFS= read -r check; do
[ -z "$check" ] && continue
owning_wf=""
for wf in "${!workflow_jobs[@]}"; do
if echo "${workflow_jobs[$wf]}" | grep -Fxq "$check"; then
owning_wf="$wf"
break
fi
done
if [ -z "$owning_wf" ]; then
echo "::warning::Required check '${check}' has no matching workflow in this repo. Skipping (may be from an external app)."
continue
fi
# Does the workflow's trigger list include merge_group?
# Match either bare `merge_group:` line or merge_group with
# subsequent indented config (types: [checks_requested]).
if grep -qE '^[[:space:]]*merge_group:' "$owning_wf"; then
echo "OK: '${check}' (in $owning_wf) — has merge_group trigger"
else
echo "::error file=${owning_wf}::Required check '${check}' is produced by ${owning_wf}, but the workflow does not declare a 'merge_group:' trigger. With merge queue enabled on ${BRANCH}, this will deadlock the queue (every PR sits AWAITING_CHECKS forever). Add this to the workflow's 'on:' block:"
echo "::error file=${owning_wf}:: merge_group:"
echo "::error file=${owning_wf}:: types: [checks_requested]"
FAILED=1
fi
done <<< "$REQUIRED"
if [ "$FAILED" -ne 0 ]; then
echo
echo "::error::Block. See errors above. Reference: $(grep -l 'reference_merge_queue' /dev/null 2>/dev/null || echo 'memory: reference_merge_queue_enablement.md')."
exit 1
fi
echo
echo "All required workflows on ${BRANCH} declare merge_group triggers."
@@ -1,16 +1,5 @@
name: Check migration collisions
# Ported from .github/workflows/check-migration-collisions.yml on 2026-05-11
# per RFC internal#219 §1 sweep.
#
# Differences from the GitHub version:
# - on.paths includes .gitea/workflows/check-migration-collisions.yml
# (this file) instead of the .github/ one.
# - Workflow-level env.GITHUB_SERVER_URL pinned to https://git.moleculesai.app
# so scripts/ops/check_migration_collisions.py can derive the Gitea API
# base (the script already supports this; see _gitea_api_url()).
# - `continue-on-error: true` on the job (RFC §1 contract).
#
# Hard gate (#2341): fails a PR that adds a migration prefix already
# claimed by the base branch or another open PR. Caught manually 2026-04-30
# during PR #2276 rebase: 044_runtime_image_pins collided with
@@ -28,25 +17,17 @@ on:
paths:
- 'workspace-server/migrations/**'
- 'scripts/ops/check_migration_collisions.py'
- '.gitea/workflows/check-migration-collisions.yml'
env:
GITHUB_SERVER_URL: https://git.moleculesai.app
- '.github/workflows/check-migration-collisions.yml'
permissions:
contents: read
# API needs read access to other PRs to detect cross-PR collisions
# gh pr list/diff need read access to other PRs
pull-requests: read
jobs:
check:
name: Migration version collision check
runs-on: ubuntu-latest
# Phase 3 (RFC #219 §1): surface broken workflows without blocking
# the PR. Follow-up PR flips this off after surfaced defects are
# triaged.
# mc#774: pre-existing continue-on-error mask; root-fix and remove, do not renew silently.
continue-on-error: true
timeout-minutes: 5
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
@@ -60,7 +41,7 @@ jobs:
BASE_REF: origin/${{ github.event.pull_request.base.ref }}
HEAD_REF: ${{ github.event.pull_request.head.sha }}
GITHUB_REPOSITORY: ${{ github.repository }}
# Auto-injected; Gitea aliases this for in-repo API access.
# gh CLI uses GH_TOKEN from env
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
# Ensure the named base ref exists locally. checkout@v4 with
@@ -70,6 +51,8 @@ jobs:
# IMPORTANT: do NOT pass --depth=1 here. The script below uses
# `git diff origin/<base>...<head>` (three-dot, merge-base form),
# which fails with "fatal: no merge base" if the base ref is
# shallow.
# shallow. The auto-promote staging→main PR (#2361) was blocked
# by exactly this for ~5h on 2026-04-30 — the depth=1 fetch
# overwrote checkout@v4's full-history clone with a shallow tip.
git fetch origin "${{ github.event.pull_request.base.ref }}" || true
python3 scripts/ops/check_migration_collisions.py
+438
View File
@@ -0,0 +1,438 @@
name: CI
on:
push:
branches: [main, staging]
pull_request:
branches: [main, staging]
# GitHub merge queue fires `merge_group` for the queue's pre-merge CI run.
# Required so the queue gets a real check result instead of a false-green
# from the absence of a triggered workflow. Safe to add unconditionally —
# the event simply doesn't fire until the queue is enabled on the branch.
merge_group:
types: [checks_requested]
# Cancel in-progress CI runs when a new commit arrives on the same ref.
# This prevents stale runs from queuing behind each other. The merge_group
# refs (refs/heads/gh-readonly-queue/...) get their own concurrency group
# automatically because github.ref differs from the PR ref.
concurrency:
group: ci-${{ github.ref }}
cancel-in-progress: true
jobs:
# Detect which paths changed so downstream jobs can skip when only
# docs/markdown files were modified.
changes:
name: Detect changes
runs-on: ubuntu-latest
outputs:
platform: ${{ steps.check.outputs.platform }}
canvas: ${{ steps.check.outputs.canvas }}
python: ${{ steps.check.outputs.python }}
scripts: ${{ steps.check.outputs.scripts }}
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
fetch-depth: 0
- id: check
run: |
# For PR events: diff against the base branch (not HEAD~1 of the branch,
# which may be unrelated after force-pushes). When a push updates a PR,
# both pull_request and push events fire — prefer the PR base so that
# the diff is always computed against the actual merge base, not the
# previous SHA on the branch which may be on a different history line.
BASE="${GITHUB_BASE_REF:-${{ github.event.before }}}"
# GITHUB_BASE_REF is set by GitHub for PR events (the base branch name).
# For pull_request events we use the stored base.sha; for push events
# (or when base.sha is unavailable) fall back to github.event.before.
if [ "${{ github.event_name }}" = "pull_request" ] && [ -n "${{ github.event.pull_request.base.sha }}" ]; then
BASE="${{ github.event.pull_request.base.sha }}"
fi
# Fallback: if BASE is empty or all zeros (new branch), run everything
if [ -z "$BASE" ] || echo "$BASE" | grep -qE '^0+$'; then
echo "platform=true" >> "$GITHUB_OUTPUT"
echo "canvas=true" >> "$GITHUB_OUTPUT"
echo "python=true" >> "$GITHUB_OUTPUT"
echo "scripts=true" >> "$GITHUB_OUTPUT"
exit 0
fi
DIFF=$(git diff --name-only "$BASE" HEAD 2>/dev/null || echo ".github/workflows/ci.yml")
echo "platform=$(echo "$DIFF" | grep -qE '^workspace-server/|^\.github/workflows/ci\.yml$' && echo true || echo false)" >> "$GITHUB_OUTPUT"
echo "canvas=$(echo "$DIFF" | grep -qE '^canvas/|^\.github/workflows/ci\.yml$' && echo true || echo false)" >> "$GITHUB_OUTPUT"
echo "python=$(echo "$DIFF" | grep -qE '^workspace/|^\.github/workflows/ci\.yml$' && echo true || echo false)" >> "$GITHUB_OUTPUT"
echo "scripts=$(echo "$DIFF" | grep -qE '^tests/e2e/|^scripts/|^infra/scripts/|^\.github/workflows/ci\.yml$' && echo true || echo false)" >> "$GITHUB_OUTPUT"
# Platform (Go) is a required check on staging. Always-run + per-step
# gating (see Canvas (Next.js) for the rationale and the failure mode
# this avoids).
platform-build:
name: Platform (Go)
needs: changes
runs-on: ubuntu-latest
defaults:
run:
working-directory: workspace-server
steps:
- if: needs.changes.outputs.platform != 'true'
working-directory: .
run: echo "No platform/** changes — skipping real build steps; this job always runs to satisfy the required-check name on branch protection."
- if: needs.changes.outputs.platform == 'true'
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- if: needs.changes.outputs.platform == 'true'
uses: actions/setup-go@40f1582b2485089dde7abd97c1529aa768e1baff # v5
with:
go-version: 'stable'
- if: needs.changes.outputs.platform == 'true'
run: go mod download
- if: needs.changes.outputs.platform == 'true'
run: go build ./cmd/server
# CLI (molecli) moved to standalone repo: github.com/molecule-ai/molecule-cli
- if: needs.changes.outputs.platform == 'true'
run: go vet ./... || true
- if: needs.changes.outputs.platform == 'true'
name: Run golangci-lint
run: golangci-lint run --timeout 3m ./... || true
- if: needs.changes.outputs.platform == 'true'
name: Run tests with race detection and coverage
run: go test -race -coverprofile=coverage.out ./...
- if: needs.changes.outputs.platform == 'true'
name: Per-file coverage report
# Advisory — lists every source file with its coverage so reviewers
# can see at-a-glance where gaps are. Sorted ascending so the worst
# offenders float to the top. Does NOT fail the build; the hard
# gate is the threshold check below. (#1823)
run: |
echo "=== Per-file coverage (worst first) ==="
go tool cover -func=coverage.out \
| grep -v '^total:' \
| awk '{file=$1; sub(/:[0-9][0-9.]*:.*/, "", file); pct=$NF; gsub(/%/,"",pct); s[file]+=pct; c[file]++}
END {for (f in s) printf "%6.1f%% %s\n", s[f]/c[f], f}' \
| sort -n
- if: needs.changes.outputs.platform == 'true'
name: Check coverage thresholds
# Enforces two gates from #1823 Layer 1:
# 1. Total floor (25% — ratchet plan in COVERAGE_FLOOR.md).
# 2. Per-file floor — non-test .go files in security-critical
# paths with coverage <10% fail the build, UNLESS the file
# path is listed in .coverage-allowlist.txt (acknowledged
# historical debt with a tracking issue + expiry).
run: |
set -e
TOTAL_FLOOR=25
# Security-critical paths where a 0%-coverage file is a real risk.
CRITICAL_PATHS=(
"internal/handlers/tokens"
"internal/handlers/workspace_provision"
"internal/handlers/a2a_proxy"
"internal/handlers/registry"
"internal/handlers/secrets"
"internal/middleware/wsauth"
"internal/crypto"
)
TOTAL=$(go tool cover -func=coverage.out | grep '^total:' | awk '{print $3}' | sed 's/%//')
echo "Total coverage: ${TOTAL}%"
if awk "BEGIN{exit !($TOTAL < $TOTAL_FLOOR)}"; then
echo "::error::Total coverage ${TOTAL}% is below the ${TOTAL_FLOOR}% floor. See COVERAGE_FLOOR.md for ratchet plan."
exit 1
fi
# Aggregate per-file coverage → /tmp/perfile.txt: "<fullpath> <pct>"
go tool cover -func=coverage.out \
| grep -v '^total:' \
| awk '{file=$1; sub(/:[0-9][0-9.]*:.*/, "", file); pct=$NF; gsub(/%/,"",pct); s[file]+=pct; c[file]++}
END {for (f in s) printf "%s %.1f\n", f, s[f]/c[f]}' \
> /tmp/perfile.txt
# Build allowlist — paths relative to workspace-server, one per line.
# Lines starting with # are comments.
ALLOWLIST=""
if [ -f ../.coverage-allowlist.txt ]; then
ALLOWLIST=$(grep -vE '^(#|[[:space:]]*$)' ../.coverage-allowlist.txt || true)
fi
FAILED=0
WARNED=0
for path in "${CRITICAL_PATHS[@]}"; do
while read -r file pct; do
[[ "$file" == *_test.go ]] && continue
[[ "$file" == *"$path"* ]] || continue
awk "BEGIN{exit !($pct < 10)}" || continue
# Strip the package-import prefix so we can match .coverage-allowlist.txt
# entries written as paths relative to workspace-server/.
# Handle both module paths: platform/workspace-server/... and platform/...
rel=$(echo "$file" | sed 's|^github.com/molecule-ai/molecule-monorepo/platform/workspace-server/||; s|^github.com/molecule-ai/molecule-monorepo/platform/||')
if echo "$ALLOWLIST" | grep -qxF "$rel"; then
echo "::warning file=workspace-server/$rel::Critical file at ${pct}% coverage (allowlisted, #1823) — fix before expiry."
WARNED=$((WARNED+1))
else
echo "::error file=workspace-server/$rel::Critical file at ${pct}% coverage — must be >=10% (target 80%). See #1823. To acknowledge as known debt, add this path to .coverage-allowlist.txt."
FAILED=$((FAILED+1))
fi
done < /tmp/perfile.txt
done
echo ""
echo "Critical-path check: $FAILED new failures, $WARNED allowlisted warnings."
if [ "$FAILED" -gt 0 ]; then
echo ""
echo "$FAILED security-critical file(s) have <10% test coverage and are"
echo "NOT in the allowlist. These paths handle auth, tokens, secrets, or"
echo "workspace provisioning — a 0% file here is the exact gap that let"
echo "CWE-22, CWE-78, KI-005 slip through in past incidents. Either:"
echo " (a) add tests to raise coverage above 10%, or"
echo " (b) add the path to .coverage-allowlist.txt with an expiry date"
echo " and a tracking issue reference."
exit 1
fi
# Canvas (Next.js) — required check, always runs. See platform-build
# comment above for the rationale.
#
# Supersedes the canvas-build-noop pattern attempted in PR #2321: two
# jobs sharing `name:` doesn't actually satisfy branch protection
# because the SKIPPED check run sibling is treated as not-passed
# regardless of how many SUCCESS siblings it has. Verified empirically
# on PR #2314 — mergeStateStatus stayed BLOCKED until I collapsed to
# a single-job-with-conditional-steps shape.
canvas-build:
name: Canvas (Next.js)
needs: changes
runs-on: ubuntu-latest
defaults:
run:
working-directory: canvas
steps:
- if: needs.changes.outputs.canvas != 'true'
working-directory: .
run: echo "No canvas/** changes — skipping real build steps; this job always runs to satisfy the required-check name on branch protection."
- if: needs.changes.outputs.canvas == 'true'
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- if: needs.changes.outputs.canvas == 'true'
uses: actions/setup-node@48b55a011bda9f5d6aeb4c2d9c7362e8dae4041e # v6.4.0
with:
node-version: '22'
- if: needs.changes.outputs.canvas == 'true'
run: rm -f package-lock.json && npm install
- if: needs.changes.outputs.canvas == 'true'
run: npm run build
- if: needs.changes.outputs.canvas == 'true'
name: Run tests with coverage
# Coverage instrumentation is configured in canvas/vitest.config.ts
# (provider: v8, reporters: text + html + json-summary). Step 2 of
# #1815 — wires coverage into CI so we get a baseline visible on
# every PR. No threshold gate yet; thresholds dial in (Step 3, also
# tracked in #1815) after the team sees what current coverage is.
# Per the inline comment in vitest.config.ts: "first land
# observability so we can see the baseline, then dial in
# thresholds + a hard gate" — this PR ships the observability half.
run: npx vitest run --coverage
- name: Upload coverage summary as artifact
if: needs.changes.outputs.canvas == 'true' && always()
uses: actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02 # v4.6.2
with:
name: canvas-coverage-${{ github.run_id }}
path: canvas/coverage/
retention-days: 7
if-no-files-found: warn
# MCP Server + SDK removed from CI — now in standalone repos:
# - github.com/molecule-ai/molecule-mcp-server (npm CI)
# - github.com/molecule-ai/molecule-sdk-python (PyPI CI)
# e2e-api job moved to .github/workflows/e2e-api.yml (issue #458).
# It now has workflow-level concurrency (cancel-in-progress: false) so
# new pushes queue the E2E run rather than cancelling it at the run level.
# Shellcheck (E2E scripts) — required check, always runs. See
# platform-build for the rationale.
shellcheck:
name: Shellcheck (E2E scripts)
needs: changes
runs-on: ubuntu-latest
steps:
- if: needs.changes.outputs.scripts != 'true'
run: echo "No tests/e2e/ or infra/scripts/ changes — skipping real shellcheck; this job always runs to satisfy the required-check name on branch protection."
- if: needs.changes.outputs.scripts == 'true'
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- if: needs.changes.outputs.scripts == 'true'
name: Run shellcheck on tests/e2e/*.sh and infra/scripts/*.sh
# shellcheck is pre-installed on ubuntu-latest runners (via apt).
# infra/scripts/ is included because setup.sh + nuke.sh gate the
# README quickstart — a shellcheck regression there silently breaks
# new-user onboarding. scripts/ is intentionally excluded until its
# pre-existing SC3040/SC3043 warnings are cleaned up.
run: |
find tests/e2e infra/scripts -type f -name '*.sh' -print0 \
| xargs -0 shellcheck --severity=warning
- if: needs.changes.outputs.scripts == 'true'
name: Lint cleanup-trap hygiene (RFC #2873)
# Asserts every shell E2E test that calls `mktemp` also installs
# an EXIT trap. Catches the /tmp-leak class — a missing trap
# silently leaks scratch into CI runners (~10-100KB per run).
# See tests/e2e/lint_cleanup_traps.sh for the rule + fix pattern.
run: bash tests/e2e/lint_cleanup_traps.sh
- if: needs.changes.outputs.scripts == 'true'
name: Run E2E bash unit tests (no live infra)
# Pure-bash unit tests for E2E helper libs (lib/*.sh). These pin
# behavior of dispatch logic that — when broken — silently masks as
# "Could not resolve authentication method" only after a successful
# tenant + workspace provision (PR #2571 incident, 2026-05-03). Add
# new self-contained unit tests here as the lib/ directory grows;
# tests requiring live CP/tenant credentials belong in the dedicated
# e2e-staging-* workflows, not this job.
run: |
bash tests/e2e/test_model_slug.sh
canvas-deploy-reminder:
name: Canvas Deploy Reminder
runs-on: ubuntu-latest
needs: [changes, canvas-build]
# Only fires on direct pushes to main (i.e. after staging→main promotion).
if: needs.changes.outputs.canvas == 'true' && github.event_name == 'push' && github.ref == 'refs/heads/main'
permissions:
# Required to post commit comments via the GitHub API.
contents: write
steps:
- name: Post deploy reminder as commit comment
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
COMMIT_SHA: ${{ github.sha }}
RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
run: |
# Write body to a temp file — avoids backtick escaping in shell.
cat > /tmp/deploy-reminder.md << 'BODY'
## Canvas build passed ✅ — deploy required
The `publish-canvas-image` workflow is now building a fresh Docker image
(`ghcr.io/molecule-ai/canvas:latest`) in the background.
Once it completes (~35 min), apply on the host machine with:
```bash
cd <runner-workspace>
git pull origin main
docker compose pull canvas && docker compose up -d canvas
```
If you need to rebuild from local source instead (e.g. testing unreleased
changes or a new `NEXT_PUBLIC_*` URL), use:
```bash
docker compose build canvas && docker compose up -d canvas
```
BODY
printf '\n> Posted automatically by CI · commit `%s` · [build log](%s)\n' \
"$COMMIT_SHA" "$RUN_URL" >> /tmp/deploy-reminder.md
gh api \
--method POST \
"repos/${{ github.repository }}/commits/${{ github.sha }}/comments" \
--field "body=@/tmp/deploy-reminder.md"
# Python Lint & Test — required check, always runs. See platform-build
# for the rationale.
python-lint:
name: Python Lint & Test
needs: changes
runs-on: ubuntu-latest
env:
WORKSPACE_ID: test
defaults:
run:
working-directory: workspace
steps:
- if: needs.changes.outputs.python != 'true'
working-directory: .
run: echo "No workspace/** changes — skipping real lint+test; this job always runs to satisfy the required-check name on branch protection."
- if: needs.changes.outputs.python == 'true'
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- if: needs.changes.outputs.python == 'true'
uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 # v6.2.0
with:
python-version: '3.11'
cache: pip
cache-dependency-path: workspace/requirements.txt
- if: needs.changes.outputs.python == 'true'
run: pip install -r requirements.txt pytest pytest-asyncio pytest-cov
# Coverage flags + fail-under floor moved into workspace/pytest.ini
# (issue #1817) so local `pytest` and CI use identical config.
- if: needs.changes.outputs.python == 'true'
run: python -m pytest --tb=short
- if: needs.changes.outputs.python == 'true'
name: Per-file critical-path coverage (MCP / inbox / auth)
# MCP-critical Python files have a per-file floor on top of the
# 86% total floor in pytest.ini. Rationale (issue #2790, after
# the PR #2766 → PR #2771 cycle): the total floor averages ~6000
# lines, so a single MCP file could regress to ~50% with no
# complaint as long as other modules compensate. These five
# files handle multi-tenant routing + auth + inbox dispatch —
# a coverage drop here is the same risk shape as a Go-side
# workspace-server token/secrets file dropping below 10%.
#
# Floor 75% sits below current actuals (80-96%) so this gate is
# strictly additive — no existing PR fails. Ratchet plan in
# COVERAGE_FLOOR.md.
run: |
set -e
PER_FILE_FLOOR=75
CRITICAL_FILES=(
"a2a_mcp_server.py"
"mcp_cli.py"
"a2a_tools.py"
"a2a_tools_inbox.py"
"inbox.py"
"platform_auth.py"
)
# pytest already wrote .coverage; emit a JSON view scoped to
# the critical files so jq/python can read the per-file pct
# without parsing tabular text. --include uses fnmatch, and
# the leading "*" allows the file to live anywhere under the
# workspace root (today they sit at workspace/<name>.py).
INCLUDES=$(printf '*%s,' "${CRITICAL_FILES[@]}")
INCLUDES="${INCLUDES%,}"
python -m coverage json -o /tmp/critical-cov.json --include="$INCLUDES"
FAILED=0
for f in "${CRITICAL_FILES[@]}"; do
# Match by top-level path key (e.g. "a2a_tools.py", not
# "builtin_tools/a2a_tools.py" — different file at 100%).
# The keys in coverage.json are paths relative to the run
# cwd (workspace/), so the critical-path entry sits at the
# bare basename.
pct=$(jq -r --arg f "$f" '.files | to_entries | map(select(.key == $f)) | .[0].value.summary.percent_covered // "MISSING"' /tmp/critical-cov.json)
if [ "$pct" = "MISSING" ]; then
echo "::error file=workspace/$f::No coverage data — file may have moved or test exclusion mis-set."
FAILED=$((FAILED+1))
continue
fi
echo "$f: ${pct}%"
if awk "BEGIN{exit !($pct < $PER_FILE_FLOOR)}"; then
echo "::error file=workspace/$f::${pct}% < ${PER_FILE_FLOOR}% per-file floor (MCP critical path). See COVERAGE_FLOOR.md."
FAILED=$((FAILED+1))
fi
done
if [ "$FAILED" -gt 0 ]; then
echo ""
echo "$FAILED MCP critical-path file(s) below the ${PER_FILE_FLOOR}% per-file floor."
echo "These paths handle multi-tenant routing, auth tokens, and inbox dispatch."
echo "A coverage drop here is the same risk shape as Go-side tokens/secrets files"
echo "dropping below 10% (see COVERAGE_FLOOR.md). Either:"
echo " (a) add tests to raise coverage back above ${PER_FILE_FLOOR}%, or"
echo " (b) if this is unavoidable historical debt, file an issue and propose"
echo " adjusting the floor with rationale in COVERAGE_FLOOR.md."
exit 1
fi
# SDK + plugin validation moved to standalone repo:
# github.com/molecule-ai/molecule-sdk-python
+122
View File
@@ -0,0 +1,122 @@
name: CodeQL
# Controls CodeQL scan triggers for this repo.
#
# GitHub's "Code quality" default setup (the UI-configured one) is
# hardcoded to only scan the default branch — on this repo that's
# `staging`, so PRs promoting staging→main would otherwise never be
# scanned. This workflow fills that gap by explicitly scanning both
# branches on push and PR.
#
# Runs on ubuntu-latest (GHA-hosted — public repo, free). GHAS is NOT
# enabled on this repo, so results are not uploaded to the Security
# tab — the scan fails the PR check on findings, and the SARIF is
# kept as a workflow artifact for triage.
on:
push:
branches: [main, staging]
pull_request:
branches: [main, staging]
# GitHub merge queue fires `merge_group` for the queue's pre-merge CI run.
# Required so CodeQL Analyze checks get a real result on the queued
# commit instead of a false-green. Event only fires once merge queue is
# enabled on the target branch — safe to add unconditionally.
merge_group:
types: [checks_requested]
schedule:
# Weekly run picks up findings in code that hasn't been touched.
- cron: '30 1 * * 0'
# Workflow-level concurrency: only one CodeQL run per branch/PR at a time.
# `cancel-in-progress: false` queues new runs so a quick follow-up push
# doesn't nuke a 45-min analysis mid-flight.
concurrency:
group: codeql-${{ github.ref }}
cancel-in-progress: false
permissions:
actions: read
contents: read
# No security-events: write — we don't call the upload API.
jobs:
analyze:
name: Analyze (${{ matrix.language }})
# CodeQL set to advisory (non-blocking) on Gitea Actions — Hongming dec'''n 2026-05-07 (#156).
# Findings still emit as SARIF artifacts; failing CodeQL run does not block PR merge.
continue-on-error: true
runs-on: ubuntu-latest
timeout-minutes: 45
strategy:
fail-fast: false
matrix:
language: [go, javascript-typescript, python]
steps:
- name: Checkout
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
# github-app-auth sibling-checkout removed 2026-05-07 (#157):
# plugin was dropped + the Dockerfile no longer needs it.
# jq is pre-installed on ubuntu-latest — no setup step needed.
- name: Initialize CodeQL
uses: github/codeql-action/init@95e58e9a2cdfd71adc6e0353d5c52f41a045d225 # v4.35.2
with:
languages: ${{ matrix.language }}
# security-extended widens past the default to include the
# full security-query set for a public SaaS surface.
queries: security-extended
- name: Autobuild
uses: github/codeql-action/autobuild@95e58e9a2cdfd71adc6e0353d5c52f41a045d225 # v4.35.2
- name: Perform CodeQL Analysis
id: analyze
uses: github/codeql-action/analyze@95e58e9a2cdfd71adc6e0353d5c52f41a045d225 # v4.35.2
with:
category: "/language:${{ matrix.language }}"
# upload: never — GHAS isn't enabled on this repo, so the
# upload API 403s. Write SARIF locally instead.
upload: never
output: sarif-results/${{ matrix.language }}
- name: Parse SARIF + fail on findings
# The analyze step writes <database>.sarif into the output
# directory — database name is the short CodeQL lang id, not
# the matrix value (e.g. "javascript-typescript" →
# javascript.sarif), so glob rather than hardcode.
# Filter to error/warning severity: security-extended emits
# "note" rows for informational findings we don't want to fail
# the build over.
shell: bash
run: |
set -euo pipefail
dir="sarif-results/${{ matrix.language }}"
sarif=$(ls "$dir"/*.sarif 2>/dev/null | head -1 || true)
if [ -z "$sarif" ] || [ ! -f "$sarif" ]; then
echo "::error::No SARIF file found under $dir"
ls -la "$dir" 2>/dev/null || true
exit 1
fi
echo "Parsing $sarif"
count=$(jq '[.runs[].results[] | select(.level == "error" or .level == "warning")] | length' "$sarif")
echo "CodeQL findings (error+warning) for ${{ matrix.language }}: $count"
if [ "$count" -gt 0 ]; then
echo "::error::CodeQL found $count issues. Details below; full SARIF in the artifact."
jq -r '.runs[].results[] | select(.level == "error" or .level == "warning") | " - [\(.level)] \(.ruleId // "?"): \(.message.text // "(no message)") @ \(.locations[0].physicalLocation.artifactLocation.uri // "?"):\(.locations[0].physicalLocation.region.startLine // "?")"' "$sarif"
exit 1
fi
- name: Upload SARIF artifact
# Keep SARIF around on success + failure so triagers can diff.
# 14-day retention — longer than default 3, short enough not
# to bloat quota.
if: always()
uses: actions/upload-artifact@v3 # pinned to v3 for Gitea act_runner v0.6 compatibility (internal#46)
with:
name: codeql-sarif-${{ matrix.language }}
path: sarif-results/${{ matrix.language }}/
retention-days: 14
@@ -1,16 +1,5 @@
name: Continuous synthetic E2E (staging)
# Ported from .github/workflows/continuous-synth-e2e.yml on 2026-05-11 per RFC
# internal#219 §1 sweep. Differences from the GitHub version:
# - Dropped `workflow_dispatch.inputs` (Gitea 1.22.6 parser rejects them
# per feedback_gitea_workflow_dispatch_inputs_unsupported).
# - Dropped `merge_group:` (no Gitea merge queue).
# - Dropped `environment:` blocks (Gitea has no environments).
# - Workflow-level env.GITHUB_SERVER_URL pinned per
# feedback_act_runner_github_server_url.
# - `continue-on-error: true` on each job (RFC §1 contract).
#
# Hard gate (#2342): cron-driven full-lifecycle E2E that catches
# regressions visible only at runtime — schema drift, deployment-pipeline
# gaps, vendor outages, env-var rotations, DNS / CF / Railway side-effects.
@@ -43,18 +32,6 @@ name: Continuous synthetic E2E (staging)
on:
schedule:
# Every 30 minutes, on :02 and :32. This keeps a recurring SaaS
# behavior probe while cutting runner occupancy from this workflow by
# roughly two thirds; fast liveness belongs in the lighter smoke/heartbeat
# probes, not in a full tenant/workspace synth every 10 minutes.
#
# Previous cadence was every 10 minutes (:02 :12 :22 :32 :42 :52).
# The current operator-host runner pool is the bottleneck, so full
# synth E2E is deliberately lower-cadence until it moves to a dedicated
# runner host or warm-runtime pool.
#
# Historical notes from the 10-minute shape:
#
# Every 10 minutes, on :02 :12 :22 :32 :42 :52. Three constraints:
# 1. Stay off the top-of-hour. GitHub Actions scheduler drops
# :00 firings under high load (own docs:
@@ -68,7 +45,7 @@ on:
# 2. Avoid colliding with the existing :15 sweep-cf-orphans
# and :45 sweep-cf-tunnels — both hit the CF API and we
# don't want to fight for rate-limit tokens.
# 3. Avoid the :30 heavy slot (staging-smoke /30, sweep-aws-
# 3. Avoid the :30 heavy slot (canary-staging /30, sweep-aws-
# secrets, sweep-stale-e2e-orgs every :15) — multiple
# overlapping cron registrations on the same minute is part
# of what GH drops under load.
@@ -78,7 +55,25 @@ on:
# fires = ~30 min cadence; closer to the 20-min target than the
# current shape and provides a real degradation alarm if drops
# get worse.
- cron: '2,32 * * * *'
- cron: '2,12,22,32,42,52 * * * *'
workflow_dispatch:
inputs:
runtime:
description: "Runtime to provision (claude-code = default + cheapest via MiniMax; langgraph = OpenAI-only; hermes = SDK-native path, slower)"
required: false
default: "claude-code"
type: string
model_slug:
description: "Model id to provision the workspace with (default MiniMax-M2.7-highspeed; e.g. 'sonnet' to test direct Anthropic, 'openai/gpt-4o' for hermes)"
required: false
default: "MiniMax-M2.7-highspeed"
type: string
keep_org:
description: "Skip teardown for post-mortem debugging (only manual dispatch — never set this for cron runs)"
required: false
default: false
type: boolean
permissions:
contents: read
# No issue-write here — failures surface as red runs in the workflow
@@ -94,16 +89,10 @@ concurrency:
group: continuous-synth-e2e
cancel-in-progress: false
env:
GITHUB_SERVER_URL: https://git.moleculesai.app
jobs:
synth:
name: Synthetic E2E against staging
runs-on: ubuntu-latest
# Phase 3 (RFC #219 §1): surface broken workflows without blocking.
# mc#774: pre-existing continue-on-error mask; root-fix and remove, do not renew silently.
continue-on-error: true
# Bumped from 12 → 20 (2026-05-04). Tenant user-data install phase
# (apt-get update + install docker.io/jq/awscli/caddy + snap install
# ssm-agent) runs from raw Ubuntu on every boot — none of it is
@@ -118,7 +107,7 @@ jobs:
timeout-minutes: 20
env:
# claude-code default: cold-start ~5 min (comparable to langgraph),
# but uses MiniMax-M2 via the template's third-party-
# but uses MiniMax-M2.7-highspeed via the template's third-party-
# Anthropic-compat path (workspace-configs-templates/claude-code-
# default/config.yaml:64-69). MiniMax is ~5-10x cheaper than
# gpt-4.1-mini per token AND avoids the recurring OpenAI quota-
@@ -131,9 +120,9 @@ jobs:
# on the per-runtime default ("sonnet" → routes to direct
# Anthropic, defeats the cost saving). Operators can override
# via workflow_dispatch by setting a different E2E_MODEL_SLUG
# input if they need to exercise a specific model. MiniMax-M2 is the
# stable staging MiniMax path used by the full-SaaS smoke.
E2E_MODEL_SLUG: ${{ github.event.inputs.model_slug || 'MiniMax-M2' }}
# input if they need to exercise a specific model. M2.7-highspeed
# is "Token Plan only" but cheap-per-token and fast.
E2E_MODEL_SLUG: ${{ github.event.inputs.model_slug || 'MiniMax-M2.7-highspeed' }}
# Bound to 10 min so a stuck provision fails the run instead of
# holding up the next cron firing. 15-min default in the script
# is for the on-PR full lifecycle where we have more headroom.
@@ -145,11 +134,6 @@ jobs:
E2E_KEEP_ORG: ${{ github.event.inputs.keep_org == 'true' && '1' || '' }}
MOLECULE_CP_URL: ${{ vars.STAGING_CP_URL || 'https://staging-api.moleculesai.app' }}
MOLECULE_ADMIN_TOKEN: ${{ secrets.CP_STAGING_ADMIN_API_TOKEN }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_DEFAULT_REGION: us-east-2
E2E_AWS_LEAK_CHECK: required
E2E_AWS_TERMINATE_LEAKS: '1'
# MiniMax key is the canary's PRIMARY auth path. claude-code
# template's `minimax` provider routes ANTHROPIC_BASE_URL to
# api.minimax.io/anthropic and reads MINIMAX_API_KEY at boot.
@@ -165,7 +149,7 @@ jobs:
# E2E_RUNTIME=langgraph or =hermes and still have a working
# canary path. The script picks the right blob shape based on
# which key is non-empty.
E2E_OPENAI_API_KEY: ${{ secrets.MOLECULE_STAGING_OPENAI_API_KEY }}
E2E_OPENAI_API_KEY: ${{ secrets.MOLECULE_STAGING_OPENAI_KEY }}
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
@@ -190,16 +174,10 @@ jobs:
echo "::error::Set it at Settings → Secrets and Variables → Actions; pull from staging-CP's CP_ADMIN_API_TOKEN env in Railway."
exit 1
fi
for var in AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY; do
if [ -z "${!var:-}" ]; then
echo "::error::$var secret missing — EC2 leak verification cannot run"
exit 1
fi
done
# LLM-key requirement is per-runtime: claude-code accepts
# EITHER MiniMax OR direct-Anthropic (whichever is set first),
# langgraph + hermes use OpenAI (MOLECULE_STAGING_OPENAI_API_KEY).
# langgraph + hermes use OpenAI (MOLECULE_STAGING_OPENAI_KEY).
case "${E2E_RUNTIME}" in
claude-code)
if [ -n "${E2E_MINIMAX_API_KEY:-}" ]; then
@@ -214,7 +192,7 @@ jobs:
fi
;;
langgraph|hermes)
required_secret_name="MOLECULE_STAGING_OPENAI_API_KEY"
required_secret_name="MOLECULE_STAGING_OPENAI_KEY"
required_secret_value="${E2E_OPENAI_API_KEY:-}"
;;
*)
+191
View File
@@ -0,0 +1,191 @@
name: E2E API Smoke Test
# Extracted from ci.yml so workflow-level concurrency can protect this job
# from run-level cancellation (issue #458).
#
# Trigger model (revised 2026-04-29):
#
# Always FIRES on push/pull_request to staging+main. Real work is gated
# per-step on `needs.detect-changes.outputs.api` — when paths under
# `workspace-server/`, `tests/e2e/`, or this workflow file haven't
# changed, the no-op step alone runs and emits SUCCESS for the
# `E2E API Smoke Test` check, satisfying branch protection without
# spending CI cycles. See the in-job comment on the `e2e-api` job for
# why this is one job (not two-jobs-sharing-name) and the 2026-04-29
# PR #2264 incident that drove the consolidation.
on:
push:
branches: [main, staging]
pull_request:
branches: [main, staging]
workflow_dispatch:
concurrency:
# Per-SHA grouping (changed 2026-04-28 from per-ref). Per-ref had the
# same auto-promote-staging brittleness as e2e-staging-canvas — back-
# to-back staging pushes share refs/heads/staging, so the older push's
# queued run gets cancelled when a newer push lands. Auto-promote-
# staging then sees `completed/cancelled` for the older SHA and stays
# put; the newer SHA's gates may eventually save the day, but if the
# newer push gets cancelled too, we deadlock.
#
# See e2e-staging-canvas.yml's identical concurrency block for the full
# rationale and the 2026-04-28 incident reference.
group: e2e-api-${{ github.event.pull_request.head.sha || github.sha }}
cancel-in-progress: false
jobs:
detect-changes:
runs-on: ubuntu-latest
outputs:
api: ${{ steps.decide.outputs.api }}
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- uses: dorny/paths-filter@fbd0ab8f3e69293af611ebaee6363fc25e6d187d # v4.0.1
id: filter
with:
filters: |
api:
- 'workspace-server/**'
- 'tests/e2e/**'
- '.github/workflows/e2e-api.yml'
- id: decide
# Always run real work for manual dispatch — no diff context to
# filter against and ops dispatching this expects the suite to
# actually exercise the platform.
run: |
if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
echo "api=true" >> "$GITHUB_OUTPUT"
else
echo "api=${{ steps.filter.outputs.api }}" >> "$GITHUB_OUTPUT"
fi
# ONE job (no job-level `if:`) that always runs and reports under the
# required-check name `E2E API Smoke Test`. Real work is gated per-step
# on `needs.detect-changes.outputs.api`. Reason: GitHub registers a
# check run for every job that matches `name:`, and a job-level
# `if: false` produces a SKIPPED check run. Branch protection treats
# all check runs with a matching context name on the latest commit as a
# SET — any SKIPPED in the set fails the required-check eval, even with
# SUCCESS siblings. Verified 2026-04-29 on PR #2264 (staging→main):
# 4 check runs (2 SKIPPED + 2 SUCCESS) at the head SHA blocked
# promotion despite all real work succeeding. Collapsing to a single
# always-running job with conditional steps emits exactly one SUCCESS
# check run regardless of paths filter — branch-protection-clean.
e2e-api:
needs: detect-changes
name: E2E API Smoke Test
runs-on: ubuntu-latest
timeout-minutes: 15
env:
DATABASE_URL: postgres://dev:dev@localhost:15432/molecule?sslmode=disable
REDIS_URL: redis://localhost:16379
PORT: "8080"
PG_CONTAINER: molecule-ci-postgres
REDIS_CONTAINER: molecule-ci-redis
steps:
- name: No-op pass (paths filter excluded this commit)
if: needs.detect-changes.outputs.api != 'true'
run: |
echo "No workspace-server / tests/e2e / workflow changes — E2E API gate satisfied without running tests."
echo "::notice::E2E API Smoke Test no-op pass (paths filter excluded this commit)."
- if: needs.detect-changes.outputs.api == 'true'
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- if: needs.detect-changes.outputs.api == 'true'
uses: actions/setup-go@40f1582b2485089dde7abd97c1529aa768e1baff # v5
with:
go-version: 'stable'
cache: true
cache-dependency-path: workspace-server/go.sum
- name: Start Postgres (docker)
if: needs.detect-changes.outputs.api == 'true'
run: |
docker rm -f "$PG_CONTAINER" 2>/dev/null || true
docker run -d --name "$PG_CONTAINER" -e POSTGRES_USER=dev -e POSTGRES_PASSWORD=dev -e POSTGRES_DB=molecule -p 15432:5432 postgres:16
for i in $(seq 1 30); do
if docker exec "$PG_CONTAINER" pg_isready -U dev >/dev/null 2>&1; then
echo "Postgres ready after ${i}s"
exit 0
fi
sleep 1
done
echo "::error::Postgres did not become ready in 30s"
docker logs "$PG_CONTAINER" || true
exit 1
- name: Start Redis (docker)
if: needs.detect-changes.outputs.api == 'true'
run: |
docker rm -f "$REDIS_CONTAINER" 2>/dev/null || true
docker run -d --name "$REDIS_CONTAINER" -p 16379:6379 redis:7
for i in $(seq 1 15); do
if docker exec "$REDIS_CONTAINER" redis-cli ping 2>/dev/null | grep -q PONG; then
echo "Redis ready after ${i}s"
exit 0
fi
sleep 1
done
echo "::error::Redis did not become ready in 15s"
docker logs "$REDIS_CONTAINER" || true
exit 1
- name: Build platform
if: needs.detect-changes.outputs.api == 'true'
working-directory: workspace-server
run: go build -o platform-server ./cmd/server
- name: Start platform (background)
if: needs.detect-changes.outputs.api == 'true'
working-directory: workspace-server
run: |
./platform-server > platform.log 2>&1 &
echo $! > platform.pid
- name: Wait for /health
if: needs.detect-changes.outputs.api == 'true'
run: |
for i in $(seq 1 30); do
if curl -sf http://localhost:8080/health > /dev/null; then
echo "Platform up after ${i}s"
exit 0
fi
sleep 1
done
echo "::error::Platform did not become healthy in 30s"
cat workspace-server/platform.log || true
exit 1
- name: Assert migrations applied
if: needs.detect-changes.outputs.api == 'true'
run: |
tables=$(docker exec "$PG_CONTAINER" psql -U dev -d molecule -tAc "SELECT count(*) FROM information_schema.tables WHERE table_schema='public' AND table_name='workspaces'")
if [ "$tables" != "1" ]; then
echo "::error::Migrations did not apply"
cat workspace-server/platform.log || true
exit 1
fi
echo "Migrations OK"
- name: Run E2E API tests
if: needs.detect-changes.outputs.api == 'true'
run: bash tests/e2e/test_api.sh
- name: Run notify-with-attachments E2E
if: needs.detect-changes.outputs.api == 'true'
run: bash tests/e2e/test_notify_attachments_e2e.sh
- name: Run priority-runtimes E2E (claude-code + hermes — skips when keys absent)
if: needs.detect-changes.outputs.api == 'true'
run: bash tests/e2e/test_priority_runtimes_e2e.sh
- name: Run poll-mode + since_id cursor E2E (#2339)
if: needs.detect-changes.outputs.api == 'true'
run: bash tests/e2e/test_poll_mode_e2e.sh
- name: Run poll-mode chat upload E2E (RFC #2891)
if: needs.detect-changes.outputs.api == 'true'
run: bash tests/e2e/test_poll_mode_chat_upload_e2e.sh
- name: Dump platform log on failure
if: failure() && needs.detect-changes.outputs.api == 'true'
run: cat workspace-server/platform.log || true
- name: Stop platform
if: always() && needs.detect-changes.outputs.api == 'true'
run: |
if [ -f workspace-server/platform.pid ]; then
kill "$(cat workspace-server/platform.pid)" 2>/dev/null || true
fi
- name: Stop service containers
if: always() && needs.detect-changes.outputs.api == 'true'
run: |
docker rm -f "$PG_CONTAINER" 2>/dev/null || true
docker rm -f "$REDIS_CONTAINER" 2>/dev/null || true
@@ -1,24 +1,13 @@
name: E2E Staging Canvas (Playwright)
# Ported from .github/workflows/e2e-staging-canvas.yml on 2026-05-11 per RFC
# internal#219 §1 sweep. Differences from the GitHub version:
# - Dropped `workflow_dispatch.inputs` (Gitea 1.22.6 parser rejects them
# per feedback_gitea_workflow_dispatch_inputs_unsupported).
# - Dropped `merge_group:` (no Gitea merge queue).
# - Dropped `environment:` blocks (Gitea has no environments).
# - Workflow-level env.GITHUB_SERVER_URL pinned per
# feedback_act_runner_github_server_url.
# - `continue-on-error: true` on each job (RFC §1 contract).
#
# Playwright test suite that provisions a fresh staging org per run and
# verifies every workspace-panel tab renders without crashing. Complements
# e2e-staging-saas.yml (which tests the API shape) by exercising the
# actual browser + canvas bundle against live staging.
#
# Triggers: push to main, PR touching canvas sources + this workflow only
# after the PR enters `merge-queue`, manual dispatch, and scheduled cron to
# catch browser/runtime drift even when canvas is quiet.
# Triggers: push to main/staging or PR touching canvas sources + this workflow,
# manual dispatch, and weekly cron to catch browser/runtime drift even
# when canvas is quiet.
# Added staging to push/pull_request branches so the auto-promote gate
# check (--event push --branch staging) can see a completed run for this
# workflow — mirrors what PR #1891 does for e2e-api.yml.
@@ -33,14 +22,14 @@ on:
# spending CI cycles. See e2e-api.yml for the rationale on why this
# is a single job rather than two-jobs-sharing-name.
push:
branches: [main]
branches: [main, staging]
pull_request:
branches: [main]
schedule:
# Nightly at 08:00 UTC — catches Chrome / Playwright / Next.js
# release-note-shaped regressions that don't ride in with a PR.
- cron: '0 8 * * *'
branches: [main, staging]
workflow_dispatch:
schedule:
# Weekly on Sunday 08:00 UTC — catches Chrome / Playwright / Next.js
# release-note-shaped regressions that don't ride in with a PR.
- cron: '0 8 * * 0'
concurrency:
# Per-SHA grouping (changed 2026-04-28 from a single global group). The
@@ -64,69 +53,28 @@ concurrency:
group: e2e-staging-canvas-${{ github.event.pull_request.head.sha || github.sha }}
cancel-in-progress: false
env:
GITHUB_SERVER_URL: https://git.moleculesai.app
jobs:
detect-changes:
runs-on: ubuntu-latest
# Phase 3 (RFC #219 §1): surface broken workflows without blocking.
# mc#774: pre-existing continue-on-error mask; root-fix and remove, do not renew silently.
continue-on-error: true
outputs:
canvas: ${{ steps.decide.outputs.canvas }}
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- uses: dorny/paths-filter@fbd0ab8f3e69293af611ebaee6363fc25e6d187d # v4.0.1
id: filter
with:
fetch-depth: 0
filters: |
canvas:
- 'canvas/**'
- '.github/workflows/e2e-staging-canvas.yml'
- id: decide
env:
GITEA_TOKEN: ${{ secrets.GITHUB_TOKEN }}
QUEUE_LABEL: merge-queue
# Inline replacement for dorny/paths-filter — see e2e-api.yml.
# Cron and manual triggers always run real work (no diff context).
# Always run real tests for manual dispatch and the weekly cron —
# both exist precisely to exercise the suite, regardless of diff.
run: |
if [ "${{ github.event_name }}" = "schedule" ] || [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
echo "canvas=true" >> "$GITHUB_OUTPUT"
exit 0
fi
BASE="${GITHUB_BASE_REF:-${{ github.event.before }}}"
if [ "${{ github.event_name }}" = "pull_request" ] && [ -n "${{ github.event.pull_request.base.sha }}" ]; then
BASE="${{ github.event.pull_request.base.sha }}"
fi
if [ -z "$BASE" ] || echo "$BASE" | grep -qE '^0+$'; then
echo "canvas=true" >> "$GITHUB_OUTPUT"
exit 0
fi
if ! git cat-file -e "$BASE" 2>/dev/null; then
git fetch --depth=1 origin "$BASE" 2>/dev/null || true
fi
if ! git cat-file -e "$BASE" 2>/dev/null; then
echo "canvas=true" >> "$GITHUB_OUTPUT"
exit 0
fi
CHANGED=$(git diff --name-only "$BASE" HEAD)
if ! echo "$CHANGED" | grep -qE '^(canvas/|\.gitea/workflows/e2e-staging-canvas\.yml$)'; then
echo "canvas=false" >> "$GITHUB_OUTPUT"
exit 0
fi
if [ "${{ github.event_name }}" != "pull_request" ]; then
echo "canvas=true" >> "$GITHUB_OUTPUT"
exit 0
fi
authfile=$(mktemp)
chmod 600 "$authfile"
printf 'header = "Authorization: token %s"\n' "$GITEA_TOKEN" > "$authfile"
labels=$(curl -fsS -K "$authfile" \
"${{ github.server_url }}/api/v1/repos/${{ github.repository }}/issues/${{ github.event.pull_request.number }}/labels" \
| python3 -c 'import json,sys; print("\n".join(label.get("name","") for label in json.load(sys.stdin)))')
rm -f "$authfile"
if printf '%s\n' "$labels" | grep -qx "$QUEUE_LABEL"; then
if [ "${{ github.event_name }}" = "workflow_dispatch" ] || [ "${{ github.event_name }}" = "schedule" ]; then
echo "canvas=true" >> "$GITHUB_OUTPUT"
else
echo "PR is not in merge-queue; skipping heavy E2E Staging Canvas for normal PR path."
echo "canvas=false" >> "$GITHUB_OUTPUT"
echo "canvas=${{ steps.filter.outputs.canvas }}" >> "$GITHUB_OUTPUT"
fi
# ONE job (no job-level `if:`) that always runs and reports under the
@@ -139,18 +87,12 @@ jobs:
needs: detect-changes
name: Canvas tabs E2E
runs-on: ubuntu-latest
# Phase 3 (RFC #219 §1): surface broken workflows without blocking.
# mc#774: pre-existing continue-on-error mask; root-fix and remove, do not renew silently.
continue-on-error: true
timeout-minutes: 40
env:
CANVAS_E2E_STAGING: '1'
MOLECULE_CP_URL: https://staging-api.moleculesai.app
# 2026-05-11: secret canonicalised from MOLECULE_STAGING_ADMIN_TOKEN
# (dead in org secret store) to CP_STAGING_ADMIN_API_TOKEN per
# internal#322 — see this PR for the cross-workflow sweep.
MOLECULE_ADMIN_TOKEN: ${{ secrets.CP_STAGING_ADMIN_API_TOKEN }}
MOLECULE_ADMIN_TOKEN: ${{ secrets.MOLECULE_STAGING_ADMIN_TOKEN }}
defaults:
run:
@@ -171,7 +113,7 @@ jobs:
if: needs.detect-changes.outputs.canvas == 'true'
run: |
if [ -z "$MOLECULE_ADMIN_TOKEN" ]; then
echo "::error::Missing CP_STAGING_ADMIN_API_TOKEN"
echo "::error::Missing MOLECULE_STAGING_ADMIN_TOKEN"
exit 2
fi
@@ -189,15 +131,7 @@ jobs:
- name: Install Playwright browsers
if: needs.detect-changes.outputs.canvas == 'true'
timeout-minutes: 10
run: |
PREBAKED_PLAYWRIGHT=/ms-playwright
if [ -d "${PREBAKED_PLAYWRIGHT}" ] && find "${PREBAKED_PLAYWRIGHT}" -maxdepth 3 -type f -name 'chrome' | grep -q .; then
echo "Using prebaked Playwright Chromium from ${PREBAKED_PLAYWRIGHT}"
echo "PLAYWRIGHT_BROWSERS_PATH=${PREBAKED_PLAYWRIGHT}" >> "$GITHUB_ENV"
exit 0
fi
npx playwright install --with-deps chromium
run: npx playwright install --with-deps chromium
- name: Run staging canvas E2E
if: needs.detect-changes.outputs.canvas == 'true'
@@ -205,11 +139,7 @@ jobs:
- name: Upload Playwright report on failure
if: failure() && needs.detect-changes.outputs.canvas == 'true'
# Pinned to v3 for Gitea act_runner v0.6 compatibility — v4+ uses
# the GHES 3.10+ artifact protocol that Gitea 1.22.x does NOT
# implement (see ci.yml upload step for the canonical error
# cite). Drop this pin when Gitea ships the v4 protocol.
uses: actions/upload-artifact@c6a366c94c3e0affe28c06c8df20a878f24da3cf # v3.2.2
uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1
with:
name: playwright-report-staging
path: canvas/playwright-report-staging/
@@ -217,8 +147,7 @@ jobs:
- name: Upload screenshots on failure
if: failure() && needs.detect-changes.outputs.canvas == 'true'
# Pinned to v3 for Gitea act_runner v0.6 compatibility (see above).
uses: actions/upload-artifact@c6a366c94c3e0affe28c06c8df20a878f24da3cf # v3.2.2
uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1
with:
name: playwright-screenshots
path: canvas/test-results/
@@ -241,7 +170,7 @@ jobs:
- name: Teardown safety net
if: always() && needs.detect-changes.outputs.canvas == 'true'
env:
ADMIN_TOKEN: ${{ secrets.CP_STAGING_ADMIN_API_TOKEN }}
ADMIN_TOKEN: ${{ secrets.MOLECULE_STAGING_ADMIN_TOKEN }}
run: |
set +e
STATE_FILE=".playwright-staging-state.json"
@@ -1,16 +1,5 @@
name: E2E Staging External Runtime
# Ported from .github/workflows/e2e-staging-external.yml on 2026-05-11 per RFC
# internal#219 §1 sweep. Differences from the GitHub version:
# - Dropped `workflow_dispatch.inputs` (Gitea 1.22.6 parser rejects them
# per feedback_gitea_workflow_dispatch_inputs_unsupported).
# - Dropped `merge_group:` (no Gitea merge queue).
# - Dropped `environment:` blocks (Gitea has no environments).
# - Workflow-level env.GITHUB_SERVER_URL pinned per
# feedback_act_runner_github_server_url.
# - `continue-on-error: true` on each job (RFC §1 contract).
#
# Regression for the four/five workspaces.status=awaiting_agent transitions
# that silently failed in production for five days before migration 046
# extended the workspace_status enum (see
@@ -43,7 +32,7 @@ name: E2E Staging External Runtime
on:
push:
branches: [main]
branches: [staging, main]
paths:
- 'workspace-server/internal/handlers/workspace.go'
- 'workspace-server/internal/handlers/registry.go'
@@ -53,9 +42,9 @@ on:
- 'workspace-server/migrations/**'
- 'workspace-server/internal/db/workspace_status_enum_drift_test.go'
- 'tests/e2e/test_staging_external_runtime.sh'
- '.gitea/workflows/e2e-staging-external.yml'
- '.github/workflows/e2e-staging-external.yml'
pull_request:
branches: [main]
branches: [staging, main]
paths:
- 'workspace-server/internal/handlers/workspace.go'
- 'workspace-server/internal/handlers/registry.go'
@@ -65,7 +54,18 @@ on:
- 'workspace-server/migrations/**'
- 'workspace-server/internal/db/workspace_status_enum_drift_test.go'
- 'tests/e2e/test_staging_external_runtime.sh'
- '.gitea/workflows/e2e-staging-external.yml'
- '.github/workflows/e2e-staging-external.yml'
workflow_dispatch:
inputs:
keep_org:
description: "Skip teardown for debugging (only via manual dispatch)"
required: false
type: boolean
default: false
stale_wait_secs:
description: "Seconds to wait for the heartbeat-staleness sweep (default 180 = 90s window + 90s buffer)"
required: false
default: "180"
schedule:
- cron: '30 7 * * *'
@@ -76,24 +76,15 @@ concurrency:
permissions:
contents: read
env:
GITHUB_SERVER_URL: https://git.moleculesai.app
jobs:
e2e-staging-external:
name: E2E Staging External Runtime
runs-on: ubuntu-latest
# Phase 3 (RFC #219 §1): surface broken workflows without blocking.
# mc#774: pre-existing continue-on-error mask; root-fix and remove, do not renew silently.
continue-on-error: true
timeout-minutes: 25
env:
MOLECULE_CP_URL: https://staging-api.moleculesai.app
# 2026-05-11: secret canonicalised from MOLECULE_STAGING_ADMIN_TOKEN
# (dead in org secret store) to CP_STAGING_ADMIN_API_TOKEN per
# internal#322 — see this PR for the cross-workflow sweep.
MOLECULE_ADMIN_TOKEN: ${{ secrets.CP_STAGING_ADMIN_API_TOKEN }}
MOLECULE_ADMIN_TOKEN: ${{ secrets.MOLECULE_STAGING_ADMIN_TOKEN }}
E2E_RUN_ID: "${{ github.run_id }}-${{ github.run_attempt }}"
E2E_KEEP_ORG: ${{ github.event.inputs.keep_org && '1' || '0' }}
E2E_STALE_WAIT_SECS: ${{ github.event.inputs.stale_wait_secs || '180' }}
@@ -108,7 +99,7 @@ jobs:
# missing — silent skip would mask infra rot. Manual dispatch
# gets the same hard-fail; an operator running this on a fork
# without secrets configured needs to know up-front.
echo "::error::CP_STAGING_ADMIN_API_TOKEN secret not set (Railway staging CP_ADMIN_API_TOKEN)"
echo "::error::MOLECULE_STAGING_ADMIN_TOKEN secret not set (Railway staging CP_ADMIN_API_TOKEN)"
exit 2
fi
echo "Admin token present ✓"
@@ -133,7 +124,7 @@ jobs:
- name: Teardown safety net (runs on cancel/failure)
if: always()
env:
ADMIN_TOKEN: ${{ secrets.CP_STAGING_ADMIN_API_TOKEN }}
ADMIN_TOKEN: ${{ secrets.MOLECULE_STAGING_ADMIN_TOKEN }}
run: |
set +e
orgs=$(curl -sS "$MOLECULE_CP_URL/cp/admin/orgs" \
@@ -1,16 +1,5 @@
name: E2E Staging SaaS (full lifecycle)
# Ported from .github/workflows/e2e-staging-saas.yml on 2026-05-11 per RFC
# internal#219 §1 sweep. Differences from the GitHub version:
# - Dropped `workflow_dispatch.inputs` (Gitea 1.22.6 parser rejects them
# per feedback_gitea_workflow_dispatch_inputs_unsupported).
# - Dropped `merge_group:` (no Gitea merge queue).
# - Dropped `environment:` blocks (Gitea has no environments).
# - Workflow-level env.GITHUB_SERVER_URL pinned per
# feedback_act_runner_github_server_url.
# - `continue-on-error: true` on each job (RFC §1 contract).
#
# Dedicated workflow that provisions a fresh staging org per run, exercises
# the full workspace lifecycle (register → heartbeat → A2A → delegation →
# HMA memory → activity → peers), then tears down and asserts leak-free.
@@ -24,24 +13,20 @@ name: E2E Staging SaaS (full lifecycle)
# PRs don't need to read.
#
# Triggers:
# - Push to main (regression guard — fires on merges to main, not on PR updates)
# - pull_request: pr-validate always posts success; real E2E step runs only
# when provisioning-critical files change (detect-changes gates the step).
# - Push to main (regression guard)
# - workflow_dispatch (manual re-run from UI)
# - Nightly cron (catches drift even when no pushes land)
#
# NOTE: A separate pr-validate job handles the pull_request path so this
# workflow posts CI status for workflow-only PRs. Without it, a PR that
# only touches the workflow file has no status check (workflow only fires
# on push, not PR branches), which blocks merge under branch protection.
# The E2E step itself only runs when provisioning-critical files change —
# pr-validate always posts success, avoiding the double-fire that motivated
# the pull_request-trigger removal in PRs #516/#530.
# - Changes to any provisioning-critical file under PR review (opt-in
# via the same paths watcher that e2e-api.yml uses)
on:
# Trunk-based (Phase 3 of internal#81): main is the only branch.
# Fire on staging push too — previously this only ran on main, which
# meant the most thorough end-to-end test caught regressions AFTER
# they shipped to staging (and then to the auto-promote PR). Running
# on staging push catches them BEFORE the staging→main promotion
# opens, so a green canary into auto-promote is more meaningful.
push:
branches: [main]
branches: [staging, main]
paths:
- 'workspace-server/internal/handlers/registry.go'
- 'workspace-server/internal/handlers/workspace_provision.go'
@@ -49,11 +34,9 @@ on:
- 'workspace-server/internal/middleware/**'
- 'workspace-server/internal/provisioner/**'
- 'tests/e2e/test_staging_full_saas.sh'
- 'tests/e2e/lib/aws_leak_check.sh'
- 'tests/e2e/test_aws_leak_check.sh'
- '.gitea/workflows/e2e-staging-saas.yml'
- '.github/workflows/e2e-staging-saas.yml'
pull_request:
branches: [main]
branches: [staging, main]
paths:
- 'workspace-server/internal/handlers/registry.go'
- 'workspace-server/internal/handlers/workspace_provision.go'
@@ -61,10 +44,18 @@ on:
- 'workspace-server/internal/middleware/**'
- 'workspace-server/internal/provisioner/**'
- 'tests/e2e/test_staging_full_saas.sh'
- 'tests/e2e/lib/aws_leak_check.sh'
- 'tests/e2e/test_aws_leak_check.sh'
- '.gitea/workflows/e2e-staging-saas.yml'
- '.github/workflows/e2e-staging-saas.yml'
workflow_dispatch:
inputs:
runtime:
description: "Runtime to test (claude-code [default, MiniMax] | hermes [OpenAI] | langgraph [OpenAI])"
required: false
default: "claude-code"
keep_org:
description: "Skip teardown for debugging (only use via manual dispatch!)"
required: false
type: boolean
default: false
schedule:
# 07:00 UTC every day — catches AMI drift, WorkOS cert rotation,
# Cloudflare API regressions, etc. even on quiet days.
@@ -78,46 +69,10 @@ concurrency:
group: e2e-staging-saas
cancel-in-progress: false
env:
GITHUB_SERVER_URL: https://git.moleculesai.app
jobs:
# PR-validation path: always posts success so branch protection can merge
# workflow-only PRs. The actual E2E step only runs when provisioning-
# critical files change (git-paths filter + if: guard below).
# All steps use continue-on-error: true so runner issues do not block merge.
pr-validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
fetch-depth: 1
# mc#774: pre-existing continue-on-error mask; root-fix and remove, do not renew silently.
continue-on-error: true
- uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 # v6.2.0
with:
python-version: "3.11"
# mc#774: pre-existing continue-on-error mask; root-fix and remove, do not renew silently.
continue-on-error: true
- name: YAML validation (best-effort)
run: |
echo "e2e-staging-saas.yml — PR validation: workflow YAML is valid."
echo "E2E step runs only when provisioning-critical files change."
# mc#774: pre-existing continue-on-error mask; root-fix and remove, do not renew silently.
continue-on-error: true
# Actual E2E: runs on trunk pushes (main + staging). NOT the PR-fire-only
# path — pr-validate above posts success for workflow-only PRs.
e2e-staging-saas:
name: E2E Staging SaaS
runs-on: ubuntu-latest
# Only runs on trunk pushes. PR paths get pr-validate instead.
if: github.event.pull_request.base.ref == ''
# Phase 3 (RFC #219 §1): surface broken workflows without blocking.
# mc#774: pre-existing continue-on-error mask; root-fix and remove, do not renew silently.
continue-on-error: true
timeout-minutes: 45
permissions:
contents: read
@@ -127,15 +82,7 @@ jobs:
# Single admin-bearer secret drives provision + tenant-token
# retrieval + teardown. Configure in
# Settings → Secrets and variables → Actions → Repository secrets.
# 2026-05-11: secret canonicalised from MOLECULE_STAGING_ADMIN_TOKEN
# (dead in org secret store) to CP_STAGING_ADMIN_API_TOKEN per
# internal#322 — see this PR for the cross-workflow sweep.
MOLECULE_ADMIN_TOKEN: ${{ secrets.CP_STAGING_ADMIN_API_TOKEN }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_DEFAULT_REGION: us-east-2
E2E_AWS_LEAK_CHECK: required
E2E_AWS_TERMINATE_LEAKS: '1'
MOLECULE_ADMIN_TOKEN: ${{ secrets.MOLECULE_STAGING_ADMIN_TOKEN }}
# MiniMax is the PRIMARY LLM auth path post-2026-05-04. Switched
# from hermes+OpenAI default after #2578 (the staging OpenAI key
# account went over quota and stayed dead for 36+ hours, taking
@@ -144,7 +91,7 @@ jobs:
# ANTHROPIC_BASE_URL to api.minimax.io/anthropic and reads
# MINIMAX_API_KEY at boot — separate billing account so an
# OpenAI quota collapse no longer wedges the gate. Mirrors the
# staging-smoke.yml + continuous-synth-e2e.yml migrations.
# canary-staging.yml + continuous-synth-e2e.yml migrations.
E2E_MINIMAX_API_KEY: ${{ secrets.MOLECULE_STAGING_MINIMAX_API_KEY }}
# Direct-Anthropic alternative for operators who don't want to
# set up a MiniMax account (priority below MiniMax — first
@@ -154,14 +101,14 @@ jobs:
# OpenAI fallback — kept wired so an operator-dispatched run with
# E2E_RUNTIME=hermes or =langgraph via workflow_dispatch can still
# exercise the OpenAI path.
E2E_OPENAI_API_KEY: ${{ secrets.MOLECULE_STAGING_OPENAI_API_KEY }}
E2E_OPENAI_API_KEY: ${{ secrets.MOLECULE_STAGING_OPENAI_KEY }}
E2E_RUNTIME: ${{ github.event.inputs.runtime || 'claude-code' }}
# Pin the model when running on the default claude-code path —
# the per-runtime default ("sonnet") routes to direct Anthropic
# and defeats the cost saving. Operators can override via the
# workflow_dispatch flow (no input wired here yet — runtime
# override is enough for ad-hoc).
E2E_MODEL_SLUG: ${{ github.event.inputs.runtime == 'hermes' && 'openai/gpt-4o' || github.event.inputs.runtime == 'langgraph' && 'openai:gpt-4o' || 'MiniMax-M2' }}
E2E_MODEL_SLUG: ${{ github.event.inputs.runtime == 'hermes' && 'openai/gpt-4o' || github.event.inputs.runtime == 'langgraph' && 'openai:gpt-4o' || 'MiniMax-M2.7-highspeed' }}
E2E_RUN_ID: "${{ github.run_id }}-${{ github.run_attempt }}"
E2E_KEEP_ORG: ${{ github.event.inputs.keep_org && '1' || '0' }}
@@ -171,15 +118,9 @@ jobs:
- name: Verify admin token present
run: |
if [ -z "$MOLECULE_ADMIN_TOKEN" ]; then
echo "::error::CP_STAGING_ADMIN_API_TOKEN secret not set (Railway staging CP_ADMIN_API_TOKEN)"
echo "::error::MOLECULE_STAGING_ADMIN_TOKEN secret not set (Railway staging CP_ADMIN_API_TOKEN)"
exit 2
fi
for var in AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY; do
if [ -z "${!var:-}" ]; then
echo "::error::$var not set — EC2 leak verification cannot run"
exit 2
fi
done
echo "Admin token present ✓"
- name: Verify LLM key present
@@ -207,7 +148,7 @@ jobs:
fi
;;
langgraph|hermes)
required_secret_name="MOLECULE_STAGING_OPENAI_API_KEY"
required_secret_name="MOLECULE_STAGING_OPENAI_KEY"
required_secret_value="${E2E_OPENAI_API_KEY:-}"
;;
*)
@@ -244,7 +185,7 @@ jobs:
- name: Teardown safety net (runs on cancel/failure)
if: always()
env:
ADMIN_TOKEN: ${{ secrets.CP_STAGING_ADMIN_API_TOKEN }}
ADMIN_TOKEN: ${{ secrets.MOLECULE_STAGING_ADMIN_TOKEN }}
run: |
# Best-effort: find any e2e-YYYYMMDD-* orgs matching this run and
# nuke them. Catches the case where the script died before
+171
View File
@@ -0,0 +1,171 @@
name: E2E Staging Sanity (leak-detection self-check)
# Periodic assertion that the teardown safety nets in e2e-staging-saas
# and canary-staging actually work. Runs the E2E harness with
# E2E_INTENTIONAL_FAILURE=1, which poisons the tenant admin token after
# the org is provisioned. The workspace-provision step then fails, the
# script exits non-zero, and the EXIT trap + workflow always()-step
# must still tear down cleanly.
#
# A green run means:
# - The script exited non-zero (intentional failure caught)
# - The trap fired teardown
# - The leak-detection poll found zero orphan orgs
#
# A red run means the teardown path itself is broken — act on this the
# same way you'd act on a canary failure (the whole E2E safety net is
# compromised until it's fixed).
#
# Cadence: once a week, Monday 06:00 UTC. Drift-slow, not per-PR — the
# teardown path rarely changes, and a weekly heartbeat is enough to
# catch silent regressions in cleanup code paths.
on:
schedule:
- cron: '0 6 * * 1'
workflow_dispatch:
concurrency:
# Shares the group with canary + full so they don't collide on
# staging org-create quota.
group: e2e-staging-sanity
cancel-in-progress: false
permissions:
issues: write
contents: read
jobs:
sanity:
name: Intentional-failure teardown sanity
runs-on: ubuntu-latest
timeout-minutes: 20
env:
MOLECULE_CP_URL: https://staging-api.moleculesai.app
MOLECULE_ADMIN_TOKEN: ${{ secrets.MOLECULE_STAGING_ADMIN_TOKEN }}
E2E_MODE: canary # lean lifecycle; we only need the org to exist
E2E_RUNTIME: hermes
E2E_RUN_ID: "sanity-${{ github.run_id }}"
E2E_INTENTIONAL_FAILURE: "1"
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- name: Verify admin token present
run: |
if [ -z "$MOLECULE_ADMIN_TOKEN" ]; then
echo "::error::MOLECULE_STAGING_ADMIN_TOKEN not set"
exit 2
fi
# Inverted assertion: the run MUST fail. If it passes, the
# E2E_INTENTIONAL_FAILURE path is broken (token not being
# poisoned correctly, or the harness silently recovered).
- name: Run harness — expecting exit !=0
id: harness
run: |
set +e
bash tests/e2e/test_staging_full_saas.sh
rc=$?
echo "harness_rc=$rc" >> "$GITHUB_OUTPUT"
# The only acceptable outcomes:
# 1 — harness failed mid-run, teardown ran, leak-check passed
# (exit 4 means teardown left a leak — that's the real bug
# this sanity check exists to catch)
if [ "$rc" = "1" ]; then
echo "✓ Harness failed as expected (rc=1); teardown trap ran, leak-check passed"
exit 0
elif [ "$rc" = "0" ]; then
echo "::error::Harness succeeded under E2E_INTENTIONAL_FAILURE=1 — the poisoning path is broken"
exit 1
elif [ "$rc" = "4" ]; then
echo "::error::LEAK DETECTED (rc=4) — teardown failed to clean up the org. Safety net broken."
exit 4
else
echo "::error::Unexpected rc=$rc — neither clean-failure nor leak. Investigate harness."
exit 1
fi
- name: Open issue if safety net is broken
if: failure()
uses: actions/github-script@3a2844b7e9c422d3c10d287c895573f7108da1b3 # v9.0.0
with:
script: |
const title = "🚨 E2E teardown safety net broken";
const runURL = `https://github.com/${context.repo.owner}/${context.repo.repo}/actions/runs/${context.runId}`;
const body =
`The weekly sanity run (E2E_INTENTIONAL_FAILURE=1) did not exit ` +
`as expected. This means one of:\n` +
` - poisoning didn't actually cause failure (test harness regression), OR\n` +
` - teardown left an orphan org (leak detection caught a real bug)\n\n` +
`Run: ${runURL}\n\n` +
`This is higher priority than a canary failure — the whole ` +
`E2E safety net can't be trusted until this is resolved.`;
const { data: existing } = await github.rest.issues.listForRepo({
owner: context.repo.owner, repo: context.repo.repo,
state: 'open', labels: 'e2e-safety-net',
});
const match = existing.find(i => i.title === title);
if (match) {
await github.rest.issues.createComment({
owner: context.repo.owner, repo: context.repo.repo,
issue_number: match.number,
body: `Still broken. ${runURL}`,
});
} else {
await github.rest.issues.create({
owner: context.repo.owner, repo: context.repo.repo,
title, body,
labels: ['e2e-safety-net', 'bug', 'priority-high'],
});
}
# Belt-and-braces: if teardown left anything behind, nuke it here
# so we don't bleed staging quota. Different label from the
# always()-steps in the other workflows so sanity-only orgs get
# cleaned up by sanity runs.
- name: Teardown safety net
if: always()
env:
ADMIN_TOKEN: ${{ secrets.MOLECULE_STAGING_ADMIN_TOKEN }}
run: |
set +e
orgs=$(curl -sS "$MOLECULE_CP_URL/cp/admin/orgs" \
-H "Authorization: Bearer $ADMIN_TOKEN" 2>/dev/null \
| python3 -c "
import json, sys
d = json.load(sys.stdin)
today = __import__('datetime').date.today().strftime('%Y%m%d')
candidates = [o['slug'] for o in d.get('orgs', [])
if o.get('slug','').startswith(f'e2e-canary-{today}-sanity-')
and o.get('status') not in ('purged',)]
print('\n'.join(candidates))
" 2>/dev/null)
# Per-slug verified DELETE — see molecule-controlplane#420.
# Failures surface as workflow warnings; the sweeper is the
# safety net within ~45 min.
leaks=()
for slug in $orgs; do
# Tempfile-routed -w + set +e/-e prevents curl-exit-code
# pollution of the captured status (lint-curl-status-capture.yml).
set +e
curl -sS -o /tmp/sanity-cleanup.out -w "%{http_code}" \
-X DELETE "$MOLECULE_CP_URL/cp/admin/tenants/$slug" \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d "{\"confirm\":\"$slug\"}" >/tmp/sanity-cleanup.code
set -e
code=$(cat /tmp/sanity-cleanup.code 2>/dev/null || echo "000")
if [ "$code" = "200" ] || [ "$code" = "204" ]; then
echo "[teardown] deleted $slug (HTTP $code)"
else
echo "::warning::sanity teardown for $slug returned HTTP $code — sweep-stale-e2e-orgs will catch it within ~45 min. Body: $(head -c 300 /tmp/sanity-cleanup.out 2>/dev/null)"
leaks+=("$slug")
fi
done
if [ ${#leaks[@]} -gt 0 ]; then
echo "::warning::sanity teardown left ${#leaks[@]} leak(s): ${leaks[*]}"
fi
exit 0
@@ -0,0 +1,171 @@
name: Handlers Postgres Integration
# Real-Postgres integration tests for workspace-server/internal/handlers/.
# Triggered on every PR/push that touches the handlers package.
#
# Why this workflow exists
# ------------------------
# Strict-sqlmock unit tests pin which SQL statements fire — they're fast
# and let us iterate without a DB. But sqlmock CANNOT detect bugs that
# depend on the row state AFTER the SQL runs. The result_preview-lost
# bug shipped to staging in PR #2854 because every unit test was
# satisfied with "an UPDATE statement fired" — none verified the row's
# preview field actually landed. The local-postgres E2E that retrofit
# self-review caught it took 2 minutes to set up and would have caught
# the bug at PR-time.
#
# This job spins a Postgres service container, applies the migration,
# and runs `go test -tags=integration` against a live DB. Required
# check on staging branch protection — backend handler PRs cannot
# merge without a real-DB regression gate.
#
# Cost: ~30s job (postgres pull from GH cache + go build + 4 tests).
on:
push:
branches: [main, staging]
pull_request:
branches: [main, staging]
merge_group:
types: [checks_requested]
workflow_dispatch:
concurrency:
group: handlers-pg-integ-${{ github.event.pull_request.head.sha || github.sha }}
cancel-in-progress: false
jobs:
detect-changes:
name: detect-changes
runs-on: ubuntu-latest
outputs:
handlers: ${{ steps.filter.outputs.handlers }}
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- uses: dorny/paths-filter@fbd0ab8f3e69293af611ebaee6363fc25e6d187d # v4.0.1
id: filter
with:
filters: |
handlers:
- 'workspace-server/internal/handlers/**'
- 'workspace-server/internal/wsauth/**'
- 'workspace-server/migrations/**'
- '.github/workflows/handlers-postgres-integration.yml'
# Single-job-with-per-step-if pattern: always runs to satisfy the
# required-check name on branch protection; real work gates on the
# paths filter. See ci.yml's Platform (Go) for the same shape.
integration:
name: Handlers Postgres Integration
needs: detect-changes
runs-on: ubuntu-latest
services:
postgres:
image: postgres:15-alpine
env:
POSTGRES_PASSWORD: test
POSTGRES_DB: molecule
ports:
- 5432:5432
# GHA spins this with --health-cmd built in for postgres images.
options: >-
--health-cmd pg_isready
--health-interval 5s
--health-timeout 5s
--health-retries 10
defaults:
run:
working-directory: workspace-server
steps:
- if: needs.detect-changes.outputs.handlers != 'true'
working-directory: .
run: echo "No handlers/migrations changes — skipping; this job always runs to satisfy the required-check name."
- if: needs.detect-changes.outputs.handlers == 'true'
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- if: needs.detect-changes.outputs.handlers == 'true'
uses: actions/setup-go@40f1582b2485089dde7abd97c1529aa768e1baff # v5
with:
go-version: 'stable'
- if: needs.detect-changes.outputs.handlers == 'true'
name: Apply migrations to Postgres service
env:
PGPASSWORD: test
run: |
# Wait for postgres to actually accept connections (the
# GHA --health-cmd is best-effort but psql can still race).
for i in {1..15}; do
if pg_isready -h localhost -p 5432 -U postgres -q; then break; fi
echo "waiting for postgres..."; sleep 2
done
# Apply every .up.sql in lexicographic order with
# ON_ERROR_STOP=0 — failing migrations are SKIPPED rather than
# blocking the suite. This handles the current schema state
# where a few historical migrations (e.g. 017_memories_fts_*)
# depend on tables that were later renamed/dropped and so
# cannot replay from scratch. The migrations that DO succeed
# land their tables, which is sufficient for the integration
# tests in handlers/.
#
# Why not maintain a curated allowlist: every new migration
# touching a handlers/-tested table would have to update this
# workflow. With apply-all-or-skip, a future migration that
# adds a column to delegations runs automatically (its base
# table 049_delegations.up.sql already succeeded above it in
# the order). Operators only need to revisit this if the
# migration chain becomes legitimately replayable end-to-end.
#
# Per-migration result is logged so a failed migration that
# SHOULD have been replayable surfaces in the CI log instead
# of silently failing.
# Apply both *.sql (legacy, lives next to its module) and
# *.up.sql (newer up/down convention) in a single
# lexicographically-sorted pass. Excluding *.down.sql so the
# newest-naming-convention pairs don't undo themselves mid-run.
# Pre-#149-followup this loop only globbed *.up.sql, which
# silently skipped 001_workspaces.sql + 009_activity_logs.sql
# — fine while no integration test depended on those tables,
# not fine once a cross-table atomicity test came in.
set +e
for migration in $(ls migrations/*.sql 2>/dev/null | grep -v '\.down\.sql$' | sort); do
if psql -h localhost -U postgres -d molecule -v ON_ERROR_STOP=1 \
-f "$migration" >/dev/null 2>&1; then
echo "✓ $(basename "$migration")"
else
echo "⊘ $(basename "$migration") (skipped — see comment in workflow)"
fi
done
set -e
# Sanity: the delegations + workspaces + activity_logs tables
# MUST exist for the integration tests to be meaningful. Hard-
# fail if any didn't land — that would be a real regression we
# want loud.
for tbl in delegations workspaces activity_logs pending_uploads; do
if ! psql -h localhost -U postgres -d molecule -tA \
-c "SELECT 1 FROM information_schema.tables WHERE table_name = '$tbl'" \
| grep -q 1; then
echo "::error::$tbl table missing after migration replay — handler integration tests would be meaningless"
exit 1
fi
echo "✓ $tbl table present"
done
- if: needs.detect-changes.outputs.handlers == 'true'
name: Run integration tests
env:
INTEGRATION_DB_URL: postgres://postgres:test@localhost:5432/molecule?sslmode=disable
run: |
go test -tags=integration -timeout 5m -v ./internal/handlers/ -run "^TestIntegration_"
- if: needs.detect-changes.outputs.handlers == 'true' && failure()
name: Diagnostic dump on failure
env:
PGPASSWORD: test
run: |
echo "::group::delegations table state"
psql -h localhost -U postgres -d molecule -c "SELECT * FROM delegations LIMIT 50;" || true
echo "::endgroup::"
+162
View File
@@ -0,0 +1,162 @@
name: Harness Replays
# Boots tests/harness (production-shape compose topology with TenantGuard,
# /cp/* proxy, canvas proxy, real production Dockerfile.tenant) and runs
# every replay under tests/harness/replays/. Fails the PR if any replay
# fails.
#
# Why this exists: 2026-04-30 we shipped #2398 which added /buildinfo as
# a public route in router.go but forgot to add it to TenantGuard's
# allowlist. The handler-level test in buildinfo_test.go constructed a
# minimal gin engine without TenantGuard — green. The harness's
# buildinfo-stale-image.sh replay would have caught it (cf-proxy doesn't
# inject X-Molecule-Org-Id, so the curl path is identical to production's
# redeploy verifier), but no one ran the harness pre-merge. The bug
# shipped; the redeploy verifier silently soft-warned every tenant as
# "unreachable" for ~1 day before being noticed.
#
# This gate makes "did you actually run the harness?" a CI invariant
# instead of a memory-discipline thing.
#
# Trigger model — match e2e-api.yml: always FIRES on push/pull_request
# to staging+main, real work is gated per-step on detect-changes output.
# One job → one check run → branch-protection-clean (the SKIPPED-in-set
# trap from PR #2264 is documented in e2e-api.yml's e2e-api job comment).
on:
push:
branches: [main, staging]
paths:
- 'workspace-server/**'
- 'canvas/**'
- 'tests/harness/**'
- '.github/workflows/harness-replays.yml'
pull_request:
branches: [main, staging]
paths:
- 'workspace-server/**'
- 'canvas/**'
- 'tests/harness/**'
- '.github/workflows/harness-replays.yml'
workflow_dispatch:
merge_group:
types: [checks_requested]
concurrency:
# Per-SHA grouping. Per-ref kept hitting the auto-promote-staging
# cancellation deadlock — see e2e-api.yml's concurrency block for
# the 2026-04-28 incident that codified this pattern.
group: harness-replays-${{ github.event.pull_request.head.sha || github.sha }}
cancel-in-progress: false
jobs:
detect-changes:
runs-on: ubuntu-latest
outputs:
run: ${{ steps.decide.outputs.run }}
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- uses: dorny/paths-filter@fbd0ab8f3e69293af611ebaee6363fc25e6d187d # v4.0.1
id: filter
with:
filters: |
run:
- 'workspace-server/**'
- 'canvas/**'
- 'tests/harness/**'
- '.github/workflows/harness-replays.yml'
- id: decide
run: |
if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
echo "run=true" >> "$GITHUB_OUTPUT"
else
echo "run=${{ steps.filter.outputs.run }}" >> "$GITHUB_OUTPUT"
fi
# ONE job that always runs. Real work is gated per-step on
# detect-changes.outputs.run so an unrelated PR (e.g. doc-only
# change to molecule-controlplane wired here later) emits the
# required check without spending CI cycles. Single-job pattern
# matches e2e-api.yml — see that workflow's comment for why a
# job-level `if: false` would block branch protection via the
# SKIPPED-in-set bug.
harness-replays:
needs: detect-changes
name: Harness Replays
runs-on: ubuntu-latest
timeout-minutes: 30
steps:
- name: No-op pass (paths filter excluded this commit)
if: needs.detect-changes.outputs.run != 'true'
run: |
echo "No workspace-server / canvas / tests/harness / workflow changes — Harness Replays gate satisfied without running."
echo "::notice::Harness Replays no-op pass (paths filter excluded this commit)."
- if: needs.detect-changes.outputs.run == 'true'
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
# github-app-auth sibling-checkout removed 2026-05-07 (#157):
# the plugin was dropped + Dockerfile.tenant no longer COPYs it.
- name: Install Python deps for replays
# peer-discovery-404 (and future replays) eval Python against the
# running tenant — importing workspace/a2a_client.py pulls in
# httpx. tests/harness/requirements.txt holds just the HTTP-client
# surface to keep CI install fast (~3s) vs the full
# workspace/requirements.txt (~30s).
if: needs.detect-changes.outputs.run == 'true'
run: pip install -r tests/harness/requirements.txt
- name: Run all replays against the harness
# run-all-replays.sh: boot via up.sh → seed via seed.sh → run
# every replays/*.sh → tear down via down.sh on EXIT (trap).
# Non-zero exit on any replay failure.
#
# KEEP_UP=1: without this, the script's trap-on-EXIT tears
# down containers immediately on failure, leaving the dump
# step below with nothing to dump (verified on PR #2410's
# first run — tenant became unhealthy, trap fired, dump
# step saw empty containers). Keeping them up lets the
# failure path collect tenant/cp-stub/cf-proxy logs. The
# always-run "Force teardown" step does the actual cleanup.
if: needs.detect-changes.outputs.run == 'true'
working-directory: tests/harness
env:
KEEP_UP: "1"
run: ./run-all-replays.sh
- name: Dump compose logs on failure
# SECRETS_ENCRYPTION_KEY: docker compose validates the entire compose
# file even for read-only `logs` calls. up.sh generates a per-run key
# and exports it to its OWN shell — this step runs in a fresh shell
# that wouldn't see it, so without a placeholder the validate step
# errors before logs print (verified against PR #2492's first run:
# "required variable SECRETS_ENCRYPTION_KEY is missing a value").
# A placeholder is fine — we're only reading log streams, not booting.
if: failure() && needs.detect-changes.outputs.run == 'true'
working-directory: tests/harness
env:
SECRETS_ENCRYPTION_KEY: dump-logs-placeholder
run: |
echo "=== docker compose ps ==="
docker compose -f compose.yml ps || true
echo "=== tenant-alpha logs ==="
docker compose -f compose.yml logs tenant-alpha || true
echo "=== tenant-beta logs ==="
docker compose -f compose.yml logs tenant-beta || true
echo "=== cp-stub logs ==="
docker compose -f compose.yml logs cp-stub || true
echo "=== cf-proxy logs ==="
docker compose -f compose.yml logs cf-proxy || true
echo "=== postgres-alpha logs (last 100) ==="
docker compose -f compose.yml logs --tail 100 postgres-alpha || true
echo "=== postgres-beta logs (last 100) ==="
docker compose -f compose.yml logs --tail 100 postgres-beta || true
- name: Force teardown
# We pass KEEP_UP=1 to run-all-replays.sh so the dump step
# above sees real containers — that means we own teardown
# explicitly here. Always run.
if: always() && needs.detect-changes.outputs.run == 'true'
working-directory: tests/harness
run: ./down.sh || true
@@ -0,0 +1,94 @@
name: Lint curl status-code capture
# Pins the workflow-bash anti-pattern that produced "HTTP 000000" on the
# 2026-05-04 redeploy-tenants-on-main run for sha 2b862f6:
#
# HTTP_CODE=$(curl ... -w '%{http_code}' ... || echo "000")
#
# When curl exits non-zero (connection reset → 56, --fail-with-body 4xx/5xx
# → 22), the `-w '%{http_code}'` already wrote a status to stdout — usually
# "000" for connection failures or the actual code for HTTP errors. The
# `|| echo "000"` then fires AND appends ANOTHER "000" to the captured
# stdout, producing values like "000000" or "409000" that fail string
# comparisons against "200" while looking superficially right.
#
# Same class of bug the synth-E2E §7c gate hit twice (PRs #2779/#2783 +
# #2797). Memory: feedback_curl_status_capture_pollution.md.
#
# Fix shape (route -w into a tempfile so curl's exit code can't pollute):
#
# set +e
# curl ... -w '%{http_code}' >code.txt 2>/dev/null
# set -e
# HTTP_CODE=$(cat code.txt 2>/dev/null)
# [ -z "$HTTP_CODE" ] && HTTP_CODE="000"
on:
pull_request:
paths: ['.github/workflows/**']
push:
branches: [main, staging]
paths: ['.github/workflows/**']
merge_group:
types: [checks_requested]
jobs:
scan:
name: Scan workflows for curl status-capture pollution
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- name: Find curl ... -w '%{http_code}' ... || echo "000" subshells
run: |
set -uo pipefail
# Multi-line aware: look for `$(curl ... -w '%{http_code}' ... || echo "000")`
# subshell where the entire command-substitution wraps a curl that
# ends with `|| echo "000"`. Must distinguish from the SAFE shape
# `$(cat tempfile 2>/dev/null || echo "000")` — `cat` with a missing
# tempfile produces empty stdout, no pollution.
python3 <<'PY'
import os, re, sys, glob
BAD_FILES = []
# Match the buggy substitution across newlines: $(curl ... -w '%{http_code}' ... || echo "000")
# The `\\n` is the bash line-continuation that lets curl flags span lines.
# We collapse continuation lines first, then look for the single-line bad pattern.
PATTERN = re.compile(
r'\$\(\s*curl\b[^)]*-w\s*[\'"]%\{http_code\}[\'"][^)]*\|\|\s*echo\s+"000"\s*\)',
re.DOTALL,
)
# Self-skip: this lint workflow contains the literal anti-pattern in
# its own docstring — that's intentional, not a bug.
SELF = ".github/workflows/lint-curl-status-capture.yml"
for f in sorted(glob.glob(".github/workflows/*.yml")):
if f == SELF:
continue
with open(f) as fh:
content = fh.read()
# Collapse bash line-continuations (\\\n + leading whitespace)
# into a single logical line so the regex can see the full
# curl invocation as one chunk.
flat = re.sub(r'\\\s*\n\s*', ' ', content)
for m in PATTERN.finditer(flat):
BAD_FILES.append((f, m.group(0)[:120]))
if not BAD_FILES:
print("✓ No curl-status-capture pollution patterns detected")
sys.exit(0)
print(f"::error::Found {len(BAD_FILES)} curl-status-capture pollution site(s):")
for f, snippet in BAD_FILES:
print(f"::error file={f}::Curl status-capture pollution: '|| echo \"000\"' inside a $(curl ... -w '%{{http_code}}' ...) subshell. On non-2xx or connection failure, curl's -w writes a status, then exits non-zero, then the || echo appends another '000' — producing 'HTTP 000000' or '409000' that fails comparisons silently. Fix: route -w into a tempfile so the exit code can't pollute stdout. See memory feedback_curl_status_capture_pollution.md.")
print(f" matched: {snippet}…")
print()
print("Fix template:")
print(' set +e')
print(' curl ... -w \'%{http_code}\' >code.txt 2>/dev/null')
print(' set -e')
print(' HTTP_CODE=$(cat code.txt 2>/dev/null)')
print(' [ -z "$HTTP_CODE" ] && HTTP_CODE="000"')
sys.exit(1)
PY
+22
View File
@@ -0,0 +1,22 @@
name: pr-guards
# Thin caller that delegates to the molecule-ci reusable guard. Today
# the guard is just "disable auto-merge when a new commit is pushed
# after auto-merge was enabled" — added 2026-04-27 after PR #2174
# auto-merged with only its first commit because the second commit
# was pushed after the merge queue had locked the PR's SHA.
#
# When more PR-time guards land in molecule-ci, add them here as
# additional jobs that share the same pull_request:synchronize
# trigger.
on:
pull_request:
types: [synchronize]
permissions:
pull-requests: write
jobs:
disable-auto-merge-on-push:
uses: molecule-ai/molecule-ci/.github/workflows/disable-auto-merge-on-push.yml@main

Some files were not shown because too many files have changed in this diff Show More