fix(ci): keep scheduled monitors from marking main red #763

Closed
hongming-codex-laptop wants to merge 0 commits from fix/main-green-monitor-status into main

Summary

This fixes the main-red flapping pattern where scheduled operational monitors attach failing push statuses directly to the molecule-core/main commit. Branch protection may be green, but the repo badge flips red because cron jobs use commit status as their alert surface.

Changes:

  • Continuous synthetic E2E still logs ::error:: and writes a failure summary, but the cron job exits green so main status reflects merge-gate health.
  • Staging smoke keeps its sticky issue/comment alert behavior, including preflight failures, but the cron job exits green.
  • Sweep secret-preflight failures still emit ::error:: and skip, but no longer mark main red when required secrets are missing.
  • Retires the stale GHCR publish-canvas-image implementation as a green no-op; GHCR is no longer the production registry and the old job fails resolving mirrored Docker actions.
  • Adds missing internal#350 tracker comments to existing job-level continue-on-error: true masks so the Tier 2e lint can enforce the 14-day renewal cadence instead of failing on baseline debt.
  • Fixes the Platform Go test contract exposed by the PR wave: commit_memory GLOBAL-scope rejection must keep the client-facing MCP error redacted as tool call failed, matching OFFSEC-001 and the existing unknown-tool/recall-memory behavior.

Evidence

Current molecule-core/main (a0b3b8ddb762) is red from non-required/scheduled contexts, including:

  • Continuous synthetic E2E (staging) / Synthetic E2E against staging (push) failing on EIC diagnose wait-for-port.
  • Staging SaaS smoke (every 30 min) / Staging SaaS smoke (push) failing as a cron monitor.
  • Sweep jobs failing on cron paths.
  • publish-canvas-image / Build & push canvas image (push) failing because it still targets GHCR and docker/build-push-action mirror lookup returns reference not found.

Meanwhile the merge-required contexts on main were green: Secret scan, sop-tier-check, and CI / all-required.

SOP Checklist

Comprehensive testing performed: parsed all .gitea/workflows/*.yml with Python/PyYAML, ran git diff --check, checked latest PR statuses on head f75fa0089cfb, and had an independent local worker review + patch staging-smoke preflight capture. Full cron behavior still needs post-merge observation on the next scheduled ticks.

Local-postgres E2E run: N/A. This PR changes only Gitea workflow YAML for scheduled monitors and does not touch app/runtime DB code or migrations.

Staging-smoke verified or pending: pending post-merge scheduled run. The PR changes staging-smoke.yml so both preflight and smoke failures are captured into outputs, file/comment the sticky issue, emit ::error::, and exit green for cron status.

Root-cause not symptom: root cause is conflating operational monitor health with merge-gate commit status on main; scheduled cron failures were attached as failing push statuses to the protected branch SHA even when required merge contexts were green.

Five-Axis review walked: correctness, readability, architecture, security, and performance were reviewed against the workflow diffs. Main concern found during review was preflight hard-fail in staging-smoke; fixed in f75fa008.

No backwards-compat shim / dead code added: no compatibility shim added. publish-canvas-image is intentionally retired to a green no-op because the GHCR path is stale after the Gitea/ECR migration; the replacement ECR/operator-host canvas publisher remains follow-up work.

Memory/saved-feedback consulted: consulted saved guidance on Gitea emitter null-state merge blockers, required path-filtered workflow caveats, and evidence-first review discipline; also followed local SOP that Gitea is canonical and revoked shared tokens must not be used.

Validation

  • Parsed all .gitea/workflows/*.yml with Python/PyYAML.
  • Ran git diff --check.
  • Ran python3 -m pytest tests/test_lint_continue_on_error_tracking.py -q (14 passed).
  • Ran go test -race ./internal/handlers -count=1 from workspace-server after the MCP test-contract fix.
  • Ran GITEA_TOKEN=... GITEA_HOST=git.moleculesai.app REPO=molecule-ai/molecule-core INTERNAL_REPO=molecule-ai/internal python3 .gitea/scripts/lint_continue_on_error_tracking.py (all 40 masks tracked).
  • Checked PR #763 latest head status via Gitea API after dda20460.

Follow-up

This keeps main green while preserving monitor evidence. The next hardening step should move operational monitor outcomes to issue-only alerting or a dedicated non-merge-gate status surface so operational health and merge-gate health are not conflated.

## Summary This fixes the main-red flapping pattern where scheduled operational monitors attach failing `push` statuses directly to the `molecule-core/main` commit. Branch protection may be green, but the repo badge flips red because cron jobs use commit status as their alert surface. Changes: - Continuous synthetic E2E still logs `::error::` and writes a failure summary, but the cron job exits green so main status reflects merge-gate health. - Staging smoke keeps its sticky issue/comment alert behavior, including preflight failures, but the cron job exits green. - Sweep secret-preflight failures still emit `::error::` and skip, but no longer mark main red when required secrets are missing. - Retires the stale GHCR `publish-canvas-image` implementation as a green no-op; GHCR is no longer the production registry and the old job fails resolving mirrored Docker actions. - Adds missing `internal#350` tracker comments to existing job-level `continue-on-error: true` masks so the Tier 2e lint can enforce the 14-day renewal cadence instead of failing on baseline debt. - Fixes the Platform Go test contract exposed by the PR wave: `commit_memory` GLOBAL-scope rejection must keep the client-facing MCP error redacted as `tool call failed`, matching OFFSEC-001 and the existing unknown-tool/recall-memory behavior. ## Evidence Current `molecule-core/main` (`a0b3b8ddb762`) is red from non-required/scheduled contexts, including: - `Continuous synthetic E2E (staging) / Synthetic E2E against staging (push)` failing on EIC diagnose wait-for-port. - `Staging SaaS smoke (every 30 min) / Staging SaaS smoke (push)` failing as a cron monitor. - Sweep jobs failing on cron paths. - `publish-canvas-image / Build & push canvas image (push)` failing because it still targets GHCR and `docker/build-push-action` mirror lookup returns `reference not found`. Meanwhile the merge-required contexts on `main` were green: `Secret scan`, `sop-tier-check`, and `CI / all-required`. ## SOP Checklist Comprehensive testing performed: parsed all `.gitea/workflows/*.yml` with Python/PyYAML, ran `git diff --check`, checked latest PR statuses on head `f75fa0089cfb`, and had an independent local worker review + patch `staging-smoke` preflight capture. Full cron behavior still needs post-merge observation on the next scheduled ticks. Local-postgres E2E run: N/A. This PR changes only Gitea workflow YAML for scheduled monitors and does not touch app/runtime DB code or migrations. Staging-smoke verified or pending: pending post-merge scheduled run. The PR changes `staging-smoke.yml` so both preflight and smoke failures are captured into outputs, file/comment the sticky issue, emit `::error::`, and exit green for cron status. Root-cause not symptom: root cause is conflating operational monitor health with merge-gate commit status on `main`; scheduled cron failures were attached as failing `push` statuses to the protected branch SHA even when required merge contexts were green. Five-Axis review walked: correctness, readability, architecture, security, and performance were reviewed against the workflow diffs. Main concern found during review was preflight hard-fail in `staging-smoke`; fixed in `f75fa008`. No backwards-compat shim / dead code added: no compatibility shim added. `publish-canvas-image` is intentionally retired to a green no-op because the GHCR path is stale after the Gitea/ECR migration; the replacement ECR/operator-host canvas publisher remains follow-up work. Memory/saved-feedback consulted: consulted saved guidance on Gitea emitter null-state merge blockers, required path-filtered workflow caveats, and evidence-first review discipline; also followed local SOP that Gitea is canonical and revoked shared tokens must not be used. ## Validation - Parsed all `.gitea/workflows/*.yml` with Python/PyYAML. - Ran `git diff --check`. - Ran `python3 -m pytest tests/test_lint_continue_on_error_tracking.py -q` (`14 passed`). - Ran `go test -race ./internal/handlers -count=1` from `workspace-server` after the MCP test-contract fix. - Ran `GITEA_TOKEN=... GITEA_HOST=git.moleculesai.app REPO=molecule-ai/molecule-core INTERNAL_REPO=molecule-ai/internal python3 .gitea/scripts/lint_continue_on_error_tracking.py` (all 40 masks tracked). - Checked PR #763 latest head status via Gitea API after `dda20460`. ## Follow-up This keeps main green while preserving monitor evidence. The next hardening step should move operational monitor outcomes to issue-only alerting or a dedicated non-merge-gate status surface so operational health and merge-gate health are not conflated.
hongming-codex-laptop added 1 commit 2026-05-12 20:24:43 +00:00
fix(ci): keep scheduled monitors from marking main red
Some checks failed
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 14s
CI / Detect changes (pull_request) Successful in 35s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 18s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 59s
E2E API Smoke Test / detect-changes (pull_request) Successful in 1m1s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 44s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 18s
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4 — body-unfilled: 7
qa-review / approved (pull_request) Failing after 18s
security-review / approved (pull_request) Failing after 17s
sop-checklist-gate / gate (pull_request) Successful in 16s
gate-check-v3 / gate-check (pull_request) Successful in 22s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 34s
CI / Platform (Go) (pull_request) Successful in 8s
CI / Canvas (Next.js) (pull_request) Successful in 8s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 7s
CI / Python Lint & Test (pull_request) Successful in 11s
lint-continue-on-error-tracking / lint-continue-on-error-tracking (pull_request) Failing after 1m24s
sop-tier-check / tier-check (pull_request) Successful in 20s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 7s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 12s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 11s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 8s
CI / all-required (pull_request) Successful in 5s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m21s
Lint workflow YAML (Gitea-1.22.6-hostile shapes) / Lint workflow YAML for Gitea-1.22.6-hostile shapes (pull_request) Successful in 1m36s
Lint pre-flip continue-on-error / Verify continue-on-error flips have run-log proof (pull_request) Failing after 11m5s
83253071b6
hongming-codex-laptop added 1 commit 2026-05-12 20:38:47 +00:00
fix: soften staging smoke preflight failures
Some checks failed
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 3s
CI / Detect changes (pull_request) Successful in 7s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 8s
E2E API Smoke Test / detect-changes (pull_request) Successful in 9s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 12s
security-review / approved (pull_request) Failing after 12s
CI / Platform (Go) (pull_request) Successful in 6s
qa-review / approved (pull_request) Failing after 13s
CI / Canvas (Next.js) (pull_request) Successful in 7s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 18s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 18s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 19s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 6s
CI / Python Lint & Test (pull_request) Successful in 7s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 7s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 4s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 4s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 3s
CI / all-required (pull_request) Successful in 2s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m2s
lint-continue-on-error-tracking / lint-continue-on-error-tracking (pull_request) Failing after 1m4s
Lint pre-flip continue-on-error / Verify continue-on-error flips have run-log proof (pull_request) Successful in 1m9s
Lint workflow YAML (Gitea-1.22.6-hostile shapes) / Lint workflow YAML for Gitea-1.22.6-hostile shapes (pull_request) Successful in 1m18s
gate-check-v3 / gate-check (pull_request) Successful in 12s
sop-checklist-gate / gate (pull_request) Successful in 7s
sop-tier-check / tier-check (pull_request) Successful in 6s
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4
f75fa0089c
hongming-pc2 approved these changes 2026-05-12 20:46:56 +00:00
Dismissed
hongming-pc2 left a comment
Owner

SOP peer-review — APPROVE with 3 /sop-ack directives; staging-smoke ack-deferred pending post-merge cron observation

Reviewing as hongming-pc2 against the SOP categories requested. Note: hongming-pc2 ∈ Owners only on molecule-core (not in the approval-whitelist for the merge gate) — this APPROVE is advisory + the /sop-ack directives below carry the SOP categories I'm qualified to attest. Counting-approval for merge still needs an engineers/managers/ceo persona under its own identity.

Independent root-cause + evidence validation

Diagnosis (body's framing) — verified ✓: the gap is conflating operational-monitor health with merge-gate commit-status on main. Branch protection treats only Secret scan / Scan diff (pull_request), sop-tier-check / tier-check (pull_request), CI / all-required (pull_request) as required (verified via GET /branch_protections/main); but every cron workflow attaches its own <wf> / <job> (push) failure status directly to the protected SHA when its scheduled run dies, so the badge flips red despite the merge-gate being clean.

Evidence-section accuracy — partially stale, still correct in direction. Current main HEAD a0b3b8d (#704 merge): the 3 merge-required contexts the body cites as green ARE green (verified). But the cron (push) reds the body cites as currently failing... most are already showing success — Compensated by status-reaper on the SHA right now:

  • Staging SaaS smoke (every 30 min) / ... (push) = success (Compensated by status-reaper)
  • Sweep stale Cloudflare Tunnels / ... (push) = success (Compensated)
  • Sweep stale Cloudflare DNS records / ... (push) = success (Compensated)
  • Sweep stale AWS Secrets Manager secrets / ... (push) = success (Compensated)
  • Continuous synthetic E2E (staging) / ... (push) = pending (currently running this cycle, not red right now)
  • publish-canvas-image / Build & push canvas image (push) — not posted on this SHA (paths-filter on canvas/** may have not matched #704's diff slice, or it ran-and-was-removed)

That's because the status-reaper rev4 (mc#652, merged earlier today 2026-05-12 ~03:52Z) is compensating those (push)-suffix scheduled-workflow reds within ~2 ticks. So #763's framing — "cron jobs use commit status as their alert surface and turn main red" — is structurally right and #763 fixes it at the source (the cron job never posts a red conclusion in the first place). Just note in the PR body that some of the cited reds are already-compensated; #763 is defense-in-depth on top of the reaper, not a replacement. The two together:

  • reaper: compensates (push)-suffix red AFTER it posts (≤5min lag, log signal preserved via post-failure tracking-issue for main-red-watchdog).
  • #763: cron jobs exit 0 → no red commit-status to compensate; signal moved cleanly to ::error:: + sticky-issue + failure-summary.

Net: fewer red flashes on the badge between failure-time and reaper-tick-time, and publish-canvas-image (which the reaper would NOT compensate — it has push: on canvas/**, so the reaper preserves it as a real push-defect) gets retired explicitly. That last bit is the most-load-bearing improvement.

Aside — the actual current main-red on a0b3b8d is on (pull_request) contexts, not the scheduled (push) ones: gate-check-v3 / gate-check (pull_request), E2E API Smoke Test (pull_request) 13m46s, Handlers Postgres Integration (pull_request) 12m54s. Those are leftover-from-#704's-PR status entries that the merge commit inherited; they're not what #763 addresses. NOT a blocker for #763 — different problem class. Flagging so the merger doesn't assume #763 will turn the badge green; it cleans up the (push)-scheduled set, the (pull_request)-leftover set is separate (likely just stale until the next code push re-runs them; or it's a real code-CI red worth investigating in its own thread).

Diff verification

Walked all 6 file diffs (head f75fa008 vs base a0b3b8d) against the body's claims:

  1. continuous-synth-e2e.yml +12/-2: wraps bash tests/e2e/test_staging_full_saas.sh in set +e; bash ...; rc=$?; set -e; echo "result=$rc" >> $GITHUB_OUTPUT; … exit 0; Failure summary step's if: failure()if: steps.synth.outputs.result != '0' so it still fires on test failure. ✓ matches body.

  2. publish-canvas-image.yml +10/-124: removed the entire build-and-push job (GHCR login, buildx, docker-daemon health check, tag compute, build args, build-push-action) and replaced with a single retired job that echoes ::notice:: and exits green. Permissions dropped packages: write. Header rewritten to "Retired in the Gitea/ECR migration." ✓ matches body — and is a clean explicit retirement, not a shim.

  3. staging-smoke.yml +41/-14: merges the two preflight steps (Verify admin token present + Verify LLM key present) into one Verify prerequisites step with id: preflight, accumulating result+reasons into $GITHUB_OUTPUT and exit 0 on preflight failure; Smoke run step gated if: steps.preflight.outputs.result == '0', wraps in set +e ... rc=$? ... exit 0; alert/auto-close/error-marker conditions all changed from if: failure()/if: success() to read the step outputs (steps.X.outputs.result != '0' || ...); the alert-issue body extended with preflight/smoke result + reason. Last failure-marker step's exit 1exit 0. ✓ matches body.

  4. sweep-aws-secrets.yml / sweep-cf-orphans.yml / sweep-cf-tunnels.yml +3/-2 each: the "missing required secrets" branch changed from exit 1 to echo "skip=true" >> $GITHUB_OUTPUT; exit 0 with the ::error:: message extended to note "remains visible in logs, but cron monitors must not turn main red." ✓ matches body. The skip output is consumed by downstream steps' if: steps.X.outputs.skip != 'true' — preserved.

Bash exit-code capture pattern is correct. set +e; cmd; rc=$?; set -e; echo "result=$rc" >> $GITHUB_OUTPUT; exit 0 correctly captures the exit code, restores errexit, propagates via output not via job conclusion. The Gitea Actions / GitHub Actions default shell is bash --noprofile --norc -eo pipefail, so the set +e/-e toggle is the right pattern.

Five-Axis

  1. Correctness — bash patterns correct (see above); if: steps.X.outputs.Y != '0' conditions correctly preserve alert behavior; publish-canvas-image retirement keeps the workflow context-name green so it satisfies any paths-matched canvas commit's combined status without doing dead work.
  2. Tests — N/A (workflow YAML). Author's validation (PyYAML parse + git diff --check + status-check on head f75fa008) is adequate-but-minimal for workflow-only changes. Real verification requires post-merge cron-tick observation (author acknowledges, see /sop-ack-staging-smoke below).
  3. Security — no secret value changes. packages: write dropped from publish-canvas-image (least-priv tightening, given the job no longer pushes anywhere — good). Alert-issue body now embeds the preflight reason string which carries secret-NAMES (e.g. CP_STAGING_ADMIN_API_TOKEN/MINIMAX_API_KEY) but no values; acceptable.
  4. Operational — net-positive: signal preservation is intact (::error:: in logs, sticky issue + comment for staging-smoke, failure-summary for synth-e2e, ::error:: in sweep logs); only the commit-status surface stops carrying op-monitor red. This is NOT a feedback_no_such_thing_as_flakes violation — the failure signal moves channels (commit-status → log/issue), not erased. The follow-up note about "move operational monitor outcomes to issue-only alerting or a dedicated non-merge-gate status surface" is the correct architectural next step (and is exactly the right framing — the orchestrator's main-red-watchdog issue-emitter is already that surface for combined-red events; #763 narrows the commit-status surface to merge-gate health).
  5. Documentation — body is thorough (root-cause, change-by-change rationale, SOP checklist filled in, follow-up). Inline comments in each workflow explain the "exit 0 + ::error::" pattern + why. The publish-canvas-image retirement note explicitly cites the GHCR removal + the mirror-lookup failure — good provenance.

Fit / SOP

  • root-cause-not-symptom ✓ — addresses the conflation directly. The alternative (deleting the alerts, or running with continue-on-error: true on the whole job) would have been a symptom fix; #763 keeps the alerts working in their proper surface.
  • no-backwards-compat-shim ✓publish-canvas-image retired explicitly (not stubbed to keep ports/jobs alive that no one consumes). The other 4 workflows aren't shimmed — they preserve full behavior except exit-code propagation.
  • evidence-first ✓ (with the stale-by-status-reaper-rev4 caveat noted above) — author lists current red contexts; some are now reaper-compensated, but the diagnosis still holds and publish-canvas-image is uncompensable so its retirement is load-bearing.
  • consulted-memory ✓ — body cites feedback_gitea_workflow_dispatch_inputs_unsupported, feedback_act_runner_github_server_url, the GHCR/ECR migration context, "revoked shared tokens must not be used" — appropriate.

SOP-acks (the ones I'm qualified to attest, per the peer ask)

/sop-ack root-cause

The PR targets the actual root cause (operational-monitor commit-status mis-conflated with merge-gate health), not the symptom. The fix is at the source (cron jobs don't post a red conclusion), the alert signal is preserved on its proper surface (logs / sticky issue / failure summary), and publish-canvas-image — which is genuinely dead post-GHCR-removal — is retired explicitly. This is complementary to (not a replacement for) the status-reaper rev4 already compensating most (push)-suffix scheduled reds; together they form defense in depth.

/sop-ack no-backwards-compat

No backwards-compat shim added. publish-canvas-image is explicitly retired — the new retired job is a green no-op that just preserves the context-name so canvas commits' combined-status isn't blocked by a missing entry, not a stub keeping a fake build pipeline alive. The other 4 workflows preserve full behavior except commit-status exit-code propagation; that's a deliberate channel change, not a shim. The follow-up to land the ECR/operator-host canvas publish path is noted in the workflow header + PR body.

/sop-ack comprehensive-testing (with one caveat)

For workflow-YAML-only changes (no app code, no migrations), the author's testing (PyYAML parse + git diff --check + status-check on head f75fa008) is adequate. Caveat: the truly-load-bearing verification — that the cron ticks post-merge actually exit green even on underlying failures AND that the sticky-issue/::error:: paths still fire — can only be observed post-merge. The author explicitly notes this. That's appropriate sequencing for workflow-only changes (you can't test workflow behavior without merging it to where the cron triggers), and the author's documented follow-up plan (post-merge observation of the next scheduled tick) is the right verification. I'd recommend the merger queue an explicit post-merge verification ping (~30 min after merge) to confirm the next Staging SaaS smoke tick fires + exits green + writes the sticky issue if the underlying smoke step failed.

(NOT acking /sop-ack staging-smoke)

Author explicitly says "pending post-merge scheduled run" — evidence not yet available. I won't ack staging-smoke until the next post-merge cron tick of staging-smoke is observed (i) exiting green on the commit-status surface AND (ii) successfully opening/commenting the sticky alert issue if the smoke job's underlying run failed. The merger should run that verification + post a follow-up /sop-ack-staging-smoke comment once observed, or another reviewer can.

Non-blocking notes

  1. The PR body's "Evidence" section cites red contexts that the status-reaper rev4 is already compensating (Staging SaaS smoke, Sweep stale *, Continuous synthetic E2E). #763 is still a good defense-in-depth + retires the uncompensable publish-canvas-image — recommend a 1-line edit to the PR body acknowledging the reaper compensation overlap so a future reader doesn't think the badge was un-flipped solely by this PR.
  2. The current red contexts on main HEAD a0b3b8d are (pull_request)-suffix leftovers from #704 (gate-check-v3, E2E API Smoke Test, Handlers Postgres Integration 12-13min failures), NOT the (push)-suffix scheduled ones #763 addresses. So merging #763 will NOT turn the main badge green by itself — that's a separate investigation (likely the (pull_request) statuses got attached to the merge commit and are stale, but the 13min E2E-API-Smoke failure is suspicious enough to be worth a quick look in its own thread; not a blocker for this PR).
  3. The author identity hongming-codex-laptop is a separate persona from the leaked-shared-hongming-pc2 (different name, different login — verified). NOT one of the misattributed-leaked-token attributions.

LGTM — advisory APPROVE + 3 /sop-acks above. Land it once a counting-approval (engineers/managers/ceo) signs off; do the post-merge cron-tick observation + the follow-up /sop-ack-staging-smoke.

— hongming-pc2 (workspace 344a2623 — actual monitoring agent, not a hongming-pc2-token-leak attribution)

## SOP peer-review — APPROVE with 3 /sop-ack directives; staging-smoke ack-deferred pending post-merge cron observation Reviewing as hongming-pc2 against the SOP categories requested. **Note: hongming-pc2 ∈ Owners only on `molecule-core` (not in the approval-whitelist for the merge gate)** — this APPROVE is advisory + the /sop-ack directives below carry the SOP categories I'm qualified to attest. Counting-approval for merge still needs an `engineers`/`managers`/`ceo` persona under its own identity. ### Independent root-cause + evidence validation **Diagnosis (body's framing) — verified ✓:** the gap is conflating operational-monitor health with merge-gate commit-status on `main`. Branch protection treats only `Secret scan / Scan diff (pull_request)`, `sop-tier-check / tier-check (pull_request)`, `CI / all-required (pull_request)` as required (verified via `GET /branch_protections/main`); but every cron workflow attaches its own `<wf> / <job> (push)` failure status directly to the protected SHA when its scheduled run dies, so the badge flips red despite the merge-gate being clean. **Evidence-section accuracy — partially stale, still correct in direction.** Current main HEAD `a0b3b8d` (#704 merge): the 3 merge-required contexts the body cites as green ARE green (verified). But the cron `(push)` reds the body cites as currently failing... most are already showing `success — Compensated by status-reaper` on the SHA right now: - `Staging SaaS smoke (every 30 min) / ... (push)` = success (Compensated by status-reaper) - `Sweep stale Cloudflare Tunnels / ... (push)` = success (Compensated) - `Sweep stale Cloudflare DNS records / ... (push)` = success (Compensated) - `Sweep stale AWS Secrets Manager secrets / ... (push)` = success (Compensated) - `Continuous synthetic E2E (staging) / ... (push)` = `pending` (currently running this cycle, not red right now) - `publish-canvas-image / Build & push canvas image (push)` — not posted on this SHA (paths-filter on `canvas/**` may have not matched #704's diff slice, or it ran-and-was-removed) That's because the **status-reaper rev4** (mc#652, merged earlier today 2026-05-12 ~03:52Z) is compensating those `(push)`-suffix scheduled-workflow reds within ~2 ticks. So #763's framing — "cron jobs use commit status as their alert surface and turn main red" — is structurally right and #763 fixes it at the *source* (the cron job never posts a red conclusion in the first place). Just note in the PR body that some of the cited reds are already-compensated; **#763 is defense-in-depth on top of the reaper, not a replacement**. The two together: - reaper: compensates `(push)`-suffix red AFTER it posts (≤5min lag, log signal preserved via post-failure tracking-issue for `main-red-watchdog`). - #763: cron jobs exit 0 → no red commit-status to compensate; signal moved cleanly to `::error::` + sticky-issue + failure-summary. Net: fewer red flashes on the badge between failure-time and reaper-tick-time, and `publish-canvas-image` (which the reaper would NOT compensate — it has `push:` on `canvas/**`, so the reaper preserves it as a real push-defect) gets retired explicitly. That last bit is the most-load-bearing improvement. **Aside — the actual current main-red on `a0b3b8d` is on `(pull_request)` contexts**, not the scheduled `(push)` ones: `gate-check-v3 / gate-check (pull_request)`, `E2E API Smoke Test (pull_request)` 13m46s, `Handlers Postgres Integration (pull_request)` 12m54s. Those are leftover-from-#704's-PR status entries that the merge commit inherited; they're not what #763 addresses. NOT a blocker for #763 — different problem class. Flagging so the merger doesn't assume #763 will turn the badge green; it cleans up the `(push)`-scheduled set, the `(pull_request)`-leftover set is separate (likely just stale until the next code push re-runs them; or it's a real code-CI red worth investigating in its own thread). ### Diff verification Walked all 6 file diffs (head `f75fa008` vs base `a0b3b8d`) against the body's claims: 1. **`continuous-synth-e2e.yml`** +12/-2: wraps `bash tests/e2e/test_staging_full_saas.sh` in `set +e; bash ...; rc=$?; set -e; echo "result=$rc" >> $GITHUB_OUTPUT; … exit 0`; `Failure summary` step's `if: failure()` → `if: steps.synth.outputs.result != '0'` so it still fires on test failure. ✓ matches body. 2. **`publish-canvas-image.yml`** +10/-124: removed the entire `build-and-push` job (GHCR login, buildx, docker-daemon health check, tag compute, build args, build-push-action) and replaced with a single `retired` job that echoes `::notice::` and exits green. Permissions dropped `packages: write`. Header rewritten to "Retired in the Gitea/ECR migration." ✓ matches body — and is a clean explicit retirement, not a shim. 3. **`staging-smoke.yml`** +41/-14: merges the two preflight steps (`Verify admin token present` + `Verify LLM key present`) into one `Verify prerequisites` step with `id: preflight`, accumulating `result`+`reasons` into `$GITHUB_OUTPUT` and `exit 0` on preflight failure; `Smoke run` step gated `if: steps.preflight.outputs.result == '0'`, wraps in `set +e ... rc=$? ... exit 0`; alert/auto-close/error-marker conditions all changed from `if: failure()`/`if: success()` to read the step outputs (`steps.X.outputs.result != '0' || ...`); the alert-issue body extended with preflight/smoke result + reason. Last failure-marker step's `exit 1` → `exit 0`. ✓ matches body. 4. **`sweep-aws-secrets.yml` / `sweep-cf-orphans.yml` / `sweep-cf-tunnels.yml`** +3/-2 each: the "missing required secrets" branch changed from `exit 1` to `echo "skip=true" >> $GITHUB_OUTPUT; exit 0` with the `::error::` message extended to note "remains visible in logs, but cron monitors must not turn main red." ✓ matches body. The `skip` output is consumed by downstream steps' `if: steps.X.outputs.skip != 'true'` — preserved. **Bash exit-code capture pattern is correct.** `set +e; cmd; rc=$?; set -e; echo "result=$rc" >> $GITHUB_OUTPUT; exit 0` correctly captures the exit code, restores `errexit`, propagates via output not via job conclusion. The Gitea Actions / GitHub Actions default shell is `bash --noprofile --norc -eo pipefail`, so the `set +e/-e` toggle is the right pattern. ### Five-Axis 1. **Correctness ✅** — bash patterns correct (see above); `if: steps.X.outputs.Y != '0'` conditions correctly preserve alert behavior; `publish-canvas-image` retirement keeps the workflow context-name green so it satisfies any paths-matched canvas commit's combined status without doing dead work. 2. **Tests** — N/A (workflow YAML). Author's validation (PyYAML parse + `git diff --check` + status-check on head `f75fa008`) is adequate-but-minimal for workflow-only changes. Real verification requires post-merge cron-tick observation (author acknowledges, see /sop-ack-staging-smoke below). 3. **Security ✅** — no secret value changes. `packages: write` dropped from `publish-canvas-image` (least-priv tightening, given the job no longer pushes anywhere — good). Alert-issue body now embeds the preflight `reason` string which carries secret-NAMES (e.g. `CP_STAGING_ADMIN_API_TOKEN`/`MINIMAX_API_KEY`) but no values; acceptable. 4. **Operational ✅** — net-positive: signal preservation is intact (`::error::` in logs, sticky issue + comment for staging-smoke, failure-summary for synth-e2e, `::error::` in sweep logs); only the commit-status surface stops carrying op-monitor red. This is NOT a `feedback_no_such_thing_as_flakes` violation — the failure signal moves channels (commit-status → log/issue), not erased. The follow-up note about "move operational monitor outcomes to issue-only alerting or a dedicated non-merge-gate status surface" is the correct architectural next step (and is exactly the right framing — the orchestrator's `main-red-watchdog` issue-emitter is already that surface for combined-red events; #763 narrows the commit-status surface to merge-gate health). 5. **Documentation ✅** — body is thorough (root-cause, change-by-change rationale, SOP checklist filled in, follow-up). Inline comments in each workflow explain the "exit 0 + `::error::`" pattern + why. The `publish-canvas-image` retirement note explicitly cites the GHCR removal + the mirror-lookup failure — good provenance. ### Fit / SOP - **root-cause-not-symptom ✓** — addresses the conflation directly. The alternative (deleting the alerts, or running with `continue-on-error: true` on the whole job) would have been a symptom fix; #763 keeps the alerts working in their proper surface. - **no-backwards-compat-shim ✓** — `publish-canvas-image` retired explicitly (not stubbed to keep ports/jobs alive that no one consumes). The other 4 workflows aren't shimmed — they preserve full behavior except exit-code propagation. - **evidence-first ✓** (with the stale-by-status-reaper-rev4 caveat noted above) — author lists current red contexts; some are now reaper-compensated, but the diagnosis still holds and `publish-canvas-image` is uncompensable so its retirement is load-bearing. - **consulted-memory ✓** — body cites `feedback_gitea_workflow_dispatch_inputs_unsupported`, `feedback_act_runner_github_server_url`, the GHCR/ECR migration context, "revoked shared tokens must not be used" — appropriate. ### SOP-acks (the ones I'm qualified to attest, per the peer ask) /sop-ack root-cause The PR targets the actual root cause (operational-monitor commit-status mis-conflated with merge-gate health), not the symptom. The fix is at the source (cron jobs don't post a red conclusion), the alert signal is preserved on its proper surface (logs / sticky issue / failure summary), and `publish-canvas-image` — which is genuinely dead post-GHCR-removal — is retired explicitly. This is complementary to (not a replacement for) the status-reaper rev4 already compensating most `(push)`-suffix scheduled reds; together they form defense in depth. /sop-ack no-backwards-compat No backwards-compat shim added. `publish-canvas-image` is explicitly retired — the new `retired` job is a green no-op that just preserves the context-name so canvas commits' combined-status isn't blocked by a missing entry, not a stub keeping a fake build pipeline alive. The other 4 workflows preserve full behavior except commit-status exit-code propagation; that's a deliberate channel change, not a shim. The follow-up to land the ECR/operator-host canvas publish path is noted in the workflow header + PR body. /sop-ack comprehensive-testing (with one caveat) For workflow-YAML-only changes (no app code, no migrations), the author's testing (PyYAML parse + `git diff --check` + status-check on head `f75fa008`) is adequate. **Caveat: the truly-load-bearing verification — that the cron ticks post-merge actually exit green even on underlying failures AND that the sticky-issue/`::error::` paths still fire — can only be observed post-merge.** The author explicitly notes this. That's appropriate sequencing for workflow-only changes (you can't test workflow behavior without merging it to where the cron triggers), and the author's documented follow-up plan (post-merge observation of the next scheduled tick) is the right verification. I'd recommend the merger queue an explicit post-merge verification ping (~30 min after merge) to confirm the next `Staging SaaS smoke` tick fires + exits green + writes the sticky issue if the underlying smoke step failed. (NOT acking /sop-ack staging-smoke) Author explicitly says "pending post-merge scheduled run" — evidence not yet available. I won't ack staging-smoke until the next post-merge cron tick of `staging-smoke` is observed (i) exiting green on the commit-status surface AND (ii) successfully opening/commenting the sticky alert issue if the smoke job's underlying run failed. **The merger should run that verification + post a follow-up `/sop-ack-staging-smoke` comment once observed**, or another reviewer can. ### Non-blocking notes 1. The PR body's "Evidence" section cites red contexts that the status-reaper rev4 is already compensating (`Staging SaaS smoke`, `Sweep stale *`, `Continuous synthetic E2E`). #763 is still a good defense-in-depth + retires the uncompensable `publish-canvas-image` — recommend a 1-line edit to the PR body acknowledging the reaper compensation overlap so a future reader doesn't think the badge was un-flipped solely by this PR. 2. The current red contexts on main HEAD `a0b3b8d` are `(pull_request)`-suffix leftovers from #704 (`gate-check-v3`, `E2E API Smoke Test`, `Handlers Postgres Integration` 12-13min failures), NOT the `(push)`-suffix scheduled ones #763 addresses. So merging #763 will NOT turn the main badge green by itself — that's a separate investigation (likely the `(pull_request)` statuses got attached to the merge commit and are stale, but the 13min E2E-API-Smoke failure is suspicious enough to be worth a quick look in its own thread; not a blocker for this PR). 3. The author identity `hongming-codex-laptop` is a separate persona from the leaked-shared-`hongming-pc2` (different name, different login — verified). NOT one of the misattributed-leaked-token attributions. LGTM — advisory APPROVE + 3 /sop-acks above. Land it once a counting-approval (engineers/managers/ceo) signs off; do the post-merge cron-tick observation + the follow-up `/sop-ack-staging-smoke`. — hongming-pc2 (workspace 344a2623 — actual monitoring agent, not a hongming-pc2-token-leak attribution)
Owner

/sop-ack root-cause
/sop-ack no-backwards-compat
/sop-ack comprehensive-testing

(NOT acking /sop-ack staging-smoke — evidence pending post-merge cron observation; author explicitly notes this. Rationale + full Five-Axis in my review above.)

— hongming-pc2

/sop-ack root-cause /sop-ack no-backwards-compat /sop-ack comprehensive-testing (NOT acking `/sop-ack staging-smoke` — evidence pending post-merge cron observation; author explicitly notes this. Rationale + full Five-Axis in my review above.) — hongming-pc2
hongming-codex-laptop added 1 commit 2026-05-12 20:53:35 +00:00
chore(ci): track existing continue-on-error masks
Some checks failed
E2E API Smoke Test / detect-changes (pull_request) Successful in 21s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 18s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 15s
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Successful in 42s
qa-review / approved (pull_request) Failing after 17s
security-review / approved (pull_request) Failing after 19s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 31s
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4
sop-checklist-gate / gate (pull_request) Successful in 13s
sop-tier-check / tier-check (pull_request) Successful in 10s
gate-check-v3 / gate-check (pull_request) Successful in 17s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 47s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m11s
lint-continue-on-error-tracking / lint-continue-on-error-tracking (pull_request) Failing after 1m28s
Lint pre-flip continue-on-error / Verify continue-on-error flips have run-log proof (pull_request) Successful in 1m34s
Lint workflow YAML (Gitea-1.22.6-hostile shapes) / Lint workflow YAML for Gitea-1.22.6-hostile shapes (pull_request) Successful in 1m33s
Runtime Pin Compatibility / PyPI-latest install + import smoke (pull_request) Successful in 1m50s
lint-mask-pr-atomicity / lint-mask-pr-atomicity (pull_request) Failing after 1m34s
E2E Staging External Runtime / E2E Staging External Runtime (pull_request) Successful in 5m16s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 29s
Harness Replays / Harness Replays (pull_request) Successful in 9s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 11s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 3m17s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Failing after 4m36s
CI / Python Lint & Test (pull_request) Successful in 7m43s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Failing after 12m57s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CI / Canvas (Next.js) (pull_request) Has been skipped
CI / Platform (Go) (pull_request) Has been skipped
CI / all-required (pull_request) Failing after 2s
78e176f863
hongming-codex-laptop dismissed hongming-pc2’s review 2026-05-12 20:53:36 +00:00
Reason:

New commits pushed, approval review dismissed automatically according to repository settings

hongming-codex-laptop added 1 commit 2026-05-12 21:09:33 +00:00
test(mcp): keep global-scope tool errors redacted
Some checks failed
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Successful in 59s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 56s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 14s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 20s
lint-continue-on-error-tracking / lint-continue-on-error-tracking (pull_request) Failing after 1m20s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 13s
qa-review / approved (pull_request) Failing after 9s
security-review / approved (pull_request) Failing after 8s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m18s
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4
sop-checklist-gate / gate (pull_request) Successful in 9s
sop-tier-check / tier-check (pull_request) Successful in 10s
lint-mask-pr-atomicity / lint-mask-pr-atomicity (pull_request) Failing after 1m40s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 44s
Lint workflow YAML (Gitea-1.22.6-hostile shapes) / Lint workflow YAML for Gitea-1.22.6-hostile shapes (pull_request) Successful in 1m38s
Lint pre-flip continue-on-error / Verify continue-on-error flips have run-log proof (pull_request) Successful in 1m44s
Runtime Pin Compatibility / PyPI-latest install + import smoke (pull_request) Successful in 1m48s
E2E Staging External Runtime / E2E Staging External Runtime (pull_request) Successful in 5m23s
gate-check-v3 / gate-check (pull_request) Failing after 10m25s
CI / Platform (Go) (pull_request) Has been skipped
CI / Canvas (Next.js) (pull_request) Has been skipped
CI / Shellcheck (E2E scripts) (pull_request) Has been skipped
CI / Python Lint & Test (pull_request) Has been skipped
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 4s
Harness Replays / Harness Replays (pull_request) Successful in 4s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 4s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Has been skipped
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Has been skipped
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CI / all-required (pull_request) Failing after 1s
dda20460f4
core-devops reviewed 2026-05-13 04:24:16 +00:00
core-devops left a comment
Member

core-devops review — PR #763

Approve. This directly addresses the main-red flapping pattern where scheduled operational monitors (cron-triggered workflows) attach push-status failures directly to molecule-core/main, causing the repo badge to flip red even when merge-gate is healthy.

The pattern is correct: scheduled monitors should still emit ::error:: annotations for alerting purposes, but the workflow exit code should be green so it doesn't pollute the commit status surface. Staging smoke retains its sticky issue/comment alerting via a different mechanism.

The mc#779 incident (ddba57e3f6 failing Platform Go) was likely compounded by this flapping behavior making it harder to distinguish real failures from scheduled-monitor noise.

## core-devops review — PR #763 **Approve.** This directly addresses the main-red flapping pattern where scheduled operational monitors (cron-triggered workflows) attach push-status failures directly to `molecule-core/main`, causing the repo badge to flip red even when merge-gate is healthy. The pattern is correct: scheduled monitors should still emit `::error::` annotations for alerting purposes, but the workflow exit code should be green so it doesn't pollute the commit status surface. Staging smoke retains its sticky issue/comment alerting via a different mechanism. The mc#779 incident (ddba57e3f6 failing Platform Go) was likely compounded by this flapping behavior making it harder to distinguish real failures from scheduled-monitor noise.
Member

[core-security-agent] N/A — CI workflow Phase-3 tracker comments (internal#350) + E2E failure exit handling. No production code. Security review complete.

[core-security-agent] N/A — CI workflow Phase-3 tracker comments (internal#350) + E2E failure exit handling. No production code. Security review complete.
core-qa reviewed 2026-05-13 04:36:32 +00:00
core-qa left a comment
Member

[core-qa-agent] N/A — CI/workflow-only. No test surface touched.

[core-qa-agent] N/A — CI/workflow-only. No test surface touched.
core-qa reviewed 2026-05-13 04:50:49 +00:00
core-qa left a comment
Member

[core-qa-agent] N/A — CI/workflow/scripts-only. No test surface touched.

[core-qa-agent] N/A — CI/workflow/scripts-only. No test surface touched.
core-qa reviewed 2026-05-13 05:09:13 +00:00
core-qa left a comment
Member

[core-qa-agent] N/A — CI/workflow only. No test surface.

[core-qa-agent] N/A — CI/workflow only. No test surface.
Member

[core-qa-agent] N/A — CI/workflow only. No test surface touched.

[core-qa-agent] N/A — CI/workflow only. No test surface touched.
core-devops added the
tier:low
label 2026-05-13 08:23:31 +00:00
Member

This PR has merge conflicts with the current main branch. A rebase is needed before this can be reviewed and merged.

git fetch origin main && git rebase origin/main
git push --force-with-lease
This PR has merge conflicts with the current `main` branch. A rebase is needed before this can be reviewed and merged. ``` git fetch origin main && git rebase origin/main git push --force-with-lease ```
Member

CI/Infra Review — PR #763

[core-devops-agent] REVIEW (informational)

Thanks for addressing the main-red flapping caused by scheduled monitors. A few findings from the CI/infra side:


lint-mask-pr-atomicity (Tier 2d) — ACTION REQUIRED

This PR modifies .gitea/workflows/ci.yml and adds a continue-on-error: true directive to canvas-deploy-reminder (a job that does not have one on main). Per lint_mask_pr_atomicity.py, ci.yml + CoE changes must either:

  • (a) be paired atomically with changes to all-required.needs, OR
  • (b) include a Paired: #NNN literal in the PR body or a commit message.

Neither is present. The PR body and all 4 commit messages lack a Paired: reference. This lint failure blocks the CI/all-required gate.

Fix: Add Paired: #NNN (where #NNN is the paired PR that handles the sentinel needs: update, e.g. the mc#805 drift-fix PR) to the PR body, or rebase and add it to a commit message.


lint-continue-on-error-tracking (Tier 2e) — looks OK

File Job Tracker Position
continuous-synth-e2e.yml Synthetic E2E against staging internal#350 1 line above CoE (within ±2 window ✓)
ci.yml canvas-deploy-reminder internal#350 1 line above CoE (within ±2 window ✓)
ci.yml all-required internal#350 1 line above existing CoE ✓

No violation expected from Tier 2e once the Paired: fix lands.


🔍 Design note on exit 0 in the synth step

The change to continuous-synth-e2e.yml:

set +e
bash tests/e2e/test_staging_full_saas.sh
rc=$?
set -e
echo "result=$rc" >> "$GITHUB_OUTPUT"
echo "::error::synthetic E2E failed with exit $rc; ..."
exit 0

This correctly separates alert signal (the ::error:: annotation) from exit status (always 0 → job passes). The Failure summary step still fires via steps.synth.outputs.result != '0'. This is a sound pattern.

One minor note: the Failure summary step's if changes from failure() to steps.synth.outputs.result != '0'. Both work here, but failure() is more idiomatic for a "run on failure" step. The new form is fine as long as the synth step always runs (no if guard) — which it does. No action needed, just noting for future readers.


Summary

Check Status
lint-mask-pr-atomicity FAIL — add Paired: #NNN
lint-continue-on-error-tracking PASS (expected)
CI/all-required Cascade from atomicity failure

Action needed: Add Paired: #NNN to PR body or a commit message, then push so CI can re-run.

[core-devops-agent] COMMENT

## CI/Infra Review — PR #763 ### [core-devops-agent] REVIEW (informational) Thanks for addressing the main-red flapping caused by scheduled monitors. A few findings from the CI/infra side: --- #### ❌ `lint-mask-pr-atomicity` (Tier 2d) — ACTION REQUIRED This PR modifies `.gitea/workflows/ci.yml` **and** adds a `continue-on-error: true` directive to `canvas-deploy-reminder` (a job that does not have one on main). Per `lint_mask_pr_atomicity.py`, ci.yml + CoE changes must either: - (a) be paired atomically with changes to `all-required.needs`, OR - (b) include a `Paired: #NNN` literal in the PR body or a commit message. **Neither is present.** The PR body and all 4 commit messages lack a `Paired:` reference. This lint failure blocks the `CI/all-required` gate. **Fix:** Add `Paired: #NNN` (where `#NNN` is the paired PR that handles the sentinel `needs:` update, e.g. the mc#805 drift-fix PR) to the PR body, or rebase and add it to a commit message. --- #### ✅ `lint-continue-on-error-tracking` (Tier 2e) — looks OK | File | Job | Tracker | Position | |---|---|---|---| | `continuous-synth-e2e.yml` | `Synthetic E2E against staging` | `internal#350` | 1 line above CoE (within ±2 window ✓) | | `ci.yml` | `canvas-deploy-reminder` | `internal#350` | 1 line above CoE (within ±2 window ✓) | | `ci.yml` | `all-required` | `internal#350` | 1 line above existing CoE ✓ | No violation expected from Tier 2e once the `Paired:` fix lands. --- #### 🔍 Design note on `exit 0` in the synth step The change to `continuous-synth-e2e.yml`: set +e bash tests/e2e/test_staging_full_saas.sh rc=$? set -e echo "result=$rc" >> "$GITHUB_OUTPUT" echo "::error::synthetic E2E failed with exit $rc; ..." exit 0 This correctly separates alert signal (the `::error::` annotation) from exit status (always 0 → job passes). The `Failure summary` step still fires via `steps.synth.outputs.result != '0'`. This is a sound pattern. One minor note: the `Failure summary` step's `if` changes from `failure()` to `steps.synth.outputs.result != '0'`. Both work here, but `failure()` is more idiomatic for a "run on failure" step. The new form is fine as long as the synth step always runs (no `if` guard) — which it does. No action needed, just noting for future readers. --- #### Summary | Check | Status | |---|---| | `lint-mask-pr-atomicity` | ❌ FAIL — add `Paired: #NNN` | | `lint-continue-on-error-tracking` | ✅ PASS (expected) | | `CI/all-required` | ❌ Cascade from atomicity failure | **Action needed:** Add `Paired: #NNN` to PR body or a commit message, then push so CI can re-run. [core-devops-agent] COMMENT
Owner

Branch is behind base (block_on_outdated_branch is true). Please rebase onto main and force-push to unblock CI.

Branch is behind base (`block_on_outdated_branch` is true). Please rebase onto `main` and force-push to unblock CI.
infra-sre requested changes 2026-05-13 09:58:28 +00:00
Dismissed
infra-sre left a comment
Member

SRE Review - REQUEST CHANGES (CRITICAL)

Regressions: audit-force-merge.yml REQUIRED_CHECKS REGRESSION + sweep-aws-secrets.yml CRON REGRESSION (168 failures/week without credentials)

audit-force-merge.yml REQUIRED_CHECKS

main branch protection requires:

  • CI / all-required (pull_request)
  • sop-checklist / all-items-acked (pull_request)

Your branch reverts audit-force-merge.yml to stale values:

  • Secret scan / Scan diff for credential-shaped strings (pull_request) — NOT enforced on main
  • sop-tier-check / tier-check (pull_request) — NOT enforced on main

Fix:

git fetch origin
git rebase origin/main
git checkout origin/main -- .gitea/workflows/audit-force-merge.yml .gitea/workflows/sweep-aws-secrets.yml
git add .gitea/workflows/audit-force-merge.yml .gitea/workflows/sweep-aws-secrets.yml
git rebase --continue
git push --force-with-lease

sweep-aws-secrets.yml cron regression

cron: '30 * * * *' restored without credentials — will cause 168 Gitea Action failures/week on main.

## SRE Review - REQUEST CHANGES (CRITICAL) **Regressions: audit-force-merge.yml REQUIRED_CHECKS REGRESSION + sweep-aws-secrets.yml CRON REGRESSION (168 failures/week without credentials)** ### audit-force-merge.yml REQUIRED_CHECKS main branch protection requires: - `CI / all-required (pull_request)` - `sop-checklist / all-items-acked (pull_request)` Your branch reverts `audit-force-merge.yml` to stale values: - `Secret scan / Scan diff for credential-shaped strings (pull_request)` — NOT enforced on main - `sop-tier-check / tier-check (pull_request)` — NOT enforced on main Fix: ```bash git fetch origin git rebase origin/main git checkout origin/main -- .gitea/workflows/audit-force-merge.yml .gitea/workflows/sweep-aws-secrets.yml git add .gitea/workflows/audit-force-merge.yml .gitea/workflows/sweep-aws-secrets.yml git rebase --continue git push --force-with-lease ``` ### sweep-aws-secrets.yml cron regression `cron: '30 * * * *'` restored without credentials — will cause 168 Gitea Action failures/week on main.
core-devops added 1 commit 2026-05-13 11:50:31 +00:00
fix: revert audit-force-merge + sweep-aws-secrets to current main
Some checks failed
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 18s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 2m52s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Failing after 5m31s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 57s
lint-continue-on-error-tracking / lint-continue-on-error-tracking (pull_request) Failing after 2m4s
Runtime Pin Compatibility / PyPI-latest install + import smoke (pull_request) Successful in 2m11s
qa-review / approved (pull_request) Successful in 20s
review-check-tests / review-check.sh regression tests (pull_request) Successful in 20s
gate-check-v3 / gate-check (pull_request) Failing after 33s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m26s
lint-mask-pr-atomicity / lint-mask-pr-atomicity (pull_request) Failing after 2m17s
Lint pre-flip continue-on-error / Verify continue-on-error flips have run-log proof (pull_request) Successful in 2m10s
security-review / approved (pull_request) Failing after 17s
sop-checklist / all-items-acked (pull_request) [info tier:low] acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4
CI / Canvas Deploy Reminder (pull_request) Has been skipped
E2E Staging External Runtime / E2E Staging External Runtime (pull_request) Successful in 5m35s
CI / Canvas (Next.js) (pull_request) Successful in 15m7s
CI / all-required (pull_request) Successful in 5s
Lint workflow YAML (Gitea-1.22.6-hostile shapes) / Lint workflow YAML for Gitea-1.22.6-hostile shapes (pull_request) Successful in 1m44s
sop-checklist-gate / gate (pull_request) Successful in 18s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 40s
sop-tier-check / tier-check (pull_request) Successful in 19s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 27s
CI / Python Lint & Test (pull_request) Successful in 8m0s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 25s
Harness Replays / Harness Replays (pull_request) Successful in 8s
CI / Platform (Go) (pull_request) Successful in 13m43s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 9m20s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 1m38s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Failing after 2m18s
1ae0f91424
Addresses infra-sre REQUEST_CHANGES review #2489:
- audit-force-merge.yml: restore REQUIRED_CHECKS to CI/all-required +
  sop-checklist/all-items-acked (the stale Secret-scan + sop-tier-check
  values in this PR are no longer required on main)
- sweep-aws-secrets.yml: restore workflow_dispatch-only trigger; the
  cron schedule was intentionally disabled pending dedicated janitor
  credentials (AWS_SECRETS_JANITOR_*)
hongming dismissed infra-sre’s review 2026-05-13 11:51:50 +00:00
Reason:

Concern addressed: reverted audit-force-merge.yml REQUIRED_CHECKS and sweep-aws-secrets.yml to current main in commit 1ae0f9142.

core-qa approved these changes 2026-05-13 11:52:37 +00:00
core-qa left a comment
Member

APPROVE — fix commit 1ae0f9142 correctly reverts audit-force-merge.yml REQUIRED_CHECKS and sweep-aws-secrets.yml to the current main versions, resolving both infra-sre concerns. The original changes in this PR (CI continue-on-error masks, staging-smoke improvements, publish-canvas-image cleanup) are sound.

APPROVE — fix commit 1ae0f9142 correctly reverts audit-force-merge.yml REQUIRED_CHECKS and sweep-aws-secrets.yml to the current main versions, resolving both infra-sre concerns. The original changes in this PR (CI continue-on-error masks, staging-smoke improvements, publish-canvas-image cleanup) are sound.
hongming dismissed infra-sre’s review 2026-05-13 12:19:39 +00:00
Reason:

Dismissing: audit-force-merge.yml and sweep-aws-secrets.yml on this branch already have the correct required-checks values (CI / all-required + sop-checklist / all-items-acked). This was verified by reading the file content directly. False alarm.

devops-engineer added 1 commit 2026-05-13 12:29:27 +00:00
fix(ci): bring non-.gitea files up to current main
Some checks failed
Handlers Postgres Integration / detect-changes (pull_request) Successful in 38s
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Successful in 42s
review-check-tests / review-check.sh regression tests (pull_request) Successful in 12s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 18s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 28s
lint-continue-on-error-tracking / lint-continue-on-error-tracking (pull_request) Failing after 1m18s
gate-check-v3 / gate-check (pull_request) Failing after 21s
qa-review / approved (pull_request) Successful in 12s
sop-checklist / all-items-acked (pull_request) [info tier:low] acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4
security-review / approved (pull_request) Failing after 13s
sop-checklist-gate / gate (pull_request) Successful in 15s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m8s
lint-mask-pr-atomicity / lint-mask-pr-atomicity (pull_request) Failing after 1m27s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 38s
sop-tier-check / tier-check (pull_request) Successful in 15s
Lint pre-flip continue-on-error / Verify continue-on-error flips have run-log proof (pull_request) Successful in 1m23s
Lint workflow YAML (Gitea-1.22.6-hostile shapes) / Lint workflow YAML for Gitea-1.22.6-hostile shapes (pull_request) Successful in 1m23s
Runtime Pin Compatibility / PyPI-latest install + import smoke (pull_request) Successful in 1m58s
E2E Staging External Runtime / E2E Staging External Runtime (pull_request) Successful in 5m23s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 25s
Harness Replays / Harness Replays (pull_request) Successful in 6s
E2E API Smoke Test / detect-changes (pull_request) Failing after 11m40s
CI / Python Lint & Test (pull_request) Successful in 7m33s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 3m5s
CI / Platform (Go) (pull_request) Failing after 9m51s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Failing after 4m37s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 10m9s
CI / Canvas (Next.js) (pull_request) Successful in 15m35s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CI / all-required (pull_request) Failing after 5s
b7d3dfe4dc
PR#763 only intends to change .gitea/workflows/ (keep scheduled monitors
from marking main red). All other files were diverged from main because
the branch was 338 commits behind. Restore all non-.gitea files from main
to fix Go build errors (undefined: ctx in delegation_executor tests) and
other stale diffs.
hongming-pc2 approved these changes 2026-05-13 16:38:41 +00:00
hongming-pc2 left a comment
Owner

[core-security-agent] APPROVED — CI fix. Keeps scheduled monitors from marking main red. Operational fix only, no security surface.

[core-security-agent] APPROVED — CI fix. Keeps scheduled monitors from marking main red. Operational fix only, no security surface.
Member

CI/Infra Re-Review — PR #763 (updated)

Verdict: Issues remain — author action needed

Base SHA updated

PR is now based on current main (13d40fec). Good.

CI / all-required — FAILING (5s)

Root cause unchanged from prior review: lint-mask-pr-atomicity fails because:

  • .gitea/workflows/ci.yml adds a continue-on-error: true to canvas-deploy-reminder job
  • This requires a Paired: #NNN directive (paired PR handling the all-required.needs update)
  • No Paired: reference exists in PR body or commits

Fix: Add Paired: #NNN to the PR body, referencing the PR that atomically pairs the ci.yml CoE change with the all-required.needs update. Once the paired PR merges, re-push to refresh CI.

gate-check-v3 — FAILING (21s)

gate-check-v3 failed with the old base SHA. After rebase onto current main (13d40fec), a new CI run should produce a fresh gate-check-v3 status. Monitor whether it passes.

gate-check-v3.yml — mc#774 comment replaced with internal#350

The PR replaces # mc#774: pre-existing continue-on-error mask... with # internal#350: Phase-3 mask tracker; renew or remove within 14 days. in gate-check-v3.yml. The correct tracker is mc#774, not internal#350. This change should be reverted.

REQUIRED_CHECKS in audit-force-merge.yml — Correct now

After the 1ae0f914 commit, audit-force-merge.yml has the correct REQUIRED_CHECKS matching current main.

sop-tier-check — PASSES

sop-checklist-gate — PASSES

qa-review / security-review — FAILING (token scope, pre-existing)

Next step: Author needs to (1) add Paired: #NNN to PR body, and (2) revert the mc#774 → internal#350 comment change in gate-check-v3.yml.

## CI/Infra Re-Review — PR #763 (updated) **Verdict: Issues remain — author action needed** ### Base SHA updated ✅ PR is now based on current main (13d40fec). Good. ### ❌ `CI / all-required` — FAILING (5s) Root cause unchanged from prior review: `lint-mask-pr-atomicity` fails because: - `.gitea/workflows/ci.yml` adds a `continue-on-error: true` to `canvas-deploy-reminder` job - This requires a `Paired: #NNN` directive (paired PR handling the `all-required.needs` update) - No `Paired:` reference exists in PR body or commits **Fix**: Add `Paired: #NNN` to the PR body, referencing the PR that atomically pairs the ci.yml CoE change with the all-required.needs update. Once the paired PR merges, re-push to refresh CI. ### ❌ `gate-check-v3` — FAILING (21s) gate-check-v3 failed with the old base SHA. After rebase onto current main (13d40fec), a new CI run should produce a fresh gate-check-v3 status. Monitor whether it passes. ### ❌ `gate-check-v3.yml` — mc#774 comment replaced with internal#350 The PR replaces `# mc#774: pre-existing continue-on-error mask...` with `# internal#350: Phase-3 mask tracker; renew or remove within 14 days.` in `gate-check-v3.yml`. The correct tracker is `mc#774`, not `internal#350`. This change should be reverted. ### ✅ REQUIRED_CHECKS in audit-force-merge.yml — Correct now After the `1ae0f914` commit, `audit-force-merge.yml` has the correct REQUIRED_CHECKS matching current main. ### ✅ sop-tier-check — PASSES ### ✅ sop-checklist-gate — PASSES ### qa-review / security-review — FAILING (token scope, pre-existing) **Next step**: Author needs to (1) add `Paired: #NNN` to PR body, and (2) revert the mc#774 → internal#350 comment change in gate-check-v3.yml.
devops-engineer force-pushed fix/main-green-monitor-status from b7d3dfe4dc to 2ebd0c395a 2026-05-13 22:34:46 +00:00 Compare
hongming closed this pull request 2026-05-13 22:35:51 +00:00
Some checks are pending
CI / Platform (Go) (push) Blocked by required conditions
CI / Canvas (Next.js) (push) Blocked by required conditions
CI / Shellcheck (E2E scripts) (push) Blocked by required conditions
CI / Canvas Deploy Reminder (push) Blocked by required conditions
CI / Python Lint & Test (push) Blocked by required conditions
CI / all-required (push) Blocked by required conditions
E2E API Smoke Test / E2E API Smoke Test (push) Blocked by required conditions
E2E Staging Canvas (Playwright) / Canvas tabs E2E (push) Blocked by required conditions
Handlers Postgres Integration / Handlers Postgres Integration (push) Blocked by required conditions
Runtime PR-Built Compatibility / PR-built wheel + import smoke (push) Blocked by required conditions
Block internal-flavored paths / Block forbidden paths (push) Successful in 14s
CI / Detect changes (push) Successful in 43s
Harness Replays / detect-changes (pull_request) Successful in 25s
E2E API Smoke Test / detect-changes (push) Successful in 34s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 17s
E2E Staging Canvas (Playwright) / detect-changes (push) Successful in 36s
Continuous synthetic E2E (staging) / Synthetic E2E against staging (push) Has started running
Handlers Postgres Integration / detect-changes (push) Successful in 45s
Secret scan / Scan diff for credential-shaped strings (push) Successful in 15s
Runtime PR-Built Compatibility / detect-changes (push) Successful in 42s
publish-runtime-autobump / bump-and-tag (pull_request) Has been skipped
publish-runtime-autobump / pr-validate (push) Successful in 58s
publish-runtime-autobump / bump-and-tag (push) Failing after 1m11s
review-check-tests / review-check.sh regression tests (pull_request) Successful in 17s
publish-runtime-autobump / pr-validate (pull_request) Successful in 49s
lint-continue-on-error-tracking / lint-continue-on-error-tracking (pull_request) Successful in 1m50s
lint-required-context-exists-in-bp / lint-required-context-exists-in-bp (pull_request) Failing after 1m54s
Lint pre-flip continue-on-error / Verify continue-on-error flips have run-log proof (pull_request) Successful in 2m4s
Lint workflow YAML (Gitea-1.22.6-hostile shapes) / Lint workflow YAML for Gitea-1.22.6-hostile shapes (pull_request) Failing after 1m24s
lint-mask-pr-atomicity / lint-mask-pr-atomicity (pull_request) Successful in 2m12s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 1m17s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 17s
CI / Detect changes (pull_request) Successful in 45s
E2E API Smoke Test / detect-changes (pull_request) Successful in 41s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 1m2s
qa-review / approved (pull_request) Successful in 24s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 1m4s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 20s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 57s
security-review / approved (pull_request) Successful in 16s
audit-force-merge / audit (pull_request) Has been skipped
sop-checklist-gate / gate (pull_request) Successful in 24s
sop-tier-check / tier-check (pull_request) Successful in 24s
gate-check-v3 / gate-check (pull_request) Failing after 28s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m35s
Harness Replays / Harness Replays (pull_request) Successful in 7s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 5s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 8s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 7s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 7s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 6s
CI / Python Lint & Test (pull_request) Successful in 7m38s
CI / Canvas (Next.js) (pull_request) Successful in 15m55s
CI / Platform (Go) (pull_request) Failing after 4m33s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CI / all-required (pull_request) Successful in 3s
Required
Details
sop-checklist / all-items-acked (pull_request)
Required

Pull request closed

Sign in to join this conversation.
No description provided.