From 7932bc4c482edf63763b20b0317bc8efa022f0f7 Mon Sep 17 00:00:00 2001 From: core-devops Date: Wed, 20 May 2026 09:29:33 -0700 Subject: [PATCH] =?UTF-8?q?chore(ssot):=20delete=20dead=20.github/workflow?= =?UTF-8?q?s/=20=E2=80=94=20Gitea=20is=20SSOT=20(#331=20SSOT-Instance-4)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Per CTO directive 2026-05-20 and task #347 (disabled GitHub-mirror push fleet-wide), .github/workflows/ on molecule-core is dead — Gitea Actions reads .gitea/workflows/ exclusively (memory: reference_molecule_core_actions_gitea_only), and GitHub Actions has had no real push activity since 2026-05-06 (the only post-2026-05-06 runs are dynamic CodeQL re-runs on frozen pre-suspension PRs). Empirical validation: - 24 files total in .github/workflows/. - 23 have same-name siblings in .gitea/workflows/ (port carries "Ported from .github/workflows/X on 2026-05-11 per RFC internal#219" header on most files). - 1 .github-only file: canary-staging.yml — already ported to .gitea/workflows/staging-smoke.yml on 2026-05-11 per the same RFC, Hongming directive renamed canary→smoke. Verified via header comment in staging-smoke.yml. - Last GitHub-side push event: 2026-05-06T07:06:12Z (pre-suspension). - All 24 .github/workflows/* files removed. Tooling updates needed (load-bearing): - tools/branch-protection/check_name_parity.sh: hard-coded $REPO_ROOT/.github/workflows path → switched to .gitea/workflows. Pre-existing parity findings (3x Analyze CodeQL names absent from any workflow file) are unchanged — that drift exists pre-PR and is out-of-scope (file as follow-up). - tools/branch-protection/test_check_name_parity.sh: synthetic test fixtures now create .gitea/workflows/ instead of .github/workflows/. All 6 unit tests pass after change. - .gitea/workflows/lint-required-workflows-docker-host-pinned.yml: dropped '.github/workflows/**' from path-filter triggers + dropped '.github/workflows' from the python directory-walk loop (the isdir-check would have made this a no-op cleanly, but pruning reflects current truth). Out-of-scope (NOT touched in this PR): - .github/CODEOWNERS, .github/dependabot.yml, .github/scripts/ remain (task is scoped to .github/workflows/). - COVERAGE_FLOOR.md, workspace/smoke_mode.py, workspace/main.py contain comment references to .github/workflows/* — stale docs string-references only, not behavioral. Separate follow-up. - Provenance comments inside .gitea/workflows/* of the form "Ported from .github/workflows/X on 2026-05-11" are intentionally preserved — useful history. Refs: task #331 (SSOT-Instance-4), task #347 (mirror push disabled), memory reference_molecule_core_actions_gitea_only, memory reference_per_repo_gitea_vs_github_actions_dir, RFC internal#219 §1 (the original 2026-05-11 port sweep). --- ...-required-workflows-docker-host-pinned.yml | 7 +- .github/workflows/block-internal-paths.yml | 154 ------ .github/workflows/canary-staging.yml | 320 ------------- .github/workflows/cascade-list-drift-gate.yml | 39 -- .../workflows/check-migration-collisions.yml | 58 --- .github/workflows/ci.yml | 442 ------------------ .github/workflows/continuous-synth-e2e.yml | 257 ---------- .github/workflows/e2e-api.yml | 307 ------------ .github/workflows/e2e-staging-canvas.yml | 216 --------- .github/workflows/e2e-staging-external.yml | 184 -------- .github/workflows/e2e-staging-saas.yml | 246 ---------- .github/workflows/e2e-staging-sanity.yml | 171 ------- .../handlers-postgres-integration.yml | 251 ---------- .github/workflows/harness-replays.yml | 248 ---------- .../workflows/lint-curl-status-capture.yml | 94 ---- .github/workflows/publish-canvas-image.yml | 121 ----- .github/workflows/railway-pin-audit.yml | 207 -------- .github/workflows/runtime-pin-compat.yml | 91 ---- .github/workflows/runtime-prbuild-compat.yml | 152 ------ .github/workflows/secret-pattern-drift.yml | 58 --- .github/workflows/sweep-aws-secrets.yml | 129 ----- .github/workflows/sweep-cf-orphans.yml | 146 ------ .github/workflows/sweep-cf-tunnels.yml | 124 ----- .github/workflows/sweep-stale-e2e-orgs.yml | 239 ---------- .github/workflows/test-ops-scripts.yml | 52 --- tools/branch-protection/check_name_parity.sh | 8 +- .../test_check_name_parity.sh | 10 +- 27 files changed, 16 insertions(+), 4315 deletions(-) delete mode 100644 .github/workflows/block-internal-paths.yml delete mode 100644 .github/workflows/canary-staging.yml delete mode 100644 .github/workflows/cascade-list-drift-gate.yml delete mode 100644 .github/workflows/check-migration-collisions.yml delete mode 100644 .github/workflows/ci.yml delete mode 100644 .github/workflows/continuous-synth-e2e.yml delete mode 100644 .github/workflows/e2e-api.yml delete mode 100644 .github/workflows/e2e-staging-canvas.yml delete mode 100644 .github/workflows/e2e-staging-external.yml delete mode 100644 .github/workflows/e2e-staging-saas.yml delete mode 100644 .github/workflows/e2e-staging-sanity.yml delete mode 100644 .github/workflows/handlers-postgres-integration.yml delete mode 100644 .github/workflows/harness-replays.yml delete mode 100644 .github/workflows/lint-curl-status-capture.yml delete mode 100644 .github/workflows/publish-canvas-image.yml delete mode 100644 .github/workflows/railway-pin-audit.yml delete mode 100644 .github/workflows/runtime-pin-compat.yml delete mode 100644 .github/workflows/runtime-prbuild-compat.yml delete mode 100644 .github/workflows/secret-pattern-drift.yml delete mode 100644 .github/workflows/sweep-aws-secrets.yml delete mode 100644 .github/workflows/sweep-cf-orphans.yml delete mode 100644 .github/workflows/sweep-cf-tunnels.yml delete mode 100644 .github/workflows/sweep-stale-e2e-orgs.yml delete mode 100644 .github/workflows/test-ops-scripts.yml diff --git a/.gitea/workflows/lint-required-workflows-docker-host-pinned.yml b/.gitea/workflows/lint-required-workflows-docker-host-pinned.yml index 957740f1..d1898dad 100644 --- a/.gitea/workflows/lint-required-workflows-docker-host-pinned.yml +++ b/.gitea/workflows/lint-required-workflows-docker-host-pinned.yml @@ -28,12 +28,10 @@ on: pull_request: paths: - '.gitea/workflows/**' - - '.github/workflows/**' push: branches: [main, staging] paths: - '.gitea/workflows/**' - - '.github/workflows/**' permissions: contents: read @@ -75,8 +73,11 @@ jobs: fails = [] warnings = [] + # Gitea is SSOT for molecule-core CI per task #347 / memory + # reference_molecule_core_actions_gitea_only. The legacy + # .github/workflows/ tree was deleted in SSOT-Instance-4 (#331). roots = [] - for root in ('.gitea/workflows', '.github/workflows'): + for root in ('.gitea/workflows',): if os.path.isdir(root): roots.append(root) diff --git a/.github/workflows/block-internal-paths.yml b/.github/workflows/block-internal-paths.yml deleted file mode 100644 index 7629a669..00000000 --- a/.github/workflows/block-internal-paths.yml +++ /dev/null @@ -1,154 +0,0 @@ -name: Block internal-flavored paths - -# Hard CI gate. Internal content (positioning, competitive briefs, sales -# playbooks, PMM/press drip, draft campaigns) lives in molecule-ai/internal — -# this public monorepo must never re-acquire those paths. CEO directive -# 2026-04-23 after a fleet-wide audit found 79 internal files leaked here. -# -# Failure mode without this gate: agents (PMM, Research, DevRel, Sales) drop -# briefs into the easiest path their cwd resolves to (root /research, -# /marketing, /docs/marketing) and gitignore alone won't catch a `git add -f` -# or a stale gitignore line. This workflow is the mechanical backstop. - -on: - pull_request: - types: [opened, synchronize, reopened] - push: - branches: [main, staging] - # Required for GitHub merge queue: the queue's pre-merge CI run on - # `gh-readonly-queue/...` refs needs this check to fire so the queue - # gets a real result instead of stalling forever AWAITING_CHECKS. - merge_group: - types: [checks_requested] - -jobs: - check: - name: Block forbidden paths - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 - with: - fetch-depth: 2 # need previous commit to diff against on push events - - # For pull_request events the diff base is github.event.pull_request.base.sha, - # which may be many commits behind HEAD and therefore absent from the - # shallow clone above. Fetch it explicitly (depth=1 keeps it fast). - - name: Fetch PR base SHA (pull_request events only) - if: github.event_name == 'pull_request' - run: git fetch --depth=1 origin ${{ github.event.pull_request.base.sha }} - - # For merge_group events the queue's pre-merge ref is a commit on - # `gh-readonly-queue/...` whose parent is the queue's base_sha. - # That parent isn't part of the queue branch's shallow clone, so - # we fetch it explicitly. Mirrors the equivalent step in - # secret-scan.yml (#2120) — same shallow-clone bug class. - - name: Fetch merge_group base SHA (merge_group events only) - if: github.event_name == 'merge_group' - run: git fetch --depth=1 origin ${{ github.event.merge_group.base_sha }} - - - name: Refuse if forbidden paths appear - env: - # Plumb event-specific SHAs through env so the script doesn't - # need conditional `${{ ... }}` interpolation per event type. - # github.event.before/after only exist on push events; - # merge_group has its own base_sha/head_sha; pull_request has - # pull_request.base.sha / pull_request.head.sha. - PR_BASE_SHA: ${{ github.event.pull_request.base.sha }} - PR_HEAD_SHA: ${{ github.event.pull_request.head.sha }} - MG_BASE_SHA: ${{ github.event.merge_group.base_sha }} - MG_HEAD_SHA: ${{ github.event.merge_group.head_sha }} - PUSH_BEFORE: ${{ github.event.before }} - PUSH_AFTER: ${{ github.event.after }} - run: | - # Paths that must NEVER live in the public monorepo. Add to this - # list narrowly — broader patterns belong in .gitignore so day-to-day - # docs work isn't accidentally blocked. - FORBIDDEN_PATTERNS=( - "^research/" - "^marketing/" - "^docs/marketing/" - "^comment-[0-9]+\.json$" - "^test-pmm.*\.(txt|md)$" - "^tick-reflections.*\.(txt|md)$" - ".*-temp\.(md|txt)$" - ) - - # Determine the diff base. Each event type stores its SHAs in - # a different place — see the env block above. - case "${{ github.event_name }}" in - pull_request) - BASE="$PR_BASE_SHA" - HEAD="$PR_HEAD_SHA" - ;; - merge_group) - BASE="$MG_BASE_SHA" - HEAD="$MG_HEAD_SHA" - ;; - *) - BASE="$PUSH_BEFORE" - HEAD="$PUSH_AFTER" - ;; - esac - - # On push events with shallow clones, BASE may be present in - # the event payload but absent from the local object DB - # (fetch-depth=2 doesn't always reach the previous commit - # across true merges). Try fetching it on demand. If the - # fetch fails — e.g. the SHA was force-overwritten — we fall - # through to the empty-BASE branch below, which scans the - # entire tree as if every file were new. Correct, just slow. - # Same recovery shape as secret-scan.yml (#2120 — incident - # 2026-04-27 06:50Z block-internal-paths exit 128 with - # "fatal: bad object " on staging push). - if [ -n "$BASE" ] && ! echo "$BASE" | grep -qE '^0+$'; then - if ! git cat-file -e "$BASE" 2>/dev/null; then - git fetch --depth=1 origin "$BASE" 2>/dev/null || true - fi - fi - - # Files added or modified in this change. - if [ -z "$BASE" ] || echo "$BASE" | grep -qE '^0+$' || ! git cat-file -e "$BASE" 2>/dev/null; then - # New branch / no previous SHA / BASE unreachable — check - # the entire tree as if every file were new. Slower but - # correct on first push or post-fetch-failure recovery. - CHANGED=$(git ls-tree -r --name-only HEAD) - else - CHANGED=$(git diff --name-only --diff-filter=AM "$BASE" "$HEAD") - fi - - if [ -z "$CHANGED" ]; then - echo "No changed files to inspect." - exit 0 - fi - - OFFENDING="" - for path in $CHANGED; do - for pattern in "${FORBIDDEN_PATTERNS[@]}"; do - if echo "$path" | grep -qE "$pattern"; then - OFFENDING="${OFFENDING}${path} (matched: ${pattern})\n" - break - fi - done - done - - if [ -n "$OFFENDING" ]; then - echo "::error::Forbidden internal-flavored paths detected:" - printf "$OFFENDING" - echo "" - echo "These paths belong in molecule-ai/internal, not this public repo." - echo "See docs/internal-content-policy.md for canonical locations." - echo "" - echo "If your file is genuinely public-facing (e.g. a blog post" - echo "ready to ship), use one of these alternatives instead:" - echo " • Public-bound blog posts: docs/blog/.md" - echo " • Public-bound tutorials: docs/tutorials/.md" - echo " • Public devrel content: docs/devrel/.md" - echo "" - echo "If you legitimately need to add a new top-level path that" - echo "happens to match a forbidden pattern, edit" - echo ".github/workflows/block-internal-paths.yml and update the" - echo "FORBIDDEN_PATTERNS list with reviewer signoff." - exit 1 - fi - - echo "✓ No forbidden paths in this change." diff --git a/.github/workflows/canary-staging.yml b/.github/workflows/canary-staging.yml deleted file mode 100644 index bf75c57f..00000000 --- a/.github/workflows/canary-staging.yml +++ /dev/null @@ -1,320 +0,0 @@ -name: Canary — staging SaaS smoke (every 30 min) - -# Minimum viable health check: provisions one Hermes workspace on a fresh -# staging org, sends one A2A message, verifies PONG, tears down. ~8 min -# wall clock. Pages on failure by opening a GitHub issue; auto-closes the -# issue on the next green run. -# -# The full-SaaS workflow (e2e-staging-saas.yml) covers the broader surface -# but runs only on provisioning-critical pushes + nightly — this one -# catches drift in the 30-min window between those runs (AMI health, CF -# cert rotation, WorkOS session stability, etc.). -# -# Lean mode: E2E_MODE=canary skips the child workspace + HMA memory + -# peers/activity checks. One parent workspace + one A2A turn is enough -# to signal "SaaS stack end-to-end is alive." - -on: - schedule: - # Every 30 min. Cron on GitHub-hosted runners has a known drift of - # a few minutes under load — that's fine for a canary. - - cron: '*/30 * * * *' - workflow_dispatch: - inputs: - keep_on_failure: - description: >- - Skip teardown when the canary fails (debugging only). The - tenant org + EC2 + CF tunnel + DNS stay alive so an operator - can SSM into the workspace EC2 and capture docker logs of the - failing claude-code container. REMEMBER to manually delete - via DELETE /cp/admin/tenants/ when done so the org - doesn't accumulate cost. Only honored on workflow_dispatch; - cron runs always tear down (we don't want unattended cron - to leak resources). - type: boolean - default: false - -# Serialise with the full-SaaS workflow so they don't contend for the -# same org-create quota on staging. Different group key from -# e2e-staging-saas since we don't mind queueing canaries behind one -# full run, but two canaries SHOULD queue against each other. -concurrency: - group: canary-staging - cancel-in-progress: false - -permissions: - # Needed to open / close the alerting issue. - issues: write - contents: read - -jobs: - canary: - name: Canary smoke - runs-on: ubuntu-latest - # 25 min headroom over the 15-min TLS-readiness deadline in - # tests/e2e/test_staging_full_saas.sh (#2107). Without the buffer - # the job is killed at the wall-clock 15:00 mark BEFORE the bash - # `fail` + diagnostic burst can fire, leaving every cancellation - # silent. Sibling staging E2E jobs run at 20-45 min — keeping - # canary tighter than them so a true wedge still surfaces here - # first. - timeout-minutes: 25 - - env: - MOLECULE_CP_URL: https://staging-api.moleculesai.app - MOLECULE_ADMIN_TOKEN: ${{ secrets.MOLECULE_STAGING_ADMIN_TOKEN }} - # MiniMax is the canary's PRIMARY LLM auth path post-2026-05-04. - # Switched from hermes+OpenAI after #2578 (the staging OpenAI key - # account went over quota and stayed dead for 36+ hours, taking - # the canary red the entire time). claude-code template's - # `minimax` provider routes ANTHROPIC_BASE_URL to - # api.minimax.io/anthropic and reads MINIMAX_API_KEY at boot — - # ~5-10x cheaper per token than gpt-4.1-mini AND on a separate - # billing account, so OpenAI quota collapse no longer wedges the - # canary. Mirrors the migration continuous-synth-e2e.yml made on - # 2026-05-03 (#265) for the same reason. tests/e2e/test_staging_ - # full_saas.sh branches SECRETS_JSON on which key is present — - # MiniMax wins when set. - E2E_MINIMAX_API_KEY: ${{ secrets.MOLECULE_STAGING_MINIMAX_API_KEY }} - # Direct-Anthropic alternative for operators who don't want to - # set up a MiniMax account (priority below MiniMax — first - # non-empty wins in test_staging_full_saas.sh's secrets-injection - # block). See #2578 PR comment for the rationale. - E2E_ANTHROPIC_API_KEY: ${{ secrets.MOLECULE_STAGING_ANTHROPIC_API_KEY }} - # OpenAI fallback — kept wired so an operator-dispatched run with - # E2E_RUNTIME=hermes overridden via workflow_dispatch can still - # exercise the OpenAI path without re-editing the workflow. - E2E_OPENAI_API_KEY: ${{ secrets.MOLECULE_STAGING_OPENAI_KEY }} - E2E_MODE: canary - E2E_RUNTIME: claude-code - # Pin the canary to a specific MiniMax model rather than relying - # on the per-runtime default (which could resolve to "sonnet" → - # direct Anthropic and defeat the cost saving). M2.7-highspeed - # is "Token Plan only" but cheap-per-token and fast. - E2E_MODEL_SLUG: MiniMax-M2.7-highspeed - E2E_RUN_ID: "canary-${{ github.run_id }}" - # Debug-only: when an operator dispatches with keep_on_failure=true, - # the canary script's E2E_KEEP_ORG=1 path skips teardown so the - # tenant org + EC2 stay alive for SSM-based log capture. Cron runs - # never set this (the input only exists on workflow_dispatch) so - # unattended cron always tears down. See molecule-core#129 - # failure mode #1 — capturing the actual exception requires - # docker logs from the live container. - E2E_KEEP_ORG: ${{ github.event.inputs.keep_on_failure == 'true' && '1' || '0' }} - - steps: - - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 - - - name: Verify admin token present - run: | - if [ -z "$MOLECULE_ADMIN_TOKEN" ]; then - echo "::error::MOLECULE_STAGING_ADMIN_TOKEN not set" - exit 2 - fi - - - name: Verify LLM key present - run: | - # Per-runtime key check — claude-code uses MiniMax; hermes / - # langgraph (operator-dispatched only) use OpenAI. Hard-fail - # rather than soft-skip per the lesson from synth E2E #2578: - # an empty key silently falls through to the wrong - # SECRETS_JSON branch and the canary fails 5 min later with - # a confusing auth error instead of the clean "secret - # missing" message at the top. - case "${E2E_RUNTIME}" in - claude-code) - # Either MiniMax OR direct-Anthropic works — first - # non-empty wins in the test script's secrets-injection - # priority chain. Operators only need to set ONE of these - # secrets; we don't force a choice between them. - if [ -n "${E2E_MINIMAX_API_KEY:-}" ]; then - required_secret_name="MOLECULE_STAGING_MINIMAX_API_KEY" - required_secret_value="${E2E_MINIMAX_API_KEY}" - elif [ -n "${E2E_ANTHROPIC_API_KEY:-}" ]; then - required_secret_name="MOLECULE_STAGING_ANTHROPIC_API_KEY" - required_secret_value="${E2E_ANTHROPIC_API_KEY}" - else - required_secret_name="MOLECULE_STAGING_MINIMAX_API_KEY or MOLECULE_STAGING_ANTHROPIC_API_KEY" - required_secret_value="" - fi - ;; - langgraph|hermes) - required_secret_name="MOLECULE_STAGING_OPENAI_KEY" - required_secret_value="${E2E_OPENAI_API_KEY:-}" - ;; - *) - echo "::warning::Unknown E2E_RUNTIME='${E2E_RUNTIME}' — skipping LLM-key check" - required_secret_name="" - required_secret_value="present" - ;; - esac - if [ -n "$required_secret_name" ] && [ -z "$required_secret_value" ]; then - echo "::error::${required_secret_name} secret not set for runtime=${E2E_RUNTIME} — A2A will fail at request time with 'No LLM provider configured'" - exit 2 - fi - echo "LLM key present ✓ (runtime=${E2E_RUNTIME}, key=${required_secret_name}, len=${#required_secret_value})" - - - name: Canary run - id: canary - run: bash tests/e2e/test_staging_full_saas.sh - - # Alerting: open a sticky issue on the FIRST failure; comment on - # subsequent failures; auto-close on next green. Comment-on-existing - # de-duplicates so a single open issue accumulates the streak — - # ops sees one issue with N comments rather than N issues. - # - # Why no consecutive-failures threshold (e.g., wait 3 runs before - # filing): the prior threshold check used - # `github.rest.actions.listWorkflowRuns()` which Gitea 1.22.6 does - # not expose (returns 404). On Gitea Actions the threshold call - # ALWAYS failed, breaking the entire alerting step and going days - # silent on real regressions (38h+ chronic red on 2026-05-07/08 - # before this fix; tracked in molecule-core#129). Filing on first - # failure is also better UX — we want to know about the first red, - # not wait 90 min for it to "count." Real flakes get one issue + - # a quick close-on-green; persistent reds accumulate comments. - - name: Open issue on failure - if: failure() - uses: actions/github-script@3a2844b7e9c422d3c10d287c895573f7108da1b3 # v9.0.0 - with: - script: | - const title = '🔴 Canary failing: staging SaaS smoke'; - const runURL = `${context.serverUrl}/${context.repo.owner}/${context.repo.repo}/actions/runs/${context.runId}`; - - // Find an existing open canary issue (stable title match). - // If one exists, this isn't a "first failure" — comment and exit. - const { data: existing } = await github.rest.issues.listForRepo({ - owner: context.repo.owner, repo: context.repo.repo, - state: 'open', labels: 'canary-staging', - per_page: 10, - }); - const match = existing.find(i => i.title === title); - if (match) { - await github.rest.issues.createComment({ - owner: context.repo.owner, repo: context.repo.repo, - issue_number: match.number, - body: `Canary still failing. ${runURL}`, - }); - core.info(`Commented on existing issue #${match.number}`); - return; - } - - // No open issue yet — file one on this first failure. The - // comment-on-existing branch above means subsequent failures - // accumulate as comments on this same issue, so we don't - // spam new issues per run. - const body = - `Canary run failed at ${new Date().toISOString()}.\n\n` + - `Run: ${runURL}\n\n` + - `This issue auto-closes on the next green canary run. ` + - `Consecutive failures add a comment here rather than a new issue.`; - await github.rest.issues.create({ - owner: context.repo.owner, repo: context.repo.repo, - title, body, - labels: ['canary-staging', 'bug'], - }); - core.info('Opened canary failure issue (first red)'); - - - name: Auto-close canary issue on success - if: success() - uses: actions/github-script@3a2844b7e9c422d3c10d287c895573f7108da1b3 # v9.0.0 - with: - script: | - const title = '🔴 Canary failing: staging SaaS smoke'; - const { data: open } = await github.rest.issues.listForRepo({ - owner: context.repo.owner, repo: context.repo.repo, - state: 'open', labels: 'canary-staging', - per_page: 10, - }); - const match = open.find(i => i.title === title); - if (match) { - await github.rest.issues.createComment({ - owner: context.repo.owner, repo: context.repo.repo, - issue_number: match.number, - body: `Canary recovered at ${new Date().toISOString()}. Closing.`, - }); - await github.rest.issues.update({ - owner: context.repo.owner, repo: context.repo.repo, - issue_number: match.number, - state: 'closed', - }); - core.info(`Closed recovered canary issue #${match.number}`); - } - - - name: Teardown safety net - if: always() - env: - ADMIN_TOKEN: ${{ secrets.MOLECULE_STAGING_ADMIN_TOKEN }} - run: | - set +e - # Slug prefix matches what test_staging_full_saas.sh emits - # in canary mode: - # SLUG="e2e-canary-$(date +%Y%m%d)-${RUN_ID_SUFFIX}" - # Earlier this was `e2e-{today}-canary-` — that was the - # full-mode pattern (date FIRST, mode SECOND); canary slugs - # have mode FIRST, date SECOND. The mismatch silently - # never matched, leaving every cancelled-canary EC2 alive - # until the once-an-hour sweep eventually caught it - # (incident 2026-04-26 21:03Z: 1h25m EC2 leak before manual - # cleanup; same gap on three earlier cancellations today). - orgs=$(curl -sS "$MOLECULE_CP_URL/cp/admin/orgs" \ - -H "Authorization: Bearer $ADMIN_TOKEN" 2>/dev/null \ - | python3 -c " - import json, sys, os, datetime - run_id = os.environ.get('GITHUB_RUN_ID', '') - d = json.load(sys.stdin) - # Scope to slugs from THIS canary run when GITHUB_RUN_ID is - # available; the canary workflow sets E2E_RUN_ID='canary-\${run_id}' - # so the slug suffix is '-canary-\${run_id}-...'. Mirrors the - # full-mode safety net's per-run scoping (e2e-staging-saas.yml) - # added after the 2026-04-21 cross-run cleanup incident. - # Sweep both today AND yesterday's UTC dates so a run that - # crosses midnight still cleans up its own slug — see the - # 2026-04-26→27 canvas-safety-net incident. - today = datetime.date.today() - yesterday = today - datetime.timedelta(days=1) - dates = (today.strftime('%Y%m%d'), yesterday.strftime('%Y%m%d')) - if run_id: - prefixes = tuple(f'e2e-canary-{d}-canary-{run_id}' for d in dates) - else: - prefixes = tuple(f'e2e-canary-{d}-' for d in dates) - candidates = [o['slug'] for o in d.get('orgs', []) - if any(o.get('slug','').startswith(p) for p in prefixes) - and o.get('status') not in ('purged',)] - print('\n'.join(candidates)) - " 2>/dev/null) - # Per-slug DELETE with HTTP-code verification. The previous - # `... >/dev/null || true` swallowed every failure, so a 5xx - # or timeout from CP looked identical to "successfully cleaned - # up" and the tenant kept eating ~2 vCPU until the hourly - # stale sweep caught it (up to 2h later). Now we capture the - # response code and surface non-2xx as a workflow warning, so - # the run page shows which slug leaked. We still don't `exit 1` - # on cleanup failure — a single-canary cleanup miss shouldn't - # fail-flag the canary itself when the actual smoke check - # passed. The sweep-stale-e2e-orgs cron (now every 15 min, - # 30-min threshold) is the safety net for whatever slips past. - # See molecule-controlplane#420. - leaks=() - for slug in $orgs; do - # Tempfile-routed -w + set +e/-e prevents curl-exit-code - # pollution of the captured status (lint-curl-status-capture.yml). - set +e - curl -sS -o /tmp/canary-cleanup.out -w "%{http_code}" \ - -X DELETE "$MOLECULE_CP_URL/cp/admin/tenants/$slug" \ - -H "Authorization: Bearer $ADMIN_TOKEN" \ - -H "Content-Type: application/json" \ - -d "{\"confirm\":\"$slug\"}" >/tmp/canary-cleanup.code - set -e - code=$(cat /tmp/canary-cleanup.code 2>/dev/null || echo "000") - if [ "$code" = "200" ] || [ "$code" = "204" ]; then - echo "[teardown] deleted $slug (HTTP $code)" - else - echo "::warning::canary teardown for $slug returned HTTP $code — sweep-stale-e2e-orgs will catch it within ~45 min. Body: $(head -c 300 /tmp/canary-cleanup.out 2>/dev/null)" - leaks+=("$slug") - fi - done - if [ ${#leaks[@]} -gt 0 ]; then - echo "::warning::canary teardown left ${#leaks[@]} leak(s): ${leaks[*]}" - fi - exit 0 diff --git a/.github/workflows/cascade-list-drift-gate.yml b/.github/workflows/cascade-list-drift-gate.yml deleted file mode 100644 index 284a68d8..00000000 --- a/.github/workflows/cascade-list-drift-gate.yml +++ /dev/null @@ -1,39 +0,0 @@ -name: cascade-list-drift-gate - -# Structural gate: TEMPLATES list in publish-runtime.yml must match -# manifest.json's workspace_templates exactly. Closes the recurrence -# path of PR #2556 (the data fix) and is the first concrete deliverable -# of RFC #388 PR-3. -# -# Why a gate, not just discipline: PR #2536 pruned the manifest, but the -# cascade list wasn't updated for ~weeks before someone (PR #2556) -# noticed during an unrelated audit. During that window, codex never -# rebuilt on a runtime publish. A structural gate catches the drift -# the same day either file changes. -# -# Triggers narrowly to keep CI quiet: only on PRs that actually change -# one of the two files. The path-filtered split + always-emit-result -# pattern (memory: "Required check names need a job that always runs") -# is unnecessary here because the workflow IS the check name and PR -# branch protection should require it directly. Future-proof: if this -# becomes a required check, add a no-op aggregator with always() so the -# name still emits when paths don't match. - -on: - pull_request: - branches: [staging, main] - paths: - - manifest.json - - .github/workflows/publish-runtime.yml - - scripts/check-cascade-list-vs-manifest.sh - -permissions: - contents: read - -jobs: - check: - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4 - - name: Check cascade list matches manifest - run: bash scripts/check-cascade-list-vs-manifest.sh diff --git a/.github/workflows/check-migration-collisions.yml b/.github/workflows/check-migration-collisions.yml deleted file mode 100644 index eaa79cbf..00000000 --- a/.github/workflows/check-migration-collisions.yml +++ /dev/null @@ -1,58 +0,0 @@ -name: Check migration collisions - -# Hard gate (#2341): fails a PR that adds a migration prefix already -# claimed by the base branch or another open PR. Caught manually 2026-04-30 -# during PR #2276 rebase: 044_runtime_image_pins collided with -# 044_platform_inbound_secret from RFC #2312. This workflow makes that -# check automatic. -# -# Trigger model: pull_request only — there's no value running this on -# pushes to staging or main (those are post-merge; the gate must fire -# pre-merge to be useful). Path filter scopes to PRs that actually touch -# migrations. - -on: - pull_request: - types: [opened, synchronize, reopened] - paths: - - 'workspace-server/migrations/**' - - 'scripts/ops/check_migration_collisions.py' - - '.github/workflows/check-migration-collisions.yml' - -permissions: - contents: read - # gh pr list/diff need read access to other PRs - pull-requests: read - -jobs: - check: - name: Migration version collision check - runs-on: ubuntu-latest - timeout-minutes: 5 - steps: - - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 - with: - # Need history to diff against base ref - fetch-depth: 0 - - - name: Detect collisions - env: - PR_NUMBER: ${{ github.event.pull_request.number }} - BASE_REF: origin/${{ github.event.pull_request.base.ref }} - HEAD_REF: ${{ github.event.pull_request.head.sha }} - GITHUB_REPOSITORY: ${{ github.repository }} - # gh CLI uses GH_TOKEN from env - GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} - run: | - # Ensure the named base ref exists locally. checkout@v4 with - # fetch-depth=0 pulls full history, but the explicit fetch is - # cheap insurance against form-of-ref differences across runs. - # - # IMPORTANT: do NOT pass --depth=1 here. The script below uses - # `git diff origin/...` (three-dot, merge-base form), - # which fails with "fatal: no merge base" if the base ref is - # shallow. The auto-promote staging→main PR (#2361) was blocked - # by exactly this for ~5h on 2026-04-30 — the depth=1 fetch - # overwrote checkout@v4's full-history clone with a shallow tip. - git fetch origin "${{ github.event.pull_request.base.ref }}" || true - python3 scripts/ops/check_migration_collisions.py diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml deleted file mode 100644 index b295ff38..00000000 --- a/.github/workflows/ci.yml +++ /dev/null @@ -1,442 +0,0 @@ -name: CI - -on: - push: - branches: [main, staging] - pull_request: - branches: [main, staging] - # GitHub merge queue fires `merge_group` for the queue's pre-merge CI run. - # Required so the queue gets a real check result instead of a false-green - # from the absence of a triggered workflow. Safe to add unconditionally — - # the event simply doesn't fire until the queue is enabled on the branch. - merge_group: - types: [checks_requested] - -# Cancel in-progress CI runs when a new commit arrives on the same ref. -# This prevents stale runs from queuing behind each other. The merge_group -# refs (refs/heads/gh-readonly-queue/...) get their own concurrency group -# automatically because github.ref differs from the PR ref. -concurrency: - group: ci-${{ github.ref }} - cancel-in-progress: true - -jobs: - # Detect which paths changed so downstream jobs can skip when only - # docs/markdown files were modified. - changes: - name: Detect changes - runs-on: ubuntu-latest - outputs: - platform: ${{ steps.check.outputs.platform }} - canvas: ${{ steps.check.outputs.canvas }} - python: ${{ steps.check.outputs.python }} - scripts: ${{ steps.check.outputs.scripts }} - steps: - - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 - with: - fetch-depth: 0 - - id: check - run: | - # For PR events: diff against the base branch (not HEAD~1 of the branch, - # which may be unrelated after force-pushes). When a push updates a PR, - # both pull_request and push events fire — prefer the PR base so that - # the diff is always computed against the actual merge base, not the - # previous SHA on the branch which may be on a different history line. - BASE="${GITHUB_BASE_REF:-${{ github.event.before }}}" - # GITHUB_BASE_REF is set by GitHub for PR events (the base branch name). - # For pull_request events we use the stored base.sha; for push events - # (or when base.sha is unavailable) fall back to github.event.before. - if [ "${{ github.event_name }}" = "pull_request" ] && [ -n "${{ github.event.pull_request.base.sha }}" ]; then - BASE="${{ github.event.pull_request.base.sha }}" - fi - # Fallback: if BASE is empty or all zeros (new branch), run everything - if [ -z "$BASE" ] || echo "$BASE" | grep -qE '^0+$'; then - echo "platform=true" >> "$GITHUB_OUTPUT" - echo "canvas=true" >> "$GITHUB_OUTPUT" - echo "python=true" >> "$GITHUB_OUTPUT" - echo "scripts=true" >> "$GITHUB_OUTPUT" - exit 0 - fi - DIFF=$(git diff --name-only "$BASE" HEAD 2>/dev/null || echo ".github/workflows/ci.yml") - echo "platform=$(echo "$DIFF" | grep -qE '^workspace-server/|^\.github/workflows/ci\.yml$' && echo true || echo false)" >> "$GITHUB_OUTPUT" - echo "canvas=$(echo "$DIFF" | grep -qE '^canvas/|^\.github/workflows/ci\.yml$' && echo true || echo false)" >> "$GITHUB_OUTPUT" - echo "python=$(echo "$DIFF" | grep -qE '^workspace/|^\.github/workflows/ci\.yml$' && echo true || echo false)" >> "$GITHUB_OUTPUT" - echo "scripts=$(echo "$DIFF" | grep -qE '^tests/e2e/|^scripts/|^infra/scripts/|^\.github/workflows/ci\.yml$' && echo true || echo false)" >> "$GITHUB_OUTPUT" - - # Platform (Go) is a required check on staging. Always-run + per-step - # gating (see Canvas (Next.js) for the rationale and the failure mode - # this avoids). - platform-build: - name: Platform (Go) - needs: changes - runs-on: ubuntu-latest - defaults: - run: - working-directory: workspace-server - steps: - - if: needs.changes.outputs.platform != 'true' - working-directory: . - run: echo "No platform/** changes — skipping real build steps; this job always runs to satisfy the required-check name on branch protection." - - if: needs.changes.outputs.platform == 'true' - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 - - if: needs.changes.outputs.platform == 'true' - uses: actions/setup-go@40f1582b2485089dde7abd97c1529aa768e1baff # v5 - with: - go-version: 'stable' - - if: needs.changes.outputs.platform == 'true' - run: go mod download - - if: needs.changes.outputs.platform == 'true' - run: go build ./cmd/server - # CLI (molecli) moved to standalone repo: github.com/molecule-ai/molecule-cli - - if: needs.changes.outputs.platform == 'true' - run: go vet ./... || true - - if: needs.changes.outputs.platform == 'true' - name: Run golangci-lint - run: golangci-lint run --timeout 3m ./... || true - - if: needs.changes.outputs.platform == 'true' - name: Run tests with race detection and coverage - run: go test -race -coverprofile=coverage.out ./... - - - if: needs.changes.outputs.platform == 'true' - name: Per-file coverage report - # Advisory — lists every source file with its coverage so reviewers - # can see at-a-glance where gaps are. Sorted ascending so the worst - # offenders float to the top. Does NOT fail the build; the hard - # gate is the threshold check below. (#1823) - run: | - echo "=== Per-file coverage (worst first) ===" - go tool cover -func=coverage.out \ - | grep -v '^total:' \ - | awk '{file=$1; sub(/:[0-9][0-9.]*:.*/, "", file); pct=$NF; gsub(/%/,"",pct); s[file]+=pct; c[file]++} - END {for (f in s) printf "%6.1f%% %s\n", s[f]/c[f], f}' \ - | sort -n - - - if: needs.changes.outputs.platform == 'true' - name: Check coverage thresholds - # Enforces two gates from #1823 Layer 1: - # 1. Total floor (25% — ratchet plan in COVERAGE_FLOOR.md). - # 2. Per-file floor — non-test .go files in security-critical - # paths with coverage <10% fail the build, UNLESS the file - # path is listed in .coverage-allowlist.txt (acknowledged - # historical debt with a tracking issue + expiry). - run: | - set -e - TOTAL_FLOOR=25 - # Security-critical paths where a 0%-coverage file is a real risk. - CRITICAL_PATHS=( - "internal/handlers/tokens" - "internal/handlers/workspace_provision" - "internal/handlers/a2a_proxy" - "internal/handlers/registry" - "internal/handlers/secrets" - "internal/middleware/wsauth" - "internal/crypto" - ) - - TOTAL=$(go tool cover -func=coverage.out | grep '^total:' | awk '{print $3}' | sed 's/%//') - echo "Total coverage: ${TOTAL}%" - if awk "BEGIN{exit !($TOTAL < $TOTAL_FLOOR)}"; then - echo "::error::Total coverage ${TOTAL}% is below the ${TOTAL_FLOOR}% floor. See COVERAGE_FLOOR.md for ratchet plan." - exit 1 - fi - - # Aggregate per-file coverage → /tmp/perfile.txt: " " - go tool cover -func=coverage.out \ - | grep -v '^total:' \ - | awk '{file=$1; sub(/:[0-9][0-9.]*:.*/, "", file); pct=$NF; gsub(/%/,"",pct); s[file]+=pct; c[file]++} - END {for (f in s) printf "%s %.1f\n", f, s[f]/c[f]}' \ - > /tmp/perfile.txt - - # Build allowlist — paths relative to workspace-server, one per line. - # Lines starting with # are comments. - ALLOWLIST="" - if [ -f ../.coverage-allowlist.txt ]; then - ALLOWLIST=$(grep -vE '^(#|[[:space:]]*$)' ../.coverage-allowlist.txt || true) - fi - - FAILED=0 - WARNED=0 - for path in "${CRITICAL_PATHS[@]}"; do - while read -r file pct; do - [[ "$file" == *_test.go ]] && continue - [[ "$file" == *"$path"* ]] || continue - awk "BEGIN{exit !($pct < 10)}" || continue - - # Strip the package-import prefix so we can match .coverage-allowlist.txt - # entries written as paths relative to workspace-server/. - # Handle both module paths: platform/workspace-server/... and platform/... - rel=$(echo "$file" | sed 's|^github.com/molecule-ai/molecule-monorepo/platform/workspace-server/||; s|^github.com/molecule-ai/molecule-monorepo/platform/||') - - if echo "$ALLOWLIST" | grep -qxF "$rel"; then - echo "::warning file=workspace-server/$rel::Critical file at ${pct}% coverage (allowlisted, #1823) — fix before expiry." - WARNED=$((WARNED+1)) - else - echo "::error file=workspace-server/$rel::Critical file at ${pct}% coverage — must be >=10% (target 80%). See #1823. To acknowledge as known debt, add this path to .coverage-allowlist.txt." - FAILED=$((FAILED+1)) - fi - done < /tmp/perfile.txt - done - - echo "" - echo "Critical-path check: $FAILED new failures, $WARNED allowlisted warnings." - - if [ "$FAILED" -gt 0 ]; then - echo "" - echo "$FAILED security-critical file(s) have <10% test coverage and are" - echo "NOT in the allowlist. These paths handle auth, tokens, secrets, or" - echo "workspace provisioning — a 0% file here is the exact gap that let" - echo "CWE-22, CWE-78, KI-005 slip through in past incidents. Either:" - echo " (a) add tests to raise coverage above 10%, or" - echo " (b) add the path to .coverage-allowlist.txt with an expiry date" - echo " and a tracking issue reference." - exit 1 - fi - - # Canvas (Next.js) — required check, always runs. See platform-build - # comment above for the rationale. - # - # Supersedes the canvas-build-noop pattern attempted in PR #2321: two - # jobs sharing `name:` doesn't actually satisfy branch protection - # because the SKIPPED check run sibling is treated as not-passed - # regardless of how many SUCCESS siblings it has. Verified empirically - # on PR #2314 — mergeStateStatus stayed BLOCKED until I collapsed to - # a single-job-with-conditional-steps shape. - canvas-build: - name: Canvas (Next.js) - needs: changes - runs-on: ubuntu-latest - defaults: - run: - working-directory: canvas - steps: - - if: needs.changes.outputs.canvas != 'true' - working-directory: . - run: echo "No canvas/** changes — skipping real build steps; this job always runs to satisfy the required-check name on branch protection." - - if: needs.changes.outputs.canvas == 'true' - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 - - if: needs.changes.outputs.canvas == 'true' - uses: actions/setup-node@48b55a011bda9f5d6aeb4c2d9c7362e8dae4041e # v6.4.0 - with: - node-version: '22' - - if: needs.changes.outputs.canvas == 'true' - run: rm -f package-lock.json && npm install - - if: needs.changes.outputs.canvas == 'true' - run: npm run build - - if: needs.changes.outputs.canvas == 'true' - name: Run tests with coverage - # Coverage instrumentation is configured in canvas/vitest.config.ts - # (provider: v8, reporters: text + html + json-summary). Step 2 of - # #1815 — wires coverage into CI so we get a baseline visible on - # every PR. No threshold gate yet; thresholds dial in (Step 3, also - # tracked in #1815) after the team sees what current coverage is. - # Per the inline comment in vitest.config.ts: "first land - # observability so we can see the baseline, then dial in - # thresholds + a hard gate" — this PR ships the observability half. - run: npx vitest run --coverage - - name: Upload coverage summary as artifact - if: needs.changes.outputs.canvas == 'true' && always() - # Pinned to v3 for Gitea act_runner v0.6 compatibility — v4+ uses - # the GHES 3.10+ artifact protocol that Gitea 1.22.x does NOT - # implement, surfacing as `GHESNotSupportedError: @actions/artifact - # v2.0.0+, upload-artifact@v4+ and download-artifact@v4+ are not - # currently supported on GHES`. Drop this pin when Gitea ships - # the v4 protocol (tracked: post-Gitea-1.23 followup). - uses: actions/upload-artifact@c6a366c94c3e0affe28c06c8df20a878f24da3cf # v3.2.2 - with: - name: canvas-coverage-${{ github.run_id }} - path: canvas/coverage/ - retention-days: 7 - if-no-files-found: warn - - # MCP Server + SDK removed from CI — now in standalone repos: - # - github.com/molecule-ai/molecule-mcp-server (npm CI) - # - github.com/molecule-ai/molecule-sdk-python (PyPI CI) - - # e2e-api job moved to .github/workflows/e2e-api.yml (issue #458). - # It now has workflow-level concurrency (cancel-in-progress: false) so - # new pushes queue the E2E run rather than cancelling it at the run level. - - # Shellcheck (E2E scripts) — required check, always runs. See - # platform-build for the rationale. - shellcheck: - name: Shellcheck (E2E scripts) - needs: changes - runs-on: ubuntu-latest - steps: - - if: needs.changes.outputs.scripts != 'true' - run: echo "No tests/e2e/ or infra/scripts/ changes — skipping real shellcheck; this job always runs to satisfy the required-check name on branch protection." - - if: needs.changes.outputs.scripts == 'true' - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 - - if: needs.changes.outputs.scripts == 'true' - name: Run shellcheck on tests/e2e/*.sh and infra/scripts/*.sh - # shellcheck is pre-installed on ubuntu-latest runners (via apt). - # infra/scripts/ is included because setup.sh + nuke.sh gate the - # README quickstart — a shellcheck regression there silently breaks - # new-user onboarding. scripts/ is intentionally excluded until its - # pre-existing SC3040/SC3043 warnings are cleaned up. - run: | - find tests/e2e infra/scripts -type f -name '*.sh' -print0 \ - | xargs -0 shellcheck --severity=warning - - - if: needs.changes.outputs.scripts == 'true' - name: Lint cleanup-trap hygiene (RFC #2873) - # Asserts every shell E2E test that calls `mktemp` also installs - # an EXIT trap. Catches the /tmp-leak class — a missing trap - # silently leaks scratch into CI runners (~10-100KB per run). - # See tests/e2e/lint_cleanup_traps.sh for the rule + fix pattern. - run: bash tests/e2e/lint_cleanup_traps.sh - - - if: needs.changes.outputs.scripts == 'true' - name: Run E2E bash unit tests (no live infra) - # Pure-bash unit tests for E2E helper libs (lib/*.sh). These pin - # behavior of dispatch logic that — when broken — silently masks as - # "Could not resolve authentication method" only after a successful - # tenant + workspace provision (PR #2571 incident, 2026-05-03). Add - # new self-contained unit tests here as the lib/ directory grows; - # tests requiring live CP/tenant credentials belong in the dedicated - # e2e-staging-* workflows, not this job. - run: | - bash tests/e2e/test_model_slug.sh - - canvas-deploy-reminder: - name: Canvas Deploy Reminder - runs-on: docker-host - needs: [changes, canvas-build] - # Only fires on direct pushes to main (i.e. after staging→main promotion). - if: needs.changes.outputs.canvas == 'true' && github.event_name == 'push' && github.ref == 'refs/heads/main' - steps: - - name: Write deploy reminder to step summary - env: - COMMIT_SHA: ${{ github.sha }} - RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }} - run: | - # Write body to a temp file — avoids backtick escaping in shell. - cat > /tmp/deploy-reminder.md << 'BODY' - ## Canvas build passed ✅ — deploy required - - The `publish-canvas-image` workflow is now building a fresh Docker image - (`ghcr.io/molecule-ai/canvas:latest`) in the background. - - Once it completes (~3–5 min), apply on the host machine with: - ```bash - cd - git pull origin main - docker compose pull canvas && docker compose up -d canvas - ``` - - If you need to rebuild from local source instead (e.g. testing unreleased - changes or a new `NEXT_PUBLIC_*` URL), use: - ```bash - docker compose build canvas && docker compose up -d canvas - ``` - BODY - printf '\n> Posted automatically by CI · commit `%s` · [build log](%s)\n' \ - "$COMMIT_SHA" "$RUN_URL" >> /tmp/deploy-reminder.md - - # Gitea has no commit-comments API (no equivalent of - # POST /repos/{owner}/{repo}/commits/{commit_sha}/comments). - # Write to GITHUB_STEP_SUMMARY instead — both GitHub Actions and - # Gitea Actions render this as the workflow run's summary page, - # which is where operators look for post-deploy action items. - # (#75 / PR-D) - cat /tmp/deploy-reminder.md >> "$GITHUB_STEP_SUMMARY" - - # Python Lint & Test — required check, always runs. See platform-build - # for the rationale. - python-lint: - name: Python Lint & Test - needs: changes - runs-on: ubuntu-latest - env: - WORKSPACE_ID: test - defaults: - run: - working-directory: workspace - steps: - - if: needs.changes.outputs.python != 'true' - working-directory: . - run: echo "No workspace/** changes — skipping real lint+test; this job always runs to satisfy the required-check name on branch protection." - - if: needs.changes.outputs.python == 'true' - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 - - if: needs.changes.outputs.python == 'true' - uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 # v6.2.0 - with: - python-version: '3.11' - cache: pip - cache-dependency-path: workspace/requirements.txt - - if: needs.changes.outputs.python == 'true' - run: pip install -r requirements.txt pytest pytest-asyncio pytest-cov sqlalchemy>=2.0.0 - # Coverage flags + fail-under floor moved into workspace/pytest.ini - # (issue #1817) so local `pytest` and CI use identical config. - - if: needs.changes.outputs.python == 'true' - run: python -m pytest --tb=short - - - if: needs.changes.outputs.python == 'true' - name: Per-file critical-path coverage (MCP / inbox / auth) - # MCP-critical Python files have a per-file floor on top of the - # 86% total floor in pytest.ini. Rationale (issue #2790, after - # the PR #2766 → PR #2771 cycle): the total floor averages ~6000 - # lines, so a single MCP file could regress to ~50% with no - # complaint as long as other modules compensate. These five - # files handle multi-tenant routing + auth + inbox dispatch — - # a coverage drop here is the same risk shape as a Go-side - # workspace-server token/secrets file dropping below 10%. - # - # Floor 75% sits below current actuals (80-96%) so this gate is - # strictly additive — no existing PR fails. Ratchet plan in - # COVERAGE_FLOOR.md. - run: | - set -e - PER_FILE_FLOOR=75 - CRITICAL_FILES=( - "a2a_mcp_server.py" - "mcp_cli.py" - "a2a_tools.py" - "a2a_tools_inbox.py" - "inbox.py" - "platform_auth.py" - ) - - # pytest already wrote .coverage; emit a JSON view scoped to - # the critical files so jq/python can read the per-file pct - # without parsing tabular text. --include uses fnmatch, and - # the leading "*" allows the file to live anywhere under the - # workspace root (today they sit at workspace/.py). - INCLUDES=$(printf '*%s,' "${CRITICAL_FILES[@]}") - INCLUDES="${INCLUDES%,}" - python -m coverage json -o /tmp/critical-cov.json --include="$INCLUDES" - - FAILED=0 - for f in "${CRITICAL_FILES[@]}"; do - # Match by top-level path key (e.g. "a2a_tools.py", not - # "builtin_tools/a2a_tools.py" — different file at 100%). - # The keys in coverage.json are paths relative to the run - # cwd (workspace/), so the critical-path entry sits at the - # bare basename. - pct=$(jq -r --arg f "$f" '.files | to_entries | map(select(.key == $f)) | .[0].value.summary.percent_covered // "MISSING"' /tmp/critical-cov.json) - if [ "$pct" = "MISSING" ]; then - echo "::error file=workspace/$f::No coverage data — file may have moved or test exclusion mis-set." - FAILED=$((FAILED+1)) - continue - fi - echo "$f: ${pct}%" - if awk "BEGIN{exit !($pct < $PER_FILE_FLOOR)}"; then - echo "::error file=workspace/$f::${pct}% < ${PER_FILE_FLOOR}% per-file floor (MCP critical path). See COVERAGE_FLOOR.md." - FAILED=$((FAILED+1)) - fi - done - - if [ "$FAILED" -gt 0 ]; then - echo "" - echo "$FAILED MCP critical-path file(s) below the ${PER_FILE_FLOOR}% per-file floor." - echo "These paths handle multi-tenant routing, auth tokens, and inbox dispatch." - echo "A coverage drop here is the same risk shape as Go-side tokens/secrets files" - echo "dropping below 10% (see COVERAGE_FLOOR.md). Either:" - echo " (a) add tests to raise coverage back above ${PER_FILE_FLOOR}%, or" - echo " (b) if this is unavoidable historical debt, file an issue and propose" - echo " adjusting the floor with rationale in COVERAGE_FLOOR.md." - exit 1 - fi - - # SDK + plugin validation moved to standalone repo: - # github.com/molecule-ai/molecule-sdk-python diff --git a/.github/workflows/continuous-synth-e2e.yml b/.github/workflows/continuous-synth-e2e.yml deleted file mode 100644 index 0fc4a20c..00000000 --- a/.github/workflows/continuous-synth-e2e.yml +++ /dev/null @@ -1,257 +0,0 @@ -name: Continuous synthetic E2E (staging) - -# Hard gate (#2342): cron-driven full-lifecycle E2E that catches -# regressions visible only at runtime — schema drift, deployment-pipeline -# gaps, vendor outages, env-var rotations, DNS / CF / Railway side-effects. -# -# Why this gate exists: -# PR-time CI catches code-level regressions but not deployment-time or -# integration-time ones. Today's empirical data: -# • #2345 (A2A v0.2 silent drop) — passed all unit tests, broke at -# JSON-RPC parse layer between sender and receiver. Visible only -# to a sender exercising the full path. -# • RFC #2312 chat upload — landed on staging-branch but never -# reached staging tenants because publish-workspace-server-image -# was main-only. Caught by manual dogfooding hours after deploy. -# Both would have surfaced within 15-20 min of regression if a -# continuous synth-E2E was running. -# -# Cadence: every 20 min (3x/hour). The script is conservatively -# bounded at 10 min wall-clock; even on degraded staging it should -# finish before the next firing. cron-overlap is guarded by the -# concurrency group below. -# -# Cost: ~3 runs/hour × 5-10 min × $0.008/min GHA = ~$0.50-$1/day. -# Plus a fresh tenant provisioned + torn down each run (Railway + -# AWS pennies). Negligible. -# -# Failure handling: when the run fails, the workflow exits non-zero -# and GitHub's standard email/notification path fires. Operators -# can subscribe to this workflow's failure channel for paging-grade -# alerting. - -on: - schedule: - # Every 10 minutes, on :02 :12 :22 :32 :42 :52. Three constraints: - # 1. Stay off the top-of-hour. GitHub Actions scheduler drops - # :00 firings under high load (own docs: - # https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#schedule). - # Prior history: cron was '0,20,40' (2026-05-02) — only :00 - # ever survived. Bumped to '10,30,50' (2026-05-03) on the - # theory that further-from-:00 wins. Empirically 2026-05-04 - # that ALSO dropped to ~60 min effective cadence (only ~1 - # schedule fire per hour — see molecule-core#2726). Detection - # latency was claimed 20 min, actual 60 min. - # 2. Avoid colliding with the existing :15 sweep-cf-orphans - # and :45 sweep-cf-tunnels — both hit the CF API and we - # don't want to fight for rate-limit tokens. - # 3. Avoid the :30 heavy slot (canary-staging /30, sweep-aws- - # secrets, sweep-stale-e2e-orgs every :15) — multiple - # overlapping cron registrations on the same minute is part - # of what GH drops under load. - # Solution: bump fires-per-hour 3 → 6 AND keep all slots in clean - # lanes (1-3 min away from any other cron). Even with empirically- - # observed ~67% GH drop ratio, 6 attempts/hour yields ~2 effective - # fires = ~30 min cadence; closer to the 20-min target than the - # current shape and provides a real degradation alarm if drops - # get worse. - - cron: '2,12,22,32,42,52 * * * *' - workflow_dispatch: - inputs: - runtime: - description: "Runtime to provision (claude-code = default + cheapest via MiniMax; langgraph = OpenAI-only; hermes = SDK-native path, slower)" - required: false - default: "claude-code" - type: string - model_slug: - description: "Model id to provision the workspace with (default MiniMax-M2.7-highspeed; e.g. 'sonnet' to test direct Anthropic, 'openai/gpt-4o' for hermes)" - required: false - default: "MiniMax-M2.7-highspeed" - type: string - keep_org: - description: "Skip teardown for post-mortem debugging (only manual dispatch — never set this for cron runs)" - required: false - default: false - type: boolean - -permissions: - contents: read - # No issue-write here — failures surface as red runs in the workflow - # history. If you want auto-issue-on-fail, add a follow-up step that - # uses gh issue create gated on `if: failure()`. Keeping the surface - # minimal until that's actually wanted. - -# Serialize so two firings can never overlap. Cron firing every 20 min -# but scripts conservatively bounded at 10 min — overlap shouldn't -# happen in steady state, but if a run hangs we don't want N more -# stacking up. -concurrency: - group: continuous-synth-e2e - cancel-in-progress: false - -jobs: - synth: - name: Synthetic E2E against staging - runs-on: ubuntu-latest - # Bumped from 12 → 20 (2026-05-04). Tenant user-data install phase - # (apt-get update + install docker.io/jq/awscli/caddy + snap install - # ssm-agent) runs from raw Ubuntu on every boot — none of it is - # pre-baked into the tenant AMI. Empirical fetch_secrets/ok timing - # across today's canaries: 51s → 82s → 143s → 625s. apt-mirror tail - # latency drives the boot-to-fetch_secrets phase from ~1min to >10min. - # A 12min budget leaves only ~2min for the workspace (which needs - # ~3.5min for claude-code cold boot) on slow-apt days, blowing the - # budget. 20min absorbs the worst tenant tail so the workspace probe - # gets the full ~7min it needs even on a slow apt day. Real fix: - # pre-bake caddy + ssm-agent into the tenant AMI (controlplane#TBD). - timeout-minutes: 20 - env: - # claude-code default: cold-start ~5 min (comparable to langgraph), - # but uses MiniMax-M2.7-highspeed via the template's third-party- - # Anthropic-compat path (workspace-configs-templates/claude-code- - # default/config.yaml:64-69). MiniMax is ~5-10x cheaper than - # gpt-4.1-mini per token AND avoids the recurring OpenAI quota- - # exhaustion class that took the canary down 2026-05-03 (#265). - # Operators can pick langgraph / hermes via workflow_dispatch - # when they specifically need to exercise the OpenAI or SDK- - # native paths. - E2E_RUNTIME: ${{ github.event.inputs.runtime || 'claude-code' }} - # Pin the canary to a specific MiniMax model rather than relying - # on the per-runtime default ("sonnet" → routes to direct - # Anthropic, defeats the cost saving). Operators can override - # via workflow_dispatch by setting a different E2E_MODEL_SLUG - # input if they need to exercise a specific model. M2.7-highspeed - # is "Token Plan only" but cheap-per-token and fast. - E2E_MODEL_SLUG: ${{ github.event.inputs.model_slug || 'MiniMax-M2.7-highspeed' }} - # Bound to 10 min so a stuck provision fails the run instead of - # holding up the next cron firing. 15-min default in the script - # is for the on-PR full lifecycle where we have more headroom. - E2E_PROVISION_TIMEOUT_SECS: '600' - # Slug suffix — namespaced "synth-" so these runs are - # distinguishable from PR-driven runs in CP admin. - E2E_RUN_ID: synth-${{ github.run_id }} - # Forced false for cron; respected for manual dispatch - E2E_KEEP_ORG: ${{ github.event.inputs.keep_org == 'true' && '1' || '' }} - MOLECULE_CP_URL: ${{ vars.STAGING_CP_URL || 'https://staging-api.moleculesai.app' }} - MOLECULE_ADMIN_TOKEN: ${{ secrets.CP_STAGING_ADMIN_API_TOKEN }} - # MiniMax key is the canary's PRIMARY auth path. claude-code - # template's `minimax` provider routes ANTHROPIC_BASE_URL to - # api.minimax.io/anthropic and reads MINIMAX_API_KEY at boot. - # tests/e2e/test_staging_full_saas.sh branches SECRETS_JSON on - # which key is present — MiniMax wins when set. - E2E_MINIMAX_API_KEY: ${{ secrets.MOLECULE_STAGING_MINIMAX_API_KEY }} - # Direct-Anthropic alternative for operators who don't want to - # set up a MiniMax account (priority below MiniMax — first - # non-empty wins in test_staging_full_saas.sh's secrets-injection - # block). See #2578 PR comment for the rationale. - E2E_ANTHROPIC_API_KEY: ${{ secrets.MOLECULE_STAGING_ANTHROPIC_API_KEY }} - # OpenAI fallback — kept wired so operators can dispatch with - # E2E_RUNTIME=langgraph or =hermes and still have a working - # canary path. The script picks the right blob shape based on - # which key is non-empty. - E2E_OPENAI_API_KEY: ${{ secrets.MOLECULE_STAGING_OPENAI_KEY }} - steps: - - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 - - - name: Verify required secrets present - run: | - # Hard-fail on missing secret REGARDLESS of trigger. Previously - # this step soft-skipped on workflow_dispatch via `exit 0`, but - # `exit 0` only ends the STEP — subsequent steps still ran with - # the empty secret, the synth script fell through to the wrong - # SECRETS_JSON branch, and the canary failed 5 min later with a - # confusing "Agent error (Exception)" instead of the clean - # "secret missing" message at the top. Caught 2026-05-04 by - # dispatched run 25296530706: claude-code + missing MINIMAX - # silently used OpenAI keys but kept model=MiniMax-M2.7, then - # the workspace 401'd against MiniMax once it tried to call. - # Fix: exit 1 in both cron and dispatch paths. Operators who - # want to verify a YAML change without setting up the secret - # can read the verify-secrets step's stderr — the failure is - # itself the verification signal. - if [ -z "${MOLECULE_ADMIN_TOKEN:-}" ]; then - echo "::error::CP_STAGING_ADMIN_API_TOKEN secret missing — synth E2E cannot run" - echo "::error::Set it at Settings → Secrets and Variables → Actions; pull from staging-CP's CP_ADMIN_API_TOKEN env in Railway." - exit 1 - fi - - # LLM-key requirement is per-runtime: claude-code accepts - # EITHER MiniMax OR direct-Anthropic (whichever is set first), - # langgraph + hermes use OpenAI (MOLECULE_STAGING_OPENAI_KEY). - case "${E2E_RUNTIME}" in - claude-code) - if [ -n "${E2E_MINIMAX_API_KEY:-}" ]; then - required_secret_name="MOLECULE_STAGING_MINIMAX_API_KEY" - required_secret_value="${E2E_MINIMAX_API_KEY}" - elif [ -n "${E2E_ANTHROPIC_API_KEY:-}" ]; then - required_secret_name="MOLECULE_STAGING_ANTHROPIC_API_KEY" - required_secret_value="${E2E_ANTHROPIC_API_KEY}" - else - required_secret_name="MOLECULE_STAGING_MINIMAX_API_KEY or MOLECULE_STAGING_ANTHROPIC_API_KEY" - required_secret_value="" - fi - ;; - langgraph|hermes) - required_secret_name="MOLECULE_STAGING_OPENAI_KEY" - required_secret_value="${E2E_OPENAI_API_KEY:-}" - ;; - *) - echo "::warning::Unknown E2E_RUNTIME='${E2E_RUNTIME}' — skipping LLM-key check" - required_secret_name="" - required_secret_value="present" - ;; - esac - if [ -n "$required_secret_name" ] && [ -z "$required_secret_value" ]; then - echo "::error::${required_secret_name} secret missing — runtime=${E2E_RUNTIME} cannot authenticate against its LLM provider" - echo "::error::Set it at Settings → Secrets and Variables → Actions, OR dispatch with a different runtime" - exit 1 - fi - - - name: Install required tools - run: | - # The script depends on jq + curl (already on ubuntu-latest) - # and python3 (likewise). Verify they're all present so we - # fail fast on a runner image regression rather than mid-script. - for cmd in jq curl python3; do - command -v "$cmd" >/dev/null 2>&1 || { - echo "::error::required tool '$cmd' not on PATH — runner image regression?" - exit 1 - } - done - - - name: Run synthetic E2E - # The script handles its own teardown via EXIT trap; even on - # failure (timeout, assertion), the org is deprovisioned and - # leaks are reported. Exit code propagates from the script. - run: | - bash tests/e2e/test_staging_full_saas.sh - - - name: Failure summary - # Runs only on failure. Adds a job summary so the workflow run - # page shows a quick "what happened" instead of forcing readers - # to scroll through script output. - if: failure() - run: | - { - echo "## Continuous synth E2E failed" - echo "" - echo "**Run ID:** ${{ github.run_id }}" - echo "**Trigger:** ${{ github.event_name }}" - echo "**Runtime:** ${E2E_RUNTIME}" - echo "**Slug:** synth-${{ github.run_id }}" - echo "" - echo "### What this means" - echo "" - echo "Staging just regressed on a path that previously worked. Likely classes:" - echo "- Schema mismatch between sender and receiver (#2345 class)" - echo "- Deployment-pipeline gap (RFC #2312 / staging-tenant-image-stale class)" - echo "- Vendor outage (Cloudflare, Railway, AWS, GHCR)" - echo "- Staging-CP env var rotation" - echo "" - echo "### Next steps" - echo "" - echo "1. Check the script output above for the assertion that failed" - echo "2. If it's a vendor outage, no action needed — next firing in ~20 min" - echo "3. If it's a code regression, find the causing PR via \`git log\` against last green run and revert/fix" - echo "4. Keep an eye on the next 1-2 firings — flake vs persistent fail differs in priority" - } >> "$GITHUB_STEP_SUMMARY" diff --git a/.github/workflows/e2e-api.yml b/.github/workflows/e2e-api.yml deleted file mode 100644 index fe855d2d..00000000 --- a/.github/workflows/e2e-api.yml +++ /dev/null @@ -1,307 +0,0 @@ -name: E2E API Smoke Test -# Extracted from ci.yml so workflow-level concurrency can protect this job -# from run-level cancellation (issue #458). -# -# Trigger model (revised 2026-04-29): -# -# Always FIRES on push/pull_request to staging+main. Real work is gated -# per-step on `needs.detect-changes.outputs.api` — when paths under -# `workspace-server/`, `tests/e2e/`, or this workflow file haven't -# changed, the no-op step alone runs and emits SUCCESS for the -# `E2E API Smoke Test` check, satisfying branch protection without -# spending CI cycles. See the in-job comment on the `e2e-api` job for -# why this is one job (not two-jobs-sharing-name) and the 2026-04-29 -# PR #2264 incident that drove the consolidation. -# -# Parallel-safety (Class B Hongming-owned CICD red sweep, 2026-05-08) -# ------------------------------------------------------------------- -# Same substrate hazard as PR #98 (handlers-postgres-integration). Our -# Gitea act_runner runs with `container.network: host` (operator host -# `/opt/molecule/runners/config.yaml`), which means: -# -# * Two concurrent runs both try to bind their `-p 15432:5432` / -# `-p 16379:6379` host ports — the second postgres/redis FATALs -# with `Address in use` and `docker run` returns exit 125 with -# `Conflict. The container name "/molecule-ci-postgres" is already -# in use by container ...`. Verified in run a7/2727 on 2026-05-07. -# * The fixed container names `molecule-ci-postgres` / `-redis` (the -# pre-fix shape) collide on name AS WELL AS port. The cleanup-with- -# `docker rm -f` at the start of the second job KILLS the first -# job's still-running postgres/redis. -# -# Fix shape (mirrors PR #98's bridge-net pattern, adapted because -# platform-server is a Go binary on the host, not a containerised -# step): -# -# 1. Unique container names per run: -# pg-e2e-api-${RUN_ID}-${RUN_ATTEMPT} -# redis-e2e-api-${RUN_ID}-${RUN_ATTEMPT} -# `${RUN_ID}-${RUN_ATTEMPT}` is unique even across reruns of the -# same run_id. -# 2. Ephemeral host port per run (`-p 0:5432`), then read the actual -# bound port via `docker port` and export DATABASE_URL/REDIS_URL -# pointing at it. No fixed host-port → no port collision. -# 3. `127.0.0.1` (NOT `localhost`) in URLs — IPv6 first-resolve was -# the original flake fixed in #92 and the script's still IPv6- -# enabled. -# 4. `if: always()` cleanup so containers don't leak when test steps -# fail. -# -# Issue #94 items #2 + #3 (also fixed here): -# * Pre-pull `alpine:latest` so the platform-server's provisioner -# (`internal/handlers/container_files.go`) can stand up its -# ephemeral token-write helper without a daemon.io round-trip. -# * Create `molecule-core-net` bridge network if missing so the -# provisioner's container.HostConfig {NetworkMode: ...} attach -# succeeds. -# Item #1 (timeouts) — evidence on recent runs (77/3191, ae/4270, 0e/ -# 2318) shows Postgres ready in 3s, Redis in 1s, Platform in 1s when -# they DO come up. Timeouts are not the bottleneck; not bumped. -# -# Item explicitly NOT fixed here: failing test `Status back online` -# fails because the platform's langgraph workspace template image -# (ghcr.io/molecule-ai/workspace-template-langgraph:latest) returns -# 403 Forbidden post-2026-05-06 GitHub org suspension. That is a -# template-registry resolution issue (ADR-002 / local-build mode) and -# belongs in a separate change that touches workspace-server, not -# this workflow file. - -on: - push: - branches: [main, staging] - pull_request: - branches: [main, staging] - workflow_dispatch: - -concurrency: - # Per-SHA grouping (changed 2026-04-28 from per-ref). Per-ref had the - # same auto-promote-staging brittleness as e2e-staging-canvas — back- - # to-back staging pushes share refs/heads/staging, so the older push's - # queued run gets cancelled when a newer push lands. Auto-promote- - # staging then sees `completed/cancelled` for the older SHA and stays - # put; the newer SHA's gates may eventually save the day, but if the - # newer push gets cancelled too, we deadlock. - # - # See e2e-staging-canvas.yml's identical concurrency block for the full - # rationale and the 2026-04-28 incident reference. - group: e2e-api-${{ github.event.pull_request.head.sha || github.sha }} - cancel-in-progress: false - -jobs: - detect-changes: - runs-on: ubuntu-latest - outputs: - api: ${{ steps.decide.outputs.api }} - steps: - - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 - - uses: dorny/paths-filter@fbd0ab8f3e69293af611ebaee6363fc25e6d187d # v4.0.1 - id: filter - with: - filters: | - api: - - 'workspace-server/**' - - 'tests/e2e/**' - - '.github/workflows/e2e-api.yml' - - id: decide - # Always run real work for manual dispatch — no diff context to - # filter against and ops dispatching this expects the suite to - # actually exercise the platform. - run: | - if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then - echo "api=true" >> "$GITHUB_OUTPUT" - else - echo "api=${{ steps.filter.outputs.api }}" >> "$GITHUB_OUTPUT" - fi - - # ONE job (no job-level `if:`) that always runs and reports under the - # required-check name `E2E API Smoke Test`. Real work is gated per-step - # on `needs.detect-changes.outputs.api`. Reason: GitHub registers a - # check run for every job that matches `name:`, and a job-level - # `if: false` produces a SKIPPED check run. Branch protection treats - # all check runs with a matching context name on the latest commit as a - # SET — any SKIPPED in the set fails the required-check eval, even with - # SUCCESS siblings. Verified 2026-04-29 on PR #2264 (staging→main): - # 4 check runs (2 SKIPPED + 2 SUCCESS) at the head SHA blocked - # promotion despite all real work succeeding. Collapsing to a single - # always-running job with conditional steps emits exactly one SUCCESS - # check run regardless of paths filter — branch-protection-clean. - e2e-api: - needs: detect-changes - name: E2E API Smoke Test - runs-on: docker-host - timeout-minutes: 15 - env: - # Unique per-run container names so concurrent runs on the host- - # network act_runner don't collide on name OR port. - # `${RUN_ID}-${RUN_ATTEMPT}` stays unique across reruns of the - # same run_id. PORT is set later (after docker port lookup) since - # we let Docker assign an ephemeral host port. - PG_CONTAINER: pg-e2e-api-${{ github.run_id }}-${{ github.run_attempt }} - REDIS_CONTAINER: redis-e2e-api-${{ github.run_id }}-${{ github.run_attempt }} - PORT: "8080" - steps: - - name: No-op pass (paths filter excluded this commit) - if: needs.detect-changes.outputs.api != 'true' - run: | - echo "No workspace-server / tests/e2e / workflow changes — E2E API gate satisfied without running tests." - echo "::notice::E2E API Smoke Test no-op pass (paths filter excluded this commit)." - - if: needs.detect-changes.outputs.api == 'true' - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 - - if: needs.detect-changes.outputs.api == 'true' - uses: actions/setup-go@40f1582b2485089dde7abd97c1529aa768e1baff # v5 - with: - go-version: 'stable' - cache: true - cache-dependency-path: workspace-server/go.sum - - name: Pre-pull alpine + ensure provisioner network (Issue #94 items #2 + #3) - if: needs.detect-changes.outputs.api == 'true' - run: | - # Provisioner uses alpine:latest for ephemeral token-write - # containers (workspace-server/internal/handlers/container_files.go). - # Pre-pull so the first provision in test_api.sh doesn't race - # the daemon's pull cache. Idempotent — `docker pull` is a no-op - # when the image is already present. - docker pull alpine:latest >/dev/null - # Provisioner attaches workspace containers to - # molecule-core-net (workspace-server/internal/provisioner/ - # provisioner.go::DefaultNetwork). The bridge already exists on - # the operator host's docker daemon — `network create` is - # idempotent via `|| true`. - docker network create molecule-core-net >/dev/null 2>&1 || true - echo "alpine:latest pre-pulled; molecule-core-net ensured." - - name: Start Postgres (docker) - if: needs.detect-changes.outputs.api == 'true' - run: | - # Defensive cleanup — only matches THIS run's container name, - # so it cannot kill a sibling run's postgres. (Pre-fix the - # name was static and this rm hit other runs' containers.) - docker rm -f "$PG_CONTAINER" 2>/dev/null || true - # `-p 0:5432` requests an ephemeral host port; we read it back - # below and export DATABASE_URL. - docker run -d --name "$PG_CONTAINER" \ - -e POSTGRES_USER=dev -e POSTGRES_PASSWORD=dev -e POSTGRES_DB=molecule \ - -p 0:5432 postgres:16 >/dev/null - # Resolve the host-side port assignment. `docker port` prints - # `0.0.0.0:NNNN` (and on host-net runners may also print an - # IPv6 line — take the first IPv4 line). - PG_PORT=$(docker port "$PG_CONTAINER" 5432/tcp | awk -F: '/^0\.0\.0\.0:/ {print $2; exit}') - if [ -z "$PG_PORT" ]; then - # Fallback: any first line. Some Docker versions print only - # one line. - PG_PORT=$(docker port "$PG_CONTAINER" 5432/tcp | head -1 | awk -F: '{print $NF}') - fi - if [ -z "$PG_PORT" ]; then - echo "::error::Could not resolve host port for $PG_CONTAINER" - docker port "$PG_CONTAINER" 5432/tcp || true - docker logs "$PG_CONTAINER" || true - exit 1 - fi - # 127.0.0.1 (NOT localhost) — IPv6 first-resolve flake (#92). - echo "PG_PORT=${PG_PORT}" >> "$GITHUB_ENV" - echo "DATABASE_URL=postgres://dev:dev@127.0.0.1:${PG_PORT}/molecule?sslmode=disable" >> "$GITHUB_ENV" - echo "Postgres host port: ${PG_PORT}" - for i in $(seq 1 30); do - if docker exec "$PG_CONTAINER" pg_isready -U dev >/dev/null 2>&1; then - echo "Postgres ready after ${i}s" - exit 0 - fi - sleep 1 - done - echo "::error::Postgres did not become ready in 30s" - docker logs "$PG_CONTAINER" || true - exit 1 - - name: Start Redis (docker) - if: needs.detect-changes.outputs.api == 'true' - run: | - docker rm -f "$REDIS_CONTAINER" 2>/dev/null || true - docker run -d --name "$REDIS_CONTAINER" -p 0:6379 redis:7 >/dev/null - REDIS_PORT=$(docker port "$REDIS_CONTAINER" 6379/tcp | awk -F: '/^0\.0\.0\.0:/ {print $2; exit}') - if [ -z "$REDIS_PORT" ]; then - REDIS_PORT=$(docker port "$REDIS_CONTAINER" 6379/tcp | head -1 | awk -F: '{print $NF}') - fi - if [ -z "$REDIS_PORT" ]; then - echo "::error::Could not resolve host port for $REDIS_CONTAINER" - docker port "$REDIS_CONTAINER" 6379/tcp || true - docker logs "$REDIS_CONTAINER" || true - exit 1 - fi - echo "REDIS_PORT=${REDIS_PORT}" >> "$GITHUB_ENV" - echo "REDIS_URL=redis://127.0.0.1:${REDIS_PORT}" >> "$GITHUB_ENV" - echo "Redis host port: ${REDIS_PORT}" - for i in $(seq 1 15); do - if docker exec "$REDIS_CONTAINER" redis-cli ping 2>/dev/null | grep -q PONG; then - echo "Redis ready after ${i}s" - exit 0 - fi - sleep 1 - done - echo "::error::Redis did not become ready in 15s" - docker logs "$REDIS_CONTAINER" || true - exit 1 - - name: Build platform - if: needs.detect-changes.outputs.api == 'true' - working-directory: workspace-server - run: go build -o platform-server ./cmd/server - - name: Start platform (background) - if: needs.detect-changes.outputs.api == 'true' - working-directory: workspace-server - run: | - # DATABASE_URL + REDIS_URL exported by the start-postgres / - # start-redis steps point at this run's per-run host ports. - ./platform-server > platform.log 2>&1 & - echo $! > platform.pid - - name: Wait for /health - if: needs.detect-changes.outputs.api == 'true' - run: | - for i in $(seq 1 30); do - if curl -sf http://127.0.0.1:8080/health > /dev/null; then - echo "Platform up after ${i}s" - exit 0 - fi - sleep 1 - done - echo "::error::Platform did not become healthy in 30s" - cat workspace-server/platform.log || true - exit 1 - - name: Assert migrations applied - if: needs.detect-changes.outputs.api == 'true' - run: | - tables=$(docker exec "$PG_CONTAINER" psql -U dev -d molecule -tAc "SELECT count(*) FROM information_schema.tables WHERE table_schema='public' AND table_name='workspaces'") - if [ "$tables" != "1" ]; then - echo "::error::Migrations did not apply" - cat workspace-server/platform.log || true - exit 1 - fi - echo "Migrations OK" - - name: Run E2E API tests - if: needs.detect-changes.outputs.api == 'true' - run: bash tests/e2e/test_api.sh - - name: Run notify-with-attachments E2E - if: needs.detect-changes.outputs.api == 'true' - run: bash tests/e2e/test_notify_attachments_e2e.sh - - name: Run priority-runtimes E2E (claude-code + hermes — skips when keys absent) - if: needs.detect-changes.outputs.api == 'true' - run: bash tests/e2e/test_priority_runtimes_e2e.sh - - name: Run poll-mode + since_id cursor E2E (#2339) - if: needs.detect-changes.outputs.api == 'true' - run: bash tests/e2e/test_poll_mode_e2e.sh - - name: Run poll-mode chat upload E2E (RFC #2891) - if: needs.detect-changes.outputs.api == 'true' - run: bash tests/e2e/test_poll_mode_chat_upload_e2e.sh - - name: Dump platform log on failure - if: failure() && needs.detect-changes.outputs.api == 'true' - run: cat workspace-server/platform.log || true - - name: Stop platform - if: always() && needs.detect-changes.outputs.api == 'true' - run: | - if [ -f workspace-server/platform.pid ]; then - kill "$(cat workspace-server/platform.pid)" 2>/dev/null || true - fi - - name: Stop service containers - # always() so containers don't leak when test steps fail. The - # cleanup is best-effort: if the container is already gone - # (e.g. concurrent rerun race), don't fail the job. - if: always() && needs.detect-changes.outputs.api == 'true' - run: | - docker rm -f "$PG_CONTAINER" 2>/dev/null || true - docker rm -f "$REDIS_CONTAINER" 2>/dev/null || true diff --git a/.github/workflows/e2e-staging-canvas.yml b/.github/workflows/e2e-staging-canvas.yml deleted file mode 100644 index 924278e9..00000000 --- a/.github/workflows/e2e-staging-canvas.yml +++ /dev/null @@ -1,216 +0,0 @@ -name: E2E Staging Canvas (Playwright) - -# Playwright test suite that provisions a fresh staging org per run and -# verifies every workspace-panel tab renders without crashing. Complements -# e2e-staging-saas.yml (which tests the API shape) by exercising the -# actual browser + canvas bundle against live staging. -# -# Triggers: push to main/staging or PR touching canvas sources + this workflow, -# manual dispatch, and weekly cron to catch browser/runtime drift even -# when canvas is quiet. -# Added staging to push/pull_request branches so the auto-promote gate -# check (--event push --branch staging) can see a completed run for this -# workflow — mirrors what PR #1891 does for e2e-api.yml. - -on: - # Trigger model (revised 2026-04-29): - # - # Always fires on push/pull_request; real work is gated per-step on - # `needs.detect-changes.outputs.canvas`. When canvas/ paths haven't - # changed, the no-op step alone runs and emits SUCCESS for the - # `Canvas tabs E2E` check, satisfying branch protection without - # spending CI cycles. See e2e-api.yml for the rationale on why this - # is a single job rather than two-jobs-sharing-name. - push: - branches: [main] - pull_request: - branches: [main] - workflow_dispatch: - schedule: - # Weekly on Sunday 08:00 UTC — catches Chrome / Playwright / Next.js - # release-note-shaped regressions that don't ride in with a PR. - - cron: '0 8 * * 0' - -concurrency: - # Per-SHA grouping (changed 2026-04-28 from a single global group). The - # global group made auto-promote-staging brittle: when a staging push - # queued behind an in-flight run and a third entrant (a PR run, a - # follow-on push) entered the group, the staging push got cancelled — - # leaving auto-promote-staging looking at `completed/cancelled` for a - # required gate and refusing to advance main. Observed 2026-04-28 - # 23:51-23:53 on staging tip 3f99fede. - # - # The original intent of the global group was to throttle parallel - # E2E provisions (each spins a fresh EC2). At our scale that throttle - # isn't worth the correctness cost — fresh-org-per-run isolates the - # state, and the cost of two parallel runs (~$0.001/min × 10min × 2) - # is rounding error vs. the cost of a stuck pipeline. - # - # Per-SHA still dedupes accidental double-triggers for the SAME SHA. - # It does NOT cancel obsolete-PR-version runs on force-push; that - # wasted CI is acceptable given the alternative is losing staging-tip - # data that auto-promote-staging needs. - group: e2e-staging-canvas-${{ github.event.pull_request.head.sha || github.sha }} - cancel-in-progress: false - -jobs: - detect-changes: - runs-on: ubuntu-latest - outputs: - canvas: ${{ steps.decide.outputs.canvas }} - steps: - - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 - - uses: dorny/paths-filter@fbd0ab8f3e69293af611ebaee6363fc25e6d187d # v4.0.1 - id: filter - with: - filters: | - canvas: - - 'canvas/**' - - '.github/workflows/e2e-staging-canvas.yml' - - id: decide - # Always run real tests for manual dispatch and the weekly cron — - # both exist precisely to exercise the suite, regardless of diff. - run: | - if [ "${{ github.event_name }}" = "workflow_dispatch" ] || [ "${{ github.event_name }}" = "schedule" ]; then - echo "canvas=true" >> "$GITHUB_OUTPUT" - else - echo "canvas=${{ steps.filter.outputs.canvas }}" >> "$GITHUB_OUTPUT" - fi - - # ONE job (no job-level `if:`) that always runs and reports under the - # required-check name `Canvas tabs E2E`. Real work is gated per-step on - # `needs.detect-changes.outputs.canvas`. See e2e-api.yml for the full - # rationale — same path-filter check-name parity issue blocked PR #2264 - # (staging→main) on 2026-04-29 because branch protection treats matching- - # name check runs as a SET, and any SKIPPED member fails the eval. - playwright: - needs: detect-changes - name: Canvas tabs E2E - runs-on: ubuntu-latest - timeout-minutes: 40 - - env: - CANVAS_E2E_STAGING: '1' - MOLECULE_CP_URL: https://staging-api.moleculesai.app - MOLECULE_ADMIN_TOKEN: ${{ secrets.MOLECULE_STAGING_ADMIN_TOKEN }} - - defaults: - run: - working-directory: canvas - - steps: - - name: No-op pass (paths filter excluded this commit) - if: needs.detect-changes.outputs.canvas != 'true' - working-directory: . - run: | - echo "No canvas / workflow changes — E2E Staging Canvas gate satisfied without running tests." - echo "::notice::E2E Staging Canvas no-op pass (paths filter excluded this commit)." - - - if: needs.detect-changes.outputs.canvas == 'true' - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 - - - name: Verify admin token present - if: needs.detect-changes.outputs.canvas == 'true' - run: | - if [ -z "$MOLECULE_ADMIN_TOKEN" ]; then - echo "::error::Missing MOLECULE_STAGING_ADMIN_TOKEN" - exit 2 - fi - - - name: Set up Node - if: needs.detect-changes.outputs.canvas == 'true' - uses: actions/setup-node@48b55a011bda9f5d6aeb4c2d9c7362e8dae4041e # v6.4.0 - with: - node-version: '20' - cache: 'npm' - cache-dependency-path: canvas/package-lock.json - - - name: Install canvas deps - if: needs.detect-changes.outputs.canvas == 'true' - run: npm ci - - - name: Install Playwright browsers - if: needs.detect-changes.outputs.canvas == 'true' - timeout-minutes: 10 - run: npx playwright install --with-deps chromium - - - name: Run staging canvas E2E - if: needs.detect-changes.outputs.canvas == 'true' - run: npx playwright test --config=playwright.staging.config.ts - - - name: Upload Playwright report on failure - if: failure() && needs.detect-changes.outputs.canvas == 'true' - # Pinned to v3 for Gitea act_runner v0.6 compatibility — v4+ uses - # the GHES 3.10+ artifact protocol that Gitea 1.22.x does NOT - # implement (see ci.yml upload step for the canonical error - # cite). Drop this pin when Gitea ships the v4 protocol. - uses: actions/upload-artifact@c6a366c94c3e0affe28c06c8df20a878f24da3cf # v3.2.2 - with: - name: playwright-report-staging - path: canvas/playwright-report-staging/ - retention-days: 14 - - - name: Upload screenshots on failure - if: failure() && needs.detect-changes.outputs.canvas == 'true' - # Pinned to v3 for Gitea act_runner v0.6 compatibility (see above). - uses: actions/upload-artifact@c6a366c94c3e0affe28c06c8df20a878f24da3cf # v3.2.2 - with: - name: playwright-screenshots - path: canvas/test-results/ - retention-days: 14 - - # Safety-net teardown — fires only when Playwright's globalTeardown - # didn't (worker crash, runner cancel). Reads the slug from - # canvas/.playwright-staging-state.json (written by staging-setup - # as its first action, before any CP call) and deletes only that - # slug. - # - # Earlier versions of this step pattern-swept `e2e-canvas--*` - # orgs to compensate for setup-crash-before-state-file-write. That - # over-aggressive cleanup raced concurrent canvas-E2E runs and - # poisoned each other's tenants — observed 2026-04-30 when three - # real-test runs killed each other mid-test, surfacing as - # `getaddrinfo ENOTFOUND` once CP had cleaned up the just-deleted - # DNS record. Pattern-sweep removed; setup now writes the state - # file before any CP work, so the slug is always recoverable. - - name: Teardown safety net - if: always() && needs.detect-changes.outputs.canvas == 'true' - env: - ADMIN_TOKEN: ${{ secrets.MOLECULE_STAGING_ADMIN_TOKEN }} - run: | - set +e - STATE_FILE=".playwright-staging-state.json" - if [ ! -f "$STATE_FILE" ]; then - echo "::notice::No state file at canvas/$STATE_FILE — Playwright globalTeardown handled it (or setup never ran)." - exit 0 - fi - slug=$(python3 -c "import json; print(json.load(open('$STATE_FILE')).get('slug',''))") - if [ -z "$slug" ]; then - echo "::warning::State file present but slug missing; nothing to clean up." - exit 0 - fi - echo "Deleting orphan tenant: $slug" - # Verify HTTP 2xx instead of `>/dev/null || true` swallowing - # failures. A 5xx or timeout previously looked identical to - # success, leaving the tenant alive for up to ~45 min until - # sweep-stale-e2e-orgs caught it. Surface failures as - # workflow warnings naming the slug. Don't `exit 1` — a single - # cleanup miss shouldn't fail-flag the canvas test when the - # actual smoke check passed; the sweeper is the safety net. - # See molecule-controlplane#420. - # Tempfile-routed -w + set +e/-e prevents curl-exit-code - # pollution of the captured status (lint-curl-status-capture.yml). - set +e - curl -sS -o /tmp/canvas-cleanup.out -w "%{http_code}" \ - -X DELETE "$MOLECULE_CP_URL/cp/admin/tenants/$slug" \ - -H "Authorization: Bearer $ADMIN_TOKEN" \ - -H "Content-Type: application/json" \ - -d "{\"confirm\":\"$slug\"}" >/tmp/canvas-cleanup.code - set -e - code=$(cat /tmp/canvas-cleanup.code 2>/dev/null || echo "000") - if [ "$code" = "200" ] || [ "$code" = "204" ]; then - echo "[teardown] deleted $slug (HTTP $code)" - else - echo "::warning::canvas teardown for $slug returned HTTP $code — sweep-stale-e2e-orgs will catch it within ~45 min. Body: $(head -c 300 /tmp/canvas-cleanup.out 2>/dev/null)" - fi - exit 0 diff --git a/.github/workflows/e2e-staging-external.yml b/.github/workflows/e2e-staging-external.yml deleted file mode 100644 index 5b8d4a9c..00000000 --- a/.github/workflows/e2e-staging-external.yml +++ /dev/null @@ -1,184 +0,0 @@ -name: E2E Staging External Runtime - -# Regression for the four/five workspaces.status=awaiting_agent transitions -# that silently failed in production for five days before migration 046 -# extended the workspace_status enum (see -# workspace-server/migrations/046_workspace_status_awaiting_agent.up.sql). -# -# Why this is its own workflow (not folded into e2e-staging-saas.yml): -# - The full-saas harness defaults to runtime=hermes, never exercises -# external-runtime. Adding an `external` parameter to that script -# would force every push to staging through both lifecycles in -# series, doubling the EC2 cold-start budget. -# - The external lifecycle has unique timing (REMOTE_LIVENESS_STALE_AFTER -# window, 90s default + sweep interval), which we wait through -# deliberately. Folding it into hermes would make the long path -# even longer. -# - It can run in parallel with the hermes E2E since both create -# fresh tenant orgs with distinct slug prefixes (`e2e-ext-...` vs -# `e2e-...`). -# -# Triggers: -# - Push to staging when any source affecting external runtime, -# hibernation, or the migration set changes. -# - PR review for the same set. -# - Manual workflow_dispatch. -# - Daily cron at 07:30 UTC (catches drift on quiet days; staggered -# 30 min after e2e-staging-saas.yml's 07:00 UTC cron). -# -# Concurrency: serialized so two staging pushes don't fight for the -# same EC2 quota window. cancel-in-progress=false so a half-rolled -# tenant always finishes its teardown. - -on: - push: - branches: [main] - paths: - - 'workspace-server/internal/handlers/workspace.go' - - 'workspace-server/internal/handlers/registry.go' - - 'workspace-server/internal/handlers/workspace_restart.go' - - 'workspace-server/internal/registry/healthsweep.go' - - 'workspace-server/internal/registry/liveness.go' - - 'workspace-server/migrations/**' - - 'workspace-server/internal/db/workspace_status_enum_drift_test.go' - - 'tests/e2e/test_staging_external_runtime.sh' - - '.github/workflows/e2e-staging-external.yml' - pull_request: - branches: [main] - paths: - - 'workspace-server/internal/handlers/workspace.go' - - 'workspace-server/internal/handlers/registry.go' - - 'workspace-server/internal/handlers/workspace_restart.go' - - 'workspace-server/internal/registry/healthsweep.go' - - 'workspace-server/internal/registry/liveness.go' - - 'workspace-server/migrations/**' - - 'workspace-server/internal/db/workspace_status_enum_drift_test.go' - - 'tests/e2e/test_staging_external_runtime.sh' - - '.github/workflows/e2e-staging-external.yml' - workflow_dispatch: - inputs: - keep_org: - description: "Skip teardown for debugging (only via manual dispatch)" - required: false - type: boolean - default: false - stale_wait_secs: - description: "Seconds to wait for the heartbeat-staleness sweep (default 180 = 90s window + 90s buffer)" - required: false - default: "180" - schedule: - - cron: '30 7 * * *' - -concurrency: - group: e2e-staging-external - cancel-in-progress: false - -permissions: - contents: read - -jobs: - e2e-staging-external: - name: E2E Staging External Runtime - runs-on: ubuntu-latest - timeout-minutes: 25 - - env: - MOLECULE_CP_URL: https://staging-api.moleculesai.app - MOLECULE_ADMIN_TOKEN: ${{ secrets.MOLECULE_STAGING_ADMIN_TOKEN }} - E2E_RUN_ID: "${{ github.run_id }}-${{ github.run_attempt }}" - E2E_KEEP_ORG: ${{ github.event.inputs.keep_org && '1' || '0' }} - E2E_STALE_WAIT_SECS: ${{ github.event.inputs.stale_wait_secs || '180' }} - - steps: - - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 - - - name: Verify admin token present - run: | - if [ -z "$MOLECULE_ADMIN_TOKEN" ]; then - # Schedule + push triggers must hard-fail when the token is - # missing — silent skip would mask infra rot. Manual dispatch - # gets the same hard-fail; an operator running this on a fork - # without secrets configured needs to know up-front. - echo "::error::MOLECULE_STAGING_ADMIN_TOKEN secret not set (Railway staging CP_ADMIN_API_TOKEN)" - exit 2 - fi - echo "Admin token present ✓" - - - name: CP staging health preflight - run: | - code=$(curl -sS -o /dev/null -w "%{http_code}" --max-time 10 "$MOLECULE_CP_URL/health") - if [ "$code" != "200" ]; then - echo "::error::Staging CP unhealthy (got HTTP $code). Skipping — not a workspace bug." - exit 1 - fi - echo "Staging CP healthy ✓" - - - name: Run external-runtime E2E - id: e2e - run: bash tests/e2e/test_staging_external_runtime.sh - - # Mirror the e2e-staging-saas.yml safety net: if the runner is - # cancelled (e.g. concurrent staging push), the test script's - # EXIT trap may not fire, so we sweep e2e-ext-* slugs scoped to - # *this* run id. - - name: Teardown safety net (runs on cancel/failure) - if: always() - env: - ADMIN_TOKEN: ${{ secrets.MOLECULE_STAGING_ADMIN_TOKEN }} - run: | - set +e - orgs=$(curl -sS "$MOLECULE_CP_URL/cp/admin/orgs" \ - -H "Authorization: Bearer $ADMIN_TOKEN" 2>/dev/null \ - | python3 -c " - import json, sys, os, datetime - run_id = os.environ.get('GITHUB_RUN_ID', '') - d = json.load(sys.stdin) - # Scope STRICTLY to this run id (e2e-ext-YYYYMMDD--...) - # so concurrent runs and unrelated dev probes are not touched. - # Sweep today AND yesterday so a midnight-crossing run still - # cleans up its own slug. - today = datetime.date.today() - yesterday = today - datetime.timedelta(days=1) - dates = (today.strftime('%Y%m%d'), yesterday.strftime('%Y%m%d')) - if not run_id: - # Without a run id we cannot scope safely; bail rather - # than risk deleting unrelated tenants. - sys.exit(0) - prefixes = tuple(f'e2e-ext-{d}-{run_id}-' for d in dates) - for o in d.get('orgs', []): - s = o.get('slug', '') - if s.startswith(prefixes) and o.get('status') != 'purged': - print(s) - " 2>/dev/null) - if [ -n "$orgs" ]; then - echo "Safety-net sweep: deleting leftover orgs:" - echo "$orgs" - # Per-slug verified DELETE — see molecule-controlplane#420. - # `>/dev/null 2>&1` previously hid every failure; surface - # non-2xx as workflow warnings so the run page names what - # leaked. Sweeper catches the rest within ~45 min. - leaks=() - for slug in $orgs; do - # Tempfile-routed -w + set +e/-e prevents curl-exit-code - # pollution of the captured status (lint-curl-status-capture.yml). - set +e - curl -sS -o /tmp/external-cleanup.out -w "%{http_code}" \ - -X DELETE "$MOLECULE_CP_URL/cp/admin/tenants/$slug" \ - -H "Authorization: Bearer $ADMIN_TOKEN" \ - -H "Content-Type: application/json" \ - -d "{\"confirm\":\"$slug\"}" >/tmp/external-cleanup.code - set -e - code=$(cat /tmp/external-cleanup.code 2>/dev/null || echo "000") - if [ "$code" = "200" ] || [ "$code" = "204" ]; then - echo "[teardown] deleted $slug (HTTP $code)" - else - echo "::warning::external teardown for $slug returned HTTP $code — sweep-stale-e2e-orgs will catch it within ~45 min. Body: $(head -c 300 /tmp/external-cleanup.out 2>/dev/null)" - leaks+=("$slug") - fi - done - if [ ${#leaks[@]} -gt 0 ]; then - echo "::warning::external teardown left ${#leaks[@]} leak(s): ${leaks[*]}" - fi - else - echo "Safety-net sweep: no leftover orgs to clean." - fi diff --git a/.github/workflows/e2e-staging-saas.yml b/.github/workflows/e2e-staging-saas.yml deleted file mode 100644 index 43e81aba..00000000 --- a/.github/workflows/e2e-staging-saas.yml +++ /dev/null @@ -1,246 +0,0 @@ -name: E2E Staging SaaS (full lifecycle) - -# Dedicated workflow that provisions a fresh staging org per run, exercises -# the full workspace lifecycle (register → heartbeat → A2A → delegation → -# HMA memory → activity → peers), then tears down and asserts leak-free. -# -# Why a separate workflow (not folded into ci.yml): -# - The run takes ~25-35 min (EC2 boot + cloudflared DNS + provision sweeps + -# agent bootstrap), way too slow for every PR. -# - Needs its own concurrency group so two pushes don't fight over the -# same staging org slug prefix. -# - Has its own required secrets (session cookie, admin token) that most -# PRs don't need to read. -# -# Triggers: -# - Push to main (regression guard) -# - workflow_dispatch (manual re-run from UI) -# - Nightly cron (catches drift even when no pushes land) -# - Changes to any provisioning-critical file under PR review (opt-in -# via the same paths watcher that e2e-api.yml uses) - -on: - # Trunk-based (Phase 3 of internal#81): main is the only branch. - # Previously this fired on staging push too because staging was a - # superset of main and ran the gate ahead of auto-promote; with no - # staging branch, main is where E2E gates the deploy. - push: - branches: [main] - paths: - - 'workspace-server/internal/handlers/registry.go' - - 'workspace-server/internal/handlers/workspace_provision.go' - - 'workspace-server/internal/handlers/a2a_proxy.go' - - 'workspace-server/internal/middleware/**' - - 'workspace-server/internal/provisioner/**' - - 'tests/e2e/test_staging_full_saas.sh' - - '.github/workflows/e2e-staging-saas.yml' - pull_request: - branches: [main] - paths: - - 'workspace-server/internal/handlers/registry.go' - - 'workspace-server/internal/handlers/workspace_provision.go' - - 'workspace-server/internal/handlers/a2a_proxy.go' - - 'workspace-server/internal/middleware/**' - - 'workspace-server/internal/provisioner/**' - - 'tests/e2e/test_staging_full_saas.sh' - - '.github/workflows/e2e-staging-saas.yml' - workflow_dispatch: - inputs: - runtime: - description: "Runtime to test (claude-code [default, MiniMax] | hermes [OpenAI] | langgraph [OpenAI])" - required: false - default: "claude-code" - keep_org: - description: "Skip teardown for debugging (only use via manual dispatch!)" - required: false - type: boolean - default: false - schedule: - # 07:00 UTC every day — catches AMI drift, WorkOS cert rotation, - # Cloudflare API regressions, etc. even on quiet days. - - cron: '0 7 * * *' - -# Serialize: staging has a finite per-hour org creation quota. Two pushes -# landing in quick succession should queue, not race. `cancel-in-progress: -# false` mirrors e2e-api.yml — GitHub would otherwise cancel the running -# teardown step and leave orphan EC2s. -concurrency: - group: e2e-staging-saas - cancel-in-progress: false - -jobs: - e2e-staging-saas: - name: E2E Staging SaaS - runs-on: ubuntu-latest - timeout-minutes: 45 - permissions: - contents: read - - env: - MOLECULE_CP_URL: https://staging-api.moleculesai.app - # Single admin-bearer secret drives provision + tenant-token - # retrieval + teardown. Configure in - # Settings → Secrets and variables → Actions → Repository secrets. - MOLECULE_ADMIN_TOKEN: ${{ secrets.MOLECULE_STAGING_ADMIN_TOKEN }} - # MiniMax is the PRIMARY LLM auth path post-2026-05-04. Switched - # from hermes+OpenAI default after #2578 (the staging OpenAI key - # account went over quota and stayed dead for 36+ hours, taking - # the full-lifecycle E2E red on every provisioning-critical push). - # claude-code template's `minimax` provider routes - # ANTHROPIC_BASE_URL to api.minimax.io/anthropic and reads - # MINIMAX_API_KEY at boot — separate billing account so an - # OpenAI quota collapse no longer wedges the gate. Mirrors the - # canary-staging.yml + continuous-synth-e2e.yml migrations. - E2E_MINIMAX_API_KEY: ${{ secrets.MOLECULE_STAGING_MINIMAX_API_KEY }} - # Direct-Anthropic alternative for operators who don't want to - # set up a MiniMax account (priority below MiniMax — first - # non-empty wins in test_staging_full_saas.sh's secrets-injection - # block). See #2578 PR comment for the rationale. - E2E_ANTHROPIC_API_KEY: ${{ secrets.MOLECULE_STAGING_ANTHROPIC_API_KEY }} - # OpenAI fallback — kept wired so an operator-dispatched run with - # E2E_RUNTIME=hermes or =langgraph via workflow_dispatch can still - # exercise the OpenAI path. - E2E_OPENAI_API_KEY: ${{ secrets.MOLECULE_STAGING_OPENAI_KEY }} - E2E_RUNTIME: ${{ github.event.inputs.runtime || 'claude-code' }} - # Pin the model when running on the default claude-code path — - # the per-runtime default ("sonnet") routes to direct Anthropic - # and defeats the cost saving. Operators can override via the - # workflow_dispatch flow (no input wired here yet — runtime - # override is enough for ad-hoc). - E2E_MODEL_SLUG: ${{ github.event.inputs.runtime == 'hermes' && 'openai/gpt-4o' || github.event.inputs.runtime == 'langgraph' && 'openai:gpt-4o' || 'MiniMax-M2.7-highspeed' }} - E2E_RUN_ID: "${{ github.run_id }}-${{ github.run_attempt }}" - E2E_KEEP_ORG: ${{ github.event.inputs.keep_org && '1' || '0' }} - - steps: - - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 - - - name: Verify admin token present - run: | - if [ -z "$MOLECULE_ADMIN_TOKEN" ]; then - echo "::error::MOLECULE_STAGING_ADMIN_TOKEN secret not set (Railway staging CP_ADMIN_API_TOKEN)" - exit 2 - fi - echo "Admin token present ✓" - - - name: Verify LLM key present - run: | - # Per-runtime key check — claude-code uses MiniMax; hermes / - # langgraph (operator-dispatched only) use OpenAI. Hard-fail - # rather than soft-skip per #2578's lesson — empty key - # silently falls through to the wrong SECRETS_JSON branch and - # produces a confusing auth error 5 min later instead of the - # clean "secret missing" message at the top. - case "${E2E_RUNTIME}" in - claude-code) - # Either MiniMax OR direct-Anthropic works — first - # non-empty wins in the test script's secrets-injection - # priority chain. - if [ -n "${E2E_MINIMAX_API_KEY:-}" ]; then - required_secret_name="MOLECULE_STAGING_MINIMAX_API_KEY" - required_secret_value="${E2E_MINIMAX_API_KEY}" - elif [ -n "${E2E_ANTHROPIC_API_KEY:-}" ]; then - required_secret_name="MOLECULE_STAGING_ANTHROPIC_API_KEY" - required_secret_value="${E2E_ANTHROPIC_API_KEY}" - else - required_secret_name="MOLECULE_STAGING_MINIMAX_API_KEY or MOLECULE_STAGING_ANTHROPIC_API_KEY" - required_secret_value="" - fi - ;; - langgraph|hermes) - required_secret_name="MOLECULE_STAGING_OPENAI_KEY" - required_secret_value="${E2E_OPENAI_API_KEY:-}" - ;; - *) - echo "::warning::Unknown E2E_RUNTIME='${E2E_RUNTIME}' — skipping LLM-key check" - required_secret_name="" - required_secret_value="present" - ;; - esac - if [ -n "$required_secret_name" ] && [ -z "$required_secret_value" ]; then - echo "::error::${required_secret_name} secret not set for runtime=${E2E_RUNTIME} — workspaces will fail at boot with 'No provider API key found'" - exit 2 - fi - echo "LLM key present ✓ (runtime=${E2E_RUNTIME}, key=${required_secret_name}, len=${#required_secret_value})" - - - name: CP staging health preflight - run: | - code=$(curl -sS -o /dev/null -w "%{http_code}" --max-time 10 "$MOLECULE_CP_URL/health") - if [ "$code" != "200" ]; then - echo "::error::Staging CP unhealthy (got HTTP $code). Skipping — not a workspace bug." - exit 1 - fi - echo "Staging CP healthy ✓" - - - name: Run full-lifecycle E2E - id: e2e - run: bash tests/e2e/test_staging_full_saas.sh - - # Belt-and-braces teardown: the test script itself installs a trap - # for EXIT/INT/TERM, but if the GH runner itself is cancelled (e.g. - # someone pushes a new commit and workflow concurrency is set to - # cancel), the trap may not fire. This `always()` step runs even on - # cancellation and attempts the delete a second time. The admin - # DELETE endpoint is idempotent so double-invoking is safe. - - name: Teardown safety net (runs on cancel/failure) - if: always() - env: - ADMIN_TOKEN: ${{ secrets.MOLECULE_STAGING_ADMIN_TOKEN }} - run: | - # Best-effort: find any e2e-YYYYMMDD-* orgs matching this run and - # nuke them. Catches the case where the script died before - # exporting its slug. - set +e - orgs=$(curl -sS "$MOLECULE_CP_URL/cp/admin/orgs" \ - -H "Authorization: Bearer $ADMIN_TOKEN" 2>/dev/null \ - | python3 -c " - import json, sys, os, datetime - run_id = os.environ.get('GITHUB_RUN_ID', '') - d = json.load(sys.stdin) - # ONLY sweep slugs from *this* CI run. Previously the filter was - # f'e2e-{today}-' which stomped on parallel CI runs AND any manual - # E2E probes a dev was running against staging (incident 2026-04-21 - # 15:02Z: this workflow's safety net deleted an unrelated manual - # run's tenant 1s after it hit 'running'). - # Sweep both today AND yesterday's UTC dates so a run that crosses - # midnight still matches its own slug — see the 2026-04-26→27 - # canvas-safety-net incident for the same bug class. - today = datetime.date.today() - yesterday = today - datetime.timedelta(days=1) - dates = (today.strftime('%Y%m%d'), yesterday.strftime('%Y%m%d')) - if run_id: - prefixes = tuple(f'e2e-{d}-{run_id}-' for d in dates) - else: - prefixes = tuple(f'e2e-{d}-' for d in dates) - candidates = [o['slug'] for o in d.get('orgs', []) - if any(o.get('slug','').startswith(p) for p in prefixes) - and o.get('instance_status') not in ('purged',)] - print('\n'.join(candidates)) - " 2>/dev/null) - # Per-slug verified DELETE (was `>/dev/null || true` — see - # molecule-controlplane#420). Surface non-2xx as a workflow - # warning naming the leaked slug; don't exit 1 (sweeper is - # the safety net within ~45 min). - leaks=() - for slug in $orgs; do - echo "Safety-net teardown: $slug" - # Tempfile-routed -w + set +e/-e prevents curl-exit-code - # pollution of the captured status (lint-curl-status-capture.yml). - set +e - curl -sS -o /tmp/saas-cleanup.out -w "%{http_code}" \ - -X DELETE "$MOLECULE_CP_URL/cp/admin/tenants/$slug" \ - -H "Authorization: Bearer $ADMIN_TOKEN" \ - -H "Content-Type: application/json" \ - -d "{\"confirm\":\"$slug\"}" >/tmp/saas-cleanup.code - set -e - code=$(cat /tmp/saas-cleanup.code 2>/dev/null || echo "000") - if [ "$code" = "200" ] || [ "$code" = "204" ]; then - echo "[teardown] deleted $slug (HTTP $code)" - else - echo "::warning::saas teardown for $slug returned HTTP $code — sweep-stale-e2e-orgs will catch it within ~45 min. Body: $(head -c 300 /tmp/saas-cleanup.out 2>/dev/null)" - leaks+=("$slug") - fi - done - if [ ${#leaks[@]} -gt 0 ]; then - echo "::warning::saas teardown left ${#leaks[@]} leak(s): ${leaks[*]}" - fi - exit 0 diff --git a/.github/workflows/e2e-staging-sanity.yml b/.github/workflows/e2e-staging-sanity.yml deleted file mode 100644 index bedf4ed5..00000000 --- a/.github/workflows/e2e-staging-sanity.yml +++ /dev/null @@ -1,171 +0,0 @@ -name: E2E Staging Sanity (leak-detection self-check) - -# Periodic assertion that the teardown safety nets in e2e-staging-saas -# and canary-staging actually work. Runs the E2E harness with -# E2E_INTENTIONAL_FAILURE=1, which poisons the tenant admin token after -# the org is provisioned. The workspace-provision step then fails, the -# script exits non-zero, and the EXIT trap + workflow always()-step -# must still tear down cleanly. -# -# A green run means: -# - The script exited non-zero (intentional failure caught) -# - The trap fired teardown -# - The leak-detection poll found zero orphan orgs -# -# A red run means the teardown path itself is broken — act on this the -# same way you'd act on a canary failure (the whole E2E safety net is -# compromised until it's fixed). -# -# Cadence: once a week, Monday 06:00 UTC. Drift-slow, not per-PR — the -# teardown path rarely changes, and a weekly heartbeat is enough to -# catch silent regressions in cleanup code paths. - -on: - schedule: - - cron: '0 6 * * 1' - workflow_dispatch: - -concurrency: - # Shares the group with canary + full so they don't collide on - # staging org-create quota. - group: e2e-staging-sanity - cancel-in-progress: false - -permissions: - issues: write - contents: read - -jobs: - sanity: - name: Intentional-failure teardown sanity - runs-on: ubuntu-latest - timeout-minutes: 20 - - env: - MOLECULE_CP_URL: https://staging-api.moleculesai.app - MOLECULE_ADMIN_TOKEN: ${{ secrets.MOLECULE_STAGING_ADMIN_TOKEN }} - E2E_MODE: canary # lean lifecycle; we only need the org to exist - E2E_RUNTIME: hermes - E2E_RUN_ID: "sanity-${{ github.run_id }}" - E2E_INTENTIONAL_FAILURE: "1" - - steps: - - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 - - - name: Verify admin token present - run: | - if [ -z "$MOLECULE_ADMIN_TOKEN" ]; then - echo "::error::MOLECULE_STAGING_ADMIN_TOKEN not set" - exit 2 - fi - - # Inverted assertion: the run MUST fail. If it passes, the - # E2E_INTENTIONAL_FAILURE path is broken (token not being - # poisoned correctly, or the harness silently recovered). - - name: Run harness — expecting exit !=0 - id: harness - run: | - set +e - bash tests/e2e/test_staging_full_saas.sh - rc=$? - echo "harness_rc=$rc" >> "$GITHUB_OUTPUT" - # The only acceptable outcomes: - # 1 — harness failed mid-run, teardown ran, leak-check passed - # (exit 4 means teardown left a leak — that's the real bug - # this sanity check exists to catch) - if [ "$rc" = "1" ]; then - echo "✓ Harness failed as expected (rc=1); teardown trap ran, leak-check passed" - exit 0 - elif [ "$rc" = "0" ]; then - echo "::error::Harness succeeded under E2E_INTENTIONAL_FAILURE=1 — the poisoning path is broken" - exit 1 - elif [ "$rc" = "4" ]; then - echo "::error::LEAK DETECTED (rc=4) — teardown failed to clean up the org. Safety net broken." - exit 4 - else - echo "::error::Unexpected rc=$rc — neither clean-failure nor leak. Investigate harness." - exit 1 - fi - - - name: Open issue if safety net is broken - if: failure() - uses: actions/github-script@3a2844b7e9c422d3c10d287c895573f7108da1b3 # v9.0.0 - with: - script: | - const title = "🚨 E2E teardown safety net broken"; - const runURL = `https://github.com/${context.repo.owner}/${context.repo.repo}/actions/runs/${context.runId}`; - const body = - `The weekly sanity run (E2E_INTENTIONAL_FAILURE=1) did not exit ` + - `as expected. This means one of:\n` + - ` - poisoning didn't actually cause failure (test harness regression), OR\n` + - ` - teardown left an orphan org (leak detection caught a real bug)\n\n` + - `Run: ${runURL}\n\n` + - `This is higher priority than a canary failure — the whole ` + - `E2E safety net can't be trusted until this is resolved.`; - - const { data: existing } = await github.rest.issues.listForRepo({ - owner: context.repo.owner, repo: context.repo.repo, - state: 'open', labels: 'e2e-safety-net', - }); - const match = existing.find(i => i.title === title); - if (match) { - await github.rest.issues.createComment({ - owner: context.repo.owner, repo: context.repo.repo, - issue_number: match.number, - body: `Still broken. ${runURL}`, - }); - } else { - await github.rest.issues.create({ - owner: context.repo.owner, repo: context.repo.repo, - title, body, - labels: ['e2e-safety-net', 'bug', 'priority-high'], - }); - } - - # Belt-and-braces: if teardown left anything behind, nuke it here - # so we don't bleed staging quota. Different label from the - # always()-steps in the other workflows so sanity-only orgs get - # cleaned up by sanity runs. - - name: Teardown safety net - if: always() - env: - ADMIN_TOKEN: ${{ secrets.MOLECULE_STAGING_ADMIN_TOKEN }} - run: | - set +e - orgs=$(curl -sS "$MOLECULE_CP_URL/cp/admin/orgs" \ - -H "Authorization: Bearer $ADMIN_TOKEN" 2>/dev/null \ - | python3 -c " - import json, sys - d = json.load(sys.stdin) - today = __import__('datetime').date.today().strftime('%Y%m%d') - candidates = [o['slug'] for o in d.get('orgs', []) - if o.get('slug','').startswith(f'e2e-canary-{today}-sanity-') - and o.get('status') not in ('purged',)] - print('\n'.join(candidates)) - " 2>/dev/null) - # Per-slug verified DELETE — see molecule-controlplane#420. - # Failures surface as workflow warnings; the sweeper is the - # safety net within ~45 min. - leaks=() - for slug in $orgs; do - # Tempfile-routed -w + set +e/-e prevents curl-exit-code - # pollution of the captured status (lint-curl-status-capture.yml). - set +e - curl -sS -o /tmp/sanity-cleanup.out -w "%{http_code}" \ - -X DELETE "$MOLECULE_CP_URL/cp/admin/tenants/$slug" \ - -H "Authorization: Bearer $ADMIN_TOKEN" \ - -H "Content-Type: application/json" \ - -d "{\"confirm\":\"$slug\"}" >/tmp/sanity-cleanup.code - set -e - code=$(cat /tmp/sanity-cleanup.code 2>/dev/null || echo "000") - if [ "$code" = "200" ] || [ "$code" = "204" ]; then - echo "[teardown] deleted $slug (HTTP $code)" - else - echo "::warning::sanity teardown for $slug returned HTTP $code — sweep-stale-e2e-orgs will catch it within ~45 min. Body: $(head -c 300 /tmp/sanity-cleanup.out 2>/dev/null)" - leaks+=("$slug") - fi - done - if [ ${#leaks[@]} -gt 0 ]; then - echo "::warning::sanity teardown left ${#leaks[@]} leak(s): ${leaks[*]}" - fi - exit 0 diff --git a/.github/workflows/handlers-postgres-integration.yml b/.github/workflows/handlers-postgres-integration.yml deleted file mode 100644 index b0879908..00000000 --- a/.github/workflows/handlers-postgres-integration.yml +++ /dev/null @@ -1,251 +0,0 @@ -name: Handlers Postgres Integration - -# Real-Postgres integration tests for workspace-server/internal/handlers/. -# Triggered on every PR/push that touches the handlers package. -# -# Why this workflow exists -# ------------------------ -# Strict-sqlmock unit tests pin which SQL statements fire — they're fast -# and let us iterate without a DB. But sqlmock CANNOT detect bugs that -# depend on the row state AFTER the SQL runs. The result_preview-lost -# bug shipped to staging in PR #2854 because every unit test was -# satisfied with "an UPDATE statement fired" — none verified the row's -# preview field actually landed. The local-postgres E2E that retrofit -# self-review caught it took 2 minutes to set up and would have caught -# the bug at PR-time. -# -# Why this workflow does NOT use `services: postgres:` (Class B fix) -# ------------------------------------------------------------------ -# Our act_runner config has `container.network: host` (operator host -# /opt/molecule/runners/config.yaml), which act_runner applies to BOTH -# the job container AND every service container. With host-net, two -# concurrent runs of this workflow both try to bind 0.0.0.0:5432 — the -# second postgres FATALs with `could not create any TCP/IP sockets: -# Address in use`, and Docker auto-removes it (act_runner sets -# AutoRemove:true on service containers). By the time the migrations -# step runs `psql`, the postgres container is gone, hence -# `Connection refused` then `failed to remove container: No such -# container` at cleanup time. -# -# Per-job `container.network` override is silently ignored by -# act_runner — `--network and --net in the options will be ignored.` -# appears in the runner log. Documented constraint. -# -# So we sidestep `services:` entirely. The job container still uses -# host-net (inherited from runner config; required for cache server -# discovery on the bridge IP 172.18.0.17:42631). We launch a sibling -# postgres on the existing `molecule-core-net` bridge with a -# UNIQUE name per run — `pg-handlers-${RUN_ID}-${RUN_ATTEMPT}` — and -# read its bridge IP via `docker inspect`. A host-net job container -# can reach a bridge-net container directly via the bridge IP (verified -# manually on operator host 2026-05-08). -# -# Trade-offs vs. the original `services:` shape: -# + No host-port collision; N parallel runs share the bridge cleanly -# + `if: always()` cleanup runs even on test-step failure -# - One more step in the workflow (+~3 lines) -# - Requires `molecule-core-net` to exist on the operator host -# (it does; declared in docker-compose.yml + docker-compose.infra.yml) -# -# Class B Hongming-owned CICD red sweep, 2026-05-08. -# -# Cost: ~30s job (postgres pull from cache + go build + 4 tests). - -on: - push: - branches: [main, staging] - pull_request: - branches: [main, staging] - merge_group: - types: [checks_requested] - workflow_dispatch: - -concurrency: - group: handlers-pg-integ-${{ github.event.pull_request.head.sha || github.sha }} - cancel-in-progress: false - -jobs: - detect-changes: - name: detect-changes - runs-on: ubuntu-latest - outputs: - handlers: ${{ steps.filter.outputs.handlers }} - steps: - - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 - - uses: dorny/paths-filter@fbd0ab8f3e69293af611ebaee6363fc25e6d187d # v4.0.1 - id: filter - with: - filters: | - handlers: - - 'workspace-server/internal/handlers/**' - - 'workspace-server/internal/wsauth/**' - - 'workspace-server/migrations/**' - - '.github/workflows/handlers-postgres-integration.yml' - - # Single-job-with-per-step-if pattern: always runs to satisfy the - # required-check name on branch protection; real work gates on the - # paths filter. See ci.yml's Platform (Go) for the same shape. - integration: - name: Handlers Postgres Integration - needs: detect-changes - runs-on: docker-host - env: - # Unique name per run so concurrent jobs don't collide on the - # bridge network. ${RUN_ID}-${RUN_ATTEMPT} is unique even across - # workflow_dispatch reruns of the same run_id. - PG_NAME: pg-handlers-${{ github.run_id }}-${{ github.run_attempt }} - # Bridge network already exists on the operator host (declared - # in docker-compose.yml + docker-compose.infra.yml). - PG_NETWORK: molecule-core-net - defaults: - run: - working-directory: workspace-server - steps: - - if: needs.detect-changes.outputs.handlers != 'true' - working-directory: . - run: echo "No handlers/migrations changes — skipping; this job always runs to satisfy the required-check name." - - - if: needs.detect-changes.outputs.handlers == 'true' - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 - - - if: needs.detect-changes.outputs.handlers == 'true' - uses: actions/setup-go@40f1582b2485089dde7abd97c1529aa768e1baff # v5 - with: - go-version: 'stable' - - - if: needs.detect-changes.outputs.handlers == 'true' - name: Start sibling Postgres on bridge network - working-directory: . - run: | - # Sanity: the bridge network must exist on the operator host. - # Hard-fail loud if it doesn't — easier to spot than a silent - # auto-create that diverges from the rest of the stack. - if ! docker network inspect "${PG_NETWORK}" >/dev/null 2>&1; then - echo "::error::Bridge network '${PG_NETWORK}' missing on operator host. Re-run docker-compose.infra.yml or check ops handbook." - exit 1 - fi - - # If a stale container with the same name exists (rerun on - # the same run_id), wipe it first. - docker rm -f "${PG_NAME}" >/dev/null 2>&1 || true - - docker run -d \ - --name "${PG_NAME}" \ - --network "${PG_NETWORK}" \ - --health-cmd "pg_isready -U postgres" \ - --health-interval 5s \ - --health-timeout 5s \ - --health-retries 10 \ - -e POSTGRES_PASSWORD=test \ - -e POSTGRES_DB=molecule \ - postgres:15-alpine >/dev/null - - # Read back the bridge IP. Always present immediately after - # `docker run -d` for bridge networks. - PG_HOST=$(docker inspect "${PG_NAME}" \ - --format "{{(index .NetworkSettings.Networks \"${PG_NETWORK}\").IPAddress}}") - if [ -z "${PG_HOST}" ]; then - echo "::error::Could not resolve PG_HOST for ${PG_NAME} on ${PG_NETWORK}" - docker logs "${PG_NAME}" || true - exit 1 - fi - echo "PG_HOST=${PG_HOST}" >> "$GITHUB_ENV" - echo "INTEGRATION_DB_URL=postgres://postgres:test@${PG_HOST}:5432/molecule?sslmode=disable" >> "$GITHUB_ENV" - echo "Started ${PG_NAME} at ${PG_HOST}:5432" - - - if: needs.detect-changes.outputs.handlers == 'true' - name: Apply migrations to Postgres service - env: - PGPASSWORD: test - run: | - # Wait for postgres to actually accept connections. Docker's - # health-cmd handles container-side readiness, but the wire - # to the bridge IP is best-tested with pg_isready directly. - for i in {1..15}; do - if pg_isready -h "${PG_HOST}" -p 5432 -U postgres -q; then break; fi - echo "waiting for postgres at ${PG_HOST}:5432..."; sleep 2 - done - - # Apply every .up.sql in lexicographic order with - # ON_ERROR_STOP=0 — failing migrations are SKIPPED rather than - # blocking the suite. This handles the current schema state - # where a few historical migrations (e.g. 017_memories_fts_*) - # depend on tables that were later renamed/dropped and so - # cannot replay from scratch. The migrations that DO succeed - # land their tables, which is sufficient for the integration - # tests in handlers/. - # - # Why not maintain a curated allowlist: every new migration - # touching a handlers/-tested table would have to update this - # workflow. With apply-all-or-skip, a future migration that - # adds a column to delegations runs automatically (its base - # table 049_delegations.up.sql already succeeded above it in - # the order). Operators only need to revisit this if the - # migration chain becomes legitimately replayable end-to-end. - # - # Per-migration result is logged so a failed migration that - # SHOULD have been replayable surfaces in the CI log instead - # of silently failing. - # Apply both *.sql (legacy, lives next to its module) and - # *.up.sql (newer up/down convention) in a single - # lexicographically-sorted pass. Excluding *.down.sql so the - # newest-naming-convention pairs don't undo themselves mid-run. - # Pre-#149-followup this loop only globbed *.up.sql, which - # silently skipped 001_workspaces.sql + 009_activity_logs.sql - # — fine while no integration test depended on those tables, - # not fine once a cross-table atomicity test came in. - set +e - for migration in $(ls migrations/*.sql 2>/dev/null | grep -v '\.down\.sql$' | sort); do - if psql -h "${PG_HOST}" -U postgres -d molecule -v ON_ERROR_STOP=1 \ - -f "$migration" >/dev/null 2>&1; then - echo "✓ $(basename "$migration")" - else - echo "⊘ $(basename "$migration") (skipped — see comment in workflow)" - fi - done - set -e - - # Sanity: the delegations + workspaces + activity_logs tables - # MUST exist for the integration tests to be meaningful. Hard- - # fail if any didn't land — that would be a real regression we - # want loud. - for tbl in delegations workspaces activity_logs pending_uploads; do - if ! psql -h "${PG_HOST}" -U postgres -d molecule -tA \ - -c "SELECT 1 FROM information_schema.tables WHERE table_name = '$tbl'" \ - | grep -q 1; then - echo "::error::$tbl table missing after migration replay — handler integration tests would be meaningless" - exit 1 - fi - echo "✓ $tbl table present" - done - - - if: needs.detect-changes.outputs.handlers == 'true' - name: Run integration tests - run: | - # INTEGRATION_DB_URL is exported by the start-postgres step; - # points at the per-run bridge IP, not 127.0.0.1, so concurrent - # workflow runs don't fight over a host-net 5432 port. - go test -tags=integration -timeout 5m -v ./internal/handlers/ -run "^TestIntegration_" - - - if: failure() && needs.detect-changes.outputs.handlers == 'true' - name: Diagnostic dump on failure - env: - PGPASSWORD: test - run: | - echo "::group::postgres container status" - docker ps -a --filter "name=${PG_NAME}" --format '{{.Status}} {{.Names}}' || true - docker logs "${PG_NAME}" 2>&1 | tail -50 || true - echo "::endgroup::" - echo "::group::delegations table state" - psql -h "${PG_HOST}" -U postgres -d molecule -c "SELECT * FROM delegations LIMIT 50;" || true - echo "::endgroup::" - - - if: always() && needs.detect-changes.outputs.handlers == 'true' - name: Stop sibling Postgres - working-directory: . - run: | - # always() so containers don't leak when migrations or tests - # fail. The cleanup is best-effort: if the container is - # already gone (e.g. concurrent rerun race), don't fail the job. - docker rm -f "${PG_NAME}" >/dev/null 2>&1 || true - echo "Cleaned up ${PG_NAME}" diff --git a/.github/workflows/harness-replays.yml b/.github/workflows/harness-replays.yml deleted file mode 100644 index 3bb342ec..00000000 --- a/.github/workflows/harness-replays.yml +++ /dev/null @@ -1,248 +0,0 @@ -name: Harness Replays - -# Boots tests/harness (production-shape compose topology with TenantGuard, -# /cp/* proxy, canvas proxy, real production Dockerfile.tenant) and runs -# every replay under tests/harness/replays/. Fails the PR if any replay -# fails. -# -# Why this exists: 2026-04-30 we shipped #2398 which added /buildinfo as -# a public route in router.go but forgot to add it to TenantGuard's -# allowlist. The handler-level test in buildinfo_test.go constructed a -# minimal gin engine without TenantGuard — green. The harness's -# buildinfo-stale-image.sh replay would have caught it (cf-proxy doesn't -# inject X-Molecule-Org-Id, so the curl path is identical to production's -# redeploy verifier), but no one ran the harness pre-merge. The bug -# shipped; the redeploy verifier silently soft-warned every tenant as -# "unreachable" for ~1 day before being noticed. -# -# This gate makes "did you actually run the harness?" a CI invariant -# instead of a memory-discipline thing. -# -# Trigger model — match e2e-api.yml: always FIRES on push/pull_request -# to staging+main, real work is gated per-step on detect-changes output. -# One job → one check run → branch-protection-clean (the SKIPPED-in-set -# trap from PR #2264 is documented in e2e-api.yml's e2e-api job comment). - -on: - push: - branches: [main, staging] - paths: - - 'workspace-server/**' - - 'canvas/**' - - 'tests/harness/**' - - '.github/workflows/harness-replays.yml' - pull_request: - branches: [main, staging] - paths: - - 'workspace-server/**' - - 'canvas/**' - - 'tests/harness/**' - - '.github/workflows/harness-replays.yml' - workflow_dispatch: - merge_group: - types: [checks_requested] - -concurrency: - # Per-SHA grouping. Per-ref kept hitting the auto-promote-staging - # cancellation deadlock — see e2e-api.yml's concurrency block for - # the 2026-04-28 incident that codified this pattern. - group: harness-replays-${{ github.event.pull_request.head.sha || github.sha }} - cancel-in-progress: false - -jobs: - detect-changes: - runs-on: ubuntu-latest - outputs: - run: ${{ steps.decide.outputs.run }} - steps: - - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 - - id: decide - run: | - # workflow_dispatch: always run (manual trigger) - if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then - echo "run=true" >> "$GITHUB_OUTPUT" - echo "debug=manual-trigger" >> "$GITHUB_OUTPUT" - exit 0 - fi - - # Determine the base commit to diff against. - # For pull_request: use base.sha (the merge-base with main/staging). - # For push: use github.event.before (the previous tip of the branch). - # Fallback for new branches (all-zeros SHA): run everything. - if [ "${{ github.event_name }}" = "pull_request" ] && \ - [ -n "${{ github.event.pull_request.base.sha }}" ]; then - BASE="${{ github.event.pull_request.base.sha }}" - elif [ -n "${{ github.event.before }}" ] && \ - ! echo "${{ github.event.before }}" | grep -qE '^0+$'; then - BASE="${{ github.event.before }}" - else - # New branch or github.event.before unavailable — run everything. - echo "run=true" >> "$GITHUB_OUTPUT" - echo "debug=new-branch-fallback" >> "$GITHUB_OUTPUT" - exit 0 - fi - - # GitHub Actions and Gitea Actions both expose github.sha for HEAD. - DIFF=$(git diff --name-only "$BASE" "${{ github.sha }}" 2>/dev/null) - echo "debug=diff-base=$BASE diff-files=$DIFF" >> "$GITHUB_OUTPUT" - - if echo "$DIFF" | grep -qE '^workspace-server/|^canvas/|^tests/harness/|^.github/workflows/harness-replays\.yml$'; then - echo "run=true" >> "$GITHUB_OUTPUT" - else - echo "run=false" >> "$GITHUB_OUTPUT" - fi - - # ONE job that always runs. Real work is gated per-step on - # detect-changes.outputs.run so an unrelated PR (e.g. doc-only - # change to molecule-controlplane wired here later) emits the - # required check without spending CI cycles. Single-job pattern - # matches e2e-api.yml — see that workflow's comment for why a - # job-level `if: false` would block branch protection via the - # SKIPPED-in-set bug. - harness-replays: - needs: detect-changes - name: Harness Replays - runs-on: docker-host - timeout-minutes: 30 - steps: - - name: No-op pass (paths filter excluded this commit) - if: needs.detect-changes.outputs.run != 'true' - run: | - echo "No workspace-server / canvas / tests/harness / workflow changes — Harness Replays gate satisfied without running." - echo "::notice::Harness Replays no-op pass (paths filter excluded this commit)." - echo "::notice::Debug: ${{ needs.detect-changes.outputs.debug }}" - - - if: needs.detect-changes.outputs.run == 'true' - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 - - # Log what files were detected so future failures include the diff. - - name: Log detected changes - if: needs.detect-changes.outputs.run == 'true' - run: | - echo "::notice::detect-changes debug: ${{ needs.detect-changes.outputs.debug }}" - - # github-app-auth sibling-checkout removed 2026-05-07 (#157): - # the plugin was dropped + Dockerfile.tenant no longer COPYs it. - - # Pre-clone manifest deps before docker compose builds the tenant - # image (Task #173 followup — same pattern as - # publish-workspace-server-image.yml's "Pre-clone manifest deps" - # step). - # - # Why pre-clone here too: tests/harness/compose.yml builds tenant-alpha - # and tenant-beta from workspace-server/Dockerfile.tenant with - # context=../.. (repo root). That Dockerfile expects - # .tenant-bundle-deps/{workspace-configs-templates,org-templates,plugins} - # to be present at build context root (post-#173 it COPYs from there - # instead of running an in-image clone — the in-image clone failed - # with "could not read Username for https://git.moleculesai.app" - # because there's no auth path inside the build sandbox). - # - # Without this step harness-replays fails before any replay runs, - # with `failed to calculate checksum of ref ... - # "/.tenant-bundle-deps/plugins": not found`. Caught by run #892 - # (main, 2026-05-07T20:28:53Z) and run #964 (staging — same - # symptom, different root cause: staging still has the in-image - # clone path, hits the auth error directly). - # - # 2026-05-08 sub-finding (#192): the clone step ALSO fails when - # any referenced workspace-template repo is private and the - # AUTO_SYNC_TOKEN bearer (devops-engineer persona) lacks read - # access. Root cause: 5 of 9 workspace-template repos - # (openclaw, codex, crewai, deepagents, gemini-cli) had been - # marked private with no team grant. Resolution: flipped them - # to public per `feedback_oss_first_repo_visibility_default` - # (the OSS surface should be public). Layer-3 (customer-private + - # marketplace third-party repos) tracked separately in - # internal#102. - # - # Token shape matches publish-workspace-server-image.yml: AUTO_SYNC_TOKEN - # is the devops-engineer persona PAT, NOT the founder PAT (per - # `feedback_per_agent_gitea_identity_default`). clone-manifest.sh - # embeds it as basic-auth for the duration of the clones and strips - # .git directories — the token never enters the resulting image. - - name: Pre-clone manifest deps - if: needs.detect-changes.outputs.run == 'true' - env: - MOLECULE_GITEA_TOKEN: ${{ secrets.AUTO_SYNC_TOKEN }} - run: | - set -euo pipefail - if [ -z "${MOLECULE_GITEA_TOKEN}" ]; then - echo "::error::AUTO_SYNC_TOKEN secret is empty — register the devops-engineer persona PAT in repo Actions secrets" - exit 1 - fi - mkdir -p .tenant-bundle-deps - bash scripts/clone-manifest.sh \ - manifest.json \ - .tenant-bundle-deps/workspace-configs-templates \ - .tenant-bundle-deps/org-templates \ - .tenant-bundle-deps/plugins - # Sanity-check counts so a silent partial clone fails fast - # instead of producing a half-empty image. - ws_count=$(find .tenant-bundle-deps/workspace-configs-templates -mindepth 1 -maxdepth 1 -type d | wc -l) - org_count=$(find .tenant-bundle-deps/org-templates -mindepth 1 -maxdepth 1 -type d | wc -l) - plugins_count=$(find .tenant-bundle-deps/plugins -mindepth 1 -maxdepth 1 -type d | wc -l) - echo "Cloned: ws=$ws_count org=$org_count plugins=$plugins_count" - - - name: Install Python deps for replays - # peer-discovery-404 (and future replays) eval Python against the - # running tenant — importing workspace/a2a_client.py pulls in - # httpx. tests/harness/requirements.txt holds just the HTTP-client - # surface to keep CI install fast (~3s) vs the full - # workspace/requirements.txt (~30s). - if: needs.detect-changes.outputs.run == 'true' - run: pip install -r tests/harness/requirements.txt - - - name: Run all replays against the harness - # run-all-replays.sh: boot via up.sh → seed via seed.sh → run - # every replays/*.sh → tear down via down.sh on EXIT (trap). - # Non-zero exit on any replay failure. - # - # KEEP_UP=1: without this, the script's trap-on-EXIT tears - # down containers immediately on failure, leaving the dump - # step below with nothing to dump (verified on PR #2410's - # first run — tenant became unhealthy, trap fired, dump - # step saw empty containers). Keeping them up lets the - # failure path collect tenant/cp-stub/cf-proxy logs. The - # always-run "Force teardown" step does the actual cleanup. - if: needs.detect-changes.outputs.run == 'true' - working-directory: tests/harness - env: - KEEP_UP: "1" - run: ./run-all-replays.sh - - - name: Dump compose logs on failure - # SECRETS_ENCRYPTION_KEY: docker compose validates the entire compose - # file even for read-only `logs` calls. up.sh generates a per-run key - # and exports it to its OWN shell — this step runs in a fresh shell - # that wouldn't see it, so without a placeholder the validate step - # errors before logs print (verified against PR #2492's first run: - # "required variable SECRETS_ENCRYPTION_KEY is missing a value"). - # A placeholder is fine — we're only reading log streams, not booting. - if: failure() && needs.detect-changes.outputs.run == 'true' - working-directory: tests/harness - env: - SECRETS_ENCRYPTION_KEY: dump-logs-placeholder - run: | - echo "=== docker compose ps ===" - docker compose -f compose.yml ps || true - echo "=== tenant-alpha logs ===" - docker compose -f compose.yml logs tenant-alpha || true - echo "=== tenant-beta logs ===" - docker compose -f compose.yml logs tenant-beta || true - echo "=== cp-stub logs ===" - docker compose -f compose.yml logs cp-stub || true - echo "=== cf-proxy logs ===" - docker compose -f compose.yml logs cf-proxy || true - echo "=== postgres-alpha logs (last 100) ===" - docker compose -f compose.yml logs --tail 100 postgres-alpha || true - echo "=== postgres-beta logs (last 100) ===" - docker compose -f compose.yml logs --tail 100 postgres-beta || true - - - name: Force teardown - # We pass KEEP_UP=1 to run-all-replays.sh so the dump step - # above sees real containers — that means we own teardown - # explicitly here. Always run. - if: always() && needs.detect-changes.outputs.run == 'true' - working-directory: tests/harness - run: ./down.sh || true diff --git a/.github/workflows/lint-curl-status-capture.yml b/.github/workflows/lint-curl-status-capture.yml deleted file mode 100644 index 487b2eb4..00000000 --- a/.github/workflows/lint-curl-status-capture.yml +++ /dev/null @@ -1,94 +0,0 @@ -name: Lint curl status-code capture - -# Pins the workflow-bash anti-pattern that produced "HTTP 000000" on the -# 2026-05-04 redeploy-tenants-on-main run for sha 2b862f6: -# -# HTTP_CODE=$(curl ... -w '%{http_code}' ... || echo "000") -# -# When curl exits non-zero (connection reset → 56, --fail-with-body 4xx/5xx -# → 22), the `-w '%{http_code}'` already wrote a status to stdout — usually -# "000" for connection failures or the actual code for HTTP errors. The -# `|| echo "000"` then fires AND appends ANOTHER "000" to the captured -# stdout, producing values like "000000" or "409000" that fail string -# comparisons against "200" while looking superficially right. -# -# Same class of bug the synth-E2E §7c gate hit twice (PRs #2779/#2783 + -# #2797). Memory: feedback_curl_status_capture_pollution.md. -# -# Fix shape (route -w into a tempfile so curl's exit code can't pollute): -# -# set +e -# curl ... -w '%{http_code}' >code.txt 2>/dev/null -# set -e -# HTTP_CODE=$(cat code.txt 2>/dev/null) -# [ -z "$HTTP_CODE" ] && HTTP_CODE="000" - -on: - pull_request: - paths: ['.github/workflows/**'] - push: - branches: [main, staging] - paths: ['.github/workflows/**'] - merge_group: - types: [checks_requested] - -jobs: - scan: - name: Scan workflows for curl status-capture pollution - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 - - name: Find curl ... -w '%{http_code}' ... || echo "000" subshells - run: | - set -uo pipefail - # Multi-line aware: look for `$(curl ... -w '%{http_code}' ... || echo "000")` - # subshell where the entire command-substitution wraps a curl that - # ends with `|| echo "000"`. Must distinguish from the SAFE shape - # `$(cat tempfile 2>/dev/null || echo "000")` — `cat` with a missing - # tempfile produces empty stdout, no pollution. - python3 <<'PY' - import os, re, sys, glob - - BAD_FILES = [] - - # Match the buggy substitution across newlines: $(curl ... -w '%{http_code}' ... || echo "000") - # The `\\n` is the bash line-continuation that lets curl flags span lines. - # We collapse continuation lines first, then look for the single-line bad pattern. - PATTERN = re.compile( - r'\$\(\s*curl\b[^)]*-w\s*[\'"]%\{http_code\}[\'"][^)]*\|\|\s*echo\s+"000"\s*\)', - re.DOTALL, - ) - - # Self-skip: this lint workflow contains the literal anti-pattern in - # its own docstring — that's intentional, not a bug. - SELF = ".github/workflows/lint-curl-status-capture.yml" - - for f in sorted(glob.glob(".github/workflows/*.yml")): - if f == SELF: - continue - with open(f) as fh: - content = fh.read() - # Collapse bash line-continuations (\\\n + leading whitespace) - # into a single logical line so the regex can see the full - # curl invocation as one chunk. - flat = re.sub(r'\\\s*\n\s*', ' ', content) - for m in PATTERN.finditer(flat): - BAD_FILES.append((f, m.group(0)[:120])) - - if not BAD_FILES: - print("✓ No curl-status-capture pollution patterns detected") - sys.exit(0) - - print(f"::error::Found {len(BAD_FILES)} curl-status-capture pollution site(s):") - for f, snippet in BAD_FILES: - print(f"::error file={f}::Curl status-capture pollution: '|| echo \"000\"' inside a $(curl ... -w '%{{http_code}}' ...) subshell. On non-2xx or connection failure, curl's -w writes a status, then exits non-zero, then the || echo appends another '000' — producing 'HTTP 000000' or '409000' that fails comparisons silently. Fix: route -w into a tempfile so the exit code can't pollute stdout. See memory feedback_curl_status_capture_pollution.md.") - print(f" matched: {snippet}…") - print() - print("Fix template:") - print(' set +e') - print(' curl ... -w \'%{http_code}\' >code.txt 2>/dev/null') - print(' set -e') - print(' HTTP_CODE=$(cat code.txt 2>/dev/null)') - print(' [ -z "$HTTP_CODE" ] && HTTP_CODE="000"') - sys.exit(1) - PY diff --git a/.github/workflows/publish-canvas-image.yml b/.github/workflows/publish-canvas-image.yml deleted file mode 100644 index 5d085ff0..00000000 --- a/.github/workflows/publish-canvas-image.yml +++ /dev/null @@ -1,121 +0,0 @@ -name: publish-canvas-image - -# Builds and pushes the canvas Docker image to GHCR whenever a commit lands -# on main that touches canvas code. Previously canvas changes were visible in -# CI (npm run build passed) but the live container was never updated — -# operators had to manually run `docker compose build canvas` each time. -# -# Mirror of publish-platform-image.yml, adapted for the Next.js canvas layer. -# See that workflow for inline notes on macOS Keychain isolation and QEMU. - -on: - push: - branches: [main] - paths: - # Only rebuild when canvas source changes — saves GHA minutes on - # platform-only / docs-only / MCP-only merges. - - 'canvas/**' - - '.github/workflows/publish-canvas-image.yml' - # Manual trigger: use after a non-canvas merge that still needs a fresh - # image (e.g. a Dockerfile change lives outside the canvas/ tree). - workflow_dispatch: - inputs: - platform_url: - description: 'NEXT_PUBLIC_PLATFORM_URL baked into the bundle (default: http://localhost:8080)' - required: false - default: '' - ws_url: - description: 'NEXT_PUBLIC_WS_URL baked into the bundle (default: ws://localhost:8080/ws)' - required: false - default: '' - -permissions: - contents: read - packages: write # required to push to ghcr.io/${{ github.repository_owner }}/* - -env: - IMAGE_NAME: ghcr.io/molecule-ai/canvas - -jobs: - build-and-push: - name: Build & push canvas image - runs-on: publish - steps: - - name: Checkout - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 - - - name: Log in to GHCR - uses: docker/login-action@c94ce9fb468520275223c153574b00df6fe4bcc9 # v3 - with: - registry: ghcr.io - username: ${{ github.actor }} - password: ${{ secrets.GITHUB_TOKEN }} - - - name: Set up Docker Buildx - uses: docker/setup-buildx-action@4d04d5d9486b7bd6fa91e7baf45bbb4f8b9deedd # v4.0.0 - - # Health check: verify Docker daemon is accessible before attempting any - # build steps. This fails loudly at step 1 when the runner's docker.sock - # is inaccessible rather than silently continuing to the build step - # where docker build fails deep in ECR auth with a cryptic error. - - name: Verify Docker daemon access - run: | - set -euo pipefail - echo "::group::Docker daemon health check" - docker info 2>&1 | head -5 || { - echo "::error::Docker daemon is not accessible at /var/run/docker.sock" - echo "::error::Check: (1) daemon running, (2) runner user in docker group, (3) sock perms 660+" - exit 1 - } - echo "Docker daemon OK" - echo "::endgroup::" - - - name: Compute tags - id: tags - shell: bash - run: | - echo "sha=${GITHUB_SHA::7}" >> "$GITHUB_OUTPUT" - - - name: Resolve build args - id: build_args - # Priority: workflow_dispatch input > repo secret > hardcoded default. - # NEXT_PUBLIC_* env vars are baked into the JS bundle at build time by - # Next.js — they cannot be changed at runtime without a full rebuild. - # For local docker-compose deployments the defaults (localhost:8080) - # work as-is; production deployments should set CANVAS_PLATFORM_URL - # and CANVAS_WS_URL as repository secrets. - # - # Inputs are passed via env vars (not direct ${{ }} interpolation) to - # prevent shell injection from workflow_dispatch string inputs. - shell: bash - env: - INPUT_PLATFORM_URL: ${{ github.event.inputs.platform_url }} - SECRET_PLATFORM_URL: ${{ secrets.CANVAS_PLATFORM_URL }} - INPUT_WS_URL: ${{ github.event.inputs.ws_url }} - SECRET_WS_URL: ${{ secrets.CANVAS_WS_URL }} - run: | - PLATFORM_URL="${INPUT_PLATFORM_URL:-${SECRET_PLATFORM_URL:-http://localhost:8080}}" - WS_URL="${INPUT_WS_URL:-${SECRET_WS_URL:-ws://localhost:8080/ws}}" - - echo "platform_url=${PLATFORM_URL}" >> "$GITHUB_OUTPUT" - echo "ws_url=${WS_URL}" >> "$GITHUB_OUTPUT" - - - name: Build & push canvas image to GHCR - uses: docker/build-push-action@bcafcacb16a39f128d818304e6c9c0c18556b85f # v7.1.0 - with: - context: ./canvas - file: ./canvas/Dockerfile - platforms: linux/amd64 - push: true - build-args: | - NEXT_PUBLIC_PLATFORM_URL=${{ steps.build_args.outputs.platform_url }} - NEXT_PUBLIC_WS_URL=${{ steps.build_args.outputs.ws_url }} - tags: | - ${{ env.IMAGE_NAME }}:latest - ${{ env.IMAGE_NAME }}:sha-${{ steps.tags.outputs.sha }} - cache-from: type=gha - cache-to: type=gha,mode=max - labels: | - org.opencontainers.image.source=https://github.com/${{ github.repository }} - org.opencontainers.image.revision=${{ github.sha }} - org.opencontainers.image.description=Molecule AI canvas (Next.js 15 + React Flow) diff --git a/.github/workflows/railway-pin-audit.yml b/.github/workflows/railway-pin-audit.yml deleted file mode 100644 index ff238946..00000000 --- a/.github/workflows/railway-pin-audit.yml +++ /dev/null @@ -1,207 +0,0 @@ -name: Railway pin audit (drift detection) - -# Daily audit of Railway env vars for drift-prone image-tag pins — -# automation-cadence layer over the detection script + regression test -# shipped in PR #2168 (#2001 closure). -# -# Background: on 2026-04-24 a stale `:staging-a14cf86` SHA pin in CP's -# TENANT_IMAGE caused 3+ hours of E2E failure with the appearance that -# "every fix didn't propagate" — really the tenant image was so old it -# didn't read the env vars those fixes produced. The audit script -# (scripts/ops/audit-railway-sha-pins.sh) flags drift; this workflow -# runs the same check unattended on a daily cron. -# -# Cadence: once a day, 13:00 UTC (06:00 PT). Daily is the right -# cadence for variables-tier config — Railway env var changes are -# deliberate operator actions, low-frequency. Hourly would risk -# Railway API rate-limit surprises and is overkill for the change rate. -# -# Issue-on-failure: drift triggers a priority-high issue, mirroring -# .github/workflows/e2e-staging-sanity.yml's pattern. Drift is -# medium-priority "config slipped, fix at next ops window," not -# active-outage paging. -# -# Secret hardening: per feedback_schedule_vs_dispatch_secrets_hardening, -# the schedule trigger HARD-FAILS on missing RAILWAY_AUDIT_TOKEN -# (silent-success on schedule was the failure-mode class that bit the -# team before; cron firing without checking anything is worse than no -# cron). The workflow_dispatch trigger SOFT-SKIPS on missing secret so -# an operator can dry-run the workflow shape during initial provisioning -# without tripping a fake red. - -on: - schedule: - - cron: '0 13 * * *' - workflow_dispatch: - -concurrency: - group: railway-pin-audit - cancel-in-progress: false - -permissions: - issues: write - contents: read - -jobs: - audit: - name: Audit Railway env vars for drift-prone pins - runs-on: ubuntu-latest - timeout-minutes: 10 - - steps: - - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 - - - name: Verify RAILWAY_AUDIT_TOKEN present - # Schedule trigger: hard-fail when the secret is missing — - # otherwise the cron silently runs against the wrong scope (or - # exits 2 from the script and we issue-spam) without anyone - # noticing the token rot. - # Dispatch trigger: soft-skip — operator may be dry-running the - # workflow shape before provisioning the secret. Logged as a - # workflow notice, not a failure. - env: - RAILWAY_AUDIT_TOKEN: ${{ secrets.RAILWAY_AUDIT_TOKEN }} - EVENT_NAME: ${{ github.event_name }} - id: secret_check - run: | - set -euo pipefail - if [ -n "${RAILWAY_AUDIT_TOKEN:-}" ]; then - echo "have_secret=true" >> "$GITHUB_OUTPUT" - exit 0 - fi - echo "have_secret=false" >> "$GITHUB_OUTPUT" - if [ "$EVENT_NAME" = "workflow_dispatch" ]; then - echo "::notice::RAILWAY_AUDIT_TOKEN not configured — soft-skipping (manual dispatch)" - exit 0 - fi - echo "::error::RAILWAY_AUDIT_TOKEN secret missing — schedule trigger requires it. Provision the token (read-only \`variables\` scope on the molecule-platform Railway project) and store as repo secret RAILWAY_AUDIT_TOKEN." - exit 1 - - - name: Install Railway CLI - if: steps.secret_check.outputs.have_secret == 'true' - # Pinned hash matching the public install instructions; bump in - # tandem with the audit-script's documented Railway CLI version. - run: | - set -euo pipefail - curl -fsSL https://railway.com/install.sh | sh - # The installer drops the binary in ~/.railway/bin - echo "$HOME/.railway/bin" >> "$GITHUB_PATH" - - - name: Verify Railway CLI authenticated - if: steps.secret_check.outputs.have_secret == 'true' - env: - RAILWAY_TOKEN: ${{ secrets.RAILWAY_AUDIT_TOKEN }} - run: | - set -euo pipefail - # `railway whoami` exits non-zero when the token is - # unauthenticated or doesn't have any project access. - if ! railway whoami >/dev/null 2>&1; then - echo "::error::Railway CLI failed to authenticate with RAILWAY_AUDIT_TOKEN — token may be revoked or scoped incorrectly" - exit 2 - fi - - - name: Link molecule-platform project - if: steps.secret_check.outputs.have_secret == 'true' - env: - RAILWAY_TOKEN: ${{ secrets.RAILWAY_AUDIT_TOKEN }} - # Project ID from reference_production_stack: molecule-platform - # / 7ccc8c68-61f4-42ab-9be5-586eeee11768. Linking is per-process, - # so we re-link in this CI shell (the audit script comment says - # it deliberately doesn't chdir for you because the linked - # project's identity matters). - run: | - set -euo pipefail - railway link --project 7ccc8c68-61f4-42ab-9be5-586eeee11768 - - - name: Run drift audit - if: steps.secret_check.outputs.have_secret == 'true' - id: audit - env: - RAILWAY_TOKEN: ${{ secrets.RAILWAY_AUDIT_TOKEN }} - run: | - set +e - bash scripts/ops/audit-railway-sha-pins.sh 2>&1 | tee /tmp/audit.log - rc=${PIPESTATUS[0]} - echo "rc=$rc" >> "$GITHUB_OUTPUT" - # Capture the audit log for the issue body. - { - echo 'log<> "$GITHUB_OUTPUT" - # Exit codes from the script: - # 0 — no drift; workflow goes green - # 1 — drift detected; we'll file an issue and fail the run - # 2 — railway CLI unauthenticated / project unlinked; fail - # Anything else: also fail. - case "$rc" in - 0) exit 0 ;; - 1) echo "::warning::Drift-prone pin(s) detected — issue will be filed"; exit 1 ;; - 2) echo "::error::Railway CLI auth/link failed mid-script — token or project ID drift"; exit 2 ;; - *) echo "::error::Unexpected audit rc=$rc"; exit 1 ;; - esac - - - name: Open / update drift issue - if: failure() && steps.audit.outputs.rc == '1' - uses: actions/github-script@3a2844b7e9c422d3c10d287c895573f7108da1b3 # v9.0.0 - env: - AUDIT_LOG: ${{ steps.audit.outputs.log }} - with: - script: | - const title = "🚨 Railway env-var drift detected"; - const runURL = `https://github.com/${context.repo.owner}/${context.repo.repo}/actions/runs/${context.runId}`; - const body = - `Daily Railway pin audit found drift-prone image-tag pins in the molecule-platform Railway project.\n\n` + - `**What this means:** an env var (likely on \`controlplane\`) is pinned to a SHA-shaped or semver tag instead of a floating tag. ` + - `Same pattern that caused the 2026-04-24 TENANT_IMAGE incident — fix-PRs land but the running service doesn't pick them up.\n\n` + - `**Recovery:** open the Railway dashboard, replace the flagged value with a floating tag (\`:staging-latest\`, \`:main\`) unless the pin is intentional and documented in the ops runbook.\n\n` + - `**Audit output:**\n\n\`\`\`\n${process.env.AUDIT_LOG || '(log unavailable)'}\n\`\`\`\n\n` + - `Run: ${runURL}\n\n` + - `Closes automatically when a subsequent daily run reports clean.`; - - const { data: existing } = await github.rest.issues.listForRepo({ - owner: context.repo.owner, repo: context.repo.repo, - state: 'open', labels: 'railway-drift', - }); - const match = existing.find(i => i.title === title); - if (match) { - await github.rest.issues.createComment({ - owner: context.repo.owner, repo: context.repo.repo, - issue_number: match.number, - body: `Still drifting. ${runURL}\n\n\`\`\`\n${process.env.AUDIT_LOG || '(log unavailable)'}\n\`\`\``, - }); - } else { - await github.rest.issues.create({ - owner: context.repo.owner, repo: context.repo.repo, - title, body, - labels: ['railway-drift', 'bug', 'priority-high'], - }); - } - - - name: Close stale drift issue on clean run - # When a previously-flagged drift gets fixed by an operator, - # the next daily run goes green. Close any open `railway-drift` - # issue with a confirmation comment so the queue doesn't carry - # stale ones. - if: success() && steps.audit.outputs.rc == '0' - uses: actions/github-script@3a2844b7e9c422d3c10d287c895573f7108da1b3 # v9.0.0 - with: - script: | - const runURL = `https://github.com/${context.repo.owner}/${context.repo.repo}/actions/runs/${context.runId}`; - const { data: existing } = await github.rest.issues.listForRepo({ - owner: context.repo.owner, repo: context.repo.repo, - state: 'open', labels: 'railway-drift', - }); - for (const issue of existing) { - await github.rest.issues.createComment({ - owner: context.repo.owner, repo: context.repo.repo, - issue_number: issue.number, - body: `Daily audit clean — drift resolved. ${runURL}`, - }); - await github.rest.issues.update({ - owner: context.repo.owner, repo: context.repo.repo, - issue_number: issue.number, - state: 'closed', - state_reason: 'completed', - }); - } diff --git a/.github/workflows/runtime-pin-compat.yml b/.github/workflows/runtime-pin-compat.yml deleted file mode 100644 index 7292ed61..00000000 --- a/.github/workflows/runtime-pin-compat.yml +++ /dev/null @@ -1,91 +0,0 @@ -name: Runtime Pin Compatibility - -# CI gate that prevents the 5-hour staging outage from 2026-04-24 from -# recurring (controlplane#253). The original failure mode: -# 1. molecule-ai-workspace-runtime 0.1.13 declared `a2a-sdk<1.0` in its -# requires_dist metadata (incorrect — it actually imports -# a2a.server.routes which only exists in a2a-sdk 1.0+) -# 2. `pip install molecule-ai-workspace-runtime` resolved cleanly -# 3. `from molecule_runtime.main import main_sync` raised ImportError -# 4. Every tenant workspace crashed; the canary tenant caught it but -# only after 5 hours of degraded staging -# -# This workflow installs the CURRENTLY PUBLISHED runtime from PyPI on -# top of `workspace/requirements.txt` and smoke-imports. Catches: -# - Upstream PyPI yanks -# - Bad re-releases of molecule-ai-workspace-runtime -# - Already-shipped wheels that stop importing because a transitive -# dep moved underneath -# -# This is the "PyPI artifact health" half of pin compatibility. The -# companion workflow `runtime-prbuild-compat.yml` covers the -# "PR-introduced breakage" half by building the wheel from THIS PR's -# workspace/ source. Splitting the two means each gets a narrow -# `paths:` filter — the pypi-latest job no longer fires on doc-only -# workspace/ edits whose content can't change what's currently on PyPI. - -on: - push: - branches: [main, staging] - paths: - # Narrow filter: pypi-latest is sensitive only to changes that - # affect what we're INSTALLING (requirements.txt) or WHAT THE - # CHECK ITSELF DOES (this workflow file). Edits to workspace/ - # source code don't change what's on PyPI right now, so they - # don't change this gate's verdict. - - 'workspace/requirements.txt' - - '.github/workflows/runtime-pin-compat.yml' - pull_request: - branches: [main, staging] - paths: - - 'workspace/requirements.txt' - - '.github/workflows/runtime-pin-compat.yml' - # Daily catch for upstream PyPI publishes that break the pin combo - # without any change in our repo (e.g. someone re-yanks an a2a-sdk - # release or molecule-ai-workspace-runtime publishes a bad bump). - schedule: - - cron: '0 13 * * *' # 06:00 PT - workflow_dispatch: - # Required-check support: when this becomes a branch-protection gate, - # merge_group runs let the queue green-check this in addition to PRs. - merge_group: - types: [checks_requested] - -concurrency: - group: ${{ github.workflow }}-${{ github.ref }} - cancel-in-progress: true - -jobs: - pypi-latest-install: - name: PyPI-latest install + import smoke - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 - - uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 # v6.2.0 - with: - python-version: '3.11' - cache: pip - cache-dependency-path: workspace/requirements.txt - - name: Install runtime + workspace requirements - # Install order is load-bearing: install the runtime FIRST so pip - # honors whatever a2a-sdk constraint the runtime metadata declares - # (this is the surface that broke in 2026-04-24 — runtime declared - # `a2a-sdk<1.0` but actually needed >=1.0). The follow-up install - # of workspace/requirements.txt then upgrades a2a-sdk to the - # constraint our runtime image actually pins. The import smoke - # below verifies the upgraded combination is consistent. - run: | - python -m venv /tmp/venv - /tmp/venv/bin/pip install --upgrade pip - /tmp/venv/bin/pip install molecule-ai-workspace-runtime - /tmp/venv/bin/pip install -r workspace/requirements.txt - /tmp/venv/bin/pip show molecule-ai-workspace-runtime a2a-sdk \ - | grep -E '^(Name|Version):' - - name: Smoke import — fail if metadata declares deps that don't satisfy real imports - # WORKSPACE_ID is validated at import time by platform_auth.py — EC2 - # user-data sets it from the cloud-init template; set a placeholder - # here so the import smoke doesn't trip on the env-var guard. - env: - WORKSPACE_ID: 00000000-0000-0000-0000-000000000001 - run: | - /tmp/venv/bin/python -c "from molecule_runtime.main import main_sync; print('runtime imports OK')" diff --git a/.github/workflows/runtime-prbuild-compat.yml b/.github/workflows/runtime-prbuild-compat.yml deleted file mode 100644 index 05b1d37c..00000000 --- a/.github/workflows/runtime-prbuild-compat.yml +++ /dev/null @@ -1,152 +0,0 @@ -name: Runtime PR-Built Compatibility - -# Companion to `runtime-pin-compat.yml`. That workflow tests what's -# CURRENTLY PUBLISHED on PyPI; this workflow tests what WOULD BE -# PUBLISHED if THIS PR merges. -# -# Why two workflows: the chicken-and-egg #128 fix added a "PR-built -# wheel" job to the original runtime-pin-compat.yml, but both jobs -# shared a `paths:` filter that was the union of their needs -# (`workspace/**`). That meant the PyPI-latest job ran on every doc -# edit even though the upstream PyPI artifact can't change with our -# workspace/ source. Splitting the two means each gets a narrow -# `paths:` filter that matches the inputs it actually depends on. -# -# Catches the failure mode where a PR adds an import requiring a newer -# SDK than `workspace/requirements.txt` pins: -# 1. Pip resolves the existing PyPI wheel + the old SDK pin → smoke -# passes (it imports the OLD main.py from the wheel, not the PR's -# new main.py). -# 2. Merge → publish-runtime.yml ships a wheel WITH the new import. -# 3. Tenant images redeploy → all crash on first boot with -# ImportError. -# -# By building from the PR's source and smoke-importing THAT wheel, we -# fail at PR-time instead of after publish. -# -# Required-check shape (2026-05-01): the workflow runs on EVERY push + -# PR + merge_group event with no top-level `paths:` filter, then uses a -# detect-changes job + per-step `if:` gates inside ONE always-running -# job named `PR-built wheel + import smoke`. PRs that don't touch -# wheel-relevant paths get a no-op SUCCESS check run, satisfying branch -# protection without re-running the heavy build. Same pattern as -# e2e-api.yml — see its comment for the full rationale + the 2026-04-29 -# PR #2264 incident that motivated the always-run-with-if-gates shape. - -on: - push: - branches: [main, staging] - pull_request: - branches: [main, staging] - workflow_dispatch: - merge_group: - types: [checks_requested] - -concurrency: - # Include event_name so a PR sync (event=pull_request) and the - # subsequent staging push (event=push) on the SAME merge SHA don't - # collide in one group. Without event_name, both runs hashed to - # the same key and cancel-in-progress=true cancelled whichever - # arrived second — usually the push run, which staging branch- - # protection then sees as a CANCELLED required check and refuses - # to mark merged. Caught 2026-05-05 across PR #2869's runs (run - # ids 25371863455 / 25371811486 / 25371078157 / 25370403142 — every - # staging push run cancelled, every matching PR run green). - # - # Per memory `feedback_concurrency_group_per_sha.md` — same drift - # class that broke auto-promote-staging on 2026-04-28. Pin invariant: - # event_name + sha is the minimum unique key for these workflows. - group: ${{ github.workflow }}-${{ github.event_name }}-${{ github.event.pull_request.head.sha || github.sha }} - cancel-in-progress: true - -jobs: - detect-changes: - runs-on: ubuntu-latest - outputs: - wheel: ${{ steps.decide.outputs.wheel }} - steps: - - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 - - uses: dorny/paths-filter@fbd0ab8f3e69293af611ebaee6363fc25e6d187d # v4.0.1 - id: filter - with: - filters: | - wheel: - - 'workspace/**' - - 'scripts/build_runtime_package.py' - - 'scripts/wheel_smoke.py' - - '.github/workflows/runtime-prbuild-compat.yml' - - id: decide - # Always run real work for manual dispatch + merge_group — no - # diff-against-base in those contexts, and the gate exists to - # validate the to-be-merged state regardless of which paths it - # touched (paths-filter would default to "no changes" which is - # the wrong answer when the queue is composing many PRs). - run: | - if [ "${{ github.event_name }}" = "workflow_dispatch" ] || [ "${{ github.event_name }}" = "merge_group" ]; then - echo "wheel=true" >> "$GITHUB_OUTPUT" - else - echo "wheel=${{ steps.filter.outputs.wheel }}" >> "$GITHUB_OUTPUT" - fi - - # ONE job (no job-level `if:`) that always runs and reports under the - # required-check name `PR-built wheel + import smoke`. Real work is - # gated per-step on `needs.detect-changes.outputs.wheel`. Same shape - # as e2e-api.yml's e2e-api job — see its comment block for the full - # rationale (SKIPPED check runs block branch protection even with - # SUCCESS siblings; collapsing to one always-run job emits exactly - # one SUCCESS check run). - local-build-install: - needs: detect-changes - name: PR-built wheel + import smoke - runs-on: ubuntu-latest - steps: - - name: No-op pass (paths filter excluded this commit) - if: needs.detect-changes.outputs.wheel != 'true' - run: | - echo "No workspace/ / scripts/{build_runtime_package,wheel_smoke}.py / workflow changes — wheel gate satisfied without rebuilding." - echo "::notice::PR-built wheel + import smoke no-op pass (paths filter excluded this commit)." - - if: needs.detect-changes.outputs.wheel == 'true' - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 - - if: needs.detect-changes.outputs.wheel == 'true' - uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 # v6.2.0 - with: - python-version: '3.11' - cache: pip - cache-dependency-path: workspace/requirements.txt - - name: Install build tooling - if: needs.detect-changes.outputs.wheel == 'true' - run: pip install build - - name: Build wheel from PR source (mirrors publish-runtime.yml) - if: needs.detect-changes.outputs.wheel == 'true' - # Use a fixed test version so the wheel filename is predictable. - # Doesn't reach PyPI — this build is local-only for the smoke. - # Use the SAME build script with the SAME args as - # publish-runtime.yml's build step. The temp dir path differs - # (`/tmp/runtime-build` here vs `${{ runner.temp }}/runtime-build` - # in publish-runtime.yml — they coincide on ubuntu-latest but - # the call sites are not byte-identical). The smoke import is - # also intentionally narrower than publish's: this gate exists - # to catch SDK-version-import drift specifically; full invariant - # coverage lives in publish-runtime.yml's own pre-PyPI smoke. - run: | - python scripts/build_runtime_package.py \ - --version "0.0.0.dev0+pin-compat" \ - --out /tmp/runtime-build - cd /tmp/runtime-build && python -m build - - name: Install built wheel + workspace requirements - if: needs.detect-changes.outputs.wheel == 'true' - run: | - python -m venv /tmp/venv-built - /tmp/venv-built/bin/pip install --upgrade pip - /tmp/venv-built/bin/pip install /tmp/runtime-build/dist/*.whl - /tmp/venv-built/bin/pip install -r workspace/requirements.txt - /tmp/venv-built/bin/pip show molecule-ai-workspace-runtime a2a-sdk \ - | grep -E '^(Name|Version):' - - name: Smoke import the PR-built wheel - if: needs.detect-changes.outputs.wheel == 'true' - # Same script publish-runtime.yml runs against the to-be-PyPI wheel. - # Closes the PR-time vs publish-time gap: a PR adding a new SDK - # call-shape no longer passes here (narrow `import main_sync`) only - # to fail post-merge in publish-runtime's broader smoke. - run: | - /tmp/venv-built/bin/python "$GITHUB_WORKSPACE/scripts/wheel_smoke.py" diff --git a/.github/workflows/secret-pattern-drift.yml b/.github/workflows/secret-pattern-drift.yml deleted file mode 100644 index 2517fea9..00000000 --- a/.github/workflows/secret-pattern-drift.yml +++ /dev/null @@ -1,58 +0,0 @@ -name: SECRET_PATTERNS drift lint - -# Detects when the canonical SECRET_PATTERNS array in -# .github/workflows/secret-scan.yml diverges from known consumer -# mirrors (workspace-runtime's bundled pre-commit hook today; more -# can be added as the consumer set grows). -# -# Why this exists: every side that scans for credentials has its own -# copy of the pattern list. They drift — most recently the runtime -# hook lagged the canonical by one pattern (sk-cp- / MiniMax F1088), -# so a developer's local pre-commit would let a sk-cp- token through -# while the org-wide CI scan would refuse it. The cost of that drift -# is dev confusion + delayed feedback; the fix is automated detection. -# -# Triggers: -# - schedule: daily 05:00 UTC. Catches drift introduced by edits -# to a consumer copy that didn't update canonical here. -# - push to main/staging where the canonical or this lint changed: -# catches the inverse — canonical updated but consumers not yet -# bumped. The lint will fail the push; that's intentional, the -# person editing canonical is the right person to also update -# the consumer. -# - workflow_dispatch: ad-hoc operator runs. - -on: - schedule: - # 05:00 UTC = 22:00 PT / 01:00 ET. Quiet hours so a failure - # email lands when humans are starting their day, not - # interrupting it. - - cron: "0 5 * * *" - push: - branches: [main, staging] - paths: - - ".github/workflows/secret-scan.yml" - - ".github/workflows/secret-pattern-drift.yml" - - ".github/scripts/lint_secret_pattern_drift.py" - - ".githooks/pre-commit" - workflow_dispatch: - -# GITHUB_TOKEN scoped to read-only. The lint only does git checkout -# + HTTPS GETs to public consumer files; no writes to anything. -permissions: - contents: read - -jobs: - lint: - name: Detect SECRET_PATTERNS drift - runs-on: ubuntu-latest - timeout-minutes: 5 - steps: - - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 - - - uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 # v6.2.0 - with: - python-version: "3.11" - - - name: Run drift lint - run: python3 .github/scripts/lint_secret_pattern_drift.py diff --git a/.github/workflows/sweep-aws-secrets.yml b/.github/workflows/sweep-aws-secrets.yml deleted file mode 100644 index 39e57978..00000000 --- a/.github/workflows/sweep-aws-secrets.yml +++ /dev/null @@ -1,129 +0,0 @@ -name: Sweep stale AWS Secrets Manager secrets - -# Janitor for per-tenant AWS Secrets Manager secrets -# (`molecule/tenant//bootstrap`) whose backing tenant no -# longer exists. Parallel-shape to sweep-cf-tunnels.yml and -# sweep-cf-orphans.yml — different cloud, same justification. -# -# Why this exists separately from a long-term reconciler integration: -# - molecule-controlplane's tenant_resources audit table (mig 024) -# currently tracks four resource kinds: CloudflareTunnel, -# CloudflareDNS, EC2Instance, SecurityGroup. SecretsManager is -# not in the list, so the existing reconciler doesn't catch -# orphan secrets. -# - At ~$0.40/secret/month the cost grew to ~$19/month before this -# sweeper was written, indicating ~45+ orphan secrets from -# crashed provisions and incomplete deprovision flows. -# - The proper fix (KindSecretsManagerSecret + recorder hook + -# reconciler enumerator) is filed as a separate controlplane -# issue. This sweeper is the immediate cost-relief stopgap. -# -# IAM principal: AWS_JANITOR_ACCESS_KEY_ID / AWS_JANITOR_SECRET_ACCESS_KEY. -# This is a DEDICATED principal — the production `molecule-cp` IAM -# user lacks `secretsmanager:ListSecrets` (it only has -# Get/Create/Update/Delete on specific resources, scoped to its -# operational needs). The janitor needs ListSecrets across the -# `molecule/tenant/*` prefix, which warrants a separate principal so -# we don't broaden the prod-CP policy. -# -# Safety: the script's MAX_DELETE_PCT gate (default 50%, mirroring -# sweep-cf-orphans.yml — tenant secrets are durable by design, unlike -# the mostly-orphan tunnels) refuses to nuke past the threshold. - -on: - schedule: - # Hourly at :30 — offsets from sweep-cf-orphans (:15) and - # sweep-cf-tunnels (:45) so the three janitors don't burst the - # CP admin endpoints at the same minute. - - cron: '30 * * * *' - workflow_dispatch: - inputs: - dry_run: - description: "Dry run only — list what would be deleted, no deletion" - required: false - type: boolean - default: true - max_delete_pct: - description: "Override safety gate (default 50, set higher only for major cleanup)" - required: false - default: "50" - grace_hours: - description: "Skip secrets created within this many hours (default 24)" - required: false - default: "24" - -# Don't let two sweeps race the same AWS account. -concurrency: - group: sweep-aws-secrets - cancel-in-progress: false - -permissions: - contents: read - -jobs: - sweep: - name: Sweep AWS Secrets Manager - runs-on: ubuntu-latest - # 30 min cap, mirroring the other janitors. AWS DeleteSecret is - # fast (~0.3s/call) so even a 100+ backlog drains in seconds - # under the 8-way xargs parallelism, but the cap is set generously - # to leave headroom for any actual API hang. - timeout-minutes: 30 - env: - AWS_REGION: ${{ secrets.AWS_REGION || 'us-east-1' }} - AWS_ACCESS_KEY_ID: ${{ secrets.AWS_JANITOR_ACCESS_KEY_ID }} - AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_JANITOR_SECRET_ACCESS_KEY }} - CP_PROD_ADMIN_TOKEN: ${{ secrets.CP_PROD_ADMIN_TOKEN }} - CP_STAGING_ADMIN_TOKEN: ${{ secrets.CP_STAGING_ADMIN_TOKEN }} - MAX_DELETE_PCT: ${{ github.event.inputs.max_delete_pct || '50' }} - GRACE_HOURS: ${{ github.event.inputs.grace_hours || '24' }} - - steps: - - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 - - - name: Verify required secrets present - id: verify - # Schedule-vs-dispatch behaviour split mirrors sweep-cf-orphans - # and sweep-cf-tunnels (hardened 2026-04-28). Same principle: - # - schedule → exit 1 on missing secrets (red CI surfaces it) - # - workflow_dispatch → exit 0 with warning (operator-driven, - # they already accepted the repo state) - run: | - missing=() - for var in AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY CP_PROD_ADMIN_TOKEN CP_STAGING_ADMIN_TOKEN; do - if [ -z "${!var:-}" ]; then - missing+=("$var") - fi - done - if [ ${#missing[@]} -gt 0 ]; then - if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then - echo "::warning::skipping sweep — secrets not configured: ${missing[*]}" - echo "::warning::set them at Settings → Secrets and Variables → Actions, then rerun." - echo "::warning::AWS_JANITOR_* must belong to a principal with secretsmanager:ListSecrets and secretsmanager:DeleteSecret on molecule/tenant/* (the prod molecule-cp principal lacks ListSecrets)." - echo "skip=true" >> "$GITHUB_OUTPUT" - exit 0 - fi - echo "::error::sweep cannot run — required secrets missing: ${missing[*]}" - echo "::error::set them at Settings → Secrets and Variables → Actions, or disable this workflow." - echo "::error::AWS_JANITOR_* must belong to a principal with secretsmanager:ListSecrets and secretsmanager:DeleteSecret on molecule/tenant/*." - exit 1 - fi - echo "All required secrets present ✓" - echo "skip=false" >> "$GITHUB_OUTPUT" - - - name: Run sweep - if: steps.verify.outputs.skip != 'true' - # Schedule-vs-dispatch dry-run asymmetry mirrors sweep-cf-tunnels: - # - Scheduled: input empty → "false" → --execute (the whole - # point of an hourly janitor). - # - Manual workflow_dispatch: input default true → dry-run; - # operator must flip it to actually delete. - run: | - set -euo pipefail - if [ "${{ github.event.inputs.dry_run || 'false' }}" = "true" ]; then - echo "Running in dry-run mode — no deletions" - bash scripts/ops/sweep-aws-secrets.sh - else - echo "Running with --execute — will delete identified orphans" - bash scripts/ops/sweep-aws-secrets.sh --execute - fi diff --git a/.github/workflows/sweep-cf-orphans.yml b/.github/workflows/sweep-cf-orphans.yml deleted file mode 100644 index f55c806b..00000000 --- a/.github/workflows/sweep-cf-orphans.yml +++ /dev/null @@ -1,146 +0,0 @@ -name: Sweep stale Cloudflare DNS records - -# Janitor for Cloudflare DNS records whose backing tenant/workspace no -# longer exists. Without this loop, every short-lived E2E or canary -# leaves a CF record on the moleculesai.app zone — the zone has a -# 200-record quota (controlplane#239 hit it 2026-04-23+) and provisions -# start failing with code 81045 once exhausted. -# -# Why a separate workflow vs sweep-stale-e2e-orgs.yml: -# - That workflow operates at the CP layer (DELETE /cp/admin/tenants/:slug -# drives the cascade). It assumes CP has the org row to drive the -# deprovision from. It doesn't catch records left behind when CP -# itself never knew about the tenant (canary scratch, manual ops -# experiments) or when the cascade's CF-delete branch failed. -# - sweep-cf-orphans.sh enumerates the CF zone directly and matches -# each record against live CP slugs + AWS EC2 names. It catches -# leaks the CP-driven sweep can't. -# -# Safety: the script's own MAX_DELETE_PCT gate refuses to nuke more -# than 50% of records in a single run. If something has gone weird -# (CP admin endpoint returns no orgs → every tenant looks orphan) the -# gate halts before damage. Decision-function unit tests in -# scripts/ops/test_sweep_cf_decide.py (#2027) cover the rule -# classifier. - -on: - schedule: - # Hourly. Mirrors sweep-stale-e2e-orgs cadence so the two janitors - # converge on the same tick. CF API rate budget is generous (1200 - # req/5min); a single sweep makes ~1 list + N deletes (N<=quota/2). - - cron: '15 * * * *' # offset from sweep-stale-e2e-orgs (top of hour) - workflow_dispatch: - inputs: - dry_run: - description: "Dry run only — list what would be deleted, no deletion" - required: false - type: boolean - default: true - max_delete_pct: - description: "Override safety gate (default 50, set higher only for major cleanup)" - required: false - default: "50" - # No `merge_group:` trigger on purpose. This is a janitor — it doesn't - # need to gate merges, and including it as written before #2088 fired - # the full sweep job (or its secret-check) on every PR going through - # the merge queue, generating one red CI run per merge-queue eval. If - # this workflow is ever wired up as a required check, re-add - # merge_group: { types: [checks_requested] } - # AND gate the sweep step with `if: github.event_name != 'merge_group'` - # so merge-queue evals report success without actually running. - -# Don't let two sweeps race the same zone. workflow_dispatch during a -# scheduled run would otherwise issue duplicate DELETE calls. -concurrency: - group: sweep-cf-orphans - cancel-in-progress: false - -permissions: - contents: read - -jobs: - sweep: - name: Sweep CF orphans - runs-on: ubuntu-latest - # 3 min surfaces hangs (CF API stall, AWS describe-instances stuck) - # within one cron interval instead of burning a full tick. Realistic - # worst case is ~2 min: 4 sequential curls + 1 aws + N×CF-DELETE - # each individually capped at 10s by the script's curl -m flag. - timeout-minutes: 3 - env: - CF_API_TOKEN: ${{ secrets.CF_API_TOKEN }} - CF_ZONE_ID: ${{ secrets.CF_ZONE_ID }} - CP_PROD_ADMIN_TOKEN: ${{ secrets.CP_PROD_ADMIN_TOKEN }} - CP_STAGING_ADMIN_TOKEN: ${{ secrets.CP_STAGING_ADMIN_TOKEN }} - AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }} - AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }} - AWS_DEFAULT_REGION: us-east-2 - MAX_DELETE_PCT: ${{ github.event.inputs.max_delete_pct || '50' }} - - steps: - - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 - - - name: Verify required secrets present - id: verify - # Schedule-vs-dispatch behaviour split (hardened 2026-04-28 - # after the silent-no-op incident below): - # - # The earlier soft-skip-on-schedule policy hid a real leak. All - # six secrets were unset on this repo for an unknown duration; - # every hourly run printed a yellow ::warning:: and exited 0, - # so the workflow registered as "passing" while doing nothing. - # CF orphans accumulated to 152/200 (~76% of the zone quota - # gone) before a manual `dig`-driven audit caught it. Anything - # that runs as a janitor and reports green while idle is - # indistinguishable from "the janitor is healthy" — so we now - # treat schedule (and any future workflow_run/push triggers) - # as a hard-fail when secrets are missing. - # - # - schedule / workflow_run / push → exit 1 (red CI run - # surfaces the misconfiguration the next tick) - # - workflow_dispatch → exit 0 with a warning - # (an operator ran this ad-hoc; they already accepted the - # state of the repo and want the workflow to short-circuit - # so they can rerun after fixing the secret) - run: | - missing=() - for var in CF_API_TOKEN CF_ZONE_ID CP_PROD_ADMIN_TOKEN CP_STAGING_ADMIN_TOKEN AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY; do - if [ -z "${!var:-}" ]; then - missing+=("$var") - fi - done - if [ ${#missing[@]} -gt 0 ]; then - if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then - echo "::warning::skipping sweep — secrets not configured: ${missing[*]}" - echo "::warning::set them at Settings → Secrets and Variables → Actions, then rerun." - echo "skip=true" >> "$GITHUB_OUTPUT" - exit 0 - fi - echo "::error::sweep cannot run — required secrets missing: ${missing[*]}" - echo "::error::set them at Settings → Secrets and Variables → Actions, or disable this workflow." - echo "::error::a silent skip masked an active CF DNS leak (152/200 zone records) caught only by a manual audit on 2026-04-28; this gate exists to make the gap visible." - exit 1 - fi - echo "All required secrets present ✓" - echo "skip=false" >> "$GITHUB_OUTPUT" - - - name: Run sweep - if: steps.verify.outputs.skip != 'true' - # Schedule-vs-dispatch dry-run asymmetry (intentional): - # - Scheduled runs: github.event.inputs.dry_run is empty → - # defaults to "false" below → script runs with --execute - # (the whole point of an hourly janitor). - # - Manual workflow_dispatch: input default is true (line 38) - # so an ad-hoc operator-triggered run is dry-run by default; - # they have to flip the toggle to actually delete. - # The script's MAX_DELETE_PCT gate (default 50%) is the second - # line of defense regardless of mode. - run: | - set -euo pipefail - if [ "${{ github.event.inputs.dry_run || 'false' }}" = "true" ]; then - echo "Running in dry-run mode — no deletions" - bash scripts/ops/sweep-cf-orphans.sh - else - echo "Running with --execute — will delete identified orphans" - bash scripts/ops/sweep-cf-orphans.sh --execute - fi diff --git a/.github/workflows/sweep-cf-tunnels.yml b/.github/workflows/sweep-cf-tunnels.yml deleted file mode 100644 index 12d5c47e..00000000 --- a/.github/workflows/sweep-cf-tunnels.yml +++ /dev/null @@ -1,124 +0,0 @@ -name: Sweep stale Cloudflare Tunnels - -# Janitor for Cloudflare Tunnels whose backing tenant no longer -# exists. Parallel-shape to sweep-cf-orphans.yml (which sweeps DNS -# records); same justification, different CF resource. -# -# Why this exists separately from sweep-cf-orphans: -# - DNS records live on the zone (`/zones//dns_records`). -# - Tunnels live on the account (`/accounts//cfd_tunnel`). -# - Different CF API surface, different scopes; the existing CF -# token might not have `account:cloudflare_tunnel:edit`. Splitting -# the workflows keeps each one's secret-presence gate independent -# so neither silent-skips when the other's secret is missing. -# - Cleaner blast radius — operators can disable one without the -# other if a regression surfaces. -# -# Safety: the script's MAX_DELETE_PCT gate (default 90% — higher than -# the DNS sweep's 50% because tenant-shaped tunnels are mostly -# orphans by design) refuses to nuke past the threshold. - -on: - schedule: - # Hourly at :45 — offset from sweep-cf-orphans (:15) so the two - # janitors don't issue parallel CF API bursts at the same minute. - - cron: '45 * * * *' - workflow_dispatch: - inputs: - dry_run: - description: "Dry run only — list what would be deleted, no deletion" - required: false - type: boolean - default: true - max_delete_pct: - description: "Override safety gate (default 90, set higher only for major cleanup)" - required: false - default: "90" - -# Don't let two sweeps race the same account. -concurrency: - group: sweep-cf-tunnels - cancel-in-progress: false - -permissions: - contents: read - -jobs: - sweep: - name: Sweep CF tunnels - runs-on: ubuntu-latest - # 30 min cap. Was 5 min on the theory that the only thing that - # could take >5min is a CF-API hang — but on 2026-05-02 a backlog - # of 672 stale tunnels accumulated (large staging E2E run + delayed - # sweep) and the serial `curl -X DELETE` loop (~0.7s/tunnel) needed - # ~7-8min to drain. The 5-min cap killed the run mid-sweep - # (cancelled at 424/672, see run 25248788312); a manual rerun - # finished the remainder fine. - # - # The fix is two-part: parallelize the delete loop (8-way xargs in - # the script — see scripts/ops/sweep-cf-tunnels.sh), AND raise the - # cap so a one-off backlog doesn't trip a hangs-detector that - # turned out to be a real-job-too-slow detector. With 8-way - # parallelism, 600+ tunnels drains in ~60s; 30 min is generous - # headroom for actual hangs to still surface (and is in line with - # the sweep-cf-orphans companion job). - timeout-minutes: 30 - env: - CF_API_TOKEN: ${{ secrets.CF_API_TOKEN }} - CF_ACCOUNT_ID: ${{ secrets.CF_ACCOUNT_ID }} - CP_PROD_ADMIN_TOKEN: ${{ secrets.CP_PROD_ADMIN_TOKEN }} - CP_STAGING_ADMIN_TOKEN: ${{ secrets.CP_STAGING_ADMIN_TOKEN }} - MAX_DELETE_PCT: ${{ github.event.inputs.max_delete_pct || '90' }} - - steps: - - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 - - - name: Verify required secrets present - id: verify - # Schedule-vs-dispatch behaviour split mirrors sweep-cf-orphans - # (hardened 2026-04-28 after the silent-no-op incident: the - # janitor reported green while doing nothing because secrets - # were unset, masking a 152/200 zone-record leak). Same - # principle applies here: - # - schedule → exit 1 on missing secrets (red CI surfaces it) - # - workflow_dispatch → exit 0 with warning (operator-driven, - # they already accepted the repo state) - run: | - missing=() - for var in CF_API_TOKEN CF_ACCOUNT_ID CP_PROD_ADMIN_TOKEN CP_STAGING_ADMIN_TOKEN; do - if [ -z "${!var:-}" ]; then - missing+=("$var") - fi - done - if [ ${#missing[@]} -gt 0 ]; then - if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then - echo "::warning::skipping sweep — secrets not configured: ${missing[*]}" - echo "::warning::set them at Settings → Secrets and Variables → Actions, then rerun." - echo "::warning::CF_API_TOKEN must include account:cloudflare_tunnel:edit scope (separate from the zone:dns:edit scope used by sweep-cf-orphans)." - echo "skip=true" >> "$GITHUB_OUTPUT" - exit 0 - fi - echo "::error::sweep cannot run — required secrets missing: ${missing[*]}" - echo "::error::set them at Settings → Secrets and Variables → Actions, or disable this workflow." - echo "::error::CF_API_TOKEN must include account:cloudflare_tunnel:edit scope." - exit 1 - fi - echo "All required secrets present ✓" - echo "skip=false" >> "$GITHUB_OUTPUT" - - - name: Run sweep - if: steps.verify.outputs.skip != 'true' - # Schedule-vs-dispatch dry-run asymmetry mirrors sweep-cf-orphans: - # - Scheduled: input empty → "false" → --execute (the whole - # point of an hourly janitor). - # - Manual workflow_dispatch: input default true → dry-run; - # operator must flip it to actually delete. - run: | - set -euo pipefail - if [ "${{ github.event.inputs.dry_run || 'false' }}" = "true" ]; then - echo "Running in dry-run mode — no deletions" - bash scripts/ops/sweep-cf-tunnels.sh - else - echo "Running with --execute — will delete identified orphans" - bash scripts/ops/sweep-cf-tunnels.sh --execute - fi diff --git a/.github/workflows/sweep-stale-e2e-orgs.yml b/.github/workflows/sweep-stale-e2e-orgs.yml deleted file mode 100644 index 18bec191..00000000 --- a/.github/workflows/sweep-stale-e2e-orgs.yml +++ /dev/null @@ -1,239 +0,0 @@ -name: Sweep stale e2e-* orgs (staging) - -# Janitor for staging tenants left behind when E2E cleanup didn't run: -# CI cancellations, runner crashes, transient AWS errors mid-cascade, -# bash trap missed (signal 9), etc. Without this loop, every failed -# teardown leaks an EC2 + DNS + DB row until manual ops cleanup — -# 2026-04-23 staging hit the 64 vCPU AWS quota from ~27 such orphans. -# -# Why not rely on per-test-run teardown: -# - Per-run teardown is best-effort by definition. Any process death -# after the test starts but before the trap fires leaves debris. -# - GH Actions cancellation kills the runner without grace period. -# The workflow's `if: always()` step usually catches this, but it -# too can fail (CP transient 5xx, runner network issue at the -# wrong moment). -# - Even when teardown runs, the CP cascade is best-effort in places -# (cascadeTerminateWorkspaces logs+continues; DNS deletion same). -# - This sweep is the catch-all that converges staging back to clean -# regardless of which specific path leaked. -# -# The PROPER fix is making CP cleanup transactional + verify-after- -# terminate (filed separately as cleanup-correctness work). This -# workflow is the safety net that catches everything else AND any -# future leak source we haven't yet identified. - -on: - schedule: - # Every 15 min. E2E orgs are short-lived (~8-25 min wall clock from - # create to teardown — canary is ~8 min, full SaaS ~25 min). The - # previous hourly + 120-min stale threshold meant a leaked tenant - # could keep an EC2 alive for up to 2 hours, eating ~2 vCPU per - # leak. Tightening the cadence + threshold reduces the worst-case - # leak window from 120 min to ~45 min (15-min sweep cadence + 30-min - # threshold) without risk of catching in-progress runs (the longest - # e2e run is the 25-min canary, well under the 30-min threshold). - # See molecule-controlplane#420 for the leak-class accounting that - # motivated this tightening. - - cron: '*/15 * * * *' - workflow_dispatch: - inputs: - max_age_minutes: - description: "Delete e2e-* orgs older than N minutes (default 30)" - required: false - default: "30" - dry_run: - description: "Dry run only — list what would be deleted" - required: false - type: boolean - default: false - -# Don't let two sweeps fight. Cron + workflow_dispatch could overlap -# on a manual trigger; queue rather than parallel-delete. -concurrency: - group: sweep-stale-e2e-orgs - cancel-in-progress: false - -permissions: - contents: read - -jobs: - sweep: - name: Sweep e2e orgs - runs-on: ubuntu-latest - timeout-minutes: 15 - env: - MOLECULE_CP_URL: https://staging-api.moleculesai.app - ADMIN_TOKEN: ${{ secrets.MOLECULE_STAGING_ADMIN_TOKEN }} - MAX_AGE_MINUTES: ${{ github.event.inputs.max_age_minutes || '30' }} - DRY_RUN: ${{ github.event.inputs.dry_run || 'false' }} - # Refuse to delete more than this many orgs in one tick. If the - # CP DB is briefly empty (or the admin endpoint goes weird and - # returns no created_at), every e2e- org would look stale. - # Bailing protects against runaway nukes. - SAFETY_CAP: 50 - - steps: - - name: Verify admin token present - run: | - if [ -z "$ADMIN_TOKEN" ]; then - echo "::error::MOLECULE_STAGING_ADMIN_TOKEN not set" - exit 2 - fi - echo "Admin token present ✓" - - - name: Identify stale e2e orgs - id: identify - run: | - set -euo pipefail - # Fetch into a file so the python step reads it via stdin — - # cleaner than embedding $(curl ...) into a heredoc. - curl -sS --fail-with-body --max-time 30 \ - "$MOLECULE_CP_URL/cp/admin/orgs?limit=500" \ - -H "Authorization: Bearer $ADMIN_TOKEN" \ - > orgs.json - - # Filter: - # 1. slug starts with one of the ephemeral test prefixes: - # - 'e2e-' — covers e2e-canary-, e2e-canvas-*, etc. - # - 'rt-e2e-' — runtime-test harness fixtures (RFC #2251); - # missing this prefix left two such tenants - # orphaned 8h on staging (2026-05-03), then - # hard-failed redeploy-tenants-on-staging - # and broke the staging→main auto-promote - # chain. Kept in sync with the EPHEMERAL_PREFIX_RE - # regex in redeploy-tenants-on-staging.yml. - # 2. created_at is older than MAX_AGE_MINUTES ago - # Output one slug per line to a file the next step reads. - python3 > stale_slugs.txt <<'PY' - import json, os - from datetime import datetime, timezone, timedelta - # SSOT for this list lives in the controlplane Go code: - # molecule-controlplane/internal/slugs/ephemeral.go - # (var EphemeralPrefixes). The redeploy-fleet auto-rollout - # also reads from there to SKIP these slugs — without that - # filter, fleet redeploy SSM-failed in-flight E2E tenants - # whose containers were still booting, breaking the test - # that just spun them up (molecule-controlplane#493). - # Update both files together. - EPHEMERAL_PREFIXES = ("e2e-", "rt-e2e-") - with open("orgs.json") as f: - data = json.load(f) - max_age = int(os.environ["MAX_AGE_MINUTES"]) - cutoff = datetime.now(timezone.utc) - timedelta(minutes=max_age) - for o in data.get("orgs", []): - slug = o.get("slug", "") - if not slug.startswith(EPHEMERAL_PREFIXES): - continue - created = o.get("created_at") - if not created: - # Defensively skip rows without created_at — better - # to leave one orphan than nuke a brand-new row - # whose timestamp didn't render. - continue - # Python 3.11+ handles RFC3339 with Z directly via - # fromisoformat; older runners need the trailing Z swap. - created_dt = datetime.fromisoformat(created.replace("Z", "+00:00")) - if created_dt < cutoff: - print(slug) - PY - - count=$(wc -l < stale_slugs.txt | tr -d ' ') - echo "Found $count stale e2e org(s) older than ${MAX_AGE_MINUTES}m" - if [ "$count" -gt 0 ]; then - echo "First 20:" - head -20 stale_slugs.txt | sed 's/^/ /' - fi - echo "count=$count" >> "$GITHUB_OUTPUT" - - - name: Safety gate - if: steps.identify.outputs.count != '0' - run: | - count="${{ steps.identify.outputs.count }}" - if [ "$count" -gt "$SAFETY_CAP" ]; then - echo "::error::Refusing to delete $count orgs in one sweep (cap=$SAFETY_CAP). Investigate manually — this usually means the CP admin API returned no created_at or returned a degraded result. Re-run with workflow_dispatch + max_age_minutes if intentional." - exit 1 - fi - echo "Within safety cap ($count ≤ $SAFETY_CAP) ✓" - - - name: Delete stale orgs - if: steps.identify.outputs.count != '0' && env.DRY_RUN != 'true' - run: | - set -uo pipefail - deleted=0 - failed=0 - while IFS= read -r slug; do - [ -z "$slug" ] && continue - # The DELETE handler requires {"confirm": ""} matching - # the URL slug — fat-finger guard. Idempotent: re-issuing - # picks up via org_purges.last_step. - # Tempfile-routed -w + set +e/-e prevents curl-exit-code - # pollution of the captured status (lint-curl-status-capture.yml). - set +e - curl -sS -o /tmp/del_resp -w "%{http_code}" \ - --max-time 60 \ - -X DELETE "$MOLECULE_CP_URL/cp/admin/tenants/$slug" \ - -H "Authorization: Bearer $ADMIN_TOKEN" \ - -H "Content-Type: application/json" \ - -d "{\"confirm\":\"$slug\"}" >/tmp/del_code - set -e - # Stderr from curl (-sS shows dial errors etc.) goes to runner log. - http_code=$(cat /tmp/del_code 2>/dev/null || echo "000") - if [ "$http_code" = "200" ] || [ "$http_code" = "204" ]; then - deleted=$((deleted+1)) - echo " deleted: $slug" - else - failed=$((failed+1)) - echo " FAILED ($http_code): $slug — $(cat /tmp/del_resp 2>/dev/null | head -c 200)" - fi - done < stale_slugs.txt - echo "" - echo "Sweep summary: deleted=$deleted failed=$failed" - # Don't fail the workflow on per-org delete errors — the - # sweeper is best-effort. Next hourly tick re-attempts. We - # only fail loud at the safety-cap gate above. - - - name: Sweep orphan tunnels - # Stale-org cleanup deletes the org (which cascades to tunnel - # delete inside the CP). But when that cascade fails partway — - # CP transient 5xx after the org row is deleted but before the - # CF tunnel delete completes — the tunnel persists with no - # matching org row. The reconciler in internal/sweep flags this - # as `cf_tunnel kind=orphan`, but nothing automatically reaps it. - # - # `/cp/admin/orphan-tunnels/cleanup` is the operator-triggered - # reaper. Calling it here at the end of every sweep tick - # converges the staging CF account to clean even when CP - # cascades half-fail. - # - # PR #492 made the underlying DeleteTunnel actually check - # status — pre-fix it silent-succeeded on CF code 1022 - # ("active connections"), so this step would have been a no-op - # against stuck connectors. Post-fix the cleanup invokes - # CleanupTunnelConnections + retry, which actually clears the - # 1022 case. (#2987) - # - # Best-effort. Failure here doesn't fail the workflow — next - # tick re-attempts. Errors flow to step output for ops review. - if: env.DRY_RUN != 'true' - run: | - set +e - curl -sS -o /tmp/cleanup_resp -w "%{http_code}" \ - --max-time 60 \ - -X POST "$MOLECULE_CP_URL/cp/admin/orphan-tunnels/cleanup" \ - -H "Authorization: Bearer $ADMIN_TOKEN" >/tmp/cleanup_code - set -e - http_code=$(cat /tmp/cleanup_code 2>/dev/null || echo "000") - body=$(cat /tmp/cleanup_resp 2>/dev/null | head -c 500) - if [ "$http_code" = "200" ]; then - count=$(echo "$body" | python3 -c "import sys,json; d=json.loads(sys.stdin.read() or '{}'); print(d.get('deleted_count', 0))" 2>/dev/null || echo "0") - failed_n=$(echo "$body" | python3 -c "import sys,json; d=json.loads(sys.stdin.read() or '{}'); print(len(d.get('failed') or {}))" 2>/dev/null || echo "0") - echo "Orphan-tunnel sweep: deleted=$count failed=$failed_n" - else - echo "::warning::orphan-tunnels cleanup returned HTTP $http_code — body: $body" - fi - - - name: Dry-run summary - if: env.DRY_RUN == 'true' - run: | - echo "DRY RUN — would have deleted ${{ steps.identify.outputs.count }} org(s) AND triggered orphan-tunnels cleanup. Re-run with dry_run=false to actually delete." diff --git a/.github/workflows/test-ops-scripts.yml b/.github/workflows/test-ops-scripts.yml deleted file mode 100644 index 6b25387c..00000000 --- a/.github/workflows/test-ops-scripts.yml +++ /dev/null @@ -1,52 +0,0 @@ -name: Ops Scripts Tests - -# Runs the unittest suite for scripts/ on every PR + push that touches -# anything under scripts/. Kept separate from the main CI so a script-only -# change doesn't trigger the heavier Go/Canvas/Python pipelines. -# -# Discovery layout: tests sit alongside the code they test (see -# scripts/ops/test_sweep_cf_decide.py for the pattern; scripts/ -# test_build_runtime_package.py for the rewriter coverage). The job -# below runs `unittest discover` TWICE — once from `scripts/`, once -# from `scripts/ops/` — because neither dir has an `__init__.py`, so -# a single discover from `scripts/` doesn't recurse into the ops -# subdir. Two passes is simpler than retrofitting namespace packages. - -on: - push: - branches: [main, staging] - paths: - - 'scripts/**' - - '.github/workflows/test-ops-scripts.yml' - pull_request: - branches: [main, staging] - paths: - - 'scripts/**' - - '.github/workflows/test-ops-scripts.yml' - merge_group: - types: [checks_requested] - -concurrency: - group: ${{ github.workflow }}-${{ github.ref }} - cancel-in-progress: true - -jobs: - test: - name: Ops scripts (unittest) - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 - - uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 # v6.2.0 - with: - python-version: '3.11' - - name: Run scripts/ unittests (build_runtime_package, …) - # Top-level scripts/ tests live alongside their target file - # (e.g. scripts/test_build_runtime_package.py exercises - # scripts/build_runtime_package.py). discover from scripts/ - # picks up only top-level test_*.py because scripts/ops/ has - # no __init__.py — that's intentional, so we run two passes. - working-directory: scripts - run: python -m unittest discover -t . -p 'test_*.py' -v - - name: Run scripts/ops/ unittests (sweep_cf_decide, …) - working-directory: scripts/ops - run: python -m unittest discover -p 'test_*.py' -v diff --git a/tools/branch-protection/check_name_parity.sh b/tools/branch-protection/check_name_parity.sh index bb73f823..c1337434 100755 --- a/tools/branch-protection/check_name_parity.sh +++ b/tools/branch-protection/check_name_parity.sh @@ -38,7 +38,11 @@ set -euo pipefail SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)" -WORKFLOWS_DIR="$REPO_ROOT/.github/workflows" +# Gitea is the SSOT for CI on molecule-core per task #347 / memory +# reference_molecule_core_actions_gitea_only — workflows live in +# .gitea/workflows/ exclusively. The legacy .github/workflows/ tree was +# deleted in SSOT-Instance-4 (task #331). +WORKFLOWS_DIR="$REPO_ROOT/.gitea/workflows" APPLY_SH="$SCRIPT_DIR/apply.sh" if [[ ! -f "$APPLY_SH" ]]; then @@ -46,7 +50,7 @@ if [[ ! -f "$APPLY_SH" ]]; then exit 2 fi if [[ ! -d "$WORKFLOWS_DIR" ]]; then - echo "check_name_parity: missing .github/workflows at $WORKFLOWS_DIR" >&2 + echo "check_name_parity: missing .gitea/workflows at $WORKFLOWS_DIR" >&2 exit 2 fi diff --git a/tools/branch-protection/test_check_name_parity.sh b/tools/branch-protection/test_check_name_parity.sh index 98c9baef..4dccbcde 100755 --- a/tools/branch-protection/test_check_name_parity.sh +++ b/tools/branch-protection/test_check_name_parity.sh @@ -33,12 +33,14 @@ trap '[[ -n "$TMPDIR_FOR_CASE" && -d "$TMPDIR_FOR_CASE" ]] && rm -rf "$TMPDIR_FO # Build a synthetic repo at $1 with apply.sh listing $2 (one name per # line) as the staging required set + zero main required, then write -# whatever .github/workflows/* files the test case adds. +# whatever .gitea/workflows/* files the test case adds. (Pre-SSOT-4 +# this was .github/workflows; molecule-core switched to Gitea-SSOT in +# task #331 and the script now reads from .gitea/workflows/.) make_fake_repo() { local root="$1" local checks="$2" mkdir -p "$root/tools/branch-protection" - mkdir -p "$root/.github/workflows" + mkdir -p "$root/.gitea/workflows" cat > "$root/tools/branch-protection/apply.sh" < "$TMPDIR_FOR_CASE/.github/workflows/$workflow_filename" + printf '%s' "$workflow_yaml" > "$TMPDIR_FOR_CASE/.gitea/workflows/$workflow_filename" local stderr_file stderr_file=$(mktemp) local actual_exit=0