The .github→.gitea migration left 3 secret-name drifts that mean the ported workflows reference secret-store names that don't match the canonical names. Renaming the workflow refs so the upcoming secret-store PUT (#425 class-A) lands under the names the workflows actually look up: - CP_STAGING_ADMIN_TOKEN -> CP_STAGING_ADMIN_API_TOKEN (sweep-aws-secrets, sweep-cf-orphans, sweep-cf-tunnels — peers in redeploy-tenants-on-staging + continuous-synth-e2e already use the _API_TOKEN form; semantic precision wins, 3v2 caller split) - CP_PROD_ADMIN_TOKEN -> CP_ADMIN_API_TOKEN (same 3 sweep workflows — CP_ADMIN_API_TOKEN is already the canonical name for the prod variant on molecule-controlplane, and matches ops.sh's `mol_tenants` reading `CP_ADMIN_API_TOKEN` from Railway) - MOLECULE_STAGING_OPENAI_KEY -> MOLECULE_STAGING_OPENAI_API_KEY (canary-staging, continuous-synth-e2e, e2e-staging-saas — the `_KEY` vs `_API_KEY` drift; peers are MOLECULE_STAGING_ANTHROPIC_API_KEY / MOLECULE_STAGING_MINIMAX_API_KEY. Confirmed CONSUMED — langgraph + hermes runtime tests use openai/gpt-4o and check the env presence — so renamed, not deleted.) KEPT as-is (no rename): CF_ACCOUNT_ID / CF_API_TOKEN / CF_ZONE_ID — these are the documented CI-scoped duplicates of the operator-host CLOUDFLARE_* admin names; renaming would touch 3 sweep workflows for zero functional gain. Documented as CI-scoped-dup in the secrets-map follow-up. Also updated the inline `for var in ...` presence-check loops + the `required_secret_name="..."` error strings so the workflows' diagnostics match the renamed names. Sequence: this PR merges → #425 class-A PUT populates the secret store under the canonical names → the 3 schedule-only reds (canary-staging, sweep-aws-secrets, continuous-synth-e2e) go green within ~30 min → watchdog #423 auto-closes their [main-red] issues. Refs: molecule-core#425 (secret-store audit, Section D), internal#297. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
130 lines
6.1 KiB
YAML
130 lines
6.1 KiB
YAML
name: Sweep stale AWS Secrets Manager secrets
|
|
|
|
# Ported from .github/workflows/sweep-aws-secrets.yml on 2026-05-11 per RFC
|
|
# internal#219 §1 sweep. Differences from the GitHub version:
|
|
# - Dropped `workflow_dispatch.inputs` (Gitea 1.22.6 parser rejects them
|
|
# per feedback_gitea_workflow_dispatch_inputs_unsupported).
|
|
# - Dropped `merge_group:` (no Gitea merge queue).
|
|
# - Dropped `environment:` blocks (Gitea has no environments).
|
|
# - Workflow-level env.GITHUB_SERVER_URL pinned per
|
|
# feedback_act_runner_github_server_url.
|
|
# - `continue-on-error: true` on each job (RFC §1 contract).
|
|
#
|
|
|
|
# Janitor for per-tenant AWS Secrets Manager secrets
|
|
# (`molecule/tenant/<org_id>/bootstrap`) whose backing tenant no
|
|
# longer exists. Parallel-shape to sweep-cf-tunnels.yml and
|
|
# sweep-cf-orphans.yml — different cloud, same justification.
|
|
#
|
|
# Why this exists separately from a long-term reconciler integration:
|
|
# - molecule-controlplane's tenant_resources audit table (mig 024)
|
|
# currently tracks four resource kinds: CloudflareTunnel,
|
|
# CloudflareDNS, EC2Instance, SecurityGroup. SecretsManager is
|
|
# not in the list, so the existing reconciler doesn't catch
|
|
# orphan secrets.
|
|
# - At ~$0.40/secret/month the cost grew to ~$19/month before this
|
|
# sweeper was written, indicating ~45+ orphan secrets from
|
|
# crashed provisions and incomplete deprovision flows.
|
|
# - The proper fix (KindSecretsManagerSecret + recorder hook +
|
|
# reconciler enumerator) is filed as a separate controlplane
|
|
# issue. This sweeper is the immediate cost-relief stopgap.
|
|
#
|
|
# IAM principal: AWS_JANITOR_ACCESS_KEY_ID / AWS_JANITOR_SECRET_ACCESS_KEY.
|
|
# This is a DEDICATED principal — the production `molecule-cp` IAM
|
|
# user lacks `secretsmanager:ListSecrets` (it only has
|
|
# Get/Create/Update/Delete on specific resources, scoped to its
|
|
# operational needs). The janitor needs ListSecrets across the
|
|
# `molecule/tenant/*` prefix, which warrants a separate principal so
|
|
# we don't broaden the prod-CP policy.
|
|
#
|
|
# Safety: the script's MAX_DELETE_PCT gate (default 50%, mirroring
|
|
# sweep-cf-orphans.yml — tenant secrets are durable by design, unlike
|
|
# the mostly-orphan tunnels) refuses to nuke past the threshold.
|
|
|
|
on:
|
|
schedule:
|
|
# Hourly at :30 — offsets from sweep-cf-orphans (:15) and
|
|
# sweep-cf-tunnels (:45) so the three janitors don't burst the
|
|
# CP admin endpoints at the same minute.
|
|
- cron: '30 * * * *'
|
|
# Don't let two sweeps race the same AWS account.
|
|
concurrency:
|
|
group: sweep-aws-secrets
|
|
cancel-in-progress: false
|
|
|
|
permissions:
|
|
contents: read
|
|
|
|
env:
|
|
GITHUB_SERVER_URL: https://git.moleculesai.app
|
|
|
|
jobs:
|
|
sweep:
|
|
name: Sweep AWS Secrets Manager
|
|
runs-on: ubuntu-latest
|
|
# Phase 3 (RFC #219 §1): surface broken workflows without blocking.
|
|
continue-on-error: true
|
|
# 30 min cap, mirroring the other janitors. AWS DeleteSecret is
|
|
# fast (~0.3s/call) so even a 100+ backlog drains in seconds
|
|
# under the 8-way xargs parallelism, but the cap is set generously
|
|
# to leave headroom for any actual API hang.
|
|
timeout-minutes: 30
|
|
env:
|
|
AWS_REGION: ${{ secrets.AWS_REGION || 'us-east-1' }}
|
|
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_JANITOR_ACCESS_KEY_ID }}
|
|
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_JANITOR_SECRET_ACCESS_KEY }}
|
|
CP_ADMIN_API_TOKEN: ${{ secrets.CP_ADMIN_API_TOKEN }}
|
|
CP_STAGING_ADMIN_API_TOKEN: ${{ secrets.CP_STAGING_ADMIN_API_TOKEN }}
|
|
MAX_DELETE_PCT: ${{ github.event.inputs.max_delete_pct || '50' }}
|
|
GRACE_HOURS: ${{ github.event.inputs.grace_hours || '24' }}
|
|
|
|
steps:
|
|
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
|
|
|
|
- name: Verify required secrets present
|
|
id: verify
|
|
# Schedule-vs-dispatch behaviour split mirrors sweep-cf-orphans
|
|
# and sweep-cf-tunnels (hardened 2026-04-28). Same principle:
|
|
# - schedule → exit 1 on missing secrets (red CI surfaces it)
|
|
# - workflow_dispatch → exit 0 with warning (operator-driven,
|
|
# they already accepted the repo state)
|
|
run: |
|
|
missing=()
|
|
for var in AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY CP_ADMIN_API_TOKEN CP_STAGING_ADMIN_API_TOKEN; do
|
|
if [ -z "${!var:-}" ]; then
|
|
missing+=("$var")
|
|
fi
|
|
done
|
|
if [ ${#missing[@]} -gt 0 ]; then
|
|
if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
|
|
echo "::warning::skipping sweep — secrets not configured: ${missing[*]}"
|
|
echo "::warning::set them at Settings → Secrets and Variables → Actions, then rerun."
|
|
echo "::warning::AWS_JANITOR_* must belong to a principal with secretsmanager:ListSecrets and secretsmanager:DeleteSecret on molecule/tenant/* (the prod molecule-cp principal lacks ListSecrets)."
|
|
echo "skip=true" >> "$GITHUB_OUTPUT"
|
|
exit 0
|
|
fi
|
|
echo "::error::sweep cannot run — required secrets missing: ${missing[*]}"
|
|
echo "::error::set them at Settings → Secrets and Variables → Actions, or disable this workflow."
|
|
echo "::error::AWS_JANITOR_* must belong to a principal with secretsmanager:ListSecrets and secretsmanager:DeleteSecret on molecule/tenant/*."
|
|
exit 1
|
|
fi
|
|
echo "All required secrets present ✓"
|
|
echo "skip=false" >> "$GITHUB_OUTPUT"
|
|
|
|
- name: Run sweep
|
|
if: steps.verify.outputs.skip != 'true'
|
|
# Schedule-vs-dispatch dry-run asymmetry mirrors sweep-cf-tunnels:
|
|
# - Scheduled: input empty → "false" → --execute (the whole
|
|
# point of an hourly janitor).
|
|
# - Manual workflow_dispatch: input default true → dry-run;
|
|
# operator must flip it to actually delete.
|
|
run: |
|
|
set -euo pipefail
|
|
if [ "${{ github.event.inputs.dry_run || 'false' }}" = "true" ]; then
|
|
echo "Running in dry-run mode — no deletions"
|
|
bash scripts/ops/sweep-aws-secrets.sh
|
|
else
|
|
echo "Running with --execute — will delete identified orphans"
|
|
bash scripts/ops/sweep-aws-secrets.sh --execute
|
|
fi
|