molecule-core/.github/workflows/sweep-aws-secrets.yml
Hongming Wang 6f8f7932d2 feat(ops): add sweep-aws-secrets janitor — orphan tenant bootstrap secrets
CP's deprovision flow calls Secrets.DeleteSecret() (provisioner/ec2.go:806)
but only when the deprovision runs to completion. Crashed provisions and
incomplete teardowns leak the per-tenant `molecule/tenant/<org_id>/bootstrap`
secret. At ~$0.40/secret/month, ~45 leaked secrets surfaced as ~$19/month
on the AWS cost dashboard.

The tenant_resources audit table (mig 024) tracks four kinds today —
CloudflareTunnel, CloudflareDNS, EC2Instance, SecurityGroup — and the
existing reconciler doesn't catch Secrets Manager orphans. The proper fix
(KindSecretsManagerSecret + recorder hook + reconciler enumerator) is filed
as a follow-up controlplane issue. This sweeper is the immediate stopgap.

Parallel-shape to sweep-cf-tunnels.sh:
  - Hourly schedule offset (:30, between sweep-cf-orphans :15 and
    sweep-cf-tunnels :45) so the three janitors don't burst CP admin
    at the same minute.
  - 24h grace window — never deletes a secret younger than the
    provisioning roundtrip, so an in-flight provision can't be racemurdered.
  - MAX_DELETE_PCT=50 default (mirrors sweep-cf-orphans for durable
    resources; tenant secrets should track 1:1 with live tenants).
  - Same schedule-vs-dispatch hardening as the other janitors:
    schedule → hard-fail on missing secrets, dispatch → soft-skip.
  - 8-way xargs parallelism, dry-run by default, --execute to delete.

Requires a dedicated AWS_JANITOR_* IAM principal — the prod molecule-cp
principal lacks secretsmanager:ListSecrets (it only has scoped
Get/Create/Update/Delete). The workflow's verify-secrets step will hard-fail
on the first scheduled run until those secrets are configured, surfacing
the missing setup loudly rather than silently no-op'ing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 02:38:08 -07:00

130 lines
5.9 KiB
YAML

name: Sweep stale AWS Secrets Manager secrets
# Janitor for per-tenant AWS Secrets Manager secrets
# (`molecule/tenant/<org_id>/bootstrap`) whose backing tenant no
# longer exists. Parallel-shape to sweep-cf-tunnels.yml and
# sweep-cf-orphans.yml — different cloud, same justification.
#
# Why this exists separately from a long-term reconciler integration:
# - molecule-controlplane's tenant_resources audit table (mig 024)
# currently tracks four resource kinds: CloudflareTunnel,
# CloudflareDNS, EC2Instance, SecurityGroup. SecretsManager is
# not in the list, so the existing reconciler doesn't catch
# orphan secrets.
# - At ~$0.40/secret/month the cost grew to ~$19/month before this
# sweeper was written, indicating ~45+ orphan secrets from
# crashed provisions and incomplete deprovision flows.
# - The proper fix (KindSecretsManagerSecret + recorder hook +
# reconciler enumerator) is filed as a separate controlplane
# issue. This sweeper is the immediate cost-relief stopgap.
#
# IAM principal: AWS_JANITOR_ACCESS_KEY_ID / AWS_JANITOR_SECRET_ACCESS_KEY.
# This is a DEDICATED principal — the production `molecule-cp` IAM
# user lacks `secretsmanager:ListSecrets` (it only has
# Get/Create/Update/Delete on specific resources, scoped to its
# operational needs). The janitor needs ListSecrets across the
# `molecule/tenant/*` prefix, which warrants a separate principal so
# we don't broaden the prod-CP policy.
#
# Safety: the script's MAX_DELETE_PCT gate (default 50%, mirroring
# sweep-cf-orphans.yml — tenant secrets are durable by design, unlike
# the mostly-orphan tunnels) refuses to nuke past the threshold.
on:
schedule:
# Hourly at :30 — offsets from sweep-cf-orphans (:15) and
# sweep-cf-tunnels (:45) so the three janitors don't burst the
# CP admin endpoints at the same minute.
- cron: '30 * * * *'
workflow_dispatch:
inputs:
dry_run:
description: "Dry run only — list what would be deleted, no deletion"
required: false
type: boolean
default: true
max_delete_pct:
description: "Override safety gate (default 50, set higher only for major cleanup)"
required: false
default: "50"
grace_hours:
description: "Skip secrets created within this many hours (default 24)"
required: false
default: "24"
# Don't let two sweeps race the same AWS account.
concurrency:
group: sweep-aws-secrets
cancel-in-progress: false
permissions:
contents: read
jobs:
sweep:
name: Sweep AWS Secrets Manager
runs-on: ubuntu-latest
# 30 min cap, mirroring the other janitors. AWS DeleteSecret is
# fast (~0.3s/call) so even a 100+ backlog drains in seconds
# under the 8-way xargs parallelism, but the cap is set generously
# to leave headroom for any actual API hang.
timeout-minutes: 30
env:
AWS_REGION: ${{ secrets.AWS_REGION || 'us-east-1' }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_JANITOR_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_JANITOR_SECRET_ACCESS_KEY }}
CP_PROD_ADMIN_TOKEN: ${{ secrets.CP_PROD_ADMIN_TOKEN }}
CP_STAGING_ADMIN_TOKEN: ${{ secrets.CP_STAGING_ADMIN_TOKEN }}
MAX_DELETE_PCT: ${{ github.event.inputs.max_delete_pct || '50' }}
GRACE_HOURS: ${{ github.event.inputs.grace_hours || '24' }}
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- name: Verify required secrets present
id: verify
# Schedule-vs-dispatch behaviour split mirrors sweep-cf-orphans
# and sweep-cf-tunnels (hardened 2026-04-28). Same principle:
# - schedule → exit 1 on missing secrets (red CI surfaces it)
# - workflow_dispatch → exit 0 with warning (operator-driven,
# they already accepted the repo state)
run: |
missing=()
for var in AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY CP_PROD_ADMIN_TOKEN CP_STAGING_ADMIN_TOKEN; do
if [ -z "${!var:-}" ]; then
missing+=("$var")
fi
done
if [ ${#missing[@]} -gt 0 ]; then
if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
echo "::warning::skipping sweep — secrets not configured: ${missing[*]}"
echo "::warning::set them at Settings → Secrets and Variables → Actions, then rerun."
echo "::warning::AWS_JANITOR_* must belong to a principal with secretsmanager:ListSecrets and secretsmanager:DeleteSecret on molecule/tenant/* (the prod molecule-cp principal lacks ListSecrets)."
echo "skip=true" >> "$GITHUB_OUTPUT"
exit 0
fi
echo "::error::sweep cannot run — required secrets missing: ${missing[*]}"
echo "::error::set them at Settings → Secrets and Variables → Actions, or disable this workflow."
echo "::error::AWS_JANITOR_* must belong to a principal with secretsmanager:ListSecrets and secretsmanager:DeleteSecret on molecule/tenant/*."
exit 1
fi
echo "All required secrets present ✓"
echo "skip=false" >> "$GITHUB_OUTPUT"
- name: Run sweep
if: steps.verify.outputs.skip != 'true'
# Schedule-vs-dispatch dry-run asymmetry mirrors sweep-cf-tunnels:
# - Scheduled: input empty → "false" → --execute (the whole
# point of an hourly janitor).
# - Manual workflow_dispatch: input default true → dry-run;
# operator must flip it to actually delete.
run: |
set -euo pipefail
if [ "${{ github.event.inputs.dry_run || 'false' }}" = "true" ]; then
echo "Running in dry-run mode — no deletions"
bash scripts/ops/sweep-aws-secrets.sh
else
echo "Running with --execute — will delete identified orphans"
bash scripts/ops/sweep-aws-secrets.sh --execute
fi