molecule-core/scripts/ops
Hongming Wang 6f8f7932d2 feat(ops): add sweep-aws-secrets janitor — orphan tenant bootstrap secrets
CP's deprovision flow calls Secrets.DeleteSecret() (provisioner/ec2.go:806)
but only when the deprovision runs to completion. Crashed provisions and
incomplete teardowns leak the per-tenant `molecule/tenant/<org_id>/bootstrap`
secret. At ~$0.40/secret/month, ~45 leaked secrets surfaced as ~$19/month
on the AWS cost dashboard.

The tenant_resources audit table (mig 024) tracks four kinds today —
CloudflareTunnel, CloudflareDNS, EC2Instance, SecurityGroup — and the
existing reconciler doesn't catch Secrets Manager orphans. The proper fix
(KindSecretsManagerSecret + recorder hook + reconciler enumerator) is filed
as a follow-up controlplane issue. This sweeper is the immediate stopgap.

Parallel-shape to sweep-cf-tunnels.sh:
  - Hourly schedule offset (:30, between sweep-cf-orphans :15 and
    sweep-cf-tunnels :45) so the three janitors don't burst CP admin
    at the same minute.
  - 24h grace window — never deletes a secret younger than the
    provisioning roundtrip, so an in-flight provision can't be racemurdered.
  - MAX_DELETE_PCT=50 default (mirrors sweep-cf-orphans for durable
    resources; tenant secrets should track 1:1 with live tenants).
  - Same schedule-vs-dispatch hardening as the other janitors:
    schedule → hard-fail on missing secrets, dispatch → soft-skip.
  - 8-way xargs parallelism, dry-run by default, --execute to delete.

Requires a dedicated AWS_JANITOR_* IAM principal — the prod molecule-cp
principal lacks secretsmanager:ListSecrets (it only has scoped
Get/Create/Update/Delete). The workflow's verify-secrets step will hard-fail
on the first scheduled run until those secrets are configured, surfacing
the missing setup loudly rather than silently no-op'ing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 02:38:08 -07:00
..
audit-railway-sha-pins.sh ops: add Railway SHA-pin drift audit script + regression test (#2001) 2026-04-27 05:01:23 -07:00
check_migration_collisions.py ci: hard gate against migration version collisions (#2341) 2026-04-29 21:42:42 -07:00
check-prod-versions.sh ops: scripts/ops/check-prod-versions.sh — one-line "is each tenant on latest?" 2026-04-30 13:13:47 -07:00
sweep_cf_decide.py refactor(ops): apply simplify findings on #2027 PR 2026-04-26 00:28:15 -07:00
sweep-aws-secrets.sh feat(ops): add sweep-aws-secrets janitor — orphan tenant bootstrap secrets 2026-05-03 02:38:08 -07:00
sweep-cf-orphans.sh refactor(ops): apply simplify findings on #2027 PR 2026-04-26 00:28:15 -07:00
sweep-cf-tunnels.sh fix(sweep-cf-tunnels): parallelize deletes + raise workflow timeout 2026-05-02 02:35:46 -07:00
test_check_migration_collisions.py fix(test): convert migration-collision tests from pytest to unittest (#2341) 2026-04-30 01:47:27 -07:00
test_sweep_cf_decide.py refactor(ops): apply simplify findings on #2027 PR 2026-04-26 00:28:15 -07:00