f9214391fb
2 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
5373b5e7f6 |
fix(ci): extend class-E rename to scripts/ops/sweep-*.sh (chained-defect from #430 review)
Some checks failed
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 18s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 13s
CI / Detect changes (pull_request) Successful in 50s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 20s
sop-tier-check / tier-check (pull_request) Successful in 19s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 55s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 50s
E2E API Smoke Test / detect-changes (pull_request) Successful in 59s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 41s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 55s
CI / Platform (Go) (pull_request) Successful in 9s
CI / Canvas (Next.js) (pull_request) Successful in 10s
CI / Python Lint & Test (pull_request) Successful in 9s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 23s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 13s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 11s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 12s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
audit-force-merge / audit (pull_request) Successful in 23s
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Failing after 4m53s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Failing after 5m15s
core-devops lens review (review 1075) caught the chained defect: the 3
sweep workflows shell out to `bash scripts/ops/sweep-{aws-secrets,cf-orphans,cf-tunnels}.sh`,
and those scripts still consume the OLD env-var names — `need CP_PROD_ADMIN_TOKEN`,
`need CP_STAGING_ADMIN_TOKEN`, and `Bearer $CP_PROD_ADMIN_TOKEN` /
`Bearer $CP_STAGING_ADMIN_TOKEN` in the CP-admin curl calls. The workflow-
level presence-check loop (renamed in the first commit) would pass, then
the shell script would `exit 1` at the `need CP_PROD_ADMIN_TOKEN` line.
Classic `feedback_chained_defects_in_never_tested_workflows` — the YAML-
surface rename looked complete; the actual consumer is one layer deeper.
This commit completes the rename in the scripts:
- `CP_PROD_ADMIN_TOKEN` -> `CP_ADMIN_API_TOKEN`
- `CP_STAGING_ADMIN_TOKEN` -> `CP_STAGING_ADMIN_API_TOKEN`
(6 occurrences total per script — comments, `need` checks, `Bearer $...`
curl headers — across all 3). The .gitea/workflows/sweep-*.yml files (first
commit) export `CP_ADMIN_API_TOKEN: ${{ secrets.CP_ADMIN_API_TOKEN }}` etc.,
so the scripts now read `$CP_ADMIN_API_TOKEN` — consistent end-to-end.
Per core-devops's other (non-blocking) note: `workflow_dispatch` each
sweep in dry-run after this lands + after the #425 class-A PUT, to confirm
the path beyond the presence-check actually works (the `MINIMAX_TOKEN`-grade
shape-match isn't enough — exercise the real CP-admin call).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
6f8f7932d2 |
feat(ops): add sweep-aws-secrets janitor — orphan tenant bootstrap secrets
CP's deprovision flow calls Secrets.DeleteSecret() (provisioner/ec2.go:806)
but only when the deprovision runs to completion. Crashed provisions and
incomplete teardowns leak the per-tenant `molecule/tenant/<org_id>/bootstrap`
secret. At ~$0.40/secret/month, ~45 leaked secrets surfaced as ~$19/month
on the AWS cost dashboard.
The tenant_resources audit table (mig 024) tracks four kinds today —
CloudflareTunnel, CloudflareDNS, EC2Instance, SecurityGroup — and the
existing reconciler doesn't catch Secrets Manager orphans. The proper fix
(KindSecretsManagerSecret + recorder hook + reconciler enumerator) is filed
as a follow-up controlplane issue. This sweeper is the immediate stopgap.
Parallel-shape to sweep-cf-tunnels.sh:
- Hourly schedule offset (:30, between sweep-cf-orphans :15 and
sweep-cf-tunnels :45) so the three janitors don't burst CP admin
at the same minute.
- 24h grace window — never deletes a secret younger than the
provisioning roundtrip, so an in-flight provision can't be racemurdered.
- MAX_DELETE_PCT=50 default (mirrors sweep-cf-orphans for durable
resources; tenant secrets should track 1:1 with live tenants).
- Same schedule-vs-dispatch hardening as the other janitors:
schedule → hard-fail on missing secrets, dispatch → soft-skip.
- 8-way xargs parallelism, dry-run by default, --execute to delete.
Requires a dedicated AWS_JANITOR_* IAM principal — the prod molecule-cp
principal lacks secretsmanager:ListSecrets (it only has scoped
Get/Create/Update/Delete). The workflow's verify-secrets step will hard-fail
on the first scheduled run until those secrets are configured, surfacing
the missing setup loudly rather than silently no-op'ing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|