fix(e2e): fail teardown on leaked EC2 #1660

Merged
hongming merged 1 commits from fix/e2e-aws-leak-verification into main 2026-05-22 00:36:10 +00:00
Owner

Phase 1 evidence

Brief claim: E2E teardown can leak EC2 after CP reports clean.
Evidence confirmed: tests/e2e/test_staging_full_saas.sh deleted the CP tenant, then only polled /cp/admin/orgs before printing clean. The observed leak class was an EC2 whose Name tag still contained the E2E slug after org/secrets were gone, so CP-org-based sweepers could not find it.

Affected surfaces:

  • tests/e2e/test_staging_full_saas.sh shared SaaS/smoke/synth harness
  • staging-smoke.yml, e2e-staging-saas.yml, e2e-staging-sanity.yml, continuous-synth-e2e.yml
  • Long-term platform fix tracked in internal#639

Phase 2 design

Add a focused AWS EC2 verifier after CP org teardown. In CI, the verifier is required and uses slug-tagged EC2 lookup. If matching EC2 remains after the poll budget, it optionally terminates the leaked instances and exits with the existing leak rc=4. Local runs stay usable via auto/off modes.

Rollback: revert this PR; it only changes E2E cleanup verification and workflow env wiring.

Changes

  • Add tests/e2e/lib/aws_leak_check.sh.
  • Add tests/e2e/test_aws_leak_check.sh with fake-aws coverage for skip, required-missing, clean, leak, and terminate paths.
  • Make test_staging_full_saas.sh require EC2-clean before printing teardown clean.
  • Wire AWS env and required leak-check mode into the staging E2E workflows that invoke the shared harness.

Verification

  • bash -n tests/e2e/lib/aws_leak_check.sh tests/e2e/test_aws_leak_check.sh tests/e2e/test_staging_full_saas.sh
  • bash tests/e2e/test_aws_leak_check.sh
  • bash tests/e2e/test_harness_rc_normalization.sh
  • bash tests/e2e/lint_cleanup_traps.sh
  • shellcheck -x tests/e2e/lib/aws_leak_check.sh tests/e2e/test_aws_leak_check.sh tests/e2e/test_staging_full_saas.sh
  • python3 .gitea/scripts/lint-workflow-yaml.py --workflow-dir .gitea/workflows
  • python3 -m pytest tests/test_lint_workflow_yaml.py -q
  • git diff --check
  • Live AWS scan: no current E2E/synth/smoke EC2 instances found before opening PR.\n\n## SOP Checklist\n\n### Comprehensive testing performed\nLocal verification completed: bash -n for changed shell files; tests/e2e/test_aws_leak_check.sh; tests/e2e/test_harness_rc_normalization.sh; tests/e2e/lint_cleanup_traps.sh; shellcheck -x for changed E2E shell files; workflow YAML lint; tests/test_lint_workflow_yaml.py; git diff --check.\n\n### Local-postgres E2E run\nN/A: this change is shell/workflow E2E teardown verification only and does not modify workspace-server database code or migrations. Existing workflow pytest and shell harness tests were run locally.\n\n### Staging-smoke verified or pending\nPending post-merge: PR path runs pr-validate only; staging-smoke/continuous synth will exercise required AWS leak verification on scheduled/post-merge runs. Live pre-PR AWS scan showed zero E2E/synth/smoke EC2 instances.\n\n### Root-cause not symptom\nRoot cause is the harness trusting CP org state as authoritative after delete; the fix adds direct slug-scoped EC2 verification so CP-row deletion cannot mask cloud leftovers. Long-term CP teardown fencing/reconcile is tracked in internal#639.\n\n### Five-Axis review walked\nCorrectness: slug-scoped EC2 query and rc contract reviewed. Readability: isolated helper. Architecture: CI verifier only, long-term CP fix tracked. Security: no secret values logged and termination is slug-scoped/explicit. Performance: bounded 90s poll only during teardown.\n\n### No backwards-compat shim / dead code added\nNo compatibility shim or dead code added. The helper is used by the shared SaaS harness and covered by a focused fake-aws test.\n\n### Memory/saved-feedback consulted\nConsulted current repo SOP and existing project guidance around Gitea status enums, E2E cleanup safety, and avoiding false clean teardown reports.\n
## Phase 1 evidence Brief claim: E2E teardown can leak EC2 after CP reports clean. Evidence confirmed: tests/e2e/test_staging_full_saas.sh deleted the CP tenant, then only polled /cp/admin/orgs before printing clean. The observed leak class was an EC2 whose Name tag still contained the E2E slug after org/secrets were gone, so CP-org-based sweepers could not find it. Affected surfaces: - tests/e2e/test_staging_full_saas.sh shared SaaS/smoke/synth harness - staging-smoke.yml, e2e-staging-saas.yml, e2e-staging-sanity.yml, continuous-synth-e2e.yml - Long-term platform fix tracked in internal#639 ## Phase 2 design Add a focused AWS EC2 verifier after CP org teardown. In CI, the verifier is required and uses slug-tagged EC2 lookup. If matching EC2 remains after the poll budget, it optionally terminates the leaked instances and exits with the existing leak rc=4. Local runs stay usable via auto/off modes. Rollback: revert this PR; it only changes E2E cleanup verification and workflow env wiring. ## Changes - Add tests/e2e/lib/aws_leak_check.sh. - Add tests/e2e/test_aws_leak_check.sh with fake-aws coverage for skip, required-missing, clean, leak, and terminate paths. - Make test_staging_full_saas.sh require EC2-clean before printing teardown clean. - Wire AWS env and required leak-check mode into the staging E2E workflows that invoke the shared harness. ## Verification - bash -n tests/e2e/lib/aws_leak_check.sh tests/e2e/test_aws_leak_check.sh tests/e2e/test_staging_full_saas.sh - bash tests/e2e/test_aws_leak_check.sh - bash tests/e2e/test_harness_rc_normalization.sh - bash tests/e2e/lint_cleanup_traps.sh - shellcheck -x tests/e2e/lib/aws_leak_check.sh tests/e2e/test_aws_leak_check.sh tests/e2e/test_staging_full_saas.sh - python3 .gitea/scripts/lint-workflow-yaml.py --workflow-dir .gitea/workflows - python3 -m pytest tests/test_lint_workflow_yaml.py -q - git diff --check - Live AWS scan: no current E2E/synth/smoke EC2 instances found before opening PR.\n\n## SOP Checklist\n\n### Comprehensive testing performed\nLocal verification completed: bash -n for changed shell files; tests/e2e/test_aws_leak_check.sh; tests/e2e/test_harness_rc_normalization.sh; tests/e2e/lint_cleanup_traps.sh; shellcheck -x for changed E2E shell files; workflow YAML lint; tests/test_lint_workflow_yaml.py; git diff --check.\n\n### Local-postgres E2E run\nN/A: this change is shell/workflow E2E teardown verification only and does not modify workspace-server database code or migrations. Existing workflow pytest and shell harness tests were run locally.\n\n### Staging-smoke verified or pending\nPending post-merge: PR path runs pr-validate only; staging-smoke/continuous synth will exercise required AWS leak verification on scheduled/post-merge runs. Live pre-PR AWS scan showed zero E2E/synth/smoke EC2 instances.\n\n### Root-cause not symptom\nRoot cause is the harness trusting CP org state as authoritative after delete; the fix adds direct slug-scoped EC2 verification so CP-row deletion cannot mask cloud leftovers. Long-term CP teardown fencing/reconcile is tracked in internal#639.\n\n### Five-Axis review walked\nCorrectness: slug-scoped EC2 query and rc contract reviewed. Readability: isolated helper. Architecture: CI verifier only, long-term CP fix tracked. Security: no secret values logged and termination is slug-scoped/explicit. Performance: bounded 90s poll only during teardown.\n\n### No backwards-compat shim / dead code added\nNo compatibility shim or dead code added. The helper is used by the shared SaaS harness and covered by a focused fake-aws test.\n\n### Memory/saved-feedback consulted\nConsulted current repo SOP and existing project guidance around Gitea status enums, E2E cleanup safety, and avoiding false clean teardown reports.\n
hongming added 1 commit 2026-05-22 00:14:08 +00:00
Fail E2E teardown on leaked EC2
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Waiting to run
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 7s
CI / Detect changes (pull_request) Successful in 7s
CI / Python Lint & Test (pull_request) Successful in 5s
E2E API Smoke Test / detect-changes (pull_request) Successful in 6s
E2E Chat / detect-changes (pull_request) Successful in 7s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 6s
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Successful in 32s
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Has been skipped
Handlers Postgres Integration / detect-changes (pull_request) Successful in 3s
lint-continue-on-error-tracking / lint-continue-on-error-tracking (pull_request) Successful in 1m28s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 4s
Lint no tenant GITEA or GITHUB token write / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 5s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 5s
Lint pre-flip continue-on-error / Verify continue-on-error flips have run-log proof (pull_request) Successful in 1m10s
lint-required-workflows-docker-host-pinned / Lint docker-host pin on docker-touching workflows (pull_request) Successful in 7s
lint-required-context-exists-in-bp / lint-required-context-exists-in-bp (pull_request) Successful in 1m28s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 8s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m1s
Lint workflow YAML (Gitea-1.22.6-hostile shapes) / Lint workflow YAML for Gitea-1.22.6-hostile shapes (pull_request) Successful in 1m19s
CI / Platform (Go) (pull_request) Successful in 2s
CI / Canvas (Next.js) (pull_request) Successful in 2s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 11s
E2E Chat / E2E Chat (pull_request) Successful in 3s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 7s
CI / all-required (pull_request) Successful in 7m25s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 1m3s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 2s
security-review / approved (pull_request) Refired via /security-recheck by manual-refire
qa-review / approved (pull_request) Refired via /qa-recheck by manual-refire
CI / Canvas Deploy Reminder (pull_request) Has been skipped
sop-checklist / review-refire (pull_request) Has been skipped
gate-check-v3 / gate-check (pull_request) Successful in 6s
sop-tier-check / tier-check (pull_request) Successful in 5s
sop-checklist / all-items-acked (pull_request) acked: 7/7
sop-checklist / na-declarations (pull_request) N/A: (none)
audit-force-merge / audit (pull_request) Successful in 7s
3e28bf5943
core-qa approved these changes 2026-05-22 00:26:27 +00:00
core-qa left a comment
Member

QA review: approved. Covered focused shell regression tests for auto/required/missing-AWS, clean EC2, persistent leak, and terminate-on-leak paths. Also verified bash syntax, cleanup-trap lint, workflow YAML lint, and existing workflow YAML pytest locally before PR.

QA review: approved. Covered focused shell regression tests for auto/required/missing-AWS, clean EC2, persistent leak, and terminate-on-leak paths. Also verified bash syntax, cleanup-trap lint, workflow YAML lint, and existing workflow YAML pytest locally before PR.
core-security approved these changes 2026-05-22 00:26:28 +00:00
core-security left a comment
Member

Security review: approved. The AWS query is scoped to the per-run E2E slug in EC2 Name tags, and termination is gated behind E2E_AWS_TERMINATE_LEAKS=1. No credential values are logged; workflow changes only reference existing secret names.

Security review: approved. The AWS query is scoped to the per-run E2E slug in EC2 Name tags, and termination is gated behind E2E_AWS_TERMINATE_LEAKS=1. No credential values are logged; workflow changes only reference existing secret names.
core-security approved these changes 2026-05-22 00:26:47 +00:00
core-security left a comment
Member

core-security 5273 Security review: approved. AWS lookup is slug-scoped, termination is opt-in via CI env, and secret values are not logged.

core-security 5273 Security review: approved. AWS lookup is slug-scoped, termination is opt-in via CI env, and secret values are not logged.
core-security approved these changes 2026-05-22 00:27:27 +00:00
core-security left a comment
Member

Security review: APPROVED.

Security review: APPROVED.
core-qa approved these changes 2026-05-22 00:27:27 +00:00
core-qa left a comment
Member

QA review: APPROVED.

QA review: APPROVED.
Author
Owner

/qa-recheck

/qa-recheck
Author
Owner

/security-recheck

/security-recheck
Member

/sop-ack 1 local tests listed in PR body verified
/sop-ack 2 N/A rationale accepted: no DB or migration path touched
/sop-ack 3 staging smoke pending post-merge with live EC2 pre-scan clean
/sop-ack 4 root cause is CP-org-only teardown verification blind spot
/sop-ack 5 five-axis review evidence present
/sop-ack 6 no shim/dead-code added
/sop-ack 7 memory/SOP context consulted

/sop-ack 1 local tests listed in PR body verified /sop-ack 2 N/A rationale accepted: no DB or migration path touched /sop-ack 3 staging smoke pending post-merge with live EC2 pre-scan clean /sop-ack 4 root cause is CP-org-only teardown verification blind spot /sop-ack 5 five-axis review evidence present /sop-ack 6 no shim/dead-code added /sop-ack 7 memory/SOP context consulted
hongming merged commit be8424c350 into main 2026-05-22 00:36:10 +00:00
hongming deleted branch fix/e2e-aws-leak-verification 2026-05-22 00:36:11 +00:00
Sign in to join this conversation.
4 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#1660