test(e2e): live staging e2e — reconciler heals a terminated EC2 (core#2261) #2270

Merged
hongming merged 1 commits from feat/core2261-reconciler-live-e2e into main 2026-06-05 01:11:52 +00:00
Owner

What

A live staging E2E that proves the core#2261 instance-state reconciler
(workspace-server/internal/registry/cp_instance_reconciler.go) actually heals
a terminated EC2 against real infra — the real-infra complement to the
deterministic unit tests, which only pin the reconcile logic against fakes.

tests/e2e/test_reconciler_heals_terminated_instance.sh:

  1. Provisions a fresh staging org + ONE workspace (same default
    runtime/model + provisioning/token conventions as
    test_staging_full_saas.sh), polls the tenant API until status=online,
    and captures its instance_id.
  2. Kills itaws ec2 terminate-instances on that exact captured
    instance_id (falls back to a slug-tag describe via
    lib/aws_leak_check.sh if the id wasn't surfaced).
  3. Asserts the reconciler heals it:
    • PRIMARY (gate, ~180s): the workspace status leaves online — the
      reconciler detected the dead instance via IsRunning and flipped it.
      This is the core#2247 regression guard: a dead instance must NOT keep
      reading online. PRIMARY failing exits 1.
    • SECONDARY (best-effort, ~600s): it auto-reprovisions — status
      returns to online on an instance_id that differs from the
      terminated one (the onOffline → RestartByID existing-volume heal). If
      the reprovision doesn't finish in the bound it's logged clearly but
      does not fail — PRIMARY stands as the gate. A future tightening to a
      hard fail is deliberately one edit away (noted inline).
  4. Teardown always — an up-front EXIT/INT/TERM trap deletes the tenant
    and leak-sweeps slug-tagged EC2, so a mid-test failure never orphans a box.

Workflow

.gitea/workflows/e2e-staging-reconciler.yml, modeled on
e2e-staging-saas.yml (same CP_STAGING_ADMIN_API_TOKEN + AWS secrets,
E2E_AWS_TERMINATE_LEAKS=1, "Verify required secrets present" preflight,
belt-and-braces teardown). Triggers: workflow_dispatch + a paths filter on
the reconciler source, the new script, and the libs (so it runs when the
reconciler changes) + a daily schedule.

NON-required initially (continue-on-error: true) — a brand-new live E2E
that provisions/terminates real EC2 should not hard-gate every merge until it
has a green track record. A header note documents the promotion to
branch-required.

Validation

  • shellcheck --severity=warning (CI-exact) clean; default-severity clean.
  • bash -n parse-clean.
  • Bulk shellcheck across all tests/e2e/*.sh clean (no sibling broken).
  • lint_cleanup_traps.sh clean; workflow-YAML linter + continue-on-error
    tracker linter clean (job-level continue-on-error references mc#1982).
  • The script was NOT executed against staging — it provisions/terminates
    real EC2 and costs money. It runs against staging only in CI.

Refs core#2261, core#2247.

🤖 Generated with Claude Code

## What A **live staging E2E** that proves the core#2261 instance-state reconciler (`workspace-server/internal/registry/cp_instance_reconciler.go`) actually heals a terminated EC2 against **real infra** — the real-infra complement to the deterministic unit tests, which only pin the reconcile logic against fakes. `tests/e2e/test_reconciler_heals_terminated_instance.sh`: 1. Provisions a fresh staging org + ONE workspace (same default runtime/model + provisioning/token conventions as `test_staging_full_saas.sh`), polls the tenant API until `status=online`, and captures its `instance_id`. 2. **Kills it** — `aws ec2 terminate-instances` on that exact captured `instance_id` (falls back to a slug-tag describe via `lib/aws_leak_check.sh` if the id wasn't surfaced). 3. **Asserts the reconciler heals it:** - **PRIMARY (gate, ~180s):** the workspace `status` leaves `online` — the reconciler detected the dead instance via `IsRunning` and flipped it. This is the core#2247 regression guard: a dead instance must NOT keep reading `online`. PRIMARY failing exits 1. - **SECONDARY (best-effort, ~600s):** it auto-reprovisions — `status` returns to `online` on an `instance_id` that **differs** from the terminated one (the `onOffline → RestartByID` existing-volume heal). If the reprovision doesn't finish in the bound it's logged clearly but **does not fail** — PRIMARY stands as the gate. A future tightening to a hard fail is deliberately one edit away (noted inline). 4. **Teardown always** — an up-front `EXIT/INT/TERM` trap deletes the tenant and leak-sweeps slug-tagged EC2, so a mid-test failure never orphans a box. ## Workflow `.gitea/workflows/e2e-staging-reconciler.yml`, modeled on `e2e-staging-saas.yml` (same `CP_STAGING_ADMIN_API_TOKEN` + AWS secrets, `E2E_AWS_TERMINATE_LEAKS=1`, "Verify required secrets present" preflight, belt-and-braces teardown). Triggers: `workflow_dispatch` + a paths filter on the reconciler source, the new script, and the libs (so it runs when the reconciler changes) + a daily `schedule`. **NON-required initially** (`continue-on-error: true`) — a brand-new live E2E that provisions/terminates real EC2 should not hard-gate every merge until it has a green track record. A header note documents the promotion to branch-required. ## Validation - `shellcheck --severity=warning` (CI-exact) clean; default-severity clean. - `bash -n` parse-clean. - Bulk shellcheck across all `tests/e2e/*.sh` clean (no sibling broken). - `lint_cleanup_traps.sh` clean; workflow-YAML linter + continue-on-error tracker linter clean (job-level `continue-on-error` references mc#1982). - The script was **NOT executed against staging** — it provisions/terminates real EC2 and costs money. It runs against staging only in CI. Refs core#2261, core#2247. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
hongming added 1 commit 2026-06-05 01:09:45 +00:00
test(e2e): live staging e2e — reconciler heals a terminated EC2 (core#2261)
ci-arm64-advisory / fast-checks (pull_request) Waiting to run
CI / Python Lint & Test (pull_request) Successful in 3s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 7s
E2E API Smoke Test / detect-changes (pull_request) Successful in 7s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 6s
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Successful in 2s
CI / Detect changes (pull_request) Successful in 9s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 4s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 11s
lint-required-workflows-docker-host-pinned / Lint docker-host pin on docker-touching workflows (pull_request) Successful in 5s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 13s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 5s
E2E Chat / detect-changes (pull_request) Successful in 19s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 19s
security-review / approved (pull_request_target) Failing after 7s
qa-review / approved (pull_request_target) Failing after 10s
CI / Platform (Go) (pull_request) Successful in 1s
gate-check-v3 / gate-check (pull_request_target) Successful in 11s
CI / Canvas (Next.js) (pull_request) Successful in 1s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 12s
E2E Chat / E2E Chat (pull_request) Successful in 2s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 1s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 30s
CI / Canvas Deploy Status (pull_request) Has been skipped
CI / all-required (pull_request) Successful in 3s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 56s
Lint pre-flip continue-on-error / Verify continue-on-error flips have run-log proof (pull_request) Successful in 1m0s
lint-continue-on-error-tracking / lint-continue-on-error-tracking (pull_request) Successful in 1m16s
Lint workflow YAML (Gitea-1.22.6-hostile shapes) / Lint workflow YAML for Gitea-1.22.6-hostile shapes (pull_request) Successful in 1m11s
lint-required-context-exists-in-bp / lint-required-context-exists-in-bp (pull_request) Failing after 1m15s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 1m18s
sop-checklist / review-refire (pull_request_target) Has been skipped
sop-tier-check / tier-check (pull_request_target) Has been cancelled
sop-checklist / all-items-acked (pull_request) [info tier:low] acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4 — body-unfilled: comprehensive-testing, l
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request_target) Successful in 3s
qa-review / approved (pull_request_review) Has been skipped
security-review / approved (pull_request_review) Has been skipped
sop-tier-check / tier-check (pull_request_review) Successful in 4s
audit-force-merge / audit (pull_request_target) Successful in 5s
E2E Staging Reconciler (heals terminated EC2) / pr-validate (pull_request) Waiting to run
E2E Staging Reconciler (heals terminated EC2) / E2E Staging Reconciler (pull_request) Waiting to run
53ec08cbdb
Provisions a real staging workspace, terminates its EC2 out-of-band, and
asserts the core#2261 instance-state reconciler heals it against real infra.

PRIMARY assertion (gate): within ~180s the workspace status leaves 'online'
— the reconciler detected the dead instance via CPProvisioner.IsRunning and
flipped it. A terminated EC2 masquerading as 'online' is exactly the
core#2247 regression this guards.

SECONDARY assertion (best-effort, ~600s): the onOffline -> RestartByID
existing-volume heal brings it back to 'online' on a NEW instance_id. Logged
but non-fatal — PRIMARY is the gate; a future tightening to a hard fail is
one edit away (noted in the script).

Kill primitive: aws ec2 terminate-instances on the captured instance_id
(falls back to slug-tag describe). Teardown is guaranteed by an up-front
EXIT/INT/TERM trap that deletes the tenant + leak-sweeps slug-tagged EC2
(reuses lib/aws_leak_check.sh), so a mid-test failure never orphans a box.

Real-infra complement to the deterministic unit tests
(cp_instance_reconciler.go). New workflow e2e-staging-reconciler.yml fires on
reconciler/script/lib changes + a daily schedule. NON-required initially
(continue-on-error: true) — promote to branch-required once green on main for
a de-flake window.

Refs core#2261, core#2247.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
hongming added the tier:low label 2026-06-05 01:11:28 +00:00
core-security approved these changes 2026-06-05 01:11:30 +00:00
core-security left a comment
Member

Security (core#2261). Real-infra e2e; guaranteed teardown prevents EC2 leaks; AWS-creds preflight; slug-tagged for the orphan sweeper. No prod-runtime change. Approve.

Security (core#2261). Real-infra e2e; guaranteed teardown prevents EC2 leaks; AWS-creds preflight; slug-tagged for the orphan sweeper. No prod-runtime change. Approve.
core-qa approved these changes 2026-06-05 01:11:51 +00:00
core-qa left a comment
Member

QA approve (core#2261 live reconciler e2e).

QA approve (core#2261 live reconciler e2e).
hongming merged commit 71010e618a into main 2026-06-05 01:11:52 +00:00
Sign in to join this conversation.
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2270