forked from molecule-ai/molecule-core
E2E Staging SaaS has been failing on every cron + push run since
2026-04-27 with `LEAK: org … still present post-teardown (count=1)`,
exit 4. Root cause: the curl timeout on the teardown DELETE was 30s
and the post-DELETE leak check was a single 10s sleep — but the
DELETE handler runs the full GDPR Art. 17 cascade synchronously,
including EC2 termination which AWS reports in 30–60s. Real-world
wall time on a prod-shaped run was 57s on 2026-04-27 (hongmingwang
DELETE); the 30s curl timeout aborted the request mid-cascade and
the 10s post-sleep check found the row still present (status not
yet 'purged').
Two-part fix to match real cascade timing:
1. DELETE curl gets its own --max-time 120 (was 30) so the
synchronous cascade has room to complete in-band.
2. The leak check polls up to 60s for status='purged' instead of
one rigid 10s sleep. Covers two cases:
- DELETE returns 5xx mid-cascade but the cascade finishes anyway
(we still observe a clean state).
- DELETE legitimately exceeds 120s — eventual-consistency catches
the eventual purge instead of false-flagging a leak.
The 5–15s estimate in `molecule-controlplane/internal/handlers/
purge.go`'s comment is the API-call cost only, not the AWS-side
time-to-termination it waits on. The async-purge refactor noted in
that comment would let us drop these timeouts back to ~15s — file
that under future work.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|---|---|---|
| .. | ||
| _extract_token.py | ||
| _lib.sh | ||
| STAGING_SAAS_E2E.md | ||
| test_a2a_e2e.sh | ||
| test_activity_e2e.sh | ||
| test_api.sh | ||
| test_chat_attachments_e2e.sh | ||
| test_chat_attachments_multiruntime_e2e.sh | ||
| test_claude_code_e2e.sh | ||
| test_comprehensive_e2e.sh | ||
| test_dev_mode.sh | ||
| test_harness_rc_normalization.sh | ||
| test_notify_attachments_e2e.sh | ||
| test_priority_runtimes_e2e.sh | ||
| test_saas_tenant.sh | ||
| test_staging_full_saas.sh | ||