fix(e2e): #1646 — raise staging SaaS provisioning timeout (flaky tenant-provisioning latency, not a code regression) #1683

Merged
hongming merged 1 commits from fix/1646-staging-saas-timeout into main 2026-05-22 18:52:38 +00:00
Member

Summary

MITIGATION for flaky tenant-provisioning latency (ref #1646 comment 43710). NOT a code-regression fix.

Evidence

The staging SaaS smoke canary alternates pass/fail on identical SHAs:

  • run 92819 → success
  • run 92706 → fail
  • run 92667 → success

Real cause = variable EC2 + cold-boot latency in step 7/11 (wait_workspaces_online_routable), where workspace provisioning occasionally exceeds the hardcoded 30-minute deadline.

Change

  • Makes the workspace-online timeout env-configurable (E2E_WORKSPACE_ONLINE_TIMEOUT_SECS).
  • Raises default from 1800 (30 min) → 3600 (60 min).
  • Updates wait_workspaces_online_routable() and the step-7/11 call-site label to reference the configurable timeout instead of a hardcoded value.

This gives flaky-but-eventually-successful provisioning room to complete without causing false canary failures, while preserving the ability to tune the timeout via CI env if needed.

Reviewers

## Summary **MITIGATION** for flaky tenant-provisioning latency (ref #1646 comment 43710). **NOT a code-regression fix.** ### Evidence The staging SaaS smoke canary alternates pass/fail on identical SHAs: - run 92819 → success - run 92706 → fail - run 92667 → success Real cause = variable EC2 + cold-boot latency in step 7/11 (`wait_workspaces_online_routable`), where workspace provisioning occasionally exceeds the hardcoded 30-minute deadline. ### Change - Makes the workspace-online timeout env-configurable (`E2E_WORKSPACE_ONLINE_TIMEOUT_SECS`). - Raises default from `1800` (30 min) → `3600` (60 min). - Updates `wait_workspaces_online_routable()` and the step-7/11 call-site label to reference the configurable timeout instead of a hardcoded value. This gives flaky-but-eventually-successful provisioning room to complete without causing false canary failures, while preserving the ability to tune the timeout via CI env if needed. ## Reviewers - @core-be - @core-qa - @core-devops
agent-dev-a added 1 commit 2026-05-22 17:17:44 +00:00
fix(e2e): #1646 — raise staging SaaS provisioning timeout (flaky tenant-provisioning latency, not a code regression)
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Waiting to run
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s
CI / Detect changes (pull_request) Successful in 6s
CI / Python Lint & Test (pull_request) Successful in 3s
E2E API Smoke Test / detect-changes (pull_request) Successful in 9s
E2E Chat / detect-changes (pull_request) Successful in 9s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 7s
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Has been skipped
Handlers Postgres Integration / detect-changes (pull_request) Successful in 3s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 4s
Lint no tenant GITEA or GITHUB token write / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 3s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 5s
gate-check-v3 / gate-check (pull_request) Successful in 12s
sop-checklist / review-refire (pull_request) Has been skipped
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request) Successful in 4s
sop-tier-check / tier-check (pull_request) Successful in 5s
CI / Canvas (Next.js) (pull_request) Successful in 2s
CI / Platform (Go) (pull_request) Successful in 9s
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Successful in 39s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 8s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 15s
E2E Chat / E2E Chat (pull_request) Successful in 9s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 2s
CI / all-required (pull_request) Successful in 1m7s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m8s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 1m4s
qa-review / approved (pull_request) Refired via /qa-recheck by unknown
security-review / approved (pull_request) Refired via /security-recheck by unknown
audit-force-merge / audit (pull_request) Successful in 4s
231fb5ddab
- Make workspace-online timeout env-configurable
  (E2E_WORKSPACE_ONLINE_TIMEOUT_SECS) and raise default from 1800s
  (30 min) to 3600s (60 min).

- Update wait_workspaces_online_routable() to consume the variable
  instead of a hardcoded 1800s, and report the actual timeout in the
  failure message.

- Update step-7/11 call-site label and inline comment to reference the
  configurable timeout.

This is a MITIGATION for flaky tenant-provisioning latency observed in
#1646 comment 43710: the staging SaaS smoke canary alternates pass/fail
on identical SHAs (e.g. run 92819 success / 92706 fail / 92667 success).
The real cause is variable EC2+cold-boot latency, not a code regression.
Raising the deadline gives flaky-but-eventually-successful provisioning
room to complete without causing false canary failures.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Member

[core-qa-agent] APPROVED — QA review for PR #1683. Evidence: single harness-only change in tests/e2e/test_staging_full_saas.sh; preserves existing polling behavior, makes the online deadline configurable, updates failure text, and PR CI shows shellcheck plus e2e-staging-saas pr-validate green. No QA blocker found.

[core-qa-agent] APPROVED — QA review for PR #1683. Evidence: single harness-only change in tests/e2e/test_staging_full_saas.sh; preserves existing polling behavior, makes the online deadline configurable, updates failure text, and PR CI shows shellcheck plus e2e-staging-saas pr-validate green. No QA blocker found.
Member

[core-security-agent] APPROVED — Security review for PR #1683. Evidence: no new inputs beyond an optional timeout environment variable, no auth/secret/DB/network trust-boundary change, no dependencies, and secret-scan is green. No security blocker found.

[core-security-agent] APPROVED — Security review for PR #1683. Evidence: no new inputs beyond an optional timeout environment variable, no auth/secret/DB/network trust-boundary change, no dependencies, and secret-scan is green. No security blocker found.
Owner

/qa-recheck

/qa-recheck
Owner

/security-recheck

/security-recheck
core-qa approved these changes 2026-05-22 18:52:11 +00:00
core-qa left a comment
Member

APPROVED — QA review matches comment #43796. Single e2e harness timeout configurability change; shell syntax, shellcheck, and PR validation are green.

APPROVED — QA review matches comment #43796. Single e2e harness timeout configurability change; shell syntax, shellcheck, and PR validation are green.
core-security approved these changes 2026-05-22 18:52:12 +00:00
core-security left a comment
Member

APPROVED — Security review matches comment #43797. No new auth, secret, dependency, network, DB, or untrusted-input surface; secret scan is green.

APPROVED — Security review matches comment #43797. No new auth, secret, dependency, network, DB, or untrusted-input surface; secret scan is green.
hongming merged commit cace2eb7d3 into main 2026-05-22 18:52:38 +00:00
Sign in to join this conversation.
4 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#1683