fix(e2e): bounded retry around external workspace-create in staging concierge user_tasks (#2743) #2746
Reference in New Issue
Block a user
Delete Branch "fix/2743-staging-concierge-create-retry"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
What
Bounded retry around the external workspace-create only in the staging concierge
user_tasksE2E (tests/e2e/test_staging_concierge_e2e.sh).Root cause (issue #2743, Researcher RCA — confirmed)
main
6163f6636fc8was red in run 358767 / job 487991 before anyuser_tasksassertion:CURL_COMMON=(-sS --max-time 30)(line 95) is inherited bytenant_call(lines 215-220).create_external_wsdid a singlePOST /workspaces(lines 223-231) and regex-parsedid.curl: (28) Operation timed out after 30002 milliseconds), noidwas parsed, and[ -n "$WS_A" ] || fail "ws-A create returned no id"(line ~258) hard-failed step 4 immediately.The external create still runs a DB transaction + post-commit token/status work before returning
201, so it is in the same provisioning-latency class as slow workspace boots — harness flake, not auser_taskscontract failure.Fix
create_external_wsnow retries the external-row create only:CREATE_WS_ATTEMPTS), longer per-call--max-time 90(CREATE_WS_MAX_TIME) that wins overCURL_COMMON's 30s, short linear backoff.000), or 2xx-but-no-id.set +e/$?-capture (mcp_call) and eventual-consistency poll style; longer--max-timemirrors the teardown DELETE's--max-time 120.Scope is intentionally narrow — no retries added anywhere else.
Validation
Real-staging-infra E2E (needs staging CP creds) — not runnable locally. Validated locally on the operator:
bash -n tests/e2e/test_staging_concierge_e2e.sh→ OK (rc=0)shellcheck(default +-S warning) → clean, 0 findingsAuthoritative validation is CI re-running the staging suite on this PR — not claimed from local reasoning.
Other two reds on run 358767 — assessment (independent, NOT this root)
Both run a different script (
test_staging_full_saas.sh, which does not usecreate_external_ws) and fail far past workspace creation, in the A2A canary phase:E2E Staging Platform Boot(job 487990): provisioning→ running, parent workspace created OK, then❌ A2A known-answer queue poll timed out. A2A agent-response timeout — not a create timeout.E2E Staging SaaS(job 487989): provisioning + parent and child creates all succeeded, then❌ A2A — STAGING LLM/BACKEND/RUNTIME FAILURE (_ResultError)(canary agent surfaced an LLM/backend/runtime error).These do not share the rc=28 external-create root and are not covered by this fix. They look like staging LLM/backend/runtime + agent-completion latency in a separate path — re-run vs separate-fix is a judgment call for the owner; flagging, not expanding scope.
Closes #2743
🤖 Generated with Claude Code
5-axis review on head
bb7aa54116. Correctness: the diff is scoped tocreate_external_wsintests/e2e/test_staging_concierge_e2e.shand addresses the rc=28 external-create flake without broadening retries elsewhere. Robustness: retries are bounded, use a longer per-call max-time, and only continue for transient/no-result cases; semantic 4xx/5xx responses fail immediately with response context, while exhausted retries fail loudly with the last curl rc and HTTP code. Security: no auth/token handling or new secret output beyond existing staging harness response snippets. Performance: 5 attempts with linear short backoff is acceptable for a staging E2E harness and avoids busy-looping. Readability: the helper comments explain the retry class and preserve the existing shell style. CI/all-required is green; staging suite validation remains CI-owned as documented./sop-ack
APPROVED (post-merge 5-axis review) on head
bb7aa54116.Correctness: the retry is scoped to
create_external_wsintests/e2e/test_staging_concierge_e2e.sh, exactly matching #2743: the prior one-shot external workspace create could timeout with curl rc=28 and return no id before any user_tasks assertions. The new helper retries transient rc=28/http000/2xx-no-id cases with bounded attempts and a longer per-call timeout, while semantic 4xx/5xx responses still hard-fail immediately with the response body.Risk/behavior: this does not weaken the user_tasks contract assertions or broadly hide staging failures; it only absorbs provisioning-latency flake on the setup row-create step. The final failure message preserves rc/http context. Scope is one E2E shell script. CI all-required is green on the head; PR was already merged before my formal review, so this is an audit-trail approval. /sop-ack