RCA: staging concierge user_tasks main-red on one-shot external workspace create timeout #2743

Closed
opened 2026-06-13 10:05:31 +00:00 by agent-researcher · 0 comments
Member

MECHANISM: Current main 6163f6636fc8 red is in run 358767/job 487991 before any user_tasks assertion. tests/e2e/test_staging_concierge_e2e.sh:95 sets CURL_COMMON=(-sS --max-time 30), tenant_call inherits that at tests/e2e/test_staging_concierge_e2e.sh:215-220, and create_external_ws performs a single POST /workspaces at tests/e2e/test_staging_concierge_e2e.sh:223-231. If the staging tenant create endpoint stalls for 30s, curl exits 28 and the helper parses no id; lines 255-260 hard-fail immediately instead of classifying this as staging control-plane latency and retrying boundedly. The workspace create handler path itself is the normal external-row path in workspace-server/internal/handlers/workspace.go:253-873; external creates still run the DB transaction + post-commit token/status work before returning 201.

EVIDENCE: Run 358767/job 487991 reached tenant running and TLS healthy, then failed at step 4. Log excerpt: curl: (28) Operation timed out after 30002 milliseconds; next line is ws-A create returned no id. The script source on the same head has no retry around create_external_ws; it calls tenant_call POST /workspaces once and immediately regex-extracts "id".

RECOMMENDED FIX SHAPE: In tests/e2e/test_staging_concierge_e2e.sh, wrap create_external_ws in a bounded retry helper for curl rc=28/http000/no-id on the external-row create only, with short backoff and logging of HTTP code/body on terminal failure. Keep semantic 4xx/5xx failures hard-red with the response body. Optionally use a longer per-call timeout just for external workspace creation, mirroring the rest of staging E2E’s eventual-consistency posture, because this is test harness flake rather than a user_tasks contract failure.

MECHANISM: Current main `6163f6636fc8` red is in run 358767/job 487991 before any user_tasks assertion. `tests/e2e/test_staging_concierge_e2e.sh:95` sets `CURL_COMMON=(-sS --max-time 30)`, `tenant_call` inherits that at `tests/e2e/test_staging_concierge_e2e.sh:215-220`, and `create_external_ws` performs a single POST `/workspaces` at `tests/e2e/test_staging_concierge_e2e.sh:223-231`. If the staging tenant create endpoint stalls for 30s, curl exits 28 and the helper parses no `id`; lines `255-260` hard-fail immediately instead of classifying this as staging control-plane latency and retrying boundedly. The workspace create handler path itself is the normal external-row path in `workspace-server/internal/handlers/workspace.go:253-873`; external creates still run the DB transaction + post-commit token/status work before returning `201`. EVIDENCE: Run 358767/job 487991 reached tenant running and TLS healthy, then failed at step 4. Log excerpt: `curl: (28) Operation timed out after 30002 milliseconds`; next line is `ws-A create returned no id`. The script source on the same head has no retry around `create_external_ws`; it calls `tenant_call POST /workspaces` once and immediately regex-extracts `"id"`. RECOMMENDED FIX SHAPE: In `tests/e2e/test_staging_concierge_e2e.sh`, wrap `create_external_ws` in a bounded retry helper for curl rc=28/http000/no-id on the external-row create only, with short backoff and logging of HTTP code/body on terminal failure. Keep semantic 4xx/5xx failures hard-red with the response body. Optionally use a longer per-call timeout just for external workspace creation, mirroring the rest of staging E2E’s eventual-consistency posture, because this is test harness flake rather than a user_tasks contract failure.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2743