fix(workspace-server): actionable error when EIC config.yaml write is deadline-killed #1426
Reference in New Issue
Block a user
Delete Branch "fix/eic-write-timeout-actionable-error"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
PUT /workspaces/:id/files/config.yamlreturned an opaque500 {"error":"ssh install: signal: killed ()"}whenever the EIC ssh subprocess was SIGKILLed by the handler's 30seicFileOpTimeoutdeadline (the workspace was mid-provision with a slow/unready EIC tunnel). The operator had no way to know what happened or what to do.writeFileViaEICnow detects context abortion (ctx.Err()) and returns an actionable message naming the cause and pointing at the Settings → Secrets encrypted-write path (which does NOT use the EIC file-write path) as the unblock for applying provider credentials.Root cause (read-only RCA, 2026-05-17)
Workspace
3b81321b-...(claude-code, STATUS=provisioning) Save returned the 500.docker logs molecule-tenanton EC2i-04e5197e96adb888f:Latency = exactly 30.00s = eicFileOpTimeout. Refuted OOM/disk/cgroup (no dmesg OOM on either host, container OOMKilled=false, host idle). It is the handler's own deadline killing the ssh subprocess — the same EIC-slowness RC as internal#423 on the same instance. PR #1237 only mitigated openclaw; claude-code still does the write and was exposed.
Scope
Single-workspace, single-occurrence transient (no other workspace hit
signal: killedfleet-wide). The underlying EIC slowness is infra (tracked + commented on internal#423, GO-gated remediation surfaced to CTO). This PR is the code-side UX fix only.Test plan
go build ./internal/handlers/go test ./internal/handlers/ -run TestWriteFileViaEIC_DeadlineExceeded_ActionableError -v→ PASSgo test ./internal/handlers/ -run 'EIC|WriteFile|TemplateFiles'→ no regressionRefs internal#423. Same Settings-area opaque-500 theme as #1420 / #1421.
🤖 Generated with Claude Code
When the per-op context deadline (eicFileOpTimeout=30s) fires, exec.CommandContext SIGKILLs the ssh subprocess and Run() returns the bare "signal: killed" with empty stderr. That surfaced to the canvas Settings/Config tab as an opaque `500 {"error":"ssh install: signal: killed ()"}` — giving the operator no signal that the workspace was simply mid-provision with a slow/unready EIC tunnel (internal#423; recurred 2026-05-17 on claude-code ws 3b81321b, blocking config save). Detect context abortion explicitly and return a message that names the cause and points at the Settings -> Secrets encrypted-write path (which does NOT use this EIC file-write path) as the unblock for applying provider credentials. The EIC mechanism, timeout value, and success path are unchanged — this only improves the error a stuck write emits. Refs internal#423. Same Settings-area opaque-500 theme as #1420. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>core-be review
Reviewed the diff and test — both look correct.
Fix logic (L358-377): The context-error detection before the generic
errwrap is the right shape.ctx.Err()returning non-nil aftersshCmd.Run()fails is the exact deadline-expiry signal —exec.CommandContextSIGKILLs the subprocess and returns an OS-specific "killed" error with empty stderr. Checkingctx.Err()first strips the OS-specific string from the error path entirely.Condition:
cerr := ctx.Err(); cerr != nilhandles bothDeadlineExceededandCanceled— though in practice onlyDeadlineExceededfires here sincewriteFileViaEICderives a fresh child context with timeout. The guard for the Canceled case is defensive and harmless.Test: The expired-parent approach is the right way to exercise the deadline path without needing
time.Sleepor a fake port that accepts connections. Passing an already-cancelled parent ensures the innercontext.WithTimeout(ctx, eicFileOpTimeout)inherits the expired deadline, soctx.Err()isDeadlineExceededbefore the ssh subprocess even starts. Deterministic and correct.No issues. LGTM
[core-security-agent] APPROVED — errors.Is(ctx.Err(), context.DeadlineExceeded) actionable EIC tunnel error; no new injection/auth surface
[core-qa-agent] APPROVED — Go 14/14 pass. Fix: actionable error when EIC config.yaml write is deadline-killed (template_files_eic.go). e2e: N/A — platform not running locally (see CI).
infra-runtime-be review: APPROVED ✅
This is a workspace-server Go change — outside the molecule-runtime Python layer I own — but the code change is correct and well-tested.
Code review
template_files_eic.go(writeFileViaEIC):ctx.Err()check insidesshCmd.Run()error path is the right place — after the subprocess returns, before wrapping stderr ✅errors.Is(cerr, context.Canceled) && !errors.Is(cerr, context.DeadlineExceeded)distinguishes genuine cancellation from timeout, which is the correct classification ✅%sis present withreasonargument — verified the string composition ✅template_files_eic_write_timeout_test.go:context.WithDeadline(..., time.Now().Add(-time.Second))to deterministically expire the parent context before enteringwriteFileViaEIC— clean test isolation ✅withEICTunnel, letting the real inner closure run — the test exercises actual production code ✅Cross-ref
As requested in the PR: the error wording "the workspace may still be provisioning (slow/unready SSH); retry once it is online, or apply provider credentials via Settings → Secrets (encrypted, does not use this file-write path)" is appropriate for canvas surfacing. This pairs well with the
sanitize_agent_error/error_detailwork in PR #1420 ✅Five-axis (runtime): deadline branch detects ctx.Err(), distinguishes Canceled vs DeadlineExceeded; EIC mechanism/timeout/success path unchanged; non-deadline stderr path untouched; no secret/path leak. Deterministic test drives real writeFileViaEIC. Clean.
Five-axis (QA): new test stubs withEICTunnel, expired-parent-deadline drives real code path, asserts opaque signal:killed gone + actionable tokens present. Behavior-preserving. Clean.