perf(workspace-server,canvas): EIC tunnel pool + canvas Promise.all (closes core#11) #13
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "feat/eic-tunnel-pool-core-11"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Symptom
Canvas detail-panel "config + filesystem load" took ~20s. Reported on production hongming tenant, workspace
c7c28c0b-4ea5-4e75-9728-3ba860081708(Claude Code Agent T2). Phase 1 root-cause writeup: #11 (comment)Two stacked latency sources
1. Server-side: per-call EIC tunnel setup (~80% of the win)
workspace-server/internal/handlers/template_files_eic.go::realWithEICTunnelperformedssh-keygen+SendSSHPublicKey+open-tunnel+waitForPortPER call. 4 callers (readFileViaEIC,writeFileViaEIC,listFilesViaEIC,deleteFileViaEIC) each paid the full ~3-5s setup cost even when fired back-to-back on the same workspace EC2.Fix: refcounted pool keyed on
instanceIDwith TTL ≤ 50s (under the 60sSendSSHPublicKeygrant). One tunnel serves N file ops; concurrent acquires for the same instance share the slot via apendingSetupsgate; LRU eviction caps simultaneous tracked instances at 32. Poisons entries on tunnel-fatal errors (connection refused, broken pipe, auth failed) so the next acquire builds fresh. Cleanup on panic viadefer-releasepattern (added after self-review caught a refcount-leak hazard).Public API unchanged —
var withEICTunnelrebinds topooledWithEICTunnelat package init, so all 4 callers inherit pooling for free.10 unit tests pin:
4-ops-amortise(1 setup),different-instances-do-not-share,TTL eviction,poison invalidates,concurrent-acquire-single-setup,TTL=0 escape hatch,LRU eviction at cap,error classification,refcount blocks expired eviction,panic poisons entry. All green.2. Canvas-side: serial fan-out + duplicate fetch (~20% of the win)
canvas/src/components/tabs/ConfigTab.tsx::loadConfigawaited 3 independent metadata GETs (/workspaces/{id},/model,/provider) serially.AgentCardSectionfired a SECOND/workspaces/{id}from its ownuseEffect.Fix:
Promise.allover the 3 metadata GETs (each leg keeps its existing.catchfallback semantics).AgentCardSectionnow readsagentCardfrom the canvas store (useCanvasStore) instead of refetching — the canvas already hydratesnode.data.agentCardfrom the platform event stream. Defensive selector handles test mocks without anodesarray.Verification
go test ./internal/handlers/5.07s green (full handlers package, including 10 new pool tests)go vet ./internal/handlers/cleannpx vitest run— 1380/1380 canvas unit tests pass (2 test FILES fail on a pre-existing xyflow CSS-load issue in vitest config, unrelated to this change)npx tsc --noEmitcleanLive wall-time verification deferred to Phase 4 / E2E (canvas browser session required; external probe blocked by 403 since the canvas auth chain is session-cookie + Origin header, not a bearer token I can fabricate).
Backwards compatibility / versioning
API surface unchanged. All 4 EIC handler callers use the rebound var; no caller migration. Pool defaults to enabled (TTL=50s); tests can disable by setting
poolTTL=0or by overwritingwithEICTunneldirectly (existing stub pattern intemplate_files_eic_dispatch_test.gopreserved). No new endpoint, no schema change, no env-var change required.Security
os.MkdirTempfor the TTL window — same surface as current code, slightly longer lifetime (50s vs ~30s per-call). Keys still scrubbed on cleanup; mode 0600 enforced byssh-keygen.SendSSHPublicKey's 60s grant is the underlying TTL ceiling — pool TTL stays ≤ 50s with 10s buffer.singleflight).Hostile self-review (3 weakest spots)
fnErrIndicatesTunnelFaultis a substring grep onerr.Error()— the marker list is hand-curated and ssh client error formats vary across OpenSSH versions. A future ssh that reports a tunnel failure via a phrasing not in the list would NOT poison the entry → next callers reuse a dead tunnel until TTL evicts. Acceptable: TTL bounds the impact (≤50s of bad reuse), and the heuristic covers every tunnel-error shape that appears in the existing test fixtures and known incidents. Future-friendly upgrade path: have callers signal poison explicitly via a separate API.acquire's for-loop has unbounded retry potential under pathological churn (signal closed → new acquirer → setup fails → repeat). No bounded retry counter. Today there is no test exercise for "flaky setup that succeeds-then-fails-then-succeeds"; if observability ever shows this shape, add a max-retry guard. Filed as a known limitation, not blocking.Substring classification could false-positive on app-level error messages that happen to contain "permission denied" or "broken pipe" verbatim. The classification test covers the discriminator but only against the error shapes we know today. Acceptable: poisoning errs on the side of building fresh, which is correct-but-slightly-slow rather than incorrect.
Rollout / rollback
git revert→ workspace-server publish → tenants redeploy. The pool is purely additive code; revert is safe at any time. Operators wanting a kill-switch without a revert can set env-drivenpoolTTL=0in a follow-up (not wired in this PR; see "Open follow-ups").Open follow-ups (parked, not blocking this PR)
EIC_TUNNEL_POOL_TTLenv var so operators have a kill switch without code change.ssh -o ControlMaster=autowould multiplex the per-op ssh handshake too (~100-200ms saved per op AFTER the tunnel pool gets us out of the per-op tunnel cost). Premature today; benchmark first under realistic load.GET /workspaces/:id/config-bundlefor a cleaner one-RTT design — separate RFC since it adds versioned API surface.Phase 4 / E2E plan
Live timing of the canvas detail-panel open against a real workspace (browser session, not external probe). Target: perceived latency under 2s on warm pool. Cold open still pays one tunnel setup (~3-5s) — the pool buys you the SECOND through Nth panel-open within the TTL window. Memory
feedback_chase_verification_to_stagingapplies — will not declare done at PR-merge; will follow through to user-visible behavior on staging.Closes #11.
## Symptom Canvas detail-panel "config + filesystem load" took ~20s. Reported on production hongming tenant, workspace c7c28c0b-... (Claude Code Agent T2). ## Two stacked latency sources ### 1. Server-side: per-call EIC tunnel setup (~80% of the win) `workspace-server/internal/handlers/template_files_eic.go::realWithEICTunnel` performed ssh-keygen + SendSSHPublicKey + open-tunnel + waitForPort PER call. 4 callers (read/write/list/delete) each paid the full ~3-5s setup cost even when fired back-to-back on the same workspace EC2. Fix: refcounted pool keyed on instanceID with TTL ≤ 50s (under the 60s SendSSHPublicKey grant). One tunnel serves N file ops; concurrent acquires for the same instance share the slot via a pendingSetups gate; LRU eviction caps simultaneous tracked instances at 32. Poisons entries on tunnel-fatal errors (connection refused, broken pipe, auth failed) so the next acquire builds fresh. Cleanup on panic via defer-release pattern (added after self-review caught a refcount-leak hazard). Public API unchanged — `var withEICTunnel` rebinds to `pooledWithEICTunnel` at package init, so all 4 callers inherit pooling for free. 10 unit tests pin: 4-ops-amortise (1 setup), different-instances-do-not-share, TTL eviction, poison invalidates, concurrent-acquire-single-setup, TTL=0 escape hatch, LRU eviction at cap, error classification heuristic, refcount blocks expired eviction, panic poisons entry. All green. ### 2. Canvas-side: serial fan-out + duplicate fetch (~20% of the win) `canvas/src/components/tabs/ConfigTab.tsx::loadConfig` awaited 3 independent metadata GETs (`/workspaces/{id}`, `/model`, `/provider`) serially. `AgentCardSection` fired a SECOND `/workspaces/{id}` from its own useEffect. Fix: Promise.all over the 3 metadata GETs (each leg keeps its existing .catch fallback semantics). AgentCardSection now reads `agentCard` from the canvas store (`useCanvasStore`) instead of refetching — the canvas already hydrates `node.data.agentCard` from the platform event stream. Defensive selector handles test mocks without a `nodes` array. ## Verification - `go test ./internal/handlers/` 5.07s green (full handlers package, including 10 new pool tests) - `go vet ./internal/handlers/` clean - `npx vitest run` — 1380/1380 canvas unit tests pass (2 test FILES fail on a pre-existing xyflow CSS-load issue in vitest config, unrelated to this change) - `npx tsc --noEmit` clean Live wall-time verification deferred to Phase 4 / E2E (canvas browser session required; external probe blocked by 403 since the canvas auth chain is session-cookie + Origin header, not a bearer token I can fabricate). ## Backwards compatibility API surface unchanged. All 4 EIC handler callers use the rebound var; no caller migration. Pool defaults to enabled (TTL=50s); tests can disable by setting poolTTL=0 or by overwriting withEICTunnel directly (existing stub pattern in template_files_eic_dispatch_test.go preserved). ## Hostile self-review (3 weakest spots) 1. `fnErrIndicatesTunnelFault` is a substring grep on err.Error() — the marker list is hand-curated and ssh client error formats vary across OpenSSH versions. A future ssh that reports a tunnel failure via a phrasing not in the list would NOT poison the entry → next callers reuse a dead tunnel until TTL evicts. Acceptable: TTL bounds the impact (≤50s of bad reuse), and the heuristic covers every tunnel-error shape that appears in the existing test fixtures and known incidents. 2. `acquire`'s for-loop has unbounded retry potential under pathological churn (signal closed → new acquirer → setup fails → repeat). No bounded retry counter. Today there is no test exercise for "flaky setup that succeeds-then-fails-then-succeeds"; if observability ever shows this shape, add a max-retry guard. Filed as a known limitation, not blocking. 3. The substring assertion `strings.Contains` style I used for tunnel-fault classification could false-positive on app-level error messages that happen to contain "permission denied" or "broken pipe" verbatim. The classification test covers the discriminator but only against the error shapes we know today. Acceptable: poisoning errs on the side of building fresh, which is correct-but-slightly-slow rather than incorrect. ## Phase 4 / E2E plan - Live timing of the canvas detail-panel open against a real workspace (browser session, not external probe). - Target: perceived latency under 2s on warm pool. Cold open still pays one tunnel setup (~3-5s) — the pool buys you the SECOND through Nth panel-open within the TTL window. - Memory `feedback_chase_verification_to_staging` applies — will not declare done at PR-merge; will follow through to user-visible behavior on staging. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Approving as security-auditor peer per Hongming green-light. EIC tunnel pool + canvas Promise.all — closes #11 (20s perf).
Hongming-approved (chat 2026-05-07). EIC tunnel pool + canvas Promise.all. Closes core#11 (the 20s perf bug). Hostile self-review accepted (3 weakest spots bounded by TTL ≤50s). Merge.