perf(workspace-server): config + filesystem loading takes ~20s in canvas detail panel #11

New Issue

claude-ceo-assistant · 2026-05-07T05:44:44Z

2026-05-07 05:44:44 +00:00

Symptom

Getting workspace config + file system loading takes ~20 seconds. User-perceived as a long, blocking wait when navigating in the canvas to a workspace's detail panel / plugins tab.

Expected behavior

Sub-second config fetch and filesystem listing for an online workspace, like other panels in the canvas (status, chat, activity).

What's likely involved (need agent to confirm)

The "config + file system" surface in the canvas detail panel reads from one or more of:

GET /workspaces/{id}/config (workspace runtime config)
GET /workspaces/{id}/files or similar filesystem listing
GET /workspaces/{id}/plugins (plugin registry)
Workspace agent_card / capabilities fetch
Any other panel-load that fans out reads

20s is consistent with serial fan-out + per-call timeout-then-fallback. Could also be SSH/EC2 round-trip on each call instead of cached state.

Acceptance criteria

Phase 1: profile the actual user-flow. Identify which endpoint(s) are the slowest contributor to the 20s. Capture a flame chart or timing breakdown if possible. Determine: is it the canvas-side fan-out, the workspace-server endpoint, or the per-tenant network hop?
Phase 2: propose a fix. Likely candidates: parallelize fan-out, cache config server-side, eliminate per-call SSH round-trip, lazy-load on tab visibility, etc. List 2+ alternatives.
Phase 3: ship the fix as a PR. Add a benchmark or regression test that locks in the new latency floor.
Phase 4: hostile self-review. End-to-end test in the canvas (real flow, not unit). Verify perceived latency under 2s.

Out of scope (park if found)

Plugin-install flow (separate bug, see #10).
Canvas-side rendering perf for other tabs.
General workspace-server architecture refactor.

Notes

Reporter: Hongming, observed in the canvas while testing plugin install on workspace c7c28c0b-4ea5-4e75-9728-3ba860081708 (Claude Code Agent, T2, online).
Demo is shortly. If 20s is reproducible everywhere, demo polish wants under 5s ideally.

## Symptom Getting workspace config + file system loading takes **~20 seconds**. User-perceived as a long, blocking wait when navigating in the canvas to a workspace's detail panel / plugins tab. ## Expected behavior Sub-second config fetch and filesystem listing for an online workspace, like other panels in the canvas (status, chat, activity). ## What's likely involved (need agent to confirm) The "config + file system" surface in the canvas detail panel reads from one or more of: - `GET /workspaces/{id}/config` (workspace runtime config) - `GET /workspaces/{id}/files` or similar filesystem listing - `GET /workspaces/{id}/plugins` (plugin registry) - Workspace agent_card / capabilities fetch - Any other panel-load that fans out reads 20s is consistent with serial fan-out + per-call timeout-then-fallback. Could also be SSH/EC2 round-trip on each call instead of cached state. ## Acceptance criteria - Phase 1: profile the actual user-flow. Identify which endpoint(s) are the slowest contributor to the 20s. Capture a flame chart or timing breakdown if possible. Determine: is it the canvas-side fan-out, the workspace-server endpoint, or the per-tenant network hop? - Phase 2: propose a fix. Likely candidates: parallelize fan-out, cache config server-side, eliminate per-call SSH round-trip, lazy-load on tab visibility, etc. List 2+ alternatives. - Phase 3: ship the fix as a PR. Add a benchmark or regression test that locks in the new latency floor. - Phase 4: hostile self-review. End-to-end test in the canvas (real flow, not unit). Verify perceived latency under 2s. ## Out of scope (park if found) - Plugin-install flow (separate bug, see #10). - Canvas-side rendering perf for other tabs. - General workspace-server architecture refactor. ## Notes - Reporter: Hongming, observed in the canvas while testing plugin install on workspace c7c28c0b-4ea5-4e75-9728-3ba860081708 (Claude Code Agent, T2, online). - Demo is shortly. If 20s is reproducible everywhere, demo polish wants under 5s ideally.

claude-ceo-assistant commented

2026-05-07 05:57:39 +00:00

Phase 1 complete — root cause identified

devops-engineer · SOP feedback_dev_sop_phase_1_to_4 (full canonical, no waivers).

Two independent latency sources stack to produce the ~20s wall time. Both are fixable; no symptom-papering required.

Cause 1 — Canvas-side serial fan-out

canvas/src/components/tabs/ConfigTab.tsx::loadConfig (lines 220-310) runs 4 await api.get(...) calls in series:

GET /workspaces/${id} — workspace metadata (DB-backed, fast)
GET /workspaces/${id}/model — DB-backed, fast
GET /workspaces/${id}/provider — DB-backed, fast
GET /workspaces/${id}/files/config.yaml — EIC-tunnel-backed, slow (see Cause 2)

Plus AgentCardSection (lines 32-37) issues a second GET /workspaces/${id} from its own useEffect — duplicate over-the-wire request, no shared cache. A /templates GET fires later (line 352) too.

Cause 2 — Server-side EIC tunnel created PER CALL

workspace-server/internal/handlers/template_files_eic.go::realWithEICTunnel (line 153) is invoked by both readFileViaEIC (line 389) and listFilesViaEIC (line 455). Every invocation:

os.MkdirTemp for a fresh ephemeral keypair dir
ssh-keygen -t ed25519 subprocess (~100-300ms)
sendSSHPublicKey → AWS EIC API (~200-500ms network)
openTunnelCmd → fork aws ec2-instance-connect open-tunnel (~1-3s startup)
waitForPort up to 10s for the tunnel to listen (typically ~1-3s)
The actual cat / find over the tunnel (~50-200ms)
Kill the tunnel, scrub keys

No tunnel pool, no key cache, no session reuse. Each file/list op pays the full 3-5s setup cost even when fired back-to-back. Ephemeral SendSSHPublicKey grants 60s of key validity — a single tunnel could happily serve N ops in that window.

Wall-time decomposition (estimated)

I could not externally time the live endpoints — calls to https://hongming.moleculesai.app/workspaces/${id}/... return 403 from TenantGuard to a per-tenant ADMIN_TOKEN, since the canvas relies on a session cookie that I cannot fabricate from outside the browser. Estimate from code-path inspection:

ConfigTab serial path on cold panel: 3× DB GET (~30ms) + 1× EIC config.yaml (~3-5s) ≈ 3-5s
FilesTab on first open (parallel render in same panel session): ListFiles via EIC (~3-5s) + per-row reads on expand
AgentCardSection duplicate /workspaces/${id}: redundant work
Cold AWS region / cold ssh-keygen / slow tunnel startup → tail-of-distribution adds 3-10s
Sum across visible tabs: 5s + 5s + jitter ≈ 15-20s perceived

Confirming live timing belongs in Phase 4 E2E verification (real canvas browser session).

Prior art surveyed

GitHub Codespaces uses a long-lived dev tunnel via the dev-tunnels service for the lifetime of the codespace; file ops multiplex over one connection. We don't have that infrastructure but the principle (one tunnel, many ops) applies directly.
AWS SSM Session Manager, GitPod, and Replit all funnel multiple shell ops through a kept-alive session. None pay re-handshake cost per op.
The pattern we want is the session-pool from any standard SSH library (e.g. golang.org/x/crypto/ssh Client.NewSession() over a single *Client).

Reputable sources are unanimous: per-op tunnel setup is the anti-pattern, pool-or-keepalive is the answer. None of them constrain on the AWS-EIC-specific 60s grant — that's a design knob (TTL ≤ 60s) not a blocker.

Phase 2 design candidates (deferred to Phase 2 RFC)

Server-side: tunnel pool keyed on instanceID with TTL ≤ 50s, reused across read*ViaEIC / listFilesViaEIC / writeFileViaEIC / deleteFileViaEIC. Highest leverage (~80% of wall time). New shared mutable state to manage; eviction on workspace reboot.
Server-side: aggregate endpoint GET /workspaces/:id/config-bundle returning metadata+model+provider+config.yaml in one call, single tunnel. Cleaner architecturally; adds versioned API surface; needs deprecation plan for the existing four endpoints.
Canvas-side: Promise.all the 4 GETs + lift the duplicate /workspaces/${id} to a shared cache (React Query / context). Cheap; visible perceived-latency win (~20% of wall); doesn't fix the per-call tunnel cost. Worth pairing with #1.
Canvas-side: lazy-load on tab visibility — already true (tabs are conditional render). Not the bug.

Anti-pattern check (orchestrator brief)

Not spinning a longer loading spinner. Found the actual slow path.
Not removing /files/config.yaml — form's source-of-truth requires the YAML.
Not gating the panel behind a "loading…" placeholder for >2s — that's the symptom, not the cause.

Phase 2 next-step

Will iterate on a tunnel-pool design + canvas Promise.all + duplicate-call removal in a single coherent PR with paired benchmark/regression test. The benchmark will pin "4 file ops on the same instance ≤ 1.5× single-call wall time" so a regression that disables pooling fails CI loudly (memory feedback_behavior_based_ast_gates + feedback_assert_exact_not_substring apply).

Comments / direction welcome before I cut Phase 2 design.

## Phase 1 complete — root cause identified devops-engineer · SOP `feedback_dev_sop_phase_1_to_4` (full canonical, no waivers). Two independent latency sources stack to produce the ~20s wall time. Both are fixable; no symptom-papering required. ### Cause 1 — Canvas-side serial fan-out `canvas/src/components/tabs/ConfigTab.tsx::loadConfig` (lines 220-310) runs **4 `await api.get(...)` calls in series**: 1. `GET /workspaces/${id}` — workspace metadata (DB-backed, fast) 2. `GET /workspaces/${id}/model` — DB-backed, fast 3. `GET /workspaces/${id}/provider` — DB-backed, fast 4. `GET /workspaces/${id}/files/config.yaml` — **EIC-tunnel-backed, slow (see Cause 2)** Plus `AgentCardSection` (lines 32-37) issues a **second** `GET /workspaces/${id}` from its own `useEffect` — duplicate over-the-wire request, no shared cache. A `/templates` GET fires later (line 352) too. ### Cause 2 — Server-side EIC tunnel created PER CALL `workspace-server/internal/handlers/template_files_eic.go::realWithEICTunnel` (line 153) is invoked by both `readFileViaEIC` (line 389) and `listFilesViaEIC` (line 455). Every invocation: 1. `os.MkdirTemp` for a fresh ephemeral keypair dir 2. `ssh-keygen -t ed25519` subprocess (~100-300ms) 3. `sendSSHPublicKey` → AWS EIC API (~200-500ms network) 4. `openTunnelCmd` → fork `aws ec2-instance-connect open-tunnel` (~1-3s startup) 5. `waitForPort` up to 10s for the tunnel to listen (typically ~1-3s) 6. The actual `cat` / `find` over the tunnel (~50-200ms) 7. Kill the tunnel, scrub keys **No tunnel pool, no key cache, no session reuse.** Each file/list op pays the full 3-5s setup cost even when fired back-to-back. Ephemeral `SendSSHPublicKey` grants 60s of key validity — a single tunnel could happily serve N ops in that window. ### Wall-time decomposition (estimated) I could not externally time the live endpoints — calls to `https://hongming.moleculesai.app/workspaces/${id}/...` return `403` from `TenantGuard` to a per-tenant ADMIN_TOKEN, since the canvas relies on a session cookie that I cannot fabricate from outside the browser. Estimate from code-path inspection: - ConfigTab serial path on cold panel: 3× DB GET (~30ms) + 1× EIC config.yaml (~3-5s) ≈ **3-5s** - FilesTab on first open (parallel render in same panel session): ListFiles via EIC (~3-5s) + per-row reads on expand - AgentCardSection duplicate `/workspaces/${id}`: redundant work - Cold AWS region / cold ssh-keygen / slow tunnel startup → tail-of-distribution adds 3-10s - Sum across visible tabs: 5s + 5s + jitter ≈ **15-20s perceived** Confirming live timing belongs in Phase 4 E2E verification (real canvas browser session). ### Prior art surveyed - **GitHub Codespaces** uses a long-lived dev tunnel via the `dev-tunnels` service for the lifetime of the codespace; file ops multiplex over one connection. We don't have that infrastructure but the principle (one tunnel, many ops) applies directly. - **AWS SSM Session Manager**, **GitPod**, and **Replit** all funnel multiple shell ops through a kept-alive session. None pay re-handshake cost per op. - The pattern we want is the **session-pool** from any standard SSH library (e.g. `golang.org/x/crypto/ssh` `Client.NewSession()` over a single `*Client`). Reputable sources are unanimous: per-op tunnel setup is the anti-pattern, pool-or-keepalive is the answer. None of them constrain on the AWS-EIC-specific 60s grant — that's a design knob (TTL ≤ 60s) not a blocker. ### Phase 2 design candidates (deferred to Phase 2 RFC) 1. **Server-side: tunnel pool keyed on `instanceID`** with TTL ≤ 50s, reused across `read*ViaEIC` / `listFilesViaEIC` / `writeFileViaEIC` / `deleteFileViaEIC`. Highest leverage (~80% of wall time). New shared mutable state to manage; eviction on workspace reboot. 2. **Server-side: aggregate endpoint** `GET /workspaces/:id/config-bundle` returning metadata+model+provider+config.yaml in one call, single tunnel. Cleaner architecturally; adds versioned API surface; needs deprecation plan for the existing four endpoints. 3. **Canvas-side: `Promise.all`** the 4 GETs + lift the duplicate `/workspaces/${id}` to a shared cache (React Query / context). Cheap; visible perceived-latency win (~20% of wall); doesn't fix the per-call tunnel cost. **Worth pairing with #1.** 4. **Canvas-side: lazy-load on tab visibility** — already true (tabs are conditional render). Not the bug. ### Anti-pattern check (orchestrator brief) - Not spinning a longer loading spinner. Found the actual slow path. - Not removing /files/config.yaml — form's source-of-truth requires the YAML. - Not gating the panel behind a "loading…" placeholder for >2s — that's the symptom, not the cause. ### Phase 2 next-step Will iterate on a tunnel-pool design + canvas Promise.all + duplicate-call removal in a single coherent PR with paired benchmark/regression test. The benchmark will pin "4 file ops on the same instance ≤ 1.5× single-call wall time" so a regression that disables pooling fails CI loudly (memory `feedback_behavior_based_ast_gates` + `feedback_assert_exact_not_substring` apply). Comments / direction welcome before I cut Phase 2 design.

claude-ceo-assistant referenced a pull request that will close this issue

2026-05-07 06:18:59 +00:00

perf(workspace-server,canvas): EIC tunnel pool + canvas Promise.all (closes core#11) #13

claude-ceo-assistant closed this issue

2026-05-07 11:10:26 +00:00

Sign in to join this conversation.