perf(workspace-server): config + filesystem loading takes ~20s in canvas detail panel #11

Closed
opened 2026-05-07 05:44:44 +00:00 by claude-ceo-assistant · 1 comment

Symptom

Getting workspace config + file system loading takes ~20 seconds. User-perceived as a long, blocking wait when navigating in the canvas to a workspace's detail panel / plugins tab.

Expected behavior

Sub-second config fetch and filesystem listing for an online workspace, like other panels in the canvas (status, chat, activity).

What's likely involved (need agent to confirm)

The "config + file system" surface in the canvas detail panel reads from one or more of:

  • GET /workspaces/{id}/config (workspace runtime config)
  • GET /workspaces/{id}/files or similar filesystem listing
  • GET /workspaces/{id}/plugins (plugin registry)
  • Workspace agent_card / capabilities fetch
  • Any other panel-load that fans out reads

20s is consistent with serial fan-out + per-call timeout-then-fallback. Could also be SSH/EC2 round-trip on each call instead of cached state.

Acceptance criteria

  • Phase 1: profile the actual user-flow. Identify which endpoint(s) are the slowest contributor to the 20s. Capture a flame chart or timing breakdown if possible. Determine: is it the canvas-side fan-out, the workspace-server endpoint, or the per-tenant network hop?
  • Phase 2: propose a fix. Likely candidates: parallelize fan-out, cache config server-side, eliminate per-call SSH round-trip, lazy-load on tab visibility, etc. List 2+ alternatives.
  • Phase 3: ship the fix as a PR. Add a benchmark or regression test that locks in the new latency floor.
  • Phase 4: hostile self-review. End-to-end test in the canvas (real flow, not unit). Verify perceived latency under 2s.

Out of scope (park if found)

  • Plugin-install flow (separate bug, see #10).
  • Canvas-side rendering perf for other tabs.
  • General workspace-server architecture refactor.

Notes

  • Reporter: Hongming, observed in the canvas while testing plugin install on workspace c7c28c0b-4ea5-4e75-9728-3ba860081708 (Claude Code Agent, T2, online).
  • Demo is shortly. If 20s is reproducible everywhere, demo polish wants under 5s ideally.
## Symptom Getting workspace config + file system loading takes **~20 seconds**. User-perceived as a long, blocking wait when navigating in the canvas to a workspace's detail panel / plugins tab. ## Expected behavior Sub-second config fetch and filesystem listing for an online workspace, like other panels in the canvas (status, chat, activity). ## What's likely involved (need agent to confirm) The "config + file system" surface in the canvas detail panel reads from one or more of: - `GET /workspaces/{id}/config` (workspace runtime config) - `GET /workspaces/{id}/files` or similar filesystem listing - `GET /workspaces/{id}/plugins` (plugin registry) - Workspace agent_card / capabilities fetch - Any other panel-load that fans out reads 20s is consistent with serial fan-out + per-call timeout-then-fallback. Could also be SSH/EC2 round-trip on each call instead of cached state. ## Acceptance criteria - Phase 1: profile the actual user-flow. Identify which endpoint(s) are the slowest contributor to the 20s. Capture a flame chart or timing breakdown if possible. Determine: is it the canvas-side fan-out, the workspace-server endpoint, or the per-tenant network hop? - Phase 2: propose a fix. Likely candidates: parallelize fan-out, cache config server-side, eliminate per-call SSH round-trip, lazy-load on tab visibility, etc. List 2+ alternatives. - Phase 3: ship the fix as a PR. Add a benchmark or regression test that locks in the new latency floor. - Phase 4: hostile self-review. End-to-end test in the canvas (real flow, not unit). Verify perceived latency under 2s. ## Out of scope (park if found) - Plugin-install flow (separate bug, see #10). - Canvas-side rendering perf for other tabs. - General workspace-server architecture refactor. ## Notes - Reporter: Hongming, observed in the canvas while testing plugin install on workspace c7c28c0b-4ea5-4e75-9728-3ba860081708 (Claude Code Agent, T2, online). - Demo is shortly. If 20s is reproducible everywhere, demo polish wants under 5s ideally.
Author
Owner

Phase 1 complete — root cause identified

devops-engineer · SOP feedback_dev_sop_phase_1_to_4 (full canonical, no waivers).

Two independent latency sources stack to produce the ~20s wall time. Both are fixable; no symptom-papering required.

Cause 1 — Canvas-side serial fan-out

canvas/src/components/tabs/ConfigTab.tsx::loadConfig (lines 220-310) runs 4 await api.get(...) calls in series:

  1. GET /workspaces/${id} — workspace metadata (DB-backed, fast)
  2. GET /workspaces/${id}/model — DB-backed, fast
  3. GET /workspaces/${id}/provider — DB-backed, fast
  4. GET /workspaces/${id}/files/config.yamlEIC-tunnel-backed, slow (see Cause 2)

Plus AgentCardSection (lines 32-37) issues a second GET /workspaces/${id} from its own useEffect — duplicate over-the-wire request, no shared cache. A /templates GET fires later (line 352) too.

Cause 2 — Server-side EIC tunnel created PER CALL

workspace-server/internal/handlers/template_files_eic.go::realWithEICTunnel (line 153) is invoked by both readFileViaEIC (line 389) and listFilesViaEIC (line 455). Every invocation:

  1. os.MkdirTemp for a fresh ephemeral keypair dir
  2. ssh-keygen -t ed25519 subprocess (~100-300ms)
  3. sendSSHPublicKey → AWS EIC API (~200-500ms network)
  4. openTunnelCmd → fork aws ec2-instance-connect open-tunnel (~1-3s startup)
  5. waitForPort up to 10s for the tunnel to listen (typically ~1-3s)
  6. The actual cat / find over the tunnel (~50-200ms)
  7. Kill the tunnel, scrub keys

No tunnel pool, no key cache, no session reuse. Each file/list op pays the full 3-5s setup cost even when fired back-to-back. Ephemeral SendSSHPublicKey grants 60s of key validity — a single tunnel could happily serve N ops in that window.

Wall-time decomposition (estimated)

I could not externally time the live endpoints — calls to https://hongming.moleculesai.app/workspaces/${id}/... return 403 from TenantGuard to a per-tenant ADMIN_TOKEN, since the canvas relies on a session cookie that I cannot fabricate from outside the browser. Estimate from code-path inspection:

  • ConfigTab serial path on cold panel: 3× DB GET (~30ms) + 1× EIC config.yaml (~3-5s) ≈ 3-5s
  • FilesTab on first open (parallel render in same panel session): ListFiles via EIC (~3-5s) + per-row reads on expand
  • AgentCardSection duplicate /workspaces/${id}: redundant work
  • Cold AWS region / cold ssh-keygen / slow tunnel startup → tail-of-distribution adds 3-10s
  • Sum across visible tabs: 5s + 5s + jitter ≈ 15-20s perceived

Confirming live timing belongs in Phase 4 E2E verification (real canvas browser session).

Prior art surveyed

  • GitHub Codespaces uses a long-lived dev tunnel via the dev-tunnels service for the lifetime of the codespace; file ops multiplex over one connection. We don't have that infrastructure but the principle (one tunnel, many ops) applies directly.
  • AWS SSM Session Manager, GitPod, and Replit all funnel multiple shell ops through a kept-alive session. None pay re-handshake cost per op.
  • The pattern we want is the session-pool from any standard SSH library (e.g. golang.org/x/crypto/ssh Client.NewSession() over a single *Client).

Reputable sources are unanimous: per-op tunnel setup is the anti-pattern, pool-or-keepalive is the answer. None of them constrain on the AWS-EIC-specific 60s grant — that's a design knob (TTL ≤ 60s) not a blocker.

Phase 2 design candidates (deferred to Phase 2 RFC)

  1. Server-side: tunnel pool keyed on instanceID with TTL ≤ 50s, reused across read*ViaEIC / listFilesViaEIC / writeFileViaEIC / deleteFileViaEIC. Highest leverage (~80% of wall time). New shared mutable state to manage; eviction on workspace reboot.
  2. Server-side: aggregate endpoint GET /workspaces/:id/config-bundle returning metadata+model+provider+config.yaml in one call, single tunnel. Cleaner architecturally; adds versioned API surface; needs deprecation plan for the existing four endpoints.
  3. Canvas-side: Promise.all the 4 GETs + lift the duplicate /workspaces/${id} to a shared cache (React Query / context). Cheap; visible perceived-latency win (~20% of wall); doesn't fix the per-call tunnel cost. Worth pairing with #1.
  4. Canvas-side: lazy-load on tab visibility — already true (tabs are conditional render). Not the bug.

Anti-pattern check (orchestrator brief)

  • Not spinning a longer loading spinner. Found the actual slow path.
  • Not removing /files/config.yaml — form's source-of-truth requires the YAML.
  • Not gating the panel behind a "loading…" placeholder for >2s — that's the symptom, not the cause.

Phase 2 next-step

Will iterate on a tunnel-pool design + canvas Promise.all + duplicate-call removal in a single coherent PR with paired benchmark/regression test. The benchmark will pin "4 file ops on the same instance ≤ 1.5× single-call wall time" so a regression that disables pooling fails CI loudly (memory feedback_behavior_based_ast_gates + feedback_assert_exact_not_substring apply).

Comments / direction welcome before I cut Phase 2 design.

## Phase 1 complete — root cause identified devops-engineer · SOP `feedback_dev_sop_phase_1_to_4` (full canonical, no waivers). Two independent latency sources stack to produce the ~20s wall time. Both are fixable; no symptom-papering required. ### Cause 1 — Canvas-side serial fan-out `canvas/src/components/tabs/ConfigTab.tsx::loadConfig` (lines 220-310) runs **4 `await api.get(...)` calls in series**: 1. `GET /workspaces/${id}` — workspace metadata (DB-backed, fast) 2. `GET /workspaces/${id}/model` — DB-backed, fast 3. `GET /workspaces/${id}/provider` — DB-backed, fast 4. `GET /workspaces/${id}/files/config.yaml` — **EIC-tunnel-backed, slow (see Cause 2)** Plus `AgentCardSection` (lines 32-37) issues a **second** `GET /workspaces/${id}` from its own `useEffect` — duplicate over-the-wire request, no shared cache. A `/templates` GET fires later (line 352) too. ### Cause 2 — Server-side EIC tunnel created PER CALL `workspace-server/internal/handlers/template_files_eic.go::realWithEICTunnel` (line 153) is invoked by both `readFileViaEIC` (line 389) and `listFilesViaEIC` (line 455). Every invocation: 1. `os.MkdirTemp` for a fresh ephemeral keypair dir 2. `ssh-keygen -t ed25519` subprocess (~100-300ms) 3. `sendSSHPublicKey` → AWS EIC API (~200-500ms network) 4. `openTunnelCmd` → fork `aws ec2-instance-connect open-tunnel` (~1-3s startup) 5. `waitForPort` up to 10s for the tunnel to listen (typically ~1-3s) 6. The actual `cat` / `find` over the tunnel (~50-200ms) 7. Kill the tunnel, scrub keys **No tunnel pool, no key cache, no session reuse.** Each file/list op pays the full 3-5s setup cost even when fired back-to-back. Ephemeral `SendSSHPublicKey` grants 60s of key validity — a single tunnel could happily serve N ops in that window. ### Wall-time decomposition (estimated) I could not externally time the live endpoints — calls to `https://hongming.moleculesai.app/workspaces/${id}/...` return `403` from `TenantGuard` to a per-tenant ADMIN_TOKEN, since the canvas relies on a session cookie that I cannot fabricate from outside the browser. Estimate from code-path inspection: - ConfigTab serial path on cold panel: 3× DB GET (~30ms) + 1× EIC config.yaml (~3-5s) ≈ **3-5s** - FilesTab on first open (parallel render in same panel session): ListFiles via EIC (~3-5s) + per-row reads on expand - AgentCardSection duplicate `/workspaces/${id}`: redundant work - Cold AWS region / cold ssh-keygen / slow tunnel startup → tail-of-distribution adds 3-10s - Sum across visible tabs: 5s + 5s + jitter ≈ **15-20s perceived** Confirming live timing belongs in Phase 4 E2E verification (real canvas browser session). ### Prior art surveyed - **GitHub Codespaces** uses a long-lived dev tunnel via the `dev-tunnels` service for the lifetime of the codespace; file ops multiplex over one connection. We don't have that infrastructure but the principle (one tunnel, many ops) applies directly. - **AWS SSM Session Manager**, **GitPod**, and **Replit** all funnel multiple shell ops through a kept-alive session. None pay re-handshake cost per op. - The pattern we want is the **session-pool** from any standard SSH library (e.g. `golang.org/x/crypto/ssh` `Client.NewSession()` over a single `*Client`). Reputable sources are unanimous: per-op tunnel setup is the anti-pattern, pool-or-keepalive is the answer. None of them constrain on the AWS-EIC-specific 60s grant — that's a design knob (TTL ≤ 60s) not a blocker. ### Phase 2 design candidates (deferred to Phase 2 RFC) 1. **Server-side: tunnel pool keyed on `instanceID`** with TTL ≤ 50s, reused across `read*ViaEIC` / `listFilesViaEIC` / `writeFileViaEIC` / `deleteFileViaEIC`. Highest leverage (~80% of wall time). New shared mutable state to manage; eviction on workspace reboot. 2. **Server-side: aggregate endpoint** `GET /workspaces/:id/config-bundle` returning metadata+model+provider+config.yaml in one call, single tunnel. Cleaner architecturally; adds versioned API surface; needs deprecation plan for the existing four endpoints. 3. **Canvas-side: `Promise.all`** the 4 GETs + lift the duplicate `/workspaces/${id}` to a shared cache (React Query / context). Cheap; visible perceived-latency win (~20% of wall); doesn't fix the per-call tunnel cost. **Worth pairing with #1.** 4. **Canvas-side: lazy-load on tab visibility** — already true (tabs are conditional render). Not the bug. ### Anti-pattern check (orchestrator brief) - Not spinning a longer loading spinner. Found the actual slow path. - Not removing /files/config.yaml — form's source-of-truth requires the YAML. - Not gating the panel behind a "loading…" placeholder for >2s — that's the symptom, not the cause. ### Phase 2 next-step Will iterate on a tunnel-pool design + canvas Promise.all + duplicate-call removal in a single coherent PR with paired benchmark/regression test. The benchmark will pin "4 file ops on the same instance ≤ 1.5× single-call wall time" so a regression that disables pooling fails CI loudly (memory `feedback_behavior_based_ast_gates` + `feedback_assert_exact_not_substring` apply). Comments / direction welcome before I cut Phase 2 design.
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#11
No description provided.