perf(workspace-server): config + filesystem loading takes ~20s in canvas detail panel #11
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Symptom
Getting workspace config + file system loading takes ~20 seconds. User-perceived as a long, blocking wait when navigating in the canvas to a workspace's detail panel / plugins tab.
Expected behavior
Sub-second config fetch and filesystem listing for an online workspace, like other panels in the canvas (status, chat, activity).
What's likely involved (need agent to confirm)
The "config + file system" surface in the canvas detail panel reads from one or more of:
GET /workspaces/{id}/config(workspace runtime config)GET /workspaces/{id}/filesor similar filesystem listingGET /workspaces/{id}/plugins(plugin registry)20s is consistent with serial fan-out + per-call timeout-then-fallback. Could also be SSH/EC2 round-trip on each call instead of cached state.
Acceptance criteria
Out of scope (park if found)
Notes
Phase 1 complete — root cause identified
devops-engineer · SOP
feedback_dev_sop_phase_1_to_4(full canonical, no waivers).Two independent latency sources stack to produce the ~20s wall time. Both are fixable; no symptom-papering required.
Cause 1 — Canvas-side serial fan-out
canvas/src/components/tabs/ConfigTab.tsx::loadConfig(lines 220-310) runs 4await api.get(...)calls in series:GET /workspaces/${id}— workspace metadata (DB-backed, fast)GET /workspaces/${id}/model— DB-backed, fastGET /workspaces/${id}/provider— DB-backed, fastGET /workspaces/${id}/files/config.yaml— EIC-tunnel-backed, slow (see Cause 2)Plus
AgentCardSection(lines 32-37) issues a secondGET /workspaces/${id}from its ownuseEffect— duplicate over-the-wire request, no shared cache. A/templatesGET fires later (line 352) too.Cause 2 — Server-side EIC tunnel created PER CALL
workspace-server/internal/handlers/template_files_eic.go::realWithEICTunnel(line 153) is invoked by bothreadFileViaEIC(line 389) andlistFilesViaEIC(line 455). Every invocation:os.MkdirTempfor a fresh ephemeral keypair dirssh-keygen -t ed25519subprocess (~100-300ms)sendSSHPublicKey→ AWS EIC API (~200-500ms network)openTunnelCmd→ forkaws ec2-instance-connect open-tunnel(~1-3s startup)waitForPortup to 10s for the tunnel to listen (typically ~1-3s)cat/findover the tunnel (~50-200ms)No tunnel pool, no key cache, no session reuse. Each file/list op pays the full 3-5s setup cost even when fired back-to-back. Ephemeral
SendSSHPublicKeygrants 60s of key validity — a single tunnel could happily serve N ops in that window.Wall-time decomposition (estimated)
I could not externally time the live endpoints — calls to
https://hongming.moleculesai.app/workspaces/${id}/...return403fromTenantGuardto a per-tenant ADMIN_TOKEN, since the canvas relies on a session cookie that I cannot fabricate from outside the browser. Estimate from code-path inspection:/workspaces/${id}: redundant workConfirming live timing belongs in Phase 4 E2E verification (real canvas browser session).
Prior art surveyed
dev-tunnelsservice for the lifetime of the codespace; file ops multiplex over one connection. We don't have that infrastructure but the principle (one tunnel, many ops) applies directly.golang.org/x/crypto/sshClient.NewSession()over a single*Client).Reputable sources are unanimous: per-op tunnel setup is the anti-pattern, pool-or-keepalive is the answer. None of them constrain on the AWS-EIC-specific 60s grant — that's a design knob (TTL ≤ 60s) not a blocker.
Phase 2 design candidates (deferred to Phase 2 RFC)
instanceIDwith TTL ≤ 50s, reused acrossread*ViaEIC/listFilesViaEIC/writeFileViaEIC/deleteFileViaEIC. Highest leverage (~80% of wall time). New shared mutable state to manage; eviction on workspace reboot.GET /workspaces/:id/config-bundlereturning metadata+model+provider+config.yaml in one call, single tunnel. Cleaner architecturally; adds versioned API surface; needs deprecation plan for the existing four endpoints.Promise.allthe 4 GETs + lift the duplicate/workspaces/${id}to a shared cache (React Query / context). Cheap; visible perceived-latency win (~20% of wall); doesn't fix the per-call tunnel cost. Worth pairing with #1.Anti-pattern check (orchestrator brief)
Phase 2 next-step
Will iterate on a tunnel-pool design + canvas Promise.all + duplicate-call removal in a single coherent PR with paired benchmark/regression test. The benchmark will pin "4 file ops on the same instance ≤ 1.5× single-call wall time" so a regression that disables pooling fails CI loudly (memory
feedback_behavior_based_ast_gates+feedback_assert_exact_not_substringapply).Comments / direction welcome before I cut Phase 2 design.