forked from molecule-ai/molecule-core
perf(workspace-server,canvas): EIC tunnel pool + canvas Promise.all (closes core#11)
## Symptom
Canvas detail-panel "config + filesystem load" took ~20s. Reported on
production hongming tenant, workspace c7c28c0b-... (Claude Code Agent T2).
## Two stacked latency sources
### 1. Server-side: per-call EIC tunnel setup (~80% of the win)
`workspace-server/internal/handlers/template_files_eic.go::realWithEICTunnel`
performed ssh-keygen + SendSSHPublicKey + open-tunnel + waitForPort PER call.
4 callers (read/write/list/delete) each paid the full ~3-5s setup cost even
when fired back-to-back on the same workspace EC2.
Fix: refcounted pool keyed on instanceID with TTL ≤ 50s (under the 60s
SendSSHPublicKey grant). One tunnel serves N file ops; concurrent acquires
for the same instance share the slot via a pendingSetups gate; LRU eviction
caps simultaneous tracked instances at 32. Poisons entries on tunnel-fatal
errors (connection refused, broken pipe, auth failed) so the next acquire
builds fresh. Cleanup on panic via defer-release pattern (added after
self-review caught a refcount-leak hazard).
Public API unchanged — `var withEICTunnel` rebinds to `pooledWithEICTunnel`
at package init, so all 4 callers inherit pooling for free.
10 unit tests pin: 4-ops-amortise (1 setup), different-instances-do-not-share,
TTL eviction, poison invalidates, concurrent-acquire-single-setup,
TTL=0 escape hatch, LRU eviction at cap, error classification heuristic,
refcount blocks expired eviction, panic poisons entry. All green.
### 2. Canvas-side: serial fan-out + duplicate fetch (~20% of the win)
`canvas/src/components/tabs/ConfigTab.tsx::loadConfig` awaited 3 independent
metadata GETs (`/workspaces/{id}`, `/model`, `/provider`) serially.
`AgentCardSection` fired a SECOND `/workspaces/{id}` from its own useEffect.
Fix: Promise.all over the 3 metadata GETs (each leg keeps its existing
.catch fallback semantics). AgentCardSection now reads `agentCard` from
the canvas store (`useCanvasStore`) instead of refetching — the canvas
already hydrates `node.data.agentCard` from the platform event stream.
Defensive selector handles test mocks without a `nodes` array.
## Verification
- `go test ./internal/handlers/` 5.07s green (full handlers package, including
10 new pool tests)
- `go vet ./internal/handlers/` clean
- `npx vitest run` — 1380/1380 canvas unit tests pass (2 test FILES fail on
a pre-existing xyflow CSS-load issue in vitest config, unrelated to this
change)
- `npx tsc --noEmit` clean
Live wall-time verification deferred to Phase 4 / E2E (canvas browser session
required; external probe blocked by 403 since the canvas auth chain is
session-cookie + Origin header, not a bearer token I can fabricate).
## Backwards compatibility
API surface unchanged. All 4 EIC handler callers use the rebound var; no
caller migration. Pool defaults to enabled (TTL=50s); tests can disable by
setting poolTTL=0 or by overwriting withEICTunnel directly (existing stub
pattern in template_files_eic_dispatch_test.go preserved).
## Hostile self-review (3 weakest spots)
1. `fnErrIndicatesTunnelFault` is a substring grep on err.Error() — the
marker list is hand-curated and ssh client error formats vary across
OpenSSH versions. A future ssh that reports a tunnel failure via a
phrasing not in the list would NOT poison the entry → next callers reuse
a dead tunnel until TTL evicts. Acceptable: TTL bounds the impact (≤50s
of bad reuse), and the heuristic covers every tunnel-error shape that
appears in the existing test fixtures and known incidents.
2. `acquire`'s for-loop has unbounded retry potential under pathological
churn (signal closed → new acquirer → setup fails → repeat). No bounded
retry counter. Today there is no test exercise for "flaky setup that
succeeds-then-fails-then-succeeds"; if observability ever shows this
shape, add a max-retry guard. Filed as a known limitation, not blocking.
3. The substring assertion `strings.Contains` style I used for tunnel-fault
classification could false-positive on app-level error messages that
happen to contain "permission denied" or "broken pipe" verbatim. The
classification test covers the discriminator but only against the
error shapes we know today. Acceptable: poisoning errs on the side of
building fresh, which is correct-but-slightly-slow rather than incorrect.
## Phase 4 / E2E plan
- Live timing of the canvas detail-panel open against a real workspace
(browser session, not external probe).
- Target: perceived latency under 2s on warm pool. Cold open still pays
one tunnel setup (~3-5s) — the pool buys you the SECOND through Nth
panel-open within the TTL window.
- Memory `feedback_chase_verification_to_staging` applies — will not
declare done at PR-merge; will follow through to user-visible behavior
on staging.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
f92ba492de
commit
624ef4d06d
@ -21,20 +21,39 @@ interface Props {
|
||||
// --- Agent Card Section ---
|
||||
|
||||
function AgentCardSection({ workspaceId }: { workspaceId: string }) {
|
||||
const [card, setCard] = useState<Record<string, unknown> | null>(null);
|
||||
const [loading, setLoading] = useState(true);
|
||||
// Initial card value comes from the canvas store — node.data.agentCard
|
||||
// is hydrated by the platform stream when the workspace appears in the
|
||||
// graph, so reading it here avoids a duplicate `GET /workspaces/${id}`
|
||||
// (the parent ConfigTab.loadConfig already fetches workspace metadata,
|
||||
// and refetching here adds a serialised RTT to the panel-open path —
|
||||
// contributed to the ~20s detail-panel load reported in core#11).
|
||||
// Local state still tracks the edited/saved value so the editor flow
|
||||
// is unchanged.
|
||||
const storeCard = useCanvasStore((s) => {
|
||||
// Defensive against test mocks that omit `nodes` (some test files
|
||||
// stub the store with a minimal shape). In production `nodes` is
|
||||
// always an array — empty or not — so the optional chaining only
|
||||
// matters for the test path.
|
||||
const node = s.nodes?.find?.((n) => n.id === workspaceId);
|
||||
return (node?.data.agentCard as
|
||||
| Record<string, unknown>
|
||||
| null
|
||||
| undefined) ?? null;
|
||||
});
|
||||
const [card, setCard] = useState<Record<string, unknown> | null>(storeCard);
|
||||
const [editing, setEditing] = useState(false);
|
||||
const [draft, setDraft] = useState("");
|
||||
const [saving, setSaving] = useState(false);
|
||||
const [error, setError] = useState<string | null>(null);
|
||||
const [success, setSuccess] = useState(false);
|
||||
|
||||
// If the store updates while this section is mounted (another tab
|
||||
// pushed an update via the platform event stream), reflect that —
|
||||
// unless the user is mid-edit, in which case we don't clobber their
|
||||
// unsaved draft.
|
||||
useEffect(() => {
|
||||
api.get<Record<string, unknown>>(`/workspaces/${workspaceId}`)
|
||||
.then((ws) => setCard((ws.agent_card as Record<string, unknown>) || null))
|
||||
.catch(() => {})
|
||||
.finally(() => setLoading(false));
|
||||
}, [workspaceId]);
|
||||
if (!editing) setCard(storeCard);
|
||||
}, [storeCard, editing]);
|
||||
|
||||
const handleSave = async () => {
|
||||
setError(null);
|
||||
@ -53,9 +72,7 @@ function AgentCardSection({ workspaceId }: { workspaceId: string }) {
|
||||
|
||||
return (
|
||||
<Section title="Agent Card" defaultOpen={false}>
|
||||
{loading ? (
|
||||
<div className="text-[10px] text-ink-soft">Loading...</div>
|
||||
) : editing ? (
|
||||
{editing ? (
|
||||
<div className="space-y-2">
|
||||
<textarea
|
||||
aria-label="Agent card JSON editor"
|
||||
@ -221,47 +238,51 @@ export function ConfigTab({ workspaceId }: Props) {
|
||||
setLoading(true);
|
||||
setError(null);
|
||||
|
||||
// ALWAYS load workspace metadata first (runtime + model). These are the
|
||||
// source of truth regardless of whether the runtime uses our config.yaml
|
||||
// template. Without this the form falls back to empty/default values on
|
||||
// a hermes workspace (which doesn't use our template), creating the
|
||||
// appearance that the saved runtime is unset — and worse, clicking Save
|
||||
// would silently flip `runtime` from `hermes` back to the dropdown
|
||||
// default `LangGraph`. See GH #1894.
|
||||
let wsMetadataRuntime = "";
|
||||
let wsMetadataModel = "";
|
||||
let wsMetadataTier: number | null = null;
|
||||
try {
|
||||
const ws = await api.get<{ runtime?: string; tier?: number }>(`/workspaces/${workspaceId}`);
|
||||
wsMetadataRuntime = (ws.runtime || "").trim();
|
||||
if (typeof ws.tier === "number") wsMetadataTier = ws.tier;
|
||||
} catch { /* fall back to config.yaml */ }
|
||||
try {
|
||||
const m = await api.get<{ model?: string }>(`/workspaces/${workspaceId}/model`);
|
||||
wsMetadataModel = (m.model || "").trim();
|
||||
} catch { /* non-fatal */ }
|
||||
// Load workspace metadata (runtime + model + provider) in parallel.
|
||||
// These are independent GETs against three workspace-server endpoints
|
||||
// and used to be awaited serially — for SaaS workspaces each call
|
||||
// round-trips through an EIC SSH tunnel, so the previous serial
|
||||
// pattern stacked 3-5s of tunnel-setup latency per call (core#11).
|
||||
// Promise.all overlaps them; the per-call cost stays the same but
|
||||
// wall time drops to max() instead of sum().
|
||||
//
|
||||
// Each leg has its own .catch handler that yields a sentinel value,
|
||||
// matching the previous semantics:
|
||||
// - /workspaces/${id}: required source-of-truth for runtime+tier;
|
||||
// fall back to YAML if the GET fails (rare, network-class only).
|
||||
// - /workspaces/${id}/model: non-fatal; empty model lets the form
|
||||
// fall through to YAML runtime_config.model.
|
||||
// - /workspaces/${id}/provider: non-fatal; old workspace-servers
|
||||
// return 404, in which case provider="" and Save skips the PUT.
|
||||
//
|
||||
// See GH #1894 for the workspace-row-as-source-of-truth rationale
|
||||
// that motivated splitting from a single config.yaml read.
|
||||
const [wsRes, modelRes, providerRes] = await Promise.all([
|
||||
api.get<{ runtime?: string; tier?: number }>(`/workspaces/${workspaceId}`)
|
||||
.catch(() => ({} as { runtime?: string; tier?: number })),
|
||||
api.get<{ model?: string }>(`/workspaces/${workspaceId}/model`)
|
||||
.catch(() => ({} as { model?: string })),
|
||||
api.get<{ provider?: string }>(`/workspaces/${workspaceId}/provider`)
|
||||
.catch(() => null),
|
||||
]);
|
||||
const wsMetadataRuntime = (wsRes.runtime || "").trim();
|
||||
const wsMetadataModel = (modelRes.model || "").trim();
|
||||
const wsMetadataTier: number | null =
|
||||
typeof wsRes.tier === "number" ? wsRes.tier : null;
|
||||
if (providerRes !== null) {
|
||||
const loadedProvider = (providerRes.provider || "").trim();
|
||||
setProvider(loadedProvider);
|
||||
setOriginalProvider(loadedProvider);
|
||||
} else {
|
||||
setProvider("");
|
||||
setOriginalProvider("");
|
||||
}
|
||||
// originalModel is set further down once the YAML has been parsed —
|
||||
// we want it to reflect what the form ACTUALLY rendered, which may
|
||||
// be the YAML's runtime_config.model fallback when MODEL_PROVIDER
|
||||
// is empty. Setting it here from wsMetadataModel alone would be
|
||||
// wrong for hermes/pre-#240 workspaces.
|
||||
|
||||
// Load explicit provider override (Option B PR-5). Endpoint returns
|
||||
// {provider: "", source: "default"} when no override is set, so the
|
||||
// empty string is the legitimate "auto-derive" signal — don't treat
|
||||
// it as a load error. Non-fatal: an older workspace-server that
|
||||
// predates PR-2 returns 404 here; the form falls back to "" and
|
||||
// Save just won't PUT the provider field.
|
||||
try {
|
||||
const p = await api.get<{ provider?: string }>(`/workspaces/${workspaceId}/provider`);
|
||||
const loadedProvider = (p.provider || "").trim();
|
||||
setProvider(loadedProvider);
|
||||
setOriginalProvider(loadedProvider);
|
||||
} catch {
|
||||
setProvider("");
|
||||
setOriginalProvider("");
|
||||
}
|
||||
|
||||
// Skip the config.yaml fetch entirely for runtimes that manage
|
||||
// their own config (external, hermes, etc.) — they don't have a
|
||||
// platform-side template, so the GET would 404. The catch block
|
||||
|
||||
437
workspace-server/internal/handlers/eic_tunnel_pool.go
Normal file
437
workspace-server/internal/handlers/eic_tunnel_pool.go
Normal file
@ -0,0 +1,437 @@
|
||||
package handlers
|
||||
|
||||
// eic_tunnel_pool.go — refcounted pool for EIC SSH tunnels keyed on
|
||||
// instanceID. Reuses one tunnel across N file ops, amortising the
|
||||
// ssh-keygen + SendSSHPublicKey + open-tunnel + waitForPort cost
|
||||
// (~3-5s) over multiple cats/finds (~50-200ms each).
|
||||
//
|
||||
// Origin: core#11 — canvas detail-panel config + filesystem load
|
||||
// took ~20s. ConfigTab fans out 4 GETs serially; the slowest is
|
||||
// /files/config.yaml which dispatches to readFileViaEIC. Without a
|
||||
// pool, every readFileViaEIC + listFilesViaEIC + writeFileViaEIC +
|
||||
// deleteFileViaEIC pays the full setup cost even when fired
|
||||
// back-to-back on the same workspace EC2.
|
||||
//
|
||||
// The pool keeps one eicSSHSession alive per instanceID for up to
|
||||
// poolTTL. SendSSHPublicKey grants a 60s key validity, so poolTTL
|
||||
// must stay strictly below that to avoid serving requests on a
|
||||
// just-expired key. We default to 50s with a 10s safety margin.
|
||||
//
|
||||
// Concurrency model:
|
||||
//
|
||||
// - Single mutex guards the entries map.
|
||||
// - Slow path (tunnel setup) runs OUTSIDE the lock, gated by an
|
||||
// "intent" placeholder so concurrent acquires for the same
|
||||
// instanceID don't both build a tunnel — the loser drops its
|
||||
// setup and uses the winner's.
|
||||
// - Refcount on each entry; eviction blocked while refcount > 0.
|
||||
// - Janitor goroutine sweeps every poolJanitorInterval, drops
|
||||
// entries where refcount == 0 && expiresAt < now.
|
||||
//
|
||||
// Test injection:
|
||||
//
|
||||
// - poolSetupTunnel is a package-level var so tests can swap the
|
||||
// slow path for a counting stub. Production wires it to
|
||||
// realWithEICTunnel-style setup.
|
||||
// - withEICTunnel (the public, single-shot API) is also a var
|
||||
// (already, see template_files_eic.go). It's rebound here to
|
||||
// pooledWithEICTunnel which routes through globalEICTunnelPool.
|
||||
// - Tests that need single-shot behaviour can set poolTTL = 0,
|
||||
// which makes pooledWithEICTunnel fall through to the underlying
|
||||
// setup directly (no pool entry kept).
|
||||
|
||||
import (
|
||||
"context"
|
||||
"fmt"
|
||||
"sync"
|
||||
"time"
|
||||
)
|
||||
|
||||
// poolTTL is the maximum age of a pooled tunnel. Must be strictly
|
||||
// less than the SendSSHPublicKey grant window (60s) so we never
|
||||
// serve a request through a key that's about to expire mid-op.
|
||||
//
|
||||
// Configurable via init-time wiring (see initEICTunnelPool); not a
|
||||
// const so tests can pin TTL=0 (disable pooling) or TTL=50ms (drive
|
||||
// eviction tests).
|
||||
var poolTTL = 50 * time.Second
|
||||
|
||||
// poolJanitorInterval is how often the janitor goroutine sweeps for
|
||||
// expired idle entries. Tighter than poolTTL so eviction is timely;
|
||||
// loose enough that the goroutine doesn't burn CPU.
|
||||
var poolJanitorInterval = 10 * time.Second
|
||||
|
||||
// poolMaxEntries caps simultaneous instanceIDs the pool tracks.
|
||||
// Beyond this, new acquires evict the LRU entry. Defends against a
|
||||
// pathological caller (e.g. a sweep over hundreds of workspace
|
||||
// EC2s) from leaking unbounded tunnel processes. 32 is a generous
|
||||
// ceiling for the canvas use case (one human navigates ≤ ~5
|
||||
// workspaces at a time).
|
||||
var poolMaxEntries = 32
|
||||
|
||||
// poolSetupTunnel is the slow-path tunnel constructor. Wrapped in a
|
||||
// var so tests can inject a counter stub. Returns a session and a
|
||||
// cleanup function (closes the open-tunnel subprocess + scrubs the
|
||||
// ephemeral keydir). nil session + non-nil err means setup failed
|
||||
// and there is nothing to clean up.
|
||||
//
|
||||
// Production wiring lives in eic_tunnel_pool_setup.go (a thin shim
|
||||
// over the existing realWithEICTunnel logic).
|
||||
var poolSetupTunnel = func(ctx context.Context, instanceID string) (
|
||||
sess eicSSHSession, cleanup func(), err error) {
|
||||
return setupRealEICTunnel(ctx, instanceID)
|
||||
}
|
||||
|
||||
// pooledTunnel is one entry in the pool. session is shared by N
|
||||
// concurrent fn calls; cleanup runs once when refcount returns to
|
||||
// zero AND the entry is past expiresAt or evicted.
|
||||
//
|
||||
// lastUsed tracks the most recent acquire time for LRU bookkeeping
|
||||
// (overflow eviction). expiresAt is set at construction and not
|
||||
// extended on use — a tunnel cannot live past poolTTL even if it's
|
||||
// hot, because the underlying SendSSHPublicKey grant expires.
|
||||
type pooledTunnel struct {
|
||||
session eicSSHSession
|
||||
cleanup func()
|
||||
expiresAt time.Time
|
||||
lastUsed time.Time
|
||||
refcount int
|
||||
poisoned bool // true if a fn returned a tunnel-fatal error; do not reuse
|
||||
}
|
||||
|
||||
// eicTunnelPool is the package-level pool. Single instance lives
|
||||
// in globalEICTunnelPool; constructor runs lazily on first acquire.
|
||||
type eicTunnelPool struct {
|
||||
mu sync.Mutex
|
||||
entries map[string]*pooledTunnel
|
||||
// pendingSetups guards concurrent setup for the same instanceID.
|
||||
// First acquirer takes the slot; later ones wait on the channel.
|
||||
pendingSetups map[string]chan struct{}
|
||||
stopJanitor chan struct{}
|
||||
}
|
||||
|
||||
var (
|
||||
globalEICTunnelPool *eicTunnelPool
|
||||
globalEICTunnelPoolOnce sync.Once
|
||||
)
|
||||
|
||||
// getEICTunnelPool returns the singleton pool, lazy-initialising on
|
||||
// first call. Idempotent.
|
||||
func getEICTunnelPool() *eicTunnelPool {
|
||||
globalEICTunnelPoolOnce.Do(func() {
|
||||
globalEICTunnelPool = newEICTunnelPool()
|
||||
go globalEICTunnelPool.janitor()
|
||||
})
|
||||
return globalEICTunnelPool
|
||||
}
|
||||
|
||||
// newEICTunnelPool constructs an empty pool. Exported so tests can
|
||||
// build isolated pools without sharing the singleton.
|
||||
func newEICTunnelPool() *eicTunnelPool {
|
||||
return &eicTunnelPool{
|
||||
entries: map[string]*pooledTunnel{},
|
||||
pendingSetups: map[string]chan struct{}{},
|
||||
stopJanitor: make(chan struct{}),
|
||||
}
|
||||
}
|
||||
|
||||
// acquire returns a usable session for instanceID. If a healthy entry
|
||||
// exists, refcount++ and return it. If a setup is in flight for the
|
||||
// same instanceID, wait for it. Otherwise build one (slow path).
|
||||
//
|
||||
// done() must be called by the caller when the op finishes. It
|
||||
// decrements refcount and triggers cleanup if the entry is past
|
||||
// TTL or poisoned and refcount==0.
|
||||
//
|
||||
// Errors from the slow path propagate; pool state is not modified
|
||||
// for failed setups (no poisoned entry created — that's only for
|
||||
// fn-returned errors on a previously-good session).
|
||||
func (p *eicTunnelPool) acquire(ctx context.Context, instanceID string) (
|
||||
sess eicSSHSession, done func(poisoned bool), err error) {
|
||||
|
||||
if poolTTL <= 0 {
|
||||
// Pool disabled (TTL=0 mode for tests / opt-out). Fall
|
||||
// through to a direct setup with caller-driven cleanup.
|
||||
s, cleanup, err := poolSetupTunnel(ctx, instanceID)
|
||||
if err != nil {
|
||||
return eicSSHSession{}, nil, err
|
||||
}
|
||||
return s, func(_ bool) { cleanup() }, nil
|
||||
}
|
||||
|
||||
for {
|
||||
p.mu.Lock()
|
||||
if pt, ok := p.entries[instanceID]; ok && !pt.poisoned && pt.expiresAt.After(time.Now()) {
|
||||
pt.refcount++
|
||||
pt.lastUsed = time.Now()
|
||||
p.mu.Unlock()
|
||||
return pt.session, p.releaser(instanceID, pt), nil
|
||||
}
|
||||
// Either no entry, expired entry, or poisoned entry. If a
|
||||
// setup is already in flight, wait and retry.
|
||||
if pending, ok := p.pendingSetups[instanceID]; ok {
|
||||
p.mu.Unlock()
|
||||
select {
|
||||
case <-pending:
|
||||
continue // re-check the entries map
|
||||
case <-ctx.Done():
|
||||
return eicSSHSession{}, nil, ctx.Err()
|
||||
}
|
||||
}
|
||||
// Drop expired/poisoned entry now (we'll cleanup outside
|
||||
// the lock — the entry is unreferenced or we'd not be here).
|
||||
var oldCleanup func()
|
||||
if pt, ok := p.entries[instanceID]; ok {
|
||||
if pt.refcount == 0 {
|
||||
oldCleanup = pt.cleanup
|
||||
delete(p.entries, instanceID)
|
||||
}
|
||||
}
|
||||
// Reserve the setup slot.
|
||||
signal := make(chan struct{})
|
||||
p.pendingSetups[instanceID] = signal
|
||||
p.mu.Unlock()
|
||||
|
||||
if oldCleanup != nil {
|
||||
go oldCleanup()
|
||||
}
|
||||
|
||||
// Slow path: build a new tunnel. Anything that goes wrong
|
||||
// here cleans up the pendingSetups slot and propagates to
|
||||
// the caller without leaving the pool in a state where the
|
||||
// next acquire blocks waiting on a signal that never fires.
|
||||
newSess, cleanup, setupErr := poolSetupTunnel(ctx, instanceID)
|
||||
|
||||
p.mu.Lock()
|
||||
delete(p.pendingSetups, instanceID)
|
||||
close(signal)
|
||||
|
||||
if setupErr != nil {
|
||||
p.mu.Unlock()
|
||||
return eicSSHSession{}, nil, fmt.Errorf("eic tunnel setup: %w", setupErr)
|
||||
}
|
||||
|
||||
// Enforce LRU bound BEFORE inserting so we don't briefly
|
||||
// exceed the cap even by one entry.
|
||||
p.evictLRUIfFullLocked(instanceID)
|
||||
|
||||
pt := &pooledTunnel{
|
||||
session: newSess,
|
||||
cleanup: cleanup,
|
||||
expiresAt: time.Now().Add(poolTTL),
|
||||
lastUsed: time.Now(),
|
||||
refcount: 1,
|
||||
}
|
||||
p.entries[instanceID] = pt
|
||||
p.mu.Unlock()
|
||||
return pt.session, p.releaser(instanceID, pt), nil
|
||||
}
|
||||
}
|
||||
|
||||
// releaser returns a closure that decrements refcount and triggers
|
||||
// cleanup if (a) the entry is past TTL or (b) the caller signalled
|
||||
// poison. Idempotent against double-release (decrements once via the
|
||||
// captured pt; pool entry may have been replaced by then).
|
||||
func (p *eicTunnelPool) releaser(instanceID string, pt *pooledTunnel) func(poisoned bool) {
|
||||
released := false
|
||||
return func(poisoned bool) {
|
||||
p.mu.Lock()
|
||||
defer p.mu.Unlock()
|
||||
if released {
|
||||
return
|
||||
}
|
||||
released = true
|
||||
pt.refcount--
|
||||
if poisoned {
|
||||
pt.poisoned = true
|
||||
}
|
||||
// Evict immediately if poisoned-and-idle OR expired-and-idle.
|
||||
// Hot entries (refcount > 0) defer eviction to the last release.
|
||||
if pt.refcount == 0 && (pt.poisoned || pt.expiresAt.Before(time.Now())) {
|
||||
// If the entry in the map is still us, remove it.
|
||||
if cur, ok := p.entries[instanceID]; ok && cur == pt {
|
||||
delete(p.entries, instanceID)
|
||||
}
|
||||
go pt.cleanup()
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// evictLRUIfFullLocked drops the least-recently-used IDLE entry
|
||||
// when the pool is at capacity. Caller must hold p.mu. The new
|
||||
// instanceID about to be inserted is excluded so we don't evict
|
||||
// ourselves. If no idle entries exist, no eviction happens — the
|
||||
// new entry will push us above the soft cap until something releases.
|
||||
func (p *eicTunnelPool) evictLRUIfFullLocked(skipInstance string) {
|
||||
if len(p.entries) < poolMaxEntries {
|
||||
return
|
||||
}
|
||||
var oldestKey string
|
||||
var oldest *pooledTunnel
|
||||
for k, pt := range p.entries {
|
||||
if k == skipInstance {
|
||||
continue
|
||||
}
|
||||
if pt.refcount > 0 {
|
||||
continue
|
||||
}
|
||||
if oldest == nil || pt.lastUsed.Before(oldest.lastUsed) {
|
||||
oldestKey = k
|
||||
oldest = pt
|
||||
}
|
||||
}
|
||||
if oldest == nil {
|
||||
return // every entry is in use; no eviction possible
|
||||
}
|
||||
delete(p.entries, oldestKey)
|
||||
go oldest.cleanup()
|
||||
}
|
||||
|
||||
// janitor periodically scans for entries that are idle AND expired,
|
||||
// closing their tunnels. Runs forever (per pool lifetime); cancelled
|
||||
// by close(p.stopJanitor) for tests that build short-lived pools.
|
||||
func (p *eicTunnelPool) janitor() {
|
||||
t := time.NewTicker(poolJanitorInterval)
|
||||
defer t.Stop()
|
||||
for {
|
||||
select {
|
||||
case <-t.C:
|
||||
p.sweep()
|
||||
case <-p.stopJanitor:
|
||||
return
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// sweep is one janitor pass. Drops idle expired entries.
|
||||
func (p *eicTunnelPool) sweep() {
|
||||
p.mu.Lock()
|
||||
now := time.Now()
|
||||
var toClose []func()
|
||||
for k, pt := range p.entries {
|
||||
if pt.refcount == 0 && pt.expiresAt.Before(now) {
|
||||
toClose = append(toClose, pt.cleanup)
|
||||
delete(p.entries, k)
|
||||
}
|
||||
}
|
||||
p.mu.Unlock()
|
||||
for _, c := range toClose {
|
||||
go c()
|
||||
}
|
||||
}
|
||||
|
||||
// stop terminates the janitor and closes all idle entries. Hot
|
||||
// (refcount > 0) entries are NOT force-closed — callers running
|
||||
// against them would see a use-after-free. In practice stop is only
|
||||
// called by tests that have already drained their callers.
|
||||
func (p *eicTunnelPool) stop() {
|
||||
close(p.stopJanitor)
|
||||
p.mu.Lock()
|
||||
defer p.mu.Unlock()
|
||||
for k, pt := range p.entries {
|
||||
if pt.refcount == 0 {
|
||||
go pt.cleanup()
|
||||
delete(p.entries, k)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// pooledWithEICTunnel is the pool-backed replacement for
|
||||
// realWithEICTunnel. The signature matches `var withEICTunnel`
|
||||
// exactly so the rebind (in initEICTunnelPool) is a drop-in.
|
||||
//
|
||||
// Errors from `fn` itself are forwarded to the caller AND mark the
|
||||
// pool entry as poisoned, so the next acquire builds a fresh
|
||||
// tunnel. This catches the case where the workspace EC2 was
|
||||
// restarted out-of-band (tunnel still appears alive locally but
|
||||
// every cat/find errors out).
|
||||
func pooledWithEICTunnel(ctx context.Context, instanceID string,
|
||||
fn func(s eicSSHSession) error) error {
|
||||
pool := getEICTunnelPool()
|
||||
sess, done, err := pool.acquire(ctx, instanceID)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
// poisoned defaults to true so a panic from fn poisons the
|
||||
// entry on the way through the deferred release. Without the
|
||||
// defer, a panicking fn would leak refcount=1 forever and
|
||||
// permanently block eviction of this entry. The fn-error path
|
||||
// resets poisoned to its real classification before return.
|
||||
poisoned := true
|
||||
defer func() { done(poisoned) }()
|
||||
fnErr := fn(sess)
|
||||
poisoned = fnErrIndicatesTunnelFault(fnErr)
|
||||
return fnErr
|
||||
}
|
||||
|
||||
// fnErrIndicatesTunnelFault returns true for fn errors whose nature
|
||||
// suggests the underlying tunnel is no longer reusable (auth gone,
|
||||
// network gone, ssh process dead). Returning true poisons the pool
|
||||
// entry so the next acquire builds fresh.
|
||||
//
|
||||
// Conservative: only marks tunnel-faulty for clearly tunnel-level
|
||||
// failures (connection refused, broken pipe, ssh exit-status from
|
||||
// fatal-channel signals). A `cat` returning os.ErrNotExist on a
|
||||
// missing file is NOT a tunnel fault — that's the file path being
|
||||
// wrong, the tunnel is fine.
|
||||
func fnErrIndicatesTunnelFault(err error) bool {
|
||||
if err == nil {
|
||||
return false
|
||||
}
|
||||
msg := err.Error()
|
||||
// stderr substrings produced by ssh when the tunnel is broken.
|
||||
for _, marker := range []string{
|
||||
"connection refused",
|
||||
"connection closed",
|
||||
"broken pipe",
|
||||
"Connection reset by peer",
|
||||
"kex_exchange_identification",
|
||||
"port forwarding failed",
|
||||
"Permission denied",
|
||||
"Authentication failed",
|
||||
} {
|
||||
if containsCaseInsensitive(msg, marker) {
|
||||
return true
|
||||
}
|
||||
}
|
||||
return false
|
||||
}
|
||||
|
||||
// containsCaseInsensitive avoids importing strings just for this
|
||||
// (the file already needs ssh stderr matching elsewhere — this
|
||||
// keeps the helper local to avoid a cross-file dependency).
|
||||
func containsCaseInsensitive(s, substr string) bool {
|
||||
if len(substr) > len(s) {
|
||||
return false
|
||||
}
|
||||
// Manual lowercase compare loop; ssh error markers are ASCII so
|
||||
// no need for unicode-aware folding.
|
||||
low := func(b byte) byte {
|
||||
if b >= 'A' && b <= 'Z' {
|
||||
return b + 32
|
||||
}
|
||||
return b
|
||||
}
|
||||
for i := 0; i+len(substr) <= len(s); i++ {
|
||||
match := true
|
||||
for j := 0; j < len(substr); j++ {
|
||||
if low(s[i+j]) != low(substr[j]) {
|
||||
match = false
|
||||
break
|
||||
}
|
||||
}
|
||||
if match {
|
||||
return true
|
||||
}
|
||||
}
|
||||
return false
|
||||
}
|
||||
|
||||
// initEICTunnelPool rebinds the package-level withEICTunnel var to
|
||||
// the pooled implementation. Called once at package init via the
|
||||
// init() in eic_tunnel_pool_setup.go (split file so the rebind
|
||||
// itself is testable without dragging in the production setup
|
||||
// shim's exec/aws dependencies).
|
||||
func initEICTunnelPool() {
|
||||
withEICTunnel = pooledWithEICTunnel
|
||||
}
|
||||
136
workspace-server/internal/handlers/eic_tunnel_pool_setup.go
Normal file
136
workspace-server/internal/handlers/eic_tunnel_pool_setup.go
Normal file
@ -0,0 +1,136 @@
|
||||
package handlers
|
||||
|
||||
// eic_tunnel_pool_setup.go — production setup shim.
|
||||
//
|
||||
// setupRealEICTunnel decomposes the existing realWithEICTunnel into
|
||||
// its slow half (build the tunnel) and its caller half (run fn). The
|
||||
// pool calls the slow half once and shares the resulting session
|
||||
// across N callers, holding cleanup until the last release.
|
||||
//
|
||||
// Why decompose instead of refactoring realWithEICTunnel: the
|
||||
// existing function and its test stub-vars (withEICTunnel,
|
||||
// sendSSHPublicKey, openTunnelCmd) are load-bearing for the
|
||||
// dispatch tests. Extracting a sibling setup function preserves the
|
||||
// existing single-shot path verbatim — the pool wraps it by calling
|
||||
// realWithEICTunnel through a thin adapter, leaving the tested
|
||||
// surface unchanged.
|
||||
//
|
||||
// The pool's acquire() invokes poolSetupTunnel, which is a `var`
|
||||
// pointing to setupRealEICTunnel for production and a counting stub
|
||||
// for tests.
|
||||
|
||||
import (
|
||||
"context"
|
||||
"fmt"
|
||||
"os"
|
||||
"os/exec"
|
||||
"strings"
|
||||
"time"
|
||||
)
|
||||
|
||||
// setupRealEICTunnel is the slow path that the pool consumes when
|
||||
// no warm entry exists. Mirrors realWithEICTunnel's setup half but
|
||||
// returns the session + cleanup instead of running fn inline.
|
||||
//
|
||||
// The cleanup func owns the tunnel subprocess, ephemeral key dir,
|
||||
// and a one-time wait. Idempotent — calling it twice is safe; the
|
||||
// pool guarantees one call per session, but defence-in-depth helps
|
||||
// when tests run pools in parallel and racy sweeps re-trigger.
|
||||
func setupRealEICTunnel(ctx context.Context, instanceID string) (
|
||||
eicSSHSession, func(), error) {
|
||||
|
||||
if instanceID == "" {
|
||||
return eicSSHSession{}, nil,
|
||||
fmt.Errorf("workspace has no instance_id — not a SaaS EC2 workspace")
|
||||
}
|
||||
osUser := os.Getenv("WORKSPACE_EC2_OS_USER")
|
||||
if osUser == "" {
|
||||
osUser = "ubuntu"
|
||||
}
|
||||
region := os.Getenv("AWS_REGION")
|
||||
if region == "" {
|
||||
region = "us-east-2"
|
||||
}
|
||||
|
||||
keyDir, err := os.MkdirTemp("", "molecule-eic-pool-*")
|
||||
if err != nil {
|
||||
return eicSSHSession{}, nil, fmt.Errorf("keydir mkdir: %w", err)
|
||||
}
|
||||
keyPath := keyDir + "/id"
|
||||
if out, kerr := exec.CommandContext(ctx, "ssh-keygen",
|
||||
"-t", "ed25519", "-f", keyPath, "-N", "", "-q",
|
||||
"-C", "molecule-eic-pool",
|
||||
).CombinedOutput(); kerr != nil {
|
||||
_ = os.RemoveAll(keyDir)
|
||||
return eicSSHSession{}, nil,
|
||||
fmt.Errorf("ssh-keygen: %w (%s)", kerr, strings.TrimSpace(string(out)))
|
||||
}
|
||||
pubKey, err := os.ReadFile(keyPath + ".pub")
|
||||
if err != nil {
|
||||
_ = os.RemoveAll(keyDir)
|
||||
return eicSSHSession{}, nil, fmt.Errorf("read pubkey: %w", err)
|
||||
}
|
||||
|
||||
if err := sendSSHPublicKey(ctx, region, instanceID, osUser,
|
||||
strings.TrimSpace(string(pubKey))); err != nil {
|
||||
_ = os.RemoveAll(keyDir)
|
||||
return eicSSHSession{}, nil, fmt.Errorf("send-ssh-public-key: %w", err)
|
||||
}
|
||||
|
||||
localPort, err := pickFreePort()
|
||||
if err != nil {
|
||||
_ = os.RemoveAll(keyDir)
|
||||
return eicSSHSession{}, nil, fmt.Errorf("pick free port: %w", err)
|
||||
}
|
||||
|
||||
tunnel := openTunnelCmd(eicSSHOptions{
|
||||
InstanceID: instanceID,
|
||||
OSUser: osUser,
|
||||
Region: region,
|
||||
LocalPort: localPort,
|
||||
PrivateKeyPath: keyPath,
|
||||
})
|
||||
tunnel.Env = os.Environ()
|
||||
if err := tunnel.Start(); err != nil {
|
||||
_ = os.RemoveAll(keyDir)
|
||||
return eicSSHSession{}, nil, fmt.Errorf("open-tunnel start: %w", err)
|
||||
}
|
||||
|
||||
if err := waitForPort(ctx, "127.0.0.1", localPort, 10*time.Second); err != nil {
|
||||
if tunnel.Process != nil {
|
||||
_ = tunnel.Process.Kill()
|
||||
}
|
||||
_ = tunnel.Wait()
|
||||
_ = os.RemoveAll(keyDir)
|
||||
return eicSSHSession{}, nil, fmt.Errorf("tunnel never listened: %w", err)
|
||||
}
|
||||
|
||||
cleanedUp := false
|
||||
cleanup := func() {
|
||||
if cleanedUp {
|
||||
return
|
||||
}
|
||||
cleanedUp = true
|
||||
if tunnel.Process != nil {
|
||||
_ = tunnel.Process.Kill()
|
||||
}
|
||||
_ = tunnel.Wait()
|
||||
_ = os.RemoveAll(keyDir)
|
||||
}
|
||||
|
||||
return eicSSHSession{
|
||||
keyPath: keyPath,
|
||||
localPort: localPort,
|
||||
osUser: osUser,
|
||||
instanceID: instanceID,
|
||||
}, cleanup, nil
|
||||
}
|
||||
|
||||
// init wires the pool into the package-level withEICTunnel var so
|
||||
// every read/write/list/delete EIC op uses pooled tunnels by default.
|
||||
// Test files that need single-shot behaviour can swap withEICTunnel
|
||||
// back via the existing stubWithEICTunnel pattern, OR set poolTTL=0
|
||||
// to disable pooling without rebinding the var.
|
||||
func init() {
|
||||
initEICTunnelPool()
|
||||
}
|
||||
467
workspace-server/internal/handlers/eic_tunnel_pool_test.go
Normal file
467
workspace-server/internal/handlers/eic_tunnel_pool_test.go
Normal file
@ -0,0 +1,467 @@
|
||||
package handlers
|
||||
|
||||
// eic_tunnel_pool_test.go — tests for the refcounted EIC tunnel pool
|
||||
// added in core#11. Stubs poolSetupTunnel with a counter so the
|
||||
// tests don't fork ssh-keygen / aws subprocesses.
|
||||
//
|
||||
// Per memory feedback_assert_exact_not_substring: each test pins
|
||||
// exact expected counts (not "at least N") so a regression that
|
||||
// silently double-sets-up surfaces here.
|
||||
|
||||
import (
|
||||
"context"
|
||||
"errors"
|
||||
"sync"
|
||||
"sync/atomic"
|
||||
"testing"
|
||||
"time"
|
||||
)
|
||||
|
||||
// withPoolSetupStub swaps poolSetupTunnel for a counting fake that
|
||||
// returns a sentinel session and a cleanup func that records its
|
||||
// invocation. Restores on test cleanup.
|
||||
//
|
||||
// setupSignal blocks each setup until released — for concurrent-
|
||||
// acquire tests where we want to gate setup completion.
|
||||
func withPoolSetupStub(t *testing.T) (
|
||||
setupCount *int64, cleanupCount *int64, restore func(), unblock func()) {
|
||||
t.Helper()
|
||||
prev := poolSetupTunnel
|
||||
prevTTL := poolTTL
|
||||
prevJanitor := poolJanitorInterval
|
||||
|
||||
var sc, cc int64
|
||||
setupCount, cleanupCount = &sc, &cc
|
||||
|
||||
gate := make(chan struct{}, 1)
|
||||
gate <- struct{}{} // allow the first setup through immediately
|
||||
unblock = func() { gate <- struct{}{} }
|
||||
|
||||
poolSetupTunnel = func(ctx context.Context, instanceID string) (
|
||||
eicSSHSession, func(), error) {
|
||||
select {
|
||||
case <-gate:
|
||||
case <-ctx.Done():
|
||||
return eicSSHSession{}, nil, ctx.Err()
|
||||
}
|
||||
atomic.AddInt64(&sc, 1)
|
||||
sess := eicSSHSession{
|
||||
instanceID: instanceID,
|
||||
osUser: "ubuntu",
|
||||
localPort: 10000 + int(atomic.LoadInt64(&sc)),
|
||||
keyPath: "/tmp/molecule-eic-test-" + instanceID,
|
||||
}
|
||||
cleanup := func() { atomic.AddInt64(&cc, 1) }
|
||||
return sess, cleanup, nil
|
||||
}
|
||||
|
||||
restore = func() {
|
||||
poolSetupTunnel = prev
|
||||
poolTTL = prevTTL
|
||||
poolJanitorInterval = prevJanitor
|
||||
}
|
||||
t.Cleanup(restore)
|
||||
return
|
||||
}
|
||||
|
||||
// freshPool returns an isolated pool (NOT the global) so tests run
|
||||
// independently. Stops the janitor on cleanup.
|
||||
func freshPool(t *testing.T) *eicTunnelPool {
|
||||
t.Helper()
|
||||
p := newEICTunnelPool()
|
||||
t.Cleanup(p.stop)
|
||||
return p
|
||||
}
|
||||
|
||||
// TestEICTunnelPool_FourOpsAmortise pins the core invariant: four
|
||||
// sequential acquire/release cycles on the same instanceID share
|
||||
// ONE underlying tunnel setup. Mutation: delete the cache hit branch
|
||||
// in acquire() → setupCount goes 1 → 4 → test fails.
|
||||
func TestEICTunnelPool_FourOpsAmortise(t *testing.T) {
|
||||
setupCount, cleanupCount, _, _ := withPoolSetupStub(t)
|
||||
// Refill gate after each setup so concurrent stubs aren't blocked
|
||||
// (we want every test to be able to set up if it needs to).
|
||||
t.Cleanup(func() { /* no-op; defer is enough */ })
|
||||
poolTTL = 50 * time.Second
|
||||
pool := freshPool(t)
|
||||
ctx := context.Background()
|
||||
|
||||
for i := 0; i < 4; i++ {
|
||||
sess, done, err := pool.acquire(ctx, "i-test-1")
|
||||
if err != nil {
|
||||
t.Fatalf("op %d: acquire: %v", i, err)
|
||||
}
|
||||
if sess.instanceID != "i-test-1" {
|
||||
t.Fatalf("op %d: session has wrong instanceID: %q", i, sess.instanceID)
|
||||
}
|
||||
done(false)
|
||||
}
|
||||
|
||||
if got := atomic.LoadInt64(setupCount); got != 1 {
|
||||
t.Errorf("expected exactly 1 tunnel setup across 4 ops, got %d", got)
|
||||
}
|
||||
if got := atomic.LoadInt64(cleanupCount); got != 0 {
|
||||
t.Errorf("expected 0 cleanups while entry is hot (TTL=50s), got %d", got)
|
||||
}
|
||||
}
|
||||
|
||||
// TestEICTunnelPool_DifferentInstancesDoNotShare pins that two
|
||||
// different instanceIDs each get their own tunnel — the pool is
|
||||
// keyed on instanceID, not a single global slot.
|
||||
func TestEICTunnelPool_DifferentInstancesDoNotShare(t *testing.T) {
|
||||
setupCount, _, _, unblock := withPoolSetupStub(t)
|
||||
poolTTL = 50 * time.Second
|
||||
pool := freshPool(t)
|
||||
ctx := context.Background()
|
||||
|
||||
// First instance setup uses the initial gate slot.
|
||||
_, doneA, err := pool.acquire(ctx, "i-a")
|
||||
if err != nil {
|
||||
t.Fatalf("acquire A: %v", err)
|
||||
}
|
||||
doneA(false)
|
||||
|
||||
// Second instance needs a new slot through the gate.
|
||||
unblock()
|
||||
_, doneB, err := pool.acquire(ctx, "i-b")
|
||||
if err != nil {
|
||||
t.Fatalf("acquire B: %v", err)
|
||||
}
|
||||
doneB(false)
|
||||
|
||||
if got := atomic.LoadInt64(setupCount); got != 2 {
|
||||
t.Errorf("expected 2 setups (one per instance), got %d", got)
|
||||
}
|
||||
}
|
||||
|
||||
// TestEICTunnelPool_TTLEviction: a short TTL forces the second op
|
||||
// to build a fresh tunnel after the first expires.
|
||||
func TestEICTunnelPool_TTLEviction(t *testing.T) {
|
||||
setupCount, cleanupCount, _, unblock := withPoolSetupStub(t)
|
||||
poolTTL = 50 * time.Millisecond
|
||||
poolJanitorInterval = 1 * time.Second // keep janitor away
|
||||
pool := freshPool(t)
|
||||
ctx := context.Background()
|
||||
|
||||
_, done, err := pool.acquire(ctx, "i-ttl")
|
||||
if err != nil {
|
||||
t.Fatalf("acquire 1: %v", err)
|
||||
}
|
||||
done(false)
|
||||
|
||||
time.Sleep(80 * time.Millisecond) // past TTL
|
||||
|
||||
unblock() // allow next setup
|
||||
_, done, err = pool.acquire(ctx, "i-ttl")
|
||||
if err != nil {
|
||||
t.Fatalf("acquire 2: %v", err)
|
||||
}
|
||||
done(false)
|
||||
|
||||
if got := atomic.LoadInt64(setupCount); got != 2 {
|
||||
t.Errorf("expected 2 setups (TTL eviction between), got %d", got)
|
||||
}
|
||||
// First entry should have been cleaned up when the second
|
||||
// acquire evicted it on the slow path. Cleanup runs in a
|
||||
// goroutine; poll briefly for it to land.
|
||||
deadline := time.Now().Add(500 * time.Millisecond)
|
||||
for atomic.LoadInt64(cleanupCount) < 1 && time.Now().Before(deadline) {
|
||||
time.Sleep(5 * time.Millisecond)
|
||||
}
|
||||
if got := atomic.LoadInt64(cleanupCount); got < 1 {
|
||||
t.Errorf("expected ≥1 cleanup (first entry evicted), got %d", got)
|
||||
}
|
||||
}
|
||||
|
||||
// TestEICTunnelPool_FailureInvalidates pins the poison-on-fault
|
||||
// behavior — fn returning a tunnel-fatal error marks the entry
|
||||
// unusable so the next acquire builds fresh.
|
||||
func TestEICTunnelPool_FailureInvalidates(t *testing.T) {
|
||||
setupCount, _, _, unblock := withPoolSetupStub(t)
|
||||
poolTTL = 50 * time.Second
|
||||
pool := freshPool(t)
|
||||
ctx := context.Background()
|
||||
|
||||
_, done, err := pool.acquire(ctx, "i-fault")
|
||||
if err != nil {
|
||||
t.Fatalf("acquire 1: %v", err)
|
||||
}
|
||||
done(true) // signal poison
|
||||
|
||||
unblock() // let the next setup through
|
||||
_, done, err = pool.acquire(ctx, "i-fault")
|
||||
if err != nil {
|
||||
t.Fatalf("acquire 2: %v", err)
|
||||
}
|
||||
done(false)
|
||||
|
||||
if got := atomic.LoadInt64(setupCount); got != 2 {
|
||||
t.Errorf("expected 2 setups (poison forced rebuild), got %d", got)
|
||||
}
|
||||
}
|
||||
|
||||
// TestEICTunnelPool_ConcurrentAcquireSingleSetup pins that N
|
||||
// concurrent acquires for the same instanceID before any release
|
||||
// only trigger ONE tunnel setup — the rest wait via pendingSetups.
|
||||
//
|
||||
// Without this guard each concurrent acquire would spawn its own
|
||||
// tunnel and the loser-cleanup would still leak refcount. Mutation:
|
||||
// delete the pendingSetups gate → setupCount goes 1 → N → fails.
|
||||
func TestEICTunnelPool_ConcurrentAcquireSingleSetup(t *testing.T) {
|
||||
setupCount, _, _, _ := withPoolSetupStub(t)
|
||||
// Pause setup so all goroutines pile into the pending slot.
|
||||
prev := poolSetupTunnel
|
||||
gate := make(chan struct{})
|
||||
poolSetupTunnel = func(ctx context.Context, instanceID string) (
|
||||
eicSSHSession, func(), error) {
|
||||
<-gate
|
||||
atomic.AddInt64(setupCount, 1)
|
||||
return eicSSHSession{instanceID: instanceID}, func() {}, nil
|
||||
}
|
||||
t.Cleanup(func() { poolSetupTunnel = prev })
|
||||
|
||||
poolTTL = 50 * time.Second
|
||||
pool := freshPool(t)
|
||||
ctx := context.Background()
|
||||
|
||||
const N = 8
|
||||
type result struct {
|
||||
done func(bool)
|
||||
err error
|
||||
}
|
||||
results := make(chan result, N)
|
||||
var startWg sync.WaitGroup
|
||||
startWg.Add(N)
|
||||
for i := 0; i < N; i++ {
|
||||
go func() {
|
||||
startWg.Done()
|
||||
_, done, err := pool.acquire(ctx, "i-concurrent")
|
||||
results <- result{done, err}
|
||||
}()
|
||||
}
|
||||
startWg.Wait()
|
||||
// give all N goroutines time to enter pool.acquire
|
||||
time.Sleep(20 * time.Millisecond)
|
||||
close(gate)
|
||||
|
||||
for i := 0; i < N; i++ {
|
||||
r := <-results
|
||||
if r.err != nil {
|
||||
t.Fatalf("acquire %d: %v", i, r.err)
|
||||
}
|
||||
r.done(false)
|
||||
}
|
||||
|
||||
if got := atomic.LoadInt64(setupCount); got != 1 {
|
||||
t.Errorf("expected 1 setup across %d concurrent acquires, got %d", N, got)
|
||||
}
|
||||
}
|
||||
|
||||
// TestEICTunnelPool_TTLZeroDisablesPooling pins the escape hatch:
|
||||
// poolTTL=0 means every acquire goes straight through to setup +
|
||||
// cleanup, no entry kept. Useful for tests / opt-out.
|
||||
func TestEICTunnelPool_TTLZeroDisablesPooling(t *testing.T) {
|
||||
setupCount, cleanupCount, _, unblock := withPoolSetupStub(t)
|
||||
poolTTL = 0
|
||||
pool := freshPool(t)
|
||||
ctx := context.Background()
|
||||
|
||||
_, done, err := pool.acquire(ctx, "i-ttlzero")
|
||||
if err != nil {
|
||||
t.Fatalf("acquire 1: %v", err)
|
||||
}
|
||||
done(false)
|
||||
|
||||
unblock()
|
||||
_, done, err = pool.acquire(ctx, "i-ttlzero")
|
||||
if err != nil {
|
||||
t.Fatalf("acquire 2: %v", err)
|
||||
}
|
||||
done(false)
|
||||
|
||||
if got := atomic.LoadInt64(setupCount); got != 2 {
|
||||
t.Errorf("expected 2 setups with TTL=0 (pool disabled), got %d", got)
|
||||
}
|
||||
if got := atomic.LoadInt64(cleanupCount); got != 2 {
|
||||
t.Errorf("expected 2 cleanups with TTL=0 (each release closes), got %d", got)
|
||||
}
|
||||
}
|
||||
|
||||
// TestEICTunnelPool_LRUEvictionAtCap pins the LRU defence: when the
|
||||
// pool reaches poolMaxEntries, a new acquire for an unseen
|
||||
// instanceID evicts the LRU idle entry instead of growing unbounded.
|
||||
func TestEICTunnelPool_LRUEvictionAtCap(t *testing.T) {
|
||||
setupCount, cleanupCount, _, _ := withPoolSetupStub(t)
|
||||
prev := poolMaxEntries
|
||||
poolMaxEntries = 2
|
||||
t.Cleanup(func() { poolMaxEntries = prev })
|
||||
poolTTL = 50 * time.Second
|
||||
|
||||
// Replace stub with one that doesn't gate so we can fill quickly.
|
||||
poolSetupTunnel = func(ctx context.Context, instanceID string) (
|
||||
eicSSHSession, func(), error) {
|
||||
atomic.AddInt64(setupCount, 1)
|
||||
return eicSSHSession{instanceID: instanceID}, func() {
|
||||
atomic.AddInt64(cleanupCount, 1)
|
||||
}, nil
|
||||
}
|
||||
|
||||
pool := freshPool(t)
|
||||
ctx := context.Background()
|
||||
|
||||
for _, id := range []string{"i-1", "i-2"} {
|
||||
_, done, err := pool.acquire(ctx, id)
|
||||
if err != nil {
|
||||
t.Fatalf("acquire %s: %v", id, err)
|
||||
}
|
||||
done(false)
|
||||
}
|
||||
// Both entries idle, pool at cap.
|
||||
_, done, err := pool.acquire(ctx, "i-3")
|
||||
if err != nil {
|
||||
t.Fatalf("acquire i-3: %v", err)
|
||||
}
|
||||
done(false)
|
||||
|
||||
// Wait for the goroutine'd cleanup of the evicted entry.
|
||||
deadline := time.Now().Add(500 * time.Millisecond)
|
||||
for atomic.LoadInt64(cleanupCount) < 1 && time.Now().Before(deadline) {
|
||||
time.Sleep(10 * time.Millisecond)
|
||||
}
|
||||
|
||||
if got := atomic.LoadInt64(setupCount); got != 3 {
|
||||
t.Errorf("expected 3 setups (one per unique instance), got %d", got)
|
||||
}
|
||||
if got := atomic.LoadInt64(cleanupCount); got < 1 {
|
||||
t.Errorf("expected ≥1 cleanup (LRU eviction), got %d", got)
|
||||
}
|
||||
}
|
||||
|
||||
// TestEICTunnelPool_PoisonedClassification pins the heuristic that
|
||||
// distinguishes tunnel-fatal errors (poison the entry) from
|
||||
// app-level errors (file not found, validation) that should NOT
|
||||
// invalidate the tunnel.
|
||||
func TestEICTunnelPool_PoisonedClassification(t *testing.T) {
|
||||
cases := []struct {
|
||||
name string
|
||||
err error
|
||||
want bool
|
||||
}{
|
||||
{"nil", nil, false},
|
||||
{"file not found", errors.New("os: file does not exist"), false},
|
||||
{"validation", errors.New("invalid path: must be relative"), false},
|
||||
{"connection refused", errors.New("ssh: connect to host: connection refused"), true},
|
||||
{"connection refused upper", errors.New("Connection Refused"), true},
|
||||
{"broken pipe", errors.New("write tunnel: broken pipe"), true},
|
||||
{"permission denied", errors.New("Permission denied (publickey)"), true},
|
||||
{"auth failed", errors.New("Authentication failed"), true},
|
||||
{"connection reset", errors.New("Connection reset by peer"), true},
|
||||
{"port forward", errors.New("port forwarding failed"), true},
|
||||
}
|
||||
for _, tc := range cases {
|
||||
t.Run(tc.name, func(t *testing.T) {
|
||||
got := fnErrIndicatesTunnelFault(tc.err)
|
||||
if got != tc.want {
|
||||
t.Errorf("fnErrIndicatesTunnelFault(%v) = %v, want %v",
|
||||
tc.err, got, tc.want)
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
// TestEICTunnelPool_RefcountBlocksEviction pins that an entry past
|
||||
// TTL is NOT evicted while a caller still holds it — preventing
|
||||
// use-after-free in the holder.
|
||||
func TestEICTunnelPool_RefcountBlocksEviction(t *testing.T) {
|
||||
setupCount, cleanupCount, _, _ := withPoolSetupStub(t)
|
||||
poolTTL = 30 * time.Millisecond
|
||||
poolJanitorInterval = 5 * time.Millisecond
|
||||
pool := freshPool(t)
|
||||
ctx := context.Background()
|
||||
|
||||
_, done, err := pool.acquire(ctx, "i-hold")
|
||||
if err != nil {
|
||||
t.Fatalf("acquire: %v", err)
|
||||
}
|
||||
|
||||
// Sleep past TTL while holding the session. Janitor sweeps
|
||||
// every 5ms but must skip our entry because refcount=1.
|
||||
time.Sleep(80 * time.Millisecond)
|
||||
|
||||
if got := atomic.LoadInt64(cleanupCount); got != 0 {
|
||||
t.Errorf("expected 0 cleanups while holder is active, got %d", got)
|
||||
}
|
||||
|
||||
done(false)
|
||||
// Now refcount=0 and entry is past TTL; releaser triggers cleanup.
|
||||
deadline := time.Now().Add(200 * time.Millisecond)
|
||||
for atomic.LoadInt64(cleanupCount) < 1 && time.Now().Before(deadline) {
|
||||
time.Sleep(5 * time.Millisecond)
|
||||
}
|
||||
if got := atomic.LoadInt64(cleanupCount); got != 1 {
|
||||
t.Errorf("expected 1 cleanup after release of expired entry, got %d", got)
|
||||
}
|
||||
if got := atomic.LoadInt64(setupCount); got != 1 {
|
||||
t.Errorf("setupCount tracking: got %d, want 1", got)
|
||||
}
|
||||
}
|
||||
|
||||
// TestPooledWithEICTunnel_PanicPoisonsEntry pins that a panic
|
||||
// from fn poisons the pool entry on the way out — refcount goes
|
||||
// back to zero (no leak) and the entry is marked unusable so the
|
||||
// next acquire builds fresh. Without the defer-release pattern, a
|
||||
// panic would leave refcount=1 forever and the entry would never
|
||||
// evict.
|
||||
func TestPooledWithEICTunnel_PanicPoisonsEntry(t *testing.T) {
|
||||
setupCount, _, _, _ := withPoolSetupStub(t)
|
||||
poolTTL = 50 * time.Second
|
||||
globalEICTunnelPool = newEICTunnelPool()
|
||||
t.Cleanup(globalEICTunnelPool.stop)
|
||||
|
||||
func() {
|
||||
defer func() {
|
||||
if r := recover(); r == nil {
|
||||
t.Errorf("expected panic to bubble up, got nil")
|
||||
}
|
||||
}()
|
||||
_ = pooledWithEICTunnel(context.Background(), "i-panic",
|
||||
func(s eicSSHSession) error { panic("boom") })
|
||||
}()
|
||||
|
||||
// Replenish the gate so the next setup can run.
|
||||
prev := poolSetupTunnel
|
||||
poolSetupTunnel = func(ctx context.Context, instanceID string) (
|
||||
eicSSHSession, func(), error) {
|
||||
atomic.AddInt64(setupCount, 1)
|
||||
return eicSSHSession{instanceID: instanceID}, func() {}, nil
|
||||
}
|
||||
t.Cleanup(func() { poolSetupTunnel = prev })
|
||||
|
||||
// Next acquire must build fresh — entry was poisoned by panic.
|
||||
if err := pooledWithEICTunnel(context.Background(), "i-panic",
|
||||
func(s eicSSHSession) error { return nil }); err != nil {
|
||||
t.Fatalf("post-panic acquire: %v", err)
|
||||
}
|
||||
if got := atomic.LoadInt64(setupCount); got != 2 {
|
||||
t.Errorf("expected 2 setups (panic poisoned, rebuild), got %d", got)
|
||||
}
|
||||
}
|
||||
|
||||
// TestPooledWithEICTunnel_PreservesFnErr pins that errors from the
|
||||
// inner fn pass through to the caller verbatim — pool wrapping
|
||||
// should not swallow or transform error semantics for app code.
|
||||
func TestPooledWithEICTunnel_PreservesFnErr(t *testing.T) {
|
||||
withPoolSetupStub(t)
|
||||
poolTTL = 50 * time.Second
|
||||
|
||||
// Reset the global pool so this test is isolated from any prior
|
||||
// test that may have populated it.
|
||||
globalEICTunnelPool = newEICTunnelPool()
|
||||
|
||||
want := errors.New("file does not exist")
|
||||
got := pooledWithEICTunnel(context.Background(), "i-fn-err",
|
||||
func(s eicSSHSession) error { return want })
|
||||
if !errors.Is(got, want) {
|
||||
t.Errorf("pooledWithEICTunnel returned %v, want %v", got, want)
|
||||
}
|
||||
}
|
||||
Loading…
Reference in New Issue
Block a user