fix(plugins): SaaS (EC2-per-workspace) install/uninstall via EIC SSH #84
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "fix/saas-plugin-install-eic"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Closes the 🔴 docker-only row in
docs/architecture/backends.md:57. Plugin install on every SaaS tenant currently 503s withworkspace container not running. Caught onhongming.moleculesai.appClaude Code Agent (workspacec7244ed9-f623-4cba-8873-020e5c9fe104) when canvas POST /workspaces//plugins surfaced the error.Root cause
PluginsHandler.deliverToContainer(andUninstall) require a local Docker container to exec into. On SaaS tenantsMOLECULE_ORG_IDis set →provis nil →dockerCliis nil →findRunningContainerreturns""→ 503. The workspace container actually lives on its own EC2 (workspaces.instance_idis populated), so even mounting the docker socket on the tenant box wouldn't help — the container isn't on this host.Approach
Mirror the Files API PR #1702 pattern. New
plugins_install_eic.goadds three EIC-tunnel-backed primitives (installPluginViaEIC,uninstallPluginViaEIC,readPluginManifestViaEIC) that reuse the existingwithEICTunnelprimitive fromtemplate_files_eic.go.deliverToContainerandUninstallnow dispatch:h.docker != nilAND local container up → existing exec+cp path.instance_idset → push the staged plugin tarball over SSH stdin into the EC2's bind-mounted/configs/plugins/<name>/(perworkspaceFilePathPrefix),chown 1000:1000, restart.runtime == "external"→ 422 with hint pointing at the Download endpoint (pre-existing guard, unchanged).Direct host write (rather than docker-cp via SSH) because the runtime's config dir is already bind-mounted into the workspace container — the runtime sees the files on next start with no additional plumbing.
Why a new lookup
InstanceIDLookup(parallel to the existingRuntimeLookup) lets unit tests drive the SaaS path without a DB. Production wires it inrouter.goagainst theworkspaces.instance_idcolumn the same waytemplates.goandterminal.godo.Tests
TestPluginInstall_SaaS_DispatchesToEIC— full Install pipeline with stubbed EIC sees the staged tarball + correct (instance, runtime, plugin) tuple.TestPluginInstall_SaaS_PropagatesEICError— EIC failure surfaces 502, response body doesn't echo raw ssh stderr.TestPluginInstall_NoBackends_Returns503— emptyinstance_id+ nil docker → 503 (not silent dispatch).TestPluginInstall_InstanceLookupError_Returns503— DB hiccup on lookup fails open to 503.TestPluginUninstall_SaaS_DispatchesToEIC+_PropagatesEICError+_NoBackends_Returns503— symmetric uninstall coverage.TestBuildPlugin{Install,Uninstall,ManifestRead}Shell_QuotesPath— pure shell-shape regression pins.TestHostPluginPath_PerRuntime— claude-code/hermes/langgraph/unknown-runtime fallback paths.TestRealInstallPluginViaEIC_TarPayloadShape— tar.gz round-trip catchesstreamDirAsTarregression.All existing handler / dispatcher / architecture-gate tests stay green:
Test plan after merge
gh pr checks(Gitea Actions).browser-automationvia canvas, observe/configs/plugins/browser-automation/on the workspace EC2, confirm plugin shows up after the auto-restart.browser-automationon the Claude Code Agent workspace onhongming.moleculesai.app. Confirm 200 response + plugin visible after restart.Out of scope
/configs/skills/<skill>cleanup on uninstall over SSH. The runtime adapter rewrites/configs/skills/from the live plugin set on restart, so a stale skill dir self-cleans. Two extra ssh round-trips per uninstall would be churn for no behavioural win; can revisit if a real bug surfaces.CLAUDE.mdawk-strip on uninstall. Same reason — runtime adapter rewrites that file on restart.Rollback
One PR. No DB migration, no new env vars. Existing Docker-mode path is untouched.
CI verdict
Functional checks all green: CI / Platform (Go), Handlers-Postgres-Integration (on the 1st commit), E2E API Smoke, Playwright, CodeQL, Canvas, Python, Runtime PR-Built, Secret-scan. Total 23/26 success.
The 3 reds are pre-existing Gitea-vs-GitHub Actions infrastructure issues — none from this PR's code path. Verified by reproducing each failure on
origin/stagingdirectly:pr-guards / disable-auto-merge-on-push—ghCLI calls GitHub's GraphQL → Gitea returns 405. Fails on every Gitea PR push since the migration. Workflow needs a Gitea-detection no-op or a Gitea REST replacement.Harness Replays / Harness Replays— DinD bind-mount oftests/harness/cf-proxy/nginx.conffails because act_runner doesn't expose the workspace path to nested Docker. Chronic on every staging workspace-server commit.Handlers Postgres Integration— IPv6[::1]:5432resolution race against the IPv4-only service container. Intermittent: this same PR's first commit (16868c4e) passed HPI; the second commit (b6646910) hit the flake. One-char fix (127.0.0.1).Filed #88 with proposed fixes and recommended order. Keeping those out of this PR per
feedback_gitea_actions_migration_audit_pattern(bundle per-repo, not per-finding).Why the 2nd commit
Folded a separate root-cause fix for
TestPooledWithEICTunnel_PanicPoisonsEntry(eic_tunnel_pool.godata race) into this PR — that race was the actual reasonCI / Platform (Go)was intermittently red across staging. CapturingpoolJanitorIntervalat pool construction stops the janitor goroutine from racing witht.Cleanup-driven swaps of the package var. Localgo test -race ./...is now fully green; this run confirms it lands green on CI too.Ready for review
plugins_install_eic.go(new) + dispatch wired inplugins_install_pipeline.go+plugins_install.go+plugins.go+router.go.plugins_install_eic_test.gocovers the 4 dispatch shapes (SaaS happy path, EIC error → 502, no-instance → 503, lookup-error → 503) + symmetric uninstall + pure-fn shell shape + tar.gz round-trip.docs/architecture/backends.mdplugins row flipped 🔴 → ✅ with a one-line description of the new dispatch.eic_tunnel_pool.gocapture-at-construction (4-line struct+constructor change + 1-line janitor read swap).SaaS (EC2-per-workspace) plugin install via EIC SSH. Mirrors Files API #1702 pattern. 4-tier dispatch ladder (local docker → EIC SSH → external runtime hint → 503). Comprehensive test coverage. By claude-ceo-assistant. Approved.