c8fca1467e
7 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
ec1f21922c |
chore(workspace-server): remove TeamHandler.Expand bulk-create handler
Every workspace can have children via the regular CreateWorkspace flow with parent_id set, so a separate handler that bulk-creates from config.yaml's sub_workspaces (and was non-idempotent — calling it twice duplicated the team) earned its way out. "Team" is just the state of having children; expanding/collapsing is purely a canvas-side visual action that toggles the `collapsed` column via PATCH. The non-idempotency directly caused tenant-hongming's vCPU starvation: 72 distinct child workspaces accumulated in 4 days, ~14 leaked EC2s (50 of 64 vCPU consumed by stale teams), every Canvas tabs E2E retry flaking on RunInstances VcpuLimitExceeded. What stays: - TeamHandler.Collapse — still useful; stops + removes children via StopWorkspaceAuto. Reachable from the canvas Collapse Team button. (Note: that button currently calls PATCH /workspaces/:id, not the Collapse endpoint — that's a separate reachability question for later.) - findTemplateDirByName helper — kept in team.go pending a relocate decision; no in-package consumers after Expand. - The four other paths that create child workspaces continue to work unchanged: regular POST /workspaces with parent_id, OrgHandler.Import (recursive tree), Bundle import, scripts. What goes: - POST /workspaces/:id/expand route (router.go) - TeamHandler.Expand method (team.go: ~130 lines) - 4 TestTeamExpand_* sqlmock tests (team_test.go) - TestTeamExpand_UsesAutoNotDirectDockerPath AST gate (workspace_provision_auto_test.go) — pinned a code path that no longer exists; the generic TestNoCallSiteCallsDirectProvisionerExceptAuto gate still covers the architectural intent for any future caller. Follow-up PRs: - canvas/ContextMenu.tsx: drop the "Expand to Team" right-click button + handleExpand callback; users create children via the regular + New Workspace dialog with the parent picker (already supported) - OrgHandler.Import idempotency (skip-if-exists OR replace_if_exists) — same bug class as the deleted Expand, but on the bulk-tree path - One-off cleanup script for tenant-hongming's 72 stale workspaces Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
11c9ed2a46 |
fix(provision): StopWorkspaceAuto mirror — close SaaS EC2-leak class
Closes #2813 (team-collapse) and #2814 (workspace delete). Two leaks, one class. Both call sites had the same shape pre-fix: if h.provisioner != nil { h.provisioner.Stop(ctx, wsID) } On SaaS where h.provisioner (Docker) is nil and h.cpProv is set, that gate evaluates false and the EC2 keeps running. Workspace gets marked removed in DB; EC2 lives on until the orphan sweeper catches it. Same drift class as PR #2811's org-import provision bug — a Docker- only check on what should be a both-backend operation. Confirmed in production: PR #2811's verification step deleted a test workspace and the EC2 stayed running until I terminated it manually. Fix: WorkspaceHandler.StopWorkspaceAuto(ctx, wsID) — symmetric mirror of provisionWorkspaceAuto. CP first, Docker second, no-op when neither is wired (a workspace nobody is running can't be stopped — that's a no-op, not a failure, distinct from provision's mark-failed contract). Three call-site changes: - team.go:208 (Collapse) → h.wh.StopWorkspaceAuto(ctx, childID) - workspace_crud.go:432 (stopAndRemove) → h.StopWorkspaceAuto(...); RemoveVolume stays Docker-only behind an explicit gate since CP-managed workspaces have no host-bind volumes - TeamHandler.provisioner field + NewTeamHandler's *Provisioner param removed as dead code (Stop was the only call site) Volume cleanup separation is intentional: the abstraction is "stop the running workload," not "tear down all state." Callers that need volume cleanup keep their `if h.provisioner != nil { RemoveVolume }` gate AFTER the Stop call. Tests: - TestStopWorkspaceAuto_RoutesToCPWhenSet — SaaS path - TestStopWorkspaceAuto_RoutesToDockerWhenOnlyDocker — self-hosted - TestStopWorkspaceAuto_NoBackendIsNoOp — pins the contract distinction from provisionWorkspaceAuto's mark-failed - TestNoCallSiteCallsBareStop — source-level pin against `.provisioner.Stop(` / `.cpProv.Stop(` outside the dispatcher, per-backend bodies, restart helper, and the Docker-daemon-direct short-lived-container path. Strips Go comments before substring match so archaeology in code comments doesn't trip the gate. - Verified: pin FAILS against the buggy shape (workspace_crud.go reversion); team.go reversion compile-fails because the field is gone — even stronger than the test. Out of scope (tracked under #2799): - workspace_restart.go's manual if-cpProv-else dispatch with retry semantics tuned for the restart hot path. Functionally equivalent + wraps cpStopWithRetry, so it's not the bug class this PR closes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
be997883c9 |
Centralize backend selection in provisionWorkspaceAuto
User-reported 2026-05-04: deploying a team org-template ("Design
Director" + 6 sub-agents) on a SaaS tenant produced 7-of-7
WORKSPACE_PROVISION_FAILED with the misleading message
"container started but never called /registry/register". Diagnose
returned "docker client not configured on this workspace-server" and
the workspace rows had no instance_id.
Root cause: TeamHandler.Expand hardcoded h.wh.provisionWorkspace —
the Docker leg of WorkspaceHandler. WorkspaceHandler.Create branched
on h.cpProv to pick CP-managed EC2 (SaaS) vs local Docker
(self-hosted), but Expand never used that branch. On SaaS the docker
goroutine ran but had no socket, so children silently sat in
"provisioning" until the 600s sweeper marked them failed.
Architectural principle (user): templates own
runtime/config/prompts/files/plugins; the platform owns where it
runs. Backend selection belongs in one helper.
Fix:
- Extract WorkspaceHandler.provisionWorkspaceAuto: picks CP when
cpProv is set, Docker when only provisioner is set, returns false
when neither (caller marks failed).
- WorkspaceHandler.Create routes through Auto.
- TeamHandler.Expand routes through Auto.
Tests pin three invariants:
- TestProvisionWorkspaceAuto_NoBackendReturnsFalse — Auto signals
fall-through correctly so the caller can persist + mark-failed.
- TestProvisionWorkspaceAuto_RoutesToCPWhenSet — when cpProv is
wired, Start lands on CP (the user-visible regression target).
Discipline-verified: removing the cpProv branch fails this.
- TestTeamExpand_UsesAutoNotDirectDockerPath — source-level guard
against future refactors reintroducing the hardcoded Docker call.
Discipline-verified: reverting team.go fails this with a clear
message naming the bug class.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
fdf1b5d76a |
refactor(workspace-status): typed constants + AST-based drift gate
Eliminate raw 'awaiting_agent'/'hibernating'/'failed'/etc string literals from production status writes. Adds models.WorkspaceStatus typed alias and models.AllWorkspaceStatuses canonical slice; every UPDATE workspaces SET status = ... now passes a parameterized $N typed value rather than a hard-coded SQL literal. Defense-in-depth follow-up to migration 046 (#2388): the Postgres enum type was missing 'awaiting_agent' + 'hibernating' for ~5 days because sqlmock regex matching cannot enforce live enum constraints. The drift gate is now a proper Go AST + SQL parser (no regex), asserting the codebase ⊆ migration enum and every const appears in the canonical slice. With status as a parameterized typed value, future enum mismatches fail at the SQL layer in tests, not silently in prod. Test coverage: full suite passes with -race; drift gate green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
e081c8335f |
refactor(handlers): widen WorkspaceHandler.provisioner to LocalProvisionerAPI interface (#2369)
Symmetric with the existing CPProvisionerAPI interface. Closes the asymmetry where the SaaS provisioner field was an interface (mockable in tests) but the Docker provisioner field was a concrete pointer (not). ## Changes - New ``provisioner.LocalProvisionerAPI`` interface — the 7 methods WorkspaceHandler / TeamHandler call on h.provisioner today: Start, Stop, IsRunning, ExecRead, RemoveVolume, VolumeHasFile, WriteAuthTokenToVolume. Compile-time assertion confirms *Provisioner satisfies it. Mirror of cp_provisioner.go's CPProvisionerAPI block. - ``WorkspaceHandler.provisioner`` and ``TeamHandler.provisioner`` re-typed from ``*provisioner.Provisioner`` to ``provisioner.LocalProvisionerAPI``. Constructor parameter type is unchanged — the assignment widens to the interface, so the 200+ callers of ``NewWorkspaceHandler`` / ``NewTeamHandler`` are unaffected. - Constructors gain a ``if p != nil`` guard before assigning to the interface field. Without this, ``NewWorkspaceHandler(..., nil, ...)`` (the test fixture pattern across 200+ tests) yields a typed-nil interface value where ``h.provisioner != nil`` evaluates *true*, and the SaaS-vs-Docker fork incorrectly routes nil-fixture tests into the Docker code path. Documented inline with reference to the Go FAQ. - Hardened the 5 Provisioner methods that lacked nil-receiver guards (Start, ExecRead, WriteAuthTokenToVolume, RemoveVolume, VolumeHasFile) — return ErrNoBackend on nil receiver instead of panicking on p.cli dereference. Symmetric with Stop/IsRunning (already hardened in #1813). Defensive cleanup so a future caller that bypasses the constructor's nil-elision still degrades cleanly. - Extended TestZeroValuedBackends_NoPanic with 5 new sub-tests covering the newly-hardened nil-receiver paths. Defense-in-depth: a future refactor that drops one of the nil-checks fails red here before reaching production. ## Why now - Provisioner orchestration has been touched in #2366 / #2368 — the interface symmetry is the natural follow-up captured in #2369. - Future work (CP fleet redeploy endpoint, multi-backend provisioners) wants this in place. Memory note ``project_provisioner_abstraction.md`` calls out pluggable backends as a north-star. - Memory note ``feedback_long_term_robust_automated.md`` — compile-time gates + ErrNoBackend symmetry > runtime panics. ## Verification - ``go build ./...`` clean. - ``go test ./...`` clean — 1300+ tests pass, including the previously-flaky Create-with-nil-provisioner paths that now exercise the constructor's nil-elision correctly. - ``go test ./internal/provisioner/ -run TestZeroValuedBackends_NoPanic -v`` — all 11 nil-receiver subtests green (was 6, +5 for the newly-hardened methods). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
bb52a1a365 |
fix(team): delegate Expand child-provisioning to shared mint pipeline (#2367)
Closes #2367. TeamHandler.Expand provisioned child workspaces by directly calling h.provisioner.Start, skipping mintWorkspaceSecrets and every other preflight (secrets load, env mutators, identity injection, missing-env, empty-config-volume auto-recover). Children shipped with NULL platform_inbound_secret + never-issued auth_token — same drift class as the SaaS bug just fixed in PR #2366, found while exercising a stronger gate against this package. Fix: - TeamHandler now holds *WorkspaceHandler. Expand delegates each child provision to wh.provisionWorkspace, picking up the shared prepare/mint/preflight pipeline automatically. Future provision-time steps go in ONE place and team-expand inherits them. - prepareProvisionContext gains PARENT_ID env injection sourced from payload.ParentID (which Expand now populates). This preserves the signal workspace/coordinator.py reads on startup, without threading env through provisioner.WorkspaceConfig manually. - NewTeamHandler signature gains *WorkspaceHandler; router passes it. Gate upgrade: - TestProvisionFunctions_AllCallMintWorkspaceSecrets is now behavior-based: it walks every FuncDecl in the package and flags any function that calls h.provisioner.Start or h.cpProv.Start without also calling mintWorkspaceSecrets. Drift-resistant by construction — a future provision function with any name still trips the gate. - Replaces the name-list version from PR #2366. The name list missed Expand precisely because Expand wasn't named provision*; the behavior-based detector caught it spontaneously when prototyped. Tests: full workspace-server module green; gate previously verified to fire red on Expand pre-fix and on deliberate mintWorkspaceSecrets removal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
479a027e4b |
chore: open-source restructure — rename dirs, remove internal files, scrub secrets
Renames: - platform/ → workspace-server/ (Go module path stays as "platform" for external dep compat — will update after plugin module republish) - workspace-template/ → workspace/ Removed (moved to separate repos or deleted): - PLAN.md — internal roadmap (move to private project board) - HANDOFF.md, AGENTS.md — one-time internal session docs - .claude/ — gitignored entirely (local agent config) - infra/cloudflare-worker/ → Molecule-AI/molecule-tenant-proxy - org-templates/molecule-dev/ → standalone template repo - .mcp-eval/ → molecule-mcp-server repo - test-results/ — ephemeral, gitignored Security scrubbing: - Cloudflare account/zone/KV IDs → placeholders - Real EC2 IPs → <EC2_IP> in all docs - CF token prefix, Neon project ID, Fly app names → redacted - Langfuse dev credentials → parameterized - Personal runner username/machine name → generic Community files: - CONTRIBUTING.md — build, test, branch conventions - CODE_OF_CONDUCT.md — Contributor Covenant 2.1 All Dockerfiles, CI workflows, docker-compose, railway.toml, render.yaml, README, CLAUDE.md updated for new directory names. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |