Commit Graph

29 Commits

Author SHA1 Message Date
Hongming Wang
e15d1182cd test(provisioner): unblock TestProvisionWorkspaceCP_NoInternalErrorsInBroadcast (#1814)
The skipped test exists to assert that provisionWorkspaceCP never
leaks err.Error() in WORKSPACE_PROVISION_FAILED broadcasts (regression
guard for #1206). Writing the test body required substituting a
failing CPProvisioner — but the handler's `cpProv` field was the
concrete *CPProvisioner type, so a mock had nowhere to plug in.

Refactor:
  - Add provisioner.CPProvisionerAPI interface with the 3 methods
    handlers actually call (Start, Stop, GetConsoleOutput)
  - Compile-time assertion `var _ CPProvisionerAPI = (*CPProvisioner)(nil)`
    catches future method-signature drift at build time
  - WorkspaceHandler.cpProv narrowed to the interface; SetCPProvisioner
    accepts the interface (production caller passes *CPProvisioner
    from NewCPProvisioner unchanged)

Test:
  - stubFailingCPProv whose Start returns a deliberately leaky error
    (machine_type=t3.large, ami=…, vpc=…, raw HTTP body fragment)
  - Drive provisionWorkspaceCP via the cpProv.Start failure path
  - Assert broadcast["error"] == "provisioning failed" (canned)
  - Assert no leak markers (machine type, AMI, VPC, subnet, HTTP
    body, raw error head) in any broadcast string value
  - Stop/GetConsoleOutput on the stub panic — flags a future
    regression that reaches into them on this path

Verification:
  - Full workspace-server test suite passes (interface refactor
    is non-breaking; production caller path unchanged)
  - go build ./... clean
  - The other skipped test in this file (TestResolveAndStage_…)
    is a separate plugins.Registry refactor and remains skipped

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 03:28:25 -07:00
Hongming Wang
4915d1d59e fix(orphan-sweeper): reap labeled containers with no DB row (wiped-DB)
The existing sweeper only reaps ws-* containers whose workspace row
has status='removed'. That misses the entire wiped-DB case: an
operator does `docker compose down -v` (kills the postgres volume),
the previous platform's ws-* containers keep running, the new
platform boots into an empty workspaces table — first pass finds
zero candidates and those containers leak forever. Symptom users
hit today: 7 ws-* containers from 11h ago, no rows in DB, no
visibility in Canvas, eating CPU + memory.

Fix shape:

1. Provisioner stamps every ws-* container + volume with
   `molecule.platform.managed=true`. Without a label, the sweeper
   would have to assume any unlabeled ws-* container might belong
   to a sibling platform stack on a shared Docker daemon.

2. Provisioner exposes ListManagedContainerIDPrefixes — a label-filter
   counterpart to the existing name-filter.

3. Sweeper splits sweepOnce into two independent passes:
     - sweepRemovedRows (unchanged behavior; status='removed' only)
     - sweepLabeledOrphansWithoutRows (new; labeled containers whose
       workspace_id has no row in the table at all)
   Each pass has its own short-circuit so an empty result or transient
   error in one doesn't block the other — load-bearing because the
   wiped-DB pass exists precisely for cases where the removed-row
   pass finds nothing.

Safe under multi-platform-on-shared-daemon: only containers carrying
our label get reaped, sibling stacks' containers are invisible to this
pass. (For now the label is a constant string; a future per-instance
UUID layer can refine "ours" further if a real shared-daemon scenario
emerges.)

Migration: existing platforms running pre-PR builds have UNLABELED
ws-* containers. After this lands they continue to NOT be reaped by
the new path (no label = invisible). They'll only be cleaned via
manual intervention or once the operator recreates them — same as
today. No regression.

Tests cover all five branches of the new pass: happy-path reap,
no-reap when row exists, mixed reap-some-keep-some, Docker error
short-circuits cleanly, non-UUID prefixes get filtered before the
SQL query.

Pairs with PR #2122 (script-level fix). Together they close the
orphan-leak path for both `bash scripts/nuke-and-rebuild.sh` users
(handled by the script) AND `docker compose down -v` users (handled
by the runtime).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 14:33:41 -07:00
Hongming Wang
d0f198b24f merge: resolve staging conflicts (a2a_proxy + workspace_crud)
Three files conflicted with staging changes that landed while this PR
sat open. Resolved each by combining both intents (not picking one side):

- a2a_proxy.go: keep the branch's idle-timeout signature
  (workspaceID parameter + comment) AND apply staging's #1483 SSRF
  defense-in-depth check at the top of dispatchA2A. Type-assert
  h.broadcaster (now an EventEmitter interface per staging) back to
  *Broadcaster for applyIdleTimeout's SubscribeSSE call; falls through
  to no-op when the assertion fails (test-mock case).

- a2a_proxy_test.go: keep both new test suites — branch's
  TestApplyIdleTimeout_* (3 cases for the idle-timeout helper) AND
  staging's TestDispatchA2A_RejectsUnsafeURL (#1483 regression). Updated
  the staging test's dispatchA2A call to pass the workspaceID arg
  introduced by the branch's signature change.

- workspace_crud.go: combine both Delete-cleanup intents:
  * Branch's cleanupCtx detachment (WithoutCancel + 30s) so canvas
    hang-up doesn't cancel mid-Docker-call (the container-leak fix)
  * Branch's stopAndRemove helper that skips RemoveVolume when Stop
    fails (orphan sweeper handles)
  * Staging's #1843 stopErrs aggregation so Stop failures bubble up
    as 500 to the client (the EC2 orphan-instance prevention)
  Both concerns satisfied: cleanup runs to completion past canvas
  hangup AND failed Stop calls surface to caller.

Build clean, all platform tests pass.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-04-26 10:43:22 -07:00
Hongming Wang
0de67cd379 feat(platform/admin): /admin/workspace-images/refresh + Docker SDK + GHCR auth
The production-side end of the runtime CD chain. Operators (or the post-
publish CI workflow) hit this after a runtime release to pull the latest
workspace-template-* images from GHCR and recreate any running ws-* containers
so they adopt the new image. Without this, freshly-published runtime sat in
the registry but containers kept the old image until naturally cycled.

Implementation notes:
- Uses Docker SDK ImagePull rather than shelling out to docker CLI — the
  alpine platform container has no docker CLI installed.
- ghcrAuthHeader() reads GHCR_USER + GHCR_TOKEN env, builds the base64-
  encoded JSON payload Docker engine expects in PullOptions.RegistryAuth.
  Both empty → public/cached images only; both set → private GHCR pulls.
- Container matching uses ContainerInspect (NOT ContainerList) because
  ContainerList returns the resolved digest in .Image, not the human tag.
  Inspect surfaces .Config.Image which is what we need.
- Provisioner.DefaultImagePlatform() exported so admin handler picks the
  same Apple-Silicon-needs-amd64 platform as the provisioner — single
  source of truth for the multi-arch override.

Local-dev companion: scripts/refresh-workspace-images.sh runs on the
host and inherits the host's docker keychain auth — alternate path for
when GHCR_USER/TOKEN aren't set in the platform env.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-04-26 10:17:21 -07:00
Hongming Wang
48b494def3 fix(provisioner): nil guards on Stop/IsRunning, unblock contract tests (closes #1813)
Both backends panicked when called on a zero-valued or nil receiver:
Provisioner.{Stop,IsRunning} dereferenced p.cli; CPProvisioner.{Stop,
IsRunning} dereferenced p.httpClient. The orphan sweeper and shutdown
paths can call these speculatively where the receiver isn't fully
wired — the panic crashed the goroutine instead of the caller seeing
a clean error.

Three changes:

1. Add ErrNoBackend (typed sentinel) and nil-guard the four methods.
   - Provisioner.{Stop,IsRunning}: guard p == nil || p.cli == nil at
     the top.
   - CPProvisioner.Stop: guard p == nil up top, then httpClient nil
     AFTER resolveInstanceID + empty-instance check (the empty
     instance_id path doesn't need HTTP and stays a no-op success
     even on zero-valued receivers — preserved historical contract
     from TestIsRunning_EmptyInstanceIDReturnsFalse).
   - CPProvisioner.IsRunning: same shape — empty instance_id stays
     (false, nil); httpClient-nil with non-empty instance_id returns
     ErrNoBackend.

2. Flip the t.Skip on TestDockerBackend_Contract +
   TestCPProvisionerBackend_Contract — both contract tests run now
   that the panics are gone. Skipped scenarios were the regression
   guard for this fix.

3. Add TestZeroValuedBackends_NoPanic — explicit assertion that
   zero-valued and nil receivers return cleanly (no panic). Docker
   backend always returns ErrNoBackend on zero-valued; CPProvisioner
   may return (false, nil) when the DB-lookup layer absorbs the case
   (no instance to query → no HTTP needed). Both are acceptable per
   the issue's contract — the gate is no-panic.

Tests:
  - 6 sub-cases across the new TestZeroValuedBackends_NoPanic
  - TestDockerBackend_Contract + TestCPProvisionerBackend_Contract
    now run their 2 scenarios (4 sub-cases each)
  - All existing provisioner tests still green
  - go build ./... + go vet ./... + go test ./... clean

Closes drift-risk #6 in docs/architecture/backends.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 02:17:51 -07:00
Hongming Wang
cb12601414 fix(platform): make Provisioner.Stop return real errors so cleanup gates fire
Review caught a critical issue with 12c49183: the headline "skip
RemoveVolume when Stop fails" guarantee was dead code. `Provisioner.Stop`
unconditionally `return nil`'d after logging the underlying
ContainerRemove error, so the new `if err := h.provisioner.Stop(...);
err != nil { skip volume }` guard in workspace_crud.go AND the same
guard in the orphan sweeper could never fire. RemoveVolume always
ran, predictably failing with "volume in use" when Stop hadn't
actually killed the container — which is the exact production bug
the commit claimed to fix.

Now Stop:
  - returns nil on successful remove (no change)
  - returns nil when the container is already gone (uses the existing
    isContainerNotFound helper — that's the cleanup post-condition,
    not a failure)
  - returns the wrapped Docker error otherwise (daemon timeout, ctx
    cancellation, socket EOF — anything that means the container
    might still be alive)

Audited every Provisioner.Stop caller in the tree (team.go,
workspace_restart.go ×4, workspace.go) — all of them already
discard the return value, so the widened error surface is purely
opt-in for the new cleanup paths and breaks no existing behaviour.

Other review-driven fixes in this commit:

- workspace_crud.go: detached `broadcaster.RecordAndBroadcast` from
  the request ctx too. RecordAndBroadcast does INSERT INTO
  structure_events + Redis Publish; if the canvas hangs up, a
  request-ctx-bound INSERT can be cancelled mid-write and the
  WORKSPACE_REMOVED event never lands, leaving other WS clients
  ignorant of the cascade.

- orphan_sweeper.go: added isLikelyWorkspaceID guard before turning
  Docker container prefixes into SQL LIKE patterns. The Docker
  name filter is a SUBSTRING match (not prefix), so non-workspace
  containers like `my-ws-tool` slip through; the in-loop HasPrefix
  in provisioner trims most, but the in-sweeper alphabet check
  (hex + dashes only) is the second line of defence and also
  blocks SQL LIKE wildcards (`_`, `%`) from reaching the query.
  Two new tests pin this — TestSweepOnce_FiltersNonWorkspacePrefixes
  and TestIsLikelyWorkspaceID with 10 alphabet cases.

- provisioner.go: comment added to ListWorkspaceContainerIDPrefixes
  flagging the substring/HasPrefix relationship as load-bearing.

Verified: full Go test suite passes; all 8 sweeper tests pass
(2 new for the LIKE-pattern guard); existing dispatch / delete /
provisioner tests unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 23:32:48 -07:00
Hongming Wang
12c4918318 fix(platform): stop leaking workspace containers on delete
Symptom: deleting workspaces from the canvas marked DB rows
status='removed' but left Docker containers running indefinitely.
After a session of org imports + cancellations, we counted 10
running ws-* containers all backed by 'removed' DB rows, eating
~1100% CPU on the Docker VM.

Two compounding bugs in handlers/workspace_crud.go's delete cascade:

1. The cleanup loop used `c.Request.Context()` for the Docker
   stop/remove calls. When the canvas's `api.del` resolved on the
   platform's 200, gin cancelled the request ctx — and any in-flight
   Docker call cancelled with `context canceled`, leaving the
   container alive. Old logs:
       "Delete descendant <id> volume removal warning:
        ... context canceled"

2. `provisioner.Stop`'s error return was discarded and `RemoveVolume`
   ran unconditionally afterward. When Stop didn't actually kill the
   container (transient daemon error, ctx cancellation as in #1), the
   volume removal would predictably fail with "volume in use" and
   the container kept running with the volume mounted. Old logs:
       "Delete descendant <id> volume removal warning:
        Error response from daemon: remove ... volume is in use"

Fix layered in two parts:

- workspace_crud.go: detach cleanup with `context.WithoutCancel(ctx)`
  + a 30s bounded timeout. Stop's error is now checked and on
  failure we skip RemoveVolume entirely (the orphan sweeper below
  catches what we deferred).

- New registry/orphan_sweeper.go: periodic reconcile pass (every 60s,
  initial run on boot). Lists running ws-* containers via Docker name
  filter, intersects with DB rows where status='removed', stops +
  removes volumes for the leaks. Defence in depth — even a brand-new
  Stop failure mode heals on the next sweep instead of leaking
  forever.

Provisioner gains a tiny ListWorkspaceContainerIDPrefixes helper
that wraps ContainerList with the `name=ws-` filter; the sweeper
takes an OrphanReaper interface (matches the ContainerChecker
pattern in healthsweep.go) so unit tests don't need a real Docker
daemon.

main.go wires the sweeper alongside the existing liveness +
health-sweep + provisioning-timeout monitors, all under
supervised.RunWithRecover so a panic restarts the goroutine.

6 new sweeper tests cover the reconcile path, the
no-running-containers short-circuit, the daemon-error skip, the
Stop-failure-leaves-volume invariant (the same trap that motivated
this fix), the volume-remove-error-is-non-fatal continuation,
and the nil-reaper no-op.

Verified: full Go test suite passes; manually purged the 10 leaked
containers + their orphan volumes from the dev host with `docker
rm -f` + `docker volume rm` (one-off cleanup; the sweeper would
have caught them on the next cycle once deployed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 12:36:22 -07:00
molecule-ai[bot]
b1dce3405c
Merge branch 'staging' into test/2026-04-23-regression-suite 2026-04-24 01:55:06 +00:00
b5e2142c46 fix(#1877): close token-rotation race on restart — Option A+Option B combined
Platform side (Option B):
- provisioner.go: add WriteAuthTokenToVolume() — writes .auth_token to
  the Docker named volume BEFORE ContainerStart using a throwaway alpine
  container, eliminating the race window where a restarted container could
  read a stale token before WriteFilesToContainer writes the new one.
- workspace_provision.go: call WriteAuthTokenToVolume() in issueAndInjectToken
  as a best-effort pre-write before the container starts.

Runtime side (Option A):
- heartbeat.py: on HTTPStatusError 401 from /registry/heartbeat, call
  refresh_cache() to force re-read of /configs/.auth_token from disk,
  then retry the heartbeat once. Fall through to normal failure tracking
  if the retry also fails.
- platform_auth.py: add refresh_cache() which discards the in-process
  _cached_token and calls get_token() to re-read from disk.

Together these eliminate the >1 consecutive 401 window described in
issue #1877. Pre-write (B) is the primary fix; runtime retry (A) is the
self-healing fallback for any residual race.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-23 17:47:18 -07:00
Hongming Wang
9ce8d97448 test: regression guard for #1738 — cp-provisioner uses real instance_id
Pins the fix-invariants from PR #1738 (merged 2026-04-23) against
regression. Pre-fix, `CPProvisioner.Stop` and `IsRunning` both passed
the workspace UUID as the `instance_id` query param:

    url := fmt.Sprintf("%s/cp/workspaces/%s?instance_id=%s",
                        baseURL, workspaceID, workspaceID)
                                              ^ should be the real i-* ID

AWS rejected downstream with InvalidInstanceID.Malformed, orphaned the
EC2, and the next provision hit InvalidGroup.Duplicate on the leftover
SG — full Save & Restart cascade failure.

## Tests added

- **TestStop_UsesRealInstanceIDNotWorkspaceUUID**: stub resolveInstanceID
  to return an i-* ID, assert the CP request's instance_id query param
  carries that i-* value (not the workspace UUID).
- **TestStop_NoInstanceIDSkipsCPCall**: empty DB lookup → no CP call at
  all (idempotent). Guards against re-introducing the "call CP with ''
  and let AWS reject" footgun.
- **TestIsRunning_UsesRealInstanceIDNotWorkspaceUUID**: mirror for the
  /cp/workspaces/:id/status path — same bug shape.

All 3 pass on current staging (which has the fix). Reverting either
Stop or IsRunning to the pre-#1738 shape causes these to fail loud.

Extends molecule-core#1902's regression suite.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 17:45:13 -07:00
Hongming Wang
539e3483e4 fix(provisioner): force linux/amd64 pull + create on Apple Silicon hosts (#1875)
On an Apple Silicon dev box, every `POST /workspaces` failed immediately
with:

  no matching manifest for linux/arm64/v8 in the manifest list entries:
  no match for platform in manifest: not found

because the GHCR workspace-template-* images ship only a linux/amd64
manifest today. `ImagePull` and `ContainerCreate` asked for the daemon's
native arch and missed. The Canvas surfaced this as

  docker image "ghcr.io/molecule-ai/workspace-template-autogen:latest"
  not found after pull attempt — verify GHCR visibility for autogen

— confusing because the image IS visible, just not for linux/arm64.

### Fix

Add an auto-detect helper `defaultImagePlatform()` in
`internal/provisioner/provisioner.go` that returns `"linux/amd64"` on
Apple Silicon hosts and `""` (no preference) everywhere else, with an
env override `MOLECULE_IMAGE_PLATFORM` for operators who want to pin
or disable explicitly. The result is passed to both `ImagePull`
(`PullOptions.Platform`) and `ContainerCreate` (4th arg
`*ocispec.Platform`) so the pulled amd64 manifest matches the
create-time platform spec. Docker Desktop transparently runs it
under QEMU emulation on M-series Macs — slow (2–5× native) but
functional.

SaaS production (linux/amd64 EC2, `MOLECULE_ENV=production`) never
hits the `runtime.GOARCH == "arm64"` branch, so the current behaviour
on real tenants is byte-for-byte unchanged. Opt-in escape hatch for
operators who want it off:

  export MOLECULE_IMAGE_PLATFORM=""     # disable auto-force
  export MOLECULE_IMAGE_PLATFORM=linux/arm64   # pin alternate

`ocispec` is `github.com/opencontainers/image-spec/specs-go/v1` —
already in go.sum v1.1.1 as a transitive dependency of
`github.com/docker/docker`, not a new import.

### Tests

`internal/provisioner/platform_test.go` exercises every branch:

  - `TestDefaultImagePlatform_EnvOverride_ExplicitValue` — env wins
  - `TestDefaultImagePlatform_EnvOverride_EmptyValue` — empty string
    disables the auto-force (operator escape hatch)
  - `TestDefaultImagePlatform_AutoDetect` — linux/amd64 on arm64 Mac,
    "" on every other host
  - `TestParseOCIPlatform` — 7 table-driven cases covering well-formed
    platforms, malformed inputs, and nil handling

### End-to-end verification

Before this commit, `POST /workspaces` on my Apple Silicon box:

  workspace status transitioned: provisioning → failed (~1s)
  log: image pull for ... failed: no matching manifest for linux/arm64/v8

After this commit, fresh DB + fresh platform:

  workspace status transitioned: provisioning → online (~25s)
  log: attempting pull (platform=linux/amd64)
       pulled ghcr.io/molecule-ai/workspace-template-langgraph:latest
  docker ps: ws-7aa08951-00d  Up 27 seconds

The existing provisioner race-tested test suite (`go test -race
./internal/provisioner/`) still passes — the platform pointer defaults
to nil on linux/amd64 hosts, so the CI-resolved test expectations
don't change.

Closes #1875 (arm64 image blocker).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 14:55:34 -07:00
Hongming Wang
a56b765b2d
docs: testing strategy + PR hygiene + backend parity matrix + boot-event postmortem (#1824)
Bundles the documentation and lightweight tooling landed during the
2026-04-23 ops/triage session. Pure additions — no behavior changes.

## Added

### docs/architecture/backends.md
Parity matrix for Docker vs EC2 (SaaS) workspace backends. 18 features
tabulated with current status; 6 ranked drift risks; enforcement
hooks (parity-lint + contract tests). Living document — owners are
workspace-server + controlplane teams.

### docs/engineering/testing-strategy.md
Tiered test-coverage floors instead of a blanket 100% target. Seven
tiers by code class (auth/crypto → generated DTOs). Per-package
current-state snapshot + targets. Tracks the 3 biggest coverage gaps
(tokens.go 0%, workspace_provision.go 0%, wsauth ~48%) against their
tier-1/2 floors.

### docs/engineering/pr-hygiene.md
Captures the patterns that keep diffs reviewable. Motivated by the
2026-04-23 backlog audit where 8 of 23 open PRs had 70-380-file bloat
from stale branch drift. Covers: small-PR sizing, rebase-not-merge,
cherry-pick-onto-fresh-base for recovery, targeting staging first,
describing why-not-what.

### docs/engineering/postmortem-2026-04-23-boot-event-401.md
Postmortem for the /cp/tenants/boot-event 401 race. Root cause (DB
INSERT ordered AFTER readiness check), detection path (E2E + manual
log inspection), lessons (write-before-read pattern, integration
tests needed, E2E alerting gap, invariants-as-comments).

### tools/check-template-parity.sh
CI lint for template repos — diffs the `${VAR:+VAR=${VAR}}` provider-
key forwarders between install.sh (bare-host / EC2 path) and start.sh
(Docker path). Catches the #5 drift risk from backends.md before it
ships.

### workspace-server/internal/provisioner/backend_contract_test.go
Shared behavioral contract scaffold for Provisioner + CPProvisioner.
Compile-time assertions catch method-signature drift today; scenario-
level runs are t.Skip'd pending backend nil-hardening (drift risk #6,
see backends.md).

## Updated

### README.md
Links the new engineering docs + backends parity matrix into the
Documentation Map so agents and humans can actually find them.

## Related issues

- #1814 — unblock workspace_provision_test.go (broadcaster interface)
- #1813 — nil-client panic hardening (drift risk #6)
- #1815 — Canvas vitest coverage instrumentation
- #1816 — tokens.go 0% → 85%
- #1817 — 5 sqlmock column-drift failures
- #1818 — Python pytest-cov setup
- #1819 — wsauth middleware coverage gap
- #1821 — tiered coverage policy (meta)
- #1822 — backend parity drift tracker

Co-authored-by: Hongming Wang <hongmingwang.rabbit@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: molecule-ai[bot] <276602405+molecule-ai[bot]@users.noreply.github.com>
2026-04-23 19:59:38 +00:00
Hongming Wang
c23ff848aa
fix(cp-provisioner): look up real EC2 instance_id for Stop + IsRunning (#1738)
Resolves a "Save & Restart cascade" failure on SaaS tenants. Observed
2026-04-22 on hongmingwang workspace a8af9d79 after a Config-tab save:

  03:13:20 workspace deprovision: TerminateInstances
           InvalidInstanceID.Malformed: a8af9d79-... is malformed
  03:13:21 workspace provision: CreateSecurityGroup
           InvalidGroup.Duplicate: workspace-a8af9d79-394 already
           exists for VPC vpc-09f85513b85d7acee

Root cause: CPProvisioner.Stop and IsRunning passed the workspace UUID
as the `instance_id` query param to CP. CP forwarded it to EC2
TerminateInstances, which rejected it (EC2 ids are i-…, not UUIDs).
The failed terminate left the workspace's SG attached → the immediate
re-provision hit InvalidGroup.Duplicate → user saw `provisioning
failed`.

Fix: both methods now call a new `resolveInstanceID` that reads
`workspaces.instance_id` from the tenant DB and passes the real EC2
id downstream. When no row / no instance_id exists, Stop is a no-op
and IsRunning returns (false, nil) so restart cascades can freshly
re-provision.

resolveInstanceID is exposed as a `var` package-level func so tests
can swap it for a pairs-map stub without standing up sqlmock — the
per-table DB scaffolding was a heavier price than the surface
warranted given these tests are about the CP HTTP flow downstream
of the lookup, not the lookup SQL itself.

Adds regression tests:
  - TestStop_EmptyInstanceIDIsNoop: no DB row → no CP call
  - TestIsRunning_UsesDBInstanceID: DB id round-trips to CP
  - TestIsRunning_EmptyInstanceIDReturnsFalse: no instance → false/nil
Updates existing tests to assert the resolved instance_id (i-abc123
variants) instead of the previous buggy workspaceID.

After this lands, user's existing workspaces with stale instance_id
bindings still need a manual cleanup of the orphaned EC2 + SG (done
for a8af9d79 today). Future restarts use the correct id.

Co-authored-by: Hongming Wang <hongmingwang.rabbit@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 18:25:29 +00:00
Hongming Wang
4c0cb487c1 fix(cp-provisioner): use CP_ADMIN_API_TOKEN bearer for /cp/admin/* routes
Symptom (prod tenant hongmingwang, 2026-04-22):
  cp provisioner: console: unexpected 401
  GET /workspaces/:id/console → 502 (View Logs broken)

Root cause: the tenant's CPProvisioner.authHeaders sent the provision-
gate shared secret as the Authorization bearer for every outbound CP
call, including /cp/admin/workspaces/:id/console. But CP gates
/cp/admin/* with CP_ADMIN_API_TOKEN — a distinct secret so a
compromised tenant's provision credentials can't read other tenants'
serial console output. Bearer mismatch → 401.

Fix: split authHeaders into two methods —
  - provisionAuthHeaders(): Authorization: Bearer <MOLECULE_CP_SHARED_SECRET>
    for /cp/workspaces/* (Start, Stop, IsRunning)
  - adminAuthHeaders():     Authorization: Bearer <CP_ADMIN_API_TOKEN>
    for /cp/admin/* (GetConsoleOutput and future admin reads)

Both still send X-Molecule-Admin-Token for per-tenant identity. When
CP_ADMIN_API_TOKEN is unset (dev / self-hosted single-secret setups),
cpAdminAPIKey falls back to sharedSecret so nothing regresses.

Rollout requirement: the tenant EC2 needs CP_ADMIN_API_TOKEN in its
env — this PR wires up the code, but CP's tenant-provision path must
inject the value. Filed as follow-up; until then, operators can set
it manually on existing tenants.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 17:13:38 -07:00
Hongming Wang
9df3159c59 feat(provisioner): pull workspace-template images from GHCR
Every standalone workspace-template repo now publishes to
ghcr.io/molecule-ai/workspace-template-<runtime>:latest via the
reusable publish-template-image workflow in molecule-ci (landed
today — one caller per template repo). This PR makes the
provisioner actually use those images:

- RuntimeImages map + DefaultImage switched from bare local tags
  (workspace-template:<runtime>) to their GHCR equivalents.
- New ensureImageLocal step before ContainerCreate: if the image
  isn't present locally, attempt `docker pull` and drain the
  progress stream to completion. Best-effort — if the pull fails
  (network, auth, rate limit) the subsequent ContainerCreate still
  surfaces the actionable "No such image" error, now with a
  GHCR-appropriate hint instead of the defunct
  `bash workspace/build-all.sh <runtime>` advice.
- runtimeTagFromImage now handles both forms: legacy
  `workspace-template:<runtime>` (local dev via build-all.sh /
  rebuild-runtime-images.sh) and the current GHCR shape. Keeps
  error hints sensible in both worlds.
- Tests cover the GHCR path for tag extraction and the new error
  message shape. Legacy local tags still recognised.

Local dev path unchanged — scripts/build-images.sh and
workspace/rebuild-runtime-images.sh still produce locally-tagged
`workspace-template:<runtime>` images, and Docker's image
resolver matches them before any pull is attempted. So
contributors can keep iterating on a template repo without
round-tripping through GHCR.

Follow-on impact:
- hongmingwang.moleculesai.app (and any other tenant EC2) will
  auto-pull `ghcr.io/molecule-ai/workspace-template-hermes:latest`
  on the next hermes workspace provision — picking up the real
  Nous hermes-agent behind the A2A bridge (template-hermes v2.1.0)
  without any tenant-side rebuild step.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 12:39:56 -07:00
molecule-ai[bot]
732f65e8e1 fix(go): replace $1 literal with resp.Body.Close() in 7 files (#1247)
PR #1229 sed command had no capture groups but used $1 in the
replacement, committing the literal string "defer func() { _ = \$1 }()"
instead of "defer func() { _ = resp.Body.Close() }()". Go does not
compile — $1 is not a valid identifier.

Fixed with: sed -i 's/defer func() { _ = \$1 }()/defer func() { _ = resp.Body.Close() }()/g'

Affected (all on origin/staging):
  workspace-server/cmd/server/cp_config.go
  workspace-server/internal/handlers/a2a_proxy.go
  workspace-server/internal/handlers/github_token.go
  workspace-server/internal/handlers/traces.go
  workspace-server/internal/handlers/transcript.go
  workspace-server/internal/middleware/session_auth.go
  workspace-server/internal/provisioner/cp_provisioner.go (3 occurrences)

Closes: #1245

Co-authored-by: Molecule AI Core-BE <core-be@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-21 03:18:21 +00:00
molecule-ai[bot]
2575960805 fix(errcheck): suppress unchecked resp.Body.Close() across workspace-server (#1229)
Issue #1196: golangci-lint errcheck flags bare resp.Body.Close()
calls because Body.Close() can return a non-nil error (e.g. when the
server sent fewer bytes than Content-Length). All occurrences fixed:

  defer resp.Body.Close()  →  defer func() { _ = resp.Body.Close() }()
  resp.Body.Close()        →  _ = resp.Body.Close()

12 files affected across all Go packages — channels, handlers,
middleware, provisioner, artifacts, and cmd. The body is already fully
consumed at each call site, so the error is always safe to discard.

🤖 Generated with [Claude Code](https://claude.ai)

Co-authored-by: Molecule AI Core-BE <core-be@agents.moleculesai.app>
2026-04-21 02:45:34 +00:00
Hongming Wang
731a9aef6e feat(platform): bootstrap-failed + console endpoints for CP watcher
Workspaces stuck in provisioning used to sit in "starting" for 10min
until the sweeper flipped them. The real signal — a runtime crash at
EC2 boot — lands on the serial console within seconds but nothing
listened. These endpoints close the loop.

1. POST /admin/workspaces/:id/bootstrap-failed
   The control plane's bootstrap watcher posts here when it spots
   "RUNTIME CRASHED" in ec2:GetConsoleOutput. Handler:
   - UPDATEs workspaces SET status='failed' only when status was
     'provisioning' (idempotent — a raced online/failed stays put)
   - Stores the error + log_tail in last_sample_error so the canvas
     can render the real stack trace, not a generic "timeout" string
   - Broadcasts WORKSPACE_PROVISION_FAILED with source='bootstrap_watcher'

2. GET /workspaces/:id/console
   Proxies to CP's new /cp/admin/workspaces/:id/console endpoint so
   the tenant platform can surface EC2 serial console output without
   holding AWS credentials. CPProvisioner.GetConsoleOutput is the
   client; returns 501 in non-CP deployments (docker-compose dev).

Both gated by AdminAuth — CP holds the tenant ADMIN_TOKEN that the
middleware accepts on its tier 2b branch.

Tests cover: happy-path fail, already-transitioned no-op, empty id,
log_tail truncation, and the 501 fallback when no CP is wired.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 17:11:34 -07:00
Hongming Wang
0a06cb4fc9 fix(cp_provisioner): cap IsRunning body read at 64 KiB
IsRunning used an unbounded json.NewDecoder(resp.Body).Decode on
CP status responses. Start already caps its body read at 64 KiB
(cp_provisioner.go:137) to defend against a misconfigured or
compromised CP streaming a huge body and exhausting memory.

IsRunning is called reactively per-request from a2a_proxy and
periodically from healthsweep, so it's a hotter path than Start
and arguably deserves the same defense more.

Adds TestIsRunning_BoundedBodyRead that serves a body padded past
the cap and asserts the decode still succeeds on the JSON prefix.

Follow-up to code-review Nit-2 on #1073.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 09:06:20 -07:00
Hongming Wang
cfa901b89a fix(cp_provisioner): IsRunning returns (true, err) on transient failures
My #1071 made IsRunning return (false, err) on all error paths, but that
breaks a2a_proxy which depends on Docker provisioner's (true, err) contract.
Without this fix, any brief CP outage causes a2a_proxy to mark workspaces
offline and trigger restart cascades across every tenant.

Contract now matches Docker.IsRunning:
  transport error    → (true, err)  — alive, degraded signal
  non-2xx response   → (true, err)  — alive, degraded signal
  JSON decode error  → (true, err)  — alive, degraded signal
  2xx state!=running → (false, nil)
  2xx state==running → (true, nil)

healthsweep.go is also happy with this — it skips on err regardless.

Adds TestIsRunning_ContractCompat_A2AProxy as regression guard that
asserts each error path explicitly against the a2a_proxy expectations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 08:58:18 -07:00
Hongming Wang
e502003c74 fix(workspace-server): IsRunning surfaces non-2xx + JSON errors
Pre-existing silent-failure path: IsRunning decoded CP responses
regardless of HTTP status, so a CP 500 → empty body → State="" →
returned (false, nil). The sweeper couldn't distinguish "workspace
stopped" from "CP broken" and would leave a dead row in place.

## Fix

  - Non-2xx → wrapped error, does NOT echo body (CP 5xx bodies may
    contain echoed headers; leaking into logs would expose bearer)
  - JSON decode error → wrapped error
  - Transport error → now wrapped with "cp provisioner: status:"
    prefix for easier log grepping

## Tests

+7 cases (5-status table + malformed JSON + existing transport).
IsRunning coverage 100%; overall cp_provisioner at 98%.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 08:47:55 -07:00
Hongming Wang
6c4d1ae4db test(workspace-server): cover Stop/IsRunning/Close + auth-header + transport errors
Closes review gap: pre-PR coverage on CPProvisioner was 37%.
After this commit every exported method is exercised:

  - NewCPProvisioner            100%
  - authHeaders                  100%
  - Start                         91.7% (remainder: json.Marshal error
                                   path, unreachable with fixed-type
                                   request struct)
  - Stop                         100% (new — header + path + error)
  - IsRunning                    100% (new — 4-state matrix + auth)
  - Close                        100% (new — contract no-op)

New cases assert both auth headers (shared secret + admin_token) land
on every outbound request, transport failures surface clear errors
on Start/Stop, and IsRunning doesn't misreport on transport failure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 08:37:39 -07:00
Hongming Wang
d3386ad620 fix(workspace-server): send X-Molecule-Admin-Token on CP calls
controlplane #118 + #130 made /cp/workspaces/* require a per-tenant
admin_token header in addition to the platform-wide shared secret.
Without it, every workspace provision / deprovision / status call
now 401s.

ADMIN_TOKEN is already injected into the tenant container by the
controlplane's Secrets Manager bootstrap, so this is purely a
header-plumbing change — no new config required on the tenant side.

## Change

- CPProvisioner carries adminToken alongside sharedSecret
- New authHeaders method sets BOTH auth headers on every outbound
  request (old authHeader deleted — single call site was misleading
  once the semantics changed)
- Empty values on either header are no-ops so self-hosted / dev
  deployments without a real CP still work

## Tests

Renamed + expanded cp_provisioner_test cases:
- TestAuthHeaders_NoopWhenBothEmpty — self-hosted path
- TestAuthHeaders_SetsBothWhenBothProvided — prod happy path
- TestAuthHeaders_OnlyAdminTokenWhenSecretEmpty — transition window

Full workspace-server suite green.

## Rollout

Next tenant provision will ship an image with this commit merged.
Existing tenants (none in prod right now — hongming was the only
one and was purged earlier today) will auto-update via the 5-min
image-pull cron.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 08:17:50 -07:00
Hongming Wang
296c52cb25 test(ws-server): cover CPProvisioner — auth, env fallback, error paths
Post-merge audit flagged cp_provisioner.go as the only new file from
the canary/C1 work without test coverage. Fills the gap:

- NewCPProvisioner_RequiresOrgID — self-hosted without MOLECULE_ORG_ID
  refuses to construct (avoids silent phone-home to prod CP).
- NewCPProvisioner_FallsBackToProvisionSharedSecret — the operator
  ergonomics of using one env-var name on both sides of the wire.
- AuthHeader noop + happy path — bearer only set when secret is set.
- Start_HappyPath — end-to-end POST to stubbed CP, bearer forwarded,
  instance_id parsed out of response.
- Start_Non201ReturnsStructuredError — when CP returns structured
  {"error":"…"}, that message surfaces to the caller.
- Start_NoStructuredErrorFallsBackToSize — regression gate for the
  anti-log-leak change from PR #980: raw upstream body must NOT
  appear in the error, only the byte count.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 03:41:16 -07:00
Hongming Wang
896a34429a Merge pull request #981 from Molecule-AI/fix/security-tenant-cpprovisioner-bearer
fix(security): tenant CPProvisioner sends CP bearer on provision / stop / status
2026-04-19 01:55:20 -07:00
Hongming Wang
a79366a04a fix(security): tenant CPProvisioner attaches CP bearer on all calls
Completes the C1 integration (PR #50 on molecule-controlplane). The CP
now requires Authorization: Bearer <PROVISION_SHARED_SECRET> on all
three /cp/workspaces/* endpoints; without this change the tenant-side
Start/Stop/IsRunning calls would all 401 (or 404 when the CP's routes
refused to mount) and every workspace provision from a SaaS tenant
would silently fail.

Reads MOLECULE_CP_SHARED_SECRET, falling back to PROVISION_SHARED_SECRET
so operators can use one env-var name on both sides of the wire. Empty
value is a no-op: self-hosted deployments with no CP or a CP that
doesn't gate /cp/workspaces/* keep working as before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 01:53:12 -07:00
Hongming Wang
365f13199e fix(security): scrub workspace-server token + upstream error logs
Two findings from the pre-launch log-scrub audit:

1. handlers/workspace_provision.go:548 logged `token[:8]` — the exact
   H1 pattern that panicked on short keys. Even with a length guard,
   leaking 8 chars of an auth token into centralized logs shortens the
   search space for anyone who gets log-read access. Now logs only
   `len(token)` as a liveness signal.

2. provisioner/cp_provisioner.go:101 fell back to logging the raw
   control-plane response body when the structured {"error":"..."}
   field was absent. If the CP ever echoed request headers (Authorization)
   or a portion of user-data back in an error path, the bearer token
   would end up in our tenant-instance logs. Now logs the byte count
   only; the structured error remains in place for the happy path.
   Also caps the read at 64 KiB via io.LimitReader to prevent
   log-flood DoS from a compromised upstream.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 01:33:47 -07:00
Hongming Wang
39074cc4ae chore: final open-source cleanup — binary, stale paths, private refs
- Remove compiled workspace-server/server binary from git
- Fix .gitignore, .gitattributes, .githooks/pre-commit for renamed dirs
- Fix CI workflow path filters (workspace-template → workspace)
- Replace real EC2 IP and personal slug in test_saas_tenant.sh
- Scrub molecule-controlplane references in docs
- Fix stale workspace-template/ paths in provisioner, handlers, tests
- Clean tracked Python cache files

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 00:38:55 -07:00
Hongming Wang
d8026347e5 chore: open-source restructure — rename dirs, remove internal files, scrub secrets
Renames:
- platform/ → workspace-server/ (Go module path stays as "platform" for
  external dep compat — will update after plugin module republish)
- workspace-template/ → workspace/

Removed (moved to separate repos or deleted):
- PLAN.md — internal roadmap (move to private project board)
- HANDOFF.md, AGENTS.md — one-time internal session docs
- .claude/ — gitignored entirely (local agent config)
- infra/cloudflare-worker/ → Molecule-AI/molecule-tenant-proxy
- org-templates/molecule-dev/ → standalone template repo
- .mcp-eval/ → molecule-mcp-server repo
- test-results/ — ephemeral, gitignored

Security scrubbing:
- Cloudflare account/zone/KV IDs → placeholders
- Real EC2 IPs → <EC2_IP> in all docs
- CF token prefix, Neon project ID, Fly app names → redacted
- Langfuse dev credentials → parameterized
- Personal runner username/machine name → generic

Community files:
- CONTRIBUTING.md — build, test, branch conventions
- CODE_OF_CONDUCT.md — Contributor Covenant 2.1

All Dockerfiles, CI workflows, docker-compose, railway.toml, render.yaml,
README, CLAUDE.md updated for new directory names.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 00:24:44 -07:00