Commit Graph

9 Commits

Author SHA1 Message Date
Hongming Wang
235aca9908 fix(boot): always start health-sweep goroutine — SaaS tenants need it for external-runtime liveness
Pre-fix, cmd/server/main.go gated the entire health-sweep goroutine on
`prov != nil`. On SaaS tenants (`MOLECULE_ORG_ID` set) the local Docker
provisioner is never initialized — only `cpProv`. So the goroutine
never started, and `sweepStaleRemoteWorkspaces` (which transitions
runtime='external' workspaces from 'online' to 'awaiting_agent' when
their last_heartbeat_at goes stale) never ran.

Net effect on production: every external-runtime workspace on SaaS
that lost its agent stayed 'online' indefinitely instead of falling
back to 'awaiting_agent' (re-registrable). The drift gate (#2388)
caught the migration side and #2382 fixed the SQL writes, but this
orchestration-side gate slipped through both because there was no
SaaS-mode E2E coverage on the heartbeat-loss → awaiting_agent
transition.

Caught by #2392 (live staging external-runtime regression E2E)
failing at step 6 — 180s with no heartbeat, expected
status=awaiting_agent, got online.

Fix: drop the `if prov != nil` gate. `StartHealthSweep` already
handles nil checker correctly (healthsweep.go:50-71): the Docker
sweep is gated inside the loop, the remote sweep always runs. Test
coverage already exists at TestStartHealthSweep_NilCheckerRunsRemoteSweep.

After this lands and tenants redeploy, #2392 step 6 passes and the
regression coverage closes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 12:05:40 -07:00
Hongming Wang
c0a5d842b4 feat(runtime): native_scheduler skip — primitive #3 of 6
When an adapter declares provides_native_scheduler=True (because its
SDK has built-in cron / Temporal-style workflows), the platform's
polling loop must skip firing schedules for that workspace — otherwise
the schedule fires twice (once natively, once via platform). The
native skip preserves observability (next_run_at still advances, the
schedule row stays in the DB, last_run_at would still update) while
moving the FIRE responsibility to the SDK.

Stacked on PR #2139 (idle_timeout_override end-to-end). The
RuntimeMetadata heartbeat block already carries the capability map;
this PR teaches the platform how to read and act on the scheduler bit.

Components:

  - handlers/runtime_overrides.go: extended the cache to store
    capability flags alongside idle timeout. Two heartbeat fields are
    independent — SetIdleTimeout / SetCapabilities each update one
    without stomping the other. Defensive copy on SetCapabilities so
    a caller mutating its map after the call doesn't retroactively
    change cached declarations. Empty entries dropped to avoid stale
    husks.

  - handlers/runtime_overrides.go: new HasCapability(workspaceID, name)
    + ProvidesNativeScheduler(workspaceID) — the latter is the
    package-level adapter the scheduler imports (avoids a
    handlers/scheduler import cycle).

  - handlers/registry.go: heartbeat handler now calls SetCapabilities
    in addition to SetIdleTimeout.

  - scheduler/scheduler.go: NativeSchedulerCheck function-pointer DI
    (mirrors the existing QueueDrainFunc pattern). New() leaves the
    field nil so existing callers preserve today's "always fire"
    behavior. SetNativeSchedulerCheck wires production. tick() drops
    workspaces declaring native ownership before goroutine fan-out;
    advances next_run_at so we don't tight-loop on the same row.

  - cmd/server/main.go: wires handlers.ProvidesNativeScheduler into
    the cron scheduler at server boot.

Tests:
  Go (7 new):
    - SetCapabilitiesAndHas (round-trip)
    - per-workspace isolation (ws-a's declaration doesn't leak to ws-b)
    - nil/empty map clears (adapter dropping the flag restores fallback)
    - SetCapabilities is a defensive copy (caller mutation can't
      retroactively flip cached value)
    - SetIdleTimeout preserves capabilities and vice-versa (two-field
      independence)
    - empty entry deleted (no stale husks)
    - ProvidesNativeScheduler reads the same singleton heartbeat writes
    - SetNativeSchedulerCheck wires the function (scheduler-side)
    - nil-check safety contract for tick

  Python: no change needed — the heartbeat already serializes the
  full capability map via _runtime_metadata_payload (PR #2139). An
  adapter setting RuntimeCapabilities(provides_native_scheduler=True)
  automatically flows through.

Verification:
  - 1308 / 1308 Python pytest pass (unchanged)
  - All Go handlers + scheduler tests pass
  - go build + go vet clean

See project memory `project_runtime_native_pluggable.md`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 22:47:00 -07:00
Hongming Wang
9375e3d4ee
feat(workspace-server): GHCR digest watcher closes runtime CD chain (#2114)
Adds an opt-in goroutine that polls GHCR every 5 minutes for digest
changes on each workspace-template-*:latest tag and invokes the same
refresh logic /admin/workspace-images/refresh exposes. With this, the
chain from "merge runtime PR" to "containers running new code" is fully
hands-off — no operator step between auto-tag → publish-runtime →
cascade → template image rebuild → host pull + recreate.

Opt-in via IMAGE_AUTO_REFRESH=true. SaaS deploys whose pipeline already
pulls every release should leave it off (would be redundant work);
self-hosters get true zero-touch.

Why a refactor of admin_workspace_images.go is in this PR:
The HTTP handler held all the refresh logic inline. To share it with
the new watcher without HTTP loopback, extracted WorkspaceImageService
with a Refresh(ctx, runtimes, recreate) (RefreshResult, error) shape.
HTTP handler is now a thin wrapper; behavior is preserved (same JSON
response, same 500-on-list-failure, same per-runtime soft-fail).

Watcher design notes:
- Last-observed digest tracked in memory (not persisted). On boot the
  first observation per runtime is seed-only — no spurious refresh
  fires on every restart.
- On Refresh error, the seen digest rolls back so the next tick retries.
  Without this rollback a transient Docker glitch would convince the
  watcher the work was done.
- Per-runtime fetch errors don't block other runtimes (one template's
  brief 500 doesn't pause the others).
- digestFetcher injection seam in tick() lets unit tests cover all
  bookkeeping branches without standing up an httptest GHCR server.

Verified live: probed GHCR's /token + manifest HEAD against
workspace-template-claude-code; got HTTP 200 + a real
Docker-Content-Digest. Same calls the watcher makes.

Co-authored-by: Hongming Wang <hongmingwangalt@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 13:36:26 -07:00
Hongming Wang
12c4918318 fix(platform): stop leaking workspace containers on delete
Symptom: deleting workspaces from the canvas marked DB rows
status='removed' but left Docker containers running indefinitely.
After a session of org imports + cancellations, we counted 10
running ws-* containers all backed by 'removed' DB rows, eating
~1100% CPU on the Docker VM.

Two compounding bugs in handlers/workspace_crud.go's delete cascade:

1. The cleanup loop used `c.Request.Context()` for the Docker
   stop/remove calls. When the canvas's `api.del` resolved on the
   platform's 200, gin cancelled the request ctx — and any in-flight
   Docker call cancelled with `context canceled`, leaving the
   container alive. Old logs:
       "Delete descendant <id> volume removal warning:
        ... context canceled"

2. `provisioner.Stop`'s error return was discarded and `RemoveVolume`
   ran unconditionally afterward. When Stop didn't actually kill the
   container (transient daemon error, ctx cancellation as in #1), the
   volume removal would predictably fail with "volume in use" and
   the container kept running with the volume mounted. Old logs:
       "Delete descendant <id> volume removal warning:
        Error response from daemon: remove ... volume is in use"

Fix layered in two parts:

- workspace_crud.go: detach cleanup with `context.WithoutCancel(ctx)`
  + a 30s bounded timeout. Stop's error is now checked and on
  failure we skip RemoveVolume entirely (the orphan sweeper below
  catches what we deferred).

- New registry/orphan_sweeper.go: periodic reconcile pass (every 60s,
  initial run on boot). Lists running ws-* containers via Docker name
  filter, intersects with DB rows where status='removed', stops +
  removes volumes for the leaks. Defence in depth — even a brand-new
  Stop failure mode heals on the next sweep instead of leaking
  forever.

Provisioner gains a tiny ListWorkspaceContainerIDPrefixes helper
that wraps ContainerList with the `name=ws-` filter; the sweeper
takes an OrphanReaper interface (matches the ContainerChecker
pattern in healthsweep.go) so unit tests don't need a real Docker
daemon.

main.go wires the sweeper alongside the existing liveness +
health-sweep + provisioning-timeout monitors, all under
supervised.RunWithRecover so a panic restarts the goroutine.

6 new sweeper tests cover the reconcile path, the
no-running-containers short-circuit, the daemon-error skip, the
Stop-failure-leaves-volume invariant (the same trap that motivated
this fix), the volume-remove-error-is-non-fatal continuation,
and the nil-reaper no-op.

Verified: full Go test suite passes; manually purged the 10 leaked
containers + their orphan volumes from the dev host with `docker
rm -f` + `docker volume rm` (one-off cleanup; the sweeper would
have caught them on the next cycle once deployed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 12:36:22 -07:00
Hongming Wang
f8c900909e fix(platform): auto-load .env from CWD on startup
Local dev runs (`/tmp/molecule-server` after `go build`) used to 401 on
/workspaces the moment the DB had any workspace token in it: the binary
inherited a bare shell env with no MOLECULE_ENV, so AdminAuth's dev
fail-open branch (gated on MOLECULE_ENV=development) didn't fire.

The repo's .env already has MOLECULE_ENV=development plus DATABASE_URL,
REDIS_URL, ADMIN_TOKEN=, etc. Until now you had to `set -a && source
.env` in the launching shell — a paper cut, but worse, it's a paper
cut in EVERY automated dev workflow (IDE run configs, integration
test harnesses, the smoke-test loop in this branch's manual testing).

Fix: cmd/server now walks upward from CWD looking for a .env (capped
at 6 levels) and merges KEY=VALUE pairs into os.Environ before any
other code reads env. Already-set vars win over file values, so
docker run -e / CI exports / `KEY=val ./binary` still dominate — only
unset keys get filled in.

Why no godotenv dep: the format we use is plain KEY=VALUE with `#`
comments, no interpolation, no quoting (verified against the live
.env: 49 kv lines, zero references to ${...} or `export`). A 30-line
parser is auditable and avoids supply-chain surface.

Why it's safe in production: Dockerfile doesn't COPY .env into the
image and .env is gitignored, so prod containers have no .env on
disk to load — the function's findDotEnv() loop finds nothing and
returns silently. If an operator deliberately drops one in, the
existing-env-wins rule means container-injected env still dominates.

Verified by booting `env -i HOME=$HOME PATH=$PATH /tmp/molecule-server`
from the repo root with a stripped env: log shows
".env: /Users/.../molecule-core/.env — loaded 49, 0 already set" and
/workspaces returns 200 instead of 401.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 20:33:28 -07:00
Hongming Wang
03e913db75 feat(#1957): wire gh-identity plugin into workspace-server
Ships the monorepo side of molecule-core#1957 (agent identity collapse).
Companion to molecule-ai-plugin-gh-identity (new repo, merged-and-tagged
separately).

Changes:
- manifest.json: add gh-identity plugin to Tier 1 registry
- workspace-server/go.mod: require github.com/Molecule-AI/molecule-ai-plugin-gh-identity
- cmd/server/main.go: build a shared provisionhook.Registry, register
  gh-identity first (always), then github-app-auth (gated on GITHUB_APP_ID)
- workspace_provision.go: propagate workspace.Role into
  env["MOLECULE_AGENT_ROLE"] before calling the mutator chain, so the
  gh-identity plugin can see which agent is booting
- provisionhook/mutator.go: add Registry.Mutators() accessor so
  individual-plugin registries can be merged onto a shared one at boot

Boot log gains a line like:
  env-mutator chain: [gh-identity github-app-auth]

Effect per workspace:
- env contains MOLECULE_AGENT_ROLE, MOLECULE_OWNER, MOLECULE_ATTRIBUTION_BADGE,
  MOLECULE_GH_WRAPPER_B64, MOLECULE_GH_WRAPPER_SHA
- Each workspace template's install.sh can decode + install the wrapper at
  /usr/local/bin/gh, intercepting @me assignment and prepending agent
  attribution on PR/issue creates

Does not break existing workspaces — absent workspace.role, the plugin is
a no-op. Absent install.sh updates in each template, the env vars are
simply unused.

Follow-up template PRs (hermes, claude-code, langgraph, etc.) each add
~15 lines to install.sh to decode + install the wrapper.

Ref: #1957

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 15:01:41 +00:00
Hongming Wang
ff338e0489 fix: harden stuck-provisioning UX — details crash, preflight, sweeper
Workspaces stuck in status='provisioning' previously surfaced in three
bad ways:

1. **Details tab crashed** with `Cannot read properties of undefined
   (reading 'toLocaleString')`. `BudgetSection` + `WorkspaceUsage`
   assumed full response shapes but a provisioning-stuck workspace
   returns partial `{}`. Guard each deep field with `?? 0` and cover
   the partial-response case with regression tests.

2. **Missing required env vars failed silently** 15+ minutes later as
   a cosmetic "Provisioning Timeout" banner. The in-container preflight
   catches them but by then the container has already crashed without
   calling /registry/register, so the workspace sat in 'provisioning'
   forever. Mirror the preflight server-side: parse config.yaml's
   `runtime_config.required_env` before launch, fail fast with a
   WORKSPACE_PROVISION_FAILED event naming the missing vars.

3. **No backend timeout** ever flipped a stuck workspace to 'failed'.
   Add a registry sweeper (10m default, env-overridable) that detects
   workspaces stuck past the window, flips them to 'failed', and emits
   WORKSPACE_PROVISION_TIMEOUT. Race-safe: the UPDATE re-checks the
   status + age predicate so a concurrent register/restart wins.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 14:51:39 -07:00
Hongming Wang
48ec5b2dc8 feat(ws-server): pull env from CP on startup
Paired with molecule-controlplane PR #55 (GET /cp/tenants/config). Lets
existing tenants heal themselves when we rotate or add a CP-side env
var (e.g. MOLECULE_CP_SHARED_SECRET landing earlier today) without any
ssh or re-provision.

Flow: main() calls refreshEnvFromCP() before any other os.Getenv read.
The helper reads MOLECULE_ORG_ID + ADMIN_TOKEN from the baked-in
user-data env, GETs {MOLECULE_CP_URL}/cp/tenants/config with those
credentials, and applies the returned string map via os.Setenv so
downstream code (CPProvisioner, etc.) sees the fresh values.

Best-effort semantics:
- self-hosted / no MOLECULE_ORG_ID → no-op (return nil)
- CP unreachable / non-200 → log + return error (main keeps booting)
- oversized values (>4 KiB each) rejected to avoid env pollution
- body read capped at 64 KiB

Once this image hits GHCR, the 5-minute tenant auto-updater picks it
up, the container restarts, refresh runs, and every tenant has
MOLECULE_CP_SHARED_SECRET within ~5 minutes — no operator toil.

Also fixes workspace-server/.gitignore so `server` no longer matches
the cmd/server package dir — it only ignored the compiled binary but
pattern was too broad. Anchored to `/server`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 02:41:15 -07:00
Hongming Wang
479a027e4b chore: open-source restructure — rename dirs, remove internal files, scrub secrets
Renames:
- platform/ → workspace-server/ (Go module path stays as "platform" for
  external dep compat — will update after plugin module republish)
- workspace-template/ → workspace/

Removed (moved to separate repos or deleted):
- PLAN.md — internal roadmap (move to private project board)
- HANDOFF.md, AGENTS.md — one-time internal session docs
- .claude/ — gitignored entirely (local agent config)
- infra/cloudflare-worker/ → Molecule-AI/molecule-tenant-proxy
- org-templates/molecule-dev/ → standalone template repo
- .mcp-eval/ → molecule-mcp-server repo
- test-results/ — ephemeral, gitignored

Security scrubbing:
- Cloudflare account/zone/KV IDs → placeholders
- Real EC2 IPs → <EC2_IP> in all docs
- CF token prefix, Neon project ID, Fly app names → redacted
- Langfuse dev credentials → parameterized
- Personal runner username/machine name → generic

Community files:
- CONTRIBUTING.md — build, test, branch conventions
- CODE_OF_CONDUCT.md — Contributor Covenant 2.1

All Dockerfiles, CI workflows, docker-compose, railway.toml, render.yaml,
README, CLAUDE.md updated for new directory names.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 00:24:44 -07:00