molecule-core

Author	SHA1	Message	Date
Hongming Wang	9375e3d4ee	feat(workspace-server): GHCR digest watcher closes runtime CD chain (#2114 ) Adds an opt-in goroutine that polls GHCR every 5 minutes for digest changes on each workspace-template-*:latest tag and invokes the same refresh logic /admin/workspace-images/refresh exposes. With this, the chain from "merge runtime PR" to "containers running new code" is fully hands-off — no operator step between auto-tag → publish-runtime → cascade → template image rebuild → host pull + recreate. Opt-in via IMAGE_AUTO_REFRESH=true. SaaS deploys whose pipeline already pulls every release should leave it off (would be redundant work); self-hosters get true zero-touch. Why a refactor of admin_workspace_images.go is in this PR: The HTTP handler held all the refresh logic inline. To share it with the new watcher without HTTP loopback, extracted WorkspaceImageService with a Refresh(ctx, runtimes, recreate) (RefreshResult, error) shape. HTTP handler is now a thin wrapper; behavior is preserved (same JSON response, same 500-on-list-failure, same per-runtime soft-fail). Watcher design notes: - Last-observed digest tracked in memory (not persisted). On boot the first observation per runtime is seed-only — no spurious refresh fires on every restart. - On Refresh error, the seen digest rolls back so the next tick retries. Without this rollback a transient Docker glitch would convince the watcher the work was done. - Per-runtime fetch errors don't block other runtimes (one template's brief 500 doesn't pause the others). - digestFetcher injection seam in tick() lets unit tests cover all bookkeeping branches without standing up an httptest GHCR server. Verified live: probed GHCR's /token + manifest HEAD against workspace-template-claude-code; got HTTP 200 + a real Docker-Content-Digest. Same calls the watcher makes. Co-authored-by: Hongming Wang <hongmingwangalt@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 13:36:26 -07:00
Hongming Wang	12c4918318	fix(platform): stop leaking workspace containers on delete Symptom: deleting workspaces from the canvas marked DB rows status='removed' but left Docker containers running indefinitely. After a session of org imports + cancellations, we counted 10 running ws-* containers all backed by 'removed' DB rows, eating ~1100% CPU on the Docker VM. Two compounding bugs in handlers/workspace_crud.go's delete cascade: 1. The cleanup loop used `c.Request.Context()` for the Docker stop/remove calls. When the canvas's `api.del` resolved on the platform's 200, gin cancelled the request ctx — and any in-flight Docker call cancelled with `context canceled`, leaving the container alive. Old logs: "Delete descendant <id> volume removal warning: ... context canceled" 2. `provisioner.Stop`'s error return was discarded and `RemoveVolume` ran unconditionally afterward. When Stop didn't actually kill the container (transient daemon error, ctx cancellation as in #1), the volume removal would predictably fail with "volume in use" and the container kept running with the volume mounted. Old logs: "Delete descendant <id> volume removal warning: Error response from daemon: remove ... volume is in use" Fix layered in two parts: - workspace_crud.go: detach cleanup with `context.WithoutCancel(ctx)` + a 30s bounded timeout. Stop's error is now checked and on failure we skip RemoveVolume entirely (the orphan sweeper below catches what we deferred). - New registry/orphan_sweeper.go: periodic reconcile pass (every 60s, initial run on boot). Lists running ws-* containers via Docker name filter, intersects with DB rows where status='removed', stops + removes volumes for the leaks. Defence in depth — even a brand-new Stop failure mode heals on the next sweep instead of leaking forever. Provisioner gains a tiny ListWorkspaceContainerIDPrefixes helper that wraps ContainerList with the `name=ws-` filter; the sweeper takes an OrphanReaper interface (matches the ContainerChecker pattern in healthsweep.go) so unit tests don't need a real Docker daemon. main.go wires the sweeper alongside the existing liveness + health-sweep + provisioning-timeout monitors, all under supervised.RunWithRecover so a panic restarts the goroutine. 6 new sweeper tests cover the reconcile path, the no-running-containers short-circuit, the daemon-error skip, the Stop-failure-leaves-volume invariant (the same trap that motivated this fix), the volume-remove-error-is-non-fatal continuation, and the nil-reaper no-op. Verified: full Go test suite passes; manually purged the 10 leaked containers + their orphan volumes from the dev host with `docker rm -f` + `docker volume rm` (one-off cleanup; the sweeper would have caught them on the next cycle once deployed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 12:36:22 -07:00
Hongming Wang	f8c900909e	fix(platform): auto-load .env from CWD on startup Local dev runs (`/tmp/molecule-server` after `go build`) used to 401 on /workspaces the moment the DB had any workspace token in it: the binary inherited a bare shell env with no MOLECULE_ENV, so AdminAuth's dev fail-open branch (gated on MOLECULE_ENV=development) didn't fire. The repo's .env already has MOLECULE_ENV=development plus DATABASE_URL, REDIS_URL, ADMIN_TOKEN=, etc. Until now you had to `set -a && source .env` in the launching shell — a paper cut, but worse, it's a paper cut in EVERY automated dev workflow (IDE run configs, integration test harnesses, the smoke-test loop in this branch's manual testing). Fix: cmd/server now walks upward from CWD looking for a .env (capped at 6 levels) and merges KEY=VALUE pairs into os.Environ before any other code reads env. Already-set vars win over file values, so docker run -e / CI exports / `KEY=val ./binary` still dominate — only unset keys get filled in. Why no godotenv dep: the format we use is plain KEY=VALUE with `#` comments, no interpolation, no quoting (verified against the live .env: 49 kv lines, zero references to ${...} or `export`). A 30-line parser is auditable and avoids supply-chain surface. Why it's safe in production: Dockerfile doesn't COPY .env into the image and .env is gitignored, so prod containers have no .env on disk to load — the function's findDotEnv() loop finds nothing and returns silently. If an operator deliberately drops one in, the existing-env-wins rule means container-injected env still dominates. Verified by booting `env -i HOME=$HOME PATH=$PATH /tmp/molecule-server` from the repo root with a stripped env: log shows ".env: /Users/.../molecule-core/.env — loaded 49, 0 already set" and /workspaces returns 200 instead of 401. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 20:33:28 -07:00
Hongming Wang	03e913db75	feat(#1957 ): wire gh-identity plugin into workspace-server Ships the monorepo side of molecule-core#1957 (agent identity collapse). Companion to molecule-ai-plugin-gh-identity (new repo, merged-and-tagged separately). Changes: - manifest.json: add gh-identity plugin to Tier 1 registry - workspace-server/go.mod: require github.com/Molecule-AI/molecule-ai-plugin-gh-identity - cmd/server/main.go: build a shared provisionhook.Registry, register gh-identity first (always), then github-app-auth (gated on GITHUB_APP_ID) - workspace_provision.go: propagate workspace.Role into env["MOLECULE_AGENT_ROLE"] before calling the mutator chain, so the gh-identity plugin can see which agent is booting - provisionhook/mutator.go: add Registry.Mutators() accessor so individual-plugin registries can be merged onto a shared one at boot Boot log gains a line like: env-mutator chain: [gh-identity github-app-auth] Effect per workspace: - env contains MOLECULE_AGENT_ROLE, MOLECULE_OWNER, MOLECULE_ATTRIBUTION_BADGE, MOLECULE_GH_WRAPPER_B64, MOLECULE_GH_WRAPPER_SHA - Each workspace template's install.sh can decode + install the wrapper at /usr/local/bin/gh, intercepting @me assignment and prepending agent attribution on PR/issue creates Does not break existing workspaces — absent workspace.role, the plugin is a no-op. Absent install.sh updates in each template, the env vars are simply unused. Follow-up template PRs (hermes, claude-code, langgraph, etc.) each add ~15 lines to install.sh to decode + install the wrapper. Ref: #1957 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 15:01:41 +00:00
Hongming Wang	ff338e0489	fix: harden stuck-provisioning UX — details crash, preflight, sweeper Workspaces stuck in status='provisioning' previously surfaced in three bad ways: 1. Details tab crashed with `Cannot read properties of undefined (reading 'toLocaleString')`. `BudgetSection` + `WorkspaceUsage` assumed full response shapes but a provisioning-stuck workspace returns partial `{}`. Guard each deep field with `?? 0` and cover the partial-response case with regression tests. 2. Missing required env vars failed silently 15+ minutes later as a cosmetic "Provisioning Timeout" banner. The in-container preflight catches them but by then the container has already crashed without calling /registry/register, so the workspace sat in 'provisioning' forever. Mirror the preflight server-side: parse config.yaml's `runtime_config.required_env` before launch, fail fast with a WORKSPACE_PROVISION_FAILED event naming the missing vars. 3. No backend timeout ever flipped a stuck workspace to 'failed'. Add a registry sweeper (10m default, env-overridable) that detects workspaces stuck past the window, flips them to 'failed', and emits WORKSPACE_PROVISION_TIMEOUT. Race-safe: the UPDATE re-checks the status + age predicate so a concurrent register/restart wins. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 14:51:39 -07:00
Hongming Wang	48ec5b2dc8	feat(ws-server): pull env from CP on startup Paired with molecule-controlplane PR #55 (GET /cp/tenants/config). Lets existing tenants heal themselves when we rotate or add a CP-side env var (e.g. MOLECULE_CP_SHARED_SECRET landing earlier today) without any ssh or re-provision. Flow: main() calls refreshEnvFromCP() before any other os.Getenv read. The helper reads MOLECULE_ORG_ID + ADMIN_TOKEN from the baked-in user-data env, GETs {MOLECULE_CP_URL}/cp/tenants/config with those credentials, and applies the returned string map via os.Setenv so downstream code (CPProvisioner, etc.) sees the fresh values. Best-effort semantics: - self-hosted / no MOLECULE_ORG_ID → no-op (return nil) - CP unreachable / non-200 → log + return error (main keeps booting) - oversized values (>4 KiB each) rejected to avoid env pollution - body read capped at 64 KiB Once this image hits GHCR, the 5-minute tenant auto-updater picks it up, the container restarts, refresh runs, and every tenant has MOLECULE_CP_SHARED_SECRET within ~5 minutes — no operator toil. Also fixes workspace-server/.gitignore so `server` no longer matches the cmd/server package dir — it only ignored the compiled binary but pattern was too broad. Anchored to `/server`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 02:41:15 -07:00
Hongming Wang	479a027e4b	chore: open-source restructure — rename dirs, remove internal files, scrub secrets Renames: - platform/ → workspace-server/ (Go module path stays as "platform" for external dep compat — will update after plugin module republish) - workspace-template/ → workspace/ Removed (moved to separate repos or deleted): - PLAN.md — internal roadmap (move to private project board) - HANDOFF.md, AGENTS.md — one-time internal session docs - .claude/ — gitignored entirely (local agent config) - infra/cloudflare-worker/ → Molecule-AI/molecule-tenant-proxy - org-templates/molecule-dev/ → standalone template repo - .mcp-eval/ → molecule-mcp-server repo - test-results/ — ephemeral, gitignored Security scrubbing: - Cloudflare account/zone/KV IDs → placeholders - Real EC2 IPs → <EC2_IP> in all docs - CF token prefix, Neon project ID, Fly app names → redacted - Langfuse dev credentials → parameterized - Personal runner username/machine name → generic Community files: - CONTRIBUTING.md — build, test, branch conventions - CODE_OF_CONDUCT.md — Contributor Covenant 2.1 All Dockerfiles, CI workflows, docker-compose, railway.toml, render.yaml, README, CLAUDE.md updated for new directory names. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-18 00:24:44 -07:00

7 Commits