molecule-core

Author	SHA1	Message	Date
claude-ceo-assistant	3dcc7230f9	fix(provisioner)+test: EvalSymlinks templatePath; stage-2 e2e for files_dir consumption Some checks failed Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 1m20s Details CI / Platform (Go) (pull_request) Successful in 2m48s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 1s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 2s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 1s Details Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s Details CI / Detect changes (pull_request) Successful in 8s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 7s Details Harness Replays / detect-changes (pull_request) Successful in 7s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 7s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 7s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 7s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 8s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 2s Details CI / Canvas (Next.js) (pull_request) Successful in 5s Details CI / Python Lint & Test (pull_request) Successful in 5s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 4s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 6s Details Harness Replays / Harness Replays (pull_request) Failing after 46s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 54s Details Two changes that fall out of one root cause discovered while preparing the local platform spin-up for the dev-department extraction (internal#77): PROBLEM CopyTemplateToContainer's filepath.Walk is called with templatePath set to the workspace's resolved files_dir. With the cross-repo symlink composition shipped in PR #5 (parent template's dev-lead → ../molecule-dev-department/dev-lead/), the Dev Lead workspace's files_dir is literally 'dev-lead' — i.e. the symlink itself, not a path THROUGH the symlink. filepath.Walk does not descend into a symlink leaf — it Lstats the root, sees a symlink (mode bit set, not a directory), emits exactly one entry, and returns. Result: the workspace's /configs/ tar would ship empty. Other 38 workspaces are fine because their files_dir paths just TRAVERSE the symlink (path resolution handles intermediate symlinks via Lstat traversal); only the leaf-is-symlink case breaks. FIX workspace-server/internal/provisioner/provisioner.go: Call filepath.EvalSymlinks on templatePath before filepath.Walk. Resolves the leaf-symlink case for ALL templates, not just dev-dept. Security: templatePath has already passed resolveInsideRoot's path-string check at the call site; the trust boundary is the operator-side /org-templates/ filesystem layout, not this resolution step. TEST workspace-server/internal/handlers/local_e2e_dev_dept_test.go: New TestLocalE2E_FilesDirConsumption — stage-2 of the local e2e. For every workspace in the resolved OrgTemplate, asserts: 1. resolveInsideRoot(orgBaseDir, ws.FilesDir) succeeds. 2. os.Stat on the result returns a directory. 3. filepath.Walk after EvalSymlinks (mirroring the platform fix) emits at least one file. 4. At least one workspace marker exists (workspace.yaml, system-prompt.md, or initial-prompt.md). Exercises the SECOND half of POST /org/import that TestLocalE2E_DevDepartmentExtraction (PR #103) didn't cover. VERIFIED LOCALLY (2026-05-08, against post-extraction Gitea state): --- PASS: TestLocalE2E_FilesDirConsumption (0.05s) checked 39 workspaces with files_dir All 39 walk paths emit non-empty file sets with valid workspace markers. REGRESSION GUARD Without the EvalSymlinks fix, this test fails on Dev Lead with: files_dir 'dev-lead' at '/.../molecule-dev/dev-lead' is empty — CopyTemplateToContainer would produce empty /configs/ Refs: internal#77 — extraction RFC molecule-core#102 (resolver symlink contract test) molecule-core#103 (stage-1 e2e: include resolution) Hongming GO 2026-05-08 ('go' on the 3 pre-spin-up optimizations)	2026-05-08 04:46:33 -07:00
claude-ceo-assistant	d9e380c5bc	feat(workspace-server): local-dev provisioner builds from Gitea source when MOLECULE_IMAGE_REGISTRY is unset (#63 , Task #194 ) Some checks failed E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 7s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 7s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 5s Details CI / Detect changes (pull_request) Successful in 7s Details CI / Canvas (Next.js) (pull_request) Successful in 7s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 1m38s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 7s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 7s Details Harness Replays / detect-changes (pull_request) Successful in 7s Details Harness Replays / Harness Replays (pull_request) Failing after 42s Details CI / Platform (Go) (pull_request) Successful in 3m32s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 1s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 1s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 6s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 7s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 7s Details CI / Python Lint & Test (pull_request) Successful in 6s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 1s Details Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped Details Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 4s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 5s Details OSS contributors who clone molecule-core and `go run ./workspace-server/cmd/server` now get a working end-to-end provision without authenticating to GHCR or AWS ECR. Pre-fix: with MOLECULE_IMAGE_REGISTRY unset, the provisioner attempted to pull ghcr.io/molecule-ai/workspace-template-<runtime>:latest, which has been returning 403 since the 2026-05-06 GitHub-org suspension. Post-fix: when MOLECULE_IMAGE_REGISTRY is unset, the provisioner switches to local-build mode — looks up the workspace-template-<runtime> repo's HEAD sha on Gitea via a single API call, shallow-clones into ~/.cache/molecule/, and runs `docker build --platform=linux/amd64`. SHA-pinned cache key skips the clone+build entirely on subsequent provisions. Production tenants are unaffected: every prod tenant sets the var to its private ECR mirror, so the SaaS pull path is byte-for-byte identical. SSOT for mode detection lives in Resolve() (registry_mode.go) returning a discriminated RegistrySource{Mode, Prefix} so call sites that branch on mode get a compile-time push instead of a string-equality footgun. Coverage: * registry_mode.go — new SSOT (Resolve, RegistryMode, IsKnownRuntime) * registry_mode_test.go — 8 tests pinning mode-decision contract * localbuild.go — clone+build pipeline (570 LOC, fully unit-tested) * localbuild_test.go — 22 tests covering happy/sad paths, fail-closed * provisioner.go — Start() inserts ensureLocalImageHook in local mode * docs/adr/ADR-002 — design rationale + alternatives + security review * docs/development/local-development.md — local-build flow + env overrides Security: * Allowlist-only runtime names (knownRuntimes) gate the clone path. * Repo prefix hardcoded to git.moleculesai.app/molecule-ai/molecule-ai-workspace-template-; forks via opt-in MOLECULE_LOCAL_TEMPLATE_REPO_PREFIX. * MOLECULE_GITEA_TOKEN masked in every log line via maskTokenInURL/maskTokenInString. * Fail-closed: Gitea unreachable / runtime not mirrored → clear error, never silently fall back to GHCR/ECR. * docker build invocation passes no --build-arg from external input. * HTTP body cap 64KB on Gitea API responses (defence vs malicious upstream). Closes #63 / Task #194. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 15:16:51 -07:00
security-auditor	c1de2287fd	fix(workspace-server): SSOT-route container check + 422 on external runtimes Some checks failed E2E API Smoke Test / E2E API Smoke Test (pull_request) Failing after 4m46s Details CI / Detect changes (pull_request) Successful in 5s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 6s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 6s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 5s Details CI / Canvas (Next.js) (pull_request) Successful in 5s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 7s Details Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 4s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 4s Details Harness Replays / detect-changes (pull_request) Successful in 5s Details Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 6s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 2s Details CI / Python Lint & Test (pull_request) Successful in 4s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 53s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Failing after 44s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 4s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m21s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m28s Details Harness Replays / Harness Replays (pull_request) Failing after 43s Details CI / Platform (Go) (pull_request) Successful in 3m19s Details Two coupled fixes for molecule-core#10 (plugin install 503 vs status=online split-state): 1. SSOT for "is this workspace's container running" — `findRunningContainer` in plugins.go used to carry its own copy of `cli.ContainerInspect`, which collapsed transient daemon errors into the same `""` return as a genuinely-stopped container. Healthsweep's `Provisioner.IsRunning` handled the same input correctly (defensive). Promote the inspect logic to `provisioner.RunningContainerName`, route both consumers through it. Transient errors get a distinct log line on the plugins side so triage doesn't confuse a flaky daemon with a stopped container. 2. Runtime-aware Install/Uninstall — `runtime='external'` workspaces have no local container; push-install via docker exec is meaningless. They pull plugins via the download endpoint instead (Phase 30.3). Without a guard they fell through to `findRunningContainer` and 503'd with a misleading "container not running." Add an early 422 with a hint pointing at the download endpoint. The two fixes are independent: (1) preserves correctness when the SSOT helper is later modified; (2) eliminates the persistent split-state on the 5 external persona-agent workspaces in this DB (and on tenant deployments hitting the same shape). * `internal/provisioner/provisioner.go` — new `RunningContainerName(ctx, cli, id) (string, error)` with three documented outcomes (running / stopped / transient). `Provisioner.IsRunning` now wraps it; behavior preserved. * `internal/handlers/plugins.go` — `findRunningContainer` shimmed onto `RunningContainerName`; new `isExternalRuntime(id)` predicate. * `internal/handlers/plugins_install.go` — Install + Uninstall reject external runtimes with 422 + hint, before the source-fetch step. * `internal/handlers/plugins_install_external_test.go` — 5 cases: external→422, uninstall-external→422, container-backed-falls-through, no-runtime-lookup-fails-open, lookup-error-fails-open. * `internal/handlers/plugins_findrunning_ssot_test.go` — two AST gates pin the SSOT routing so future PRs can't silently re-introduce the parallel impl. Mutation-tested: reverting either consumer to a direct `ContainerInspect` makes the gate fail. Refs: molecule-core#10 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 22:58:20 -07:00
claude-ceo-assistant (Claude Opus 4.7 on Hongming's MacBook)	4b074f631b	feat(provisioner): env-driven RegistryPrefix() for workspace template images (#6 ) Some checks failed pr-guards / disable-auto-merge-on-push (pull_request) Failing after 0s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Failing after 41s Details Harness Replays / Harness Replays (pull_request) Failing after 30s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Failing after 5m7s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details CI / Platform (Go) (pull_request) Failing after 3m8s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 14m4s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 14m36s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 14m30s Details Block internal-flavored paths / Block forbidden paths (pull_request) Has been cancelled Details CI / Python Lint & Test (pull_request) Has been cancelled Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Has been cancelled Details CI / Canvas (Next.js) (pull_request) Has been cancelled Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Has been cancelled Details CI / Detect changes (pull_request) Has been cancelled Details Secret scan / Scan diff for credential-shaped strings (pull_request) Has been cancelled Details E2E API Smoke Test / detect-changes (pull_request) Has been cancelled Details Runtime PR-Built Compatibility / detect-changes (pull_request) Has been cancelled Details Harness Replays / detect-changes (pull_request) Has been cancelled Details Handlers Postgres Integration / detect-changes (pull_request) Has been cancelled Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Has been cancelled Details CI / Shellcheck (E2E scripts) (pull_request) Has been cancelled Details Add MOLECULE_IMAGE_REGISTRY env var to override the registry prefix used by all workspace-template image references. Defaults to ghcr.io/molecule-ai (unchanged for OSS users); set to an ECR URI in production tenants when mirroring to AWS. Why this matters: GitHub suspended the Molecule-AI org on 2026-05-06 with no warning. Production tenants kept running because they had images cached locally, but any tenant restart (AWS health event, redeploy, OS reboot) would have failed at `docker pull ghcr.io/molecule-ai/...` because GHCR returned 401. This change introduces the seam needed to point new pulls at a registry we control (AWS ECR) by flipping a single env var on Railway. Design (RFC: molecule-ai/internal#6): - New `RegistryPrefix()` function in `provisioner/registry.go` reads MOLECULE_IMAGE_REGISTRY, falls back to "ghcr.io/molecule-ai". - New `RuntimeImage(runtime)` returns the canonical ref using the prefix. - `RuntimeImages` map computed at init via `computeRuntimeImages()` so existing callers that range over it still work. - `DefaultImage` likewise computed via `RuntimeImage(defaultRuntime)`. - `handlers.TemplateImageRef()` switched from hardcoded format string to `provisioner.RegistryPrefix()`. - `runtime_image_pin.go::resolveRuntimeImage()` automatically inherits the prefix change because it reads from `provisioner.RuntimeImages[]` and only re-formats the tag suffix to a digest pin. Alternatives rejected (see RFC): - Multi-registry fallback chain (try ECR, fall back to GHCR): GHCR is locked from outbound for our org, so the fallback never works for us. Adds code complexity for no benefit. - Hardcoded ECR-only switch: couples production code to a specific deployment environment. OSS users self-hosting Molecule would need the upstream GHCR. - Self-hosted Harbor / registry-on-Hetzner: adds a component to operate. Not justified at 3-tenant scale; AWS ECR is mature and IAM-integrated. Auth — deliberately NOT changed in this commit: - For GHCR, the existing `ghcrAuthHeader()` reads GHCR_USER/GHCR_TOKEN. - For ECR, EC2 user-data installs `amazon-ecr-credential-helper` and adds a `credHelpers` entry in `~/.docker/config.json` so the daemon resolves ECR credentials via the EC2 instance role on every pull. The Go code needs no auth change. This keeps the diff minimal. Backwards compatibility: - Additive: env unset → identical behavior to today (GHCR). - Existing tests reference literal `ghcr.io/molecule-ai/...` strings; they continue to pass under the default prefix. - `RuntimeImages` map preserved for callers that iterate it. - No interface, schema, API, or migration version bump needed. Security review: - No untrusted input: MOLECULE_IMAGE_REGISTRY is set at deploy time (Railway env, EC2 user-data), not by users. - No expanded data collection or logging changes. - No new permissions: ECR pull permission is a future user-data + IAM role change, separate from this code change. - Worst-case: an attacker who already compromises Railway can swap the registry prefix to a malicious URI — same blast radius as compromising Railway today, no expansion. Tests: - 9 new unit tests in `registry_test.go` covering: default fallback, env override, empty env, all 9 known runtimes, unknown runtime, override-applies-to-all, computeRuntimeImages map population, env reflection, alphabetical ordering pin. - All existing provisioner + handlers tests continue to pass. - Mutation-tested mentally: deleting `if v := os.Getenv(...)` makes TestRegistryPrefix_RespectsEnv fail. Deleting `for _, r := range knownRuntimes` makes TestRuntimeImage_AllKnownRuntimes fail. The test suite would catch a regression of the original failure mode. Rollout plan: this PR is safe to merge with no env change. Production cutover happens by setting MOLECULE_IMAGE_REGISTRY on Railway after the AWS ECR mirror is populated (separate ops change, tracked in issue #6 phases 3b–3f). Tracking: - RFC: molecule-ai/internal#6 - Tasks: #97 (ECR setup), #98 (CP fallback) - Tech debt: runbooks/hetzner-rollout-tech-debt-2026-05-06.md item 7 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 14:23:01 -07:00
Hongming Wang	83454e5efd	feat(workspace-server): structured logging at provisioning boundaries Adds internal/provlog with a single Event(name, fields) helper that emits JSON-tagged single-line records to the standard logger. Five boundary sites instrumented for #2867: provision.start — workspace_dispatchers.go (sync + async) provision.skip_existing — org_import.go idempotency hit provision.ec2_started — cp_provisioner.go after RunInstances provision.ec2_stopped — cp_provisioner.go after TerminateInstances ack restart.pre_stop — workspace_restart.go before Stop dispatch These pair with the existing human-prose log.Printf lines (kept). The new records are grep+jq friendly so a future log-aggregation pipeline can reconstruct per-workspace provision timelines without parsing the operator messages — this is the "and debug loggers so it dont happen again" half of the leak-prevention work. Tests: - provlog: emits evt-prefixed JSON, nil-tolerant, marshal-error fallback preserves event boundary, single-line output pinned. - handlers: provlog_emit_test.go pins three call-site contracts: provisionWorkspaceAutoSync emits provision.start with sync=true, stopForRestart emits restart.pre_stop with backend=cp on SaaS, and backend=none when both backends are nil. Field taxonomy is convenience for ops, not contract — payload can grow additively without breaking callers. Behavior gate is the event name + boundary location, per feedback_behavior_based_ast_gates.md. Refs #2867 (PR-D structured logging at provisioning boundaries) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 12:30:11 -07:00
Hongming Wang	1bff419833	feat(provisioner): digest-pin workspace images via runtime_image_pins (#2272 layer 1) Layer 1 of the runtime-rollout plan. Decouples publish from promotion by giving operators a `runtime_image_pins` table the provisioner consults at container-create time. No row = legacy `:latest` behavior; row present = provisioner pulls `<base>@sha256:<digest>`. One bad publish no longer breaks every workspace simultaneously. Mechanics: - Migration 047: `runtime_image_pins` (template_name PK + sha256 digest + audit columns) and `workspaces.runtime_image_digest` (nullable, with partial index) for "show me workspaces still on the old digest" queries. - `resolveRuntimeImage` (handlers/runtime_image_pin.go): looks up the pin, returns `<base>@sha256:<digest>` on hit, "" on miss/error so the provisioner falls through to the legacy tag map. Availability over pinning — any DB error logs and returns "" rather than blocking the provision. `WORKSPACE_IMAGE_LOCAL_OVERRIDE=1` short-circuits the lookup so devs rebuilding template images locally see their fresh build. - `WorkspaceConfig.Image` carries the resolved value into the provisioner. `selectImage` honors it ahead of the runtime→tag map and falls back to DefaultImage on unknown runtime. - The existing `imageTagIsMoving` predicate (#215) already returns false on `@sha256:` form, so digest pins skip the force-pull path naturally. Tests: - Handler-side (sqlmock): no-pin/db-error/with-pin/empty/unknown/local- override paths cover every branch of `resolveRuntimeImage`. - Provisioner-side: `selectImage` table covers explicit-image preference, runtime-map fallback, unknown-runtime → default, empty-config → default. Plus a struct-literal compile-time pin on `Image` so a future refactor can't silently drop the field. Layer 2 (per-ring routing via `workspaces.runtime_image_digest`) and the admin promote/rollback endpoint ride on top of this and ship separately.	2026-05-03 02:30:00 -07:00
Hongming Wang	552602e462	fix(provisioner): force re-pull of moving image tags on workspace start Previously Start() only pulled when the image was missing locally (imgErr != nil). Once a tenant's Docker daemon had `:latest` cached, it stuck on that snapshot forever even after publish-runtime pushed a newer image with the same tag — the same image-cache class that sibling task #232 closed on the controlplane redeploy path. Now Start() additionally re-pulls when the tag is "moving" (`:latest`, no tag, `:staging`, `:main`, `:dev`, `:edge`, `:nightly`, `:rolling`). Pinned tags (semver, sha-prefixed, date-stamped, build-id) and digest-pinned references (`@sha256:...`) skip the pull because their contents are by definition immutable. The classifier (imageTagIsMoving) is deliberately conservative on the "moving" side — only the well-known moving tags trip it. Misclassifying a pinned tag as moving wastes bandwidth on every provision; misclassifying moving as pinned silently bricks the fleet on stale snapshots, which is exactly the bug class this fix closes. Edge cases handled: - Registry hostname with port (`localhost:5000/foo`) — the `:5000` is not mistaken for a tag. - Digest pinning (`image@sha256:...`) — never re-pulled even if a moving-looking tag is also present. - Legacy local-build tags (`workspace-template:hermes`) — treated as pinned (no registry to move from). Test coverage: 22 cases across all classifier shapes. No changes to the pull-failure path (still best-effort, ContainerCreate still surfaces the actionable "image not found" error if the pull failed and the cache is also empty). Task: #215. Companion to #232.	2026-05-02 23:56:32 -07:00
Hongming Wang	5167e482d0	fix(cp-provisioner): surface CP non-2xx on Stop to plug EC2 leak http.Client.Do only errors on transport failure — a CP 5xx (AWS hiccup, missing IAM, transient outage) was silently treated as success. Workspace row then flipped to status='removed' and the EC2 stayed alive forever with no DB pointer (the "orphan EC2 on a 0-customer account" scenario flagged in workspace_crud.go #1843). Found while triaging 13 zombie workspace EC2s on demo-prep staging. Adds a status-code check that returns an error tagged with the workspace ID + status + bounded body excerpt, so the existing loud-fail path in workspace_crud.go's Delete handler can populate stop_failures and surface a 500. Body read is io.LimitReader-capped at 512 bytes to keep error logs sane during a CP outage. Tests: 4 new (5xx surfaces, 4xx surfaces, 2xx variants 200/202/204 all succeed, long body is truncated). Test-first verified — the first three fail on the buggy code and all four pass on the fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 22:59:01 -07:00
Hongming Wang	e081c8335f	refactor(handlers): widen WorkspaceHandler.provisioner to LocalProvisionerAPI interface (#2369 ) Symmetric with the existing CPProvisionerAPI interface. Closes the asymmetry where the SaaS provisioner field was an interface (mockable in tests) but the Docker provisioner field was a concrete pointer (not). ## Changes - New ``provisioner.LocalProvisionerAPI`` interface — the 7 methods WorkspaceHandler / TeamHandler call on h.provisioner today: Start, Stop, IsRunning, ExecRead, RemoveVolume, VolumeHasFile, WriteAuthTokenToVolume. Compile-time assertion confirms Provisioner satisfies it. Mirror of cp_provisioner.go's CPProvisionerAPI block. - ``WorkspaceHandler.provisioner`` and ``TeamHandler.provisioner`` re-typed from ``provisioner.Provisioner`` to ``provisioner.LocalProvisionerAPI``. Constructor parameter type is unchanged — the assignment widens to the interface, so the 200+ callers of ``NewWorkspaceHandler`` / ``NewTeamHandler`` are unaffected. - Constructors gain a ``if p != nil`` guard before assigning to the interface field. Without this, ``NewWorkspaceHandler(..., nil, ...)`` (the test fixture pattern across 200+ tests) yields a typed-nil interface value where ``h.provisioner != nil`` evaluates true, and the SaaS-vs-Docker fork incorrectly routes nil-fixture tests into the Docker code path. Documented inline with reference to the Go FAQ. - Hardened the 5 Provisioner methods that lacked nil-receiver guards (Start, ExecRead, WriteAuthTokenToVolume, RemoveVolume, VolumeHasFile) — return ErrNoBackend on nil receiver instead of panicking on p.cli dereference. Symmetric with Stop/IsRunning (already hardened in #1813). Defensive cleanup so a future caller that bypasses the constructor's nil-elision still degrades cleanly. - Extended TestZeroValuedBackends_NoPanic with 5 new sub-tests covering the newly-hardened nil-receiver paths. Defense-in-depth: a future refactor that drops one of the nil-checks fails red here before reaching production. ## Why now - Provisioner orchestration has been touched in #2366 / #2368 — the interface symmetry is the natural follow-up captured in #2369. - Future work (CP fleet redeploy endpoint, multi-backend provisioners) wants this in place. Memory note ``project_provisioner_abstraction.md`` calls out pluggable backends as a north-star. - Memory note ``feedback_long_term_robust_automated.md`` — compile-time gates + ErrNoBackend symmetry > runtime panics. ## Verification - ``go build ./...`` clean. - ``go test ./...`` clean — 1300+ tests pass, including the previously-flaky Create-with-nil-provisioner paths that now exercise the constructor's nil-elision correctly. - ``go test ./internal/provisioner/ -run TestZeroValuedBackends_NoPanic -v`` — all 11 nil-receiver subtests green (was 6, +5 for the newly-hardened methods). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 09:18:16 -07:00
Hongming Wang	9f35788aee	fix(a2a): detect dead EC2 agents on upstream 5xx + reactive auto-restart for SaaS Class-of-bugs fix surfaced by hongmingwang.moleculesai.app's canvas chat to a dead workspace returning a generic Cloudflare 502 page on 2026-04-30. Three independent gaps in the reactive-health path that together leak dead-agent failures to canvas with no auto-recovery. ## Bug 1 — maybeMarkContainerDead is a no-op for SaaS tenants `maybeMarkContainerDead` only consulted `h.provisioner` (local Docker provisioner). SaaS tenants set `h.cpProv` (CP-backed EC2 provisioner) and leave `h.provisioner` nil — so the function early-returned false on every call and dead EC2 agents never triggered the offline-flip / broadcast / restart cascade. Fix: extend `CPProvisionerAPI` interface with `IsRunning(ctx, id) (bool, error)` (already implemented on `*CPProvisioner`; just needs to surface on the interface). `maybeMarkContainerDead` now branches: local-Docker path uses `h.provisioner.IsRunning`; SaaS path uses `h.cpProv.IsRunning` which calls the CP's `/cp/workspaces/:id/status` endpoint to read the EC2 state. ## Bug 2 — RestartByID short-circuits on `h.provisioner == nil` Same shape as Bug 1: the auto-restart cascade triggered by `maybeMarkContainerDead` calls `RestartByID` which short-circuited when the local Docker provisioner was missing. So even if Bug 1 were fixed, the workspace-offline state would never recover. Fix: change the gate to `h.provisioner == nil && h.cpProv == nil` and update `runRestartCycle` to branch on which provisioner is wired for the Stop call. (The HTTP `Restart` handler already does this branching correctly — we're just bringing the auto-restart path to parity.) ## Bug 3 — upstream 502/503/504 propagated as-is, masked by Cloudflare When the agent's tunnel returns 5xx (the "tunnel up but no origin" shape — agent process dead but cloudflared connection still healthy), `dispatchA2A` returns successfully at the HTTP layer with a 5xx body. `handleA2ADispatchError`'s reactive-health path doesn't run because that path is only triggered on transport-level errors. The pre-fix code propagated the 502 status to canvas; Cloudflare in front of the platform then masked the 502 with its own opaque "error code: 502" page, hiding any structured response and any Retry-After hint. Fix: in `proxyA2ARequest`, when the upstream returns 502/503/504, run `maybeMarkContainerDead` BEFORE propagating. If IsRunning confirms the agent is dead → return a structured 503 with restarting=true + Retry-After (CF doesn't mask 503s the same way). If running, propagate the original status (don't recycle a healthy agent on a transient hiccup — it might have legitimately returned 502). ## Drive-by — a2aClient transport timeouts a2aClient was `&http.Client{}` with no Transport timeouts. When a workspace's EC2 black-holes TCP connects (instance terminated mid-flight, SG flipped, NACL bug), the OS default is 75s on Linux / 21s on macOS — long enough for Cloudflare's ~100s edge timeout to fire first and surface a generic 502. Added DialContext (10s connect), TLSHandshake (10s), and ResponseHeaderTimeout (60s). Client.Timeout DELIBERATELY unset — that would pre-empt slow-cold-start flows (Claude Code OAuth first-token, multi-minute agent synthesis). Long-tail body streaming is still governed by per-request context deadline. ## Tests - `TestMaybeMarkContainerDead_CPOnly_NotRunning` — IsRunning(false) → marks workspace offline, returns true. - `TestMaybeMarkContainerDead_CPOnly_Running` — IsRunning(true) → no offline-flip, returns false (don't recycle a healthy agent). - `TestProxyA2A_Upstream502_TriggersContainerDeadCheck` — agent server returns 502 + cpProv reports dead → caller gets 503 with restarting= true and Retry-After: 15. - `TestProxyA2A_Upstream502_AliveAgent_PropagatesAsIs` — same upstream 502 but cpProv reports running → propagates 502 (existing behavior; safety check that prevents over-eager recycling). - Existing `TestMaybeMarkContainerDead_NilProvisioner` / `TestMaybeMarkContainerDead_ExternalRuntime` still pass. - Full handlers + provisioner test suites pass. ## Impact Pre-fix: dead EC2 agent on a SaaS tenant → CF-masked 502 to canvas, no auto-recovery, manual restart from canvas required. Post-fix: dead EC2 agent on a SaaS tenant → structured 503 with restarting=true + Retry-After to canvas, workspace flipped to offline, auto-restart cycle triggered. Canvas can show a user-actionable "agent is restarting, please wait" message instead of a generic 502. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 00:28:22 -07:00
Hongming Wang	68f18424f5	test(arch): codify 4 module boundaries as architecture tests (#2344 ) Hard gate #4: codified module boundaries as Go tests, so a new contributor (or AI agent) can't silently land an import that crosses a layer. Boundaries enforced (one architecture_test.go per package): - wsauth has no internal/* deps — auth leaf, must be unit-testable in isolation - models has no internal/* deps — pure-types leaf, reverse dep would create cycles since most packages depend on models - db has no internal/* deps — DB layer below business logic, must be testable with sqlmock without spinning up handlers/provisioner - provisioner does not import handlers or router — unidirectional layering: handlers wires provisioner into HTTP routes; the reverse is a cycle Each test parses .go files in its package via go/parser (no x/tools dep needed) and asserts forbidden import paths don't appear. Failure messages name the rule, the offending file, and explain WHY the boundary exists so the diff reviewer learns the rule. Note: the original issue's first two proposed boundaries (provisioner-no-DB, handlers-no-docker) don't match the codebase today — provisioner already imports db (PR #2276 runtime-image lookup) and handlers hold *docker.Client directly (terminal, plugins, bundle, templates). I picked the four boundaries that actually hold; the first two are aspirational and would need a refactor before they could be codified. Hand-tested by injecting a deliberate wsauth -> orgtoken violation: the gate fires red with the rule message before merge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 22:12:58 -07:00
Hongming Wang	92d99d96fe	fix(provisioner): treat "removal already in progress" as no-op success Cascade-deleting a 7-workspace org returned 500 with "workspace marked removed, but 2 stop call(s) failed — please retry: stop eeb99b5d-...: force-remove ws-eeb99b5d-607: Error response from daemon: removal of container ws-eeb99b5d-607 is already in progress" even though the DB-side post-condition succeeded (removed_count=7) and the containers WERE removed shortly after. The fanout fired Stop() on every workspace concurrently and the orphan sweeper happened to reap two of them at the same instant, so Docker rejected the second ContainerRemove with "removal already in progress" — a race-condition ack, not a real failure. Retrying just races the same in-flight removal. The post-condition we care about (the container WILL be gone) is identical to a successful removal, so Stop() should treat it the same way it already treats "No such container" — a no-op return nil that lets the caller proceed with volume cleanup. Real daemon failures (timeout, EOF, ctx cancel) still surface as errors. Two pieces: - New isRemovalInProgress() predicate using the same string-match approach as isContainerNotFound (docker/docker has no typed errdef for this; the CLI itself relies on the message). - Stop() now treats the predicate as success, with a log line distinct from the not-found path so debugging can tell which race fired. Both substrings ("removal of container" + "already in progress") must match — "already in progress" alone would false-positive on unrelated operations like image pulls. Truth table pinned in 7 new test cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 13:25:32 -07:00
Hongming Wang	e15d1182cd	test(provisioner): unblock TestProvisionWorkspaceCP_NoInternalErrorsInBroadcast (#1814 ) The skipped test exists to assert that provisionWorkspaceCP never leaks err.Error() in WORKSPACE_PROVISION_FAILED broadcasts (regression guard for #1206). Writing the test body required substituting a failing CPProvisioner — but the handler's `cpProv` field was the concrete CPProvisioner type, so a mock had nowhere to plug in. Refactor: - Add provisioner.CPProvisionerAPI interface with the 3 methods handlers actually call (Start, Stop, GetConsoleOutput) - Compile-time assertion `var _ CPProvisionerAPI = (CPProvisioner)(nil)` catches future method-signature drift at build time - WorkspaceHandler.cpProv narrowed to the interface; SetCPProvisioner accepts the interface (production caller passes *CPProvisioner from NewCPProvisioner unchanged) Test: - stubFailingCPProv whose Start returns a deliberately leaky error (machine_type=t3.large, ami=…, vpc=…, raw HTTP body fragment) - Drive provisionWorkspaceCP via the cpProv.Start failure path - Assert broadcast["error"] == "provisioning failed" (canned) - Assert no leak markers (machine type, AMI, VPC, subnet, HTTP body, raw error head) in any broadcast string value - Stop/GetConsoleOutput on the stub panic — flags a future regression that reaches into them on this path Verification: - Full workspace-server test suite passes (interface refactor is non-breaking; production caller path unchanged) - go build ./... clean - The other skipped test in this file (TestResolveAndStage_…) is a separate plugins.Registry refactor and remains skipped Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 03:28:25 -07:00
Hongming Wang	4915d1d59e	fix(orphan-sweeper): reap labeled containers with no DB row (wiped-DB) The existing sweeper only reaps ws-* containers whose workspace row has status='removed'. That misses the entire wiped-DB case: an operator does `docker compose down -v` (kills the postgres volume), the previous platform's ws-* containers keep running, the new platform boots into an empty workspaces table — first pass finds zero candidates and those containers leak forever. Symptom users hit today: 7 ws-* containers from 11h ago, no rows in DB, no visibility in Canvas, eating CPU + memory. Fix shape: 1. Provisioner stamps every ws-* container + volume with `molecule.platform.managed=true`. Without a label, the sweeper would have to assume any unlabeled ws-* container might belong to a sibling platform stack on a shared Docker daemon. 2. Provisioner exposes ListManagedContainerIDPrefixes — a label-filter counterpart to the existing name-filter. 3. Sweeper splits sweepOnce into two independent passes: - sweepRemovedRows (unchanged behavior; status='removed' only) - sweepLabeledOrphansWithoutRows (new; labeled containers whose workspace_id has no row in the table at all) Each pass has its own short-circuit so an empty result or transient error in one doesn't block the other — load-bearing because the wiped-DB pass exists precisely for cases where the removed-row pass finds nothing. Safe under multi-platform-on-shared-daemon: only containers carrying our label get reaped, sibling stacks' containers are invisible to this pass. (For now the label is a constant string; a future per-instance UUID layer can refine "ours" further if a real shared-daemon scenario emerges.) Migration: existing platforms running pre-PR builds have UNLABELED ws-* containers. After this lands they continue to NOT be reaped by the new path (no label = invisible). They'll only be cleaned via manual intervention or once the operator recreates them — same as today. No regression. Tests cover all five branches of the new pass: happy-path reap, no-reap when row exists, mixed reap-some-keep-some, Docker error short-circuits cleanly, non-UUID prefixes get filtered before the SQL query. Pairs with PR #2122 (script-level fix). Together they close the orphan-leak path for both `bash scripts/nuke-and-rebuild.sh` users (handled by the script) AND `docker compose down -v` users (handled by the runtime). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 14:33:41 -07:00
Hongming Wang	d0f198b24f	merge: resolve staging conflicts (a2a_proxy + workspace_crud) Three files conflicted with staging changes that landed while this PR sat open. Resolved each by combining both intents (not picking one side): - a2a_proxy.go: keep the branch's idle-timeout signature (workspaceID parameter + comment) AND apply staging's #1483 SSRF defense-in-depth check at the top of dispatchA2A. Type-assert h.broadcaster (now an EventEmitter interface per staging) back to Broadcaster for applyIdleTimeout's SubscribeSSE call; falls through to no-op when the assertion fails (test-mock case). - a2a_proxy_test.go: keep both new test suites — branch's TestApplyIdleTimeout_ (3 cases for the idle-timeout helper) AND staging's TestDispatchA2A_RejectsUnsafeURL (#1483 regression). Updated the staging test's dispatchA2A call to pass the workspaceID arg introduced by the branch's signature change. - workspace_crud.go: combine both Delete-cleanup intents: * Branch's cleanupCtx detachment (WithoutCancel + 30s) so canvas hang-up doesn't cancel mid-Docker-call (the container-leak fix) * Branch's stopAndRemove helper that skips RemoveVolume when Stop fails (orphan sweeper handles) * Staging's #1843 stopErrs aggregation so Stop failures bubble up as 500 to the client (the EC2 orphan-instance prevention) Both concerns satisfied: cleanup runs to completion past canvas hangup AND failed Stop calls surface to caller. Build clean, all platform tests pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-04-26 10:43:22 -07:00
Hongming Wang	0de67cd379	feat(platform/admin): /admin/workspace-images/refresh + Docker SDK + GHCR auth The production-side end of the runtime CD chain. Operators (or the post- publish CI workflow) hit this after a runtime release to pull the latest workspace-template-* images from GHCR and recreate any running ws-* containers so they adopt the new image. Without this, freshly-published runtime sat in the registry but containers kept the old image until naturally cycled. Implementation notes: - Uses Docker SDK ImagePull rather than shelling out to docker CLI — the alpine platform container has no docker CLI installed. - ghcrAuthHeader() reads GHCR_USER + GHCR_TOKEN env, builds the base64- encoded JSON payload Docker engine expects in PullOptions.RegistryAuth. Both empty → public/cached images only; both set → private GHCR pulls. - Container matching uses ContainerInspect (NOT ContainerList) because ContainerList returns the resolved digest in .Image, not the human tag. Inspect surfaces .Config.Image which is what we need. - Provisioner.DefaultImagePlatform() exported so admin handler picks the same Apple-Silicon-needs-amd64 platform as the provisioner — single source of truth for the multi-arch override. Local-dev companion: scripts/refresh-workspace-images.sh runs on the host and inherits the host's docker keychain auth — alternate path for when GHCR_USER/TOKEN aren't set in the platform env. 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-04-26 10:17:21 -07:00
Hongming Wang	48b494def3	fix(provisioner): nil guards on Stop/IsRunning, unblock contract tests (closes #1813 ) Both backends panicked when called on a zero-valued or nil receiver: Provisioner.{Stop,IsRunning} dereferenced p.cli; CPProvisioner.{Stop, IsRunning} dereferenced p.httpClient. The orphan sweeper and shutdown paths can call these speculatively where the receiver isn't fully wired — the panic crashed the goroutine instead of the caller seeing a clean error. Three changes: 1. Add ErrNoBackend (typed sentinel) and nil-guard the four methods. - Provisioner.{Stop,IsRunning}: guard p == nil \|\| p.cli == nil at the top. - CPProvisioner.Stop: guard p == nil up top, then httpClient nil AFTER resolveInstanceID + empty-instance check (the empty instance_id path doesn't need HTTP and stays a no-op success even on zero-valued receivers — preserved historical contract from TestIsRunning_EmptyInstanceIDReturnsFalse). - CPProvisioner.IsRunning: same shape — empty instance_id stays (false, nil); httpClient-nil with non-empty instance_id returns ErrNoBackend. 2. Flip the t.Skip on TestDockerBackend_Contract + TestCPProvisionerBackend_Contract — both contract tests run now that the panics are gone. Skipped scenarios were the regression guard for this fix. 3. Add TestZeroValuedBackends_NoPanic — explicit assertion that zero-valued and nil receivers return cleanly (no panic). Docker backend always returns ErrNoBackend on zero-valued; CPProvisioner may return (false, nil) when the DB-lookup layer absorbs the case (no instance to query → no HTTP needed). Both are acceptable per the issue's contract — the gate is no-panic. Tests: - 6 sub-cases across the new TestZeroValuedBackends_NoPanic - TestDockerBackend_Contract + TestCPProvisionerBackend_Contract now run their 2 scenarios (4 sub-cases each) - All existing provisioner tests still green - go build ./... + go vet ./... + go test ./... clean Closes drift-risk #6 in docs/architecture/backends.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 02:17:51 -07:00
Hongming Wang	cb12601414	fix(platform): make Provisioner.Stop return real errors so cleanup gates fire Review caught a critical issue with `12c49183`: the headline "skip RemoveVolume when Stop fails" guarantee was dead code. `Provisioner.Stop` unconditionally `return nil`'d after logging the underlying ContainerRemove error, so the new `if err := h.provisioner.Stop(...); err != nil { skip volume }` guard in workspace_crud.go AND the same guard in the orphan sweeper could never fire. RemoveVolume always ran, predictably failing with "volume in use" when Stop hadn't actually killed the container — which is the exact production bug the commit claimed to fix. Now Stop: - returns nil on successful remove (no change) - returns nil when the container is already gone (uses the existing isContainerNotFound helper — that's the cleanup post-condition, not a failure) - returns the wrapped Docker error otherwise (daemon timeout, ctx cancellation, socket EOF — anything that means the container might still be alive) Audited every Provisioner.Stop caller in the tree (team.go, workspace_restart.go ×4, workspace.go) — all of them already discard the return value, so the widened error surface is purely opt-in for the new cleanup paths and breaks no existing behaviour. Other review-driven fixes in this commit: - workspace_crud.go: detached `broadcaster.RecordAndBroadcast` from the request ctx too. RecordAndBroadcast does INSERT INTO structure_events + Redis Publish; if the canvas hangs up, a request-ctx-bound INSERT can be cancelled mid-write and the WORKSPACE_REMOVED event never lands, leaving other WS clients ignorant of the cascade. - orphan_sweeper.go: added isLikelyWorkspaceID guard before turning Docker container prefixes into SQL LIKE patterns. The Docker name filter is a SUBSTRING match (not prefix), so non-workspace containers like `my-ws-tool` slip through; the in-loop HasPrefix in provisioner trims most, but the in-sweeper alphabet check (hex + dashes only) is the second line of defence and also blocks SQL LIKE wildcards (`_`, `%`) from reaching the query. Two new tests pin this — TestSweepOnce_FiltersNonWorkspacePrefixes and TestIsLikelyWorkspaceID with 10 alphabet cases. - provisioner.go: comment added to ListWorkspaceContainerIDPrefixes flagging the substring/HasPrefix relationship as load-bearing. Verified: full Go test suite passes; all 8 sweeper tests pass (2 new for the LIKE-pattern guard); existing dispatch / delete / provisioner tests unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 23:32:48 -07:00
Hongming Wang	12c4918318	fix(platform): stop leaking workspace containers on delete Symptom: deleting workspaces from the canvas marked DB rows status='removed' but left Docker containers running indefinitely. After a session of org imports + cancellations, we counted 10 running ws-* containers all backed by 'removed' DB rows, eating ~1100% CPU on the Docker VM. Two compounding bugs in handlers/workspace_crud.go's delete cascade: 1. The cleanup loop used `c.Request.Context()` for the Docker stop/remove calls. When the canvas's `api.del` resolved on the platform's 200, gin cancelled the request ctx — and any in-flight Docker call cancelled with `context canceled`, leaving the container alive. Old logs: "Delete descendant <id> volume removal warning: ... context canceled" 2. `provisioner.Stop`'s error return was discarded and `RemoveVolume` ran unconditionally afterward. When Stop didn't actually kill the container (transient daemon error, ctx cancellation as in #1), the volume removal would predictably fail with "volume in use" and the container kept running with the volume mounted. Old logs: "Delete descendant <id> volume removal warning: Error response from daemon: remove ... volume is in use" Fix layered in two parts: - workspace_crud.go: detach cleanup with `context.WithoutCancel(ctx)` + a 30s bounded timeout. Stop's error is now checked and on failure we skip RemoveVolume entirely (the orphan sweeper below catches what we deferred). - New registry/orphan_sweeper.go: periodic reconcile pass (every 60s, initial run on boot). Lists running ws-* containers via Docker name filter, intersects with DB rows where status='removed', stops + removes volumes for the leaks. Defence in depth — even a brand-new Stop failure mode heals on the next sweep instead of leaking forever. Provisioner gains a tiny ListWorkspaceContainerIDPrefixes helper that wraps ContainerList with the `name=ws-` filter; the sweeper takes an OrphanReaper interface (matches the ContainerChecker pattern in healthsweep.go) so unit tests don't need a real Docker daemon. main.go wires the sweeper alongside the existing liveness + health-sweep + provisioning-timeout monitors, all under supervised.RunWithRecover so a panic restarts the goroutine. 6 new sweeper tests cover the reconcile path, the no-running-containers short-circuit, the daemon-error skip, the Stop-failure-leaves-volume invariant (the same trap that motivated this fix), the volume-remove-error-is-non-fatal continuation, and the nil-reaper no-op. Verified: full Go test suite passes; manually purged the 10 leaked containers + their orphan volumes from the dev host with `docker rm -f` + `docker volume rm` (one-off cleanup; the sweeper would have caught them on the next cycle once deployed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 12:36:22 -07:00
molecule-ai[bot]	b1dce3405c	Merge branch 'staging' into test/2026-04-23-regression-suite	2026-04-24 01:55:06 +00:00
Molecule AI Core-BE	b5e2142c46	fix(#1877 ): close token-rotation race on restart — Option A+Option B combined Platform side (Option B): - provisioner.go: add WriteAuthTokenToVolume() — writes .auth_token to the Docker named volume BEFORE ContainerStart using a throwaway alpine container, eliminating the race window where a restarted container could read a stale token before WriteFilesToContainer writes the new one. - workspace_provision.go: call WriteAuthTokenToVolume() in issueAndInjectToken as a best-effort pre-write before the container starts. Runtime side (Option A): - heartbeat.py: on HTTPStatusError 401 from /registry/heartbeat, call refresh_cache() to force re-read of /configs/.auth_token from disk, then retry the heartbeat once. Fall through to normal failure tracking if the retry also fails. - platform_auth.py: add refresh_cache() which discards the in-process _cached_token and calls get_token() to re-read from disk. Together these eliminate the >1 consecutive 401 window described in issue #1877. Pre-write (B) is the primary fix; runtime retry (A) is the self-healing fallback for any residual race. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-23 17:47:18 -07:00
Hongming Wang	9ce8d97448	test: regression guard for #1738 — cp-provisioner uses real instance_id Pins the fix-invariants from PR #1738 (merged 2026-04-23) against regression. Pre-fix, `CPProvisioner.Stop` and `IsRunning` both passed the workspace UUID as the `instance_id` query param: url := fmt.Sprintf("%s/cp/workspaces/%s?instance_id=%s", baseURL, workspaceID, workspaceID) ^ should be the real i-* ID AWS rejected downstream with InvalidInstanceID.Malformed, orphaned the EC2, and the next provision hit InvalidGroup.Duplicate on the leftover SG — full Save & Restart cascade failure. ## Tests added - TestStop_UsesRealInstanceIDNotWorkspaceUUID: stub resolveInstanceID to return an i-* ID, assert the CP request's instance_id query param carries that i-* value (not the workspace UUID). - TestStop_NoInstanceIDSkipsCPCall: empty DB lookup → no CP call at all (idempotent). Guards against re-introducing the "call CP with '' and let AWS reject" footgun. - TestIsRunning_UsesRealInstanceIDNotWorkspaceUUID: mirror for the /cp/workspaces/:id/status path — same bug shape. All 3 pass on current staging (which has the fix). Reverting either Stop or IsRunning to the pre-#1738 shape causes these to fail loud. Extends molecule-core#1902's regression suite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 17:45:13 -07:00
Hongming Wang	539e3483e4	fix(provisioner): force linux/amd64 pull + create on Apple Silicon hosts (#1875 ) On an Apple Silicon dev box, every `POST /workspaces` failed immediately with: no matching manifest for linux/arm64/v8 in the manifest list entries: no match for platform in manifest: not found because the GHCR workspace-template-* images ship only a linux/amd64 manifest today. `ImagePull` and `ContainerCreate` asked for the daemon's native arch and missed. The Canvas surfaced this as docker image "ghcr.io/molecule-ai/workspace-template-autogen:latest" not found after pull attempt — verify GHCR visibility for autogen — confusing because the image IS visible, just not for linux/arm64. ### Fix Add an auto-detect helper `defaultImagePlatform()` in `internal/provisioner/provisioner.go` that returns `"linux/amd64"` on Apple Silicon hosts and `""` (no preference) everywhere else, with an env override `MOLECULE_IMAGE_PLATFORM` for operators who want to pin or disable explicitly. The result is passed to both `ImagePull` (`PullOptions.Platform`) and `ContainerCreate` (4th arg `*ocispec.Platform`) so the pulled amd64 manifest matches the create-time platform spec. Docker Desktop transparently runs it under QEMU emulation on M-series Macs — slow (2–5× native) but functional. SaaS production (linux/amd64 EC2, `MOLECULE_ENV=production`) never hits the `runtime.GOARCH == "arm64"` branch, so the current behaviour on real tenants is byte-for-byte unchanged. Opt-in escape hatch for operators who want it off: export MOLECULE_IMAGE_PLATFORM="" # disable auto-force export MOLECULE_IMAGE_PLATFORM=linux/arm64 # pin alternate `ocispec` is `github.com/opencontainers/image-spec/specs-go/v1` — already in go.sum v1.1.1 as a transitive dependency of `github.com/docker/docker`, not a new import. ### Tests `internal/provisioner/platform_test.go` exercises every branch: - `TestDefaultImagePlatform_EnvOverride_ExplicitValue` — env wins - `TestDefaultImagePlatform_EnvOverride_EmptyValue` — empty string disables the auto-force (operator escape hatch) - `TestDefaultImagePlatform_AutoDetect` — linux/amd64 on arm64 Mac, "" on every other host - `TestParseOCIPlatform` — 7 table-driven cases covering well-formed platforms, malformed inputs, and nil handling ### End-to-end verification Before this commit, `POST /workspaces` on my Apple Silicon box: workspace status transitioned: provisioning → failed (~1s) log: image pull for ... failed: no matching manifest for linux/arm64/v8 After this commit, fresh DB + fresh platform: workspace status transitioned: provisioning → online (~25s) log: attempting pull (platform=linux/amd64) pulled ghcr.io/molecule-ai/workspace-template-langgraph:latest docker ps: ws-7aa08951-00d Up 27 seconds The existing provisioner race-tested test suite (`go test -race ./internal/provisioner/`) still passes — the platform pointer defaults to nil on linux/amd64 hosts, so the CI-resolved test expectations don't change. Closes #1875 (arm64 image blocker). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 14:55:34 -07:00
Hongming Wang	a56b765b2d	docs: testing strategy + PR hygiene + backend parity matrix + boot-event postmortem (#1824 ) Bundles the documentation and lightweight tooling landed during the 2026-04-23 ops/triage session. Pure additions — no behavior changes. ## Added ### docs/architecture/backends.md Parity matrix for Docker vs EC2 (SaaS) workspace backends. 18 features tabulated with current status; 6 ranked drift risks; enforcement hooks (parity-lint + contract tests). Living document — owners are workspace-server + controlplane teams. ### docs/engineering/testing-strategy.md Tiered test-coverage floors instead of a blanket 100% target. Seven tiers by code class (auth/crypto → generated DTOs). Per-package current-state snapshot + targets. Tracks the 3 biggest coverage gaps (tokens.go 0%, workspace_provision.go 0%, wsauth ~48%) against their tier-1/2 floors. ### docs/engineering/pr-hygiene.md Captures the patterns that keep diffs reviewable. Motivated by the 2026-04-23 backlog audit where 8 of 23 open PRs had 70-380-file bloat from stale branch drift. Covers: small-PR sizing, rebase-not-merge, cherry-pick-onto-fresh-base for recovery, targeting staging first, describing why-not-what. ### docs/engineering/postmortem-2026-04-23-boot-event-401.md Postmortem for the /cp/tenants/boot-event 401 race. Root cause (DB INSERT ordered AFTER readiness check), detection path (E2E + manual log inspection), lessons (write-before-read pattern, integration tests needed, E2E alerting gap, invariants-as-comments). ### tools/check-template-parity.sh CI lint for template repos — diffs the `${VAR:+VAR=${VAR}}` provider- key forwarders between install.sh (bare-host / EC2 path) and start.sh (Docker path). Catches the #5 drift risk from backends.md before it ships. ### workspace-server/internal/provisioner/backend_contract_test.go Shared behavioral contract scaffold for Provisioner + CPProvisioner. Compile-time assertions catch method-signature drift today; scenario- level runs are t.Skip'd pending backend nil-hardening (drift risk #6, see backends.md). ## Updated ### README.md Links the new engineering docs + backends parity matrix into the Documentation Map so agents and humans can actually find them. ## Related issues - #1814 — unblock workspace_provision_test.go (broadcaster interface) - #1813 — nil-client panic hardening (drift risk #6) - #1815 — Canvas vitest coverage instrumentation - #1816 — tokens.go 0% → 85% - #1817 — 5 sqlmock column-drift failures - #1818 — Python pytest-cov setup - #1819 — wsauth middleware coverage gap - #1821 — tiered coverage policy (meta) - #1822 — backend parity drift tracker Co-authored-by: Hongming Wang <hongmingwang.rabbit@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: molecule-ai[bot] <276602405+molecule-ai[bot]@users.noreply.github.com>	2026-04-23 19:59:38 +00:00
Hongming Wang	c23ff848aa	fix(cp-provisioner): look up real EC2 instance_id for Stop + IsRunning (#1738 ) Resolves a "Save & Restart cascade" failure on SaaS tenants. Observed 2026-04-22 on hongmingwang workspace a8af9d79 after a Config-tab save: 03:13:20 workspace deprovision: TerminateInstances InvalidInstanceID.Malformed: a8af9d79-... is malformed 03:13:21 workspace provision: CreateSecurityGroup InvalidGroup.Duplicate: workspace-a8af9d79-394 already exists for VPC vpc-09f85513b85d7acee Root cause: CPProvisioner.Stop and IsRunning passed the workspace UUID as the `instance_id` query param to CP. CP forwarded it to EC2 TerminateInstances, which rejected it (EC2 ids are i-…, not UUIDs). The failed terminate left the workspace's SG attached → the immediate re-provision hit InvalidGroup.Duplicate → user saw `provisioning failed`. Fix: both methods now call a new `resolveInstanceID` that reads `workspaces.instance_id` from the tenant DB and passes the real EC2 id downstream. When no row / no instance_id exists, Stop is a no-op and IsRunning returns (false, nil) so restart cascades can freshly re-provision. resolveInstanceID is exposed as a `var` package-level func so tests can swap it for a pairs-map stub without standing up sqlmock — the per-table DB scaffolding was a heavier price than the surface warranted given these tests are about the CP HTTP flow downstream of the lookup, not the lookup SQL itself. Adds regression tests: - TestStop_EmptyInstanceIDIsNoop: no DB row → no CP call - TestIsRunning_UsesDBInstanceID: DB id round-trips to CP - TestIsRunning_EmptyInstanceIDReturnsFalse: no instance → false/nil Updates existing tests to assert the resolved instance_id (i-abc123 variants) instead of the previous buggy workspaceID. After this lands, user's existing workspaces with stale instance_id bindings still need a manual cleanup of the orphaned EC2 + SG (done for a8af9d79 today). Future restarts use the correct id. Co-authored-by: Hongming Wang <hongmingwang.rabbit@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 18:25:29 +00:00
Hongming Wang	4c0cb487c1	fix(cp-provisioner): use CP_ADMIN_API_TOKEN bearer for /cp/admin/* routes Symptom (prod tenant hongmingwang, 2026-04-22): cp provisioner: console: unexpected 401 GET /workspaces/:id/console → 502 (View Logs broken) Root cause: the tenant's CPProvisioner.authHeaders sent the provision- gate shared secret as the Authorization bearer for every outbound CP call, including /cp/admin/workspaces/:id/console. But CP gates /cp/admin/* with CP_ADMIN_API_TOKEN — a distinct secret so a compromised tenant's provision credentials can't read other tenants' serial console output. Bearer mismatch → 401. Fix: split authHeaders into two methods — - provisionAuthHeaders(): Authorization: Bearer <MOLECULE_CP_SHARED_SECRET> for /cp/workspaces/* (Start, Stop, IsRunning) - adminAuthHeaders(): Authorization: Bearer <CP_ADMIN_API_TOKEN> for /cp/admin/* (GetConsoleOutput and future admin reads) Both still send X-Molecule-Admin-Token for per-tenant identity. When CP_ADMIN_API_TOKEN is unset (dev / self-hosted single-secret setups), cpAdminAPIKey falls back to sharedSecret so nothing regresses. Rollout requirement: the tenant EC2 needs CP_ADMIN_API_TOKEN in its env — this PR wires up the code, but CP's tenant-provision path must inject the value. Filed as follow-up; until then, operators can set it manually on existing tenants. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 17:13:38 -07:00
Hongming Wang	9df3159c59	feat(provisioner): pull workspace-template images from GHCR Every standalone workspace-template repo now publishes to ghcr.io/molecule-ai/workspace-template-<runtime>:latest via the reusable publish-template-image workflow in molecule-ci (landed today — one caller per template repo). This PR makes the provisioner actually use those images: - RuntimeImages map + DefaultImage switched from bare local tags (workspace-template:<runtime>) to their GHCR equivalents. - New ensureImageLocal step before ContainerCreate: if the image isn't present locally, attempt `docker pull` and drain the progress stream to completion. Best-effort — if the pull fails (network, auth, rate limit) the subsequent ContainerCreate still surfaces the actionable "No such image" error, now with a GHCR-appropriate hint instead of the defunct `bash workspace/build-all.sh <runtime>` advice. - runtimeTagFromImage now handles both forms: legacy `workspace-template:<runtime>` (local dev via build-all.sh / rebuild-runtime-images.sh) and the current GHCR shape. Keeps error hints sensible in both worlds. - Tests cover the GHCR path for tag extraction and the new error message shape. Legacy local tags still recognised. Local dev path unchanged — scripts/build-images.sh and workspace/rebuild-runtime-images.sh still produce locally-tagged `workspace-template:<runtime>` images, and Docker's image resolver matches them before any pull is attempted. So contributors can keep iterating on a template repo without round-tripping through GHCR. Follow-on impact: - hongmingwang.moleculesai.app (and any other tenant EC2) will auto-pull `ghcr.io/molecule-ai/workspace-template-hermes:latest` on the next hermes workspace provision — picking up the real Nous hermes-agent behind the A2A bridge (template-hermes v2.1.0) without any tenant-side rebuild step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 12:39:56 -07:00
molecule-ai[bot]	732f65e8e1	fix(go): replace $1 literal with resp.Body.Close() in 7 files (#1247 ) PR #1229 sed command had no capture groups but used $1 in the replacement, committing the literal string "defer func() { _ = \$1 }()" instead of "defer func() { _ = resp.Body.Close() }()". Go does not compile — $1 is not a valid identifier. Fixed with: sed -i 's/defer func() { _ = \$1 }()/defer func() { _ = resp.Body.Close() }()/g' Affected (all on origin/staging): workspace-server/cmd/server/cp_config.go workspace-server/internal/handlers/a2a_proxy.go workspace-server/internal/handlers/github_token.go workspace-server/internal/handlers/traces.go workspace-server/internal/handlers/transcript.go workspace-server/internal/middleware/session_auth.go workspace-server/internal/provisioner/cp_provisioner.go (3 occurrences) Closes: #1245 Co-authored-by: Molecule AI Core-BE <core-be@agents.moleculesai.app> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-21 03:18:21 +00:00
molecule-ai[bot]	2575960805	fix(errcheck): suppress unchecked resp.Body.Close() across workspace-server (#1229 ) Issue #1196: golangci-lint errcheck flags bare resp.Body.Close() calls because Body.Close() can return a non-nil error (e.g. when the server sent fewer bytes than Content-Length). All occurrences fixed: defer resp.Body.Close() → defer func() { _ = resp.Body.Close() }() resp.Body.Close() → _ = resp.Body.Close() 12 files affected across all Go packages — channels, handlers, middleware, provisioner, artifacts, and cmd. The body is already fully consumed at each call site, so the error is always safe to discard. 🤖 Generated with [Claude Code](https://claude.ai) Co-authored-by: Molecule AI Core-BE <core-be@agents.moleculesai.app>	2026-04-21 02:45:34 +00:00
Hongming Wang	731a9aef6e	feat(platform): bootstrap-failed + console endpoints for CP watcher Workspaces stuck in provisioning used to sit in "starting" for 10min until the sweeper flipped them. The real signal — a runtime crash at EC2 boot — lands on the serial console within seconds but nothing listened. These endpoints close the loop. 1. POST /admin/workspaces/:id/bootstrap-failed The control plane's bootstrap watcher posts here when it spots "RUNTIME CRASHED" in ec2:GetConsoleOutput. Handler: - UPDATEs workspaces SET status='failed' only when status was 'provisioning' (idempotent — a raced online/failed stays put) - Stores the error + log_tail in last_sample_error so the canvas can render the real stack trace, not a generic "timeout" string - Broadcasts WORKSPACE_PROVISION_FAILED with source='bootstrap_watcher' 2. GET /workspaces/:id/console Proxies to CP's new /cp/admin/workspaces/:id/console endpoint so the tenant platform can surface EC2 serial console output without holding AWS credentials. CPProvisioner.GetConsoleOutput is the client; returns 501 in non-CP deployments (docker-compose dev). Both gated by AdminAuth — CP holds the tenant ADMIN_TOKEN that the middleware accepts on its tier 2b branch. Tests cover: happy-path fail, already-transitioned no-op, empty id, log_tail truncation, and the 501 fallback when no CP is wired. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 17:11:34 -07:00
Hongming Wang	0a06cb4fc9	fix(cp_provisioner): cap IsRunning body read at 64 KiB IsRunning used an unbounded json.NewDecoder(resp.Body).Decode on CP status responses. Start already caps its body read at 64 KiB (cp_provisioner.go:137) to defend against a misconfigured or compromised CP streaming a huge body and exhausting memory. IsRunning is called reactively per-request from a2a_proxy and periodically from healthsweep, so it's a hotter path than Start and arguably deserves the same defense more. Adds TestIsRunning_BoundedBodyRead that serves a body padded past the cap and asserts the decode still succeeds on the JSON prefix. Follow-up to code-review Nit-2 on #1073. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 09:06:20 -07:00
Hongming Wang	cfa901b89a	fix(cp_provisioner): IsRunning returns (true, err) on transient failures My #1071 made IsRunning return (false, err) on all error paths, but that breaks a2a_proxy which depends on Docker provisioner's (true, err) contract. Without this fix, any brief CP outage causes a2a_proxy to mark workspaces offline and trigger restart cascades across every tenant. Contract now matches Docker.IsRunning: transport error → (true, err) — alive, degraded signal non-2xx response → (true, err) — alive, degraded signal JSON decode error → (true, err) — alive, degraded signal 2xx state!=running → (false, nil) 2xx state==running → (true, nil) healthsweep.go is also happy with this — it skips on err regardless. Adds TestIsRunning_ContractCompat_A2AProxy as regression guard that asserts each error path explicitly against the a2a_proxy expectations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 08:58:18 -07:00
Hongming Wang	e502003c74	fix(workspace-server): IsRunning surfaces non-2xx + JSON errors Pre-existing silent-failure path: IsRunning decoded CP responses regardless of HTTP status, so a CP 500 → empty body → State="" → returned (false, nil). The sweeper couldn't distinguish "workspace stopped" from "CP broken" and would leave a dead row in place. ## Fix - Non-2xx → wrapped error, does NOT echo body (CP 5xx bodies may contain echoed headers; leaking into logs would expose bearer) - JSON decode error → wrapped error - Transport error → now wrapped with "cp provisioner: status:" prefix for easier log grepping ## Tests +7 cases (5-status table + malformed JSON + existing transport). IsRunning coverage 100%; overall cp_provisioner at 98%. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 08:47:55 -07:00
Hongming Wang	6c4d1ae4db	test(workspace-server): cover Stop/IsRunning/Close + auth-header + transport errors Closes review gap: pre-PR coverage on CPProvisioner was 37%. After this commit every exported method is exercised: - NewCPProvisioner 100% - authHeaders 100% - Start 91.7% (remainder: json.Marshal error path, unreachable with fixed-type request struct) - Stop 100% (new — header + path + error) - IsRunning 100% (new — 4-state matrix + auth) - Close 100% (new — contract no-op) New cases assert both auth headers (shared secret + admin_token) land on every outbound request, transport failures surface clear errors on Start/Stop, and IsRunning doesn't misreport on transport failure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 08:37:39 -07:00
Hongming Wang	d3386ad620	fix(workspace-server): send X-Molecule-Admin-Token on CP calls controlplane #118 + #130 made /cp/workspaces/* require a per-tenant admin_token header in addition to the platform-wide shared secret. Without it, every workspace provision / deprovision / status call now 401s. ADMIN_TOKEN is already injected into the tenant container by the controlplane's Secrets Manager bootstrap, so this is purely a header-plumbing change — no new config required on the tenant side. ## Change - CPProvisioner carries adminToken alongside sharedSecret - New authHeaders method sets BOTH auth headers on every outbound request (old authHeader deleted — single call site was misleading once the semantics changed) - Empty values on either header are no-ops so self-hosted / dev deployments without a real CP still work ## Tests Renamed + expanded cp_provisioner_test cases: - TestAuthHeaders_NoopWhenBothEmpty — self-hosted path - TestAuthHeaders_SetsBothWhenBothProvided — prod happy path - TestAuthHeaders_OnlyAdminTokenWhenSecretEmpty — transition window Full workspace-server suite green. ## Rollout Next tenant provision will ship an image with this commit merged. Existing tenants (none in prod right now — hongming was the only one and was purged earlier today) will auto-update via the 5-min image-pull cron. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 08:17:50 -07:00
Hongming Wang	296c52cb25	test(ws-server): cover CPProvisioner — auth, env fallback, error paths Post-merge audit flagged cp_provisioner.go as the only new file from the canary/C1 work without test coverage. Fills the gap: - NewCPProvisioner_RequiresOrgID — self-hosted without MOLECULE_ORG_ID refuses to construct (avoids silent phone-home to prod CP). - NewCPProvisioner_FallsBackToProvisionSharedSecret — the operator ergonomics of using one env-var name on both sides of the wire. - AuthHeader noop + happy path — bearer only set when secret is set. - Start_HappyPath — end-to-end POST to stubbed CP, bearer forwarded, instance_id parsed out of response. - Start_Non201ReturnsStructuredError — when CP returns structured {"error":"…"}, that message surfaces to the caller. - Start_NoStructuredErrorFallsBackToSize — regression gate for the anti-log-leak change from PR #980: raw upstream body must NOT appear in the error, only the byte count. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 03:41:16 -07:00
Hongming Wang	896a34429a	Merge pull request #981 from Molecule-AI/fix/security-tenant-cpprovisioner-bearer fix(security): tenant CPProvisioner sends CP bearer on provision / stop / status	2026-04-19 01:55:20 -07:00
Hongming Wang	a79366a04a	fix(security): tenant CPProvisioner attaches CP bearer on all calls Completes the C1 integration (PR #50 on molecule-controlplane). The CP now requires Authorization: Bearer <PROVISION_SHARED_SECRET> on all three /cp/workspaces/* endpoints; without this change the tenant-side Start/Stop/IsRunning calls would all 401 (or 404 when the CP's routes refused to mount) and every workspace provision from a SaaS tenant would silently fail. Reads MOLECULE_CP_SHARED_SECRET, falling back to PROVISION_SHARED_SECRET so operators can use one env-var name on both sides of the wire. Empty value is a no-op: self-hosted deployments with no CP or a CP that doesn't gate /cp/workspaces/* keep working as before. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 01:53:12 -07:00
Hongming Wang	365f13199e	fix(security): scrub workspace-server token + upstream error logs Two findings from the pre-launch log-scrub audit: 1. handlers/workspace_provision.go:548 logged `token[:8]` — the exact H1 pattern that panicked on short keys. Even with a length guard, leaking 8 chars of an auth token into centralized logs shortens the search space for anyone who gets log-read access. Now logs only `len(token)` as a liveness signal. 2. provisioner/cp_provisioner.go:101 fell back to logging the raw control-plane response body when the structured {"error":"..."} field was absent. If the CP ever echoed request headers (Authorization) or a portion of user-data back in an error path, the bearer token would end up in our tenant-instance logs. Now logs the byte count only; the structured error remains in place for the happy path. Also caps the read at 64 KiB via io.LimitReader to prevent log-flood DoS from a compromised upstream. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 01:33:47 -07:00
Hongming Wang	39074cc4ae	chore: final open-source cleanup — binary, stale paths, private refs - Remove compiled workspace-server/server binary from git - Fix .gitignore, .gitattributes, .githooks/pre-commit for renamed dirs - Fix CI workflow path filters (workspace-template → workspace) - Replace real EC2 IP and personal slug in test_saas_tenant.sh - Scrub molecule-controlplane references in docs - Fix stale workspace-template/ paths in provisioner, handlers, tests - Clean tracked Python cache files Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-18 00:38:55 -07:00
Hongming Wang	d8026347e5	chore: open-source restructure — rename dirs, remove internal files, scrub secrets Renames: - platform/ → workspace-server/ (Go module path stays as "platform" for external dep compat — will update after plugin module republish) - workspace-template/ → workspace/ Removed (moved to separate repos or deleted): - PLAN.md — internal roadmap (move to private project board) - HANDOFF.md, AGENTS.md — one-time internal session docs - .claude/ — gitignored entirely (local agent config) - infra/cloudflare-worker/ → Molecule-AI/molecule-tenant-proxy - org-templates/molecule-dev/ → standalone template repo - .mcp-eval/ → molecule-mcp-server repo - test-results/ — ephemeral, gitignored Security scrubbing: - Cloudflare account/zone/KV IDs → placeholders - Real EC2 IPs → <EC2_IP> in all docs - CF token prefix, Neon project ID, Fly app names → redacted - Langfuse dev credentials → parameterized - Personal runner username/machine name → generic Community files: - CONTRIBUTING.md — build, test, branch conventions - CODE_OF_CONDUCT.md — Contributor Covenant 2.1 All Dockerfiles, CI workflows, docker-compose, railway.toml, render.yaml, README, CLAUDE.md updated for new directory names. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-18 00:24:44 -07:00

41 Commits