Commit Graph

753 Commits

Author SHA1 Message Date
952bfb3ca2 fix(workspace): replace asyncio.get_event_loop().run_until_complete with asyncio.run() (#307) (#498)
Co-authored-by: core-be <core-be@agents.moleculesai.app>
Co-committed-by: core-be <core-be@agents.moleculesai.app>
2026-05-11 15:37:34 +00:00
aa49dbc728 fix(handlers): add rows.Err() checks after rows.Next() loops
Add deferred error checks following rows.Next() iteration in:
- ListDelegations (delegation.go): log on error, continue serving results
- org import reconcile orphan query (org.go): log + append to reconcileErrs

Fixes the rows.Err() gap identified in the delegated rows.Err() check PR
(#302, closed; replaced by this PR).  Two additional files already had
the check (activity.go, memories.go) — pattern applied consistently here.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-11 06:15:42 +00:00
706df19b43 [core-be-agent] fix(security#321): CWE-22 path traversal guards in loadWorkspaceEnv
Two vulnerable call sites confirmed on origin/main:

1. org_helpers.go:loadWorkspaceEnv (line 101): filesDir from untrusted org YAML
   joined directly with orgBaseDir without traversal guard. A malicious filesDir
   like "../../../etc" escapes the org root and reads arbitrary files.

2. org_import.go:createWorkspaceTree (line 494): same pattern directly in the
   env-loading block — not covered by staging-targeted PR #345.

Fix (both locations): call resolveInsideRoot(orgBaseDir, filesDir) before
filepath.Join. On traversal detection, org_helpers.go returns an empty map
(caller contract); org_import.go silently skips the workspace .env override
(matches existing template-resolution pattern in the same function).

Tests: org_helpers_test.go — 3 cases covering traversal rejection,
workspace-override happy path, and empty filesDir edge case.

Closes: molecule-core#362, molecule-core#321

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-11 03:34:55 +00:00
d67c3da13e fix(platform): A2A proxy ResponseHeaderTimeout 60s -> 180s default, env-configurable 2026-05-11 02:09:06 +00:00
65f9df24b8 Merge branch 'main' into fix/external-connection-user-facing-urls 2026-05-10 11:37:44 +00:00
a355b6f0ad fix(workspace-server): emit Gitea/PyPI URLs for external user instructions (RFC #229 P2-5)
The Molecule-AI GitHub org was suspended 2026-05-06; canonical SCM is
now git.moleculesai.app. external_connection.go was still emitting
github.com URLs in operator-facing copy-paste blocks, breaking
external-agent onboarding silently.

Per-site decisions (8 emit sites in 1 file):

- L124 (channel template doc comment): swap source-of-truth comment to
  Gitea host.
- L137 /plugin marketplace add Molecule-AI/...: swap to explicit Gitea
  HTTPS URL form. End-to-end-verified path per internal#37 § 1.A.
- L138 /plugin install molecule@molecule-mcp-claude-channel: marketplace
  name is molecule-channel (per remote .claude-plugin/marketplace.json),
  not the repo name. Fix to molecule@molecule-channel.
- L157 --channels plugin:molecule@molecule-mcp-claude-channel: same
  marketplace-name fix.
- L179 user-facing GitHub URL: swap to Gitea.
- L261 pip install git+https://github.com/Molecule-AI/molecule-sdk-python:
  not on PyPI; swap to git+https://git.moleculesai.app/molecule-ai/...
- L310 hermes-channel doc comment: swap source-of-truth comment.
- L339 pip install git+https://github.com/Molecule-AI/hermes-channel-molecule:
  not on PyPI; swap to Gitea.
- L369 issue-tracker URL: swap to Gitea.

Verification:
- molecule-ai-workspace-runtime, codex-channel-molecule are on PyPI (200);
  no swap needed for those pip lines (they were already package-name form).
- molecule-mcp-claude-channel, molecule-sdk-python, hermes-channel-molecule
  are NOT on PyPI; swapped to git+https://git.moleculesai.app/molecule-ai/
  form. All three repos are public on Gitea (default branch main) and
  serve git-upload-pack unauthenticated (verified curl 200 against
  /info/refs?service=git-upload-pack).
- Third-party github URLs (gin import, openai/codex, NousResearch/
  hermes-agent upstream issue trackers, npm @openai/codex) intentionally
  preserved.

Adds TestExternalTemplates_NoBrokenMoleculeAIGitHubURLs regression guard
to prevent the same broken URLs from re-emerging on future template
edits.

go vet / go build / existing TestExternal* — all clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 04:23:46 -07:00
0846ebc1f6 fix(workspace-server): respect MOLECULE_IMAGE_REGISTRY in imagewatch + admin_workspace_images (RFC #229 P2-4)
Two surfaces in workspace-server hardcoded `ghcr.io` and silently bypassed
the `MOLECULE_IMAGE_REGISTRY` env override that flips every other image
operation to the configured private mirror (e.g. AWS ECR in production):

  1. internal/imagewatch/watch.go — image-auto-refresh polled
     `https://ghcr.io/v2/...` and `https://ghcr.io/token` directly. Post-
     suspension, with the platform pointed at ECR, the watcher silently
     stopped seeing digest changes (every poll either 404'd or hung on a
     registry it has no business talking to).

  2. internal/handlers/admin_workspace_images.go — Docker Engine auth
     payload pinned `serveraddress: "ghcr.io"`, so when the operator sets
     `MOLECULE_IMAGE_REGISTRY=…ecr…/molecule-ai` the engine matched the
     wrong credential entry on every authenticated pull.

Fix: extract `provisioner.RegistryHost()` returning the host portion of
`RegistryPrefix()` (e.g. `ghcr.io` ← `ghcr.io/molecule-ai`, or
`004947743811.dkr.ecr.us-east-2.amazonaws.com` ← the ECR mirror prefix),
and route both surfaces through it. Default behavior is unchanged for
OSS users on GHCR.

Tests
- New `TestRegistryHost_SplitsHostFromOrgPath` and
  `TestRegistryHost_NeverEmpty` pin the helper across GHCR / ECR /
  self-hosted Gitea / bare-host edge cases.
- New `TestGHCRAuthHeader_RespectsRegistryEnv` asserts the Docker auth
  payload's `serveraddress` follows MOLECULE_IMAGE_REGISTRY (and never
  leaks the org-path suffix).
- New `TestRemoteDigest_RegistryHostFollowsEnv` stands up an httptest
  server, points MOLECULE_IMAGE_REGISTRY at it, and confirms both the
  token endpoint and the manifest HEAD land there — i.e. the full image-
  watch loop respects the env override end-to-end.

Both new tests were verified to FAIL on the pre-fix code path before the
helper was wired in, so a future revert can't silently re-introduce the
bug.

Out of scope (followup needed)
ECR uses `aws ecr get-authorization-token` (SigV4 + basic-auth) instead
of GHCR's `/token?service=…&scope=…` flow. This PR makes the URL host-
configurable; the bearer-token negotiation in `fetchPullToken` still
speaks the GHCR flavor. On ECR with `IMAGE_AUTO_REFRESH=true`, the
watcher will now fail loudly at the token fetch (logged per tick) rather
than silently hitting ghcr.io. Operators on ECR should keep
IMAGE_AUTO_REFRESH=false until ECR auth is wired — tracked as a separate
task. Net effect of this PR alone is strictly better than pre-fix:
fail-loud > silent-broken.

Refs: RFC #229 P2-4
tier:low

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 04:21:27 -07:00
bc555aeb45 Merge pull request 'fix(provisioner): export MOLECULE_MODEL canonical env + read it first; drop stray brace in delegation_test.go' (#286) from fix/molecule-model-env-go into main 2026-05-10 10:52:22 +00:00
9b930d8e39 fix(provisioner): export MOLECULE_MODEL (canonical model env) + read it first; drop stray brace in delegation_test.go
internal#226 follow-up #1. `molecule_runtime.config` resolves the picked
model as `MOLECULE_MODEL` > `MODEL` > (legacy) `MODEL_PROVIDER` (#280) —
this side of the boundary now matches:

  - applyRuntimeModelEnv reads `MOLECULE_MODEL` ahead of `MODEL` /
    `MODEL_PROVIDER`, and exports BOTH `MOLECULE_MODEL` and `MODEL`
    (the latter kept for back-compat with everything that already reads
    `os.environ["MODEL"]`). So a workspace whose secrets carry
    `MOLECULE_MODEL` (the unambiguous name) is honoured, and the
    `MODEL_PROVIDER` misnomer — which got set to provider slugs
    ("minimax") and even runtime names ("claude-code") — is the lowest-
    priority fallback, exactly as on the runtime side.
  - the resolution-order comment is updated to flag MODEL_PROVIDER as the
    legacy-and-misleadingly-named var.

Also drops a stray trailing `}` in delegation_test.go (committed in
97768272 "test(delegation): add isDeliveryConfirmedSuccess helper") that
made `internal/handlers` fail to parse — one of the things keeping the
package from compiling for tests.

Tests: TestApplyRuntimeModelEnv_SetsUniversalMODELForAllRuntimes extended
to assert MOLECULE_MODEL mirrors MODEL on every case, plus two new cases
(MOLECULE_MODEL env fallback; MOLECULE_MODEL beats MODEL_PROVIDER). Could
not run `go test ./internal/handlers/` locally — the package is still
blocked behind `internal/plugins` `SourceResolver` redeclaration (the
#248 plugin-router/resolver refactor, Core-BE's lane); CI validates once
that lands. The applyRuntimeModelEnv change is mechanical (same shape as
the existing `MODEL` handling) — reviewer please eyeball.

Companion: molecule-core#280 (runtime config.py side), molecule-ai-workspace-template-claude-code#14 (CLI-stream-error surfacing).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 03:11:41 -07:00
cc4d7fc2c1 Merge branch 'main' into fix/offsec-001-error-message-scrubbing 2026-05-10 10:01:43 +00:00
Molecule AI Core Platform Lead
14e3956d8a Merge branch 'main' into fix/core-248-pluginresolver-and-plgh 2026-05-10 09:51:14 +00:00
Molecule AI Core Platform Lead
9e3d420363 [core-lead-agent] fix(core#228): cascade fixes for PluginResolver — make main compile
PR #256 introduced PluginResolver to break the SourceResolver redeclaration
deadlock, but missed three downstream call-sites that left main uncompilable:

1. plugins/drift_sweeper.go: PluginResolver.Resolve was declared returning
   PluginResolver (recursive). *Registry.Resolve returns the production
   SourceResolver from source.go, so *Registry didn't satisfy PluginResolver.
   Fix: Resolve returns SourceResolver. Add compile-time assertion that
   *Registry satisfies PluginResolver so any future signature drift fails
   the build instead of router wiring.

2. plugins/drift_sweeper_test.go: stubResolver was still declared with the
   old SourceResolver shape AND asserted against SourceResolver — the
   assertion failed because stubResolver lacks Scheme()/Fetch(). Fix: stub
   is a PluginResolver; assertion targets PluginResolver. Drop the unused
   "database/sql" import that fails go vet.

3. router/router.go:
   - The 70f84823 reorder moved the plgh init block above its dockerCli
     dependency (line 538 used; line 594 declared). Moved the dockerCli
     declaration up so it's available where used; replaced the orphaned
     declaration in the terminal block with a comment.
   - Setup's pluginResolver param was typed plugins.SourceResolver — wrong
     for *plugins.Registry (Registry is not a per-scheme resolver). Retyped
     to plugins.PluginResolver, which *Registry actually satisfies.
   - Removed the broken `plgh.WithSourceResolver(pluginResolver)` call —
     WithSourceResolver expects a per-scheme SourceResolver, not a
     PluginResolver/registry. plgh has its own internal default registry
     (github+local) from NewPluginsHandler, so dropping the call is
     functionally a no-op vs the broken state. Kept the param so the
     drift sweeper (main.go) can share scheme enumeration when needed.

4. go.sum: add the content hash entry for go.moleculesai.app/plugin/
   gh-identity/pluginloader (only the /go.mod hash was present, breaking
   `go build ./cmd/server`).

Verified locally:
  go build ./...           ✓
  go vet ./...             ✓ (only pre-existing org_external append warning)
  go test ./internal/plugins/...  ✓
  go test ./internal/router/...   ✓

6 pre-existing handler test failures (TestExecuteDelegation_*,
TestHandleDiagnose_*) are orthogonal — they did not run before because the
package didn't compile. Out of scope for this fix; tracking separately.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 09:46:35 +00:00
7d1a189f2e fix(mcp): scrub err.Error() from JSON-RPC error messages (OFFSEC-001)
Replace all three err.Error() leaks in mcp.go with constant strings,
consistent with the same fix applied to 22 other files in PRs #1193/1206/1219/#168.

- Call handler (line ~329): "parse error: " + err.Error() → "parse error"
- dispatchRPC params unmarshal (line ~417): "invalid params: " + err.Error()
  → "invalid parameters"
- dispatchRPC tool call (line ~422): err.Error() → "tool call failed"
  + log.Printf server-side for forensics

Routes protected by WorkspaceAuth (C1) and MCPRateLimiter (C2) — this is
defence-in-depth per OFFSEC-001 / #259.

Tests added:
- TestMCPHandler_Call_MalformedJSON_ReturnsConstantParseError
- TestMCPHandler_dispatchRPC_InvalidParams_ReturnsConstantMessage
- TestMCPHandler_dispatchRPC_UnknownTool_ReturnsConstantMessage
- TestMCPHandler_dispatchRPC_InvalidParams_ArrayInsteadOfObject

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 09:01:51 +00:00
70f8482399 fix(core#248): reorder router.go plugin init before drift handler — plgh ordering fix
Plgh was referenced at line 505 before it was created at line 632, causing
"undefined: plgh" on main. Moved the entire Plugins block to before the
drift handler block. No functional change to registered routes — only
declaration order. Combined with d88a320f (SourceResolver→PluginResolver
rename, SSRF guard placement, and test regressions) this makes main fully
compile again.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 08:08:09 +00:00
67840629eb fix(internal#214): refresh go.sum for the go.moleculesai.app/plugin/gh-identity vanity path
go.sum still carried the pre-suspension github.com/Molecule-AI/molecule-ai-plugin-gh-identity
entries while go.mod requires go.moleculesai.app/plugin/gh-identity — so `go build` failed
with 'missing go.sum entry'. With the go.moleculesai.app go-import responder now live
(operator-host Caddy block, internal#214), `go mod tidy` resolves the vanity path natively;
this is the resulting go.sum (no replace directive, no go.mod change beyond the tidy).

Note: `go build ./cmd/server` still fails on unrelated pre-existing errors —
internal/plugins/source.go vs drift_sweeper.go SourceResolver redeclaration (#123) and
internal/router/router.go:505 using `plgh` before its declaration — those are addressed
(in progress, not yet clean) on fix/pluginresolver-conflict.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 23:55:20 -07:00
d88a320f0c fix: resolve SourceResolver naming conflict, SSRF guard placement, and multiple test regressions
- plugins/drift_sweeper.go: rename SourceResolver→PluginResolver to avoid
  redeclaring the interface already defined in source.go (core#228)

- handlers/workspace.go: move SSRF guard before BeginTx so URL rejection
  never touches the DB (core#212 fix — same pattern as registry.go:324)

- handlers/restart_signals.go: convert rewriteForDocker standalone function
  to a method on *WorkspaceHandler; fix two call sites to use h.rewriteForDocker

- handlers/plugins.go: change Sources() return type from plugins.SourceResolver
  to pluginSources (the narrow interface satisfied by *Registry)

- handlers/admin_plugin_drift.go: remove unused "context" import

- handlers/delegation_test.go: remove stray closing brace

- handlers/restart_signals_test.go: rewrite with correct miniredis v2 API
  (mr.Get takes context, mr.Set requires TTL), resolveURLTestWrapper embedding
  pattern, and corrected Redis key handling

- handlers/workspace_test.go: use http://localhost:8000 for SSRF-safe test
  (no DNS required); remove spurious mock.ExpectExec for Redis CacheURL call

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 06:05:11 +00:00
Molecule AI Core Platform Lead
3f738e6ab5 Merge remote-tracking branch 'origin/main' into trig-223 2026-05-10 02:47:46 +00:00
12bb73d000 fix(dockerfile-tenant): chown /org-templates to canvas user so !external resolver can mkdir cache
Root cause:
  Dockerfile.tenant chowns /canvas /platform /memory-plugin /migrations
  to canvas:canvas (line ~119) but not /org-templates. The image is
  built as root, COPY-ed templates inherit root:root 0755. The platform
  binary then runs as the canvas user (uid 1000) because of the USER
  directive on line ~124, so when the !external resolver
  (org_external.go, internal#77 / task #222) tries
  os.MkdirAll("/org-templates/<tmpl>/.external-cache/<repo>") on first
  import, mkdir(2) returns EACCES and the import handler returns 400
  "org template expansion failed" (org.go:592). The user-facing error
  is generic; only the server log carries:

    Org import: refusing import: !include expansion failed:
    !external at line 156: fetch git.moleculesai.app/molecule-ai/molecule-dev-department@v1.0.0:
    mkdir cache root: mkdir /org-templates/molecule-dev/.external-cache: permission denied

Repro:
  Tenant staging-cplead-2 (canary AWS 004947743811, image SHA
  a93c4ce17725...). POST /org/import {"dir":"molecule-dev"} returns 400
  while POST /org/import {"dir":"free-beats-all"} returns 201 — only
  templates with !external trip the bug.

Fix:
  Add /org-templates to the chown -R argv. One-line change. Same
  ownership shape as the other writable platform-state dirs.

Why this is safe for prod:
  * The platform binary already needs read access to /org-templates,
    so canvas:canvas owning it doesn't widen any attack surface.
  * /org-templates is image-resident, not bind-mounted; chown applies
    inside the image layers and prod tenants get the fix on next
    image rebuild + redeploy. Live prod tenants are unaffected until
    the next deploy (no orgs currently using !external in prod —
    molecule-dev consumers are all internal staging).

Verification:
  After hand-applying the chown live (docker exec --user 0 ... chown -R
  canvas:canvas /org-templates/molecule-dev), POST /org/import
  {"dir":"molecule-dev"} returns 201 with 39 workspaces; cp-lead +
  CP-BE + CP-QA + CP-Security all reach status=online within ~2 min.

Refs:
  internal#77 — !external RFC (Phase 3a)
  task #222 — resolver PR (introduced the unflagged-permission
              dependency this fixes)
  Live incident 2026-05-10 — staging-cplead-2 import failed,
              chown-on-host workaround in place pending image rebuild
2026-05-09 19:40:52 -07:00
4474ddc189 fix(workspace): add SSRF validation before writing external workspace URL
Issue #212: POST /workspaces with runtime=external and a URL wrote the
URL directly to the DB without validateAgentURL checking (the same check
that registry.go:324 applies to the heartbeat path). An attacker with
AdminAuth could register a workspace URL at a cloud metadata endpoint
(169.254.169.254) and exfiltrate IAM credentials when the platform
fires pre-restart drain signals.

Changes:
- workspace.go: add validateAgentURL(payload.URL) guard before the
  UPDATE at line 386. 400 on unsafe URL, no DB write occurs.
- workspace_test.go: add 3 regression tests:
  - TestWorkspaceCreate_ExternalURL_SSRFSafe: safe public URL → 201
  - TestWorkspaceCreate_ExternalURL_SSRFMetadataBlocked: 169.254.169.254 → 400
  - TestWorkspaceCreate_ExternalURL_SSRFLoopbackBlocked: 127.0.0.1 → 400
  Both unsafe tests assert zero DB calls (the handler rejects before
  any transaction).

Ref: issue #212.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 02:30:18 +00:00
b5d9f13ab1 docs(runbook): add admin-auth.md covering test-token route lockdown
Issue #214: documents the MOLECULE_ENV=production requirement for
staging/prod tenants to lock the /admin/workspaces/:id/test-token route.
Also adds a startup INFO log in main.go when the route is enabled, so
operators can confirm the setting in boot logs without having to probe
the endpoint directly.

Ref: issue #214.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 02:20:30 +00:00
d0126662c7 docs: cycle report 2026-05-10
Cycle summary:
- Assigned: core#125 (feat: preserve in-flight A2A messages across restart)
- Implemented: Phase 1 of #125 — pre-restart drain signal
- Opened: PR #207
- Reviewed: PR #140 (static-token fallback, approved)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 01:15:07 +00:00
ada1008012 feat(plugins): plugin drift detector + queue + admin apply endpoint (#123)
## Summary

Adds the version-subscription drift detection and operator-apply workflow for
per-workspace plugin tracking (core#113).

## Components

**Migration** (`20260510000000_plugin_drift_queue`):
- Adds `installed_sha` column to `workspace_plugins` — records the commit SHA
  installed so the drift sweeper can compare against upstream.
- Creates `plugin_update_queue` table with status: pending | applied | dismissed.
- Adds partial unique index to prevent duplicate pending rows per
  (workspace_id, plugin_name).

**GithubResolver** (`github.go`):
- `LastFetchSHA` field + `LastSHA()` getter — populated by `Fetch` after a
  successful shallow clone (captured before `.git` is stripped). Used by the
  install pipeline to seed `installed_sha`.
- `ResolveRef(ctx, spec)` method — resolves a plugin spec to its full commit
  SHA using `git fetch --depth=1 + git rev-parse`. Used by the drift sweeper
  to get the current upstream SHA for a tracked ref (tag:vX.Y.Z, tag:latest,
  sha:…, or bare branch).

**Drift sweeper** (`plugins/drift_sweeper.go`):
- Periodic sweep every 1h: SELECTs rows where `tracked_ref != 'none' AND
  installed_sha IS NOT NULL`, resolves upstream SHA, queues drift if different.
- `ListPendingUpdates()` — reads pending queue rows for the admin endpoint.
- `ApplyDriftUpdate()` — marks entry applied (idempotent).
- ctx.Err() guard on ticker arm to avoid post-shutdown work.

**Install pipeline** (`plugins_install_pipeline.go`, `plugins_tracking.go`,
`plugins_install.go`):
- `stageResult.InstalledSHA` field — carries the SHA from Fetch to the DB.
- `recordWorkspacePluginInstall` now accepts and stores `installed_sha`.
- `deleteWorkspacePluginRow` — removes tracking row on uninstall so a stale
  SHA doesn't prevent the next install from creating a fresh row.
- Both Docker and EIC uninstall paths call `deleteWorkspacePluginRow`.

**Admin endpoints** (`handlers/admin_plugin_drift.go`):
- `GET /admin/plugin-updates-pending` — list all pending drift entries.
- `POST /admin/plugin-updates/:id/apply` — re-installs plugin from source_raw
  (re-fetching the same tracked ref), records the new SHA, marks entry applied,
  triggers workspace restart. Idempotent (already-applied returns 200).

**Router wiring** (`router.go`, `cmd/server/main.go`):
- Plugin registry created in main.go and shared between PluginsHandler and drift
  sweeper.
- `router.Setup` accepts optional `pluginResolver` param.
- `PluginsHandler.Sources()` export for the sweeper wiring pattern.

## Tests

- `plugins/github_test.go` — `ResolveRef` coverage (invalid spec, git error,
  not-found mapping, no-panic for all ref shapes).
- `plugins/drift_sweeper_test.go` — `ResolveRef` happy path, stub resolver
  interface compliance.
- `handlers/admin_plugin_drift_test.go` — ListPending (empty, non-empty, DB
  error), Apply (not found, already applied, already dismissed, workspace_plugins
  missing).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 00:39:50 +00:00
1492b40b38 ci(docker): pin base image digests in all Dockerfiles
Pins all FROM image tags to exact SHA256 digests for reproducible
builds. Without digest pinning, a registry push of a new image to the
same tag can silently change the layer content between builds — a
supply-chain risk especially for prod-deployed images.

Pinned images (7 Dockerfiles):
- golang:1.25-alpine → sha256:c4ea15b... (workspace-server/Dockerfile,
  Dockerfile.dev, Dockerfile.tenant, tests/harness/cp-stub/Dockerfile)
- alpine:3.20 → sha256:c64c687c... (workspace-server/Dockerfile,
  tests/harness/cp-stub/Dockerfile)
- node:20-alpine → sha256:afdf982... (workspace-server/Dockerfile.tenant)
- node:22-alpine → sha256:cb15fca... (canvas/Dockerfile)
- python:3.11-slim → sha256:e78299e... (workspace/Dockerfile)
- nginx:1.27-alpine → sha256:62223d6... (tests/harness/cf-proxy/Dockerfile)

Note: docker-compose.yml service images (postgres, redis, clickhouse,
litellm, ollama) are intentionally left on major-version tags — those
are runtime-pulled and updated regularly for local-dev ergonomics.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-09 23:56:39 +00:00
e29b166f60 fix(test): poll error counter to 0 before asserting in RecordsMetricsOnSuccess
Race-detector CI runs (-race) slow goroutines enough that a
prior sweeper goroutine (e.g. TestStartSweeper_TransientErrorDoesNotCrashLoop)
can still be running and incrementing pendingUploadsSweepErrors after
metricDelta() captures its baseline, but before the success-path sweeper
records its success metrics. The test then reads deltaError=1 instead of 0.

Fix: add waitForMetricDelta(t, deltaError, 0, 2*time.Second) before the
assertion, matching the polling pattern already used in the error-path
test (TestStartSweeper_RecordsMetricsOnError). This ensures the error
counter has settled before we assert on it.

Fixes molecule-core#22.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-09 23:27:19 +00:00
Molecule AI Core Platform Lead
2a04233d5a Merge remote-tracking branch 'origin/main' into trig-189 2026-05-09 22:58:32 +00:00
Molecule AI Core Platform Lead
bd0a52a9a1 merge main into infra/fix-issue-151: keep PR #183 root-skip wording in local_test.go 2026-05-09 22:51:03 +00:00
ebc56a2ce6 fix(deps): migrate gh-identity from GitHub to Gitea module path
Update go.mod require and main.go import to use the Gitea-hosted
module path go.moleculesai.app/plugin/gh-identity (migrated from
github.com/Molecule-AI/molecule-ai-plugin-gh-identity).

Follows the pattern of the org-template URL migrations (github.com ->
git.moleculesai.app) applied to Go module imports.

Fixes molecule-core#91.
Ref: molecule-internal#71.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-09 22:50:45 +00:00
eaf7dbb7c4 fix(handlers): auto-restart workspace after file write/delete/replace
PUT /workspaces/:id/files and DELETE /workspaces/:id/files updated the
config volume but never restarted the container, so the running agent
continued serving stale file content from its in-memory cache. The
SecretsHandler already had this pattern (issue #15); TemplatesHandler
was missing it.

Fix: after every successful write/delete in WriteFile, DeleteFile, and
ReplaceFiles, call h.wh.RestartByID(workspaceID) asynchronously, guarded
by h.wh != nil (nil-tolerant for callers that only use read-only
surfaces). The RestartByID coalescing gate prevents thundering-herd on
concurrent requests.

Fixes #151.
Fixes #87 (duplicate effort closed — core-be also filed #183).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-09 22:43:27 +00:00
Molecule AI Core Platform Lead
ede4551c73 Merge remote-tracking branch 'origin/main' into trig-185 2026-05-09 22:41:01 +00:00
2077cf4054 [core-be-agent] fix(pendinguploads/test): correct sweeper test isolation
Issue #86: TestStartSweeper_RecordsMetricsOnSuccess fails in full-suite.

Root cause: two cooperating bugs in the sweeper test harness.

1. Sweeper loop called sweepOnce after ctx cancellation (double-increment).
   When ctx was cancelled the loop's select received ctx.Done(), called
   sweepOnce with the cancelled ctx, storage.Sweep returned context error,
   and metrics.PendingUploadsSweepError() incremented the error counter a
   SECOND time before the loop exited. Subsequent tests captured a polluted
   error baseline and their deltaError assertions failed.

2. Tests called defer cancel() without waiting for the goroutine to exit.
   The goroutine could still be blocked on Sweep (waiting for the next
   ticker's C channel) when the next test called metricDelta(). If the
   goroutine's Sweep returned during the next test's measurement window,
   the shared metric counters mutated mid-baseline.

Fix (production code):
- Guard the ticker arm: if ctx.Err() != nil, continue instead of calling
  sweepOnce. This prevents the post-cancellation sweep from running.

Fix (test harness):
- startSweeperWithInterval gains a done chan struct{} parameter. When the
  loop exits the channel is closed exactly once.
- StartSweeperForTest starts the goroutine and returns the done channel,
  allowing tests to drain it with <-done after cancel() — guaranteeing
  the goroutine has fully terminated before the next test's baseline.

All 8 sweeper tests now use StartSweeperForTest and drain the done
channel before returning, ensuring stable metric baselines across the
full suite.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-09 22:30:28 +00:00
e65633bf15 fix(test): skip TestLocalResolver_BubblesUpCopyFailure when uid==0
os.Chmod(dst, 0o555) silently passes when os.Geteuid() == 0 because
root bypasses POSIX permission checks. A previous attempt to use a
symlink to /dev/full also fails: Go's os.MkdirAll resolves the symlink
during path traversal and the kernel allows mkdir("/dev/full") as a
device-table entry — io.Copy to /dev/full then succeeds with 0 bytes
written and returns nil.

The honest, consistent fix mirrors TestLocalResolver_CopyFileSourceUnreadable:
skip when running as root. The write-failure propagation logic is
exercised correctly in non-root CI environments.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-09 22:22:44 +00:00
e3ea8ff74a [core-be-agent]
fix(plugins/test): skip TestLocalResolver_BubblesUpCopyFailure when running as root

Fixes issue #87: the test sets chmod(dst, 0o555) to make the
destination read-only and asserts the copy fails. On Linux, root
bypasses filesystem permissions and can write to 0o555 directories,
so the copy succeeds when running as root and the assertion fails.

Fix: check os.Getuid() == 0 at the start of the test and skip with
a clear message. Mirrors the existing skip in
TestLocalResolver_CopyFileSourceUnreadable (line 175) which already
handles the same root-bypass issue for unreadable source files.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-09 22:21:35 +00:00
Molecule AI Core Platform Lead
97768272a3 test(delegation): add isDeliveryConfirmedSuccess helper + 10-case table test
[core-lead-agent] Closes the regression-test gap on PR #170 (Core-BE's
fix for #159 retry-storm). Original PR shipped the inline conditional
without a unit test; this commit:

1. Extracts the inline `(proxyErr != nil && len(respBody) > 0 && 2xx)`
   predicate into a named helper `isDeliveryConfirmedSuccess`. Same
   behavior; the call site now reads `if isDeliveryConfirmedSuccess(...)`.

2. Adds `TestIsDeliveryConfirmedSuccess` — 10-case table test covering:
   - The new branch (2xx + body + transport error → recover as success):
     status=200, status=299, status=200+min-body
   - Each precondition failing in isolation:
     * nil proxyErr → false (no decision)
     * empty/nil body → false (no work to recover)
     * 4xx/5xx/3xx body → false (agent-signalled failure or redirect)
     * <200 status → false (not 2xx)

Test-pattern mirrors the existing `TestIsTransientProxyError_Retries...`
and `TestIsQueuedProxyResponse` table tests in the same file — same
file-local mock-error pattern, no new test infra.
2026-05-09 22:12:04 +00:00
21a5c31b85 [core-be-agent]
fix: Treat delivery-confirmed proxy errors as delegation success

Two-part fix for issue #159 — successful delegation responses were
rendered as error banners:

PART 1 — a2a_proxy.go: When io.ReadAll fails mid-stream (e.g., TCP
connection drops after the agent sent its 200 OK response), the prior
code returned (0, nil, BadGateway) discarding both the HTTP status code
and any partial body bytes already received. Fix: return
(resp.StatusCode, respBody, error) so callers can inspect what was
delivered even when the body read failed.

PART 2 — delegation.go: New condition in executeDelegation after the
transient-error retry block:

    if proxyErr != nil && len(respBody) > 0 && status >= 200 && status < 300 {
        goto handleSuccess
    }

When proxyA2ARequest returns a delivery-confirmed error (status 2xx +
non-empty partial body), route to success instead of failure. This
prevents the retry-storm pattern where the canvas shows "error" with
a Restart-workspace suggestion even though the delegation actually
completed and the response is available.

Regression tests (delegation_test.go):
- TestExecuteDelegation_DeliveryConfirmedProxyError_TreatsAsSuccess:
  server sends 200 + partial body then closes; second attempt succeeds.
  Verifies the new condition fires for delivery-confirmed 2xx responses.
- TestExecuteDelegation_ProxyErrorNon2xx_RemainsFailed: server sends
  500 + partial body then closes. Verifies non-2xx routes to failure.
- TestExecuteDelegation_ProxyErrorEmptyBody_RemainsFailed: server
  returns 502 Bad Gateway (empty body, transient). Verifies empty-body
  errors still route to failure (condition len(respBody) > 0 guards it).
- TestExecuteDelegation_CleanProxyResponse_Unchanged: clean 200 OK.
  Verifies baseline (proxyErr == nil path) is unaffected.

Fixes issue #159.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-09 22:11:54 +00:00
c9cf240751 [core-be-agent]
fix(template_import): Remove silent template-dir fallback in ReplaceFiles offline path

When the workspace container is offline and writeViaEphemeral fails
(docker unavailable), ReplaceFiles previously fell back to writing
to the host-side template directory. This silently returned 200 with
"source: template" while the file change was invisible after restart
because the restart handler reads from the Docker volume, not the
template dir (issue #151).

Now returns 503 Service Unavailable with a message telling the caller
to retry after the workspace starts. The ephemeral write path is
the only correct mechanism for offline-container updates.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-09 21:58:34 +00:00
7079d4ba01 [core-be-agent]
fix: Treat delivery-confirmed proxy errors as delegation success

When proxyA2ARequest returns an error but we have a non-empty
response body with a 2xx status code, the agent completed the work
successfully. The error is a delivery/transport error (e.g., connection
reset after response was received).

Previously, executeDelegation would mark these as "failed" even though
the work was done, causing:
- Retry storms (canvas suggests restart, user retries)
- "error" rendering in canvas even though result is available
- Data loss risk from unnecessary restarts

Now we check for valid response data before marking as failed.

Fixes issue #159.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-09 21:52:09 +00:00
Molecule AI Core Platform Lead
ea8ac4f023 Merge remote-tracking branch 'origin/main' into tech-debt/rename-net 2026-05-09 21:19:28 +00:00
Molecule AI Core Platform Lead
ad89173f0f Merge remote-tracking branch 'origin/main' into tech-debt/rename-net 2026-05-09 21:18:46 +00:00
Molecule AI Core Platform Lead
7090eab0d5 fix(workspace-server): sanitize err.Error() leaks in CascadeDelete and OrgImport
[core-lead-agent] Closes Core-Security audit finding (2026-05-09 audit cycle, MEDIUM):

1. workspace-server/internal/handlers/workspace_crud.go:335
   `DELETE /workspaces/:id` returned `err.Error()` verbatim in the 500
   body, leaking wrapped lib/pq driver strings (schema column names,
   index hints) to HTTP clients. Replaced with sanitized message;
   raw error already logged server-side via the existing log.Printf
   immediately above.

2. workspace-server/internal/handlers/org.go:610
   `OrgImport` echoed the user-supplied `body.Dir` verbatim in the 404
   "org template not found: %s" response. Path traversal is already
   blocked by resolveInsideRoot earlier in the handler, but echoing
   raw input back lets a client probe filesystem layout (404-with-echo
   vs. 400-from-resolve is itself a signal). Dropped the input from the
   client-facing message; preserved full context in a new log.Printf
   (orgFile path + the requested body.Dir) for operator triage.

Both fixes preserve operator-side diagnostics (logs unchanged in
content, only client-facing JSON sanitized). No behavior change for
legitimate clients — error type, status code, and JSON shape all stay
the same.

Tier: low. Defensive hardening only; reduces info-disclosure surface
without altering control-flow or auth gates.
2026-05-09 21:01:40 +00:00
252f8d0c47 tech-debt: rename molecule-monorepo-net -> molecule-core-net
Renames Docker network across all code, configs, scripts, and docs.

Per issue #93: the network was named molecule-monorepo-net as a holdover
from when the repo was called molecule-monorepo. The canonical repo name is
now molecule-core, so the network should be molecule-core-net.

Files changed:
- docker-compose.yml, docker-compose.infra.yml: network definition
- infra/scripts/setup.sh: docker network create
- scripts/nuke-and-rebuild.sh: docker network rm
- workspace-server/internal/provisioner/provisioner.go: DefaultNetwork
- All comments/docs: updated wording

Acceptance: grep -rn 'molecule-monorepo-net' returns zero matches.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-09 20:51:48 +00:00
e8f521011f fix(mcp): write delegation activity row so canvas Agent Comms shows task text
MCP delegate_task and delegate_task_async bypassed the delegation activity
lifecycle entirely — no activity_log row was written for MCP-initiated
delegations. As a result the canvas Agent Comms tab rendered outbound
delegations as bare "Delegation dispatched" events with no task body.

Fix: insert a delegation row (mirroring insertDelegationRow from
delegation.go) before the A2A call so the canvas can show the task text.
The sync tool updates status to 'dispatched' after the HTTP call; the
async tool inserts with 'dispatched' directly (goroutine won't update).

Closes #158.
Closes #49 (partial — addresses the canvas-display gap; full lifecycle
parity requires DelegationWriter extraction, tracked separately).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-09 20:44:06 +00:00
claude-ceo-assistant
b3041c13d3 fix(org-import): emit started event after YAML parse so name is populated
The org.import.started event was firing immediately after request body
bind, before the YAML at body.Dir was loaded. Result: payload.name was
"" whenever the caller passed `dir` (the common path — the canvas and
all live imports use dir, not inline template). Three started rows
already in the local platform's structure_events have empty name.

Fix: move the started emit (and importStart timestamp) to after the
YAML unmarshal / inline-template fallthrough, where tmpl.Name is
guaranteed populated.

Bonus: pre-parse error returns (invalid body, traversal-rejected dir,
file-not-found, YAML expansion fail, YAML unmarshal fail, neither dir
nor template provided) no longer emit an orphan started row — every
started is now guaranteed a paired completed/failed.

Verified live against running platform: re-imported molecule-dev-only,
new started row in structure_events carries
"Molecule AI Dev Team (dev-only)" instead of "".

Tests: full handler suite green (`go test ./internal/handlers/`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 16:25:24 -07:00
claude-ceo-assistant
bfefcb315b refactor(handlers): Delete() delegates to CascadeDelete helper
Drops ~150 lines of duplicated cascade logic from the Delete HTTP
handler — workspace_crud.go's CascadeDelete (added in PR #137) and
Delete() were running the same #73 race-guard sequence (status update →
canvas_layouts → tokens → schedules → container stop → broadcast),
just with Delete() inlined and CascadeDelete owning the OrgImport
reconcile path.

CascadeDelete now returns the descendant id list (was: count) so
Delete() can drive the optional ?purge=true hard-delete against the
same set the cascade just touched.

Net diff: workspace_crud.go shrinks from ~270 lines in Delete() to
~75 lines (parse + 409 confirm gate + CascadeDelete call + stop-error
500 + purge block + 200 response). Behavior identical — same SQL
ordering, same #73 race guard, same response shapes. Three sqlmock
tests for the 0-children case gained one extra ExpectQuery for the
recursive-CTE descendants scan (the old inline code skipped that
query when len(children)==0; CascadeDelete walks unconditionally —
returns 0 rows, same end state, one extra cheap query).

Tests: full handler suite green (`go test ./internal/handlers/`).
Live-tested against the running local platform: DELETE on a fake
workspace returns `{"cascade_deleted":0,"status":"removed"}`,
fleet of 9 workspaces preserved, refactored handler matches the
prior wire-shape exactly.

Tracked as the PR #137 follow-up tech-debt item.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 15:47:51 -07:00
claude-ceo-assistant
3de51faa19 fix(org-import): reconcile mode + audit-event emission
Closes the additive-import zombie bug — re-running /org/import with a
tree shape that reparents same-named roles left the prior workspace
online because lookupExistingChild's dedupe is parent-scoped (different
parent_id → "different" workspace). Caught 2026-05-08 after a dev-tree
re-import left 8 orphans co-existing with the new tree on canvas until
manual cascade-delete.

Three layers in this PR:

- mode="reconcile" on /org/import — after the import loop, online
  workspaces whose name matches an imported name but whose id isn't in
  the result set are cascade-deleted. Default mode "" / "merge"
  preserves existing additive behavior. Empty-set guards prevent
  accidental "delete everything" if either array comes up empty.

- WorkspaceHandler.CascadeDelete extracted as a callable helper from
  the existing Delete HTTP handler so OrgImport's reconcile path shares
  the same teardown sequence (#73 race guard, container stop, volume
  removal, token revocation, schedule disable, event broadcast). The
  HTTP Delete handler still inlines the same logic; deduplication
  tracked as tech-debt follow-up.

- emitOrgEvent(structure_events) records org.import.started +
  org.import.completed with mode, created/skipped/reconcile_removed
  counts, duration_ms, error. Replaces the lost-on-restart stdout-only
  log shape for an audit-trail surface that's queryable by SQL. Closes
  the "what happened at 20:13?" debugging gap that motivated this fix.

Verified live against the local platform: cascade-delete on an old
tree's removed root cleared 8 surviving orphans; mode="reconcile" with
a freshly-INSERTed fake orphan removed exactly the fake; idempotent
re-run of reconcile is a no-op (0 removed, no errors); structure_events
captures every started+completed pair with full payload.

7 new unit tests (walkOrgWorkspaceNames flat/nested/spawning:false/
empty-name; emitOrgEvent success + DB-error-swallow; errString). Full
handler suite green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 15:04:47 -07:00
6f861926bd Merge pull request 'fix(workspace_provision): preserve MODEL secret over MODEL_PROVIDER slug on restart' (#136) from fix/preserve-model-secret-on-restart into main 2026-05-08 21:31:50 +00:00
15c5f32491 fix(workspace_provision): preserve MODEL secret over MODEL_PROVIDER slug on restart
Phase 4 follow-up to template-claude-code PR #9 (2026-05-08 dev-tree wedge).

Pre-fix: applyRuntimeModelEnv unconditionally overwrote envVars["MODEL"]
with the MODEL_PROVIDER slug whenever payload.Model was empty (the restart
path). This silently wiped the operator'\''s explicit per-persona MODEL
secret on every restart.

Symptom: dev-tree workspaces booted correctly on first /org/import (the
envVars map was populated direct from the persona env file with both
MODEL=MiniMax-M2.7-highspeed and MODEL_PROVIDER=minimax), then on the
next Restart the MODEL secret got clobbered to literal "minimax" — a
provider slug, not a valid model id — and the workspace template'\''s
adapter failed to match any registry prefix, fell through to providers[0]
(anthropic-oauth), and wedged at SDK initialize.

Fix: resolution order in applyRuntimeModelEnv is now:
  1. payload.Model (caller passed the canvas-picked model id verbatim)
  2. envVars["MODEL"] (workspace_secret persisted from persona env)
  3. envVars["MODEL_PROVIDER"] (legacy canvas Save+Restart shape)

Tests
-----
TestApplyRuntimeModelEnv_PersonaEnvMODELSecretPreserved — locks in
the new resolution order with four cases:
  - MODEL secret wins over MODEL_PROVIDER slug (persona-env shape)
  - MODEL secret wins even when same as MODEL_PROVIDER
  - MODEL absent → fall back to MODEL_PROVIDER (legacy shape)
  - Both absent → no MODEL set (no-op)

Existing TestApplyRuntimeModelEnv_SetsUniversalMODELForAllRuntimes
continues to pass — fix is strictly additive on the precedence chain.
2026-05-08 14:31:14 -07:00
9b5e89bb42 Merge pull request 'feat(org-import): add spawning:false field to skip workspace + descendants' (#135) from feat/org-import-spawning-false into main 2026-05-08 21:20:56 +00:00
claude-ceo-assistant
b91da1ab77 feat(org-import): add spawning:false field to skip workspace + descendants
Lets a workspace declare it (and its entire subtree) should be skipped
during /org/import. Pointer-typed `*bool` so we distinguish "explicitly
false" from "unset" (default = spawn).

## Use case

The dev-tree org template ships the full role taxonomy (Dev Lead with
Core Platform / Controlplane / App & Docs / Infra / SDK Leads, each with
their own engineering / QA / security / UI-UX children — 27 personas
total in a single import). Some setups need a smaller set:

- Local dev on a memory-constrained machine
- Demo / smoke runs that don't need the full org breathing
- Customer trials starting with leadership-only before fan-out

Pre-fix the only options were:
- Edit the canonical template (mutates shared state)
- Author a parallel slimmer template (duplicates structure)
- Manual workspace deprovision after full import (wasteful — already paid
  the docker pull / build cost)

`spawning: false` is the per-workspace knob that solves this without
touching the canonical template structure.

## Semantics

- Unset: workspace spawns (current behaviour, no migration)
- `spawning: true`: explicitly spawns (same as unset)
- `spawning: false`: workspace is skipped AND every descendant is
  skipped. The guard sits BEFORE any side effect in
  createWorkspaceTree — no DB row, no docker provision, no children
  recursion. A false-spawning subtree is genuinely a no-op except for
  the log line. countWorkspaces still counts the subtree (so /org/templates
  numbers reflect the full structure).

## Stage A — verified

Local dev-only template that wraps teams/dev.yaml (Dev Lead) with
children:[] cleared on the 5 sub-team yaml files, plus 3 floater
personas (Release Manager / Integration Tester / Fullstack Engineer).
/org/import returned 9 workspaces. Drop-in: same result via
`spawning: false` on each sub-tree root in the future.

## Stage B — N/A

Pure additive feature on the org-template handler. No SaaS deploy chain
implications.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 14:20:14 -07:00
claude-ceo-assistant
c3596d6271 fix(org-import): use ws.FilesDir as persona-dir lookup, add docker-cli-buildx to dev image
## org_import.go — persona env injection root-cause fix

The Phase-3 fix from earlier today (`feedback/per-agent-gitea-identity-default`)
introduced loadPersonaEnvFile to inject persona-specific creds into
workspace_secrets on /org/import. It passed `ws.Role` as the persona-dir
lookup key, but in our dev-tree org.yaml shape `role:` carries the
multi-line descriptive text the agent reads from its prompt
("Engineering planning and team coordination — leads Core Platform,
Controlplane, ..."), while `files_dir:` holds the short slug
(`core-lead`, `dev-lead`, etc.) matching
`~/.molecule-ai/personas/<files_dir>/env`.

isSafeRoleName silently rejected the multi-word role text → no persona
env loaded → every imported workspace booted with zero
workspace_secrets rows → no ANTHROPIC / CLAUDE_CODE / MINIMAX auth in
the container env → claude_agent_sdk wedged on `query.initialize()`
with a 60s control-request timeout.

After the fix, /org/import on the dev tree (27 personas) populates
8 workspace_secrets per workspace (Gitea identity + MODEL/MODEL_PROVIDER
+ provider-specific token), 5 of 6 leads boot online, and the
remaining wedges trace to a separate runtime-template-repo bug
(workspace-template-claude-code's claude_sdk_executor.py doesn't
dispatch on MODEL_PROVIDER=minimax — filed separately).

## Dockerfile.dev — docker-cli + docker-cli-buildx

Without these, every claude-code/tier-2 workspace POST fails-fast:
- docker-cli alone produces `exec: "docker": executable file not found`
- docker-cli alone (no buildx) fails on `docker build` with
  `ERROR: BuildKit is enabled but the buildx component is missing or broken`

Both packages are now installed in the dev image; verified with
`docker exec molecule-core-platform-1 docker buildx version`.

## Stage A verified

Local /org/import dev-only path: 27 workspaces created, all 27 receive
persona env injection (8 secrets each — Gitea identity + provider creds).
Lead workspaces (claude-code-OAuth tier) boot online.

## Stage B — N/A

Local-dev-only path (docker-compose.dev.yml + dev image). Tenant EC2
provisioning uses Dockerfile.tenant (untouched).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 13:50:46 -07:00
claude-ceo-assistant
7eda8f510f feat(local-dev): containerize platform + canvas stack via docker-compose (closes #126)
Replaces the legacy nohup `go run ./cmd/server` setup with a fully
containerized local stack: postgres + redis + platform + canvas, all
with `restart: unless-stopped` so they survive Mac sleep/wake and
Docker Desktop daemon restarts.

## Changes

- **docker-compose.yml**
  - `restart: unless-stopped` on platform/postgres/redis
  - `BIND_ADDR=0.0.0.0` for platform — the dev-mode-fail-open default
    of 127.0.0.1 (PR #7) made the host unable to reach the container
    even with port mapping. Container netns is already isolated, so
    binding all interfaces inside is safe.
  - Healthchecks switched from `wget --spider` (HEAD → 404 forever
    because /health is GET-only) to `wget -qO /dev/null` (GET).
    Same regression existed on canvas; fixed both.

- **workspace-server/Dockerfile.dev**
  - `CGO_ENABLED=1` → `0` to match prod Dockerfile + Dockerfile.tenant.
    Without this, the alpine dev image fails with "gcc: not found"
    because workspace-server has no actual cgo deps but the env was
    forcing the cgo build path. Closes a divergence introduced in
    9d50a6da (today's air hot-reload PR).

- **canvas/Dockerfile**
  - `npm install` → `npm ci --include=optional` for lockfile-exact
    installs that include platform-specific @tailwindcss/oxide native
    binaries. Without these, `next build` fails with "Cannot read
    properties of undefined (reading 'All')" on the
    `@import "tailwindcss"` directive.

- **canvas/.dockerignore** (new)
  - Excludes `node_modules` and `.next` so the Dockerfile's
    `COPY . .` step doesn't clobber the freshly-installed container
    node_modules with the host's (potentially stale or wrong-arch)
    copy. This was the actual root cause of the canvas build break.

- **workspace-server/.gitignore**
  - Adds `/tmp/` for air's live-reload build cache.

## Stage A verified

```
container          status                    restart
postgres-1         Up (healthy)              unless-stopped
redis-1            Up (healthy)              unless-stopped
platform-1         Up (healthy, air-mode)    unless-stopped
canvas-1           Up (healthy)              unless-stopped

GET :8080/health  → 200
GET :3000/        → 200
DB preserved:     407 workspace rows + 5 named personas
Persona mount:    28 dirs at /etc/molecule-bootstrap/personas
```

## Stage B — N/A

This is local-dev infrastructure only. None of these files ship to
SaaS tenants — production EC2s use `Dockerfile.tenant` + `ec2.go`
user-data, not docker-compose.

## Out of scope

- The decorative-but-broken `wget --spider` healthcheck has presumably
  also been silently 404'ing on prod tenants. Ship a follow-up to
  audit + fix the prod path; not done here to keep the PR scoped.
- Docker Desktop "Start at login" is a per-machine GUI setting that
  must be toggled manually (Settings → General).
- The legacy heartbeat-all.sh that pinged 5 persona workspaces from
  the host has been deleted (~/.molecule-ai/heartbeat-all.sh).
  Per Hongming: each workspace is responsible for its own heartbeat.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 10:53:39 -07:00