Commit Graph

730 Commits

Author SHA1 Message Date
e29b166f60 fix(test): poll error counter to 0 before asserting in RecordsMetricsOnSuccess
Race-detector CI runs (-race) slow goroutines enough that a
prior sweeper goroutine (e.g. TestStartSweeper_TransientErrorDoesNotCrashLoop)
can still be running and incrementing pendingUploadsSweepErrors after
metricDelta() captures its baseline, but before the success-path sweeper
records its success metrics. The test then reads deltaError=1 instead of 0.

Fix: add waitForMetricDelta(t, deltaError, 0, 2*time.Second) before the
assertion, matching the polling pattern already used in the error-path
test (TestStartSweeper_RecordsMetricsOnError). This ensures the error
counter has settled before we assert on it.

Fixes molecule-core#22.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-09 23:27:19 +00:00
Molecule AI Core Platform Lead
2a04233d5a Merge remote-tracking branch 'origin/main' into trig-189 2026-05-09 22:58:32 +00:00
Molecule AI Core Platform Lead
bd0a52a9a1 merge main into infra/fix-issue-151: keep PR #183 root-skip wording in local_test.go 2026-05-09 22:51:03 +00:00
ebc56a2ce6 fix(deps): migrate gh-identity from GitHub to Gitea module path
Update go.mod require and main.go import to use the Gitea-hosted
module path go.moleculesai.app/plugin/gh-identity (migrated from
github.com/Molecule-AI/molecule-ai-plugin-gh-identity).

Follows the pattern of the org-template URL migrations (github.com ->
git.moleculesai.app) applied to Go module imports.

Fixes molecule-core#91.
Ref: molecule-internal#71.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-09 22:50:45 +00:00
eaf7dbb7c4 fix(handlers): auto-restart workspace after file write/delete/replace
PUT /workspaces/:id/files and DELETE /workspaces/:id/files updated the
config volume but never restarted the container, so the running agent
continued serving stale file content from its in-memory cache. The
SecretsHandler already had this pattern (issue #15); TemplatesHandler
was missing it.

Fix: after every successful write/delete in WriteFile, DeleteFile, and
ReplaceFiles, call h.wh.RestartByID(workspaceID) asynchronously, guarded
by h.wh != nil (nil-tolerant for callers that only use read-only
surfaces). The RestartByID coalescing gate prevents thundering-herd on
concurrent requests.

Fixes #151.
Fixes #87 (duplicate effort closed — core-be also filed #183).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-09 22:43:27 +00:00
Molecule AI Core Platform Lead
ede4551c73 Merge remote-tracking branch 'origin/main' into trig-185 2026-05-09 22:41:01 +00:00
2077cf4054 [core-be-agent] fix(pendinguploads/test): correct sweeper test isolation
Issue #86: TestStartSweeper_RecordsMetricsOnSuccess fails in full-suite.

Root cause: two cooperating bugs in the sweeper test harness.

1. Sweeper loop called sweepOnce after ctx cancellation (double-increment).
   When ctx was cancelled the loop's select received ctx.Done(), called
   sweepOnce with the cancelled ctx, storage.Sweep returned context error,
   and metrics.PendingUploadsSweepError() incremented the error counter a
   SECOND time before the loop exited. Subsequent tests captured a polluted
   error baseline and their deltaError assertions failed.

2. Tests called defer cancel() without waiting for the goroutine to exit.
   The goroutine could still be blocked on Sweep (waiting for the next
   ticker's C channel) when the next test called metricDelta(). If the
   goroutine's Sweep returned during the next test's measurement window,
   the shared metric counters mutated mid-baseline.

Fix (production code):
- Guard the ticker arm: if ctx.Err() != nil, continue instead of calling
  sweepOnce. This prevents the post-cancellation sweep from running.

Fix (test harness):
- startSweeperWithInterval gains a done chan struct{} parameter. When the
  loop exits the channel is closed exactly once.
- StartSweeperForTest starts the goroutine and returns the done channel,
  allowing tests to drain it with <-done after cancel() — guaranteeing
  the goroutine has fully terminated before the next test's baseline.

All 8 sweeper tests now use StartSweeperForTest and drain the done
channel before returning, ensuring stable metric baselines across the
full suite.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-09 22:30:28 +00:00
e65633bf15 fix(test): skip TestLocalResolver_BubblesUpCopyFailure when uid==0
os.Chmod(dst, 0o555) silently passes when os.Geteuid() == 0 because
root bypasses POSIX permission checks. A previous attempt to use a
symlink to /dev/full also fails: Go's os.MkdirAll resolves the symlink
during path traversal and the kernel allows mkdir("/dev/full") as a
device-table entry — io.Copy to /dev/full then succeeds with 0 bytes
written and returns nil.

The honest, consistent fix mirrors TestLocalResolver_CopyFileSourceUnreadable:
skip when running as root. The write-failure propagation logic is
exercised correctly in non-root CI environments.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-09 22:22:44 +00:00
e3ea8ff74a [core-be-agent]
fix(plugins/test): skip TestLocalResolver_BubblesUpCopyFailure when running as root

Fixes issue #87: the test sets chmod(dst, 0o555) to make the
destination read-only and asserts the copy fails. On Linux, root
bypasses filesystem permissions and can write to 0o555 directories,
so the copy succeeds when running as root and the assertion fails.

Fix: check os.Getuid() == 0 at the start of the test and skip with
a clear message. Mirrors the existing skip in
TestLocalResolver_CopyFileSourceUnreadable (line 175) which already
handles the same root-bypass issue for unreadable source files.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-09 22:21:35 +00:00
Molecule AI Core Platform Lead
97768272a3 test(delegation): add isDeliveryConfirmedSuccess helper + 10-case table test
[core-lead-agent] Closes the regression-test gap on PR #170 (Core-BE's
fix for #159 retry-storm). Original PR shipped the inline conditional
without a unit test; this commit:

1. Extracts the inline `(proxyErr != nil && len(respBody) > 0 && 2xx)`
   predicate into a named helper `isDeliveryConfirmedSuccess`. Same
   behavior; the call site now reads `if isDeliveryConfirmedSuccess(...)`.

2. Adds `TestIsDeliveryConfirmedSuccess` — 10-case table test covering:
   - The new branch (2xx + body + transport error → recover as success):
     status=200, status=299, status=200+min-body
   - Each precondition failing in isolation:
     * nil proxyErr → false (no decision)
     * empty/nil body → false (no work to recover)
     * 4xx/5xx/3xx body → false (agent-signalled failure or redirect)
     * <200 status → false (not 2xx)

Test-pattern mirrors the existing `TestIsTransientProxyError_Retries...`
and `TestIsQueuedProxyResponse` table tests in the same file — same
file-local mock-error pattern, no new test infra.
2026-05-09 22:12:04 +00:00
21a5c31b85 [core-be-agent]
fix: Treat delivery-confirmed proxy errors as delegation success

Two-part fix for issue #159 — successful delegation responses were
rendered as error banners:

PART 1 — a2a_proxy.go: When io.ReadAll fails mid-stream (e.g., TCP
connection drops after the agent sent its 200 OK response), the prior
code returned (0, nil, BadGateway) discarding both the HTTP status code
and any partial body bytes already received. Fix: return
(resp.StatusCode, respBody, error) so callers can inspect what was
delivered even when the body read failed.

PART 2 — delegation.go: New condition in executeDelegation after the
transient-error retry block:

    if proxyErr != nil && len(respBody) > 0 && status >= 200 && status < 300 {
        goto handleSuccess
    }

When proxyA2ARequest returns a delivery-confirmed error (status 2xx +
non-empty partial body), route to success instead of failure. This
prevents the retry-storm pattern where the canvas shows "error" with
a Restart-workspace suggestion even though the delegation actually
completed and the response is available.

Regression tests (delegation_test.go):
- TestExecuteDelegation_DeliveryConfirmedProxyError_TreatsAsSuccess:
  server sends 200 + partial body then closes; second attempt succeeds.
  Verifies the new condition fires for delivery-confirmed 2xx responses.
- TestExecuteDelegation_ProxyErrorNon2xx_RemainsFailed: server sends
  500 + partial body then closes. Verifies non-2xx routes to failure.
- TestExecuteDelegation_ProxyErrorEmptyBody_RemainsFailed: server
  returns 502 Bad Gateway (empty body, transient). Verifies empty-body
  errors still route to failure (condition len(respBody) > 0 guards it).
- TestExecuteDelegation_CleanProxyResponse_Unchanged: clean 200 OK.
  Verifies baseline (proxyErr == nil path) is unaffected.

Fixes issue #159.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-09 22:11:54 +00:00
c9cf240751 [core-be-agent]
fix(template_import): Remove silent template-dir fallback in ReplaceFiles offline path

When the workspace container is offline and writeViaEphemeral fails
(docker unavailable), ReplaceFiles previously fell back to writing
to the host-side template directory. This silently returned 200 with
"source: template" while the file change was invisible after restart
because the restart handler reads from the Docker volume, not the
template dir (issue #151).

Now returns 503 Service Unavailable with a message telling the caller
to retry after the workspace starts. The ephemeral write path is
the only correct mechanism for offline-container updates.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-09 21:58:34 +00:00
7079d4ba01 [core-be-agent]
fix: Treat delivery-confirmed proxy errors as delegation success

When proxyA2ARequest returns an error but we have a non-empty
response body with a 2xx status code, the agent completed the work
successfully. The error is a delivery/transport error (e.g., connection
reset after response was received).

Previously, executeDelegation would mark these as "failed" even though
the work was done, causing:
- Retry storms (canvas suggests restart, user retries)
- "error" rendering in canvas even though result is available
- Data loss risk from unnecessary restarts

Now we check for valid response data before marking as failed.

Fixes issue #159.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-09 21:52:09 +00:00
Molecule AI Core Platform Lead
ea8ac4f023 Merge remote-tracking branch 'origin/main' into tech-debt/rename-net 2026-05-09 21:19:28 +00:00
Molecule AI Core Platform Lead
ad89173f0f Merge remote-tracking branch 'origin/main' into tech-debt/rename-net 2026-05-09 21:18:46 +00:00
Molecule AI Core Platform Lead
7090eab0d5 fix(workspace-server): sanitize err.Error() leaks in CascadeDelete and OrgImport
[core-lead-agent] Closes Core-Security audit finding (2026-05-09 audit cycle, MEDIUM):

1. workspace-server/internal/handlers/workspace_crud.go:335
   `DELETE /workspaces/:id` returned `err.Error()` verbatim in the 500
   body, leaking wrapped lib/pq driver strings (schema column names,
   index hints) to HTTP clients. Replaced with sanitized message;
   raw error already logged server-side via the existing log.Printf
   immediately above.

2. workspace-server/internal/handlers/org.go:610
   `OrgImport` echoed the user-supplied `body.Dir` verbatim in the 404
   "org template not found: %s" response. Path traversal is already
   blocked by resolveInsideRoot earlier in the handler, but echoing
   raw input back lets a client probe filesystem layout (404-with-echo
   vs. 400-from-resolve is itself a signal). Dropped the input from the
   client-facing message; preserved full context in a new log.Printf
   (orgFile path + the requested body.Dir) for operator triage.

Both fixes preserve operator-side diagnostics (logs unchanged in
content, only client-facing JSON sanitized). No behavior change for
legitimate clients — error type, status code, and JSON shape all stay
the same.

Tier: low. Defensive hardening only; reduces info-disclosure surface
without altering control-flow or auth gates.
2026-05-09 21:01:40 +00:00
252f8d0c47 tech-debt: rename molecule-monorepo-net -> molecule-core-net
Renames Docker network across all code, configs, scripts, and docs.

Per issue #93: the network was named molecule-monorepo-net as a holdover
from when the repo was called molecule-monorepo. The canonical repo name is
now molecule-core, so the network should be molecule-core-net.

Files changed:
- docker-compose.yml, docker-compose.infra.yml: network definition
- infra/scripts/setup.sh: docker network create
- scripts/nuke-and-rebuild.sh: docker network rm
- workspace-server/internal/provisioner/provisioner.go: DefaultNetwork
- All comments/docs: updated wording

Acceptance: grep -rn 'molecule-monorepo-net' returns zero matches.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-09 20:51:48 +00:00
e8f521011f fix(mcp): write delegation activity row so canvas Agent Comms shows task text
MCP delegate_task and delegate_task_async bypassed the delegation activity
lifecycle entirely — no activity_log row was written for MCP-initiated
delegations. As a result the canvas Agent Comms tab rendered outbound
delegations as bare "Delegation dispatched" events with no task body.

Fix: insert a delegation row (mirroring insertDelegationRow from
delegation.go) before the A2A call so the canvas can show the task text.
The sync tool updates status to 'dispatched' after the HTTP call; the
async tool inserts with 'dispatched' directly (goroutine won't update).

Closes #158.
Closes #49 (partial — addresses the canvas-display gap; full lifecycle
parity requires DelegationWriter extraction, tracked separately).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-09 20:44:06 +00:00
claude-ceo-assistant
b3041c13d3 fix(org-import): emit started event after YAML parse so name is populated
The org.import.started event was firing immediately after request body
bind, before the YAML at body.Dir was loaded. Result: payload.name was
"" whenever the caller passed `dir` (the common path — the canvas and
all live imports use dir, not inline template). Three started rows
already in the local platform's structure_events have empty name.

Fix: move the started emit (and importStart timestamp) to after the
YAML unmarshal / inline-template fallthrough, where tmpl.Name is
guaranteed populated.

Bonus: pre-parse error returns (invalid body, traversal-rejected dir,
file-not-found, YAML expansion fail, YAML unmarshal fail, neither dir
nor template provided) no longer emit an orphan started row — every
started is now guaranteed a paired completed/failed.

Verified live against running platform: re-imported molecule-dev-only,
new started row in structure_events carries
"Molecule AI Dev Team (dev-only)" instead of "".

Tests: full handler suite green (`go test ./internal/handlers/`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 16:25:24 -07:00
claude-ceo-assistant
bfefcb315b refactor(handlers): Delete() delegates to CascadeDelete helper
Drops ~150 lines of duplicated cascade logic from the Delete HTTP
handler — workspace_crud.go's CascadeDelete (added in PR #137) and
Delete() were running the same #73 race-guard sequence (status update →
canvas_layouts → tokens → schedules → container stop → broadcast),
just with Delete() inlined and CascadeDelete owning the OrgImport
reconcile path.

CascadeDelete now returns the descendant id list (was: count) so
Delete() can drive the optional ?purge=true hard-delete against the
same set the cascade just touched.

Net diff: workspace_crud.go shrinks from ~270 lines in Delete() to
~75 lines (parse + 409 confirm gate + CascadeDelete call + stop-error
500 + purge block + 200 response). Behavior identical — same SQL
ordering, same #73 race guard, same response shapes. Three sqlmock
tests for the 0-children case gained one extra ExpectQuery for the
recursive-CTE descendants scan (the old inline code skipped that
query when len(children)==0; CascadeDelete walks unconditionally —
returns 0 rows, same end state, one extra cheap query).

Tests: full handler suite green (`go test ./internal/handlers/`).
Live-tested against the running local platform: DELETE on a fake
workspace returns `{"cascade_deleted":0,"status":"removed"}`,
fleet of 9 workspaces preserved, refactored handler matches the
prior wire-shape exactly.

Tracked as the PR #137 follow-up tech-debt item.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 15:47:51 -07:00
claude-ceo-assistant
3de51faa19 fix(org-import): reconcile mode + audit-event emission
Closes the additive-import zombie bug — re-running /org/import with a
tree shape that reparents same-named roles left the prior workspace
online because lookupExistingChild's dedupe is parent-scoped (different
parent_id → "different" workspace). Caught 2026-05-08 after a dev-tree
re-import left 8 orphans co-existing with the new tree on canvas until
manual cascade-delete.

Three layers in this PR:

- mode="reconcile" on /org/import — after the import loop, online
  workspaces whose name matches an imported name but whose id isn't in
  the result set are cascade-deleted. Default mode "" / "merge"
  preserves existing additive behavior. Empty-set guards prevent
  accidental "delete everything" if either array comes up empty.

- WorkspaceHandler.CascadeDelete extracted as a callable helper from
  the existing Delete HTTP handler so OrgImport's reconcile path shares
  the same teardown sequence (#73 race guard, container stop, volume
  removal, token revocation, schedule disable, event broadcast). The
  HTTP Delete handler still inlines the same logic; deduplication
  tracked as tech-debt follow-up.

- emitOrgEvent(structure_events) records org.import.started +
  org.import.completed with mode, created/skipped/reconcile_removed
  counts, duration_ms, error. Replaces the lost-on-restart stdout-only
  log shape for an audit-trail surface that's queryable by SQL. Closes
  the "what happened at 20:13?" debugging gap that motivated this fix.

Verified live against the local platform: cascade-delete on an old
tree's removed root cleared 8 surviving orphans; mode="reconcile" with
a freshly-INSERTed fake orphan removed exactly the fake; idempotent
re-run of reconcile is a no-op (0 removed, no errors); structure_events
captures every started+completed pair with full payload.

7 new unit tests (walkOrgWorkspaceNames flat/nested/spawning:false/
empty-name; emitOrgEvent success + DB-error-swallow; errString). Full
handler suite green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 15:04:47 -07:00
6f861926bd Merge pull request 'fix(workspace_provision): preserve MODEL secret over MODEL_PROVIDER slug on restart' (#136) from fix/preserve-model-secret-on-restart into main 2026-05-08 21:31:50 +00:00
15c5f32491 fix(workspace_provision): preserve MODEL secret over MODEL_PROVIDER slug on restart
Phase 4 follow-up to template-claude-code PR #9 (2026-05-08 dev-tree wedge).

Pre-fix: applyRuntimeModelEnv unconditionally overwrote envVars["MODEL"]
with the MODEL_PROVIDER slug whenever payload.Model was empty (the restart
path). This silently wiped the operator'\''s explicit per-persona MODEL
secret on every restart.

Symptom: dev-tree workspaces booted correctly on first /org/import (the
envVars map was populated direct from the persona env file with both
MODEL=MiniMax-M2.7-highspeed and MODEL_PROVIDER=minimax), then on the
next Restart the MODEL secret got clobbered to literal "minimax" — a
provider slug, not a valid model id — and the workspace template'\''s
adapter failed to match any registry prefix, fell through to providers[0]
(anthropic-oauth), and wedged at SDK initialize.

Fix: resolution order in applyRuntimeModelEnv is now:
  1. payload.Model (caller passed the canvas-picked model id verbatim)
  2. envVars["MODEL"] (workspace_secret persisted from persona env)
  3. envVars["MODEL_PROVIDER"] (legacy canvas Save+Restart shape)

Tests
-----
TestApplyRuntimeModelEnv_PersonaEnvMODELSecretPreserved — locks in
the new resolution order with four cases:
  - MODEL secret wins over MODEL_PROVIDER slug (persona-env shape)
  - MODEL secret wins even when same as MODEL_PROVIDER
  - MODEL absent → fall back to MODEL_PROVIDER (legacy shape)
  - Both absent → no MODEL set (no-op)

Existing TestApplyRuntimeModelEnv_SetsUniversalMODELForAllRuntimes
continues to pass — fix is strictly additive on the precedence chain.
2026-05-08 14:31:14 -07:00
9b5e89bb42 Merge pull request 'feat(org-import): add spawning:false field to skip workspace + descendants' (#135) from feat/org-import-spawning-false into main 2026-05-08 21:20:56 +00:00
claude-ceo-assistant
b91da1ab77 feat(org-import): add spawning:false field to skip workspace + descendants
Lets a workspace declare it (and its entire subtree) should be skipped
during /org/import. Pointer-typed `*bool` so we distinguish "explicitly
false" from "unset" (default = spawn).

## Use case

The dev-tree org template ships the full role taxonomy (Dev Lead with
Core Platform / Controlplane / App & Docs / Infra / SDK Leads, each with
their own engineering / QA / security / UI-UX children — 27 personas
total in a single import). Some setups need a smaller set:

- Local dev on a memory-constrained machine
- Demo / smoke runs that don't need the full org breathing
- Customer trials starting with leadership-only before fan-out

Pre-fix the only options were:
- Edit the canonical template (mutates shared state)
- Author a parallel slimmer template (duplicates structure)
- Manual workspace deprovision after full import (wasteful — already paid
  the docker pull / build cost)

`spawning: false` is the per-workspace knob that solves this without
touching the canonical template structure.

## Semantics

- Unset: workspace spawns (current behaviour, no migration)
- `spawning: true`: explicitly spawns (same as unset)
- `spawning: false`: workspace is skipped AND every descendant is
  skipped. The guard sits BEFORE any side effect in
  createWorkspaceTree — no DB row, no docker provision, no children
  recursion. A false-spawning subtree is genuinely a no-op except for
  the log line. countWorkspaces still counts the subtree (so /org/templates
  numbers reflect the full structure).

## Stage A — verified

Local dev-only template that wraps teams/dev.yaml (Dev Lead) with
children:[] cleared on the 5 sub-team yaml files, plus 3 floater
personas (Release Manager / Integration Tester / Fullstack Engineer).
/org/import returned 9 workspaces. Drop-in: same result via
`spawning: false` on each sub-tree root in the future.

## Stage B — N/A

Pure additive feature on the org-template handler. No SaaS deploy chain
implications.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 14:20:14 -07:00
claude-ceo-assistant
c3596d6271 fix(org-import): use ws.FilesDir as persona-dir lookup, add docker-cli-buildx to dev image
## org_import.go — persona env injection root-cause fix

The Phase-3 fix from earlier today (`feedback/per-agent-gitea-identity-default`)
introduced loadPersonaEnvFile to inject persona-specific creds into
workspace_secrets on /org/import. It passed `ws.Role` as the persona-dir
lookup key, but in our dev-tree org.yaml shape `role:` carries the
multi-line descriptive text the agent reads from its prompt
("Engineering planning and team coordination — leads Core Platform,
Controlplane, ..."), while `files_dir:` holds the short slug
(`core-lead`, `dev-lead`, etc.) matching
`~/.molecule-ai/personas/<files_dir>/env`.

isSafeRoleName silently rejected the multi-word role text → no persona
env loaded → every imported workspace booted with zero
workspace_secrets rows → no ANTHROPIC / CLAUDE_CODE / MINIMAX auth in
the container env → claude_agent_sdk wedged on `query.initialize()`
with a 60s control-request timeout.

After the fix, /org/import on the dev tree (27 personas) populates
8 workspace_secrets per workspace (Gitea identity + MODEL/MODEL_PROVIDER
+ provider-specific token), 5 of 6 leads boot online, and the
remaining wedges trace to a separate runtime-template-repo bug
(workspace-template-claude-code's claude_sdk_executor.py doesn't
dispatch on MODEL_PROVIDER=minimax — filed separately).

## Dockerfile.dev — docker-cli + docker-cli-buildx

Without these, every claude-code/tier-2 workspace POST fails-fast:
- docker-cli alone produces `exec: "docker": executable file not found`
- docker-cli alone (no buildx) fails on `docker build` with
  `ERROR: BuildKit is enabled but the buildx component is missing or broken`

Both packages are now installed in the dev image; verified with
`docker exec molecule-core-platform-1 docker buildx version`.

## Stage A verified

Local /org/import dev-only path: 27 workspaces created, all 27 receive
persona env injection (8 secrets each — Gitea identity + provider creds).
Lead workspaces (claude-code-OAuth tier) boot online.

## Stage B — N/A

Local-dev-only path (docker-compose.dev.yml + dev image). Tenant EC2
provisioning uses Dockerfile.tenant (untouched).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 13:50:46 -07:00
claude-ceo-assistant
7eda8f510f feat(local-dev): containerize platform + canvas stack via docker-compose (closes #126)
Replaces the legacy nohup `go run ./cmd/server` setup with a fully
containerized local stack: postgres + redis + platform + canvas, all
with `restart: unless-stopped` so they survive Mac sleep/wake and
Docker Desktop daemon restarts.

## Changes

- **docker-compose.yml**
  - `restart: unless-stopped` on platform/postgres/redis
  - `BIND_ADDR=0.0.0.0` for platform — the dev-mode-fail-open default
    of 127.0.0.1 (PR #7) made the host unable to reach the container
    even with port mapping. Container netns is already isolated, so
    binding all interfaces inside is safe.
  - Healthchecks switched from `wget --spider` (HEAD → 404 forever
    because /health is GET-only) to `wget -qO /dev/null` (GET).
    Same regression existed on canvas; fixed both.

- **workspace-server/Dockerfile.dev**
  - `CGO_ENABLED=1` → `0` to match prod Dockerfile + Dockerfile.tenant.
    Without this, the alpine dev image fails with "gcc: not found"
    because workspace-server has no actual cgo deps but the env was
    forcing the cgo build path. Closes a divergence introduced in
    9d50a6da (today's air hot-reload PR).

- **canvas/Dockerfile**
  - `npm install` → `npm ci --include=optional` for lockfile-exact
    installs that include platform-specific @tailwindcss/oxide native
    binaries. Without these, `next build` fails with "Cannot read
    properties of undefined (reading 'All')" on the
    `@import "tailwindcss"` directive.

- **canvas/.dockerignore** (new)
  - Excludes `node_modules` and `.next` so the Dockerfile's
    `COPY . .` step doesn't clobber the freshly-installed container
    node_modules with the host's (potentially stale or wrong-arch)
    copy. This was the actual root cause of the canvas build break.

- **workspace-server/.gitignore**
  - Adds `/tmp/` for air's live-reload build cache.

## Stage A verified

```
container          status                    restart
postgres-1         Up (healthy)              unless-stopped
redis-1            Up (healthy)              unless-stopped
platform-1         Up (healthy, air-mode)    unless-stopped
canvas-1           Up (healthy)              unless-stopped

GET :8080/health  → 200
GET :3000/        → 200
DB preserved:     407 workspace rows + 5 named personas
Persona mount:    28 dirs at /etc/molecule-bootstrap/personas
```

## Stage B — N/A

This is local-dev infrastructure only. None of these files ship to
SaaS tenants — production EC2s use `Dockerfile.tenant` + `ec2.go`
user-data, not docker-compose.

## Out of scope

- The decorative-but-broken `wget --spider` healthcheck has presumably
  also been silently 404'ing on prod tenants. Ship a follow-up to
  audit + fix the prod path; not done here to keep the PR scoped.
- Docker Desktop "Start at login" is a per-machine GUI setting that
  must be toggled manually (Settings → General).
- The legacy heartbeat-all.sh that pinged 5 persona workspaces from
  the host has been deleted (~/.molecule-ai/heartbeat-all.sh).
  Per Hongming: each workspace is responsible for its own heartbeat.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 10:53:39 -07:00
claude-ceo-assistant
120b3a25aa feat(workspaces): update_tier column for canary vs production fan-out
Closes core#115 partial. Schema-only change; the apply-endpoint filter
logic that reads this column lands with core#123 (drift detector +
queue + apply endpoint, the deferred follow-up of core#113).

Default 'production' so existing customers (Reno-Stars + any future
tenant) are default-safe. Synthetic dogfooding workspaces opt INTO
'canary' explicitly.

CHECK constraint pins the closed value set ('canary' | 'production') —
the apply endpoint's filter relies on the database to reject anything
else, so a future operator typo in PATCH /workspaces/:id ({update_tier:
'canery'}) returns a constraint violation, not silent fan-out to
nobody.

Partial index on canary rows since the apply-endpoint query path
('apply this update only to canary tier first') hits canary much more
often than production, and the production set is the much larger
default.

WHAT THIS DOES NOT DO (lands with core#123)
  - PATCH endpoint to flip a workspace to canary
  - The apply endpoint that consults the column
  - Tests that exercise canary-vs-production fan-out

Schema-only foundation; same pattern as core#113 (workspace_plugins).

PHASE 4 SELF-REVIEW
  Correctness: No finding — IF NOT EXISTS guards, DEFAULT clause means
    existing rows get 'production' on migration apply.
  Readability: No finding — comment block documents the tier semantics
    + the deferral to core#123.
  Architecture: No finding — additive ALTER, partial index for the
    expected access pattern.
  Security: No finding — no code path; column constraint reduces blast
    radius of bad PATCH input.
  Performance: No finding — partial index minimizes write amplification
    on the production-default rows.

REFS
  core#115 — this issue
  core#123 — apply endpoint follow-up (will exercise this column)
  core#113 — version subscription DB foundation (sibling pattern)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 08:55:19 -07:00
claude-ceo-assistant
72b0d4b1ab feat(plugins): workspace_plugins tracking table — version-subscription foundation
Closes core#113 partial. Adds the DB foundation for the
version-subscription model. Drift detection + queue + admin apply
endpoint are follow-up scope (separate PR; filed as a new issue).

WHY THIS PR ONLY GETS US PART-WAY
  Plugin install state today is filesystem-only — '/configs/plugins/<name>/'
  inside the container. There's no DB record of 'plugin X installed at
  workspace W from source S, tracking ref T'. That makes drift detection
  impossible: nothing to compare upstream tags against.

  This PR adds the table + the install-endpoint hook that writes to it.
  With baseline tags now on every plugin (post internal#92), the table
  starts collecting tracked-ref values immediately on the next install.
  The actual drift-check job + queue + apply endpoint layer on top.

WHAT THIS ADDS
  workspace_plugins table:
    workspace_id   FK → workspaces(id) ON DELETE CASCADE
    plugin_name    canonical name from plugin.yaml
    source_raw     full source URL the install used
    tracked_ref    'none' | 'tag:vX.Y.Z' | 'tag:latest' | 'sha:<full>'
    installed_at, updated_at

  installRequest gains optional 'track' field (defaults to 'none').
  Install handler upserts the workspace_plugins row after delivery
  succeeds. DB write failure is logged but doesn't fail the install
  (the plugin IS in the container; surfacing 500 misleads the caller).

  validateTrackedRef enforces the closed set of accepted shapes:
    'none' | 'tag:<non-empty>' | 'sha:<non-empty>'
  Bare values like 'latest' / 'main' / version-strings without
  prefix are rejected — the drift detector keys on prefix to know
  what kind of resolution to do.

WHAT THIS DOES NOT ADD (filed separately)
  - Drift detector job (cron / on-demand) that scans
    'WHERE tracked_ref != none' rows and queues updates on upstream drift
  - plugin_update_queue table (separate migration once detector lands)
  - GET /admin/plugin-updates-pending and POST .../apply endpoints
  - Tier-aware apply (core#115 — composes here)

PHASE 4 SELF-REVIEW (FIVE-AXIS)
  Correctness: No finding — install endpoint behavior unchanged for
    callers that don't pass 'track'. DB write is best-effort + logged
    on failure. validateTrackedRef rejects ambiguous bare strings.
  Readability: No finding — separate file plugins_tracking.go isolates
    the new concern; install handler delta is a single 4-line block.
  Architecture: No finding — additive table; existing schema untouched.
    Migration 20260508160000_* uses the timestamp-prefixed convention.
  Security: No finding — INSERT params via  placeholders (no string
    interpolation). validateTrackedRef rejects unexpected shapes before
    the column constraint would.
  Performance: No finding — one extra ExecContext per install. Install
    is already seconds-scale (network fetch + tar + docker exec); rounds
    to noise.

TESTS (1 new, all green)
  TestValidateTrackedRef — pin closed set + structural validators

REFS
  core#113 — this issue (foundation only; drift+queue+apply = follow-up)
  internal#92, internal#93 — plugin/template baseline tags (now exists for tracking)
  core#114 — atomic install (this PR composes — no atomicity regression)
  core#115 — canary tier filter (will key off the same DB foundation)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 08:52:35 -07:00
claude-ceo-assistant
249e760fbd feat(plugins): hot-reload classifier — skip restart on SKILL-content-only updates
Closes molecule-core#112. Composes with #114 (atomic install).

Before issuing restartFunc, classify the diff between staged and live:
  - skill-content-only: only **/SKILL.md content changed
                        → skip restart (Claude Code re-reads SKILL.md on
                          each Skill invocation; no in-memory cache)
  - cold: anything else
                        → restartFunc as before
                          (hooks/settings load at session start;
                          plugin.yaml is structural; added/removed files
                          require a fresh load)

DETECTION
  - Hash every regular file in staged tree (host filesystem, sha256)
  - Hash every regular file in live tree (in-container via docker exec
    sh -c 'cd <livePath> && find . -type f -print0 | xargs -0 sha256sum')
  - .complete marker dropped from comparison (mtime varies install-to-
    install; including it would force-cold every reinstall)
  - File added/removed → cold
  - File content differs but isn't SKILL.md → cold
  - All differences are SKILL.md basenames → skill-content-only

DEFAULTS COLD
  - First install (no live tree) → cold
  - Live tree read failure → cold (conservative; never hot-reload speculatively)
  - Symlinks skipped during hash (same posture as tar walker)

PHASE 4 SELF-REVIEW
  Correctness: No finding — all error paths default to cold; never
    falsely classify as skill-content-only. The .complete drop is
    a deliberate exception (the marker is bookkeeping, not content).
  Readability: No finding — single-purpose helpers (hashLocalTree,
    hashContainerTree, isSkillMarkdown, shQuote) each do one thing.
    The classifier itself reads as 'compare set, then walk diff with
    isSkillMarkdown gate.'
  Architecture: No finding — composes existing execAsRoot primitive;
    new helpers in plugins_classifier.go don't touch any other
    handler. Old behavior unchanged when live read fails.
  Security: No finding — shQuote single-quotes any non-trivial path,
    pluginName comes from validatePluginName-validated source, and
    the docker exec command takes the path as a single arg (xargs -0
    handles binary-safe path delimiting). Symlinks skipped.
  Performance: No finding — adds two tree walks (host + container)
    per install. Container walk is one docker exec call returning
    sha256 lines; for typical plugins (~10-50 files) round-trip is
    ~100ms. Versus the saved ~5-10s of restart on a hot-reloadable
    update, this is a clear win.

TESTS (4 new, all green; full handler suite green)
  TestIsSkillMarkdown        — basename match, case-sensitive
  TestHashLocalTree_StableHash — re-hash same dir = same map
  TestHashLocalTree_SymlinkSkipped — hostile link doesn't poison classifier
  TestShQuote                — quoting boundary for shell injection safety

REFS
  molecule-core#112 — this issue
  molecule-core#114 — atomic install (.complete marker added there)
  Reno-Stars iteration safety (Hongming 2026-05-08)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 08:26:05 -07:00
3e96184d6f Merge pull request 'feat(plugins): atomic install — stage→snapshot→swap→marker (docker path)' (#120) from feat/plugin-atomic-install into main 2026-05-08 15:23:31 +00:00
claude-ceo-assistant
7fbb8cb6e9 feat(plugins): atomic install — stage→snapshot→swap→marker (docker path)
Closes molecule-core#114 for the docker (local-OSS) path.
EIC (SaaS) path tracked as a follow-up — same shape, different
exec primitives (ssh vs docker exec); shipping both in one PR
doubles the test surface.

THE FOUR-STEP DANCE
  1. STAGE     — docker.CopyToContainer extracts tar into
                 /configs/plugins/.staging/<name>.<ts>/
  2. SNAPSHOT  — if /configs/plugins/<name>/ exists, mv to
                 /configs/plugins/.previous/<name>.<ts>/
  3. SWAP      — atomic mv staging → live (single rename(2))
  4. MARKER    — touch /configs/plugins/<name>/.complete

Workspace-side plugin loaders should refuse to load any plugin dir
without .complete (separate small change, not in this PR — the marker
write is the necessary precursor; consumer side is a follow-up so
existing-content plugins don't break before they're re-installed).

ROLLBACK
  - Stage failure: rm -rf staging dir; live untouched
  - Snapshot failure: rm -rf staging dir; live untouched (no rename happened)
  - Swap failure with snapshot present: mv previous back to live
  - Swap failure (no snapshot): rm -rf staging; live (which never
    existed) stays absent
  - Marker failure: content already in place, log loudly with manual
    recovery hint (touch <plugin>/.complete) — don't roll back since
    the new content is what we wanted, just unmarked

GC
  Best-effort delete of previous-version snapshot after successful
  marker write. Failures non-fatal — next install or a separate
  sweeper reclaims. Sweeper for stale .previous/* across reboots is
  follow-up scope.

CONCURRENCY
  Each install gets a unique stamp (UTC second precision), so two
  concurrent reinstalls land in distinct staging dirs and the second
  swap simply overwrites the first's live result. The atomicity is
  per-install, not cross-install — by design (the platform serializes
  POST /workspaces/:id/plugins via Go-side semaphore upstream of
  this code, so cross-install collisions don't reach here).

CHANGES
  + plugins_atomic.go        — installVersion + atomicCopyToContainer
  + plugins_atomic_tar.go    — tarWalk/tarHostDirWithPrefix helpers
  + plugins_atomic_test.go   — 5 unit tests (paths, stamp shape,
                               tar happy path, symlink-skip, prefix
                               normalization). All green.
  ~ plugins_install_pipeline.go::deliverToContainer — swap
    copyPluginToContainer call to atomicCopyToContainer

Old copyPluginToContainer is retained (still called by Download()) so
this PR is purely additive on the install path; no public API change.

PHASE 4 SELF-REVIEW (FIVE-AXIS)
  Correctness: Required (addressed) — swap-failure rollback writes mv
    of previous back to live before returning the error; if rollback
    itself fails, we wrap both errors and surface the combined fault.
    Marker-write failure is treated as content-landed-but-unmarked
    (LOG, don't roll back the new content).
  Readability: No finding — installVersion path methods make the
    /staging/.previous/live/marker layout obvious from one struct.
    tarWalk extracted from the inline filepath.Walk in
    plugins_install_pipeline.go for testability.
  Architecture: No finding — atomicCopyToContainer composes existing
    execAsRoot / docker.CopyToContainer primitives; no new dependencies.
    Old copyPluginToContainer kept for Download() — single responsibility
    per function.
  Security: No finding — symlinks still skipped during tar walk
    (defense vs hostile plugin escaping its own dir). Marker writes
    use composeable path.Join, no user input touches the path.
  Performance: No finding — adds ~3 docker exec calls per install
    (mkdir, mv-snapshot, mv-swap, touch — actually 4) on top of the
    one CopyToContainer. Each exec ~50-100ms in practice; install
    end-to-end was already seconds-scale, this rounds to noise.

REFS
  molecule-core#114 — this issue
  Companion: molecule-core#112 (hot-reload classifier — depends on .complete marker)
  Companion: molecule-core#113 (version subscription — uses install machinery)
  EIC follow-up: separate issue to be filed for SaaS path parity

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 08:22:52 -07:00
c3686a4bb3 Merge branch 'main' into fix/pendinguploads-test-isolation 2026-05-08 15:20:36 +00:00
e37a289eb6 Merge pull request 'feat(org-import): inject per-role persona env from operator-host bootstrap dir' (#110) from feat/persona-env-injection into main 2026-05-08 15:17:17 +00:00
claude-ceo-assistant
9d50a6dae4 feat(local-dev): air-based hot-reload for workspace-server
Closes core#116. Brings local-dev iteration parity with the canvas's
Turbopack HMR — edit a Go file, see the platform restart in <5s
instead of running 'docker compose up --build' (~30s) per change.

USAGE
  make dev   # docker compose with air-driven live reload
  make up    # production-shape stack (no air, normal Dockerfile)

WHAT THIS ADDS
  workspace-server/.air.toml      — air watch config
  workspace-server/Dockerfile.dev — air-on-golang:1.25-alpine, dev-only
  docker-compose.dev.yml          — overlay swapping platform service
                                    to Dockerfile.dev + bind-mounting
                                    workspace-server/ source
  Makefile                        — make {dev,up,down,logs,build,test}

WHAT THIS DOES NOT TOUCH
  workspace-server/Dockerfile (production multi-stage build)
  docker-compose.yml          (prod-shape stack)
  CI workflows                (build prod image directly)
  Tenant deployment / SaaS    (image swap stays the model)

Pure additive. Existing 'docker compose up' path unchanged; production
stays on the static binary. Air install pinned via go install at image
build time so the dev image is reproducible-enough for local use (we
don't pin air to a SHA — the dev image is rebuilt locally and updates
opportunistically).

PHASE 4 SELF-REVIEW (FIVE-AXIS)
  Correctness: No finding — additive change, no existing path modified.
    .air.toml watches .go + .yaml under workspace-server/, excludes
    _test.go and tests dir so test edits don't trigger rebuild.
    Dockerfile.dev mirrors prod's 'go mod download' so first rebuild
    is fast.
  Readability: No finding — three small files plus a Makefile, each
    with header comments explaining the WHY, not just the WHAT. The
    Makefile uses the standard ## help-target pattern.
  Architecture: No finding — overlay pattern (docker-compose.dev.yml
    on top of docker-compose.yml) is the standard compose convention
    for env-specific overrides. Doesn't fork the prod path.
  Security: No finding because no production code path; dev-only image
    isn't built in CI and isn't published to ECR.
  Performance: No finding — air debounce=500ms, exclude_unchanged=true
    so a save that doesn't change content is a no-op rebuild.

REFS
  core#116 — this issue
  Companion: core#117 (workspace-side config-watcher for hot-reload of
  config.yaml) — different scope; this issue is platform-only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 08:10:50 -07:00
dev-lead
9e18ab4620 fix(pendinguploads): wait for error metric before test exit
TestStartSweeper_TransientErrorDoesNotCrashLoop leaks an in-flight
metric write across the test boundary: cycleDone fires inside the
fake's Sweep defer (before Sweep returns), waitForCycle returns
immediately after, cancel() lands, but the goroutine still has
metrics.PendingUploadsSweepError() to execute. Whether that write
happens before or after the next test's metricDelta() baseline read
is a coin-flip on slow CI hosts.

Outcome: TestStartSweeper_RecordsMetricsOnSuccess fails with
"error counter delta = 1, want 0" — looks like a real bug, isn't.
Instrumented analysis (per the file's existing waitForMetricDelta
docstring covering the same shape) confirms the metric IS getting
recorded, just AFTER the next test reads its baseline.

The Records* tests already use waitForMetricDelta to close this race
on their own assertions. This change extends the same shape to
TransientErrorDoesNotCrashLoop so it doesn't poison subsequent tests'
baselines.

Verified by running `go test -race -count=20 ./internal/pendinguploads/...`
locally — passes deterministically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 07:37:45 -07:00
claude-ceo-assistant
43b33bcaa5 feat(org-import): inject per-role persona env from operator-host bootstrap dir
Wires the 28 dev-tree persona credentials minted 2026-05-08 into the
workspace-secrets path used by org_import. When a workspace.yaml carries
`role: <name>`, the importer now reads
$MOLECULE_PERSONA_ROOT/<role>/env (default
/etc/molecule-bootstrap/personas/<role>/env, populated by the bootstrap
kit on the tenant host) and merges the role's GITEA_USER /
GITEA_TOKEN / GITEA_TOKEN_SCOPES / GITEA_USER_EMAIL /
GITEA_SSH_KEY_PATH into the same envVars map that already feeds
workspace_secrets via parseEnvFile + crypto.Encrypt + INSERT.

PRECEDENCE
  Persona env is the LOWEST layer:
    0. Persona env (per-role)
    1. Org root .env (shared)
    2. Workspace .env (per-workspace)
  Each later layer overrides the previous, so a workspace .env can
  pin a different GITEA_TOKEN if it ever needs to (testing, override).

WHY THIS LAYERING
  Workspaces should boot with the role's identity by default. .env
  files stay the explicit-override mechanism for the (rare) case where
  a workspace needs to deviate. No new behavior for workspaces with no
  role: persona load is silent no-op when ws.Role is empty or unsafe.

SECURITY
  isSafeRoleName accepts only [A-Za-z0-9_-]+ (no '..', '/', or
  separators) — admin-only construct, but defense-in-depth keeps the
  persona dir shape invariant. Test
  TestLoadPersonaEnvFile_RejectsTraversal pins the rejection set against
  a planted target file.

OPERATOR-HOST CONTRACT
  The 28 persona env files live at /etc/molecule-bootstrap/personas/<role>/env
  (mode 600, owner root:root) with the per-role token-scope tailoring
  Hongming approved 2026-05-08 (D5). Synced via task #241. Override via
  MOLECULE_PERSONA_ROOT for tests + non-prod hosts.

TESTS (7 new, all green)
  TestLoadPersonaEnvFile_HappyPath        — typical persona-env shape
  TestLoadPersonaEnvFile_MissingDir       — silent no-op when file absent
  TestLoadPersonaEnvFile_EmptyRole        — silent no-op when role empty
  TestLoadPersonaEnvFile_RejectsTraversal — planted file unreachable
                                            via '../../etc/passwd' etc.
  TestLoadPersonaEnvFile_DefaultRoot      — falls back to /etc/...
  TestLoadPersonaEnvFile_OverwritesEmptyMap
  TestIsSafeRoleName_Acceptance           — positive + negative role names

PHASE 4 SELF-REVIEW (FIVE-AXIS)
  Correctness: No finding — additive change, silent no-op on the ws.Role==''
    path covers every existing workspace; tests cover happy path + each
    rejection mode + missing-dir.
  Readability: No finding — helper sits next to parseEnvFile in
    org_helpers.go with a comment block explaining WHY persona is
    lowest precedence.
  Architecture: No finding — fits the existing 'merge .env into envVars
    then INSERT INTO workspace_secrets' pattern that's been in place
    since the .env-driven workspace secrets feature; no new dependencies,
    no new tables.
  Security: Required (addressed) — path traversal blocked by
    isSafeRoleName. No finding beyond that since persona files are
    admin-managed and the helper does not log token values.
  Performance: No finding — one extra os.ReadFile per workspace at
    import time; amortized over workspace lifetime, cost is negligible.

REFS
  internal#85 — RFC for SOP Phase 4 + structured Five-Axis (parent context)
  Saved memories: feedback_per_agent_gitea_identity_default,
                  feedback_unified_credentials_file
  Task #241 — operator-host sync (already DONE; populated 28 dirs)
  Task #242 — this PR

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 07:09:40 -07:00
claude-ceo-assistant
c72d0a5383 harden(org-external): token via http.extraHeader, .complete cache marker, ref '..' deny, naming cleanup
Self-review of molecule-core PR #105 + #106 (the !external resolver
chain) surfaced 3 real correctness/security gaps and 2 readability
nits. Fixes all four in one PR since they're the same file's hardening.

(1) TOKEN LEAKAGE — fixed
  Before: gitFetcher built clone URLs with auth in userinfo
    (https://oauth2:TOKEN@host/repo.git). Two leak paths:
      a. Token persisted in cloned repo's .git/config
      b. Token could appear in clone error output captured via
         cmd.CombinedOutput()
  After: clone URL has no userinfo (https://host/repo.git). Auth is
    layered on via -c http.extraHeader=Authorization: token ...
    which sends the header per-request without persisting. Plus a
    redactToken() pass over any error string before it surfaces in
    fmt.Errorf, as belt-and-braces.
  Tradeoff: token now visible in 'ps aux' for the duration of the
    git child process (same as before via env var), but no longer
    in any persistent state.

(2) CACHE-VALIDITY FOOTGUN — fixed
  Before: cache-hit was 'cacheDir/.git exists'. A clone interrupted
    after .git was created but before content finished writing would
    leave a partially-written cache that subsequent imports treated
    as hit, returning stale/incomplete content forever (no self-heal).
  After: cache-hit also requires a .complete marker file written
    only AFTER successful clone+rename. Partially-written cache is
    treated as cache-miss and re-fetched cleanly (after RemoveAll
    on the partial dir to avoid blocking the new clone's mkdir).

(3) REF '..' DENY — fixed
  Before: safeRefPattern '^[a-zA-Z0-9_./-]+$' allowed '..' as a
    substring. Git itself rejects most refs containing '..', but
    defense-in-depth says don't depend on the downstream tool's
    validation when sanitizing input at the boundary.
  After: explicit strings.Contains(ref.Ref, '..') check.

(4) NAMING CLEANUP — fixed
  Before: rewriteFilesDirAndIncludes() — name claims to rewrite
    !include scalars but doesn't (we removed that during PR-A
    development; double-prefix bug). Misleading for readers.
  After: rewriteFilesDir(). Docstring updated to explicitly explain
    why !include paths are NOT rewritten (relative to subDir, naturally
    inside cache).
  Also: removed unused buildAuthedURL() (replaced by
    buildExternalCloneURL + authConfigArgs split), removed unused
    shortHash() helper (replaced by os.MkdirTemp), removed unused
    crypto/sha1 + encoding/hex + fmt imports, removed stray
    '_ = fmt.Sprint' line in integration test.

NEW TESTS
  - TestGitFetcher_RejectsRefWithDoubleDot (defense-in-depth on ref input)
  - TestGitFetcher_CacheValidatedByCompleteMarker (partial cache → re-fetch)

VERIFIED LOCALLY 2026-05-08
  Full ./internal/handlers/ suite: ok (7.8s, 14 external-resolver tests
  + all existing tests). Two new tests cover the two new behaviors.

Refs:
  internal#77 — extraction RFC
  molecule-core#105 (resolver), #106 (tests) — original implementation
  Hongming code-review-and-quality skill invocation 2026-05-08 + 'fix all'
2026-05-08 05:54:54 -07:00
claude-ceo-assistant
89c5567d79 test(org-external): integration test against local bare-git + e2e against live Gitea (PR-B + PR-C)
PR-B (local bare-git integration, task #233):
  workspace-server/internal/handlers/org_external_integration_test.go
  Three tests using git's GIT_CONFIG_COUNT/KEY/VALUE env-var-injected
  insteadOf URL rewrite — process-scoped, no ~/.gitconfig pollution:

  - TestGitFetcher_RealClone_LocalRedirect: full resolver chain end-to-
    end with REAL git clone against a local bare-repo, asserts cache
    population + content materialization + path rewrite + cache-hit on
    second invocation.
  - TestGitFetcher_RealClone_BadRefFails: nonexistent ref surfaces
    git's error cleanly through the ls-remote step.
  - TestGitFetcher_DirectFetch_CacheHit: gitFetcher.Fetch direct
    invocation (no resolver wrapping); verifies cache-hit returns
    same dir + same SHA, no clobber.

  Production code untouched — insteadOf rewrite makes the production
  gitFetcher think it's cloning from Gitea, but git rewrites at clone
  time to file://<barePath>. Tests the real shell-out + parsing.

PR-C (live Gitea e2e, task #234):
  workspace-server/internal/handlers/local_e2e_dev_dept_test.go
  TestLocalE2E_ExternalDevDepartment — minimal parent template that
  uses !external against the LIVE molecule-ai/molecule-dev-department
  repo. No symlink, no /tmp/local-e2e-deploy fixture. Composition
  resolves over network at import time.

  Asserts:
    - 28+ dev-tree workspaces resolve through the fetched cache
      (matches the count from TestLocalE2E_DevDepartmentExtraction)
    - Q1 placement: 'Documentation Specialist' present (under app-lead)
    - Q2 placement: 'Triage Operator' present (under dev-lead)
    - Every workspace's files_dir is cache-prefixed (proves rewrite ran)
    - Every workspace's resolveInsideRoot+Stat succeeds
      (would fail provisioning if not)

  Skipped if Gitea unreachable (TCP probe to git.moleculesai.app:443)
  or git binary absent — won't false-fail offline runners.

VERIFIED LOCALLY 2026-05-08:
  --- PASS: TestGitFetcher_RealClone_LocalRedirect (0.26s)
  --- PASS: TestGitFetcher_RealClone_BadRefFails    (0.15s)
  --- PASS: TestGitFetcher_DirectFetch_CacheHit     (0.23s)
  --- PASS: TestLocalE2E_ExternalDevDepartment      (0.55s)
  workspaces resolved through !external: 28
  Full ./internal/handlers/ test suite: ok (no regressions)

Together with PR-A's unit tests (#105), the !external resolver is now
covered at three layers:
  - unit (fakeFetcher injection): allowlist, validation, path rewrite
  - integration (real git, local bare-repo): clone, cache, ls-remote
  - e2e (real git, live Gitea, live dev-department): full chain

Refs:
  internal#77 — extraction RFC (Phase 3a phasing in comment 1995)
  task #233 (PR-B), task #234 (PR-C)
  Hongming GO 2026-05-08 ('do PR-B/C/D')
2026-05-08 05:30:04 -07:00
claude-ceo-assistant
257d6c1b5a feat(org-import): !external cross-repo subtree resolver (Phase 3a, internal#77 / task #222)
Adds gitops-style cross-repo subtree composition to the platform's
org-template importer. Replaces (eventually) the operator-side
filesystem symlink approach shipped in PR #5.

DESIGN
  See internal#77 comment 1995 for the full design doc + decision points
  agreed with Hongming 2026-05-08.

  Schema: a `!external`-tagged mapping anywhere a workspace entry is
  allowed (workspaces:, roots:, children:):

    - !external
      repo: molecule-ai/molecule-dev-department
      ref: main
      path: dev-lead/workspace.yaml
      url: git.moleculesai.app    # optional; default = MOLECULE_EXTERNAL_GITEA_URL or git.moleculesai.app

  At resolve time the platform fetches the repo at ref into a content-
  addressable cache under <orgBaseDir>/.external-cache/<repo>/<sha>/,
  loads <cacheDir>/<path>, recursively resolves nested !include /
  !external in the loaded subtree, then rewrites every files_dir scalar
  in the fully-resolved subtree to be cache-prefixed. Downstream
  pipeline (resolveInsideRoot, plugin merge, CopyTemplateToContainer)
  sees ordinary in-tree paths.

IMPLEMENTATION
  - org_external.go: ExternalRef type, fetcher interface (gitFetcher
    production + injectable for tests), resolveExternalMapping resolver,
    rewriteFilesDirAndIncludes path-rewrite walker, allowlistedHostPath
    + safeRefPattern + safeRepoCacheDir validation helpers.
  - org_include.go: 4-line hook in expandNode dispatching MappingNode
    with Tag=="!external" to resolveExternalMapping.
  - org_external_test.go: 8 unit tests with fakeFetcher injection
    (no network):
      * happy path (top + nested workspace files_dir cache-prefixed)
      * allowlist rejection (github.com/foo/bar)
      * path-traversal rejection (../../etc/passwd)
      * malformed ref rejection ("main; rm -rf /")
      * missing required fields (repo / ref / path)
      * rewriteFilesDirAndIncludes basic + idempotent
      * allowlistedHostPath env-override + glob

  Path rewrite ONLY rewrites files_dir scalars. !include scalars are
  NOT rewritten — they resolve relative to their containing file's
  directory, which post-fetch is naturally inside the cache, so
  relative !includes Just Work without modification.

ALLOWLIST + AUTH
  - Default allowlist: git.moleculesai.app/molecule-ai/.
  - Override: MOLECULE_EXTERNAL_REPO_ALLOWLIST (comma-separated
    prefixes; trailing /* or / supported).
  - Auth: MOLECULE_GITEA_TOKEN env var injected into clone URL.
    Optional — falls back to unauthenticated for public repos.
  - Reject: malformed refs, path-traversal, non-allowlisted hosts.

CACHE
  - Location: <orgBaseDir>/.external-cache/<safe-repo>/<sha>/.
    Operators add to .gitignore.
  - Content-addressable: same (repo, sha) reuses cache, no overwrite.
  - Atomic clone via tmp-then-rename.
  - Concurrency: race-tolerant — last-writer-wins on same SHA.
    GC out of scope for v1 (filed as parked follow-up).

SECURITY (per SOP Phase 2)
  Untrusted yaml input — all validated:
    repo: allowlist (default molecule-ai/* on Gitea host)
    ref:  ^[a-zA-Z0-9_./-]+$ regex (rejects shell injection)
    path: relative-and-down-only (rejects ../escape)
  Auth: read-only token scoped to allowed orgs.
  Recursion: maxExternalDepth=4 (vs maxIncludeDepth=16) to limit
    network fan-out cost.
  Cache poisoning: per-(repo, sha) content-addressable; can't poison
    across SHAs.
  Trust boundary: cloned content treated identically to a sibling-
    cloned subtree (same model as current symlink approach).

VERSIONING / BACKWARDS COMPAT
  Pure additive. Existing !include and inline workspaces unchanged.
  Existing dev-lead symlink (parent template PR #5) keeps working.
  Migration of parent template to !external is a separate PR-D.
  No DB schema change. No public API change.

VERIFIED LOCALLY
  go test ./internal/handlers/ → ok (5.2s, all 8 new tests + existing)

  Stub fetcher injection lets unit tests cover the resolver +
  path-rewrite logic without network. PR-B (follow-up) adds an
  integration test against a local bare-git repo. PR-C adds the
  real-Gitea e2e test against the live dev-department repo.

Refs:
  internal#77 — extraction RFC (comment 1995 = Phase 1+2 design)
  task #222 — this PR is Phase 3a (PR-A in the design's phasing)
  Hongming GO 2026-05-08 ('go' on 4 decision points + design)
2026-05-08 05:17:55 -07:00
claude-ceo-assistant
3dcc7230f9 fix(provisioner)+test: EvalSymlinks templatePath; stage-2 e2e for files_dir consumption
Two changes that fall out of one root cause discovered while preparing
the local platform spin-up for the dev-department extraction (internal#77):

PROBLEM
  CopyTemplateToContainer's filepath.Walk is called with templatePath
  set to the workspace's resolved files_dir. With the cross-repo
  symlink composition shipped in PR #5 (parent template's
  dev-lead → ../molecule-dev-department/dev-lead/), the Dev Lead
  workspace's files_dir is literally 'dev-lead' — i.e. the symlink
  itself, not a path THROUGH the symlink.

  filepath.Walk does not descend into a symlink leaf — it Lstats the
  root, sees a symlink (mode bit set, not a directory), emits exactly
  one entry, and returns. Result: the workspace's /configs/ tar would
  ship empty. Other 38 workspaces are fine because their files_dir
  paths just TRAVERSE the symlink (path resolution handles intermediate
  symlinks via Lstat traversal); only the leaf-is-symlink case breaks.

FIX
  workspace-server/internal/provisioner/provisioner.go:
    Call filepath.EvalSymlinks on templatePath before filepath.Walk.
    Resolves the leaf-symlink case for ALL templates, not just dev-dept.
    Security: templatePath has already passed resolveInsideRoot's
    path-string check at the call site; the trust boundary is the
    operator-side /org-templates/ filesystem layout, not this
    resolution step.

TEST
  workspace-server/internal/handlers/local_e2e_dev_dept_test.go:
    New TestLocalE2E_FilesDirConsumption — stage-2 of the local e2e.
    For every workspace in the resolved OrgTemplate, asserts:
      1. resolveInsideRoot(orgBaseDir, ws.FilesDir) succeeds.
      2. os.Stat on the result returns a directory.
      3. filepath.Walk after EvalSymlinks (mirroring the platform fix)
         emits at least one file.
      4. At least one workspace marker exists (workspace.yaml,
         system-prompt.md, or initial-prompt.md).
    Exercises the SECOND half of POST /org/import that
    TestLocalE2E_DevDepartmentExtraction (PR #103) didn't cover.

VERIFIED LOCALLY (2026-05-08, against post-extraction Gitea state):
  --- PASS: TestLocalE2E_FilesDirConsumption (0.05s)
  checked 39 workspaces with files_dir
  All 39 walk paths emit non-empty file sets with valid workspace markers.

REGRESSION GUARD
  Without the EvalSymlinks fix, this test fails on Dev Lead with:
    files_dir 'dev-lead' at '/.../molecule-dev/dev-lead' is empty —
    CopyTemplateToContainer would produce empty /configs/

Refs:
  internal#77 — extraction RFC
  molecule-core#102 (resolver symlink contract test)
  molecule-core#103 (stage-1 e2e: include resolution)
  Hongming GO 2026-05-08 ('go' on the 3 pre-spin-up optimizations)
2026-05-08 04:46:33 -07:00
claude-ceo-assistant
3adbbacf2e test(local-e2e): verify dev-department extraction end-to-end via real resolveYAMLIncludes
Phase 4 (local-only) of internal#77 (dev-department extraction).

Adds TestLocalE2E_DevDepartmentExtraction that exercises the FULL platform
import path against the real molecule-ai-org-template-molecule-dev (post-slim)
and molecule-ai/molecule-dev-department (post-atomize) repos cloned as siblings
under /tmp/local-e2e-deploy/.

What it proves end-to-end:
  - The dev-lead symlink at parent's template root is followed by
    resolveYAMLIncludes (filepath.Abs/Rel-style security check passes,
    os.ReadFile follows the link).
  - Recursive !include chain through the symlinked subtree resolves:
    parent's org.yaml → !include dev-lead/workspace.yaml (symlinked)
    → !include ./core-lead/workspace.yaml → !include ./core-be/workspace.yaml
    (atomized children: paths, no '..').
  - 39 workspaces enumerate after resolution: 5 PM-tree + 6 Marketing-tree
    + 28 dev-tree (Dev Lead + 5 sub-team leads + 18 leaf workspaces +
    3 floaters + 1 triage-operator).
  - Q1+Q2 placements verified by sentinel name check: 'Documentation
    Specialist' is reachable (under app-lead via app-docs sub-team),
    'Triage Operator' is reachable (direct child of Dev Lead).

Test skips with t.Skipf if the local-e2e fixture isn't present on the
host — won't block CI on hosts that haven't set it up. To set up locally:

  TESTROOT=/tmp/local-e2e-deploy
  mkdir -p $TESTROOT && cd $TESTROOT
  git clone https://git.moleculesai.app/molecule-ai/molecule-ai-org-template-molecule-dev.git molecule-dev
  git clone https://git.moleculesai.app/molecule-ai/molecule-dev-department.git
  cd /Users/<you>/molecule-core/workspace-server
  go test -v -run TestLocalE2E_DevDepartmentExtraction ./internal/handlers/

Verified locally 2026-05-08:
  --- PASS: TestLocalE2E_DevDepartmentExtraction (0.01s)
  total workspaces (recursive): 39

Refs:
  internal#77 — extraction RFC
  molecule-core PR #102 — symlink-resolution contract test
  molecule-ai/molecule-dev-department PRs #1, #2, #3 (scaffold + extract + atomize)
  molecule-ai/molecule-ai-org-template-molecule-dev PR #5 (parent slim + symlink wire)
  Hongming GO 2026-05-08 ('lets not go for staging right now, we do local test first')
  SOP Phase 4 (local) — task #226
2026-05-08 04:24:47 -07:00
claude-ceo-assistant
78c4b9b74f test(org-include): pin symlink-based subtree composition contract
Two new tests in workspace-server/internal/handlers/org_include_test.go:

- TestResolveYAMLIncludes_FollowsDirectorySymlink: parent template's
  org.yaml `!include`s into a sibling-repo subtree via a relative
  directory symlink. The resolver's filepath.Abs/Rel security check
  operates on path strings (passes), and os.ReadFile follows the
  symlink at OS layer (file content delivered). Recursive nested
  `!include`s within the symlinked subtree resolve correctly because
  filepath.Dir(absTarget) keeps the literal symlink path as currentDir.

- TestResolveYAMLIncludes_RejectsSymlinkEscapingRoot: companion test
  pinning current behavior where a symlink target outside the parent
  root is followed (resolveInsideRoot doesn't EvalSymlinks). Asserted
  as 'should resolve' so future hardening (if filepath.EvalSymlinks
  is added) flips the test red and forces a coordinated update to the
  dev-department subtree-composition pattern.

Why now: internal#77 RFC (dev-department extraction) selects symlink-
based composition over a future platform-level external: ref. These
tests pin the contract before the operator-side symlink convention
gets shipped, so a refactor or hardening of the resolver can't
silently break the production org-import path.

No production code changes. Pure additive test coverage.

Refs: internal#77 (Phase 3b verification — task #223)
2026-05-07 20:42:38 -07:00
7f61206a18 Merge branch 'staging' into fix/saas-plugin-install-eic 2026-05-08 00:21:10 +00:00
6d7554d282 chore: sync main → staging (auto, d84d88ad) 2026-05-07 23:38:08 +00:00
6bb272360d Merge branch 'main' into feat/issue-63-local-build-from-gitea-v2 2026-05-07 23:33:03 +00:00
b664691051 fix(eic-tunnel-pool): capture poolJanitorInterval at pool construction
Closes the chronic -race flake on TestPooledWithEICTunnel_PanicPoisonsEntry
and the handlers package as a whole (CI / Platform (Go) was intermittent
on staging, ~50% red on workspace-server-touching commits since 2026-04).

The race: tests swap the package-level poolJanitorInterval via t.Cleanup
(eic_tunnel_pool_test.go:61) AFTER an earlier test caused the global pool's
janitor goroutine to start. The janitor loops on time.NewTicker(poolJanitorInterval)
on every tick — so the cleanup write races the goroutine read for the rest
of the process. Caught locally + on PR #84's CI run on Gitea.

Fix: capture the interval as a field on eicTunnelPool at newEICTunnelPool().
The janitor now reads p.janitorInterval, which never changes after construction.
Tests that override poolJanitorInterval before freshPool() still get the new
value (they set the package var before construction). The global pool's
janitor — created lazily once via sync.Once on first getEICTunnelPool() —
is now immune to t.Cleanup-driven swaps from later tests.

Surfaced while verifying #84 (SaaS plugin install via EIC SSH); folded
into this PR per the "fix root not symptom" rule rather than merging
around a chronic-red CI signal.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 16:01:11 -07:00
7e2cca7fad chore: sync main → staging (auto, e7660618) 2026-05-07 23:00:21 +00:00
16868c4ec1 fix(plugins): SaaS (EC2-per-workspace) install/uninstall via EIC SSH
Closes the 🔴 docker-only row in docs/architecture/backends.md. Plugin
install on every SaaS tenant currently 503s with "workspace container
not running" because the handler is hardcoded to Docker exec but SaaS
workspaces live on per-workspace EC2s. Caught on hongming.moleculesai.app
when canvas POST /workspaces/<id>/plugins surfaced the error.

Mirrors the Files API PR #1702 pattern: dispatch on workspaces.instance_id
in deliverToContainer (and Uninstall). When set, push the staged plugin
tarball to the EC2 over the existing withEICTunnel primitive
(template_files_eic.go) and unpack into the runtime's bind-mounted config
dir (/configs for claude-code, /home/ubuntu/.hermes for hermes — see
workspaceFilePathPrefix). chown 1000:1000 to match the docker path's
agent-uid contract; restart via the existing dispatcher.

Direct host write rather than docker-cp via SSH because the runtime's
config dir is already bind-mounted into the workspace container — the
runtime sees the files on next start with no additional plumbing.

Adds InstanceIDLookup (parallel to RuntimeLookup) so unit tests don't
need a DB; production wires it in router.go like templates.go does.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 15:42:51 -07:00
d9e380c5bc feat(workspace-server): local-dev provisioner builds from Gitea source when MOLECULE_IMAGE_REGISTRY is unset (#63, Task #194)
OSS contributors who clone molecule-core and `go run ./workspace-server/cmd/server`
now get a working end-to-end provision without authenticating to GHCR or AWS ECR.

Pre-fix: with MOLECULE_IMAGE_REGISTRY unset, the provisioner attempted to pull
ghcr.io/molecule-ai/workspace-template-<runtime>:latest, which has been
returning 403 since the 2026-05-06 GitHub-org suspension.

Post-fix: when MOLECULE_IMAGE_REGISTRY is unset, the provisioner switches to
local-build mode — looks up the workspace-template-<runtime> repo's HEAD sha
on Gitea via a single API call, shallow-clones into ~/.cache/molecule/, and
runs `docker build --platform=linux/amd64`. SHA-pinned cache key skips the
clone+build entirely on subsequent provisions.

Production tenants are unaffected: every prod tenant sets the var to its
private ECR mirror, so the SaaS pull path is byte-for-byte identical.

SSOT for mode detection lives in Resolve() (registry_mode.go) returning a
discriminated RegistrySource{Mode, Prefix} so call sites that branch on
mode get a compile-time push instead of a string-equality footgun.

Coverage:
* registry_mode.go            — new SSOT (Resolve, RegistryMode, IsKnownRuntime)
* registry_mode_test.go       — 8 tests pinning mode-decision contract
* localbuild.go               — clone+build pipeline (570 LOC, fully unit-tested)
* localbuild_test.go          — 22 tests covering happy/sad paths, fail-closed
* provisioner.go              — Start() inserts ensureLocalImageHook in local mode
* docs/adr/ADR-002            — design rationale + alternatives + security review
* docs/development/local-development.md — local-build flow + env overrides

Security:
* Allowlist-only runtime names (knownRuntimes) gate the clone path.
* Repo prefix hardcoded to git.moleculesai.app/molecule-ai/molecule-ai-workspace-template-;
  forks via opt-in MOLECULE_LOCAL_TEMPLATE_REPO_PREFIX.
* MOLECULE_GITEA_TOKEN masked in every log line via maskTokenInURL/maskTokenInString.
* Fail-closed: Gitea unreachable / runtime not mirrored → clear error, never
  silently fall back to GHCR/ECR.
* docker build invocation passes no --build-arg from external input.
* HTTP body cap 64KB on Gitea API responses (defence vs malicious upstream).

Closes #63 / Task #194.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 15:16:51 -07:00