internal#189: replaces the OR-gate ("≥1 approver from eligible teams")
with an AND-gate ("all required clauses must each have ≥1 approver").
New TIER_EXPR map (single source of truth at top of script):
tier:low → engineers,managers,ceo (OR, same as before)
tier:medium → managers AND engineers AND qa???,security??? (AND)
tier:high → ceo (single-team, framework wired for future AND)
"???" suffix: teams not yet created in Gitea (qa, security). The
expression always fails for these until the teams are created and the
markers are removed. The clear error message guides ops to create them.
Expression syntax documented at top of script. Clause-level pass/fail is
annotated in the notice/error lines so PR authors can see exactly which
gate is missing without SOP_DEBUG=1.
BURN-IN (internal#189 Phase 1): continue-on-error: true on the job
prevents AND-composition from blocking PRs during the 7-day window.
Remove after 2026-05-17 per the workflow BURN-IN NOTE comment.
SOP_LEGACY_CHECK=1 env var: forces OR-gate for individual runs,
enabling a grace window for PRs in-flight at deploy time.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
SettingsButton: gear button render, aria-expanded, active class toggle,
openPanel/closePanel calls, forwardRef, Radix Tooltip mock.
TopBar: header render, canvas name display, "+ New Agent" button,
SettingsButton integration, logo aria-hidden.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Root cause:
Dockerfile.tenant chowns /canvas /platform /memory-plugin /migrations
to canvas:canvas (line ~119) but not /org-templates. The image is
built as root, COPY-ed templates inherit root:root 0755. The platform
binary then runs as the canvas user (uid 1000) because of the USER
directive on line ~124, so when the !external resolver
(org_external.go, internal#77 / task #222) tries
os.MkdirAll("/org-templates/<tmpl>/.external-cache/<repo>") on first
import, mkdir(2) returns EACCES and the import handler returns 400
"org template expansion failed" (org.go:592). The user-facing error
is generic; only the server log carries:
Org import: refusing import: !include expansion failed:
!external at line 156: fetch git.moleculesai.app/molecule-ai/molecule-dev-department@v1.0.0:
mkdir cache root: mkdir /org-templates/molecule-dev/.external-cache: permission denied
Repro:
Tenant staging-cplead-2 (canary AWS 004947743811, image SHA
a93c4ce17725...). POST /org/import {"dir":"molecule-dev"} returns 400
while POST /org/import {"dir":"free-beats-all"} returns 201 — only
templates with !external trip the bug.
Fix:
Add /org-templates to the chown -R argv. One-line change. Same
ownership shape as the other writable platform-state dirs.
Why this is safe for prod:
* The platform binary already needs read access to /org-templates,
so canvas:canvas owning it doesn't widen any attack surface.
* /org-templates is image-resident, not bind-mounted; chown applies
inside the image layers and prod tenants get the fix on next
image rebuild + redeploy. Live prod tenants are unaffected until
the next deploy (no orgs currently using !external in prod —
molecule-dev consumers are all internal staging).
Verification:
After hand-applying the chown live (docker exec --user 0 ... chown -R
canvas:canvas /org-templates/molecule-dev), POST /org/import
{"dir":"molecule-dev"} returns 201 with 39 workspaces; cp-lead +
CP-BE + CP-QA + CP-Security all reach status=online within ~2 min.
Refs:
internal#77 — !external RFC (Phase 3a)
task #222 — resolver PR (introduced the unflagged-permission
dependency this fixes)
Live incident 2026-05-10 — staging-cplead-2 import failed,
chown-on-host workaround in place pending image rebuild
Issue #212: POST /workspaces with runtime=external and a URL wrote the
URL directly to the DB without validateAgentURL checking (the same check
that registry.go:324 applies to the heartbeat path). An attacker with
AdminAuth could register a workspace URL at a cloud metadata endpoint
(169.254.169.254) and exfiltrate IAM credentials when the platform
fires pre-restart drain signals.
Changes:
- workspace.go: add validateAgentURL(payload.URL) guard before the
UPDATE at line 386. 400 on unsafe URL, no DB write occurs.
- workspace_test.go: add 3 regression tests:
- TestWorkspaceCreate_ExternalURL_SSRFSafe: safe public URL → 201
- TestWorkspaceCreate_ExternalURL_SSRFMetadataBlocked: 169.254.169.254 → 400
- TestWorkspaceCreate_ExternalURL_SSRFLoopbackBlocked: 127.0.0.1 → 400
Both unsafe tests assert zero DB calls (the handler rejects before
any transaction).
Ref: issue #212.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Issue #214: documents the MOLECULE_ENV=production requirement for
staging/prod tenants to lock the /admin/workspaces/:id/test-token route.
Also adds a startup INFO log in main.go when the route is enabled, so
operators can confirm the setting in boot logs without having to probe
the endpoint directly.
Ref: issue #214.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds a 4th fallback step to the token chain (cache > API > env > static)
so workspace git/gh operations survive a platform outage without requiring
a restart or platform-side fix. Addresses the 2026-05-08 incident where
every workspace lost git+gh auth simultaneously when the
/github-installation-token endpoint returned 500.
Operator places a PAT in ${CONFIGS_DIR:-/configs}/.github-token
(no root needed — /configs is agent-writable). Both _fetch_token
(git credential helper path) and _refresh_gh (gh CLI daemon path)
gain the static fallback so git and gh both recover post-incident.
Pure additive — existing cache > API > env chain is unchanged.
Empty static file is rejected (whitespace-stripped before use).
Static path never writes the cache, so the API recovers transparently
on the next refresh cycle when it comes back online.
Ref: issue #140.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
StatusBadge: all 3 status variants, aria-label, role=status, config class names.
ValidationHint: error/valid/neutral states, warning icon, valid icon, class names.
Spinner: sm/md/lg size classes, aria-hidden, motion-safe:animate-spin.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Root cause of issue #213: canary-verify.yml still used GHCR
(ghcr.io/molecule-ai/platform-tenant) while
publish-workspace-server-image.yml migrated to ECR on 2026-05-07
(commit 10e510f5). Canary smoke tests were silently testing a stale
GHCR image while actual staging/prod tenants ran the ECR build.
The POST /org/import and POST /workspaces routes were missing from
the ECR binary (likely a Docker layer-caching artefact during the
staging push window) but smoke tests passed because they never tested
the ECR image at all.
Changes:
- canary-verify.yml: migrate promote-to-latest from GHCR crane tag
ops to the CP redeploy-fleet endpoint (same mechanism as
redeploy-tenants-on-main.yml). The wait-for-canaries step already
read SHA from the running tenant /health (registry-agnostic), so
no change needed there. Pre-fix promote step used `crane tag` against
GHCR, which was never updated after the ECR migration.
- redeploy-tenants-on-main.yml: update stale comments that reference
GHCR to reflect ECR; replace the 30s GHCR CDN propagation wait
with a no-op comment (ECR has no CDN cache to wait for).
- scripts/canary-smoke.sh: add POST /org/import and POST /workspaces
smoke tests (steps 6-8). These assert HTTP 401 unauthenticated
(proves AdminAuth enforced AND the route is compiled in — 404 would
mean route missing from binary). GET /workspaces was already covered;
POST was the untested gap.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
publish-runtime.yml was dead on Gitea Actions because Gitea reads
.gitea/workflows/, not .github/workflows/ (the GitHub Actions paths are
ignored). Issue #206 identified this as one of three bugs blocking the
runtime versioning pipeline.
Changes:
- Add .gitea/workflows/publish-runtime.yml (canonical Gitea version)
- Drop environment: + id-token: write (Gitea has no OIDC/OAuth)
- Replace pypa/gh-action-pypi-publish with twine upload using PYPI_TOKEN secret
- Replace github.ref_name with ${GITHUB_REF#refs/tags/} (Gitea exposes github.ref)
- Drop merge_group trigger (Gitea has no merge queue)
- Drop staging branch trigger (staging branch does not exist)
- Cascade step unchanged (DISPATCH_TOKEN + Gitea API already compatible)
- Add DEPRECATED notice to .github/workflows/publish-runtime.yml
Required secrets (repo Settings → Actions → Variables and Secrets):
PYPI_TOKEN: PyPI API token for molecule-ai-workspace-runtime
DISPATCH_TOKEN: Gitea PAT with write:repo on template repos (already used)
Closes#206 (publish-runtime Gitea port).
dorny/paths-filter is GitHub-Actions-only and does not work correctly on
Gitea Actions — it silently returns no file changes regardless of what
files were modified, causing the harness-replays workflow to silently
skip on Gitea even when workspace-server/** or canvas/** files change.
Verified: zero harness-replays statuses on PR #188 and #168 (both changed
workspace-server files) vs GitHub Actions where the same workflow
correctly detects changes.
Replace with a shell-based approach that uses:
- github.event.pull_request.base.sha (Gitea + GitHub: merge-base for PRs)
- github.event.before (Gitea + GitHub: previous tip for pushes)
- git diff --name-only <BASE> github.sha (portable git, works on both platforms)
Also adds detect-changes.debug output so future no-op passes show WHY
the workflow decided to skip, and the first real run on Gitea will
confirm the diff detection is working.
Closes#141 (followup: root-cause fix still TBD — failure logs
inaccessible via Gitea Actions API).
Adds a note to the audit doc footer tracking the new component tests
(PR #205: Tooltip, Legend, TermsGate, ApprovalBanner) and bumps the
updated date.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
## Summary
Adds the version-subscription drift detection and operator-apply workflow for
per-workspace plugin tracking (core#113).
## Components
**Migration** (`20260510000000_plugin_drift_queue`):
- Adds `installed_sha` column to `workspace_plugins` — records the commit SHA
installed so the drift sweeper can compare against upstream.
- Creates `plugin_update_queue` table with status: pending | applied | dismissed.
- Adds partial unique index to prevent duplicate pending rows per
(workspace_id, plugin_name).
**GithubResolver** (`github.go`):
- `LastFetchSHA` field + `LastSHA()` getter — populated by `Fetch` after a
successful shallow clone (captured before `.git` is stripped). Used by the
install pipeline to seed `installed_sha`.
- `ResolveRef(ctx, spec)` method — resolves a plugin spec to its full commit
SHA using `git fetch --depth=1 + git rev-parse`. Used by the drift sweeper
to get the current upstream SHA for a tracked ref (tag:vX.Y.Z, tag:latest,
sha:…, or bare branch).
**Drift sweeper** (`plugins/drift_sweeper.go`):
- Periodic sweep every 1h: SELECTs rows where `tracked_ref != 'none' AND
installed_sha IS NOT NULL`, resolves upstream SHA, queues drift if different.
- `ListPendingUpdates()` — reads pending queue rows for the admin endpoint.
- `ApplyDriftUpdate()` — marks entry applied (idempotent).
- ctx.Err() guard on ticker arm to avoid post-shutdown work.
**Install pipeline** (`plugins_install_pipeline.go`, `plugins_tracking.go`,
`plugins_install.go`):
- `stageResult.InstalledSHA` field — carries the SHA from Fetch to the DB.
- `recordWorkspacePluginInstall` now accepts and stores `installed_sha`.
- `deleteWorkspacePluginRow` — removes tracking row on uninstall so a stale
SHA doesn't prevent the next install from creating a fresh row.
- Both Docker and EIC uninstall paths call `deleteWorkspacePluginRow`.
**Admin endpoints** (`handlers/admin_plugin_drift.go`):
- `GET /admin/plugin-updates-pending` — list all pending drift entries.
- `POST /admin/plugin-updates/:id/apply` — re-installs plugin from source_raw
(re-fetching the same tracked ref), records the new SHA, marks entry applied,
triggers workspace restart. Idempotent (already-applied returns 200).
**Router wiring** (`router.go`, `cmd/server/main.go`):
- Plugin registry created in main.go and shared between PluginsHandler and drift
sweeper.
- `router.Setup` accepts optional `pluginResolver` param.
- `PluginsHandler.Sources()` export for the sweeper wiring pattern.
## Tests
- `plugins/github_test.go` — `ResolveRef` coverage (invalid spec, git error,
not-found mapping, no-panic for all ref shapes).
- `plugins/drift_sweeper_test.go` — `ResolveRef` happy path, stub resolver
interface compliance.
- `handlers/admin_plugin_drift_test.go` — ListPending (empty, non-empty, DB
error), Apply (not found, already applied, already dismissed, workspace_plugins
missing).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add 10 tests for StatusDot covering:
- All known STATUS_CONFIG statuses (online, offline, degraded,
failed, paused, not_configured, provisioning)
- Correct color class applied per status
- Glow class applied when declared in STATUS_CONFIG
- motion-safe:animate-pulse on provisioning status
- Fallback to bg-zinc-500 for unknown status
- size prop (sm/md) applies correct Tailwind dimension class
- aria-hidden="true" for accessibility tree isolation
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Pre-commit Hook: moved stray "Action:" line inside the section (was appended to
WCAG entry below it after a rebase conflict resolution)
- Removed duplicate text-ink-soft WCAG AA entry (lines 62-68 were a rebase artifact)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Controls: all three buttons (zoom in/out/fit) have aria-label attributes from
React Flow; verified from @xyflow/react source (index.mjs:4453). Removed "verify
if keyboard accessible" caveat.
- MiniMap: actually present in Canvas.tsx (rendered at line 310). The old audit
note "not present (mocked as null in tests)" referred to the minimap being absent
from unit test renders, not from production. Updated to reflect actual presence
and status-coloring behavior.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Fix arrow-key nudge description: was "20px/100px" (wrong), now "10px/50px" (matches useKeyboardShortcuts)
- Add Cmd/Ctrl+Arrow resize shortcut row to dialog (missing since PR #192)
- Fix 3 tests in useKeyboardShortcuts.test.tsx that asserted shrink below min dimensions:
"resizes height down" expected height:100, clamped to 110 (node starts at minHeight)
"resizes width down" expected width:200, clamped to 210 (node starts at minWidth)
"2px step with Shift" expected height:108, clamped to 110 (minHeight wins)
All three tests updated to assert clamped values with explanatory comments.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pins all FROM image tags to exact SHA256 digests for reproducible
builds. Without digest pinning, a registry push of a new image to the
same tag can silently change the layer content between builds — a
supply-chain risk especially for prod-deployed images.
Pinned images (7 Dockerfiles):
- golang:1.25-alpine → sha256:c4ea15b... (workspace-server/Dockerfile,
Dockerfile.dev, Dockerfile.tenant, tests/harness/cp-stub/Dockerfile)
- alpine:3.20 → sha256:c64c687c... (workspace-server/Dockerfile,
tests/harness/cp-stub/Dockerfile)
- node:20-alpine → sha256:afdf982... (workspace-server/Dockerfile.tenant)
- node:22-alpine → sha256:cb15fca... (canvas/Dockerfile)
- python:3.11-slim → sha256:e78299e... (workspace/Dockerfile)
- nginx:1.27-alpine → sha256:62223d6... (tests/harness/cf-proxy/Dockerfile)
Note: docker-compose.yml service images (postgres, redis, clickhouse,
litellm, ollama) are intentionally left on major-version tags — those
are runtime-pulled and updated regularly for local-dev ergonomics.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add 10 tests covering the Cmd/Ctrl+Arrow resize shortcut:
- ArrowUp/Down resizes height (−/+10px)
- ArrowLeft/Right resizes width (−/+10px)
- Shift modifier uses 2px step for fine control
- min-height constraint respected when shrinking
- Guard: no-op when no node selected
- Guard: skipped when modal dialog is open
- Plain arrow keys (no modifier) fire moveNode instead
- Alt+Arrow is skipped (not a resize combo)
Also extends the mock store state with `onNodesChange` and node
`width`/`height` fields needed for the resize tests.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Race-detector CI runs (-race) slow goroutines enough that a
prior sweeper goroutine (e.g. TestStartSweeper_TransientErrorDoesNotCrashLoop)
can still be running and incrementing pendingUploadsSweepErrors after
metricDelta() captures its baseline, but before the success-path sweeper
records its success metrics. The test then reads deltaError=1 instead of 0.
Fix: add waitForMetricDelta(t, deltaError, 0, 2*time.Second) before the
assertion, matching the polling pattern already used in the error-path
test (TestStartSweeper_RecordsMetricsOnError). This ensures the error
counter has settled before we assert on it.
Fixes molecule-core#22.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace all text-ink-soft usages across canvas components and app pages.
ink-soft (#8d92a0) on dark zinc (#0e1014) yields ~2.2:1 contrast,
failing WCAG 2.1 AA minimum of 4.5:1 for normal text.
ink-mid (#c8c2b4) on dark zinc yields ~7.6:1 — well above AA.
text-ink-mid is already the semantic token for secondary/caption text
in the warm-paper light mode; the dark-mode override was the gap.
52 files, 268 replacements. No functional change beyond contrast.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Canvas runs Next.js 15.5.15 (package-lock.json). Audit doc had
Next.js 14 App Router from before the upgrade. Also add
KeyboardShortcutsDialog.tsx to the directory structure tree.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Cmd/Ctrl+Arrow Up/Down resizes node height (±10px, ±2px with Shift).
Cmd/Ctrl+Arrow Left/Right resizes node width (±10px, ±2px with Shift).
Uses the same onNodesChange('dimensions') path that NodeResizer uses
— no new store action needed. Respects min-width/min-height matching
the NodeResizer constraints (360×200 with children, 210×110 without).
The Arrow-key move shortcut now skips when a modifier key is held,
so Cmd/Ctrl+Arrow unambiguously means resize (not move).
Updates canvas audit doc: Node Rendering section updated and
the LOW node-resize item marked done. All Remaining Gaps items
are now complete.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Issue #75 PR-D: two remaining `gh` CLI calls in .github/workflows/.
1. ci.yml canvas-deploy-reminder:
- Replaced `gh api POST repos/.../commits/.../comments` with writing
to GITHUB_STEP_SUMMARY. Gitea has no commit-comments API (confirmed
in issue #75), so the gh call always failed. GITHUB_STEP_SUMMARY works
on both GitHub Actions and Gitea Actions as the workflow-run summary
page, which is the natural place for post-deploy action items.
- Removed now-unnecessary GH_TOKEN env var and contents:write permission.
2. check-merge-group-trigger.yml:
- Converted to no-op stub. Gitea has no merge queue feature and no
merge_group: event type, so this workflow's lint would find nothing
to verify (all workflows vacuously pass). Keeping workflow+job name
unchanged preserves commit-status context names for branch protection
consumers. Dropped the merge_group: trigger since it would never fire
on Gitea. Dropped the full bash linter + gh api call.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Target handle (top of card): Enter/Space extracts this node from
its parent, moving it to the root level.
Source handle (bottom of card): Enter/Space nests the currently
selected node as a child of this node (requires another node to be
selected first).
Both handles gain tabIndex=0, role="button", a descriptive aria-label,
and a blue focus ring so keyboard-only users can navigate the
workspace hierarchy without a mouse. Uses the existing nestNode store
action — no new API surface needed.
Updates the canvas audit doc to mark the LOW edge-anchor item done.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Update go.mod require and main.go import to use the Gitea-hosted
module path go.moleculesai.app/plugin/gh-identity (migrated from
github.com/Molecule-AI/molecule-ai-plugin-gh-identity).
Follows the pattern of the org-template URL migrations (github.com ->
git.moleculesai.app) applied to Go module imports.
Fixes molecule-core#91.
Ref: molecule-internal#71.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
PUT /workspaces/:id/files and DELETE /workspaces/:id/files updated the
config volume but never restarted the container, so the running agent
continued serving stale file content from its in-memory cache. The
SecretsHandler already had this pattern (issue #15); TemplatesHandler
was missing it.
Fix: after every successful write/delete in WriteFile, DeleteFile, and
ReplaceFiles, call h.wh.RestartByID(workspaceID) asynchronously, guarded
by h.wh != nil (nil-tolerant for callers that only use read-only
surfaces). The RestartByID coalescing gate prevents thundering-herd on
concurrent requests.
Fixes#151.
Fixes#87 (duplicate effort closed — core-be also filed #183).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The "Node Rendering" and "Drag and Drop" sections still said
"mouse only, no keyboard alternative" and "Keyboard alternative: None"
despite PR #182 (Arrow keys) being merged. Update both to reflect
the keyboard-accessible node drag.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Issue #86: TestStartSweeper_RecordsMetricsOnSuccess fails in full-suite.
Root cause: two cooperating bugs in the sweeper test harness.
1. Sweeper loop called sweepOnce after ctx cancellation (double-increment).
When ctx was cancelled the loop's select received ctx.Done(), called
sweepOnce with the cancelled ctx, storage.Sweep returned context error,
and metrics.PendingUploadsSweepError() incremented the error counter a
SECOND time before the loop exited. Subsequent tests captured a polluted
error baseline and their deltaError assertions failed.
2. Tests called defer cancel() without waiting for the goroutine to exit.
The goroutine could still be blocked on Sweep (waiting for the next
ticker's C channel) when the next test called metricDelta(). If the
goroutine's Sweep returned during the next test's measurement window,
the shared metric counters mutated mid-baseline.
Fix (production code):
- Guard the ticker arm: if ctx.Err() != nil, continue instead of calling
sweepOnce. This prevents the post-cancellation sweep from running.
Fix (test harness):
- startSweeperWithInterval gains a done chan struct{} parameter. When the
loop exits the channel is closed exactly once.
- StartSweeperForTest starts the goroutine and returns the done channel,
allowing tests to drain it with <-done after cancel() — guaranteeing
the goroutine has fully terminated before the next test's baseline.
All 8 sweeper tests now use StartSweeperForTest and drain the done
channel before returning, ensuring stable metric baselines across the
full suite.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
os.Chmod(dst, 0o555) silently passes when os.Geteuid() == 0 because
root bypasses POSIX permission checks. A previous attempt to use a
symlink to /dev/full also fails: Go's os.MkdirAll resolves the symlink
during path traversal and the kernel allows mkdir("/dev/full") as a
device-table entry — io.Copy to /dev/full then succeeds with 0 bytes
written and returns nil.
The honest, consistent fix mirrors TestLocalResolver_CopyFileSourceUnreadable:
skip when running as root. The write-failure propagation logic is
exercised correctly in non-root CI environments.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
fix(plugins/test): skip TestLocalResolver_BubblesUpCopyFailure when running as root
Fixes issue #87: the test sets chmod(dst, 0o555) to make the
destination read-only and asserts the copy fails. On Linux, root
bypasses filesystem permissions and can write to 0o555 directories,
so the copy succeeds when running as root and the assertion fails.
Fix: check os.Getuid() == 0 at the start of the test and skip with
a clear message. Mirrors the existing skip in
TestLocalResolver_CopyFileSourceUnreadable (line 175) which already
handles the same root-bypass issue for unreadable source files.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closes canvas audit item: MEDIUM keyboard-accessible node drag.
- Arrow keys move the selected node by 10px per press; Shift+Arrow
moves by 50px. Position is persisted to the backend via savePosition.
- The modal-dialog guard (same pattern as ? shortcut) prevents Arrow
keys from moving nodes when a modal like KeyboardShortcutsDialog is
open — dialogs own their own arrow semantics.
- All shortcuts guarded by the inInput check so Arrow keys still work
for text navigation inside inputs/textareas.
Changes:
- canvas.ts: new moveNode(dx, dy) store action — updates position
directly without the grow-parents pass that onNodesChange runs on
every drag tick (avoids edge-chase flicker).
- useKeyboardShortcuts.ts: Arrow key handler added.
- canvas.test.ts: new moveNode unit tests (position update, no-op,
savePosition call).
- useKeyboardShortcuts.test.tsx: new integration tests for all
keyboard shortcuts including the new Arrow key handlers.
- canvas-audit-items.md: Keyboard Shortcuts section upgraded to ✅,
drag item marked done.
- canvas-events.test.ts: fix pre-existing double-}); syntax error.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
fix(tests): clear platform_auth cache before each test
Fixes issue #160: workspace tests fail when MOLECULE_WORKSPACE_TOKEN
is set in the environment.
The bug: platform_auth._cached_token is populated at module import or
first get_token() call and persists for the process lifetime. Tests
that use monkeypatch.delenv("MOLECULE_WORKSPACE_TOKEN") to simulate "no
token in env" were failing because delenv removes the env var but not
the module-level cache — subsequent get_token() calls returned the
stale cached value.
Fix: add a function-scoped autouse fixture in conftest.py that calls
platform_auth.clear_cache() before every test. The import is inside the
fixture to avoid collection-time import issues when platform_auth is
not yet available.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
[core-lead-agent] Closes the regression-test gap on PR #170 (Core-BE's
fix for #159 retry-storm). Original PR shipped the inline conditional
without a unit test; this commit:
1. Extracts the inline `(proxyErr != nil && len(respBody) > 0 && 2xx)`
predicate into a named helper `isDeliveryConfirmedSuccess`. Same
behavior; the call site now reads `if isDeliveryConfirmedSuccess(...)`.
2. Adds `TestIsDeliveryConfirmedSuccess` — 10-case table test covering:
- The new branch (2xx + body + transport error → recover as success):
status=200, status=299, status=200+min-body
- Each precondition failing in isolation:
* nil proxyErr → false (no decision)
* empty/nil body → false (no work to recover)
* 4xx/5xx/3xx body → false (agent-signalled failure or redirect)
* <200 status → false (not 2xx)
Test-pattern mirrors the existing `TestIsTransientProxyError_Retries...`
and `TestIsQueuedProxyResponse` table tests in the same file — same
file-local mock-error pattern, no new test infra.
fix: Treat delivery-confirmed proxy errors as delegation success
Two-part fix for issue #159 — successful delegation responses were
rendered as error banners:
PART 1 — a2a_proxy.go: When io.ReadAll fails mid-stream (e.g., TCP
connection drops after the agent sent its 200 OK response), the prior
code returned (0, nil, BadGateway) discarding both the HTTP status code
and any partial body bytes already received. Fix: return
(resp.StatusCode, respBody, error) so callers can inspect what was
delivered even when the body read failed.
PART 2 — delegation.go: New condition in executeDelegation after the
transient-error retry block:
if proxyErr != nil && len(respBody) > 0 && status >= 200 && status < 300 {
goto handleSuccess
}
When proxyA2ARequest returns a delivery-confirmed error (status 2xx +
non-empty partial body), route to success instead of failure. This
prevents the retry-storm pattern where the canvas shows "error" with
a Restart-workspace suggestion even though the delegation actually
completed and the response is available.
Regression tests (delegation_test.go):
- TestExecuteDelegation_DeliveryConfirmedProxyError_TreatsAsSuccess:
server sends 200 + partial body then closes; second attempt succeeds.
Verifies the new condition fires for delivery-confirmed 2xx responses.
- TestExecuteDelegation_ProxyErrorNon2xx_RemainsFailed: server sends
500 + partial body then closes. Verifies non-2xx routes to failure.
- TestExecuteDelegation_ProxyErrorEmptyBody_RemainsFailed: server
returns 502 Bad Gateway (empty body, transient). Verifies empty-body
errors still route to failure (condition len(respBody) > 0 guards it).
- TestExecuteDelegation_CleanProxyResponse_Unchanged: clean 200 OK.
Verifies baseline (proxyErr == nil path) is unaffected.
Fixes issue #159.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
fix(template_import): Remove silent template-dir fallback in ReplaceFiles offline path
When the workspace container is offline and writeViaEphemeral fails
(docker unavailable), ReplaceFiles previously fell back to writing
to the host-side template directory. This silently returned 200 with
"source: template" while the file change was invisible after restart
because the restart handler reads from the Docker volume, not the
template dir (issue #151).
Now returns 503 Service Unavailable with a message telling the caller
to retry after the workspace starts. The ephemeral write path is
the only correct mechanism for offline-container updates.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Mark PR #175 (keyboard shortcuts dialog) as ✅ done.
Note that screen reader announcements (HIGH) is in progress by Core-FE.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Issue #160: workspace tests fail when MOLECULE_WORKSPACE_TOKEN is set in
the test environment (or when /configs/.auth_token exists on disk, as it
does in a container CI runner).
Root cause:
- test_resolve_token_returns_none_when_missing: monkeypatch.delenv()
removes the env var, but _resolve_token() falls through to
configs_dir.resolve()/.auth_token which exists in the container.
- Multi-workspace tests: clear_cache() resets _cached_token, but
get_token() immediately re-reads /configs/.auth_token and caches
the real token before the env var is even checked.
Fix:
- test_mcp_doctor: patch configs_dir.resolve() to return a bare tmp_path
so the disk-file fallback finds nothing.
- Multi-workspace tests: patch platform_auth._token_file() to return a
non-existent path (via tmp_path) alongside clear_cache(), ensuring
the env var wins as intended.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
fix: Treat delivery-confirmed proxy errors as delegation success
When proxyA2ARequest returns an error but we have a non-empty
response body with a 2xx status code, the agent completed the work
successfully. The error is a delivery/transport error (e.g., connection
reset after response was received).
Previously, executeDelegation would mark these as "failed" even though
the work was done, causing:
- Retry storms (canvas suggests restart, user retries)
- "error" rendering in canvas even though result is available
- Data loss risk from unnecessary restarts
Now we check for valid response data before marking as failed.
Fixes issue #159.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Issue #152: claude-code workspace plugin adapter import fails with
'No module named plugins_registry'. Plugin adapter code
(workspace-template-*) uses bare `from plugins_registry import ...`
but molecule-runtime only shipped it at
molecule_runtime/plugins_registry/ (the package namespace path).
Fix: copy workspace/plugins_registry/ to the top level of the wheel
in addition to molecule_runtime/plugins_registry/. Both copies coexist
— the top-level one satisfies bare imports from plugin adapters,
the nested one satisfies the rewritten
`from molecule_runtime.plugins_registry import ...` in adapter_base.py.
pyproject.toml updated to include plugins_registry* in the packages find
directive so setuptools ships it from the wheel root.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closes the "no keyboard shortcut help dialog" audit gap (MEDIUM).
Changes:
- Add KeyboardShortcutsDialog component: portal-based, accessible
dialog listing all canvas + navigation + agent shortcuts grouped by
category. WCAG 2.1 compliant (focus trap, Esc close, aria-modal,
aria-labelledby, focus restoration on close).
- Add global ? shortcut: opens the dialog when pressed outside any
input field and no modal is already open.
- Add "See all shortcuts →" link in the Toolbar quick-start popup
linking to the dialog.
Test plan:
- [x] npx vitest run (182 tests pass)
- [x] tsc --noEmit (no type errors)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Issue: HIGH priority item from canvas accessibility audit (2026-05-09).
Screen reader users had no way to know when workspace status changed
— the canvas updated visually but no announcement was made.
Changes:
- canvas.ts: add `liveAnnouncement: string` + `setLiveAnnouncement` to
CanvasState so the store can hold the current announcement text.
- canvas-events.ts: set `liveAnnouncement` in handleCanvasEvent for 6
key status transitions: ONLINE, OFFLINE, PAUSED, DEGRADED, PROVISIONING,
REMOVED, PROVISION_FAILED. Names are looked up from store nodes so
announcements are human-readable ("Alpha is now online" not "ws-1").
TASK_UPDATED and AGENT_MESSAGE are intentionally excluded — they fire
on every heartbeat and would overwhelm the user.
- Canvas.tsx: subscribe to `liveAnnouncement` from the store; render a
visually-hidden `aria-live="polite" aria-atomic="true"` region that
speaks the announcement then clears it after 500 ms so the same
message doesn't re-announce on re-render. Fallback still announces
workspace count on initial load.
- canvas-events.test.ts: 12 new test cases covering announcement
content for all 6 event types, empty/no-announcement cases, and
payload-name fallback when a node isn't yet in the store.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Issue #159: successful delegation responses were rendered as error
banners because extractResponseText() only handled the A2A result
format (body.result.parts[].text) but delegation.go stores
response_body as {text: "...", delegation_id: "..."}. The error
status was set when the HTTP transport failed even though the actual
agent response was received.
Fixes:
1. extractResponseText: check body.text before the result path so
delegation response_body.text is extracted correctly
2. extractResponseText: also check body.response_preview (WS event shape
from DELEGATION_COMPLETE handler)
3. GroupedCommsView: render NormalMessage when status=error but
responseText is populated (delegation succeeded, transport failed)
instead of burying the content in an error banner
Tests: 8 new cases (4 extractResponseText + 2 extractRequestText
regression + 2 render tests). 189 tests pass across 10 files.
Closes#159.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
[core-lead-agent] Closes Core-Security audit finding (2026-05-09 audit cycle, MEDIUM):
1. workspace-server/internal/handlers/workspace_crud.go:335
`DELETE /workspaces/:id` returned `err.Error()` verbatim in the 500
body, leaking wrapped lib/pq driver strings (schema column names,
index hints) to HTTP clients. Replaced with sanitized message;
raw error already logged server-side via the existing log.Printf
immediately above.
2. workspace-server/internal/handlers/org.go:610
`OrgImport` echoed the user-supplied `body.Dir` verbatim in the 404
"org template not found: %s" response. Path traversal is already
blocked by resolveInsideRoot earlier in the handler, but echoing
raw input back lets a client probe filesystem layout (404-with-echo
vs. 400-from-resolve is itself a signal). Dropped the input from the
client-facing message; preserved full context in a new log.Printf
(orgFile path + the requested body.Dir) for operator triage.
Both fixes preserve operator-side diagnostics (logs unchanged in
content, only client-facing JSON sanitized). No behavior change for
legitimate clients — error type, status code, and JSON shape all stay
the same.
Tier: low. Defensive hardening only; reduces info-disclosure surface
without altering control-flow or auth gates.
Agent Comms tab rendered outbound delegations as blank bubbles because
extractRequestText only checked the A2A JSON-RPC format
(body.params.message.parts[].text) while delegation.go stores
request_body as {"task": "...", "delegation_id": "..."}.
Fix: check body.task first for delegation activities, then fall back to
the A2A format. Add six test cases covering the delegation shape,
precedence over A2A params when both present, empty-string guard, and
non-string type guard.
Closes#158.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
[FORCE-MERGE AUDIT — §SOP-7]
- Approver: hongming via chat-go ("go") in conversation transcript ~21:00 UTC on 2026-05-09
- Bypassed: required status checks (all pending — runner pickup issue, separate from PR correctness)
- Audit channel: orchestrator force-merge log + this commit message
Part of overnight team shipping cycle. PR authored by team persona under per-persona Gitea identity (post #156 merge).
Renames Docker network across all code, configs, scripts, and docs.
Per issue #93: the network was named molecule-monorepo-net as a holdover
from when the repo was called molecule-monorepo. The canonical repo name is
now molecule-core, so the network should be molecule-core-net.
Files changed:
- docker-compose.yml, docker-compose.infra.yml: network definition
- infra/scripts/setup.sh: docker network create
- scripts/nuke-and-rebuild.sh: docker network rm
- workspace-server/internal/provisioner/provisioner.go: DefaultNetwork
- All comments/docs: updated wording
Acceptance: grep -rn 'molecule-monorepo-net' returns zero matches.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
MCP delegate_task and delegate_task_async bypassed the delegation activity
lifecycle entirely — no activity_log row was written for MCP-initiated
delegations. As a result the canvas Agent Comms tab rendered outbound
delegations as bare "Delegation dispatched" events with no task body.
Fix: insert a delegation row (mirroring insertDelegationRow from
delegation.go) before the A2A call so the canvas can show the task text.
The sync tool updates status to 'dispatched' after the HTTP call; the
async tool inserts with 'dispatched' directly (goroutine won't update).
Closes#158.
Closes#49 (partial — addresses the canvas-display gap; full lifecycle
parity requires DelegationWriter extraction, tracked separately).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per issue #153: `docker compose up -d` (docker-compose.yml) did not start
Temporal because it lived only in docker-compose.infra.yml. Users had to know
to run `setup.sh` which explicitly uses `-f docker-compose.infra.yml`.
Adding `include: - docker-compose.infra.yml` makes the full infra stack
(starting with Temporal) start with the default `docker compose up` command.
Both compose files define postgres/redis — the main file's definitions take
precedence via compose merge semantics, so no service conflicts.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Major correction from Core-FE review:
- Canvas has THREE themes: System/Light/Dark, not dark-only
- Warm paper tones for light, zinc-adjacent dark for dark mode
- ThemeProvider handles switching, persisted in mol_theme cookie
- Use semantic tokens: bg-surface, bg-surface-card, border-line, text-ink
- NEVER use raw zinc for surfaces — only for borders/disabled/code
Updated:
- Section 1: Three-mode theme palette with exact hex values
- Section 4: Component patterns now use semantic tokens
- Added Section 4.6: ThemeProvider + useTheme() usage
- Section 7: Enforcement checklist now includes token rules
Co-Authored-By: Core-FE <core-fe@moleculesai.app>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Cross-reference the Core-FE draft against actual molecule-core/canvas/src/
codebase. Creates two new docs:
- canvas-design-system-v1.md: Full design system with verified color
palette, typography scale, animation tokens (from theme-tokens.css),
component patterns, WCAG 2.1 AA checklist. Marks all items as
VERIFIED with source file citations.
- canvas-audit-items.md: Updated architecture brain dump with verified
findings on React Flow canvas accessibility. Flags remaining gaps
(screen reader announcements, keyboard shortcuts help, keyboard drag).
Key verified discrepancies from draft:
- Font: system-ui stack (not Inter/Geist)
- Tooltip: uses aria-describedby + role=tooltip (not group-hover CSS)
- Animation tokens: already defined in theme-tokens.css
- ContextMenu: has full keyboard nav (arrow keys, wrap-around)
Co-Authored-By: Core-FE <core-fe@moleculesai.app>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closes#155.
Without this, every commit from a workspace booted via the standard
provisioner lands with an empty `user.name`/`user.email` and Gitea
attributes the work to whichever PAT pushed (typically the founder's
`claude-ceo-assistant`), instead of the persona that actually authored
the commit. That's the same fingerprint pattern that got us suspended
on GitHub 2026-05-06.
GITEA_USER is already injected per-workspace by the provisioner from
workspace_secrets (verified: 8/8 Core-* workspaces have it set,
correctly-named, on operator + local). Boot picks it up unconditionally;
falls through cleanly if unset (e.g. legacy boxes without persona
identity wiring).
Email uses `bot.moleculesai.app` so agent commits are visually distinct
from human-authored commits in Gitea history. The `gitconfig` copy from
`/root/.gitconfig` to `/home/agent/.gitconfig` is now unconditional —
previously it was nested inside the `molecule-git-token-helper.sh`
block, which meant the per-persona identity wouldn't propagate to the
agent user when the helper was unavailable.
Also added an inline note that the github.com credential-helper block
is post-suspension legacy. Full removal tracked under #171; this PR
deliberately doesn't touch it (smaller blast radius).
Tested: docker exec sets the same config in 8 running Core-* workspaces
locally and they pick up correct identity for `git config -l`. Will
reset when those containers restart, hence this PR for the persistent
fix.
molecule-core/main branch protection requires the status-check context
'Secret scan / Scan diff for credential-shaped strings (pull_request)'
but the workflow lived only in .github/workflows/, which Gitea Actions
doesn't see — every PR's required-status-checks rollup left the context
in 'expected' / never-fires state, blocking merge.
Port to .gitea/workflows/secret-scan.yml. Drops:
- merge_group event (Gitea has no merge queue)
- workflow_call (no cross-repo reusable invocation on Gitea)
SELF exclude lists both .github/ and .gitea/ paths so a future sync
between them stays clean. Job + step names match the GitHub workflow
so the produced status-check context name matches branch protection
unchanged.
Same regex set as the runtime's pre-commit hook
(molecule-ai-workspace-runtime: molecule_runtime/scripts/pre-commit-checks.sh).
This unblocks PR #150 (audit-force-merge fan-out) and every future
PR on molecule-core/main.
Mirrors the canonical workflow shipped on internal#120 + #122. Same
shape: pull_request_target on closed, base.sha checkout, structured
JSON event to runner stdout that Vector ships to Loki on
molecule-canonical-obs.
REQUIRED_CHECKS env declares both molecule-core/main protected
contexts (sop-tier-check + Secret scan). Mirror against branch
protection if either is added/removed.
Verified end-to-end on internal: synthetic force-merge of internal#123
emitted incident.force_merge with all expected fields, indexable in
Loki via {host="molecule-canonical-1"} |= "incident.force_merge".
Tier: low (CI workflow, no platform code path).
Closes the post-PR-#174 self-review gap: the matched-pair contract
between ADMIN_TOKEN (server-side bearer gate) and NEXT_PUBLIC_ADMIN_TOKEN
(canvas client-side bearer attach) was descriptive only, living in a
.env file comment. Future agents/devs could re-misconfigure with one
of the two unset and silently 401 — every workspace API call refused
with no actionable diagnostic.
Adds checkAdminTokenPair() to canvas/next.config.ts, run after
loadMonorepoEnv() so it sees the post-load state. Two distinct
warnings (server-set/client-unset and the inverse) so an operator can
tell which half is missing without grep'ing. Empty string is treated
as unset so KEY= and unset KEY produce the same verdict.
Warn-only, not exit — production canvas Docker images bake these vars
at image-build time and a hard exit would turn a recoverable auth
issue into a crashloop. The console.error fires in `next dev`, the
standalone server's stdout, and the canvas Docker container logs —
the three places an operator looks when "everything 401s."
Tests pin exact stderr strings (per feedback_assert_exact_not_substring)
across 6 cases: both unset, both set, ADMIN_TOKEN-only, NEXT_PUBLIC-only,
empty-string-as-unset, and the empty-string-asymmetric mismatch.
Mutation-tested: flipping the if-condition from === to !== fails all 6.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The forks pool's implicit maxWorkers=1 (2-CPU runner) was insufficient
to prevent concurrent jsdom worker cold-starts. Each jsdom worker
allocates ~30-50 MB RSS at boot; multiple workers starting simultaneously
exhaust available memory, causing 5 test files to fail with:
[vitest-pool]: Failed to start forks worker for test files ...
[vitest-pool-runner]: Timeout waiting for worker to respond
Individual jsdom test files take 12-15 s in isolation and pass cleanly.
Failures only occur when 51 files are run together through the pool.
Fix: explicitly set maxWorkers:1 so a single worker processes all files
sequentially, eliminating concurrent jsdom bootstrap memory pressure.
With this change, all 51 files pass (was 46 pass + 5 fail), and suite
duration improves from ~5070 s to ~1117 s because workers no longer
compete for resources during startup.
Ref: issue #148
Ref: vitest-pool investigation for issue #22 (canvas side)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Mirrors the canonical refactor: workflow YAML shrinks (env+invocation),
logic moves to .gitea/scripts/sop-tier-check.sh, debug echoes gated on
SOP_DEBUG, checkout@v6 pinned to base.sha.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fans the security fix from internal#116 (cce89067) to molecule-core. Same
rationale: pull_request loads workflow from PR HEAD, allowing any
write-access contributor to rewrite the workflow file in their PR and
exfiltrate SOP_TIER_CHECK_TOKEN. pull_request_target loads from base
(main), neutralising the attack.
Verified post-merge on internal: synthetic PR rewriting the workflow to
print the token did NOT execute the modified version — main's
pull_request_target version ran instead. ATTACK_PROBE never fired.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase-1 fan-out of §SOP-6 enforcement to molecule-core. No branch
protection change in this PR — workflow runs and reports a status,
doesn't block any merge yet.
Branch protection update is the follow-up PR after the workflow
demonstrates a green run on its own PR, per the Phase 2 plan.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The org.import.started event was firing immediately after request body
bind, before the YAML at body.Dir was loaded. Result: payload.name was
"" whenever the caller passed `dir` (the common path — the canvas and
all live imports use dir, not inline template). Three started rows
already in the local platform's structure_events have empty name.
Fix: move the started emit (and importStart timestamp) to after the
YAML unmarshal / inline-template fallthrough, where tmpl.Name is
guaranteed populated.
Bonus: pre-parse error returns (invalid body, traversal-rejected dir,
file-not-found, YAML expansion fail, YAML unmarshal fail, neither dir
nor template provided) no longer emit an orphan started row — every
started is now guaranteed a paired completed/failed.
Verified live against running platform: re-imported molecule-dev-only,
new started row in structure_events carries
"Molecule AI Dev Team (dev-only)" instead of "".
Tests: full handler suite green (`go test ./internal/handlers/`).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops ~150 lines of duplicated cascade logic from the Delete HTTP
handler — workspace_crud.go's CascadeDelete (added in PR #137) and
Delete() were running the same #73 race-guard sequence (status update →
canvas_layouts → tokens → schedules → container stop → broadcast),
just with Delete() inlined and CascadeDelete owning the OrgImport
reconcile path.
CascadeDelete now returns the descendant id list (was: count) so
Delete() can drive the optional ?purge=true hard-delete against the
same set the cascade just touched.
Net diff: workspace_crud.go shrinks from ~270 lines in Delete() to
~75 lines (parse + 409 confirm gate + CascadeDelete call + stop-error
500 + purge block + 200 response). Behavior identical — same SQL
ordering, same #73 race guard, same response shapes. Three sqlmock
tests for the 0-children case gained one extra ExpectQuery for the
recursive-CTE descendants scan (the old inline code skipped that
query when len(children)==0; CascadeDelete walks unconditionally —
returns 0 rows, same end state, one extra cheap query).
Tests: full handler suite green (`go test ./internal/handlers/`).
Live-tested against the running local platform: DELETE on a fake
workspace returns `{"cascade_deleted":0,"status":"removed"}`,
fleet of 9 workspaces preserved, refactored handler matches the
prior wire-shape exactly.
Tracked as the PR #137 follow-up tech-debt item.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the additive-import zombie bug — re-running /org/import with a
tree shape that reparents same-named roles left the prior workspace
online because lookupExistingChild's dedupe is parent-scoped (different
parent_id → "different" workspace). Caught 2026-05-08 after a dev-tree
re-import left 8 orphans co-existing with the new tree on canvas until
manual cascade-delete.
Three layers in this PR:
- mode="reconcile" on /org/import — after the import loop, online
workspaces whose name matches an imported name but whose id isn't in
the result set are cascade-deleted. Default mode "" / "merge"
preserves existing additive behavior. Empty-set guards prevent
accidental "delete everything" if either array comes up empty.
- WorkspaceHandler.CascadeDelete extracted as a callable helper from
the existing Delete HTTP handler so OrgImport's reconcile path shares
the same teardown sequence (#73 race guard, container stop, volume
removal, token revocation, schedule disable, event broadcast). The
HTTP Delete handler still inlines the same logic; deduplication
tracked as tech-debt follow-up.
- emitOrgEvent(structure_events) records org.import.started +
org.import.completed with mode, created/skipped/reconcile_removed
counts, duration_ms, error. Replaces the lost-on-restart stdout-only
log shape for an audit-trail surface that's queryable by SQL. Closes
the "what happened at 20:13?" debugging gap that motivated this fix.
Verified live against the local platform: cascade-delete on an old
tree's removed root cleared 8 surviving orphans; mode="reconcile" with
a freshly-INSERTed fake orphan removed exactly the fake; idempotent
re-run of reconcile is a no-op (0 removed, no errors); structure_events
captures every started+completed pair with full payload.
7 new unit tests (walkOrgWorkspaceNames flat/nested/spawning:false/
empty-name; emitOrgEvent success + DB-error-swallow; errString). Full
handler suite green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 4 follow-up to template-claude-code PR #9 (2026-05-08 dev-tree wedge).
Pre-fix: applyRuntimeModelEnv unconditionally overwrote envVars["MODEL"]
with the MODEL_PROVIDER slug whenever payload.Model was empty (the restart
path). This silently wiped the operator'\''s explicit per-persona MODEL
secret on every restart.
Symptom: dev-tree workspaces booted correctly on first /org/import (the
envVars map was populated direct from the persona env file with both
MODEL=MiniMax-M2.7-highspeed and MODEL_PROVIDER=minimax), then on the
next Restart the MODEL secret got clobbered to literal "minimax" — a
provider slug, not a valid model id — and the workspace template'\''s
adapter failed to match any registry prefix, fell through to providers[0]
(anthropic-oauth), and wedged at SDK initialize.
Fix: resolution order in applyRuntimeModelEnv is now:
1. payload.Model (caller passed the canvas-picked model id verbatim)
2. envVars["MODEL"] (workspace_secret persisted from persona env)
3. envVars["MODEL_PROVIDER"] (legacy canvas Save+Restart shape)
Tests
-----
TestApplyRuntimeModelEnv_PersonaEnvMODELSecretPreserved — locks in
the new resolution order with four cases:
- MODEL secret wins over MODEL_PROVIDER slug (persona-env shape)
- MODEL secret wins even when same as MODEL_PROVIDER
- MODEL absent → fall back to MODEL_PROVIDER (legacy shape)
- Both absent → no MODEL set (no-op)
Existing TestApplyRuntimeModelEnv_SetsUniversalMODELForAllRuntimes
continues to pass — fix is strictly additive on the precedence chain.
Lets a workspace declare it (and its entire subtree) should be skipped
during /org/import. Pointer-typed `*bool` so we distinguish "explicitly
false" from "unset" (default = spawn).
## Use case
The dev-tree org template ships the full role taxonomy (Dev Lead with
Core Platform / Controlplane / App & Docs / Infra / SDK Leads, each with
their own engineering / QA / security / UI-UX children — 27 personas
total in a single import). Some setups need a smaller set:
- Local dev on a memory-constrained machine
- Demo / smoke runs that don't need the full org breathing
- Customer trials starting with leadership-only before fan-out
Pre-fix the only options were:
- Edit the canonical template (mutates shared state)
- Author a parallel slimmer template (duplicates structure)
- Manual workspace deprovision after full import (wasteful — already paid
the docker pull / build cost)
`spawning: false` is the per-workspace knob that solves this without
touching the canonical template structure.
## Semantics
- Unset: workspace spawns (current behaviour, no migration)
- `spawning: true`: explicitly spawns (same as unset)
- `spawning: false`: workspace is skipped AND every descendant is
skipped. The guard sits BEFORE any side effect in
createWorkspaceTree — no DB row, no docker provision, no children
recursion. A false-spawning subtree is genuinely a no-op except for
the log line. countWorkspaces still counts the subtree (so /org/templates
numbers reflect the full structure).
## Stage A — verified
Local dev-only template that wraps teams/dev.yaml (Dev Lead) with
children:[] cleared on the 5 sub-team yaml files, plus 3 floater
personas (Release Manager / Integration Tester / Fullstack Engineer).
/org/import returned 9 workspaces. Drop-in: same result via
`spawning: false` on each sub-tree root in the future.
## Stage B — N/A
Pure additive feature on the org-template handler. No SaaS deploy chain
implications.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## org_import.go — persona env injection root-cause fix
The Phase-3 fix from earlier today (`feedback/per-agent-gitea-identity-default`)
introduced loadPersonaEnvFile to inject persona-specific creds into
workspace_secrets on /org/import. It passed `ws.Role` as the persona-dir
lookup key, but in our dev-tree org.yaml shape `role:` carries the
multi-line descriptive text the agent reads from its prompt
("Engineering planning and team coordination — leads Core Platform,
Controlplane, ..."), while `files_dir:` holds the short slug
(`core-lead`, `dev-lead`, etc.) matching
`~/.molecule-ai/personas/<files_dir>/env`.
isSafeRoleName silently rejected the multi-word role text → no persona
env loaded → every imported workspace booted with zero
workspace_secrets rows → no ANTHROPIC / CLAUDE_CODE / MINIMAX auth in
the container env → claude_agent_sdk wedged on `query.initialize()`
with a 60s control-request timeout.
After the fix, /org/import on the dev tree (27 personas) populates
8 workspace_secrets per workspace (Gitea identity + MODEL/MODEL_PROVIDER
+ provider-specific token), 5 of 6 leads boot online, and the
remaining wedges trace to a separate runtime-template-repo bug
(workspace-template-claude-code's claude_sdk_executor.py doesn't
dispatch on MODEL_PROVIDER=minimax — filed separately).
## Dockerfile.dev — docker-cli + docker-cli-buildx
Without these, every claude-code/tier-2 workspace POST fails-fast:
- docker-cli alone produces `exec: "docker": executable file not found`
- docker-cli alone (no buildx) fails on `docker build` with
`ERROR: BuildKit is enabled but the buildx component is missing or broken`
Both packages are now installed in the dev image; verified with
`docker exec molecule-core-platform-1 docker buildx version`.
## Stage A verified
Local /org/import dev-only path: 27 workspaces created, all 27 receive
persona env injection (8 secrets each — Gitea identity + provider creds).
Lead workspaces (claude-code-OAuth tier) boot online.
## Stage B — N/A
Local-dev-only path (docker-compose.dev.yml + dev image). Tenant EC2
provisioning uses Dockerfile.tenant (untouched).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to the workspace-template visibility flip in 558e4fee. After
flipping the 5 private workspace-templates public (#192 root cause),
the harness-replays clone moved one step deeper to the org-templates
list, where 6 of 7 were also private. Hongming-confirmed flip plan:
- 5 of 6 (molecule-dev, free-beats-all, medo-smoke, molecule-worker-gemini,
ux-ab-lab) — flipped public per `feedback_oss_first_repo_visibility_default`.
These are unambiguously OSS-template-shape: generic README, no
customer-shaped names, no creds in content.
- 1 of 6 (reno-stars) — name itself is customer-shaped (would expose
customer/tenant identity). Kept private; removed from manifest.json
per Hongming. Will be handled at provision-time via the per-tenant
credential resolver designed in internal#102 (Layer-3 RFC).
Documents the OSS-surface contract in two places:
- manifest.json _comment: every entry MUST be public; Layer-3 lives elsewhere
- clone-manifest.sh comment block: rationale + the explicit ci-readonly
team-grant escape hatch (review-gated, not default).
Closes the second clone-fail layer of #192. Combined with 558e4fee +
the workspace-template visibility flips, the Pre-clone manifest deps
step should now succeed anonymously for the full registered set.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 of 9 workspace-template repos (openclaw, codex, crewai, deepagents,
gemini-cli) had been marked private with no team grant for AUTO_SYNC_TOKEN
bearer (devops-engineer persona). Pre-clone manifest deps step 404'd on
the first private repo encountered, failing every Harness Replays run.
Resolution path taken:
1. Flipped the 5 to public per `feedback_oss_first_repo_visibility_default`
— runtime/template/plugin repos default public; that's what makes them
OSS surface.
2. Scoped existing `ci-readonly` org team to legitimately-internal repos
only (compliance docs, RFCs-in-flight). Workspace templates removed
from it.
3. Filed internal#102 RFC for Layer-3 (customer-owned + marketplace
third-party private repos) — that's a different shape entirely;
needs per-tenant credential-resolver, not org-team grants.
This commit is a documentation-only touch on the workflow file to (a)
record the root cause inline next to the existing pre-clone-fail
narrative, (b) trigger a fresh Harness Replays run that should now pass
the clone step.
Closes#192.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Investigating molecule-core#129 failure mode #1 (claude-code "Agent
error (Exception)") needs the workspace's docker logs to find the
actual exception. The canary tears down the tenant on every failure,
so the workspace container is destroyed before anyone can SSM in.
Add a workflow_dispatch input `keep_on_failure: bool` (default false).
When true, sets `E2E_KEEP_ORG=1` for the canary script — its existing
debug path skips teardown, leaving the tenant + EC2 + CF tunnel + DNS
alive. Operator can then SSM into the workspace EC2 (via the same
flow as recover-tunnels.py) and capture `docker logs` from the
claude-code container.
Cron-triggered runs never set the input (it only exists on dispatch),
so unattended scheduled canaries always tear down — no risk of
unattended cost leak.
Operator workflow:
1. Dispatch canary-staging.yml with keep_on_failure=true
2. Watch CI; on failure (likely, given the 38h chronic red),
note the SLUG / TENANT_URL printed at step 1/11
3. SSM exec into the workspace EC2 (us-east-2) and run
`docker logs <claude-code-container>` to find the actual
exception traceback
4. Manually delete via DELETE /cp/admin/tenants/<slug> when done
(the script logs this reminder on E2E_KEEP_ORG=1 path)
Refs: molecule-core#129 (canary investigation)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the legacy nohup `go run ./cmd/server` setup with a fully
containerized local stack: postgres + redis + platform + canvas, all
with `restart: unless-stopped` so they survive Mac sleep/wake and
Docker Desktop daemon restarts.
## Changes
- **docker-compose.yml**
- `restart: unless-stopped` on platform/postgres/redis
- `BIND_ADDR=0.0.0.0` for platform — the dev-mode-fail-open default
of 127.0.0.1 (PR #7) made the host unable to reach the container
even with port mapping. Container netns is already isolated, so
binding all interfaces inside is safe.
- Healthchecks switched from `wget --spider` (HEAD → 404 forever
because /health is GET-only) to `wget -qO /dev/null` (GET).
Same regression existed on canvas; fixed both.
- **workspace-server/Dockerfile.dev**
- `CGO_ENABLED=1` → `0` to match prod Dockerfile + Dockerfile.tenant.
Without this, the alpine dev image fails with "gcc: not found"
because workspace-server has no actual cgo deps but the env was
forcing the cgo build path. Closes a divergence introduced in
9d50a6da (today's air hot-reload PR).
- **canvas/Dockerfile**
- `npm install` → `npm ci --include=optional` for lockfile-exact
installs that include platform-specific @tailwindcss/oxide native
binaries. Without these, `next build` fails with "Cannot read
properties of undefined (reading 'All')" on the
`@import "tailwindcss"` directive.
- **canvas/.dockerignore** (new)
- Excludes `node_modules` and `.next` so the Dockerfile's
`COPY . .` step doesn't clobber the freshly-installed container
node_modules with the host's (potentially stale or wrong-arch)
copy. This was the actual root cause of the canvas build break.
- **workspace-server/.gitignore**
- Adds `/tmp/` for air's live-reload build cache.
## Stage A verified
```
container status restart
postgres-1 Up (healthy) unless-stopped
redis-1 Up (healthy) unless-stopped
platform-1 Up (healthy, air-mode) unless-stopped
canvas-1 Up (healthy) unless-stopped
GET :8080/health → 200
GET :3000/ → 200
DB preserved: 407 workspace rows + 5 named personas
Persona mount: 28 dirs at /etc/molecule-bootstrap/personas
```
## Stage B — N/A
This is local-dev infrastructure only. None of these files ship to
SaaS tenants — production EC2s use `Dockerfile.tenant` + `ec2.go`
user-data, not docker-compose.
## Out of scope
- The decorative-but-broken `wget --spider` healthcheck has presumably
also been silently 404'ing on prod tenants. Ship a follow-up to
audit + fix the prod path; not done here to keep the PR scoped.
- Docker Desktop "Start at login" is a per-machine GUI setting that
must be toggled manually (Settings → General).
- The legacy heartbeat-all.sh that pinged 5 persona workspaces from
the host has been deleted (~/.molecule-ai/heartbeat-all.sh).
Per Hongming: each workspace is responsible for its own heartbeat.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "Open issue on failure" step was failing on every canary run
because Gitea 1.22.6 doesn't expose /api/v1/actions endpoints
(per memory reference_gitea_actions_log_fetch). The threshold check
called github.rest.actions.listWorkflowRuns() to count consecutive
prior failures and gate issue creation behind 3 reds — that call
ALWAYS 404'd on Gitea, breaking the entire alerting step.
Net effect: the canary's own self-alerting was broken, so the
underlying staging regression went unflagged for 38h+
(2026-05-07 02:30 UTC → 2026-05-08 17:34 UTC, every cron tick red,
zero issues filed).
Fix: drop the consecutive-failures threshold entirely. File a
sticky issue on the FIRST failure; comment-on-existing handles
deduplication for subsequent failures. The auto-close-on-success
step is unchanged.
Why not a Gitea-compatible threshold (e.g., walk recent commit
statuses): comment-on-existing already gives ops a single
accumulating issue per regression streak. The threshold's purpose
was to avoid spamming on transient flakes — but with sticky issue
+ auto-close-on-green, transient flakes get one issue + one quick
close, which is fine signal. Filing on first failure is also
better UX: catches the regression in 30 min instead of 90 min.
Also: rewrote runURL from hardcoded https://github.com/... to
context.serverUrl so the link actually points at Gitea
(https://git.moleculesai.app) — was always broken on Gitea but
nobody noticed because the issue-filing step itself was broken.
Net: 21 insertions, 40 deletions. Removes WORKFLOW_PATH +
CONSECUTIVE_THRESHOLD env vars (no longer needed).
Tracked in: molecule-core#129 (failure mode 3 of 3)
Verification: yaml syntax-valid; no remaining github.rest.actions.*
calls; only github.rest.issues.* (all Gitea-supported per
memory feedback_persona_token_v2_scope).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes core#242 LOCAL surface. The PROD surface (CP user-data fetching
persona env files into tenant EC2's /etc/molecule-bootstrap/personas
via Secrets Manager) is filed as a follow-up.
WHAT THIS ADDS
Bind-mount on the platform service in docker-compose.yml:
${MOLECULE_PERSONA_ROOT_HOST:-${HOME}/.molecule-ai/personas}
→ /etc/molecule-bootstrap/personas (read-only)
Default source = ${HOME}/.molecule-ai/personas (the operator-host-mirrored
local dir populated by today's persona rotation work). Override via
MOLECULE_PERSONA_ROOT_HOST when running on a machine with a different
layout (CI runners, etc.).
WHY READ-ONLY
workspace-server only reads persona env files; never writes back. The
read-only mount enforces that contract — a hostile plugin install path
can't tamper with the persona credentials it's about to consume.
WHY THIS PATH MATCHES PROD
/etc/molecule-bootstrap/personas is the same in-container path the
prod tenant EC2 will use. Same code path (org_import.go::loadPersonaEnvFile)
reads the same file regardless of mode — local-dev parity with prod
per feedback_local_must_mimic_production.
STAGE A VERIFICATION
- docker compose config: resolves to /Users/hongming/.molecule-ai/personas
correctly (28 persona dirs visible at source path)
- Persona env file shape verified: dev-lead's env contains GITEA_USER,
GITEA_USER_EMAIL, GITEA_TOKEN_SCOPES, GITEA_SSH_KEY_PATH,
MODEL_PROVIDER=claude-code, MODEL=opus (lead tier matches Hongming's
2026-05-08 mapping)
- Full handler test suite green (TestLoadPersonaEnvFile_HappyPath +
7 sibling tests pass; rejection tests still catch path traversal)
- Build clean
STAGE B SKIPPED (with justification per § Skip conditions)
This change is config-only (docker-compose.yml volume addition). The
prod tenant EC2s do NOT use docker-compose.yml — they use CP user-data
+ ec2.go's docker run script. So this PR has no prod blast radius.
Stage B (staging tenant probe) would be checking 'is the platform
using the new compose mount' on a SaaS tenant — and SaaS tenants
don't run docker compose. The actual prod-surface change is the
follow-up issue.
PROD SURFACE — FOLLOW-UP FILED
Tenant EC2 user-data needs to fetch persona env files from operator
host (or AWS Secrets Manager per the established
feedback_unified_credentials_file pattern) and stage them at
/etc/molecule-bootstrap/personas inside the workspace-server container.
Touches molecule-controlplane/internal/provisioner/ec2.go user-data.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes core#115 partial. Schema-only change; the apply-endpoint filter
logic that reads this column lands with core#123 (drift detector +
queue + apply endpoint, the deferred follow-up of core#113).
Default 'production' so existing customers (Reno-Stars + any future
tenant) are default-safe. Synthetic dogfooding workspaces opt INTO
'canary' explicitly.
CHECK constraint pins the closed value set ('canary' | 'production') —
the apply endpoint's filter relies on the database to reject anything
else, so a future operator typo in PATCH /workspaces/:id ({update_tier:
'canery'}) returns a constraint violation, not silent fan-out to
nobody.
Partial index on canary rows since the apply-endpoint query path
('apply this update only to canary tier first') hits canary much more
often than production, and the production set is the much larger
default.
WHAT THIS DOES NOT DO (lands with core#123)
- PATCH endpoint to flip a workspace to canary
- The apply endpoint that consults the column
- Tests that exercise canary-vs-production fan-out
Schema-only foundation; same pattern as core#113 (workspace_plugins).
PHASE 4 SELF-REVIEW
Correctness: No finding — IF NOT EXISTS guards, DEFAULT clause means
existing rows get 'production' on migration apply.
Readability: No finding — comment block documents the tier semantics
+ the deferral to core#123.
Architecture: No finding — additive ALTER, partial index for the
expected access pattern.
Security: No finding — no code path; column constraint reduces blast
radius of bad PATCH input.
Performance: No finding — partial index minimizes write amplification
on the production-default rows.
REFS
core#115 — this issue
core#123 — apply endpoint follow-up (will exercise this column)
core#113 — version subscription DB foundation (sibling pattern)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes core#113 partial. Adds the DB foundation for the
version-subscription model. Drift detection + queue + admin apply
endpoint are follow-up scope (separate PR; filed as a new issue).
WHY THIS PR ONLY GETS US PART-WAY
Plugin install state today is filesystem-only — '/configs/plugins/<name>/'
inside the container. There's no DB record of 'plugin X installed at
workspace W from source S, tracking ref T'. That makes drift detection
impossible: nothing to compare upstream tags against.
This PR adds the table + the install-endpoint hook that writes to it.
With baseline tags now on every plugin (post internal#92), the table
starts collecting tracked-ref values immediately on the next install.
The actual drift-check job + queue + apply endpoint layer on top.
WHAT THIS ADDS
workspace_plugins table:
workspace_id FK → workspaces(id) ON DELETE CASCADE
plugin_name canonical name from plugin.yaml
source_raw full source URL the install used
tracked_ref 'none' | 'tag:vX.Y.Z' | 'tag:latest' | 'sha:<full>'
installed_at, updated_at
installRequest gains optional 'track' field (defaults to 'none').
Install handler upserts the workspace_plugins row after delivery
succeeds. DB write failure is logged but doesn't fail the install
(the plugin IS in the container; surfacing 500 misleads the caller).
validateTrackedRef enforces the closed set of accepted shapes:
'none' | 'tag:<non-empty>' | 'sha:<non-empty>'
Bare values like 'latest' / 'main' / version-strings without
prefix are rejected — the drift detector keys on prefix to know
what kind of resolution to do.
WHAT THIS DOES NOT ADD (filed separately)
- Drift detector job (cron / on-demand) that scans
'WHERE tracked_ref != none' rows and queues updates on upstream drift
- plugin_update_queue table (separate migration once detector lands)
- GET /admin/plugin-updates-pending and POST .../apply endpoints
- Tier-aware apply (core#115 — composes here)
PHASE 4 SELF-REVIEW (FIVE-AXIS)
Correctness: No finding — install endpoint behavior unchanged for
callers that don't pass 'track'. DB write is best-effort + logged
on failure. validateTrackedRef rejects ambiguous bare strings.
Readability: No finding — separate file plugins_tracking.go isolates
the new concern; install handler delta is a single 4-line block.
Architecture: No finding — additive table; existing schema untouched.
Migration 20260508160000_* uses the timestamp-prefixed convention.
Security: No finding — INSERT params via placeholders (no string
interpolation). validateTrackedRef rejects unexpected shapes before
the column constraint would.
Performance: No finding — one extra ExecContext per install. Install
is already seconds-scale (network fetch + tar + docker exec); rounds
to noise.
TESTS (1 new, all green)
TestValidateTrackedRef — pin closed set + structural validators
REFS
core#113 — this issue (foundation only; drift+queue+apply = follow-up)
internal#92, internal#93 — plugin/template baseline tags (now exists for tracking)
core#114 — atomic install (this PR composes — no atomicity regression)
core#115 — canary tier filter (will key off the same DB foundation)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes molecule-core#112. Composes with #114 (atomic install).
Before issuing restartFunc, classify the diff between staged and live:
- skill-content-only: only **/SKILL.md content changed
→ skip restart (Claude Code re-reads SKILL.md on
each Skill invocation; no in-memory cache)
- cold: anything else
→ restartFunc as before
(hooks/settings load at session start;
plugin.yaml is structural; added/removed files
require a fresh load)
DETECTION
- Hash every regular file in staged tree (host filesystem, sha256)
- Hash every regular file in live tree (in-container via docker exec
sh -c 'cd <livePath> && find . -type f -print0 | xargs -0 sha256sum')
- .complete marker dropped from comparison (mtime varies install-to-
install; including it would force-cold every reinstall)
- File added/removed → cold
- File content differs but isn't SKILL.md → cold
- All differences are SKILL.md basenames → skill-content-only
DEFAULTS COLD
- First install (no live tree) → cold
- Live tree read failure → cold (conservative; never hot-reload speculatively)
- Symlinks skipped during hash (same posture as tar walker)
PHASE 4 SELF-REVIEW
Correctness: No finding — all error paths default to cold; never
falsely classify as skill-content-only. The .complete drop is
a deliberate exception (the marker is bookkeeping, not content).
Readability: No finding — single-purpose helpers (hashLocalTree,
hashContainerTree, isSkillMarkdown, shQuote) each do one thing.
The classifier itself reads as 'compare set, then walk diff with
isSkillMarkdown gate.'
Architecture: No finding — composes existing execAsRoot primitive;
new helpers in plugins_classifier.go don't touch any other
handler. Old behavior unchanged when live read fails.
Security: No finding — shQuote single-quotes any non-trivial path,
pluginName comes from validatePluginName-validated source, and
the docker exec command takes the path as a single arg (xargs -0
handles binary-safe path delimiting). Symlinks skipped.
Performance: No finding — adds two tree walks (host + container)
per install. Container walk is one docker exec call returning
sha256 lines; for typical plugins (~10-50 files) round-trip is
~100ms. Versus the saved ~5-10s of restart on a hot-reloadable
update, this is a clear win.
TESTS (4 new, all green; full handler suite green)
TestIsSkillMarkdown — basename match, case-sensitive
TestHashLocalTree_StableHash — re-hash same dir = same map
TestHashLocalTree_SymlinkSkipped — hostile link doesn't poison classifier
TestShQuote — quoting boundary for shell injection safety
REFS
molecule-core#112 — this issue
molecule-core#114 — atomic install (.complete marker added there)
Reno-Stars iteration safety (Hongming 2026-05-08)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes molecule-core#114 for the docker (local-OSS) path.
EIC (SaaS) path tracked as a follow-up — same shape, different
exec primitives (ssh vs docker exec); shipping both in one PR
doubles the test surface.
THE FOUR-STEP DANCE
1. STAGE — docker.CopyToContainer extracts tar into
/configs/plugins/.staging/<name>.<ts>/
2. SNAPSHOT — if /configs/plugins/<name>/ exists, mv to
/configs/plugins/.previous/<name>.<ts>/
3. SWAP — atomic mv staging → live (single rename(2))
4. MARKER — touch /configs/plugins/<name>/.complete
Workspace-side plugin loaders should refuse to load any plugin dir
without .complete (separate small change, not in this PR — the marker
write is the necessary precursor; consumer side is a follow-up so
existing-content plugins don't break before they're re-installed).
ROLLBACK
- Stage failure: rm -rf staging dir; live untouched
- Snapshot failure: rm -rf staging dir; live untouched (no rename happened)
- Swap failure with snapshot present: mv previous back to live
- Swap failure (no snapshot): rm -rf staging; live (which never
existed) stays absent
- Marker failure: content already in place, log loudly with manual
recovery hint (touch <plugin>/.complete) — don't roll back since
the new content is what we wanted, just unmarked
GC
Best-effort delete of previous-version snapshot after successful
marker write. Failures non-fatal — next install or a separate
sweeper reclaims. Sweeper for stale .previous/* across reboots is
follow-up scope.
CONCURRENCY
Each install gets a unique stamp (UTC second precision), so two
concurrent reinstalls land in distinct staging dirs and the second
swap simply overwrites the first's live result. The atomicity is
per-install, not cross-install — by design (the platform serializes
POST /workspaces/:id/plugins via Go-side semaphore upstream of
this code, so cross-install collisions don't reach here).
CHANGES
+ plugins_atomic.go — installVersion + atomicCopyToContainer
+ plugins_atomic_tar.go — tarWalk/tarHostDirWithPrefix helpers
+ plugins_atomic_test.go — 5 unit tests (paths, stamp shape,
tar happy path, symlink-skip, prefix
normalization). All green.
~ plugins_install_pipeline.go::deliverToContainer — swap
copyPluginToContainer call to atomicCopyToContainer
Old copyPluginToContainer is retained (still called by Download()) so
this PR is purely additive on the install path; no public API change.
PHASE 4 SELF-REVIEW (FIVE-AXIS)
Correctness: Required (addressed) — swap-failure rollback writes mv
of previous back to live before returning the error; if rollback
itself fails, we wrap both errors and surface the combined fault.
Marker-write failure is treated as content-landed-but-unmarked
(LOG, don't roll back the new content).
Readability: No finding — installVersion path methods make the
/staging/.previous/live/marker layout obvious from one struct.
tarWalk extracted from the inline filepath.Walk in
plugins_install_pipeline.go for testability.
Architecture: No finding — atomicCopyToContainer composes existing
execAsRoot / docker.CopyToContainer primitives; no new dependencies.
Old copyPluginToContainer kept for Download() — single responsibility
per function.
Security: No finding — symlinks still skipped during tar walk
(defense vs hostile plugin escaping its own dir). Marker writes
use composeable path.Join, no user input touches the path.
Performance: No finding — adds ~3 docker exec calls per install
(mkdir, mv-snapshot, mv-swap, touch — actually 4) on top of the
one CopyToContainer. Each exec ~50-100ms in practice; install
end-to-end was already seconds-scale, this rounds to noise.
REFS
molecule-core#114 — this issue
Companion: molecule-core#112 (hot-reload classifier — depends on .complete marker)
Companion: molecule-core#113 (version subscription — uses install machinery)
EIC follow-up: separate issue to be filed for SaaS path parity
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes core#116. Brings local-dev iteration parity with the canvas's
Turbopack HMR — edit a Go file, see the platform restart in <5s
instead of running 'docker compose up --build' (~30s) per change.
USAGE
make dev # docker compose with air-driven live reload
make up # production-shape stack (no air, normal Dockerfile)
WHAT THIS ADDS
workspace-server/.air.toml — air watch config
workspace-server/Dockerfile.dev — air-on-golang:1.25-alpine, dev-only
docker-compose.dev.yml — overlay swapping platform service
to Dockerfile.dev + bind-mounting
workspace-server/ source
Makefile — make {dev,up,down,logs,build,test}
WHAT THIS DOES NOT TOUCH
workspace-server/Dockerfile (production multi-stage build)
docker-compose.yml (prod-shape stack)
CI workflows (build prod image directly)
Tenant deployment / SaaS (image swap stays the model)
Pure additive. Existing 'docker compose up' path unchanged; production
stays on the static binary. Air install pinned via go install at image
build time so the dev image is reproducible-enough for local use (we
don't pin air to a SHA — the dev image is rebuilt locally and updates
opportunistically).
PHASE 4 SELF-REVIEW (FIVE-AXIS)
Correctness: No finding — additive change, no existing path modified.
.air.toml watches .go + .yaml under workspace-server/, excludes
_test.go and tests dir so test edits don't trigger rebuild.
Dockerfile.dev mirrors prod's 'go mod download' so first rebuild
is fast.
Readability: No finding — three small files plus a Makefile, each
with header comments explaining the WHY, not just the WHAT. The
Makefile uses the standard ## help-target pattern.
Architecture: No finding — overlay pattern (docker-compose.dev.yml
on top of docker-compose.yml) is the standard compose convention
for env-specific overrides. Doesn't fork the prod path.
Security: No finding because no production code path; dev-only image
isn't built in CI and isn't published to ECR.
Performance: No finding — air debounce=500ms, exclude_unchanged=true
so a save that doesn't change content is a no-op rebuild.
REFS
core#116 — this issue
Companion: core#117 (workspace-side config-watcher for hot-reload of
config.yaml) — different scope; this issue is platform-only.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TestStartSweeper_TransientErrorDoesNotCrashLoop leaks an in-flight
metric write across the test boundary: cycleDone fires inside the
fake's Sweep defer (before Sweep returns), waitForCycle returns
immediately after, cancel() lands, but the goroutine still has
metrics.PendingUploadsSweepError() to execute. Whether that write
happens before or after the next test's metricDelta() baseline read
is a coin-flip on slow CI hosts.
Outcome: TestStartSweeper_RecordsMetricsOnSuccess fails with
"error counter delta = 1, want 0" — looks like a real bug, isn't.
Instrumented analysis (per the file's existing waitForMetricDelta
docstring covering the same shape) confirms the metric IS getting
recorded, just AFTER the next test reads its baseline.
The Records* tests already use waitForMetricDelta to close this race
on their own assertions. This change extends the same shape to
TransientErrorDoesNotCrashLoop so it doesn't poison subsequent tests'
baselines.
Verified by running `go test -race -count=20 ./internal/pendinguploads/...`
locally — passes deterministically.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Trunk-based migration final cleanup for molecule-core. The 6 workflows
deleted here all existed to manage the staging↔main branch dance that
trunk-based makes obsolete:
- auto-promote-staging.yml fast-forward staging→main on green
- auto-promote-on-e2e.yml alt promote path on E2E green
- auto-promote-stale-alarm.yml alarm if staging promotion stalls
- auto-sync-main-to-staging.yml sync main→staging after UI merges
- auto-sync-canary.yml dry-run probe of the auto-sync
token+push path
- retarget-main-to-staging.yml rebase open PRs onto staging
After Phase 3A (PR #108 promoted 5 staging-only feature PRs to main)
and Phase 3B (PR #109 dropped staging-branch triggers from the 4 e2e
workflows), main is the only branch the CI cares about. None of the
above workflows have anything to do; they're 1977 lines of dead Go-time-
no-Gitea-time-yes code.
Rollback: `git revert` this commit to restore the workflows. They still
work mechanically; trunk-based just doesn't need them.
The `staging` branch on the remote is deleted in a follow-up step
(`git push origin --delete staging`) after this PR merges, so reviewers
can confirm CI runs cleanly on the new shape before the ref disappears.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires the 28 dev-tree persona credentials minted 2026-05-08 into the
workspace-secrets path used by org_import. When a workspace.yaml carries
`role: <name>`, the importer now reads
$MOLECULE_PERSONA_ROOT/<role>/env (default
/etc/molecule-bootstrap/personas/<role>/env, populated by the bootstrap
kit on the tenant host) and merges the role's GITEA_USER /
GITEA_TOKEN / GITEA_TOKEN_SCOPES / GITEA_USER_EMAIL /
GITEA_SSH_KEY_PATH into the same envVars map that already feeds
workspace_secrets via parseEnvFile + crypto.Encrypt + INSERT.
PRECEDENCE
Persona env is the LOWEST layer:
0. Persona env (per-role)
1. Org root .env (shared)
2. Workspace .env (per-workspace)
Each later layer overrides the previous, so a workspace .env can
pin a different GITEA_TOKEN if it ever needs to (testing, override).
WHY THIS LAYERING
Workspaces should boot with the role's identity by default. .env
files stay the explicit-override mechanism for the (rare) case where
a workspace needs to deviate. No new behavior for workspaces with no
role: persona load is silent no-op when ws.Role is empty or unsafe.
SECURITY
isSafeRoleName accepts only [A-Za-z0-9_-]+ (no '..', '/', or
separators) — admin-only construct, but defense-in-depth keeps the
persona dir shape invariant. Test
TestLoadPersonaEnvFile_RejectsTraversal pins the rejection set against
a planted target file.
OPERATOR-HOST CONTRACT
The 28 persona env files live at /etc/molecule-bootstrap/personas/<role>/env
(mode 600, owner root:root) with the per-role token-scope tailoring
Hongming approved 2026-05-08 (D5). Synced via task #241. Override via
MOLECULE_PERSONA_ROOT for tests + non-prod hosts.
TESTS (7 new, all green)
TestLoadPersonaEnvFile_HappyPath — typical persona-env shape
TestLoadPersonaEnvFile_MissingDir — silent no-op when file absent
TestLoadPersonaEnvFile_EmptyRole — silent no-op when role empty
TestLoadPersonaEnvFile_RejectsTraversal — planted file unreachable
via '../../etc/passwd' etc.
TestLoadPersonaEnvFile_DefaultRoot — falls back to /etc/...
TestLoadPersonaEnvFile_OverwritesEmptyMap
TestIsSafeRoleName_Acceptance — positive + negative role names
PHASE 4 SELF-REVIEW (FIVE-AXIS)
Correctness: No finding — additive change, silent no-op on the ws.Role==''
path covers every existing workspace; tests cover happy path + each
rejection mode + missing-dir.
Readability: No finding — helper sits next to parseEnvFile in
org_helpers.go with a comment block explaining WHY persona is
lowest precedence.
Architecture: No finding — fits the existing 'merge .env into envVars
then INSERT INTO workspace_secrets' pattern that's been in place
since the .env-driven workspace secrets feature; no new dependencies,
no new tables.
Security: Required (addressed) — path traversal blocked by
isSafeRoleName. No finding beyond that since persona files are
admin-managed and the helper does not log token values.
Performance: No finding — one extra os.ReadFile per workspace at
import time; amortized over workspace lifetime, cost is negligible.
REFS
internal#85 — RFC for SOP Phase 4 + structured Five-Axis (parent context)
Saved memories: feedback_per_agent_gitea_identity_default,
feedback_unified_credentials_file
Task #241 — operator-host sync (already DONE; populated 28 dirs)
Task #242 — this PR
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Harness Replays job failed at "dependency failed to start: container
harness-tenant-alpha-1 is unhealthy" — that is not caused by this
merge (which adds workspace-server/internal/handlers code, not
container infra). Retry to confirm it was a transient environmental
issue (likely operator-host load/disk per internal#78).
Trunk-based migration: main is the only branch. Update 4 workflows
that fired on staging-branch pushes to fire on main instead.
- e2e-staging-canvas.yml: drop staging from push + pull_request
- e2e-staging-external.yml: drop staging from push + pull_request
- e2e-staging-saas.yml: drop staging from push + pull_request,
update header comment that references the (now-obsolete)
staging→main auto-promote flow
- redeploy-tenants-on-staging.yml: workflow_run.branches changes
from [staging] to [main] so the tenant redeploy fires when
publish-workspace-server-image runs on main
Workflows that target the staging tenant FLEET (canary-staging.yml,
e2e-staging-sanity.yml) are not changed — they fire on cron, the word
"staging" in their filenames refers to the deployment target environ-
ment, not the git branch.
Lands as Phase 3b after #108 promotes the 5 staging-only feature PRs
(Phase 3a). Phase 3c deletes the obsolete promote/sync workflows
(auto-promote-staging, auto-sync-main-to-staging, etc.) plus the
staging branch itself, after we no-op-verify both Phase 3a and 3b
green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was supposed to fast-forward when each PR merged on staging,
but auto-promote-staging.yml has not been firing reliably on Gitea
since the GitHub suspension. Result: main is missing 5 substantive
feature PRs that landed on staging between 2026-04-29 and 2026-05-07:
- #102: test(org-include) symlink-based subtree composition contract
- #103: test(local-e2e) dev-department extraction end-to-end
- #104: fix(provisioner)+test EvalSymlinks templatePath; stage-2 e2e
- #105: feat(org-import) !external cross-repo subtree resolver (#222)
- #106: test(org-external) integration + e2e for !external resolver
Each PR was independently reviewed and CI-green at staging-merge time;
this commit promotes the merged state atomically. Use git log on main
after the merge to see the original PR-merge commits preserved.
Sister work: Phase 3 of internal#81 (trunk-based migration). Workflow
trigger updates land in a follow-up PR; staging-branch deletion happens
after a no-op verification deploy.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five-Axis self-review pass on the !external resolver work (PRs #105+#106) caught three real issues that the unstructured 3-weakest review missed:
1. Cache validity gap — partial cache writes looked complete
2. Token persistence — token in URL userinfo got persisted to .git/config
3. Misleading function name post-refactor
This PR fixes all three:
- .complete marker file written atomically; wipe-and-refetch on partial cache
- Token via -c http.extraHeader, never embedded in URL
- Defense-in-depth ref .. deny (was already validated by repoSafeRefRegex but explicit + tested)
- Renamed buildCloneURL -> buildExternalCloneURL (collision with artifacts.go), rewriteFilesDirAndIncludes -> rewriteFilesDir
- Removed unused redactToken/shortHash helpers and crypto/sha1, encoding/hex, fmt imports
Approved by platform-engineer 2026-05-08T12:55Z.
Self-review of molecule-core PR #105 + #106 (the !external resolver
chain) surfaced 3 real correctness/security gaps and 2 readability
nits. Fixes all four in one PR since they're the same file's hardening.
(1) TOKEN LEAKAGE — fixed
Before: gitFetcher built clone URLs with auth in userinfo
(https://oauth2:TOKEN@host/repo.git). Two leak paths:
a. Token persisted in cloned repo's .git/config
b. Token could appear in clone error output captured via
cmd.CombinedOutput()
After: clone URL has no userinfo (https://host/repo.git). Auth is
layered on via -c http.extraHeader=Authorization: token ...
which sends the header per-request without persisting. Plus a
redactToken() pass over any error string before it surfaces in
fmt.Errorf, as belt-and-braces.
Tradeoff: token now visible in 'ps aux' for the duration of the
git child process (same as before via env var), but no longer
in any persistent state.
(2) CACHE-VALIDITY FOOTGUN — fixed
Before: cache-hit was 'cacheDir/.git exists'. A clone interrupted
after .git was created but before content finished writing would
leave a partially-written cache that subsequent imports treated
as hit, returning stale/incomplete content forever (no self-heal).
After: cache-hit also requires a .complete marker file written
only AFTER successful clone+rename. Partially-written cache is
treated as cache-miss and re-fetched cleanly (after RemoveAll
on the partial dir to avoid blocking the new clone's mkdir).
(3) REF '..' DENY — fixed
Before: safeRefPattern '^[a-zA-Z0-9_./-]+$' allowed '..' as a
substring. Git itself rejects most refs containing '..', but
defense-in-depth says don't depend on the downstream tool's
validation when sanitizing input at the boundary.
After: explicit strings.Contains(ref.Ref, '..') check.
(4) NAMING CLEANUP — fixed
Before: rewriteFilesDirAndIncludes() — name claims to rewrite
!include scalars but doesn't (we removed that during PR-A
development; double-prefix bug). Misleading for readers.
After: rewriteFilesDir(). Docstring updated to explicitly explain
why !include paths are NOT rewritten (relative to subDir, naturally
inside cache).
Also: removed unused buildAuthedURL() (replaced by
buildExternalCloneURL + authConfigArgs split), removed unused
shortHash() helper (replaced by os.MkdirTemp), removed unused
crypto/sha1 + encoding/hex + fmt imports, removed stray
'_ = fmt.Sprint' line in integration test.
NEW TESTS
- TestGitFetcher_RejectsRefWithDoubleDot (defense-in-depth on ref input)
- TestGitFetcher_CacheValidatedByCompleteMarker (partial cache → re-fetch)
VERIFIED LOCALLY 2026-05-08
Full ./internal/handlers/ suite: ok (7.8s, 14 external-resolver tests
+ all existing tests). Two new tests cover the two new behaviors.
Refs:
internal#77 — extraction RFC
molecule-core#105 (resolver), #106 (tests) — original implementation
Hongming code-review-and-quality skill invocation 2026-05-08 + 'fix all'
PR-B (local bare-git integration, task #233):
workspace-server/internal/handlers/org_external_integration_test.go
Three tests using git's GIT_CONFIG_COUNT/KEY/VALUE env-var-injected
insteadOf URL rewrite — process-scoped, no ~/.gitconfig pollution:
- TestGitFetcher_RealClone_LocalRedirect: full resolver chain end-to-
end with REAL git clone against a local bare-repo, asserts cache
population + content materialization + path rewrite + cache-hit on
second invocation.
- TestGitFetcher_RealClone_BadRefFails: nonexistent ref surfaces
git's error cleanly through the ls-remote step.
- TestGitFetcher_DirectFetch_CacheHit: gitFetcher.Fetch direct
invocation (no resolver wrapping); verifies cache-hit returns
same dir + same SHA, no clobber.
Production code untouched — insteadOf rewrite makes the production
gitFetcher think it's cloning from Gitea, but git rewrites at clone
time to file://<barePath>. Tests the real shell-out + parsing.
PR-C (live Gitea e2e, task #234):
workspace-server/internal/handlers/local_e2e_dev_dept_test.go
TestLocalE2E_ExternalDevDepartment — minimal parent template that
uses !external against the LIVE molecule-ai/molecule-dev-department
repo. No symlink, no /tmp/local-e2e-deploy fixture. Composition
resolves over network at import time.
Asserts:
- 28+ dev-tree workspaces resolve through the fetched cache
(matches the count from TestLocalE2E_DevDepartmentExtraction)
- Q1 placement: 'Documentation Specialist' present (under app-lead)
- Q2 placement: 'Triage Operator' present (under dev-lead)
- Every workspace's files_dir is cache-prefixed (proves rewrite ran)
- Every workspace's resolveInsideRoot+Stat succeeds
(would fail provisioning if not)
Skipped if Gitea unreachable (TCP probe to git.moleculesai.app:443)
or git binary absent — won't false-fail offline runners.
VERIFIED LOCALLY 2026-05-08:
--- PASS: TestGitFetcher_RealClone_LocalRedirect (0.26s)
--- PASS: TestGitFetcher_RealClone_BadRefFails (0.15s)
--- PASS: TestGitFetcher_DirectFetch_CacheHit (0.23s)
--- PASS: TestLocalE2E_ExternalDevDepartment (0.55s)
workspaces resolved through !external: 28
Full ./internal/handlers/ test suite: ok (no regressions)
Together with PR-A's unit tests (#105), the !external resolver is now
covered at three layers:
- unit (fakeFetcher injection): allowlist, validation, path rewrite
- integration (real git, local bare-repo): clone, cache, ls-remote
- e2e (real git, live Gitea, live dev-department): full chain
Refs:
internal#77 — extraction RFC (Phase 3a phasing in comment 1995)
task #233 (PR-B), task #234 (PR-C)
Hongming GO 2026-05-08 ('do PR-B/C/D')
Adds gitops-style cross-repo subtree composition to the platform's
org-template importer. Replaces (eventually) the operator-side
filesystem symlink approach shipped in PR #5.
DESIGN
See internal#77 comment 1995 for the full design doc + decision points
agreed with Hongming 2026-05-08.
Schema: a `!external`-tagged mapping anywhere a workspace entry is
allowed (workspaces:, roots:, children:):
- !external
repo: molecule-ai/molecule-dev-department
ref: main
path: dev-lead/workspace.yaml
url: git.moleculesai.app # optional; default = MOLECULE_EXTERNAL_GITEA_URL or git.moleculesai.app
At resolve time the platform fetches the repo at ref into a content-
addressable cache under <orgBaseDir>/.external-cache/<repo>/<sha>/,
loads <cacheDir>/<path>, recursively resolves nested !include /
!external in the loaded subtree, then rewrites every files_dir scalar
in the fully-resolved subtree to be cache-prefixed. Downstream
pipeline (resolveInsideRoot, plugin merge, CopyTemplateToContainer)
sees ordinary in-tree paths.
IMPLEMENTATION
- org_external.go: ExternalRef type, fetcher interface (gitFetcher
production + injectable for tests), resolveExternalMapping resolver,
rewriteFilesDirAndIncludes path-rewrite walker, allowlistedHostPath
+ safeRefPattern + safeRepoCacheDir validation helpers.
- org_include.go: 4-line hook in expandNode dispatching MappingNode
with Tag=="!external" to resolveExternalMapping.
- org_external_test.go: 8 unit tests with fakeFetcher injection
(no network):
* happy path (top + nested workspace files_dir cache-prefixed)
* allowlist rejection (github.com/foo/bar)
* path-traversal rejection (../../etc/passwd)
* malformed ref rejection ("main; rm -rf /")
* missing required fields (repo / ref / path)
* rewriteFilesDirAndIncludes basic + idempotent
* allowlistedHostPath env-override + glob
Path rewrite ONLY rewrites files_dir scalars. !include scalars are
NOT rewritten — they resolve relative to their containing file's
directory, which post-fetch is naturally inside the cache, so
relative !includes Just Work without modification.
ALLOWLIST + AUTH
- Default allowlist: git.moleculesai.app/molecule-ai/.
- Override: MOLECULE_EXTERNAL_REPO_ALLOWLIST (comma-separated
prefixes; trailing /* or / supported).
- Auth: MOLECULE_GITEA_TOKEN env var injected into clone URL.
Optional — falls back to unauthenticated for public repos.
- Reject: malformed refs, path-traversal, non-allowlisted hosts.
CACHE
- Location: <orgBaseDir>/.external-cache/<safe-repo>/<sha>/.
Operators add to .gitignore.
- Content-addressable: same (repo, sha) reuses cache, no overwrite.
- Atomic clone via tmp-then-rename.
- Concurrency: race-tolerant — last-writer-wins on same SHA.
GC out of scope for v1 (filed as parked follow-up).
SECURITY (per SOP Phase 2)
Untrusted yaml input — all validated:
repo: allowlist (default molecule-ai/* on Gitea host)
ref: ^[a-zA-Z0-9_./-]+$ regex (rejects shell injection)
path: relative-and-down-only (rejects ../escape)
Auth: read-only token scoped to allowed orgs.
Recursion: maxExternalDepth=4 (vs maxIncludeDepth=16) to limit
network fan-out cost.
Cache poisoning: per-(repo, sha) content-addressable; can't poison
across SHAs.
Trust boundary: cloned content treated identically to a sibling-
cloned subtree (same model as current symlink approach).
VERSIONING / BACKWARDS COMPAT
Pure additive. Existing !include and inline workspaces unchanged.
Existing dev-lead symlink (parent template PR #5) keeps working.
Migration of parent template to !external is a separate PR-D.
No DB schema change. No public API change.
VERIFIED LOCALLY
go test ./internal/handlers/ → ok (5.2s, all 8 new tests + existing)
Stub fetcher injection lets unit tests cover the resolver +
path-rewrite logic without network. PR-B (follow-up) adds an
integration test against a local bare-git repo. PR-C adds the
real-Gitea e2e test against the live dev-department repo.
Refs:
internal#77 — extraction RFC (comment 1995 = Phase 1+2 design)
task #222 — this PR is Phase 3a (PR-A in the design's phasing)
Hongming GO 2026-05-08 ('go' on 4 decision points + design)
Two changes that fall out of one root cause discovered while preparing
the local platform spin-up for the dev-department extraction (internal#77):
PROBLEM
CopyTemplateToContainer's filepath.Walk is called with templatePath
set to the workspace's resolved files_dir. With the cross-repo
symlink composition shipped in PR #5 (parent template's
dev-lead → ../molecule-dev-department/dev-lead/), the Dev Lead
workspace's files_dir is literally 'dev-lead' — i.e. the symlink
itself, not a path THROUGH the symlink.
filepath.Walk does not descend into a symlink leaf — it Lstats the
root, sees a symlink (mode bit set, not a directory), emits exactly
one entry, and returns. Result: the workspace's /configs/ tar would
ship empty. Other 38 workspaces are fine because their files_dir
paths just TRAVERSE the symlink (path resolution handles intermediate
symlinks via Lstat traversal); only the leaf-is-symlink case breaks.
FIX
workspace-server/internal/provisioner/provisioner.go:
Call filepath.EvalSymlinks on templatePath before filepath.Walk.
Resolves the leaf-symlink case for ALL templates, not just dev-dept.
Security: templatePath has already passed resolveInsideRoot's
path-string check at the call site; the trust boundary is the
operator-side /org-templates/ filesystem layout, not this
resolution step.
TEST
workspace-server/internal/handlers/local_e2e_dev_dept_test.go:
New TestLocalE2E_FilesDirConsumption — stage-2 of the local e2e.
For every workspace in the resolved OrgTemplate, asserts:
1. resolveInsideRoot(orgBaseDir, ws.FilesDir) succeeds.
2. os.Stat on the result returns a directory.
3. filepath.Walk after EvalSymlinks (mirroring the platform fix)
emits at least one file.
4. At least one workspace marker exists (workspace.yaml,
system-prompt.md, or initial-prompt.md).
Exercises the SECOND half of POST /org/import that
TestLocalE2E_DevDepartmentExtraction (PR #103) didn't cover.
VERIFIED LOCALLY (2026-05-08, against post-extraction Gitea state):
--- PASS: TestLocalE2E_FilesDirConsumption (0.05s)
checked 39 workspaces with files_dir
All 39 walk paths emit non-empty file sets with valid workspace markers.
REGRESSION GUARD
Without the EvalSymlinks fix, this test fails on Dev Lead with:
files_dir 'dev-lead' at '/.../molecule-dev/dev-lead' is empty —
CopyTemplateToContainer would produce empty /configs/
Refs:
internal#77 — extraction RFC
molecule-core#102 (resolver symlink contract test)
molecule-core#103 (stage-1 e2e: include resolution)
Hongming GO 2026-05-08 ('go' on the 3 pre-spin-up optimizations)
Phase 4 (local-only) of internal#77 (dev-department extraction).
Adds TestLocalE2E_DevDepartmentExtraction that exercises the FULL platform
import path against the real molecule-ai-org-template-molecule-dev (post-slim)
and molecule-ai/molecule-dev-department (post-atomize) repos cloned as siblings
under /tmp/local-e2e-deploy/.
What it proves end-to-end:
- The dev-lead symlink at parent's template root is followed by
resolveYAMLIncludes (filepath.Abs/Rel-style security check passes,
os.ReadFile follows the link).
- Recursive !include chain through the symlinked subtree resolves:
parent's org.yaml → !include dev-lead/workspace.yaml (symlinked)
→ !include ./core-lead/workspace.yaml → !include ./core-be/workspace.yaml
(atomized children: paths, no '..').
- 39 workspaces enumerate after resolution: 5 PM-tree + 6 Marketing-tree
+ 28 dev-tree (Dev Lead + 5 sub-team leads + 18 leaf workspaces +
3 floaters + 1 triage-operator).
- Q1+Q2 placements verified by sentinel name check: 'Documentation
Specialist' is reachable (under app-lead via app-docs sub-team),
'Triage Operator' is reachable (direct child of Dev Lead).
Test skips with t.Skipf if the local-e2e fixture isn't present on the
host — won't block CI on hosts that haven't set it up. To set up locally:
TESTROOT=/tmp/local-e2e-deploy
mkdir -p $TESTROOT && cd $TESTROOT
git clone https://git.moleculesai.app/molecule-ai/molecule-ai-org-template-molecule-dev.git molecule-dev
git clone https://git.moleculesai.app/molecule-ai/molecule-dev-department.git
cd /Users/<you>/molecule-core/workspace-server
go test -v -run TestLocalE2E_DevDepartmentExtraction ./internal/handlers/
Verified locally 2026-05-08:
--- PASS: TestLocalE2E_DevDepartmentExtraction (0.01s)
total workspaces (recursive): 39
Refs:
internal#77 — extraction RFC
molecule-core PR #102 — symlink-resolution contract test
molecule-ai/molecule-dev-department PRs #1, #2, #3 (scaffold + extract + atomize)
molecule-ai/molecule-ai-org-template-molecule-dev PR #5 (parent slim + symlink wire)
Hongming GO 2026-05-08 ('lets not go for staging right now, we do local test first')
SOP Phase 4 (local) — task #226
Two new tests in workspace-server/internal/handlers/org_include_test.go:
- TestResolveYAMLIncludes_FollowsDirectorySymlink: parent template's
org.yaml `!include`s into a sibling-repo subtree via a relative
directory symlink. The resolver's filepath.Abs/Rel security check
operates on path strings (passes), and os.ReadFile follows the
symlink at OS layer (file content delivered). Recursive nested
`!include`s within the symlinked subtree resolve correctly because
filepath.Dir(absTarget) keeps the literal symlink path as currentDir.
- TestResolveYAMLIncludes_RejectsSymlinkEscapingRoot: companion test
pinning current behavior where a symlink target outside the parent
root is followed (resolveInsideRoot doesn't EvalSymlinks). Asserted
as 'should resolve' so future hardening (if filepath.EvalSymlinks
is added) flips the test red and forces a coordinated update to the
dev-department subtree-composition pattern.
Why now: internal#77 RFC (dev-department extraction) selects symlink-
based composition over a future platform-level external: ref. These
tests pin the contract before the operator-side symlink convention
gets shipped, so a refactor or hardening of the resolver can't
silently break the production org-import path.
No production code changes. Pure additive test coverage.
Refs: internal#77 (Phase 3b verification — task #223)
Closes the post-Task-#176 self-review gap: the bearer-token + tenant-
slug header construction was duplicated across 7 raw-fetch callsites
in the canvas (lib/api.ts request(), uploads.ts × 2, and 5 Attachment*
components). Each callsite read NEXT_PUBLIC_ADMIN_TOKEN, attached
Authorization: Bearer manually, computed getTenantSlug locally
(three of them inline-redefined it from /lib/tenant!), and attached
X-Molecule-Org-Slug. A new poller / raw-fetch added without going
through this exact recipe silently 401s against workspace-server when
ADMIN_TOKEN is set on the server side — the bug shape called out in
the original task.
Adds platformAuthHeaders() to lib/api.ts as the single source of truth
and routes all 7 raw-fetch callsites through it. Removes 4 duplicate
local getTenantSlug() copies (Image, Video, Audio, PDF, TextPreview)
that were inline-redefining what /lib/tenant.ts already exports.
Also preserves the AttachmentTextPreview off-platform branch — when
isPlatformAttachment() is false, headers is {} (no bearer leakage to
third-party URLs).
Tests:
- 6 unit tests in platform-auth-headers.test.ts covering: empty,
bearer-only, slug-only, both, empty-string-as-unset, fresh-object-
per-call. Mutation-tested: removing the bearer attach inside the
helper fails 2 of 6 tests immediately.
- All 1389 existing canvas vitest tests pass unchanged.
- npx tsc --noEmit clean.
- npm run build succeeds (canvas Next.js build).
Per feedback_assert_exact_not_substring: tests use exact toEqual()
equality, not substring/contains, so an extra-header bug also fails
the assertion. Per feedback_oss_design_philosophy: this is the
"plugin/abstract/modular/SSOT" move applied to the auth-header
construction surface — one helper, six call sites, no duplication.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 14:36:02 -07:00
212 changed files with 14060 additions and 2899 deletions
echo"::notice::clause [$_label]: PASS — satisfied by approving reviewer(s)"
else
_failed_clauses="${_failed_clauses:+, }$_label"
echo"::error::clause [$_label]: FAIL — no approving reviewer belongs to any of these teams${_clause_names}. Set SOP_DEBUG=1 to see per-team probe results."
fi
done
if[ -n "$_failed_clauses"];then
echo""
echo"::error::sop-tier-check FAILED for $TIER."
echo" Passed :${_passed_clauses}"
echo" Missing:${_failed_clauses}"
echo" All clauses must be satisfied. Each missing team needs an APPROVED review from one of its members."
exit1
fi
echo"::notice::sop-tier-check PASSED: $TIER — all required clauses satisfied [${_passed_clauses}]"
Automated promotion of \`staging\` (\`${TARGET_SHA:0:8}\`) to \`main\`. All required staging gates are green at this SHA (combined status reported success).
This PR is auto-generated by \`.github/workflows/auto-promote-staging.yml\` whenever every required gate completes green on the same staging SHA.
**Approvalgate:** \`main\` branch protection requires 1 approval before this can land. Once approved, Gitea will auto-merge (the workflow scheduled \`merge_when_checks_succeed:true\` immediately after open).
The reverse-direction sync (the merge commit on \`main\` → \`staging\`) is handled automatically by \`auto-sync-main-to-staging.yml\` after this PR lands.
---
- Source:staging at \`${TARGET_SHA}\`
- Opened by:\`devops-engineer\` persona (anti-bot-ring; never founder PAT)
if grep -qE '^[[:space:]]*merge_group:' "$owning_wf"; then
echo "OK: '${check}' (in $owning_wf) — has merge_group trigger"
else
echo "::error file=${owning_wf}::Required check '${check}' is produced by ${owning_wf}, but the workflow does not declare a 'merge_group:' trigger. With merge queue enabled on ${BRANCH}, this will deadlock the queue (every PR sits AWAITING_CHECKS forever). Add this to the workflow's 'on:' block:"
CMT_REQ=$(jq -n '{body:"[retarget-bot] Closing — another PR on the same head branch already targets `staging`. This PR is redundant. See issue#1884 for the rationale."}')
# (PRs ARE issues — same endpoint, different sub-resources
# for diffs/files/etc.). The body uses jq to safely
# encode the multi-line markdown without shell-quote
# nightmares.
REQ=$(jq -n '{body:"[retarget-bot] This PR was opened against `main` and has been retargeted to `staging` automatically.\n\n**Why:** per [SHARED_RULES rule 8](https://git.moleculesai.app/molecule-ai/molecule-ai-org-template-molecule-dev/src/branch/main/SHARED_RULES.md), all feature work targets `staging` first; the CEO promotes `staging → main` separately.\n\n**What changed:** just the base branch — no code change. CI will re-run against `staging`. If you get merge conflicts, rebase on `staging`.\n\n**If this PR is the CEO`s staging→main promotion:** the Action skipped you (only bot-authored PRs are retargeted, head=staging is also exempted). If you see this comment on your CEO PR, that`s a bug — please tag @hongmingwang."}')
<buttontype="button"onClick={onDownloadAll}aria-label="Download all files"className="text-[10px] text-ink-soft hover:text-ink-mid" title="Download all files">
<buttontype="button"onClick={onDownloadAll}aria-label="Download all files"className="text-[10px] text-ink-mid hover:text-ink-mid" title="Download all files">
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.