Phase 4 follow-up to template-claude-code PR #9 (2026-05-08 dev-tree wedge).
Pre-fix: applyRuntimeModelEnv unconditionally overwrote envVars["MODEL"]
with the MODEL_PROVIDER slug whenever payload.Model was empty (the restart
path). This silently wiped the operator'\''s explicit per-persona MODEL
secret on every restart.
Symptom: dev-tree workspaces booted correctly on first /org/import (the
envVars map was populated direct from the persona env file with both
MODEL=MiniMax-M2.7-highspeed and MODEL_PROVIDER=minimax), then on the
next Restart the MODEL secret got clobbered to literal "minimax" — a
provider slug, not a valid model id — and the workspace template'\''s
adapter failed to match any registry prefix, fell through to providers[0]
(anthropic-oauth), and wedged at SDK initialize.
Fix: resolution order in applyRuntimeModelEnv is now:
1. payload.Model (caller passed the canvas-picked model id verbatim)
2. envVars["MODEL"] (workspace_secret persisted from persona env)
3. envVars["MODEL_PROVIDER"] (legacy canvas Save+Restart shape)
Tests
-----
TestApplyRuntimeModelEnv_PersonaEnvMODELSecretPreserved — locks in
the new resolution order with four cases:
- MODEL secret wins over MODEL_PROVIDER slug (persona-env shape)
- MODEL secret wins even when same as MODEL_PROVIDER
- MODEL absent → fall back to MODEL_PROVIDER (legacy shape)
- Both absent → no MODEL set (no-op)
Existing TestApplyRuntimeModelEnv_SetsUniversalMODELForAllRuntimes
continues to pass — fix is strictly additive on the precedence chain.
Lets a workspace declare it (and its entire subtree) should be skipped
during /org/import. Pointer-typed `*bool` so we distinguish "explicitly
false" from "unset" (default = spawn).
## Use case
The dev-tree org template ships the full role taxonomy (Dev Lead with
Core Platform / Controlplane / App & Docs / Infra / SDK Leads, each with
their own engineering / QA / security / UI-UX children — 27 personas
total in a single import). Some setups need a smaller set:
- Local dev on a memory-constrained machine
- Demo / smoke runs that don't need the full org breathing
- Customer trials starting with leadership-only before fan-out
Pre-fix the only options were:
- Edit the canonical template (mutates shared state)
- Author a parallel slimmer template (duplicates structure)
- Manual workspace deprovision after full import (wasteful — already paid
the docker pull / build cost)
`spawning: false` is the per-workspace knob that solves this without
touching the canonical template structure.
## Semantics
- Unset: workspace spawns (current behaviour, no migration)
- `spawning: true`: explicitly spawns (same as unset)
- `spawning: false`: workspace is skipped AND every descendant is
skipped. The guard sits BEFORE any side effect in
createWorkspaceTree — no DB row, no docker provision, no children
recursion. A false-spawning subtree is genuinely a no-op except for
the log line. countWorkspaces still counts the subtree (so /org/templates
numbers reflect the full structure).
## Stage A — verified
Local dev-only template that wraps teams/dev.yaml (Dev Lead) with
children:[] cleared on the 5 sub-team yaml files, plus 3 floater
personas (Release Manager / Integration Tester / Fullstack Engineer).
/org/import returned 9 workspaces. Drop-in: same result via
`spawning: false` on each sub-tree root in the future.
## Stage B — N/A
Pure additive feature on the org-template handler. No SaaS deploy chain
implications.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes core#115 partial. Schema-only change; the apply-endpoint filter
logic that reads this column lands with core#123 (drift detector +
queue + apply endpoint, the deferred follow-up of core#113).
Default 'production' so existing customers (Reno-Stars + any future
tenant) are default-safe. Synthetic dogfooding workspaces opt INTO
'canary' explicitly.
CHECK constraint pins the closed value set ('canary' | 'production') —
the apply endpoint's filter relies on the database to reject anything
else, so a future operator typo in PATCH /workspaces/:id ({update_tier:
'canery'}) returns a constraint violation, not silent fan-out to
nobody.
Partial index on canary rows since the apply-endpoint query path
('apply this update only to canary tier first') hits canary much more
often than production, and the production set is the much larger
default.
WHAT THIS DOES NOT DO (lands with core#123)
- PATCH endpoint to flip a workspace to canary
- The apply endpoint that consults the column
- Tests that exercise canary-vs-production fan-out
Schema-only foundation; same pattern as core#113 (workspace_plugins).
PHASE 4 SELF-REVIEW
Correctness: No finding — IF NOT EXISTS guards, DEFAULT clause means
existing rows get 'production' on migration apply.
Readability: No finding — comment block documents the tier semantics
+ the deferral to core#123.
Architecture: No finding — additive ALTER, partial index for the
expected access pattern.
Security: No finding — no code path; column constraint reduces blast
radius of bad PATCH input.
Performance: No finding — partial index minimizes write amplification
on the production-default rows.
REFS
core#115 — this issue
core#123 — apply endpoint follow-up (will exercise this column)
core#113 — version subscription DB foundation (sibling pattern)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes core#113 partial. Adds the DB foundation for the
version-subscription model. Drift detection + queue + admin apply
endpoint are follow-up scope (separate PR; filed as a new issue).
WHY THIS PR ONLY GETS US PART-WAY
Plugin install state today is filesystem-only — '/configs/plugins/<name>/'
inside the container. There's no DB record of 'plugin X installed at
workspace W from source S, tracking ref T'. That makes drift detection
impossible: nothing to compare upstream tags against.
This PR adds the table + the install-endpoint hook that writes to it.
With baseline tags now on every plugin (post internal#92), the table
starts collecting tracked-ref values immediately on the next install.
The actual drift-check job + queue + apply endpoint layer on top.
WHAT THIS ADDS
workspace_plugins table:
workspace_id FK → workspaces(id) ON DELETE CASCADE
plugin_name canonical name from plugin.yaml
source_raw full source URL the install used
tracked_ref 'none' | 'tag:vX.Y.Z' | 'tag:latest' | 'sha:<full>'
installed_at, updated_at
installRequest gains optional 'track' field (defaults to 'none').
Install handler upserts the workspace_plugins row after delivery
succeeds. DB write failure is logged but doesn't fail the install
(the plugin IS in the container; surfacing 500 misleads the caller).
validateTrackedRef enforces the closed set of accepted shapes:
'none' | 'tag:<non-empty>' | 'sha:<non-empty>'
Bare values like 'latest' / 'main' / version-strings without
prefix are rejected — the drift detector keys on prefix to know
what kind of resolution to do.
WHAT THIS DOES NOT ADD (filed separately)
- Drift detector job (cron / on-demand) that scans
'WHERE tracked_ref != none' rows and queues updates on upstream drift
- plugin_update_queue table (separate migration once detector lands)
- GET /admin/plugin-updates-pending and POST .../apply endpoints
- Tier-aware apply (core#115 — composes here)
PHASE 4 SELF-REVIEW (FIVE-AXIS)
Correctness: No finding — install endpoint behavior unchanged for
callers that don't pass 'track'. DB write is best-effort + logged
on failure. validateTrackedRef rejects ambiguous bare strings.
Readability: No finding — separate file plugins_tracking.go isolates
the new concern; install handler delta is a single 4-line block.
Architecture: No finding — additive table; existing schema untouched.
Migration 20260508160000_* uses the timestamp-prefixed convention.
Security: No finding — INSERT params via placeholders (no string
interpolation). validateTrackedRef rejects unexpected shapes before
the column constraint would.
Performance: No finding — one extra ExecContext per install. Install
is already seconds-scale (network fetch + tar + docker exec); rounds
to noise.
TESTS (1 new, all green)
TestValidateTrackedRef — pin closed set + structural validators
REFS
core#113 — this issue (foundation only; drift+queue+apply = follow-up)
internal#92, internal#93 — plugin/template baseline tags (now exists for tracking)
core#114 — atomic install (this PR composes — no atomicity regression)
core#115 — canary tier filter (will key off the same DB foundation)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes molecule-core#112. Composes with #114 (atomic install).
Before issuing restartFunc, classify the diff between staged and live:
- skill-content-only: only **/SKILL.md content changed
→ skip restart (Claude Code re-reads SKILL.md on
each Skill invocation; no in-memory cache)
- cold: anything else
→ restartFunc as before
(hooks/settings load at session start;
plugin.yaml is structural; added/removed files
require a fresh load)
DETECTION
- Hash every regular file in staged tree (host filesystem, sha256)
- Hash every regular file in live tree (in-container via docker exec
sh -c 'cd <livePath> && find . -type f -print0 | xargs -0 sha256sum')
- .complete marker dropped from comparison (mtime varies install-to-
install; including it would force-cold every reinstall)
- File added/removed → cold
- File content differs but isn't SKILL.md → cold
- All differences are SKILL.md basenames → skill-content-only
DEFAULTS COLD
- First install (no live tree) → cold
- Live tree read failure → cold (conservative; never hot-reload speculatively)
- Symlinks skipped during hash (same posture as tar walker)
PHASE 4 SELF-REVIEW
Correctness: No finding — all error paths default to cold; never
falsely classify as skill-content-only. The .complete drop is
a deliberate exception (the marker is bookkeeping, not content).
Readability: No finding — single-purpose helpers (hashLocalTree,
hashContainerTree, isSkillMarkdown, shQuote) each do one thing.
The classifier itself reads as 'compare set, then walk diff with
isSkillMarkdown gate.'
Architecture: No finding — composes existing execAsRoot primitive;
new helpers in plugins_classifier.go don't touch any other
handler. Old behavior unchanged when live read fails.
Security: No finding — shQuote single-quotes any non-trivial path,
pluginName comes from validatePluginName-validated source, and
the docker exec command takes the path as a single arg (xargs -0
handles binary-safe path delimiting). Symlinks skipped.
Performance: No finding — adds two tree walks (host + container)
per install. Container walk is one docker exec call returning
sha256 lines; for typical plugins (~10-50 files) round-trip is
~100ms. Versus the saved ~5-10s of restart on a hot-reloadable
update, this is a clear win.
TESTS (4 new, all green; full handler suite green)
TestIsSkillMarkdown — basename match, case-sensitive
TestHashLocalTree_StableHash — re-hash same dir = same map
TestHashLocalTree_SymlinkSkipped — hostile link doesn't poison classifier
TestShQuote — quoting boundary for shell injection safety
REFS
molecule-core#112 — this issue
molecule-core#114 — atomic install (.complete marker added there)
Reno-Stars iteration safety (Hongming 2026-05-08)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes molecule-core#114 for the docker (local-OSS) path.
EIC (SaaS) path tracked as a follow-up — same shape, different
exec primitives (ssh vs docker exec); shipping both in one PR
doubles the test surface.
THE FOUR-STEP DANCE
1. STAGE — docker.CopyToContainer extracts tar into
/configs/plugins/.staging/<name>.<ts>/
2. SNAPSHOT — if /configs/plugins/<name>/ exists, mv to
/configs/plugins/.previous/<name>.<ts>/
3. SWAP — atomic mv staging → live (single rename(2))
4. MARKER — touch /configs/plugins/<name>/.complete
Workspace-side plugin loaders should refuse to load any plugin dir
without .complete (separate small change, not in this PR — the marker
write is the necessary precursor; consumer side is a follow-up so
existing-content plugins don't break before they're re-installed).
ROLLBACK
- Stage failure: rm -rf staging dir; live untouched
- Snapshot failure: rm -rf staging dir; live untouched (no rename happened)
- Swap failure with snapshot present: mv previous back to live
- Swap failure (no snapshot): rm -rf staging; live (which never
existed) stays absent
- Marker failure: content already in place, log loudly with manual
recovery hint (touch <plugin>/.complete) — don't roll back since
the new content is what we wanted, just unmarked
GC
Best-effort delete of previous-version snapshot after successful
marker write. Failures non-fatal — next install or a separate
sweeper reclaims. Sweeper for stale .previous/* across reboots is
follow-up scope.
CONCURRENCY
Each install gets a unique stamp (UTC second precision), so two
concurrent reinstalls land in distinct staging dirs and the second
swap simply overwrites the first's live result. The atomicity is
per-install, not cross-install — by design (the platform serializes
POST /workspaces/:id/plugins via Go-side semaphore upstream of
this code, so cross-install collisions don't reach here).
CHANGES
+ plugins_atomic.go — installVersion + atomicCopyToContainer
+ plugins_atomic_tar.go — tarWalk/tarHostDirWithPrefix helpers
+ plugins_atomic_test.go — 5 unit tests (paths, stamp shape,
tar happy path, symlink-skip, prefix
normalization). All green.
~ plugins_install_pipeline.go::deliverToContainer — swap
copyPluginToContainer call to atomicCopyToContainer
Old copyPluginToContainer is retained (still called by Download()) so
this PR is purely additive on the install path; no public API change.
PHASE 4 SELF-REVIEW (FIVE-AXIS)
Correctness: Required (addressed) — swap-failure rollback writes mv
of previous back to live before returning the error; if rollback
itself fails, we wrap both errors and surface the combined fault.
Marker-write failure is treated as content-landed-but-unmarked
(LOG, don't roll back the new content).
Readability: No finding — installVersion path methods make the
/staging/.previous/live/marker layout obvious from one struct.
tarWalk extracted from the inline filepath.Walk in
plugins_install_pipeline.go for testability.
Architecture: No finding — atomicCopyToContainer composes existing
execAsRoot / docker.CopyToContainer primitives; no new dependencies.
Old copyPluginToContainer kept for Download() — single responsibility
per function.
Security: No finding — symlinks still skipped during tar walk
(defense vs hostile plugin escaping its own dir). Marker writes
use composeable path.Join, no user input touches the path.
Performance: No finding — adds ~3 docker exec calls per install
(mkdir, mv-snapshot, mv-swap, touch — actually 4) on top of the
one CopyToContainer. Each exec ~50-100ms in practice; install
end-to-end was already seconds-scale, this rounds to noise.
REFS
molecule-core#114 — this issue
Companion: molecule-core#112 (hot-reload classifier — depends on .complete marker)
Companion: molecule-core#113 (version subscription — uses install machinery)
EIC follow-up: separate issue to be filed for SaaS path parity
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes core#116. Brings local-dev iteration parity with the canvas's
Turbopack HMR — edit a Go file, see the platform restart in <5s
instead of running 'docker compose up --build' (~30s) per change.
USAGE
make dev # docker compose with air-driven live reload
make up # production-shape stack (no air, normal Dockerfile)
WHAT THIS ADDS
workspace-server/.air.toml — air watch config
workspace-server/Dockerfile.dev — air-on-golang:1.25-alpine, dev-only
docker-compose.dev.yml — overlay swapping platform service
to Dockerfile.dev + bind-mounting
workspace-server/ source
Makefile — make {dev,up,down,logs,build,test}
WHAT THIS DOES NOT TOUCH
workspace-server/Dockerfile (production multi-stage build)
docker-compose.yml (prod-shape stack)
CI workflows (build prod image directly)
Tenant deployment / SaaS (image swap stays the model)
Pure additive. Existing 'docker compose up' path unchanged; production
stays on the static binary. Air install pinned via go install at image
build time so the dev image is reproducible-enough for local use (we
don't pin air to a SHA — the dev image is rebuilt locally and updates
opportunistically).
PHASE 4 SELF-REVIEW (FIVE-AXIS)
Correctness: No finding — additive change, no existing path modified.
.air.toml watches .go + .yaml under workspace-server/, excludes
_test.go and tests dir so test edits don't trigger rebuild.
Dockerfile.dev mirrors prod's 'go mod download' so first rebuild
is fast.
Readability: No finding — three small files plus a Makefile, each
with header comments explaining the WHY, not just the WHAT. The
Makefile uses the standard ## help-target pattern.
Architecture: No finding — overlay pattern (docker-compose.dev.yml
on top of docker-compose.yml) is the standard compose convention
for env-specific overrides. Doesn't fork the prod path.
Security: No finding because no production code path; dev-only image
isn't built in CI and isn't published to ECR.
Performance: No finding — air debounce=500ms, exclude_unchanged=true
so a save that doesn't change content is a no-op rebuild.
REFS
core#116 — this issue
Companion: core#117 (workspace-side config-watcher for hot-reload of
config.yaml) — different scope; this issue is platform-only.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TestStartSweeper_TransientErrorDoesNotCrashLoop leaks an in-flight
metric write across the test boundary: cycleDone fires inside the
fake's Sweep defer (before Sweep returns), waitForCycle returns
immediately after, cancel() lands, but the goroutine still has
metrics.PendingUploadsSweepError() to execute. Whether that write
happens before or after the next test's metricDelta() baseline read
is a coin-flip on slow CI hosts.
Outcome: TestStartSweeper_RecordsMetricsOnSuccess fails with
"error counter delta = 1, want 0" — looks like a real bug, isn't.
Instrumented analysis (per the file's existing waitForMetricDelta
docstring covering the same shape) confirms the metric IS getting
recorded, just AFTER the next test reads its baseline.
The Records* tests already use waitForMetricDelta to close this race
on their own assertions. This change extends the same shape to
TransientErrorDoesNotCrashLoop so it doesn't poison subsequent tests'
baselines.
Verified by running `go test -race -count=20 ./internal/pendinguploads/...`
locally — passes deterministically.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Trunk-based migration final cleanup for molecule-core. The 6 workflows
deleted here all existed to manage the staging↔main branch dance that
trunk-based makes obsolete:
- auto-promote-staging.yml fast-forward staging→main on green
- auto-promote-on-e2e.yml alt promote path on E2E green
- auto-promote-stale-alarm.yml alarm if staging promotion stalls
- auto-sync-main-to-staging.yml sync main→staging after UI merges
- auto-sync-canary.yml dry-run probe of the auto-sync
token+push path
- retarget-main-to-staging.yml rebase open PRs onto staging
After Phase 3A (PR #108 promoted 5 staging-only feature PRs to main)
and Phase 3B (PR #109 dropped staging-branch triggers from the 4 e2e
workflows), main is the only branch the CI cares about. None of the
above workflows have anything to do; they're 1977 lines of dead Go-time-
no-Gitea-time-yes code.
Rollback: `git revert` this commit to restore the workflows. They still
work mechanically; trunk-based just doesn't need them.
The `staging` branch on the remote is deleted in a follow-up step
(`git push origin --delete staging`) after this PR merges, so reviewers
can confirm CI runs cleanly on the new shape before the ref disappears.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires the 28 dev-tree persona credentials minted 2026-05-08 into the
workspace-secrets path used by org_import. When a workspace.yaml carries
`role: <name>`, the importer now reads
$MOLECULE_PERSONA_ROOT/<role>/env (default
/etc/molecule-bootstrap/personas/<role>/env, populated by the bootstrap
kit on the tenant host) and merges the role's GITEA_USER /
GITEA_TOKEN / GITEA_TOKEN_SCOPES / GITEA_USER_EMAIL /
GITEA_SSH_KEY_PATH into the same envVars map that already feeds
workspace_secrets via parseEnvFile + crypto.Encrypt + INSERT.
PRECEDENCE
Persona env is the LOWEST layer:
0. Persona env (per-role)
1. Org root .env (shared)
2. Workspace .env (per-workspace)
Each later layer overrides the previous, so a workspace .env can
pin a different GITEA_TOKEN if it ever needs to (testing, override).
WHY THIS LAYERING
Workspaces should boot with the role's identity by default. .env
files stay the explicit-override mechanism for the (rare) case where
a workspace needs to deviate. No new behavior for workspaces with no
role: persona load is silent no-op when ws.Role is empty or unsafe.
SECURITY
isSafeRoleName accepts only [A-Za-z0-9_-]+ (no '..', '/', or
separators) — admin-only construct, but defense-in-depth keeps the
persona dir shape invariant. Test
TestLoadPersonaEnvFile_RejectsTraversal pins the rejection set against
a planted target file.
OPERATOR-HOST CONTRACT
The 28 persona env files live at /etc/molecule-bootstrap/personas/<role>/env
(mode 600, owner root:root) with the per-role token-scope tailoring
Hongming approved 2026-05-08 (D5). Synced via task #241. Override via
MOLECULE_PERSONA_ROOT for tests + non-prod hosts.
TESTS (7 new, all green)
TestLoadPersonaEnvFile_HappyPath — typical persona-env shape
TestLoadPersonaEnvFile_MissingDir — silent no-op when file absent
TestLoadPersonaEnvFile_EmptyRole — silent no-op when role empty
TestLoadPersonaEnvFile_RejectsTraversal — planted file unreachable
via '../../etc/passwd' etc.
TestLoadPersonaEnvFile_DefaultRoot — falls back to /etc/...
TestLoadPersonaEnvFile_OverwritesEmptyMap
TestIsSafeRoleName_Acceptance — positive + negative role names
PHASE 4 SELF-REVIEW (FIVE-AXIS)
Correctness: No finding — additive change, silent no-op on the ws.Role==''
path covers every existing workspace; tests cover happy path + each
rejection mode + missing-dir.
Readability: No finding — helper sits next to parseEnvFile in
org_helpers.go with a comment block explaining WHY persona is
lowest precedence.
Architecture: No finding — fits the existing 'merge .env into envVars
then INSERT INTO workspace_secrets' pattern that's been in place
since the .env-driven workspace secrets feature; no new dependencies,
no new tables.
Security: Required (addressed) — path traversal blocked by
isSafeRoleName. No finding beyond that since persona files are
admin-managed and the helper does not log token values.
Performance: No finding — one extra os.ReadFile per workspace at
import time; amortized over workspace lifetime, cost is negligible.
REFS
internal#85 — RFC for SOP Phase 4 + structured Five-Axis (parent context)
Saved memories: feedback_per_agent_gitea_identity_default,
feedback_unified_credentials_file
Task #241 — operator-host sync (already DONE; populated 28 dirs)
Task #242 — this PR
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Harness Replays job failed at "dependency failed to start: container
harness-tenant-alpha-1 is unhealthy" — that is not caused by this
merge (which adds workspace-server/internal/handlers code, not
container infra). Retry to confirm it was a transient environmental
issue (likely operator-host load/disk per internal#78).
Trunk-based migration: main is the only branch. Update 4 workflows
that fired on staging-branch pushes to fire on main instead.
- e2e-staging-canvas.yml: drop staging from push + pull_request
- e2e-staging-external.yml: drop staging from push + pull_request
- e2e-staging-saas.yml: drop staging from push + pull_request,
update header comment that references the (now-obsolete)
staging→main auto-promote flow
- redeploy-tenants-on-staging.yml: workflow_run.branches changes
from [staging] to [main] so the tenant redeploy fires when
publish-workspace-server-image runs on main
Workflows that target the staging tenant FLEET (canary-staging.yml,
e2e-staging-sanity.yml) are not changed — they fire on cron, the word
"staging" in their filenames refers to the deployment target environ-
ment, not the git branch.
Lands as Phase 3b after #108 promotes the 5 staging-only feature PRs
(Phase 3a). Phase 3c deletes the obsolete promote/sync workflows
(auto-promote-staging, auto-sync-main-to-staging, etc.) plus the
staging branch itself, after we no-op-verify both Phase 3a and 3b
green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was supposed to fast-forward when each PR merged on staging,
but auto-promote-staging.yml has not been firing reliably on Gitea
since the GitHub suspension. Result: main is missing 5 substantive
feature PRs that landed on staging between 2026-04-29 and 2026-05-07:
- #102: test(org-include) symlink-based subtree composition contract
- #103: test(local-e2e) dev-department extraction end-to-end
- #104: fix(provisioner)+test EvalSymlinks templatePath; stage-2 e2e
- #105: feat(org-import) !external cross-repo subtree resolver (#222)
- #106: test(org-external) integration + e2e for !external resolver
Each PR was independently reviewed and CI-green at staging-merge time;
this commit promotes the merged state atomically. Use git log on main
after the merge to see the original PR-merge commits preserved.
Sister work: Phase 3 of internal#81 (trunk-based migration). Workflow
trigger updates land in a follow-up PR; staging-branch deletion happens
after a no-op verification deploy.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five-Axis self-review pass on the !external resolver work (PRs #105+#106) caught three real issues that the unstructured 3-weakest review missed:
1. Cache validity gap — partial cache writes looked complete
2. Token persistence — token in URL userinfo got persisted to .git/config
3. Misleading function name post-refactor
This PR fixes all three:
- .complete marker file written atomically; wipe-and-refetch on partial cache
- Token via -c http.extraHeader, never embedded in URL
- Defense-in-depth ref .. deny (was already validated by repoSafeRefRegex but explicit + tested)
- Renamed buildCloneURL -> buildExternalCloneURL (collision with artifacts.go), rewriteFilesDirAndIncludes -> rewriteFilesDir
- Removed unused redactToken/shortHash helpers and crypto/sha1, encoding/hex, fmt imports
Approved by platform-engineer 2026-05-08T12:55Z.
Self-review of molecule-core PR #105 + #106 (the !external resolver
chain) surfaced 3 real correctness/security gaps and 2 readability
nits. Fixes all four in one PR since they're the same file's hardening.
(1) TOKEN LEAKAGE — fixed
Before: gitFetcher built clone URLs with auth in userinfo
(https://oauth2:TOKEN@host/repo.git). Two leak paths:
a. Token persisted in cloned repo's .git/config
b. Token could appear in clone error output captured via
cmd.CombinedOutput()
After: clone URL has no userinfo (https://host/repo.git). Auth is
layered on via -c http.extraHeader=Authorization: token ...
which sends the header per-request without persisting. Plus a
redactToken() pass over any error string before it surfaces in
fmt.Errorf, as belt-and-braces.
Tradeoff: token now visible in 'ps aux' for the duration of the
git child process (same as before via env var), but no longer
in any persistent state.
(2) CACHE-VALIDITY FOOTGUN — fixed
Before: cache-hit was 'cacheDir/.git exists'. A clone interrupted
after .git was created but before content finished writing would
leave a partially-written cache that subsequent imports treated
as hit, returning stale/incomplete content forever (no self-heal).
After: cache-hit also requires a .complete marker file written
only AFTER successful clone+rename. Partially-written cache is
treated as cache-miss and re-fetched cleanly (after RemoveAll
on the partial dir to avoid blocking the new clone's mkdir).
(3) REF '..' DENY — fixed
Before: safeRefPattern '^[a-zA-Z0-9_./-]+$' allowed '..' as a
substring. Git itself rejects most refs containing '..', but
defense-in-depth says don't depend on the downstream tool's
validation when sanitizing input at the boundary.
After: explicit strings.Contains(ref.Ref, '..') check.
(4) NAMING CLEANUP — fixed
Before: rewriteFilesDirAndIncludes() — name claims to rewrite
!include scalars but doesn't (we removed that during PR-A
development; double-prefix bug). Misleading for readers.
After: rewriteFilesDir(). Docstring updated to explicitly explain
why !include paths are NOT rewritten (relative to subDir, naturally
inside cache).
Also: removed unused buildAuthedURL() (replaced by
buildExternalCloneURL + authConfigArgs split), removed unused
shortHash() helper (replaced by os.MkdirTemp), removed unused
crypto/sha1 + encoding/hex + fmt imports, removed stray
'_ = fmt.Sprint' line in integration test.
NEW TESTS
- TestGitFetcher_RejectsRefWithDoubleDot (defense-in-depth on ref input)
- TestGitFetcher_CacheValidatedByCompleteMarker (partial cache → re-fetch)
VERIFIED LOCALLY 2026-05-08
Full ./internal/handlers/ suite: ok (7.8s, 14 external-resolver tests
+ all existing tests). Two new tests cover the two new behaviors.
Refs:
internal#77 — extraction RFC
molecule-core#105 (resolver), #106 (tests) — original implementation
Hongming code-review-and-quality skill invocation 2026-05-08 + 'fix all'
PR-B (local bare-git integration, task #233):
workspace-server/internal/handlers/org_external_integration_test.go
Three tests using git's GIT_CONFIG_COUNT/KEY/VALUE env-var-injected
insteadOf URL rewrite — process-scoped, no ~/.gitconfig pollution:
- TestGitFetcher_RealClone_LocalRedirect: full resolver chain end-to-
end with REAL git clone against a local bare-repo, asserts cache
population + content materialization + path rewrite + cache-hit on
second invocation.
- TestGitFetcher_RealClone_BadRefFails: nonexistent ref surfaces
git's error cleanly through the ls-remote step.
- TestGitFetcher_DirectFetch_CacheHit: gitFetcher.Fetch direct
invocation (no resolver wrapping); verifies cache-hit returns
same dir + same SHA, no clobber.
Production code untouched — insteadOf rewrite makes the production
gitFetcher think it's cloning from Gitea, but git rewrites at clone
time to file://<barePath>. Tests the real shell-out + parsing.
PR-C (live Gitea e2e, task #234):
workspace-server/internal/handlers/local_e2e_dev_dept_test.go
TestLocalE2E_ExternalDevDepartment — minimal parent template that
uses !external against the LIVE molecule-ai/molecule-dev-department
repo. No symlink, no /tmp/local-e2e-deploy fixture. Composition
resolves over network at import time.
Asserts:
- 28+ dev-tree workspaces resolve through the fetched cache
(matches the count from TestLocalE2E_DevDepartmentExtraction)
- Q1 placement: 'Documentation Specialist' present (under app-lead)
- Q2 placement: 'Triage Operator' present (under dev-lead)
- Every workspace's files_dir is cache-prefixed (proves rewrite ran)
- Every workspace's resolveInsideRoot+Stat succeeds
(would fail provisioning if not)
Skipped if Gitea unreachable (TCP probe to git.moleculesai.app:443)
or git binary absent — won't false-fail offline runners.
VERIFIED LOCALLY 2026-05-08:
--- PASS: TestGitFetcher_RealClone_LocalRedirect (0.26s)
--- PASS: TestGitFetcher_RealClone_BadRefFails (0.15s)
--- PASS: TestGitFetcher_DirectFetch_CacheHit (0.23s)
--- PASS: TestLocalE2E_ExternalDevDepartment (0.55s)
workspaces resolved through !external: 28
Full ./internal/handlers/ test suite: ok (no regressions)
Together with PR-A's unit tests (#105), the !external resolver is now
covered at three layers:
- unit (fakeFetcher injection): allowlist, validation, path rewrite
- integration (real git, local bare-repo): clone, cache, ls-remote
- e2e (real git, live Gitea, live dev-department): full chain
Refs:
internal#77 — extraction RFC (Phase 3a phasing in comment 1995)
task #233 (PR-B), task #234 (PR-C)
Hongming GO 2026-05-08 ('do PR-B/C/D')
Adds gitops-style cross-repo subtree composition to the platform's
org-template importer. Replaces (eventually) the operator-side
filesystem symlink approach shipped in PR #5.
DESIGN
See internal#77 comment 1995 for the full design doc + decision points
agreed with Hongming 2026-05-08.
Schema: a `!external`-tagged mapping anywhere a workspace entry is
allowed (workspaces:, roots:, children:):
- !external
repo: molecule-ai/molecule-dev-department
ref: main
path: dev-lead/workspace.yaml
url: git.moleculesai.app # optional; default = MOLECULE_EXTERNAL_GITEA_URL or git.moleculesai.app
At resolve time the platform fetches the repo at ref into a content-
addressable cache under <orgBaseDir>/.external-cache/<repo>/<sha>/,
loads <cacheDir>/<path>, recursively resolves nested !include /
!external in the loaded subtree, then rewrites every files_dir scalar
in the fully-resolved subtree to be cache-prefixed. Downstream
pipeline (resolveInsideRoot, plugin merge, CopyTemplateToContainer)
sees ordinary in-tree paths.
IMPLEMENTATION
- org_external.go: ExternalRef type, fetcher interface (gitFetcher
production + injectable for tests), resolveExternalMapping resolver,
rewriteFilesDirAndIncludes path-rewrite walker, allowlistedHostPath
+ safeRefPattern + safeRepoCacheDir validation helpers.
- org_include.go: 4-line hook in expandNode dispatching MappingNode
with Tag=="!external" to resolveExternalMapping.
- org_external_test.go: 8 unit tests with fakeFetcher injection
(no network):
* happy path (top + nested workspace files_dir cache-prefixed)
* allowlist rejection (github.com/foo/bar)
* path-traversal rejection (../../etc/passwd)
* malformed ref rejection ("main; rm -rf /")
* missing required fields (repo / ref / path)
* rewriteFilesDirAndIncludes basic + idempotent
* allowlistedHostPath env-override + glob
Path rewrite ONLY rewrites files_dir scalars. !include scalars are
NOT rewritten — they resolve relative to their containing file's
directory, which post-fetch is naturally inside the cache, so
relative !includes Just Work without modification.
ALLOWLIST + AUTH
- Default allowlist: git.moleculesai.app/molecule-ai/.
- Override: MOLECULE_EXTERNAL_REPO_ALLOWLIST (comma-separated
prefixes; trailing /* or / supported).
- Auth: MOLECULE_GITEA_TOKEN env var injected into clone URL.
Optional — falls back to unauthenticated for public repos.
- Reject: malformed refs, path-traversal, non-allowlisted hosts.
CACHE
- Location: <orgBaseDir>/.external-cache/<safe-repo>/<sha>/.
Operators add to .gitignore.
- Content-addressable: same (repo, sha) reuses cache, no overwrite.
- Atomic clone via tmp-then-rename.
- Concurrency: race-tolerant — last-writer-wins on same SHA.
GC out of scope for v1 (filed as parked follow-up).
SECURITY (per SOP Phase 2)
Untrusted yaml input — all validated:
repo: allowlist (default molecule-ai/* on Gitea host)
ref: ^[a-zA-Z0-9_./-]+$ regex (rejects shell injection)
path: relative-and-down-only (rejects ../escape)
Auth: read-only token scoped to allowed orgs.
Recursion: maxExternalDepth=4 (vs maxIncludeDepth=16) to limit
network fan-out cost.
Cache poisoning: per-(repo, sha) content-addressable; can't poison
across SHAs.
Trust boundary: cloned content treated identically to a sibling-
cloned subtree (same model as current symlink approach).
VERSIONING / BACKWARDS COMPAT
Pure additive. Existing !include and inline workspaces unchanged.
Existing dev-lead symlink (parent template PR #5) keeps working.
Migration of parent template to !external is a separate PR-D.
No DB schema change. No public API change.
VERIFIED LOCALLY
go test ./internal/handlers/ → ok (5.2s, all 8 new tests + existing)
Stub fetcher injection lets unit tests cover the resolver +
path-rewrite logic without network. PR-B (follow-up) adds an
integration test against a local bare-git repo. PR-C adds the
real-Gitea e2e test against the live dev-department repo.
Refs:
internal#77 — extraction RFC (comment 1995 = Phase 1+2 design)
task #222 — this PR is Phase 3a (PR-A in the design's phasing)
Hongming GO 2026-05-08 ('go' on 4 decision points + design)
Two changes that fall out of one root cause discovered while preparing
the local platform spin-up for the dev-department extraction (internal#77):
PROBLEM
CopyTemplateToContainer's filepath.Walk is called with templatePath
set to the workspace's resolved files_dir. With the cross-repo
symlink composition shipped in PR #5 (parent template's
dev-lead → ../molecule-dev-department/dev-lead/), the Dev Lead
workspace's files_dir is literally 'dev-lead' — i.e. the symlink
itself, not a path THROUGH the symlink.
filepath.Walk does not descend into a symlink leaf — it Lstats the
root, sees a symlink (mode bit set, not a directory), emits exactly
one entry, and returns. Result: the workspace's /configs/ tar would
ship empty. Other 38 workspaces are fine because their files_dir
paths just TRAVERSE the symlink (path resolution handles intermediate
symlinks via Lstat traversal); only the leaf-is-symlink case breaks.
FIX
workspace-server/internal/provisioner/provisioner.go:
Call filepath.EvalSymlinks on templatePath before filepath.Walk.
Resolves the leaf-symlink case for ALL templates, not just dev-dept.
Security: templatePath has already passed resolveInsideRoot's
path-string check at the call site; the trust boundary is the
operator-side /org-templates/ filesystem layout, not this
resolution step.
TEST
workspace-server/internal/handlers/local_e2e_dev_dept_test.go:
New TestLocalE2E_FilesDirConsumption — stage-2 of the local e2e.
For every workspace in the resolved OrgTemplate, asserts:
1. resolveInsideRoot(orgBaseDir, ws.FilesDir) succeeds.
2. os.Stat on the result returns a directory.
3. filepath.Walk after EvalSymlinks (mirroring the platform fix)
emits at least one file.
4. At least one workspace marker exists (workspace.yaml,
system-prompt.md, or initial-prompt.md).
Exercises the SECOND half of POST /org/import that
TestLocalE2E_DevDepartmentExtraction (PR #103) didn't cover.
VERIFIED LOCALLY (2026-05-08, against post-extraction Gitea state):
--- PASS: TestLocalE2E_FilesDirConsumption (0.05s)
checked 39 workspaces with files_dir
All 39 walk paths emit non-empty file sets with valid workspace markers.
REGRESSION GUARD
Without the EvalSymlinks fix, this test fails on Dev Lead with:
files_dir 'dev-lead' at '/.../molecule-dev/dev-lead' is empty —
CopyTemplateToContainer would produce empty /configs/
Refs:
internal#77 — extraction RFC
molecule-core#102 (resolver symlink contract test)
molecule-core#103 (stage-1 e2e: include resolution)
Hongming GO 2026-05-08 ('go' on the 3 pre-spin-up optimizations)
Phase 4 (local-only) of internal#77 (dev-department extraction).
Adds TestLocalE2E_DevDepartmentExtraction that exercises the FULL platform
import path against the real molecule-ai-org-template-molecule-dev (post-slim)
and molecule-ai/molecule-dev-department (post-atomize) repos cloned as siblings
under /tmp/local-e2e-deploy/.
What it proves end-to-end:
- The dev-lead symlink at parent's template root is followed by
resolveYAMLIncludes (filepath.Abs/Rel-style security check passes,
os.ReadFile follows the link).
- Recursive !include chain through the symlinked subtree resolves:
parent's org.yaml → !include dev-lead/workspace.yaml (symlinked)
→ !include ./core-lead/workspace.yaml → !include ./core-be/workspace.yaml
(atomized children: paths, no '..').
- 39 workspaces enumerate after resolution: 5 PM-tree + 6 Marketing-tree
+ 28 dev-tree (Dev Lead + 5 sub-team leads + 18 leaf workspaces +
3 floaters + 1 triage-operator).
- Q1+Q2 placements verified by sentinel name check: 'Documentation
Specialist' is reachable (under app-lead via app-docs sub-team),
'Triage Operator' is reachable (direct child of Dev Lead).
Test skips with t.Skipf if the local-e2e fixture isn't present on the
host — won't block CI on hosts that haven't set it up. To set up locally:
TESTROOT=/tmp/local-e2e-deploy
mkdir -p $TESTROOT && cd $TESTROOT
git clone https://git.moleculesai.app/molecule-ai/molecule-ai-org-template-molecule-dev.git molecule-dev
git clone https://git.moleculesai.app/molecule-ai/molecule-dev-department.git
cd /Users/<you>/molecule-core/workspace-server
go test -v -run TestLocalE2E_DevDepartmentExtraction ./internal/handlers/
Verified locally 2026-05-08:
--- PASS: TestLocalE2E_DevDepartmentExtraction (0.01s)
total workspaces (recursive): 39
Refs:
internal#77 — extraction RFC
molecule-core PR #102 — symlink-resolution contract test
molecule-ai/molecule-dev-department PRs #1, #2, #3 (scaffold + extract + atomize)
molecule-ai/molecule-ai-org-template-molecule-dev PR #5 (parent slim + symlink wire)
Hongming GO 2026-05-08 ('lets not go for staging right now, we do local test first')
SOP Phase 4 (local) — task #226
Two new tests in workspace-server/internal/handlers/org_include_test.go:
- TestResolveYAMLIncludes_FollowsDirectorySymlink: parent template's
org.yaml `!include`s into a sibling-repo subtree via a relative
directory symlink. The resolver's filepath.Abs/Rel security check
operates on path strings (passes), and os.ReadFile follows the
symlink at OS layer (file content delivered). Recursive nested
`!include`s within the symlinked subtree resolve correctly because
filepath.Dir(absTarget) keeps the literal symlink path as currentDir.
- TestResolveYAMLIncludes_RejectsSymlinkEscapingRoot: companion test
pinning current behavior where a symlink target outside the parent
root is followed (resolveInsideRoot doesn't EvalSymlinks). Asserted
as 'should resolve' so future hardening (if filepath.EvalSymlinks
is added) flips the test red and forces a coordinated update to the
dev-department subtree-composition pattern.
Why now: internal#77 RFC (dev-department extraction) selects symlink-
based composition over a future platform-level external: ref. These
tests pin the contract before the operator-side symlink convention
gets shipped, so a refactor or hardening of the resolver can't
silently break the production org-import path.
No production code changes. Pure additive test coverage.
Refs: internal#77 (Phase 3b verification — task #223)
Class B Hongming-owned CICD red sweep, e2e-api leg. Same substrate
hazard as PR #98 (handlers-postgres-integration) — Gitea act_runner
configures `container.network: host` operator-wide, so:
* Two concurrent e2e-api runs both attempted to bind `-p 15432:5432`
and `-p 16379:6379` on the operator host. Verified in run a7/2727
on 2026-05-07: `docker: Error response from daemon: Conflict. The
container name "/molecule-ci-redis" is already in use by container
af10f438...` — exit 125, job fails before any test runs.
* Hardcoded container names `molecule-ci-postgres` / `-redis` plus
the leading `docker rm -f` step meant a second job's startup also
KILLED the first job's still-running services.
Fix shape (mirrors PR #98 bridge-net pattern, adapted because the
platform-server is a Go binary on the host, not a containerised step):
1. Per-run unique container names: `pg-e2e-api-${RUN_ID}-${RUN_ATTEMPT}`,
`redis-e2e-api-${RUN_ID}-${RUN_ATTEMPT}`. Unique even across reruns
of the same run_id.
2. Ephemeral host port per run via `-p 0:5432` / `-p 0:6379` and
`docker port` lookup, exported as `DATABASE_URL` / `REDIS_URL` to
`$GITHUB_ENV`. No fixed host-port → no collision.
3. `127.0.0.1` (NOT `localhost`) in URLs — IPv6 first-resolve flake
fixed in #92 stays fixed.
4. `if: always()` cleanup so containers don't leak when test steps
fail.
Issue #94 items #2 + #3 also addressed:
* Pre-pull `alpine:latest` (provisioner uses it for ephemeral
token-write containers in `internal/handlers/container_files.go`).
* Idempotent `docker network create molecule-monorepo-net` (the
provisioner attaches workspace containers via that bridge —
`internal/provisioner/provisioner.go::DefaultNetwork`).
Issue #94 item #1 (timeouts) NOT bumped — recent log evidence shows
postgres ready in 3s, redis in 1s, platform in 1s when they DO come
up. Timeouts are not the bottleneck on the current substrate.
NOT addressed here (out of scope, separate change required):
* `Run E2E API tests` step has been failing on `Status back online`
because the platform's langgraph workspace template image
(`ghcr.io/molecule-ai/workspace-template-langgraph:latest`)
returns 403 Forbidden post-2026-05-06 GitHub org suspension. That
is a template-registry resolution issue (ADR-002 / local-build
mode) and belongs in a workspace-server change, not this workflow
file. This PR fixes the parallel-collision class and the workflow
setup hygiene; the langgraph-403 failure will still surface on
runs after this lands until template resolution is fixed
separately.
Verified manually on operator host 2026-05-08: docker now hands out
ephemeral ports on `-p 0:5432`, two parallel runs land on different
ports, both reach pg_isready GREEN.
Closes#94 (items #2 and #3; item #1 documented as not-bottleneck;
langgraph-template-403 referenced for follow-up).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switches from services: block to --network molecule-monorepo-net with unique per-run container names. Avoids port-5432 collision when parallel Handlers-Postgres jobs run on host-network act_runner. Approved by security-auditor.