molecule-core

Author	SHA1	Message	Date
Molecule AI Core Platform Lead	d03fec794e	feat(workspace): add /configs/.github-token static-token fallback When platform /github-installation-token returns 500 (GitHub App unconfigured or token expired), operators can place a PAT in /configs/.github-token to keep git/ gh ops running. This is a pure additive step-4 fallback — cache is NEVER written for static tokens so recovery always reads fresh. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-08 22:46:14 +00:00
claude-ceo-assistant	c94ead1953	Merge pull request 'fix(org-import): reconcile mode + audit-event emission' (#137 ) from fix/org-import-reconcile-and-audit into main	2026-05-08 22:13:20 +00:00
claude-ceo-assistant	3de51faa19	fix(org-import): reconcile mode + audit-event emission Closes the additive-import zombie bug — re-running /org/import with a tree shape that reparents same-named roles left the prior workspace online because lookupExistingChild's dedupe is parent-scoped (different parent_id → "different" workspace). Caught 2026-05-08 after a dev-tree re-import left 8 orphans co-existing with the new tree on canvas until manual cascade-delete. Three layers in this PR: - mode="reconcile" on /org/import — after the import loop, online workspaces whose name matches an imported name but whose id isn't in the result set are cascade-deleted. Default mode "" / "merge" preserves existing additive behavior. Empty-set guards prevent accidental "delete everything" if either array comes up empty. - WorkspaceHandler.CascadeDelete extracted as a callable helper from the existing Delete HTTP handler so OrgImport's reconcile path shares the same teardown sequence (#73 race guard, container stop, volume removal, token revocation, schedule disable, event broadcast). The HTTP Delete handler still inlines the same logic; deduplication tracked as tech-debt follow-up. - emitOrgEvent(structure_events) records org.import.started + org.import.completed with mode, created/skipped/reconcile_removed counts, duration_ms, error. Replaces the lost-on-restart stdout-only log shape for an audit-trail surface that's queryable by SQL. Closes the "what happened at 20:13?" debugging gap that motivated this fix. Verified live against the local platform: cascade-delete on an old tree's removed root cleared 8 surviving orphans; mode="reconcile" with a freshly-INSERTed fake orphan removed exactly the fake; idempotent re-run of reconcile is a no-op (0 removed, no errors); structure_events captures every started+completed pair with full payload. 7 new unit tests (walkOrgWorkspaceNames flat/nested/spawning:false/ empty-name; emitOrgEvent success + DB-error-swallow; errString). Full handler suite green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 15:04:47 -07:00
claude-ceo-assistant	6f861926bd	Merge pull request 'fix(workspace_provision): preserve MODEL secret over MODEL_PROVIDER slug on restart' (#136 ) from fix/preserve-model-secret-on-restart into main	2026-05-08 21:31:50 +00:00
claude-ceo-assistant	15c5f32491	fix(workspace_provision): preserve MODEL secret over MODEL_PROVIDER slug on restart Phase 4 follow-up to template-claude-code PR #9 (2026-05-08 dev-tree wedge). Pre-fix: applyRuntimeModelEnv unconditionally overwrote envVars["MODEL"] with the MODEL_PROVIDER slug whenever payload.Model was empty (the restart path). This silently wiped the operator'\''s explicit per-persona MODEL secret on every restart. Symptom: dev-tree workspaces booted correctly on first /org/import (the envVars map was populated direct from the persona env file with both MODEL=MiniMax-M2.7-highspeed and MODEL_PROVIDER=minimax), then on the next Restart the MODEL secret got clobbered to literal "minimax" — a provider slug, not a valid model id — and the workspace template'\''s adapter failed to match any registry prefix, fell through to providers[0] (anthropic-oauth), and wedged at SDK initialize. Fix: resolution order in applyRuntimeModelEnv is now: 1. payload.Model (caller passed the canvas-picked model id verbatim) 2. envVars["MODEL"] (workspace_secret persisted from persona env) 3. envVars["MODEL_PROVIDER"] (legacy canvas Save+Restart shape) Tests ----- TestApplyRuntimeModelEnv_PersonaEnvMODELSecretPreserved — locks in the new resolution order with four cases: - MODEL secret wins over MODEL_PROVIDER slug (persona-env shape) - MODEL secret wins even when same as MODEL_PROVIDER - MODEL absent → fall back to MODEL_PROVIDER (legacy shape) - Both absent → no MODEL set (no-op) Existing TestApplyRuntimeModelEnv_SetsUniversalMODELForAllRuntimes continues to pass — fix is strictly additive on the precedence chain.	2026-05-08 14:31:14 -07:00
claude-ceo-assistant	9b5e89bb42	Merge pull request 'feat(org-import): add spawning:false field to skip workspace + descendants' (#135 ) from feat/org-import-spawning-false into main	2026-05-08 21:20:56 +00:00
claude-ceo-assistant	b91da1ab77	feat(org-import): add spawning:false field to skip workspace + descendants Lets a workspace declare it (and its entire subtree) should be skipped during /org/import. Pointer-typed `*bool` so we distinguish "explicitly false" from "unset" (default = spawn). ## Use case The dev-tree org template ships the full role taxonomy (Dev Lead with Core Platform / Controlplane / App & Docs / Infra / SDK Leads, each with their own engineering / QA / security / UI-UX children — 27 personas total in a single import). Some setups need a smaller set: - Local dev on a memory-constrained machine - Demo / smoke runs that don't need the full org breathing - Customer trials starting with leadership-only before fan-out Pre-fix the only options were: - Edit the canonical template (mutates shared state) - Author a parallel slimmer template (duplicates structure) - Manual workspace deprovision after full import (wasteful — already paid the docker pull / build cost) `spawning: false` is the per-workspace knob that solves this without touching the canonical template structure. ## Semantics - Unset: workspace spawns (current behaviour, no migration) - `spawning: true`: explicitly spawns (same as unset) - `spawning: false`: workspace is skipped AND every descendant is skipped. The guard sits BEFORE any side effect in createWorkspaceTree — no DB row, no docker provision, no children recursion. A false-spawning subtree is genuinely a no-op except for the log line. countWorkspaces still counts the subtree (so /org/templates numbers reflect the full structure). ## Stage A — verified Local dev-only template that wraps teams/dev.yaml (Dev Lead) with children:[] cleared on the 5 sub-team yaml files, plus 3 floater personas (Release Manager / Integration Tester / Fullstack Engineer). /org/import returned 9 workspaces. Drop-in: same result via `spawning: false` on each sub-tree root in the future. ## Stage B — N/A Pure additive feature on the org-template handler. No SaaS deploy chain implications. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 14:20:14 -07:00
claude-ceo-assistant	aea6109602	Merge pull request 'fix(org-import): use ws.FilesDir as persona-dir lookup + docker-cli-buildx in dev image' (#134 ) from fix/org-import-persona-env-files-dir into main	2026-05-08 20:51:47 +00:00
claude-ceo-assistant	c3596d6271	fix(org-import): use ws.FilesDir as persona-dir lookup, add docker-cli-buildx to dev image ## org_import.go — persona env injection root-cause fix The Phase-3 fix from earlier today (`feedback/per-agent-gitea-identity-default`) introduced loadPersonaEnvFile to inject persona-specific creds into workspace_secrets on /org/import. It passed `ws.Role` as the persona-dir lookup key, but in our dev-tree org.yaml shape `role:` carries the multi-line descriptive text the agent reads from its prompt ("Engineering planning and team coordination — leads Core Platform, Controlplane, ..."), while `files_dir:` holds the short slug (`core-lead`, `dev-lead`, etc.) matching `~/.molecule-ai/personas/<files_dir>/env`. isSafeRoleName silently rejected the multi-word role text → no persona env loaded → every imported workspace booted with zero workspace_secrets rows → no ANTHROPIC / CLAUDE_CODE / MINIMAX auth in the container env → claude_agent_sdk wedged on `query.initialize()` with a 60s control-request timeout. After the fix, /org/import on the dev tree (27 personas) populates 8 workspace_secrets per workspace (Gitea identity + MODEL/MODEL_PROVIDER + provider-specific token), 5 of 6 leads boot online, and the remaining wedges trace to a separate runtime-template-repo bug (workspace-template-claude-code's claude_sdk_executor.py doesn't dispatch on MODEL_PROVIDER=minimax — filed separately). ## Dockerfile.dev — docker-cli + docker-cli-buildx Without these, every claude-code/tier-2 workspace POST fails-fast: - docker-cli alone produces `exec: "docker": executable file not found` - docker-cli alone (no buildx) fails on `docker build` with `ERROR: BuildKit is enabled but the buildx component is missing or broken` Both packages are now installed in the dev image; verified with `docker exec molecule-core-platform-1 docker buildx version`. ## Stage A verified Local /org/import dev-only path: 27 workspaces created, all 27 receive persona env injection (8 secrets each — Gitea identity + provider creds). Lead workspaces (claude-code-OAuth tier) boot online. ## Stage B — N/A Local-dev-only path (docker-compose.dev.yml + dev image). Tenant EC2 provisioning uses Dockerfile.tenant (untouched). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 13:50:46 -07:00
claude-ceo-assistant	2fa79ea462	Merge pull request 'chore(ci): document #192 root cause — workspace-template repos public per OSS-first' (#133 ) from chore/192-retrigger-harness-replays-after-public-flip into main	2026-05-08 19:12:54 +00:00
claude-ceo-assistant	15935143c8	chore(manifest): drop reno-stars + 5 org-templates flipped public; document OSS-surface contract Follow-up to the workspace-template visibility flip in `558e4fee`. After flipping the 5 private workspace-templates public (#192 root cause), the harness-replays clone moved one step deeper to the org-templates list, where 6 of 7 were also private. Hongming-confirmed flip plan: - 5 of 6 (molecule-dev, free-beats-all, medo-smoke, molecule-worker-gemini, ux-ab-lab) — flipped public per `feedback_oss_first_repo_visibility_default`. These are unambiguously OSS-template-shape: generic README, no customer-shaped names, no creds in content. - 1 of 6 (reno-stars) — name itself is customer-shaped (would expose customer/tenant identity). Kept private; removed from manifest.json per Hongming. Will be handled at provision-time via the per-tenant credential resolver designed in internal#102 (Layer-3 RFC). Documents the OSS-surface contract in two places: - manifest.json _comment: every entry MUST be public; Layer-3 lives elsewhere - clone-manifest.sh comment block: rationale + the explicit ci-readonly team-grant escape hatch (review-gated, not default). Closes the second clone-fail layer of #192. Combined with `558e4fee` + the workspace-template visibility flips, the Pre-clone manifest deps step should now succeed anonymously for the full registered set. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 11:58:09 -07:00
claude-ceo-assistant	558e4fee48	chore(ci): document #192 root cause — workspace-template repos public per OSS-first 5 of 9 workspace-template repos (openclaw, codex, crewai, deepagents, gemini-cli) had been marked private with no team grant for AUTO_SYNC_TOKEN bearer (devops-engineer persona). Pre-clone manifest deps step 404'd on the first private repo encountered, failing every Harness Replays run. Resolution path taken: 1. Flipped the 5 to public per `feedback_oss_first_repo_visibility_default` — runtime/template/plugin repos default public; that's what makes them OSS surface. 2. Scoped existing `ci-readonly` org team to legitimately-internal repos only (compliance docs, RFCs-in-flight). Workspace templates removed from it. 3. Filed internal#102 RFC for Layer-3 (customer-owned + marketplace third-party private repos) — that's a different shape entirely; needs per-tenant credential-resolver, not org-team grants. This commit is a documentation-only touch on the workflow file to (a) record the root cause inline next to the existing pre-clone-fail narrative, (b) trigger a fresh Harness Replays run that should now pass the clone step. Closes #192. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 11:50:55 -07:00
claude-ceo-assistant	8e4169cfac	Merge pull request 'feat(local-dev): containerize platform + canvas stack via docker-compose' (#131 ) from feat/126-containerize-local-platform-stack into main	2026-05-08 18:38:32 +00:00
claude-ceo-assistant	bce60f1b22	Merge pull request 'fix(canvas): consolidate platform-auth headers via shared helper (#178 )' (#54 ) from fix/178-canvas-shared-auth-headers into main	2026-05-08 18:35:58 +00:00
claude-ceo-assistant	c6f41198f7	Merge pull request 'chore(canary): workflow_dispatch input keep_on_failure for log capture' (#132 ) from chore/canary-keep-on-failure-input into main	2026-05-08 17:59:10 +00:00
dev-lead	5c0c15eb4f	chore(canary): workflow_dispatch input keep_on_failure for log capture Investigating molecule-core#129 failure mode #1 (claude-code "Agent error (Exception)") needs the workspace's docker logs to find the actual exception. The canary tears down the tenant on every failure, so the workspace container is destroyed before anyone can SSM in. Add a workflow_dispatch input `keep_on_failure: bool` (default false). When true, sets `E2E_KEEP_ORG=1` for the canary script — its existing debug path skips teardown, leaving the tenant + EC2 + CF tunnel + DNS alive. Operator can then SSM into the workspace EC2 (via the same flow as recover-tunnels.py) and capture `docker logs` from the claude-code container. Cron-triggered runs never set the input (it only exists on dispatch), so unattended scheduled canaries always tear down — no risk of unattended cost leak. Operator workflow: 1. Dispatch canary-staging.yml with keep_on_failure=true 2. Watch CI; on failure (likely, given the 38h chronic red), note the SLUG / TENANT_URL printed at step 1/11 3. SSM exec into the workspace EC2 (us-east-2) and run `docker logs <claude-code-container>` to find the actual exception traceback 4. Manually delete via DELETE /cp/admin/tenants/<slug> when done (the script logs this reminder on E2E_KEEP_ORG=1 path) Refs: molecule-core#129 (canary investigation) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 10:58:19 -07:00
claude-ceo-assistant	7eda8f510f	feat(local-dev): containerize platform + canvas stack via docker-compose (closes #126 ) Replaces the legacy nohup `go run ./cmd/server` setup with a fully containerized local stack: postgres + redis + platform + canvas, all with `restart: unless-stopped` so they survive Mac sleep/wake and Docker Desktop daemon restarts. ## Changes - docker-compose.yml - `restart: unless-stopped` on platform/postgres/redis - `BIND_ADDR=0.0.0.0` for platform — the dev-mode-fail-open default of 127.0.0.1 (PR #7) made the host unable to reach the container even with port mapping. Container netns is already isolated, so binding all interfaces inside is safe. - Healthchecks switched from `wget --spider` (HEAD → 404 forever because /health is GET-only) to `wget -qO /dev/null` (GET). Same regression existed on canvas; fixed both. - workspace-server/Dockerfile.dev - `CGO_ENABLED=1` → `0` to match prod Dockerfile + Dockerfile.tenant. Without this, the alpine dev image fails with "gcc: not found" because workspace-server has no actual cgo deps but the env was forcing the cgo build path. Closes a divergence introduced in `9d50a6da` (today's air hot-reload PR). - canvas/Dockerfile - `npm install` → `npm ci --include=optional` for lockfile-exact installs that include platform-specific @tailwindcss/oxide native binaries. Without these, `next build` fails with "Cannot read properties of undefined (reading 'All')" on the `@import "tailwindcss"` directive. - canvas/.dockerignore (new) - Excludes `node_modules` and `.next` so the Dockerfile's `COPY . .` step doesn't clobber the freshly-installed container node_modules with the host's (potentially stale or wrong-arch) copy. This was the actual root cause of the canvas build break. - workspace-server/.gitignore - Adds `/tmp/` for air's live-reload build cache. ## Stage A verified ``` container status restart postgres-1 Up (healthy) unless-stopped redis-1 Up (healthy) unless-stopped platform-1 Up (healthy, air-mode) unless-stopped canvas-1 Up (healthy) unless-stopped GET :8080/health → 200 GET :3000/ → 200 DB preserved: 407 workspace rows + 5 named personas Persona mount: 28 dirs at /etc/molecule-bootstrap/personas ``` ## Stage B — N/A This is local-dev infrastructure only. None of these files ship to SaaS tenants — production EC2s use `Dockerfile.tenant` + `ec2.go` user-data, not docker-compose. ## Out of scope - The decorative-but-broken `wget --spider` healthcheck has presumably also been silently 404'ing on prod tenants. Ship a follow-up to audit + fix the prod path; not done here to keep the PR scoped. - Docker Desktop "Start at login" is a per-machine GUI setting that must be toggled manually (Settings → General). - The legacy heartbeat-all.sh that pinged 5 persona workspaces from the host has been deleted (~/.molecule-ai/heartbeat-all.sh). Per Hongming: each workspace is responsible for its own heartbeat. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 10:53:39 -07:00
claude-ceo-assistant	44bb35f2a8	Merge pull request 'fix(ci): canary alerting — drop Gitea-incompatible actions API call' (#130 ) from fix/canary-staging-gitea-compat-alerting into main	2026-05-08 17:52:48 +00:00
dev-lead	42ff6be15c	fix(ci): canary alerting — drop Gitea-incompatible actions API call The "Open issue on failure" step was failing on every canary run because Gitea 1.22.6 doesn't expose /api/v1/actions endpoints (per memory reference_gitea_actions_log_fetch). The threshold check called github.rest.actions.listWorkflowRuns() to count consecutive prior failures and gate issue creation behind 3 reds — that call ALWAYS 404'd on Gitea, breaking the entire alerting step. Net effect: the canary's own self-alerting was broken, so the underlying staging regression went unflagged for 38h+ (2026-05-07 02:30 UTC → 2026-05-08 17:34 UTC, every cron tick red, zero issues filed). Fix: drop the consecutive-failures threshold entirely. File a sticky issue on the FIRST failure; comment-on-existing handles deduplication for subsequent failures. The auto-close-on-success step is unchanged. Why not a Gitea-compatible threshold (e.g., walk recent commit statuses): comment-on-existing already gives ops a single accumulating issue per regression streak. The threshold's purpose was to avoid spamming on transient flakes — but with sticky issue + auto-close-on-green, transient flakes get one issue + one quick close, which is fine signal. Filing on first failure is also better UX: catches the regression in 30 min instead of 90 min. Also: rewrote runURL from hardcoded https://github.com/... to context.serverUrl so the link actually points at Gitea (https://git.moleculesai.app) — was always broken on Gitea but nobody noticed because the issue-filing step itself was broken. Net: 21 insertions, 40 deletions. Removes WORKFLOW_PATH + CONSECUTIVE_THRESHOLD env vars (no longer needed). Tracked in: molecule-core#129 (failure mode 3 of 3) Verification: yaml syntax-valid; no remaining github.rest.actions.* calls; only github.rest.issues.* (all Gitea-supported per memory feedback_persona_token_v2_scope). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 10:52:09 -07:00
claude-ceo-assistant	32773fd566	Merge pull request 'feat(local-dev): bind-mount ~/.molecule-ai/personas into platform container' (#127 ) from feat/persona-bind-mount-local-dev into main	2026-05-08 16:53:05 +00:00
claude-ceo-assistant	d72f21da09	feat(local-dev): bind-mount ~/.molecule-ai/personas into platform container Closes core#242 LOCAL surface. The PROD surface (CP user-data fetching persona env files into tenant EC2's /etc/molecule-bootstrap/personas via Secrets Manager) is filed as a follow-up. WHAT THIS ADDS Bind-mount on the platform service in docker-compose.yml: ${MOLECULE_PERSONA_ROOT_HOST:-${HOME}/.molecule-ai/personas} → /etc/molecule-bootstrap/personas (read-only) Default source = ${HOME}/.molecule-ai/personas (the operator-host-mirrored local dir populated by today's persona rotation work). Override via MOLECULE_PERSONA_ROOT_HOST when running on a machine with a different layout (CI runners, etc.). WHY READ-ONLY workspace-server only reads persona env files; never writes back. The read-only mount enforces that contract — a hostile plugin install path can't tamper with the persona credentials it's about to consume. WHY THIS PATH MATCHES PROD /etc/molecule-bootstrap/personas is the same in-container path the prod tenant EC2 will use. Same code path (org_import.go::loadPersonaEnvFile) reads the same file regardless of mode — local-dev parity with prod per feedback_local_must_mimic_production. STAGE A VERIFICATION - docker compose config: resolves to /Users/hongming/.molecule-ai/personas correctly (28 persona dirs visible at source path) - Persona env file shape verified: dev-lead's env contains GITEA_USER, GITEA_USER_EMAIL, GITEA_TOKEN_SCOPES, GITEA_SSH_KEY_PATH, MODEL_PROVIDER=claude-code, MODEL=opus (lead tier matches Hongming's 2026-05-08 mapping) - Full handler test suite green (TestLoadPersonaEnvFile_HappyPath + 7 sibling tests pass; rejection tests still catch path traversal) - Build clean STAGE B SKIPPED (with justification per § Skip conditions) This change is config-only (docker-compose.yml volume addition). The prod tenant EC2s do NOT use docker-compose.yml — they use CP user-data + ec2.go's docker run script. So this PR has no prod blast radius. Stage B (staging tenant probe) would be checking 'is the platform using the new compose mount' on a SaaS tenant — and SaaS tenants don't run docker compose. The actual prod-surface change is the follow-up issue. PROD SURFACE — FOLLOW-UP FILED Tenant EC2 user-data needs to fetch persona env files from operator host (or AWS Secrets Manager per the established feedback_unified_credentials_file pattern) and stage them at /etc/molecule-bootstrap/personas inside the workspace-server container. Touches molecule-controlplane/internal/provisioner/ec2.go user-data. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 09:52:43 -07:00
claude-ceo-assistant	cc28cc6607	Merge pull request 'feat(workspaces): update_tier column for canary vs production fan-out' (#124 ) from feat/canary-tier-filter into main	2026-05-08 15:55:42 +00:00
claude-ceo-assistant	120b3a25aa	feat(workspaces): update_tier column for canary vs production fan-out Closes core#115 partial. Schema-only change; the apply-endpoint filter logic that reads this column lands with core#123 (drift detector + queue + apply endpoint, the deferred follow-up of core#113). Default 'production' so existing customers (Reno-Stars + any future tenant) are default-safe. Synthetic dogfooding workspaces opt INTO 'canary' explicitly. CHECK constraint pins the closed value set ('canary' \| 'production') — the apply endpoint's filter relies on the database to reject anything else, so a future operator typo in PATCH /workspaces/:id ({update_tier: 'canery'}) returns a constraint violation, not silent fan-out to nobody. Partial index on canary rows since the apply-endpoint query path ('apply this update only to canary tier first') hits canary much more often than production, and the production set is the much larger default. WHAT THIS DOES NOT DO (lands with core#123) - PATCH endpoint to flip a workspace to canary - The apply endpoint that consults the column - Tests that exercise canary-vs-production fan-out Schema-only foundation; same pattern as core#113 (workspace_plugins). PHASE 4 SELF-REVIEW Correctness: No finding — IF NOT EXISTS guards, DEFAULT clause means existing rows get 'production' on migration apply. Readability: No finding — comment block documents the tier semantics + the deferral to core#123. Architecture: No finding — additive ALTER, partial index for the expected access pattern. Security: No finding — no code path; column constraint reduces blast radius of bad PATCH input. Performance: No finding — partial index minimizes write amplification on the production-default rows. REFS core#115 — this issue core#123 — apply endpoint follow-up (will exercise this column) core#113 — version subscription DB foundation (sibling pattern) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 08:55:19 -07:00
claude-ceo-assistant	b7f3b270a3	Merge pull request 'feat(plugins): workspace_plugins tracking table (version-subscription foundation)' (#122 ) from feat/plugin-version-subscription into main	2026-05-08 15:53:42 +00:00
claude-ceo-assistant	72b0d4b1ab	feat(plugins): workspace_plugins tracking table — version-subscription foundation Closes core#113 partial. Adds the DB foundation for the version-subscription model. Drift detection + queue + admin apply endpoint are follow-up scope (separate PR; filed as a new issue). WHY THIS PR ONLY GETS US PART-WAY Plugin install state today is filesystem-only — '/configs/plugins/<name>/' inside the container. There's no DB record of 'plugin X installed at workspace W from source S, tracking ref T'. That makes drift detection impossible: nothing to compare upstream tags against. This PR adds the table + the install-endpoint hook that writes to it. With baseline tags now on every plugin (post internal#92), the table starts collecting tracked-ref values immediately on the next install. The actual drift-check job + queue + apply endpoint layer on top. WHAT THIS ADDS workspace_plugins table: workspace_id FK → workspaces(id) ON DELETE CASCADE plugin_name canonical name from plugin.yaml source_raw full source URL the install used tracked_ref 'none' \| 'tag:vX.Y.Z' \| 'tag:latest' \| 'sha:<full>' installed_at, updated_at installRequest gains optional 'track' field (defaults to 'none'). Install handler upserts the workspace_plugins row after delivery succeeds. DB write failure is logged but doesn't fail the install (the plugin IS in the container; surfacing 500 misleads the caller). validateTrackedRef enforces the closed set of accepted shapes: 'none' \| 'tag:<non-empty>' \| 'sha:<non-empty>' Bare values like 'latest' / 'main' / version-strings without prefix are rejected — the drift detector keys on prefix to know what kind of resolution to do. WHAT THIS DOES NOT ADD (filed separately) - Drift detector job (cron / on-demand) that scans 'WHERE tracked_ref != none' rows and queues updates on upstream drift - plugin_update_queue table (separate migration once detector lands) - GET /admin/plugin-updates-pending and POST .../apply endpoints - Tier-aware apply (core#115 — composes here) PHASE 4 SELF-REVIEW (FIVE-AXIS) Correctness: No finding — install endpoint behavior unchanged for callers that don't pass 'track'. DB write is best-effort + logged on failure. validateTrackedRef rejects ambiguous bare strings. Readability: No finding — separate file plugins_tracking.go isolates the new concern; install handler delta is a single 4-line block. Architecture: No finding — additive table; existing schema untouched. Migration 20260508160000_* uses the timestamp-prefixed convention. Security: No finding — INSERT params via placeholders (no string interpolation). validateTrackedRef rejects unexpected shapes before the column constraint would. Performance: No finding — one extra ExecContext per install. Install is already seconds-scale (network fetch + tar + docker exec); rounds to noise. TESTS (1 new, all green) TestValidateTrackedRef — pin closed set + structural validators REFS core#113 — this issue (foundation only; drift+queue+apply = follow-up) internal#92, internal#93 — plugin/template baseline tags (now exists for tracking) core#114 — atomic install (this PR composes — no atomicity regression) core#115 — canary tier filter (will key off the same DB foundation) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 08:52:35 -07:00
claude-ceo-assistant	f78d844960	Merge pull request 'feat(plugins): hot-reload classifier — skip restart on SKILL-content-only updates' (#121 ) from feat/plugin-hot-reload-classifier into main	2026-05-08 15:26:32 +00:00
claude-ceo-assistant	249e760fbd	feat(plugins): hot-reload classifier — skip restart on SKILL-content-only updates Closes molecule-core#112. Composes with #114 (atomic install). Before issuing restartFunc, classify the diff between staged and live: - skill-content-only: only **/SKILL.md content changed → skip restart (Claude Code re-reads SKILL.md on each Skill invocation; no in-memory cache) - cold: anything else → restartFunc as before (hooks/settings load at session start; plugin.yaml is structural; added/removed files require a fresh load) DETECTION - Hash every regular file in staged tree (host filesystem, sha256) - Hash every regular file in live tree (in-container via docker exec sh -c 'cd <livePath> && find . -type f -print0 \| xargs -0 sha256sum') - .complete marker dropped from comparison (mtime varies install-to- install; including it would force-cold every reinstall) - File added/removed → cold - File content differs but isn't SKILL.md → cold - All differences are SKILL.md basenames → skill-content-only DEFAULTS COLD - First install (no live tree) → cold - Live tree read failure → cold (conservative; never hot-reload speculatively) - Symlinks skipped during hash (same posture as tar walker) PHASE 4 SELF-REVIEW Correctness: No finding — all error paths default to cold; never falsely classify as skill-content-only. The .complete drop is a deliberate exception (the marker is bookkeeping, not content). Readability: No finding — single-purpose helpers (hashLocalTree, hashContainerTree, isSkillMarkdown, shQuote) each do one thing. The classifier itself reads as 'compare set, then walk diff with isSkillMarkdown gate.' Architecture: No finding — composes existing execAsRoot primitive; new helpers in plugins_classifier.go don't touch any other handler. Old behavior unchanged when live read fails. Security: No finding — shQuote single-quotes any non-trivial path, pluginName comes from validatePluginName-validated source, and the docker exec command takes the path as a single arg (xargs -0 handles binary-safe path delimiting). Symlinks skipped. Performance: No finding — adds two tree walks (host + container) per install. Container walk is one docker exec call returning sha256 lines; for typical plugins (~10-50 files) round-trip is ~100ms. Versus the saved ~5-10s of restart on a hot-reloadable update, this is a clear win. TESTS (4 new, all green; full handler suite green) TestIsSkillMarkdown — basename match, case-sensitive TestHashLocalTree_StableHash — re-hash same dir = same map TestHashLocalTree_SymlinkSkipped — hostile link doesn't poison classifier TestShQuote — quoting boundary for shell injection safety REFS molecule-core#112 — this issue molecule-core#114 — atomic install (.complete marker added there) Reno-Stars iteration safety (Hongming 2026-05-08) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 08:26:05 -07:00
claude-ceo-assistant	3a4b62a52a	Merge pull request 'chore(workflows): delete obsolete promote/sync workflows (Phase 3C of internal#81)' (#119 ) from chore/trunk-based-delete-obsolete-workflows into main	2026-05-08 15:26:00 +00:00
claude-ceo-assistant	b4eab9cef2	Merge branch 'main' into chore/trunk-based-delete-obsolete-workflows	2026-05-08 15:24:55 +00:00
claude-ceo-assistant	3e96184d6f	Merge pull request 'feat(plugins): atomic install — stage→snapshot→swap→marker (docker path)' (#120 ) from feat/plugin-atomic-install into main	2026-05-08 15:23:31 +00:00
claude-ceo-assistant	48a24e6b3e	Merge branch 'main' into chore/trunk-based-delete-obsolete-workflows	2026-05-08 15:23:05 +00:00
claude-ceo-assistant	7fbb8cb6e9	feat(plugins): atomic install — stage→snapshot→swap→marker (docker path) Closes molecule-core#114 for the docker (local-OSS) path. EIC (SaaS) path tracked as a follow-up — same shape, different exec primitives (ssh vs docker exec); shipping both in one PR doubles the test surface. THE FOUR-STEP DANCE 1. STAGE — docker.CopyToContainer extracts tar into /configs/plugins/.staging/<name>.<ts>/ 2. SNAPSHOT — if /configs/plugins/<name>/ exists, mv to /configs/plugins/.previous/<name>.<ts>/ 3. SWAP — atomic mv staging → live (single rename(2)) 4. MARKER — touch /configs/plugins/<name>/.complete Workspace-side plugin loaders should refuse to load any plugin dir without .complete (separate small change, not in this PR — the marker write is the necessary precursor; consumer side is a follow-up so existing-content plugins don't break before they're re-installed). ROLLBACK - Stage failure: rm -rf staging dir; live untouched - Snapshot failure: rm -rf staging dir; live untouched (no rename happened) - Swap failure with snapshot present: mv previous back to live - Swap failure (no snapshot): rm -rf staging; live (which never existed) stays absent - Marker failure: content already in place, log loudly with manual recovery hint (touch <plugin>/.complete) — don't roll back since the new content is what we wanted, just unmarked GC Best-effort delete of previous-version snapshot after successful marker write. Failures non-fatal — next install or a separate sweeper reclaims. Sweeper for stale .previous/* across reboots is follow-up scope. CONCURRENCY Each install gets a unique stamp (UTC second precision), so two concurrent reinstalls land in distinct staging dirs and the second swap simply overwrites the first's live result. The atomicity is per-install, not cross-install — by design (the platform serializes POST /workspaces/:id/plugins via Go-side semaphore upstream of this code, so cross-install collisions don't reach here). CHANGES + plugins_atomic.go — installVersion + atomicCopyToContainer + plugins_atomic_tar.go — tarWalk/tarHostDirWithPrefix helpers + plugins_atomic_test.go — 5 unit tests (paths, stamp shape, tar happy path, symlink-skip, prefix normalization). All green. ~ plugins_install_pipeline.go::deliverToContainer — swap copyPluginToContainer call to atomicCopyToContainer Old copyPluginToContainer is retained (still called by Download()) so this PR is purely additive on the install path; no public API change. PHASE 4 SELF-REVIEW (FIVE-AXIS) Correctness: Required (addressed) — swap-failure rollback writes mv of previous back to live before returning the error; if rollback itself fails, we wrap both errors and surface the combined fault. Marker-write failure is treated as content-landed-but-unmarked (LOG, don't roll back the new content). Readability: No finding — installVersion path methods make the /staging/.previous/live/marker layout obvious from one struct. tarWalk extracted from the inline filepath.Walk in plugins_install_pipeline.go for testability. Architecture: No finding — atomicCopyToContainer composes existing execAsRoot / docker.CopyToContainer primitives; no new dependencies. Old copyPluginToContainer kept for Download() — single responsibility per function. Security: No finding — symlinks still skipped during tar walk (defense vs hostile plugin escaping its own dir). Marker writes use composeable path.Join, no user input touches the path. Performance: No finding — adds ~3 docker exec calls per install (mkdir, mv-snapshot, mv-swap, touch — actually 4) on top of the one CopyToContainer. Each exec ~50-100ms in practice; install end-to-end was already seconds-scale, this rounds to noise. REFS molecule-core#114 — this issue Companion: molecule-core#112 (hot-reload classifier — depends on .complete marker) Companion: molecule-core#113 (version subscription — uses install machinery) EIC follow-up: separate issue to be filed for SaaS path parity Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 08:22:52 -07:00
claude-ceo-assistant	d543138bde	Merge pull request 'chore: promote 5 staging-only feature PRs to main (Phase 3 of internal#81)' (#108 ) from chore/promote-staging-features-to-main into main	2026-05-08 15:22:12 +00:00
claude-ceo-assistant	bfcb0fc445	Merge branch 'main' into chore/promote-staging-features-to-main	2026-05-08 15:21:18 +00:00
claude-ceo-assistant	2752a217c8	Merge pull request 'fix(pendinguploads): wait for error metric before test exit' (#111 ) from fix/pendinguploads-test-isolation into main	2026-05-08 15:21:08 +00:00
claude-ceo-assistant	c3686a4bb3	Merge branch 'main' into fix/pendinguploads-test-isolation	2026-05-08 15:20:36 +00:00
claude-ceo-assistant	e37a289eb6	Merge pull request 'feat(org-import): inject per-role persona env from operator-host bootstrap dir' (#110 ) from feat/persona-env-injection into main	2026-05-08 15:17:17 +00:00
claude-ceo-assistant	61166f8848	Merge pull request 'feat(local-dev): air-based hot-reload for workspace-server in docker-compose dev mode' (#118 ) from feat/air-hot-reload-dev into main	2026-05-08 15:16:58 +00:00
claude-ceo-assistant	9d50a6dae4	feat(local-dev): air-based hot-reload for workspace-server Closes core#116. Brings local-dev iteration parity with the canvas's Turbopack HMR — edit a Go file, see the platform restart in <5s instead of running 'docker compose up --build' (~30s) per change. USAGE make dev # docker compose with air-driven live reload make up # production-shape stack (no air, normal Dockerfile) WHAT THIS ADDS workspace-server/.air.toml — air watch config workspace-server/Dockerfile.dev — air-on-golang:1.25-alpine, dev-only docker-compose.dev.yml — overlay swapping platform service to Dockerfile.dev + bind-mounting workspace-server/ source Makefile — make {dev,up,down,logs,build,test} WHAT THIS DOES NOT TOUCH workspace-server/Dockerfile (production multi-stage build) docker-compose.yml (prod-shape stack) CI workflows (build prod image directly) Tenant deployment / SaaS (image swap stays the model) Pure additive. Existing 'docker compose up' path unchanged; production stays on the static binary. Air install pinned via go install at image build time so the dev image is reproducible-enough for local use (we don't pin air to a SHA — the dev image is rebuilt locally and updates opportunistically). PHASE 4 SELF-REVIEW (FIVE-AXIS) Correctness: No finding — additive change, no existing path modified. .air.toml watches .go + .yaml under workspace-server/, excludes _test.go and tests dir so test edits don't trigger rebuild. Dockerfile.dev mirrors prod's 'go mod download' so first rebuild is fast. Readability: No finding — three small files plus a Makefile, each with header comments explaining the WHY, not just the WHAT. The Makefile uses the standard ## help-target pattern. Architecture: No finding — overlay pattern (docker-compose.dev.yml on top of docker-compose.yml) is the standard compose convention for env-specific overrides. Doesn't fork the prod path. Security: No finding because no production code path; dev-only image isn't built in CI and isn't published to ECR. Performance: No finding — air debounce=500ms, exclude_unchanged=true so a save that doesn't change content is a no-op rebuild. REFS core#116 — this issue Companion: core#117 (workspace-side config-watcher for hot-reload of config.yaml) — different scope; this issue is platform-only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 08:10:50 -07:00
dev-lead	9e18ab4620	fix(pendinguploads): wait for error metric before test exit TestStartSweeper_TransientErrorDoesNotCrashLoop leaks an in-flight metric write across the test boundary: cycleDone fires inside the fake's Sweep defer (before Sweep returns), waitForCycle returns immediately after, cancel() lands, but the goroutine still has metrics.PendingUploadsSweepError() to execute. Whether that write happens before or after the next test's metricDelta() baseline read is a coin-flip on slow CI hosts. Outcome: TestStartSweeper_RecordsMetricsOnSuccess fails with "error counter delta = 1, want 0" — looks like a real bug, isn't. Instrumented analysis (per the file's existing waitForMetricDelta docstring covering the same shape) confirms the metric IS getting recorded, just AFTER the next test reads its baseline. The Records* tests already use waitForMetricDelta to close this race on their own assertions. This change extends the same shape to TransientErrorDoesNotCrashLoop so it doesn't poison subsequent tests' baselines. Verified by running `go test -race -count=20 ./internal/pendinguploads/...` locally — passes deterministically. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 07:37:45 -07:00
claude-ceo-assistant	08e8d325e2	chore(workflows): delete obsolete promote/sync workflows (Phase 3C of internal#81) Trunk-based migration final cleanup for molecule-core. The 6 workflows deleted here all existed to manage the staging↔main branch dance that trunk-based makes obsolete: - auto-promote-staging.yml fast-forward staging→main on green - auto-promote-on-e2e.yml alt promote path on E2E green - auto-promote-stale-alarm.yml alarm if staging promotion stalls - auto-sync-main-to-staging.yml sync main→staging after UI merges - auto-sync-canary.yml dry-run probe of the auto-sync token+push path - retarget-main-to-staging.yml rebase open PRs onto staging After Phase 3A (PR #108 promoted 5 staging-only feature PRs to main) and Phase 3B (PR #109 dropped staging-branch triggers from the 4 e2e workflows), main is the only branch the CI cares about. None of the above workflows have anything to do; they're 1977 lines of dead Go-time- no-Gitea-time-yes code. Rollback: `git revert` this commit to restore the workflows. They still work mechanically; trunk-based just doesn't need them. The `staging` branch on the remote is deleted in a follow-up step (`git push origin --delete staging`) after this PR merges, so reviewers can confirm CI runs cleanly on the new shape before the ref disappears. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 14:18:35 +00:00
claude-ceo-assistant	ff8cc48340	ci: retrigger after AUTO_SYNC_TOKEN rotated to devops-engineer (was 401 against any repo)	2026-05-08 14:16:27 +00:00
claude-ceo-assistant	43b33bcaa5	feat(org-import): inject per-role persona env from operator-host bootstrap dir Wires the 28 dev-tree persona credentials minted 2026-05-08 into the workspace-secrets path used by org_import. When a workspace.yaml carries `role: <name>`, the importer now reads $MOLECULE_PERSONA_ROOT/<role>/env (default /etc/molecule-bootstrap/personas/<role>/env, populated by the bootstrap kit on the tenant host) and merges the role's GITEA_USER / GITEA_TOKEN / GITEA_TOKEN_SCOPES / GITEA_USER_EMAIL / GITEA_SSH_KEY_PATH into the same envVars map that already feeds workspace_secrets via parseEnvFile + crypto.Encrypt + INSERT. PRECEDENCE Persona env is the LOWEST layer: 0. Persona env (per-role) 1. Org root .env (shared) 2. Workspace .env (per-workspace) Each later layer overrides the previous, so a workspace .env can pin a different GITEA_TOKEN if it ever needs to (testing, override). WHY THIS LAYERING Workspaces should boot with the role's identity by default. .env files stay the explicit-override mechanism for the (rare) case where a workspace needs to deviate. No new behavior for workspaces with no role: persona load is silent no-op when ws.Role is empty or unsafe. SECURITY isSafeRoleName accepts only [A-Za-z0-9_-]+ (no '..', '/', or separators) — admin-only construct, but defense-in-depth keeps the persona dir shape invariant. Test TestLoadPersonaEnvFile_RejectsTraversal pins the rejection set against a planted target file. OPERATOR-HOST CONTRACT The 28 persona env files live at /etc/molecule-bootstrap/personas/<role>/env (mode 600, owner root:root) with the per-role token-scope tailoring Hongming approved 2026-05-08 (D5). Synced via task #241. Override via MOLECULE_PERSONA_ROOT for tests + non-prod hosts. TESTS (7 new, all green) TestLoadPersonaEnvFile_HappyPath — typical persona-env shape TestLoadPersonaEnvFile_MissingDir — silent no-op when file absent TestLoadPersonaEnvFile_EmptyRole — silent no-op when role empty TestLoadPersonaEnvFile_RejectsTraversal — planted file unreachable via '../../etc/passwd' etc. TestLoadPersonaEnvFile_DefaultRoot — falls back to /etc/... TestLoadPersonaEnvFile_OverwritesEmptyMap TestIsSafeRoleName_Acceptance — positive + negative role names PHASE 4 SELF-REVIEW (FIVE-AXIS) Correctness: No finding — additive change, silent no-op on the ws.Role=='' path covers every existing workspace; tests cover happy path + each rejection mode + missing-dir. Readability: No finding — helper sits next to parseEnvFile in org_helpers.go with a comment block explaining WHY persona is lowest precedence. Architecture: No finding — fits the existing 'merge .env into envVars then INSERT INTO workspace_secrets' pattern that's been in place since the .env-driven workspace secrets feature; no new dependencies, no new tables. Security: Required (addressed) — path traversal blocked by isSafeRoleName. No finding beyond that since persona files are admin-managed and the helper does not log token values. Performance: No finding — one extra os.ReadFile per workspace at import time; amortized over workspace lifetime, cost is negligible. REFS internal#85 — RFC for SOP Phase 4 + structured Five-Axis (parent context) Saved memories: feedback_per_agent_gitea_identity_default, feedback_unified_credentials_file Task #241 — operator-host sync (already DONE; populated 28 dirs) Task #242 — this PR Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 07:09:40 -07:00
claude-ceo-assistant	c5669aa304	ci: retrigger after operator disk freed (was ENOSPC during harness boot)	2026-05-08 14:00:14 +00:00
claude-ceo-assistant	bbfcaedece	ci: retrigger after harness-tenant-alpha unhealthy on first run Harness Replays job failed at "dependency failed to start: container harness-tenant-alpha-1 is unhealthy" — that is not caused by this merge (which adds workspace-server/internal/handlers code, not container infra). Retry to confirm it was a transient environmental issue (likely operator-host load/disk per internal#78).	2026-05-08 13:31:27 +00:00
devops-engineer	7d3a6a46c5	chore: sync main → staging (auto, `ae2d9eab`)	2026-05-08 13:30:46 +00:00
claude-ceo-assistant	ae2d9eabf6	Merge pull request 'chore(workflows): drop staging-branch triggers (Phase 3b of internal#81)' (#109 ) from chore/trunk-based-drop-staging-from-workflow-triggers into main	2026-05-08 13:30:24 +00:00
claude-ceo-assistant	2fac4b61b4	chore(workflows): drop staging-branch triggers (Phase 3b of internal#81) Trunk-based migration: main is the only branch. Update 4 workflows that fired on staging-branch pushes to fire on main instead. - e2e-staging-canvas.yml: drop staging from push + pull_request - e2e-staging-external.yml: drop staging from push + pull_request - e2e-staging-saas.yml: drop staging from push + pull_request, update header comment that references the (now-obsolete) staging→main auto-promote flow - redeploy-tenants-on-staging.yml: workflow_run.branches changes from [staging] to [main] so the tenant redeploy fires when publish-workspace-server-image runs on main Workflows that target the staging tenant FLEET (canary-staging.yml, e2e-staging-sanity.yml) are not changed — they fire on cron, the word "staging" in their filenames refers to the deployment target environ- ment, not the git branch. Lands as Phase 3b after #108 promotes the 5 staging-only feature PRs (Phase 3a). Phase 3c deletes the obsolete promote/sync workflows (auto-promote-staging, auto-sync-main-to-staging, etc.) plus the staging branch itself, after we no-op-verify both Phase 3a and 3b green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 13:08:51 +00:00
claude-ceo-assistant	2597511d7b	chore: promote 5 staging-only feature PRs to main (Phase 3 of internal#81) This was supposed to fast-forward when each PR merged on staging, but auto-promote-staging.yml has not been firing reliably on Gitea since the GitHub suspension. Result: main is missing 5 substantive feature PRs that landed on staging between 2026-04-29 and 2026-05-07: - #102: test(org-include) symlink-based subtree composition contract - #103: test(local-e2e) dev-department extraction end-to-end - #104: fix(provisioner)+test EvalSymlinks templatePath; stage-2 e2e - #105: feat(org-import) !external cross-repo subtree resolver (#222) - #106: test(org-external) integration + e2e for !external resolver Each PR was independently reviewed and CI-green at staging-merge time; this commit promotes the merged state atomically. Use git log on main after the merge to see the original PR-merge commits preserved. Sister work: Phase 3 of internal#81 (trunk-based migration). Workflow trigger updates land in a follow-up PR; staging-branch deletion happens after a no-op verification deploy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 13:07:22 +00:00
claude-ceo-assistant	5abc4f74ca	harden(org-external): token via http.extraHeader, .complete cache marker, ref .. deny, naming cleanup (#107 ) Five-Axis self-review pass on the !external resolver work (PRs #105+#106) caught three real issues that the unstructured 3-weakest review missed: 1. Cache validity gap — partial cache writes looked complete 2. Token persistence — token in URL userinfo got persisted to .git/config 3. Misleading function name post-refactor This PR fixes all three: - .complete marker file written atomically; wipe-and-refetch on partial cache - Token via -c http.extraHeader, never embedded in URL - Defense-in-depth ref .. deny (was already validated by repoSafeRefRegex but explicit + tested) - Renamed buildCloneURL -> buildExternalCloneURL (collision with artifacts.go), rewriteFilesDirAndIncludes -> rewriteFilesDir - Removed unused redactToken/shortHash helpers and crypto/sha1, encoding/hex, fmt imports Approved by platform-engineer 2026-05-08T12:55Z.	2026-05-08 13:04:00 +00:00

1 2 3 4 5 ...

4751 Commits