molecule-core

Author	SHA1	Message	Date
Hongming Wang	db22b5d853	Merge pull request #413 from Molecule-AI/fix/isrunning-distinguish-notfound fix(provisioner): IsRunning conservative on daemon errors to stop restart cascade	2026-04-16 03:07:54 -07:00
Hongming Wang	1e43e45de7	Merge pull request #402 from Molecule-AI/feat/per-agent-git-identity feat(provisioner): per-agent git identity via GIT_AUTHOR_* env vars	2026-04-16 03:07:50 -07:00
rabbitblood	7debdb1676	fix(tests): CSP test now fragment-matches instead of exact-matches SecurityHeaders middleware widened its CSP to allow Next.js inline scripts + data:/blob: images (platform/internal/middleware/securityheaders.go:44, canvas is reverse-proxied through the gin stack so it needs the permissive policy). The two CSP asserts in securityheaders_test.go still hard-compared against the old tight `default-src 'self'`, so they fail on main as of this afternoon. Fix: assert each expected CSP fragment is PRESENT in the header (substring match) instead of byte-for-byte equality. Test intent is "CSP is set, starts with tight default-src, contains the expected directives" — not "CSP matches this exact string". Future subsource tuning (add a new CDN, bump blob:/data: scope) won't re-break this test. Caught because every PR touching anything in the monorepo currently fails the Platform (Go) CI job on these two asserts. Fixing on a dedicated branch so it can land ahead of every blocked PR in the queue. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 02:59:06 -07:00
Hongming Wang	b8cb14f46e	feat(tenant): combined platform + canvas Docker image with reverse proxy Single-container tenant architecture: Go platform (:8080) + Canvas Node.js (:3000) in one Fly machine, with Go's NoRoute handler reverse- proxying non-API routes to the canvas. Browser only talks to :8080. Changes: platform/Dockerfile.tenant — multi-stage build (Go + Node + runtime). Bakes workspace-configs-templates/ + org-templates/ into the image. Build context: repo root. platform/entrypoint-tenant.sh — starts both processes, kills both if either exits. Fly health check on :8080 covers the Go binary; canvas health is implicit (proxy returns 502 if canvas is down). platform/internal/router/canvas_proxy.go — httputil.ReverseProxy that forwards unmatched routes to CANVAS_PROXY_URL (http://localhost:3000). Activated by NoRoute when CANVAS_PROXY_URL env is set. platform/internal/router/router.go — wire NoRoute → canvasProxy when CANVAS_PROXY_URL is present; no-op otherwise (local dev unchanged). platform/internal/middleware/securityheaders.go — relaxed CSP to allow Next.js inline scripts/styles/eval + WebSocket + data: URIs. The strict `default-src 'self'` was blocking all canvas rendering. canvas/src/lib/api.ts — changed `\|\|` to `??` for NEXT_PUBLIC_PLATFORM_URL so empty string means "same-origin" (combined image) instead of falling back to localhost:8080. canvas/src/components/tabs/TerminalTab.tsx — same `??` fix for WS URL. Verified: tenant machine boots, canvas renders, 8 runtime templates + 4 org templates visible, API routes work through the same port. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 02:46:47 -07:00
rabbitblood	73171532a1	feat(memory): optimistic-locking via if_match_version on workspace_memory writes Closes the silent-overwrite hole where two agents racing a read-modify- write on the same memory key left only one agent's update. Relevant for orchestrators (PM, Dev Lead, Marketing Lead) keeping structured running state (delegation-result ledgers, task queues) in memory, and for the ``research-backlog:`` keys that multiple idle loops write in parallel. ## Semantics ### Back-compat path (no if_match_version) Unchanged: ``INSERT ... ON CONFLICT UPDATE`` last-write-wins. Every existing agent tool, every existing ``commit_memory`` call, every existing cron that writes memory — all continue to work with no edit. ### Optimistic-lock path (if_match_version set) 1. Client calls ``GET /memory/:key`` → ``{value, version: V}`` 2. Client modifies value locally 3. Client ``POST /memory {key, value, if_match_version: V}`` 4. Server: ``UPDATE ... WHERE version = V`` + RETURNING new version 5. On match → 200 + ``{version: V+1}`` 6. On mismatch → 409 + ``{expected_version: V, current_version: <actual>}`` 7. Client reads the actual version and retries. ### Create-only marker ``if_match_version: 0`` means "create iff the key doesn't exist yet". Two agents simultaneously seeding a shared key will see exactly one success + one 409 — no silent collision, no duplicate-init work. ### Schema Migration 023 adds ``version BIGINT NOT NULL DEFAULT 1``. Existing rows baseline at 1. New rows start at 1. Every successful write (both paths) increments: ``version = version + 1`` on update, ``1`` on insert. ## Why version, not updated_at ``updated_at`` has second-granularity and can collide between concurrent writers on a fast clock. A monotonic counter is collision-free and more readable in the 409 response body ("expected 5, current is 7 — you missed 2 writes" tells an agent exactly what to re-read). ## Why ``if_match_version`` and not an ETag header JSON field keeps it in the request body, visible alongside the value payload. Agents assembling requests programmatically don't have to remember to thread a header through their HTTP client wrapper; the existing ``commit_memory`` tool can grow one optional kwarg and match the existing signature shape. ## Tests 11 memory-handler cases covering every path: - GET list / get (with version in response shape) - Set with no version (back-compat upsert, returns new version) - Set with if_match_version match (happy path, increment) - Set with if_match_version mismatch (409 + expected/current fields) - Set with if_match_version=0 on absent key (create-only success) - Set with if_match_version=N on absent key (409 — caller's mental model is wrong) - Bad inputs (missing key, malformed JSON) - Delete happy + error path Full ``go test ./internal/handlers/`` green. ## Follow-up (not in this PR) - Workspace-template tool update: ``commit_memory(content, , if_match_version=None)`` surfaces the new option + on 409 surfaces the current_version so agents can retry without manual re-read. - Named checkpoints table (``workspace_checkpoints``) for durable orchestrator state snapshots. Different concern than per-key locking; separate PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 02:32:46 -07:00
rabbitblood	8bf27ae1d0	fix(provisioner): IsRunning conservative on daemon errors to stop restart cascade Root cause of the 2026-04-16 09:10 UTC six-container restart cascade. ## Timeline 09:10:26 — PM sent a batch delegation to 15+ agents (Dev Lead coordinating). 09:10:26-27 — 4 leaders/auditors (Security, RL, BE, DevOps) simultaneously hit "workspace agent unreachable — container restart triggered" even though their containers were running fine. Another 2 (DL, UIUX) tripped in the next few seconds. 09:10:27 — Provisioner stopped + recreated 6 containers in parallel. A2A callers got EOFs, PM's batch coordination stalled. ## Root cause `provisioner.IsRunning` collapsed every ContainerInspect error into `(false, nil)`, including transient Docker daemon hiccups: func IsRunning(...) (bool, error) { info, err := p.cli.ContainerInspect(ctx, name) if err != nil { return false, nil // Container doesn't exist ← MISREAD } return info.State.Running, nil } The comment said "Container doesn't exist" but the error was actually any of: daemon timeout, socket EOF, context deadline, connection refused. Under load (batch delegation fan-out → 15 concurrent HTTP inbound → 15 concurrent Claude Code subprocesses → Docker daemon CPU pressure), ContainerInspect calls started failing transiently. All 6 calls returned `(false, nil)`. Caller `maybeMarkContainerDead` treated `running=false` as "container is dead, restart it" → six parallel restarts. This was exactly the destructive-on-error pattern we keep trying to kill (see #160 SDK-stderr-probe, #318 fail-open classes). ## Fix `IsRunning` now distinguishes NotFound from transient errors: - Legitimately missing container (caller deleted, Docker pruned) → `(false, nil)` — safe to act on; caller marks dead + restarts. - Any other error (daemon timeout, socket issue, context deadline) → `(true, err)` — caller stays on the alive path. The transient error is preserved so metrics + logging still see it, but it does NOT trigger the destructive restart branch. `isContainerNotFound` matches on error-message substring — same approach docker/cli uses internally — to avoid pulling in errdefs as a direct dep. Truth table tests in `isrunning_test.go` cover 8 cases: NotFound variants (real + generic), nil, empty, and the 4 transient- error shapes we've actually observed (deadline, EOF, connection-refused, i/o timeout). ## Caller update `maybeMarkContainerDead` in a2a_proxy.go now logs the transient inspect error (was silently discarded via `_`). Visibility without destructiveness. If this error becomes persistent, we'll see it in platform logs rather than diagnosing after another restart cascade. ## Expected impact - Zero restart cascades from the current class of transient inspect errors (EOF, timeout, connection refused). - Dead containers still detected within the A2A layer because an actual stopped container returns NotFound on inspect, and the TTL monitor (180s post #386) catches anything that slips through. - New visibility in platform logs when inspect has trouble — previously silent. Combined with the TTL fix in #386, the defense-in-depth on spurious restart is now: 1. IsRunning only returns false for real NotFound 2. Liveness TTL is 180s, surviving 5+ missed heartbeats 3. A2A proxy 503-Busy path retries with backoff before touching restart logic at all Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 02:21:25 -07:00
Hongming Wang	51e3393ec0	fix(ops): bake workspace-configs-templates into platform Docker image Tenant machines were booting with no templates because the Dockerfile only shipped the Go binary + migrations. The canvas showed "0 templates" with an empty picker. Changes: - platform/Dockerfile: build context changed from ./platform to repo root so COPY can reach workspace-configs-templates/ alongside the Go source. COPY paths updated for platform/{go.mod,go.sum,*.go} and platform/migrations/. - .github/workflows/publish-platform-image.yml: context: . (was ./platform), paths trigger now includes workspace-configs-templates/ so template changes rebuild the image. Phase A of the template-registry plan. Phase B adds a DB registry + on-demand fetch for community templates (user pastes GitHub URL at workspace creation time). The baked defaults always ship in the image for zero-config tenant boot. Verified: `docker build -f platform/Dockerfile -t test .` succeeds, `docker run --rm test ls /workspace-configs-templates/` shows all 8 templates (autogen, claude-code-default, crewai, deepagents, gemini-cli, hermes, langgraph, openclaw). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 01:54:47 -07:00
rabbitblood	112c28d885	feat(org-templates): Phase 3 — !include directive + split org.yaml into team files Part 3 of 4 in the scalability refactor. Adds YAML `!include` support to the org importer and splits molecule-dev/org.yaml (676 lines post- Phase 2) into 6 team / role files; top-level org.yaml drops to 114 lines of pure scaffolding. ## Platform changes New `platform/internal/handlers/org_include.go`: - `resolveYAMLIncludes(data, baseDir)` — pre-processes a YAML document, expanding any scalar tagged `!include <path>` with the parsed content of the referenced file. - Path resolution via `resolveInsideRoot` so a crafted `!include ../../etc/passwd` can't escape the org template directory (same defense the existing `files_dir` copy uses). - Nested includes supported: each included file carries its own search root (its directory), so `teams/pm.yaml` with `!include research.yaml` resolves to `teams/research.yaml` — matching the convention of C-include / Sass @import / most package systems. - Cycle detection via visited-set keyed on absolute path; belt-and- braces `maxIncludeDepth = 16` cap in case symlinks or path normalization defeats the set. - Inline-template mode (POST /org/import with raw JSON body, no `dir`) errors cleanly when a file ref is used — can't resolve without a base. Wired into both `ListTemplates` (so /org/templates shows an accurate workspace count after the split) and `Import` (expansion happens before unmarshal into OrgTemplate). ## Template changes molecule-dev/org.yaml now contains only: - name + description - defaults (runtime, plugins, category_routing, initial_prompt text) - `workspaces: [!include teams/pm.yaml, !include teams/marketing.yaml]` New files: - `teams/pm.yaml` — PM top-level, children are !include refs - `teams/research.yaml` — Research Lead + Market Analyst + Technical Researcher + Competitive Intelligence (inline children) - `teams/dev.yaml` — Dev Lead + FE/BE/DevOps/Security/QA/UIUX (inline) - `teams/marketing.yaml` — Marketing Lead + DevRel/PMM/Content/ Community/SEO/Social (inline) - `teams/documentation-specialist.yaml` — leaf - `teams/triage-operator.yaml` — leaf ## File-size impact \| State \| org.yaml lines \| total config size \| \|---\|---:\|---:\| \| Before (main) \| 1801 \| 108 KB \| \| After Phase 1 (#389) \| 1687 \| 101 KB \| \| After Phase 2 (#390) \| 676 \| 35 KB \| \| After this PR \| 114 \| 4 KB (org.yaml only) \| With the 6 team files (total ~570 lines of structural yaml), every file is now under 230 lines and individually readable without scrolling past a single team's boundaries. ## Tests `platform/internal/handlers/org_include_test.go` — 9 cases: - Flat include (single file, single workspace) - Nested include (file → file → file) - Traversal rejection (`../secret.yaml`, `../../secret.yaml`) - Cycle detection (a↔b) - Empty path error - Missing file error - Inline-template error (baseDir empty) - No-op when YAML has no includes (safety: we always run the preprocessor) - Integration: load the real `org-templates/molecule-dev/org.yaml`, resolve includes, unmarshal into OrgTemplate, verify PM + Marketing Lead are top-level and PM has ≥4 children after expansion. All 9 pass + existing `TestResolvePromptRef` + `TestOrgYAML` suites stay green. ## Ownership implication Each team file can now be owned + reviewed independently. When the marketing team adds a 7th role, the diff is in `teams/marketing.yaml` alone — no merge conflicts against PM or research changes in the same review window. Same for the eventual engineer team, security team, etc. ## What's next - Phase 4 (queued): per-workspace atomization. Each role gets `<role>/workspace.yaml`; team files shrink to a list of !include refs. Terminal step in the scalability arc — at that point adding a new role is one new file under `org-templates/molecule-dev/<role>/` plus one line in the team's manifest. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 07:49:56 +00:00
Hongming Wang	29044c3995	fix(#249 ): add /schedules/health endpoint accessible to CanCommunicate peers (#400 ) Rebased cleanly onto current main (resolves the add/add conflicts that blocked CI on PR #374 — the original branch diverged from a pre-repo-bootstrap commit that predated most files). Changes: - schedules.go: add scheduleHealthResponse struct + Health handler (mirrors A2A proxy auth pattern: X-Workspace-ID + CanCommunicate gate) - router.go: register GET /workspaces/:id/schedules/health on r (not wsAuth) so peer agents can query without holding the target workspace's bearer token - schedules_test.go: 7 new tests (missing caller 401, self-call OK, legacy peer grandfathered, non-peer 403, system caller bypass, no prompt exposure, DB error 500) isSystemCaller/validateCallerToken reused from a2a_proxy.go (same package). registry.CanCommunicate import added to schedules.go. Closes #249 Supersedes PR #374 (which could not get CI due to merge conflict) Co-authored-by: PM (Molecule AI) <pm@molecule-ai.internal> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 00:45:30 -07:00
rabbitblood	c12d6436ab	feat(provisioner): per-agent git identity via GIT_AUTHOR_* env vars Every workspace now commits under its own name. Step 3 of the three- step agent-separation plan (platform-level git identity today; GitHub App migration follows as Option 1). ## Problem All 20+ agents in the molecule-dev template (PM, Dev Lead, Research Lead, FE, BE, DevOps, Security, QA, UIUX, Marketing roles, etc.) share a single GITHUB_TOKEN — specifically the CEO's personal PAT. So every commit, PR, and issue across the live repos ends up attributed to HongmingWang-Rabbit. `git log` can't distinguish "which agent wrote this code" from "did the CEO write it"; neither can the authority- verification rule in triage-operator/philosophy.md (rule #3). ## Fix When the provisioner starts a workspace container, it now sets: GIT_AUTHOR_NAME = "Molecule AI <Workspace Name>" GIT_AUTHOR_EMAIL = <slug>@agents.moleculesai.app GIT_COMMITTER_NAME = (same) GIT_COMMITTER_EMAIL = (same) Git prefers these env vars over `git config user.name` / `user.email`, so no per-container git-config step is needed; every commit automatically carries the right authorship. Examples (20 agents, 20 distinct identities): Frontend Engineer → frontend-engineer@agents.moleculesai.app Backend Engineer → backend-engineer@agents.moleculesai.app Product Marketing Manager → product-marketing-manager@agents.moleculesai.app UIUX Designer → uiux-designer@agents.moleculesai.app Domain `agents.moleculesai.app` is deliberate: marks the email as a bot address without resembling a real inbox. ## Operator override preserved `applyAgentGitIdentity` runs AFTER the secret-load loops in `provisionWorkspaceOpts`, but uses `setIfEmpty` so any workspace_secret with the same key wins. Teams that want custom authorship (shared org signing identity, a person-on-the-loop owner) can still set `GIT_AUTHOR_NAME` via /workspaces/:id/secrets and get their value through to git. ## What this does NOT solve (yet) - PR / issue authorship is still whoever owns GITHUB_TOKEN (the shared PAT). That needs the GitHub App migration (Option 1, next PR). The commit-level split shipped here is the prerequisite: the App path will keep these env vars and just swap the PAT for a short-lived installation token. - Existing containers continue with their pre-fix env (git env vars are baked in at container-create time). Applying is one plain `POST /workspaces/:id/restart` per agent after this merges + deploys — the restart goes through provisionWorkspace which picks up the new injection. ## Tests `agent_git_identity_test.go` — 4 behavior tests + a 10-row slug test: - fills all 4 env vars from a workspace name - operator override via pre-set env is preserved (setIfEmpty semantics) - empty / whitespace workspace name is a no-op (no `unknown@...` emails) - nil map doesn't panic (defensive) - slugify handles spaces / punctuation / edge hyphens / em-dashes All 15 cases pass; platform build clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 00:45:26 -07:00
Hongming Wang	ce0e793673	feat(org-templates): Phase 1 — externalize prompt bodies to sibling files (#389 ) Part 1 of 4 in the scalability refactor. Each role can now keep its initial_prompt / idle_prompt / schedule prompts as sibling .md files under files_dir/; inline YAML literals still work for backwards-compat. ## What changes Platform (org.go importer): - `OrgWorkspace` gains `InitialPromptFile`, `IdlePrompt`, `IdlePromptFile`, `IdleIntervalSeconds`. The idle_* fields were previously dropped by the org importer entirely — struct didn't declare them — which is why engineer idle_prompts never propagated from org.yaml to live /configs (I've been manually docker-cp'ing them in every maintenance cron). - `OrgSchedule` gains `PromptFile`. Hourly/weekly cron prompts are the largest bodies in org.yaml (1-5 KB each) and get resolved at import time just like initial_prompt. - `OrgDefaults` gains the same idle_* + _file fields for org-wide fallback. - New `resolvePromptRef(inline, fileRef, orgBaseDir, filesDir)` helper — the single chokepoint for inline-vs-file resolution. Inline wins when both are set. File refs route through `resolveInsideRoot` so a crafted ref can't escape the org template directory (same traversal defense as files_dir). - `createWorkspaceTree` now injects idle_prompt + idle_interval_seconds into the workspace's config.yaml (previously missing — that's the second half of the idle-prompt propagation bug). Tests:* - `org_prompt_ref_test.go` — 10 cases: inline-wins, file-read-when-empty, both-empty, defaults-level resolution, inline-template mode errors, traversal rejection (via file ref AND via files_dir), missing-file errors, and YAML-unmarshal parsing for each new field. Proof migration: - Documentation Specialist (biggest role at 6.9 KB of prompts) moves from inline YAML to `documentation-specialist/{initial-prompt.md, schedules/daily-docs-sync.md, schedules/weekly-terminology-audit.md}`. - org.yaml drops 1801 → 1687 lines (-6.3%) from just this one role. ## Why this matters org.yaml is 108 KB of which 67 KB (62%) is prompt text. At the current 12-role template size that's already unreadable; the marketing + triage- operator additions pushed it to 1801 lines. The 4-phase refactor aims: - Phase 1 (this PR): platform support + 1 role proof. - Phase 2: migrate remaining ~20 roles to file refs. Target: org.yaml at ~600 lines of pure structural scaffolding. - Phase 3: YAML `!include` preprocessor — split org.yaml into teams/{research,dev,marketing,ops}.yaml shards. - Phase 4: per-workspace atomization — each role gets its own workspace.yaml manifest; org.yaml composes them. ## Backwards compatibility - Inline `initial_prompt: \|` / `prompt: \|` / `idle_prompt: \|` all still work. - Missing `prompt_file` refs log + skip the schedule (not fatal) — fail loud so bugs surface during deployment rather than silent-drop. - Inline-template mode (POST /org/import with raw JSON body, no `dir`) errors cleanly when a file ref is used — can't resolve files without a base dir, surface that rather than guessing. ## Test plan - [x] `go build ./...` clean - [x] `go test -run 'TestResolvePromptRef\|TestOrgYAML' ./internal/handlers/` — 10 tests pass - [x] `python -c "yaml.safe_load(...)"` on the edited org.yaml — parses - [ ] Post-merge: deploy platform rebuild, run `POST /org/import` against a fresh workspace, verify Documentation Specialist's /configs/config.yaml contains the initial_prompt body and workspace_schedules rows contain the cron prompts (phantom-success check: grep the actual content, not just the row count). Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 00:32:09 -07:00
Hongming Wang	a9fdbe4185	fix(liveness): raise workspace TTL 60s → 180s to survive Opus synthesis (#386 ) Problem observed 2026-04-16: Research Lead, Dev Lead, Security Auditor, and UIUX Designer were being auto-restarted by the liveness monitor every ~30 minutes, even though their containers were healthy and processing real work. A2A callers (PM, children agents) saw regular EOFs: A2A request to <leader-id> failed: Post http://ws-:8000: EOF Followed in platform logs by: Liveness: workspace <id> TTL expired Auto-restart: restarting <name> (was: offline) Provisioner: stopped and removed container ws- Root cause: the liveness key `ws:{id}` in Redis has a 60s TTL (platform/internal/db/redis.go). The workspace heartbeat loop (workspace-template/heartbeat.py) refreshes it every 30s. That leaves room for exactly ONE missed heartbeat before expiry. A busy Claude Code Opus synthesis can starve the container's asyncio scheduler for 60-120s (the SDK spawns the claude CLI subprocess and blocks until the message-reader yields; the heartbeat coroutine doesn't run during that window). Leaders running 5-minute orchestrator pulses or processing deep delegations routinely hit this. The platform then mistakes a busy-but-healthy container for a dead one, marks it offline, tears it down, and re-provisions — interrupting whatever work was mid- synthesis and generating a cascade of EOF errors on pending A2A calls. Fix: hoist the TTL into a named `LivenessTTL` constant and raise it to 180s. With a 30s heartbeat interval this now tolerates up to ~5 missed beats before expiry — comfortably longer than any realistic Opus stall, while still detecting genuinely-dead containers within 3 minutes. Safety: real crashes are still caught immediately by a2a_proxy's reactive IsRunning() check (maybeMarkContainerDead in a2a_proxy.go:439). That path doesn't depend on TTL; it fires on the first failed forward. So this PR only relaxes the "slow but alive" false-positive — dead-container detection is unchanged. Observed impact before fix (2026-04-16 ~06:40–06:49 UTC, 10-minute window, 4 containers affected): \| Container \| EOF errors \| Forced restart \| \|-------------------\|-----------:\|:--------------:\| \| Dev Lead \| 5 \| yes (06:48) \| \| Research Lead \| 5 \| yes (06:47) \| \| Security Auditor \| 5 \| yes (06:49) \| \| UIUX Designer \| 4 \| no (not yet) \| Expected impact after merge + redeploy: drop to ~0 forced restarts on healthy-busy leaders. If genuinely-stuck agents stop responding, the IsRunning check still catches them on the next A2A forward. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 00:05:45 -07:00
Hongming Wang	0bcebff908	config(org): add Telegram to Dev Lead and Research Lead (#385 ) * feat(adapters): add gemini-cli runtime adapter (closes #332) Adds a `gemini-cli` workspace runtime backed by Google's Gemini CLI (@google/gemini-cli, ~101k ★, Apache 2.0). Mirrors the claude-code adapter pattern: Docker image installs the CLI, CLIAgentExecutor drives the subprocess, A2A MCP tools wire via ~/.gemini/settings.json. Changes: - workspace-template/adapters/gemini_cli/ — new adapter (Dockerfile, adapter.py, __init__.py, requirements.txt); setup() seeds GEMINI.md from system-prompt.md and injects A2A MCP server into settings.json - workspace-template/cli_executor.py — adds gemini-cli to RUNTIME_PRESETS (--yolo flag, -p prompt, --model, GEMINI_API_KEY env auth); adds mcp_via_settings preset flag to skip --mcp-config injection for runtimes that own their own settings file - workspace-configs-templates/gemini-cli/ — default config.yaml + system-prompt.md template - tests/test_adapters.py — adds gemini-cli to expected adapter set - CLAUDE.md — documents new runtime row in the image table Requires: GEMINI_API_KEY global secret. Build: bash workspace-template/build-all.sh gemini-cli Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(provisioner): add gemini-cli to RuntimeImages map Without this entry, POST /workspaces with runtime:gemini-cli falls back to workspace-template:langgraph (wrong image, missing gemini dep) instead of workspace-template:gemini-cli. Every runtime MUST have an entry here. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * config(org): add Telegram to Dev Lead and Research Lead (closes #383) Completes leadership-tier Telegram coverage: PM ✓ DevOps ✓ Security ✓ → Dev Lead ✓ Research Lead ✓ Both roles produce high-value async output (architecture decisions, eco-watch summaries) that was invisible until the user polled the canvas. Same bot_token/chat_id secrets as the other three roles — no new credentials required. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: DevOps Engineer <devops@molecule.ai> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 00:00:10 -07:00
Hongming Wang	52bdadbd6d	fix(security): forward Authorization header in transcript proxy (#405 ) (#380 ) The platform's GET /workspaces/:id/transcript proxy was constructing the outbound request without an Authorization header. The workspace's /transcript endpoint (hardened in #287/#328) fails-closed when the header is absent, so every transcript call in production returned 401 from the workspace. Fix: after WorkspaceAuth validates the incoming bearer token, the handler now forwards it verbatim via req.Header.Set("Authorization", ...). Forwarding is safe — the token has already been validated by the middleware. Tests: - TestTranscript_ForwardsAuthHeader: was t.Skip'd as a bug marker; now active. Verifies the Authorization header reaches the workspace stub. - TestTranscript_NoAuthHeader_PassesThrough: new. Verifies that a missing header produces no synthetic Authorization on the upstream call, and the workspace 401 is faithfully relayed. Identified by QA audit 2026-04-16. Co-authored-by: QA Engineer <qa-engineer@molecule-ai.internal> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 23:38:07 -07:00
PM Bot	e257cd80d4	chore(test): remove dead constants from wsauth_middleware_test.go (#358 ) PR #357 deleted the grace-period tests that used hasLiveTokenQuery and workspaceExistsQuery, but the constants themselves (and the stale comment describing the old HasAnyLiveToken-based dispatch) were not removed. Remove both dead const declarations and update the header comment to reflect the strict-enforcement contract introduced by #357. Closes #358. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 05:02:11 +00:00
Hongming Wang	fa239217a0	fix(security): remove WorkspaceAuth tokenless grace period (#351 ) Severity HIGH. #318 closed the fake-UUID fail-open for WorkspaceAuth but left the grace period intact for real workspaces with no live tokens. Zombie test-artifact workspaces from prior DAST runs still exist in the DB with empty configs and no tokens, so they pass WorkspaceExists=true but HasAnyLiveToken=false — and fell through the grace period, leaking every global-secret key name to any unauthenticated caller on the Docker network. Phase 30.1 shipped months ago; every production workspace has gone through multiple boot cycles and acquired a token since. The "legacy workspaces grandfathered" window no longer serves legitimate traffic. Removing it entirely is the cleanest fix — and does NOT affect registration (which is on /registry/register, outside this middleware's scope). New contract (strict): every /workspaces/:id/* request MUST carry Authorization: Bearer <token-for-this-workspace> Any missing/mismatched/revoked/wrong-workspace bearer → 401. No existence check, no fallback. The wsauth.WorkspaceExists helper is kept in the package for any future caller but no longer used here. Tests: - TestWorkspaceAuth_351_NoBearer_Returns401_NoDBCalls — new, covers fake UUID / zombie / pre-token in one sub-table. Asserts zero DB calls on missing bearer. - Existing C4/C8 + #170 tests updated to drop the stale HasAnyLiveToken sqlmock expectations. - Renamed TestWorkspaceAuth_Issue170_SecretDelete_FailOpen_NoTokens to _NoTokensStillRejected and flipped the assertion from 200 to 401. - Dropped TestWorkspaceAuth_318_ExistsQueryError_Returns500 — the code path it covered no longer exists. Full platform test sweep green. Closes #351 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 21:52:44 -07:00
Hongming Wang	c5d40b861b	Merge pull request #343 from Molecule-AI/fix/issue-337-webhook-secret-constant-time fix(security): constant-time webhook_secret comparison (#337)	2026-04-15 21:31:02 -07:00
Hongming Wang	50819500f0	fix(security): constant-time webhook_secret comparison (#337 ) Severity LOW. The /webhooks/:type handler compared the Telegram X-Telegram-Bot-Api-Secret-Token header against the decrypted webhook_secret using Go's `!=` operator, which short-circuits on the first mismatched byte. Under low-latency Docker-network conditions an attacker could time response latency byte-by-byte and converge on the real secret, then inject Telegram-formatted messages into any channel. Fix: switch to crypto/subtle.ConstantTimeCompare, which runs in time proportional to the length of the shorter input regardless of content match. Same posture as the cdp-proxy token compare in host-bridge (which already used timingSafeEqual). Risk profile over the public internet is low (Telegram webhooks have natural jitter that masks the signal), but the defensive pattern matters for consistency across all secret comparisons. Closes #337 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 21:23:12 -07:00
Hongming Wang	a205c92428	fix(security): scope PausePollersForToken to requesting workspace (closes #329 ) CI 5/6 pass (E2E cancel = run-supersession pattern). Dev Lead review 04:21: ✅ Approved. Fixes cross-tenant token exposure: PausePollersForToken now scoped to requesting workspace_id via SQL WHERE clause. Closes #329.	2026-04-15 21:22:50 -07:00
Hongming Wang	d85ee97472	fix(security): encrypt channel_config bot_token at rest (closes #319 ) CI fully green. Dev Lead code review: ✅ clean, all read/write paths verified, tests cover round-trip + idempotency + legacy plaintext. Closes #319.	2026-04-15 21:09:34 -07:00
Hongming Wang	5c3aac11e3	fix(security): close WorkspaceAuth fail-open on non-existent workspace IDs (#318 ) CI fully green. Security Audit cycle 15 LGTM. Closes #318. Closes #325.	2026-04-15 21:02:29 -07:00
Hongming Wang	472495c380	Merge pull request #270 from Molecule-AI/feat/workspace-transcript-endpoint feat: GET /workspaces/:id/transcript — live agent session log	2026-04-15 17:55:41 -07:00
airenostars	66b8cbb7fa	fix(transcript): validate workspace URL to prevent SSRF (#272 ) `TranscriptHandler.Get` previously proxied `agent_card->>'url'` directly to the outbound HTTP client with no validation. Since `agent_card` is attacker-writable via /registry/register, a workspace-token holder could point it at cloud metadata (169.254.169.254), link-local ranges, or non-http schemes and pivot the platform container against internal services (IMDS, Redis, Postgres, other containers on the Docker net). Four required fixes per reviewer: 1. `validateWorkspaceURL(u *url.URL)` — runs before `httpClient.Do`: - scheme must be http/https (rejects file://, gopher://, ftp://) - cloud metadata hostname blocklist (GCP + Azure + plain "metadata") - IMDS IP blocklist (169.254.169.254) - IPv4/IPv6 link-local blocklist (169.254/16, fe80::/10, multicast) - IPv6 unique-local fd00::/8 blocklist - loopback + docker.internal still allowed for local dev 2. Query-param allowlist — `target.RawQuery = c.Request.URL.RawQuery` forwarded everything verbatim, letting a caller smuggle params the upstream transcript endpoint didn't intend to expose. Replaced with an allowlist of `since` and `limit`. 3. Sanitized error string — `fmt.Sprintf("workspace unreachable: %v", err)` leaked the actual internal host/IP via `net.OpError`. Now logs the real error server-side and returns a plain "workspace unreachable" to the caller. 4. 10 new regression test cases: - `TestTranscript_Rejects{CloudMetadataIP,NonHTTPScheme,MetadataHostname,LinkLocalIPv6}` exercise the handler end-to-end with each attack URL and assert 400 before the HTTP client fires. - `TestValidateWorkspaceURL` table-drives the validator across localhost/public/docker-internal (allowed) + IMDS/GCP/Azure/file/ gopher/link-local/multicast (rejected). - `TestTranscript_ProxyPropagatesAllowlistedQueryParams` asserts `secret=leak&cmd=rm` is stripped while `since=42&limit=7` pass through. Also fixed a pre-existing test bug: `seedWorkspace` was issuing a real SQL Exec against sqlmock with no expectation set, so the prior test helpers silently failed in CI. Replaced with `expectWorkspaceURLLookup` which programs the mock correctly. All 11 tests now pass. Closes #272 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 17:46:55 -07:00
Hongming Wang	cb37aa850c	fix(security): add Referrer-Policy + Permissions-Policy headers (#282 ) Closes #282. CLAUDE.md documented the SecurityHeaders() middleware as setting 6 headers (X-Content-Type-Options, X-Frame-Options, Referrer- Policy, Content-Security-Policy, Permissions-Policy, HSTS) but the implementation only set 4 — Referrer-Policy and Permissions-Policy were silently missing. Adds: - Referrer-Policy: strict-origin-when-cross-origin — prevents browsers from leaking full paths/queries in Referer on cross- origin navigation. Particularly relevant for canvas embeds of Langfuse trace URLs that may contain trace IDs. - Permissions-Policy: camera=(), microphone=(), geolocation=() — denies sensor access by default. Iframes the canvas embeds (Langfuse trace viewer etc.) can no longer request these without an explicit delegation. Regression tests added to securityheaders_test.go — both headers are now in the same table-driven assertion loop as the other 4, so a future edit that drops them again fails CI loudly. LOW severity — this is defense-in-depth, not a direct exploit path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 16:52:19 -07:00
airenostars	1f22d7df1b	feat: GET /workspaces/:id/transcript — live agent session log Closes #N (issue to be filed) Lets canvas / operators see live tool calls + AI thinking instead of waiting for the high-level activity log to flush. Right now the only way to "look over an agent's shoulder" is `docker exec ws-XXX cat /home/agent/.claude/projects/.../<session>.jsonl`, which: - doesn't work for remote workspaces (Phase 30 / Fly Machines) - requires shell access on the host - has no pagination This PR adds: 1. `BaseAdapter.transcript_lines(since, limit)` — async hook returning `{runtime, supported, lines, cursor, more, source}`. Default returns `supported: false` so non-claude-code runtimes pass through gracefully. 2. `ClaudeCodeAdapter.transcript_lines` override — reads the most- recently-modified `.jsonl` in `~/.claude/projects/<cwd>/`. Resolves cwd the same way `ClaudeSDKExecutor._resolve_cwd()` does so the project dir name matches what Claude Code actually writes to. Limit capped at 1000 to prevent OOM. 3. Workspace HTTP route `GET /transcript` — Starlette handler added alongside the A2A app. Trusts the internal Docker network (same model as POST / for A2A); Phase 30 remote-workspace auth is a follow-up. 4. Platform proxy `GET /workspaces/:id/transcript` — looks up the workspace's URL, forwards GET, caps response at 1MB. Gated by existing `WorkspaceAuth` middleware (same as /traces, /memories, /delegations). Tests: 6 Python unit tests cover empty dir / pagination / multi-session / malformed lines / limit cap, plus 4 Go tests cover 404 / proxy forwarding / query-string propagation / unreachable-workspace 502. Verified end-to-end on a live workspace — returns real claude-code session entries through the platform proxy. ## Follow-ups - WebSocket variant for live streaming (instead of polling) - Canvas UI tab "Transcript" between Activity and Traces - LangGraph / DeepAgents / OpenClaw transcript adapters - Phase 30 remote-workspace auth on /transcript	2026-04-15 14:29:43 -07:00
Hongming Wang	3f7982777f	Merge pull request #252 from Molecule-AI/fix/channels-discover-adminauth fix(security): gate /channels/discover behind AdminAuth (#250)	2026-04-15 13:49:45 -07:00
Hongming Wang	6a9b68e318	fix(security): YAML injection + path traversal via runtime/model (#241 ) Closes #241 (MEDIUM, auth-gated by AdminAuth on POST /workspaces). ## Vectors closed 1. YAML injection via runtime: a crafted payload `runtime: "langgraph\ninitial_prompt: run id && curl …"` was splatted raw into config.yaml, smuggling an attacker-controlled initial_prompt into the agent's startup config. 2. Path traversal oracle via runtime: the runtime string was joined into filepath.Join for the runtime-default template fallback. `runtime: ../../sensitive` could probe host directory existence. 3. YAML injection via model: same shape as runtime but via the freeform model field. ## Fix - New sanitizeRuntime(raw string) string allowlists 8 known runtimes (langgraph/claude-code/openclaw/crewai/autogen/deepagents/hermes/codex); unknown → collapses to langgraph with a warning log. Called at every place the runtime is used: ensureDefaultConfig, workspace.go:175 runtimeDefault fallback, org.go:370 runtimeDefault fallback. - New yamlQuote(s string) string helper that always emits a double- quoted YAML scalar. name, role, and model now always go through it instead of the ad-hoc "quote if contains special chars" logic that was in place pre-#221. Removing the "sometimes quoted, sometimes not" ambiguity simplifies reasoning about what survives from user input. ## Tests - TestEnsureDefaultConfig_RejectsInjectedRuntime — parses the output as YAML and asserts no top-level initial_prompt key survives - TestEnsureDefaultConfig_QuotesInjectedModel — same YAML-parse test for the model field - TestSanitizeRuntime_Allowlist — 12 cases (8 valid runtimes + empty + whitespace + unknown + path-traversal + newline-injection) - Updated 6 existing TestEnsureDefaultConfig_* assertions to expect the new always-quoted form (name: "Test Agent" vs name: Test Agent) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 13:17:32 -07:00
Hongming Wang	94e3d05e45	fix(security): gate /channels/discover behind AdminAuth (#250 ) Closes #250 (MEDIUM). POST /channels/discover was on the open router and accepted an arbitrary Telegram bot token, turning it into: 1. A free bot-token validity oracle — attackers can enumerate/probe tokens at zero cost 2. A drive-by deleteWebhook side effect — every call invokes tgbotapi.DeleteWebhookConfig against the target bot, breaking legitimate webhook delivery 3. A rate-limit amplifier — getMe + deleteWebhook + getUpdates per call Fix: one-line addition of middleware.AdminAuth(db.DB) to the route, matching its actual intent (platform-operator admin helper, not a per-workspace route). Pattern mirrors /admin/liveness, /events, and /bundles/export from PR #167. No new test: AdminAuth behavior is covered by wsauth_middleware_test.go; this PR only wires it onto an additional route. The load-bearing code comment references #250 so future reviewers can't revert without an issue citation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 13:11:22 -07:00
Hongming Wang	ce160aecc7	fix(security): #234 — sanitize source_id spoof log line via %q Closes #234 LOW. The security log I added in PR #228 (code-review follow-up) echoed body.SourceID with %s, which preserves any \n / \r that json.Unmarshal decoded from the attacker's JSON. An authenticated workspace could have injected fake log entries by sending source_id="evil\ntimestamp=FORGED level=INFO msg=fake". Fix: use %q on both body_source_id and c.ClientIP(). Go-quoted string escapes all control characters so multi-line payloads stay on a single log line. One-line fix. Regression test: TestActivityHandler_Report_SourceIDLogInjection exercises the code path with a literal \n in source_id. Assertion is limited to "handler returns 403 cleanly with no panic" because capturing log output in Go tests requires a log.SetOutput swap, which adds noise for little signal vs just reading the test log output (visible when running with -v). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 12:04:26 -07:00
Hongming Wang	6fd13ff037	fix(security): #226 — gate POST /workspaces template/runtime against traversal Closes #226 MEDIUM. WorkspaceHandler.Create joined payload.Template directly into filepath.Join(configsDir, template) without validating it stayed inside configsDir. An attacker posting Template="../../etc" would have the provisioner walk and mount arbitrary host directories into the workspace container. Same fix as #103 (POST /org/import): use the existing resolveInsideRoot helper to reject absolute paths and any ".." that escapes the root. Applied at both call sites in workspace.go: 1. Synchronous runtime detection before DB insert — 400 on bad input 2. Async provisioning goroutine — early return, logs the rejection (belt-and-suspenders; the create path already blocks) No test added inline because the existing resolveInsideRoot suite (org_path_test.go) already covers absolute / traversal / prefix-sibling / empty-path / deep-subpath cases. A duplicate test for the workspace handler wouldn't add signal. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 12:00:26 -07:00
Hongming Wang	00626a41a5	Merge pull request #224 from Molecule-AI/fix/issue-221-yaml-injection fix(security): sanitize workspace name before YAML interpolation	2026-04-15 11:59:10 -07:00
Hongming Wang	cb0205ed95	fix(security): #221 — quote name as YAML scalar instead of stripping newlines The original fix stripped \n/\r but left the rest in place, then relied on a substring-based test which was over-strict (the escaped fragment still contained the banned substring as bytes). Better approach: emit the name as a double-quoted YAML scalar with all escape sequences (\\, \", \n, \r, \t) handled inline. This is the canonical YAML-safe way to embed user input — no injection possible because every control character is either escaped or rejected by the YAML parser inside the scalar context. Test rewritten to parse the output as YAML and verify: 1. parsed[\"name\"] equals the literal attacker input (payload preserved) 2. no banned top-level keys leaked to the parsed map 3. legitimate default keys (description/version/tier/model) still present Updated the two existing tests that asserted the unquoted name format. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 11:58:16 -07:00
Hongming Wang	1c0e3565af	Merge branch 'main' into test/issue-217-plugin-pipeline-tests	2026-04-15 11:54:12 -07:00
Hongming Wang	c730f6bc02	Merge branch 'main' into fix/issue-221-yaml-injection	2026-04-15 11:54:10 -07:00
Hongming Wang	410d2493d1	fix(code-review): CanvasOrBearer fall-through, scheduler short(), activity spoof log + 6 new tests Addresses self-review of the 10-PR batch merged earlier this session. Splits the follow-ups into this Go-side PR and a later Python/docs PR. ## Fixes 1. wsauth_middleware.go CanvasOrBearer — invalid bearer now hard-rejects with 401 instead of falling through to the Origin check. Previous code let an attacker with an expired token + matching Origin bypass auth. Empty bearer still falls through to the Origin path (the intended canvas path). 2. scheduler.go short() helper — extracts safe UUID prefix truncation. Pre-existing unsafe [:12] and [:8] slices would panic on workspace IDs shorter than the bound. #115's new skip path had the bounds check; the happy-path log lines did not. One helper, three call sites. 3. activity.go security-event log on source_id spoof — #209 added the 403 but the attempt was invisible to any auditor cron. Stable greppable log line with authed_workspace, body_source_id, client IP. ## New tests - TestShort_helper — bounds-safety regression guard for the helper - TestRecordSkipped_writesSkippedStatus — #115 coverage gap, exercises UPDATE + INSERT via sqlmock - TestRecordSkipped_shortWorkspaceIDNoPanic — short-ID crash regression - TestActivityHandler_Report_SourceIDSpoofRejected — #209 403 path - TestActivityHandler_Report_MatchingSourceIDAccepted — non-spoof path - TestHistory_IncludesErrorDetail — #152 problem B coverage go test -race ./... green locally. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 11:48:25 -07:00
Dev Lead Agent	a3ce767822	test(handlers): add unit test suite for plugins_install_pipeline.go The 13K-line plugins_install_pipeline.go had zero unit tests, making it the highest-regression-risk file in the platform handlers package. New test file covers all testable pure-function and integration paths that do not require a live Docker daemon: validatePluginName (8 cases) - valid names, empty, forward slash, backslash, "..", embedded ".."; path-traversal variants ("../etc", "../../secrets") dirSize (6 cases) - empty dir, single file, multiple files, nested subdirectory, exceeds limit (verifies error mentions "cap"), exactly at limit httpErr / newHTTPErr (3 cases) - Error() contains status code, all relevant HTTP codes preserved, errors.As unwraps through fmt.Errorf %w chains regexpEscapeForAwk (6 cases) - alphanumeric names unchanged, slash escaped, dot escaped, + escaped, full "# Plugin: name /" marker (space not escaped), backslash escaped streamDirAsTar (4 cases) - empty dir yields zero entries, single file round-trips content, nested directory preserves relative path, entries have no absolute or tempdir-leaking paths resolveAndStage via stubResolver (10 cases) - empty source → 400, unknown scheme → 400, happy path (result fields), staged dir cleaned on fetch error, ErrPluginNotFound → 404, DeadlineExceeded → 504, generic error → 502, resolver returns invalid name → 400, local:// path traversal → 400 (pre-Fetch validation) stubResolver implements plugins.SourceResolver as an in-process test double — no network, no filesystem side-effects beyond the staging tempdir that resolveAndStage creates and cleans up. Closes #217 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 18:47:25 +00:00
Dev Lead Agent	afea61ae52	fix(security): sanitize body.Name before YAML interpolation in generateDefaultConfig A crafted workspace name containing a newline (e.g. "x\nmodel: evil") could inject arbitrary YAML keys into the auto-generated config.yaml. Strip \n and \r from the name before interpolation. YAML key injection requires a newline to start a new mapping entry; other characters such as `:` are safe in unquoted scalar values. Adds TestGenerateDefaultConfig_YAMLInjection with three adversarial inputs: bare \n injection, CRLF injection, and multi-key injection. Closes #221 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 18:44:11 +00:00
Hongming Wang	a507961f22	fix(db): #211 — migration runner skips .down.sql (stop wiping data on boot) Closes #211 HIGH ops/security. RunMigrations globbed \`.sql\` which matches both \`.up.sql\` AND \`.down.sql\`. Alphabetical sort puts \"d\" before \"u\", so every platform boot ran the rollback BEFORE the forward migration for any pair starting with migration 018. Net effect: every restart wiped workspace_auth_tokens (the 020 pair), which in turn regressed AdminAuth to its fail-open bootstrap bypass for every route protected by it — the live server was effectively unauthenticated from restart until the next workspace re-registered. Also wiped 018_secrets_encryption_version and 019_workspace_access pairs silently. Fix is a 3-line filter: skip files whose base name ends in \`.down.sql\`. Down migrations remain on disk for operator-driven rollback via psql, but are never picked up by the auto-run loop. Added unit test against a tmp dir to lock the filter behaviour so this can never regress: stages a mix of legacy plain .sql, matched up/down pairs, asserts only forward files survive. Follow-up (not in this PR): the runner still re-applies every migration on every boot. Migrations must be idempotent. A proper schema_migrations tracking table is tracked as a future cleanup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 11:24:06 -07:00
Hongming Wang	a04f7c288d	fix(security): C2 from #169 — reject spoofed source_id in activity.Report Cherry-picks the one genuinely new fix from #169 after confirming the rest of that PR is already covered on main (C1/C3/C5 by wsAuth group, C6 by #94+#119 SSRF blocklist, C4 ownership by existing WHERE filter). Pre-existing middleware (WorkspaceAuth on /workspaces/:id/* sub-routes) proves the caller owns the :id path param. But the body field source_id was never validated — a workspace authenticated for its own /activity endpoint could still attribute logs to a different workspace by setting source_id=<foreign UUID>. Rejected with 403 now. No schema change, no new middleware. 4-line handler delta. Closes the only real gap in #169; #169 itself will be closed as superseded. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 11:15:08 -07:00
Hongming Wang	2624d28f0c	fix(scheduler): #115 — skip cron fire when workspace is busy Closes #115. The Security Auditor hourly cron (and likely others) hit a ~36% miss rate because the platform's A2A proxy rejected fires with "workspace agent busy — retry after a short backoff" while the agent was still executing the prior audit. That error was recorded as a hard failure and polluted last_error. New behaviour: Before fireSchedule calls into the A2A proxy, it reads workspaces.active_tasks for the target. If >0, it: - Advances next_run_at to the next cron slot (cron keeps ticking) - Bumps run_count - Sets last_status='skipped' + last_error=<reason> - Inserts a cron_run activity_logs row with status='skipped' + error_detail - Broadcasts CRON_SKIPPED for canvas + operators Effect: busy-collision ceases to be an error. The history surface now distinguishes "ran and failed" from "skipped because busy". Operators can tell the difference at a glance, and the liveness view doesn't stall waiting for the next ticker cycle. Pairs with #149 (dedicated heartbeat pulse) and #152 problem B (error_detail surfaced in history) for a coherent scheduler story. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 11:13:15 -07:00
Hongming Wang	4d7c0ee01d	fix(scheduler): #152 problem B — persist and surface cron error_detail Closes #152 problem B (schedule history API drops error detail). Two tiny changes: 1. scheduler.fireSchedule now writes lastError into activity_logs.error_detail when inserting the cron_run row. Previously the column was left NULL even on failure because the INSERT didn't include it. 2. schedules.History SELECT now reads error_detail and includes it in the JSON response under error_detail. Frontend + audit cron can now display "why did this run fail" instead of just "status=error". No schema change — activity_logs.error_detail already exists from migration 009. This just starts using the column. Problem A of #152 (Research Lead ecosystem-watch 50% error rate on its own) is a separate ops investigation and stays open. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 11:11:16 -07:00
Hongming Wang	f0dcb81a24	fix(auth): #168 — CanvasOrBearer middleware for PUT /canvas/viewport only Closes #168 by the route-split path from #194's review. #167 put PUT /canvas/viewport behind strict AdminAuth, breaking canvas drag/zoom persist because the canvas uses session cookies not bearer tokens. New narrow middleware CanvasOrBearer: - Accepts a valid bearer (same contract as AdminAuth) OR - Accepts a request whose Origin exactly matches CORS_ORIGINS - Lazy-bootstrap fail-open preserved for fresh installs Applied ONLY to PUT /canvas/viewport. The softer check is acceptable there because viewport corruption is cosmetic-only — worst case a user refreshes the page. This middleware must NOT be used on routes that leak prompts (#165), create resources (#164), or write files (#190) — see #194 review for why. The other canvas-facing routes mentioned in #168 (Events tab, Bundle Export/Import) remain behind strict AdminAuth pending a proper session-cookie-accepting AdminAuth (#168 follow-up for Phase H). 6 new tests cover: bootstrap fail-open, no-creds 401, canvas origin match, wrong origin 401, empty origin rejected, localhost default. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 11:09:16 -07:00
Hongming Wang	7c9192063d	fix(security): #190 — gate POST /templates/import behind AdminAuth Closes #190 (HIGH). The route was registered on the root router with no auth middleware, letting any unauthenticated caller write arbitrary files into configsDir via a crafted template. Same vulnerability class as #164 (bundles/import) and path-traversal risk same as #103 (org/import). One-line gate via the existing wsAdmin pattern. Lazy-bootstrap fail-open preserved for fresh installs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 11:00:49 -07:00
Hongming Wang	74046ca2cf	Merge pull request #187 from Molecule-AI/fix/issue-179-trusted-proxies fix(router): SetTrustedProxies(nil) closes rate-limit bypass via X-Forwarded-For (#179)	2026-04-15 10:55:01 -07:00
Hongming Wang	940a7772c3	Merge branch 'main' into fix/issue-170-secret-delete-auth	2026-04-15 10:54:36 -07:00
Backend Engineer	6edaebca00	fix: require workspace auth on DELETE /secrets/:key (#170 ) The route wsAuth.DELETE("/secrets/:key", sech.Delete) was already moved inside the WorkspaceAuth group in a prior commit, closing the CWE-306 unauthenticated-delete vector. This commit adds two regression tests to lock that in: - TestWorkspaceAuth_Issue170_SecretDelete_NoBearer_Returns401: workspace with live tokens, no bearer header → 401 (blocks the attack). - TestWorkspaceAuth_Issue170_SecretDelete_FailOpen_NoTokens: workspace with no tokens (bootstrap/legacy) → 200 (fail-open preserved). Mirrors the TestAdminAuth_Issue120_* and TestWorkspaceAuth_C4_C8_* patterns. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 17:42:08 +00:00
Backend Engineer	1ad98be17b	fix(router): call SetTrustedProxies(nil) to close IP-spoofing bypass (#179 ) Without this call Gin's default trusts all X-Forwarded-For headers, letting any caller rotate their effective IP and bypass per-IP rate limiting. SetTrustedProxies(nil) forces c.ClientIP() to always return the real TCP RemoteAddr. Adds two regression tests: one documenting the pre-fix bypass, one asserting the spoofed header is ignored after the fix. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 17:32:54 +00:00
Backend Engineer	3cbeab45ba	fix(security): gate GET /approvals/pending behind AdminAuth (#180 ) GET /approvals/pending was registered on the open router with no middleware, allowing any unauthenticated caller to enumerate all pending approvals across every workspace on the platform. Fix: add inline middleware.AdminAuth(db.DB) to the route registration, matching the pattern used in PR #167 for bundles, events, and viewport. The three workspace-scoped approvals routes (POST/GET /approvals, POST /approvals/:id/decide) were already correctly behind WorkspaceAuth inside the wsAuth group — no change needed there. Tests: two new regression tests in wsauth_middleware_test.go — TestAdminAuth_Issue180_ApprovalsListing_NoBearer_Returns401 TestAdminAuth_Issue180_ApprovalsListing_FailOpen_NoTokens Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 17:25:09 +00:00
Hongming Wang	ad5e7b88b3	fix(security): #164 + #165 + #166 — gate 6 unauth routes behind AdminAuth CRITICAL (#164): POST /bundles/import — anon callers could create arbitrary workspaces with user-supplied system prompts, plugins, and secrets envelopes. Fixed by gating behind AdminAuth (bundleAdmin group). HIGH (#165): GET /bundles/export/:id — anon UUID probe leaked full system prompts, agent_card, plugins, memory for any workspace. GET /events + GET /events/:workspaceId — anon read of the append-only event log leaked org topology, workspace names, card fragments. Both moved into the same bundleAdmin / eventsAdmin groups. MEDIUM (#166): PUT /canvas/viewport — anon callers could reset shared viewport state. Gated via a scoped viewportAdmin group; GET stays open so canvas bootstraps without a bearer. GET /admin/liveness — operational-intel leak (scheduler cadence reveals work pattern). Inline AdminAuth on the single handler. All 6 routes use the same lazy-bootstrap admin auth the rest of the platform uses: zero-token installs fail-open, once any token exists every request must present a valid bearer. Known follow-up: canvas uses session cookies not bearer tokens (same pattern as #138). In multi-tenant production these canvas features — Events tab, Export/Duplicate, viewport persist — will return 401 once a workspace is token-enrolled. Needs cookie-accepting AdminAuth as a follow-up (tracked as option B in #138 triage discussion); a new issue will be filed for that scope. The security gain from closing #164 CRITICAL outweighs the canvas UX regression for tonight. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 09:52:32 -07:00
Hongming Wang	146f4c781b	Merge pull request #162 from Molecule-AI/fix/issue-138-field-whitelist fix(auth): #138 — field-level authz on PATCH /workspaces/:id (canvas regression fix)	2026-04-15 09:39:22 -07:00
Hongming Wang	0fc4edab2a	fix(auth): #138 — field-level authz on PATCH /workspaces/:id Closes #138. #125 moved PATCH /workspaces/:id into the wsAdmin AdminAuth group to close the #120 unauth vulnerability, but broke canvas drag- reposition and inline rename because canvas uses session cookies not bearer tokens. Multi-tenant deployments with any live token would have seen every canvas PATCH 401. Option A per #138 triage: PATCH goes back on the open router, but WorkspaceHandler.Update now enforces field-level authz: Cosmetic (no bearer required): name, role, x, y, canvas Sensitive (bearer required when any live token exists): tier — resource escalation parent_id — A2A hierarchy manipulation runtime — container image swap workspace_dir — host bind-mount redirection Fail-open bootstrap: HasAnyLiveTokenGlobal = 0 → pass-through (fresh install, pre-Phase-30 upgrade path). Matches the same lazy-bootstrap contract WorkspaceAuth and AdminAuth use elsewhere. 3 new tests cover all three branches of the matrix (cosmetic no-bearer, sensitive no-bearer-rejected, sensitive fail-open). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 09:39:09 -07:00
Hongming Wang	f06574428e	Merge pull request #119 from Molecule-AI/fix/111-112-clean fix(security+scheduler): IPv6 SSRF gap + scheduler unit tests [supersedes #111, #112]	2026-04-15 09:36:59 -07:00
Hongming Wang	5c389efc82	Merge branch 'main' into fix/111-112-clean	2026-04-15 09:36:14 -07:00
Hongming Wang	639f225142	Merge branch 'main' into fix/delete-revokes-tokens	2026-04-15 09:35:44 -07:00
Hongming Wang	0f5ab7a2c9	fix(tests): add EXISTS probe mock to 4 WorkspaceUpdate tests #125 added a SELECT EXISTS guard before WorkspaceHandler.Update applies any UPDATE so nonexistent workspace IDs return 404 instead of silent zero-row successes. The 4 existing WorkspaceUpdate_* sqlmock tests didn't mock the probe, so they broke on main. This was not caught because CI is blocked by the Actions billing cap. Adds ExpectQuery for the EXISTS probe to: - TestWorkspaceUpdate_ParentID - TestWorkspaceUpdate_NameOnly - TestWorkspaceUpdate_MultipleFields - TestWorkspaceUpdate_RuntimeField TestWorkspaceUpdate_BadJSON doesn't need the fix — it aborts on c.ShouldBindJSON before reaching the guard. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 09:35:08 -07:00
Hongming Wang	30d2d268b5	fix(security): #151 — register SecurityHeaders middleware Closes #151. The middleware was already implemented + tested (3 passing tests in securityheaders_test.go covering base set, multi-route, and the don't-override-existing contract) but never registered in router.go. One-line wire-up, runs after TenantGuard so rejected requests still get the same headers as accepted ones, and before routes so handlers can still opt out by setting their own header before c.Next() returns. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 03:50:52 -07:00
rabbitblood	3e13b727f7	fix(scheduler): independent heartbeat pulse so liveness doesn't false-stale during long fires (#140 ) The #95 scheduler heartbeat scheme relied on: 1. Top of tick() (once per poll interval) 2. Per-fire goroutine entry + exit That leaves a gap: tick() ends with wg.Wait(), so if a single fire takes longer than pollInterval (UIUX audits routinely take 60-120s; max fireTimeout is 5min), the next tick doesn't run and no top-of-tick heartbeat fires. Per-fire heartbeats only bracket the fire — between entry and the HTTP response returning, nothing heartbeats either. Observed today: /admin/liveness reports seconds_ago=251 while docker logs show the scheduler actively firing 'Hourly ecosystem watch'. Scheduler is fine; liveness is lying. Adds an independent 10s heartbeat pulse goroutine inside Start(), decoupled from tick completion. The existing heartbeats at tick top + per-fire are kept as redundant signals but this pulse is the one that guarantees liveness freshness regardless of what tick is doing. Ships the exact fix proposed in #140 body. Closes #140.	2026-04-15 03:13:41 -07:00
Hongming Wang	55827baafa	fix(security): close unauthenticated PATCH /workspaces/:id (#120 ) + schedule IDOR (#113 ) Security fix merging despite CI outage (issue #136 — runner failing since 07:22, all jobs fail in 1-2s with no log output, infrastructure issue confirmed across 28 consecutive runs). Issue #120 confirmed live by Security Auditor (cycle 3): curl -X PATCH .../workspaces/00000000-... -d '{"name":"probe"}' → 200 (no token) Code reviewed and approved by Security Auditor. Tests added in commit `76cb7c3` follow established AdminAuth/sqlmock patterns. CI outage is unrelated to these changes.	2026-04-15 01:41:35 -07:00
Dev Lead Agent	76cb7c3760	test(security): add #120 regression tests — PATCH auth + workspace existence guard Two gaps identified by Security Auditor in PR #125 review cycle: 1. handlers_extended_test.go: - Fix TestExtended_WorkspaceUpdate: add SELECT EXISTS mock expectation so the test correctly reflects the #120 existence guard now running first. - Add TestExtended_WorkspaceUpdate_NotFound: verifies PATCH returns 404 (not 200) for a nonexistent workspace ID — the core #120 behaviour fix. 2. wsauth_middleware_test.go: - Add TestAdminAuth_Issue120_PatchWorkspace_NoBearer_Returns401: documents the confirmed attack vector (PATCH without token must return 401) and asserts AdminAuth is applied to PATCH /workspaces/:id per the router.go change. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 08:40:06 +00:00
Dev Lead Agent	3705377a6c	fix(security): #120 PATCH auth + #113 schedule IDOR — close unauthenticated write vectors Issue #120 (HIGH — immediately exploitable): PATCH /workspaces/:id was registered on the root router with no auth middleware. An attacker with any workspace UUID could: - Escalate tier (tier 4 = 4 GB RAM allocation) - Rewrite parent_id to subvert CanCommunicate A2A access control - Swap runtime image on next restart - Redirect workspace_dir host bind-mount to arbitrary path Fix: move PATCH into the wsAdmin AdminAuth group alongside POST, DELETE. The canvas position-persist call already has an AdminAuth token (required for GET /workspaces list on initial load) so no canvas regression. Also add workspace-existence guard in Update handler — previously returned 200 with zero rows affected for nonexistent IDs. Issue #113 (MEDIUM — schedule IDOR, carry-over from prior cycle): PATCH /workspaces/:id/schedules/:scheduleId and DELETE operated on scheduleID alone (WHERE id = $1), allowing any authenticated caller to modify or delete schedules belonging to other workspaces. Fix: bind workspace_id = c.Param("id") in both Update and Delete handlers; add AND workspace_id = $N to all schedule SQL queries. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 08:01:22 +00:00
Hongming Wang	8ba88011b4	Merge pull request #109 from Molecule-AI/feat/issue-101-github-workflow-run feat(webhooks): #101 — GitHub workflow_run event → DevOps A2A	2026-04-15 00:51:01 -07:00
Hongming Wang	7a41d67fa3	Merge pull request #108 from Molecule-AI/fix/issue-93-category-routing fix: #93 category_routing + #105 X-RateLimit headers	2026-04-15 00:50:58 -07:00
Security Auditor	5718b05cc7	fix(security): close IPv6 SSRF gap in validateAgentURL (C6) PR #94 blocked 169.254.0.0/16 but left IPv6 equivalents fully open. Go's (IPNet).Contains() does not match pure IPv6 addresses against IPv4 CIDRs, so ::1, fe80::, and fc00::/7 all bypassed the check. Add three explicit IPv6 entries to blockedRanges: - fe80::/10 (IPv6 link-local — cloud metadata analogue) - ::1/128 (IPv6 loopback) - fc00::/7 (IPv6 ULA — RFC-4193 private) IPv4-mapped IPv6 (::ffff:169.254.x.x) is already safe: Go normalises these to IPv4 via To4() before Contains() runs. Tests: four new cases in TestValidateAgentURL covering all three blocked IPv6 ranges plus the IPv4-mapped IPv6 auto-normalisation path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 07:43:23 +00:00
Backend Engineer	140ae9ebee	test(scheduler): add unit tests for Healthy, LastTickAt, ComputeNextRun, panic recovery Added scheduler_test.go with 8 test cases covering all previously untested security-critical code paths from PR #90: TestLastTickAt_zero — zero time before first tick TestHealthy_beforeStart — false on fresh scheduler (zero lastTickAt) TestHealthy_freshTick — true when lastTickAt == now TestHealthy_stale — false when lastTickAt is 3×pollInterval ago TestComputeNextRun_valid — "0 * * * *" / UTC returns top-of-hour future time TestComputeNextRun_invalid — unparseable expression returns non-nil error TestComputeNextRun_invalidTimezone — unrecognised IANA zone returns non-nil error TestPanicRecovery — panicProxy crashes ProxyA2ARequest; scheduler goroutine recovers and remains Healthy To support these tests, scheduler.go gained four changes (minimal surface): 1. Added mu sync.RWMutex, lastTickAt time.Time, and tickInterval time.Duration fields to Scheduler. tickInterval defaults to pollInterval so production behaviour is unchanged; tests can override it directly. 2. Added LastTickAt() and Healthy() methods with read-lock protection. 3. tick() now records lastTickAt after wg.Wait() — a single atomic write under the mutex, no hot-path cost. 4. fireSchedule() got a deferred recover() so a panicking A2A proxy cannot crash the goroutine pool. Without this, TestPanicRecovery itself crashes the test binary — the test passing proves recovery is in place. Bug fix: ComputeNextRun previously silently fell back to UTC on an invalid timezone; it now returns a non-nil error. The schedules handler already validates the timezone before calling ComputeNextRun so this is a no-op for callers, but it makes the contract explicit and testable. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 07:42:13 +00:00
DevOps Engineer	3ef9142914	fix(security): revoke workspace tokens on delete (root-cause fix for C1 E2E) The Delete handler marked workspaces 'removed' but never touched workspace_auth_tokens. That left stale live tokens in the table, so HasAnyLiveTokenGlobal stayed true after the last workspace was deleted. AdminAuth then blocked the unauthenticated GET /workspaces in the E2E count-zero assertion with 401, and the previous commit worked around it by commenting out the assertion. This commit fixes the root cause: - workspace.go Delete: batch-revoke auth tokens for all deleted workspace IDs (including descendants) immediately after the canvas_layouts clean-up, using the same pq.Array pattern as the status update. - workspace_test.go TestWorkspaceDelete_CascadeWithChildren: add the expected UPDATE workspace_auth_tokens SET revoked_at sqlmock expectation. - tests/e2e/test_api.sh: restore the count=0 post-delete assertion (now passes because tokens are revoked → fail-open), capture NEW_TOKEN from the re-imported workspace registration for the final cleanup call (SUM_TOKEN is revoked after SUM_ID is deleted). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 07:28:10 +00:00
Hongming Wang	de6ebe2262	Merge pull request #106 from Molecule-AI/fix/org-import-path-traversal fix(security): #103 — path-sanitize + admin-gate POST /org/import	2026-04-15 00:26:16 -07:00
Hongming Wang	7859d43685	Merge pull request #95 from Molecule-AI/fix/supervised-goroutines fix(platform): panic-recovering supervisor for every background goroutine (#92)	2026-04-15 00:26:13 -07:00
Hongming Wang	f8c1b786ac	Merge pull request #99 from Molecule-AI/fix/auth-middleware-critical fix(security): C1 — auth-gate GET /workspaces + middleware test coverage (C4/C8/C10/C11)	2026-04-15 00:26:10 -07:00
Hongming Wang	958789f4ba	feat(webhooks): #101 — workflow_run event → DevOps A2A Closes #101 layer 1: buildGitHubA2APayload now handles workflow_run events, routing failed CI runs to a workspace via the existing X-Molecule-Workspace-ID / webhook path. Only completed runs with a failure/cancelled/timed_out conclusion fan out — success/skipped/neutral are dropped via errIgnoredGitHubAction. Surface message is human-readable + includes the run URL so DevOps can jump straight to the failing job. Metadata carries the full run context (workflow_name, run_id, run_number, conclusion, head_branch, head_sha, run_url, trigger_event) for programmatic handling. 4 new tests cover the failure path, success skip, non-completed action skip, and short-SHA edge case. Layer 2 (org.yaml wiring for DevOps workspace + GITHUB_WEBHOOK_SECRET docs) stays as a follow-up PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 00:25:49 -07:00
Hongming Wang	2a74a7b11b	fix: #93 category_routing + #105 X-RateLimit headers Closes #93 and #105. #93 — add research/plugins/template/channels entries to org.yaml category_routing defaults. Without them, evolution crons firing with these categories found no target and their audit summaries silently dropped at PM. Routes each back to the role that generated it so the author acts on their own findings. #105 — emit X-RateLimit-Limit / -Remaining / -Reset on every response (allowed and throttled) and Retry-After on 429s per RFC 6585. 2 tests cover both paths. Clients and monitoring tools can now back off proactively instead of polling into 429 walls. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 00:23:46 -07:00
Hongming Wang	4dbf335d7f	fix(security): #103 — path-sanitize + admin-gate POST /org/import Closes #103 (HIGH). Three attack surfaces on the import endpoint — body.Dir, workspace.Template, workspace.FilesDir — were concatenated via filepath.Join without validation, letting an unauthenticated caller probe arbitrary filesystem paths with "../../../etc". Two layers of defense: 1. resolveInsideRoot() rejects absolute paths and any relative path whose lexically cleaned join escapes the provided root (Abs + HasPrefix + separator guard). 6 tests cover happy path, traversal attempts, absolute path, empty input, prefix-sibling escape, and deep subpath resolution. 2. Route now runs behind middleware.AdminAuth so an unauthenticated attacker can't reach the handler at all once a token exists. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 00:18:09 -07:00
Hongming Wang	80b0ad25ff	Merge pull request #94 from Molecule-AI/fix/c6-loopback-ssrf fix(security): C6 — block loopback IP literals in /registry/register	2026-04-15 00:15:23 -07:00
Hongming Wang	593c7e2984	merge: resolve scheduler conflicts with main (#85 panic-recover + supervised heartbeat)	2026-04-15 00:12:29 -07:00
rabbitblood	0653e78262	fix(registry): allow ancestor↔descendant A2A so audit_summary can reach PM Found via deep workspace inspection during a maintenance cycle: Security Auditor's hourly cron correctly tries to delegate_task its audit_summary to PM, the platform proxy rejects with "access denied: workspaces cannot communicate per hierarchy", the agent falls back to delegating to its direct parent (Dev Lead), and PM's category_routing dispatcher (#75) is never reached. This breaks the audit-routing contract end-to-end. Every audit cycle was landing on Dev Lead instead of being fanned out via PM's category_routing to the right dev role (security → BE+DevOps, ui/ux → FE, etc). ## Root cause `registry.CanCommunicate()` only allowed: - self → self - siblings (same parent) - root-level siblings - direct parent → child - direct child → parent A grandchild → grandparent (Security Auditor → PM, where parent is Dev Lead and grandparent is PM) was DENIED. The original design wanted strict hierarchy to prevent rogue horizontal A2A — but it also broke the fundamental "child can talk to its leadership chain" pattern that any audit/escalation flow needs. ## Fix Generalise to ancestor ↔ descendant. Any workspace can talk to any ancestor (any depth) and any descendant (any depth). Direct parent/child remains a fast path that avoids the walk. Sibling rules unchanged. Cousins still cannot directly communicate (would need to go through their shared ancestor). Cross-subtree A2A is still rejected. Implementation: `isAncestorOf(ancestorID, childID)` walks the parent chain in Go with a maxAncestorWalk=32 safety cap so a malformed cycle in the workspaces table cannot loop forever. One DB lookup per step. For a typical 3-deep tree, this adds 1-2 extra lookups vs the old direct-parent fast path. Could be optimized to a single recursive CTE if profiling shows it matters; not now. ## Tests - TestCanCommunicate_Denied_Grandchild → REPLACED with two new tests: - TestCanCommunicate_Allowed_GrandparentToGrandchild - TestCanCommunicate_Allowed_GrandchildToGrandparent (the actual bug) - TestCanCommunicate_Allowed_DeepAncestor — 4-level chain - TestCanCommunicate_Denied_UnrelatedAncestors — ensures cross-subtree walks still terminate denied - TestCanCommunicate_Denied_DifferentParents — extended with the walk lookup mocks so sqlmock doesn't log warnings - TestCanCommunicate_Denied_CousinToRoot — same All 13 tests pass clean. The previous direct parent/child / siblings / self tests are unchanged (fast paths preserved). ## Why platform-level Per the "platform-wide fixes are mine to ship" rule. Every org template hits the same broken audit-routing chain — fixing it at the platform benefits all users, not just molecule-dev. This unblocks #50 (PM dispatcher prompt) and #75 (category_routing).	2026-04-14 22:18:38 -07:00
Backend Engineer	80c2161687	fix(security): C1 — gate GET /workspaces behind AdminAuth; add auth middleware tests Security Auditor confirmed C1 (GET /workspaces) exposes workspace topology without any authentication. The endpoint was intentionally left open for the canvas browser frontend; this PR closes that gap. Router change: - Move GET /workspaces from the bare root router into the wsAdmin AdminAuth group alongside POST /workspaces and DELETE /workspaces/:id. - AdminAuth uses the same fail-open bootstrap contract as all other auth gates: fresh installs (no live tokens) pass through; once any workspace has registered with a token, a valid bearer is required. Status of findings C2–C11 (documented here for audit trail): - C2 POST /workspaces/:id/activity → already in wsAuth group (Cycle 5) - C3 POST /workspaces/:id/delegations/record → already in wsAuth group (Cycle 5) - C4 POST /workspaces/:id/delegations/:id/update → already in wsAuth group (Cycle 5) - C5 GET /workspaces/:id/delegations → already in wsAuth group (Cycle 5) - C7 GET /workspaces/:id/memories → already in wsAuth group (Cycle 5) - C8 POST /workspaces/:id/memories → already in wsAuth group (Cycle 5) - C9 POST /workspaces/:id/delegate → already in wsAuth group (Cycle 5) - C10 GET /admin/secrets → already in adminAuth group (Cycle 7) - C11 POST+DELETE /admin/secrets → already in adminAuth group (Cycle 7) Tests (platform/internal/middleware/wsauth_middleware_test.go — 13 new): WorkspaceAuth: - fail-open when workspace has no tokens (bootstrap path) - C4: no bearer on /delegations/:id/update → 401 - C8: no bearer on /memories POST → 401 - invalid bearer → 401 - cross-workspace token replay → 401 - valid bearer for correct workspace → 200 AdminAuth: - fail-open when no tokens exist globally (fresh install) - C10: no bearer on GET /admin/secrets → 401 - C11: no bearer on POST /admin/secrets → 401 - C11: no bearer on DELETE /admin/secrets/:key → 401 - valid bearer → 200 - invalid bearer → 401 Note: did NOT touch DELETE /admin/secrets in production — no destructive calls to live secrets endpoints were made during this work. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 04:37:14 +00:00
Backend Engineer	63e482f05b	fix(security): C6 — extend SSRF blocklist to RFC-1918 private ranges PR #94 only blocked 127.0.0.0/8 (loopback) and 169.254.0.0/16 (link-local/IMDS). An attacker could still register a workspace with a URL in any RFC-1918 range (10.x, 172.16–31.x, 192.168.x) and redirect A2A proxy traffic to internal services. Block all five reserved ranges in validateAgentURL: - 169.254.0.0/16 link-local (IMDS: AWS/GCP/Azure) - 127.0.0.0/8 loopback (self-SSRF) - 10.0.0.0/8 RFC-1918 - 172.16.0.0/12 RFC-1918 (includes Docker bridge networks) - 192.168.0.0/16 RFC-1918 Agents must use DNS hostnames, not IP literals. The provisioner still writes 127.0.0.1 URLs via direct SQL UPDATE (CASE guard preserves those); this blocklist only applies to the /registry/register request body. Tests: updated 3 previously-allowed RFC-1918 cases to expect rejection; added 9 new cases covering range boundaries and the Docker bridge range. All 22 validateAgentURL subtests pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 04:35:05 +00:00
rabbitblood	101f284e5d	fix(scheduler): heartbeat at tick start + per-fire so liveness reflects work-in-progress The first scheduler heartbeat (#95) only fired AFTER each tick completed. A tick that runs fireSchedule for 110+ seconds (long agent prompts) would make /admin/liveness report scheduler as stale even though it was actively working. Observed today: scheduler firing UIUX audit, last_tick_at lagged by 95s+ and incrementing. Three places now call Heartbeat: 1. Top of tick() — proves we're past the ticker.C wait 2. Inside each fire goroutine, before fireSchedule — ANY active fire keeps the heartbeat fresh 3. Inside each fire goroutine, after fireSchedule — captures the moment the per-fire work completes (The post-tick Heartbeat in Start() is still there as the "all idle" case.) Net result: /admin/liveness reports stale only if the scheduler genuinely isn't doing anything for >2× pollInterval, which is the actual signal we want.	2026-04-14 21:20:06 -07:00
rabbitblood	e4535560cf	fix(platform): panic-recovering supervisor for every background goroutine (#92 ) Yesterday's scheduler-died incident (#85) was one instance of a systemic bug: every long-running goroutine in the platform lacks panic recovery and exposes no liveness signal. In a multi-tenant SaaS deployment, a single tenant's bad data panicking any subsystem takes down the subsystem for every tenant, silently, with all standard health probes still green. That is a scale-of-one sev-1. This PR: 1. Introduces `platform/internal/supervised/` with two primitives: a. RunWithRecover(ctx, name, fn) — runs fn in a recover wrapper. On panic logs the stack + exponential-backoff restart (1s → 2s → 4s → … → 30s cap). On clean return (fn decided to stop) returns. On ctx.Done cancels cleanly. b. Heartbeat(name) + LastTick(name) + Snapshot() + IsHealthy(names, staleThreshold) — shared in-memory liveness registry. Every subsystem calls Heartbeat(name) at the end of each tick so operators can distinguish "goroutine alive and healthy" from "alive but stuck inside a single tick". 2. Wraps every `go X.Start(ctx)` in main.go: - broadcaster.Subscribe (Redis pub/sub relay → WebSocket) - registry.StartLivenessMonitor - registry.StartHealthSweep - scheduler.Start (the one that died yesterday) - channelMgr.Start (Telegram / Slack) 3. Adds `supervised.Heartbeat("scheduler")` inside the scheduler tick loop as the first end-to-end demonstration. Follow-up PRs will add heartbeats to the other four subsystems. 4. Adds `GET /admin/liveness` endpoint returning per-subsystem last_tick_at + seconds_ago. Operators can poll this and alert on any subsystem whose seconds_ago exceeds 2x its cron/tick interval. 5. Unit tests for RunWithRecover (clean return no restart; panic restarts with backoff; ctx cancel stops restart loop) and for the liveness registry. Net new code: ~160 lines + ~100 lines of tests. Refactor of main.go: ~10 line changes. No behavior change on happy path; only lifts what happens on a panic. Closes #92. Supersedes the local recover added to scheduler.go in #90 (kept conceptually, but now via the shared helper).	2026-04-14 20:34:18 -07:00
Backend Engineer	19bdd81ba4	fix(security): C6 — block loopback IP literals in /registry/register A workspace that self-registers with a 127.0.0.x URL on first INSERT could redirect A2A proxy traffic back to the platform itself (SSRF). The previous fix only blocked 169.254.0.0/16 (cloud metadata). Add 127.0.0.0/8 to validateAgentURL's blocklist. RFC-1918 private ranges (10.x, 172.16.x, 192.168.x) remain allowed — Docker container networking depends on them. Safe because the provisioner writes 127.0.0.1 URLs via direct SQL UPDATE, not through /registry/register, so the UPSERT CASE that preserves provisioner URLs is unaffected. Local-dev agents can still register using "localhost" by name (hostname, not IP literal). Tests: removed "valid localhost http" case (now correctly rejected), added "valid localhost name" + three loopback-block assertions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 03:34:14 +00:00
rabbitblood	7dc9d83792	fix(scheduler): recover from panics + add liveness watchdog (#85 ) The scheduler died silently on 2026-04-14 14:21 UTC and stayed dead for 12+ hours. Platform restart didn't recover it. Root cause: tick() and fireSchedule() goroutines have no panic recovery. A single bad row, bad cron expression, DB blip, or transient panic anywhere in the chain permanently kills the scheduler goroutine — and the only signal to an operator is "no crons firing", which is invisible if you're not watching. Specifically: func (s *Scheduler) Start(ctx context.Context) { for { select { case <-ticker.C: s.tick(ctx) // <- if this panics, the for-loop exits forever } } } And inside tick: go func(s2 scheduleRow) { defer wg.Done() defer func() { <-sem }() s.fireSchedule(ctx, s2) // <- panic here propagates up wg.Wait() }(sched) Two `defer recover()` additions: 1. In Start's tick wrapper — a panic in tick() (DB scan, cron parse, row processing) is logged and the next tick fires normally. 2. In each fireSchedule goroutine — a single bad workspace can't take the rest of the batch down. Plus a liveness watchdog: - Scheduler now records `lastTickAt` after each successful tick. - New methods `LastTickAt()` and `Healthy()` (true if last tick within 2× pollInterval = 60s). - Initialised at Start so Healthy() returns true on a fresh process. Endpoint plumbing for /admin/scheduler/health is a follow-up — needs threading the scheduler instance through router.Setup(). Documented on #85. Closes the silent-outage failure mode of #85. The other proposed fixes (force-kill on /restart hang, active_tasks watchdog) are separate concerns tracked in #85's comments.	2026-04-14 19:32:01 -07:00
Hongming Wang	e38257ac88	fix(middleware): tenant guard reads bare UUID from state= (no prefix) Pair to molecule-controlplane PR #8. Fly's proxy returns 502 if the fly-replay state value contains '=', so the control plane now puts the bare UUID in state= (no 'org-id=' prefix). TenantGuard now treats the whole 'state=...' value as the org id.	2026-04-14 18:09:44 -07:00
Hongming Wang	522d055758	fix(middleware): TenantGuard accepts org id via Fly-Replay-Src state Phase B.3 pair-fix to the control plane's fly-replay state change. Background: the private molecule-controlplane's router emits `fly-replay: app=X;instance=Y;state=org-id=<uuid>`. Fly's edge replays the request to the tenant and injects `Fly-Replay-Src: instance=Z;...; state=org-id=<uuid>` on the replayed request. But response headers from the cp (like X-Molecule-Org-Id) never travel to the replayed tenant — only the state= param does. TenantGuard now checks both paths in order: 1. Primary: X-Molecule-Org-Id header (direct-access path, e.g. molecli) 2. Secondary: Fly-Replay-Src's `state=org-id=<uuid>` segment (production fly-replay path) Either matching configured MOLECULE_ORG_ID → allow. Neither matches → 404 (still don't leak tenant existence). New helper orgIDFromReplaySrc parses the semicolon-separated Fly-Replay- Src header per Fly's format. Covered by a table-driven test with 7 cases including malformed + empty-header + wrong-state-key. Tests: +3 new TestTenantGuard_* (FlyReplaySrc match, mismatch, table). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 17:54:13 -07:00
Hongming Wang	2094f4f0c2	feat(platform): TenantGuard middleware — public repo's only SaaS hook Phase 32 foundation. The SaaS control plane (private molecule-controlplane repo) provisions one platform instance per customer org on Fly Machines and sets MOLECULE_ORG_ID=<uuid> on the machine. Its subdomain router forwards requests with X-Molecule-Org-Id=<uuid>. TenantGuard: - When MOLECULE_ORG_ID is set → every non-allowlisted request must carry a matching X-Molecule-Org-Id header. Mismatched/missing header → 404 (not 403 — don't leak tenant existence by letting probers distinguish "wrong org" from "route doesn't exist"). - When unset → passthrough. Self-hosted / dev / CI behavior unchanged. - Allowlist is exact-match, not prefix — /health and /metrics only. No orgs table, no signup, no billing, no Fly provisioning in this repo — all that lives in the private control plane. The public repo's SaaS surface is exactly this one middleware. 6 tests covering: unset-is-passthrough, matching header, mismatched header 404 (with empty body), missing header 404, allowlist bypass, and allowlist-is-exact-match. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 15:20:33 -07:00
Hongming Wang	07a5ca3c51	Merge pull request #76 from Molecule-AI/fix/issue-24-schedules-db-authoritative fix(org): DB-authoritative schedules; org/import is additive on template rows (#24)	2026-04-14 14:40:54 -07:00
Hongming Wang	a921644f9c	fix(schedules): backfill legacy rows to 'template' + extract import SQL const Addresses code-review warnings on PR #76: - Migration 022 now backfills pre-existing workspace_schedules rows to source='template' before flipping NOT NULL + DEFAULT 'runtime'. Legacy rows (all seeded via org/import historically) stay refreshable on re-import. Down migration drops the CHECK constraint too. - Extracted the import UPSERT into const orgImportScheduleSQL so the shape test asserts against the const directly instead of file-scraping org.go. Removed the os.ReadFile helper. - scheduleResponse.Source gets json:\",omitempty\" so old clients that predate the migration don't see an empty string they can't explain. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 14:30:22 -07:00
Hongming Wang	608d6745b6	fix(org): use yaml.Marshal for category_routing + newline-guard block appends Addresses code-review warnings on PR #75: - renderCategoryRoutingYAML now builds yaml.Node + yaml.Marshal, escaping YAML-reserved chars in role names correctly (was JSON-as-YAML, fragile on unicode line separators). - New appendYAMLBlock helper guarantees a newline boundary when concatenating YAML fragments into config.yaml (category_routing + initial_prompt both used to risk merging into the previous line). - Fixed struct comment (replace-per-key, not UNION). - Added TestCategoryRouting_EscapesYAMLSpecials and TestAppendYAMLBlock_NewlineGuard. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 14:28:22 -07:00
Hongming Wang	293033de23	fix(org): DB-authoritative schedules; org/import is additive on template rows (#24 ) Resolves #24 per CEO direction. DB is source of truth for workspace_schedules. POST /org/import becomes idempotent — only touches rows it owns (source='template'); runtime-added schedules (Canvas / API) are preserved across re-imports. - Migration 022: adds source TEXT NOT NULL DEFAULT 'runtime' CHECK in ('template','runtime'); unique index on (workspace_id, name) so the org/import upsert can use ON CONFLICT. - org.go: schedule INSERT becomes INSERT ... 'template' ON CONFLICT (workspace_id, name) DO UPDATE SET ... WHERE workspace_schedules.source='template'. Never DELETEs. - schedules.go: runtime POST writes 'runtime' explicitly; List handler surfaces the source field on the response so Canvas can render badges. - 3 new unit tests assert source='runtime' default for runtime CRUD, the SQL shape contract for org/import (additive + idempotent + runtime-preserving + never-DELETE), and List response surface. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 14:09:44 -07:00
Hongming Wang	932ada2c59	feat(platform): generic category_routing replaces hardcoded audit dispatch (#51 ) Add a category_routing block to org.yaml schema (defaults + per-workspace, UNION semantics with per-key replace). The merged routing table is rendered into each workspace's config.yaml at import time. PM's system prompt loses the hardcoded security/ui/infra → role mapping from PR #50; instead it reads category_routing from /configs/config.yaml and delegates to whatever roles the org template lists for the incoming audit-summary's category. Future org templates ship their own routing without prompt churn. Tests: 4 new TestCategoryRouting_* cases covering YAML parse, UNION+drop semantics, deterministic config.yaml render, and empty-map handling. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 14:06:47 -07:00
Hongming Wang	d9603a77ce	fix(org): per-workspace plugins UNION with defaults; '!' prefix opts out (#68 ) Per-workspace `plugins:` now UNIONS with `defaults.plugins` instead of replacing. A leading `!` or `-` on a per-workspace entry opts a default out. Backward-compatible: re-listing defaults still dedupes to the same list. Refactored the inline REPLACE logic into a pure helper `mergePlugins` in org.go so it's unit-testable. Five TestPlugins_* cases added. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 13:21:23 -07:00
Hongming Wang	383582fbbf	Merge pull request #64 from Molecule-AI/fix/issue-15-refresh-oauth-on-restart fix(secrets): auto-refresh global_secrets on workspace restart (#15)	2026-04-14 12:49:19 -07:00
Hongming Wang	c4240e32c1	feat(platform): inject restart context system message (#19 Layer 1) After a workspace restart (HTTP /restart or programmatic RestartByID) and re-registration, the platform sends a synthetic A2A message/send to the workspace containing: - restart timestamp - previous session end timestamp + human duration - env-var keys now available (keys only — never values) The message is rendered in the format proposed in #19 and marked with metadata.kind=restart_context so agents can detect and handle it specifically if they choose. Skip path: if the workspace doesn't re-register within 30s, log and drop. The Restart HTTP response is unaffected by delivery success. Layer 2 (user-defined restart_prompt via config.yaml / org.yaml) is deferred — tracked as a separate follow-up issue. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 12:41:01 -07:00
Hongming Wang	e658f86c08	fix(secrets): auto-restart workspaces on global secret change (#15 ) Global secrets (e.g. CLAUDE_CODE_OAUTH_TOKEN) are injected as container env vars at Start() time. Until now, rotating one only propagated to a workspace on the next full restart-from-zero, which manual ops had to drive via a `POST /workspaces/:id/restart` loop. Tier-3 Claude Code agents hit the stale-token path first and surfaced as 401s inside the SDK. Restart-time re-read of global_secrets + workspace_secrets was already correct in `provisionWorkspaceOpts` — the missing piece was the trigger. SetGlobal / DeleteGlobal now enqueue RestartByID for every non-paused, non-removed, non-external workspace that does NOT shadow the key with a workspace-level override. Matches the existing behaviour of workspace-scoped `Set` / `Delete`. Adds two sqlmock-backed tests exercising both branches. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 12:39:00 -07:00
Hongming Wang	9c7f57688c	Merge pull request #57 from Molecule-AI/fix/issue-12-preserve-claude-sessions fix(provisioner): preserve Claude session directory across restart (#12)	2026-04-14 12:26:12 -07:00
Hongming Wang	479f1776a8	feat(provisioner): configurable per-tier memory/CPU limits (#14 ) Resolves #14. ApplyTierConfig now reads TIER{2,3,4}_MEMORY_MB and TIER{2,3,4}_CPU_SHARES env vars, falling back to the compiled defaults agreed in the issue: - T2: 512 MiB / 1024 shares (1 CPU) — unchanged baseline - T3: 2048 MiB / 2048 shares (2 CPU) — new cap (previously uncapped) - T4: 4096 MiB / 4096 shares (4 CPU) — new cap (previously uncapped) CPU_SHARES follows Docker's 1024 = 1 CPU convention; internally the value is translated to NanoCPUs for a hard allocation so behaviour remains deterministic across hosts. Malformed or non-positive env values silently fall back to the default. Behaviour change note: T3 and T4 previously had no explicit cap. Operators who relied on unlimited can set very large TIERn_MEMORY_MB / TIERn_CPU_SHARES values; a follow-up can add unset-means-unlimited semantics if required. Tests: - TestGetTierMemoryMB_DefaultsMatchLegacy - TestGetTierMemoryMB_EnvOverride (covers malformed + zero fallback) - TestGetTierCPUShares_EnvOverride - TestApplyTierConfig_T3_UsesEnvOverride (wiring) - TestApplyTierConfig_T3_DefaultCap (documents the new cap) Docs: .env.example section + CLAUDE.md platform env-vars list updated. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 10:49:37 -07:00
Hongming Wang	7ad3173c10	fix(provisioner): preserve Claude session directory across restart (#12 ) Resolves #12. The claude-code SDK stores conversations in /root/.claude/sessions/ and Postgres tracks current_session_id, but the container filesystem was recreated on every restart — next agent message failed with "No conversation found with session ID: <uuid>". Add a per-workspace named Docker volume (ws-<id>-claude-sessions) mounted read-write at /root/.claude/sessions. Gated by runtime=claude-code so other runtimes don't pay for a path they don't use. Volume is cleaned up in RemoveVolume alongside the config volume. Two opt-outs discard the volume before restart for a fresh session: - env WORKSPACE_RESET_SESSION=1 on the container - POST /workspaces/:id/restart?reset=true (or {"reset": true} body) Plumbed via new ResetClaudeSession field on WorkspaceConfig + provisionWorkspaceOpts helper so the flag stays request-scoped (not persisted on CreateWorkspacePayload). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 10:45:30 -07:00
Hongming Wang	0832f997f0	feat(platform): GET /admin/workspaces/:id/test-token for E2E (#6 ) Adds a gated admin endpoint that mints a fresh workspace bearer token on demand, eliminating the register-race currently used by test_comprehensive_e2e.sh (PR #5 follow-up). - New handler admin_test_token.go: returns 404 unless MOLECULE_ENV != production or MOLECULE_ENABLE_TEST_TOKENS=1. Hides route existence in prod (404 not 403). - Mints via wsauth.IssueToken; logs at INFO without the token itself. - Verifies workspace exists before minting (missing -> 404, never 500). - Tests cover prod-hidden, enable-flag-overrides-prod, missing workspace, and happy-path + token-validates round trip. - tests/e2e/_lib.sh gains e2e_mint_test_token helper for downstream adoption. - CLAUDE.md updated with route + env vars. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 09:35:26 -07:00
Hongming Wang	f7683e3adf	fix(provisioner): stop rogue config-missing restart loop (#17 ) Resolves #17. Part A: scripts/cleanup-rogue-workspaces.sh deletes workspaces whose id or name starts with known test placeholder prefixes (aaaaaaaa-, etc.) and force-removes the paired Docker container. Documented in tests/README.md. Part B: add a pre-flight check in provisionWorkspace() — when neither a template path nor in-memory configFiles supplies config.yaml, probe the existing named volume via a throwaway alpine container. If the volume lacks config.yaml, mark the workspace status='failed' with a clear last_sample_error instead of handing it to Docker's unless-stopped restart policy (which otherwise loops forever on FileNotFoundError). New pure helper provisioner.ValidateConfigSource + unit tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 07:32:58 -07:00
Dev Lead Agent	07bb730675	fix(security): C18 register ownership check, C20 DELETE auth gate C18 — Workspace URL hijacking (CRITICAL, CONFIRMED LIVE): POST /registry/register now calls requireWorkspaceToken() before persisting anything. If the workspace has any live auth tokens, the caller must supply a valid Bearer token matching that workspace ID. First registration (no tokens yet) passes through — token is issued at end of this function (unchanged bootstrap contract). Mirrors the same pattern already applied to /registry/heartbeat and /registry/update-card. Attacker POC — overwriting Backend Engineer URL to http://attacker.example.com:9999/steal — now returns 401. C20 — Unauthenticated workspace deletion (CRITICAL, CONFIRMED LIVE): DELETE /workspaces/:id moved from bare router into AdminAuth group. Any valid workspace bearer token grants access (same fail-open bootstrap contract as /settings/secrets). Mass-deletion attack chain (C19 list → C20 delete all) requires auth for the DELETE step. POST /workspaces (create) also moved to AdminAuth to prevent unauthenticated workspace creation. C19 (GET /workspaces topology exposure) deferred — canvas browser has no bearer token; fix requires canvas service-token refactor. Tests: 2 new registry tests — C18 bootstrap (no tokens, passes through and issues token), C18 hijack blocked (has tokens, no bearer → 401). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 07:38:53 +00:00
Dev Lead Agent	d1ee16f65f	fix(security): block SSRF via registry URL validation (C6) POST /registry/register accepted any URL string and persisted it as the workspace's A2A endpoint — an attacker could register a workspace with url=http://169.254.169.254/latest/meta-data/ and cause the platform to proxy requests to the cloud metadata service when proxying A2A traffic. Fix: validateAgentURL() helper rejects: - empty URL - non-http/https schemes (file://, ftp://, etc.) - 169.254.0.0/16 link-local IPs (AWS/GCP/Azure IMDS endpoints) Allows RFC-1918 private ranges (Docker networking uses 172.16-31.x.x). Adds 12 unit tests covering valid Docker-internal URLs and all SSRF vectors. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 06:37:37 +00:00
Dev Lead Agent	c1656503ef	fix(security): protect global secrets routes with AdminAuth middleware (Cycle 7) Three unauthenticated routes allowed arbitrary read/write/delete of all global platform secrets (API keys, provider credentials) with zero auth: - GET/PUT/POST /settings/secrets - DELETE /settings/secrets/:key - GET/POST/DELETE /admin/secrets (legacy aliases) Fix: new AdminAuth middleware with same lazy-bootstrap contract as WorkspaceAuth — fail-open when no tokens exist (fresh install / pre-Phase-30 upgrade), enforce once any workspace has a live token. Any valid workspace bearer token grants access (platform-wide scope, no workspace binding needed). Changes: wsauth/tokens.go — HasAnyLiveTokenGlobal + ValidateAnyToken functions wsauth/tokens_test.go — 5 new tests covering both new functions middleware/wsauth_middleware.go — AdminAuth middleware router/router.go — global secrets routes now registered under adminAuth group Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 06:33:22 +00:00
Hongming Wang	07743c9946	Merge pull request #31 from Molecule-AI/fix/security-cycle5-auth fix(security): Cycle 5+6 — workspace auth middleware blocks all 16 open criticals	2026-04-13 23:22:10 -07:00
Dev Lead Agent	bea0e96a86	fix(security): Cycle 5 — auth middleware, injection hardening, skill sandbox Fix A — platform/internal/middleware/wsauth_middleware.go (NEW): WorkspaceAuth() gin middleware enforces per-workspace bearer-token auth on ALL /workspaces/:id/* sub-routes. Same lazy-bootstrap contract as secrets.Values: workspaces with no live token are grandfathered through. Blocks C2, C3, C4, C5, C7, C8, C9, C12, C13 simultaneously. Fix A — platform/internal/router/router.go: Reorganised route registration: bare CRUD (/workspaces, /workspaces/:id) and /a2a remain on root router; all other /workspaces/:id/* sub-routes moved into wsAuth = r.Group("/workspaces/:id", middleware.WorkspaceAuth(db.DB)). CORS AllowHeaders updated to include Authorization so browser/agent callers can send the bearer token cross-origin. Fix B — workspace-template/heartbeat.py: _check_delegations(): validate source_id == self.workspace_id before accepting a delegation result. Attacker-crafted records with a foreign source_id are silently skipped with a WARNING log (injection attempt). trigger_msg no longer embeds raw response_preview text; references delegation_id + status only — removes the prompt-injection vector. Fix C — workspace-template/skill_loader/loader.py: load_skill_tools(): before exec_module(), verify script is within scripts_dir (path traversal guard) and temporarily scrub sensitive env vars (CLAUDE_CODE_OAUTH_TOKEN, ANTHROPIC_API_KEY, OPENAI_API_KEY, WORKSPACE_AUTH_TOKEN, GITHUB_TOKEN, GH_TOKEN) from os.environ; restore in finally block. Defence-in-depth even if /plugins auth gate is bypassed. Fix D — platform/internal/handlers/socket.go: HandleConnect(): agent connections (X-Workspace-ID present) validated via wsauth.HasAnyLiveToken + wsauth.ValidateToken before WebSocket upgrade. Canvas clients (no X-Workspace-ID) remain unauthenticated. Fix D — workspace-template/events.py: PlatformEventSubscriber._connect(): include platform_auth bearer token in WebSocket upgrade headers alongside X-Workspace-ID. Fix E — workspace-template/executor_helpers.py: recall_memories() and commit_memory() now pass platform_auth bearer token in Authorization header so WorkspaceAuth middleware allows access. Fix F — workspace-template/a2a_client.py: send_a2a_message(): timeout=None → httpx.Timeout(connect=30, read=300, write=30, pool=30). Resolves H2 flagged across 5 consecutive audits. Tests: 149/149 Python tests pass (test_heartbeat + test_events updated to assert new source_id validation behaviour and allow Authorization header). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 04:44:42 +00:00
Hongming Wang	1129b67fed	refactor(platform): split 981-line plugins.go into per-domain modules Pure mechanical split — no behavior changes. Groups the PluginsHandler surface area by responsibility so each file stays focused and readable. Before: plugins.go — 981 lines, 32 funcs After: plugins.go — 194 (struct, constructor, shared helpers) plugins_sources.go — 14 (ListSources) plugins_listing.go — 174 (ListRegistry, ListInstalled, ListAvailableForWorkspace, CheckRuntimeCompatibility) plugins_install.go — 276 (Install, Uninstall, Download handlers) plugins_install_pipeline.go — 368 (resolveAndStage, deliverToContainer, copy/stream tar, CLAUDE.md marker stripping, dirSize, httpErr, installRequest/stageResult, install-layer consts + envx caps) plugins_test.go (1365 lines) untouched — tests pass unchanged. go build, go vet, and go test -race ./internal/handlers/... all clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 18:01:59 -07:00
Hongming Wang	208235bddd	test: 100% coverage of extracted helpers + ConfirmDialog singleButton Follow-up to the quality-fixes-pass2 code review. ## Go: direct unit tests for PR #5 extracted helpers (~47 new tests) a2a_proxy_test.go: - resolveAgentURL: cache hit, cache-miss DB hit, not-found, null-URL, docker-rewrite guard - dispatchA2A: build error, canvas timeout, agent timeout, success - handleA2ADispatchError: context deadline, generic error, build error - maybeMarkContainerDead: nil-provisioner, runtime=external short-circuits - logA2AFailure, logA2ASuccess: activity_logs row content + status delegation_test.go: - bindDelegateRequest: valid / malformed / bad-UUID - lookupIdempotentDelegation: no-key / no-match / failed-row-deleted / existing-pending - insertDelegationRow: insertOK / insertHandledByIdempotent / insertTrackingUnavailable - insertDelegationOutcome: zero-value is insertOutcomeUnknown sentinel discovery_test.go: - discoverWorkspacePeer: online / not-found / access-denied + 2 edges - writeExternalWorkspaceURL: 3 cases - discoverHostPeer: smoke test documents the unreachable-by-design path activity_test.go: - parseSessionSearchParams: defaults + custom limit/offset/q - buildSessionSearchQuery: no-filters + with-query shapes - scanSessionSearchRows: empty / single / multiple rows Package coverage: 56.1% → 57.6%. Every helper extracted in PR #5 is now at or near 100% line coverage (see PR notes for the 4 remaining gaps, all blocked on provisioner interface mockability). ## Defensive enum zero-value fix insertDelegationOutcome now starts with insertOutcomeUnknown=0 as a sentinel so an un-initialized variable can't silently read as "success". insertOK, insertHandledByIdempotent, insertTrackingUnavailable shift to 1/2/3. No caller changes needed. ## Canvas: ConfirmDialog.singleButton test (5 cases) canvas/src/components/__tests__/ConfirmDialog.test.tsx covers: - default render (both buttons) - singleButton hides Cancel - singleButton: Escape still fires onCancel - singleButton: backdrop-click still fires onCancel - singleButton: onConfirm fires on click vitest total: 352 → 357, all passing. ## Docstring clarity ConfirmDialog.tsx: expanded singleButton prop comment to explicitly instruct callers to pass the same handler for onConfirm/onCancel when using it as an info toast (matches TemplatePalette usage). ## ErrorBoundary clipboard observability .catch(() => {}) silently swallowed rejections. Now: .catch((e) => console.warn("clipboard write failed:", e)) so permission-denied / insecure-context failures surface in the console. ## Verification - go build ./... clean - go vet ./... clean - go test -race ./internal/... — all pass - canvas npm run build — clean - canvas npm test -- --run — 357/357 pass - tests/e2e/test_api.sh — 46/62 pass; all 16 failures are pre-existing (token-auth enforcement + stale test workspaces + missing Docker network). None involve handlers touched in PR #5. - Manual: platform + canvas running locally, title=Molecule AI, /workspaces returns [], /health returns ok. Identified + killed a stale Next.js server from the old Starfire-AgentTeam repo that was serving the old brand on IPv4 port 3000. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 17:08:33 -07:00
Hongming Wang	0a0235c312	chore: address follow-up code review — named enum, singleButton, tests Post-review fixes on top of the quality-pass-2 branch. 1. delegation.go: replaced insertDelegationRow's (bool, bool) return with a typed insertDelegationOutcome enum (insertOK / insertHandledByIdempotent / insertTrackingUnavailable). Eliminates the positional-boolean decoding the caller had to do. Internal, no behavior change. 2. ConfirmDialog.tsx: added singleButton prop. When true, hides the Cancel button for single-action info toasts (Esc still dismisses via onCancel). TemplatePalette's import notice uses it. 3. ErrorBoundary.tsx: fixed the floating clipboard promise. Added .catch(() => {}) so a rejected writeText (permission denied, insecure context) doesn't surface as unhandled rejection. 4. a2a_proxy_test.go: added 5 direct unit tests for normalizeA2APayload (invalid JSON, wraps-bare, preserves-existing- id, preserves-existing-messageId, missing-method). Fills the unit- test gap for the helper extracted in the last pass. Verification: - go test -race ./internal/handlers/... passes (incl. 5 new tests) - go build ./... clean - canvas npm run build clean - canvas npm test -- --run -> 352/352 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 14:45:05 -07:00
Hongming Wang	74e2da8b92	chore: quality pass — native dialogs, env sync, Go handler splits Three parallel cleanups driven by the second code-review pass. ## Native dialogs → ConfirmDialog (7 sites) Violated the standing feedback_no_native_dialogs rule. - ChannelsTab: confirm() → ConfirmDialog danger variant with pendingDelete state - ScheduleTab: window.confirm() → ConfirmDialog danger - ChatTab: confirm("Restart...") → ConfirmDialog warning (restart is recoverable) - TemplatePalette: two alert() sites collapsed into a single notice state + ConfirmDialog as OK-only info toast - ErrorBoundary: dropped both window.alert calls entirely. Clipboard-copy click is self-evident; console.error already captures the fallback. ## .env.example ↔ Go env var sync Added 11 previously-undocumented env vars grouped into 6 new sections: - Platform: PLATFORM_URL, MOLECULE_URL, WORKSPACE_DIR, MOLECULE_ENV - CORS / rate limiting: CORS_ORIGINS, RATE_LIMIT - Activity retention: ACTIVITY_RETENTION_DAYS, ACTIVITY_CLEANUP_INTERVAL_HOURS - Container detection: MOLECULE_IN_DOCKER (moved to dedup) - Observability: AWARENESS_URL - Webhooks: GITHUB_WEBHOOK_SECRET - CLI: MOLECLI_URL All 21 distinct os.Getenv / envx.* keys (excluding HOME) now documented. Zero orphans in the other direction. ## Go handler function splits (4 funcs, pure refactor) No behavior change; same tests pass. \| Function \| Before \| After \| Helpers \| \|---------------------------\|-------:\|------:\|---------------------------------------------------------------\| \| proxyA2ARequest \| 257 \| 56 \| resolveAgentURL, normalizeA2APayload, dispatchA2A, \| \| \| \| \| handleA2ADispatchError, maybeMarkContainerDead, \| \| \| \| \| logA2AFailure, logA2ASuccess \| \| Delegate \| 127 \| 60 \| bindDelegateRequest, lookupIdempotentDelegation, \| \| \| \| \| insertDelegationRow \| \| Discover \| 125 \| 40 \| discoverWorkspacePeer, writeExternalWorkspaceURL, \| \| \| \| \| discoverHostPeer \| \| SessionSearch \| 109 \| 24 \| parseSessionSearchParams, buildSessionSearchQuery, \| \| \| \| \| scanSessionSearchRows \| Preserved exact error semantics, log.Printf calls, status codes, and response shapes. Introduced a proxyDispatchBuildError sentinel in a2a_proxy so the orchestrator can distinguish "couldn't build the request" from "Do() failed" without changing existing branches. ## Verification - go build ./... clean - go vet ./... clean - go test -race ./internal/... — all pass - canvas npm run build — clean - canvas npm test -- --run — 352/352 pass - grep window.confirm\|window.alert\|window.prompt in canvas/src — 0 matches - every platform os.Getenv key present in .env.example Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 14:36:30 -07:00
Hongming Wang	fa9342aa81	chore: structural cleanup — dead dirs, moves, gitignore - Delete empty platform/plugins/ (dead remnant; plugins/ at repo root is the real registry; router.go comment updated) - Gitignore local dev cruft: platform/workspace-configs-templates/, .agents/ (codex/gemini skill cache), backups/ - Untrack .agents/skills/ (keep local, stop tracking) - Move examples/remote-agent/ → sdk/python/examples/remote-agent/ (co-locate with the SDK it exercises); update refs in molecule_agent README + __init__ + PLAN.md + the demo's own README - Move docs/superpowers/plans/ → plugins/superpowers/plans/ (plans were written by the superpowers plugin's writing-plans subskill; belong with the plugin, not under docs) - Add tests/README.md explaining the unit-tests-per-package + root-E2E split so new contributors don't ask - Add docs/README.md explaining why site tooling lives under docs/ rather than a separate docs-site/ (VitePress ergonomics) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 14:06:52 -07:00
Hongming Wang	24fec62d7f	initial commit — Molecule AI platform Forked clean from public hackathon repo (Starfire-AgentTeam, BSL 1.1) with full rebrand to Molecule AI under github.com/Molecule-AI/molecule-monorepo. Brand: Starfire → Molecule AI. Slug: starfire / agent-molecule → molecule. Env vars: STARFIRE_* → MOLECULE_*. Go module: github.com/agent-molecule/platform → github.com/Molecule-AI/molecule-monorepo/platform. Python packages: starfire_plugin → molecule_plugin, starfire_agent → molecule_agent. DB: agentmolecule → molecule. History truncated; see public repo for prior commits and contributor attribution. Verified green: go test -race ./... (platform), pytest (workspace-template 1129 + sdk 132), vitest (canvas 352), build (mcp). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 11:55:37 -07:00

1 2 3 4 5

208 Commits