No env vars to configure. The platform auto-detects the backend:
MOLECULE_ORG_ID set → SaaS tenant → control plane provisioner
MOLECULE_ORG_ID empty → self-hosted → Docker provisioner
The control plane URL defaults to https://api.moleculesai.app (override
with CP_PROVISION_URL for testing). No FLY_API_TOKEN on the tenant.
Removed: direct Fly provisioner (FlyProvisioner) — all SaaS workspace
provisioning goes through the control plane which holds the Fly token
and manages billing, quotas, and cleanup.
Two backends: CPProvisioner (SaaS) and Docker Provisioner (self-hosted).
Closes#494
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PATCH /workspaces/:id field-level auth for parent_id/tier/runtime
required a bearer token, blocking canvas nesting (drag-to-nest).
Added IsSameOriginCanvas check so the tenant canvas can update
sensitive fields without a bearer.
Exported IsSameOriginCanvas from middleware package so workspace.go
can call it for the field-level auth path.
DELETE /workspaces/:id is behind AdminAuth which already has the
same-origin check — if delete still fails, it's a different issue.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When CONTAINER_BACKEND=flyio, workspaces are provisioned as Fly Machines
instead of local Docker containers. This enables workspace deployment
on SaaS tenants where no Docker daemon is available.
New files:
- provisioner/fly_provisioner.go: FlyProvisioner with Start/Stop/
IsRunning/Restart/Close via Fly Machines API (api.machines.dev/v1)
- FlyRuntimeImages maps runtimes to GHCR image tags
Changes:
- main.go: select Docker vs Fly based on CONTAINER_BACKEND env var
- workspace.go: SetFlyProvisioner() setter, Create checks flyProv first
- workspace_provision.go: provisionWorkspaceFly() loads secrets, calls
FlyProvisioner.Start, issues auth token for the new machine
Env vars for Fly backend:
- CONTAINER_BACKEND=flyio (activates Fly provisioner)
- FLY_API_TOKEN (Fly deploy token)
- FLY_WORKSPACE_APP (Fly app name for workspace machines)
- FLY_REGION (default: ord)
Closes#494
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Settings panel: wire TokensTab into "API Tokens" tab (was imported
but not rendered). Rename "API Keys" → "Secrets", add "API Tokens"
tab. Fix docs link → doc.moleculesai.app/docs/tokens.
2. Referer match hardening: require exact host match or trailing slash
to prevent evil.com subdomain bypass. Cache CANVAS_PROXY_URL at
init time instead of per-request os.Getenv.
3. Extract shared deriveWsBaseUrl() to lib/ws-url.ts — eliminates
duplicate 12-line derivation in socket.ts and TerminalTab.tsx.
4. Token list pagination: add ?limit= and ?offset= params (default
50, max 200) to GET /workspaces/:id/tokens.
507/507 canvas tests pass, Go build + vet clean.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two adjacent fixes that surfaced trying to bring the molecule-dev org
template back up against the new standalone workspace-template-* repos.
1) handlers/org.go — expand ${VAR} in workspace_dir before validation.
The molecule-dev pm/workspace.yaml (and any operator's per-host
binding) ships `workspace_dir: ${WORKSPACE_DIR}` so each operator
can pick the host path PM bind-mounts. Without expansion the literal
"${WORKSPACE_DIR}" string reaches validateWorkspaceDir and fails with
"must be an absolute path", aborting the whole org import.
Other fields (channel config, prompts) already go through expandWithEnv;
workspace_dir was the last hold-out.
2) provisioner/provisioner.go — inject PYTHONPATH=/app for every
workspace container. Standalone template Dockerfiles COPY adapter.py
to /app and set ENV ADAPTER_MODULE=adapter, but molecule-runtime is
a pip console_script entry point so cwd isn't on sys.path
automatically. Setting PYTHONPATH here fixes every adapter image at
once instead of needing 8 PRs against template repos. Operator
override still wins (workspace EnvVars are appended after, so Docker
takes the later duplicate).
Note: this unblocks the import path but does NOT make claude-code /
hermes / etc. boot. The runtime itself has a separate top-level
`from adapters import` that breaks against modular templates —
tracked at workspace-runtime#1.
Tests: TestBuildContainerEnv_InjectsPYTHONPATH +
TestBuildContainerEnv_WorkspaceEnvVarsCanOverridePYTHONPATH lock the
default + operator-override invariants. expandWithEnv is already covered
by TestExpandWithEnv_* — the workspace_dir use site is a one-line call
to that primitive.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add `provisionhook.EnvMutator` extension point so out-of-tree plugins
(e.g. github-app-auth, vault-secrets) can inject or override env vars
right before container Start, without forking core or piling more
provider-specific code into the handlers package.
WorkspaceHandler gains an optional `envMutators *provisionhook.Registry`
wired in via SetEnvMutators during boot. The hook fires after built-in
secret loads + per-agent git identity, so plugins can both read what's
already there and override anything they own (GIT_AUTHOR_*, GITHUB_TOKEN).
A nil registry is a no-op via Registry.Run's nil-receiver branch — keeps
the hot path a single nil compare and means existing flows stay green
even with zero plugins registered.
Mutator failure aborts provisioning and marks the workspace failed with
the wrapped error in last_sample_error. Failing fast surfaces the cause
to the operator instead of letting an agent boot into opaque "git push
401" loops it can never recover from on its own.
Tests cover ordered execution, chained env visibility, first-error abort,
nil-receiver no-op, nil-mutator drop, registration order, and concurrent
register-vs-run safety (-race clean).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Closes#460, #461.
**#460 — YAML injection via unquoted skill/prompt filenames**
`generateDefaultConfig` extracted skill directory names and prompt file
names from user-supplied `body.Files` keys and wrote them directly into
YAML list items without quoting:
cfg.WriteString(" - " + s + "\n")
`validateRelPath` only blocks path traversal (`../`); it does NOT block
YAML control characters including newlines. On Linux, filenames can
contain newlines, so an attacker with any live workspace bearer token
could submit:
{"files": {"skills/legit\nruntime: malicious/SKILL.md": "# skill"}}
The generated config.yaml would then contain `runtime: malicious` as a
top-level YAML key, overriding the runtime for workspaces provisioned
from the template.
Fix: extract `yamlEscape` as a reusable local from the same
`strings.NewReplacer` already used for the `name` field (#221) and apply
it to both the `skills:` and `prompt_files:` list items, wrapping each
in double-quotes.
**#461 — Docker error details in ReplaceFiles 500 responses**
`ReplaceFiles` returned `fmt.Sprintf("failed to write files: %v", err)`
in two 500 paths, where `err` comes from Docker API calls and may include
internal container names, volume names, and daemon error messages.
Fix: log the full error server-side and return a static opaque string to
the caller.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Container rebuild or volume wipe caused workspaces to lose /configs/.auth_token.
On re-registration the platform returned no auth_token (HasAnyLiveToken==true →
no re-issue), leaving the workspace unable to authenticate any subsequent API call.
Fix: provisionWorkspaceOpts now calls issueAndInjectToken before Start(). This
revokes any existing live tokens (plaintext is irrecoverable from the stored hash,
so rotation is the only safe path) and issues a fresh token that is written into
cfg.ConfigFiles[".auth_token"]. WriteFilesToContainer delivers it to /configs
immediately after ContainerStart, racing safely ahead of the Python adapter's
1-2s startup time.
Failure modes are soft: revoke or issue errors skip injection with a warning;
provisioning continues and the workspace recovers on the next restart.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add an optional channel_budget (INTEGER, nullable) to workspace_channels
via migration 024. When channel_budget IS NOT NULL and message_count has
reached the budget, the Send handler returns 429 {"error":"channel budget
exceeded"} and aborts before calling SendOutbound.
Implementation details:
- Single SELECT query reads both message_count and channel_budget in one
round-trip (avoids TOCTOU window between read and write)
- Fail-open on DB error: transient failures log but don't block sends
- Early-return on budget hit is before SendOutbound so message_count
cannot be incremented past the limit by a concurrent send that slips
through the window (best-effort; atomic enforcement requires DB-level CAS)
- NULL channel_budget = unlimited (default, backward-compatible)
Migration is idempotent (ADD COLUMN IF NOT EXISTS). Down migration drops
the column cleanly.
Four sqlmock tests cover: at-limit → 429, above-limit → 429, NULL budget
passes through, under-limit passes through.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduce `memoryRecallMaxLimit = 50` constant and honour the `?limit=N`
query parameter in Search. Values above 50 are silently clamped to 50;
absent or invalid values default to 50. The LIMIT clause is now a
parameterised argument (nextArg pattern) instead of a hardcoded literal.
Three sqlmock tests verify the cap, the explicit limit, and the default.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The Register handler was serialising the raw Go error into the HTTP response:
c.JSON(500, gin.H{"error": fmt.Sprintf("failed to register: %v", err)})
PostgreSQL errors wrapped by lib/pq contain table names, constraint names, and
driver-version strings — enough for a caller to fingerprint the schema and craft
targeted attacks. The error is already logged at full detail with Printf before
this line, so callers only need the generic message.
Fix: replace the Sprintf with a static "registration failed" string (same pattern
the heartbeat and update-card handlers already used).
New test: TestRegister_DBErrorResponseIsOpaque verifies the response body is the
opaque string and that "sql:", "pq:", and "connection" substrings are absent.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When a workspace container AND its /configs Docker volume are both destroyed,
the restart handler previously had no recovery path — findTemplateByName searched
only the top-level configsDir, which holds workspace-instance dirs (ws-{id[:12]}/),
not the role-named org-template source directories.
Fix: add `rebuild_config: true` to the POST /workspaces/:id/restart body struct.
When set, the handler falls back to searching configsDir/org-templates/ via the
existing findTemplateByName logic (which already handles name normalisation and
config.yaml name-field matching). The workspace can then self-recover with its own
bearer token — no admin intervention required.
New helper: resolveOrgTemplate(configsDir, wsName) — pure function, independently
tested (4 cases: hit-by-dir, hit-by-config-yaml, no org-templates dir, no match).
Usage:
curl -X POST -H "Authorization: Bearer $(cat /configs/.auth_token)" \
-d '{"rebuild_config": true}' \
http://platform:8080/workspaces/$WORKSPACE_ID/restart
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Part 4 of 4 — terminal step of the org.yaml scalability refactor. Each
role in the molecule-dev template now owns its own workspace.yaml file,
colocated with the existing system-prompt.md / initial-prompt.md /
idle-prompt.md / schedules/*.md. Team files shrink to a leader's own
definition plus a list of !include refs.
## Platform change
`resolveYAMLIncludes` now uses a TWO-ROOT model:
- Path resolution is relative to the INCLUDING file's directory
(natural sibling + cousin refs, C-include / Sass @import convention).
- Security bound is the ORIGINAL org root (`rootDir`), preserved across
all recursion depths. Sibling-dir refs like `../my-role/workspace.yaml`
from a team file are now allowed (they stay inside the org template);
refs that escape the root still error.
Regression coverage: new `TestResolveYAMLIncludes_SiblingDirAccess`
reproduces the Phase 4 pattern (team file at `teams/x.yaml` referencing
`../<role>/workspace.yaml`) — fails without the fix, passes with.
## Template change
Atomized 15 child workspaces across 3 team files:
- `teams/research.yaml`: 58 → 30 lines; 3 children now !include refs
- `teams/dev.yaml`: 222 → 38 lines; 6 children now !include refs
- `teams/marketing.yaml`: 143 → 28 lines; 6 children now !include refs
Each role now has `<role>/workspace.yaml` colocated with its prompts.
Example `frontend-engineer/` directory:
frontend-engineer/
├── workspace.yaml (24 lines — name/role/tier/canvas/plugins/...)
├── system-prompt.md (from earlier phases)
├── initial-prompt.md
├── idle-prompt.md
└── (no schedules for this role — but if added, schedules/<slug>.md)
## File-size progression across all 4 phases
| State | org.yaml | total `.yaml` in tree |
|---|---:|---:|
| Before (main) | 1801 lines / 108 KB | 1801 / 108 KB (one file) |
| After Phase 1 (#389) | 1687 | 1687 / 101 KB |
| After Phase 2 (#390) | 676 | 676 / 35 KB |
| After Phase 3 (#393) | 114 | 683 (1 + 6 teams) / 33 KB |
| **After this PR** | **114** | **~698** (1 + 6 + 15 workspace) / 35 KB |
Aggregate size is flat — the decrease came from prompt externalization
in Phases 1/2; Phases 3/4 reorganize structure without adding content.
The win is readability and ownership:
- Every individual file fits on 1-2 screens.
- Adding a new role is now: create `<role>/` dir, add `workspace.yaml`
+ `system-prompt.md` + prompts, add ONE `!include` line to the team
file. No touching of aggregated mega-YAML.
- Team files can be reviewed + merged independently.
## Tests
All 10 `TestResolveYAMLIncludes_*` tests pass, including the real-template
integration test (`TestResolveYAMLIncludes_RealMoleculeDev`) which now
walks org.yaml → teams/pm.yaml → teams/research.yaml → ../market-analyst/
workspace.yaml and validates the full 21-role tree unmarshals cleanly.
Plus all existing `TestResolvePromptRef` + `TestOrgYAML` + `TestInitialPrompt`
suites stay green.
## Ops followup
After merging all 4 phases and deploying, the `POST /org/import`
endpoint should produce a workspace tree byte-identical to the
pre-refactor state. Verify with:
diff <(curl POST /org/import before) <(curl POST /org/import after)
or by spot-checking:
- `/configs/config.yaml` bodies across all 21 workspaces
- `workspace_schedules.prompt` row values
The externalization is lossless — YAML literal to file and back
recovers the same string modulo trailing-whitespace normalization.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Closes the silent-overwrite hole where two agents racing a read-modify-
write on the same memory key left only one agent's update. Relevant for
orchestrators (PM, Dev Lead, Marketing Lead) keeping structured running
state (delegation-result ledgers, task queues) in memory, and for the
``research-backlog:*`` keys that multiple idle loops write in parallel.
## Semantics
### Back-compat path (no if_match_version)
Unchanged: ``INSERT ... ON CONFLICT UPDATE`` last-write-wins. Every
existing agent tool, every existing ``commit_memory`` call, every
existing cron that writes memory — all continue to work with no edit.
### Optimistic-lock path (if_match_version set)
1. Client calls ``GET /memory/:key`` → ``{value, version: V}``
2. Client modifies value locally
3. Client ``POST /memory {key, value, if_match_version: V}``
4. Server: ``UPDATE ... WHERE version = V`` + RETURNING new version
5. On match → 200 + ``{version: V+1}``
6. On mismatch → 409 + ``{expected_version: V, current_version: <actual>}``
7. Client reads the actual version and retries.
### Create-only marker
``if_match_version: 0`` means "create iff the key doesn't exist yet".
Two agents simultaneously seeding a shared key will see exactly one
success + one 409 — no silent collision, no duplicate-init work.
### Schema
Migration 023 adds ``version BIGINT NOT NULL DEFAULT 1``. Existing rows
baseline at 1. New rows start at 1. Every successful write (both paths)
increments: ``version = version + 1`` on update, ``1`` on insert.
## Why version, not updated_at
``updated_at`` has second-granularity and can collide between concurrent
writers on a fast clock. A monotonic counter is collision-free and more
readable in the 409 response body ("expected 5, current is 7 — you
missed 2 writes" tells an agent exactly what to re-read).
## Why ``if_match_version`` and not an ETag header
JSON field keeps it in the request body, visible alongside the value
payload. Agents assembling requests programmatically don't have to
remember to thread a header through their HTTP client wrapper; the
existing ``commit_memory`` tool can grow one optional kwarg and match
the existing signature shape.
## Tests
11 memory-handler cases covering every path:
- GET list / get (with version in response shape)
- Set with no version (back-compat upsert, returns new version)
- Set with if_match_version match (happy path, increment)
- Set with if_match_version mismatch (409 + expected/current fields)
- Set with if_match_version=0 on absent key (create-only success)
- Set with if_match_version=N on absent key (409 — caller's mental
model is wrong)
- Bad inputs (missing key, malformed JSON)
- Delete happy + error path
Full ``go test ./internal/handlers/`` green.
## Follow-up (not in this PR)
- Workspace-template tool update: ``commit_memory(content, *,
if_match_version=None)`` surfaces the new option + on 409 surfaces
the current_version so agents can retry without manual re-read.
- Named checkpoints table (``workspace_checkpoints``) for durable
orchestrator state snapshots. Different concern than per-key locking;
separate PR.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause of the 2026-04-16 09:10 UTC six-container restart cascade.
## Timeline
09:10:26 — PM sent a batch delegation to 15+ agents (Dev Lead coordinating).
09:10:26-27 — 4 leaders/auditors (Security, RL, BE, DevOps) simultaneously
hit "workspace agent unreachable — container restart triggered"
even though their containers were running fine. Another 2
(DL, UIUX) tripped in the next few seconds.
09:10:27 — Provisioner stopped + recreated 6 containers in parallel. A2A
callers got EOFs, PM's batch coordination stalled.
## Root cause
`provisioner.IsRunning` collapsed every ContainerInspect error into
`(false, nil)`, including transient Docker daemon hiccups:
func IsRunning(...) (bool, error) {
info, err := p.cli.ContainerInspect(ctx, name)
if err != nil {
return false, nil // Container doesn't exist ← MISREAD
}
return info.State.Running, nil
}
The comment said "Container doesn't exist" but the error was actually
any of: daemon timeout, socket EOF, context deadline, connection
refused. Under load (batch delegation fan-out → 15 concurrent HTTP
inbound → 15 concurrent Claude Code subprocesses → Docker daemon CPU
pressure), ContainerInspect calls started failing transiently. All 6
calls returned `(false, nil)`. Caller `maybeMarkContainerDead` treated
`running=false` as "container is dead, restart it" → six parallel
restarts. This was exactly the destructive-on-error pattern we keep
trying to kill (see #160 SDK-stderr-probe, #318 fail-open classes).
## Fix
`IsRunning` now distinguishes NotFound from transient errors:
- Legitimately missing container (caller deleted, Docker pruned) →
`(false, nil)` — safe to act on; caller marks dead + restarts.
- Any other error (daemon timeout, socket issue, context deadline) →
`(true, err)` — caller stays on the alive path. The transient error
is preserved so metrics + logging still see it, but it does NOT
trigger the destructive restart branch.
`isContainerNotFound` matches on error-message substring — same
approach docker/cli uses internally — to avoid pulling in errdefs as a
direct dep. Truth table tests in `isrunning_test.go` cover 8 cases:
NotFound variants (real + generic), nil, empty, and the 4 transient-
error shapes we've actually observed (deadline, EOF, connection-refused,
i/o timeout).
## Caller update
`maybeMarkContainerDead` in a2a_proxy.go now logs the transient inspect
error (was silently discarded via `_`). Visibility without
destructiveness. If this error becomes persistent, we'll see it in
platform logs rather than diagnosing after another restart cascade.
## Expected impact
- Zero restart cascades from the current class of transient inspect
errors (EOF, timeout, connection refused).
- Dead containers still detected within the A2A layer because an actual
stopped container returns NotFound on inspect, and the TTL monitor
(180s post #386) catches anything that slips through.
- New visibility in platform logs when inspect has trouble — previously
silent.
Combined with the TTL fix in #386, the defense-in-depth on spurious
restart is now:
1. IsRunning only returns false for real NotFound
2. Liveness TTL is 180s, surviving 5+ missed heartbeats
3. A2A proxy 503-Busy path retries with backoff before touching
restart logic at all
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Part 3 of 4 in the scalability refactor. Adds YAML `!include` support
to the org importer and splits molecule-dev/org.yaml (676 lines post-
Phase 2) into 6 team / role files; top-level org.yaml drops to 114 lines
of pure scaffolding.
## Platform changes
New `platform/internal/handlers/org_include.go`:
- `resolveYAMLIncludes(data, baseDir)` — pre-processes a YAML document,
expanding any scalar tagged `!include <path>` with the parsed content
of the referenced file.
- Path resolution via `resolveInsideRoot` so a crafted `!include
../../etc/passwd` can't escape the org template directory (same
defense the existing `files_dir` copy uses).
- Nested includes supported: each included file carries its own search
root (its directory), so `teams/pm.yaml` with `!include research.yaml`
resolves to `teams/research.yaml` — matching the convention of
C-include / Sass @import / most package systems.
- Cycle detection via visited-set keyed on absolute path; belt-and-
braces `maxIncludeDepth = 16` cap in case symlinks or path
normalization defeats the set.
- Inline-template mode (POST /org/import with raw JSON body, no `dir`)
errors cleanly when a file ref is used — can't resolve without a
base.
Wired into both `ListTemplates` (so /org/templates shows an accurate
workspace count after the split) and `Import` (expansion happens before
unmarshal into OrgTemplate).
## Template changes
molecule-dev/org.yaml now contains only:
- name + description
- defaults (runtime, plugins, category_routing, initial_prompt text)
- `workspaces: [!include teams/pm.yaml, !include teams/marketing.yaml]`
New files:
- `teams/pm.yaml` — PM top-level, children are !include refs
- `teams/research.yaml` — Research Lead + Market Analyst + Technical
Researcher + Competitive Intelligence (inline children)
- `teams/dev.yaml` — Dev Lead + FE/BE/DevOps/Security/QA/UIUX (inline)
- `teams/marketing.yaml` — Marketing Lead + DevRel/PMM/Content/
Community/SEO/Social (inline)
- `teams/documentation-specialist.yaml` — leaf
- `teams/triage-operator.yaml` — leaf
## File-size impact
| State | org.yaml lines | total config size |
|---|---:|---:|
| Before (main) | 1801 | 108 KB |
| After Phase 1 (#389) | 1687 | 101 KB |
| After Phase 2 (#390) | 676 | 35 KB |
| After this PR | **114** | **4 KB** (org.yaml only) |
With the 6 team files (total ~570 lines of structural yaml), every file
is now under 230 lines and individually readable without scrolling past
a single team's boundaries.
## Tests
`platform/internal/handlers/org_include_test.go` — 9 cases:
- Flat include (single file, single workspace)
- Nested include (file → file → file)
- Traversal rejection (`../secret.yaml`, `../../secret.yaml`)
- Cycle detection (a↔b)
- Empty path error
- Missing file error
- Inline-template error (baseDir empty)
- No-op when YAML has no includes (safety: we always run the preprocessor)
- **Integration**: load the real `org-templates/molecule-dev/org.yaml`,
resolve includes, unmarshal into OrgTemplate, verify PM + Marketing
Lead are top-level and PM has ≥4 children after expansion.
All 9 pass + existing `TestResolvePromptRef` + `TestOrgYAML` suites stay
green.
## Ownership implication
Each team file can now be owned + reviewed independently. When the
marketing team adds a 7th role, the diff is in `teams/marketing.yaml`
alone — no merge conflicts against PM or research changes in the same
review window. Same for the eventual engineer team, security team, etc.
## What's next
- **Phase 4 (queued):** per-workspace atomization. Each role gets
`<role>/workspace.yaml`; team files shrink to a list of !include
refs. Terminal step in the scalability arc — at that point adding a
new role is one new file under `org-templates/molecule-dev/<role>/`
plus one line in the team's manifest.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rebased cleanly onto current main (resolves the add/add conflicts that
blocked CI on PR #374 — the original branch diverged from a pre-repo-bootstrap
commit that predated most files).
Changes:
- schedules.go: add scheduleHealthResponse struct + Health handler
(mirrors A2A proxy auth pattern: X-Workspace-ID + CanCommunicate gate)
- router.go: register GET /workspaces/:id/schedules/health on r (not wsAuth)
so peer agents can query without holding the target workspace's bearer token
- schedules_test.go: 7 new tests (missing caller 401, self-call OK, legacy
peer grandfathered, non-peer 403, system caller bypass, no prompt exposure,
DB error 500)
isSystemCaller/validateCallerToken reused from a2a_proxy.go (same package).
registry.CanCommunicate import added to schedules.go.
Closes#249
Supersedes PR #374 (which could not get CI due to merge conflict)
Co-authored-by: PM (Molecule AI) <pm@molecule-ai.internal>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Every workspace now commits under its own name. Step 3 of the three-
step agent-separation plan (platform-level git identity today;
GitHub App migration follows as Option 1).
## Problem
All 20+ agents in the molecule-dev template (PM, Dev Lead, Research
Lead, FE, BE, DevOps, Security, QA, UIUX, Marketing roles, etc.) share
a single GITHUB_TOKEN — specifically the CEO's personal PAT. So every
commit, PR, and issue across the live repos ends up attributed to
HongmingWang-Rabbit. `git log` can't distinguish "which agent wrote
this code" from "did the CEO write it"; neither can the authority-
verification rule in triage-operator/philosophy.md (rule #3).
## Fix
When the provisioner starts a workspace container, it now sets:
GIT_AUTHOR_NAME = "Molecule AI <Workspace Name>"
GIT_AUTHOR_EMAIL = <slug>@agents.moleculesai.app
GIT_COMMITTER_NAME = (same)
GIT_COMMITTER_EMAIL = (same)
Git prefers these env vars over `git config user.name` / `user.email`,
so no per-container git-config step is needed; every commit automatically
carries the right authorship.
Examples (20 agents, 20 distinct identities):
Frontend Engineer → frontend-engineer@agents.moleculesai.app
Backend Engineer → backend-engineer@agents.moleculesai.app
Product Marketing Manager → product-marketing-manager@agents.moleculesai.app
UIUX Designer → uiux-designer@agents.moleculesai.app
Domain `agents.moleculesai.app` is deliberate: marks the email as a
bot address without resembling a real inbox.
## Operator override preserved
`applyAgentGitIdentity` runs AFTER the secret-load loops in
`provisionWorkspaceOpts`, but uses `setIfEmpty` so any workspace_secret
with the same key wins. Teams that want custom authorship (shared org
signing identity, a person-on-the-loop owner) can still set
`GIT_AUTHOR_NAME` via /workspaces/:id/secrets and get their value
through to git.
## What this does NOT solve (yet)
- PR / issue authorship is still whoever owns GITHUB_TOKEN (the shared
PAT). That needs the GitHub App migration (Option 1, next PR). The
commit-level split shipped here is the prerequisite: the App path
will keep these env vars and just swap the PAT for a short-lived
installation token.
- Existing containers continue with their pre-fix env (git env vars
are baked in at container-create time). Applying is one plain
`POST /workspaces/:id/restart` per agent after this merges +
deploys — the restart goes through provisionWorkspace which picks
up the new injection.
## Tests
`agent_git_identity_test.go` — 4 behavior tests + a 10-row slug test:
- fills all 4 env vars from a workspace name
- operator override via pre-set env is preserved (setIfEmpty semantics)
- empty / whitespace workspace name is a no-op (no `unknown@...` emails)
- nil map doesn't panic (defensive)
- slugify handles spaces / punctuation / edge hyphens / em-dashes
All 15 cases pass; platform build clean.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Part 1 of 4 in the scalability refactor. Each role can now keep its
initial_prompt / idle_prompt / schedule prompts as sibling .md files
under files_dir/; inline YAML literals still work for backwards-compat.
## What changes
**Platform (org.go importer):**
- `OrgWorkspace` gains `InitialPromptFile`, `IdlePrompt`, `IdlePromptFile`,
`IdleIntervalSeconds`. The idle_* fields were previously dropped by the
org importer entirely — struct didn't declare them — which is why
engineer idle_prompts never propagated from org.yaml to live /configs
(I've been manually docker-cp'ing them in every maintenance cron).
- `OrgSchedule` gains `PromptFile`. Hourly/weekly cron prompts are the
largest bodies in org.yaml (1-5 KB each) and get resolved at import
time just like initial_prompt.
- `OrgDefaults` gains the same idle_* + *_file fields for org-wide fallback.
- New `resolvePromptRef(inline, fileRef, orgBaseDir, filesDir)` helper —
the single chokepoint for inline-vs-file resolution. Inline wins when
both are set. File refs route through `resolveInsideRoot` so a crafted
ref can't escape the org template directory (same traversal defense as
files_dir).
- `createWorkspaceTree` now injects idle_prompt + idle_interval_seconds
into the workspace's config.yaml (previously missing — that's the
second half of the idle-prompt propagation bug).
**Tests:**
- `org_prompt_ref_test.go` — 10 cases: inline-wins, file-read-when-empty,
both-empty, defaults-level resolution, inline-template mode errors,
traversal rejection (via file ref AND via files_dir), missing-file
errors, and YAML-unmarshal parsing for each new field.
**Proof migration:**
- Documentation Specialist (biggest role at 6.9 KB of prompts) moves from
inline YAML to `documentation-specialist/{initial-prompt.md,
schedules/daily-docs-sync.md, schedules/weekly-terminology-audit.md}`.
- org.yaml drops 1801 → 1687 lines (-6.3%) from just this one role.
## Why this matters
org.yaml is 108 KB of which 67 KB (62%) is prompt text. At the current
12-role template size that's already unreadable; the marketing + triage-
operator additions pushed it to 1801 lines. The 4-phase refactor aims:
- **Phase 1 (this PR):** platform support + 1 role proof.
- **Phase 2:** migrate remaining ~20 roles to file refs. Target: org.yaml
at ~600 lines of pure structural scaffolding.
- **Phase 3:** YAML `!include` preprocessor — split org.yaml into
teams/{research,dev,marketing,ops}.yaml shards.
- **Phase 4:** per-workspace atomization — each role gets its own
workspace.yaml manifest; org.yaml composes them.
## Backwards compatibility
- Inline `initial_prompt: |` / `prompt: |` / `idle_prompt: |` all still work.
- Missing `prompt_file` refs log + skip the schedule (not fatal) — fail
loud so bugs surface during deployment rather than silent-drop.
- Inline-template mode (POST /org/import with raw JSON body, no `dir`)
errors cleanly when a file ref is used — can't resolve files without a
base dir, surface that rather than guessing.
## Test plan
- [x] `go build ./...` clean
- [x] `go test -run 'TestResolvePromptRef|TestOrgYAML' ./internal/handlers/`
— 10 tests pass
- [x] `python -c "yaml.safe_load(...)"` on the edited org.yaml — parses
- [ ] Post-merge: deploy platform rebuild, run `POST /org/import` against
a fresh workspace, verify Documentation Specialist's /configs/config.yaml
contains the initial_prompt body and workspace_schedules rows contain
the cron prompts (phantom-success check: grep the actual content, not
just the row count).
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The platform's GET /workspaces/:id/transcript proxy was constructing the
outbound request without an Authorization header. The workspace's /transcript
endpoint (hardened in #287/#328) fails-closed when the header is absent,
so every transcript call in production returned 401 from the workspace.
Fix: after WorkspaceAuth validates the incoming bearer token, the handler
now forwards it verbatim via req.Header.Set("Authorization", ...).
Forwarding is safe — the token has already been validated by the middleware.
Tests:
- TestTranscript_ForwardsAuthHeader: was t.Skip'd as a bug marker; now
active. Verifies the Authorization header reaches the workspace stub.
- TestTranscript_NoAuthHeader_PassesThrough: new. Verifies that a missing
header produces no synthetic Authorization on the upstream call, and the
workspace 401 is faithfully relayed.
Identified by QA audit 2026-04-16.
Co-authored-by: QA Engineer <qa-engineer@molecule-ai.internal>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Severity LOW. The /webhooks/:type handler compared the Telegram
X-Telegram-Bot-Api-Secret-Token header against the decrypted
webhook_secret using Go's `!=` operator, which short-circuits on the
first mismatched byte. Under low-latency Docker-network conditions an
attacker could time response latency byte-by-byte and converge on the
real secret, then inject Telegram-formatted messages into any channel.
Fix: switch to crypto/subtle.ConstantTimeCompare, which runs in time
proportional to the length of the shorter input regardless of content
match. Same posture as the cdp-proxy token compare in host-bridge
(which already used timingSafeEqual).
Risk profile over the public internet is low (Telegram webhooks have
natural jitter that masks the signal), but the defensive pattern
matters for consistency across all secret comparisons.
Closes#337
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CI 5/6 pass (E2E cancel = run-supersession pattern). Dev Lead review 04:21: ✅ Approved. Fixes cross-tenant token exposure: PausePollersForToken now scoped to requesting workspace_id via SQL WHERE clause. Closes#329.
`TranscriptHandler.Get` previously proxied `agent_card->>'url'` directly
to the outbound HTTP client with no validation. Since `agent_card` is
attacker-writable via /registry/register, a workspace-token holder
could point it at cloud metadata (169.254.169.254), link-local ranges,
or non-http schemes and pivot the platform container against internal
services (IMDS, Redis, Postgres, other containers on the Docker net).
Four required fixes per reviewer:
1. `validateWorkspaceURL(u *url.URL)` — runs before `httpClient.Do`:
- scheme must be http/https (rejects file://, gopher://, ftp://)
- cloud metadata hostname blocklist (GCP + Azure + plain "metadata")
- IMDS IP blocklist (169.254.169.254)
- IPv4/IPv6 link-local blocklist (169.254/16, fe80::/10, multicast)
- IPv6 unique-local fd00::/8 blocklist
- loopback + docker.internal still allowed for local dev
2. Query-param allowlist — `target.RawQuery = c.Request.URL.RawQuery`
forwarded everything verbatim, letting a caller smuggle params the
upstream transcript endpoint didn't intend to expose. Replaced with
an allowlist of `since` and `limit`.
3. Sanitized error string — `fmt.Sprintf("workspace unreachable: %v", err)`
leaked the actual internal host/IP via `net.OpError`. Now logs the
real error server-side and returns a plain "workspace unreachable"
to the caller.
4. 10 new regression test cases:
- `TestTranscript_Rejects{CloudMetadataIP,NonHTTPScheme,MetadataHostname,LinkLocalIPv6}`
exercise the handler end-to-end with each attack URL and assert
400 before the HTTP client fires.
- `TestValidateWorkspaceURL` table-drives the validator across
localhost/public/docker-internal (allowed) + IMDS/GCP/Azure/file/
gopher/link-local/multicast (rejected).
- `TestTranscript_ProxyPropagatesAllowlistedQueryParams` asserts
`secret=leak&cmd=rm` is stripped while `since=42&limit=7` pass
through.
Also fixed a pre-existing test bug: `seedWorkspace` was issuing a real
SQL Exec against sqlmock with no expectation set, so the prior test
helpers silently failed in CI. Replaced with `expectWorkspaceURLLookup`
which programs the mock correctly. All 11 tests now pass.
Closes#272
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Closes #N (issue to be filed)
Lets canvas / operators see live tool calls + AI thinking instead of
waiting for the high-level activity log to flush. Right now the only
way to "look over an agent's shoulder" is `docker exec ws-XXX cat
/home/agent/.claude/projects/.../<session>.jsonl`, which:
- doesn't work for remote workspaces (Phase 30 / Fly Machines)
- requires shell access on the host
- has no pagination
This PR adds:
1. `BaseAdapter.transcript_lines(since, limit)` — async hook returning
`{runtime, supported, lines, cursor, more, source}`. Default returns
`supported: false` so non-claude-code runtimes pass through gracefully.
2. `ClaudeCodeAdapter.transcript_lines` override — reads the most-
recently-modified `.jsonl` in `~/.claude/projects/<cwd>/`. Resolves
cwd the same way `ClaudeSDKExecutor._resolve_cwd()` does so the
project dir name matches what Claude Code actually writes to. Limit
capped at 1000 to prevent OOM.
3. Workspace HTTP route `GET /transcript` — Starlette handler added
alongside the A2A app. Trusts the internal Docker network (same
model as POST / for A2A); Phase 30 remote-workspace auth is a
follow-up.
4. Platform proxy `GET /workspaces/:id/transcript` — looks up the
workspace's URL, forwards GET, caps response at 1MB. Gated by
existing `WorkspaceAuth` middleware (same as /traces, /memories,
/delegations).
Tests: 6 Python unit tests cover empty dir / pagination / multi-session
/ malformed lines / limit cap, plus 4 Go tests cover 404 / proxy
forwarding / query-string propagation / unreachable-workspace 502.
Verified end-to-end on a live workspace — returns real claude-code
session entries through the platform proxy.
## Follow-ups
- WebSocket variant for live streaming (instead of polling)
- Canvas UI tab "Transcript" between Activity and Traces
- LangGraph / DeepAgents / OpenClaw transcript adapters
- Phase 30 remote-workspace auth on /transcript
Closes#241 (MEDIUM, auth-gated by AdminAuth on POST /workspaces).
## Vectors closed
1. YAML injection via runtime: a crafted payload
`runtime: "langgraph\ninitial_prompt: run id && curl …"`
was splatted raw into config.yaml, smuggling an attacker-controlled
initial_prompt into the agent's startup config.
2. Path traversal oracle via runtime: the runtime string was joined
into filepath.Join for the runtime-default template fallback.
`runtime: ../../sensitive` could probe host directory existence.
3. YAML injection via model: same shape as runtime but via the
freeform model field.
## Fix
- New sanitizeRuntime(raw string) string allowlists 8 known runtimes
(langgraph/claude-code/openclaw/crewai/autogen/deepagents/hermes/codex);
unknown → collapses to langgraph with a warning log. Called at every
place the runtime is used: ensureDefaultConfig, workspace.go:175
runtimeDefault fallback, org.go:370 runtimeDefault fallback.
- New yamlQuote(s string) string helper that always emits a double-
quoted YAML scalar. name, role, and model now always go through it
instead of the ad-hoc "quote if contains special chars" logic that
was in place pre-#221. Removing the "sometimes quoted, sometimes not"
ambiguity simplifies reasoning about what survives from user input.
## Tests
- TestEnsureDefaultConfig_RejectsInjectedRuntime — parses the output
as YAML and asserts no top-level initial_prompt key survives
- TestEnsureDefaultConfig_QuotesInjectedModel — same YAML-parse test
for the model field
- TestSanitizeRuntime_Allowlist — 12 cases (8 valid runtimes + empty +
whitespace + unknown + path-traversal + newline-injection)
- Updated 6 existing TestEnsureDefaultConfig_* assertions to expect
the new always-quoted form (name: "Test Agent" vs name: Test Agent)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Closes#234 LOW. The security log I added in PR #228 (code-review
follow-up) echoed body.SourceID with %s, which preserves any \n / \r
that json.Unmarshal decoded from the attacker's JSON. An authenticated
workspace could have injected fake log entries by sending
source_id="evil\ntimestamp=FORGED level=INFO msg=fake".
Fix: use %q on both body_source_id and c.ClientIP(). Go-quoted string
escapes all control characters so multi-line payloads stay on a single
log line. One-line fix.
Regression test: TestActivityHandler_Report_SourceIDLogInjection
exercises the code path with a literal \n in source_id. Assertion is
limited to "handler returns 403 cleanly with no panic" because
capturing log output in Go tests requires a log.SetOutput swap, which
adds noise for little signal vs just reading the test log output
(visible when running with -v).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Closes#226 MEDIUM. WorkspaceHandler.Create joined payload.Template
directly into filepath.Join(configsDir, template) without validating
it stayed inside configsDir. An attacker posting Template="../../etc"
would have the provisioner walk and mount arbitrary host directories
into the workspace container.
Same fix as #103 (POST /org/import): use the existing resolveInsideRoot
helper to reject absolute paths and any ".." that escapes the root.
Applied at both call sites in workspace.go:
1. Synchronous runtime detection before DB insert — 400 on bad input
2. Async provisioning goroutine — early return, logs the rejection
(belt-and-suspenders; the create path already blocks)
No test added inline because the existing resolveInsideRoot suite
(org_path_test.go) already covers absolute / traversal / prefix-sibling
/ empty-path / deep-subpath cases. A duplicate test for the workspace
handler wouldn't add signal.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The original fix stripped \n/\r but left the rest in place, then relied
on a substring-based test which was over-strict (the escaped fragment
still contained the banned substring as bytes).
Better approach: emit the name as a double-quoted YAML scalar with all
escape sequences (\\, \", \n, \r, \t) handled inline. This is the
canonical YAML-safe way to embed user input — no injection possible
because every control character is either escaped or rejected by the
YAML parser inside the scalar context.
Test rewritten to parse the output as YAML and verify:
1. parsed[\"name\"] equals the literal attacker input (payload preserved)
2. no banned top-level keys leaked to the parsed map
3. legitimate default keys (description/version/tier/model) still present
Updated the two existing tests that asserted the unquoted name format.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Addresses self-review of the 10-PR batch merged earlier this session.
Splits the follow-ups into this Go-side PR and a later Python/docs PR.
## Fixes
1. wsauth_middleware.go CanvasOrBearer — invalid bearer now hard-rejects
with 401 instead of falling through to the Origin check. Previous code
let an attacker with an expired token + matching Origin bypass auth.
Empty bearer still falls through to the Origin path (the intended
canvas path).
2. scheduler.go short() helper — extracts safe UUID prefix truncation.
Pre-existing unsafe [:12] and [:8] slices would panic on workspace IDs
shorter than the bound. #115's new skip path had the bounds check;
the happy-path log lines did not. One helper, three call sites.
3. activity.go security-event log on source_id spoof — #209 added the
403 but the attempt was invisible to any auditor cron. Stable
greppable log line with authed_workspace, body_source_id, client IP.
## New tests
- TestShort_helper — bounds-safety regression guard for the helper
- TestRecordSkipped_writesSkippedStatus — #115 coverage gap, exercises
UPDATE + INSERT via sqlmock
- TestRecordSkipped_shortWorkspaceIDNoPanic — short-ID crash regression
- TestActivityHandler_Report_SourceIDSpoofRejected — #209 403 path
- TestActivityHandler_Report_MatchingSourceIDAccepted — non-spoof path
- TestHistory_IncludesErrorDetail — #152 problem B coverage
go test -race ./... green locally.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The 13K-line plugins_install_pipeline.go had zero unit tests, making it
the highest-regression-risk file in the platform handlers package.
New test file covers all testable pure-function and integration paths
that do not require a live Docker daemon:
validatePluginName (8 cases)
- valid names, empty, forward slash, backslash, "..", embedded "..";
path-traversal variants ("../etc", "../../secrets")
dirSize (6 cases)
- empty dir, single file, multiple files, nested subdirectory,
exceeds limit (verifies error mentions "cap"), exactly at limit
httpErr / newHTTPErr (3 cases)
- Error() contains status code, all relevant HTTP codes preserved,
errors.As unwraps through fmt.Errorf %w chains
regexpEscapeForAwk (6 cases)
- alphanumeric names unchanged, slash escaped, dot escaped, + escaped,
full "# Plugin: name /" marker (space not escaped), backslash escaped
streamDirAsTar (4 cases)
- empty dir yields zero entries, single file round-trips content,
nested directory preserves relative path, entries have no absolute
or tempdir-leaking paths
resolveAndStage via stubResolver (10 cases)
- empty source → 400, unknown scheme → 400, happy path (result fields),
staged dir cleaned on fetch error, ErrPluginNotFound → 404,
DeadlineExceeded → 504, generic error → 502, resolver returns invalid
name → 400, local:// path traversal → 400 (pre-Fetch validation)
stubResolver implements plugins.SourceResolver as an in-process test
double — no network, no filesystem side-effects beyond the staging tempdir
that resolveAndStage creates and cleans up.
Closes#217
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
A crafted workspace name containing a newline (e.g. "x\nmodel: evil")
could inject arbitrary YAML keys into the auto-generated config.yaml.
Strip \n and \r from the name before interpolation. YAML key injection
requires a newline to start a new mapping entry; other characters such
as `:` are safe in unquoted scalar values.
Adds TestGenerateDefaultConfig_YAMLInjection with three adversarial
inputs: bare \n injection, CRLF injection, and multi-key injection.
Closes#221
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Cherry-picks the one genuinely new fix from #169 after confirming the
rest of that PR is already covered on main (C1/C3/C5 by wsAuth group,
C6 by #94+#119 SSRF blocklist, C4 ownership by existing WHERE filter).
Pre-existing middleware (WorkspaceAuth on /workspaces/:id/* sub-routes)
proves the caller owns the :id path param. But the body field
source_id was never validated — a workspace authenticated for its own
/activity endpoint could still attribute logs to a different workspace
by setting source_id=<foreign UUID>. Rejected with 403 now.
No schema change, no new middleware. 4-line handler delta. Closes the
only real gap in #169; #169 itself will be closed as superseded.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Closes#152 problem B (schedule history API drops error detail).
Two tiny changes:
1. scheduler.fireSchedule now writes lastError into activity_logs.error_detail
when inserting the cron_run row. Previously the column was left NULL even
on failure because the INSERT didn't include it.
2. schedules.History SELECT now reads error_detail and includes it in the
JSON response under error_detail. Frontend + audit cron can now display
"why did this run fail" instead of just "status=error".
No schema change — activity_logs.error_detail already exists from
migration 009. This just starts using the column.
Problem A of #152 (Research Lead ecosystem-watch 50% error rate on its
own) is a separate ops investigation and stays open.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Closes#138. #125 moved PATCH /workspaces/:id into the wsAdmin AdminAuth
group to close the #120 unauth vulnerability, but broke canvas drag-
reposition and inline rename because canvas uses session cookies not
bearer tokens. Multi-tenant deployments with any live token would have
seen every canvas PATCH 401.
Option A per #138 triage: PATCH goes back on the open router, but
WorkspaceHandler.Update now enforces field-level authz:
Cosmetic (no bearer required):
name, role, x, y, canvas
Sensitive (bearer required when any live token exists):
tier — resource escalation
parent_id — A2A hierarchy manipulation
runtime — container image swap
workspace_dir — host bind-mount redirection
Fail-open bootstrap: HasAnyLiveTokenGlobal = 0 → pass-through
(fresh install, pre-Phase-30 upgrade path). Matches the same
lazy-bootstrap contract WorkspaceAuth and AdminAuth use elsewhere.
3 new tests cover all three branches of the matrix (cosmetic
no-bearer, sensitive no-bearer-rejected, sensitive fail-open).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>