Commit Graph

143 Commits

Author SHA1 Message Date
Hongming Wang
737dd1999b fix: restore cp_provisioner.go updated for EC2 backend
The CP provisioner calls POST /cp/workspaces/provision which now
creates EC2 instances (not Fly Machines). The tenant platform
auto-activates this when MOLECULE_ORG_ID is set.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 14:25:43 -07:00
Molecule AI Backend Engineer
a84a33523c fix(middleware): split CSP by route type — strict for API, permissive for canvas (#450)
API routes return JSON and never need 'unsafe-inline' or 'unsafe-eval'.
Serving those directives globally defeated the purpose of CSP and gave
false security assurance. Canvas-proxied routes (NoRoute → Next.js) keep
'unsafe-inline' because React hydration requires it; 'unsafe-eval' was
already absent and is confirmed unnecessary in production builds.

Implementation:
- Add isAPIPath() helper with an explicit prefix allowlist that mirrors
  the routes registered in router/router.go
- Strict "default-src 'self'" on all /workspaces, /registry, /health,
  /admin, /metrics, /settings, /bundles, /org, /templates, /plugins,
  /webhooks, /channels, /ws, /events, /approvals paths
- Permissive CSP (unsafe-inline, no unsafe-eval) on canvas/NoRoute paths
- 4 new test functions: TestCSPAPIRoutesGetStrictPolicy (covers every
  prefix + sub-path), TestCSPCanvasRoutesGetPermissivePolicy, and
  TestIsAPIPath unit test including substring-non-match guard

Resolves #450

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 20:26:17 +00:00
rabbitblood
3609b7ab8c feat(platform): wire github-app-auth plugin for per-installation tokens
Integrates github.com/Molecule-AI/molecule-ai-plugin-github-app-auth.
When GITHUB_APP_ID is set, the platform constructs a plugin
Authenticator at boot and registers it as an EnvMutator on the
WorkspaceHandler. Every workspace provision then gets a fresh
GITHUB_TOKEN / GH_TOKEN injected from the App's installation token
(rotates ~hourly, refresh 5 min before expiry).

Verified live this turn:
- Platform boot log: `github-app-auth: registered, 1 mutator(s) in chain`
- `docker exec ws-<id> gh auth status` → `Logged in as molecule-ai[bot] (GH_TOKEN)`
- `gh issue list --repo Molecule-AI/molecule-core` returns real data
  (Hermes #498/#499/#500 visible from inside a workspace container)

## Changes
- platform/go.mod + go.sum: new dep on the plugin
- platform/cmd/server/main.go: import + conditional registration
  (soft-skip when GITHUB_APP_ID is unset for self-hosted/dev)
- docker-compose.yml: pass GITHUB_APP_* env + bind-mount private key

## Drive-by
.gitignore: exclude /org-templates /plugins /workspace-configs-templates
— these dirs are populated locally by clone-manifest.sh from the
standalone repos, should never be committed to core. Without this rule
my previous git add -A staged 33 embedded git dirs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 12:52:20 -07:00
Hongming Wang
b6e039cb49 fix: code review findings — dead code, DRY, rate limit, docs
1. Delete fly_provisioner.go — superseded by control plane architecture.
   Direct Fly provisioning from tenant was intentionally removed.

2. Extract loadWorkspaceSecrets() — shared by Docker + CP provisioner
   paths. Eliminates 30-line secret-loading duplication.

3. Token rate limit — max 50 active tokens per workspace. Returns 429
   if exceeded. Prevents unbounded token creation by compromised client.

4. CLAUDE.md — add GET/POST/DELETE /workspaces/:id/tokens to route table.

5. .env.example — document MOLECULE_ORG_ID and CP_PROVISION_URL.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 12:04:37 -07:00
Hongming Wang
1ea615df4c feat(platform): auto-detect SaaS tenant → control plane provisioner
No env vars to configure. The platform auto-detects the backend:

  MOLECULE_ORG_ID set → SaaS tenant → control plane provisioner
  MOLECULE_ORG_ID empty → self-hosted → Docker provisioner

The control plane URL defaults to https://api.moleculesai.app (override
with CP_PROVISION_URL for testing). No FLY_API_TOKEN on the tenant.

Removed: direct Fly provisioner (FlyProvisioner) — all SaaS workspace
provisioning goes through the control plane which holds the Fly token
and manages billing, quotas, and cleanup.

Two backends: CPProvisioner (SaaS) and Docker Provisioner (self-hosted).

Closes #494

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 11:50:52 -07:00
Hongming Wang
1949846001 fix(auth): allow nesting + delete from tenant canvas (same-origin)
PATCH /workspaces/:id field-level auth for parent_id/tier/runtime
required a bearer token, blocking canvas nesting (drag-to-nest).
Added IsSameOriginCanvas check so the tenant canvas can update
sensitive fields without a bearer.

Exported IsSameOriginCanvas from middleware package so workspace.go
can call it for the field-level auth path.

DELETE /workspaces/:id is behind AdminAuth which already has the
same-origin check — if delete still fails, it's a different issue.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 11:22:45 -07:00
Hongming Wang
7160d1a1a8 feat(platform): Fly Machines provisioner for SaaS workspace deployment
When CONTAINER_BACKEND=flyio, workspaces are provisioned as Fly Machines
instead of local Docker containers. This enables workspace deployment
on SaaS tenants where no Docker daemon is available.

New files:
- provisioner/fly_provisioner.go: FlyProvisioner with Start/Stop/
  IsRunning/Restart/Close via Fly Machines API (api.machines.dev/v1)
- FlyRuntimeImages maps runtimes to GHCR image tags

Changes:
- main.go: select Docker vs Fly based on CONTAINER_BACKEND env var
- workspace.go: SetFlyProvisioner() setter, Create checks flyProv first
- workspace_provision.go: provisionWorkspaceFly() loads secrets, calls
  FlyProvisioner.Start, issues auth token for the new machine

Env vars for Fly backend:
- CONTAINER_BACKEND=flyio (activates Fly provisioner)
- FLY_API_TOKEN (Fly deploy token)
- FLY_WORKSPACE_APP (Fly app name for workspace machines)
- FLY_REGION (default: ord)

Closes #494

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 10:51:15 -07:00
Hongming Wang
96b909b8f3 fix: code review findings — token UI, auth hardening, WS dedup
1. Settings panel: wire TokensTab into "API Tokens" tab (was imported
   but not rendered). Rename "API Keys" → "Secrets", add "API Tokens"
   tab. Fix docs link → doc.moleculesai.app/docs/tokens.

2. Referer match hardening: require exact host match or trailing slash
   to prevent evil.com subdomain bypass. Cache CANVAS_PROXY_URL at
   init time instead of per-request os.Getenv.

3. Extract shared deriveWsBaseUrl() to lib/ws-url.ts — eliminates
   duplicate 12-line derivation in socket.ts and TerminalTab.tsx.

4. Token list pagination: add ?limit= and ?offset= params (default
   50, max 200) to GET /workspaces/:id/tokens.

507/507 canvas tests pass, Go build + vet clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 10:42:26 -07:00
Hongming Wang
c4b56c6c84 fix(auth): allow same-origin canvas requests through WorkspaceAuth on tenant
WorkspaceAuth only accepted bearer tokens, blocking the canvas from
calling per-workspace routes (restart, config, secrets, chat) on the
tenant image where canvas + API share the same origin.

Added isSameOriginCanvas() fallback (same check used by AdminAuth):
checks Referer matches request Host, gated behind CANVAS_PROXY_URL
so only tenant deployments are affected.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 10:06:33 -07:00
Hongming Wang
25bd9241d1 fix(tenant): WebSocket URL derivation + AdminAuth same-origin for tenant image
Two bugs on the combined tenant image (canvas + API same-origin):

1. WebSocket URL: NEXT_PUBLIC_WS_URL="" (empty string for same-origin)
   was preserved by ?? operator, producing an invalid WS URL. Now derives
   from window.location when both env vars are empty. Same fix applied
   to TerminalTab.

2. AdminAuth blocking canvas: same-origin requests have no Origin header,
   so neither AdminAuth nor CanvasOrBearer could authenticate the canvas.
   Added isSameOriginCanvas() that checks Referer against request Host,
   gated behind CANVAS_PROXY_URL (only active on tenant image). This
   lets the canvas create/list workspaces, view events, etc. without a
   bearer token when served from the same Go process.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 08:43:01 -07:00
Hongming Wang
de9f3d179c feat(platform): token management API + MCP setup + external agent guide
1. Token Management API (closes production gap):
   - GET /workspaces/:id/tokens — list tokens (prefix + metadata, never plaintext)
   - POST /workspaces/:id/tokens — create new token (plaintext returned once)
   - DELETE /workspaces/:id/tokens/:tokenId — revoke specific token
   - Behind WorkspaceAuth middleware (need existing token to manage tokens)
   - Tests skip gracefully when no DB available

2. MCP Server Setup:
   - Fix .mcp.json to use npx @molecule-ai/mcp-server (was referencing
     non-existent local ./mcp-server/dist/index.js)
   - Add comprehensive tool→API mapping doc (87 tools across 15 categories)

3. External Agent Registration Guide:
   - Step-by-step: create workspace, register, heartbeat, A2A messaging
   - Python (Flask) and Node.js (Express) complete working examples
   - Communication rules, lifecycle, security, troubleshooting

4. Token Management Guide:
   - Bootstrap flow, rotation procedure, security properties

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 08:37:42 -07:00
Hongming Wang
206564c90b
Merge pull request #483 from Molecule-AI/fix/platform-modular-template-support
fix(platform): unblock org-template imports against modular workspace templates
2026-04-16 07:55:26 -07:00
rabbitblood
ff2394c085 fix(platform): unblock org-template imports against modular workspace templates
Two adjacent fixes that surfaced trying to bring the molecule-dev org
template back up against the new standalone workspace-template-* repos.

1) handlers/org.go — expand ${VAR} in workspace_dir before validation.
   The molecule-dev pm/workspace.yaml (and any operator's per-host
   binding) ships `workspace_dir: ${WORKSPACE_DIR}` so each operator
   can pick the host path PM bind-mounts. Without expansion the literal
   "${WORKSPACE_DIR}" string reaches validateWorkspaceDir and fails with
   "must be an absolute path", aborting the whole org import.
   Other fields (channel config, prompts) already go through expandWithEnv;
   workspace_dir was the last hold-out.

2) provisioner/provisioner.go — inject PYTHONPATH=/app for every
   workspace container. Standalone template Dockerfiles COPY adapter.py
   to /app and set ENV ADAPTER_MODULE=adapter, but molecule-runtime is
   a pip console_script entry point so cwd isn't on sys.path
   automatically. Setting PYTHONPATH here fixes every adapter image at
   once instead of needing 8 PRs against template repos. Operator
   override still wins (workspace EnvVars are appended after, so Docker
   takes the later duplicate).

   Note: this unblocks the import path but does NOT make claude-code /
   hermes / etc. boot. The runtime itself has a separate top-level
   `from adapters import` that breaks against modular templates —
   tracked at workspace-runtime#1.

Tests: TestBuildContainerEnv_InjectsPYTHONPATH +
TestBuildContainerEnv_WorkspaceEnvVarsCanOverridePYTHONPATH lock the
default + operator-override invariants. expandWithEnv is already covered
by TestExpandWithEnv_* — the workspace_dir use site is a one-line call
to that primitive.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 07:49:45 -07:00
rabbitblood
e7710d2e6f feat(channels): Lark / Feishu adapter (outbound webhook + Events API inbound)
New ChannelAdapter implementation for Lark (international, open.larksuite.com)
and Feishu (China, open.feishu.cn). Both speak the same payload format —
only the host differs — so a single adapter covers both.

Outbound: POST text to a Custom Bot webhook URL with msg_type:"text".
Lark returns 200 OK even when delivery fails — the body's `code` field is
the truth. Adapter parses the response and returns a Go error when
code != 0 so callers don't think a revoked-webhook send succeeded.

Inbound: handles both v1 url_verification (handshake) and v2 event_callback
(im.message.receive_v1) shapes. Optional verify_token field — when set,
inbound payloads with mismatching tokens are rejected via constant-time
compare (#337 class — never raw == against a stored secret).

Sender ID resolution prefers user_id → falls back to open_id (open_id is
always present; user_id only when the bot has the contacts permission).
Non-text message types and non-message events return nil, nil so the
receiver responds 200 OK without dispatching.

Tests: 23 cases — identity, ValidateConfig (6 sub-cases incl. URL prefix
matrix), SendMessage (no URL / invalid prefix / happy-path body shape /
api-error-code surfacing), ParseWebhook (handshake + token mismatch +
text message + open_id fallback + non-message + non-text + token mismatch
+ malformed JSON + malformed content + empty text), StartPolling no-op,
registry presence.

Also: make migration 023 idempotent (ADD COLUMN IF NOT EXISTS) — the
platform's migration runner has no schema_migrations tracking table, so
every .up.sql replays on every boot. Without IF NOT EXISTS the second
boot against an existing volume crashes with "column already exists".
Followup issue to be filed for proper migration tracking.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 07:10:58 -07:00
rabbitblood
e08f28c962 feat(platform): provision-time env mutator hook for plugins
Add `provisionhook.EnvMutator` extension point so out-of-tree plugins
(e.g. github-app-auth, vault-secrets) can inject or override env vars
right before container Start, without forking core or piling more
provider-specific code into the handlers package.

WorkspaceHandler gains an optional `envMutators *provisionhook.Registry`
wired in via SetEnvMutators during boot. The hook fires after built-in
secret loads + per-agent git identity, so plugins can both read what's
already there and override anything they own (GIT_AUTHOR_*, GITHUB_TOKEN).

A nil registry is a no-op via Registry.Run's nil-receiver branch — keeps
the hot path a single nil compare and means existing flows stay green
even with zero plugins registered.

Mutator failure aborts provisioning and marks the workspace failed with
the wrapped error in last_sample_error. Failing fast surfaces the cause
to the operator instead of letting an agent boot into opaque "git push
401" loops it can never recover from on its own.

Tests cover ordered execution, chained env visibility, first-error abort,
nil-receiver no-op, nil-mutator drop, registration order, and concurrent
register-vs-run safety (-race clean).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 06:47:09 -07:00
Security Auditor
284fb26558 fix(security): YAML-quote skill/prompt names in generateDefaultConfig + opaque file-write errors
Closes #460, #461.

**#460 — YAML injection via unquoted skill/prompt filenames**
`generateDefaultConfig` extracted skill directory names and prompt file
names from user-supplied `body.Files` keys and wrote them directly into
YAML list items without quoting:

  cfg.WriteString("  - " + s + "\n")

`validateRelPath` only blocks path traversal (`../`); it does NOT block
YAML control characters including newlines. On Linux, filenames can
contain newlines, so an attacker with any live workspace bearer token
could submit:

  {"files": {"skills/legit\nruntime: malicious/SKILL.md": "# skill"}}

The generated config.yaml would then contain `runtime: malicious` as a
top-level YAML key, overriding the runtime for workspaces provisioned
from the template.

Fix: extract `yamlEscape` as a reusable local from the same
`strings.NewReplacer` already used for the `name` field (#221) and apply
it to both the `skills:` and `prompt_files:` list items, wrapping each
in double-quotes.

**#461 — Docker error details in ReplaceFiles 500 responses**
`ReplaceFiles` returned `fmt.Sprintf("failed to write files: %v", err)`
in two 500 paths, where `err` comes from Docker API calls and may include
internal container names, volume names, and daemon error messages.

Fix: log the full error server-side and return a static opaque string to
the caller.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 05:40:45 -07:00
Hongming Wang
8b13fff355 fix(test): wrap httptest.ResponseRecorder with CloseNotify for canvas proxy tests
httputil.ReverseProxy calls CloseNotify() which httptest.ResponseRecorder
doesn't implement. Gin casts the writer, causing a panic. Added a
closeNotifyRecorder wrapper with a no-op channel.
2026-04-16 05:40:17 -07:00
Molecule AI Backend Engineer
eec59fe63b fix(auth): inject fresh bearer token into config volume on every provision (closes #418)
Container rebuild or volume wipe caused workspaces to lose /configs/.auth_token.
On re-registration the platform returned no auth_token (HasAnyLiveToken==true →
no re-issue), leaving the workspace unable to authenticate any subsequent API call.

Fix: provisionWorkspaceOpts now calls issueAndInjectToken before Start(). This
revokes any existing live tokens (plaintext is irrecoverable from the stored hash,
so rotation is the only safe path) and issues a fresh token that is written into
cfg.ConfigFiles[".auth_token"]. WriteFilesToContainer delivers it to /configs
immediately after ContainerStart, racing safely ahead of the Python adapter's
1-2s startup time.

Failure modes are soft: revoke or issue errors skip injection with a warning;
provisioning continues and the workspace recovers on the next restart.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 05:26:10 -07:00
Hongming Wang
7fca9723a0
Merge pull request #467 from Molecule-AI/feat/slack-webhook-validation
[Backend Engineer] feat(channels): Slack adapter with webhook URL validation (#384)
2026-04-16 05:22:47 -07:00
Hongming Wang
d6e7784f11
Merge pull request #469 from Molecule-AI/feat/per-channel-budget
[Backend Engineer] feat(channels): per-channel message budget with 429 enforcement (#368)
2026-04-16 05:22:39 -07:00
Hongming Wang
6c374833b0
Merge pull request #457 from Molecule-AI/fix/issue-451-strip-auth-header-canvas-proxy
[Backend Engineer] fix(security): strip Authorization + Cookie in canvas reverse proxy
2026-04-16 05:17:01 -07:00
Hongming Wang
1184232d86
Merge pull request #446 from Molecule-AI/fix/issue-435-registry-error-leak
fix(security): suppress raw DB error from /registry/register response
2026-04-16 05:16:57 -07:00
Hongming Wang
370fb151b2
Merge pull request #465 from Molecule-AI/fix/memory-recall-flood-limit
[Backend Engineer] fix(memories): hard cap of 50 on recall results (#377)
2026-04-16 05:16:49 -07:00
Hongming Wang
74e4f30216 fix: address all code review findings + remove exposed secrets
Code review fixes:
- 🟡 #1: Replace python3 with jq in Dockerfile template stages (~50MB → ~2MB)
- 🟡 #2: Add clone count verification to scripts/clone-manifest.sh
  (set -e + expected vs actual count check — fails build if any clone fails)
- 🟡 #3: Drop 'unsafe-eval' from CSP (not needed for Next.js production
  standalone builds, only dev mode). Updated test assertion.
- 🟡 #4: Remove broken pyproject.toml from workspace-template/ (it claimed
  to package as molecule-ai-workspace-runtime but the directory structure
  didn't match — the real package ships from the standalone repo)
- 🔵 #1: Add version-pinning TODO comment to manifest.json
- 🔵 #3: Add full repo URLs + test counts for SDK/MCP/CLI/runtime in CLAUDE.md

Security (GitGuardian alert):
- Removed Telegram bot token (8633739353:AA...) from template-molecule-dev
  pm/.env — replaced with ${TELEGRAM_BOT_TOKEN} placeholder
- Removed Claude OAuth token (sk-ant-oat01-...) from template-molecule-dev
  root .env — replaced with ${CLAUDE_CODE_OAUTH_TOKEN} placeholder
- Both tokens need immediate rotation by the operator

Tests: Platform middleware tests updated + all pass.
2026-04-16 05:05:49 -07:00
Molecule AI Backend Engineer
b021f85af9 feat(channels): per-channel message budget with 429 enforcement (#368)
Add an optional channel_budget (INTEGER, nullable) to workspace_channels
via migration 024. When channel_budget IS NOT NULL and message_count has
reached the budget, the Send handler returns 429 {"error":"channel budget
exceeded"} and aborts before calling SendOutbound.

Implementation details:
- Single SELECT query reads both message_count and channel_budget in one
  round-trip (avoids TOCTOU window between read and write)
- Fail-open on DB error: transient failures log but don't block sends
- Early-return on budget hit is before SendOutbound so message_count
  cannot be incremented past the limit by a concurrent send that slips
  through the window (best-effort; atomic enforcement requires DB-level CAS)
- NULL channel_budget = unlimited (default, backward-compatible)

Migration is idempotent (ADD COLUMN IF NOT EXISTS). Down migration drops
the column cleanly.

Four sqlmock tests cover: at-limit → 429, above-limit → 429, NULL budget
passes through, under-limit passes through.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 11:17:14 +00:00
Molecule AI Backend Engineer
68c9b37048 feat(channels): add Slack adapter with webhook URL validation (#384)
Implement SlackAdapter satisfying the ChannelAdapter interface:
- ValidateConfig: rejects any webhook_url that doesn't start with
  https://hooks.slack.com/ — returns "invalid Slack webhook URL" so
  the handler surfaces 400 {"error":"invalid config: invalid Slack webhook URL"}
- SendMessage: HTTP POST JSON {"text":"..."} to the webhook URL with a
  10s timeout; rejects invalid-prefix URLs at send time too (defence in depth)
- ParseWebhook: handles both slash-command (form-encoded) and Events API
  (JSON) payloads; no-ops on url_verification and non-message events
- StartPolling: returns nil immediately (Slack doesn't support polling via
  Incoming Webhooks)

Register "slack" in the adapter registry. Twelve unit tests cover
Type/DisplayName, happy-path validation, every bad-URL variant (wrong scheme,
wrong host, SSRF lookalike, empty string), empty webhook in SendMessage,
StartPolling nil return, and registry lookup/listing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 11:14:31 +00:00
Hongming Wang
8e304e69e8 chore: remove extracted directories, add manifest-driven Docker builds
Remove plugins/, workspace-configs-templates/, org-templates/ dirs (now
in standalone repos). Add manifest.json listing all 33 repos and
scripts/clone-manifest.sh to clone them. Both Dockerfiles now use the
manifest script instead of 33 hardcoded git-clone lines.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 04:13:29 -07:00
Molecule AI Backend Engineer
6fb4b7b282 fix(memories): add hard cap of 50 on recall results (#377)
Introduce `memoryRecallMaxLimit = 50` constant and honour the `?limit=N`
query parameter in Search. Values above 50 are silently clamped to 50;
absent or invalid values default to 50. The LIMIT clause is now a
parameterised argument (nextArg pattern) instead of a hardcoded literal.
Three sqlmock tests verify the cap, the explicit limit, and the default.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 11:12:35 +00:00
Molecule AI Backend Engineer
479b172b25 fix(security): strip Authorization + Cookie headers in canvas reverse proxy (closes #451)
The canvas proxy was forwarding all headers verbatim to the Next.js process.
Workspace bearer tokens sent by agents (e.g. during an A2A call that hit a
canvas-side route) could reach unvalidated Next.js handlers and be echoed back
to an attacker via an error page or a debug endpoint.

Fix: Director now calls Header.Del("Authorization") + Header.Del("Cookie")
before forwarding. Non-credential headers (Accept, X-Request-Id, etc.) are
unaffected — the strip is surgical.

Four unit tests added (strips Authorization, strips Cookie, forwards other
headers, strips both simultaneously).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 11:00:43 +00:00
Backend Engineer
b0381d656c fix(security): registry DB errors must not leak raw driver messages (closes #435)
The Register handler was serialising the raw Go error into the HTTP response:
  c.JSON(500, gin.H{"error": fmt.Sprintf("failed to register: %v", err)})

PostgreSQL errors wrapped by lib/pq contain table names, constraint names, and
driver-version strings — enough for a caller to fingerprint the schema and craft
targeted attacks. The error is already logged at full detail with Printf before
this line, so callers only need the generic message.

Fix: replace the Sprintf with a static "registration failed" string (same pattern
the heartbeat and update-card handlers already used).

New test: TestRegister_DBErrorResponseIsOpaque verifies the response body is the
opaque string and that "sql:", "pq:", and "connection" substrings are absent.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 10:34:35 +00:00
Backend Engineer
2451b1acc0 fix(provisioner): rebuild_config flag on restart recovers from destroyed config volume (closes #239)
When a workspace container AND its /configs Docker volume are both destroyed,
the restart handler previously had no recovery path — findTemplateByName searched
only the top-level configsDir, which holds workspace-instance dirs (ws-{id[:12]}/),
not the role-named org-template source directories.

Fix: add `rebuild_config: true` to the POST /workspaces/:id/restart body struct.
When set, the handler falls back to searching configsDir/org-templates/ via the
existing findTemplateByName logic (which already handles name normalisation and
config.yaml name-field matching). The workspace can then self-recover with its own
bearer token — no admin intervention required.

New helper: resolveOrgTemplate(configsDir, wsName) — pure function, independently
tested (4 cases: hit-by-dir, hit-by-config-yaml, no org-templates dir, no match).

Usage:
  curl -X POST -H "Authorization: Bearer $(cat /configs/.auth_token)" \
       -d '{"rebuild_config": true}' \
       http://platform:8080/workspaces/$WORKSPACE_ID/restart

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 10:34:25 +00:00
Hongming Wang
3e50b95800
Merge pull request #433 from Molecule-AI/feat/externalize-prompts-phase4
feat(org-templates): Phase 4 — atomize each role to <role>/workspace.yaml
2026-04-16 03:19:43 -07:00
Hongming Wang
c545e3a276
Merge pull request #417 from Molecule-AI/feat/memory-checkpoint-reconciliation
feat(memory): optimistic-locking via if_match_version on workspace_memory writes
2026-04-16 03:18:09 -07:00
rabbitblood
40a69d6f87 feat(org-templates): Phase 4 — atomize each role to <role>/workspace.yaml
Part 4 of 4 — terminal step of the org.yaml scalability refactor. Each
role in the molecule-dev template now owns its own workspace.yaml file,
colocated with the existing system-prompt.md / initial-prompt.md /
idle-prompt.md / schedules/*.md. Team files shrink to a leader's own
definition plus a list of !include refs.

## Platform change

`resolveYAMLIncludes` now uses a TWO-ROOT model:
- Path resolution is relative to the INCLUDING file's directory
  (natural sibling + cousin refs, C-include / Sass @import convention).
- Security bound is the ORIGINAL org root (`rootDir`), preserved across
  all recursion depths. Sibling-dir refs like `../my-role/workspace.yaml`
  from a team file are now allowed (they stay inside the org template);
  refs that escape the root still error.

Regression coverage: new `TestResolveYAMLIncludes_SiblingDirAccess`
reproduces the Phase 4 pattern (team file at `teams/x.yaml` referencing
`../<role>/workspace.yaml`) — fails without the fix, passes with.

## Template change

Atomized 15 child workspaces across 3 team files:
- `teams/research.yaml`: 58 → 30 lines; 3 children now !include refs
- `teams/dev.yaml`: 222 → 38 lines; 6 children now !include refs
- `teams/marketing.yaml`: 143 → 28 lines; 6 children now !include refs

Each role now has `<role>/workspace.yaml` colocated with its prompts.
Example `frontend-engineer/` directory:
  frontend-engineer/
  ├── workspace.yaml        (24 lines — name/role/tier/canvas/plugins/...)
  ├── system-prompt.md      (from earlier phases)
  ├── initial-prompt.md
  ├── idle-prompt.md
  └── (no schedules for this role — but if added, schedules/<slug>.md)

## File-size progression across all 4 phases

| State | org.yaml | total `.yaml` in tree |
|---|---:|---:|
| Before (main) | 1801 lines / 108 KB | 1801 / 108 KB (one file) |
| After Phase 1 (#389) | 1687 | 1687 / 101 KB |
| After Phase 2 (#390) | 676 | 676 / 35 KB |
| After Phase 3 (#393) | 114 | 683 (1 + 6 teams) / 33 KB |
| **After this PR** | **114** | **~698** (1 + 6 + 15 workspace) / 35 KB |

Aggregate size is flat — the decrease came from prompt externalization
in Phases 1/2; Phases 3/4 reorganize structure without adding content.
The win is readability and ownership:
- Every individual file fits on 1-2 screens.
- Adding a new role is now: create `<role>/` dir, add `workspace.yaml`
  + `system-prompt.md` + prompts, add ONE `!include` line to the team
  file. No touching of aggregated mega-YAML.
- Team files can be reviewed + merged independently.

## Tests

All 10 `TestResolveYAMLIncludes_*` tests pass, including the real-template
integration test (`TestResolveYAMLIncludes_RealMoleculeDev`) which now
walks org.yaml → teams/pm.yaml → teams/research.yaml → ../market-analyst/
workspace.yaml and validates the full 21-role tree unmarshals cleanly.

Plus all existing `TestResolvePromptRef` + `TestOrgYAML` + `TestInitialPrompt`
suites stay green.

## Ops followup

After merging all 4 phases and deploying, the `POST /org/import`
endpoint should produce a workspace tree byte-identical to the
pre-refactor state. Verify with:
  diff <(curl POST /org/import before) <(curl POST /org/import after)
or by spot-checking:
  - `/configs/config.yaml` bodies across all 21 workspaces
  - `workspace_schedules.prompt` row values

The externalization is lossless — YAML literal to file and back
recovers the same string modulo trailing-whitespace normalization.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 03:09:56 -07:00
Hongming Wang
0c73810121
Merge pull request #404 from Molecule-AI/feat/externalize-prompts-phase3
feat(org-templates): Phase 3 — !include directive + split org.yaml into team files
2026-04-16 03:08:01 -07:00
Hongming Wang
db22b5d853
Merge pull request #413 from Molecule-AI/fix/isrunning-distinguish-notfound
fix(provisioner): IsRunning conservative on daemon errors to stop restart cascade
2026-04-16 03:07:54 -07:00
Hongming Wang
1e43e45de7
Merge pull request #402 from Molecule-AI/feat/per-agent-git-identity
feat(provisioner): per-agent git identity via GIT_AUTHOR_* env vars
2026-04-16 03:07:50 -07:00
rabbitblood
7debdb1676 fix(tests): CSP test now fragment-matches instead of exact-matches
SecurityHeaders middleware widened its CSP to allow Next.js inline scripts
+ data:/blob: images (platform/internal/middleware/securityheaders.go:44,
canvas is reverse-proxied through the gin stack so it needs the permissive
policy). The two CSP asserts in securityheaders_test.go still hard-compared
against the old tight `default-src 'self'`, so they fail on main as of
this afternoon.

Fix: assert each expected CSP fragment is PRESENT in the header (substring
match) instead of byte-for-byte equality. Test intent is "CSP is set, starts
with tight default-src, contains the expected directives" — not "CSP matches
this exact string". Future subsource tuning (add a new CDN, bump blob:/data:
scope) won't re-break this test.

Caught because every PR touching anything in the monorepo currently fails
the Platform (Go) CI job on these two asserts. Fixing on a dedicated branch
so it can land ahead of every blocked PR in the queue.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 02:59:06 -07:00
Hongming Wang
b8cb14f46e feat(tenant): combined platform + canvas Docker image with reverse proxy
Single-container tenant architecture: Go platform (:8080) + Canvas
Node.js (:3000) in one Fly machine, with Go's NoRoute handler reverse-
proxying non-API routes to the canvas. Browser only talks to :8080.

Changes:

platform/Dockerfile.tenant — multi-stage build (Go + Node + runtime).
  Bakes workspace-configs-templates/ + org-templates/ into the image.
  Build context: repo root.

platform/entrypoint-tenant.sh — starts both processes, kills both if
  either exits. Fly health check on :8080 covers the Go binary; canvas
  health is implicit (proxy returns 502 if canvas is down).

platform/internal/router/canvas_proxy.go — httputil.ReverseProxy that
  forwards unmatched routes to CANVAS_PROXY_URL (http://localhost:3000).
  Activated by NoRoute when CANVAS_PROXY_URL env is set.

platform/internal/router/router.go — wire NoRoute → canvasProxy when
  CANVAS_PROXY_URL is present; no-op otherwise (local dev unchanged).

platform/internal/middleware/securityheaders.go — relaxed CSP to allow
  Next.js inline scripts/styles/eval + WebSocket + data: URIs. The
  strict `default-src 'self'` was blocking all canvas rendering.

canvas/src/lib/api.ts — changed `||` to `??` for NEXT_PUBLIC_PLATFORM_URL
  so empty string means "same-origin" (combined image) instead of falling
  back to localhost:8080.

canvas/src/components/tabs/TerminalTab.tsx — same `??` fix for WS URL.

Verified: tenant machine boots, canvas renders, 8 runtime templates +
4 org templates visible, API routes work through the same port.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 02:46:47 -07:00
rabbitblood
73171532a1 feat(memory): optimistic-locking via if_match_version on workspace_memory writes
Closes the silent-overwrite hole where two agents racing a read-modify-
write on the same memory key left only one agent's update. Relevant for
orchestrators (PM, Dev Lead, Marketing Lead) keeping structured running
state (delegation-result ledgers, task queues) in memory, and for the
``research-backlog:*`` keys that multiple idle loops write in parallel.

## Semantics

### Back-compat path (no if_match_version)
Unchanged: ``INSERT ... ON CONFLICT UPDATE`` last-write-wins. Every
existing agent tool, every existing ``commit_memory`` call, every
existing cron that writes memory — all continue to work with no edit.

### Optimistic-lock path (if_match_version set)
1. Client calls ``GET /memory/:key`` → ``{value, version: V}``
2. Client modifies value locally
3. Client ``POST /memory {key, value, if_match_version: V}``
4. Server: ``UPDATE ... WHERE version = V`` + RETURNING new version
5. On match → 200 + ``{version: V+1}``
6. On mismatch → 409 + ``{expected_version: V, current_version: <actual>}``
7. Client reads the actual version and retries.

### Create-only marker
``if_match_version: 0`` means "create iff the key doesn't exist yet".
Two agents simultaneously seeding a shared key will see exactly one
success + one 409 — no silent collision, no duplicate-init work.

### Schema

Migration 023 adds ``version BIGINT NOT NULL DEFAULT 1``. Existing rows
baseline at 1. New rows start at 1. Every successful write (both paths)
increments: ``version = version + 1`` on update, ``1`` on insert.

## Why version, not updated_at

``updated_at`` has second-granularity and can collide between concurrent
writers on a fast clock. A monotonic counter is collision-free and more
readable in the 409 response body ("expected 5, current is 7 — you
missed 2 writes" tells an agent exactly what to re-read).

## Why ``if_match_version`` and not an ETag header

JSON field keeps it in the request body, visible alongside the value
payload. Agents assembling requests programmatically don't have to
remember to thread a header through their HTTP client wrapper; the
existing ``commit_memory`` tool can grow one optional kwarg and match
the existing signature shape.

## Tests

11 memory-handler cases covering every path:
- GET list / get (with version in response shape)
- Set with no version (back-compat upsert, returns new version)
- Set with if_match_version match (happy path, increment)
- Set with if_match_version mismatch (409 + expected/current fields)
- Set with if_match_version=0 on absent key (create-only success)
- Set with if_match_version=N on absent key (409 — caller's mental
  model is wrong)
- Bad inputs (missing key, malformed JSON)
- Delete happy + error path

Full ``go test ./internal/handlers/`` green.

## Follow-up (not in this PR)

- Workspace-template tool update: ``commit_memory(content, *,
  if_match_version=None)`` surfaces the new option + on 409 surfaces
  the current_version so agents can retry without manual re-read.
- Named checkpoints table (``workspace_checkpoints``) for durable
  orchestrator state snapshots. Different concern than per-key locking;
  separate PR.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 02:32:46 -07:00
rabbitblood
8bf27ae1d0 fix(provisioner): IsRunning conservative on daemon errors to stop restart cascade
Root cause of the 2026-04-16 09:10 UTC six-container restart cascade.

## Timeline

09:10:26 — PM sent a batch delegation to 15+ agents (Dev Lead coordinating).
09:10:26-27 — 4 leaders/auditors (Security, RL, BE, DevOps) simultaneously
              hit "workspace agent unreachable — container restart triggered"
              even though their containers were running fine. Another 2
              (DL, UIUX) tripped in the next few seconds.
09:10:27 — Provisioner stopped + recreated 6 containers in parallel. A2A
           callers got EOFs, PM's batch coordination stalled.

## Root cause

`provisioner.IsRunning` collapsed every ContainerInspect error into
`(false, nil)`, including transient Docker daemon hiccups:

  func IsRunning(...) (bool, error) {
      info, err := p.cli.ContainerInspect(ctx, name)
      if err != nil {
          return false, nil // Container doesn't exist ← MISREAD
      }
      return info.State.Running, nil
  }

The comment said "Container doesn't exist" but the error was actually
any of: daemon timeout, socket EOF, context deadline, connection
refused. Under load (batch delegation fan-out → 15 concurrent HTTP
inbound → 15 concurrent Claude Code subprocesses → Docker daemon CPU
pressure), ContainerInspect calls started failing transiently. All 6
calls returned `(false, nil)`. Caller `maybeMarkContainerDead` treated
`running=false` as "container is dead, restart it" → six parallel
restarts. This was exactly the destructive-on-error pattern we keep
trying to kill (see #160 SDK-stderr-probe, #318 fail-open classes).

## Fix

`IsRunning` now distinguishes NotFound from transient errors:

- Legitimately missing container (caller deleted, Docker pruned) →
  `(false, nil)` — safe to act on; caller marks dead + restarts.
- Any other error (daemon timeout, socket issue, context deadline) →
  `(true, err)` — caller stays on the alive path. The transient error
  is preserved so metrics + logging still see it, but it does NOT
  trigger the destructive restart branch.

`isContainerNotFound` matches on error-message substring — same
approach docker/cli uses internally — to avoid pulling in errdefs as a
direct dep. Truth table tests in `isrunning_test.go` cover 8 cases:
NotFound variants (real + generic), nil, empty, and the 4 transient-
error shapes we've actually observed (deadline, EOF, connection-refused,
i/o timeout).

## Caller update

`maybeMarkContainerDead` in a2a_proxy.go now logs the transient inspect
error (was silently discarded via `_`). Visibility without
destructiveness. If this error becomes persistent, we'll see it in
platform logs rather than diagnosing after another restart cascade.

## Expected impact

- Zero restart cascades from the current class of transient inspect
  errors (EOF, timeout, connection refused).
- Dead containers still detected within the A2A layer because an actual
  stopped container returns NotFound on inspect, and the TTL monitor
  (180s post #386) catches anything that slips through.
- New visibility in platform logs when inspect has trouble — previously
  silent.

Combined with the TTL fix in #386, the defense-in-depth on spurious
restart is now:
  1. IsRunning only returns false for real NotFound
  2. Liveness TTL is 180s, surviving 5+ missed heartbeats
  3. A2A proxy 503-Busy path retries with backoff before touching
     restart logic at all

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 02:21:25 -07:00
Hongming Wang
51e3393ec0 fix(ops): bake workspace-configs-templates into platform Docker image
Tenant machines were booting with no templates because the Dockerfile
only shipped the Go binary + migrations. The canvas showed "0 templates"
with an empty picker.

Changes:
- platform/Dockerfile: build context changed from ./platform to repo
  root so COPY can reach workspace-configs-templates/ alongside the
  Go source. COPY paths updated for platform/{go.mod,go.sum,*.go} and
  platform/migrations/.
- .github/workflows/publish-platform-image.yml: context: . (was
  ./platform), paths trigger now includes workspace-configs-templates/
  so template changes rebuild the image.

Phase A of the template-registry plan. Phase B adds a DB registry +
on-demand fetch for community templates (user pastes GitHub URL at
workspace creation time). The baked defaults always ship in the image
for zero-config tenant boot.

Verified: `docker build -f platform/Dockerfile -t test .` succeeds,
`docker run --rm test ls /workspace-configs-templates/` shows all 8
templates (autogen, claude-code-default, crewai, deepagents, gemini-cli,
hermes, langgraph, openclaw).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 01:54:47 -07:00
rabbitblood
112c28d885 feat(org-templates): Phase 3 — !include directive + split org.yaml into team files
Part 3 of 4 in the scalability refactor. Adds YAML `!include` support
to the org importer and splits molecule-dev/org.yaml (676 lines post-
Phase 2) into 6 team / role files; top-level org.yaml drops to 114 lines
of pure scaffolding.

## Platform changes

New `platform/internal/handlers/org_include.go`:

- `resolveYAMLIncludes(data, baseDir)` — pre-processes a YAML document,
  expanding any scalar tagged `!include <path>` with the parsed content
  of the referenced file.
- Path resolution via `resolveInsideRoot` so a crafted `!include
  ../../etc/passwd` can't escape the org template directory (same
  defense the existing `files_dir` copy uses).
- Nested includes supported: each included file carries its own search
  root (its directory), so `teams/pm.yaml` with `!include research.yaml`
  resolves to `teams/research.yaml` — matching the convention of
  C-include / Sass @import / most package systems.
- Cycle detection via visited-set keyed on absolute path; belt-and-
  braces `maxIncludeDepth = 16` cap in case symlinks or path
  normalization defeats the set.
- Inline-template mode (POST /org/import with raw JSON body, no `dir`)
  errors cleanly when a file ref is used — can't resolve without a
  base.

Wired into both `ListTemplates` (so /org/templates shows an accurate
workspace count after the split) and `Import` (expansion happens before
unmarshal into OrgTemplate).

## Template changes

molecule-dev/org.yaml now contains only:
- name + description
- defaults (runtime, plugins, category_routing, initial_prompt text)
- `workspaces: [!include teams/pm.yaml, !include teams/marketing.yaml]`

New files:
- `teams/pm.yaml` — PM top-level, children are !include refs
- `teams/research.yaml` — Research Lead + Market Analyst + Technical
  Researcher + Competitive Intelligence (inline children)
- `teams/dev.yaml` — Dev Lead + FE/BE/DevOps/Security/QA/UIUX (inline)
- `teams/marketing.yaml` — Marketing Lead + DevRel/PMM/Content/
  Community/SEO/Social (inline)
- `teams/documentation-specialist.yaml` — leaf
- `teams/triage-operator.yaml` — leaf

## File-size impact

| State | org.yaml lines | total config size |
|---|---:|---:|
| Before (main) | 1801 | 108 KB |
| After Phase 1 (#389) | 1687 | 101 KB |
| After Phase 2 (#390) | 676 | 35 KB |
| After this PR | **114** | **4 KB** (org.yaml only) |

With the 6 team files (total ~570 lines of structural yaml), every file
is now under 230 lines and individually readable without scrolling past
a single team's boundaries.

## Tests

`platform/internal/handlers/org_include_test.go` — 9 cases:
- Flat include (single file, single workspace)
- Nested include (file → file → file)
- Traversal rejection (`../secret.yaml`, `../../secret.yaml`)
- Cycle detection (a↔b)
- Empty path error
- Missing file error
- Inline-template error (baseDir empty)
- No-op when YAML has no includes (safety: we always run the preprocessor)
- **Integration**: load the real `org-templates/molecule-dev/org.yaml`,
  resolve includes, unmarshal into OrgTemplate, verify PM + Marketing
  Lead are top-level and PM has ≥4 children after expansion.

All 9 pass + existing `TestResolvePromptRef` + `TestOrgYAML` suites stay
green.

## Ownership implication

Each team file can now be owned + reviewed independently. When the
marketing team adds a 7th role, the diff is in `teams/marketing.yaml`
alone — no merge conflicts against PM or research changes in the same
review window. Same for the eventual engineer team, security team, etc.

## What's next

- **Phase 4 (queued):** per-workspace atomization. Each role gets
  `<role>/workspace.yaml`; team files shrink to a list of !include
  refs. Terminal step in the scalability arc — at that point adding a
  new role is one new file under `org-templates/molecule-dev/<role>/`
  plus one line in the team's manifest.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 07:49:56 +00:00
Hongming Wang
29044c3995
fix(#249): add /schedules/health endpoint accessible to CanCommunicate peers (#400)
Rebased cleanly onto current main (resolves the add/add conflicts that
blocked CI on PR #374 — the original branch diverged from a pre-repo-bootstrap
commit that predated most files).

Changes:
- schedules.go: add scheduleHealthResponse struct + Health handler
  (mirrors A2A proxy auth pattern: X-Workspace-ID + CanCommunicate gate)
- router.go: register GET /workspaces/:id/schedules/health on r (not wsAuth)
  so peer agents can query without holding the target workspace's bearer token
- schedules_test.go: 7 new tests (missing caller 401, self-call OK, legacy
  peer grandfathered, non-peer 403, system caller bypass, no prompt exposure,
  DB error 500)

isSystemCaller/validateCallerToken reused from a2a_proxy.go (same package).
registry.CanCommunicate import added to schedules.go.

Closes #249
Supersedes PR #374 (which could not get CI due to merge conflict)

Co-authored-by: PM (Molecule AI) <pm@molecule-ai.internal>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 00:45:30 -07:00
rabbitblood
c12d6436ab feat(provisioner): per-agent git identity via GIT_AUTHOR_* env vars
Every workspace now commits under its own name. Step 3 of the three-
step agent-separation plan (platform-level git identity today;
GitHub App migration follows as Option 1).

## Problem

All 20+ agents in the molecule-dev template (PM, Dev Lead, Research
Lead, FE, BE, DevOps, Security, QA, UIUX, Marketing roles, etc.) share
a single GITHUB_TOKEN — specifically the CEO's personal PAT. So every
commit, PR, and issue across the live repos ends up attributed to
HongmingWang-Rabbit. `git log` can't distinguish "which agent wrote
this code" from "did the CEO write it"; neither can the authority-
verification rule in triage-operator/philosophy.md (rule #3).

## Fix

When the provisioner starts a workspace container, it now sets:

  GIT_AUTHOR_NAME    = "Molecule AI <Workspace Name>"
  GIT_AUTHOR_EMAIL   = <slug>@agents.moleculesai.app
  GIT_COMMITTER_NAME  = (same)
  GIT_COMMITTER_EMAIL = (same)

Git prefers these env vars over `git config user.name` / `user.email`,
so no per-container git-config step is needed; every commit automatically
carries the right authorship.

Examples (20 agents, 20 distinct identities):
  Frontend Engineer         → frontend-engineer@agents.moleculesai.app
  Backend Engineer          → backend-engineer@agents.moleculesai.app
  Product Marketing Manager → product-marketing-manager@agents.moleculesai.app
  UIUX Designer             → uiux-designer@agents.moleculesai.app

Domain `agents.moleculesai.app` is deliberate: marks the email as a
bot address without resembling a real inbox.

## Operator override preserved

`applyAgentGitIdentity` runs AFTER the secret-load loops in
`provisionWorkspaceOpts`, but uses `setIfEmpty` so any workspace_secret
with the same key wins. Teams that want custom authorship (shared org
signing identity, a person-on-the-loop owner) can still set
`GIT_AUTHOR_NAME` via /workspaces/:id/secrets and get their value
through to git.

## What this does NOT solve (yet)

- PR / issue authorship is still whoever owns GITHUB_TOKEN (the shared
  PAT). That needs the GitHub App migration (Option 1, next PR). The
  commit-level split shipped here is the prerequisite: the App path
  will keep these env vars and just swap the PAT for a short-lived
  installation token.
- Existing containers continue with their pre-fix env (git env vars
  are baked in at container-create time). Applying is one plain
  `POST /workspaces/:id/restart` per agent after this merges +
  deploys — the restart goes through provisionWorkspace which picks
  up the new injection.

## Tests

`agent_git_identity_test.go` — 4 behavior tests + a 10-row slug test:
- fills all 4 env vars from a workspace name
- operator override via pre-set env is preserved (setIfEmpty semantics)
- empty / whitespace workspace name is a no-op (no `unknown@...` emails)
- nil map doesn't panic (defensive)
- slugify handles spaces / punctuation / edge hyphens / em-dashes

All 15 cases pass; platform build clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 00:45:26 -07:00
Hongming Wang
ce0e793673
feat(org-templates): Phase 1 — externalize prompt bodies to sibling files (#389)
Part 1 of 4 in the scalability refactor. Each role can now keep its
initial_prompt / idle_prompt / schedule prompts as sibling .md files
under files_dir/; inline YAML literals still work for backwards-compat.

## What changes

**Platform (org.go importer):**
- `OrgWorkspace` gains `InitialPromptFile`, `IdlePrompt`, `IdlePromptFile`,
  `IdleIntervalSeconds`. The idle_* fields were previously dropped by the
  org importer entirely — struct didn't declare them — which is why
  engineer idle_prompts never propagated from org.yaml to live /configs
  (I've been manually docker-cp'ing them in every maintenance cron).
- `OrgSchedule` gains `PromptFile`. Hourly/weekly cron prompts are the
  largest bodies in org.yaml (1-5 KB each) and get resolved at import
  time just like initial_prompt.
- `OrgDefaults` gains the same idle_* + *_file fields for org-wide fallback.
- New `resolvePromptRef(inline, fileRef, orgBaseDir, filesDir)` helper —
  the single chokepoint for inline-vs-file resolution. Inline wins when
  both are set. File refs route through `resolveInsideRoot` so a crafted
  ref can't escape the org template directory (same traversal defense as
  files_dir).
- `createWorkspaceTree` now injects idle_prompt + idle_interval_seconds
  into the workspace's config.yaml (previously missing — that's the
  second half of the idle-prompt propagation bug).

**Tests:**
- `org_prompt_ref_test.go` — 10 cases: inline-wins, file-read-when-empty,
  both-empty, defaults-level resolution, inline-template mode errors,
  traversal rejection (via file ref AND via files_dir), missing-file
  errors, and YAML-unmarshal parsing for each new field.

**Proof migration:**
- Documentation Specialist (biggest role at 6.9 KB of prompts) moves from
  inline YAML to `documentation-specialist/{initial-prompt.md,
  schedules/daily-docs-sync.md, schedules/weekly-terminology-audit.md}`.
- org.yaml drops 1801 → 1687 lines (-6.3%) from just this one role.

## Why this matters

org.yaml is 108 KB of which 67 KB (62%) is prompt text. At the current
12-role template size that's already unreadable; the marketing + triage-
operator additions pushed it to 1801 lines. The 4-phase refactor aims:

- **Phase 1 (this PR):** platform support + 1 role proof.
- **Phase 2:** migrate remaining ~20 roles to file refs. Target: org.yaml
  at ~600 lines of pure structural scaffolding.
- **Phase 3:** YAML `!include` preprocessor — split org.yaml into
  teams/{research,dev,marketing,ops}.yaml shards.
- **Phase 4:** per-workspace atomization — each role gets its own
  workspace.yaml manifest; org.yaml composes them.

## Backwards compatibility

- Inline `initial_prompt: |` / `prompt: |` / `idle_prompt: |` all still work.
- Missing `prompt_file` refs log + skip the schedule (not fatal) — fail
  loud so bugs surface during deployment rather than silent-drop.
- Inline-template mode (POST /org/import with raw JSON body, no `dir`)
  errors cleanly when a file ref is used — can't resolve files without a
  base dir, surface that rather than guessing.

## Test plan

- [x] `go build ./...` clean
- [x] `go test -run 'TestResolvePromptRef|TestOrgYAML' ./internal/handlers/`
      — 10 tests pass
- [x] `python -c "yaml.safe_load(...)"` on the edited org.yaml — parses
- [ ] Post-merge: deploy platform rebuild, run `POST /org/import` against
      a fresh workspace, verify Documentation Specialist's /configs/config.yaml
      contains the initial_prompt body and workspace_schedules rows contain
      the cron prompts (phantom-success check: grep the actual content, not
      just the row count).

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 00:32:09 -07:00
Hongming Wang
a9fdbe4185
fix(liveness): raise workspace TTL 60s → 180s to survive Opus synthesis (#386)
Problem observed 2026-04-16: Research Lead, Dev Lead, Security Auditor,
and UIUX Designer were being auto-restarted by the liveness monitor every
~30 minutes, even though their containers were healthy and processing
real work. A2A callers (PM, children agents) saw regular EOFs:

  A2A request to <leader-id> failed: Post http://ws-*:8000: EOF

Followed in platform logs by:

  Liveness: workspace <id> TTL expired
  Auto-restart: restarting <name> (was: offline)
  Provisioner: stopped and removed container ws-*

Root cause: the liveness key `ws:{id}` in Redis has a 60s TTL
(platform/internal/db/redis.go). The workspace heartbeat loop
(workspace-template/heartbeat.py) refreshes it every 30s. That leaves
room for exactly ONE missed heartbeat before expiry.

A busy Claude Code Opus synthesis can starve the container's asyncio
scheduler for 60-120s (the SDK spawns the claude CLI subprocess and
blocks until the message-reader yields; the heartbeat coroutine doesn't
run during that window). Leaders running 5-minute orchestrator pulses
or processing deep delegations routinely hit this. The platform then
mistakes a busy-but-healthy container for a dead one, marks it offline,
tears it down, and re-provisions — interrupting whatever work was mid-
synthesis and generating a cascade of EOF errors on pending A2A calls.

Fix: hoist the TTL into a named `LivenessTTL` constant and raise it to
180s. With a 30s heartbeat interval this now tolerates up to ~5 missed
beats before expiry — comfortably longer than any realistic Opus stall,
while still detecting genuinely-dead containers within 3 minutes.

Safety: real crashes are still caught immediately by a2a_proxy's reactive
IsRunning() check (maybeMarkContainerDead in a2a_proxy.go:439). That path
doesn't depend on TTL; it fires on the first failed forward. So this PR
only relaxes the "slow but alive" false-positive — dead-container
detection is unchanged.

Observed impact before fix (2026-04-16 ~06:40–06:49 UTC, 10-minute
window, 4 containers affected):

  | Container         | EOF errors | Forced restart |
  |-------------------|-----------:|:--------------:|
  | Dev Lead          | 5          | yes (06:48)    |
  | Research Lead     | 5          | yes (06:47)    |
  | Security Auditor  | 5          | yes (06:49)    |
  | UIUX Designer     | 4          | no (not yet)   |

Expected impact after merge + redeploy: drop to ~0 forced restarts on
healthy-busy leaders. If genuinely-stuck agents stop responding, the
IsRunning check still catches them on the next A2A forward.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 00:05:45 -07:00
Hongming Wang
0bcebff908
config(org): add Telegram to Dev Lead and Research Lead (#385)
* feat(adapters): add gemini-cli runtime adapter (closes #332)

Adds a `gemini-cli` workspace runtime backed by Google's Gemini CLI
(@google/gemini-cli, ~101k ★, Apache 2.0). Mirrors the claude-code
adapter pattern: Docker image installs the CLI, CLIAgentExecutor
drives the subprocess, A2A MCP tools wire via ~/.gemini/settings.json.

Changes:
- workspace-template/adapters/gemini_cli/ — new adapter (Dockerfile,
  adapter.py, __init__.py, requirements.txt); setup() seeds GEMINI.md
  from system-prompt.md and injects A2A MCP server into settings.json
- workspace-template/cli_executor.py — adds gemini-cli to
  RUNTIME_PRESETS (--yolo flag, -p prompt, --model, GEMINI_API_KEY env
  auth); adds mcp_via_settings preset flag to skip --mcp-config
  injection for runtimes that own their own settings file
- workspace-configs-templates/gemini-cli/ — default config.yaml +
  system-prompt.md template
- tests/test_adapters.py — adds gemini-cli to expected adapter set
- CLAUDE.md — documents new runtime row in the image table

Requires: GEMINI_API_KEY global secret. Build:
  bash workspace-template/build-all.sh gemini-cli

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(provisioner): add gemini-cli to RuntimeImages map

Without this entry, POST /workspaces with runtime:gemini-cli falls back
to workspace-template:langgraph (wrong image, missing gemini dep) instead
of workspace-template:gemini-cli. Every runtime MUST have an entry here.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* config(org): add Telegram to Dev Lead and Research Lead (closes #383)

Completes leadership-tier Telegram coverage:
  PM ✓ DevOps ✓ Security ✓ → Dev Lead ✓ Research Lead ✓

Both roles produce high-value async output (architecture decisions,
eco-watch summaries) that was invisible until the user polled the
canvas. Same bot_token/chat_id secrets as the other three roles —
no new credentials required.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: DevOps Engineer <devops@molecule.ai>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 00:00:10 -07:00
Hongming Wang
52bdadbd6d
fix(security): forward Authorization header in transcript proxy (#405) (#380)
The platform's GET /workspaces/:id/transcript proxy was constructing the
outbound request without an Authorization header. The workspace's /transcript
endpoint (hardened in #287/#328) fails-closed when the header is absent,
so every transcript call in production returned 401 from the workspace.

Fix: after WorkspaceAuth validates the incoming bearer token, the handler
now forwards it verbatim via req.Header.Set("Authorization", ...).
Forwarding is safe — the token has already been validated by the middleware.

Tests:
- TestTranscript_ForwardsAuthHeader: was t.Skip'd as a bug marker; now
  active. Verifies the Authorization header reaches the workspace stub.
- TestTranscript_NoAuthHeader_PassesThrough: new. Verifies that a missing
  header produces no synthetic Authorization on the upstream call, and the
  workspace 401 is faithfully relayed.

Identified by QA audit 2026-04-16.

Co-authored-by: QA Engineer <qa-engineer@molecule-ai.internal>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 23:38:07 -07:00
PM Bot
e257cd80d4 chore(test): remove dead constants from wsauth_middleware_test.go (#358)
PR #357 deleted the grace-period tests that used hasLiveTokenQuery and
workspaceExistsQuery, but the constants themselves (and the stale comment
describing the old HasAnyLiveToken-based dispatch) were not removed.

Remove both dead const declarations and update the header comment to
reflect the strict-enforcement contract introduced by #357.

Closes #358.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 05:02:11 +00:00