Measured test counts (not guessed):
- Platform Go: 12 packages (was claiming 818 individual tests — now
reports package-level which is the go test output format)
- Canvas: 490 Vitest tests (33 files)
- workspace-template: 955 pytest tests (down from 1179 — 224 adapter-
specific tests moved to standalone template repos)
- molecule-app: 76 unit + 22 e2e (separate repo)
Architecture updates:
- CI section: documents manifest-driven Docker builds + reusable CI
workflows from molecule-ci repo for all 33 plugin/template repos
- Workspace Images section: already updated by prior PR (adapter repos)
- Test commands: accurate counts, standalone repo URLs with test counts
Code review fixes:
- 🟡#1: Replace python3 with jq in Dockerfile template stages (~50MB → ~2MB)
- 🟡#2: Add clone count verification to scripts/clone-manifest.sh
(set -e + expected vs actual count check — fails build if any clone fails)
- 🟡#3: Drop 'unsafe-eval' from CSP (not needed for Next.js production
standalone builds, only dev mode). Updated test assertion.
- 🟡#4: Remove broken pyproject.toml from workspace-template/ (it claimed
to package as molecule-ai-workspace-runtime but the directory structure
didn't match — the real package ships from the standalone repo)
- 🔵#1: Add version-pinning TODO comment to manifest.json
- 🔵#3: Add full repo URLs + test counts for SDK/MCP/CLI/runtime in CLAUDE.md
Security (GitGuardian alert):
- Removed Telegram bot token (8633739353:AA...) from template-molecule-dev
pm/.env — replaced with ${TELEGRAM_BOT_TOKEN} placeholder
- Removed Claude OAuth token (sk-ant-oat01-...) from template-molecule-dev
root .env — replaced with ${CLAUDE_CODE_OAUTH_TOKEN} placeholder
- Both tokens need immediate rotation by the operator
Tests: Platform middleware tests updated + all pass.
PR #471 removed Dockerfiles/requirements from adapters/ but left the
Python source files. This commit finishes the extraction:
1. Moved shared_runtime.py → workspace-template/shared_runtime.py
(used by prompt.py, a2a_executor.py, coordinator.py — not adapter-specific)
2. Moved base.py → workspace-template/adapter_base.py
(BaseAdapter + AdapterConfig — the interface adapters implement)
3. Updated imports in prompt.py, a2a_executor.py, coordinator.py
4. Rewritten adapters/__init__.py as a thin shim that:
- Reads ADAPTER_MODULE env var (production: standalone repos set this)
- Re-exports BaseAdapter/AdapterConfig for backward compat
5. adapters/base.py + adapters/shared_runtime.py remain as re-export shims
6. Deleted all 8 adapter subdirectories (autogen, claude_code, crewai,
deepagents, gemini_cli, hermes, langgraph, openclaw)
7. Removed 11 test files that imported adapter-specific code
Tests: 955 passed, 0 failed (down from 1216 — the difference is
adapter-specific tests that moved to standalone repos).
test_first_party_plugins.py, test_plugins_builtins_drift.py, and
test_hermes_adapter.py all referenced files under plugins/ and
adapters/ which were extracted to standalone repos. These tests
belong in those repos now, not in the core workspace-template.
1216 passed, 0 failed after removal.
These files have moved to the standalone template repos:
https://github.com/Molecule-AI/molecule-ai-workspace-template-<runtime>
Each adapter repo now has its own Dockerfile (FROM python:3.11-slim + pip install
molecule-ai-workspace-runtime) and requirements.txt. The adapter Python source
files (.py) stay in the monorepo for local development and testing.
Adapters removed from workspace-template/adapters/*/: Dockerfile, requirements.txt
Adapters retained: adapter.py, __init__.py (+ hermes extras: escalation.py, executor.py, providers.py)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Job-level `concurrency.cancel-in-progress: false` only prevents sibling jobs
from killing each other — it does not protect the parent workflow run from
being cancelled when a new push arrives. Every PR push was cancelling the
in-progress E2E run, forcing manual `gh run rerun` across 7+ active PRs.
Fix: move e2e-api into `.github/workflows/e2e-api.yml` with a workflow-level
concurrency group (`e2e-api-${{ github.ref }}`, cancel-in-progress: false).
New pushes now queue behind the running E2E job instead of cancelling it.
Fast jobs (platform-build, canvas-build, shellcheck, python-lint) stay in
ci.yml and retain normal run-level cancellation for quick iteration feedback.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove plugins/, workspace-configs-templates/, org-templates/ dirs (now
in standalone repos). Add manifest.json listing all 33 repos and
scripts/clone-manifest.sh to clone them. Both Dockerfiles now use the
manifest script instead of 33 hardcoded git-clone lines.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Introduce `memoryRecallMaxLimit = 50` constant and honour the `?limit=N`
query parameter in Search. Values above 50 are silently clamped to 50;
absent or invalid values default to 50. The LIMIT clause is now a
parameterised argument (nextArg pattern) instead of a hardcoded literal.
Three sqlmock tests verify the cap, the explicit limit, and the default.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Width was initialized to 480px on every render, so clicking a different
workspace node (which re-mounts SidePanel) discarded any resize the user
had done.
Fix:
- localStorage-backed useState initializer (SSR-safe typeof window guard)
- Validates the stored value: must be a finite integer ≥ 320px
- Persists the width in the mouseUp handler via a widthRef that stays in
sync with the live drag value — avoids spamming localStorage on every
pixel during the drag
- Extra guard: onMouseUp bails early if not actually dragging (prevents
spurious saves on unrelated window mouseup events)
- Named constants replace magic numbers 480 / 320
Tests: 5 new cases in SidePanel.tabs.test.tsx — default fallback, valid
saved value, too-small saved value, NaN saved value, drag-persist roundtrip.
Closes#425
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The platform Dockerfile COPYs paths relative to the repo root —
\`COPY platform/go.mod\`, \`COPY platform/migrations\`,
\`COPY workspace-configs-templates\`. The compose file was setting
\`context: ./platform\`, which silently caused those COPY layers to
miss + stop invalidating cache.
Symptom (caught 2026-04-16 10:22 UTC): after PR #417 (memory schema
migration 023) merged + I ran \`docker compose up -d --build platform\`,
the rebuild was a no-op. Image SHA didn't change, container booted with
old migration set, \`Applied 22 migrations\` instead of the expected 23.
Migration 023 file was on disk locally but never reached the image.
Workaround was \`docker build -t molecule-monorepo-platform:fresh -f
platform/Dockerfile .\` from repo root → SHA changed, migration 023
applied. This commit makes \`docker compose up -d --build platform\`
work correctly without the manual workaround.
CI workflow already builds with \`context: .\` + \`file: ./platform/Dockerfile\`
(per the comment at the top of platform/Dockerfile). This change just
aligns the local compose file with what CI does.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Every agent in the template currently uses the same GitHub PAT, so
\`gh pr list\` shows every PR as authored by the CEO's account with
no signal which agent opened each one. Commits already carry
per-agent authors (GIT_AUTHOR_NAME from #402). This wrapper extends
the identity split to the PR/issue metadata surface layer that
commit attribution can't reach.
## How it works
A tiny bash script installed at \`/usr/local/bin/gh\`, which sits
earlier in PATH than the real binary at \`/usr/bin/gh\`. For \`gh pr
create\` and \`gh issue create\`:
- Title gets prefixed with \`[Role Name]\` — e.g. \`[Frontend Engineer]
fix: canvas grid index\`
- Body gets \`\n\n---\n_Opened by: Molecule AI <Role>_\` appended
Role is read from \`GIT_AUTHOR_NAME\` which the platform provisioner
sets to \`Molecule AI <Role>\` (shipped with #402). Accepts both
\`--title X\` and \`--title=X\` forms. Same for \`--body\`.
Anything that isn't \`gh pr create\` or \`gh issue create\` (e.g.
\`gh pr list\`, \`gh issue view\`, \`gh run watch\`) passes through
untouched. No behaviour change for read-side operations.
## Idempotent
- If the title already starts with \`[...]\` the wrapper does not
re-prefix. \`gh pr edit\` flows that resubmit title won't layer
multiple tags.
- If the body already contains \`Opened by: Molecule AI\` the footer
is not re-appended.
## Fail-open
When \`GIT_AUTHOR_NAME\` is absent or doesn't start with \`Molecule
AI \`, the wrapper exec's the real gh with unchanged args. No call
is ever blocked by this script.
## Test coverage
\`tests/test_gh_wrapper.sh\` — 12 cases, no network, no Docker:
- Passthrough for non-create subcommands (pr list)
- pr create title prefix + body footer
- issue create with \`--title=X\` \`--body=X\` equals-form
- Idempotent title re-prefix
- Idempotent body footer (count = 1 after two applies)
- Missing GIT_AUTHOR_NAME → passthrough, title preserved
- Malformed GIT_AUTHOR_NAME (not "Molecule AI ...") → passthrough
All 12 pass. Test script is standalone bash + a temp fake gh binary
that echoes argv; safe to run in CI's Python Lint & Test job via
subprocess shell-out.
## Deployment note
This lands in the workspace image. Existing containers keep their
old /usr/bin/gh until the image is rebuilt and they're re-provisioned
(POST /workspaces/:id/restart {}). No migration required; the wrapper
just starts tagging PRs once the new image is rolled.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Part 4 of 4 — terminal step of the org.yaml scalability refactor. Each
role in the molecule-dev template now owns its own workspace.yaml file,
colocated with the existing system-prompt.md / initial-prompt.md /
idle-prompt.md / schedules/*.md. Team files shrink to a leader's own
definition plus a list of !include refs.
## Platform change
`resolveYAMLIncludes` now uses a TWO-ROOT model:
- Path resolution is relative to the INCLUDING file's directory
(natural sibling + cousin refs, C-include / Sass @import convention).
- Security bound is the ORIGINAL org root (`rootDir`), preserved across
all recursion depths. Sibling-dir refs like `../my-role/workspace.yaml`
from a team file are now allowed (they stay inside the org template);
refs that escape the root still error.
Regression coverage: new `TestResolveYAMLIncludes_SiblingDirAccess`
reproduces the Phase 4 pattern (team file at `teams/x.yaml` referencing
`../<role>/workspace.yaml`) — fails without the fix, passes with.
## Template change
Atomized 15 child workspaces across 3 team files:
- `teams/research.yaml`: 58 → 30 lines; 3 children now !include refs
- `teams/dev.yaml`: 222 → 38 lines; 6 children now !include refs
- `teams/marketing.yaml`: 143 → 28 lines; 6 children now !include refs
Each role now has `<role>/workspace.yaml` colocated with its prompts.
Example `frontend-engineer/` directory:
frontend-engineer/
├── workspace.yaml (24 lines — name/role/tier/canvas/plugins/...)
├── system-prompt.md (from earlier phases)
├── initial-prompt.md
├── idle-prompt.md
└── (no schedules for this role — but if added, schedules/<slug>.md)
## File-size progression across all 4 phases
| State | org.yaml | total `.yaml` in tree |
|---|---:|---:|
| Before (main) | 1801 lines / 108 KB | 1801 / 108 KB (one file) |
| After Phase 1 (#389) | 1687 | 1687 / 101 KB |
| After Phase 2 (#390) | 676 | 676 / 35 KB |
| After Phase 3 (#393) | 114 | 683 (1 + 6 teams) / 33 KB |
| **After this PR** | **114** | **~698** (1 + 6 + 15 workspace) / 35 KB |
Aggregate size is flat — the decrease came from prompt externalization
in Phases 1/2; Phases 3/4 reorganize structure without adding content.
The win is readability and ownership:
- Every individual file fits on 1-2 screens.
- Adding a new role is now: create `<role>/` dir, add `workspace.yaml`
+ `system-prompt.md` + prompts, add ONE `!include` line to the team
file. No touching of aggregated mega-YAML.
- Team files can be reviewed + merged independently.
## Tests
All 10 `TestResolveYAMLIncludes_*` tests pass, including the real-template
integration test (`TestResolveYAMLIncludes_RealMoleculeDev`) which now
walks org.yaml → teams/pm.yaml → teams/research.yaml → ../market-analyst/
workspace.yaml and validates the full 21-role tree unmarshals cleanly.
Plus all existing `TestResolvePromptRef` + `TestOrgYAML` + `TestInitialPrompt`
suites stay green.
## Ops followup
After merging all 4 phases and deploying, the `POST /org/import`
endpoint should produce a workspace tree byte-identical to the
pre-refactor state. Verify with:
diff <(curl POST /org/import before) <(curl POST /org/import after)
or by spot-checking:
- `/configs/config.yaml` bodies across all 21 workspaces
- `workspace_schedules.prompt` row values
The externalization is lossless — YAML literal to file and back
recovers the same string modulo trailing-whitespace normalization.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SecurityHeaders middleware widened its CSP to allow Next.js inline scripts
+ data:/blob: images (platform/internal/middleware/securityheaders.go:44,
canvas is reverse-proxied through the gin stack so it needs the permissive
policy). The two CSP asserts in securityheaders_test.go still hard-compared
against the old tight `default-src 'self'`, so they fail on main as of
this afternoon.
Fix: assert each expected CSP fragment is PRESENT in the header (substring
match) instead of byte-for-byte equality. Test intent is "CSP is set, starts
with tight default-src, contains the expected directives" — not "CSP matches
this exact string". Future subsource tuning (add a new CDN, bump blob:/data:
scope) won't re-break this test.
Caught because every PR touching anything in the monorepo currently fails
the Platform (Go) CI job on these two asserts. Fixing on a dedicated branch
so it can land ahead of every blocked PR in the queue.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Single-container tenant architecture: Go platform (:8080) + Canvas
Node.js (:3000) in one Fly machine, with Go's NoRoute handler reverse-
proxying non-API routes to the canvas. Browser only talks to :8080.
Changes:
platform/Dockerfile.tenant — multi-stage build (Go + Node + runtime).
Bakes workspace-configs-templates/ + org-templates/ into the image.
Build context: repo root.
platform/entrypoint-tenant.sh — starts both processes, kills both if
either exits. Fly health check on :8080 covers the Go binary; canvas
health is implicit (proxy returns 502 if canvas is down).
platform/internal/router/canvas_proxy.go — httputil.ReverseProxy that
forwards unmatched routes to CANVAS_PROXY_URL (http://localhost:3000).
Activated by NoRoute when CANVAS_PROXY_URL env is set.
platform/internal/router/router.go — wire NoRoute → canvasProxy when
CANVAS_PROXY_URL is present; no-op otherwise (local dev unchanged).
platform/internal/middleware/securityheaders.go — relaxed CSP to allow
Next.js inline scripts/styles/eval + WebSocket + data: URIs. The
strict `default-src 'self'` was blocking all canvas rendering.
canvas/src/lib/api.ts — changed `||` to `??` for NEXT_PUBLIC_PLATFORM_URL
so empty string means "same-origin" (combined image) instead of falling
back to localhost:8080.
canvas/src/components/tabs/TerminalTab.tsx — same `??` fix for WS URL.
Verified: tenant machine boots, canvas renders, 8 runtime templates +
4 org templates visible, API routes work through the same port.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phase 3 escalation ladder added `from .escalation import ...` to
executor.py. The phase-2 dispatch tests load executor.py via
`exec(compile(src, ...))` with the relative import rewritten — this
broke because (a) the rewrite didn't know about escalation and (b) the
exec namespace lacked `__name__`, which executor.py needs at import
time for `logging.getLogger(__name__)`.
Fix both in all 8 exec sites:
- Rewrite both `from .providers import` AND `from .escalation import`
- Pre-register escalation + providers in sys.modules under the fake
package name
- Seed the exec namespace with `__name__ = "hermes_executor_under_test"`
54/54 hermes tests pass (28 escalation truth-table + 6 ladder-integration
+ 20 existing phase-2 dispatch).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Closes the silent-overwrite hole where two agents racing a read-modify-
write on the same memory key left only one agent's update. Relevant for
orchestrators (PM, Dev Lead, Marketing Lead) keeping structured running
state (delegation-result ledgers, task queues) in memory, and for the
``research-backlog:*`` keys that multiple idle loops write in parallel.
## Semantics
### Back-compat path (no if_match_version)
Unchanged: ``INSERT ... ON CONFLICT UPDATE`` last-write-wins. Every
existing agent tool, every existing ``commit_memory`` call, every
existing cron that writes memory — all continue to work with no edit.
### Optimistic-lock path (if_match_version set)
1. Client calls ``GET /memory/:key`` → ``{value, version: V}``
2. Client modifies value locally
3. Client ``POST /memory {key, value, if_match_version: V}``
4. Server: ``UPDATE ... WHERE version = V`` + RETURNING new version
5. On match → 200 + ``{version: V+1}``
6. On mismatch → 409 + ``{expected_version: V, current_version: <actual>}``
7. Client reads the actual version and retries.
### Create-only marker
``if_match_version: 0`` means "create iff the key doesn't exist yet".
Two agents simultaneously seeding a shared key will see exactly one
success + one 409 — no silent collision, no duplicate-init work.
### Schema
Migration 023 adds ``version BIGINT NOT NULL DEFAULT 1``. Existing rows
baseline at 1. New rows start at 1. Every successful write (both paths)
increments: ``version = version + 1`` on update, ``1`` on insert.
## Why version, not updated_at
``updated_at`` has second-granularity and can collide between concurrent
writers on a fast clock. A monotonic counter is collision-free and more
readable in the 409 response body ("expected 5, current is 7 — you
missed 2 writes" tells an agent exactly what to re-read).
## Why ``if_match_version`` and not an ETag header
JSON field keeps it in the request body, visible alongside the value
payload. Agents assembling requests programmatically don't have to
remember to thread a header through their HTTP client wrapper; the
existing ``commit_memory`` tool can grow one optional kwarg and match
the existing signature shape.
## Tests
11 memory-handler cases covering every path:
- GET list / get (with version in response shape)
- Set with no version (back-compat upsert, returns new version)
- Set with if_match_version match (happy path, increment)
- Set with if_match_version mismatch (409 + expected/current fields)
- Set with if_match_version=0 on absent key (create-only success)
- Set with if_match_version=N on absent key (409 — caller's mental
model is wrong)
- Bad inputs (missing key, malformed JSON)
- Delete happy + error path
Full ``go test ./internal/handlers/`` green.
## Follow-up (not in this PR)
- Workspace-template tool update: ``commit_memory(content, *,
if_match_version=None)`` surfaces the new option + on 409 surfaces
the current_version so agents can retry without manual re-read.
- Named checkpoints table (``workspace_checkpoints``) for durable
orchestrator state snapshots. Different concern than per-key locking;
separate PR.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ships scoped Phase 3 of the Hermes multi-provider work. Every workspace
can now declare an ordered list of (provider, model) rungs; when the
pinned model hits rate-limit / 5xx / context-length / overload, the
executor advances to the next rung before raising.
## Why
3× Claude Max saturation is a routine occurrence now — the "first 429 on
a batch delegation" is the common path, not the exception. A workspace
pinned to Haiku that hits a context-length limit has no recovery today;
same for Sonnet hitting rate-limit mid-synthesis. Escalation promotes
to the next tier for that single call, preserves coordination, avoids
restart cascades.
## New module: adapters/hermes/escalation.py
- ``LadderRung(provider, model)`` — one config entry.
- ``parse_ladder(raw)`` — tolerant config parser; skips malformed rungs
with a warning rather than raising so boot stays resilient.
- ``should_escalate(exc) -> bool`` — truth table over 15+ error shapes:
- Typed classes (RateLimitError, OverloadedError, APITimeoutError,
APIConnectionError, InternalServerError)
- Context-length markers (each provider uses different phrasing)
- Gateway markers (502/503/504, overloaded, temporarily unavailable)
- Status-code substrings (429, 529, 5xx)
- Hard-rejects auth failures (401/403/invalid_api_key) even if the
outer exception class is RateLimitError — wrapping case matters.
## Executor wiring
``HermesA2AExecutor`` now accepts ``escalation_ladder`` in its
constructor + ``create_executor()`` factory. ``_do_inference()`` walks
the ladder:
1. First attempt = pinned provider:model (matches pre-ladder behaviour)
2. On escalatable error, try each rung in order
3. On non-escalatable error, raise immediately (auth, malformed payload)
4. On exhaustion, raise the last error
Rung switches temporarily rebind ``self.provider_cfg`` / ``self.model``
/ ``self.api_key`` / ``self.base_url`` in a try/finally, so any raised
error leaves the executor in its original state for the next call. Key
resolution for non-pinned rungs goes through ``resolve_provider`` which
reads the rung-provider's env vars fresh.
## Config shape
``config.yaml`` (rendered from ``org.yaml`` → workspace secrets):
runtime_config:
escalation_ladder:
- provider: gemini
model: gemini-2.5-flash
- provider: anthropic
model: claude-sonnet-4-5-20250929
- provider: anthropic
model: claude-opus-4-1-20250805
Empty / absent = single-shot behaviour, full backwards-compat with
every existing workspace.
## Tests
34 passing, all isolated (no network):
- ``test_hermes_escalation.py`` (28): parser + truth-table across
rate-limit, overload, context-length, gateway, auth-reject, unrelated
exceptions, and case-insensitivity.
- ``test_hermes_ladder_integration.py`` (6): no-ladder single call,
ladder-not-triggered on success, escalate-on-rate-limit-then-succeed,
stop-on-non-escalatable, raise-last-error-when-exhausted, skip-
unknown-provider-in-rung.
## Not in this PR
- Uncertainty-driven escalation (judge pass after successful reply).
- Per-workspace budget tracking (#305 covers this separately).
- Live streaming reuse across rungs (ladder retries the whole call).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Closes#399.
## Root cause
`publish-platform-image.yml` existed for the Go platform image but there
was no equivalent for the canvas. After every canvas PR merged, CI ran
`npm run build` and passed — but the live container at :3000 was never
updated. The `canvas-deploy-reminder` job only posted a comment asking
operators to manually rebuild, which was consistently missed.
## What this adds
- `.github/workflows/publish-canvas-image.yml`: triggers on `canvas/**`
changes to main (and `workflow_dispatch`). Mirrors the platform workflow:
macOS Keychain isolation, QEMU for linux/amd64, Buildx, GHCR push with
`:latest` + `:sha-<7>` tags.
- `NEXT_PUBLIC_PLATFORM_URL` / `NEXT_PUBLIC_WS_URL` resolve from
`workflow_dispatch` inputs → `CANVAS_PLATFORM_URL` / `CANVAS_WS_URL`
repo secrets → `localhost:8080` defaults (safe for self-hosted dev).
- Inputs are passed via env vars (not direct `${{ }}` interpolation) to
prevent shell injection from string inputs.
- `docker-compose.yml`: adds `image: ghcr.io/molecule-ai/canvas:latest`
to the canvas service so `docker compose pull canvas && docker compose
up -d canvas` applies the new image. `build:` is retained for local
development. Adds a comment clarifying that `NEXT_PUBLIC_*` runtime env
vars are ignored by the standalone bundle (build-time only).
- `ci.yml`: updates `canvas-deploy-reminder` commit comment to reference
`docker compose pull` as the fast path, with `docker compose build` as
the local-source fallback.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Root cause of the 2026-04-16 09:10 UTC six-container restart cascade.
## Timeline
09:10:26 — PM sent a batch delegation to 15+ agents (Dev Lead coordinating).
09:10:26-27 — 4 leaders/auditors (Security, RL, BE, DevOps) simultaneously
hit "workspace agent unreachable — container restart triggered"
even though their containers were running fine. Another 2
(DL, UIUX) tripped in the next few seconds.
09:10:27 — Provisioner stopped + recreated 6 containers in parallel. A2A
callers got EOFs, PM's batch coordination stalled.
## Root cause
`provisioner.IsRunning` collapsed every ContainerInspect error into
`(false, nil)`, including transient Docker daemon hiccups:
func IsRunning(...) (bool, error) {
info, err := p.cli.ContainerInspect(ctx, name)
if err != nil {
return false, nil // Container doesn't exist ← MISREAD
}
return info.State.Running, nil
}
The comment said "Container doesn't exist" but the error was actually
any of: daemon timeout, socket EOF, context deadline, connection
refused. Under load (batch delegation fan-out → 15 concurrent HTTP
inbound → 15 concurrent Claude Code subprocesses → Docker daemon CPU
pressure), ContainerInspect calls started failing transiently. All 6
calls returned `(false, nil)`. Caller `maybeMarkContainerDead` treated
`running=false` as "container is dead, restart it" → six parallel
restarts. This was exactly the destructive-on-error pattern we keep
trying to kill (see #160 SDK-stderr-probe, #318 fail-open classes).
## Fix
`IsRunning` now distinguishes NotFound from transient errors:
- Legitimately missing container (caller deleted, Docker pruned) →
`(false, nil)` — safe to act on; caller marks dead + restarts.
- Any other error (daemon timeout, socket issue, context deadline) →
`(true, err)` — caller stays on the alive path. The transient error
is preserved so metrics + logging still see it, but it does NOT
trigger the destructive restart branch.
`isContainerNotFound` matches on error-message substring — same
approach docker/cli uses internally — to avoid pulling in errdefs as a
direct dep. Truth table tests in `isrunning_test.go` cover 8 cases:
NotFound variants (real + generic), nil, empty, and the 4 transient-
error shapes we've actually observed (deadline, EOF, connection-refused,
i/o timeout).
## Caller update
`maybeMarkContainerDead` in a2a_proxy.go now logs the transient inspect
error (was silently discarded via `_`). Visibility without
destructiveness. If this error becomes persistent, we'll see it in
platform logs rather than diagnosing after another restart cascade.
## Expected impact
- Zero restart cascades from the current class of transient inspect
errors (EOF, timeout, connection refused).
- Dead containers still detected within the A2A layer because an actual
stopped container returns NotFound on inspect, and the TTL monitor
(180s post #386) catches anything that slips through.
- New visibility in platform logs when inspect has trouble — previously
silent.
Combined with the TTL fix in #386, the defense-in-depth on spurious
restart is now:
1. IsRunning only returns false for real NotFound
2. Liveness TTL is 180s, surviving 5+ missed heartbeats
3. A2A proxy 503-Busy path retries with backoff before touching
restart logic at all
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Both components use useState/useEffect/useCallback/useRef but were
missing the 'use client' directive. Without it Next.js App Router
renders them as server HTML — React never hydrates them and event
handlers are silently dropped.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Chip labels (status badge, active-task count, current-task text) were
rendered at text-[7px] — well below the 9px minimum required to meet
WCAG 1.4.3 readability. Raised all three to text-[9px] so the labels
are legible without magnification.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tenant machines were booting with no templates because the Dockerfile
only shipped the Go binary + migrations. The canvas showed "0 templates"
with an empty picker.
Changes:
- platform/Dockerfile: build context changed from ./platform to repo
root so COPY can reach workspace-configs-templates/ alongside the
Go source. COPY paths updated for platform/{go.mod,go.sum,*.go} and
platform/migrations/.
- .github/workflows/publish-platform-image.yml: context: . (was
./platform), paths trigger now includes workspace-configs-templates/
so template changes rebuild the image.
Phase A of the template-registry plan. Phase B adds a DB registry +
on-demand fetch for community templates (user pastes GitHub URL at
workspace creation time). The baked defaults always ship in the image
for zero-config tenant boot.
Verified: `docker build -f platform/Dockerfile -t test .` succeeds,
`docker run --rm test ls /workspace-configs-templates/` shows all 8
templates (autogen, claude-code-default, crewai, deepagents, gemini-cli,
hermes, langgraph, openclaw).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix(a2a): add missing Authorization header to delegation and message calls
Three A2A client functions were missing the Bearer token on their HTTP calls
after the Phase 30.1 workspace-auth enforcement rollout:
1. send_a2a_message (a2a_client.py): POST to target workspace's /message/send
used WorkspaceAuth middleware that fails-closed on missing auth header.
Fix: headers=auth_headers() — auth_headers() already imported.
2. tool_delegate_task_async (a2a_tools.py): POST to platform /delegate endpoint
requires the caller's workspace bearer token since Phase 30.1.
Fix: headers=_auth_headers_for_heartbeat()
3. tool_check_task_status (a2a_tools.py): GET /delegations endpoint, same issue.
Fix: headers=_auth_headers_for_heartbeat()
tool_list_peers already uses _auth_headers_for_heartbeat() correctly —
that's why list_peers works while delegation returns 401/[A2A_ERROR].
Root cause of the multi-session A2A outage. PR #386 (TTL fix) addressed
the workspace-restart cascade; this fixes the underlying 401 on each call.
Closes#391
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(a2a): add missing auth headers to /activity and /notify endpoints
Two more Phase 30.1 regressions in a2a_tools.py found during send_message_to_user
debugging (it was returning 401):
- tool_report_activity: POST /workspaces/:id/activity missing headers
- tool_send_message_to_user: POST /workspaces/:id/notify missing headers
Both now use headers=_auth_headers_for_heartbeat() matching the pattern used
by commit_memory, recall_memory, and the heartbeat POST in the same file.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: PM (Molecule AI) <pm@molecule-ai.internal>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Part 3 of 4 in the scalability refactor. Adds YAML `!include` support
to the org importer and splits molecule-dev/org.yaml (676 lines post-
Phase 2) into 6 team / role files; top-level org.yaml drops to 114 lines
of pure scaffolding.
## Platform changes
New `platform/internal/handlers/org_include.go`:
- `resolveYAMLIncludes(data, baseDir)` — pre-processes a YAML document,
expanding any scalar tagged `!include <path>` with the parsed content
of the referenced file.
- Path resolution via `resolveInsideRoot` so a crafted `!include
../../etc/passwd` can't escape the org template directory (same
defense the existing `files_dir` copy uses).
- Nested includes supported: each included file carries its own search
root (its directory), so `teams/pm.yaml` with `!include research.yaml`
resolves to `teams/research.yaml` — matching the convention of
C-include / Sass @import / most package systems.
- Cycle detection via visited-set keyed on absolute path; belt-and-
braces `maxIncludeDepth = 16` cap in case symlinks or path
normalization defeats the set.
- Inline-template mode (POST /org/import with raw JSON body, no `dir`)
errors cleanly when a file ref is used — can't resolve without a
base.
Wired into both `ListTemplates` (so /org/templates shows an accurate
workspace count after the split) and `Import` (expansion happens before
unmarshal into OrgTemplate).
## Template changes
molecule-dev/org.yaml now contains only:
- name + description
- defaults (runtime, plugins, category_routing, initial_prompt text)
- `workspaces: [!include teams/pm.yaml, !include teams/marketing.yaml]`
New files:
- `teams/pm.yaml` — PM top-level, children are !include refs
- `teams/research.yaml` — Research Lead + Market Analyst + Technical
Researcher + Competitive Intelligence (inline children)
- `teams/dev.yaml` — Dev Lead + FE/BE/DevOps/Security/QA/UIUX (inline)
- `teams/marketing.yaml` — Marketing Lead + DevRel/PMM/Content/
Community/SEO/Social (inline)
- `teams/documentation-specialist.yaml` — leaf
- `teams/triage-operator.yaml` — leaf
## File-size impact
| State | org.yaml lines | total config size |
|---|---:|---:|
| Before (main) | 1801 | 108 KB |
| After Phase 1 (#389) | 1687 | 101 KB |
| After Phase 2 (#390) | 676 | 35 KB |
| After this PR | **114** | **4 KB** (org.yaml only) |
With the 6 team files (total ~570 lines of structural yaml), every file
is now under 230 lines and individually readable without scrolling past
a single team's boundaries.
## Tests
`platform/internal/handlers/org_include_test.go` — 9 cases:
- Flat include (single file, single workspace)
- Nested include (file → file → file)
- Traversal rejection (`../secret.yaml`, `../../secret.yaml`)
- Cycle detection (a↔b)
- Empty path error
- Missing file error
- Inline-template error (baseDir empty)
- No-op when YAML has no includes (safety: we always run the preprocessor)
- **Integration**: load the real `org-templates/molecule-dev/org.yaml`,
resolve includes, unmarshal into OrgTemplate, verify PM + Marketing
Lead are top-level and PM has ≥4 children after expansion.
All 9 pass + existing `TestResolvePromptRef` + `TestOrgYAML` suites stay
green.
## Ownership implication
Each team file can now be owned + reviewed independently. When the
marketing team adds a 7th role, the diff is in `teams/marketing.yaml`
alone — no merge conflicts against PM or research changes in the same
review window. Same for the eventual engineer team, security team, etc.
## What's next
- **Phase 4 (queued):** per-workspace atomization. Each role gets
`<role>/workspace.yaml`; team files shrink to a list of !include
refs. Terminal step in the scalability arc — at that point adding a
new role is one new file under `org-templates/molecule-dev/<role>/`
plus one line in the team's manifest.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Part 2 of 4 in the org.yaml scalability refactor. Follows PR #389 which
added platform support; this PR completes the migration for every role
in the `molecule-dev` template.
## Scope
All 20 remaining roles moved from inline YAML literals to sibling .md
files under their existing `files_dir`:
- PM, Research Lead, Dev Lead, Marketing Lead (4 leaders)
- Market Analyst, Technical Researcher, Competitive Intelligence (research)
- Frontend/Backend/DevOps Engineer, Security Auditor, QA Engineer, UIUX
Designer, Triage Operator (dev team)
- DevRel, PMM, Content Marketer, Community Manager, SEO Growth Analyst,
Social Media Brand (marketing team)
Per workspace, externalized (where present):
- `initial_prompt: |...` → `initial-prompt.md` + `initial_prompt_file:`
- `idle_prompt: |...` → `idle-prompt.md` + `idle_prompt_file:`
- `schedules[*].prompt: |...` → `schedules/<slug>.md` + `prompt_file:`
Totals: 17 initial-prompt files, 12 idle-prompt files, 18 schedule files
(47 new files).
## File-size impact
| Before (main) | After Phase 1 | After Phase 2 | Reduction |
|---|---|---|---|
| 1801 lines | 1687 lines | 676 lines | **-62.5%** |
| 108 KB | 101 KB | 35 KB | **-67%** |
org.yaml is now pure structural scaffolding (name / role / tier / model /
canvas / plugins / channels / children / category_routing / schedules
metadata). Readable end-to-end on one screen per team.
## How the migration was driven
A Python round-trip script (using `ruamel.yaml` to preserve comments +
formatting) walked the workspace tree recursively, wrote prompts to
files keyed by `files_dir`, and replaced inline keys with `*_file:` refs.
Zero manual YAML hand-editing beyond the Phase 1 Documentation Specialist
proof. Script is one-shot; not committed.
Slug convention for schedule files: lowercase the schedule name, replace
non-alphanumeric with `-`, collapse, cap 60 chars. Examples:
- "Orchestrator pulse" → `orchestrator-pulse.md`
- "Hourly template fitness audit" → `hourly-template-fitness-audit.md`
- "Code quality audit (every 12h)" → `code-quality-audit-every-12h.md`
## Backwards compatibility
Fully compatible — Phase 1's resolver prefers inline when both are set,
so a future one-off experiment can still drop inline YAML. The migration
doesn't remove inline support, just stops using it.
## Verification
- [x] `python -c "yaml.safe_load(...)"` on edited org.yaml — parses clean
- [x] Walk-and-inspect script: every workspace has exactly the expected
`*_file:` refs, zero `INLINE_*` markers remain
- [x] All 47 extracted .md files non-empty + trimmed
- [x] `go test -run 'TestResolvePromptRef|TestOrgYAML|TestInitialPrompt'`
passes (from Phase 1 platform work)
- [ ] Post-merge: live `POST /org/import` against a fresh workspace,
diff the resulting `/configs/config.yaml` + `workspace_schedules`
rows against the pre-migration values (should be identical bodies)
## What's next
- **Phase 3 (queued):** YAML `!include` directive for org.yaml; split the
remaining 676 lines into `teams/{research,dev,marketing,ops}.yaml`.
- **Phase 4 (queued):** per-workspace atomization; each role owns its
own `workspace.yaml` manifest.
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>