The CP provisioner calls POST /cp/workspaces/provision which now
creates EC2 instances (not Fly Machines). The tenant platform
auto-activates this when MOLECULE_ORG_ID is set.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
No env vars to configure. The platform auto-detects the backend:
MOLECULE_ORG_ID set → SaaS tenant → control plane provisioner
MOLECULE_ORG_ID empty → self-hosted → Docker provisioner
The control plane URL defaults to https://api.moleculesai.app (override
with CP_PROVISION_URL for testing). No FLY_API_TOKEN on the tenant.
Removed: direct Fly provisioner (FlyProvisioner) — all SaaS workspace
provisioning goes through the control plane which holds the Fly token
and manages billing, quotas, and cleanup.
Two backends: CPProvisioner (SaaS) and Docker Provisioner (self-hosted).
Closes#494
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When CONTAINER_BACKEND=flyio, workspaces are provisioned as Fly Machines
instead of local Docker containers. This enables workspace deployment
on SaaS tenants where no Docker daemon is available.
New files:
- provisioner/fly_provisioner.go: FlyProvisioner with Start/Stop/
IsRunning/Restart/Close via Fly Machines API (api.machines.dev/v1)
- FlyRuntimeImages maps runtimes to GHCR image tags
Changes:
- main.go: select Docker vs Fly based on CONTAINER_BACKEND env var
- workspace.go: SetFlyProvisioner() setter, Create checks flyProv first
- workspace_provision.go: provisionWorkspaceFly() loads secrets, calls
FlyProvisioner.Start, issues auth token for the new machine
Env vars for Fly backend:
- CONTAINER_BACKEND=flyio (activates Fly provisioner)
- FLY_API_TOKEN (Fly deploy token)
- FLY_WORKSPACE_APP (Fly app name for workspace machines)
- FLY_REGION (default: ord)
Closes#494
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two adjacent fixes that surfaced trying to bring the molecule-dev org
template back up against the new standalone workspace-template-* repos.
1) handlers/org.go — expand ${VAR} in workspace_dir before validation.
The molecule-dev pm/workspace.yaml (and any operator's per-host
binding) ships `workspace_dir: ${WORKSPACE_DIR}` so each operator
can pick the host path PM bind-mounts. Without expansion the literal
"${WORKSPACE_DIR}" string reaches validateWorkspaceDir and fails with
"must be an absolute path", aborting the whole org import.
Other fields (channel config, prompts) already go through expandWithEnv;
workspace_dir was the last hold-out.
2) provisioner/provisioner.go — inject PYTHONPATH=/app for every
workspace container. Standalone template Dockerfiles COPY adapter.py
to /app and set ENV ADAPTER_MODULE=adapter, but molecule-runtime is
a pip console_script entry point so cwd isn't on sys.path
automatically. Setting PYTHONPATH here fixes every adapter image at
once instead of needing 8 PRs against template repos. Operator
override still wins (workspace EnvVars are appended after, so Docker
takes the later duplicate).
Note: this unblocks the import path but does NOT make claude-code /
hermes / etc. boot. The runtime itself has a separate top-level
`from adapters import` that breaks against modular templates —
tracked at workspace-runtime#1.
Tests: TestBuildContainerEnv_InjectsPYTHONPATH +
TestBuildContainerEnv_WorkspaceEnvVarsCanOverridePYTHONPATH lock the
default + operator-override invariants. expandWithEnv is already covered
by TestExpandWithEnv_* — the workspace_dir use site is a one-line call
to that primitive.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause of the 2026-04-16 09:10 UTC six-container restart cascade.
## Timeline
09:10:26 — PM sent a batch delegation to 15+ agents (Dev Lead coordinating).
09:10:26-27 — 4 leaders/auditors (Security, RL, BE, DevOps) simultaneously
hit "workspace agent unreachable — container restart triggered"
even though their containers were running fine. Another 2
(DL, UIUX) tripped in the next few seconds.
09:10:27 — Provisioner stopped + recreated 6 containers in parallel. A2A
callers got EOFs, PM's batch coordination stalled.
## Root cause
`provisioner.IsRunning` collapsed every ContainerInspect error into
`(false, nil)`, including transient Docker daemon hiccups:
func IsRunning(...) (bool, error) {
info, err := p.cli.ContainerInspect(ctx, name)
if err != nil {
return false, nil // Container doesn't exist ← MISREAD
}
return info.State.Running, nil
}
The comment said "Container doesn't exist" but the error was actually
any of: daemon timeout, socket EOF, context deadline, connection
refused. Under load (batch delegation fan-out → 15 concurrent HTTP
inbound → 15 concurrent Claude Code subprocesses → Docker daemon CPU
pressure), ContainerInspect calls started failing transiently. All 6
calls returned `(false, nil)`. Caller `maybeMarkContainerDead` treated
`running=false` as "container is dead, restart it" → six parallel
restarts. This was exactly the destructive-on-error pattern we keep
trying to kill (see #160 SDK-stderr-probe, #318 fail-open classes).
## Fix
`IsRunning` now distinguishes NotFound from transient errors:
- Legitimately missing container (caller deleted, Docker pruned) →
`(false, nil)` — safe to act on; caller marks dead + restarts.
- Any other error (daemon timeout, socket issue, context deadline) →
`(true, err)` — caller stays on the alive path. The transient error
is preserved so metrics + logging still see it, but it does NOT
trigger the destructive restart branch.
`isContainerNotFound` matches on error-message substring — same
approach docker/cli uses internally — to avoid pulling in errdefs as a
direct dep. Truth table tests in `isrunning_test.go` cover 8 cases:
NotFound variants (real + generic), nil, empty, and the 4 transient-
error shapes we've actually observed (deadline, EOF, connection-refused,
i/o timeout).
## Caller update
`maybeMarkContainerDead` in a2a_proxy.go now logs the transient inspect
error (was silently discarded via `_`). Visibility without
destructiveness. If this error becomes persistent, we'll see it in
platform logs rather than diagnosing after another restart cascade.
## Expected impact
- Zero restart cascades from the current class of transient inspect
errors (EOF, timeout, connection refused).
- Dead containers still detected within the A2A layer because an actual
stopped container returns NotFound on inspect, and the TTL monitor
(180s post #386) catches anything that slips through.
- New visibility in platform logs when inspect has trouble — previously
silent.
Combined with the TTL fix in #386, the defense-in-depth on spurious
restart is now:
1. IsRunning only returns false for real NotFound
2. Liveness TTL is 180s, surviving 5+ missed heartbeats
3. A2A proxy 503-Busy path retries with backoff before touching
restart logic at all
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(adapters): add gemini-cli runtime adapter (closes#332)
Adds a `gemini-cli` workspace runtime backed by Google's Gemini CLI
(@google/gemini-cli, ~101k ★, Apache 2.0). Mirrors the claude-code
adapter pattern: Docker image installs the CLI, CLIAgentExecutor
drives the subprocess, A2A MCP tools wire via ~/.gemini/settings.json.
Changes:
- workspace-template/adapters/gemini_cli/ — new adapter (Dockerfile,
adapter.py, __init__.py, requirements.txt); setup() seeds GEMINI.md
from system-prompt.md and injects A2A MCP server into settings.json
- workspace-template/cli_executor.py — adds gemini-cli to
RUNTIME_PRESETS (--yolo flag, -p prompt, --model, GEMINI_API_KEY env
auth); adds mcp_via_settings preset flag to skip --mcp-config
injection for runtimes that own their own settings file
- workspace-configs-templates/gemini-cli/ — default config.yaml +
system-prompt.md template
- tests/test_adapters.py — adds gemini-cli to expected adapter set
- CLAUDE.md — documents new runtime row in the image table
Requires: GEMINI_API_KEY global secret. Build:
bash workspace-template/build-all.sh gemini-cli
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(provisioner): add gemini-cli to RuntimeImages map
Without this entry, POST /workspaces with runtime:gemini-cli falls back
to workspace-template:langgraph (wrong image, missing gemini dep) instead
of workspace-template:gemini-cli. Every runtime MUST have an entry here.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* config(org): add Telegram to Dev Lead and Research Lead (closes#383)
Completes leadership-tier Telegram coverage:
PM ✓ DevOps ✓ Security ✓ → Dev Lead ✓ Research Lead ✓
Both roles produce high-value async output (architecture decisions,
eco-watch summaries) that was invisible until the user polled the
canvas. Same bot_token/chat_id secrets as the other three roles —
no new credentials required.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: DevOps Engineer <devops@molecule.ai>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Resolves#14. ApplyTierConfig now reads TIER{2,3,4}_MEMORY_MB and
TIER{2,3,4}_CPU_SHARES env vars, falling back to the compiled defaults
agreed in the issue:
- T2: 512 MiB / 1024 shares (1 CPU) — unchanged baseline
- T3: 2048 MiB / 2048 shares (2 CPU) — new cap (previously uncapped)
- T4: 4096 MiB / 4096 shares (4 CPU) — new cap (previously uncapped)
CPU_SHARES follows Docker's 1024 = 1 CPU convention; internally the
value is translated to NanoCPUs for a hard allocation so behaviour
remains deterministic across hosts. Malformed or non-positive env
values silently fall back to the default.
Behaviour change note: T3 and T4 previously had no explicit cap.
Operators who relied on unlimited can set very large TIERn_MEMORY_MB /
TIERn_CPU_SHARES values; a follow-up can add unset-means-unlimited
semantics if required.
Tests:
- TestGetTierMemoryMB_DefaultsMatchLegacy
- TestGetTierMemoryMB_EnvOverride (covers malformed + zero fallback)
- TestGetTierCPUShares_EnvOverride
- TestApplyTierConfig_T3_UsesEnvOverride (wiring)
- TestApplyTierConfig_T3_DefaultCap (documents the new cap)
Docs: .env.example section + CLAUDE.md platform env-vars list updated.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolves#12. The claude-code SDK stores conversations in
/root/.claude/sessions/ and Postgres tracks current_session_id, but the
container filesystem was recreated on every restart — next agent message
failed with "No conversation found with session ID: <uuid>".
Add a per-workspace named Docker volume (ws-<id>-claude-sessions) mounted
read-write at /root/.claude/sessions. Gated by runtime=claude-code so
other runtimes don't pay for a path they don't use. Volume is cleaned up
in RemoveVolume alongside the config volume.
Two opt-outs discard the volume before restart for a fresh session:
- env WORKSPACE_RESET_SESSION=1 on the container
- POST /workspaces/:id/restart?reset=true (or {"reset": true} body)
Plumbed via new ResetClaudeSession field on WorkspaceConfig +
provisionWorkspaceOpts helper so the flag stays request-scoped (not
persisted on CreateWorkspacePayload).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolves#17.
Part A: scripts/cleanup-rogue-workspaces.sh deletes workspaces whose id
or name starts with known test placeholder prefixes (aaaaaaaa-, etc.)
and force-removes the paired Docker container. Documented in
tests/README.md.
Part B: add a pre-flight check in provisionWorkspace() — when neither a
template path nor in-memory configFiles supplies config.yaml, probe the
existing named volume via a throwaway alpine container. If the volume
lacks config.yaml, mark the workspace status='failed' with a clear
last_sample_error instead of handing it to Docker's unless-stopped
restart policy (which otherwise loops forever on FileNotFoundError).
New pure helper provisioner.ValidateConfigSource + unit tests.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>