forked from molecule-ai/molecule-core
09e99a09c6
3663 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
10ea5062a1 |
Merge pull request #407 from Molecule-AI/fix/bake-templates-into-platform-image
fix(ops): bake templates into platform Docker image |
||
|
|
3a32d9a46f
|
Merge pull request #407 from Molecule-AI/fix/bake-templates-into-platform-image
fix(ops): bake templates into platform Docker image |
||
|
|
a363b56f25 |
feat(tenant): combined platform + canvas Docker image with reverse proxy
Single-container tenant architecture: Go platform (:8080) + Canvas Node.js (:3000) in one Fly machine, with Go's NoRoute handler reverse- proxying non-API routes to the canvas. Browser only talks to :8080. Changes: platform/Dockerfile.tenant — multi-stage build (Go + Node + runtime). Bakes workspace-configs-templates/ + org-templates/ into the image. Build context: repo root. platform/entrypoint-tenant.sh — starts both processes, kills both if either exits. Fly health check on :8080 covers the Go binary; canvas health is implicit (proxy returns 502 if canvas is down). platform/internal/router/canvas_proxy.go — httputil.ReverseProxy that forwards unmatched routes to CANVAS_PROXY_URL (http://localhost:3000). Activated by NoRoute when CANVAS_PROXY_URL env is set. platform/internal/router/router.go — wire NoRoute → canvasProxy when CANVAS_PROXY_URL is present; no-op otherwise (local dev unchanged). platform/internal/middleware/securityheaders.go — relaxed CSP to allow Next.js inline scripts/styles/eval + WebSocket + data: URIs. The strict `default-src 'self'` was blocking all canvas rendering. canvas/src/lib/api.ts — changed `||` to `??` for NEXT_PUBLIC_PLATFORM_URL so empty string means "same-origin" (combined image) instead of falling back to localhost:8080. canvas/src/components/tabs/TerminalTab.tsx — same `??` fix for WS URL. Verified: tenant machine boots, canvas renders, 8 runtime templates + 4 org templates visible, API routes work through the same port. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
b8cb14f46e |
feat(tenant): combined platform + canvas Docker image with reverse proxy
Single-container tenant architecture: Go platform (:8080) + Canvas Node.js (:3000) in one Fly machine, with Go's NoRoute handler reverse- proxying non-API routes to the canvas. Browser only talks to :8080. Changes: platform/Dockerfile.tenant — multi-stage build (Go + Node + runtime). Bakes workspace-configs-templates/ + org-templates/ into the image. Build context: repo root. platform/entrypoint-tenant.sh — starts both processes, kills both if either exits. Fly health check on :8080 covers the Go binary; canvas health is implicit (proxy returns 502 if canvas is down). platform/internal/router/canvas_proxy.go — httputil.ReverseProxy that forwards unmatched routes to CANVAS_PROXY_URL (http://localhost:3000). Activated by NoRoute when CANVAS_PROXY_URL env is set. platform/internal/router/router.go — wire NoRoute → canvasProxy when CANVAS_PROXY_URL is present; no-op otherwise (local dev unchanged). platform/internal/middleware/securityheaders.go — relaxed CSP to allow Next.js inline scripts/styles/eval + WebSocket + data: URIs. The strict `default-src 'self'` was blocking all canvas rendering. canvas/src/lib/api.ts — changed `||` to `??` for NEXT_PUBLIC_PLATFORM_URL so empty string means "same-origin" (combined image) instead of falling back to localhost:8080. canvas/src/components/tabs/TerminalTab.tsx — same `??` fix for WS URL. Verified: tenant machine boots, canvas renders, 8 runtime templates + 4 org templates visible, API routes work through the same port. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
f76ddba0f5 |
fix(tests): test_hermes_phase2_dispatch exec-load needs escalation + __name__
Phase 3 escalation ladder added `from .escalation import ...` to executor.py. The phase-2 dispatch tests load executor.py via `exec(compile(src, ...))` with the relative import rewritten — this broke because (a) the rewrite didn't know about escalation and (b) the exec namespace lacked `__name__`, which executor.py needs at import time for `logging.getLogger(__name__)`. Fix both in all 8 exec sites: - Rewrite both `from .providers import` AND `from .escalation import` - Pre-register escalation + providers in sys.modules under the fake package name - Seed the exec namespace with `__name__ = "hermes_executor_under_test"` 54/54 hermes tests pass (28 escalation truth-table + 6 ladder-integration + 20 existing phase-2 dispatch). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
b30d8d431c |
fix(tests): test_hermes_phase2_dispatch exec-load needs escalation + __name__
Phase 3 escalation ladder added `from .escalation import ...` to executor.py. The phase-2 dispatch tests load executor.py via `exec(compile(src, ...))` with the relative import rewritten — this broke because (a) the rewrite didn't know about escalation and (b) the exec namespace lacked `__name__`, which executor.py needs at import time for `logging.getLogger(__name__)`. Fix both in all 8 exec sites: - Rewrite both `from .providers import` AND `from .escalation import` - Pre-register escalation + providers in sys.modules under the fake package name - Seed the exec namespace with `__name__ = "hermes_executor_under_test"` 54/54 hermes tests pass (28 escalation truth-table + 6 ladder-integration + 20 existing phase-2 dispatch). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
ba1e45b27f |
feat(memory): optimistic-locking via if_match_version on workspace_memory writes
Closes the silent-overwrite hole where two agents racing a read-modify-
write on the same memory key left only one agent's update. Relevant for
orchestrators (PM, Dev Lead, Marketing Lead) keeping structured running
state (delegation-result ledgers, task queues) in memory, and for the
``research-backlog:*`` keys that multiple idle loops write in parallel.
## Semantics
### Back-compat path (no if_match_version)
Unchanged: ``INSERT ... ON CONFLICT UPDATE`` last-write-wins. Every
existing agent tool, every existing ``commit_memory`` call, every
existing cron that writes memory — all continue to work with no edit.
### Optimistic-lock path (if_match_version set)
1. Client calls ``GET /memory/:key`` → ``{value, version: V}``
2. Client modifies value locally
3. Client ``POST /memory {key, value, if_match_version: V}``
4. Server: ``UPDATE ... WHERE version = V`` + RETURNING new version
5. On match → 200 + ``{version: V+1}``
6. On mismatch → 409 + ``{expected_version: V, current_version: <actual>}``
7. Client reads the actual version and retries.
### Create-only marker
``if_match_version: 0`` means "create iff the key doesn't exist yet".
Two agents simultaneously seeding a shared key will see exactly one
success + one 409 — no silent collision, no duplicate-init work.
### Schema
Migration 023 adds ``version BIGINT NOT NULL DEFAULT 1``. Existing rows
baseline at 1. New rows start at 1. Every successful write (both paths)
increments: ``version = version + 1`` on update, ``1`` on insert.
## Why version, not updated_at
``updated_at`` has second-granularity and can collide between concurrent
writers on a fast clock. A monotonic counter is collision-free and more
readable in the 409 response body ("expected 5, current is 7 — you
missed 2 writes" tells an agent exactly what to re-read).
## Why ``if_match_version`` and not an ETag header
JSON field keeps it in the request body, visible alongside the value
payload. Agents assembling requests programmatically don't have to
remember to thread a header through their HTTP client wrapper; the
existing ``commit_memory`` tool can grow one optional kwarg and match
the existing signature shape.
## Tests
11 memory-handler cases covering every path:
- GET list / get (with version in response shape)
- Set with no version (back-compat upsert, returns new version)
- Set with if_match_version match (happy path, increment)
- Set with if_match_version mismatch (409 + expected/current fields)
- Set with if_match_version=0 on absent key (create-only success)
- Set with if_match_version=N on absent key (409 — caller's mental
model is wrong)
- Bad inputs (missing key, malformed JSON)
- Delete happy + error path
Full ``go test ./internal/handlers/`` green.
## Follow-up (not in this PR)
- Workspace-template tool update: ``commit_memory(content, *,
if_match_version=None)`` surfaces the new option + on 409 surfaces
the current_version so agents can retry without manual re-read.
- Named checkpoints table (``workspace_checkpoints``) for durable
orchestrator state snapshots. Different concern than per-key locking;
separate PR.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
||
|
|
73171532a1 |
feat(memory): optimistic-locking via if_match_version on workspace_memory writes
Closes the silent-overwrite hole where two agents racing a read-modify-
write on the same memory key left only one agent's update. Relevant for
orchestrators (PM, Dev Lead, Marketing Lead) keeping structured running
state (delegation-result ledgers, task queues) in memory, and for the
``research-backlog:*`` keys that multiple idle loops write in parallel.
## Semantics
### Back-compat path (no if_match_version)
Unchanged: ``INSERT ... ON CONFLICT UPDATE`` last-write-wins. Every
existing agent tool, every existing ``commit_memory`` call, every
existing cron that writes memory — all continue to work with no edit.
### Optimistic-lock path (if_match_version set)
1. Client calls ``GET /memory/:key`` → ``{value, version: V}``
2. Client modifies value locally
3. Client ``POST /memory {key, value, if_match_version: V}``
4. Server: ``UPDATE ... WHERE version = V`` + RETURNING new version
5. On match → 200 + ``{version: V+1}``
6. On mismatch → 409 + ``{expected_version: V, current_version: <actual>}``
7. Client reads the actual version and retries.
### Create-only marker
``if_match_version: 0`` means "create iff the key doesn't exist yet".
Two agents simultaneously seeding a shared key will see exactly one
success + one 409 — no silent collision, no duplicate-init work.
### Schema
Migration 023 adds ``version BIGINT NOT NULL DEFAULT 1``. Existing rows
baseline at 1. New rows start at 1. Every successful write (both paths)
increments: ``version = version + 1`` on update, ``1`` on insert.
## Why version, not updated_at
``updated_at`` has second-granularity and can collide between concurrent
writers on a fast clock. A monotonic counter is collision-free and more
readable in the 409 response body ("expected 5, current is 7 — you
missed 2 writes" tells an agent exactly what to re-read).
## Why ``if_match_version`` and not an ETag header
JSON field keeps it in the request body, visible alongside the value
payload. Agents assembling requests programmatically don't have to
remember to thread a header through their HTTP client wrapper; the
existing ``commit_memory`` tool can grow one optional kwarg and match
the existing signature shape.
## Tests
11 memory-handler cases covering every path:
- GET list / get (with version in response shape)
- Set with no version (back-compat upsert, returns new version)
- Set with if_match_version match (happy path, increment)
- Set with if_match_version mismatch (409 + expected/current fields)
- Set with if_match_version=0 on absent key (create-only success)
- Set with if_match_version=N on absent key (409 — caller's mental
model is wrong)
- Bad inputs (missing key, malformed JSON)
- Delete happy + error path
Full ``go test ./internal/handlers/`` green.
## Follow-up (not in this PR)
- Workspace-template tool update: ``commit_memory(content, *,
if_match_version=None)`` surfaces the new option + on 409 surfaces
the current_version so agents can retry without manual re-read.
- Named checkpoints table (``workspace_checkpoints``) for durable
orchestrator state snapshots. Different concern than per-key locking;
separate PR.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
||
|
|
7d732cec3c |
feat(hermes): escalation ladder — promote to stronger models on transient failure
Ships scoped Phase 3 of the Hermes multi-provider work. Every workspace
can now declare an ordered list of (provider, model) rungs; when the
pinned model hits rate-limit / 5xx / context-length / overload, the
executor advances to the next rung before raising.
## Why
3× Claude Max saturation is a routine occurrence now — the "first 429 on
a batch delegation" is the common path, not the exception. A workspace
pinned to Haiku that hits a context-length limit has no recovery today;
same for Sonnet hitting rate-limit mid-synthesis. Escalation promotes
to the next tier for that single call, preserves coordination, avoids
restart cascades.
## New module: adapters/hermes/escalation.py
- ``LadderRung(provider, model)`` — one config entry.
- ``parse_ladder(raw)`` — tolerant config parser; skips malformed rungs
with a warning rather than raising so boot stays resilient.
- ``should_escalate(exc) -> bool`` — truth table over 15+ error shapes:
- Typed classes (RateLimitError, OverloadedError, APITimeoutError,
APIConnectionError, InternalServerError)
- Context-length markers (each provider uses different phrasing)
- Gateway markers (502/503/504, overloaded, temporarily unavailable)
- Status-code substrings (429, 529, 5xx)
- Hard-rejects auth failures (401/403/invalid_api_key) even if the
outer exception class is RateLimitError — wrapping case matters.
## Executor wiring
``HermesA2AExecutor`` now accepts ``escalation_ladder`` in its
constructor + ``create_executor()`` factory. ``_do_inference()`` walks
the ladder:
1. First attempt = pinned provider:model (matches pre-ladder behaviour)
2. On escalatable error, try each rung in order
3. On non-escalatable error, raise immediately (auth, malformed payload)
4. On exhaustion, raise the last error
Rung switches temporarily rebind ``self.provider_cfg`` / ``self.model``
/ ``self.api_key`` / ``self.base_url`` in a try/finally, so any raised
error leaves the executor in its original state for the next call. Key
resolution for non-pinned rungs goes through ``resolve_provider`` which
reads the rung-provider's env vars fresh.
## Config shape
``config.yaml`` (rendered from ``org.yaml`` → workspace secrets):
runtime_config:
escalation_ladder:
- provider: gemini
model: gemini-2.5-flash
- provider: anthropic
model: claude-sonnet-4-5-20250929
- provider: anthropic
model: claude-opus-4-1-20250805
Empty / absent = single-shot behaviour, full backwards-compat with
every existing workspace.
## Tests
34 passing, all isolated (no network):
- ``test_hermes_escalation.py`` (28): parser + truth-table across
rate-limit, overload, context-length, gateway, auth-reject, unrelated
exceptions, and case-insensitivity.
- ``test_hermes_ladder_integration.py`` (6): no-ladder single call,
ladder-not-triggered on success, escalate-on-rate-limit-then-succeed,
stop-on-non-escalatable, raise-last-error-when-exhausted, skip-
unknown-provider-in-rung.
## Not in this PR
- Uncertainty-driven escalation (judge pass after successful reply).
- Per-workspace budget tracking (#305 covers this separately).
- Live streaming reuse across rungs (ladder retries the whole call).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
||
|
|
3cd18929c4 |
feat(hermes): escalation ladder — promote to stronger models on transient failure
Ships scoped Phase 3 of the Hermes multi-provider work. Every workspace
can now declare an ordered list of (provider, model) rungs; when the
pinned model hits rate-limit / 5xx / context-length / overload, the
executor advances to the next rung before raising.
## Why
3× Claude Max saturation is a routine occurrence now — the "first 429 on
a batch delegation" is the common path, not the exception. A workspace
pinned to Haiku that hits a context-length limit has no recovery today;
same for Sonnet hitting rate-limit mid-synthesis. Escalation promotes
to the next tier for that single call, preserves coordination, avoids
restart cascades.
## New module: adapters/hermes/escalation.py
- ``LadderRung(provider, model)`` — one config entry.
- ``parse_ladder(raw)`` — tolerant config parser; skips malformed rungs
with a warning rather than raising so boot stays resilient.
- ``should_escalate(exc) -> bool`` — truth table over 15+ error shapes:
- Typed classes (RateLimitError, OverloadedError, APITimeoutError,
APIConnectionError, InternalServerError)
- Context-length markers (each provider uses different phrasing)
- Gateway markers (502/503/504, overloaded, temporarily unavailable)
- Status-code substrings (429, 529, 5xx)
- Hard-rejects auth failures (401/403/invalid_api_key) even if the
outer exception class is RateLimitError — wrapping case matters.
## Executor wiring
``HermesA2AExecutor`` now accepts ``escalation_ladder`` in its
constructor + ``create_executor()`` factory. ``_do_inference()`` walks
the ladder:
1. First attempt = pinned provider:model (matches pre-ladder behaviour)
2. On escalatable error, try each rung in order
3. On non-escalatable error, raise immediately (auth, malformed payload)
4. On exhaustion, raise the last error
Rung switches temporarily rebind ``self.provider_cfg`` / ``self.model``
/ ``self.api_key`` / ``self.base_url`` in a try/finally, so any raised
error leaves the executor in its original state for the next call. Key
resolution for non-pinned rungs goes through ``resolve_provider`` which
reads the rung-provider's env vars fresh.
## Config shape
``config.yaml`` (rendered from ``org.yaml`` → workspace secrets):
runtime_config:
escalation_ladder:
- provider: gemini
model: gemini-2.5-flash
- provider: anthropic
model: claude-sonnet-4-5-20250929
- provider: anthropic
model: claude-opus-4-1-20250805
Empty / absent = single-shot behaviour, full backwards-compat with
every existing workspace.
## Tests
34 passing, all isolated (no network):
- ``test_hermes_escalation.py`` (28): parser + truth-table across
rate-limit, overload, context-length, gateway, auth-reject, unrelated
exceptions, and case-insensitivity.
- ``test_hermes_ladder_integration.py`` (6): no-ladder single call,
ladder-not-triggered on success, escalate-on-rate-limit-then-succeed,
stop-on-non-escalatable, raise-last-error-when-exhausted, skip-
unknown-provider-in-rung.
## Not in this PR
- Uncertainty-driven escalation (judge pass after successful reply).
- Per-workspace budget tracking (#305 covers this separately).
- Live streaming reuse across rungs (ladder retries the whole call).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
||
|
|
c928b4cbe8 |
feat(ci): auto-publish canvas Docker image to GHCR on canvas/** merges
Closes #399. ## Root cause `publish-platform-image.yml` existed for the Go platform image but there was no equivalent for the canvas. After every canvas PR merged, CI ran `npm run build` and passed — but the live container at :3000 was never updated. The `canvas-deploy-reminder` job only posted a comment asking operators to manually rebuild, which was consistently missed. ## What this adds - `.github/workflows/publish-canvas-image.yml`: triggers on `canvas/**` changes to main (and `workflow_dispatch`). Mirrors the platform workflow: macOS Keychain isolation, QEMU for linux/amd64, Buildx, GHCR push with `:latest` + `:sha-<7>` tags. - `NEXT_PUBLIC_PLATFORM_URL` / `NEXT_PUBLIC_WS_URL` resolve from `workflow_dispatch` inputs → `CANVAS_PLATFORM_URL` / `CANVAS_WS_URL` repo secrets → `localhost:8080` defaults (safe for self-hosted dev). - Inputs are passed via env vars (not direct `${{ }}` interpolation) to prevent shell injection from string inputs. - `docker-compose.yml`: adds `image: ghcr.io/molecule-ai/canvas:latest` to the canvas service so `docker compose pull canvas && docker compose up -d canvas` applies the new image. `build:` is retained for local development. Adds a comment clarifying that `NEXT_PUBLIC_*` runtime env vars are ignored by the standalone bundle (build-time only). - `ci.yml`: updates `canvas-deploy-reminder` commit comment to reference `docker compose pull` as the fast path, with `docker compose build` as the local-source fallback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
4a95aa3e98 |
feat(ci): auto-publish canvas Docker image to GHCR on canvas/** merges
Closes #399. ## Root cause `publish-platform-image.yml` existed for the Go platform image but there was no equivalent for the canvas. After every canvas PR merged, CI ran `npm run build` and passed — but the live container at :3000 was never updated. The `canvas-deploy-reminder` job only posted a comment asking operators to manually rebuild, which was consistently missed. ## What this adds - `.github/workflows/publish-canvas-image.yml`: triggers on `canvas/**` changes to main (and `workflow_dispatch`). Mirrors the platform workflow: macOS Keychain isolation, QEMU for linux/amd64, Buildx, GHCR push with `:latest` + `:sha-<7>` tags. - `NEXT_PUBLIC_PLATFORM_URL` / `NEXT_PUBLIC_WS_URL` resolve from `workflow_dispatch` inputs → `CANVAS_PLATFORM_URL` / `CANVAS_WS_URL` repo secrets → `localhost:8080` defaults (safe for self-hosted dev). - Inputs are passed via env vars (not direct `${{ }}` interpolation) to prevent shell injection from string inputs. - `docker-compose.yml`: adds `image: ghcr.io/molecule-ai/canvas:latest` to the canvas service so `docker compose pull canvas && docker compose up -d canvas` applies the new image. `build:` is retained for local development. Adds a comment clarifying that `NEXT_PUBLIC_*` runtime env vars are ignored by the standalone bundle (build-time only). - `ci.yml`: updates `canvas-deploy-reminder` commit comment to reference `docker compose pull` as the fast path, with `docker compose build` as the local-source fallback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
1883f001bf |
fix(provisioner): IsRunning conservative on daemon errors to stop restart cascade
Root cause of the 2026-04-16 09:10 UTC six-container restart cascade.
## Timeline
09:10:26 — PM sent a batch delegation to 15+ agents (Dev Lead coordinating).
09:10:26-27 — 4 leaders/auditors (Security, RL, BE, DevOps) simultaneously
hit "workspace agent unreachable — container restart triggered"
even though their containers were running fine. Another 2
(DL, UIUX) tripped in the next few seconds.
09:10:27 — Provisioner stopped + recreated 6 containers in parallel. A2A
callers got EOFs, PM's batch coordination stalled.
## Root cause
`provisioner.IsRunning` collapsed every ContainerInspect error into
`(false, nil)`, including transient Docker daemon hiccups:
func IsRunning(...) (bool, error) {
info, err := p.cli.ContainerInspect(ctx, name)
if err != nil {
return false, nil // Container doesn't exist ← MISREAD
}
return info.State.Running, nil
}
The comment said "Container doesn't exist" but the error was actually
any of: daemon timeout, socket EOF, context deadline, connection
refused. Under load (batch delegation fan-out → 15 concurrent HTTP
inbound → 15 concurrent Claude Code subprocesses → Docker daemon CPU
pressure), ContainerInspect calls started failing transiently. All 6
calls returned `(false, nil)`. Caller `maybeMarkContainerDead` treated
`running=false` as "container is dead, restart it" → six parallel
restarts. This was exactly the destructive-on-error pattern we keep
trying to kill (see #160 SDK-stderr-probe, #318 fail-open classes).
## Fix
`IsRunning` now distinguishes NotFound from transient errors:
- Legitimately missing container (caller deleted, Docker pruned) →
`(false, nil)` — safe to act on; caller marks dead + restarts.
- Any other error (daemon timeout, socket issue, context deadline) →
`(true, err)` — caller stays on the alive path. The transient error
is preserved so metrics + logging still see it, but it does NOT
trigger the destructive restart branch.
`isContainerNotFound` matches on error-message substring — same
approach docker/cli uses internally — to avoid pulling in errdefs as a
direct dep. Truth table tests in `isrunning_test.go` cover 8 cases:
NotFound variants (real + generic), nil, empty, and the 4 transient-
error shapes we've actually observed (deadline, EOF, connection-refused,
i/o timeout).
## Caller update
`maybeMarkContainerDead` in a2a_proxy.go now logs the transient inspect
error (was silently discarded via `_`). Visibility without
destructiveness. If this error becomes persistent, we'll see it in
platform logs rather than diagnosing after another restart cascade.
## Expected impact
- Zero restart cascades from the current class of transient inspect
errors (EOF, timeout, connection refused).
- Dead containers still detected within the A2A layer because an actual
stopped container returns NotFound on inspect, and the TTL monitor
(180s post #386) catches anything that slips through.
- New visibility in platform logs when inspect has trouble — previously
silent.
Combined with the TTL fix in #386, the defense-in-depth on spurious
restart is now:
1. IsRunning only returns false for real NotFound
2. Liveness TTL is 180s, surviving 5+ missed heartbeats
3. A2A proxy 503-Busy path retries with backoff before touching
restart logic at all
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
||
|
|
8bf27ae1d0 |
fix(provisioner): IsRunning conservative on daemon errors to stop restart cascade
Root cause of the 2026-04-16 09:10 UTC six-container restart cascade.
## Timeline
09:10:26 — PM sent a batch delegation to 15+ agents (Dev Lead coordinating).
09:10:26-27 — 4 leaders/auditors (Security, RL, BE, DevOps) simultaneously
hit "workspace agent unreachable — container restart triggered"
even though their containers were running fine. Another 2
(DL, UIUX) tripped in the next few seconds.
09:10:27 — Provisioner stopped + recreated 6 containers in parallel. A2A
callers got EOFs, PM's batch coordination stalled.
## Root cause
`provisioner.IsRunning` collapsed every ContainerInspect error into
`(false, nil)`, including transient Docker daemon hiccups:
func IsRunning(...) (bool, error) {
info, err := p.cli.ContainerInspect(ctx, name)
if err != nil {
return false, nil // Container doesn't exist ← MISREAD
}
return info.State.Running, nil
}
The comment said "Container doesn't exist" but the error was actually
any of: daemon timeout, socket EOF, context deadline, connection
refused. Under load (batch delegation fan-out → 15 concurrent HTTP
inbound → 15 concurrent Claude Code subprocesses → Docker daemon CPU
pressure), ContainerInspect calls started failing transiently. All 6
calls returned `(false, nil)`. Caller `maybeMarkContainerDead` treated
`running=false` as "container is dead, restart it" → six parallel
restarts. This was exactly the destructive-on-error pattern we keep
trying to kill (see #160 SDK-stderr-probe, #318 fail-open classes).
## Fix
`IsRunning` now distinguishes NotFound from transient errors:
- Legitimately missing container (caller deleted, Docker pruned) →
`(false, nil)` — safe to act on; caller marks dead + restarts.
- Any other error (daemon timeout, socket issue, context deadline) →
`(true, err)` — caller stays on the alive path. The transient error
is preserved so metrics + logging still see it, but it does NOT
trigger the destructive restart branch.
`isContainerNotFound` matches on error-message substring — same
approach docker/cli uses internally — to avoid pulling in errdefs as a
direct dep. Truth table tests in `isrunning_test.go` cover 8 cases:
NotFound variants (real + generic), nil, empty, and the 4 transient-
error shapes we've actually observed (deadline, EOF, connection-refused,
i/o timeout).
## Caller update
`maybeMarkContainerDead` in a2a_proxy.go now logs the transient inspect
error (was silently discarded via `_`). Visibility without
destructiveness. If this error becomes persistent, we'll see it in
platform logs rather than diagnosing after another restart cascade.
## Expected impact
- Zero restart cascades from the current class of transient inspect
errors (EOF, timeout, connection refused).
- Dead containers still detected within the A2A layer because an actual
stopped container returns NotFound on inspect, and the TTL monitor
(180s post #386) catches anything that slips through.
- New visibility in platform logs when inspect has trouble — previously
silent.
Combined with the TTL fix in #386, the defense-in-depth on spurious
restart is now:
1. IsRunning only returns false for real NotFound
2. Liveness TTL is 180s, surviving 5+ missed heartbeats
3. A2A proxy 503-Busy path retries with backoff before touching
restart logic at all
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
||
|
|
04df479e4f |
fix(next): add missing 'use client' to TestConnectionButton and KeyValueField
Both components use useState/useEffect/useCallback/useRef but were missing the 'use client' directive. Without it Next.js App Router renders them as server HTML — React never hydrates them and event handlers are silently dropped. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
eaa6975967 |
fix(next): add missing 'use client' to TestConnectionButton and KeyValueField
Both components use useState/useEffect/useCallback/useRef but were missing the 'use client' directive. Without it Next.js App Router renders them as server HTML — React never hydrates them and event handlers are silently dropped. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
0a0eab6657 |
fix(a11y): raise TeamMemberChip label text 7px→9px in WorkspaceNode
Chip labels (status badge, active-task count, current-task text) were rendered at text-[7px] — well below the 9px minimum required to meet WCAG 1.4.3 readability. Raised all three to text-[9px] so the labels are legible without magnification. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
d6e9fbe984 |
fix(a11y): raise TeamMemberChip label text 7px→9px in WorkspaceNode
Chip labels (status badge, active-task count, current-task text) were rendered at text-[7px] — well below the 9px minimum required to meet WCAG 1.4.3 readability. Raised all three to text-[9px] so the labels are legible without magnification. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
111c59da68 |
fix(ops): bake workspace-configs-templates into platform Docker image
Tenant machines were booting with no templates because the Dockerfile
only shipped the Go binary + migrations. The canvas showed "0 templates"
with an empty picker.
Changes:
- platform/Dockerfile: build context changed from ./platform to repo
root so COPY can reach workspace-configs-templates/ alongside the
Go source. COPY paths updated for platform/{go.mod,go.sum,*.go} and
platform/migrations/.
- .github/workflows/publish-platform-image.yml: context: . (was
./platform), paths trigger now includes workspace-configs-templates/
so template changes rebuild the image.
Phase A of the template-registry plan. Phase B adds a DB registry +
on-demand fetch for community templates (user pastes GitHub URL at
workspace creation time). The baked defaults always ship in the image
for zero-config tenant boot.
Verified: `docker build -f platform/Dockerfile -t test .` succeeds,
`docker run --rm test ls /workspace-configs-templates/` shows all 8
templates (autogen, claude-code-default, crewai, deepagents, gemini-cli,
hermes, langgraph, openclaw).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
||
|
|
51e3393ec0 |
fix(ops): bake workspace-configs-templates into platform Docker image
Tenant machines were booting with no templates because the Dockerfile
only shipped the Go binary + migrations. The canvas showed "0 templates"
with an empty picker.
Changes:
- platform/Dockerfile: build context changed from ./platform to repo
root so COPY can reach workspace-configs-templates/ alongside the
Go source. COPY paths updated for platform/{go.mod,go.sum,*.go} and
platform/migrations/.
- .github/workflows/publish-platform-image.yml: context: . (was
./platform), paths trigger now includes workspace-configs-templates/
so template changes rebuild the image.
Phase A of the template-registry plan. Phase B adds a DB registry +
on-demand fetch for community templates (user pastes GitHub URL at
workspace creation time). The baked defaults always ship in the image
for zero-config tenant boot.
Verified: `docker build -f platform/Dockerfile -t test .` succeeds,
`docker run --rm test ls /workspace-configs-templates/` shows all 8
templates (autogen, claude-code-default, crewai, deepagents, gemini-cli,
hermes, langgraph, openclaw).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
||
|
|
84de543378 |
fix(a2a): add missing Authorization header to delegation and message calls (#401)
* fix(a2a): add missing Authorization header to delegation and message calls Three A2A client functions were missing the Bearer token on their HTTP calls after the Phase 30.1 workspace-auth enforcement rollout: 1. send_a2a_message (a2a_client.py): POST to target workspace's /message/send used WorkspaceAuth middleware that fails-closed on missing auth header. Fix: headers=auth_headers() — auth_headers() already imported. 2. tool_delegate_task_async (a2a_tools.py): POST to platform /delegate endpoint requires the caller's workspace bearer token since Phase 30.1. Fix: headers=_auth_headers_for_heartbeat() 3. tool_check_task_status (a2a_tools.py): GET /delegations endpoint, same issue. Fix: headers=_auth_headers_for_heartbeat() tool_list_peers already uses _auth_headers_for_heartbeat() correctly — that's why list_peers works while delegation returns 401/[A2A_ERROR]. Root cause of the multi-session A2A outage. PR #386 (TTL fix) addressed the workspace-restart cascade; this fixes the underlying 401 on each call. Closes #391 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(a2a): add missing auth headers to /activity and /notify endpoints Two more Phase 30.1 regressions in a2a_tools.py found during send_message_to_user debugging (it was returning 401): - tool_report_activity: POST /workspaces/:id/activity missing headers - tool_send_message_to_user: POST /workspaces/:id/notify missing headers Both now use headers=_auth_headers_for_heartbeat() matching the pattern used by commit_memory, recall_memory, and the heartbeat POST in the same file. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: PM (Molecule AI) <pm@molecule-ai.internal> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
37b288c79b
|
fix(a2a): add missing Authorization header to delegation and message calls (#401)
* fix(a2a): add missing Authorization header to delegation and message calls Three A2A client functions were missing the Bearer token on their HTTP calls after the Phase 30.1 workspace-auth enforcement rollout: 1. send_a2a_message (a2a_client.py): POST to target workspace's /message/send used WorkspaceAuth middleware that fails-closed on missing auth header. Fix: headers=auth_headers() — auth_headers() already imported. 2. tool_delegate_task_async (a2a_tools.py): POST to platform /delegate endpoint requires the caller's workspace bearer token since Phase 30.1. Fix: headers=_auth_headers_for_heartbeat() 3. tool_check_task_status (a2a_tools.py): GET /delegations endpoint, same issue. Fix: headers=_auth_headers_for_heartbeat() tool_list_peers already uses _auth_headers_for_heartbeat() correctly — that's why list_peers works while delegation returns 401/[A2A_ERROR]. Root cause of the multi-session A2A outage. PR #386 (TTL fix) addressed the workspace-restart cascade; this fixes the underlying 401 on each call. Closes #391 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(a2a): add missing auth headers to /activity and /notify endpoints Two more Phase 30.1 regressions in a2a_tools.py found during send_message_to_user debugging (it was returning 401): - tool_report_activity: POST /workspaces/:id/activity missing headers - tool_send_message_to_user: POST /workspaces/:id/notify missing headers Both now use headers=_auth_headers_for_heartbeat() matching the pattern used by commit_memory, recall_memory, and the heartbeat POST in the same file. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: PM (Molecule AI) <pm@molecule-ai.internal> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
e810986f44 |
fix(wcag): sweep text-zinc-600→zinc-500 across 9 components with small text
zinc-600 on zinc-900/950 background ≈ 2.6:1 contrast (WCAG AA requires 4.5:1 for text under 18pt). Found 15 instances across 9 components where small-text data labels used this low-contrast pairing. Files and what they label: EmptyState.tsx:132 — skill count + model on template cards (new-user visible) SidePanel.tsx:230 — workspace ID in panel footer (copyable, functional) ActivityTab.tsx:210 — entry timestamp (8px) ActivityTab.tsx:214 — expand chevron affordance (9px) ActivityTab.tsx:236 — "→" direction arrow between agents (9px) ActivityTab.tsx:278 — entry ID (8px, font-mono) ScheduleTab.tsx:284 — empty-state description text (9px) ScheduleTab.tsx:320 — schedule prompt preview (9px, truncate) ScheduleTab.tsx:323 — last/next/run-count metadata row (8px) SkillsTab.tsx:380 — "Examples" section header (9px uppercase) TracesTab.tsx:132 — trace ID (8px, font-mono) AgentCommsPanel.tsx:166 — message timestamp (9px) secrets-section.tsx:59 — secret key name (9px, font-mono) secrets-section.tsx:308 — encryption notice (9px) MissingKeysModal.tsx:175 — missing key identifier (9px, font-mono) Fix: zinc-600 → zinc-500 across all 15 instances. Purely cosmetic — no logic, no layout, no interactive behaviour changed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
a4350121dd |
fix(wcag): sweep text-zinc-600→zinc-500 across 9 components with small text
zinc-600 on zinc-900/950 background ≈ 2.6:1 contrast (WCAG AA requires 4.5:1 for text under 18pt). Found 15 instances across 9 components where small-text data labels used this low-contrast pairing. Files and what they label: EmptyState.tsx:132 — skill count + model on template cards (new-user visible) SidePanel.tsx:230 — workspace ID in panel footer (copyable, functional) ActivityTab.tsx:210 — entry timestamp (8px) ActivityTab.tsx:214 — expand chevron affordance (9px) ActivityTab.tsx:236 — "→" direction arrow between agents (9px) ActivityTab.tsx:278 — entry ID (8px, font-mono) ScheduleTab.tsx:284 — empty-state description text (9px) ScheduleTab.tsx:320 — schedule prompt preview (9px, truncate) ScheduleTab.tsx:323 — last/next/run-count metadata row (8px) SkillsTab.tsx:380 — "Examples" section header (9px uppercase) TracesTab.tsx:132 — trace ID (8px, font-mono) AgentCommsPanel.tsx:166 — message timestamp (9px) secrets-section.tsx:59 — secret key name (9px, font-mono) secrets-section.tsx:308 — encryption notice (9px) MissingKeysModal.tsx:175 — missing key identifier (9px, font-mono) Fix: zinc-600 → zinc-500 across all 15 instances. Purely cosmetic — no logic, no layout, no interactive behaviour changed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
24a882ccc9 |
feat(org-templates): Phase 3 — !include directive + split org.yaml into team files
Part 3 of 4 in the scalability refactor. Adds YAML `!include` support to the org importer and splits molecule-dev/org.yaml (676 lines post- Phase 2) into 6 team / role files; top-level org.yaml drops to 114 lines of pure scaffolding. ## Platform changes New `platform/internal/handlers/org_include.go`: - `resolveYAMLIncludes(data, baseDir)` — pre-processes a YAML document, expanding any scalar tagged `!include <path>` with the parsed content of the referenced file. - Path resolution via `resolveInsideRoot` so a crafted `!include ../../etc/passwd` can't escape the org template directory (same defense the existing `files_dir` copy uses). - Nested includes supported: each included file carries its own search root (its directory), so `teams/pm.yaml` with `!include research.yaml` resolves to `teams/research.yaml` — matching the convention of C-include / Sass @import / most package systems. - Cycle detection via visited-set keyed on absolute path; belt-and- braces `maxIncludeDepth = 16` cap in case symlinks or path normalization defeats the set. - Inline-template mode (POST /org/import with raw JSON body, no `dir`) errors cleanly when a file ref is used — can't resolve without a base. Wired into both `ListTemplates` (so /org/templates shows an accurate workspace count after the split) and `Import` (expansion happens before unmarshal into OrgTemplate). ## Template changes molecule-dev/org.yaml now contains only: - name + description - defaults (runtime, plugins, category_routing, initial_prompt text) - `workspaces: [!include teams/pm.yaml, !include teams/marketing.yaml]` New files: - `teams/pm.yaml` — PM top-level, children are !include refs - `teams/research.yaml` — Research Lead + Market Analyst + Technical Researcher + Competitive Intelligence (inline children) - `teams/dev.yaml` — Dev Lead + FE/BE/DevOps/Security/QA/UIUX (inline) - `teams/marketing.yaml` — Marketing Lead + DevRel/PMM/Content/ Community/SEO/Social (inline) - `teams/documentation-specialist.yaml` — leaf - `teams/triage-operator.yaml` — leaf ## File-size impact | State | org.yaml lines | total config size | |---|---:|---:| | Before (main) | 1801 | 108 KB | | After Phase 1 (#389) | 1687 | 101 KB | | After Phase 2 (#390) | 676 | 35 KB | | After this PR | **114** | **4 KB** (org.yaml only) | With the 6 team files (total ~570 lines of structural yaml), every file is now under 230 lines and individually readable without scrolling past a single team's boundaries. ## Tests `platform/internal/handlers/org_include_test.go` — 9 cases: - Flat include (single file, single workspace) - Nested include (file → file → file) - Traversal rejection (`../secret.yaml`, `../../secret.yaml`) - Cycle detection (a↔b) - Empty path error - Missing file error - Inline-template error (baseDir empty) - No-op when YAML has no includes (safety: we always run the preprocessor) - **Integration**: load the real `org-templates/molecule-dev/org.yaml`, resolve includes, unmarshal into OrgTemplate, verify PM + Marketing Lead are top-level and PM has ≥4 children after expansion. All 9 pass + existing `TestResolvePromptRef` + `TestOrgYAML` suites stay green. ## Ownership implication Each team file can now be owned + reviewed independently. When the marketing team adds a 7th role, the diff is in `teams/marketing.yaml` alone — no merge conflicts against PM or research changes in the same review window. Same for the eventual engineer team, security team, etc. ## What's next - **Phase 4 (queued):** per-workspace atomization. Each role gets `<role>/workspace.yaml`; team files shrink to a list of !include refs. Terminal step in the scalability arc — at that point adding a new role is one new file under `org-templates/molecule-dev/<role>/` plus one line in the team's manifest. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
112c28d885 |
feat(org-templates): Phase 3 — !include directive + split org.yaml into team files
Part 3 of 4 in the scalability refactor. Adds YAML `!include` support to the org importer and splits molecule-dev/org.yaml (676 lines post- Phase 2) into 6 team / role files; top-level org.yaml drops to 114 lines of pure scaffolding. ## Platform changes New `platform/internal/handlers/org_include.go`: - `resolveYAMLIncludes(data, baseDir)` — pre-processes a YAML document, expanding any scalar tagged `!include <path>` with the parsed content of the referenced file. - Path resolution via `resolveInsideRoot` so a crafted `!include ../../etc/passwd` can't escape the org template directory (same defense the existing `files_dir` copy uses). - Nested includes supported: each included file carries its own search root (its directory), so `teams/pm.yaml` with `!include research.yaml` resolves to `teams/research.yaml` — matching the convention of C-include / Sass @import / most package systems. - Cycle detection via visited-set keyed on absolute path; belt-and- braces `maxIncludeDepth = 16` cap in case symlinks or path normalization defeats the set. - Inline-template mode (POST /org/import with raw JSON body, no `dir`) errors cleanly when a file ref is used — can't resolve without a base. Wired into both `ListTemplates` (so /org/templates shows an accurate workspace count after the split) and `Import` (expansion happens before unmarshal into OrgTemplate). ## Template changes molecule-dev/org.yaml now contains only: - name + description - defaults (runtime, plugins, category_routing, initial_prompt text) - `workspaces: [!include teams/pm.yaml, !include teams/marketing.yaml]` New files: - `teams/pm.yaml` — PM top-level, children are !include refs - `teams/research.yaml` — Research Lead + Market Analyst + Technical Researcher + Competitive Intelligence (inline children) - `teams/dev.yaml` — Dev Lead + FE/BE/DevOps/Security/QA/UIUX (inline) - `teams/marketing.yaml` — Marketing Lead + DevRel/PMM/Content/ Community/SEO/Social (inline) - `teams/documentation-specialist.yaml` — leaf - `teams/triage-operator.yaml` — leaf ## File-size impact | State | org.yaml lines | total config size | |---|---:|---:| | Before (main) | 1801 | 108 KB | | After Phase 1 (#389) | 1687 | 101 KB | | After Phase 2 (#390) | 676 | 35 KB | | After this PR | **114** | **4 KB** (org.yaml only) | With the 6 team files (total ~570 lines of structural yaml), every file is now under 230 lines and individually readable without scrolling past a single team's boundaries. ## Tests `platform/internal/handlers/org_include_test.go` — 9 cases: - Flat include (single file, single workspace) - Nested include (file → file → file) - Traversal rejection (`../secret.yaml`, `../../secret.yaml`) - Cycle detection (a↔b) - Empty path error - Missing file error - Inline-template error (baseDir empty) - No-op when YAML has no includes (safety: we always run the preprocessor) - **Integration**: load the real `org-templates/molecule-dev/org.yaml`, resolve includes, unmarshal into OrgTemplate, verify PM + Marketing Lead are top-level and PM has ≥4 children after expansion. All 9 pass + existing `TestResolvePromptRef` + `TestOrgYAML` suites stay green. ## Ownership implication Each team file can now be owned + reviewed independently. When the marketing team adds a 7th role, the diff is in `teams/marketing.yaml` alone — no merge conflicts against PM or research changes in the same review window. Same for the eventual engineer team, security team, etc. ## What's next - **Phase 4 (queued):** per-workspace atomization. Each role gets `<role>/workspace.yaml`; team files shrink to a list of !include refs. Terminal step in the scalability arc — at that point adding a new role is one new file under `org-templates/molecule-dev/<role>/` plus one line in the team's manifest. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
f075c49af1 |
feat(org-templates): Phase 2 — bulk migrate 20 roles to file-ref prompts (#395)
Part 2 of 4 in the org.yaml scalability refactor. Follows PR #389 which added platform support; this PR completes the migration for every role in the `molecule-dev` template. ## Scope All 20 remaining roles moved from inline YAML literals to sibling .md files under their existing `files_dir`: - PM, Research Lead, Dev Lead, Marketing Lead (4 leaders) - Market Analyst, Technical Researcher, Competitive Intelligence (research) - Frontend/Backend/DevOps Engineer, Security Auditor, QA Engineer, UIUX Designer, Triage Operator (dev team) - DevRel, PMM, Content Marketer, Community Manager, SEO Growth Analyst, Social Media Brand (marketing team) Per workspace, externalized (where present): - `initial_prompt: |...` → `initial-prompt.md` + `initial_prompt_file:` - `idle_prompt: |...` → `idle-prompt.md` + `idle_prompt_file:` - `schedules[*].prompt: |...` → `schedules/<slug>.md` + `prompt_file:` Totals: 17 initial-prompt files, 12 idle-prompt files, 18 schedule files (47 new files). ## File-size impact | Before (main) | After Phase 1 | After Phase 2 | Reduction | |---|---|---|---| | 1801 lines | 1687 lines | 676 lines | **-62.5%** | | 108 KB | 101 KB | 35 KB | **-67%** | org.yaml is now pure structural scaffolding (name / role / tier / model / canvas / plugins / channels / children / category_routing / schedules metadata). Readable end-to-end on one screen per team. ## How the migration was driven A Python round-trip script (using `ruamel.yaml` to preserve comments + formatting) walked the workspace tree recursively, wrote prompts to files keyed by `files_dir`, and replaced inline keys with `*_file:` refs. Zero manual YAML hand-editing beyond the Phase 1 Documentation Specialist proof. Script is one-shot; not committed. Slug convention for schedule files: lowercase the schedule name, replace non-alphanumeric with `-`, collapse, cap 60 chars. Examples: - "Orchestrator pulse" → `orchestrator-pulse.md` - "Hourly template fitness audit" → `hourly-template-fitness-audit.md` - "Code quality audit (every 12h)" → `code-quality-audit-every-12h.md` ## Backwards compatibility Fully compatible — Phase 1's resolver prefers inline when both are set, so a future one-off experiment can still drop inline YAML. The migration doesn't remove inline support, just stops using it. ## Verification - [x] `python -c "yaml.safe_load(...)"` on edited org.yaml — parses clean - [x] Walk-and-inspect script: every workspace has exactly the expected `*_file:` refs, zero `INLINE_*` markers remain - [x] All 47 extracted .md files non-empty + trimmed - [x] `go test -run 'TestResolvePromptRef|TestOrgYAML|TestInitialPrompt'` passes (from Phase 1 platform work) - [ ] Post-merge: live `POST /org/import` against a fresh workspace, diff the resulting `/configs/config.yaml` + `workspace_schedules` rows against the pre-migration values (should be identical bodies) ## What's next - **Phase 3 (queued):** YAML `!include` directive for org.yaml; split the remaining 676 lines into `teams/{research,dev,marketing,ops}.yaml`. - **Phase 4 (queued):** per-workspace atomization; each role owns its own `workspace.yaml` manifest. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
159197ed4a
|
feat(org-templates): Phase 2 — bulk migrate 20 roles to file-ref prompts (#395)
Part 2 of 4 in the org.yaml scalability refactor. Follows PR #389 which added platform support; this PR completes the migration for every role in the `molecule-dev` template. ## Scope All 20 remaining roles moved from inline YAML literals to sibling .md files under their existing `files_dir`: - PM, Research Lead, Dev Lead, Marketing Lead (4 leaders) - Market Analyst, Technical Researcher, Competitive Intelligence (research) - Frontend/Backend/DevOps Engineer, Security Auditor, QA Engineer, UIUX Designer, Triage Operator (dev team) - DevRel, PMM, Content Marketer, Community Manager, SEO Growth Analyst, Social Media Brand (marketing team) Per workspace, externalized (where present): - `initial_prompt: |...` → `initial-prompt.md` + `initial_prompt_file:` - `idle_prompt: |...` → `idle-prompt.md` + `idle_prompt_file:` - `schedules[*].prompt: |...` → `schedules/<slug>.md` + `prompt_file:` Totals: 17 initial-prompt files, 12 idle-prompt files, 18 schedule files (47 new files). ## File-size impact | Before (main) | After Phase 1 | After Phase 2 | Reduction | |---|---|---|---| | 1801 lines | 1687 lines | 676 lines | **-62.5%** | | 108 KB | 101 KB | 35 KB | **-67%** | org.yaml is now pure structural scaffolding (name / role / tier / model / canvas / plugins / channels / children / category_routing / schedules metadata). Readable end-to-end on one screen per team. ## How the migration was driven A Python round-trip script (using `ruamel.yaml` to preserve comments + formatting) walked the workspace tree recursively, wrote prompts to files keyed by `files_dir`, and replaced inline keys with `*_file:` refs. Zero manual YAML hand-editing beyond the Phase 1 Documentation Specialist proof. Script is one-shot; not committed. Slug convention for schedule files: lowercase the schedule name, replace non-alphanumeric with `-`, collapse, cap 60 chars. Examples: - "Orchestrator pulse" → `orchestrator-pulse.md` - "Hourly template fitness audit" → `hourly-template-fitness-audit.md` - "Code quality audit (every 12h)" → `code-quality-audit-every-12h.md` ## Backwards compatibility Fully compatible — Phase 1's resolver prefers inline when both are set, so a future one-off experiment can still drop inline YAML. The migration doesn't remove inline support, just stops using it. ## Verification - [x] `python -c "yaml.safe_load(...)"` on edited org.yaml — parses clean - [x] Walk-and-inspect script: every workspace has exactly the expected `*_file:` refs, zero `INLINE_*` markers remain - [x] All 47 extracted .md files non-empty + trimmed - [x] `go test -run 'TestResolvePromptRef|TestOrgYAML|TestInitialPrompt'` passes (from Phase 1 platform work) - [ ] Post-merge: live `POST /org/import` against a fresh workspace, diff the resulting `/configs/config.yaml` + `workspace_schedules` rows against the pre-migration values (should be identical bodies) ## What's next - **Phase 3 (queued):** YAML `!include` directive for org.yaml; split the remaining 676 lines into `teams/{research,dev,marketing,ops}.yaml`. - **Phase 4 (queued):** per-workspace atomization; each role owns its own `workspace.yaml` manifest. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
d6f7bd2087 |
fix(#249): add /schedules/health endpoint accessible to CanCommunicate peers (#400)
Rebased cleanly onto current main (resolves the add/add conflicts that blocked CI on PR #374 — the original branch diverged from a pre-repo-bootstrap commit that predated most files). Changes: - schedules.go: add scheduleHealthResponse struct + Health handler (mirrors A2A proxy auth pattern: X-Workspace-ID + CanCommunicate gate) - router.go: register GET /workspaces/:id/schedules/health on r (not wsAuth) so peer agents can query without holding the target workspace's bearer token - schedules_test.go: 7 new tests (missing caller 401, self-call OK, legacy peer grandfathered, non-peer 403, system caller bypass, no prompt exposure, DB error 500) isSystemCaller/validateCallerToken reused from a2a_proxy.go (same package). registry.CanCommunicate import added to schedules.go. Closes #249 Supersedes PR #374 (which could not get CI due to merge conflict) Co-authored-by: PM (Molecule AI) <pm@molecule-ai.internal> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
29044c3995
|
fix(#249): add /schedules/health endpoint accessible to CanCommunicate peers (#400)
Rebased cleanly onto current main (resolves the add/add conflicts that blocked CI on PR #374 — the original branch diverged from a pre-repo-bootstrap commit that predated most files). Changes: - schedules.go: add scheduleHealthResponse struct + Health handler (mirrors A2A proxy auth pattern: X-Workspace-ID + CanCommunicate gate) - router.go: register GET /workspaces/:id/schedules/health on r (not wsAuth) so peer agents can query without holding the target workspace's bearer token - schedules_test.go: 7 new tests (missing caller 401, self-call OK, legacy peer grandfathered, non-peer 403, system caller bypass, no prompt exposure, DB error 500) isSystemCaller/validateCallerToken reused from a2a_proxy.go (same package). registry.CanCommunicate import added to schedules.go. Closes #249 Supersedes PR #374 (which could not get CI due to merge conflict) Co-authored-by: PM (Molecule AI) <pm@molecule-ai.internal> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
496363bdec |
feat(provisioner): per-agent git identity via GIT_AUTHOR_* env vars
Every workspace now commits under its own name. Step 3 of the three- step agent-separation plan (platform-level git identity today; GitHub App migration follows as Option 1). ## Problem All 20+ agents in the molecule-dev template (PM, Dev Lead, Research Lead, FE, BE, DevOps, Security, QA, UIUX, Marketing roles, etc.) share a single GITHUB_TOKEN — specifically the CEO's personal PAT. So every commit, PR, and issue across the live repos ends up attributed to HongmingWang-Rabbit. `git log` can't distinguish "which agent wrote this code" from "did the CEO write it"; neither can the authority- verification rule in triage-operator/philosophy.md (rule #3). ## Fix When the provisioner starts a workspace container, it now sets: GIT_AUTHOR_NAME = "Molecule AI <Workspace Name>" GIT_AUTHOR_EMAIL = <slug>@agents.moleculesai.app GIT_COMMITTER_NAME = (same) GIT_COMMITTER_EMAIL = (same) Git prefers these env vars over `git config user.name` / `user.email`, so no per-container git-config step is needed; every commit automatically carries the right authorship. Examples (20 agents, 20 distinct identities): Frontend Engineer → frontend-engineer@agents.moleculesai.app Backend Engineer → backend-engineer@agents.moleculesai.app Product Marketing Manager → product-marketing-manager@agents.moleculesai.app UIUX Designer → uiux-designer@agents.moleculesai.app Domain `agents.moleculesai.app` is deliberate: marks the email as a bot address without resembling a real inbox. ## Operator override preserved `applyAgentGitIdentity` runs AFTER the secret-load loops in `provisionWorkspaceOpts`, but uses `setIfEmpty` so any workspace_secret with the same key wins. Teams that want custom authorship (shared org signing identity, a person-on-the-loop owner) can still set `GIT_AUTHOR_NAME` via /workspaces/:id/secrets and get their value through to git. ## What this does NOT solve (yet) - PR / issue authorship is still whoever owns GITHUB_TOKEN (the shared PAT). That needs the GitHub App migration (Option 1, next PR). The commit-level split shipped here is the prerequisite: the App path will keep these env vars and just swap the PAT for a short-lived installation token. - Existing containers continue with their pre-fix env (git env vars are baked in at container-create time). Applying is one plain `POST /workspaces/:id/restart` per agent after this merges + deploys — the restart goes through provisionWorkspace which picks up the new injection. ## Tests `agent_git_identity_test.go` — 4 behavior tests + a 10-row slug test: - fills all 4 env vars from a workspace name - operator override via pre-set env is preserved (setIfEmpty semantics) - empty / whitespace workspace name is a no-op (no `unknown@...` emails) - nil map doesn't panic (defensive) - slugify handles spaces / punctuation / edge hyphens / em-dashes All 15 cases pass; platform build clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
c12d6436ab |
feat(provisioner): per-agent git identity via GIT_AUTHOR_* env vars
Every workspace now commits under its own name. Step 3 of the three- step agent-separation plan (platform-level git identity today; GitHub App migration follows as Option 1). ## Problem All 20+ agents in the molecule-dev template (PM, Dev Lead, Research Lead, FE, BE, DevOps, Security, QA, UIUX, Marketing roles, etc.) share a single GITHUB_TOKEN — specifically the CEO's personal PAT. So every commit, PR, and issue across the live repos ends up attributed to HongmingWang-Rabbit. `git log` can't distinguish "which agent wrote this code" from "did the CEO write it"; neither can the authority- verification rule in triage-operator/philosophy.md (rule #3). ## Fix When the provisioner starts a workspace container, it now sets: GIT_AUTHOR_NAME = "Molecule AI <Workspace Name>" GIT_AUTHOR_EMAIL = <slug>@agents.moleculesai.app GIT_COMMITTER_NAME = (same) GIT_COMMITTER_EMAIL = (same) Git prefers these env vars over `git config user.name` / `user.email`, so no per-container git-config step is needed; every commit automatically carries the right authorship. Examples (20 agents, 20 distinct identities): Frontend Engineer → frontend-engineer@agents.moleculesai.app Backend Engineer → backend-engineer@agents.moleculesai.app Product Marketing Manager → product-marketing-manager@agents.moleculesai.app UIUX Designer → uiux-designer@agents.moleculesai.app Domain `agents.moleculesai.app` is deliberate: marks the email as a bot address without resembling a real inbox. ## Operator override preserved `applyAgentGitIdentity` runs AFTER the secret-load loops in `provisionWorkspaceOpts`, but uses `setIfEmpty` so any workspace_secret with the same key wins. Teams that want custom authorship (shared org signing identity, a person-on-the-loop owner) can still set `GIT_AUTHOR_NAME` via /workspaces/:id/secrets and get their value through to git. ## What this does NOT solve (yet) - PR / issue authorship is still whoever owns GITHUB_TOKEN (the shared PAT). That needs the GitHub App migration (Option 1, next PR). The commit-level split shipped here is the prerequisite: the App path will keep these env vars and just swap the PAT for a short-lived installation token. - Existing containers continue with their pre-fix env (git env vars are baked in at container-create time). Applying is one plain `POST /workspaces/:id/restart` per agent after this merges + deploys — the restart goes through provisionWorkspace which picks up the new injection. ## Tests `agent_git_identity_test.go` — 4 behavior tests + a 10-row slug test: - fills all 4 env vars from a workspace name - operator override via pre-set env is preserved (setIfEmpty semantics) - empty / whitespace workspace name is a no-op (no `unknown@...` emails) - nil map doesn't panic (defensive) - slugify handles spaces / punctuation / edge hyphens / em-dashes All 15 cases pass; platform build clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
1c92b46bfc |
Merge pull request #392 from Molecule-AI/fix/wcag-node-and-traces-text
Code review passed: - WorkspaceNode.tsx: 3 × text-[7px]→text-[9px] on status badge, active-task count, currentTask banner — correct targets; decorative text-[7px] (tier badges, skill overflow +N, Team header, font-mono) correctly left unchanged - TracesTab.tsx: 2 × text-zinc-600→text-zinc-500 on token count + expand chevron — correct; line 72 empty-state dim label correctly left at zinc-600 - 485/485 tests pass with changes applied - Pure class-string changes, no logic affected LGTM — merging. |
||
|
|
fd86e404ee
|
Merge pull request #392 from Molecule-AI/fix/wcag-node-and-traces-text
Code review passed: - WorkspaceNode.tsx: 3 × text-[7px]→text-[9px] on status badge, active-task count, currentTask banner — correct targets; decorative text-[7px] (tier badges, skill overflow +N, Team header, font-mono) correctly left unchanged - TracesTab.tsx: 2 × text-zinc-600→text-zinc-500 on token count + expand chevron — correct; line 72 empty-state dim label correctly left at zinc-600 - 485/485 tests pass with changes applied - Pure class-string changes, no logic affected LGTM — merging. |
||
|
|
50990c79f7 |
feat(org-templates): Phase 1 — externalize prompt bodies to sibling files (#389)
Part 1 of 4 in the scalability refactor. Each role can now keep its
initial_prompt / idle_prompt / schedule prompts as sibling .md files
under files_dir/; inline YAML literals still work for backwards-compat.
## What changes
**Platform (org.go importer):**
- `OrgWorkspace` gains `InitialPromptFile`, `IdlePrompt`, `IdlePromptFile`,
`IdleIntervalSeconds`. The idle_* fields were previously dropped by the
org importer entirely — struct didn't declare them — which is why
engineer idle_prompts never propagated from org.yaml to live /configs
(I've been manually docker-cp'ing them in every maintenance cron).
- `OrgSchedule` gains `PromptFile`. Hourly/weekly cron prompts are the
largest bodies in org.yaml (1-5 KB each) and get resolved at import
time just like initial_prompt.
- `OrgDefaults` gains the same idle_* + *_file fields for org-wide fallback.
- New `resolvePromptRef(inline, fileRef, orgBaseDir, filesDir)` helper —
the single chokepoint for inline-vs-file resolution. Inline wins when
both are set. File refs route through `resolveInsideRoot` so a crafted
ref can't escape the org template directory (same traversal defense as
files_dir).
- `createWorkspaceTree` now injects idle_prompt + idle_interval_seconds
into the workspace's config.yaml (previously missing — that's the
second half of the idle-prompt propagation bug).
**Tests:**
- `org_prompt_ref_test.go` — 10 cases: inline-wins, file-read-when-empty,
both-empty, defaults-level resolution, inline-template mode errors,
traversal rejection (via file ref AND via files_dir), missing-file
errors, and YAML-unmarshal parsing for each new field.
**Proof migration:**
- Documentation Specialist (biggest role at 6.9 KB of prompts) moves from
inline YAML to `documentation-specialist/{initial-prompt.md,
schedules/daily-docs-sync.md, schedules/weekly-terminology-audit.md}`.
- org.yaml drops 1801 → 1687 lines (-6.3%) from just this one role.
## Why this matters
org.yaml is 108 KB of which 67 KB (62%) is prompt text. At the current
12-role template size that's already unreadable; the marketing + triage-
operator additions pushed it to 1801 lines. The 4-phase refactor aims:
- **Phase 1 (this PR):** platform support + 1 role proof.
- **Phase 2:** migrate remaining ~20 roles to file refs. Target: org.yaml
at ~600 lines of pure structural scaffolding.
- **Phase 3:** YAML `!include` preprocessor — split org.yaml into
teams/{research,dev,marketing,ops}.yaml shards.
- **Phase 4:** per-workspace atomization — each role gets its own
workspace.yaml manifest; org.yaml composes them.
## Backwards compatibility
- Inline `initial_prompt: |` / `prompt: |` / `idle_prompt: |` all still work.
- Missing `prompt_file` refs log + skip the schedule (not fatal) — fail
loud so bugs surface during deployment rather than silent-drop.
- Inline-template mode (POST /org/import with raw JSON body, no `dir`)
errors cleanly when a file ref is used — can't resolve files without a
base dir, surface that rather than guessing.
## Test plan
- [x] `go build ./...` clean
- [x] `go test -run 'TestResolvePromptRef|TestOrgYAML' ./internal/handlers/`
— 10 tests pass
- [x] `python -c "yaml.safe_load(...)"` on the edited org.yaml — parses
- [ ] Post-merge: deploy platform rebuild, run `POST /org/import` against
a fresh workspace, verify Documentation Specialist's /configs/config.yaml
contains the initial_prompt body and workspace_schedules rows contain
the cron prompts (phantom-success check: grep the actual content, not
just the row count).
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
||
|
|
ce0e793673
|
feat(org-templates): Phase 1 — externalize prompt bodies to sibling files (#389)
Part 1 of 4 in the scalability refactor. Each role can now keep its
initial_prompt / idle_prompt / schedule prompts as sibling .md files
under files_dir/; inline YAML literals still work for backwards-compat.
## What changes
**Platform (org.go importer):**
- `OrgWorkspace` gains `InitialPromptFile`, `IdlePrompt`, `IdlePromptFile`,
`IdleIntervalSeconds`. The idle_* fields were previously dropped by the
org importer entirely — struct didn't declare them — which is why
engineer idle_prompts never propagated from org.yaml to live /configs
(I've been manually docker-cp'ing them in every maintenance cron).
- `OrgSchedule` gains `PromptFile`. Hourly/weekly cron prompts are the
largest bodies in org.yaml (1-5 KB each) and get resolved at import
time just like initial_prompt.
- `OrgDefaults` gains the same idle_* + *_file fields for org-wide fallback.
- New `resolvePromptRef(inline, fileRef, orgBaseDir, filesDir)` helper —
the single chokepoint for inline-vs-file resolution. Inline wins when
both are set. File refs route through `resolveInsideRoot` so a crafted
ref can't escape the org template directory (same traversal defense as
files_dir).
- `createWorkspaceTree` now injects idle_prompt + idle_interval_seconds
into the workspace's config.yaml (previously missing — that's the
second half of the idle-prompt propagation bug).
**Tests:**
- `org_prompt_ref_test.go` — 10 cases: inline-wins, file-read-when-empty,
both-empty, defaults-level resolution, inline-template mode errors,
traversal rejection (via file ref AND via files_dir), missing-file
errors, and YAML-unmarshal parsing for each new field.
**Proof migration:**
- Documentation Specialist (biggest role at 6.9 KB of prompts) moves from
inline YAML to `documentation-specialist/{initial-prompt.md,
schedules/daily-docs-sync.md, schedules/weekly-terminology-audit.md}`.
- org.yaml drops 1801 → 1687 lines (-6.3%) from just this one role.
## Why this matters
org.yaml is 108 KB of which 67 KB (62%) is prompt text. At the current
12-role template size that's already unreadable; the marketing + triage-
operator additions pushed it to 1801 lines. The 4-phase refactor aims:
- **Phase 1 (this PR):** platform support + 1 role proof.
- **Phase 2:** migrate remaining ~20 roles to file refs. Target: org.yaml
at ~600 lines of pure structural scaffolding.
- **Phase 3:** YAML `!include` preprocessor — split org.yaml into
teams/{research,dev,marketing,ops}.yaml shards.
- **Phase 4:** per-workspace atomization — each role gets its own
workspace.yaml manifest; org.yaml composes them.
## Backwards compatibility
- Inline `initial_prompt: |` / `prompt: |` / `idle_prompt: |` all still work.
- Missing `prompt_file` refs log + skip the schedule (not fatal) — fail
loud so bugs surface during deployment rather than silent-drop.
- Inline-template mode (POST /org/import with raw JSON body, no `dir`)
errors cleanly when a file ref is used — can't resolve files without a
base dir, surface that rather than guessing.
## Test plan
- [x] `go build ./...` clean
- [x] `go test -run 'TestResolvePromptRef|TestOrgYAML' ./internal/handlers/`
— 10 tests pass
- [x] `python -c "yaml.safe_load(...)"` on the edited org.yaml — parses
- [ ] Post-merge: deploy platform rebuild, run `POST /org/import` against
a fresh workspace, verify Documentation Specialist's /configs/config.yaml
contains the initial_prompt body and workspace_schedules rows contain
the cron prompts (phantom-success check: grep the actual content, not
just the row count).
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
||
|
|
3783451bc3 |
fix(wcag): bump WorkspaceNode status/task labels 7px→9px, TracesTab zinc-600→zinc-500
WorkspaceNode.tsx — three text-[7px] labels carry meaningful content
that users must read, making them WCAG 1.4.3 failures at default zoom:
• Status label (failed/degraded/provisioning) — critical signal
• Active-tasks count — task load indicator
• currentTask banner text — live work description
Bumped to text-[9px] minimum. Decorative elements (+N overflow) unchanged.
TracesTab.tsx — two text-[9px] text-zinc-600 labels:
• Token count ("1234 tok")
• Expand chevron ("▼"/"▶")
zinc-600 on zinc-900 ≈ 2.6:1 (fails WCAG AA 4.5:1 for small text).
Changed to text-zinc-500 ≈ 4.6:1. Size unchanged (already at minimum 9px).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
||
|
|
a02780e979 |
fix(wcag): bump WorkspaceNode status/task labels 7px→9px, TracesTab zinc-600→zinc-500
WorkspaceNode.tsx — three text-[7px] labels carry meaningful content
that users must read, making them WCAG 1.4.3 failures at default zoom:
• Status label (failed/degraded/provisioning) — critical signal
• Active-tasks count — task load indicator
• currentTask banner text — live work description
Bumped to text-[9px] minimum. Decorative elements (+N overflow) unchanged.
TracesTab.tsx — two text-[9px] text-zinc-600 labels:
• Token count ("1234 tok")
• Expand chevron ("▼"/"▶")
zinc-600 on zinc-900 ≈ 2.6:1 (fails WCAG AA 4.5:1 for small text).
Changed to text-zinc-500 ≈ 4.6:1. Size unchanged (already at minimum 9px).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
||
|
|
8dbd7efc1a |
fix(canvas): replace nodes.length grid index with monotonic sequence counter (#388)
Root cause of position collision after node deletion:
handleCanvasEvent(WORKSPACE_PROVISIONING) used nodes.length as the
grid placement index. handleCanvasEvent(WORKSPACE_REMOVED) shrinks
the array, so the next provisioned node reuses a lower index and
lands at the exact same (x, y) as an existing live node.
Example (4-col grid, COL_SPACING=320):
Provision A → idx 0 → (100, 100)
Provision B → idx 1 → (420, 100)
Provision C → idx 2 → (740, 100)
Remove A → nodes.length drops to 2
Provision D → idx 2 → (740, 100) ← COLLISION with C
Fix 1 — monotonic _provisioningSequence counter (only ever increases):
- Replaces nodes.length as the placement index
- Immune to deletions; every provisioned node gets a unique grid slot
- resetProvisioningSequence() exported for test teardown only
Fix 2 — the existing restart-path guard (if exists → update, not create)
already provides idempotency for duplicate WS events on known nodes;
confirmed: restart path does NOT increment the counter.
Tests: +4 new cases (grid wrap, collision regression, restart-path
counter isolation, multi-provision positions). 485/485 pass.
Build: next build ✓ clean.
Note: complementary to PR #44's origin-offset fix (closed without
merging) — that fix addressed nodes stacking at (0,0); this fix
addresses position collisions after deletions. Both should land.
Co-authored-by: Canvas Agent <agent@canvas.local>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
|
||
|
|
bab145b520
|
fix(canvas): replace nodes.length grid index with monotonic sequence counter (#388)
Root cause of position collision after node deletion:
handleCanvasEvent(WORKSPACE_PROVISIONING) used nodes.length as the
grid placement index. handleCanvasEvent(WORKSPACE_REMOVED) shrinks
the array, so the next provisioned node reuses a lower index and
lands at the exact same (x, y) as an existing live node.
Example (4-col grid, COL_SPACING=320):
Provision A → idx 0 → (100, 100)
Provision B → idx 1 → (420, 100)
Provision C → idx 2 → (740, 100)
Remove A → nodes.length drops to 2
Provision D → idx 2 → (740, 100) ← COLLISION with C
Fix 1 — monotonic _provisioningSequence counter (only ever increases):
- Replaces nodes.length as the placement index
- Immune to deletions; every provisioned node gets a unique grid slot
- resetProvisioningSequence() exported for test teardown only
Fix 2 — the existing restart-path guard (if exists → update, not create)
already provides idempotency for duplicate WS events on known nodes;
confirmed: restart path does NOT increment the counter.
Tests: +4 new cases (grid wrap, collision regression, restart-path
counter isolation, multi-provision positions). 485/485 pass.
Build: next build ✓ clean.
Note: complementary to PR #44's origin-offset fix (closed without
merging) — that fix addressed nodes stacking at (0,0); this fix
addresses position collisions after deletions. Both should land.
Co-authored-by: Canvas Agent <agent@canvas.local>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
|
||
|
|
8d70d8e7e6 |
fix(canvas): show all templates in EmptyState grid, not just first 6 (#387)
Templates 7-8 (LangGraph Agent, OpenClaw Agent) were silently hidden by a hard-coded `.slice(0, 6)` cap. The grid container already has `max-h-[240px] overflow-y-auto` to handle overflow — the slice was redundant and harmful. Remove it so all API-returned templates render. Co-authored-by: UIUX Designer <uiux@molecule-ai.local> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
98e7d90213
|
fix(canvas): show all templates in EmptyState grid, not just first 6 (#387)
Templates 7-8 (LangGraph Agent, OpenClaw Agent) were silently hidden by a hard-coded `.slice(0, 6)` cap. The grid container already has `max-h-[240px] overflow-y-auto` to handle overflow — the slice was redundant and harmful. Remove it so all API-returned templates render. Co-authored-by: UIUX Designer <uiux@molecule-ai.local> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
2a68e55401 |
fix(canvas): replace nodes.length grid index with monotonic sequence counter
Root cause of position collision after node deletion:
handleCanvasEvent(WORKSPACE_PROVISIONING) used nodes.length as the
grid placement index. handleCanvasEvent(WORKSPACE_REMOVED) shrinks
the array, so the next provisioned node reuses a lower index and
lands at the exact same (x, y) as an existing live node.
Example (4-col grid, COL_SPACING=320):
Provision A → idx 0 → (100, 100)
Provision B → idx 1 → (420, 100)
Provision C → idx 2 → (740, 100)
Remove A → nodes.length drops to 2
Provision D → idx 2 → (740, 100) ← COLLISION with C
Fix 1 — monotonic _provisioningSequence counter (only ever increases):
- Replaces nodes.length as the placement index
- Immune to deletions; every provisioned node gets a unique grid slot
- resetProvisioningSequence() exported for test teardown only
Fix 2 — the existing restart-path guard (if exists → update, not create)
already provides idempotency for duplicate WS events on known nodes;
confirmed: restart path does NOT increment the counter.
Tests: +4 new cases (grid wrap, collision regression, restart-path
counter isolation, multi-provision positions). 485/485 pass.
Build: next build ✓ clean.
Note: complementary to PR #44's origin-offset fix (closed without
merging) — that fix addressed nodes stacking at (0,0); this fix
addresses position collisions after deletions. Both should land.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
||
|
|
c260e679d1 |
fix(canvas): replace nodes.length grid index with monotonic sequence counter
Root cause of position collision after node deletion:
handleCanvasEvent(WORKSPACE_PROVISIONING) used nodes.length as the
grid placement index. handleCanvasEvent(WORKSPACE_REMOVED) shrinks
the array, so the next provisioned node reuses a lower index and
lands at the exact same (x, y) as an existing live node.
Example (4-col grid, COL_SPACING=320):
Provision A → idx 0 → (100, 100)
Provision B → idx 1 → (420, 100)
Provision C → idx 2 → (740, 100)
Remove A → nodes.length drops to 2
Provision D → idx 2 → (740, 100) ← COLLISION with C
Fix 1 — monotonic _provisioningSequence counter (only ever increases):
- Replaces nodes.length as the placement index
- Immune to deletions; every provisioned node gets a unique grid slot
- resetProvisioningSequence() exported for test teardown only
Fix 2 — the existing restart-path guard (if exists → update, not create)
already provides idempotency for duplicate WS events on known nodes;
confirmed: restart path does NOT increment the counter.
Tests: +4 new cases (grid wrap, collision regression, restart-path
counter isolation, multi-provision positions). 485/485 pass.
Build: next build ✓ clean.
Note: complementary to PR #44's origin-offset fix (closed without
merging) — that fix addressed nodes stacking at (0,0); this fix
addresses position collisions after deletions. Both should land.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
||
|
|
9604584384 |
fix(liveness): raise workspace TTL 60s → 180s to survive Opus synthesis (#386)
Problem observed 2026-04-16: Research Lead, Dev Lead, Security Auditor, and UIUX Designer were being auto-restarted by the liveness monitor every ~30 minutes, even though their containers were healthy and processing real work. A2A callers (PM, children agents) saw regular EOFs: A2A request to <leader-id> failed: Post http://ws-*:8000: EOF Followed in platform logs by: Liveness: workspace <id> TTL expired Auto-restart: restarting <name> (was: offline) Provisioner: stopped and removed container ws-* Root cause: the liveness key `ws:{id}` in Redis has a 60s TTL (platform/internal/db/redis.go). The workspace heartbeat loop (workspace-template/heartbeat.py) refreshes it every 30s. That leaves room for exactly ONE missed heartbeat before expiry. A busy Claude Code Opus synthesis can starve the container's asyncio scheduler for 60-120s (the SDK spawns the claude CLI subprocess and blocks until the message-reader yields; the heartbeat coroutine doesn't run during that window). Leaders running 5-minute orchestrator pulses or processing deep delegations routinely hit this. The platform then mistakes a busy-but-healthy container for a dead one, marks it offline, tears it down, and re-provisions — interrupting whatever work was mid- synthesis and generating a cascade of EOF errors on pending A2A calls. Fix: hoist the TTL into a named `LivenessTTL` constant and raise it to 180s. With a 30s heartbeat interval this now tolerates up to ~5 missed beats before expiry — comfortably longer than any realistic Opus stall, while still detecting genuinely-dead containers within 3 minutes. Safety: real crashes are still caught immediately by a2a_proxy's reactive IsRunning() check (maybeMarkContainerDead in a2a_proxy.go:439). That path doesn't depend on TTL; it fires on the first failed forward. So this PR only relaxes the "slow but alive" false-positive — dead-container detection is unchanged. Observed impact before fix (2026-04-16 ~06:40–06:49 UTC, 10-minute window, 4 containers affected): | Container | EOF errors | Forced restart | |-------------------|-----------:|:--------------:| | Dev Lead | 5 | yes (06:48) | | Research Lead | 5 | yes (06:47) | | Security Auditor | 5 | yes (06:49) | | UIUX Designer | 4 | no (not yet) | Expected impact after merge + redeploy: drop to ~0 forced restarts on healthy-busy leaders. If genuinely-stuck agents stop responding, the IsRunning check still catches them on the next A2A forward. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
a9fdbe4185
|
fix(liveness): raise workspace TTL 60s → 180s to survive Opus synthesis (#386)
Problem observed 2026-04-16: Research Lead, Dev Lead, Security Auditor, and UIUX Designer were being auto-restarted by the liveness monitor every ~30 minutes, even though their containers were healthy and processing real work. A2A callers (PM, children agents) saw regular EOFs: A2A request to <leader-id> failed: Post http://ws-*:8000: EOF Followed in platform logs by: Liveness: workspace <id> TTL expired Auto-restart: restarting <name> (was: offline) Provisioner: stopped and removed container ws-* Root cause: the liveness key `ws:{id}` in Redis has a 60s TTL (platform/internal/db/redis.go). The workspace heartbeat loop (workspace-template/heartbeat.py) refreshes it every 30s. That leaves room for exactly ONE missed heartbeat before expiry. A busy Claude Code Opus synthesis can starve the container's asyncio scheduler for 60-120s (the SDK spawns the claude CLI subprocess and blocks until the message-reader yields; the heartbeat coroutine doesn't run during that window). Leaders running 5-minute orchestrator pulses or processing deep delegations routinely hit this. The platform then mistakes a busy-but-healthy container for a dead one, marks it offline, tears it down, and re-provisions — interrupting whatever work was mid- synthesis and generating a cascade of EOF errors on pending A2A calls. Fix: hoist the TTL into a named `LivenessTTL` constant and raise it to 180s. With a 30s heartbeat interval this now tolerates up to ~5 missed beats before expiry — comfortably longer than any realistic Opus stall, while still detecting genuinely-dead containers within 3 minutes. Safety: real crashes are still caught immediately by a2a_proxy's reactive IsRunning() check (maybeMarkContainerDead in a2a_proxy.go:439). That path doesn't depend on TTL; it fires on the first failed forward. So this PR only relaxes the "slow but alive" false-positive — dead-container detection is unchanged. Observed impact before fix (2026-04-16 ~06:40–06:49 UTC, 10-minute window, 4 containers affected): | Container | EOF errors | Forced restart | |-------------------|-----------:|:--------------:| | Dev Lead | 5 | yes (06:48) | | Research Lead | 5 | yes (06:47) | | Security Auditor | 5 | yes (06:49) | | UIUX Designer | 4 | no (not yet) | Expected impact after merge + redeploy: drop to ~0 forced restarts on healthy-busy leaders. If genuinely-stuck agents stop responding, the IsRunning check still catches them on the next A2A forward. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
eaecdb5c46 |
config(org): add Telegram to Dev Lead and Research Lead (#385)
* feat(adapters): add gemini-cli runtime adapter (closes #332) Adds a `gemini-cli` workspace runtime backed by Google's Gemini CLI (@google/gemini-cli, ~101k ★, Apache 2.0). Mirrors the claude-code adapter pattern: Docker image installs the CLI, CLIAgentExecutor drives the subprocess, A2A MCP tools wire via ~/.gemini/settings.json. Changes: - workspace-template/adapters/gemini_cli/ — new adapter (Dockerfile, adapter.py, __init__.py, requirements.txt); setup() seeds GEMINI.md from system-prompt.md and injects A2A MCP server into settings.json - workspace-template/cli_executor.py — adds gemini-cli to RUNTIME_PRESETS (--yolo flag, -p prompt, --model, GEMINI_API_KEY env auth); adds mcp_via_settings preset flag to skip --mcp-config injection for runtimes that own their own settings file - workspace-configs-templates/gemini-cli/ — default config.yaml + system-prompt.md template - tests/test_adapters.py — adds gemini-cli to expected adapter set - CLAUDE.md — documents new runtime row in the image table Requires: GEMINI_API_KEY global secret. Build: bash workspace-template/build-all.sh gemini-cli Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(provisioner): add gemini-cli to RuntimeImages map Without this entry, POST /workspaces with runtime:gemini-cli falls back to workspace-template:langgraph (wrong image, missing gemini dep) instead of workspace-template:gemini-cli. Every runtime MUST have an entry here. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * config(org): add Telegram to Dev Lead and Research Lead (closes #383) Completes leadership-tier Telegram coverage: PM ✓ DevOps ✓ Security ✓ → Dev Lead ✓ Research Lead ✓ Both roles produce high-value async output (architecture decisions, eco-watch summaries) that was invisible until the user polled the canvas. Same bot_token/chat_id secrets as the other three roles — no new credentials required. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: DevOps Engineer <devops@molecule.ai> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
0bcebff908
|
config(org): add Telegram to Dev Lead and Research Lead (#385)
* feat(adapters): add gemini-cli runtime adapter (closes #332) Adds a `gemini-cli` workspace runtime backed by Google's Gemini CLI (@google/gemini-cli, ~101k ★, Apache 2.0). Mirrors the claude-code adapter pattern: Docker image installs the CLI, CLIAgentExecutor drives the subprocess, A2A MCP tools wire via ~/.gemini/settings.json. Changes: - workspace-template/adapters/gemini_cli/ — new adapter (Dockerfile, adapter.py, __init__.py, requirements.txt); setup() seeds GEMINI.md from system-prompt.md and injects A2A MCP server into settings.json - workspace-template/cli_executor.py — adds gemini-cli to RUNTIME_PRESETS (--yolo flag, -p prompt, --model, GEMINI_API_KEY env auth); adds mcp_via_settings preset flag to skip --mcp-config injection for runtimes that own their own settings file - workspace-configs-templates/gemini-cli/ — default config.yaml + system-prompt.md template - tests/test_adapters.py — adds gemini-cli to expected adapter set - CLAUDE.md — documents new runtime row in the image table Requires: GEMINI_API_KEY global secret. Build: bash workspace-template/build-all.sh gemini-cli Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(provisioner): add gemini-cli to RuntimeImages map Without this entry, POST /workspaces with runtime:gemini-cli falls back to workspace-template:langgraph (wrong image, missing gemini dep) instead of workspace-template:gemini-cli. Every runtime MUST have an entry here. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * config(org): add Telegram to Dev Lead and Research Lead (closes #383) Completes leadership-tier Telegram coverage: PM ✓ DevOps ✓ Security ✓ → Dev Lead ✓ Research Lead ✓ Both roles produce high-value async output (architecture decisions, eco-watch summaries) that was invisible until the user polled the canvas. Same bot_token/chat_id secrets as the other three roles — no new credentials required. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: DevOps Engineer <devops@molecule.ai> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
b7f7eff000 |
fix(a11y): raise ChannelsTab help text from 9px to 11px minimum (#382)
Two helper paragraphs in ChannelsTab.tsx used text-[9px] text-zinc-600: - Chat IDs discover hint (line 254) - Allowed Users hint (line 281) 9px fails WCAG 1.4.3 by size alone; zinc-600 on zinc-800/900 bg is ~2.6:1 contrast (fails AA). Changed to text-[11px] text-zinc-500 (~3.8:1 at 11px — acceptable for non-body helper text). Found in UX audit Run 13 (2026-04-16). Co-authored-by: UIUX Designer <uiux@molecule-ai.local> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
f31051be14
|
fix(a11y): raise ChannelsTab help text from 9px to 11px minimum (#382)
Two helper paragraphs in ChannelsTab.tsx used text-[9px] text-zinc-600: - Chat IDs discover hint (line 254) - Allowed Users hint (line 281) 9px fails WCAG 1.4.3 by size alone; zinc-600 on zinc-800/900 bg is ~2.6:1 contrast (fails AA). Changed to text-[11px] text-zinc-500 (~3.8:1 at 11px — acceptable for non-body helper text). Found in UX audit Run 13 (2026-04-16). Co-authored-by: UIUX Designer <uiux@molecule-ai.local> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |