Commit Graph

453 Commits

Author SHA1 Message Date
Molecule AI Backend Engineer
b021f85af9 feat(channels): per-channel message budget with 429 enforcement (#368)
Add an optional channel_budget (INTEGER, nullable) to workspace_channels
via migration 024. When channel_budget IS NOT NULL and message_count has
reached the budget, the Send handler returns 429 {"error":"channel budget
exceeded"} and aborts before calling SendOutbound.

Implementation details:
- Single SELECT query reads both message_count and channel_budget in one
  round-trip (avoids TOCTOU window between read and write)
- Fail-open on DB error: transient failures log but don't block sends
- Early-return on budget hit is before SendOutbound so message_count
  cannot be incremented past the limit by a concurrent send that slips
  through the window (best-effort; atomic enforcement requires DB-level CAS)
- NULL channel_budget = unlimited (default, backward-compatible)

Migration is idempotent (ADD COLUMN IF NOT EXISTS). Down migration drops
the column cleanly.

Four sqlmock tests cover: at-limit → 429, above-limit → 429, NULL budget
passes through, under-limit passes through.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 11:17:14 +00:00
Hongming Wang
520c993baa
Merge pull request #449 from Molecule-AI/fix/issue-425-sidepanel-width-persist
fix(canvas): persist SidePanel width to localStorage (closes #425)
2026-04-16 03:49:05 -07:00
Hongming Wang
e0b83d170d
Merge pull request #440 from Molecule-AI/fix/docker-compose-platform-build-context
fix(compose): platform build context must be repo root
2026-04-16 03:48:30 -07:00
Canvas Agent
966920355a fix(canvas): persist SidePanel width to localStorage (issue #425)
Width was initialized to 480px on every render, so clicking a different
workspace node (which re-mounts SidePanel) discarded any resize the user
had done.

Fix:
- localStorage-backed useState initializer (SSR-safe typeof window guard)
- Validates the stored value: must be a finite integer ≥ 320px
- Persists the width in the mouseUp handler via a widthRef that stays in
  sync with the live drag value — avoids spamming localStorage on every
  pixel during the drag
- Extra guard: onMouseUp bails early if not actually dragging (prevents
  spurious saves on unrelated window mouseup events)
- Named constants replace magic numbers 480 / 320

Tests: 5 new cases in SidePanel.tabs.test.tsx — default fallback, valid
saved value, too-small saved value, NaN saved value, drag-persist roundtrip.

Closes #425

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 10:40:08 +00:00
rabbitblood
239e211d3d fix(compose): platform build context must be repo root, not ./platform
The platform Dockerfile COPYs paths relative to the repo root —
\`COPY platform/go.mod\`, \`COPY platform/migrations\`,
\`COPY workspace-configs-templates\`. The compose file was setting
\`context: ./platform\`, which silently caused those COPY layers to
miss + stop invalidating cache.

Symptom (caught 2026-04-16 10:22 UTC): after PR #417 (memory schema
migration 023) merged + I ran \`docker compose up -d --build platform\`,
the rebuild was a no-op. Image SHA didn't change, container booted with
old migration set, \`Applied 22 migrations\` instead of the expected 23.
Migration 023 file was on disk locally but never reached the image.

Workaround was \`docker build -t molecule-monorepo-platform:fresh -f
platform/Dockerfile .\` from repo root → SHA changed, migration 023
applied. This commit makes \`docker compose up -d --build platform\`
work correctly without the manual workaround.

CI workflow already builds with \`context: .\` + \`file: ./platform/Dockerfile\`
(per the comment at the top of platform/Dockerfile). This change just
aligns the local compose file with what CI does.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 03:25:58 -07:00
Hongming Wang
18837c44ca
Merge pull request #419 from Molecule-AI/feat/gh-agent-attribution
feat(workspace): gh-wrapper — auto-tag agent PRs + issues with role
2026-04-16 03:19:46 -07:00
Hongming Wang
3e50b95800
Merge pull request #433 from Molecule-AI/feat/externalize-prompts-phase4
feat(org-templates): Phase 4 — atomize each role to <role>/workspace.yaml
2026-04-16 03:19:43 -07:00
Hongming Wang
c545e3a276
Merge pull request #417 from Molecule-AI/feat/memory-checkpoint-reconciliation
feat(memory): optimistic-locking via if_match_version on workspace_memory writes
2026-04-16 03:18:09 -07:00
rabbitblood
067a8333ce feat(workspace): gh-wrapper — auto-tag agent PRs + issues with role
Every agent in the template currently uses the same GitHub PAT, so
\`gh pr list\` shows every PR as authored by the CEO's account with
no signal which agent opened each one. Commits already carry
per-agent authors (GIT_AUTHOR_NAME from #402). This wrapper extends
the identity split to the PR/issue metadata surface layer that
commit attribution can't reach.

## How it works

A tiny bash script installed at \`/usr/local/bin/gh\`, which sits
earlier in PATH than the real binary at \`/usr/bin/gh\`. For \`gh pr
create\` and \`gh issue create\`:

- Title gets prefixed with \`[Role Name]\` — e.g. \`[Frontend Engineer]
  fix: canvas grid index\`
- Body gets \`\n\n---\n_Opened by: Molecule AI <Role>_\` appended

Role is read from \`GIT_AUTHOR_NAME\` which the platform provisioner
sets to \`Molecule AI <Role>\` (shipped with #402). Accepts both
\`--title X\` and \`--title=X\` forms. Same for \`--body\`.

Anything that isn't \`gh pr create\` or \`gh issue create\` (e.g.
\`gh pr list\`, \`gh issue view\`, \`gh run watch\`) passes through
untouched. No behaviour change for read-side operations.

## Idempotent

- If the title already starts with \`[...]\` the wrapper does not
  re-prefix. \`gh pr edit\` flows that resubmit title won't layer
  multiple tags.
- If the body already contains \`Opened by: Molecule AI\` the footer
  is not re-appended.

## Fail-open

When \`GIT_AUTHOR_NAME\` is absent or doesn't start with \`Molecule
AI \`, the wrapper exec's the real gh with unchanged args. No call
is ever blocked by this script.

## Test coverage

\`tests/test_gh_wrapper.sh\` — 12 cases, no network, no Docker:
- Passthrough for non-create subcommands (pr list)
- pr create title prefix + body footer
- issue create with \`--title=X\` \`--body=X\` equals-form
- Idempotent title re-prefix
- Idempotent body footer (count = 1 after two applies)
- Missing GIT_AUTHOR_NAME → passthrough, title preserved
- Malformed GIT_AUTHOR_NAME (not "Molecule AI ...") → passthrough

All 12 pass. Test script is standalone bash + a temp fake gh binary
that echoes argv; safe to run in CI's Python Lint & Test job via
subprocess shell-out.

## Deployment note

This lands in the workspace image. Existing containers keep their
old /usr/bin/gh until the image is rebuilt and they're re-provisioned
(POST /workspaces/:id/restart {}). No migration required; the wrapper
just starts tagging PRs once the new image is rolled.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 03:10:46 -07:00
rabbitblood
40a69d6f87 feat(org-templates): Phase 4 — atomize each role to <role>/workspace.yaml
Part 4 of 4 — terminal step of the org.yaml scalability refactor. Each
role in the molecule-dev template now owns its own workspace.yaml file,
colocated with the existing system-prompt.md / initial-prompt.md /
idle-prompt.md / schedules/*.md. Team files shrink to a leader's own
definition plus a list of !include refs.

## Platform change

`resolveYAMLIncludes` now uses a TWO-ROOT model:
- Path resolution is relative to the INCLUDING file's directory
  (natural sibling + cousin refs, C-include / Sass @import convention).
- Security bound is the ORIGINAL org root (`rootDir`), preserved across
  all recursion depths. Sibling-dir refs like `../my-role/workspace.yaml`
  from a team file are now allowed (they stay inside the org template);
  refs that escape the root still error.

Regression coverage: new `TestResolveYAMLIncludes_SiblingDirAccess`
reproduces the Phase 4 pattern (team file at `teams/x.yaml` referencing
`../<role>/workspace.yaml`) — fails without the fix, passes with.

## Template change

Atomized 15 child workspaces across 3 team files:
- `teams/research.yaml`: 58 → 30 lines; 3 children now !include refs
- `teams/dev.yaml`: 222 → 38 lines; 6 children now !include refs
- `teams/marketing.yaml`: 143 → 28 lines; 6 children now !include refs

Each role now has `<role>/workspace.yaml` colocated with its prompts.
Example `frontend-engineer/` directory:
  frontend-engineer/
  ├── workspace.yaml        (24 lines — name/role/tier/canvas/plugins/...)
  ├── system-prompt.md      (from earlier phases)
  ├── initial-prompt.md
  ├── idle-prompt.md
  └── (no schedules for this role — but if added, schedules/<slug>.md)

## File-size progression across all 4 phases

| State | org.yaml | total `.yaml` in tree |
|---|---:|---:|
| Before (main) | 1801 lines / 108 KB | 1801 / 108 KB (one file) |
| After Phase 1 (#389) | 1687 | 1687 / 101 KB |
| After Phase 2 (#390) | 676 | 676 / 35 KB |
| After Phase 3 (#393) | 114 | 683 (1 + 6 teams) / 33 KB |
| **After this PR** | **114** | **~698** (1 + 6 + 15 workspace) / 35 KB |

Aggregate size is flat — the decrease came from prompt externalization
in Phases 1/2; Phases 3/4 reorganize structure without adding content.
The win is readability and ownership:
- Every individual file fits on 1-2 screens.
- Adding a new role is now: create `<role>/` dir, add `workspace.yaml`
  + `system-prompt.md` + prompts, add ONE `!include` line to the team
  file. No touching of aggregated mega-YAML.
- Team files can be reviewed + merged independently.

## Tests

All 10 `TestResolveYAMLIncludes_*` tests pass, including the real-template
integration test (`TestResolveYAMLIncludes_RealMoleculeDev`) which now
walks org.yaml → teams/pm.yaml → teams/research.yaml → ../market-analyst/
workspace.yaml and validates the full 21-role tree unmarshals cleanly.

Plus all existing `TestResolvePromptRef` + `TestOrgYAML` + `TestInitialPrompt`
suites stay green.

## Ops followup

After merging all 4 phases and deploying, the `POST /org/import`
endpoint should produce a workspace tree byte-identical to the
pre-refactor state. Verify with:
  diff <(curl POST /org/import before) <(curl POST /org/import after)
or by spot-checking:
  - `/configs/config.yaml` bodies across all 21 workspaces
  - `workspace_schedules.prompt` row values

The externalization is lossless — YAML literal to file and back
recovers the same string modulo trailing-whitespace normalization.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 03:09:56 -07:00
Hongming Wang
2abb267d97
Merge pull request #415 from Molecule-AI/fix/issue-399-canvas-image-publish
feat(ci): auto-publish canvas Docker image to GHCR on canvas/** merges
2026-04-16 03:08:27 -07:00
Hongming Wang
4ff5a6d12f
Merge pull request #409 from Molecule-AI/fix/use-client-ui-components
fix(next): add missing 'use client' to TestConnectionButton and KeyValueField
2026-04-16 03:08:24 -07:00
Hongming Wang
42c42470fd
Merge pull request #408 from Molecule-AI/fix/canvas-events-sequence-counter-v2
fix(canvas): monotonic sequence counter + 7px→9px chip labels
2026-04-16 03:08:20 -07:00
Hongming Wang
8d523633f8
Merge pull request #405 from Molecule-AI/fix/wcag-zinc600-smalltext-sweep
fix(wcag): sweep text-zinc-600→zinc-500 on small-text labels across 9 components
2026-04-16 03:08:17 -07:00
Hongming Wang
0c73810121
Merge pull request #404 from Molecule-AI/feat/externalize-prompts-phase3
feat(org-templates): Phase 3 — !include directive + split org.yaml into team files
2026-04-16 03:08:01 -07:00
Hongming Wang
5c7b9d31bc
Merge pull request #416 from Molecule-AI/feat/hermes-escalation-ladder
feat(hermes): escalation ladder — promote to stronger models on transient failure
2026-04-16 03:07:57 -07:00
Hongming Wang
db22b5d853
Merge pull request #413 from Molecule-AI/fix/isrunning-distinguish-notfound
fix(provisioner): IsRunning conservative on daemon errors to stop restart cascade
2026-04-16 03:07:54 -07:00
Hongming Wang
1e43e45de7
Merge pull request #402 from Molecule-AI/feat/per-agent-git-identity
feat(provisioner): per-agent git identity via GIT_AUTHOR_* env vars
2026-04-16 03:07:50 -07:00
Hongming Wang
3cf5fd117a
Merge pull request #428 from Molecule-AI/fix/securityheaders-test-stale-csp
fix(tests): CSP test fragment-match instead of exact-match
2026-04-16 03:07:05 -07:00
rabbitblood
7debdb1676 fix(tests): CSP test now fragment-matches instead of exact-matches
SecurityHeaders middleware widened its CSP to allow Next.js inline scripts
+ data:/blob: images (platform/internal/middleware/securityheaders.go:44,
canvas is reverse-proxied through the gin stack so it needs the permissive
policy). The two CSP asserts in securityheaders_test.go still hard-compared
against the old tight `default-src 'self'`, so they fail on main as of
this afternoon.

Fix: assert each expected CSP fragment is PRESENT in the header (substring
match) instead of byte-for-byte equality. Test intent is "CSP is set, starts
with tight default-src, contains the expected directives" — not "CSP matches
this exact string". Future subsource tuning (add a new CDN, bump blob:/data:
scope) won't re-break this test.

Caught because every PR touching anything in the monorepo currently fails
the Platform (Go) CI job on these two asserts. Fixing on a dedicated branch
so it can land ahead of every blocked PR in the queue.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 02:59:06 -07:00
Hongming Wang
3a32d9a46f
Merge pull request #407 from Molecule-AI/fix/bake-templates-into-platform-image
fix(ops): bake templates into platform Docker image
2026-04-16 02:47:04 -07:00
Hongming Wang
b8cb14f46e feat(tenant): combined platform + canvas Docker image with reverse proxy
Single-container tenant architecture: Go platform (:8080) + Canvas
Node.js (:3000) in one Fly machine, with Go's NoRoute handler reverse-
proxying non-API routes to the canvas. Browser only talks to :8080.

Changes:

platform/Dockerfile.tenant — multi-stage build (Go + Node + runtime).
  Bakes workspace-configs-templates/ + org-templates/ into the image.
  Build context: repo root.

platform/entrypoint-tenant.sh — starts both processes, kills both if
  either exits. Fly health check on :8080 covers the Go binary; canvas
  health is implicit (proxy returns 502 if canvas is down).

platform/internal/router/canvas_proxy.go — httputil.ReverseProxy that
  forwards unmatched routes to CANVAS_PROXY_URL (http://localhost:3000).
  Activated by NoRoute when CANVAS_PROXY_URL env is set.

platform/internal/router/router.go — wire NoRoute → canvasProxy when
  CANVAS_PROXY_URL is present; no-op otherwise (local dev unchanged).

platform/internal/middleware/securityheaders.go — relaxed CSP to allow
  Next.js inline scripts/styles/eval + WebSocket + data: URIs. The
  strict `default-src 'self'` was blocking all canvas rendering.

canvas/src/lib/api.ts — changed `||` to `??` for NEXT_PUBLIC_PLATFORM_URL
  so empty string means "same-origin" (combined image) instead of falling
  back to localhost:8080.

canvas/src/components/tabs/TerminalTab.tsx — same `??` fix for WS URL.

Verified: tenant machine boots, canvas renders, 8 runtime templates +
4 org templates visible, API routes work through the same port.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 02:46:47 -07:00
rabbitblood
b30d8d431c fix(tests): test_hermes_phase2_dispatch exec-load needs escalation + __name__
Phase 3 escalation ladder added `from .escalation import ...` to
executor.py. The phase-2 dispatch tests load executor.py via
`exec(compile(src, ...))` with the relative import rewritten — this
broke because (a) the rewrite didn't know about escalation and (b) the
exec namespace lacked `__name__`, which executor.py needs at import
time for `logging.getLogger(__name__)`.

Fix both in all 8 exec sites:
- Rewrite both `from .providers import` AND `from .escalation import`
- Pre-register escalation + providers in sys.modules under the fake
  package name
- Seed the exec namespace with `__name__ = "hermes_executor_under_test"`

54/54 hermes tests pass (28 escalation truth-table + 6 ladder-integration
+ 20 existing phase-2 dispatch).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 02:43:02 -07:00
rabbitblood
73171532a1 feat(memory): optimistic-locking via if_match_version on workspace_memory writes
Closes the silent-overwrite hole where two agents racing a read-modify-
write on the same memory key left only one agent's update. Relevant for
orchestrators (PM, Dev Lead, Marketing Lead) keeping structured running
state (delegation-result ledgers, task queues) in memory, and for the
``research-backlog:*`` keys that multiple idle loops write in parallel.

## Semantics

### Back-compat path (no if_match_version)
Unchanged: ``INSERT ... ON CONFLICT UPDATE`` last-write-wins. Every
existing agent tool, every existing ``commit_memory`` call, every
existing cron that writes memory — all continue to work with no edit.

### Optimistic-lock path (if_match_version set)
1. Client calls ``GET /memory/:key`` → ``{value, version: V}``
2. Client modifies value locally
3. Client ``POST /memory {key, value, if_match_version: V}``
4. Server: ``UPDATE ... WHERE version = V`` + RETURNING new version
5. On match → 200 + ``{version: V+1}``
6. On mismatch → 409 + ``{expected_version: V, current_version: <actual>}``
7. Client reads the actual version and retries.

### Create-only marker
``if_match_version: 0`` means "create iff the key doesn't exist yet".
Two agents simultaneously seeding a shared key will see exactly one
success + one 409 — no silent collision, no duplicate-init work.

### Schema

Migration 023 adds ``version BIGINT NOT NULL DEFAULT 1``. Existing rows
baseline at 1. New rows start at 1. Every successful write (both paths)
increments: ``version = version + 1`` on update, ``1`` on insert.

## Why version, not updated_at

``updated_at`` has second-granularity and can collide between concurrent
writers on a fast clock. A monotonic counter is collision-free and more
readable in the 409 response body ("expected 5, current is 7 — you
missed 2 writes" tells an agent exactly what to re-read).

## Why ``if_match_version`` and not an ETag header

JSON field keeps it in the request body, visible alongside the value
payload. Agents assembling requests programmatically don't have to
remember to thread a header through their HTTP client wrapper; the
existing ``commit_memory`` tool can grow one optional kwarg and match
the existing signature shape.

## Tests

11 memory-handler cases covering every path:
- GET list / get (with version in response shape)
- Set with no version (back-compat upsert, returns new version)
- Set with if_match_version match (happy path, increment)
- Set with if_match_version mismatch (409 + expected/current fields)
- Set with if_match_version=0 on absent key (create-only success)
- Set with if_match_version=N on absent key (409 — caller's mental
  model is wrong)
- Bad inputs (missing key, malformed JSON)
- Delete happy + error path

Full ``go test ./internal/handlers/`` green.

## Follow-up (not in this PR)

- Workspace-template tool update: ``commit_memory(content, *,
  if_match_version=None)`` surfaces the new option + on 409 surfaces
  the current_version so agents can retry without manual re-read.
- Named checkpoints table (``workspace_checkpoints``) for durable
  orchestrator state snapshots. Different concern than per-key locking;
  separate PR.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 02:32:46 -07:00
rabbitblood
3cd18929c4 feat(hermes): escalation ladder — promote to stronger models on transient failure
Ships scoped Phase 3 of the Hermes multi-provider work. Every workspace
can now declare an ordered list of (provider, model) rungs; when the
pinned model hits rate-limit / 5xx / context-length / overload, the
executor advances to the next rung before raising.

## Why

3× Claude Max saturation is a routine occurrence now — the "first 429 on
a batch delegation" is the common path, not the exception. A workspace
pinned to Haiku that hits a context-length limit has no recovery today;
same for Sonnet hitting rate-limit mid-synthesis. Escalation promotes
to the next tier for that single call, preserves coordination, avoids
restart cascades.

## New module: adapters/hermes/escalation.py

- ``LadderRung(provider, model)`` — one config entry.
- ``parse_ladder(raw)`` — tolerant config parser; skips malformed rungs
  with a warning rather than raising so boot stays resilient.
- ``should_escalate(exc) -> bool`` — truth table over 15+ error shapes:
  - Typed classes (RateLimitError, OverloadedError, APITimeoutError,
    APIConnectionError, InternalServerError)
  - Context-length markers (each provider uses different phrasing)
  - Gateway markers (502/503/504, overloaded, temporarily unavailable)
  - Status-code substrings (429, 529, 5xx)
  - Hard-rejects auth failures (401/403/invalid_api_key) even if the
    outer exception class is RateLimitError — wrapping case matters.

## Executor wiring

``HermesA2AExecutor`` now accepts ``escalation_ladder`` in its
constructor + ``create_executor()`` factory. ``_do_inference()`` walks
the ladder:

  1. First attempt = pinned provider:model (matches pre-ladder behaviour)
  2. On escalatable error, try each rung in order
  3. On non-escalatable error, raise immediately (auth, malformed payload)
  4. On exhaustion, raise the last error

Rung switches temporarily rebind ``self.provider_cfg`` / ``self.model``
/ ``self.api_key`` / ``self.base_url`` in a try/finally, so any raised
error leaves the executor in its original state for the next call. Key
resolution for non-pinned rungs goes through ``resolve_provider`` which
reads the rung-provider's env vars fresh.

## Config shape

``config.yaml`` (rendered from ``org.yaml`` → workspace secrets):

    runtime_config:
      escalation_ladder:
        - provider: gemini
          model: gemini-2.5-flash
        - provider: anthropic
          model: claude-sonnet-4-5-20250929
        - provider: anthropic
          model: claude-opus-4-1-20250805

Empty / absent = single-shot behaviour, full backwards-compat with
every existing workspace.

## Tests

34 passing, all isolated (no network):

- ``test_hermes_escalation.py`` (28): parser + truth-table across
  rate-limit, overload, context-length, gateway, auth-reject, unrelated
  exceptions, and case-insensitivity.
- ``test_hermes_ladder_integration.py`` (6): no-ladder single call,
  ladder-not-triggered on success, escalate-on-rate-limit-then-succeed,
  stop-on-non-escalatable, raise-last-error-when-exhausted, skip-
  unknown-provider-in-rung.

## Not in this PR

- Uncertainty-driven escalation (judge pass after successful reply).
- Per-workspace budget tracking (#305 covers this separately).
- Live streaming reuse across rungs (ladder retries the whole call).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 02:27:27 -07:00
Canvas Agent
4a95aa3e98 feat(ci): auto-publish canvas Docker image to GHCR on canvas/** merges
Closes #399.

## Root cause
`publish-platform-image.yml` existed for the Go platform image but there
was no equivalent for the canvas. After every canvas PR merged, CI ran
`npm run build` and passed — but the live container at :3000 was never
updated. The `canvas-deploy-reminder` job only posted a comment asking
operators to manually rebuild, which was consistently missed.

## What this adds
- `.github/workflows/publish-canvas-image.yml`: triggers on `canvas/**`
  changes to main (and `workflow_dispatch`). Mirrors the platform workflow:
  macOS Keychain isolation, QEMU for linux/amd64, Buildx, GHCR push with
  `:latest` + `:sha-<7>` tags.
  - `NEXT_PUBLIC_PLATFORM_URL` / `NEXT_PUBLIC_WS_URL` resolve from
    `workflow_dispatch` inputs → `CANVAS_PLATFORM_URL` / `CANVAS_WS_URL`
    repo secrets → `localhost:8080` defaults (safe for self-hosted dev).
  - Inputs are passed via env vars (not direct `${{ }}` interpolation) to
    prevent shell injection from string inputs.

- `docker-compose.yml`: adds `image: ghcr.io/molecule-ai/canvas:latest`
  to the canvas service so `docker compose pull canvas && docker compose
  up -d canvas` applies the new image. `build:` is retained for local
  development. Adds a comment clarifying that `NEXT_PUBLIC_*` runtime env
  vars are ignored by the standalone bundle (build-time only).

- `ci.yml`: updates `canvas-deploy-reminder` commit comment to reference
  `docker compose pull` as the fast path, with `docker compose build` as
  the local-source fallback.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 09:23:26 +00:00
rabbitblood
8bf27ae1d0 fix(provisioner): IsRunning conservative on daemon errors to stop restart cascade
Root cause of the 2026-04-16 09:10 UTC six-container restart cascade.

## Timeline

09:10:26 — PM sent a batch delegation to 15+ agents (Dev Lead coordinating).
09:10:26-27 — 4 leaders/auditors (Security, RL, BE, DevOps) simultaneously
              hit "workspace agent unreachable — container restart triggered"
              even though their containers were running fine. Another 2
              (DL, UIUX) tripped in the next few seconds.
09:10:27 — Provisioner stopped + recreated 6 containers in parallel. A2A
           callers got EOFs, PM's batch coordination stalled.

## Root cause

`provisioner.IsRunning` collapsed every ContainerInspect error into
`(false, nil)`, including transient Docker daemon hiccups:

  func IsRunning(...) (bool, error) {
      info, err := p.cli.ContainerInspect(ctx, name)
      if err != nil {
          return false, nil // Container doesn't exist ← MISREAD
      }
      return info.State.Running, nil
  }

The comment said "Container doesn't exist" but the error was actually
any of: daemon timeout, socket EOF, context deadline, connection
refused. Under load (batch delegation fan-out → 15 concurrent HTTP
inbound → 15 concurrent Claude Code subprocesses → Docker daemon CPU
pressure), ContainerInspect calls started failing transiently. All 6
calls returned `(false, nil)`. Caller `maybeMarkContainerDead` treated
`running=false` as "container is dead, restart it" → six parallel
restarts. This was exactly the destructive-on-error pattern we keep
trying to kill (see #160 SDK-stderr-probe, #318 fail-open classes).

## Fix

`IsRunning` now distinguishes NotFound from transient errors:

- Legitimately missing container (caller deleted, Docker pruned) →
  `(false, nil)` — safe to act on; caller marks dead + restarts.
- Any other error (daemon timeout, socket issue, context deadline) →
  `(true, err)` — caller stays on the alive path. The transient error
  is preserved so metrics + logging still see it, but it does NOT
  trigger the destructive restart branch.

`isContainerNotFound` matches on error-message substring — same
approach docker/cli uses internally — to avoid pulling in errdefs as a
direct dep. Truth table tests in `isrunning_test.go` cover 8 cases:
NotFound variants (real + generic), nil, empty, and the 4 transient-
error shapes we've actually observed (deadline, EOF, connection-refused,
i/o timeout).

## Caller update

`maybeMarkContainerDead` in a2a_proxy.go now logs the transient inspect
error (was silently discarded via `_`). Visibility without
destructiveness. If this error becomes persistent, we'll see it in
platform logs rather than diagnosing after another restart cascade.

## Expected impact

- Zero restart cascades from the current class of transient inspect
  errors (EOF, timeout, connection refused).
- Dead containers still detected within the A2A layer because an actual
  stopped container returns NotFound on inspect, and the TTL monitor
  (180s post #386) catches anything that slips through.
- New visibility in platform logs when inspect has trouble — previously
  silent.

Combined with the TTL fix in #386, the defense-in-depth on spurious
restart is now:
  1. IsRunning only returns false for real NotFound
  2. Liveness TTL is 180s, surviving 5+ missed heartbeats
  3. A2A proxy 503-Busy path retries with backoff before touching
     restart logic at all

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 02:21:25 -07:00
Canvas Agent
eaa6975967 fix(next): add missing 'use client' to TestConnectionButton and KeyValueField
Both components use useState/useEffect/useCallback/useRef but were
missing the 'use client' directive. Without it Next.js App Router
renders them as server HTML — React never hydrates them and event
handlers are silently dropped.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 09:10:22 +00:00
Canvas Agent
d6e9fbe984 fix(a11y): raise TeamMemberChip label text 7px→9px in WorkspaceNode
Chip labels (status badge, active-task count, current-task text) were
rendered at text-[7px] — well below the 9px minimum required to meet
WCAG 1.4.3 readability. Raised all three to text-[9px] so the labels
are legible without magnification.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 09:06:56 +00:00
Hongming Wang
51e3393ec0 fix(ops): bake workspace-configs-templates into platform Docker image
Tenant machines were booting with no templates because the Dockerfile
only shipped the Go binary + migrations. The canvas showed "0 templates"
with an empty picker.

Changes:
- platform/Dockerfile: build context changed from ./platform to repo
  root so COPY can reach workspace-configs-templates/ alongside the
  Go source. COPY paths updated for platform/{go.mod,go.sum,*.go} and
  platform/migrations/.
- .github/workflows/publish-platform-image.yml: context: . (was
  ./platform), paths trigger now includes workspace-configs-templates/
  so template changes rebuild the image.

Phase A of the template-registry plan. Phase B adds a DB registry +
on-demand fetch for community templates (user pastes GitHub URL at
workspace creation time). The baked defaults always ship in the image
for zero-config tenant boot.

Verified: `docker build -f platform/Dockerfile -t test .` succeeds,
`docker run --rm test ls /workspace-configs-templates/` shows all 8
templates (autogen, claude-code-default, crewai, deepagents, gemini-cli,
hermes, langgraph, openclaw).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 01:54:47 -07:00
Hongming Wang
37b288c79b
fix(a2a): add missing Authorization header to delegation and message calls (#401)
* fix(a2a): add missing Authorization header to delegation and message calls

Three A2A client functions were missing the Bearer token on their HTTP calls
after the Phase 30.1 workspace-auth enforcement rollout:

1. send_a2a_message (a2a_client.py): POST to target workspace's /message/send
   used WorkspaceAuth middleware that fails-closed on missing auth header.
   Fix: headers=auth_headers() — auth_headers() already imported.

2. tool_delegate_task_async (a2a_tools.py): POST to platform /delegate endpoint
   requires the caller's workspace bearer token since Phase 30.1.
   Fix: headers=_auth_headers_for_heartbeat()

3. tool_check_task_status (a2a_tools.py): GET /delegations endpoint, same issue.
   Fix: headers=_auth_headers_for_heartbeat()

tool_list_peers already uses _auth_headers_for_heartbeat() correctly —
that's why list_peers works while delegation returns 401/[A2A_ERROR].

Root cause of the multi-session A2A outage. PR #386 (TTL fix) addressed
the workspace-restart cascade; this fixes the underlying 401 on each call.

Closes #391
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(a2a): add missing auth headers to /activity and /notify endpoints

Two more Phase 30.1 regressions in a2a_tools.py found during send_message_to_user
debugging (it was returning 401):

- tool_report_activity: POST /workspaces/:id/activity missing headers
- tool_send_message_to_user: POST /workspaces/:id/notify missing headers

Both now use headers=_auth_headers_for_heartbeat() matching the pattern used
by commit_memory, recall_memory, and the heartbeat POST in the same file.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: PM (Molecule AI) <pm@molecule-ai.internal>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 00:53:18 -07:00
UIUX Designer
a4350121dd fix(wcag): sweep text-zinc-600→zinc-500 across 9 components with small text
zinc-600 on zinc-900/950 background ≈ 2.6:1 contrast (WCAG AA requires
4.5:1 for text under 18pt). Found 15 instances across 9 components where
small-text data labels used this low-contrast pairing.

Files and what they label:
  EmptyState.tsx:132     — skill count + model on template cards (new-user visible)
  SidePanel.tsx:230      — workspace ID in panel footer (copyable, functional)
  ActivityTab.tsx:210    — entry timestamp (8px)
  ActivityTab.tsx:214    — expand chevron affordance (9px)
  ActivityTab.tsx:236    — "→" direction arrow between agents (9px)
  ActivityTab.tsx:278    — entry ID (8px, font-mono)
  ScheduleTab.tsx:284    — empty-state description text (9px)
  ScheduleTab.tsx:320    — schedule prompt preview (9px, truncate)
  ScheduleTab.tsx:323    — last/next/run-count metadata row (8px)
  SkillsTab.tsx:380      — "Examples" section header (9px uppercase)
  TracesTab.tsx:132      — trace ID (8px, font-mono)
  AgentCommsPanel.tsx:166 — message timestamp (9px)
  secrets-section.tsx:59  — secret key name (9px, font-mono)
  secrets-section.tsx:308 — encryption notice (9px)
  MissingKeysModal.tsx:175 — missing key identifier (9px, font-mono)

Fix: zinc-600 → zinc-500 across all 15 instances. Purely cosmetic —
no logic, no layout, no interactive behaviour changed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 07:53:00 +00:00
rabbitblood
112c28d885 feat(org-templates): Phase 3 — !include directive + split org.yaml into team files
Part 3 of 4 in the scalability refactor. Adds YAML `!include` support
to the org importer and splits molecule-dev/org.yaml (676 lines post-
Phase 2) into 6 team / role files; top-level org.yaml drops to 114 lines
of pure scaffolding.

## Platform changes

New `platform/internal/handlers/org_include.go`:

- `resolveYAMLIncludes(data, baseDir)` — pre-processes a YAML document,
  expanding any scalar tagged `!include <path>` with the parsed content
  of the referenced file.
- Path resolution via `resolveInsideRoot` so a crafted `!include
  ../../etc/passwd` can't escape the org template directory (same
  defense the existing `files_dir` copy uses).
- Nested includes supported: each included file carries its own search
  root (its directory), so `teams/pm.yaml` with `!include research.yaml`
  resolves to `teams/research.yaml` — matching the convention of
  C-include / Sass @import / most package systems.
- Cycle detection via visited-set keyed on absolute path; belt-and-
  braces `maxIncludeDepth = 16` cap in case symlinks or path
  normalization defeats the set.
- Inline-template mode (POST /org/import with raw JSON body, no `dir`)
  errors cleanly when a file ref is used — can't resolve without a
  base.

Wired into both `ListTemplates` (so /org/templates shows an accurate
workspace count after the split) and `Import` (expansion happens before
unmarshal into OrgTemplate).

## Template changes

molecule-dev/org.yaml now contains only:
- name + description
- defaults (runtime, plugins, category_routing, initial_prompt text)
- `workspaces: [!include teams/pm.yaml, !include teams/marketing.yaml]`

New files:
- `teams/pm.yaml` — PM top-level, children are !include refs
- `teams/research.yaml` — Research Lead + Market Analyst + Technical
  Researcher + Competitive Intelligence (inline children)
- `teams/dev.yaml` — Dev Lead + FE/BE/DevOps/Security/QA/UIUX (inline)
- `teams/marketing.yaml` — Marketing Lead + DevRel/PMM/Content/
  Community/SEO/Social (inline)
- `teams/documentation-specialist.yaml` — leaf
- `teams/triage-operator.yaml` — leaf

## File-size impact

| State | org.yaml lines | total config size |
|---|---:|---:|
| Before (main) | 1801 | 108 KB |
| After Phase 1 (#389) | 1687 | 101 KB |
| After Phase 2 (#390) | 676 | 35 KB |
| After this PR | **114** | **4 KB** (org.yaml only) |

With the 6 team files (total ~570 lines of structural yaml), every file
is now under 230 lines and individually readable without scrolling past
a single team's boundaries.

## Tests

`platform/internal/handlers/org_include_test.go` — 9 cases:
- Flat include (single file, single workspace)
- Nested include (file → file → file)
- Traversal rejection (`../secret.yaml`, `../../secret.yaml`)
- Cycle detection (a↔b)
- Empty path error
- Missing file error
- Inline-template error (baseDir empty)
- No-op when YAML has no includes (safety: we always run the preprocessor)
- **Integration**: load the real `org-templates/molecule-dev/org.yaml`,
  resolve includes, unmarshal into OrgTemplate, verify PM + Marketing
  Lead are top-level and PM has ≥4 children after expansion.

All 9 pass + existing `TestResolvePromptRef` + `TestOrgYAML` suites stay
green.

## Ownership implication

Each team file can now be owned + reviewed independently. When the
marketing team adds a 7th role, the diff is in `teams/marketing.yaml`
alone — no merge conflicts against PM or research changes in the same
review window. Same for the eventual engineer team, security team, etc.

## What's next

- **Phase 4 (queued):** per-workspace atomization. Each role gets
  `<role>/workspace.yaml`; team files shrink to a list of !include
  refs. Terminal step in the scalability arc — at that point adding a
  new role is one new file under `org-templates/molecule-dev/<role>/`
  plus one line in the team's manifest.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 07:49:56 +00:00
Hongming Wang
159197ed4a
feat(org-templates): Phase 2 — bulk migrate 20 roles to file-ref prompts (#395)
Part 2 of 4 in the org.yaml scalability refactor. Follows PR #389 which
added platform support; this PR completes the migration for every role
in the `molecule-dev` template.

## Scope

All 20 remaining roles moved from inline YAML literals to sibling .md
files under their existing `files_dir`:

- PM, Research Lead, Dev Lead, Marketing Lead (4 leaders)
- Market Analyst, Technical Researcher, Competitive Intelligence (research)
- Frontend/Backend/DevOps Engineer, Security Auditor, QA Engineer, UIUX
  Designer, Triage Operator (dev team)
- DevRel, PMM, Content Marketer, Community Manager, SEO Growth Analyst,
  Social Media Brand (marketing team)

Per workspace, externalized (where present):
- `initial_prompt: |...` → `initial-prompt.md` + `initial_prompt_file:`
- `idle_prompt: |...`    → `idle-prompt.md`    + `idle_prompt_file:`
- `schedules[*].prompt: |...` → `schedules/<slug>.md` + `prompt_file:`

Totals: 17 initial-prompt files, 12 idle-prompt files, 18 schedule files
(47 new files).

## File-size impact

| Before (main) | After Phase 1 | After Phase 2 | Reduction |
|---|---|---|---|
| 1801 lines | 1687 lines | 676 lines | **-62.5%** |
| 108 KB | 101 KB | 35 KB | **-67%** |

org.yaml is now pure structural scaffolding (name / role / tier / model /
canvas / plugins / channels / children / category_routing / schedules
metadata). Readable end-to-end on one screen per team.

## How the migration was driven

A Python round-trip script (using `ruamel.yaml` to preserve comments +
formatting) walked the workspace tree recursively, wrote prompts to
files keyed by `files_dir`, and replaced inline keys with `*_file:` refs.
Zero manual YAML hand-editing beyond the Phase 1 Documentation Specialist
proof. Script is one-shot; not committed.

Slug convention for schedule files: lowercase the schedule name, replace
non-alphanumeric with `-`, collapse, cap 60 chars. Examples:
- "Orchestrator pulse" → `orchestrator-pulse.md`
- "Hourly template fitness audit" → `hourly-template-fitness-audit.md`
- "Code quality audit (every 12h)" → `code-quality-audit-every-12h.md`

## Backwards compatibility

Fully compatible — Phase 1's resolver prefers inline when both are set,
so a future one-off experiment can still drop inline YAML. The migration
doesn't remove inline support, just stops using it.

## Verification

- [x] `python -c "yaml.safe_load(...)"` on edited org.yaml — parses clean
- [x] Walk-and-inspect script: every workspace has exactly the expected
      `*_file:` refs, zero `INLINE_*` markers remain
- [x] All 47 extracted .md files non-empty + trimmed
- [x] `go test -run 'TestResolvePromptRef|TestOrgYAML|TestInitialPrompt'`
      passes (from Phase 1 platform work)
- [ ] Post-merge: live `POST /org/import` against a fresh workspace,
      diff the resulting `/configs/config.yaml` + `workspace_schedules`
      rows against the pre-migration values (should be identical bodies)

## What's next

- **Phase 3 (queued):** YAML `!include` directive for org.yaml; split the
  remaining 676 lines into `teams/{research,dev,marketing,ops}.yaml`.
- **Phase 4 (queued):** per-workspace atomization; each role owns its
  own `workspace.yaml` manifest.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 00:47:32 -07:00
Hongming Wang
29044c3995
fix(#249): add /schedules/health endpoint accessible to CanCommunicate peers (#400)
Rebased cleanly onto current main (resolves the add/add conflicts that
blocked CI on PR #374 — the original branch diverged from a pre-repo-bootstrap
commit that predated most files).

Changes:
- schedules.go: add scheduleHealthResponse struct + Health handler
  (mirrors A2A proxy auth pattern: X-Workspace-ID + CanCommunicate gate)
- router.go: register GET /workspaces/:id/schedules/health on r (not wsAuth)
  so peer agents can query without holding the target workspace's bearer token
- schedules_test.go: 7 new tests (missing caller 401, self-call OK, legacy
  peer grandfathered, non-peer 403, system caller bypass, no prompt exposure,
  DB error 500)

isSystemCaller/validateCallerToken reused from a2a_proxy.go (same package).
registry.CanCommunicate import added to schedules.go.

Closes #249
Supersedes PR #374 (which could not get CI due to merge conflict)

Co-authored-by: PM (Molecule AI) <pm@molecule-ai.internal>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 00:45:30 -07:00
rabbitblood
c12d6436ab feat(provisioner): per-agent git identity via GIT_AUTHOR_* env vars
Every workspace now commits under its own name. Step 3 of the three-
step agent-separation plan (platform-level git identity today;
GitHub App migration follows as Option 1).

## Problem

All 20+ agents in the molecule-dev template (PM, Dev Lead, Research
Lead, FE, BE, DevOps, Security, QA, UIUX, Marketing roles, etc.) share
a single GITHUB_TOKEN — specifically the CEO's personal PAT. So every
commit, PR, and issue across the live repos ends up attributed to
HongmingWang-Rabbit. `git log` can't distinguish "which agent wrote
this code" from "did the CEO write it"; neither can the authority-
verification rule in triage-operator/philosophy.md (rule #3).

## Fix

When the provisioner starts a workspace container, it now sets:

  GIT_AUTHOR_NAME    = "Molecule AI <Workspace Name>"
  GIT_AUTHOR_EMAIL   = <slug>@agents.moleculesai.app
  GIT_COMMITTER_NAME  = (same)
  GIT_COMMITTER_EMAIL = (same)

Git prefers these env vars over `git config user.name` / `user.email`,
so no per-container git-config step is needed; every commit automatically
carries the right authorship.

Examples (20 agents, 20 distinct identities):
  Frontend Engineer         → frontend-engineer@agents.moleculesai.app
  Backend Engineer          → backend-engineer@agents.moleculesai.app
  Product Marketing Manager → product-marketing-manager@agents.moleculesai.app
  UIUX Designer             → uiux-designer@agents.moleculesai.app

Domain `agents.moleculesai.app` is deliberate: marks the email as a
bot address without resembling a real inbox.

## Operator override preserved

`applyAgentGitIdentity` runs AFTER the secret-load loops in
`provisionWorkspaceOpts`, but uses `setIfEmpty` so any workspace_secret
with the same key wins. Teams that want custom authorship (shared org
signing identity, a person-on-the-loop owner) can still set
`GIT_AUTHOR_NAME` via /workspaces/:id/secrets and get their value
through to git.

## What this does NOT solve (yet)

- PR / issue authorship is still whoever owns GITHUB_TOKEN (the shared
  PAT). That needs the GitHub App migration (Option 1, next PR). The
  commit-level split shipped here is the prerequisite: the App path
  will keep these env vars and just swap the PAT for a short-lived
  installation token.
- Existing containers continue with their pre-fix env (git env vars
  are baked in at container-create time). Applying is one plain
  `POST /workspaces/:id/restart` per agent after this merges +
  deploys — the restart goes through provisionWorkspace which picks
  up the new injection.

## Tests

`agent_git_identity_test.go` — 4 behavior tests + a 10-row slug test:
- fills all 4 env vars from a workspace name
- operator override via pre-set env is preserved (setIfEmpty semantics)
- empty / whitespace workspace name is a no-op (no `unknown@...` emails)
- nil map doesn't panic (defensive)
- slugify handles spaces / punctuation / edge hyphens / em-dashes

All 15 cases pass; platform build clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 00:45:26 -07:00
Hongming Wang
fd86e404ee
Merge pull request #392 from Molecule-AI/fix/wcag-node-and-traces-text
Code review passed:
- WorkspaceNode.tsx: 3 × text-[7px]→text-[9px] on status badge, active-task count, currentTask banner — correct targets; decorative text-[7px] (tier badges, skill overflow +N, Team header, font-mono) correctly left unchanged
- TracesTab.tsx: 2 × text-zinc-600→text-zinc-500 on token count + expand chevron — correct; line 72 empty-state dim label correctly left at zinc-600
- 485/485 tests pass with changes applied
- Pure class-string changes, no logic affected
LGTM — merging.
2026-04-16 00:33:59 -07:00
Hongming Wang
ce0e793673
feat(org-templates): Phase 1 — externalize prompt bodies to sibling files (#389)
Part 1 of 4 in the scalability refactor. Each role can now keep its
initial_prompt / idle_prompt / schedule prompts as sibling .md files
under files_dir/; inline YAML literals still work for backwards-compat.

## What changes

**Platform (org.go importer):**
- `OrgWorkspace` gains `InitialPromptFile`, `IdlePrompt`, `IdlePromptFile`,
  `IdleIntervalSeconds`. The idle_* fields were previously dropped by the
  org importer entirely — struct didn't declare them — which is why
  engineer idle_prompts never propagated from org.yaml to live /configs
  (I've been manually docker-cp'ing them in every maintenance cron).
- `OrgSchedule` gains `PromptFile`. Hourly/weekly cron prompts are the
  largest bodies in org.yaml (1-5 KB each) and get resolved at import
  time just like initial_prompt.
- `OrgDefaults` gains the same idle_* + *_file fields for org-wide fallback.
- New `resolvePromptRef(inline, fileRef, orgBaseDir, filesDir)` helper —
  the single chokepoint for inline-vs-file resolution. Inline wins when
  both are set. File refs route through `resolveInsideRoot` so a crafted
  ref can't escape the org template directory (same traversal defense as
  files_dir).
- `createWorkspaceTree` now injects idle_prompt + idle_interval_seconds
  into the workspace's config.yaml (previously missing — that's the
  second half of the idle-prompt propagation bug).

**Tests:**
- `org_prompt_ref_test.go` — 10 cases: inline-wins, file-read-when-empty,
  both-empty, defaults-level resolution, inline-template mode errors,
  traversal rejection (via file ref AND via files_dir), missing-file
  errors, and YAML-unmarshal parsing for each new field.

**Proof migration:**
- Documentation Specialist (biggest role at 6.9 KB of prompts) moves from
  inline YAML to `documentation-specialist/{initial-prompt.md,
  schedules/daily-docs-sync.md, schedules/weekly-terminology-audit.md}`.
- org.yaml drops 1801 → 1687 lines (-6.3%) from just this one role.

## Why this matters

org.yaml is 108 KB of which 67 KB (62%) is prompt text. At the current
12-role template size that's already unreadable; the marketing + triage-
operator additions pushed it to 1801 lines. The 4-phase refactor aims:

- **Phase 1 (this PR):** platform support + 1 role proof.
- **Phase 2:** migrate remaining ~20 roles to file refs. Target: org.yaml
  at ~600 lines of pure structural scaffolding.
- **Phase 3:** YAML `!include` preprocessor — split org.yaml into
  teams/{research,dev,marketing,ops}.yaml shards.
- **Phase 4:** per-workspace atomization — each role gets its own
  workspace.yaml manifest; org.yaml composes them.

## Backwards compatibility

- Inline `initial_prompt: |` / `prompt: |` / `idle_prompt: |` all still work.
- Missing `prompt_file` refs log + skip the schedule (not fatal) — fail
  loud so bugs surface during deployment rather than silent-drop.
- Inline-template mode (POST /org/import with raw JSON body, no `dir`)
  errors cleanly when a file ref is used — can't resolve files without a
  base dir, surface that rather than guessing.

## Test plan

- [x] `go build ./...` clean
- [x] `go test -run 'TestResolvePromptRef|TestOrgYAML' ./internal/handlers/`
      — 10 tests pass
- [x] `python -c "yaml.safe_load(...)"` on the edited org.yaml — parses
- [ ] Post-merge: deploy platform rebuild, run `POST /org/import` against
      a fresh workspace, verify Documentation Specialist's /configs/config.yaml
      contains the initial_prompt body and workspace_schedules rows contain
      the cron prompts (phantom-success check: grep the actual content, not
      just the row count).

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 00:32:09 -07:00
UIUX Designer
a02780e979 fix(wcag): bump WorkspaceNode status/task labels 7px→9px, TracesTab zinc-600→zinc-500
WorkspaceNode.tsx — three text-[7px] labels carry meaningful content
that users must read, making them WCAG 1.4.3 failures at default zoom:
  • Status label (failed/degraded/provisioning) — critical signal
  • Active-tasks count — task load indicator
  • currentTask banner text — live work description
Bumped to text-[9px] minimum. Decorative elements (+N overflow) unchanged.

TracesTab.tsx — two text-[9px] text-zinc-600 labels:
  • Token count ("1234 tok")
  • Expand chevron ("▼"/"▶")
zinc-600 on zinc-900 ≈ 2.6:1 (fails WCAG AA 4.5:1 for small text).
Changed to text-zinc-500 ≈ 4.6:1. Size unchanged (already at minimum 9px).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 07:25:34 +00:00
Hongming Wang
bab145b520
fix(canvas): replace nodes.length grid index with monotonic sequence counter (#388)
Root cause of position collision after node deletion:

  handleCanvasEvent(WORKSPACE_PROVISIONING) used nodes.length as the
  grid placement index. handleCanvasEvent(WORKSPACE_REMOVED) shrinks
  the array, so the next provisioned node reuses a lower index and
  lands at the exact same (x, y) as an existing live node.

  Example (4-col grid, COL_SPACING=320):
    Provision A → idx 0 → (100, 100)
    Provision B → idx 1 → (420, 100)
    Provision C → idx 2 → (740, 100)
    Remove    A → nodes.length drops to 2
    Provision D → idx 2 → (740, 100)  ← COLLISION with C

Fix 1 — monotonic _provisioningSequence counter (only ever increases):
  - Replaces nodes.length as the placement index
  - Immune to deletions; every provisioned node gets a unique grid slot
  - resetProvisioningSequence() exported for test teardown only

Fix 2 — the existing restart-path guard (if exists → update, not create)
  already provides idempotency for duplicate WS events on known nodes;
  confirmed: restart path does NOT increment the counter.

Tests: +4 new cases (grid wrap, collision regression, restart-path
counter isolation, multi-provision positions). 485/485 pass.
Build: next build ✓ clean.

Note: complementary to PR #44's origin-offset fix (closed without
merging) — that fix addressed nodes stacking at (0,0); this fix
addresses position collisions after deletions. Both should land.

Co-authored-by: Canvas Agent <agent@canvas.local>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 00:25:33 -07:00
Hongming Wang
98e7d90213
fix(canvas): show all templates in EmptyState grid, not just first 6 (#387)
Templates 7-8 (LangGraph Agent, OpenClaw Agent) were silently hidden
by a hard-coded `.slice(0, 6)` cap. The grid container already has
`max-h-[240px] overflow-y-auto` to handle overflow — the slice was
redundant and harmful. Remove it so all API-returned templates render.

Co-authored-by: UIUX Designer <uiux@molecule-ai.local>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 00:19:24 -07:00
Canvas Agent
c260e679d1 fix(canvas): replace nodes.length grid index with monotonic sequence counter
Root cause of position collision after node deletion:

  handleCanvasEvent(WORKSPACE_PROVISIONING) used nodes.length as the
  grid placement index. handleCanvasEvent(WORKSPACE_REMOVED) shrinks
  the array, so the next provisioned node reuses a lower index and
  lands at the exact same (x, y) as an existing live node.

  Example (4-col grid, COL_SPACING=320):
    Provision A → idx 0 → (100, 100)
    Provision B → idx 1 → (420, 100)
    Provision C → idx 2 → (740, 100)
    Remove    A → nodes.length drops to 2
    Provision D → idx 2 → (740, 100)  ← COLLISION with C

Fix 1 — monotonic _provisioningSequence counter (only ever increases):
  - Replaces nodes.length as the placement index
  - Immune to deletions; every provisioned node gets a unique grid slot
  - resetProvisioningSequence() exported for test teardown only

Fix 2 — the existing restart-path guard (if exists → update, not create)
  already provides idempotency for duplicate WS events on known nodes;
  confirmed: restart path does NOT increment the counter.

Tests: +4 new cases (grid wrap, collision regression, restart-path
counter isolation, multi-provision positions). 485/485 pass.
Build: next build ✓ clean.

Note: complementary to PR #44's origin-offset fix (closed without
merging) — that fix addressed nodes stacking at (0,0); this fix
addresses position collisions after deletions. Both should land.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 07:18:11 +00:00
Hongming Wang
a9fdbe4185
fix(liveness): raise workspace TTL 60s → 180s to survive Opus synthesis (#386)
Problem observed 2026-04-16: Research Lead, Dev Lead, Security Auditor,
and UIUX Designer were being auto-restarted by the liveness monitor every
~30 minutes, even though their containers were healthy and processing
real work. A2A callers (PM, children agents) saw regular EOFs:

  A2A request to <leader-id> failed: Post http://ws-*:8000: EOF

Followed in platform logs by:

  Liveness: workspace <id> TTL expired
  Auto-restart: restarting <name> (was: offline)
  Provisioner: stopped and removed container ws-*

Root cause: the liveness key `ws:{id}` in Redis has a 60s TTL
(platform/internal/db/redis.go). The workspace heartbeat loop
(workspace-template/heartbeat.py) refreshes it every 30s. That leaves
room for exactly ONE missed heartbeat before expiry.

A busy Claude Code Opus synthesis can starve the container's asyncio
scheduler for 60-120s (the SDK spawns the claude CLI subprocess and
blocks until the message-reader yields; the heartbeat coroutine doesn't
run during that window). Leaders running 5-minute orchestrator pulses
or processing deep delegations routinely hit this. The platform then
mistakes a busy-but-healthy container for a dead one, marks it offline,
tears it down, and re-provisions — interrupting whatever work was mid-
synthesis and generating a cascade of EOF errors on pending A2A calls.

Fix: hoist the TTL into a named `LivenessTTL` constant and raise it to
180s. With a 30s heartbeat interval this now tolerates up to ~5 missed
beats before expiry — comfortably longer than any realistic Opus stall,
while still detecting genuinely-dead containers within 3 minutes.

Safety: real crashes are still caught immediately by a2a_proxy's reactive
IsRunning() check (maybeMarkContainerDead in a2a_proxy.go:439). That path
doesn't depend on TTL; it fires on the first failed forward. So this PR
only relaxes the "slow but alive" false-positive — dead-container
detection is unchanged.

Observed impact before fix (2026-04-16 ~06:40–06:49 UTC, 10-minute
window, 4 containers affected):

  | Container         | EOF errors | Forced restart |
  |-------------------|-----------:|:--------------:|
  | Dev Lead          | 5          | yes (06:48)    |
  | Research Lead     | 5          | yes (06:47)    |
  | Security Auditor  | 5          | yes (06:49)    |
  | UIUX Designer     | 4          | no (not yet)   |

Expected impact after merge + redeploy: drop to ~0 forced restarts on
healthy-busy leaders. If genuinely-stuck agents stop responding, the
IsRunning check still catches them on the next A2A forward.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 00:05:45 -07:00
Hongming Wang
0bcebff908
config(org): add Telegram to Dev Lead and Research Lead (#385)
* feat(adapters): add gemini-cli runtime adapter (closes #332)

Adds a `gemini-cli` workspace runtime backed by Google's Gemini CLI
(@google/gemini-cli, ~101k ★, Apache 2.0). Mirrors the claude-code
adapter pattern: Docker image installs the CLI, CLIAgentExecutor
drives the subprocess, A2A MCP tools wire via ~/.gemini/settings.json.

Changes:
- workspace-template/adapters/gemini_cli/ — new adapter (Dockerfile,
  adapter.py, __init__.py, requirements.txt); setup() seeds GEMINI.md
  from system-prompt.md and injects A2A MCP server into settings.json
- workspace-template/cli_executor.py — adds gemini-cli to
  RUNTIME_PRESETS (--yolo flag, -p prompt, --model, GEMINI_API_KEY env
  auth); adds mcp_via_settings preset flag to skip --mcp-config
  injection for runtimes that own their own settings file
- workspace-configs-templates/gemini-cli/ — default config.yaml +
  system-prompt.md template
- tests/test_adapters.py — adds gemini-cli to expected adapter set
- CLAUDE.md — documents new runtime row in the image table

Requires: GEMINI_API_KEY global secret. Build:
  bash workspace-template/build-all.sh gemini-cli

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(provisioner): add gemini-cli to RuntimeImages map

Without this entry, POST /workspaces with runtime:gemini-cli falls back
to workspace-template:langgraph (wrong image, missing gemini dep) instead
of workspace-template:gemini-cli. Every runtime MUST have an entry here.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* config(org): add Telegram to Dev Lead and Research Lead (closes #383)

Completes leadership-tier Telegram coverage:
  PM ✓ DevOps ✓ Security ✓ → Dev Lead ✓ Research Lead ✓

Both roles produce high-value async output (architecture decisions,
eco-watch summaries) that was invisible until the user polled the
canvas. Same bot_token/chat_id secrets as the other three roles —
no new credentials required.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: DevOps Engineer <devops@molecule.ai>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 00:00:10 -07:00
Hongming Wang
f31051be14
fix(a11y): raise ChannelsTab help text from 9px to 11px minimum (#382)
Two helper paragraphs in ChannelsTab.tsx used text-[9px] text-zinc-600:
- Chat IDs discover hint (line 254)
- Allowed Users hint (line 281)

9px fails WCAG 1.4.3 by size alone; zinc-600 on zinc-800/900 bg is
~2.6:1 contrast (fails AA). Changed to text-[11px] text-zinc-500
(~3.8:1 at 11px — acceptable for non-body helper text).

Found in UX audit Run 13 (2026-04-16).

Co-authored-by: UIUX Designer <uiux@molecule-ai.local>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 23:47:36 -07:00
Hongming Wang
16ae320bed
Merge pull request #381 from Molecule-AI/chore/triage-operator-handoff
chore(handoff): Triage Operator role + agent handoff package
2026-04-15 23:43:05 -07:00
Hongming Wang
df5821a251 chore(handoff): triage-operator role + agent handoff package
Wraps up a ~100-tick autonomous triage session by converting the prior
operator's institutional knowledge into standing, checked-in artifacts
so the next team picking up the hourly PR + issue cycle can drop in
without re-discovering everything from scratch.

## New role: Triage Operator

Peer to Dev Lead, Research Lead, Documentation Specialist under PM.
Owns the 7-gate PR verification + issue-pickup cycle across both
molecule-monorepo and molecule-controlplane. NOT an engineer — never
writes logic, never makes design calls. Mechanical fixes on other
people's branches + verified-merge only.

Runs on cron `17 * * * *`. On first boot reads four handoff files +
the last 20 lines of cron-learnings.jsonl, waits for the scheduled
tick (no first-boot triage — known stale-state footgun).

## Files

org-templates/molecule-dev/triage-operator/
- system-prompt.md (48 lines) — role prompt loaded at boot. Standing
  rules, verification discipline, escalation paths.
- philosophy.md (135 lines) — 10 principles each tied to a real
  incident. Rule 2 ("tool succeeded ≠ work done") references the
  WorkOS refresh-token + missing-migration saga. Rule 3 (authority
  verification) references PR #370 CEO directive hold.
- playbook.md (234 lines) — step-by-step tick flow (Step 0 guards →
  1 list → 2 seven-gate → 3 docs sync → 4 issue pickup → 5 report).
  Expected 5–30 min wall-clock. When-not-to-triage.
- handoff-notes.md (146 lines) — point-in-time state for the NEXT
  operator arriving fresh. 15 PRs merged this session, in-flight
  items, design-call backlog with recommendations per issue.
- SKILL.md (152 lines) — installable skill spec. Invocation, inputs,
  outputs, required composed skills, edge cases, output format.

.claude/AGENT_HANDOFF.md (206 lines) — top-level handoff for any
Claude Code agent working this repo (not just the triage operator).
The 10 principles (one-liners), communication style the user
expects, currently-live state, open items, what NOT to do, break-
glass escalation conditions. Points at triage-operator/philosophy.md
for full incident context.

## Wiring

org.yaml gains a Triage Operator workspace block under PM with:
- tier: 3, model: opus
- 8 plugins (careful-bash, session-context, cron-learnings,
  code-review, cross-vendor-review, llm-judge, update-docs, hitl)
- Hourly cron at `:17` with the full Step 0–5 flow inline as prompt
- canvas position (1150, 250) — peer to Documentation Specialist

## Why this ships now

The 30-min manual triage cron was cancelled per CEO direction. The
role moves to another team. Without this handoff package they'd be
rediscovering the same incident-classes I shipped fixes for
(#318 fail-open, #327 cross-tenant decrypt, #351 tokenless grace,
WorkOS refresh-token saga, missing migration runner). The philosophy
file gives them the scar tissue in ~10 min of reading; the playbook
gives them the steps; the SKILL gives them an invocable entry point.

No code changes outside org.yaml. Existing TestPlugins_UnionWithDefaults
still passes (verified in platform test run).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 23:41:01 -07:00
Hongming Wang
52bdadbd6d
fix(security): forward Authorization header in transcript proxy (#405) (#380)
The platform's GET /workspaces/:id/transcript proxy was constructing the
outbound request without an Authorization header. The workspace's /transcript
endpoint (hardened in #287/#328) fails-closed when the header is absent,
so every transcript call in production returned 401 from the workspace.

Fix: after WorkspaceAuth validates the incoming bearer token, the handler
now forwards it verbatim via req.Header.Set("Authorization", ...).
Forwarding is safe — the token has already been validated by the middleware.

Tests:
- TestTranscript_ForwardsAuthHeader: was t.Skip'd as a bug marker; now
  active. Verifies the Authorization header reaches the workspace stub.
- TestTranscript_NoAuthHeader_PassesThrough: new. Verifies that a missing
  header produces no synthetic Authorization on the upstream call, and the
  workspace 401 is faithfully relayed.

Identified by QA audit 2026-04-16.

Co-authored-by: QA Engineer <qa-engineer@molecule-ai.internal>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 23:38:07 -07:00
Hongming Wang
0aec76400a
feat(adapters): add gemini-cli runtime adapter (closes #332) (#379)
Adds a `gemini-cli` workspace runtime backed by Google's Gemini CLI
(@google/gemini-cli, ~101k ★, Apache 2.0). Mirrors the claude-code
adapter pattern: Docker image installs the CLI, CLIAgentExecutor
drives the subprocess, A2A MCP tools wire via ~/.gemini/settings.json.

Changes:
- workspace-template/adapters/gemini_cli/ — new adapter (Dockerfile,
  adapter.py, __init__.py, requirements.txt); setup() seeds GEMINI.md
  from system-prompt.md and injects A2A MCP server into settings.json
- workspace-template/cli_executor.py — adds gemini-cli to
  RUNTIME_PRESETS (--yolo flag, -p prompt, --model, GEMINI_API_KEY env
  auth); adds mcp_via_settings preset flag to skip --mcp-config
  injection for runtimes that own their own settings file
- workspace-configs-templates/gemini-cli/ — default config.yaml +
  system-prompt.md template
- tests/test_adapters.py — adds gemini-cli to expected adapter set
- CLAUDE.md — documents new runtime row in the image table

Requires: GEMINI_API_KEY global secret. Build:
  bash workspace-template/build-all.sh gemini-cli

Co-authored-by: DevOps Engineer <devops@molecule.ai>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 23:30:00 -07:00
Hongming Wang
b2e1631640
feat(org-templates): add 7-role marketing team sub-tree (#373)
Add Marketing Lead + 6 reports as a peer sub-tree of PM under the CEO:
DevRel Engineer, Product Marketing Manager, Content Marketer, Community
Manager, SEO Growth Analyst, Social Media / Brand.

- Marketing Lead: tier-3 Opus CMO-equivalent with a 5-min orchestrator
  pulse (minutes 4/9/14/... offset from Dev Lead's 2/7/12/...) that
  dispatches cross-role work, reviews drafts, and routes cross-team
  asks back to PM.
- DevRel + PMM: tier-3 Opus (technical writing + positioning judgment).
  Each has an idle_prompt for proactive issue-claim plus an hourly
  evolution cron (DevRel = sample-coverage audit, PMM = competitor
  diff against docs/ecosystem-watch.md).
- Content / Community / SEO / Social: tier-2 Sonnet with idle_prompts
  for backlog-pull (matches the #205 idle-loop pattern proven on
  Technical Researcher + Market Analyst + Competitive Intelligence).
  Each has an hourly cron tuned to its surface.
- category_routing gets 6 new keys (content, positioning, community,
  growth, social, devrel) so audit_summary messages fan out correctly.
- Canvas positions lay out the marketing cluster to the right of
  PM/Dev Lead (x=1000-1300, y=50/250/400) so the graph stays readable.

Each role also gets a system-prompt.md under its files_dir with
responsibilities, team interfaces, conventions, and self-review gates
(molecule-skill-llm-judge or molecule-hitl depending on risk).

Per CEO directive 2026-04-16 ("comprehensive marketing team"). This is
PR 1 of 2 — follow-up will add cross-tree A2A conventions and wire
DevRel ↔ Backend Engineer / PMM ↔ Competitive Intelligence delegations.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 23:20:04 -07:00