The Register handler was serialising the raw Go error into the HTTP response:
c.JSON(500, gin.H{"error": fmt.Sprintf("failed to register: %v", err)})
PostgreSQL errors wrapped by lib/pq contain table names, constraint names, and
driver-version strings — enough for a caller to fingerprint the schema and craft
targeted attacks. The error is already logged at full detail with Printf before
this line, so callers only need the generic message.
Fix: replace the Sprintf with a static "registration failed" string (same pattern
the heartbeat and update-card handlers already used).
New test: TestRegister_DBErrorResponseIsOpaque verifies the response body is the
opaque string and that "sql:", "pq:", and "connection" substrings are absent.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When a workspace container AND its /configs Docker volume are both destroyed,
the restart handler previously had no recovery path — findTemplateByName searched
only the top-level configsDir, which holds workspace-instance dirs (ws-{id[:12]}/),
not the role-named org-template source directories.
Fix: add `rebuild_config: true` to the POST /workspaces/:id/restart body struct.
When set, the handler falls back to searching configsDir/org-templates/ via the
existing findTemplateByName logic (which already handles name normalisation and
config.yaml name-field matching). The workspace can then self-recover with its own
bearer token — no admin intervention required.
New helper: resolveOrgTemplate(configsDir, wsName) — pure function, independently
tested (4 cases: hit-by-dir, hit-by-config-yaml, no org-templates dir, no match).
Usage:
curl -X POST -H "Authorization: Bearer $(cat /configs/.auth_token)" \
-d '{"rebuild_config": true}' \
http://platform:8080/workspaces/$WORKSPACE_ID/restart
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When a workspace container AND its /configs Docker volume are both destroyed,
the restart handler previously had no recovery path — findTemplateByName searched
only the top-level configsDir, which holds workspace-instance dirs (ws-{id[:12]}/),
not the role-named org-template source directories.
Fix: add `rebuild_config: true` to the POST /workspaces/:id/restart body struct.
When set, the handler falls back to searching configsDir/org-templates/ via the
existing findTemplateByName logic (which already handles name normalisation and
config.yaml name-field matching). The workspace can then self-recover with its own
bearer token — no admin intervention required.
New helper: resolveOrgTemplate(configsDir, wsName) — pure function, independently
tested (4 cases: hit-by-dir, hit-by-config-yaml, no org-templates dir, no match).
Usage:
curl -X POST -H "Authorization: Bearer $(cat /configs/.auth_token)" \
-d '{"rebuild_config": true}' \
http://platform:8080/workspaces/$WORKSPACE_ID/restart
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Closes#430.
During the session fetch on SaaS deployments, AuthGate returned null —
causing a white/blank screen flash for 200–500ms before the zinc-950
canvas background appeared.
Replace with a fixed zinc-950 div so the browser always paints the
correct dark background from the first frame. The canvas loading UI
renders on top once the session resolves, with no visible transition.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Closes#430.
During the session fetch on SaaS deployments, AuthGate returned null —
causing a white/blank screen flash for 200–500ms before the zinc-950
canvas background appeared.
Replace with a fixed zinc-950 div so the browser always paints the
correct dark background from the first frame. The canvas loading UI
renders on top once the session resolves, with no visible transition.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The platform Dockerfile COPYs paths relative to the repo root —
\`COPY platform/go.mod\`, \`COPY platform/migrations\`,
\`COPY workspace-configs-templates\`. The compose file was setting
\`context: ./platform\`, which silently caused those COPY layers to
miss + stop invalidating cache.
Symptom (caught 2026-04-16 10:22 UTC): after PR #417 (memory schema
migration 023) merged + I ran \`docker compose up -d --build platform\`,
the rebuild was a no-op. Image SHA didn't change, container booted with
old migration set, \`Applied 22 migrations\` instead of the expected 23.
Migration 023 file was on disk locally but never reached the image.
Workaround was \`docker build -t molecule-monorepo-platform:fresh -f
platform/Dockerfile .\` from repo root → SHA changed, migration 023
applied. This commit makes \`docker compose up -d --build platform\`
work correctly without the manual workaround.
CI workflow already builds with \`context: .\` + \`file: ./platform/Dockerfile\`
(per the comment at the top of platform/Dockerfile). This change just
aligns the local compose file with what CI does.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The platform Dockerfile COPYs paths relative to the repo root —
\`COPY platform/go.mod\`, \`COPY platform/migrations\`,
\`COPY workspace-configs-templates\`. The compose file was setting
\`context: ./platform\`, which silently caused those COPY layers to
miss + stop invalidating cache.
Symptom (caught 2026-04-16 10:22 UTC): after PR #417 (memory schema
migration 023) merged + I ran \`docker compose up -d --build platform\`,
the rebuild was a no-op. Image SHA didn't change, container booted with
old migration set, \`Applied 22 migrations\` instead of the expected 23.
Migration 023 file was on disk locally but never reached the image.
Workaround was \`docker build -t molecule-monorepo-platform:fresh -f
platform/Dockerfile .\` from repo root → SHA changed, migration 023
applied. This commit makes \`docker compose up -d --build platform\`
work correctly without the manual workaround.
CI workflow already builds with \`context: .\` + \`file: ./platform/Dockerfile\`
(per the comment at the top of platform/Dockerfile). This change just
aligns the local compose file with what CI does.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Every agent in the template currently uses the same GitHub PAT, so
\`gh pr list\` shows every PR as authored by the CEO's account with
no signal which agent opened each one. Commits already carry
per-agent authors (GIT_AUTHOR_NAME from #402). This wrapper extends
the identity split to the PR/issue metadata surface layer that
commit attribution can't reach.
## How it works
A tiny bash script installed at \`/usr/local/bin/gh\`, which sits
earlier in PATH than the real binary at \`/usr/bin/gh\`. For \`gh pr
create\` and \`gh issue create\`:
- Title gets prefixed with \`[Role Name]\` — e.g. \`[Frontend Engineer]
fix: canvas grid index\`
- Body gets \`\n\n---\n_Opened by: Molecule AI <Role>_\` appended
Role is read from \`GIT_AUTHOR_NAME\` which the platform provisioner
sets to \`Molecule AI <Role>\` (shipped with #402). Accepts both
\`--title X\` and \`--title=X\` forms. Same for \`--body\`.
Anything that isn't \`gh pr create\` or \`gh issue create\` (e.g.
\`gh pr list\`, \`gh issue view\`, \`gh run watch\`) passes through
untouched. No behaviour change for read-side operations.
## Idempotent
- If the title already starts with \`[...]\` the wrapper does not
re-prefix. \`gh pr edit\` flows that resubmit title won't layer
multiple tags.
- If the body already contains \`Opened by: Molecule AI\` the footer
is not re-appended.
## Fail-open
When \`GIT_AUTHOR_NAME\` is absent or doesn't start with \`Molecule
AI \`, the wrapper exec's the real gh with unchanged args. No call
is ever blocked by this script.
## Test coverage
\`tests/test_gh_wrapper.sh\` — 12 cases, no network, no Docker:
- Passthrough for non-create subcommands (pr list)
- pr create title prefix + body footer
- issue create with \`--title=X\` \`--body=X\` equals-form
- Idempotent title re-prefix
- Idempotent body footer (count = 1 after two applies)
- Missing GIT_AUTHOR_NAME → passthrough, title preserved
- Malformed GIT_AUTHOR_NAME (not "Molecule AI ...") → passthrough
All 12 pass. Test script is standalone bash + a temp fake gh binary
that echoes argv; safe to run in CI's Python Lint & Test job via
subprocess shell-out.
## Deployment note
This lands in the workspace image. Existing containers keep their
old /usr/bin/gh until the image is rebuilt and they're re-provisioned
(POST /workspaces/:id/restart {}). No migration required; the wrapper
just starts tagging PRs once the new image is rolled.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Every agent in the template currently uses the same GitHub PAT, so
\`gh pr list\` shows every PR as authored by the CEO's account with
no signal which agent opened each one. Commits already carry
per-agent authors (GIT_AUTHOR_NAME from #402). This wrapper extends
the identity split to the PR/issue metadata surface layer that
commit attribution can't reach.
## How it works
A tiny bash script installed at \`/usr/local/bin/gh\`, which sits
earlier in PATH than the real binary at \`/usr/bin/gh\`. For \`gh pr
create\` and \`gh issue create\`:
- Title gets prefixed with \`[Role Name]\` — e.g. \`[Frontend Engineer]
fix: canvas grid index\`
- Body gets \`\n\n---\n_Opened by: Molecule AI <Role>_\` appended
Role is read from \`GIT_AUTHOR_NAME\` which the platform provisioner
sets to \`Molecule AI <Role>\` (shipped with #402). Accepts both
\`--title X\` and \`--title=X\` forms. Same for \`--body\`.
Anything that isn't \`gh pr create\` or \`gh issue create\` (e.g.
\`gh pr list\`, \`gh issue view\`, \`gh run watch\`) passes through
untouched. No behaviour change for read-side operations.
## Idempotent
- If the title already starts with \`[...]\` the wrapper does not
re-prefix. \`gh pr edit\` flows that resubmit title won't layer
multiple tags.
- If the body already contains \`Opened by: Molecule AI\` the footer
is not re-appended.
## Fail-open
When \`GIT_AUTHOR_NAME\` is absent or doesn't start with \`Molecule
AI \`, the wrapper exec's the real gh with unchanged args. No call
is ever blocked by this script.
## Test coverage
\`tests/test_gh_wrapper.sh\` — 12 cases, no network, no Docker:
- Passthrough for non-create subcommands (pr list)
- pr create title prefix + body footer
- issue create with \`--title=X\` \`--body=X\` equals-form
- Idempotent title re-prefix
- Idempotent body footer (count = 1 after two applies)
- Missing GIT_AUTHOR_NAME → passthrough, title preserved
- Malformed GIT_AUTHOR_NAME (not "Molecule AI ...") → passthrough
All 12 pass. Test script is standalone bash + a temp fake gh binary
that echoes argv; safe to run in CI's Python Lint & Test job via
subprocess shell-out.
## Deployment note
This lands in the workspace image. Existing containers keep their
old /usr/bin/gh until the image is rebuilt and they're re-provisioned
(POST /workspaces/:id/restart {}). No migration required; the wrapper
just starts tagging PRs once the new image is rolled.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Part 4 of 4 — terminal step of the org.yaml scalability refactor. Each
role in the molecule-dev template now owns its own workspace.yaml file,
colocated with the existing system-prompt.md / initial-prompt.md /
idle-prompt.md / schedules/*.md. Team files shrink to a leader's own
definition plus a list of !include refs.
## Platform change
`resolveYAMLIncludes` now uses a TWO-ROOT model:
- Path resolution is relative to the INCLUDING file's directory
(natural sibling + cousin refs, C-include / Sass @import convention).
- Security bound is the ORIGINAL org root (`rootDir`), preserved across
all recursion depths. Sibling-dir refs like `../my-role/workspace.yaml`
from a team file are now allowed (they stay inside the org template);
refs that escape the root still error.
Regression coverage: new `TestResolveYAMLIncludes_SiblingDirAccess`
reproduces the Phase 4 pattern (team file at `teams/x.yaml` referencing
`../<role>/workspace.yaml`) — fails without the fix, passes with.
## Template change
Atomized 15 child workspaces across 3 team files:
- `teams/research.yaml`: 58 → 30 lines; 3 children now !include refs
- `teams/dev.yaml`: 222 → 38 lines; 6 children now !include refs
- `teams/marketing.yaml`: 143 → 28 lines; 6 children now !include refs
Each role now has `<role>/workspace.yaml` colocated with its prompts.
Example `frontend-engineer/` directory:
frontend-engineer/
├── workspace.yaml (24 lines — name/role/tier/canvas/plugins/...)
├── system-prompt.md (from earlier phases)
├── initial-prompt.md
├── idle-prompt.md
└── (no schedules for this role — but if added, schedules/<slug>.md)
## File-size progression across all 4 phases
| State | org.yaml | total `.yaml` in tree |
|---|---:|---:|
| Before (main) | 1801 lines / 108 KB | 1801 / 108 KB (one file) |
| After Phase 1 (#389) | 1687 | 1687 / 101 KB |
| After Phase 2 (#390) | 676 | 676 / 35 KB |
| After Phase 3 (#393) | 114 | 683 (1 + 6 teams) / 33 KB |
| **After this PR** | **114** | **~698** (1 + 6 + 15 workspace) / 35 KB |
Aggregate size is flat — the decrease came from prompt externalization
in Phases 1/2; Phases 3/4 reorganize structure without adding content.
The win is readability and ownership:
- Every individual file fits on 1-2 screens.
- Adding a new role is now: create `<role>/` dir, add `workspace.yaml`
+ `system-prompt.md` + prompts, add ONE `!include` line to the team
file. No touching of aggregated mega-YAML.
- Team files can be reviewed + merged independently.
## Tests
All 10 `TestResolveYAMLIncludes_*` tests pass, including the real-template
integration test (`TestResolveYAMLIncludes_RealMoleculeDev`) which now
walks org.yaml → teams/pm.yaml → teams/research.yaml → ../market-analyst/
workspace.yaml and validates the full 21-role tree unmarshals cleanly.
Plus all existing `TestResolvePromptRef` + `TestOrgYAML` + `TestInitialPrompt`
suites stay green.
## Ops followup
After merging all 4 phases and deploying, the `POST /org/import`
endpoint should produce a workspace tree byte-identical to the
pre-refactor state. Verify with:
diff <(curl POST /org/import before) <(curl POST /org/import after)
or by spot-checking:
- `/configs/config.yaml` bodies across all 21 workspaces
- `workspace_schedules.prompt` row values
The externalization is lossless — YAML literal to file and back
recovers the same string modulo trailing-whitespace normalization.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Part 4 of 4 — terminal step of the org.yaml scalability refactor. Each
role in the molecule-dev template now owns its own workspace.yaml file,
colocated with the existing system-prompt.md / initial-prompt.md /
idle-prompt.md / schedules/*.md. Team files shrink to a leader's own
definition plus a list of !include refs.
## Platform change
`resolveYAMLIncludes` now uses a TWO-ROOT model:
- Path resolution is relative to the INCLUDING file's directory
(natural sibling + cousin refs, C-include / Sass @import convention).
- Security bound is the ORIGINAL org root (`rootDir`), preserved across
all recursion depths. Sibling-dir refs like `../my-role/workspace.yaml`
from a team file are now allowed (they stay inside the org template);
refs that escape the root still error.
Regression coverage: new `TestResolveYAMLIncludes_SiblingDirAccess`
reproduces the Phase 4 pattern (team file at `teams/x.yaml` referencing
`../<role>/workspace.yaml`) — fails without the fix, passes with.
## Template change
Atomized 15 child workspaces across 3 team files:
- `teams/research.yaml`: 58 → 30 lines; 3 children now !include refs
- `teams/dev.yaml`: 222 → 38 lines; 6 children now !include refs
- `teams/marketing.yaml`: 143 → 28 lines; 6 children now !include refs
Each role now has `<role>/workspace.yaml` colocated with its prompts.
Example `frontend-engineer/` directory:
frontend-engineer/
├── workspace.yaml (24 lines — name/role/tier/canvas/plugins/...)
├── system-prompt.md (from earlier phases)
├── initial-prompt.md
├── idle-prompt.md
└── (no schedules for this role — but if added, schedules/<slug>.md)
## File-size progression across all 4 phases
| State | org.yaml | total `.yaml` in tree |
|---|---:|---:|
| Before (main) | 1801 lines / 108 KB | 1801 / 108 KB (one file) |
| After Phase 1 (#389) | 1687 | 1687 / 101 KB |
| After Phase 2 (#390) | 676 | 676 / 35 KB |
| After Phase 3 (#393) | 114 | 683 (1 + 6 teams) / 33 KB |
| **After this PR** | **114** | **~698** (1 + 6 + 15 workspace) / 35 KB |
Aggregate size is flat — the decrease came from prompt externalization
in Phases 1/2; Phases 3/4 reorganize structure without adding content.
The win is readability and ownership:
- Every individual file fits on 1-2 screens.
- Adding a new role is now: create `<role>/` dir, add `workspace.yaml`
+ `system-prompt.md` + prompts, add ONE `!include` line to the team
file. No touching of aggregated mega-YAML.
- Team files can be reviewed + merged independently.
## Tests
All 10 `TestResolveYAMLIncludes_*` tests pass, including the real-template
integration test (`TestResolveYAMLIncludes_RealMoleculeDev`) which now
walks org.yaml → teams/pm.yaml → teams/research.yaml → ../market-analyst/
workspace.yaml and validates the full 21-role tree unmarshals cleanly.
Plus all existing `TestResolvePromptRef` + `TestOrgYAML` + `TestInitialPrompt`
suites stay green.
## Ops followup
After merging all 4 phases and deploying, the `POST /org/import`
endpoint should produce a workspace tree byte-identical to the
pre-refactor state. Verify with:
diff <(curl POST /org/import before) <(curl POST /org/import after)
or by spot-checking:
- `/configs/config.yaml` bodies across all 21 workspaces
- `workspace_schedules.prompt` row values
The externalization is lossless — YAML literal to file and back
recovers the same string modulo trailing-whitespace normalization.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SecurityHeaders middleware widened its CSP to allow Next.js inline scripts
+ data:/blob: images (platform/internal/middleware/securityheaders.go:44,
canvas is reverse-proxied through the gin stack so it needs the permissive
policy). The two CSP asserts in securityheaders_test.go still hard-compared
against the old tight `default-src 'self'`, so they fail on main as of
this afternoon.
Fix: assert each expected CSP fragment is PRESENT in the header (substring
match) instead of byte-for-byte equality. Test intent is "CSP is set, starts
with tight default-src, contains the expected directives" — not "CSP matches
this exact string". Future subsource tuning (add a new CDN, bump blob:/data:
scope) won't re-break this test.
Caught because every PR touching anything in the monorepo currently fails
the Platform (Go) CI job on these two asserts. Fixing on a dedicated branch
so it can land ahead of every blocked PR in the queue.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SecurityHeaders middleware widened its CSP to allow Next.js inline scripts
+ data:/blob: images (platform/internal/middleware/securityheaders.go:44,
canvas is reverse-proxied through the gin stack so it needs the permissive
policy). The two CSP asserts in securityheaders_test.go still hard-compared
against the old tight `default-src 'self'`, so they fail on main as of
this afternoon.
Fix: assert each expected CSP fragment is PRESENT in the header (substring
match) instead of byte-for-byte equality. Test intent is "CSP is set, starts
with tight default-src, contains the expected directives" — not "CSP matches
this exact string". Future subsource tuning (add a new CDN, bump blob:/data:
scope) won't re-break this test.
Caught because every PR touching anything in the monorepo currently fails
the Platform (Go) CI job on these two asserts. Fixing on a dedicated branch
so it can land ahead of every blocked PR in the queue.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Single-container tenant architecture: Go platform (:8080) + Canvas
Node.js (:3000) in one Fly machine, with Go's NoRoute handler reverse-
proxying non-API routes to the canvas. Browser only talks to :8080.
Changes:
platform/Dockerfile.tenant — multi-stage build (Go + Node + runtime).
Bakes workspace-configs-templates/ + org-templates/ into the image.
Build context: repo root.
platform/entrypoint-tenant.sh — starts both processes, kills both if
either exits. Fly health check on :8080 covers the Go binary; canvas
health is implicit (proxy returns 502 if canvas is down).
platform/internal/router/canvas_proxy.go — httputil.ReverseProxy that
forwards unmatched routes to CANVAS_PROXY_URL (http://localhost:3000).
Activated by NoRoute when CANVAS_PROXY_URL env is set.
platform/internal/router/router.go — wire NoRoute → canvasProxy when
CANVAS_PROXY_URL is present; no-op otherwise (local dev unchanged).
platform/internal/middleware/securityheaders.go — relaxed CSP to allow
Next.js inline scripts/styles/eval + WebSocket + data: URIs. The
strict `default-src 'self'` was blocking all canvas rendering.
canvas/src/lib/api.ts — changed `||` to `??` for NEXT_PUBLIC_PLATFORM_URL
so empty string means "same-origin" (combined image) instead of falling
back to localhost:8080.
canvas/src/components/tabs/TerminalTab.tsx — same `??` fix for WS URL.
Verified: tenant machine boots, canvas renders, 8 runtime templates +
4 org templates visible, API routes work through the same port.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Single-container tenant architecture: Go platform (:8080) + Canvas
Node.js (:3000) in one Fly machine, with Go's NoRoute handler reverse-
proxying non-API routes to the canvas. Browser only talks to :8080.
Changes:
platform/Dockerfile.tenant — multi-stage build (Go + Node + runtime).
Bakes workspace-configs-templates/ + org-templates/ into the image.
Build context: repo root.
platform/entrypoint-tenant.sh — starts both processes, kills both if
either exits. Fly health check on :8080 covers the Go binary; canvas
health is implicit (proxy returns 502 if canvas is down).
platform/internal/router/canvas_proxy.go — httputil.ReverseProxy that
forwards unmatched routes to CANVAS_PROXY_URL (http://localhost:3000).
Activated by NoRoute when CANVAS_PROXY_URL env is set.
platform/internal/router/router.go — wire NoRoute → canvasProxy when
CANVAS_PROXY_URL is present; no-op otherwise (local dev unchanged).
platform/internal/middleware/securityheaders.go — relaxed CSP to allow
Next.js inline scripts/styles/eval + WebSocket + data: URIs. The
strict `default-src 'self'` was blocking all canvas rendering.
canvas/src/lib/api.ts — changed `||` to `??` for NEXT_PUBLIC_PLATFORM_URL
so empty string means "same-origin" (combined image) instead of falling
back to localhost:8080.
canvas/src/components/tabs/TerminalTab.tsx — same `??` fix for WS URL.
Verified: tenant machine boots, canvas renders, 8 runtime templates +
4 org templates visible, API routes work through the same port.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phase 3 escalation ladder added `from .escalation import ...` to
executor.py. The phase-2 dispatch tests load executor.py via
`exec(compile(src, ...))` with the relative import rewritten — this
broke because (a) the rewrite didn't know about escalation and (b) the
exec namespace lacked `__name__`, which executor.py needs at import
time for `logging.getLogger(__name__)`.
Fix both in all 8 exec sites:
- Rewrite both `from .providers import` AND `from .escalation import`
- Pre-register escalation + providers in sys.modules under the fake
package name
- Seed the exec namespace with `__name__ = "hermes_executor_under_test"`
54/54 hermes tests pass (28 escalation truth-table + 6 ladder-integration
+ 20 existing phase-2 dispatch).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phase 3 escalation ladder added `from .escalation import ...` to
executor.py. The phase-2 dispatch tests load executor.py via
`exec(compile(src, ...))` with the relative import rewritten — this
broke because (a) the rewrite didn't know about escalation and (b) the
exec namespace lacked `__name__`, which executor.py needs at import
time for `logging.getLogger(__name__)`.
Fix both in all 8 exec sites:
- Rewrite both `from .providers import` AND `from .escalation import`
- Pre-register escalation + providers in sys.modules under the fake
package name
- Seed the exec namespace with `__name__ = "hermes_executor_under_test"`
54/54 hermes tests pass (28 escalation truth-table + 6 ladder-integration
+ 20 existing phase-2 dispatch).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Closes the silent-overwrite hole where two agents racing a read-modify-
write on the same memory key left only one agent's update. Relevant for
orchestrators (PM, Dev Lead, Marketing Lead) keeping structured running
state (delegation-result ledgers, task queues) in memory, and for the
``research-backlog:*`` keys that multiple idle loops write in parallel.
## Semantics
### Back-compat path (no if_match_version)
Unchanged: ``INSERT ... ON CONFLICT UPDATE`` last-write-wins. Every
existing agent tool, every existing ``commit_memory`` call, every
existing cron that writes memory — all continue to work with no edit.
### Optimistic-lock path (if_match_version set)
1. Client calls ``GET /memory/:key`` → ``{value, version: V}``
2. Client modifies value locally
3. Client ``POST /memory {key, value, if_match_version: V}``
4. Server: ``UPDATE ... WHERE version = V`` + RETURNING new version
5. On match → 200 + ``{version: V+1}``
6. On mismatch → 409 + ``{expected_version: V, current_version: <actual>}``
7. Client reads the actual version and retries.
### Create-only marker
``if_match_version: 0`` means "create iff the key doesn't exist yet".
Two agents simultaneously seeding a shared key will see exactly one
success + one 409 — no silent collision, no duplicate-init work.
### Schema
Migration 023 adds ``version BIGINT NOT NULL DEFAULT 1``. Existing rows
baseline at 1. New rows start at 1. Every successful write (both paths)
increments: ``version = version + 1`` on update, ``1`` on insert.
## Why version, not updated_at
``updated_at`` has second-granularity and can collide between concurrent
writers on a fast clock. A monotonic counter is collision-free and more
readable in the 409 response body ("expected 5, current is 7 — you
missed 2 writes" tells an agent exactly what to re-read).
## Why ``if_match_version`` and not an ETag header
JSON field keeps it in the request body, visible alongside the value
payload. Agents assembling requests programmatically don't have to
remember to thread a header through their HTTP client wrapper; the
existing ``commit_memory`` tool can grow one optional kwarg and match
the existing signature shape.
## Tests
11 memory-handler cases covering every path:
- GET list / get (with version in response shape)
- Set with no version (back-compat upsert, returns new version)
- Set with if_match_version match (happy path, increment)
- Set with if_match_version mismatch (409 + expected/current fields)
- Set with if_match_version=0 on absent key (create-only success)
- Set with if_match_version=N on absent key (409 — caller's mental
model is wrong)
- Bad inputs (missing key, malformed JSON)
- Delete happy + error path
Full ``go test ./internal/handlers/`` green.
## Follow-up (not in this PR)
- Workspace-template tool update: ``commit_memory(content, *,
if_match_version=None)`` surfaces the new option + on 409 surfaces
the current_version so agents can retry without manual re-read.
- Named checkpoints table (``workspace_checkpoints``) for durable
orchestrator state snapshots. Different concern than per-key locking;
separate PR.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Closes the silent-overwrite hole where two agents racing a read-modify-
write on the same memory key left only one agent's update. Relevant for
orchestrators (PM, Dev Lead, Marketing Lead) keeping structured running
state (delegation-result ledgers, task queues) in memory, and for the
``research-backlog:*`` keys that multiple idle loops write in parallel.
## Semantics
### Back-compat path (no if_match_version)
Unchanged: ``INSERT ... ON CONFLICT UPDATE`` last-write-wins. Every
existing agent tool, every existing ``commit_memory`` call, every
existing cron that writes memory — all continue to work with no edit.
### Optimistic-lock path (if_match_version set)
1. Client calls ``GET /memory/:key`` → ``{value, version: V}``
2. Client modifies value locally
3. Client ``POST /memory {key, value, if_match_version: V}``
4. Server: ``UPDATE ... WHERE version = V`` + RETURNING new version
5. On match → 200 + ``{version: V+1}``
6. On mismatch → 409 + ``{expected_version: V, current_version: <actual>}``
7. Client reads the actual version and retries.
### Create-only marker
``if_match_version: 0`` means "create iff the key doesn't exist yet".
Two agents simultaneously seeding a shared key will see exactly one
success + one 409 — no silent collision, no duplicate-init work.
### Schema
Migration 023 adds ``version BIGINT NOT NULL DEFAULT 1``. Existing rows
baseline at 1. New rows start at 1. Every successful write (both paths)
increments: ``version = version + 1`` on update, ``1`` on insert.
## Why version, not updated_at
``updated_at`` has second-granularity and can collide between concurrent
writers on a fast clock. A monotonic counter is collision-free and more
readable in the 409 response body ("expected 5, current is 7 — you
missed 2 writes" tells an agent exactly what to re-read).
## Why ``if_match_version`` and not an ETag header
JSON field keeps it in the request body, visible alongside the value
payload. Agents assembling requests programmatically don't have to
remember to thread a header through their HTTP client wrapper; the
existing ``commit_memory`` tool can grow one optional kwarg and match
the existing signature shape.
## Tests
11 memory-handler cases covering every path:
- GET list / get (with version in response shape)
- Set with no version (back-compat upsert, returns new version)
- Set with if_match_version match (happy path, increment)
- Set with if_match_version mismatch (409 + expected/current fields)
- Set with if_match_version=0 on absent key (create-only success)
- Set with if_match_version=N on absent key (409 — caller's mental
model is wrong)
- Bad inputs (missing key, malformed JSON)
- Delete happy + error path
Full ``go test ./internal/handlers/`` green.
## Follow-up (not in this PR)
- Workspace-template tool update: ``commit_memory(content, *,
if_match_version=None)`` surfaces the new option + on 409 surfaces
the current_version so agents can retry without manual re-read.
- Named checkpoints table (``workspace_checkpoints``) for durable
orchestrator state snapshots. Different concern than per-key locking;
separate PR.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ships scoped Phase 3 of the Hermes multi-provider work. Every workspace
can now declare an ordered list of (provider, model) rungs; when the
pinned model hits rate-limit / 5xx / context-length / overload, the
executor advances to the next rung before raising.
## Why
3× Claude Max saturation is a routine occurrence now — the "first 429 on
a batch delegation" is the common path, not the exception. A workspace
pinned to Haiku that hits a context-length limit has no recovery today;
same for Sonnet hitting rate-limit mid-synthesis. Escalation promotes
to the next tier for that single call, preserves coordination, avoids
restart cascades.
## New module: adapters/hermes/escalation.py
- ``LadderRung(provider, model)`` — one config entry.
- ``parse_ladder(raw)`` — tolerant config parser; skips malformed rungs
with a warning rather than raising so boot stays resilient.
- ``should_escalate(exc) -> bool`` — truth table over 15+ error shapes:
- Typed classes (RateLimitError, OverloadedError, APITimeoutError,
APIConnectionError, InternalServerError)
- Context-length markers (each provider uses different phrasing)
- Gateway markers (502/503/504, overloaded, temporarily unavailable)
- Status-code substrings (429, 529, 5xx)
- Hard-rejects auth failures (401/403/invalid_api_key) even if the
outer exception class is RateLimitError — wrapping case matters.
## Executor wiring
``HermesA2AExecutor`` now accepts ``escalation_ladder`` in its
constructor + ``create_executor()`` factory. ``_do_inference()`` walks
the ladder:
1. First attempt = pinned provider:model (matches pre-ladder behaviour)
2. On escalatable error, try each rung in order
3. On non-escalatable error, raise immediately (auth, malformed payload)
4. On exhaustion, raise the last error
Rung switches temporarily rebind ``self.provider_cfg`` / ``self.model``
/ ``self.api_key`` / ``self.base_url`` in a try/finally, so any raised
error leaves the executor in its original state for the next call. Key
resolution for non-pinned rungs goes through ``resolve_provider`` which
reads the rung-provider's env vars fresh.
## Config shape
``config.yaml`` (rendered from ``org.yaml`` → workspace secrets):
runtime_config:
escalation_ladder:
- provider: gemini
model: gemini-2.5-flash
- provider: anthropic
model: claude-sonnet-4-5-20250929
- provider: anthropic
model: claude-opus-4-1-20250805
Empty / absent = single-shot behaviour, full backwards-compat with
every existing workspace.
## Tests
34 passing, all isolated (no network):
- ``test_hermes_escalation.py`` (28): parser + truth-table across
rate-limit, overload, context-length, gateway, auth-reject, unrelated
exceptions, and case-insensitivity.
- ``test_hermes_ladder_integration.py`` (6): no-ladder single call,
ladder-not-triggered on success, escalate-on-rate-limit-then-succeed,
stop-on-non-escalatable, raise-last-error-when-exhausted, skip-
unknown-provider-in-rung.
## Not in this PR
- Uncertainty-driven escalation (judge pass after successful reply).
- Per-workspace budget tracking (#305 covers this separately).
- Live streaming reuse across rungs (ladder retries the whole call).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ships scoped Phase 3 of the Hermes multi-provider work. Every workspace
can now declare an ordered list of (provider, model) rungs; when the
pinned model hits rate-limit / 5xx / context-length / overload, the
executor advances to the next rung before raising.
## Why
3× Claude Max saturation is a routine occurrence now — the "first 429 on
a batch delegation" is the common path, not the exception. A workspace
pinned to Haiku that hits a context-length limit has no recovery today;
same for Sonnet hitting rate-limit mid-synthesis. Escalation promotes
to the next tier for that single call, preserves coordination, avoids
restart cascades.
## New module: adapters/hermes/escalation.py
- ``LadderRung(provider, model)`` — one config entry.
- ``parse_ladder(raw)`` — tolerant config parser; skips malformed rungs
with a warning rather than raising so boot stays resilient.
- ``should_escalate(exc) -> bool`` — truth table over 15+ error shapes:
- Typed classes (RateLimitError, OverloadedError, APITimeoutError,
APIConnectionError, InternalServerError)
- Context-length markers (each provider uses different phrasing)
- Gateway markers (502/503/504, overloaded, temporarily unavailable)
- Status-code substrings (429, 529, 5xx)
- Hard-rejects auth failures (401/403/invalid_api_key) even if the
outer exception class is RateLimitError — wrapping case matters.
## Executor wiring
``HermesA2AExecutor`` now accepts ``escalation_ladder`` in its
constructor + ``create_executor()`` factory. ``_do_inference()`` walks
the ladder:
1. First attempt = pinned provider:model (matches pre-ladder behaviour)
2. On escalatable error, try each rung in order
3. On non-escalatable error, raise immediately (auth, malformed payload)
4. On exhaustion, raise the last error
Rung switches temporarily rebind ``self.provider_cfg`` / ``self.model``
/ ``self.api_key`` / ``self.base_url`` in a try/finally, so any raised
error leaves the executor in its original state for the next call. Key
resolution for non-pinned rungs goes through ``resolve_provider`` which
reads the rung-provider's env vars fresh.
## Config shape
``config.yaml`` (rendered from ``org.yaml`` → workspace secrets):
runtime_config:
escalation_ladder:
- provider: gemini
model: gemini-2.5-flash
- provider: anthropic
model: claude-sonnet-4-5-20250929
- provider: anthropic
model: claude-opus-4-1-20250805
Empty / absent = single-shot behaviour, full backwards-compat with
every existing workspace.
## Tests
34 passing, all isolated (no network):
- ``test_hermes_escalation.py`` (28): parser + truth-table across
rate-limit, overload, context-length, gateway, auth-reject, unrelated
exceptions, and case-insensitivity.
- ``test_hermes_ladder_integration.py`` (6): no-ladder single call,
ladder-not-triggered on success, escalate-on-rate-limit-then-succeed,
stop-on-non-escalatable, raise-last-error-when-exhausted, skip-
unknown-provider-in-rung.
## Not in this PR
- Uncertainty-driven escalation (judge pass after successful reply).
- Per-workspace budget tracking (#305 covers this separately).
- Live streaming reuse across rungs (ladder retries the whole call).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Closes#399.
## Root cause
`publish-platform-image.yml` existed for the Go platform image but there
was no equivalent for the canvas. After every canvas PR merged, CI ran
`npm run build` and passed — but the live container at :3000 was never
updated. The `canvas-deploy-reminder` job only posted a comment asking
operators to manually rebuild, which was consistently missed.
## What this adds
- `.github/workflows/publish-canvas-image.yml`: triggers on `canvas/**`
changes to main (and `workflow_dispatch`). Mirrors the platform workflow:
macOS Keychain isolation, QEMU for linux/amd64, Buildx, GHCR push with
`:latest` + `:sha-<7>` tags.
- `NEXT_PUBLIC_PLATFORM_URL` / `NEXT_PUBLIC_WS_URL` resolve from
`workflow_dispatch` inputs → `CANVAS_PLATFORM_URL` / `CANVAS_WS_URL`
repo secrets → `localhost:8080` defaults (safe for self-hosted dev).
- Inputs are passed via env vars (not direct `${{ }}` interpolation) to
prevent shell injection from string inputs.
- `docker-compose.yml`: adds `image: ghcr.io/molecule-ai/canvas:latest`
to the canvas service so `docker compose pull canvas && docker compose
up -d canvas` applies the new image. `build:` is retained for local
development. Adds a comment clarifying that `NEXT_PUBLIC_*` runtime env
vars are ignored by the standalone bundle (build-time only).
- `ci.yml`: updates `canvas-deploy-reminder` commit comment to reference
`docker compose pull` as the fast path, with `docker compose build` as
the local-source fallback.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Closes#399.
## Root cause
`publish-platform-image.yml` existed for the Go platform image but there
was no equivalent for the canvas. After every canvas PR merged, CI ran
`npm run build` and passed — but the live container at :3000 was never
updated. The `canvas-deploy-reminder` job only posted a comment asking
operators to manually rebuild, which was consistently missed.
## What this adds
- `.github/workflows/publish-canvas-image.yml`: triggers on `canvas/**`
changes to main (and `workflow_dispatch`). Mirrors the platform workflow:
macOS Keychain isolation, QEMU for linux/amd64, Buildx, GHCR push with
`:latest` + `:sha-<7>` tags.
- `NEXT_PUBLIC_PLATFORM_URL` / `NEXT_PUBLIC_WS_URL` resolve from
`workflow_dispatch` inputs → `CANVAS_PLATFORM_URL` / `CANVAS_WS_URL`
repo secrets → `localhost:8080` defaults (safe for self-hosted dev).
- Inputs are passed via env vars (not direct `${{ }}` interpolation) to
prevent shell injection from string inputs.
- `docker-compose.yml`: adds `image: ghcr.io/molecule-ai/canvas:latest`
to the canvas service so `docker compose pull canvas && docker compose
up -d canvas` applies the new image. `build:` is retained for local
development. Adds a comment clarifying that `NEXT_PUBLIC_*` runtime env
vars are ignored by the standalone bundle (build-time only).
- `ci.yml`: updates `canvas-deploy-reminder` commit comment to reference
`docker compose pull` as the fast path, with `docker compose build` as
the local-source fallback.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Root cause of the 2026-04-16 09:10 UTC six-container restart cascade.
## Timeline
09:10:26 — PM sent a batch delegation to 15+ agents (Dev Lead coordinating).
09:10:26-27 — 4 leaders/auditors (Security, RL, BE, DevOps) simultaneously
hit "workspace agent unreachable — container restart triggered"
even though their containers were running fine. Another 2
(DL, UIUX) tripped in the next few seconds.
09:10:27 — Provisioner stopped + recreated 6 containers in parallel. A2A
callers got EOFs, PM's batch coordination stalled.
## Root cause
`provisioner.IsRunning` collapsed every ContainerInspect error into
`(false, nil)`, including transient Docker daemon hiccups:
func IsRunning(...) (bool, error) {
info, err := p.cli.ContainerInspect(ctx, name)
if err != nil {
return false, nil // Container doesn't exist ← MISREAD
}
return info.State.Running, nil
}
The comment said "Container doesn't exist" but the error was actually
any of: daemon timeout, socket EOF, context deadline, connection
refused. Under load (batch delegation fan-out → 15 concurrent HTTP
inbound → 15 concurrent Claude Code subprocesses → Docker daemon CPU
pressure), ContainerInspect calls started failing transiently. All 6
calls returned `(false, nil)`. Caller `maybeMarkContainerDead` treated
`running=false` as "container is dead, restart it" → six parallel
restarts. This was exactly the destructive-on-error pattern we keep
trying to kill (see #160 SDK-stderr-probe, #318 fail-open classes).
## Fix
`IsRunning` now distinguishes NotFound from transient errors:
- Legitimately missing container (caller deleted, Docker pruned) →
`(false, nil)` — safe to act on; caller marks dead + restarts.
- Any other error (daemon timeout, socket issue, context deadline) →
`(true, err)` — caller stays on the alive path. The transient error
is preserved so metrics + logging still see it, but it does NOT
trigger the destructive restart branch.
`isContainerNotFound` matches on error-message substring — same
approach docker/cli uses internally — to avoid pulling in errdefs as a
direct dep. Truth table tests in `isrunning_test.go` cover 8 cases:
NotFound variants (real + generic), nil, empty, and the 4 transient-
error shapes we've actually observed (deadline, EOF, connection-refused,
i/o timeout).
## Caller update
`maybeMarkContainerDead` in a2a_proxy.go now logs the transient inspect
error (was silently discarded via `_`). Visibility without
destructiveness. If this error becomes persistent, we'll see it in
platform logs rather than diagnosing after another restart cascade.
## Expected impact
- Zero restart cascades from the current class of transient inspect
errors (EOF, timeout, connection refused).
- Dead containers still detected within the A2A layer because an actual
stopped container returns NotFound on inspect, and the TTL monitor
(180s post #386) catches anything that slips through.
- New visibility in platform logs when inspect has trouble — previously
silent.
Combined with the TTL fix in #386, the defense-in-depth on spurious
restart is now:
1. IsRunning only returns false for real NotFound
2. Liveness TTL is 180s, surviving 5+ missed heartbeats
3. A2A proxy 503-Busy path retries with backoff before touching
restart logic at all
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>