Security fix merging despite CI outage (issue #136 — runner failing since 07:22, all jobs fail in 1-2s with no log output, infrastructure issue confirmed across 28 consecutive runs).
Issue #120 confirmed live by Security Auditor (cycle 3):
curl -X PATCH .../workspaces/00000000-... -d '{"name":"probe"}' → 200 (no token)
Code reviewed and approved by Security Auditor. Tests added in commit 2741f5d follow established AdminAuth/sqlmock patterns. CI outage is unrelated to these changes.
Two gaps identified by Security Auditor in PR #125 review cycle:
1. handlers_extended_test.go:
- Fix TestExtended_WorkspaceUpdate: add SELECT EXISTS mock expectation
so the test correctly reflects the #120 existence guard now running first.
- Add TestExtended_WorkspaceUpdate_NotFound: verifies PATCH returns 404
(not 200) for a nonexistent workspace ID — the core #120 behaviour fix.
2. wsauth_middleware_test.go:
- Add TestAdminAuth_Issue120_PatchWorkspace_NoBearer_Returns401: documents
the confirmed attack vector (PATCH without token must return 401) and
asserts AdminAuth is applied to PATCH /workspaces/:id per the router.go change.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Issue #120 (HIGH — immediately exploitable):
PATCH /workspaces/:id was registered on the root router with no auth
middleware. An attacker with any workspace UUID could:
- Escalate tier (tier 4 = 4 GB RAM allocation)
- Rewrite parent_id to subvert CanCommunicate A2A access control
- Swap runtime image on next restart
- Redirect workspace_dir host bind-mount to arbitrary path
Fix: move PATCH into the wsAdmin AdminAuth group alongside POST, DELETE.
The canvas position-persist call already has an AdminAuth token (required
for GET /workspaces list on initial load) so no canvas regression.
Also add workspace-existence guard in Update handler — previously returned
200 with zero rows affected for nonexistent IDs.
Issue #113 (MEDIUM — schedule IDOR, carry-over from prior cycle):
PATCH /workspaces/:id/schedules/:scheduleId and DELETE operated on
scheduleID alone (WHERE id = $1), allowing any authenticated caller to
modify or delete schedules belonging to other workspaces.
Fix: bind workspace_id = c.Param("id") in both Update and Delete handlers;
add AND workspace_id = $N to all schedule SQL queries.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Closes#101 layer 1: buildGitHubA2APayload now handles workflow_run
events, routing failed CI runs to a workspace via the existing
X-Molecule-Workspace-ID / webhook path. Only completed runs with a
failure/cancelled/timed_out conclusion fan out — success/skipped/neutral
are dropped via errIgnoredGitHubAction.
Surface message is human-readable + includes the run URL so DevOps can
jump straight to the failing job. Metadata carries the full run context
(workflow_name, run_id, run_number, conclusion, head_branch, head_sha,
run_url, trigger_event) for programmatic handling.
4 new tests cover the failure path, success skip, non-completed action
skip, and short-SHA edge case.
Layer 2 (org.yaml wiring for DevOps workspace + GITHUB_WEBHOOK_SECRET
docs) stays as a follow-up PR.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Closes#93 and #105.
#93 — add research/plugins/template/channels entries to org.yaml
category_routing defaults. Without them, evolution crons firing with
these categories found no target and their audit summaries silently
dropped at PM. Routes each back to the role that generated it so the
author acts on their own findings.
#105 — emit X-RateLimit-Limit / -Remaining / -Reset on every response
(allowed and throttled) and Retry-After on 429s per RFC 6585. 2 tests
cover both paths. Clients and monitoring tools can now back off
proactively instead of polling into 429 walls.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Closes#103 (HIGH). Three attack surfaces on the import endpoint —
body.Dir, workspace.Template, workspace.FilesDir — were concatenated
via filepath.Join without validation, letting an unauthenticated
caller probe arbitrary filesystem paths with "../../../etc".
Two layers of defense:
1. resolveInsideRoot() rejects absolute paths and any relative path
whose lexically cleaned join escapes the provided root (Abs +
HasPrefix + separator guard). 6 tests cover happy path, traversal
attempts, absolute path, empty input, prefix-sibling escape, and
deep subpath resolution.
2. Route now runs behind middleware.AdminAuth so an unauthenticated
attacker can't reach the handler at all once a token exists.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Found via deep workspace inspection during a maintenance cycle: Security
Auditor's hourly cron correctly tries to delegate_task its audit_summary
to PM, the platform proxy rejects with "access denied: workspaces cannot
communicate per hierarchy", the agent falls back to delegating to its
direct parent (Dev Lead), and PM's category_routing dispatcher (#75) is
never reached.
This breaks the audit-routing contract end-to-end. Every audit cycle was
landing on Dev Lead instead of being fanned out via PM's category_routing
to the right dev role (security → BE+DevOps, ui/ux → FE, etc).
## Root cause
`registry.CanCommunicate()` only allowed:
- self → self
- siblings (same parent)
- root-level siblings
- direct parent → child
- direct child → parent
A grandchild → grandparent (Security Auditor → PM, where parent is Dev
Lead and grandparent is PM) was DENIED. The original design wanted strict
hierarchy to prevent rogue horizontal A2A — but it also broke the
fundamental "child can talk to its leadership chain" pattern that any
audit/escalation flow needs.
## Fix
Generalise to ancestor ↔ descendant. Any workspace can talk to any
ancestor (any depth) and any descendant (any depth). Direct parent/child
remains a fast path that avoids the walk. Sibling rules unchanged.
Cousins still cannot directly communicate (would need to go through their
shared ancestor). Cross-subtree A2A is still rejected.
Implementation: `isAncestorOf(ancestorID, childID)` walks the parent
chain in Go with a maxAncestorWalk=32 safety cap so a malformed cycle in
the workspaces table cannot loop forever. One DB lookup per step. For a
typical 3-deep tree, this adds 1-2 extra lookups vs the old direct-parent
fast path. Could be optimized to a single recursive CTE if profiling
shows it matters; not now.
## Tests
- TestCanCommunicate_Denied_Grandchild → REPLACED with two new tests:
- TestCanCommunicate_Allowed_GrandparentToGrandchild
- TestCanCommunicate_Allowed_GrandchildToGrandparent (the actual bug)
- TestCanCommunicate_Allowed_DeepAncestor — 4-level chain
- TestCanCommunicate_Denied_UnrelatedAncestors — ensures cross-subtree
walks still terminate denied
- TestCanCommunicate_Denied_DifferentParents — extended with the walk
lookup mocks so sqlmock doesn't log warnings
- TestCanCommunicate_Denied_CousinToRoot — same
All 13 tests pass clean. The previous direct parent/child / siblings /
self tests are unchanged (fast paths preserved).
## Why platform-level
Per the "platform-wide fixes are mine to ship" rule. Every org template
hits the same broken audit-routing chain — fixing it at the platform
benefits all users, not just molecule-dev. This unblocks #50 (PM
dispatcher prompt) and #75 (category_routing).
Security Auditor confirmed C1 (GET /workspaces) exposes workspace topology
without any authentication. The endpoint was intentionally left open for
the canvas browser frontend; this PR closes that gap.
Router change:
- Move GET /workspaces from the bare root router into the wsAdmin AdminAuth
group alongside POST /workspaces and DELETE /workspaces/:id.
- AdminAuth uses the same fail-open bootstrap contract as all other auth
gates: fresh installs (no live tokens) pass through; once any workspace
has registered with a token, a valid bearer is required.
Status of findings C2–C11 (documented here for audit trail):
- C2 POST /workspaces/:id/activity → already in wsAuth group (Cycle 5)
- C3 POST /workspaces/:id/delegations/record → already in wsAuth group (Cycle 5)
- C4 POST /workspaces/:id/delegations/:id/update → already in wsAuth group (Cycle 5)
- C5 GET /workspaces/:id/delegations → already in wsAuth group (Cycle 5)
- C7 GET /workspaces/:id/memories → already in wsAuth group (Cycle 5)
- C8 POST /workspaces/:id/memories → already in wsAuth group (Cycle 5)
- C9 POST /workspaces/:id/delegate → already in wsAuth group (Cycle 5)
- C10 GET /admin/secrets → already in adminAuth group (Cycle 7)
- C11 POST+DELETE /admin/secrets → already in adminAuth group (Cycle 7)
Tests (platform/internal/middleware/wsauth_middleware_test.go — 13 new):
WorkspaceAuth:
- fail-open when workspace has no tokens (bootstrap path)
- C4: no bearer on /delegations/:id/update → 401
- C8: no bearer on /memories POST → 401
- invalid bearer → 401
- cross-workspace token replay → 401
- valid bearer for correct workspace → 200
AdminAuth:
- fail-open when no tokens exist globally (fresh install)
- C10: no bearer on GET /admin/secrets → 401
- C11: no bearer on POST /admin/secrets → 401
- C11: no bearer on DELETE /admin/secrets/:key → 401
- valid bearer → 200
- invalid bearer → 401
Note: did NOT touch DELETE /admin/secrets in production — no destructive
calls to live secrets endpoints were made during this work.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PR #94 only blocked 127.0.0.0/8 (loopback) and 169.254.0.0/16
(link-local/IMDS). An attacker could still register a workspace with
a URL in any RFC-1918 range (10.x, 172.16–31.x, 192.168.x) and
redirect A2A proxy traffic to internal services.
Block all five reserved ranges in validateAgentURL:
- 169.254.0.0/16 link-local (IMDS: AWS/GCP/Azure)
- 127.0.0.0/8 loopback (self-SSRF)
- 10.0.0.0/8 RFC-1918
- 172.16.0.0/12 RFC-1918 (includes Docker bridge networks)
- 192.168.0.0/16 RFC-1918
Agents must use DNS hostnames, not IP literals. The provisioner
still writes 127.0.0.1 URLs via direct SQL UPDATE (CASE guard
preserves those); this blocklist only applies to the /registry/register
request body.
Tests: updated 3 previously-allowed RFC-1918 cases to expect rejection;
added 9 new cases covering range boundaries and the Docker bridge range.
All 22 validateAgentURL subtests pass.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The first scheduler heartbeat (#95) only fired AFTER each tick completed.
A tick that runs fireSchedule for 110+ seconds (long agent prompts) would
make /admin/liveness report scheduler as stale even though it was actively
working. Observed today: scheduler firing UIUX audit, last_tick_at lagged
by 95s+ and incrementing.
Three places now call Heartbeat:
1. Top of tick() — proves we're past the ticker.C wait
2. Inside each fire goroutine, before fireSchedule — ANY active fire
keeps the heartbeat fresh
3. Inside each fire goroutine, after fireSchedule — captures the moment
the per-fire work completes
(The post-tick Heartbeat in Start() is still there as the "all idle" case.)
Net result: /admin/liveness reports stale only if the scheduler genuinely
isn't doing anything for >2× pollInterval, which is the actual signal we
want.
Yesterday's scheduler-died incident (#85) was one instance of a systemic
bug: every long-running goroutine in the platform lacks panic recovery
and exposes no liveness signal. In a multi-tenant SaaS deployment, a
single tenant's bad data panicking any subsystem takes down the
subsystem for every tenant, silently, with all standard health probes
still green. That is a scale-of-one sev-1.
This PR:
1. Introduces `platform/internal/supervised/` with two primitives:
a. RunWithRecover(ctx, name, fn) — runs fn in a recover wrapper.
On panic logs the stack + exponential-backoff restart (1s → 2s →
4s → … → 30s cap). On clean return (fn decided to stop) returns.
On ctx.Done cancels cleanly.
b. Heartbeat(name) + LastTick(name) + Snapshot() + IsHealthy(names,
staleThreshold) — shared in-memory liveness registry. Every
subsystem calls Heartbeat(name) at the end of each tick so
operators can distinguish "goroutine alive and healthy" from
"alive but stuck inside a single tick".
2. Wraps every `go X.Start(ctx)` in main.go:
- broadcaster.Subscribe (Redis pub/sub relay → WebSocket)
- registry.StartLivenessMonitor
- registry.StartHealthSweep
- scheduler.Start (the one that died yesterday)
- channelMgr.Start (Telegram / Slack)
3. Adds `supervised.Heartbeat("scheduler")` inside the scheduler tick
loop as the first end-to-end demonstration. Follow-up PRs will add
heartbeats to the other four subsystems.
4. Adds `GET /admin/liveness` endpoint returning per-subsystem
last_tick_at + seconds_ago. Operators can poll this and alert on
any subsystem whose seconds_ago exceeds 2x its cron/tick interval.
5. Unit tests for RunWithRecover (clean return no restart; panic
restarts with backoff; ctx cancel stops restart loop) and for the
liveness registry.
Net new code: ~160 lines + ~100 lines of tests. Refactor of main.go:
~10 line changes. No behavior change on happy path; only lifts what
happens on a panic.
Closes#92. Supersedes the local recover added to scheduler.go in
#90 (kept conceptually, but now via the shared helper).
A workspace that self-registers with a 127.0.0.x URL on first INSERT
could redirect A2A proxy traffic back to the platform itself (SSRF).
The previous fix only blocked 169.254.0.0/16 (cloud metadata).
Add 127.0.0.0/8 to validateAgentURL's blocklist. RFC-1918 private
ranges (10.x, 172.16.x, 192.168.x) remain allowed — Docker container
networking depends on them.
Safe because the provisioner writes 127.0.0.1 URLs via direct SQL
UPDATE, not through /registry/register, so the UPSERT CASE that
preserves provisioner URLs is unaffected. Local-dev agents can still
register using "localhost" by name (hostname, not IP literal).
Tests: removed "valid localhost http" case (now correctly rejected),
added "valid localhost name" + three loopback-block assertions.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The scheduler died silently on 2026-04-14 14:21 UTC and stayed dead for
12+ hours. Platform restart didn't recover it. Root cause: tick() and
fireSchedule() goroutines have no panic recovery. A single bad row, bad
cron expression, DB blip, or transient panic anywhere in the chain
permanently kills the scheduler goroutine — and the only signal to an
operator is "no crons firing", which is invisible if you're not watching.
Specifically:
func (s *Scheduler) Start(ctx context.Context) {
for {
select {
case <-ticker.C:
s.tick(ctx) // <- if this panics, the for-loop exits forever
}
}
}
And inside tick:
go func(s2 scheduleRow) {
defer wg.Done()
defer func() { <-sem }()
s.fireSchedule(ctx, s2) // <- panic here propagates up wg.Wait()
}(sched)
Two `defer recover()` additions:
1. In Start's tick wrapper — a panic in tick() (DB scan, cron parse,
row processing) is logged and the next tick fires normally.
2. In each fireSchedule goroutine — a single bad workspace can't take
the rest of the batch down.
Plus a liveness watchdog:
- Scheduler now records `lastTickAt` after each successful tick.
- New methods `LastTickAt()` and `Healthy()` (true if last tick within
2× pollInterval = 60s).
- Initialised at Start so Healthy() returns true on a fresh process.
Endpoint plumbing for /admin/scheduler/health is a follow-up — needs
threading the scheduler instance through router.Setup(). Documented
on #85.
Closes the silent-outage failure mode of #85. The other proposed
fixes (force-kill on /restart hang, active_tasks watchdog) are
separate concerns tracked in #85's comments.
Pair to molecule-controlplane PR #8. Fly's proxy returns 502 if the
fly-replay state value contains '=', so the control plane now puts the
bare UUID in state= (no 'org-id=' prefix). TenantGuard now treats the
whole 'state=...' value as the org id.
Phase B.3 pair-fix to the control plane's fly-replay state change.
Background: the private molecule-controlplane's router emits
`fly-replay: app=X;instance=Y;state=org-id=<uuid>`. Fly's edge replays
the request to the tenant and injects `Fly-Replay-Src: instance=Z;...;
state=org-id=<uuid>` on the replayed request. But response headers from
the cp (like X-Molecule-Org-Id) never travel to the replayed tenant —
only the state= param does.
TenantGuard now checks both paths in order:
1. Primary: X-Molecule-Org-Id header (direct-access path, e.g. molecli)
2. Secondary: Fly-Replay-Src's `state=org-id=<uuid>` segment
(production fly-replay path)
Either matching configured MOLECULE_ORG_ID → allow. Neither matches →
404 (still don't leak tenant existence).
New helper orgIDFromReplaySrc parses the semicolon-separated Fly-Replay-
Src header per Fly's format. Covered by a table-driven test with 7 cases
including malformed + empty-header + wrong-state-key.
Tests: +3 new TestTenantGuard_* (FlyReplaySrc match, mismatch, table).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phase 32 foundation. The SaaS control plane (private molecule-controlplane
repo) provisions one platform instance per customer org on Fly Machines
and sets MOLECULE_ORG_ID=<uuid> on the machine. Its subdomain router
forwards requests with X-Molecule-Org-Id=<uuid>.
TenantGuard:
- When MOLECULE_ORG_ID is set → every non-allowlisted request must carry a
matching X-Molecule-Org-Id header. Mismatched/missing header → 404 (not
403 — don't leak tenant existence by letting probers distinguish "wrong
org" from "route doesn't exist").
- When unset → passthrough. Self-hosted / dev / CI behavior unchanged.
- Allowlist is exact-match, not prefix — /health and /metrics only.
No orgs table, no signup, no billing, no Fly provisioning in this repo —
all that lives in the private control plane. The public repo's SaaS
surface is exactly this one middleware.
6 tests covering: unset-is-passthrough, matching header, mismatched
header 404 (with empty body), missing header 404, allowlist bypass, and
allowlist-is-exact-match.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Addresses code-review warnings on PR #76:
- Migration 022 now backfills pre-existing workspace_schedules rows to
source='template' before flipping NOT NULL + DEFAULT 'runtime'. Legacy
rows (all seeded via org/import historically) stay refreshable on
re-import. Down migration drops the CHECK constraint too.
- Extracted the import UPSERT into const orgImportScheduleSQL so the shape
test asserts against the const directly instead of file-scraping org.go.
Removed the os.ReadFile helper.
- scheduleResponse.Source gets json:\",omitempty\" so old clients that
predate the migration don't see an empty string they can't explain.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Addresses code-review warnings on PR #75:
- renderCategoryRoutingYAML now builds yaml.Node + yaml.Marshal, escaping
YAML-reserved chars in role names correctly (was JSON-as-YAML, fragile on
unicode line separators).
- New appendYAMLBlock helper guarantees a newline boundary when concatenating
YAML fragments into config.yaml (category_routing + initial_prompt both
used to risk merging into the previous line).
- Fixed struct comment (replace-per-key, not UNION).
- Added TestCategoryRouting_EscapesYAMLSpecials and TestAppendYAMLBlock_NewlineGuard.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolves#24 per CEO direction.
DB is source of truth for workspace_schedules. POST /org/import becomes
idempotent — only touches rows it owns (source='template'); runtime-added
schedules (Canvas / API) are preserved across re-imports.
- Migration 022: adds source TEXT NOT NULL DEFAULT 'runtime' CHECK in
('template','runtime'); unique index on (workspace_id, name) so the
org/import upsert can use ON CONFLICT.
- org.go: schedule INSERT becomes
INSERT ... 'template' ON CONFLICT (workspace_id, name) DO UPDATE
SET ... WHERE workspace_schedules.source='template'.
Never DELETEs.
- schedules.go: runtime POST writes 'runtime' explicitly; List handler
surfaces the source field on the response so Canvas can render badges.
- 3 new unit tests assert source='runtime' default for runtime CRUD,
the SQL shape contract for org/import (additive + idempotent +
runtime-preserving + never-DELETE), and List response surface.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add a category_routing block to org.yaml schema (defaults + per-workspace,
UNION semantics with per-key replace). The merged routing table is rendered
into each workspace's config.yaml at import time.
PM's system prompt loses the hardcoded security/ui/infra → role mapping
from PR #50; instead it reads category_routing from /configs/config.yaml
and delegates to whatever roles the org template lists for the incoming
audit-summary's category. Future org templates ship their own routing
without prompt churn.
Tests: 4 new TestCategoryRouting_* cases covering YAML parse, UNION+drop
semantics, deterministic config.yaml render, and empty-map handling.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Per-workspace `plugins:` now UNIONS with `defaults.plugins` instead of
replacing. A leading `!` or `-` on a per-workspace entry opts a default
out. Backward-compatible: re-listing defaults still dedupes to the same
list.
Refactored the inline REPLACE logic into a pure helper `mergePlugins`
in org.go so it's unit-testable. Five TestPlugins_* cases added.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
After a workspace restart (HTTP /restart or programmatic RestartByID) and
re-registration, the platform sends a synthetic A2A message/send to the
workspace containing:
- restart timestamp
- previous session end timestamp + human duration
- env-var keys now available (keys only — never values)
The message is rendered in the format proposed in #19 and marked with
metadata.kind=restart_context so agents can detect and handle it
specifically if they choose.
Skip path: if the workspace doesn't re-register within 30s, log and drop.
The Restart HTTP response is unaffected by delivery success.
Layer 2 (user-defined restart_prompt via config.yaml / org.yaml) is
deferred — tracked as a separate follow-up issue.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Global secrets (e.g. CLAUDE_CODE_OAUTH_TOKEN) are injected as container env
vars at Start() time. Until now, rotating one only propagated to a workspace
on the next full restart-from-zero, which manual ops had to drive via a
`POST /workspaces/:id/restart` loop. Tier-3 Claude Code agents hit the
stale-token path first and surfaced as 401s inside the SDK.
Restart-time re-read of global_secrets + workspace_secrets was already
correct in `provisionWorkspaceOpts` — the missing piece was the trigger.
SetGlobal / DeleteGlobal now enqueue RestartByID for every non-paused,
non-removed, non-external workspace that does NOT shadow the key with a
workspace-level override. Matches the existing behaviour of workspace-scoped
`Set` / `Delete`.
Adds two sqlmock-backed tests exercising both branches.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolves#14. ApplyTierConfig now reads TIER{2,3,4}_MEMORY_MB and
TIER{2,3,4}_CPU_SHARES env vars, falling back to the compiled defaults
agreed in the issue:
- T2: 512 MiB / 1024 shares (1 CPU) — unchanged baseline
- T3: 2048 MiB / 2048 shares (2 CPU) — new cap (previously uncapped)
- T4: 4096 MiB / 4096 shares (4 CPU) — new cap (previously uncapped)
CPU_SHARES follows Docker's 1024 = 1 CPU convention; internally the
value is translated to NanoCPUs for a hard allocation so behaviour
remains deterministic across hosts. Malformed or non-positive env
values silently fall back to the default.
Behaviour change note: T3 and T4 previously had no explicit cap.
Operators who relied on unlimited can set very large TIERn_MEMORY_MB /
TIERn_CPU_SHARES values; a follow-up can add unset-means-unlimited
semantics if required.
Tests:
- TestGetTierMemoryMB_DefaultsMatchLegacy
- TestGetTierMemoryMB_EnvOverride (covers malformed + zero fallback)
- TestGetTierCPUShares_EnvOverride
- TestApplyTierConfig_T3_UsesEnvOverride (wiring)
- TestApplyTierConfig_T3_DefaultCap (documents the new cap)
Docs: .env.example section + CLAUDE.md platform env-vars list updated.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolves#12. The claude-code SDK stores conversations in
/root/.claude/sessions/ and Postgres tracks current_session_id, but the
container filesystem was recreated on every restart — next agent message
failed with "No conversation found with session ID: <uuid>".
Add a per-workspace named Docker volume (ws-<id>-claude-sessions) mounted
read-write at /root/.claude/sessions. Gated by runtime=claude-code so
other runtimes don't pay for a path they don't use. Volume is cleaned up
in RemoveVolume alongside the config volume.
Two opt-outs discard the volume before restart for a fresh session:
- env WORKSPACE_RESET_SESSION=1 on the container
- POST /workspaces/:id/restart?reset=true (or {"reset": true} body)
Plumbed via new ResetClaudeSession field on WorkspaceConfig +
provisionWorkspaceOpts helper so the flag stays request-scoped (not
persisted on CreateWorkspacePayload).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds a gated admin endpoint that mints a fresh workspace bearer token on
demand, eliminating the register-race currently used by
test_comprehensive_e2e.sh (PR #5 follow-up).
- New handler admin_test_token.go: returns 404 unless MOLECULE_ENV != production
or MOLECULE_ENABLE_TEST_TOKENS=1. Hides route existence in prod (404 not 403).
- Mints via wsauth.IssueToken; logs at INFO without the token itself.
- Verifies workspace exists before minting (missing -> 404, never 500).
- Tests cover prod-hidden, enable-flag-overrides-prod, missing workspace,
and happy-path + token-validates round trip.
- tests/e2e/_lib.sh gains e2e_mint_test_token helper for downstream adoption.
- CLAUDE.md updated with route + env vars.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolves#17.
Part A: scripts/cleanup-rogue-workspaces.sh deletes workspaces whose id
or name starts with known test placeholder prefixes (aaaaaaaa-, etc.)
and force-removes the paired Docker container. Documented in
tests/README.md.
Part B: add a pre-flight check in provisionWorkspace() — when neither a
template path nor in-memory configFiles supplies config.yaml, probe the
existing named volume via a throwaway alpine container. If the volume
lacks config.yaml, mark the workspace status='failed' with a clear
last_sample_error instead of handing it to Docker's unless-stopped
restart policy (which otherwise loops forever on FileNotFoundError).
New pure helper provisioner.ValidateConfigSource + unit tests.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
C18 — Workspace URL hijacking (CRITICAL, CONFIRMED LIVE):
POST /registry/register now calls requireWorkspaceToken() before
persisting anything. If the workspace has any live auth tokens, the
caller must supply a valid Bearer token matching that workspace ID.
First registration (no tokens yet) passes through — token is issued
at end of this function (unchanged bootstrap contract). Mirrors the
same pattern already applied to /registry/heartbeat and
/registry/update-card. Attacker POC — overwriting Backend Engineer URL
to http://attacker.example.com:9999/steal — now returns 401.
C20 — Unauthenticated workspace deletion (CRITICAL, CONFIRMED LIVE):
DELETE /workspaces/:id moved from bare router into AdminAuth group.
Any valid workspace bearer token grants access (same fail-open
bootstrap contract as /settings/secrets). Mass-deletion attack chain
(C19 list → C20 delete all) requires auth for the DELETE step.
POST /workspaces (create) also moved to AdminAuth to prevent
unauthenticated workspace creation.
C19 (GET /workspaces topology exposure) deferred — canvas browser
has no bearer token; fix requires canvas service-token refactor.
Tests: 2 new registry tests — C18 bootstrap (no tokens, passes
through and issues token), C18 hijack blocked (has tokens, no
bearer → 401).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
POST /registry/register accepted any URL string and persisted it as
the workspace's A2A endpoint — an attacker could register a workspace
with url=http://169.254.169.254/latest/meta-data/ and cause the platform
to proxy requests to the cloud metadata service when proxying A2A traffic.
Fix: validateAgentURL() helper rejects:
- empty URL
- non-http/https schemes (file://, ftp://, etc.)
- 169.254.0.0/16 link-local IPs (AWS/GCP/Azure IMDS endpoints)
Allows RFC-1918 private ranges (Docker networking uses 172.16-31.x.x).
Adds 12 unit tests covering valid Docker-internal URLs and all SSRF vectors.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three unauthenticated routes allowed arbitrary read/write/delete of all
global platform secrets (API keys, provider credentials) with zero auth:
- GET/PUT/POST /settings/secrets
- DELETE /settings/secrets/:key
- GET/POST/DELETE /admin/secrets (legacy aliases)
Fix: new AdminAuth middleware with same lazy-bootstrap contract as
WorkspaceAuth — fail-open when no tokens exist (fresh install / pre-Phase-30
upgrade), enforce once any workspace has a live token. Any valid workspace
bearer token grants access (platform-wide scope, no workspace binding needed).
Changes:
wsauth/tokens.go — HasAnyLiveTokenGlobal + ValidateAnyToken functions
wsauth/tokens_test.go — 5 new tests covering both new functions
middleware/wsauth_middleware.go — AdminAuth middleware
router/router.go — global secrets routes now registered under adminAuth group
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fix A — platform/internal/middleware/wsauth_middleware.go (NEW):
WorkspaceAuth() gin middleware enforces per-workspace bearer-token auth on
ALL /workspaces/:id/* sub-routes. Same lazy-bootstrap contract as
secrets.Values: workspaces with no live token are grandfathered through.
Blocks C2, C3, C4, C5, C7, C8, C9, C12, C13 simultaneously.
Fix A — platform/internal/router/router.go:
Reorganised route registration: bare CRUD (/workspaces, /workspaces/:id)
and /a2a remain on root router; all other /workspaces/:id/* sub-routes
moved into wsAuth = r.Group("/workspaces/:id", middleware.WorkspaceAuth(db.DB)).
CORS AllowHeaders updated to include Authorization so browser/agent callers
can send the bearer token cross-origin.
Fix B — workspace-template/heartbeat.py:
_check_delegations(): validate source_id == self.workspace_id before
accepting a delegation result. Attacker-crafted records with a foreign
source_id are silently skipped with a WARNING log (injection attempt).
trigger_msg no longer embeds raw response_preview text; references
delegation_id + status only — removes the prompt-injection vector.
Fix C — workspace-template/skill_loader/loader.py:
load_skill_tools(): before exec_module(), verify script is within
scripts_dir (path traversal guard) and temporarily scrub sensitive env
vars (CLAUDE_CODE_OAUTH_TOKEN, ANTHROPIC_API_KEY, OPENAI_API_KEY,
WORKSPACE_AUTH_TOKEN, GITHUB_TOKEN, GH_TOKEN) from os.environ; restore
in finally block. Defence-in-depth even if /plugins auth gate is bypassed.
Fix D — platform/internal/handlers/socket.go:
HandleConnect(): agent connections (X-Workspace-ID present) validated via
wsauth.HasAnyLiveToken + wsauth.ValidateToken before WebSocket upgrade.
Canvas clients (no X-Workspace-ID) remain unauthenticated.
Fix D — workspace-template/events.py:
PlatformEventSubscriber._connect(): include platform_auth bearer token in
WebSocket upgrade headers alongside X-Workspace-ID.
Fix E — workspace-template/executor_helpers.py:
recall_memories() and commit_memory() now pass platform_auth bearer token
in Authorization header so WorkspaceAuth middleware allows access.
Fix F — workspace-template/a2a_client.py:
send_a2a_message(): timeout=None → httpx.Timeout(connect=30, read=300,
write=30, pool=30). Resolves H2 flagged across 5 consecutive audits.
Tests: 149/149 Python tests pass (test_heartbeat + test_events updated to
assert new source_id validation behaviour and allow Authorization header).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Follow-up to the quality-fixes-pass2 code review.
## Go: direct unit tests for PR #5 extracted helpers (~47 new tests)
a2a_proxy_test.go:
- resolveAgentURL: cache hit, cache-miss DB hit, not-found, null-URL,
docker-rewrite guard
- dispatchA2A: build error, canvas timeout, agent timeout, success
- handleA2ADispatchError: context deadline, generic error, build error
- maybeMarkContainerDead: nil-provisioner, runtime=external short-circuits
- logA2AFailure, logA2ASuccess: activity_logs row content + status
delegation_test.go:
- bindDelegateRequest: valid / malformed / bad-UUID
- lookupIdempotentDelegation: no-key / no-match / failed-row-deleted / existing-pending
- insertDelegationRow: insertOK / insertHandledByIdempotent /
insertTrackingUnavailable
- insertDelegationOutcome: zero-value is insertOutcomeUnknown sentinel
discovery_test.go:
- discoverWorkspacePeer: online / not-found / access-denied + 2 edges
- writeExternalWorkspaceURL: 3 cases
- discoverHostPeer: smoke test documents the unreachable-by-design path
activity_test.go:
- parseSessionSearchParams: defaults + custom limit/offset/q
- buildSessionSearchQuery: no-filters + with-query shapes
- scanSessionSearchRows: empty / single / multiple rows
Package coverage: 56.1% → 57.6%. Every helper extracted in PR #5 is
now at or near 100% line coverage (see PR notes for the 4 remaining
gaps, all blocked on provisioner interface mockability).
## Defensive enum zero-value fix
insertDelegationOutcome now starts with insertOutcomeUnknown=0 as a
sentinel so an un-initialized variable can't silently read as
"success". insertOK, insertHandledByIdempotent, insertTrackingUnavailable
shift to 1/2/3. No caller changes needed.
## Canvas: ConfirmDialog.singleButton test (5 cases)
canvas/src/components/__tests__/ConfirmDialog.test.tsx covers:
- default render (both buttons)
- singleButton hides Cancel
- singleButton: Escape still fires onCancel
- singleButton: backdrop-click still fires onCancel
- singleButton: onConfirm fires on click
vitest total: 352 → 357, all passing.
## Docstring clarity
ConfirmDialog.tsx: expanded singleButton prop comment to explicitly
instruct callers to pass the same handler for onConfirm/onCancel when
using it as an info toast (matches TemplatePalette usage).
## ErrorBoundary clipboard observability
.catch(() => {}) silently swallowed rejections. Now:
.catch((e) => console.warn("clipboard write failed:", e))
so permission-denied / insecure-context failures surface in the console.
## Verification
- go build ./... clean
- go vet ./... clean
- go test -race ./internal/... — all pass
- canvas npm run build — clean
- canvas npm test -- --run — 357/357 pass
- tests/e2e/test_api.sh — 46/62 pass; all 16 failures are pre-existing
(token-auth enforcement + stale test workspaces + missing Docker
network). None involve handlers touched in PR #5.
- Manual: platform + canvas running locally, title=Molecule AI,
/workspaces returns [], /health returns ok. Identified + killed a
stale Next.js server from the old Starfire-AgentTeam repo that was
serving the old brand on IPv4 port 3000.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Post-review fixes on top of the quality-pass-2 branch.
1. delegation.go: replaced insertDelegationRow's (bool, bool) return
with a typed insertDelegationOutcome enum (insertOK /
insertHandledByIdempotent / insertTrackingUnavailable). Eliminates
the positional-boolean decoding the caller had to do. Internal, no
behavior change.
2. ConfirmDialog.tsx: added singleButton prop. When true, hides the
Cancel button for single-action info toasts (Esc still dismisses
via onCancel). TemplatePalette's import notice uses it.
3. ErrorBoundary.tsx: fixed the floating clipboard promise. Added
.catch(() => {}) so a rejected writeText (permission denied,
insecure context) doesn't surface as unhandled rejection.
4. a2a_proxy_test.go: added 5 direct unit tests for
normalizeA2APayload (invalid JSON, wraps-bare, preserves-existing-
id, preserves-existing-messageId, missing-method). Fills the unit-
test gap for the helper extracted in the last pass.
Verification:
- go test -race ./internal/handlers/... passes (incl. 5 new tests)
- go build ./... clean
- canvas npm run build clean
- canvas npm test -- --run -> 352/352
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Delete empty platform/plugins/ (dead remnant; plugins/ at repo root is
the real registry; router.go comment updated)
- Gitignore local dev cruft: platform/workspace-configs-templates/,
.agents/ (codex/gemini skill cache), backups/
- Untrack .agents/skills/ (keep local, stop tracking)
- Move examples/remote-agent/ → sdk/python/examples/remote-agent/
(co-locate with the SDK it exercises); update refs in
molecule_agent README + __init__ + PLAN.md + the demo's own README
- Move docs/superpowers/plans/ → plugins/superpowers/plans/
(plans were written by the superpowers plugin's writing-plans
subskill; belong with the plugin, not under docs)
- Add tests/README.md explaining the unit-tests-per-package +
root-E2E split so new contributors don't ask
- Add docs/README.md explaining why site tooling lives under docs/
rather than a separate docs-site/ (VitePress ergonomics)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>