The canvas sends NEXT_PUBLIC_ADMIN_TOKEN on all API calls but per-workspace
routes (/activity, /delegations, /traces) use WorkspaceAuth which only
accepts per-workspace bearer tokens. This made the canvas dashboard 401
on every workspace detail view.
Fix: WorkspaceAuth now accepts the admin token as a fallback after
workspace token validation fails. This lets the canvas read all workspace
data with a single admin credential.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Soft-delete (status='removed') leaves orphan DB rows and FK data forever.
When ?purge=true is passed, after container cleanup the handler cascade-
deletes all leaf FK tables and hard-removes the workspace row.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The org import fired all workspace provisioning goroutines concurrently,
overwhelming Docker when creating 39+ containers. Containers timed out,
leaving workspaces stuck in 'provisioning' with no schedules or hooks.
Fix:
- Add provisionConcurrency=3 semaphore limiting concurrent Docker ops
- Increase workspaceCreatePacingMs from 50ms to 2000ms between siblings
- Pass semaphore through createWorkspaceTree recursion
With 39 workspaces at 3 concurrent + 2s pacing, import takes ~30s instead
of timing out. Each workspace gets its full template: schedules, hooks,
settings, hierarchy.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
IsRunning used an unbounded json.NewDecoder(resp.Body).Decode on
CP status responses. Start already caps its body read at 64 KiB
(cp_provisioner.go:137) to defend against a misconfigured or
compromised CP streaming a huge body and exhausting memory.
IsRunning is called reactively per-request from a2a_proxy and
periodically from healthsweep, so it's a hotter path than Start
and arguably deserves the same defense more.
Adds TestIsRunning_BoundedBodyRead that serves a body padded past
the cap and asserts the decode still succeeds on the JSON prefix.
Follow-up to code-review Nit-2 on #1073.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
My #1071 made IsRunning return (false, err) on all error paths, but that
breaks a2a_proxy which depends on Docker provisioner's (true, err) contract.
Without this fix, any brief CP outage causes a2a_proxy to mark workspaces
offline and trigger restart cascades across every tenant.
Contract now matches Docker.IsRunning:
transport error → (true, err) — alive, degraded signal
non-2xx response → (true, err) — alive, degraded signal
JSON decode error → (true, err) — alive, degraded signal
2xx state!=running → (false, nil)
2xx state==running → (true, nil)
healthsweep.go is also happy with this — it skips on err regardless.
Adds TestIsRunning_ContractCompat_A2AProxy as regression guard that
asserts each error path explicitly against the a2a_proxy expectations.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-existing silent-failure path: IsRunning decoded CP responses
regardless of HTTP status, so a CP 500 → empty body → State="" →
returned (false, nil). The sweeper couldn't distinguish "workspace
stopped" from "CP broken" and would leave a dead row in place.
## Fix
- Non-2xx → wrapped error, does NOT echo body (CP 5xx bodies may
contain echoed headers; leaking into logs would expose bearer)
- JSON decode error → wrapped error
- Transport error → now wrapped with "cp provisioner: status:"
prefix for easier log grepping
## Tests
+7 cases (5-status table + malformed JSON + existing transport).
IsRunning coverage 100%; overall cp_provisioner at 98%.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes review gap: pre-PR coverage on CPProvisioner was 37%.
After this commit every exported method is exercised:
- NewCPProvisioner 100%
- authHeaders 100%
- Start 91.7% (remainder: json.Marshal error
path, unreachable with fixed-type
request struct)
- Stop 100% (new — header + path + error)
- IsRunning 100% (new — 4-state matrix + auth)
- Close 100% (new — contract no-op)
New cases assert both auth headers (shared secret + admin_token) land
on every outbound request, transport failures surface clear errors
on Start/Stop, and IsRunning doesn't misreport on transport failure.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #729 tightened AdminAuth to require ADMIN_TOKEN, breaking the
workspace credential helper which called /admin/github-installation-token
with a workspace bearer token. Tokens expired after 60 min with no refresh.
Fix: Add /workspaces/:id/github-installation-token under WorkspaceAuth
so any authenticated workspace can refresh its GitHub token. Keep the
admin path as backward-compatible alias.
Update molecule-git-token-helper.sh to use the workspace-scoped path
when WORKSPACE_ID is set.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
controlplane #118 + #130 made /cp/workspaces/* require a per-tenant
admin_token header in addition to the platform-wide shared secret.
Without it, every workspace provision / deprovision / status call
now 401s.
ADMIN_TOKEN is already injected into the tenant container by the
controlplane's Secrets Manager bootstrap, so this is purely a
header-plumbing change — no new config required on the tenant side.
## Change
- CPProvisioner carries adminToken alongside sharedSecret
- New authHeaders method sets BOTH auth headers on every outbound
request (old authHeader deleted — single call site was misleading
once the semantics changed)
- Empty values on either header are no-ops so self-hosted / dev
deployments without a real CP still work
## Tests
Renamed + expanded cp_provisioner_test cases:
- TestAuthHeaders_NoopWhenBothEmpty — self-hosted path
- TestAuthHeaders_SetsBothWhenBothProvided — prod happy path
- TestAuthHeaders_OnlyAdminTokenWhenSecretEmpty — transition window
Full workspace-server suite green.
## Rollout
Next tenant provision will ship an image with this commit merged.
Existing tenants (none in prod right now — hongming was the only
one and was purged earlier today) will auto-update via the 5-min
image-pull cron.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add MemorySeed model and initial_memories support at three levels:
- POST /workspaces payload: seed memories on workspace creation
- org.yaml workspace config: per-workspace initial_memories with
defaults fallback
- org.yaml global_memories: org-wide GLOBAL scope memories seeded
on the first root workspace during import
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The provisioner was unconditionally writing CLAUDE_CODE_OAUTH_TOKEN into
config.yaml's required_env for all claude-code workspaces. When the
baked token expired, preflight rejected every workspace — even those
with a valid token injected via the secrets API at runtime.
Changes:
- workspace_provision.go: remove hardcoded required_env for claude-code
and codex runtimes; tokens are injected at container start via secrets
- workspace_provision_test.go: flip assertion to reject hardcoded token
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a workspace is deleted (status set to 'removed'), its schedules
remained enabled, causing the scheduler to keep firing cron jobs for
non-existent containers. Add a cascade disable query alongside the
existing token revocation and canvas layout cleanup.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three changes to boost agent throughput:
1. Event-driven cron triggers (webhooks.go): GitHub issues/opened events
fire all "pick-up-work" schedules immediately. PR review/submitted
events fire "PR review" and "security review" schedules. Uses
next_run_at=now() so the scheduler picks them up on next tick.
2. Auto-push hook (executor_helpers.py): After every task completion,
agents automatically push unpushed commits and open a PR targeting
staging. Guards: only on non-protected branches with unpushed work.
Uses /usr/local/bin/git and /usr/local/bin/gh wrappers with baked-in
GH_TOKEN. Never crashes the agent — all errors logged and continued.
3. Integration (claude_sdk_executor.py): auto_push_hook() called in the
_execute_locked finally block after commit_memory.
Closes productivity gap where agents wrote code but never pushed,
and where work crons only fired on timers instead of reacting to events.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Silent data loss on mid-cursor DB errors — partial sub-workspace
bundles returned instead of surfacing the iteration error. Adds
rows.Err() check after the SELECT id FROM workspaces query in
Export(), mirroring the pattern already used in scheduler.go
and handlers with similar recursion patterns.
Closes: R1 MISSING-ROWS-ERR findings (bundle/exporter.go)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
PR #881 closed SAFE-T1201 (#838) on the HTTP path by wiring redactSecrets()
into MemoriesHandler.Commit — but the sibling code path on the MCP bridge
(MCPHandler.toolCommitMemory) was left with only the TODO comment. Agents
calling commit_memory via the MCP tool bridge are the PRIMARY attack vector
for #838 (confused / prompt-injected agent pipes raw tool-response text
containing plain-text credentials into agent_memories, leaking into shared
TEAM scope). The HTTP path is only exercised by canvas UI posts, so the MCP
gap was the hotter one.
Change:
workspace-server/internal/handlers/mcp.go:725
- TODO(#838): run _redactSecrets(content) before insert — plain-text
- API keys from tool responses must not land in the memories table.
+ SAFE-T1201 (#838): scrub known credential patterns before persistence…
+ content, _ = redactSecrets(workspaceID, content)
Reuses redactSecrets (same package) so there's no duplicated pattern list —
a future-added pattern in memories.go automatically covers the MCP path too.
Tests added in mcp_test.go:
- TestMCPHandler_CommitMemory_SecretInContent_IsRedactedBeforeInsert
Exercises three patterns (env-var assignment, Bearer token, sk-…)
and uses sqlmock's WithArgs to bind the exact REDACTED form — so a
regression (removing the redactSecrets call) fails with arg-mismatch
rather than silently persisting the secret.
- TestMCPHandler_CommitMemory_CleanContent_PassesThrough
Regression guard — benign content must NOT be altered by the redactor.
NOTE: unable to run `go test -race ./...` locally (this container has no Go
toolchain). The change is mechanical reuse of an already-shipped function in
the same package; CI must validate. The sqlmock patterns mirror the existing
TestMCPHandler_CommitMemory_LocalScope_Success test exactly.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previously, the scheduler skipped cron fires entirely when a workspace
had active_tasks > 0 (#115). This caused permanent cron misses for
workspaces kept perpetually busy by the 5-min Orchestrator pulse — work
crons (pick-up-work, PR review) were skipped every fire because the
agent was always processing a delegation.
Measured impact on Dev Lead: 17 context-deadline-exceeded timeouts in
2 hours, ~30% of inter-agent messages silently dropped.
Fix: when workspace is busy, poll every 10s for up to 2 minutes waiting
for idle. If idle within the window, fire normally. If still busy after
2 min, fall back to the original skip behavior.
This is a minimal, safe change:
- No new goroutines or channels
- Same fire path once idle
- Bounded wait (2 min max, won't block the scheduler pool)
- Falls back to skip if workspace never becomes idle
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The phantom-producer detector (#795) was doing UPDATE + SELECT in two
roundtrips — first incrementing consecutive_empty_runs, then re-
reading to check the stale threshold. Switch to UPDATE ... RETURNING
so the post-increment value comes back in one query.
Called once per schedule per cron tick. At 100 tenants × dozens of
schedules per tenant, the halved DB traffic on the empty-response
path is measurable, not just cosmetic.
Also now properly logs if the bump itself fails (previously it silent-
swallowed the ExecContext error and still ran the SELECT, which would
confuse debugging).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Post-merge audit flagged cp_provisioner.go as the only new file from
the canary/C1 work without test coverage. Fills the gap:
- NewCPProvisioner_RequiresOrgID — self-hosted without MOLECULE_ORG_ID
refuses to construct (avoids silent phone-home to prod CP).
- NewCPProvisioner_FallsBackToProvisionSharedSecret — the operator
ergonomics of using one env-var name on both sides of the wire.
- AuthHeader noop + happy path — bearer only set when secret is set.
- Start_HappyPath — end-to-end POST to stubbed CP, bearer forwarded,
instance_id parsed out of response.
- Start_Non201ReturnsStructuredError — when CP returns structured
{"error":"…"}, that message surfaces to the caller.
- Start_NoStructuredErrorFallsBackToSize — regression gate for the
anti-log-leak change from PR #980: raw upstream body must NOT
appear in the error, only the byte count.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Paired with molecule-controlplane PR #55 (GET /cp/tenants/config). Lets
existing tenants heal themselves when we rotate or add a CP-side env
var (e.g. MOLECULE_CP_SHARED_SECRET landing earlier today) without any
ssh or re-provision.
Flow: main() calls refreshEnvFromCP() before any other os.Getenv read.
The helper reads MOLECULE_ORG_ID + ADMIN_TOKEN from the baked-in
user-data env, GETs {MOLECULE_CP_URL}/cp/tenants/config with those
credentials, and applies the returned string map via os.Setenv so
downstream code (CPProvisioner, etc.) sees the fresh values.
Best-effort semantics:
- self-hosted / no MOLECULE_ORG_ID → no-op (return nil)
- CP unreachable / non-200 → log + return error (main keeps booting)
- oversized values (>4 KiB each) rejected to avoid env pollution
- body read capped at 64 KiB
Once this image hits GHCR, the 5-minute tenant auto-updater picks it
up, the container restarts, refresh runs, and every tenant has
MOLECULE_CP_SHARED_SECRET within ~5 minutes — no operator toil.
Also fixes workspace-server/.gitignore so `server` no longer matches
the cmd/server package dir — it only ignored the compiled binary but
pattern was too broad. Anchored to `/server`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Completes the C1 integration (PR #50 on molecule-controlplane). The CP
now requires Authorization: Bearer <PROVISION_SHARED_SECRET> on all
three /cp/workspaces/* endpoints; without this change the tenant-side
Start/Stop/IsRunning calls would all 401 (or 404 when the CP's routes
refused to mount) and every workspace provision from a SaaS tenant
would silently fail.
Reads MOLECULE_CP_SHARED_SECRET, falling back to PROVISION_SHARED_SECRET
so operators can use one env-var name on both sides of the wire. Empty
value is a no-op: self-hosted deployments with no CP or a CP that
doesn't gate /cp/workspaces/* keep working as before.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two findings from the pre-launch log-scrub audit:
1. handlers/workspace_provision.go:548 logged `token[:8]` — the exact
H1 pattern that panicked on short keys. Even with a length guard,
leaking 8 chars of an auth token into centralized logs shortens the
search space for anyone who gets log-read access. Now logs only
`len(token)` as a liveness signal.
2. provisioner/cp_provisioner.go:101 fell back to logging the raw
control-plane response body when the structured {"error":"..."}
field was absent. If the CP ever echoed request headers (Authorization)
or a portion of user-data back in an error path, the bearer token
would end up in our tenant-instance logs. Now logs the byte count
only; the structured error remains in place for the happy path.
Also caps the read at 64 KiB via io.LimitReader to prevent
log-flood DoS from a compromised upstream.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-launch review blocker. AdminAuth's Tier-1 fail-open fired whenever
the workspace_auth_tokens table was empty — including the window between
a hosted tenant EC2 booting and the first workspace being created. In
that window, every admin-gated route (POST /org/import, POST /workspaces,
POST /bundles/import, etc.) was reachable without a bearer, letting an
attacker pre-empt the first real user by importing a hostile workspace
into a freshly provisioned instance.
Fix: fail-open is now ONLY applied when ADMIN_TOKEN is unset (self-
hosted dev with zero auth configured). Hosted SaaS always sets
ADMIN_TOKEN at provision time, so the branch never fires in prod and
requests with no bearer get 401 even before the first token is minted.
Tier-2 / Tier-3 paths unchanged.
The old TestAdminAuth_684_FailOpen_AdminTokenSet_NoGlobalTokens test
was codifying exactly this bug (asserting 200 on fresh install with
ADMIN_TOKEN set). Renamed and flipped to
TestAdminAuth_C4_AdminTokenSet_FreshInstall_FailsClosed asserting 401.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two HIGH-severity DoS surfaces: both handlers read the entire HTTP
body with io.ReadAll(r.Body) and no upper bound, so a caller streaming
a multi-gigabyte request could exhaust memory on the tenant instance
before we even validated the JSON.
H3 (Discord webhook): wrap Body in io.LimitReader with a 1 MiB cap.
Discord Interactions payloads are well under 10 KiB in practice.
H4 (workspace config PATCH): wrap Body in http.MaxBytesReader with a
256 KiB cap. Real configs are <10 KiB; jsonb handles the cap
comfortably. Returns 413 Request Entity Too Large on overflow.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sub of #795 (phantom-busy post-mortem). Adds last_outbound_at TIMESTAMPTZ
column to workspaces. Bumped async on every successful outbound A2A call
from a real workspace (skip canvas + system callers). Exposed in
GET /workspaces/:id response as "last_outbound_at".
PM/Dev Lead orchestrators can now detect workspaces that have gone silent
despite being online (> 2h + active cron = phantom-busy warning).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- TestCommitMemory_GlobalScope_DelimiterSpoofingEscaped: verifies [MEMORY prefix
is escaped to [_MEMORY before DB insert (SAFE-T1201, #807)
- TestCommitMemory_LocalScope_NoDelimiterEscape: LOCAL scope stored verbatim
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SAFE-T1201 (#807): Escape [MEMORY prefix in GLOBAL memory content on
write to prevent delimiter-spoofing prompt injection. Content stored
as "[_MEMORY " so it renders as text, not structure, when wrapped with
the real delimiter on read.
SAFE-T1102 (#805): Pin @molecule-ai/mcp-server@1.0.0 in .mcp.json.example.
Prevents supply-chain attacks via unpinned npx -y.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>