PR #881 closed SAFE-T1201 (#838) on the HTTP path by wiring redactSecrets()
into MemoriesHandler.Commit — but the sibling code path on the MCP bridge
(MCPHandler.toolCommitMemory) was left with only the TODO comment. Agents
calling commit_memory via the MCP tool bridge are the PRIMARY attack vector
for #838 (confused / prompt-injected agent pipes raw tool-response text
containing plain-text credentials into agent_memories, leaking into shared
TEAM scope). The HTTP path is only exercised by canvas UI posts, so the MCP
gap was the hotter one.
Change:
workspace-server/internal/handlers/mcp.go:725
- TODO(#838): run _redactSecrets(content) before insert — plain-text
- API keys from tool responses must not land in the memories table.
+ SAFE-T1201 (#838): scrub known credential patterns before persistence…
+ content, _ = redactSecrets(workspaceID, content)
Reuses redactSecrets (same package) so there's no duplicated pattern list —
a future-added pattern in memories.go automatically covers the MCP path too.
Tests added in mcp_test.go:
- TestMCPHandler_CommitMemory_SecretInContent_IsRedactedBeforeInsert
Exercises three patterns (env-var assignment, Bearer token, sk-…)
and uses sqlmock's WithArgs to bind the exact REDACTED form — so a
regression (removing the redactSecrets call) fails with arg-mismatch
rather than silently persisting the secret.
- TestMCPHandler_CommitMemory_CleanContent_PassesThrough
Regression guard — benign content must NOT be altered by the redactor.
NOTE: unable to run `go test -race ./...` locally (this container has no Go
toolchain). The change is mechanical reuse of an already-shipped function in
the same package; CI must validate. The sqlmock patterns mirror the existing
TestMCPHandler_CommitMemory_LocalScope_Success test exactly.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The phantom-producer detector (#795) was doing UPDATE + SELECT in two
roundtrips — first incrementing consecutive_empty_runs, then re-
reading to check the stale threshold. Switch to UPDATE ... RETURNING
so the post-increment value comes back in one query.
Called once per schedule per cron tick. At 100 tenants × dozens of
schedules per tenant, the halved DB traffic on the empty-response
path is measurable, not just cosmetic.
Also now properly logs if the bump itself fails (previously it silent-
swallowed the ExecContext error and still ran the SELECT, which would
confuse debugging).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Post-merge audit flagged cp_provisioner.go as the only new file from
the canary/C1 work without test coverage. Fills the gap:
- NewCPProvisioner_RequiresOrgID — self-hosted without MOLECULE_ORG_ID
refuses to construct (avoids silent phone-home to prod CP).
- NewCPProvisioner_FallsBackToProvisionSharedSecret — the operator
ergonomics of using one env-var name on both sides of the wire.
- AuthHeader noop + happy path — bearer only set when secret is set.
- Start_HappyPath — end-to-end POST to stubbed CP, bearer forwarded,
instance_id parsed out of response.
- Start_Non201ReturnsStructuredError — when CP returns structured
{"error":"…"}, that message surfaces to the caller.
- Start_NoStructuredErrorFallsBackToSize — regression gate for the
anti-log-leak change from PR #980: raw upstream body must NOT
appear in the error, only the byte count.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the canary loop with the escape hatch and a single place to
read about the whole flow.
scripts/rollback-latest.sh <sha>
uses crane to retag :latest ← :staging-<sha> for BOTH the platform
and tenant images. Pre-checks the target tag exists and verifies
the :latest digest after the move so a bad ops typo doesn't
silently promote the wrong thing. Prod tenants auto-update to the
rolled-back digest within their 5-min cycle. Exit codes: 0 = both
retagged, 1 = registry/tag error, 2 = usage error.
docs/architecture/canary-release.md
The one-page map of the pipeline: how PR → main → staging-<sha> →
canary smoke → :latest promotion works end-to-end, how to add a
canary tenant, how to roll back, and what this gate explicitly does
NOT catch (prod-only data, config drift, cross-tenant bugs).
No code changes in the CP or workspace-server — this PR is shell
+ docs only, so it's safe to land independently of the other Phase
{1,1.5,2,3} PRs still in review.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Completes the canary release train. Before this, publish-workspace-
server-image.yml pushed both :staging-<sha> and :latest on every
main merge — meaning the prod tenant fleet auto-pulled every image
immediately, before any post-deploy smoke test. A broken image
(think: this morning's E2E current_task drift, but shipped at 3am
instead of caught in CI) would have fanned out to every running
tenant within 5 min.
Now:
- publish workflow pushes :staging-<sha> ONLY
- canary tenants are configured to track :staging-<sha>; they pick
up the new image on their next auto-update cycle
- canary-verify.yml runs the smoke suite (Phase 2) after the sleep
- on green: a new promote-to-latest job uses crane to remotely
retag :staging-<sha> → :latest for both platform and tenant images
- prod tenants auto-update to the newly-retagged :latest within
their usual 5-min window
- on red: :latest stays frozen on prior good digest; prod is untouched
crane is pulled onto the runner (~4 MB, GitHub release) rather than
docker-daemon retag so the workflow doesn't need a privileged runner.
Rollback: if canary passed but something surfaces post-promotion,
operator runs "crane tag ghcr.io/molecule-ai/platform:<prior-good-sha>
latest" manually. A follow-up can wrap that in a Phase 4 admin
endpoint / script.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Post-deploy verification for staging tenant images. Runs against the
canary fleet after each publish-workspace-server-image build — catches
auto-update breakage (a la today's E2E current_task drift) before it
propagates to the prod tenant fleet that auto-pulls :latest every 5 min.
scripts/canary-smoke.sh iterates a space-sep list of canary base URLs
(paired with their ADMIN_TOKENs) and checks:
- /admin/liveness reachable with admin bearer (tenant boot OK)
- /workspaces list responds (wsAuth + DB path OK)
- /memories/commit + /memories/search round-trip (encryption + scrubber)
- /events admin read (AdminAuth C4 path)
- /admin/liveness without bearer returns 401 (C4 fail-closed regression)
.github/workflows/canary-verify.yml runs after publish succeeds:
- 6-min sleep (tenant auto-updater pulls every 5 min)
- bash scripts/canary-smoke.sh with secrets pulled from repo settings
- on failure: writes a Step Summary flagging that :latest should be
rolled back to prior known-good digest
Phase 3 follow-up will split the publish workflow so only
:staging-<sha> ships initially, and canary-verify's green gate is
what promotes :staging-<sha> → :latest. This commit lays the test
gate alone so we have something running against tenants immediately.
Secrets to set in GitHub repo settings before this workflow can run:
- CANARY_TENANT_URLS (space-sep list)
- CANARY_ADMIN_TOKENS (same order as URLs)
- CANARY_CP_SHARED_SECRET (matches staging CP PROVISION_SHARED_SECRET)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Paired with molecule-controlplane PR #55 (GET /cp/tenants/config). Lets
existing tenants heal themselves when we rotate or add a CP-side env
var (e.g. MOLECULE_CP_SHARED_SECRET landing earlier today) without any
ssh or re-provision.
Flow: main() calls refreshEnvFromCP() before any other os.Getenv read.
The helper reads MOLECULE_ORG_ID + ADMIN_TOKEN from the baked-in
user-data env, GETs {MOLECULE_CP_URL}/cp/tenants/config with those
credentials, and applies the returned string map via os.Setenv so
downstream code (CPProvisioner, etc.) sees the fresh values.
Best-effort semantics:
- self-hosted / no MOLECULE_ORG_ID → no-op (return nil)
- CP unreachable / non-200 → log + return error (main keeps booting)
- oversized values (>4 KiB each) rejected to avoid env pollution
- body read capped at 64 KiB
Once this image hits GHCR, the 5-minute tenant auto-updater picks it
up, the container restarts, refresh runs, and every tenant has
MOLECULE_CP_SHARED_SECRET within ~5 minutes — no operator toil.
Also fixes workspace-server/.gitignore so `server` no longer matches
the cmd/server package dir — it only ignored the compiled binary but
pattern was too broad. Anchored to `/server`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures the 10-PR staging→main cutover: what shipped, the three new
Railway prod env vars (PROVISION_SHARED_SECRET / EC2_VPC_ID /
CP_BASE_URL), and the sharp edge for existing tenants — their
containers pre-date PR #53 so they still need MOLECULE_CP_SHARED_SECRET
added manually (or a re-provision) before the new CPProvisioner's
outbound bearer works.
Also includes a post-deploy verification checklist and rollback plan.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #966 intentionally stripped current_task, last_sample_error, and
workspace_dir from the public GET /workspaces/:id response to avoid
leaking task bodies to anyone with a workspace bearer. The E2E smoke
test hadn't caught up — it was still asserting "current_task":"..."
on the single-workspace GET, which made every post-#966 CI run fail
with '60 passed, 2 failed'.
Swap the per-workspace asserts to check active_tasks (still exposed,
canonical busy signal) and keep the list-endpoint check that proves
admin-auth'd callers still see current_task end-to-end.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-launch audit flagged api.ts as missing a timeout on every fetch.
A slow or hung CP response would leave the UI spinning indefinitely
with no way for the user to abort — effectively a client-side DoS.
15s is long enough for real CP queries (slowest observed is Stripe
portal redirect at ~3s) and short enough that a stalled backend
surfaces as a clear error with a retry affordance.
Uses AbortSignal.timeout (widely supported since 2023) so the
abort propagates through React Query / SWR consumers cleanly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Completes the C1 integration (PR #50 on molecule-controlplane). The CP
now requires Authorization: Bearer <PROVISION_SHARED_SECRET> on all
three /cp/workspaces/* endpoints; without this change the tenant-side
Start/Stop/IsRunning calls would all 401 (or 404 when the CP's routes
refused to mount) and every workspace provision from a SaaS tenant
would silently fail.
Reads MOLECULE_CP_SHARED_SECRET, falling back to PROVISION_SHARED_SECRET
so operators can use one env-var name on both sides of the wire. Empty
value is a no-op: self-hosted deployments with no CP or a CP that
doesn't gate /cp/workspaces/* keep working as before.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two findings from the pre-launch log-scrub audit:
1. handlers/workspace_provision.go:548 logged `token[:8]` — the exact
H1 pattern that panicked on short keys. Even with a length guard,
leaking 8 chars of an auth token into centralized logs shortens the
search space for anyone who gets log-read access. Now logs only
`len(token)` as a liveness signal.
2. provisioner/cp_provisioner.go:101 fell back to logging the raw
control-plane response body when the structured {"error":"..."}
field was absent. If the CP ever echoed request headers (Authorization)
or a portion of user-data back in an error path, the bearer token
would end up in our tenant-instance logs. Now logs the byte count
only; the structured error remains in place for the happy path.
Also caps the read at 64 KiB via io.LimitReader to prevent
log-flood DoS from a compromised upstream.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-launch review blocker. AdminAuth's Tier-1 fail-open fired whenever
the workspace_auth_tokens table was empty — including the window between
a hosted tenant EC2 booting and the first workspace being created. In
that window, every admin-gated route (POST /org/import, POST /workspaces,
POST /bundles/import, etc.) was reachable without a bearer, letting an
attacker pre-empt the first real user by importing a hostile workspace
into a freshly provisioned instance.
Fix: fail-open is now ONLY applied when ADMIN_TOKEN is unset (self-
hosted dev with zero auth configured). Hosted SaaS always sets
ADMIN_TOKEN at provision time, so the branch never fires in prod and
requests with no bearer get 401 even before the first token is minted.
Tier-2 / Tier-3 paths unchanged.
The old TestAdminAuth_684_FailOpen_AdminTokenSet_NoGlobalTokens test
was codifying exactly this bug (asserting 200 on fresh install with
ADMIN_TOKEN set). Renamed and flipped to
TestAdminAuth_C4_AdminTokenSet_FreshInstall_FailsClosed asserting 401.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two HIGH-severity DoS surfaces: both handlers read the entire HTTP
body with io.ReadAll(r.Body) and no upper bound, so a caller streaming
a multi-gigabyte request could exhaust memory on the tenant instance
before we even validated the JSON.
H3 (Discord webhook): wrap Body in io.LimitReader with a 1 MiB cap.
Discord Interactions payloads are well under 10 KiB in practice.
H4 (workspace config PATCH): wrap Body in http.MaxBytesReader with a
256 KiB cap. Real configs are <10 KiB; jsonb handles the cap
comfortably. Returns 413 Request Entity Too Large on overflow.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sub of #795 (phantom-busy post-mortem). Adds last_outbound_at TIMESTAMPTZ
column to workspaces. Bumped async on every successful outbound A2A call
from a real workspace (skip canvas + system callers). Exposed in
GET /workspaces/:id response as "last_outbound_at".
PM/Dev Lead orchestrators can now detect workspaces that have gone silent
despite being online (> 2h + active cron = phantom-busy warning).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
amber-400 on zinc-900 is 5.4:1 (AA pass). amber-300 is 6.9:1 (AA+AAA pass)
and matches the rest of the amber usage in WorkspaceNode (currentTask,
error detail, badge chip).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- TestCommitMemory_GlobalScope_DelimiterSpoofingEscaped: verifies [MEMORY prefix
is escaped to [_MEMORY before DB insert (SAFE-T1201, #807)
- TestCommitMemory_LocalScope_NoDelimiterEscape: LOCAL scope stored verbatim
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>