Eliminate raw 'awaiting_agent'/'hibernating'/'failed'/etc string literals
from production status writes. Adds models.WorkspaceStatus typed alias and
models.AllWorkspaceStatuses canonical slice; every UPDATE workspaces SET
status = ... now passes a parameterized $N typed value rather than a
hard-coded SQL literal.
Defense-in-depth follow-up to migration 046 (#2388): the Postgres enum
type was missing 'awaiting_agent' + 'hibernating' for ~5 days because
sqlmock regex matching cannot enforce live enum constraints. The drift
gate is now a proper Go AST + SQL parser (no regex), asserting the
codebase ⊆ migration enum and every const appears in the canonical
slice. With status as a parameterized typed value, future enum mismatches
fail at the SQL layer in tests, not silently in prod.
Test coverage: full suite passes with -race; drift gate green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every workspace in the cross-EC2 SaaS provisioning shape was failing
registration, heartbeat, or A2A routing. Four distinct blockers sat
between "EC2 is up" and "agent responds"; three are platform-side and
fixed here (the fourth is in the CP user-data, separate PR).
1. SSRF validator blocked RFC-1918 (registry.go + mcp.go)
validateAgentURL and isPrivateOrMetadataIP rejected 172.16.0.0/12,
which contains the AWS default VPC range (172.31.x.x) that every
sibling workspace EC2 registers from. Registration returned 400 and
the 10-min provision sweep flipped status to failed. RFC-1918 +
IPv6 ULA are now gated behind saasMode(); link-local (169.254/16),
loopback, IPv6 metadata (fe80::/10, ::1), and TEST-NET stay blocked
unconditionally in both modes.
saasMode() resolution order:
1. MOLECULE_DEPLOY_MODE=saas|self-hosted (explicit operator flag)
2. MOLECULE_ORG_ID presence (legacy implicit signal, kept for
back-compat so existing deployments don't need a config change)
isPrivateOrMetadataIP now actually checks IPv6 — previously it
returned false on any non-IPv4 input, which would let a registered
[::1] or [fe80::...] URL bypass the SSRF check entirely.
2. Orphan auth-token minting (workspace_provision.go)
issueAndInjectToken mints a token and stuffs it into
cfg.ConfigFiles[".auth_token"]. The Docker provisioner writes that
file into the /configs volume — the CP provisioner ignores it
(only cfg.EnvVars crosses the wire). Result: live token in DB, no
plaintext on disk, RegistryHandler.requireWorkspaceToken 401s every
/registry/register attempt because the workspace is no longer in
the "no live token → bootstrap-allowed" state. Now no-ops in SaaS
mode; the register handler already mints on first successful
register and returns the plaintext in the response body for the
runtime to persist locally.
Also removes the redundant wsauth.IssueToken call at the bottom of
provisionWorkspaceCP, which created the same orphan-token pattern
a second time.
3. Compaction artefacts (bundle/importer.go, handlers/org_tokens.go,
scheduler.go, workspace_provision.go)
Four pre-existing compile errors on main from an earlier session's
code truncation: missing tuple destructuring on ExecContext /
redactSecrets / orgTokenActor, missing close-brace in
Scheduler.fireSchedule's panic recovery. All one-line mechanical
fixes; without them the binary would not build.
Tests
-----
ssrf_test.go adds:
* TestSaasMode — covers the env resolution ladder (explicit flag
wins over legacy signal, case-insensitive, whitespace tolerant)
* TestIsPrivateOrMetadataIP_SaaSMode — asserts RFC-1918 + IPv6 ULA
flip to allowed, metadata/loopback/TEST-NET still blocked
* TestIsPrivateOrMetadataIP_IPv6 — regression guard for the old
"returns false for all IPv6" behaviour
Follow-up issue for CP-sourced workspace_id attestation will be filed
separately — closes the residual intra-VPC SSRF + token-race windows
the SaaS-mode relaxation introduces.
Verified end-to-end today on workspace 6565a2e0 (hermes runtime, OpenAI
provider) — agent returned "PONG" in 1.4s after register → heartbeat →
A2A proxy → runtime.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Suppress errcheck warnings for calls where the return value is safely
ignored:
- resp.Body.Close() (artifacts/client.go): deferred cleanup — failure
to close a response body is non-critical; the defer itself is what
matters for connection reuse.
- rows.Close() (bundle/exporter.go): deferred cleanup in a loop where
rows.Err() already handles query errors.
- filepath.Walk (bundle/exporter.go): top-level walk call; errors in
sub-directory traversal are handled by the inner callback (which
returns nil for err != nil).
- broadcaster.RecordAndBroadcast (bundle/importer.go): fire-and-forget
event broadcast; errors are logged internally by the broadcaster.
- db.DB.ExecContext (bundle/importer.go): best-effort runtime column
update; non-critical auxiliary data that the provisioner re-extracts
if needed.
Fixes: #1143
CP-QA approved. golangci-lint fixes in bundle/exporter.go + bundle/importer.go, redactSecrets in admin_memories.go, plus 489-line admin_memories_test.go.
Silent data loss on mid-cursor DB errors — partial sub-workspace
bundles returned instead of surfacing the iteration error. Adds
rows.Err() check after the SELECT id FROM workspaces query in
Export(), mirroring the pattern already used in scheduler.go
and handlers with similar recursion patterns.
Closes: R1 MISSING-ROWS-ERR findings (bundle/exporter.go)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>