The new external-runtime regression test had two payload bugs that made
step 5 fail with HTTP 400 on its first run:
1. Field name: sent {"workspace_id":...} but RegisterPayload (workspace-
server/internal/models/workspace.go:58) declares `id` with
binding:"required" — workspace_id is the heartbeat payload's field,
not register's.
2. Missing required field: agent_card has binding:"required" and was
absent. ShouldBindJSON 400'd before any handler logic ran, which is
why the body said nothing useful.
Why this got past local verification: the test was written from memory
of the heartbeat shape, never run end-to-end before pushing, and curl
with --fail-with-body prints the body to stdout but exit-22's under
set -e — the body was suppressed before the log line could fire.
Fix:
- Send `id` + a minimal valid agent_card ({name, skills:[{id,name}]})
matching the canonical shape from tests/e2e/test_api.sh:96.
- Pull the body into REGISTER_BODY shared between steps 5 and 7 so
drift between the two register calls is impossible.
- Drop --fail-with-body for these two calls and append HTTP_CODE via
curl -w so the body is always visible when the call non-200s. The
explicit grep for HTTP_CODE=200 + ||true on curl preserves the
fail-fast contract.
- Inline payload contract comment pointing at RegisterPayload so the
next person editing this doesn't repeat the heartbeat-confusion
mistake.
The url=https://example.invalid:443 is fine: runtime=external resolves
to poll mode (registry.go:resolveDeliveryMode case 3), and validateAgentURL
only fires for push.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The harness had `STATUS == "ready"` as the terminal condition, but
/cp/admin/orgs returns `instance_status='running'` for the live tenant.
Test ran for 14 minutes seeing instance_status=running and timing out
because nothing matched 'ready'.
Mirrors test_staging_full_saas.sh:210-211 — the case "$STATUS" in
running) break path is the source of truth. Also adds the same
diagnostic burst on 'failed' so the next run surfaces last_error
instead of just "timed out."
Caught on the first dispatch run (id=25177415268) of this harness.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pins the four workspaces.status=awaiting_agent transitions on a real
staging tenant, end-to-end. Catches the class of silent enum failures
that migration 046 fix-forwarded — specifically:
1. workspace.go:333 — POST /workspaces with runtime=external + no URL
parks the row in 'awaiting_agent'. Pre-046 the UPDATE silently
failed and the row stuck on 'provisioning'.
2. registry.go:resolveDeliveryMode — registering an external workspace
defaults delivery_mode='poll' (PR #2382). The harness asserts the
poll default after register.
3. registry/healthsweep.go:sweepStaleRemoteWorkspaces — after
REMOTE_LIVENESS_STALE_AFTER (90s default) with no heartbeat, the
workspace transitions back to 'awaiting_agent'. Pre-046 the sweep
UPDATE silently failed and the workspace stuck on 'online' forever.
4. Re-register from awaiting_agent → 'online' confirms the state is
operator-recoverable, which is the whole reason for using
awaiting_agent (vs. 'offline') as the external-runtime stale state.
The harness mirrors test_staging_full_saas.sh: tenant create →
DNS/TLS wait → tenant token retrieve → exercise → idempotent teardown
via EXIT/INT/TERM trap. Exit codes match the documented contract
{0,1,2,3,4}; raw bash exit codes are normalized so the safety-net
sweeper doesn't open false-positive incident issues.
The companion workflow gates on the source files that touch this
lifecycle: workspace.go, registry.go, workspace_restart.go,
healthsweep.go, liveness.go, every migration, the static drift gate,
and the script + workflow themselves. Daily 07:30 UTC cron catches
infra drift on quiet days. cancel-in-progress=false because aborting
a half-rolled tenant leaves orphan resources for the safety-net to
clean.
Verification:
- bash -n: ok
- shellcheck: only the documented A && B || C pattern, identical to
test_staging_full_saas.sh.
- YAML parser: ok.
- Workflow path filter matches every site that writes to the
workspace_status enum (cross-checked against the drift gate's
UPDATE workspaces / INSERT INTO workspaces enumeration).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Migration 043 (2026-04-25) introduced the workspace_status enum but
omitted two values application code had been writing for days, so every
UPDATE that tried to write either value failed silently in production:
'awaiting_agent' (since 2026-04-24, commit 1e8b5e01):
- handlers/workspace.go:333 — external workspace pre-register
- handlers/registry.go (via PR #2382) — liveness offline transition
- registry/healthsweep.go (via PR #2382) — heartbeat-staleness sweep
'hibernating' (since hibernation feature shipped):
- handlers/workspace_restart.go:271 — DB-level claim before stop
All four/five sites swallowed the enum-cast error. User-visible impact:
external workspaces never transition to a stale state when their agent
disconnects (canvas shows them stuck on 'online'/'degraded' indefinitely),
new external workspaces never advance past 'provisioning', and idle
workspaces never auto-hibernate (resources held forever).
PR #2382 didn't *cause* this — it inherited the gap and added two more
silent-fail paths on top. The pre-existing two had been broken for five
days and went unnoticed because:
1. sqlmock matches SQL by regex, not against the live enum constraint.
Every test passed despite the prod-only failure.
2. The handlers either drop the Exec error entirely (workspace.go:333)
or log+continue without an alert (the other three).
Fix in three pieces:
1. migrations/046_*.up.sql — ALTER TYPE workspace_status ADD VALUE
'awaiting_agent', 'hibernating'. IF NOT EXISTS makes it idempotent
across re-runs (RunMigrations re-applies until schema_migrations
records the file). ALTER TYPE ADD VALUE doesn't take a heavy lock
and commits immediately, safe under live traffic.
2. migrations/046_*.down.sql — full rename → recreate → cast → drop
recipe. Postgres has no DROP VALUE so this is the only honest
rollback. Pre-flights existing rows to compatible values
(awaiting_agent → offline, hibernating → hibernated) before the
type swap.
3. internal/db/workspace_status_enum_drift_test.go — static gate that
parses every UPDATE/INSERT against `workspaces` in workspace-server/
internal/, extracts every status literal, and asserts each is in
the enum union (CREATE TYPE + every ALTER TYPE ADD VALUE). The gate
runs in unit tests, no DB required, and would have caught both
omissions on the day they shipped. Pattern matches
feedback_behavior_based_ast_gates and feedback_mock_at_drifting_layer.
Verification:
- go test ./internal/db/ -count=1 -race ✓
- go vet ./... ✓
- Drift gate flips red if I delete either ADD VALUE from the migration
(validated via local mutation).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pin the 5 public functions adapters and the runtime hot-path import
through ``from platform_auth import``:
- ``auth_headers`` — every outbound httpx call merges this in
- ``self_source_headers`` — A2A peer + self-message header builder
- ``get_token`` — main.py reads on boot to decide register-vs-resume
- ``save_token`` — main.py persists the platform-issued token
- ``refresh_cache`` — 401-retry path drops in-process cache (#1877)
A grep across workspace/ shows 14+ runtime modules import these:
main.py, heartbeat.py, a2a_client.py, a2a_tools.py, consolidation.py,
events.py, executor_helpers.py (3 sites), molecule_ai_status.py,
builtin_tools/memory.py (3 sites), builtin_tools/temporal_workflow.py
(2 sites). Renaming any of the five (e.g. ``auth_headers`` →
``bearer_headers``) makes every one of those imports raise ImportError
at workspace boot — the failure surface is deep in heartbeat init,
nowhere near the rename site.
Same drift class as the BaseAdapter signature snapshot (#2378, #2380),
skill_loader gate (#2381), runtime_wedge gate (#2383). Reuses the
``_signature_snapshot.py`` helpers shipped in #2381.
Defense-in-depth: ``test_snapshot_has_required_functions`` asserts
the five names are still present, so removing one even with a
synchronized snapshot edit forces an explicit edit here with a
justification.
``clear_cache`` is intentionally NOT in the snapshot — it's a
test-only helper. Production code MUST NOT depend on it.
Verified red on deliberate rename: ``auth_headers`` →
``bearer_headers`` produces a clean diff of the missing function in
the failure message. Restored before commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "do request → check err → defer close → forward headers → set
status → io.Copy → log mid-stream errors" tail was duplicated between
Upload and Download. Each handler had ~12 lines that differed only in:
- the op label in log messages ("upload" vs "download")
- the set of response headers to forward verbatim
(Upload: Content-Type only; Download: Content-Type +
Content-Length + Content-Disposition)
Hoist into ChatFilesHandler.streamWorkspaceResponse(c, op,
workspaceID, forwardURL, req, forwardHeaders). Each call site
reduces to one line. Future changes — request-id forwarding,
observability metric, response-size cap, bytes-streamed log —
go in ONE place rather than two.
Same drift-prevention rationale as resolveWorkspaceForwardCreds
(#2372) and readOrLazyHealInboundSecret (#2376), applied to the
response-streaming layer of the same handlers.
Behavior preserved: existing TestChatUpload_* and TestChatDownload_*
integration tests (8 across both handlers) all pass unchanged. The
log message format is consistent across both handlers now (single
"chat_files {op}: ..." string template) — operators can grep one
prefix for both features instead of separate prefixes per handler.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
BaseAdapter docstring tells adapter authors:
> ``runtime_wedge.mark_wedged()`` / ``clear_wedge()`` — flip the
> workspace to ``degraded`` + auto-recover when your SDK hits a
> non-recoverable error class. Import directly from ``runtime_wedge``;
> the heartbeat forwards the state to the platform automatically.
That's a contract — adapter templates depend on the four module-level
functions (``is_wedged``, ``wedge_reason``, ``mark_wedged``,
``clear_wedge``) being importable by those exact names with those
exact signatures. Renaming any silently breaks every adapter that
calls them: the import resolves the module fine, the
``AttributeError`` only surfaces when the adapter actually hits its
first SDK error — long after the rename merges.
Same drift class as #2378 / #2380 / #2381 (BaseAdapter, skill_loader)
applied to the module-level function surface.
Changes:
- tests/_signature_snapshot.py gains build_module_functions_record.
Walks a module's public top-level functions, optionally filtered
to a specific name list (used here — runtime_wedge has internal
helpers like reset_for_test that intentionally aren't part of
the contract). Skips re-exports via __module__ check so a
`from foo import bar` doesn't pollute the snapshot.
- tests/test_runtime_wedge_signature.py snapshots the four
contract functions. Plus a defense-in-depth required-functions
test that catches removal even when source + snapshot are
updated together.
Verified: deliberately renaming `mark_wedged` → `mark_wedged_RENAMED`
trips the gate with full snapshot diff in the failure message.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Paired molecule-core change for the molecule-cli `molecule connect`
RFC (https://github.com/Molecule-AI/molecule-cli/issues/10).
After this PR an `external`-runtime workspace's full lifecycle
matches the operator-driven model: it boots in awaiting_agent, the
CLI connects in poll mode without operator-side flag tuning, the
heartbeat-loss path lands back on awaiting_agent (re-registrable)
instead of the terminal-feeling 'offline'.
Two changes in workspace-server:
1) `resolveDeliveryMode` (registry.go) now reads `runtime` alongside
`delivery_mode`. Resolution order:
a. payload.delivery_mode if non-empty (operator override)
b. row's existing delivery_mode if non-empty (preserves prior
registration)
c. **NEW:** "poll" if row.runtime = "external" — external
operators run on laptops without public HTTPS; push-mode
would hard-fail at validateAgentURL anyway. (`molecule connect`
registers without --mode and expects this default.)
d. "push" otherwise (historical default for platform-managed
runtimes — langgraph, hermes, claude-code, etc.)
2) Heartbeat-loss for external workspaces lands them in
`awaiting_agent` instead of `offline`. Two code paths:
- `liveness.go` — Redis TTL expiration. Uses a CASE expression
so the conditional is one UPDATE (no extra round-trip for
non-external runtimes, no TOCTOU between runtime read and
status write).
- `healthsweep.go::sweepStaleRemoteWorkspaces` — DB-side
last_heartbeat_at age scan. This sweep is already external-
only by query filter, so the UPDATE just hard-codes the new
status.
The Docker-side `sweepOnlineWorkspaces` keeps `offline` —
recovery there is "restart the container", not "re-register from
the operator's box".
Why awaiting_agent over offline for external:
- Matches the status the workspace was created in (workspace.go:333).
- The CLI re-registers on every invocation; awaiting_agent → online
is the natural transition. offline is a terminal-feeling status
that implies operator intervention is needed.
- An operator who closed their laptop overnight should see
awaiting_agent in canvas, not 'offline (something is wrong)'.
Test plan:
- Existing: 9 `resolveDeliveryMode` test sites updated to the new
query shape. Sqlmock now reads `delivery_mode, runtime` columns.
- New: TestRegister_ExternalRuntime_DefaultsToPoll asserts the
external→poll branch. TestRegister_NonExternalRuntime_StillDefaultsToPush
guards against the new branch overshooting (langgraph keeps push).
- Liveness: regex updated to match the CASE expression.
- Healthsweep: `TestSweepStaleRemoteWorkspaces_MarksStaleAwaitingAgent`
(renamed for grep-ability), Docker-side sweepOnlineWorkspaces test
unchanged (verified to still match `'offline'`).
- Full handlers + registry suite green under -race (12.873s + 2.264s).
No migration needed — `status` is a free-form text column; both
'offline' and 'awaiting_agent' are existing values used elsewhere
(workspace.go uses awaiting_agent on initial external creation).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes in one PR (tightly coupled — the second wouldn't make
sense without the first):
1. Hoist the inspect-based snapshot helpers out of
test_adapter_base_signature.py into tests/_signature_snapshot.py
so future surfaces don't copy-paste introspection logic.
- build_class_signature_record(cls): walks public methods,
unwraps static/class/abstract methods, returns a stable
{class, methods: [...]} dict.
- build_dataclass_record(cls): walks dataclass fields via
dataclasses.fields(), returns {name, frozen, fields: [...]}.
- compare_against_snapshot(actual, path): writes-on-first-run +
diff-on-drift, with both expected and actual JSON in failure
message.
test_adapter_base_signature.py is rewritten to use the helpers;
the existing snapshot file is byte-identical (no behavior change).
2. New gate: tests/test_skill_loader_signature.py covers the
public dataclasses exported from skill_loader/loader.py:
- SkillMetadata: every adapter pattern-matches on .runtime for
skill-compat filtering. Renaming this field would silently
break per-adapter skill loading — the loader still returns
objects, but adapters' `if "*" in skill.metadata.runtime`
raises AttributeError at workspace boot.
- LoadedSkill: returned in SetupResult.loaded_skills.
Includes test_snapshot_has_required_skill_metadata_fields
defense-in-depth: ensures the runtime / id / name / description
fields stay even if both source and snapshot are updated together.
Verified: deliberately renaming SkillMetadata.runtime trips the
gate with full snapshot diff in the failure message.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follows up #2378. The BaseAdapter snapshot covers method signatures
but `adapter_base.py` also exports three public dataclasses that
form the call/return contract between the platform and every
adapter:
- SetupResult — returned by adapter._common_setup()
- AdapterConfig — passed into adapter setup hooks
- RuntimeCapabilities — returned by adapter.capabilities();
drives platform-side dispatch routing (#117)
Renaming a RuntimeCapabilities flag silently disables every
adapter's capability declaration (the platform fallback runs)
without an AttributeError to surface the breakage. That's exactly
the drift class the snapshot pattern is meant to catch.
Changes:
- _build_dataclass_snapshot walks SetupResult, AdapterConfig,
RuntimeCapabilities via dataclasses.fields(), capturing field
name + type annotation + has_default per field, plus the
@dataclass(frozen=...) flag.
- _build_full_snapshot composes method + dataclass records into
one stable JSON snapshot.
- test_snapshot_has_required_dataclass_fields — defense-in-depth
test parallel to test_snapshot_has_required_methods. Catches
field removal even when both source AND snapshot are updated
together. Required field set is intentionally short (the flags
that drive platform dispatch + the adapter-level config knobs).
Verified: deliberately renaming `provides_native_heartbeat` →
`provides_native_heartbeat_RENAMED` trips
test_base_adapter_signature_matches_snapshot with a full diff in
the failure message.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The check has been blocking the staging→main auto-promote PR (#2361)
since 2026-04-30T07:17Z with:
fatal: origin/main...<head>: no merge base
Root cause: the workflow does `git fetch origin <base> --depth=1`
which overwrites checkout@v4's full-history clone with a shallow
tip — destroying the ancestry the subsequent
`git diff origin/main...HEAD` (three-dot, merge-base form) needs.
This deadlocks every staging→main promote PR until manually fixed.
The auto-promote runs were succeeding at the gate-check phase but
the subsequent PR-merge step waited 30 min for the failing check
and timed out, skipping the publish + redeploy dispatch tail.
Fleet recovery for any production-only fix went through staging
fine but never reached main.
Fix: drop --depth=1 so the explicit fetch preserves full history.
The leading comment is updated to call out this trap so a future
maintainer doesn't re-add the flag thinking it's a perf win.
No test added: this is a workflow-config one-liner that the
existing PR check itself exercises end-to-end (the real signal is
PR #2361 going green after this lands).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every workspace template (langgraph, claude-code, hermes, etc.)
subclasses BaseAdapter. Renaming, removing, or re-typing a method
on the base class silently breaks templates: the override stops
being recognized as an override; the old method-name's caller
silently invokes the default no-op; the new method-name is
unimplemented in templates that haven't migrated.
Recent #87 universal-runtime + #1957 recordResource refactor both
renamed/added methods. Without a frozen snapshot, the next rename
ships quietly and surfaces only when a template's CI catches the
AttributeError days later — long after the merge window for an
easy revert.
This snapshot pins BaseAdapter's public method surface against a
checked-in JSON file. Same-shape pattern as PR #2363's A2A
protocol-compat replay gate, applied to a Python public-API
surface instead of JSON message shapes. Both close drift classes
by snapshotting the structural surface that consumers depend on.
Two tests:
1. test_base_adapter_signature_matches_snapshot — full
introspection diff against tests/snapshots/adapter_base_signature.json.
Drift = test failure with both expected + actual JSON in
the message so the reviewer sees what changed.
2. test_snapshot_has_required_methods — defense-in-depth: even
if both the source AND snapshot are updated together
(intentional API removal), this catches removal of the
short list of methods that EVERY template depends on (name,
display_name, description, capabilities, memory_filename).
Removing one of these requires explicit edit to the
`required` set with a justification.
Verified the gate fires red on a deliberate rename
(memory_filename → memory_filename_RENAMED) — failure message
shows the full snapshot diff including parameter shapes and
return annotations.
Updating the snapshot is the explicit acknowledgment that a
template-affecting API change is intentional. Reviewer of the
introducing PR sees the snapshot diff and decides whether
template repos need coordinated updates.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The helper landed in #2376 and is exercised via chat_files + registry
integration tests. Those tests conflate the helper's behavior with the
caller's response shape — a future refactor that broke the (secret,
healed, err) contract subtly (e.g. returning healed=true on a
read-success path, or swallowing a mint error) might still pass them.
Adds 4 direct sub-tests pinning each branch of the contract:
- secret already present → (s, false, nil)
- secret missing, mint succeeds → (minted, true, nil)
- secret missing, mint fails → ("", false, err)
- read fails (non-NoInboundSecret) → ("", false, err)
Each sub-case asserts the return tuple shape AND mock.ExpectationsWereMet
(for the success path) so a future helper change that skips a DB op
trips the gate immediately.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The lazy-heal-on-miss pattern landed in two places this session:
PR #2372 (chat_files.go::resolveWorkspaceForwardCreds — Upload + Download)
and PR #2375 (registry.go::Register). Both implementations did the same
thing:
read → if ErrNoInboundSecret then mint inline → return outcome
Different response-shape requirements but the same core mechanic. Three
sites' worth of drift potential: any future heal-time condition we add
(audit log, alert, secret rotation, observability) had to be applied to
each site, with partial application silently re-opening the gap.
Fix: extract readOrLazyHealInboundSecret in workspace_provision_shared.go
returning (secret, healed, err). Each caller maps the outcome to its
response shape:
- chat_files: healed=true → 503 with retry hint; err != nil → 503 with
RFC-#2312 reprovision hint
- registry: healed=true|false + err==nil → include in response;
err != nil → omit field (workspace can retry on next register)
Net effect:
- Single source of truth for the read+heal mechanic
- Response-shape decisions stay in callers (they DO differ per feature)
- Future heal-time conditions go in one place
- Behavior preserved: existing TestRegister_NoInboundSecret_LazyHeals,
TestRegister_NoInboundSecret_LazyHealMintFailureOmitsField,
TestChatUpload_NoInboundSecret_LazyHeal*,
TestChatDownload_NoInboundSecret_LazyHeal* all pass unchanged
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-fix: a legacy SaaS workspace with NULL platform_inbound_secret
needed two round-trips before chat upload worked:
1. Workspace registers → response missing platform_inbound_secret
2. User attempts chat upload → chat_files lazy-heals platform-side
(RFC #2312 backfill) → 503 + retry-after
3. Workspace heartbeats → register response now includes the
freshly-minted secret → workspace writes /configs/.platform_inbound_secret
4. User retries chat upload → workspace bearer matches → 200
The platform-side lazy-heal in chat_files.go (#2366) closes the
existing-workspace gap, but the user-visible round-trip dance is
still ugly.
Fix: lazy-heal at register time too. When ReadPlatformInboundSecret
returns ErrNoInboundSecret, mint inline and include the freshly-
minted secret in the register response. Collapses the dance to a
single round-trip:
1. Workspace registers → response includes lazy-healed secret
2. User attempts chat upload → workspace bearer matches → 200
Failure model: best-effort. Mint failure logs and falls through to
omitting the field (workspace will retry on next register call).
The 200 response status is preserved — register success doesn't
hinge on the inbound-secret heal.
Tests:
- TestRegister_NoInboundSecret_LazyHeals: pins the success branch.
Mocks the UPDATE explicitly + asserts ExpectationsWereMet, so a
regression that skipped the mint would fail loudly. Replaces
the prior TestRegister_NoInboundSecret_OmitsField which
"passed" on this branch only because sqlmock-unmatched-UPDATE
coincidentally drove the omit-field error path.
- TestRegister_NoInboundSecret_LazyHealMintFailureOmitsField:
pins the failure branch — explicit UPDATE error → 200 + field
absent.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ValidateToken, WorkspaceFromToken, and ValidateAnyToken each duplicated
the same JOIN+WHERE auth predicate:
FROM workspace_auth_tokens t
JOIN workspaces w ON w.id = t.workspace_id
WHERE t.token_hash = $1
AND t.revoked_at IS NULL
AND w.status != 'removed'
Same drift class as the SaaS provision-mint bug fixed in #2366. A
future safety addition (e.g. exclude paused workspaces from auth) had
to be applied to all three queries; a partial application would
silently re-open one auth path while closing the others.
Fix: hoist the predicate into lookupTokenByHash, which projects
(id, workspace_id) — the union of fields any caller needs. Each
public function picks what it uses:
- ValidateToken — needs both (compares workspaceID, updates last_used_at by id)
- WorkspaceFromToken — needs workspace_id
- ValidateAnyToken — needs id
The trivial perf cost of selecting one extra column per call is worth
the single-source-of-truth guarantee for the auth predicate.
Test mock updates: two upstream test files (a2a_proxy_test, middleware
wsauth_middleware_test{,_canvasorbearer_test}) had hand-typed regex
matchers and row shapes pinned to the per-function SELECT projection.
Updated to the unified shape; behavior is unchanged.
All wsauth + middleware + handlers + full-module tests green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The admin test-token endpoint has a critical security check at
admin_test_token.go:64-72 — the IDOR fix from #112 that requires an
explicit ADMIN_TOKEN bearer when the env var is set. Pre-fix, the
route accepted ANY bearer that matched a live org token, allowing
cross-org test-token minting (and therefore cross-org workspace
authentication). The current code uses subtle.ConstantTimeCompare
against ADMIN_TOKEN.
Test coverage was zero. The existing tests exercised the
ADMIN_TOKEN-unset path (local dev / CI) but never set ADMIN_TOKEN.
A regression that:
- removed the os.Getenv("ADMIN_TOKEN") check
- inverted the comparison
- replaced ConstantTimeCompare with bytes.Equal (timing leak)
- re-introduced the AdminAuth fallback that allows org tokens
would not fail any test, and the breakage would re-open the IDOR
that #112 closed.
Adds four tests covering the gate matrix:
- ADMIN_TOKEN set + no Authorization header → 401
- ADMIN_TOKEN set + wrong Authorization → 401
- ADMIN_TOKEN set + correct Authorization → 200
- ADMIN_TOKEN unset + no Authorization → 200 (gate bypassed safely)
The 4-row matrix pins the gate's full truth table: any regression in
either dimension (gate enabled/disabled, header correct/wrong) trips
exactly one test.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 50-line "resolve URL + read inbound secret + lazy-heal on miss"
block was duplicated nearly verbatim between Upload and Download
handlers. Drift-prone — same class of risk as the original SaaS
provision drift fixed in #2366. A future change like:
- secret rotation (re-mint when the row's older than X)
- per-feature audit logging
- additional fail-closed conditions
would have to be applied to both handlers, and a partial application
that healed Upload but skipped Download would surface only at runtime.
Fix: hoist the shared logic into resolveWorkspaceForwardCreds. The
function takes an op label ("upload"/"download") used in log messages
+ the 503 RFC-#2312 detail copy so operators can still distinguish
which feature ran. Both handlers reduce to:
wsURL, secret, ok := resolveWorkspaceForwardCreds(c, ctx, workspaceID, "upload")
if !ok { return }
Net -20 lines (helper amortizes the 50-line block across both call
sites). Existing test coverage (TestChatUpload_NoInboundSecret_*,
TestChatDownload_NoInboundSecret_* from PR #2370) covers all four
branches of the shared helper.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#2367 moved PARENT_ID env injection from inline TeamHandler.Expand
into the shared prepareProvisionContext (sourced from
payload.ParentID). The test was missing — a regression that:
- dropped the injection
- inverted the nil-check
- leaked an empty PARENT_ID="" into env
would not fail any existing test, but workspace/coordinator.py reads
PARENT_ID on startup to track parent-child relationship, so the
breakage would surface only at runtime.
Adds TestPrepareProvisionContext_ParentIDInjection with three
sub-cases:
- nil ParentID → no PARENT_ID env
- empty-string ParentID → no PARENT_ID env (don't pollute)
- set ParentID → PARENT_ID env equals value
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 2026-04-30 lazy-heal fix in chat_files.go (PR #2366) ATTEMPTS to
mint platform_inbound_secret on miss so legacy workspaces self-heal
without requiring destructive reprovision. The pre-existing
TestChatUpload_NoInboundSecret + TestChatDownload_NoInboundSecret
tests asserted the 503 response shape but did NOT pin that the mint
UPDATE actually fires — they happened to exercise the mint-failure
branch (sqlmock unmatched UPDATE = error = "Failed to mint" code path
returns 503 with "RFC #2312" detail, which still passed the original
assertions).
This means a regression that:
- skipped the lazy-heal mint entirely
- inverted the success/failure response branches
- moved the mint to a different code path
would not fail those tests.
Fix:
- TestChatUpload_NoInboundSecret_LazyHeal: mock the UPDATE
successfully; assert sqlmock.ExpectationsWereMet (mint MUST run)
+ body contains "retry" + "30" (success branch).
- TestChatUpload_NoInboundSecret_LazyHealFailure: mock the UPDATE
to fail; assert body contains "Reprovision" (failure branch).
- Same pair for the Download handler — independent code path means
independent test.
Pins both branches of both handlers (4 tests) so future drift trips
the gate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes#2367.
TeamHandler.Expand provisioned child workspaces by directly calling
h.provisioner.Start, skipping mintWorkspaceSecrets and every other
preflight (secrets load, env mutators, identity injection, missing-env,
empty-config-volume auto-recover). Children shipped with NULL
platform_inbound_secret + never-issued auth_token — same drift class as
the SaaS bug just fixed in PR #2366, found while exercising a stronger
gate against this package.
Fix:
- TeamHandler now holds *WorkspaceHandler. Expand delegates each child
provision to wh.provisionWorkspace, picking up the shared
prepare/mint/preflight pipeline automatically. Future provision-time
steps go in ONE place and team-expand inherits them.
- prepareProvisionContext gains PARENT_ID env injection sourced from
payload.ParentID (which Expand now populates). This preserves the
signal workspace/coordinator.py reads on startup, without threading
env through provisioner.WorkspaceConfig manually.
- NewTeamHandler signature gains *WorkspaceHandler; router passes it.
Gate upgrade:
- TestProvisionFunctions_AllCallMintWorkspaceSecrets is now
behavior-based: it walks every FuncDecl in the package and flags any
function that calls h.provisioner.Start or h.cpProv.Start without
also calling mintWorkspaceSecrets. Drift-resistant by construction —
a future provision function with any name still trips the gate.
- Replaces the name-list version from PR #2366. The name list missed
Expand precisely because Expand wasn't named provision*; the
behavior-based detector caught it spontaneously when prototyped.
Tests: full workspace-server module green; gate previously verified to
fire red on Expand pre-fix and on deliberate mintWorkspaceSecrets
removal.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause of 2026-04-30 silent-503 chat-upload bug: provisionWorkspaceCP
(SaaS) skipped issueAndInjectInboundSecret while provisionWorkspaceOpts
(Docker) called it. Every prod SaaS workspace provisioned with NULL
platform_inbound_secret → upload returned 503 with the v2-enrollment
message on every attempt.
Structural fix:
- Extract prepareProvisionContext (secrets load, env mutators, preflight,
cfg build), mintWorkspaceSecrets (auth_token + platform_inbound_secret),
markProvisionFailed (broadcast + DB update) into workspace_provision_shared.go
- Refactor both provision modes to call the shared helpers
- Add provisionAbort struct so the missing-env failure class can carry its
structured "missing" payload through the shared abort path
- Unify last_sample_error: previously the decrypt-fail path skipped it while
others set it; users now see every failure class in the UI
Drift prevention:
- AST gate TestProvisionFunctions_AllCallMintWorkspaceSecrets asserts every
function in the provisionFunctions set calls mintWorkspaceSecrets at least
once (same shape as the audit-coverage gate from #335). New provision paths
must either call mint or be added to provisionExemptFunctions with a
one-line justification
- Behavioral test TestMintWorkspaceSecrets_PersistsInboundSecretInSaaSMode
pins the contract: SaaS mode MUST persist platform_inbound_secret to the DB
column even though it skips file injection
Existing-workspace recovery (chat_files.go lazy-heal):
- Upload + Download handlers detect NULL platform_inbound_secret and call
IssuePlatformInboundSecret inline, returning 503 with retry_after_seconds=30
- Self-heals workspaces that were provisioned before this fix without
requiring destructive reprovision
Tests: full handlers + workspace-server module green; AST gate verified to
fire red on deliberate violation (commented-out mint call surfaces the
exact function name + actionable remediation message).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI failure: the Ops scripts (unittest) job runs `python -m unittest
discover` which doesn't have pytest installed. test_check_migration_
collisions.py imported pytest unconditionally, failing module import:
ImportError: Failed to import test module: test_check_migration_collisions
Traceback (most recent call last):
File ".../test_check_migration_collisions.py", line 12, in <module>
import pytest
ModuleNotFoundError: No module named 'pytest'
The tests use no pytest-specific features (just bare assert + plain
class). Sibling test_sweep_cf_decide.py in the same dir already uses
unittest.TestCase. Convert this one to match: drop the pytest import,
make TestMigrationFileRe inherit from unittest.TestCase.
unittest.TestLoader.discover() requires TestCase subclasses for
auto-discovery, so the fix is two lines (drop import, add base).
Bare assert statements work fine inside TestCase methods.
Verified: `python3 -m unittest scripts.ops.test_check_migration_collisions -v`
runs all 9 tests, all pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Independent review on PR #2362 caught: the dead-agent classifier at
a2a_proxy.go included 502/503/504/524 but missed the rest of the CF
origin-failure family (521/522/523), which are MORE indicative of a
dead EC2 than 524:
- 521 "Web server is down" — CF can't open TCP to origin (most direct
dead-EC2 signal; fires when the workspace EC2 has been terminated
and CF still has the CNAME pointing at it).
- 522 "Connection timed out" — TCP didn't complete in ~15s (typical
of SG/NACL flap or agent process hung on accept).
- 523 "Origin is unreachable" — CF can't route to origin (DNS gone,
network path broken).
Pre-fix any of these would propagate as-is to the canvas and the user
would see a 5xx without the reactive auto-restart firing — exactly
the SaaS-blind class of failure PR #2362 was meant to close.
Refactor: extracted isUpstreamDeadStatus(int) helper so the matrix is
in one place, with TestIsUpstreamDeadStatus locking in 18 status
codes (7 dead, 11 not-dead including 520 and 525 which look CF-shaped
but indicate different failures).
Also tightened TestStopForRestart_NoProvisioner_NoOp per the same
review: now uses sqlmock.ExpectationsWereMet to assert the dispatcher
doesn't touch the DB on the both-nil path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Backward-compat replay gate for the A2A JSON-RPC protocol surface.
Every PR that touches normalizeA2APayload OR bumps the a-2-a-sdk
version pin runs every shape in testdata/a2a_corpus/ through the
current code and asserts:
valid/ — every shape MUST parse without error and produce a
canonical v0.3 payload (params.message.parts list).
invalid/ — every shape MUST be rejected with the documented
status code and error substring.
What this prevents
The 2026-04-29 v0.2 → v0.3 silent-drop bug (PR #2349) shipped
because the SDK bump PR didn't replay v0.2-shaped inputs against
the new code; the shape-mismatch surfaced only in production when
the receiver's Pydantic validator silently rejected inbound
messages.
This gate would have caught it pre-merge. Hand-verified: reverting
the v0.2 string→parts shim in normalizeA2APayload fails 3 of the
v0.2 corpus entries with the exact rejection class the production
bug exhibited.
Corpus contents (11 entries)
valid/ (10):
v0_2_string_content — basic v0.2 (the broken case)
v0_2_string_content_no_message_id — v0.2 + auto-fill messageId
v0_2_list_content — v0.2 with content as Part list
v0_3_parts_text_only — canonical v0.3
v0_3_parts_multi_text — multi-Part list
v0_3_parts_with_file — multimodal (text + file)
v0_3_parts_with_context — contextId for multi-turn
v0_3_streaming_method — message/stream variant
v0_3_unicode_text — emoji + multi-script
v0_3_long_text — 10KB text Part
no_jsonrpc_envelope — bare params/method without
outer envelope (legacy senders)
invalid/ (3):
no_content_or_parts — message has neither field
content_is_integer — wrong type for v0.2 content
content_is_bool — wrong type, separate from int
so the failure msg identifies
which type-class regressed
Plus 4 inline malformed-JSON cases (truncated, not-JSON, empty,
whitespace) that can't be expressed as JSON corpus entries.
Coverage tests
The gate has 4 test functions:
1. TestA2ACorpus_ValidShapesParse — replay valid/ corpus,
assert no error + canonical v0.3 output (parts list non-empty,
messageId non-empty, content field deleted).
2. TestA2ACorpus_InvalidShapesRejected — replay invalid/ corpus,
assert rejection matches recorded status + error substring.
3. TestA2ACorpus_MalformedJSONRejected — inline cases for
non-parseable bodies.
4. TestA2ACorpus_HasMinimumCoverage — at least one v0.2 +
one v0.3 entry exists (loses neither side of the bridge).
5. TestA2ACorpus_EveryEntryHasMetadata — _comment/_added/_source
on every entry per the README policy; _expect_error and
_expect_status on invalid entries.
Documentation
testdata/a2a_corpus/README.md describes the corpus contract:
- When to add entries (new SDK shape, new production-observed
shape).
- When NOT to add (test scaffolding, hypothetical futures).
- Removal policy (breaking change, deprecation window required).
Verification
- All 24 corpus subtests pass on current main.
- Hand-test: revert the v0.2 compat shim → 3 v0.2 entries fail
the gate with the exact rejection class the production bug
exhibited. Confirmed.
- Whole-module go test ./... green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses code-review C1 (test goroutine race) and I2 (CF 524) on PR #2362.
C1: TestRunRestartCycle_SaaSPath_DispatchesViaCPProv invoked runRestartCycle
end-to-end, which spawns `go h.sendRestartContext(...)`. That goroutine
outlived the test, then read db.DB while the next test's setupTestDB wrote
to it — DATA RACE under -race, cascading 30+ failures across the handlers
suite. Refactored: extracted `stopForRestart(ctx, id)` from runRestartCycle
as a pure dispatcher, and rewrote the SaaS-path test to call it directly
(no async goroutine spawned). Added a no-provisioner no-op guard test.
I2: Cloudflare 524 ("origin timed out") now triggers maybeMarkContainerDead
alongside 502/503/504. Same upstream signal — origin agent unresponsive.
Verified `go test -race -count=1 ./internal/handlers/...` green locally.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Independent review of #2362 caught a Critical gap: the previous commit
fixed the Stop dispatch in runRestartCycle but left the provisionWorkspace
dispatch unconditionally Docker-only. So on SaaS the auto-restart cycle
would Stop the EC2 successfully (good), then NPE inside provisionWorkspace's
`h.provisioner.VolumeHasFile` call. coalesceRestart's recover()-without-
re-raise (a deliberate platform-stability safeguard) silently swallowed
the panic, leaving the workspace permanently stuck in status='provisioning'
because the UPDATE on workspace_restart.go:450 had already run.
Net pre-amendment effect on SaaS: dead agent → structured 503 (good) →
workspace flipped to 'offline' (good) → cpProv.Stop succeeded (good) →
provisionWorkspace NPE swallowed (bad) → workspace permanently
'provisioning' until manual canvas restart. The headline claim of #2362
("SaaS auto-restart now works") was false on the path it shipped.
Fix: dispatch the reprovision call the same way every other call site
in the package does (workspace.go:431-433, workspace_restart.go:197+596) —
branch on `h.cpProv != nil` and call provisionWorkspaceCP for SaaS,
provisionWorkspace for Docker.
Tests:
- New TestRunRestartCycle_SaaSPath_DispatchesViaCPProv asserts cpProv.Stop
is called when the SaaS path runs (would have caught the NPE if
provisionWorkspace had been called instead).
- fakeCPProv updated: methods record calls and return nil/empty by
default rather than panicking. The previous "panic on unexpected call"
pattern was unsafe — the panic fires on the async restart goroutine
spawned by maybeMarkContainerDead AFTER the test assertions ran, so
the test passed by accident even though the production path was
broken (which is exactly how the Critical bug landed).
- Existing tests still pass (full handlers + provisioner suites green).
Branch-count audit refresh:
runRestartCycle dispatch decisions:
1. h.provisioner != nil → provisioner.Stop + provisionWorkspace ✓ (existing tests)
2. h.cpProv != nil → cpProv.Stop + provisionWorkspaceCP ✓ (NEW test)
3. both nil → coalesceRestart never called (RestartByID gate) ✓
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Class-of-bugs fix surfaced by hongmingwang.moleculesai.app's canvas chat
to a dead workspace returning a generic Cloudflare 502 page on
2026-04-30. Three independent gaps in the reactive-health path that
together leak dead-agent failures to canvas with no auto-recovery.
## Bug 1 — maybeMarkContainerDead is a no-op for SaaS tenants
`maybeMarkContainerDead` only consulted `h.provisioner` (local Docker
provisioner). SaaS tenants set `h.cpProv` (CP-backed EC2 provisioner)
and leave `h.provisioner` nil — so the function early-returned false
on every call and dead EC2 agents never triggered the offline-flip /
broadcast / restart cascade.
Fix: extend `CPProvisionerAPI` interface with `IsRunning(ctx, id)
(bool, error)` (already implemented on `*CPProvisioner`; just needs
to surface on the interface). `maybeMarkContainerDead` now branches:
local-Docker path uses `h.provisioner.IsRunning`; SaaS path uses
`h.cpProv.IsRunning` which calls the CP's `/cp/workspaces/:id/status`
endpoint to read the EC2 state.
## Bug 2 — RestartByID short-circuits on `h.provisioner == nil`
Same shape as Bug 1: the auto-restart cascade triggered by
`maybeMarkContainerDead` calls `RestartByID` which short-circuited
when the local Docker provisioner was missing. So even if Bug 1 were
fixed, the workspace-offline state would never recover.
Fix: change the gate to `h.provisioner == nil && h.cpProv == nil`
and update `runRestartCycle` to branch on which provisioner is
wired for the Stop call. (The HTTP `Restart` handler already does
this branching correctly — we're just bringing the auto-restart path
to parity.)
## Bug 3 — upstream 502/503/504 propagated as-is, masked by Cloudflare
When the agent's tunnel returns 5xx (the "tunnel up but no origin"
shape — agent process dead but cloudflared connection still healthy),
`dispatchA2A` returns successfully at the HTTP layer with a 5xx body.
`handleA2ADispatchError`'s reactive-health path doesn't run because
that path is only triggered on transport-level errors. The pre-fix
code propagated the 502 status to canvas; Cloudflare in front of the
platform then masked the 502 with its own opaque "error code: 502"
page, hiding any structured response and any Retry-After hint.
Fix: in `proxyA2ARequest`, when the upstream returns 502/503/504, run
`maybeMarkContainerDead` BEFORE propagating. If IsRunning confirms
the agent is dead → return a structured 503 with restarting=true +
Retry-After (CF doesn't mask 503s the same way). If running, propagate
the original status (don't recycle a healthy agent on a transient
hiccup — it might have legitimately returned 502).
## Drive-by — a2aClient transport timeouts
a2aClient was `&http.Client{}` with no Transport timeouts. When a
workspace's EC2 black-holes TCP connects (instance terminated mid-flight,
SG flipped, NACL bug), the OS default is 75s on Linux / 21s on macOS —
long enough for Cloudflare's ~100s edge timeout to fire first and
surface a generic 502. Added DialContext (10s connect), TLSHandshake
(10s), and ResponseHeaderTimeout (60s). Client.Timeout DELIBERATELY
unset — that would pre-empt slow-cold-start flows (Claude Code OAuth
first-token, multi-minute agent synthesis). Long-tail body streaming
is still governed by per-request context deadline.
## Tests
- `TestMaybeMarkContainerDead_CPOnly_NotRunning` — IsRunning(false) →
marks workspace offline, returns true.
- `TestMaybeMarkContainerDead_CPOnly_Running` — IsRunning(true) →
no offline-flip, returns false (don't recycle a healthy agent).
- `TestProxyA2A_Upstream502_TriggersContainerDeadCheck` — agent server
returns 502 + cpProv reports dead → caller gets 503 with restarting=
true and Retry-After: 15.
- `TestProxyA2A_Upstream502_AliveAgent_PropagatesAsIs` — same upstream
502 but cpProv reports running → propagates 502 (existing behavior;
safety check that prevents over-eager recycling).
- Existing `TestMaybeMarkContainerDead_NilProvisioner` /
`TestMaybeMarkContainerDead_ExternalRuntime` still pass.
- Full handlers + provisioner test suites pass.
## Impact
Pre-fix: dead EC2 agent on a SaaS tenant → CF-masked 502 to canvas, no
auto-recovery, manual restart from canvas required.
Post-fix: dead EC2 agent on a SaaS tenant → structured 503 with
restarting=true + Retry-After to canvas, workspace flipped to offline,
auto-restart cycle triggered. Canvas can show a user-actionable
"agent is restarting, please wait" message instead of a generic 502.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>